Cut and paste culture: The hidden cost of reusing data

Updated: Jun 17



Recently, I was on an email thread where a templated email was reused out of context leading to a slightly confusing and funny exchange about a non-existent maternity leave. Making things worse, I added to the thread by sending out a questionnaire where the first question was clearly out of context. Oops! I think we've all had this happen. The cut and paste function is an easy default and most of the time it saves us time. Yet, it also creates a dynamic for how we work. The technology feature of cut and paste shapes our discourse in a way that is largely invisible until there is a glaring error. Reusing data. Asking new questions from existing datasets. That lies at the heart of machine learning. It's cut and paste of a different magnitude.

Tiny images, big ethical issues

The 80 million Tiny Images data set, curated by MIT, was removed this summer after it was found to be racist and offensive. As this article in Venture Beat explains, many of the images were labelled with blatantly racist and sexist terms and the data set contained pornographic images of women that were gathered without consent. Since the images are tiny and the dataset so large, there was no human oversight or review, until a researcher at the University of Dublin wrote a paper auditing the dataset (this is the pre-print) exposing these issues (Prabhu, Birhane, 2020). The dataset had been in use since 2006 and was cited over 1,700 times (Johnson, 2020).

Cut and paste on steroids

It's easy to scrape data from the internet and then use and reuse it without really knowing much about the data. In the case of 80 million Tiny Images, it was drawing on another dataset called WordNet as a means of automated data collection and labelling. There was no human in the data collection loop. The importance of how a dataset is labelled and who gets to make decisions about data labels is something I learned about while doing my own research. In interviewing AI researchers working on healthcare datasets, I found that some used the data as is with no label modifications, some used a small human labelled set and then applied automated tools to "scale up" the labelling. Others noted that even human labelled data can be done in ways that encode bias either through errors or judgement calls by the labelers. Labeling data is tedious work, often outsourced to gig-economy workers with no training, for little pay. That itself is an ethical issue.

Ethics boards missing big data issues

There are two issues with 80 Million Tiny Images, the dataset itself and the culture that accommodates it's construction and use. In my own research, I found that research ethics boards have little if any governance over research involving the use of what is considered a public dataset. This makes it very easy to continue to reuse a problematic dataset without any formal ethics oversight. It took a focused audit of 80 Million Tiny Images, a data set that was in use for close to 15 years, for MIT to realize there was a problem and address it by removing the dataset. During the time it existed, other researchers might view MIT's curation of the 80 Million Tiny Images data set as a sign of trustworthiness, a short hand that signals the data is credible because it meets MIT's standards. This dataset propagated other data sets including the CIFAR-10/CIFAR-100, which has not been audited but which used different labelling methods (students were hired to label the images).

To be clear, I'm not against cut and paste, it's part of how we manage in our digital culture. However, I think we need to find some balance to not blindly rely on it as a feature or defer to it as an ideology.

-- Katrina Ingram

#cutandpaste #DataBias #AIEthics #DataReuse

___________

Resources

Johnson, K. (July 1, 2020). MIT takes down 80 million Tiny Images data set due to racist and offensive content. Venture Beat. Retrieved from - https://venturebeat.com/2020/07/01/mit-takes-down-80-million-tiny-images-data-set-due-to-racist-and-offensive-content/

Prabhu, V. U., & Birhane, A. (2020). Large image datasets: A pyrrhic win for computer vision? In arXiv [cs.CY]. arXiv. http://arxiv.org/abs/2006.16923

Abeba Birhane is one of the PhD students behind the 80 Million Tiny Images audit. Here's a podcast where she explores the concept of relational ethics

Are you wondering who invented cut and paste? It was Larry Tesler.