We should all care about the dangers of data collection

May 5, 2022 | 05:09 pm PT
Dang Nguyen Researcher
Back in 2008, Google claimed that they could outperform U.S. Centers for Disease Control (CDC) in predicting when and where flu outbreaks would peak.

The idea was simple: when people are sick with the flu, many search for flu-related information on Google. These searches, when taken as data points, can be treated as proxies of overall flu prevalence. Google Flu Trends was developed based on the premise that search data, if successfully tuned to the flu tracking information from the CDCs, could produce accurate estimates of flu prevalence two weeks earlier than the CDCs’ data. Every anxious search can be turned into potentially life-saving insights, so goes the logic behind this web service. In an analysis paper published in 2012, Google claimed that Google Flu Trends predictions were 97 percent accurate comparing with CDC data.

Google Flu Trends then went on to miss the peak of the 2013 flu season by 140 percent. Over the interval of 2011–2013, it consistently overestimated relative flu incidence; over one interval in the 2012-2013 flu season, it predicted twice as many doctors' visits as the CDC recorded, a failure of spectacular proportions.

Google quietly stopped the program in 2015. Google Flu Trends has since been added to the big data history book of failures as one of the most iconic examples of what Northeastern computer scientist David Lazer calls "big data hubris", the implicit assumption that big data can substitute for, rather than a complement, traditional data collection and analysis.

What went wrong? For one, the correlation between Googling flu symptoms and actual instances of influenza is spurious: being expressly concerned about flu symptoms is not indicative of having the flu. Flu-like symptoms could be indicative of illnesses other than the flu: a lesson not lost on anyone living with and through the Covid-19 pandemic, yet might not have been readily obvious to computer scientists in the early 2010s.

In a paper published in Science, a team from Northeastern University, the University of Houston, and Harvard University found that Google's model was vulnerable to overfitting to seasonal terms unrelated to the flu. The paper pointed out that Google's methodology was to find the best matches among 50 million search terms to fit 1,152 data points in the CDC dataset. As a result of this, it was inevitable that some strong correlations between search terms (such as ‘high school basketball’) and flu cases were found by pure chance, and so these terms were unlikely to be driven by actual flu cases, or predictive of future trends.

Search behavior on the web also changes over time. The training data used on Google Flu Trends could have reflected the search behavior of users at a time when using Google for health-related information was not as common. As people rely more and more on online search engines to find health-related information, the 45 undisclosed search terms that Google identified as strongly correlated to flu outbreaks might no longer correlate to actual flu cases in any reliable way, given changes in how people use search engines.

Google itself is also a living technology: it routinely tweaks its recommender system to be more 'useful' to its users. The 'autosuggest' feature could make it more likely for people to search on terms related to flu cures because these terms have been searched many times before, making these terms better correlated with the Google Flu Trends model, rather than the actual CDC data, which Google tries to predict.

The dangers of dataism

This is not to say that all of big data is hubris. The strongest applications of big data are probably found within the private sector. Here’s one example: a recent Perspective headline reads, "It's nearly impossible to do mankind any good without proper data", citing Starbucks as an example of big data best practice. Seemingly frustrated that big data isn’t bigger, the author wrote:

"As someone who has been in the Big Data trenches for nearly a decade, I have realized that the toughest challenge in data collection comes from the protests of the masses. The main reason, as I can see, is that people think data collection only benefits the "big guys" while they suffer the indignity of privacy invasion, not unlike sheep who were fleeced and then carved up."

Unnecessarily evocative language aside, the mentality behind this characterization is very common in Silicon Valley. Timnit Gebru, a renowned computer scientist who co-led a team on the ethics of artificial intelligence at Google from 2018 – 2020, has frequently commented on Silicon Valley's tendency to fantasize about technological solutions that supposedly help "all of humanity", as if humanity is a monolith devoid of context. In these fantasies, humanity is not shared among humans as complex sentient beings; it is rather something that technologists optimize, monetize, and once in a while, even rescue. Technologists have a special talent for separating themselves from the humanity they desperately try to save: we see it in the language of the capital "I" vs. "the masses", the "big guys" vs. "sheep who were fleeced and then carved up."

This distance from humanity is perhaps also what drives their peculiar zest for big data as the ultimate solution to all of the world's problems: technocratic elites see themselves as having access to a special kind of vision and knowledge that the rest of humanity would do well to accept and follow. In an article on Surveillance & Society in 2014, Jose van Dijck, influential media studies scholar, refers to this belief as "dataism", a "belief in the objective quantification and potential tracking of all kinds of human behavior." The assumption here, of course, is that those collecting and analyzing data can be trusted to collect, interpret, and share this data. It does not matter whether we wish to become known to technologists in this way, why take "protests of the masses" seriously? Neither does it matter whether we agree with what our data say about us, whether our "data-doubles" align with how we understand ourselves as individuals also seems beside the point. Humanity is to become known to technologists through data, which, despite enthusiastic evangelism, can only tell the story of humanity through a particular vantage point. More often than not, that vantage point creates actual harm on diverse groups of human beings that make up our common humanity.

Dataism and its harms

The moral imperative behind dataism, that more data is always better, rests on the idea that individuals are completely knowable from afar, once all possible data about them have been collected.

Consider facial recognition technology, one of the first machine learning-based applications to be banned by U.S. governmental bodies. Facial recognition technology relies on data, understood as images of the human face captured through cameras, to learn about the human face so that it could classify faces from non-faces, as well as different kinds of face. In March 2019, the city of San Francisco, home to many Silicon Valley tech developers, barred the police and other municipal agencies from using the technology as a result of both real and speculative harms that can be done by this technology.

What kind of harm, you might wonder? The chances of you getting accurately classified by facial recognition technology vary significantly depending on who you are and how you look. In the landmark 2018 "Gender Shades" project, three gender classification algorithms, including those developed by IBM and Microsoft, were assessed to identify biases in gender classification error rates between groups. The project selected 1,270 images from three African countries and three European countries, then grouped these images into four categories: darker-skinned females, darker-skinned males, lighter-skinned females, and lighter-skinned males. All three algorithms performed the worst on darker-skinned females, with error rates up to 34 percent higher than for lighter-skinned males. Independent assessment by the National Institute of Standards and Technology (NIST) has also confirmed that face recognition technologies across 189 algorithms are least accurate on women of color.

What could be the harms of being incorrectly identified or classified by a machine? After all, wouldn’t technology eventually get better with time if, alas, you just gave technologists more and better data? Unfortunately, machine biases aren’t the least of our worries. Even if a facial recognition technology with 100 percent accuracy across all demographic groups could be developed, the flawed belief that humanity could be fully known from a distance would still continue to drive misguided, even perverse, technological "solutions."

In her latest book "Discriminating Data", Wendy Hui Kyong Chun dissected two instances where facial recognition technology is harmful not only because of its well-documented biases, but also in the ways in which it threatens to become an "authenticity machine." One instance involves an unpublished yet widely covered paper posted online in 2016, where scientists from Shanghai Jiao Tong University claimed to have developed a system to discriminate between criminals and non-criminals. The other instance involves a machine learning program by computer scientist Yilun Wang and computational social scientist Michal Kosinski in 2018 that claims to detect sexual orientation.

Both instances demonstrate the harmful idea that machines can be the true purveyor of knowledge about humans, by means of making vast troves of existing data speak. Data never speak for themselves; the shape of your nose or the size of your forehead does not directly point to whether you have committed a crime, or reveal your sexuality. For data to make these predictions, they have to be made to speak, or rather, spoken for, on their own limited terms. The fact that physiognomy, the long-discredited practice of inferring someone's characters based on their appearance, is resurrected in contemporary big data practices using elaborate, energy-intensive methodologies such as deep neural networks is the direct product of dataism.

To be clear, vast amounts of data automatically collected by surveillance cameras mean nothing until a data scientist steps in and makes the decision to shoehorn these datasets towards "solving" a problem of his own making. How these problems are framed are dependent on what kind of data are already being collected (and can be further collected with existing technologies), so that a "solution" can be arrived at. And so we see: search terms "solving" the "problem" of predicting flu outbreaks, the shape of human jaws "solving" the "problem" of guessing someone's sexuality, the color of human skin "solving" the "problem" of guessing whether someone has committed a crime.

Inevitability is a lie

I've always been wary of people telling me that something is inevitable because it's already happening elsewhere. People who are convinced of the inevitability of a technological paradigm usually are those who stand to benefit the most from it materializing. Usually, they also lack the imagination and creativity to think about how things could be otherwise. Maybe there is a deluded comfort in reducing the full complexity of humanity to the amorphous "masses" that do not know any better. As Shoshana Zuboff put it quite bluntly in her 2019 best-selling book "The age of surveillance capitalism": "Inevitability rhetoric is a cunning fraud designed to render us helpless and passive in the face of implacable forces that are and must always be indifferent to the merely human."

The good news is, you and I do not have to play along. We all can, and should, care about the harms and dangers of data. Anyone who pretends to have all the answers without including everyone else in the process of inventing these answers is doing just that, they are pretending. We can and should call their bluff, the stakes are too high for us not to.

* Dang Nguyen is a Research Fellow at the Australian Research Council Center of Excellence for Automated Decision-Making & Society. The opinions expressed are her own.

The opinions expressed here are personal and do not necessarily match VnExpress's viewpoints. Send your opinions here.
go to top