Illustration: Ajit Bajaj
Illustration: Ajit Bajaj

The dark side of big data

Latest Facebook data breach episode is a reminder that rise of Big Data represents a massive engineering of society with ominous implications for notions of freedom, privacy and justice
1.

If August 15, 1947, and January 26, 1950, were independent India’s first two Promethean moments, one freeing us from foreign rule, the other gifting us a democratic republic, August 24, 2017, should, in all fairness, go down in history as its third, for on that blessed day the Supreme Court declared privacy a fundamental right, thereby changing our lives forever in ways it may take us a while to unravel and fathom.

To quote generously from the judgement on which all nine judges of the Constitution Bench put their imprimatur, “Privacy includes at its core the preservation of personal intimacies, the sanctity of family life, marriage, procreation, the home and sexual orientation. Privacy also connotes a right to be left alone. Privacy safeguards individual autonomy and recognises the ability of the individual to control vital aspects of his or her life…While the legitimate expectation of privacy may vary from the intimate zone to the private zone and from the private to the public arenas, it is important to underscore that privacy is not lost or surrendered merely because the individual is in a public place. Privacy attaches to the person since it is an essential facet of the dignity of the human being.”

It’s difficult to imagine a more libertarian credo. Coming at a moment as it does when the government in power is trying to curb, sometimes subtly, but often overtly, certain individual freedoms in the name of questionable public or national good, many see the verdict as a fresh lease of life for civil liberty activists fighting laws that discriminate against minorities such as homosexuals and beef buffs. It also offers a ray of hope to star-crossed lovers who are often hounded for crossing arbitrary Lakshmanrekhas of religion, caste, community, class, race, and even gender.

But perhaps more momentously, in the wake of Edward Snowden’s disclosures on global surveillance, the judgement puts a big question mark on the right of governments and corporations to collect, share, sell and manipulate personal data that may infringe individual privacy and dignity. While it takes care of the conceptual part of the challenge to Aadhaar, namely whether privacy is a fundamental right, a five-judge Constitution Bench will determine whether Aadhaar itself violates privacy later this November.

Nevertheless, without alluding directly to Aadhaar, the judges have expressed their anxieties about the dangers of an Orwellian state riding on Big Data. To quote from the judgement again, “The contemporary age has been aptly regarded as ‘an era of ubiquitous dataveillance, or the systematic monitoring of citizen’s communications or actions through the use of information technology’. It is also an age of ‘big data’ or the collection of data sets. These data sets are capable of being searched; they have linkages with other data sets; and are marked by their exhaustive scope and the permanency of collection. The challenge which big data poses to privacy interests emanate [sic]from state and non-state entities.”

To be sure, Aadhaar, touted as the world’s largest biometric ID undertaking, is Big Data. Under the Aadhaar (Targeted Delivery of Financial and other Subsidies, Benefits and Services) Act, 2016, each resident Indian will be branded with a unique 12-digit digital tattoo representing not only regular personal data such as name, address and date of birth, but also, controversially, scans of all the 10 fingers and the iris. All this data is stored in a centralised vault in Manesar, Haryana. As of August 15 this year, the project had issued about 1.171 billion cards at an expense of over Rs 9,000 crore.



It’s quite possible that the Supreme Court may not rule Aadhaar as in breach of the right to privacy. It may try, as some legal minds have suggested, to balance public interest and privacy by instructing the state to enact a robust data protection law. Nevertheless, the government’s almost fanatical apology for Aadhaar, not to mention the fact that it argued against privacy being a fundamental right, suggests an abiding faith in the seductive power of numbers. As anthropologist James Scott documented in his compelling Seeing Like a State, the state has always worshipped data, as it gives more power to the powerful, even if often at the expense of people’s happiness.

Guided by the same instinct, the current craze to quantify almost everything stems from Big Data’s supposed extraordinary power to extract new truths about the world. It is based on the premise that everything in the world can be captured in ciphers of 0 and 1, and that if we could capture enough of it, preferably all of it, we can rummage in it using slick mathematical combs called algorithms and pull out non-obvious insights into practically every problem on earth—how to stop a terrorist, catch a tax cheat, prevent train accidents, predict and tackle extreme weather events.

Such is its seduction that all the world’s political and business elites with privileged access to it—multinationals, IT giants, the G20 group, World Bank and the UN, to name the top few—are paying handsome obeisance to it. Aadhaar is a perfect example. They obviously believe it holds promise (of more power and more profit, to be precise). But it is being sold to us as a magic wand that will add the prefix smart to everything—smart cities, smart devices, smart babies, smart humans, smart nature. As Kenneth Cukier and Viktor Mayer-Schönberger inform us in their almost gushing Big Data: A Revolution That Will Transform How We Live, Work, and Think, “the benefits to society will be myriad, as big data becomes part of the solution to pressing global problems like addressing climate change, eradicating disease, and fostering good governance and economic development.”

On the other hand, in the light of Snowden’s revelations, many critics see in the rise of Big Data the germs of a “new mind control”. For some others like Shoshana Zuboff of the Harvard Business School, Big Data, together with artificial intelligence (AI), represents a massive disruptive engineering of the human soul with ominous and as yet unclear implications for notions of freedom, privacy, justice, moral reasoning and autonomy.

But what really is Big Data and how is it different from older forms of data that we thought were huge, census data, for example? What explains its lure, not to mention its wickedness?

At first glance, what distinguishes Big Data from older forms is the sheer volume of it: the world has never seen such a glut of data, especially digital data. To give you a handle on it, consider the following: 95 per cent of all data created since the dawn of human history was created in the past two years; data doubles in size every two years; by 2030 almost every person in the world will be sporting a smart-phone; by 2020, an estimated 50-200 billion smart devices will be “talking” to each other. Most tellingly, Big Data enthusiasts never forget to mention that currently less than one per cent of all data is ever analysed or used.

But volume or size is not the only defining feature of Big Data. People add two more Vs to it: high velocity, meaning data is being created in real-time, and greater variety, meaning it is both structured, such as credit card transactions, and unstructured, such as random browsing on the Internet.

While the world has been collecting and archiving data from yore—in the form of books, films, music, government documents, financial records and scientific data—the ongoing data revolution was triggered by the explosion of the Internet. Again, to grasp what this means, consider this: before you can say Big Data, the world would have logged 7,745 tweets, 800 Instagram posts, 2,750 Skype calls, 62,300 Google searches, 70,660 YouTube videos and 2,621,171 emails.

What also separates the era of Big Data from the past is the enormously enhanced capacity to store data, and forever. This endless data stream is now captured and stored in behemoth data servers (unjustly called clouds) scattered around the world. So as you read this piece online, some digital critter probably sitting in a cloud somewhere in the Arctic is recording the fact that you are curious about Big Data. In fact, the unpalatable yet unavoidable truth is that anyone who conducts her life online—through cellphones, smartphones, Internet and plastic cards—is being tracked 24x365. Every time you go online, you leave traces of your presence in what is called digital exhaust, all of which, no matter how trivial, is stashed away as data.

Sources: Cisco; comScore; MapReduce; Radicati Group; Twitter; YouTube; * 1015 bytes; **1018 bytes

THE END OF THEORY?

However, the most crucial aspect of Big Data is how we are interpreting it, what new meanings and facts we are sifting from it. In the last century, statisticians evolved mathematical tricks, such as random sampling, which allowed social scientists to glean statistically significant “truths” about a randomly selected part of a large group of things or people and then assume that the “truths” would hold for the whole too. It was practical, cheap and reliable, provided it was carried out carefully and honestly. Indeed, much of what psychologists, sexologists, nutritionists, epidemiologists, doctors, salespersons and election trackers tell you about how the world works is based on this trick.

It trumps intuition to think that what is true of a randomly chosen sample of 1,100 observations about something about which people can answer in yes or no would also hold for the whole population, no matter a million or a billion, with a 3 per cent margin of error. The theory behind it is that after a certain point early on more or bigger doesn’t necessarily yield something new.

However, if the random sample is not precise or is fudged, or not sufficiently random, things can go wrong, sometimes horribly so if the results reinforce subtle prejudices of caste, gender, race or class, like blacks being inherently less bright or women being bad with numbers. Aren’t we all familiar with the popular lampoon, Lies, Damned Lies, and Statistics? That said, since it is logistically difficult and expensive to collect all data points, researchers, and hence we, have had to make do with sampling’s approximate truths.

Big Data represents a three-way shift in the way we extract truth about the world from data. First, unlike the small data of sampling, Big Data deals with very large data sets, often including almost everything—imagine a world where everyone’s connected via a smartphone, for example. And that too quickly, cheaply and at regular intervals to boot. Second, in Big Data analysis, getting detail is more important than accuracy. So the messier and bigger the data, it is more likely to yield new insights. And last, which is a consequence of the first two, Big Data does not care about causes; it merely looks for hidden patterns and correlations, which, in Cukier’s words, “may not tell us precisely why something is happening, but they alert us that it is happening. And in many situations this is good enough.”

In 2008, a team of software geeks at Google with zilch training in medicine was able to track the passage of flu in the US without looking at the result of a single medical check-up. They simply put millions of carefully chosen Google searches like “flu symptoms” and “drug stores nearby” through a thresher of algorithms, which predicted with reasonable accuracy the odds of when and where the flu is likely to break out. Most fascinatingly, their prediction beat the government’s by a week. Before long Google Flu Trends (GFT), as the algorithm was labelled, was hailed as the poster child of Big Data.

This was spooky as it flies in the face of how we have tried to make sense of the world—take a general theory about how things work, draw from it a hypothesis about how two particular things might be related, say environment and disease, and gather data to see if it bears out the hypothesis. Google just threw theory out of the window. They simply fiddled around in a large heap of messy data and found a strong correlation between Google searches for flu and its outbreak.

In probably the most-cited spiel on Big Data that proclaimed the end of theory, Chris Anderson, former editor-in-chief of Wired, waxed eloquent thus: “This is a world where massive amounts of data and applied mathematics replace every other tool that might be brought to bear. Out with every theory of human behaviour, from linguistics to sociology. Forget taxonomy, ontology, and psychology. Who knows why people do what they do? The point is they do it, and we can track and measure it with unprecedented fidelity. With enough data, the numbers speak for themselves.”

Cukier will not go as far as Anderson but he is truly impressed with how Big Data is challenging common sense. Consider the making of Google Translate. Google engineers put their cyborgs to work on billions of pages of translations of good to middling quality (a fine example of messy data). What came out was a fairly workable, but by no means perfect, digital translator. It is the best in the game and can translate among 60 languages.

Cukier explains Google Translate works better than others not because it has smarter algorithms, or that computers can crunch data much faster now, but because Google was able to use thousands times more data as it wasn’t bothered by messiness. In fact, in a paper titled “The Unreasonable Effectiveness of Data”, Google’s AI maven Peter Norvig argued that messiness is the key.

That messiness is probably what makes Big Data work is a mere intuition, not an inference drawn from a theory of how Big Data works. Indeed, nobody knows why it sometimes works, or why at others it doesn’t. No surprise then that GFT failed spectacularly in 2013 when it predicted a strong flu outbreak where there was almost none. As the cautionary tale goes, they mistook correlation for causation. Anyway, stung by the betrayal of their algorithms, they quickly buried the programme.

But, as history is witness, some ideas have a way of persisting against better judgement. While no one’s claiming that Big Data is all bunkum, the GFT story should be a sobering caveat to not turn Big Data into a theology and oversell its promise.

In a much-discussed paper, “Critical Questions for Big Data”, sociologists Danah Boyd and Kate Crawford, offer six sobering “provocations” to spark a critical conversation about the tall claims of Big Data. They argue that Big Data does not change the way we make sense of the world (meaning theory is alive and kicking); that it is not necessarily objective and accurate; that bigger is not necessarily better; that it loses meaning if taken out of context; that just because it is available does not make it ethical; and finally that privileged access to data creates Big Data haves and have-nots.

Disturbingly, they argue, Big Data creates a state of “seeing patterns where none actually exist, simply because enormous quantities of data can offer connections that radiate in all directions.”

In fact, this state of being that privileges the “what” at the expense of the “why” appears to have seeped into the popular psyche, and is probably influencing political and social narratives on the ground. So ordinary folks readily lap up fancy gizmos like the smartphone or download free apps without bothering to wonder what they might be losing in the bargain. Truth is understood as something that works, heals, fixes, makes things easier; no one has the patience, or competence, to look what’s behind it.

Post-truth politics is a child of this world view. So if false propaganda helps you win elections, so be it. Donald Trump made good use of this weapon when in the 2016 Presidential elections he trumped everyone, politicians, pundits and pollsters alike, with a deadly cocktail of Machiavellian instincts and Big Data.

But in the Indian context no one’s a finer connoisseur of Big Data than our Prime Minister Narendra Modi who leveraged it in the 2014 elections. A la Scott, Modi knows how the state sees. So he dreams of connecting every Indian through the smart-phone so that while he does his Mann Ki Baat with 1.25 billion Bharatwasis, he can also sense and tweet their mood. His flagship project of Jan Dhan, Aadhaar, Mobile (JAM) is geared towards quantifying all Indians. Alongside, he wants to integrate all manner of other data, currently locked in silos, such as data on crime, land use, forests, banking, finance, insurance, agriculture, education and law, with JAM. In fact, even demonetisation, which did not meet its original objectives, is being passed off as yet another attempt to digitise India.

In Big Data parlance, both the state and the corporation are aggregators—the former aggregates power while the latter capital. It suits both to get as many Indians as possible on the smartphone board, so that both can milk Big Data.

So far, the government has little to show for Big Data insights into governance, except for the claim that it has saved about Rs 50,000 crore because of Aadhaar-based direct benefit transfers in the past two-and-a-half years. However, at the same time many researchers have shown how making Aadhaar mandatory has excluded many poor and needy people from getting subsidised rations because of one or the other technical glitch.

While Big Data could potentially yield new insights into pressing problems like air pollution, flooding in cities, sharing of river waters, or managing waste, the government seems more interested in catching crooks and criminals. This year it flagged off two projects—Project Insight, a Rs 10,000 crore worth attempt to catch tax evaders by tapping into data on income tax, bank accounts, and social media, and the Rs 2,000 crore worth Crime and Criminal Tracking Networks and Systems (CCTNS), which seeks to digitise all crime records in the country and use that data to make predictions about crimes, criminals and victims.

Massive digitisation of personal data poses serious problems. A study by the Bengaluru-based Centre for Internet and Society (CIS) of several schemes under the Modi government’s Digital India Project concluded that “the project aims to enhance the delivery of services to the citizens at the cost of exposing their personal information to cyber security threats.” (See “Glaring gaps in privacy”.) The report also raises red flag on informed consent as CIS investigations revealed that a large number of citizens were not clear how their personal data were being used. As Amber Sinha of CIS put it, “transparency in government activities is pivotal in all democratic processes.”

The fact that the government is going about amassing huge amounts of personal data about its citizens without first putting into place necessary safeguards that protect their sense of privacy and dignity demonstrates they don’t care. Why, in the privacy case hearing Attorney General K K Venugopal argued that privacy is an elitist notion and that in India “it is not fair and right to talk about right to privacy for such poor people”.

It is precisely the shadowy nature of data collection, its storage, and, even more insidious, the shadiness of algorithms, the black magicians of Big Data, that gives it an Orwellian tinge.

In the opening scene of the 2004 sci-fi film Minority Report, following a tip-off by a bunch of clairvoyant mutants, a police posse barges into a man’s house and arrests him for the crime of murdering his wife the next day. With the help of the mutants’ crime forecasts, the “pre-crime” unit is able to cleanse the city of crime by jailing anyone who may have been thinking of committing it. However, when the chief of the “pre-crime” branch, played by Tom Cruise, begins to have doubts about the ethics of what he is doing, his department views him as a threat and frames him for murder by tweaking his thought data. Moral of the story: it’s ok to fudge data in the larger interest of society.

In a scary adaptation of art to reality, the police in Los Angeles and Chicago are rounding up suspects on the grounds of thinking of doing bad things. Except that here the mutants in movie have been replaced by an algorithm called Palantir, which can predict who is likely to commit a crime when and where.

Of late, as part of CCTNS, Delhi Police too has been toying with something similar—it uses a software that mashes up real-time data from its 100 helpline with the satellite map of Delhi, maps it onto its crime data, and then spits out probabilities of crime in different neighbourhoods. But Delhi Police’s psychic robot is no match for the predictive power of probably the most powerful algorithm in the world. Peter Thiel, the billionaire co-founder of Paypal, created Palantir in 2004 along with his Paypal pals so that the US military could track the movements of subversives during the Iraq war. Backed and partly underwritten by the CIA, Palantir had privileged access to the largest data trove in the world. Having outlived its military purpose, at least for now, it is now being used to track all kinds of bad guys—terrorists, tax cheats, petty criminals and illegal aliens. In 2011, Samuel Reading, a former marine, told Bloomberg.com, “It’s the combination of every analytical tool you could ever dream of. You will know every single bad guy in your area.”

As it is in cahoots with the secret arms of the state, it cannot but keep a very low profile. It doesn’t even have an address—rumours have it that it is located in some nondescript Palo Alto street in a vault with walls so thick that nothing, not radio waves, not phone signals, not even Internet can pass through. You can imagine what Palantir means to both its founders and to their masters. In 2015, it was valued at US $20 billion.

Source: 'Privacy Gaps in India's Digital India Project' by the Centre for Internet and Society; *Sensitive personal data or information; **NLRMP: National Land Records Modernisation Programme

Palantir is a frightening reminder of the risks of Big Data predictions. With CCTV profiling of crime-prone neighbourhoods, which tend to be mostly poor, policing by algorithm only ends up reinforcing old prejudices. As Cukier and Mayer-Schnberger caution in their book, penalising someone for an act that has not yet happened “negates the very idea of the presumption of innocence, the principle upon which our legal system, as well as our sense of fairness, is based. And if we hold people responsible for predicted future acts, ones they may never commit, we also deny that humans have a capacity for moral choice.”

The risks of Big Data prediction go beyond criminal justice. As societies start setting greater store by efficient and risk-averse norms, there is a real danger of algorithms replacing human judgement. So based on psychological profiles cobbled together by algorithms, a company may deny job to an applicant, or a spouse may file for divorce, or a bank may deny loan to a customer.

The rise of predictive analytics does not bode well for the future of certain subject-area experts either. Some jobs are already becoming redundant—online papers like Huffington Post now employ algorithms to choose content in consultation with editors; in the creation of Google Translate, linguists played second fiddle to software engineers; and we have all heard about how Amazon chief Jeff Bezos fired his team of book reviewers when he found that algorithms were far better in seducing customers to buy books. Even species like yours truly may have to adapt to the new world as algorithms created by companies like Narrative Science become better at writing reports.

Indeed, there is a real fear that in a world dominated by Big Data and AI, humans may lose their preeminence. Just last week, the Russian President, Vladimir Putin, predicted that countries that invest heavily in AI would dominate the world. Elon Musk, the billionaire technologist, agreed with Putin and added that AI may trigger World War III “if it decides that a preemptive strike is most probable path to victory”.

Who would want to live in such a world? In fact, the sinister success of Palantir should alert us to the ominous portents of the Big Data nexus between the state and the corporation.

Snowden’s whistle blew the cover on how US National Security Agency (NSA) created a spy dragnet called PRISM that siphoned off personal data not just of Americans but also of people around the world from the databanks of giant IT and Internet companies like Google, Apple, Verizon and AT&T, and stashed it away in large clouds like the Utah Data Center. Snowden told the The Guardian that anyone with access to the PRISM database could spy on the intimate lives of anyone in the world. In fact, the just-retired Union home secretary Rajiv Mehrishi told a parliamentary panel last month that, unbeknownst to them, 40 per cent of Indians using smartphones share their data with US intelligence agencies. The corollary to this is that once the corporation-state nexus entrap, either by coercion or by seduction, the rest of the 60 per cent into the smartphone stream, we are all sitting ducks for surveillance.

In fact, we are naive to think that the communications networks that connect us to the rest of the world are scattered and diffuse, and hence not so amenable to eavesdropping. The truth is almost all the telecom and Internet traffic physically flows through the US. In 2006, a whistleblower at the American telecom giant AT&T revealed how he had helped NSA set up a device that sucked up huge amounts of digital data—emails, Skype chats and calls, Internet browsing histories. To decode this massive heist, the US government spawned a cottage industry of algorithms that could tease out information on any individual.

Shocking? Time to wise up. As Bruce Schneier, a data security expert who has written extensively on the surveillance society, disillusions us in his Data and Goliath: The Hidden Battles to Collect Your Data and Control Your World, “we’re all open books to both governments and corporations; their ability to peer into our collective personal lives is greater than it has ever been before. The bargain you make, again and again, with various companies is surveillance in exchange for free service.”

Bruce calls it the hegemon. For him, “being stripped of privacy is fundamentally dehumanizing, and it makes no difference whether the surveillance is conducted by an undercover policeman following us around or by a computer algorithm tracking our every move.”

The question is how does one resist it, if at all.

In an essay titled “What Is an Apparatus?”, Italian philosopher Giorgio Agamben argues that “ever since Homo sapiens first appeared, there have been apparatuses, but we could say that today there is not even a single instant in which the life of individuals is not modeled, contaminated, or controlled by some apparatus.” By an apparatus he means “literally anything that has in some way the capacity to capture, orient, determine, intercept, model, control, or secure the gestures, behaviours, opinions, or discourses of living beings.”

For him, the cellphone or smartphone is an apparatus. He makes it clear that he finds it an abomination and wishes he could destroy them all. But then he takes a pause and wonders if by destroying it we might not in the process also destroy a part of us that not only created the apparatus but was also created by it.

This is the paradox that makes us complicit in modern surveillance. In a piece titled “Of Being Numerous”, published in the online journal The New Inquiry, American journalist Natasha argues that we can’t “ignore the fact that it is no mere accident of history that millions of us have chosen to live with and through these devices. These ­devices require and in turn produce trackable, numerable and, therefore, surveillable subjects.”

That said, it is also true that we had no real choice once we were hooked on. Would we have voluntarily consented to be being tracked had we been told at the very beginning what was being done with all our personal data is a moot question. Maybe we wouldn’t have, but that doesn’t help now.

It seems there is no way one can break the corporate-state surveillance nexus except perhaps to make it less rapacious by extracting a few privacy clauses. Colin Koopman, who teaches philosophy at the University of Oregon, USA, believes “every form of power has its vulnerabilities, and the specific weakness of what I call ‘infopower’ is shutting off the data feed that supplies the algorithm.” But then he wonders “who among us would be audacious enough to stop churning out the data that increasingly defines our very selfhood?” However, Mark Andrejevic, media scholar at the University of Queensland, Australia, is not too worried about the privacy question. In his 2013 book Infoglut: How Too Much Information Is Changing the Way We Think and Know, he argues that Big Data is interested not so much in personal histories as in broader trends in a population. And since what is normal is determined by the statistically average person, the Big Brother is more curious about an average law-abiding, conforming individual rather than an outlier.

In the light of such confusion over whether one is watched individually or as part of an anonymous crowd, it would be interesting to see how the Indian Supreme Court makes sense of Big Data with respect to whether Aadhaar violates the right to privacy. It’s clear creating a law to protect data will not be enough as the algorithms can find the individual needles from an anonymised haystack. The court took the first radical step by declaring privacy a fundamental right. But will it push the envelope further by breaking the state-corporation surveillance nexus?

Viewed from outside the box, Big Data is essentially an advanced feature of the industrial-military complex in the service of profit and power. It allows the new potentates to extract both profit and power from the datafication of life at a scale and complexity the world has never seen before. For, in what way does it matter to the villager marooned in the Bihar flood? It doesn’t. Or for that matter a fisherwoman eking out her livelihood in boondocks of Assam. They represent the last frontier against Big Data’s predations and depredations. Is it a surprise then if NaMo, a fervent devotee of Big Data, wants to trap all of them in its catchment?

In the Cloud of Unknowing, a 14th century mystical work written by an unnamed English monk, the author chats up a young monk about what could be considered a good life. The senior monk advises the apprentice that if he wants to grow as a person, he should unburden his mind of unnecessary baggage. He likens a mind stuffed with thoughts to a cloud that is so beautiful that it is actually devoid of wisdom, a cloud of unknowing.

Moral of the story: Beware of the cloud.

The article was first published in November 1-15, 2017 issue of Down To Earth under the headline 'The Faustian temptations of big data'

Down To Earth
www.downtoearth.org.in