14.3. Ethics and digital data#

As we’ve seen, conducting research that is ethical can be very challenging. It requires careful thinking about, for example, who is helped or harmed by a study, how the benefits and harms are distributed, and what the second, third, or even later order effects of research might be, both in time and across populations. It also requires a willingness – eagerness, even – to be always updating our ideas about what counts as “ethical” as we learn more about the consequences of our research, including whom we are including, whom we are leaving out, and how the results are being implemented.

All of this becomes even more difficult and uncertain as data science becomes a more prominent part of all of our lives. The two big areas where new ethical problems are being discovered and discussed (and, unfortunately, also often ignored) are with the data itself as well as the algorithms that churn through them. In this section, we focus on the data, and three problems in particular that pose especially thorny ethical challenges:

  1. Trace data and informed consent

  2. Data privacy and ownership

  3. Unintended secondary uses

  4. Data considered as objective

2. Data privacy and ownership#

A very related problem – and one whose lack of an answer is likely complicating our ability to resolve questions about informed consent – is that there is no widespread agreement on who owns and/or gets to see the data you generate. For example, if you post a photo on Instagram, who owns the photo – you, or Instagram (or Meta)? While there are more laws and terms of service coming down the pipeline to clarify these matters, progress is largely idiosyncratic and up to the will of any particular company. And, if you’ve ever made the mistake of, say, donating to a non-profit organization (never mind a political campaign) one time long ago, only to spend the next three years drowning in (physical!) mail from dozens of other organizations you’ve never heard of, you know firsthand that your data (in this case, my address) is a hot commodity that someone, somewhere, sold to someone. In this particular case, the primary harm might just be that it’s a nuisance, but it could be downright dangerous for one’s personal information to be out there, and unjust in the sense of – why is some company or group making money off of selling my data?

One of the challenging aspects of wrestling with ethical questions in the digital age is that there are simply so many transactions and so much data, and so much of it is of little consequence, that it can be very challenging to sift through which cases are worthy of further attention. For exmaple, while I’m not particularly put out that someone made some money off of my address (though I don’t love) it, if the problem stays where it is, then it’s nothing too horrible.

But there are other cases where a different version of a similar data ownership question is very consequential, for example in the case of Henrietta Lacks, who died of cancer in 1951, but whose cells have properties that make them extremely useful for scientists to study, and thus these “HeLa” cells have been used for decades since. Ms. Lacks was not paid for her cells (i.e., her data), nor was her family compensated in any way, despite the fact that these cells have been the basis for a lot of development in cancer research. If we also consider that Ms. Lacks was a low-income Black woman, it adds additional layers of injustice and a lack of ethics in terms of who is benefitting and who is paying the costs of this research.

3. Unintended secondary uses#

Another challenge around data privacy and ownership, as well as the fact that we are all generating data all the time, is that even if we could in principle resolve some of the challenges around these two issues and make all data private, which runs counter to other research norms about data transparency, it’s very difficult to guard against the risk of data that might not inherently be sensitive on its own falling into the wrong hands. Much data that seems innocuous – perhaps it’s anonymous, or includes variables that don’t seem to be all that sensitive – can actually become quite problematic. For example, two anonymous datasets that have variables in common could be combined and thus effectively de-anonymized. Or, while someone’s viewing history on Netflix might not seem like something we should worry about, it turns out that (often thanks to algorithms) we can “learn” or simply suspect things about people – such as their sexual orientation, political views, and so on – based on the movies that they watch. Not only is this a violation of someone’s right to privacy about aspects of their own identity, but in some contexts this can actually be dangerous, for example, if a Netflix viewer is from a regime where homosexuality is considered to be a crime.

4. Data considered as objective#

A fourth problem, and one that we’ve already covered in this book, but is worth mentioning in this context as well, is that simply calling a piece of information “data”, or assigning a number to something, can cause humans everywhere to mistakenly think of that data as “true”. While in some cases it’s pretty close – you either clicked on a website or you didn’t, though the decision to track that is potentially biased – in other cases, such as manually rating an employee at your company according to some numeric performance metric, e.g., 1-5, where 5 is “best” – your “data” is more than likely a codified list of people’s subjective assessments of others. While, sure, if we have a large dataset we might hope that some of that subjectivity is minimized, but if we consider that, for example, much of corporate leadership in the US is white and male, it’s difficult to believe that deep-seated biases and assumptions will “cancel out”; rather, it’s much more likely that we will have systematic errors in our data based on leaders’ own subjective ideas about what “top talent” looks like. This doesn’t (necessarily) mean don’t collect data on performance (though it does motivate perhaps finding better measures), but it does mean that when we hear the results of “the data”, we must remember that in this case the data is a set of numbers that reflect opions and not truth. The danger, thus, is when we forget where data comes from.