Breaking down implications in data and data collection

In contrast to April’s silly (but serious!) entry, I’ve elected to focus May and June’s entries on ones that made me throw away my highlighter because the pages became solid blocks of neon yellow. This month’s is a piece I’ve referenced over the last couple of years, and June’s is one I’ll be referencing for years to come.

Hold on to your hats: we’re going to talk about data, inequality, harm, and our societal responsibilities.

Catch up on previous entries in the Influential Articles series.

What Gets Counted Counts, by Catherine D’Ignazio and Lauren Klein

What Gets Counted Counts (archive) by Catherine D’Ignazio, Director of the Data + Feminism Lab at MIT, and Dr. Lauren Klein, Director of the Digital Humanities Lab at Emory University, is a chapter in the amazing Data Feminism book. (I highly recommend picking up a copy; I promise it’ll change your worldview.) In this chapter, we learn about how the very act of data collection can reinforce harmful and/or outdated societal norms, the ways that data collection can empower and heal, and why we should continually question how we collect and classify data.

The chapter centers the fourth principle in the book:

“Principle #4 of Data Feminism is to Rethink Binaries and Hierarchies. Data feminism requires us to challenge the gender binary, along with other systems of counting and classification that perpetuate oppression.” (link)

In our ongoing quest for nice, clean, structured data, we have devalued insights that can be gained from qualitative data. I get it; qualitative data are messy. They require work. You can’t just throw a standard algorithm at qualitative data and expect understanding.

Data collection, data reporting, and data recycling

But, focusing solely on quantitative data raises questions about the systems that collect those data—whether they are forms, registration data, or surveys. And questions about the systems raise questions about the designers of those systems! Are we inheriting biases two, three, four levels deep in the system?

For example, when the National Institutes of Health (NIH) awards grants, the Principal Investigator(s) (PIs) are required to gather data on all human participants in a specified format (archive). Values entered within the template are checked to make sure they are acceptable to the NIH’s system. This system conflates sex and gender, has a limited selection for race, and even places an upper bound on age. If you are studying geriatric populations, you may find the idea of equating a 90 year old participant with one who is 103 laughable. While this system does not preclude gathering more nuanced data, the data (lacking nuance) reported through the system are used in subsequent studies and available in explorable sites such as the NIH National Cancer Institute GDC Data Portal.

“What is counted—like being a man or a woman—often becomes the basis for policymaking and resource allocation. By contrast, what is not counted—like being nonbinary—becomes invisible.” (link)

These systems of data collection and data reporting can do harm, by dissuading people from participating from studies, surveys, and more because they cannot provide the appropriate information and do not view themselves as the intended audience. They become invisibly uncounted. And because it is so often an invisible act (people don’t tell you why they failed to create an account, or didn’t sign up for a study, or left a survey halfway through), we simultaneously make people feel othered and cause further harm by failing to account for them in our analyses.

Power structures, incentives, and systemic inequalities

We have to question the power dynamics at play when collecting and analyzing data and constantly reevaluate our assumptions. Who created the system? What were their motivations and biases? When was the last time it was revised? Rinse, repeat.

“Over the course of the eighteenth century, increasingly racist systems of classification began to emerge, along with pseudosciences like comparative anatomy and physiognomy. These allowed elite white men to provide a purportedly scientific basis for the differential treatment of people of color, women, disabled people, and gay people, among other groups.” (link)

While we hope that studies and surveys are executed with scientific rigor, the unfortunate truth is that they too are products of the humans that lead them, and therefore inherit the context of their time, views, and biases.

…Or, naiveté! In industry, surveys are often thrown together with an attitude of “the more the better”. Data, even de-identified and anonymized data, can be used against people. Whether it’s through erasure, being classified as an “outlier”, getting disqualified from a job, or being denied healthcare, data has the potential to disrupt lives. This is doubly true for marginalized populations, as without good data hygiene, it becomes easy to deanonymize people just by virtue of the fact that they are underrepresented within the data.

Storage is cheap but people are priceless

I don’t usually end these entries with a takeaway. This month and next, I’ll make an exception.

As a society, we are careless with data. This has gotten worse with the availability of cheap storage, and even worse with increased access to powerful data processing tools. We need to take a step back. When people voluntarily give us data, whether it’s during an event registration or as part of a research study, they are giving us their trust.

Let’s be worthy of it.

What Gets Counted Counts, by Catherine D’Ignazio and Lauren Klein#

Data collection, data reporting, and data recycling#

Power structures, incentives, and systemic inequalities#

Storage is cheap but people are priceless#

What Gets Counted Counts, by Catherine D’Ignazio and Lauren Klein

Data collection, data reporting, and data recycling

Power structures, incentives, and systemic inequalities

Storage is cheap but people are priceless