A critical look at the disparity of failures in demographic inference tools

Mea culpa. I’m late on June’s influential article. Life happens and June was a year of a month.

However, like May’s entry, this is a big one. And similarly, I’ll be focusing on one paper that addresses data, methodologies, and systemic issues that appear in both.

Catch up on previous entries in the Influential Articles series.

Name-based demographic inference and the unequal distribution of misrecognition, by Jeffrey W. Lockhard, Molly M. King, and Christin Munsch

This paper is hot off the metaphorical press, having only just been published in early April of this year (2023). A cross-university collaboration between three sociologists, Name-based demographic inference and the unequal distribution of misrecognition (open access) by Jeffrey W. Lockhard, Molly M. King, and Christin Munsch, it examines the practice of using a person’s name to draw conclusions about their gender, race, class, and more. Not only do we get a disturbing survey of tools used to make these inferences, we also get a glimpse of how spectacularly they fail.

In the paper, the authors use listed authors of articles in sociology, economics, and communication journals in Web of Science between 2015 and 2020 as their data set.

A brief aside

You might be wondering why we keep having these conversations. Perhaps you remember this study on gender and skin-type bias (archive) or this study on “predicting” sexual orientation (archive) or this article on artificial intelligence and gender (archive). Maybe you ask yourself why people keep building these tools when the harm done can be so severe, like underdiagnosing diseases (archive).

I have two thoughts on that:

Humans really like categories (charitable opinion)
There is a lot of money to be made in categorizing data, especially data about people, and using that categorized data to make even more money (less charitable opinion)

Whether we take the charitable or less charitable perspective, we do keep having these conversations, and likely will continue to have these conversations for the foreseeable future.

Unacceptable error rates

Similar to how we addressed data recycling last time, tools that try to infer demographics from names have the potential to recycle bias, perpetuating those biases in whatever or whomever uses them. The demographics that name inference tools cover include many categories that qualify as protected classes in the United States, such as those mentioned above as well as ethnicity, religion, and age. Miscategorizations that stem from the usage of these tools can negatively affect already marginalized populations.

While on the surface, these tools claim to have relatively low error rates (if they publicize their error rates at all), it’s important to remember that those claims are overall error rates. For some subpopulations, the actual error rate can be much higher. One of the most popular gender prediction tools, genderize.io, will only predict two genders (male or female) and therefore misgenders nonbinary people 100% of the time. genderize.io is used by researchers, mainstream news sites, and industry, and reinforces the gender binary everywhere it is used.

“The algorithm was wrong 3.5 times more often for women than men, and some subgroups like Chinese women have error rates over 43%. For scientists, these disparities will bias results and inferences. For individuals, misgendering and misclassification of race/ethnicity can produce substantial harms, the ethical implications of which are heightened by the unequal distribution of harm across groups.” (link)

Tools that aim to infer race from names have wildly varying results depending on the population, which makes (unfortunate) sense given systemic discrimination against culturally distinctive names. People in the majority do not often consider the amount of privilege that factors into prospective parents’ naming decisions, but as inferential tools become more integral in automated decision-making, we need to be wary of the ways that using them can recycle and perpetuate biases.

“What name-based demographic imputation tools measure, then, is not the ‘ground truth’ of a person’s or name’s gender or race (which does not exist) but rather the cultural ‘consensus estimates of how each name is gendered’ or racialized.” (link)

Returning to the subject of error rates, perhaps these tools should publish not just the aggregate error rates, but also the error rates across and within subgroups. I know that if I were evaluating such a tool, I would want to know exactly where its weaknesses lie.

Fruit of the poisonous tree

There is the question of how the training data for such tools were collected, by whom, and for what purpose. Can we trust data that were originally gathered by someone with ulterior motives? This is not a theoretical question; as even the categories and/or labels themselves can reflect systemic biases and inequalities (which we also read about last month in the Data Feminism chapter on “What Gets Counted Counts”).

When discussing categorizations of race, ethnicity, gender, disability, and more, it is critical to keep in mind historical context. Motivations for collecting this information have deep roots in eugenics.

“Such uses directly extend the long history of scientific and administrative actors exerting control over populations through gender classification, which is intimately bound up with colonial and eugenic projects.” (link)

“Moreover, colonial and eugenic projects of controlling populations by imposing categorizations on them for scientific or administrative ends live on in automated race/ethnicity imputation systems.” (link)

In cases where collected data were hand-labeled or categorized after collection, we have to consider who was doing the labeling. Unless we can identify a, shall we say, data chain of custody, the usage of tools built on such data is introducing unknown unknowns into our analyses. Bias can creep into training data at every point in the process, making every analysis, model, and application fruit of the poisonous data tree.

The extent of failure

Lockhart, King, and Munsch look at two primary failure cases: misgendering and misrecognition of race/ethnicity.

Based on what we know about data and bias recycling, it should not be a surprise that inequalities in the performance of inference tools compound when looking at intersections of demographics. I highly recommend reviewing the charts in the paper, which are extremely detailed.

Misgendering happens far more frequently for LGBTQIA+ people, disabled people, and Asian people. Recall that nonbinary are misgendered 100% of the time, which the paper sees as having spillover effects on other demographics as well. Spillover also happens when looking at the intersection of gender and race/ethnicity, with Vietnamese and Chinese women being misgendered far more often than others (88% and 76%, respectively).

Misrecognition of race/ethnicity occurs at a much higher rate than the overall error rate for Middle Eastern and North African (MENA), Black, and Filipino names. The highest rate of misrecognition is for people who indicated “Other Race/Ethnicity” (close to 80%!).

This makes sense to me for multiple reasons. First, if the race/ethnicity that describes you isn’t an available option, then you get combined with every other race/ethnicity that isn’t specified. The “other” category becomes somewhat meaningless in its lack of granularity. Remember, what gets counted, counts.

Second, the underlying data set for predictrace’s surname predictions is the United States census data. While the census does allow for respondents to specify multiple races (see this sample survey from the 2020 census), predictrace collapses results for multiple races into one output column, 2races.

Systemic issues encoded in the data

Demographic inference tools generally are based on data that have already been collected, and therefore inherit all problems bundled into those data. Those problems range farther than categorization and representation alone though.

“English-language publications Romanize other languages by converting writing, including personal names, to Latin characters. English scientific databases like Web of Science and computational researchers often go further, standardizing writing to a narrow subset of Latin characters with few or no diacritics, such as ASCII, for the sake of computational processing. For some languages, especially tonal languages, this removes linguistic information that often carries demographic associations.” (link)

This sort of modification at time of collection or processing usually is not documented anywhere, and therefore neither are the implications. While most tools understand that the data sets they work with are imperfect, they often do not know the extent of how imperfect they are.

Additionally, we have to consider the sociocultural context of the data.

“Due to the long history of slavery, there is considerable overlap between Black and White names in the US.” (link)

Forced name changes, whether it is due to slavery, prejudice, or other reasons, are reflected in the source data and contribute to disparities in error rates of the inference tools. People who are affected by these forced name changes experience not only the initial trauma of the name change itself, but ongoing systemic issues encoded in the data and perpetuated by tools or processes that use this data.

“The high and highly heterogeneous error rates we demonstrated should give the many research, government and corporate users of name-based demographic inference pause.” (link)

Guiding principles for name-based demographic inference

The paper proposes five principles for when you are considering doing name-based demographic inference, which I’ll briefly summarize here:

Don’t
Apply the inference tools to questions that examine perception, not self-identification (external ascription)
Create or use models that are trained specifically on the population in question
Use inference tools only on subgroups where they have low error rates (high accuracy)
Look at the aggregate estimates, not individual ones

Given the potential for harm from misuse and misunderstanding, I’m inclined towards “don’t” for use, at least in industry.

CPU cycles are cheap but people are priceless

The current machine learning fad has everyone wondering what else we can do with math to (again, charitably) reduce our cognitive load. To outsource expertise to systems that presumably know better than we do. To lean on cold, hard logic and statistical computation to justify decisions as impartial and unbiased.

We have access to cheap storage for immense amounts of data and cheap processing power to manipulate that data. The temptation to use that access to glean some insight from it is compelling, I’ll admit! That said, math is not magic. It relies on what we give it.

And…the data we have to give it aren’t unbiased.

The systems built upon them are as fallible as humans are, as humans were, and as humans will continue to be if we keep using them uncritically. It’s not that name-based demographic inference and similar applications are bad, per se. It’s that a lack of understanding of and accounting for the historic context, data provenance, and methodologies behind them leads to erroneous conclusions at best and harm (systemic and individual) at worst.

We have a responsibility to avoid reinforcing societal failings in the tools we create and use. It may be hard work, involve a lot of difficult questions, and require introspection. It may lead to abandoning approaches altogether.

Society is worth it.

Name-based demographic inference and the unequal distribution of misrecognition, by Jeffrey W. Lockhard, Molly M. King, and Christin Munsch#

A brief aside#

Unacceptable error rates#

Fruit of the poisonous tree#

The extent of failure#

Systemic issues encoded in the data#

Guiding principles for name-based demographic inference#

CPU cycles are cheap but people are priceless#