Best practices for recording race/ethnicity

What are some examples of race/ethnicity questionnaire items that honor people’s right to identify and describe themselves while also complying with US OMB guidelines and leading to manageable data?

One example I have seen is used in the All of Us research project; see “The Basics” at I would love to see good examples that other people have found.


I am still learning how to embed tweets, but I wanted to share this one from Jonathan Jackson, PhD. When he says “this question,” he’s referring to Dr. Consuelo Wilkins’s question, “Scientists: how would you measure a drop of Black blood?”

Indeed. Fellow scientists, if you can't answer this question, why are you continuing to measure race in your studies as a nominal-scale variable? Or at all?

Why not measure racism instead? Or individual systemic deprivation? Cleaner and more accurate.

— Jonathan Jackson, PhD (@egaly) June 11, 2020

On a related issue I have lots of teaching examples where race as a variable is oversimplified. What is the best language that I can add as a disclaimer when I can’t collect better data?


Perhaps because racism or individual systemic deprivation is harder to measure. Can measurement as a nominal variable, not lead to occasional useful insights?

Another related Twitter thread:


I’ve been thinking about concrete actions epidemiologists can take to dismantle racism in our science. How we define & analyze race can be racist & paternalistic, stubbornly valuing (what we perceive as) “objectivity” over accuracy. #epitwitter 1/n

— Amy Baugher (@AmyBaugher) June 11, 2020
1 Like

Frank, I think that how to handle teaching examples is such an important question, and I’m still thinking through all of it. It seems to me that with teaching examples, there’s a whole additional layer added onto the medical and social problems that arise from inappropriate representations of race in research. That is, there’s also the power differential between teacher and students, and the personal message that the teacher might be inadvertently sending to the students by presenting such an example without comment. For teaching examples that treat race as a biological construct, the message might be, “People like you are fundamentally biologically different from people like me.” For examples that present race as a set of mutually exclusive categories, it might be, “People like you don’t really exist (or are not important enough to study or to mention).” So I definitely think it’s important to directly address the flaws in our current examples while we work to find better ones. But right now I don’t have a suggestion about specific language. I’m hoping other people will respond with ideas, and also that we can come up with a) a systematic way of evaluating current textbook examples and b) a repository of better ones.


Another thing to consider in presenting older examples is whether the example presents “race” as an explanatory factor, when the true underlying explanatory factor is racism. Dr. Camara P. Jones discusses this with regard to disparities in COVID-19 deaths (but not specifically tied to teaching examples) here: Why Racism, Not Race, Is a Risk Factor for Dying of COVID-19. It seems to me that in a teaching setting, making the distinction between racism and race is especially important, even (or maybe even more so) when race is not the focus of the example. Although she also is not writing specifically about the classroom setting, Dr. Mary Bassett makes a strong argument for explicitly naming racism in Public Health Meets the Problem of the Color Line:

…we must name racism in our research proposals, in our theories, in our oral presentations and conference tracks, and even in our hypotheses. The essence of naming racism is this: how we frame a problem is inextricable from how we solve it.

1 Like

Great points. For other settings it is not racisim that is being studied, but biological differences. Still, calling it race is misleading and instead a broader genetic metric should be substituted. I just added this language in a case study:

**Note**: Categorization of race is arbitrary and represents a gross oversimplification of genetic differences in physiology.

I hope that’s OK. This is in the context of studying differences in peripheral blood flow.

A purely statistical caveat about categorization of a variable might not be as helpful as placing the statistical work into its fuller scientific context. The data underlying this teaching example came from a 1995 NEJM article [1] which does a fairly good job of establishing such context in its Introduction. Note the multiple sources and nested levels of variation discussed, which seem ripe for translation into statisticians’ language [2,3].

Interestingly, the pattern of sound statistical thinking in [1] is exhibited again 6 years later [4] when several of the same authors argue “Study of the relationship between specific phenotypes and genotypes, both within and across ethnic groups, is more likely to advance our understanding of the regulation of blood pressure than studies focused on race and blood pressure.”

  1. Lang CC, Stein CM, Brown RM, et al. Attenuation of isoproterenol-mediated vasodilatation in blacks. NEJM. 1995;333(3):155-160. doi:10.1056/NEJM199507203330304

  2. Gelman A. Analysis of variance—why it is more important than ever. The Annals of Statistics. 2005;33(1):1-53. doi:10.1214/009053604000001048

  3. Senn S. Mastering variation: variance components and personalised medicine. Stat Med. 2016;35(7):966-977. doi:10.1002/sim.6739

  4. Stein CM, Lang CC, Xie HG, Wood AJ. Hypertension in black people: study of specific genotypes and phenotypes will provide a greater understanding of interindividual and interethnic variability in blood pressure regulation than studies based on race. Pharmacogenetics. 2001;11(2):95-110. doi:10.1097/00008571-200103000-00001

1 Like

Excellent. That’s actually the dataset I use in the case study. I should have looked more at the original paper.

1 Like

@davidcnorrismd, thank you for sharing those articles. @f2harrell , it looks like both the Lang et al. paper and the Stein et al. paper cite environmental factors as a possible contribution to racial disparities in hypertension. So in your caveat above, I think it would make sense to add “and/or environmental factors.” And since many of the environmental stresses experienced by black Americans but not white Americans are due to racism, this could be a situation where racism is being studied, whether intentionally or not; so I think it would be helpful to expand that to something like “and/or environmental factors stemming from racism.” I should clarify here that I’m using “racism” in the sense of systemic/structural racism, as in this definition from CP Jones, from the interview cited above:

Racism is a system of structuring opportunity and assigning value based on the social interpretation of how one looks (which is what we call “race”) that unfairly disadvantages some individuals and communities, unfairly advantages other individuals and communities, and saps the strength of the whole society through the waste of human resources.


Jonathan Jackson @egaly is exactly correct. As an analyst I ask clients why they are collecting this data. It is a poor biologic proxy and a social/political construct. If you have a well formulated data question race won’t answer your questions in a manner that will reveal an actionable solution. List the specific measure you seek.


I also – when coerced-- defer to categories as defined in the CENSUS. This allows better narratives and comparisons to historical data.


Consuelo Wilkins and colleagues make a similar argument in a recent piece in JAMA Neurology:

Unfortunately, analyzing findings by race/ethnicity without appropriate contextual data could lead to inaccurate, misleading, or stigmatizing conclusions that may detract from the overall goals of diversity in research: to enhance the accuracy, utility, and generalizability of scientific evidence.

They use Alzheimer disease research as an example, and recommend that in this context, in addition to race/ethnicity, researchers collect detailed information related to primary language, education, annual household income, perceived social class, neighborhood characteristics, and perceived discrimination. The article gives specifics about the recommended level of detail for each of these. For example, under “education” they recommend asking about “Total years of education; school characteristics (public vs private, rural vs urban vs suburban); parents’ total years of education.”


The nice thing about the more detailed race/ethnicity questionnaire used by All of Us is that it allows people more flexibility in describing themselves, but the results can still be collapsed into the current OMB standard ( that the Census Bureau is required to use.

1 Like