Colorectal cancer is among the main causes of cancer-related deaths in the U.S., and preventive screenings are a critical tool in catching the disease in its early stages. Clinicians often use risk prediction tools to identify those at greater risk of the disease. Some of these algorithms include race as a variable, a practice that has been deeply debated over the past few years because it can worsen health disparities.
Now, a new study suggests that a variable for race can — under certain conditions — help reduce health disparities. Anna Zink, a researcher who studies machine learning and health at the University of Chicago, and her colleagues focused on a well-known aspect of building algorithms that is often left out of discussions of race: underlying data quality.
A person’s age, sex, health history, and family history of cancer are all key in determining their cancer risk. But family history data are more likely to be incomplete or uncertain among Black populations because of historical disparities in access to health care. This racial disparity in the data could be corrected by including a race variable, the researchers found. Zink and her colleagues tested the implications of this racial difference for a calculator they built to predict colorectal cancer risk.
What’s in the new study
Zink and her colleagues used data from 77,836 adults in the Southern Community Cohort Study who were cancer-free at the start of the study. About two-thirds of the participants reported being Black, and the rest reported their race as white. Black participants were more likely to have an unknown family history of cancer and also less likely to report a positive family history. But over the course of the study, they also had higher cancer rates, suggesting that family history data was more incomplete for Black people in the study.
The researchers used about half the data to build their model, then tested the equations they developed on the other half of the data. They used the NIH Colorectal Cancer Risk Calculator as a basis to build two models: one that included race as a variable and another that did not.
When they tested the two, the race-adjusted equation was a better predictor of cancer risk among Black participants than the race-free equation. “You’re not able to predict colorectal cancer outcomes for the participants as well when you exclude the race variable,” Zink said.
In other words, excluding race flagged fewer Black participants as being at high risk. The calculators developed in the study were only an experiment. But “if you assume that one of these risk calculators is used to decide who to recommend for screening, there would be fewer Black participants recommended for screening,” Zink said.
The use of race in a clinical algorithm perpetuates incorrect ideas conflating race and biology and can worsen racial disparities. But some clinicians argue that race should not be removed from risk prediction tools without evaluating the implications. In this instance, however, including race improved the algorithm’s ability to identify high-risk Black participants so their cancer could be caught and treated early.
“The discussion around whether to include or exclude race is really important,” Zink said, and should include considerations of data quality across different variables included in the equation, not just the race variable alone. “This is one part of that conversation that should be happening.”
Key takeaways
- In coverage, highlight that the use of race in an algorithm does not imply biological differences between races, but it may, in some cases, help to reduce inequities that are a result of historical or ongoing racism.
- The current study focused on family history data, but similar racial disparities could exist for other kinds of medical information. “While we focus on family history, because it’s something that’s been well studied, we don’t think this is an isolated issue,” Zink said. Other data types skewed by historical racism could offer story fodder.
- Considering how an algorithm might be used for disease screening or to allocate treatment — and how it might impact disparities in access — can inform the decision of whether or not to include race. “It’s helpful to think through the context in which the algorithm is being used,” Zink said.
- Ask whether a race variable in an algorithm aims to compensate for specific data quality issues that are a result of inequities in health care and research. “If you have a patient who is certain they have a family history, or you have a dataset where you know the family history data is high quality, then race adjustment might not be needed,” Zink said. “But if we’re working with data that we know has high rates of missing this, it might be helpful.”





