Powerful AI-powered analysis uncovers hidden COVID-19 death tolls across the U.S., exposing deep inequities in how pandemic deaths are recorded.

Research: Apply machine learning to identify unconfirmed COVID-19 deaths recorded as other causes of death in the United States. Image credit: Design_Cells / Shutterstock
In a recent study published in the journal scientific progressresearchers developed a new machine learning (ML) model to estimate the previously unrecognized number of deaths from the coronavirus disease 2019 (COVID-19) in the United States (US), rather than calculating the “true” death toll of the COVID-19 pandemic. The model was coded to focus on the period March 2020 to December 2021.
Algorithmic estimates show that the U.S. medical reporting system likely failed to identify 155,536 deaths from COVID-19 that were officially attributed to other causes. Additionally, the model found that these predicted “unrecognized” deaths occurred disproportionately among marginalized racial groups, including Hispanics, American Indians/Alaska Natives, Blacks, and Asians.
False alarms have been demonstrated to be significantly higher than the national average among less educated individuals and residents of the American South, suggesting systemic inequalities in the nation’s death surveillance system rather than conclusive evidence of systemic deficiencies.
Limitations of conventional estimation of mortality rate due to new coronavirus infection
Accurate epidemiological public health reporting, especially mortality data, is widely considered to be the foundation of modern health systems, as it allows authorities to allocate resources and develop effective policies during emergencies.
However, the recent coronavirus pandemic is often criticized as an example of a broken system, and there is growing evidence that reporting was often delayed or incomplete.
Traditionally, studies have primarily used “excess mortality” statistical models to estimate pandemic death tolls by comparing the actual number of deaths with historical trends. Unfortunately, while these models have proven useful in estimating the total number of deaths in a particular region, they cannot accurately determine the cause of death.
As a result, excess mortality approaches alone have not previously been able to distinguish between those who died directly from the virus (COVID-19) and those who died from indirect factors related to the pandemic, such as delays in heart surgery or the economic stress of lockdowns.
Machine learning models and research design
This study aimed to address this knowledge gap within the context of the US death surveillance system. In this study, we leveraged recent computational advances to train a predictive ML model on a large national death certificate dataset, treating inpatient deaths as a high-quality (“gold standard”) reference under key assumptions.
Rather than a proprietary dataset, this training set was derived from inpatient death certificate data in the United States, where testing for COVID-19 was nearly universal and cause of death reporting was assumed to be highly accurate. The dataset focused on the period from March 2020 to December 2021, when 1.88 million deaths were reported.
Sixteen different ML models were trained on this reference dataset, with a particular focus on contributing causes and death characteristics of death certificates that may indicate deaths due to COVID-19. The Extreme Gradient Boosting (XGBoost) model was chosen for its consistently high prediction accuracy on the training dataset.
The model was then provided with 3.85 million “out-of-hospital” death certificates from adults aged 25 and older. This dataset included up to 20 underlying and contributing causes of death, including age, gender, race, education level, pre-existing chronic conditions, median household income, and geographic location.
Importantly, this approach assumes that patterns learned from in-hospital deaths can be validly applied to out-of-hospital deaths. This is an important but potentially limiting assumption of the model.
Estimated underreporting and mortality disparities
The XGBoost model estimated a total of 995,787 deaths due to COVID-19 during the study period (95% uncertainty interval (UI): 990,313 to 1,001,363). This number is up to 19% (n = 155,536) higher than official records (n = 840,251) and reveals a large reporting gap in the US death surveillance system.
The model further revealed that these discrepancies in official records were most severe for deaths that occurred in the home, where the predicted number of deaths was 160% higher than the number of reported deaths (adjusted reporting rate (ARR) = 2.60; 95% UI: 2.56 to 2.65). Unexpectedly, the model also identified critical gaps in hospice care and emergency rooms.
When estimating the relative contributions of various sociodemographic and medical conditions associated with misclassification, the model revealed that unrecognized mortality was highest in the southern United States. We observed that Alabama (ARR 1.67), Oklahoma (ARR 1.51), and South Carolina (ARR 1.47) were the most underreported states in the nation.
The model identified reporting disparities in racial and ethnic records, with Hispanic deaths most likely to have unrecognized COVID-19 deaths (ARR 1.31, 95% UI: 1.30 to 1.32). Underreporting was also higher among American Indian/Alaska Native (ARR 1.24), Asians (ARR ~1.24), and Blacks (ARR 1.19).
Finally, individuals with less than a high school education were significantly more likely to be underrepresented compared to individuals with a higher education level (ARR 1.29). Similarly, counties with the lowest household incomes and the worst existing health indicators had the highest rates of unrecognized mortality.
Implications for public health and equity
The publication concluded that the U.S. death surveillance system is undercounting COVID-19 deaths in a “systematically unfair” manner. The findings of the XGBoost model suggest that the system was inadvertently hiding the true depth of the pandemic’s impact on marginalized communities.
Although this study is limited by the assumption that hospital-trained models can be generalized to home deaths, the researchers argue that this approach offers a potentially more specific alternative to traditional excess mortality models. The authors also stress that these estimates should be interpreted in parallel with other methodologies and not as a final count.
Future research should aim to apply similar ML frameworks to investigate other “hidden” mortality crises, such as drug overdoses and the effects of extreme heat.
Reference magazines:
- Kian, M. V., et al. (2026). Apply machine learning to identify unconfirmed COVID-19 deaths recorded as other causes of death in the United States. Science Advances, 12(12). DOI – 10.1126/sciadv.aef5697, https://www.science.org/doi/10.1126/sciadv.aef5697

