New data section highlights common large datasets used in studies

We are well into the age of Big Data, in which researchers may use databases or another dataset with data from tens of thousands or even millions of individuals.

These massive datasets have many advantages, such as the ability to narrow down a specific population through inclusion or exclusion criteria, having adequate participation to achieve statistical power, being able to analyze and compare subgroups based on demographics or other differences and the ability to get diverse, representative populations.

“With the advent of administrative databases and patient registries, big data is increasingly accessible to researchers,” but there are pitfalls to this brave new world of big data also, wrote three JAMA Surgery statistical editors recently in an open-access editorial offering researchers tips on using large data sets.

For one thing, as they note, no database is completely free of bias and measurement error.” And with larger sets of data, even minor random patterns may show up as statistically significant because there is such a large sample size. Huge amounts of data make it so easy to do data dives that it’s easy to produce statistically significant results that are not clinically significant or not even necessarily true. It’s far easier to go p-hacking, intentionally or even unintentionally, with such large data sets.

It’s helpful may be helpful for journalists to know a bit about some of the most commonly used databases or other datasets that researchers may use in conducting retrospective studies. A new section in the Medical Studies Core Topic Data tab provides an overview of several of these massive databases, a sort of primer on each one that discusses what it is, who is included in it, what information it does or does not contain, possible cautions or limitations to using it, and things you should look for in a study that uses it (such as certain types of statistical adjustment or sensitivity analyses).

These are the ones that the JAMA Surgery editors specifically drew attention to because they see these databases used most frequently in the papers submitted to their journal, but the list is by no means comprehensive. Other datasets will be added as they seem appropriate or come to our attention. Please send any suggestions to tara@healthjournalism.org.

Agency for Healthcare Research and Quality Healthcare Cost and Utilization Project databases: National Inpatient Sample, State Inpatient Databases, and Kids’ Inpatient Database
Surveillance, Epidemiology, and End Results Program
Medicare Claims Data
Military Health System Tricare Encounter Data
Veterans Affairs Surgical Quality Improvement Program
National Surgical Quality Improvement Program
Metabolic and Bariatric Surgery Accreditation and Quality Improvement Program
National Cancer Database
National Trauma Data Bank
Society for Vascular Surgery Vascular Quality Initiative
The Society of Thoracic Surgeons National Database