“Big data” datasets and databases used by researchers

  • Medical Studies

We are well into the age of Big Data, in which researchers may use databases or other datasets with information on tens of thousands or even millions of individuals. It’s easy, even unwittingly, to manipulate this data and end up with “significant” results that may not actually be significant or true.

Below is a rundown of each of several large databases or other large datasets that researchers may use in observational studies. Knowing the characteristics of each particular data source may make it easier for journalists to assess whether that was an appropriate data source to use or, at the least, ask the researcher why they chose it.

If you have a suggestion for an additional database primer to include, please send your suggestion to tara@healthjournalism.org.

National Surgical Quality Improvement Program (NSQIP) and Pediatric NSQIPAdapted from JAMA Surgery’s overview for researchers

The National Surgical Quality Improvement Program (NSQIP) was established in 2004 by the American College of Surgeons based on similar programs in Veterans Affairs. The ACS added the pediatric NSQIP in 2008. Both databases contain detailed demographic and clinical data about surgical procedures with up to 30-days postoperative follow-up. Approximately 700 hospital hospitals participate in the NSQIP, resulting in inclusion of more than 1 million cases a year, and more than 100 sites participate in the pediatric database (NSQIP-P), which includes approximately 150,000 annual cases.

What to know:

  • The database contains national, reliable, risk-adjusted data on very specific surgical procedures, including 30-day outcomes, complications, readmissions and length of stay, but data sets are not necessarily nationally representative.
  • Cases also include detailed information about comorbidities (preexisting conditions) and complications that develop during or after surgery.
  • As of 2018, more than 1,500 peer-reviewed studies used data from the NSQIP.
    Patient and hospital-level data is de-identified.

Study uses

  • Often used to develop quality improvement interventions and, according to the JAMA guide, “have led to improvements in morbidity and mortality, cost savings from prevention of complications, and a platform for disease-specific, procedure-specific, or regional or system-based collaboratives.”
  • Can be used to identify or quantify risk factors based on demographics, comorbidities or types/subtypes of specific procedures (such as different types of gastric bypass surgery as opposed to gastric bypass surgery in general).
  • Can be used to study trends over time or for comparative effectiveness research.

Limitations and considerations

  • Data may be missing: read through the methods and limitations sections to see if researchers note missing data. If so, the study should note how the researchers handled the missing data, typically by imputation or sensitivity analyses.
  • Though studies using NSQIP usually have large sample sizes with highly detailed clinical information, outcomes are limited to 30 days of follow-up, and even large sample sizes may not be large enough to identify very rare outcomes or “accurately portray outcomes for rare cases.” An option to use data with oversampling of specific cases may offset this in some cases, but this is an area worth asking researchers about during interviews.
  • Some variables have been redefined, added or removed over time, so longer term studies will need to account for changes in variables or simply cannot study certain long-term trends accurately. Check the methods section to see how the researchers define specific variables and outcomes and whether these definitions remain consistent across the time period studied.
  • Financial outcomes are not included in the database, so any cost-effectiveness studies relying on NSQIP must estimate costs from other sources.
  • Patient satisfaction, quality of life, patient-reported adverse effects or experiences and similar patient-reported outcomes are not included in the database. Some of these may be added in the future.

Society for Vascular Surgery Vascular Quality Initiative (SVS VQI)Adapted from JAMA Surgery’s overview for researchers

The Society for Vascular Surgery set up the Vascular Quality Initiative as a patient safety organization in 2011 to assess/improve the safety and effectiveness of 12 vascular procedures:

  • carotid artery stenting
  • carotid endarterectomy
  • endovascular abdominal aortic aneurysm repair hemodialysis access
  • infrainguinal bypass
  • inferior vena cava filter
  • lower extremity amputations
  • open abdominal aortic aneurysm repair
  • peripheral vascular interventions
  • suprainguinal bypass
  • thoracic and complex endovascular abdominal aortic aneurysm repair (2 separate procedures)
  • varicose vein treatment

What to know:

  • Reporting to the database is optional, and some of the participating physicians and hospitals may choose to report on some procedures but not others.
  • As of 2017, the SVS VQI included data on 390,270 procedures performed by more than 3,200 doctors at 431 US and Canadian institutions: 40% community hospitals, 29% teaching hospitals and 31% academic hospitals.
  • Each case includes the patient’s demographics and clinical characteristics as well as the physician and hospital involved. Clinical data are reported at the time of the procedure and one year later, allowing for information on mortality, additional procedures/interventions, complications and other medium-term outcomes.
  • Some electronic health record systems’ data can be matched/integrated with SVS VQI data. The database can also be linked with patient identification information to Medicare claims, the Social Security Death Index and other similar data sets.

Study uses

  • Primary use is safety improvement, though it could be used for comparative effectiveness research as well.
  • Expect outcomes to be reported in odds ratios or effect sizes. Studies should also include flow diagrams to clarify the patients included and excluded and the reasons.

Limitations and considerations

  • The biggest challenge with using the SVS VQI for research is selection bias since it’s voluntary and physician/hospital self-reported, and participants may not report for all procedures. “There may also be some debate about whether participation in a quality registry leads to meaningful improvement in institutional health care quality,” the JAMA Surgery authors note.
    Pay attention to which registries are included (which procedures) because no data are included regarding which patients were excluded from procedures—again possibly introducing selection bias.
  • The database collects information prospectively, but all analyses can only be done retrospectively. It’s not possible to draw conclusions about causations from this database.
  • The researchers should have established their hypothesis and outcomes before conducting the study; changes to primary or secondary outcomes afterward could be subject to p-hacking. “The emphasis of the article should be on practical clinical findings, not incidental statistically significant results,” the JAMA Surgery authors note.
  • The number of variables can be so vast — sometimes hundreds of variables — that the researchers should have made adjustments to account for statistical significance occurring by chance. A Bonferroni correction or similar statistical calculation can correct for this.
    If the researchers conducted subgroup analyses, make sure the subgroups were planned ahead of time (again, an area ripe for p-hacking otherwise), and the characteristics not being studied (demographics, clinical variables, etc.) should be evenly distributed across subgroups to avoid bias.
  • For missing data, be sure the researchers have explained how they accounted for it. This database typically requires imputation of missing values — simply removing those values altogether could introduce bias — and the researchers should have explained what assumptions they used to do so. (This is another area worth asking an outside source about.)
  • The JAMA Surgery authors suggest that researchers “analyze whether the missing values are owing to an underlying bias in the data such that multiple imputation could skew the results” — something journalists can ask outside experts about.
  • The researchers should note a power calculation in their study—what number of participants/cases they needed to include to achieve statistical significance. You could ask an outside source, such as a biostatistician, whether their description of this calculation looks appropriate if other red flags suggest concern about the study.

Healthcare Cost and Utilization Project National Inpatient Sample (NIS)Adapted from JAMA Surgery overview for researchers

The federal the Agency for Healthcare Research and Quality runs the Healthcare Cost and Utilization Project (HCUP), which includes multiple data resources. These include all the administrative information reported to payers since 1988, pulled from the federal government, private and state data organizations and hospital associations.

One of these HCUP resources, the National Inpatient Sample (NIS), is sometimes used in surgical research even though the data it provides is administrative instead of clinical. The NIS is just one HCUP database resource, and this data primer does not include information on the others (though some of the limitations apply to all of them).

What to know:

  • Administrative data sets include diagnostic codes, procedures codes and costs — but not clinical data — for a 20% nationally representative sampling of all US hospital inpatient encounters.
  • Researchers can use tools within the database itself for risk-adjustment.
  • Data are de-identified, but it’s still possible to track a single individual over time through added features in the system that use other variables (instead of personally identifiable information) to track patients.

Study uses

  • Best used for generating hypotheses and research questions and looking for or examining trends over time — but not for conclusions related to causation.
  • HCUP databases in general “have been used to help shape policy decisions, assess the effectiveness of surgical techniques, examine disparities in surgical care, perform comparative effectiveness research and drive quality-improvement efforts,” the JAMA Surgery authors note.
  • It’s “ideal for performing basic descriptive studies, deriving national estimates, studying costs, studying rare disease, and understanding trends over time.”

Limitations and considerations

  • The NIS database was redesigned in 2012, and a major change included using a sample of discharges from all hospitals instead of all discharges from a sample of hospitals. While the new method is more nationally representative, it means comparing data before 2012 to after 2012 could be problematic.
  • Sample sizes are in the millions, so statistical significance requires a P value MUCH lower than 0.5. You should be seeing a lot of zeroes after the decimal point for true statistical significance with research using this database.
  • Since the JAMA Surgery authors suggest that researchers start “with a clearly thought-out research question” and :understand” the limitations of asking this question using the data available,” journalists could ask outside sources whether the question is indeed well-thought out and appropriate for this data set.
  • Since the standards used for coding changes over time, researchers may be limited in how they compare procedures or diagnoses over time. Most of the original data would have used ICD-9 coding, but since then, ICD-10 and soon ICD-11 will be included in reported data.
  • The data were collecting for billing purposes, not for clinical purposes, so using it to study clinical outcomes has to take this different original purpose into effect (including any bias associated with doing that — something to ask outside experts about).
    Surveillance bias — the more you look for something, the more likely you are to find it — is possible with this data set.

Society of Thoracic Surgeons (STS) National DatabaseAdapted from JAMA Surgery overview for researchers

The Society of Thoracic Surgeons (STS) National Database is a voluntary clinical registry that collects data on cardiac surgery outcomes for the purpose of quality and safety improvement. The database uses risk-adjustment to account for differences in patients and institutions when reporting mortality and morbidity rates. The rates are divided into three types of cardiothoracic surgery: adult cardiac surgery (dating back to 1989), congenital heart surgery (dating back to 1994) and general thoracic surgery (2002).

What to know:

  • The 1,119 participating healthcare centers that report (as of September 2016) are not necessarily individual hospitals but rather institutional surgical programs, of which there could be multiple within a single institution (or possibly shared across institutions). They come from all 50 states, plus 29 participating centers in 8 other countries. As of September 2016, the database had 6.1 million patient records and involved 3,100 surgeons.
  • Despite being voluntary, participation is high: the database includes “data on 94% of Medicare beneficiaries undergoing coronary artery bypass grafting surgery and from 90% of the sites providing care to Medicare beneficiaries.”
  • The data collected primarily includes demographic info, covariates specific to the subspecialty and short-term outcomes data, including morbidity, mortality and length of stay.

Study uses

  • Can compare outcomes, including mortality and short-term morbidity, across different institutions and across types of procedures.

Limitations and considerations

  • Studies using STS database data are retrospective observational cohort studies and therefore subject to all the usual limitations of any retrospective observational study (especially confounding from unmeasured variables). The data is also collected according to the procedure a patient had, not their diagnosis.
  • The database does not provide data on characteristics of individual clinicians, long-term outcomes or patient-reported outcomes. The outcomes may also not be generalizable, depending on the population the study focuses on.
  • Any study should use multivariable regression, propensity score analysis, propensity-matching, instrumental variable analysis or some other method of adjusting for covariates. (You don’t necessarily need to know what they did or how unless it’s especially germane to your reporting, but you should at least note that they did something.)
  • “Because of the large number of patients in the database, it is easy to discover statistically significant associations that are clinically unimportant. Therefore investigators should prespecify predictors, out- comes of interest, and the definition of a clinically important difference across variables.” In other words, watch out for p-hacking or statistical findings that don’t actually change the course of medical care.

Because it’s procedure-based (and not illness/disease/condition/diagnosis-based), the database cannot be used to compare non-surgical interventions to surgical interventions. It can only compare different types of surgical procedures. However, researchers sometimes link this databased to others to do comparisons with non-surgical interventions.

Veterans Affairs Surgical Quality Improvement Program (VASQIP)
Adapted from JAMA Surgery overview for researchers

This database, located in the Veterans Administration (VA) National Surgery Office (NSO), tracks all patients who undergo surgery with the VA. Nurse data managers record the data with the primary purpose of care quality improvement. The data collection is legally mandated by Congress, includes cardiac and non-cardiac surgery and is not publicly available.

What to know:

  • Cases are tracked on eight-day cycles (up to 36 cases per cycle) to ensure none are missed, and patients are followed for 30 days post-operation, regardless of hospital length of stay.
  • Cases included data on more than 200 variables, including pre-operative condition, lab results, other procedures (up to 10), procedure start/end times, anesthesia type, presence of trainees and 30-day post-op outcomes (mortality, morbidity of 22 complications of varying severity), among others.
  • Hospital location is tracked, so outcomes can be compared across institutions.

Study uses

  • Compare post-operative complications across different institutions.
  • Identify associations between specific procedure and complications, including associations related to patient characteristics and to the procedure’s characteristics (such as duration of procedure, number of trainees present, type of anesthesia used, etc.).
  • Cases can be cross-linked to other VA databases, so it’s theoretically possible to track a patient before and after their procedure through those other databases.

Limitations and considerations

  • Retrospective, observational.
  • Pay close attention to the specific definition of procedures and outcomes so you don’t misinterpret them (or so you can see whether the researchers misinterpreted them).
  • Sample size is large enough for statistical significance to translate to clinical significance, but when many different variables are studied, look for statistical adjustment/corrections to account for statistical significance happening by chance. Researchers can control for a lot, but controlling for too much can lead to statistical flukes.
  • With so many variables and conditions to choose from, researchers must choose very carefully and be sure the models they use fit the exposures and outcomes they are studying.
  • Patients who undergo a second operation within 30 days will have two sets of records, so researchers need to remove the second or otherwise account for duplicate or overlapping data.

Look for whether researchers conducted a “reliability adjustment” to account for natural changes over time and/or differences in hospital volume.

National Trauma Data Bank (NTDB)
Adapted from JAMA Surgery overview for researchers

The National Trauma Data Bank (NTDB) contains more than 7.5 million electronic records from over 900 trauma centers, making it the largest database of trauma incidents in the world. It grew out of the Major Trauma Outcomes Study that ended in 1989, and then the American College of Surgeons Committee on Trauma formed a subcommittee in 1997 to continue it.

What to know:

  • Individual hospitals voluntarily submit data every February-May.
  • The NTDB has its own inclusion/exclusion criteria for what qualifies to be reported.
    Each trauma incident is reported independently; a person with repeated injuries can show up in the database multiple times under different records.
  • The most reliable data starts in 2007, when a National Trauma Data Standard was adopted; combining 2002-2006 data with later data is likely unreliable or inconsistent
  • Data quality has improved as the number of hospitals participating in the American College of Surgeons Trauma Quality Improvement Program, begun in 2010, increases.

Study uses

  • Most studies are retrospective, cross-sectional and matched case-control studies.
    Commonly studied outcomes include mortality, length of hospitalization and complications, including rare injuries or outcomes because the database is sufficiently large.
  • Researchers should define outcome variables before beginning the study and should provide a rationale for using NTNB along with a flow diagram.
  • The database only includes data up through discharge and does not record deaths occurring after discharge; “transfers-to-hospice” may frequently be counted as deaths.
  • If researchers select their outcome or variables based on what is statistically significant in the dataset, their study should be aimed at coming up with hypothesis, not finding conclusive associations.

Limitations and considerations

  • Not nationally representative because of voluntary reporting, but nearly all level I/II trauma centers currently report.
  • Does not include data on costs, lab test results or long-term outcomes
    Selection bias may occur with hip fractures and transferred patients because different hospitals treat have different protocols for these two categories.
  • Dead-on-arrival patients or those who die soon after arrival may have incomplete information on their injury and its severity.
    Hospitals may vary in how they report injury severity scores, comorbidities and complications; some hospitals may use more sensitive imaging and report more incidental or minor findings
  • Researchers may need to make adjustments to deal with missing data.

Big Data Source: Medicare Claims Data
Adapted from JAMA Surgery overview for researchers

The Centers for Medicare and Medicaid Services (CMS) is the agency responsible for managing Medicare, the health insurance program used by most people at least 65 years old in the US. The public program, which AHCJ discusses in greater detail in the Insurance Core Topic, includes four parts: hospital insurance (Part A), medical insurance (Part B), a CMS-approved private insurance called Medicare Advantage (Part C) and prescription drug coverage (Part D).

Researchers can use de-identified datasets of claims reimbursed by CMS, except claims for the private insurance in Part C.

What to know:

  • The datasets include age, birthdate, sex, race/ethnicity and place of residence and cover 70 percent of all US adults older than 64. The large size of the population allows for subgroup analyses with a greater statistical power than many other subgroup analyses may have.
  • The data is national and therefore includes a nationally representative population for those age 65 and older, including care in a wide variety of healthcare settings (surgical centers, physician offices, hospitals, nursing home care, etc.).
  • The data can be linked both to other CMS data sets — health care utilization, insurance enrollment, and clinician characteristics — and to non-CMS data, including the US Census, cancer registries (including SEER), the Social Security death index, clinician information (such as the American Hospital Association data), and other government insurance programs, such as Medicaid and Tricare.
  • Missing data is rare since hospitals and physicians cannot get paid without complete, accurate claims.

Study uses:

  • It’s possible to track patients over a long period of time inn longitudinal studies on outcomes and health care utilization. This same longitudinal data can be used to reliably identify trends over time.
  • This data is excellent for comparative effectiveness studies and comparing outcomes across different geographical areas and different types of health care settings. It may also be used to assess healthy policy needs and effects.

Limitations and considerations:

  • Comparative effectiveness studies should adjust for selection bias, and studies comparing practice pattern or outcomes variations across clinicians should adjust for patient risk. Studies evaluating health policy should adjust for background trends occurring over time, such as general improvement over time.
  • Diagnoses are identified with International Classification of Diseases, Ninth Revision (ICD-9) or ICD-10 codes, so specifics about chronic conditions and comorbidities aren’t always available. In studies examining surgery outcomes, researchers cannot precisely, accurately determine complications using ICD codes alone.
  • The data does not include time stamps during hospital stays, so the order in which events occur during a hospital stay cannot be determined from the claims associated with a single stay.
  • Claims include only services billed and diagnoses, not vital signs, lab test results, pathology results, imaging results or other physiological information about individual patients.
  • The claims only include services covered and billed for under Medicare; information on non-covered services or any care or diagnoses sought or received elsewhere are not included.
  • Billing data cannot reliably provide information about severity of a condition, causes or comorbidities.

Military Health System Tricare Data
— Adapted from JAMA Surgery overview for researchers

Tricare is the insurance used by the Department of Defense for all active, retired and disabled (with certain conditions) military personnel and their dependents up to age 64, with two exceptions: it does not cover services by the Veterans Administration (VA) or health care provided in combat zones. About 80 percent of the 9 million nationwide beneficiaries are civilians, and the other 20percent is active military personnel.

What to know:

  • Data is generally considered diverse and nationally representative in terms of patients’ sociodemographic, vocational, educational and occupational characteristics up to age 64.
  • More generalizable than data from Medicare, private insurance or national registries.
  • Treatment may be administered at DoD healthcare centers (“direct care”) or at civilian centers (“purchased care”).
  • Covers inpatient and outpatient care, doctor fees, dental care, prescriptions and medical equipment.
  • Data includes age, voluntarily self-reported race/ethnicity, marital status, residence region based on census and the sponsor’s rank, which is sometimes used as a proxy for socioeconomic status.
  • Data includes diagnoses (based on ICD-9 and ICD-10), comorbidities, injury severity scores and length of hospital stays.

Study uses

  • Can study mortality, health care utilization, prescription drug use, comorbidity associations, quality of care and morbidity after surgeries, among other things.
  • Can often be used for long-term studies among career military personnel and their dependents.
  • Can serve as a model of single-payer universal health insurance in the US, including study of how universal care affects healthcare disparities.

Limitations and considerations

  • Data may not include extreme detail, such as vital signs and specific information about individual visits, surgeries, treatments, etc.
    Medication misclassification or undercounting can occur because beneficiaries may purchase medication out-of-pocket and never submit a claim.
  • Payments are often in lump sums, preventing cost comparisons for specific services, encounters or surgeries.
  • Missing data may require statistical adjustment.
  • Researchers should account for whether the services were provided at DoD facilities or civilian facilities.
  • Optional self-reporting of race/ethnicity means it’s frequently a missing data point (up to 31 percent in some studies) or may not be accurate; excluding incomplete files may bias results unless adjusted for.

National Cancer Database (NCDB)
— Adapted from JAMA Surgery overview for researchers

This database relies on reporting from more than 1,500 US hospitals and provides information on more than 70 percent of all new cancer diagnoses, dating back to 1989. The American College of Surgeons Commission on Cancer and the American Cancer Society run it together.

What to know:

  • Data collected on patient characteristics, patient comorbidities, cancer staging, tumor characteristics, treatments/interventions and survival outcomes.
  • Includes healthcare “facility type” and IDs for individual facilities; facilities not submitting at least one case per year are frequently excluded.
  • Multi-year studies should use a standardized variable when looking at staging and/or tumor data.
  • Mortality data includes 30-day, 60-day and 90-day mortality rates as well as 5-year survival rates.

Limitations and considerations:

  • Smaller facilities may have fewer cases or lower volume, with implications for bias or challenges in separating signals from noise. For example, one death in a low-volume hospital will appear as a much higher mortality rate than multiple deaths in a much higher volume hospital.
  • Reported treatments are detailed but only cover the first 6 months after diagnosis; long-term treatment data or planned treatment protocols are not included.
  • Only the most significant/definitive surgical procedure is included if multiple surgeries occurred.
  • Survival rate data may be subject to lead time bias and other biases.
  • 30-day readmission data only refers to readmission to the same hospital as the first admission and therefore has high potential for bias.
  • Not population-based or generalizable beyond the facilities included, but generally representative of nationwide care because of its size.
  • Long-term studies may require time as an additional variable to be accounted for because of changes over time in available treatments, diagnoses, intervention protocols, screenings and patients.
  • Missing data is common and should be accounted for; data specific to individual sites should not be used if less than 50 percent of a particular data point is available.

Surveillance, Epidemiology, and End Results (SEER) Database
Adapted from JAMA Surgery overview for researchers

The Centers for Disease Control and Prevention (CDC), National Cancer Institute and regional and state cancer registries collaborate on SEER, a federally funded and therefore publicly available cancer reporting system begun in 1974. The data is population-based and therefore nationally representative and generalizable because the 18 states reporting data come from all geographic regions of the US.

What to know:

  • Data includes patients of all ages, independent of insurance payer or status, reported at the local level.
  • The data covers 28 percent of the U.S. population with oversampling of racial/ethnic minorities, people born overseas and residents living below the federal poverty line.
  • Mortality statistics come from U.S. Census data, and death certificates provide data on dates and causes of death.
  • Does not include individual or family income, but does include age at diagnosis, year and place of birth, race/ethnicity, sex, marital status and the education and income of the individual’s census tract.
  • Cancer and pathology data collected includes the cancer site, stage, grade, advancement of disease, tumor markers, lymph node status, lymphovascular invasion, perineural invasion and margin status.
  • Includes method of diagnosis, surgeries and radiation and order of treatment.

Study uses:

  • Studies covering long periods of time can identify trends in cancer incidence, prevalence and survival.
  • Can study rare cancer and specific subpopulations.
  • Incidence and mortality rates require adjustment for age and are best reported as cases per 100,000 person-years.

Limitations and considerations:

  • Long-term trend data should stay within three predetermined cohorts—SEER-9 data from 1974, SEER-13 from 1992 and SEER-18 from 2000—or else account for changes in staging definitions, diagnostic criteria, radiation or surgery protocols, etc.
  • Missing clinical and pathologic data is common and should be noted or accounted for.
  • Does not include data on surgical approach, radiation dose, chemotherapy, hormonal therapy or immunotherapy.
  • Includes past cancer history but not cancer recurrence or comorbidities, medications or disability level.
  • Not helpful for comparative effectiveness studies since data on comorbidities, cancer recurrence and chemotherapy are unavailable.

Share: