Member? Log in...

Join or renew today

Data

Abortion

Agriculture and food data

“Big data” datasets and databases used by researchers

Deaths from law enforcement

Disability statistics

Drug shortages

FDA-approved drugs

FDA data

Firearm deaths and injuries and mass shootings data

General statistics

Infectious disease

International data

Lead poisoning research

Medical devices and equipment

Medications

Opioids

Prisons

Number needed to treat

Suicide statistics

Supplements and nutrition products

Supplemental Nutrition Assistance Program (SNAP) Data

Vaccines and Immunization Data

Abortion

Court cases related to abortion continue to make their way to the Supreme Court every few years, and the issue remains as divisive as ever across the nation. Here are resources both for statistics on abortion and general research that might be relevant for stories related to state legislation, court cases and local stories.

For overall statistics, the Guttmacher Institute has an extensive section on abortion, including laws, statistics, trends and research. They also enable you to customize and download a data set by region (world regions, countries, states and U.S. counties) and specific stats. They also keep a fact sheet on abortion that’s chock full of information—including incidence, demographics, gestational timing, types of abortion, safety, insurance coverage and law/policy—and regularly updated. They link to a study on the reasons women have abortions,

The CDC provides an overall section on reproductive health, abortion surveillance with summaries such as this MMWR one (Wikipedia actually provides a nice list of links to each year’s report.) Pew Research offers an in-depth look at public views on abortion as well as a quick look at the big picture with public opinion and knowledge. For trends, check out this study on incidence of and access to abortions through 2008.

Data on unintended pregnancy and contraception is available here, here and here at Journalist’s Resource. The partisan site Abort73 (opposing abortion) also provides a helpful list of stats with links, and Planned Parenthood has a quick fact sheet of their numbers.

If writing about a court case or proposed legislation, it’s helpful to be familiar with the major SCOTUS rulings on abortion linked here: Roe v. Wade, Webster v. Reproductive Health Services, Planned Parenthood of Southeastern Pennsylvania v. Casey, Stenberg v. Carhart, Gonzales v. Carhart, and, most recently, Whole Woman's Health v. Hellerstedt (full opinion here). An even more extensive list of rulings that includes lower courts (primarily state supreme courts and circuit courts) can be found here, though it’s a partisan site promoting the reversal of Roe. Pew Research offers an excellent overview of the history of abortion rulings in the US if you need a quick bone-up on the big picture (briefer version here). The Guttmacher Institute also gives a statistical overview of state laws related to abortion.

Agriculture and food data

Those reporting on food, farming, food-borne illnesses and related topics may find several links from the Department of Agriculture helpful. The USDA Census of Agriculture offers a “comprehensive summary of agricultural activity for the United States and for each state,” including “number of farms by size and type, inventory and values for crops and livestock, operator characteristics and much more.” Get even more specific with cropland data from the interactive National Agricultural Statistics Service CropScape page.

The Food Safety and Inspection Service has information on recalls available here, and quarterly reports are available here. If you are looking into school lunches (National School Lunch Program) or any other food and nutrition programs, such as the Emergency Food Assistance Program or the Food Distribution Program on Indian Reservations, the USDA Web Based Supply Chain Management page has information relevant to the federal Food and Nutrition Service, Farm Service Agency, Agriculture Marketing Service and Foreign Agricultural Service as well as USAID.

“Big data” datasets and databases used by researchers

We are well into the age of Big Data, in which researchers may use databases or other datasets with information on tens of thousands or even millions of individuals. It’s easy, even unwittingly, to manipulate this data and end up with “significant” results that may not actually be significant or true.

Below is a rundown of each of several large databases or other large datasets that researchers may use in observational studies. Knowing the characteristics of each particular data source may make it easier for journalists to assess whether that was an appropriate data source to use or, at the least, ask the researcher why they chose it.

If you have a suggestion for an additional database primer to include, please send your suggestion to tara@healthjournalism.org.

 

National Surgical Quality Improvement Program (NSQIP) and Pediatric NSQIP — Adapted from JAMA Surgery’s overview for researchers

The National Surgical Quality Improvement Program (NSQIP) was established in 2004 by the American College of Surgeons based on similar programs in Veterans Affairs. The ACS added the pediatric NSQIP in 2008. Both databases contain detailed demographic and clinical data about surgical procedures with up to 30-days postoperative follow-up. Approximately 700 hospital hospitals participate in the NSQIP, resulting in inclusion of more than 1 million cases a year, and more than 100 sites participate in the pediatric database (NSQIP-P), which includes approximately 150,000 annual cases.

What to know:

  • The database contains national, reliable, risk-adjusted data on very specific surgical procedures, including 30-day outcomes, complications, readmissions and length of stay, but data sets are not necessarily nationally representative.

  • Cases also include detailed information about comorbidities (preexisting conditions) and complications that develop during or after surgery.

  • As of 2018, more than 1,500 peer-reviewed studies used data from the NSQIP.  

  • Patient and hospital-level data is de-identified. 

Study uses

  • Often used to develop quality improvement interventions and, according to the JAMA guide, “have led to improvements in morbidity and mortality, cost savings from prevention of complications, and a platform for disease-specific, procedure-specific, or regional or system-based collaboratives.”

  • Can be used to identify or quantify risk factors based on demographics, comorbidities or types/subtypes of specific procedures (such as different types of gastric bypass surgery as opposed to gastric bypass surgery in general).

  • Can be used to study trends over time or for comparative effectiveness research.

Limitations and considerations

  • Data may be missing: read through the methods and limitations sections to see if researchers note missing data. If so, the study should note how the researchers handled the missing data, typically by imputation or sensitivity analyses.

  • Though studies using NSQIP usually have large sample sizes with highly detailed clinical information, outcomes are limited to 30 days of follow-up, and even large sample sizes may not be large enough to identify very rare outcomes or “accurately portray outcomes for rare cases.” An option to use data with oversampling of specific cases may offset this in some cases, but this is an area worth asking researchers about during interviews.

  • Some variables have been redefined, added or removed over time, so longer term studies will need to account for changes in variables or simply cannot study certain long-term trends accurately. Check the methods section to see how the researchers define specific variables and outcomes and whether these definitions remain consistent across the time period studied. 

  • Financial outcomes are not included in the database, so any cost-effectiveness studies relying on NSQIP must estimate costs from other sources.

Patient satisfaction, quality of life, patient-reported adverse effects or experiences and similar patient-reported outcomes are not included in the database. Some of these may be added in the future.

Society for Vascular Surgery Vascular Quality Initiative (SVS VQI) — Adapted from JAMA Surgery’s overview for researchers

The Society for Vascular Surgery set up the Vascular Quality Initiative as a patient safety organization in 2011 to assess/improve the safety and effectiveness of 12 vascular procedures:

  • carotid artery stenting

  • carotid endarterectomy

  • endovascular abdominal aortic aneurysm repair hemodialysis access

  • infrainguinal bypass

  • inferior vena cava filter

  • lower extremity amputations

  • open abdominal aortic aneurysm repair

  • peripheral vascular interventions

  • suprainguinal bypass

  • thoracic and complex endovascular abdominal aortic aneurysm repair (2 separate procedures)

  • varicose vein treatment

What to know:

  • Reporting to the database is optional, and some of the participating physicians and hospitals may choose to report on some procedures but not others.

  • As of 2017, the SVS VQI included data on 390,270 procedures performed by more than 3,200 doctors at 431 US and Canadian institutions: 40% community hospitals, 29% teaching hospitals and 31% academic hospitals.

  • Each case includes the patient’s demographics and clinical characteristics as well as the physician and hospital involved. Clinical data are reported at the time of the procedure and one year later, allowing for information on mortality, additional procedures/interventions, complications and other medium-term outcomes.

  • Some electronic health record systems’ data can be matched/integrated with SVS VQI data. The database can also be linked with patient identification information to Medicare claims, the Social Security Death Index and other similar data sets.

Study uses

  • Primary use is safety improvement, though it could be used for comparative effectiveness research as well.

  • Expect outcomes to be reported in odds ratios or effect sizes. Studies should also include flow diagrams to clarify the patients included and excluded and the reasons. 

Limitations and considerations

  • The biggest challenge with using the SVS VQI for research is selection bias since it’s voluntary and physician/hospital self-reported, and participants may not report for all procedures. “There may also be some debate about whether participation in a quality registry leads to meaningful improvement in institutional health care quality,” the JAMA Surgery authors note.

  • Pay attention to which registries are included (which procedures) because no data are included regarding which patients were excluded from procedures—again possibly introducing selection bias.

  • The database collects information prospectively, but all analyses can only be done retrospectively. It’s not possible to draw conclusions about causations from this database.

  • The researchers should have established their hypothesis and outcomes before conducting the study; changes to primary or secondary outcomes afterward could be subject to p-hacking. “The emphasis of the article should be on practical clinical findings, not incidental statistically significant results,” the JAMA Surgery authors note.

  • The number of variables can be so vast — sometimes hundreds of variables — that the researchers should have made adjustments to account for statistical significance occurring by chance. A Bonferroni correction or similar statistical calculation can correct for this.

  • If the researchers conducted subgroup analyses, make sure the subgroups were planned ahead of time (again, an area ripe for p-hacking otherwise), and the characteristics not being studied (demographics, clinical variables, etc.) should be evenly distributed across subgroups to avoid bias.

  • For missing data, be sure the researchers have explained how they accounted for it. This database typically requires imputation of missing values — simply removing those values altogether could introduce bias — and the researchers should have explained what assumptions they used to do so. (This is another area worth asking an outside source about.)

  • The JAMA Surgery authors suggest that researchers “analyze whether the missing values are owing to an underlying bias in the data such that multiple imputation could skew the results” — something journalists can ask outside experts about.

  • The researchers should note a power calculation in their study—what number of participants/cases they needed to include to achieve statistical significance. You could ask an outside source, such as a biostatistician, whether their description of this calculation looks appropriate if other red flags suggest concern about the study.

Healthcare Cost and Utilization Project National Inpatient Sample (NIS) — Adapted from JAMA Surgery overview for researchers

The federal the Agency for Healthcare Research and Quality runs the Healthcare Cost and Utilization Project (HCUP), which includes multiple data resources. These include all the administrative information reported to payers since 1988, pulled from the federal government, private and state data organizations and hospital associations.

One of these HCUP resources, the National Inpatient Sample (NIS), is sometimes used in surgical research even though the data it provides is administrative instead of clinical. The NIS is just one HCUP database resource, and this data primer does not include information on the others (though some of the limitations apply to all of them).

What to know:

  • Administrative data sets include diagnostic codes, procedures codes and costs — but not clinical data — for a 20% nationally representative sampling of all US hospital inpatient encounters.

  • Researchers can use tools within the database itself for risk-adjustment. 

  • Data are de-identified, but it’s still possible to track a single individual over time through added features in the system that use other variables (instead of personally identifiable information) to track patients.

Study uses

  • Best used for generating hypotheses and research questions and looking for or examining trends over time — but not for conclusions related to causation.

  • HCUP databases in general “have been used to help shape policy decisions, assess the effectiveness of surgical techniques, examine disparities in surgical care, perform comparative effectiveness research and drive quality-improvement efforts,” the JAMA Surgery authors note.

  • It’s “ideal for performing basic descriptive studies, deriving national estimates, studying costs, studying rare disease, and understanding trends over time.”

Limitations and considerations

  • The NIS database was redesigned in 2012, and a major change included using a sample of discharges from all hospitals instead of all discharges from a sample of hospitals. While the new method is more nationally representative, it means comparing data before 2012 to after 2012 could be problematic.

  • Sample sizes are in the millions, so statistical significance requires a P value MUCH lower than 0.5. You should be seeing a lot of zeroes after the decimal point for true statistical significance with research using this database.

  • Since the JAMA Surgery authors suggest that researchers start “with a clearly thought-out research question” and :understand” the limitations of asking this question using the data available,” journalists could ask outside sources whether the question is indeed well-thought out and appropriate for this data set.


  • Since the standards used for coding changes over time, researchers may be limited in how they compare procedures or diagnoses over time. Most of the original data would have used ICD-9 coding, but since then, ICD-10 and soon ICD-11 will be included in reported data.

  • The data were collecting for billing purposes, not for clinical purposes, so using it to study clinical outcomes has to take this different original purpose into effect (including any bias associated with doing that — something to ask outside experts about).

  • Surveillance bias — the more you look for something, the more likely you are to find it — is possible with this data set.

Society of Thoracic Surgeons (STS) National Database — Adapted from JAMA Surgery overview for researchers

The Society of Thoracic Surgeons (STS) National Database is a voluntary clinical registry that collects data on cardiac surgery outcomes for the purpose of quality and safety improvement. The database uses risk-adjustment to account for differences in patients and institutions when reporting mortality and morbidity rates. The rates are divided into three types of cardiothoracic surgery: adult cardiac surgery (dating back to 1989), congenital heart surgery (dating back to 1994) and general thoracic surgery (2002).

What to know:

  • The 1,119 participating healthcare centers that report (as of September 2016) are not necessarily individual hospitals but rather institutional surgical programs, of which there could be multiple within a single institution (or possibly shared across institutions). They come from all 50 states, plus 29 participating centers in 8 other countries. As of September 2016, the database had 6.1 million patient records and involved 3,100 surgeons. 

  • Despite being voluntary, participation is high: the database includes “data on 94% of Medicare beneficiaries undergoing coronary artery bypass grafting surgery and from 90% of the sites providing care to Medicare beneficiaries.”

  • The data collected primarily includes demographic info, covariates specific to the subspecialty and short-term outcomes data, including morbidity, mortality and length of stay.

Study uses

  • Can compare outcomes, including mortality and short-term morbidity, across different institutions and across types of procedures.

Limitations and considerations

  • Studies using STS database data are retrospective observational cohort studies and therefore subject to all the usual limitations of any retrospective observational study (especially confounding from unmeasured variables). The data is also collected according to the procedure a patient had, not their diagnosis.

  • The database does not provide data on characteristics of individual clinicians, long-term outcomes or patient-reported outcomes. The outcomes may also not be generalizable, depending on the population the study focuses on.

  • Any study should use multivariable regression, propensity score analysis, propensity-matching, instrumental variable analysis or some other method of adjusting for covariates. (You don’t necessarily need to know what they did or how unless it’s especially germane to your reporting, but you should at least note that they did something.)

  • “Because of the large number of patients in the database, it is easy to discover statistically significant associations that are clinically unimportant. Therefore investigators should prespecify predictors, out- comes of interest, and the definition of a clinically important difference across variables.” In other words, watch out for p-hacking or statistical findings that don’t actually change the course of medical care.

Because it’s procedure-based (and not illness/disease/condition/diagnosis-based), the database cannot be used to compare non-surgical interventions to surgical interventions. It can only compare different types of surgical procedures. However, researchers sometimes link this databased to others to do comparisons with non-surgical interventions. 

Veterans Affairs Surgical Quality Improvement Program (VASQIP)
— Adapted from JAMA Surgery overview for researchers

This database, located in the Veterans Administration (VA) National Surgery Office (NSO), tracks all patients who undergo surgery with the VA. Nurse data managers record the data with the primary purpose of care quality improvement. The data collection is legally mandated by Congress, includes cardiac and non-cardiac surgery and is not publicly available.

What to know:

  • Cases are tracked on eight-day cycles (up to 36 cases per cycle) to ensure none are missed, and patients are followed for 30 days post-operation, regardless of hospital length of stay.

  • Cases included data on more than 200 variables, including pre-operative condition, lab results, other procedures (up to 10), procedure start/end times, anesthesia type, presence of trainees and 30-day post-op outcomes (mortality, morbidity of 22 complications of varying severity), among others. 

  • Hospital location is tracked, so outcomes can be compared across institutions.

Study uses

  • Compare post-operative complications across different institutions.

  • Identify associations between specific procedure and complications, including associations related to patient characteristics and to the procedure’s characteristics (such as duration of procedure, number of trainees present, type of anesthesia used, etc.).

  • Cases can be cross-linked to other VA databases, so it’s theoretically possible to track a patient before and after their procedure through those other databases.  

Limitations and considerations

  • Retrospective, observational.

  • Pay close attention to the specific definition of procedures and outcomes so you don’t misinterpret them (or so you can see whether the researchers misinterpreted them).

  • Sample size is large enough for statistical significance to translate to clinical significance, but when many different variables are studied, look for statistical adjustment/corrections to account for statistical significance happening by chance. Researchers can control for a lot, but controlling for too much can lead to statistical flukes.

  • With so many variables and conditions to choose from, researchers must choose very carefully and be sure the models they use fit the exposures and outcomes they are studying.

  • Patients who undergo a second operation within 30 days will have two sets of records, so researchers need to remove the second or otherwise account for duplicate or overlapping data.

Look for whether researchers conducted a “reliability adjustment” to account for natural changes over time and/or differences in hospital volume.

National Trauma Data Bank (NTDB)
Adapted from JAMA Surgery overview for researchers

The National Trauma Data Bank (NTDB) contains more than 7.5 million electronic records from over 900 trauma centers, making it the largest database of trauma incidents in the world. It grew out of the Major Trauma Outcomes Study that ended in 1989, and then the American College of Surgeons Committee on Trauma formed a subcommittee in 1997 to continue it.

What to know:

  • Individual hospitals voluntarily submit data every February-May.

  • The NTDB has its own inclusion/exclusion criteria for what qualifies to be reported.

  • Each trauma incident is reported independently; a person with repeated injuries can show up in the database multiple times under different records.

  • The most reliable data starts in 2007, when a National Trauma Data Standard was adopted; combining 2002-2006 data with later data is likely unreliable or inconsistent

  • Data quality has improved as the number of hospitals participating in the American College of Surgeons Trauma Quality Improvement Program, begun in 2010, increases.

Study uses

  • Most studies are retrospective, cross-sectional and matched case-control studies.

  • Commonly studied outcomes include mortality, length of hospitalization and complications, including rare injuries or outcomes because the database is sufficiently large.

  • Researchers should define outcome variables before beginning the study and should provide a rationale for using NTNB along with a flow diagram.

  • The database only includes data up through discharge and does not record deaths occurring after discharge; “transfers-to-hospice” may frequently be counted as deaths.

  • If researchers select their outcome or variables based on what is statistically significant in the dataset, their study should be aimed at coming up with hypothesis, not finding conclusive associations.

Limitations and considerations

  • Not nationally representative because of voluntary reporting, but nearly all level I/II trauma centers currently report.

  • Does not include data on costs, lab test results or long-term outcomes

  • Selection bias may occur with hip fractures and transferred patients because different hospitals treat have different protocols for these two categories.

  • Dead-on-arrival patients or those who die soon after arrival may have incomplete information on their injury and its severity.

  • Hospitals may vary in how they report injury severity scores, comorbidities and complications; some hospitals may use more sensitive imaging and report more incidental or minor findings

  • Researchers may need to make adjustments to deal with missing data.

Big Data Source: Medicare Claims Data
—Adapted from JAMA Surgery overview for researchers 

The Centers for Medicare and Medicaid Services (CMS) is the agency responsible for managing Medicare, the health insurance program used by most people at least 65 years old in the US. The public program, which AHCJ discusses in greater detail in the Insurance Core Topic, includes four parts: hospital insurance (Part A), medical insurance (Part B), a CMS-approved private insurance called Medicare Advantage (Part C) and prescription drug coverage (Part D).

Researchers can use de-identified datasets of claims reimbursed by CMS, except claims for the private insurance in Part C.

What to know:

  • The datasets include age, birthdate, sex, race/ethnicity and place of residence and cover 70 percent of all US adults older than 64. The large size of the population allows for subgroup analyses with a greater statistical power than many other subgroup analyses may have.

  • The data is national and therefore includes a nationally representative population for those age 65 and older, including care in a wide variety of healthcare settings (surgical centers, physician offices, hospitals, nursing home care, etc.).

  • The data can be linked both to other CMS data sets — health care utilization, insurance enrollment, and clinician characteristics — and to non-CMS data, including the US Census, cancer registries (including SEER), the Social Security death index, clinician information (such as the American Hospital Association data), and other government insurance programs, such as Medicaid and Tricare.

  • Missing data is rare since hospitals and physicians cannot get paid without complete, accurate claims.

Study uses:

  • It’s possible to track patients over a long period of time inn longitudinal studies on outcomes and health care utilization. This same longitudinal data can be used to reliably identify trends over time.

  • This data is excellent for comparative effectiveness studies and comparing outcomes across different geographical areas and different types of health care settings. It may also be used to assess healthy policy needs and effects.

Limitations and considerations:

  • Comparative effectiveness studies should adjust for selection bias, and studies comparing practice pattern or outcomes variations across clinicians should adjust for patient risk. Studies evaluating health policy should adjust for background trends occurring over time, such as general improvement over time.

  • Diagnoses are identified with International Classification of Diseases, Ninth Revision (ICD-9) or ICD-10 codes, so specifics about chronic conditions and comorbidities aren’t always available. In studies examining surgery outcomes, researchers cannot precisely, accurately determine complications using ICD codes alone.

  • The data does not include time stamps during hospital stays, so the order in which events occur during a hospital stay cannot be determined from the claims associated with a single stay.

  • Claims include only services billed and diagnoses, not vital signs, lab test results, pathology results, imaging results or other physiological information about individual patients.

  • The claims only include services covered and billed for under Medicare; information on non-covered services or any care or diagnoses sought or received elsewhere are not included.

  • Billing data cannot reliably provide information about severity of a condition, causes or comorbidities.

Military Health System Tricare Data
— Adapted from JAMA Surgery overview for researchers

Tricare is the insurance used by the Department of Defense for all active, retired and disabled (with certain conditions) military personnel and their dependents up to age 64, with two exceptions: it does not cover services by the Veterans Administration (VA) or health care provided in combat zones. About 80 percent of the 9 million nationwide beneficiaries are civilians, and the other 20percent is active military personnel.

What to know:

  • Data is generally considered diverse and nationally representative in terms of patients’ sociodemographic, vocational, educational and occupational characteristics up to age 64.

  • More generalizable than data from Medicare, private insurance or national registries.

  • Treatment may be administered at DoD healthcare centers (“direct care”) or at civilian centers (“purchased care”).

  • Covers inpatient and outpatient care, doctor fees, dental care, prescriptions and medical equipment.

  • Data includes age, voluntarily self-reported race/ethnicity, marital status, residence region based on census and the sponsor’s rank, which is sometimes used as a proxy for socioeconomic status.

  • Data includes diagnoses (based on ICD-9 and ICD-10), comorbidities, injury severity scores and length of hospital stays.

Study uses

  • Can study mortality, health care utilization, prescription drug use, comorbidity associations, quality of care and morbidity after surgeries, among other things.

  • Can often be used for long-term studies among career military personnel and their dependents.

  • Can serve as a model of single-payer universal health insurance in the US, including study of how universal care affects healthcare disparities.

Limitations and considerations

  • Data may not include extreme detail, such as vital signs and specific information about individual visits, surgeries, treatments, etc.

  • Medication misclassification or undercounting can occur because beneficiaries may purchase medication out-of-pocket and never submit a claim.

  • Payments are often in lump sums, preventing cost comparisons for specific services, encounters or surgeries.

  • Missing data may require statistical adjustment.

  • Researchers should account for whether the services were provided at DoD facilities or civilian facilities.

  • Optional self-reporting of race/ethnicity means it’s frequently a missing data point (up to 31 percent in some studies) or may not be accurate; excluding incomplete files may bias results unless adjusted for.

National Cancer Database (NCDB)
— Adapted from JAMA Surgery overview for researchers

This database relies on reporting from more than 1,500 US hospitals and provides information on more than 70 percent of all new cancer diagnoses, dating back to 1989. The American College of Surgeons Commission on Cancer and the American Cancer Society run it together.

What to know:

  • Data collected on patient characteristics, patient comorbidities, cancer staging, tumor characteristics, treatments/interventions and survival outcomes.

  • Includes healthcare “facility type” and IDs for individual facilities; facilities not submitting at least one case per year are frequently excluded.

  • Multi-year studies should use a standardized variable when looking at staging and/or tumor data.

  • Mortality data includes 30-day, 60-day and 90-day mortality rates as well as 5-year survival rates.

Limitations and considerations:

  • Smaller facilities may have fewer cases or lower volume, with implications for bias or challenges in separating signals from noise. For example, one death in a low-volume hospital will appear as a much higher mortality rate than multiple deaths in a much higher volume hospital.

  • Reported treatments are detailed but only cover the first 6 months after diagnosis; long-term treatment data or planned treatment protocols are not included.

  • Only the most significant/definitive surgical procedure is included if multiple surgeries occurred.

  • Survival rate data may be subject to lead time bias and other biases.

  • 30-day readmission data only refers to readmission to the same hospital as the first admission and therefore has high potential for bias.

  • Not population-based or generalizable beyond the facilities included, but generally representative of nationwide care because of its size.

  • Long-term studies may require time as an additional variable to be accounted for because of changes over time in available treatments, diagnoses, intervention protocols, screenings and patients.

  • Missing data is common and should be accounted for; data specific to individual sites should not be used if less than 50 percent of a particular data point is available.

Surveillance, Epidemiology, and End Results (SEER) Database
— Adapted from JAMA Surgery overview for researchers

The Centers for Disease Control and Prevention (CDC), National Cancer Institute and regional and state cancer registries collaborate on SEER, a federally funded and therefore publicly available cancer reporting system begun in 1974. The data is population-based and therefore nationally representative and generalizable because the 18 states reporting data come from all geographic regions of the US.

What to know:

  • Data includes patients of all ages, independent of insurance payer or status, reported at the local level.

  • The data covers 28 percent of the U.S. population with oversampling of racial/ethnic minorities, people born overseas and residents living below the federal poverty line.

  • Mortality statistics come from U.S. Census data, and death certificates provide data on dates and causes of death.

  • Does not include individual or family income, but does include age at diagnosis, year and place of birth, race/ethnicity, sex, marital status and the education and income of the individual’s census tract.

  • Cancer and pathology data collected includes the cancer site, stage, grade, advancement of disease, tumor markers, lymph node status, lymphovascular invasion, perineural invasion and margin status.

  • Includes method of diagnosis, surgeries and radiation and order of treatment.

Study uses:

  • Studies covering long periods of time can identify trends in cancer incidence, prevalence and survival.

  • Can study rare cancer and specific subpopulations.

  • Incidence and mortality rates require adjustment for age and are best reported as cases per 100,000 person-years.

Limitations and considerations:

  • Long-term trend data should stay within three predetermined cohorts—SEER-9 data from 1974, SEER-13 from 1992 and SEER-18 from 2000—or else account for changes in staging definitions, diagnostic criteria, radiation or surgery protocols, etc.

  • Missing clinical and pathologic data is common and should be noted or accounted for.

  • Does not include data on surgical approach, radiation dose, chemotherapy, hormonal therapy or immunotherapy.

  • Includes past cancer history but not cancer recurrence or comorbidities, medications or disability level.

  • Not helpful for comparative effectiveness studies since data on comorbidities, cancer recurrence and chemotherapy are unavailable.

Deaths from law enforcement

Police-associated deaths and police brutality are becoming increasingly reported as a public health issue, helped by the fact that the American Public Health Association has an official policy statement on the issue. It therefore helps to know where to find data on these incidents. Unfortunately, it can be difficult to find all the information a journalist might want in a convenient single place, and it’s often necessary to cobble together different statistics or data sets. If just diving into this issue for the first time, a helpful primer at Journalist’s Resource can give you the big picture along with many resources to check out. They offer a wealth of resources and data on deaths that occur in police custody in the U.S.

One study that provides a nice overview of using health care administrative datasets to track injuries resulting from police interaction, both justified and unjustified, is “Perils of police action: a cautionary tale from US data sets,” in BMJ, though it’s unfortunately behind a paywall. The data resources below include both “official” sources, such as federal agencies, as well as media-based, nonprofit or informal collections of information, so it will require a bit of picking through to find precisely what is needed for a particular story or project.

The Bureau of Justice Statistics Use of Force is an official federal resource that regularly publishes reports that include Police–Public Contact Surveys every three years and data from the Arrest-Related Deaths program. It also includes annual data in the FBI’s Law Enforcement Officers Killed and Assaulted.

Reports on specific cities, such as Los Angeles (Harvard, May 2009) and Baltimore (federal/DOJ, August 2016), are available from the agencies or departments that conducted the report.

Interestingly, some of the better data comes from news sources. The Guardian’s investigation “The Counted,” as they state it, “revealed the true number of people killed by law enforcement” and related trends, actually leading to a response from the U.S. government. The Washington Post similarly maintains a database called Fatal Force, which annually compiles data on people shot and killed by police, provides their methodology and allows anyone to download the data. Fatal Encounters is a website run by a single journalist in Reno, Nev., who attempts to track all deaths caused by law enforcement.

The Cato Institute offers a daily newsfeed recap of police misconduct reported in the media across the U.S. and provides quarterly, semi-annually and annual statistical reports and various ancillary reports.

Finally, Mapping Police Violence does exactly what it sounds like and contains several graphs and charts of police violence with an option to download the source data.

Disability statistics

The word “disability” encompasses a wide range of individuals in the U.S. For example, many people may not think of depression as a disability, yet the World Health Organization describes depression as the “leading cause of disability worldwide” in their fact sheet on depression. But “disability” can also have very precise meanings, especially when it comes to federal and state law and government programs. (Here’s the Census Bureau’s definitions, for example.) The following are reliable sources of data, definitions and statistics related to disability:

Drug shortages

Drug shortages, especially shortages of cancer drugs, have driven recent news stories. Want to write one of your own? Check this list of drug shortages maintained by the FDA. The American Society of Health Systems Pharmacists maintains a separate list of drugs in short supply here.

FDA-approved drugs

Sometimes researchers study drugs that are already FDA-approved to see if they may have other uses. If you want to find out more about approved drugs (what they’re approved to treat, for example, or what their major side effects are) check out these two resources:

  • Daily Med – from the National Library of Medicine, provides drug label information for a growing list of prescription drugs.

  • Drugs@FDA is the FDAs searchable database of currently approved drugs.  It’s a great way to find out whether drugs are still on the market, who manufactures them, and in what forms and dosages they’re currently offered.

FDA data

OpenFDA is an initiative to make it easier for web developers, researchers, and the public to access large, important public health datasets collected by the agency. The FDA phased in openFDA beginning in June 2014 with millions of reports of drug adverse events and medication errors that have been submitted to the FDA from 2004 to 2013. Previously, the data was only available through difficult to use reports or Freedom of Information Act requests. The pilot will be expanded to include the FDA’s databases on product recalls and product labeling.

Firearm deaths and injuries and mass shootings data

It can be tricky to find reliable stats related to firearms and firearm injuries and deaths, but journalists can compile a pretty good big picture by visiting several sites and pooling their data. For basic numbers, the best starting place is the CDC’s FastStats Injuries page, where you can download tables that break down firearm deaths and injuries by age and other demographics. The injuries here, however, are an underestimate since not all firearm-related injuries (especially not those where a person didn’t seek medical attention) are not reported.

The FBI’s National Instant Criminal Background Check System provides data on the number of background checks that NICS conducts monthly in the U.S., but they can’t be used as a proxy for sales since a background check may occur without a sale or a private transaction may occur without a background check. It also includes the numbers for federal denials according to the reasons.

Then, the FBI’s Uniform Crime Reports offers several tabs, including “Violent Crime,” “Murder” and “Expanded Homicide,” but it’s important to note these stats are underreported and sometimes inconsistent. For example, both Florida and Alabama report the total number of homicides in their state but without breaking it down by weapon (downloadable with Excel) or mode.

The Bureau of Justice Statistics offers information on background checks for transfers, stolen firearms, homicide trends and other data. For mass shootings, no state or federal agency tracks these consistently, but the Gun Violence Archive on Mass Shootings is a good place to start. It also provides a Creative Commons map and a description of methodology. Everytown for Gun Safety, the advocacy group that grew out of the Newtown tragedy at Sandy Hook Elementary School, has also posted an Analysis of Mass Shootings with some great infographics.

Finally, the overall site of Gun Violence Archive pulls together probably the most comprehensive stats you’ll find. This nonprofit, nonpartisan group compiles data from a wide range of state and federal reported data, and the site transparently describes all its methodology.

General statistics

Pew Research Center: Most journalists are aware of Pew Research Center, but they may not be aware of how much the site has to offer health journalism. On the center’s data page, anyone can download their complete datasets, divided into seven categories: U.S. Politics & Policy,Journalism & MediaInternet, Science & TechReligion & Public LifeHispanic TrendsGlobal Attitudes & Trends and Social & Demographic Trends. Each of these areas contain information on public attitudes that could be ripe for story ideas or for providing context in a story, such as how people view the CDC, attitudes toward a wide range of health topics, a survey on agingLatino attitudes, environmental concerns in China, polls about the Sandwich Generation and a survey of LGBT Americans, among many others. They also provide a post walking visitors through how to download the data.

Reproductive, Maternal and Child Statistics: The best resource for any data related to reproductive health or maternal/neonatal health is the CDC page on Reproductive Health Data and Statistics. Although a wide range of resources are available on the page, a couple to highlight include PRAMStata, which includes a database searchable by state or topic for more than 250 child and maternal health indicators tracked in the Pregnancy Risk Assessment Monitoring System (PRAMS). Everything from prenatal care statistics to smoking in pregnancy to breastfeeding stats and more is available here. The CDC explains its surveillance of pregnancy mortality here, and data on sudden unexpected infant deaths (SUIDs) and sudden infant death syndrome (SIDS) are available here. The March of Dimes also offers a variety of data on perinatal statistics. Other resources for data on the CDC page include links for contraception, abortion, assisted reproductive technology, sexually transmitted infections and birth data, among others.

FastStats comes from the National Center for Health Statistics at the CDC. Frequently updated and easy to use, this page is an invaluable resource for reporters who need statistics for context in a flash.

Need to know how many knee replacements were performed in the U.S. this year? How about hysterectomies? Want to know how trends in procedures have changed over time? Then you need the Center for Disease Control's National Hospital Discharge Survey. Be aware, though, the CDC is integrating that survey into a larger dataset that will include procedures from the emergency department and ambulatory care centers. The new National Hospital Care Survey doesn't have results yet, but when they're posted, they'll be here.

Infectious disease

Any time there is an outbreak of an infectious disease, the public generally wants to know how common it is and what their risk of getting it is. Journalists will want to stay on top of new and continuing cases as well, but to provide context to those stories, or to others about infectious diseases or vaccines, they may want information on historical incidence or trends as well. Below are resources for infectious diseases exclusively within the U.S.

Domestically, the Centers for Disease Control and Prevention maintain a National Notifiable Diseases Surveillance System (NNDSS) that tracks all Nationally Notifiable Conditions, diseases which health departments are required to report when they have a local case with the condition. (Clicking on each disease tells you the case definition and how long it has been a notifiable condition.) This spreadsheet tells you whether a disease was reported during each of the years included. These diseases are reported for the week, month and year-to-date in each Morbidity and Mortality Weekly Report. To see how many cases have been reported during a particular month or up through the year-to-date, look for the Notifiable Diseases and Mortality Tables link for the month and year you need from 2015 or 2016. The CDC also provides MMWR summaries of cases for nationally notifiable conditions during each past year back to 1993. Be sure to read what data users should know about the National Notifiable Diseases Surveillance System before you dig into the data, and definitions of key terms are available here. Other helpful information to understanding the system is here.

You can also look up state level data on specific notifiable diseases, and other data from the NNDSS can be accessed here, part of the CDC’s overall data site. The State Health Statistic page of the MMWRs contains links to the MMWR   Notifiable Diseases Data Tables, NNDSS Morbidity Tables, and Mortality Tables by week, year, and any of 122 cities. You may also find helpful information from specific data sets in the CDC’s Wide-ranging Online Data for Epidemiologic Research (WONDER).

Influenza is tracked through the extensive and granular Weekly U.S. Influenza Surveillance Report (FluView). For emerging viruses, the CDC will often set up a disease-specific page as they did for MERS, SARS and Zika, whose U.S. cases are tracked here.

You can query specific data sets for HIV/AIDS, hepatitis, sexually transmitted diseases and tuberculosis at the The National Center for HIV/AIDS, Viral Hepatitis, STD, and TB Prevention (NCHHSTP) Atlas. The National Center for Health Statistics has links for tracking new annual tuberculosis, salmonella, Lyme and meningococcal cases as well as trends and data for AIDS/HIV, influenza, measles, pneumonia, sexually transmitted diseases, viral hepatitis and whooping cough/pertussis. Most vaccine-preventable diseases have their own pages for trends and historical data, such as a page for measles and one for pertussis. (Frustratingly, it can sometimes be difficult to find precisely the information you need. This table of pertussis cases by year historically is not accessible from the main pertussis page or pertussis surveillance page for some reason.)

International data

The U.S. Agency for International Development, or USAID, plays a significant role in aiding foreign governments with infrastructure support, education support, international food programs and a variety of health-related programs. The main site for the data retrieval provides an extensive list of databases, spreadsheets and other searchable information, including ones by program and by country, that can be searched individually, or journalists can search all the information at once with a search box (though the site can be buggy). The data dump is so large that it could take some time to familiarize yourself, so spend some time perusing their Data Resources page to gain a sense of what's available. Specific databases from the main page that could be of particular use for health journalists will be added here gradually.

Lead poisoning research

If you are reporting on the lead-contaminated water crisis in Flint, Michigan, on follow-up stories in other locations, or on any other stories related to lead exposure, it helps to have handy an overview of the facts and research related to lead exposure.

This graph from the CDC provides a nice summary of how lead levels in children’s blood has fallen from 1997 to 2014. The CDC also contains a very extensive list of the physiological effects of lead on adults and children in both chronic low-level amounts and acute high-level amounts. It’s part of the CDC’s educational case study from the Agency for Toxic Substances and Disease Registry. One such condition is encephalopathy, on which the NIH has a quick fact sheet. Among other CDC resources are pages on where lead is found, exposure routes, persons most at risk, diagnostic tests, a collection of surveillance data, prevention tips, and general information for consumers. An MMWR from the CDC discusses how low-income families and minorities are at disproportionately higher risk for exposure. Also helpful is a more concise overview on lead at the Environmental Protection Agency webpage and this fact sheet from the World Health Organization.

Some particularly relevant studies include the following: a 2000 study on the cognitive effects of blood lead levels even below 5 microg/dL, which was below the CDC’s upper limit recommendation of 10 microg/dl until the CDC lowered it to 5 microg/dL in 2012; this 2008 review on the neurodevelopmental effects of even low levels of lead exposure; a 2009 study suggesting that each additional blood level concentration of 1 microg/dL correlates with a reduction of 1 IQ point; and this 2003 non-paywalled article focusing on “the reasons for the child's exquisite sensitivity, the behavioral effects of lead, how these effects are best measured, and the long-term outlook for the poisoned child.”

The Shorenstein’s Center’s Journalist Research offers a nice overview summary of the effects of lead exposure and how many people the problem affects throughout the U.S. and the world. Their resource page also includes a helpful list of studies and their summaries for reporters looking for data about specific effects or specific populations.  

Medical devices and equipment

Want to find out if doctors or companies have reported problems with a medical device to the FDA? Check MAUDE – Manufacture And User Facility Device Experience.

Want to find out if a piece of hospital equipment has malfunctioned at more than one facility recently? You'll want to check MedSun.

Medications

DrugAlert.org is a comprehensive database featuring information and news alerts about potentially dangerous drugs currently on the market or previously available worldwide. The Web site posts information about drug recalls, side effects, and pending litigation associated with various drugs and their manufacturers.

Opioids

In the not-too-distant past, vehicle crashes were the leading cause of injury deaths in the U.S. Then gun violence began claiming more lives than cars a few years ago, and deaths from opioid overdoses soon caught up as well. Overdose deaths from opioids now lead the pack for injury deaths in the U.S., so if you haven’t already reported extensively on opioids, you inevitably will soon. The following sources of data about opioids, opioid use disorder and overdoses may help with reporting, though don’t forget to scour PubMed for incidence and prevalence studies and other research. 

U.S. and world prisons

Reporting on prisons is typically a beat for criminal justice reporters, but as more research reveals failures in prison health care systems, the mental health effects of solitary confinement and the abuses of some private, for-profit prisons, it is increasingly becoming a beat for health reporters as well. The measles outbreak in Arizona in the summer of 2016, for example, highlighted low immunization rates and inadequate rules and oversight regarding employee vaccinations. 

One of the best places to start is the U.S. Bureau of Justice Statistics, which has statistics and costs on total correctional population, prison population, jail population, probation population and parole population. All their annual surveys are archived here as well as various reports on recidivism, capital punishment, sexual assault in prison, deaths in custody and related topics.

A wealth of worldwide comparative information is available at the International Centre for Prison Studies, “an online database comprising information on prisons and the use of imprisonment around the world” that has recently merged with the Institute for Criminal Policy Research. They have a 15-page fact sheet full of big-picture states, and their world prison briefs provide contact information for prison systems in every country in the world as well as statistics on overall prison population and rate; juvenile, female, foreign and pre-trial populations and rates; system institutions and capacity; and trends over time. They also have a section on research and publications worth perusing if you’re seeking general information or aren’t sure what you need yet.

A report from the U.S. Department of Justice offers a detailed breakdown of prison and parole/probation populations in the U.S. from 2000 through 2014, including a per-state breakdown. A National Academies Press publication provides an overview of causes in the increase in incarceration and recommendations for addressing it (complete report here). For more than 100 of graphic representations of federal, state and historical prison populations, check out the Prison Policy Initiative report on tracking state prison growth. The site offers dozens of other reports as well.

For solitary confinement stats, a very extensive 155-page report from Yale Law School updates numbers for U.S. solitary confinement/isolation (which comes under several euphemistic names); it also includes findings related to demographics, living conditions, duration of time spent in isolation and how that time is spent. A separate Yale study focused on state and federal policies related to isolation, and a report from the Government Accountability Office makes recommendations for improvements to polices within the Bureau of Prisons. A 2014 American Journal of Public Health study investigates self-harm among inmates in isolation, and the ACLU has a special report on female inmates in solitary confinement.

Additional resources are available at the Journalist’s Resource here, here (solitary confinement) and here (father incarceration’s impact on children). Looking for ideas to localize? Check out Frontline’s “Locked Up in America” series.

Number needed to treat

One of the most easily understandable ways to talk about risk is the number needed to treat (NNT). Researchers are catching on its value and it’s cropping up more and more often in studies.  A group of enterprising docs has started collecting these stats in a searchable website. It’s a good one to bookmark if you cover medical studies.

Suicide statistics

The coverage of suicide and prominent suicides in the news can have unintended consequences—such as an increase in copycat suicides—so reporters must be cognizant of the research on suicide reporting and how they can minimize that impact. Examples include as not describing suicide methods in detail and not reporting on suicides unless there is a pressing news need, such as the death of a prominent person. The Poynter Institute offers a course that goes more into depth, and the Journalist’s Resource provides a good overview about reporting on suicide and the relevant research.

Just as important as the way suicide is reported on, however, is that the statistics and facts are accurate and placed in context. The following data resources can help. First, a page at Western Michigan University explains how to understand suicide data and the importance (and pitfalls) of doing so. The World Health Organization has a robust selection of data sources and databases related to suicide across the world.

The CDC page on National Suicide Statistics provides data on trends and patterns about suicide and even lets you create a map of suicide data for your area. They also have a page listing a wide range of other data sources for suicide, including a fact sheet, the National Electronic Injury Surveillance System-All Injury Program, the National Violent Death Reporting System, the National Vital Statistics System, Youth Risk Behavior Surveillance System, and the National Survey on Drug Use and Health, which began asking about suicidal thoughts and behaviors of all adults starting in 2008.

The American Society of Suicidology has a page of detailed annual reports on suicide statistics that also include breakdowns by age, gender and geography. It also has some infographics available to reproduce. Similarly, the American Foundation for Suicide Prevention has an overview of statistics along with graphs that can be adjusted to reflect each state’s data.

The Agency for Healthcare Research and Quality has some reports related to suicide (must be searched for). Mental Health America provides an overview of risk factors, general statistics and treatment for suicidality or suicide ideation (suicidal thoughts or plans). The statistics page at Suicide Awareness Voices of Education includes extra stats on gender and age trends.

Supplements and nutrition products

It can be challenging to gather information on supplements and nutrition products, such as vitamins and minerals sold over the counter and not regulated by the FDA in the same way that approved drugs and medical devices are. The website Examine.com is an “independent and unbiased encyclopedia on supplementation and nutrition” that provides extensive information on each vitamin, mineral or other supplement you might need to look up. In addition to a basic summary, list of alternative names and recommended dosage, the site provides a “Human Effect Matrix” that goes over every possible effect/outcome the supplement might affect, what evidence does (or doesn’t) exist for those effects, and the strength of that evidence along with links to the individual studies. Each page also contains a Scientific Research tab that makes researching studies on the supplement far easier than a PubMed keyword search, and the citations are frequently several hundred items long. Although they are a commercial company, they are not affiliated with any supplement companies, instead gaining all income from three products: Examine.com Research Digest, Supplement-Goals Reference, and the Supplement Stack Guides. The summaries are compiled by editors, physicians, scientists and other experts.

Supplemental Nutrition Assistance Program (SNAP) Data

Medical studies might focus on specific populations, including Medicaid and/or lower income populations. If reporting on one of these studies for a local market, reporters might want to try to localize the data since the study population is likely to be either national data or regional data from a place outside the reporter’s coverage area. If the study focuses specifically on food stamps or individuals using food stamps, reporters can discover state-level and county-level estimates of participation in the Supplemental Nutrition Assistance Program (SNAP). The SNAP Data System on the USDA website also include “area estimates of total population, the number of persons in poverty, and selected socio-demographic characteristics of the population,” each for a specific point in time each year and including benefit levels. The data is three to five years old but can provide an overview reporters can use to localize the data found in a study relating to lower-income populations or SNAP recipients.

Vaccines and Immunization Data

The most complete record of national and states rates of immunization coverage for each vaccine are in the National Immunization Surveys. Anyone can download the data sets in various forms for each of the most recent five years for which data is available. Adverse events occurring after vaccination are reported to the Vaccine Adverse Event Reporting System (VAERS), a passive surveillance system available for anyone (doctors, patients, parents, other health care providers, etc.) to report any adverse event that occurred after receiving a vaccine. However, because VAERS is a passive system – it only collects information, which anyone can submit as many times as they like – it does not accurately represent “side effects” that are linked to vaccines. (It is similar to MAUDE at the FDA.) Reports may be duplicates and may be coincidental or actual side effects from vaccines. (Some reports include car accidents, for example.) A YouTube training video explains how to search the VAERS database. Reports are available as CSV or ZIP files by year dating to 1990. An active surveillance system for vaccines is the Vaccine Safety Datalink. Research findings from the VSD are frequently published in medical studies (complete list here), and two datasets are available by public request.