Court cases related to abortion continue to make their way to the Supreme Court every few years, and the issue remains as divisive as ever across the nation. Here are resources both for statistics on abortion and general research that might be relevant for stories related to state legislation, court cases and local stories.
Data on unintended pregnancy and contraception is available here, here and here at Journalist’s Resource. The partisan site Abort73 (opposing abortion) also provides a helpful list of stats with links, and Planned Parenthood has a quick fact sheet of their numbers.
Those reporting on food, farming, food-borne illnesses and related topics may find several links from the Department of Agriculture helpful. The USDA Census of Agriculture offers a “comprehensive summary of agricultural activity for the United States and for each state,” including “number of farms by size and type, inventory and values for crops and livestock, operator characteristics and much more.” Get even more specific with cropland data from the interactive National Agricultural Statistics Service CropScape page.
“Big data” datasets and databases used by researchers
We are well into the age of Big Data, in which researchers may use databases or other datasets with information on tens of thousands or even millions of individuals. It’s easy, even unwittingly, to manipulate this data and end up with “significant” results that may not actually be significant or true.
Below is a rundown of each of several large databases or other large datasets that researchers may use in observational studies. Knowing the characteristics of each particular data source may make it easier for journalists to assess whether that was an appropriate data source to use or, at the least, ask the researcher why they chose it.
If you have a suggestion for an additional database primer to include, please send your suggestion to firstname.lastname@example.org.
The National Surgical Quality Improvement Program (NSQIP) was established in 2004 by the American College of Surgeons based on similar programs in Veterans Affairs. The ACS added the pediatric NSQIP in 2008. Both databases contain detailed demographic and clinical data about surgical procedures with up to 30-days postoperative follow-up. Approximately 700 hospital hospitals participate in the NSQIP, resulting in inclusion of more than 1 million cases a year, and more than 100 sites participate in the pediatric database (NSQIP-P), which includes approximately 150,000 annual cases.
What to know:
The database contains national, reliable, risk-adjusted data on very specific surgical procedures, including 30-day outcomes, complications, readmissions and length of stay, but data sets are not necessarily nationally representative.
Cases also include detailed information about comorbidities (preexisting conditions) and complications that develop during or after surgery.
As of 2018, more than 1,500 peer-reviewed studies used data from the NSQIP.
Patient and hospital-level data is de-identified.
Often used to develop quality improvement interventions and, according to the JAMA guide, “have led to improvements in morbidity and mortality, cost savings from prevention of complications, and a platform for disease-specific, procedure-specific, or regional or system-based collaboratives.”
Can be used to identify or quantify risk factors based on demographics, comorbidities or types/subtypes of specific procedures (such as different types of gastric bypass surgery as opposed to gastric bypass surgery in general).
Can be used to study trends over time or for comparative effectiveness research.
Limitations and considerations
Data may be missing: read through the methods and limitations sections to see if researchers note missing data. If so, the study should note how the researchers handled the missing data, typically by imputation or sensitivity analyses.
Though studies using NSQIP usually have large sample sizes with highly detailed clinical information, outcomes are limited to 30 days of follow-up, and even large sample sizes may not be large enough to identify very rare outcomes or “accurately portray outcomes for rare cases.” An option to use data with oversampling of specific cases may offset this in some cases, but this is an area worth asking researchers about during interviews.
Some variables have been redefined, added or removed over time, so longer term studies will need to account for changes in variables or simply cannot study certain long-term trends accurately. Check the methods section to see how the researchers define specific variables and outcomes and whether these definitions remain consistent across the time period studied.
Financial outcomes are not included in the database, so any cost-effectiveness studies relying on NSQIP must estimate costs from other sources.
Patient satisfaction, quality of life, patient-reported adverse effects or experiences and similar patient-reported outcomes are not included in the database. Some of these may be added in the future.
thoracic and complex endovascular abdominal aortic aneurysm repair (2 separate procedures)
varicose vein treatment
What to know:
Reporting to the database is optional, and some of the participating physicians and hospitals may choose to report on some procedures but not others.
As of 2017, the SVS VQI included data on 390,270 procedures performed by more than 3,200 doctors at 431 US and Canadian institutions: 40% community hospitals, 29% teaching hospitals and 31% academic hospitals.
Each case includes the patient’s demographics and clinical characteristics as well as the physician and hospital involved. Clinical data are reported at the time of the procedure and one year later, allowing for information on mortality, additional procedures/interventions, complications and other medium-term outcomes.
Some electronic health record systems’ data can be matched/integrated with SVS VQI data. The database can also be linked with patient identification information to Medicare claims, the Social Security Death Index and other similar data sets.
Primary use is safety improvement, though it could be used for comparative effectiveness research as well.
Expect outcomes to be reported in odds ratios or effect sizes. Studies should also include flow diagrams to clarify the patients included and excluded and the reasons.
Limitations and considerations
The biggest challenge with using the SVS VQI for research is selection bias since it’s voluntary and physician/hospital self-reported, and participants may not report for all procedures. “There may also be some debate about whether participation in a quality registry leads to meaningful improvement in institutional health care quality,” the JAMA Surgery authors note.
Pay attention to which registries are included (which procedures) because no data are included regarding which patients were excluded from procedures—again possibly introducing selection bias.
The database collects information prospectively, but all analyses can only be done retrospectively. It’s not possible to draw conclusions about causations from this database.
The researchers should have established their hypothesis and outcomes before conducting the study; changes to primary or secondary outcomes afterward could be subject to p-hacking. “The emphasis of the article should be on practical clinical findings, not incidental statistically significant results,” the JAMA Surgery authors note.
The number of variables can be so vast — sometimes hundreds of variables — that the researchers should have made adjustments to account for statistical significance occurring by chance. A Bonferroni correction or similar statistical calculation can correct for this.
If the researchers conducted subgroup analyses, make sure the subgroups were planned ahead of time (again, an area ripe for p-hacking otherwise), and the characteristics not being studied (demographics, clinical variables, etc.) should be evenly distributed across subgroups to avoid bias.
For missing data, be sure the researchers have explained how they accounted for it. This database typically requires imputation of missing values — simply removing those values altogether could introduce bias — and the researchers should have explained what assumptions they used to do so. (This is another area worth asking an outside source about.)
The JAMA Surgery authors suggest that researchers “analyze whether the missing values are owing to an underlying bias in the data such that multiple imputation could skew the results” — something journalists can ask outside experts about.
The researchers should note a power calculation in their study—what number of participants/cases they needed to include to achieve statistical significance. You could ask an outside source, such as a biostatistician, whether their description of this calculation looks appropriate if other red flags suggest concern about the study.
The federal the Agency for Healthcare Research and Quality runs the Healthcare Cost and Utilization Project (HCUP), which includes multiple data resources. These include all the administrative information reported to payers since 1988, pulled from the federal government, private and state data organizations and hospital associations.
One of these HCUP resources, the National Inpatient Sample (NIS), is sometimes used in surgical research even though the data it provides is administrative instead of clinical. The NIS is just one HCUP database resource, and this data primer does not include information on the others (though some of the limitations apply to all of them).
What to know:
Administrative data sets include diagnostic codes, procedures codes and costs — but not clinical data — for a 20% nationally representative sampling of all US hospital inpatient encounters.
Researchers can use tools within the database itself for risk-adjustment.
Data are de-identified, but it’s still possible to track a single individual over time through added features in the system that use other variables (instead of personally identifiable information) to track patients.
Best used for generating hypotheses and research questions and looking for or examining trends over time — but not for conclusions related to causation.
HCUP databases in general “have been used to help shape policy decisions, assess the effectiveness of surgical techniques, examine disparities in surgical care, perform comparative effectiveness research and drive quality-improvement efforts,” the JAMA Surgery authors note.
It’s “ideal for performing basic descriptive studies, deriving national estimates, studying costs, studying rare disease, and understanding trends over time.”
Limitations and considerations
The NIS database was redesigned in 2012, and a major change included using a sample of discharges from all hospitals instead of all discharges from a sample of hospitals. While the new method is more nationally representative, it means comparing data before 2012 to after 2012 could be problematic.
Sample sizes are in the millions, so statistical significance requires a P value MUCH lower than 0.5. You should be seeing a lot of zeroes after the decimal point for true statistical significance with research using this database.
Since the JAMA Surgery authors suggest that researchers start “with a clearly thought-out research question” and :understand” the limitations of asking this question using the data available,” journalists could ask outside sources whether the question is indeed well-thought out and appropriate for this data set.
Since the standards used for coding changes over time, researchers may be limited in how they compare procedures or diagnoses over time. Most of the original data would have used ICD-9 coding, but since then, ICD-10 and soon ICD-11 will be included in reported data.
The data were collecting for billing purposes, not for clinical purposes, so using it to study clinical outcomes has to take this different original purpose into effect (including any bias associated with doing that — something to ask outside experts about).
Surveillance bias — the more you look for something, the more likely you are to find it — is possible with this data set.
The Society of Thoracic Surgeons (STS) National Database is a voluntary clinical registry that collects data on cardiac surgery outcomes for the purpose of quality and safety improvement. The database uses risk-adjustment to account for differences in patients and institutions when reporting mortality and morbidity rates. The rates are divided into three types of cardiothoracic surgery: adult cardiac surgery (dating back to 1989), congenital heart surgery (dating back to 1994) and general thoracic surgery (2002).
What to know:
The 1,119 participating healthcare centers that report (as of September 2016) are not necessarily individual hospitals but rather institutional surgical programs, of which there could be multiple within a single institution (or possibly shared across institutions). They come from all 50 states, plus 29 participating centers in 8 other countries. As of September 2016, the database had 6.1 million patient records and involved 3,100 surgeons.
Despite being voluntary, participation is high: the database includes “data on 94% of Medicare beneficiaries undergoing coronary artery bypass grafting surgery and from 90% of the sites providing care to Medicare beneficiaries.”
The data collected primarily includes demographic info, covariates specific to the subspecialty and short-term outcomes data, including morbidity, mortality and length of stay.
Can compare outcomes, including mortality and short-term morbidity, across different institutions and across types of procedures.
Limitations and considerations
Studies using STS database data are retrospective observational cohort studies and therefore subject to all the usual limitations of any retrospective observational study (especially confounding from unmeasured variables). The data is also collected according to the procedure a patient had, not their diagnosis.
The database does not provide data on characteristics of individual clinicians, long-term outcomes or patient-reported outcomes. The outcomes may also not be generalizable, depending on the population the study focuses on.
Any study should use multivariable regression, propensity score analysis, propensity-matching, instrumental variable analysis or some other method of adjusting for covariates. (You don’t necessarily need to know what they did or how unless it’s especially germane to your reporting, but you should at least note that they did something.)
“Because of the large number of patients in the database, it is easy to discover statistically significant associations that are clinically unimportant. Therefore investigators should prespecify predictors, out- comes of interest, and the definition of a clinically important difference across variables.” In other words, watch out for p-hacking or statistical findings that don’t actually change the course of medical care.
Because it’s procedure-based (and not illness/disease/condition/diagnosis-based), the database cannot be used to compare non-surgical interventions to surgical interventions. It can only compare different types of surgical procedures. However, researchers sometimes link this databased to others to do comparisons with non-surgical interventions.
This database, located in the Veterans Administration (VA) National Surgery Office (NSO), tracks all patients who undergo surgery with the VA. Nurse data managers record the data with the primary purpose of care quality improvement. The data collection is legally mandated by Congress, includes cardiac and non-cardiac surgery and is not publicly available.
What to know:
Cases are tracked on eight-day cycles (up to 36 cases per cycle) to ensure none are missed, and patients are followed for 30 days post-operation, regardless of hospital length of stay.
Cases included data on more than 200 variables, including pre-operative condition, lab results, other procedures (up to 10), procedure start/end times, anesthesia type, presence of trainees and 30-day post-op outcomes (mortality, morbidity of 22 complications of varying severity), among others.
Hospital location is tracked, so outcomes can be compared across institutions.
Compare post-operative complications across different institutions.
Identify associations between specific procedure and complications, including associations related to patient characteristics and to the procedure’s characteristics (such as duration of procedure, number of trainees present, type of anesthesia used, etc.).
Cases can be cross-linked to other VA databases, so it’s theoretically possible to track a patient before and after their procedure through those other databases.
Limitations and considerations
Pay close attention to the specific definition of procedures and outcomes so you don’t misinterpret them (or so you can see whether the researchers misinterpreted them).
Sample size is large enough for statistical significance to translate to clinical significance, but when many different variables are studied, look for statistical adjustment/corrections to account for statistical significance happening by chance. Researchers can control for a lot, but controlling for too much can lead to statistical flukes.
With so many variables and conditions to choose from, researchers must choose very carefully and be sure the models they use fit the exposures and outcomes they are studying.
Patients who undergo a second operation within 30 days will have two sets of records, so researchers need to remove the second or otherwise account for duplicate or overlapping data.
Look for whether researchers conducted a “reliability adjustment” to account for natural changes over time and/or differences in hospital volume.
The National Trauma Data Bank (NTDB) contains more than 7.5 million electronic records from over 900 trauma centers, making it the largest database of trauma incidents in the world. It grew out of the Major Trauma Outcomes Study that ended in 1989, and then the American College of Surgeons Committee on Trauma formed a subcommittee in 1997 to continue it.
What to know:
Individual hospitals voluntarily submit data every February-May.
The NTDB has its own inclusion/exclusion criteria for what qualifies to be reported.
Each trauma incident is reported independently; a person with repeated injuries can show up in the database multiple times under different records.
The most reliable data starts in 2007, when a National Trauma Data Standard was adopted; combining 2002-2006 data with later data is likely unreliable or inconsistent
Data quality has improved as the number of hospitals participating in the American College of Surgeons Trauma Quality Improvement Program, begun in 2010, increases.
Most studies are retrospective, cross-sectional and matched case-control studies.
Commonly studied outcomes include mortality, length of hospitalization and complications, including rare injuries or outcomes because the database is sufficiently large.
Researchers should define outcome variables before beginning the study and should provide a rationale for using NTNB along with a flow diagram.
The database only includes data up through discharge and does not record deaths occurring after discharge; “transfers-to-hospice” may frequently be counted as deaths.
If researchers select their outcome or variables based on what is statistically significant in the dataset, their study should be aimed at coming up with hypothesis, not finding conclusive associations.
Limitations and considerations
Not nationally representative because of voluntary reporting, but nearly all level I/II trauma centers currently report.
Does not include data on costs, lab test results or long-term outcomes
Selection bias may occur with hip fractures and transferred patients because different hospitals treat have different protocols for these two categories.
Dead-on-arrival patients or those who die soon after arrival may have incomplete information on their injury and its severity.
Hospitals may vary in how they report injury severity scores, comorbidities and complications; some hospitals may use more sensitive imaging and report more incidental or minor findings
Researchers may need to make adjustments to deal with missing data.
The Centers for Medicare and Medicaid Services (CMS) is the agency responsible for managing Medicare, the health insurance program used by most people at least 65 years old in the US. The public program, which AHCJ discusses in greater detail in the Insurance Core Topic, includes four parts: hospital insurance (Part A), medical insurance (Part B), a CMS-approved private insurance called Medicare Advantage (Part C) and prescription drug coverage (Part D).
Researchers can use de-identified datasets of claims reimbursed by CMS, except claims for the private insurance in Part C.
What to know:
The datasets include age, birthdate, sex, race/ethnicity and place of residence and cover 70 percent of all US adults older than 64. The large size of the population allows for subgroup analyses with a greater statistical power than many other subgroup analyses may have.
The data is national and therefore includes a nationally representative population for those age 65 and older, including care in a wide variety of healthcare settings (surgical centers, physician offices, hospitals, nursing home care, etc.).
The data can be linked both to other CMS data sets — health care utilization, insurance enrollment, and clinician characteristics — and to non-CMS data, including the US Census, cancer registries (includingSEER), the Social Security death index, clinician information (such as the American Hospital Association data), and other government insurance programs, such as Medicaid and Tricare.
Missing data is rare since hospitals and physicians cannot get paid without complete, accurate claims.
It’s possible to track patients over a long period of time inn longitudinal studies on outcomes and health care utilization. This same longitudinal data can be used to reliably identify trends over time.
This data is excellent for comparative effectiveness studies and comparing outcomes across different geographical areas and different types of health care settings. It may also be used to assess healthy policy needs and effects.
Limitations and considerations:
Comparative effectiveness studies should adjust for selection bias, and studies comparing practice pattern or outcomes variations across clinicians should adjust for patient risk. Studies evaluating health policy should adjust for background trends occurring over time, such as general improvement over time.
Diagnoses are identified with International Classification of Diseases, Ninth Revision (ICD-9) or ICD-10 codes, so specifics about chronic conditions and comorbidities aren’t always available. In studies examining surgery outcomes, researchers cannot precisely, accurately determine complications using ICD codes alone.
The data does not include time stamps during hospital stays, so the order in which events occur during a hospital stay cannot be determined from the claims associated with a single stay.
Claims include only services billed and diagnoses, not vital signs, lab test results, pathology results, imaging results or other physiological information about individual patients.
The claims only include services covered and billed for under Medicare; information on non-covered services or any care or diagnoses sought or received elsewhere are not included.
Billing data cannot reliably provide information about severity of a condition, causes or comorbidities.
Tricare is the insurance used by the Department of Defense for all active, retired and disabled (with certain conditions) military personnel and their dependents up to age 64, with two exceptions: it does not cover services by the Veterans Administration (VA) or health care provided in combat zones. About 80 percent of the 9 million nationwide beneficiaries are civilians, and the other 20percent is active military personnel.
What to know:
Data is generally considered diverse and nationally representative in terms of patients’ sociodemographic, vocational, educational and occupational characteristics up to age 64.
More generalizable than data from Medicare, private insurance or national registries.
Treatment may be administered at DoD healthcare centers (“direct care”) or at civilian centers (“purchased care”).
Covers inpatient and outpatient care, doctor fees, dental care, prescriptions and medical equipment.
Data includes age, voluntarily self-reported race/ethnicity, marital status, residence region based on census and the sponsor’s rank, which is sometimes used as a proxy for socioeconomic status.
Data includes diagnoses (based on ICD-9 and ICD-10), comorbidities, injury severity scores and length of hospital stays.
Can study mortality, health care utilization, prescription drug use, comorbidity associations, quality of care and morbidity after surgeries, among other things.
Can often be used for long-term studies among career military personnel and their dependents.
Can serve as a model of single-payer universal health insurance in the US, including study of how universal care affects healthcare disparities.
Limitations and considerations
Data may not include extreme detail, such as vital signs and specific information about individual visits, surgeries, treatments, etc.
Medication misclassification or undercounting can occur because beneficiaries may purchase medication out-of-pocket and never submit a claim.
Payments are often in lump sums, preventing cost comparisons for specific services, encounters or surgeries.
Missing data may require statistical adjustment.
Researchers should account for whether the services were provided at DoD facilities or civilian facilities.
Optional self-reporting of race/ethnicity means it’s frequently a missing data point (up to 31 percent in some studies) or may not be accurate; excluding incomplete files may bias results unless adjusted for.
This database relies on reporting from more than 1,500 US hospitals and provides information on more than 70 percent of all new cancer diagnoses, dating back to 1989. The American College of Surgeons Commission on Cancer and the American Cancer Society run it together.
What to know:
Data collected on patient characteristics, patient comorbidities, cancer staging, tumor characteristics, treatments/interventions and survival outcomes.
Includes healthcare “facility type” and IDs for individual facilities; facilities not submitting at least one case per year are frequently excluded.
Multi-year studies should use a standardized variable when looking at staging and/or tumor data.
Mortality data includes 30-day, 60-day and 90-day mortality rates as well as 5-year survival rates.
Limitations and considerations:
Smaller facilities may have fewer cases or lower volume, with implications for bias or challenges in separating signals from noise. For example, one death in a low-volume hospital will appear as a much higher mortality rate than multiple deaths in a much higher volume hospital.
Reported treatments are detailed but only cover the first 6 months after diagnosis; long-term treatment data or planned treatment protocols are not included.
Only the most significant/definitive surgical procedure is included if multiple surgeries occurred.
The Centers for Disease Control and Prevention (CDC), National Cancer Institute and regional and state cancer registries collaborate on SEER, a federally funded and therefore publicly available cancer reporting system begun in 1974. The data is population-based and therefore nationally representative and generalizable because the 18 states reporting data come from all geographic regions of the US.
What to know:
Data includes patients of all ages, independent of insurance payer or status, reported at the local level.
The data covers 28 percent of the U.S. population with oversampling of racial/ethnic minorities, people born overseas and residents living below the federal poverty line.
Mortality statistics come from U.S. Census data, and death certificates provide data on dates and causes of death.
Does not include individual or family income, but does include age at diagnosis, year and place of birth, race/ethnicity, sex, marital status and the education and income of the individual’s census tract.
Cancer and pathology data collected includes the cancer site, stage, grade, advancement of disease, tumor markers, lymph node status, lymphovascular invasion, perineural invasion and margin status.
Includes method of diagnosis, surgeries and radiation and order of treatment.
Studies covering long periods of time can identify trends in cancer incidence, prevalence and survival.
Can study rare cancer and specific subpopulations.
Incidence and mortality rates require adjustment for age and are best reported as cases per 100,000 person-years.
Limitations and considerations:
Long-term trend data should stay within three predetermined cohorts—SEER-9 data from 1974, SEER-13 from 1992 and SEER-18 from 2000—or else account for changes in staging definitions, diagnostic criteria, radiation or surgery protocols, etc.
Missing clinical and pathologic data is common and should be noted or accounted for.
Does not include data on surgical approach, radiation dose, chemotherapy, hormonal therapy or immunotherapy.
Includes past cancer history but not cancer recurrence or comorbidities, medications or disability level.
Not helpful for comparative effectiveness studies since data on comorbidities, cancer recurrence and chemotherapy are unavailable.
One study that provides a nice overview of using health care administrative datasets to track injuries resulting from police interaction, both justified and unjustified, is “Perils of police action: a cautionary tale from US data sets,” in BMJ, though it’s unfortunately behind a paywall. The data resources below include both “official” sources, such as federal agencies, as well as media-based, nonprofit or informal collections of information, so it will require a bit of picking through to find precisely what is needed for a particular story or project.
Reports on specific cities, such as Los Angeles (Harvard, May 2009) and Baltimore (federal/DOJ, August 2016), are available from the agencies or departments that conducted the report.
Interestingly, some of the better data comes from news sources. The Guardian’s investigation “The Counted,” as they state it, “revealed the true number of people killed by law enforcement” and related trends, actually leading to a response from the U.S. government. The Washington Post similarly maintains a database called Fatal Force, which annually compiles data on people shot and killed by police, provides their methodology and allows anyone to download the data. Fatal Encounters is a website run by a single journalist in Reno, Nev., who attempts to track all deaths caused by law enforcement.
The Cato Institute offers a daily newsfeed recap of police misconduct reported in the media across the U.S. and provides quarterly, semi-annually and annual statistical reports and various ancillary reports.
Finally, Mapping Police Violence does exactly what it sounds like and contains several graphs and charts of police violence with an option to download the source data.
Drug shortages, especially shortages of cancer drugs, have driven recent news stories. Want to write one of your own? Check this list of drug shortages maintained by the FDA. The American Society of Health Systems Pharmacists maintains a separate list of drugs in short supply here.
Sometimes researchers study drugs that are already FDA-approved to see if they may have other uses. If you want to find out more about approved drugs (what they’re approved to treat, for example, or what their major side effects are) check out these two resources:
Daily Med – from the National Library of Medicine, provides drug label information for a growing list of prescription drugs.
Drugs@FDA is the FDAs searchable database of currently approved drugs. It’s a great way to find out whether drugs are still on the market, who manufactures them, and in what forms and dosages they’re currently offered.
OpenFDA is an initiative to make it easier for web developers, researchers, and the public to access large, important public health datasets collected by the agency. The FDA phased in openFDA beginning in June 2014 with millions of reports of drug adverse events and medication errors that have been submitted to the FDA from 2004 to 2013. Previously, the data was only available through difficult to use reports or Freedom of Information Act requests. The pilot will be expanded to include the FDA’s databases on product recalls and product labeling.
Firearm deaths and injuries and mass shootings data
It can be tricky to find reliable stats related to firearms and firearm injuries and deaths, but journalists can compile a pretty good big picture by visiting several sites and pooling their data. For basic numbers, the best starting place is the CDC’s FastStats Injuries page, where you can download tables that break down firearm deaths and injuries by age and other demographics. The injuries here, however, are an underestimate since not all firearm-related injuries (especially not those where a person didn’t seek medical attention) are not reported.
The FBI’s National Instant Criminal Background Check System provides data on the number of background checks that NICS conducts monthly in the U.S., but they can’t be used as a proxy for sales since a background check may occur without a sale or a private transaction may occur without a background check. It also includes the numbers for federal denials according to the reasons.
Then, the FBI’s Uniform Crime Reports offers several tabs, including “Violent Crime,” “Murder” and “Expanded Homicide,” but it’s important to note these stats are underreported and sometimes inconsistent. For example, both Florida and Alabama report the total number of homicides in their state but without breaking it down by weapon (downloadable with Excel) or mode.
The Bureau of Justice Statistics offers information on background checks for transfers, stolen firearms, homicide trends and other data. For mass shootings, no state or federal agency tracks these consistently, but the Gun Violence Archive on Mass Shootings is a good place to start. It also provides a Creative Commons map and a description of methodology. Everytown for Gun Safety, the advocacy group that grew out of the Newtown tragedy at Sandy Hook Elementary School, has also posted an Analysis of Mass Shootings with some great infographics.
Finally, the overall site of Gun Violence Archive pulls together probably the most comprehensive stats you’ll find. This nonprofit, nonpartisan group compiles data from a wide range of state and federal reported data, and the site transparently describes all its methodology.
Reproductive, Maternal and Child Statistics: The best resource for any data related to reproductive health or maternal/neonatal health is the CDC page on Reproductive Health Data and Statistics. Although a wide range of resources are available on the page, a couple to highlight include PRAMStata, which includes a database searchable by state or topic for more than 250 child and maternal health indicators tracked in the Pregnancy Risk Assessment Monitoring System (PRAMS). Everything from prenatal care statistics to smoking in pregnancy to breastfeeding stats and more is available here. The CDC explains its surveillance of pregnancy mortality here, and data on sudden unexpected infant deaths (SUIDs) and sudden infant death syndrome (SIDS) are available here. The March of Dimes also offers a variety of data on perinatal statistics. Other resources for data on the CDC page include links for contraception, abortion, assisted reproductive technology, sexually transmitted infections and birth data, among others.
FastStats comes from the National Center for Health Statistics at the CDC. Frequently updated and easy to use, this page is an invaluable resource for reporters who need statistics for context in a flash.
Need to know how many knee replacements were performed in the U.S. this year? How about hysterectomies? Want to know how trends in procedures have changed over time? Then you need the Center for Disease Control's National Hospital Discharge Survey. Be aware, though, the CDC is integrating that survey into a larger dataset that will include procedures from the emergency department and ambulatory care centers. The new National Hospital Care Survey doesn't have results yet, but when they're posted, they'll be here.
Any time there is an outbreak of an infectious disease, the public generally wants to know how common it is and what their risk of getting it is. Journalists will want to stay on top of new and continuing cases as well, but to provide context to those stories, or to others about infectious diseases or vaccines, they may want information on historical incidence or trends as well. Below are resources for infectious diseases exclusively within the U.S.
Domestically, the Centers for Disease Control and Prevention maintain a National Notifiable Diseases Surveillance System (NNDSS) that tracks all Nationally Notifiable Conditions, diseases which health departments are required to report when they have a local case with the condition. (Clicking on each disease tells you the case definition and how long it has been a notifiable condition.) This spreadsheet tells you whether a disease was reported during each of the years included. These diseases are reported for the week, month and year-to-date in each Morbidity and Mortality Weekly Report. To see how many cases have been reported during a particular month or up through the year-to-date, look for the Notifiable Diseases and Mortality Tables link for the month and year you need from 2015 or 2016. The CDC also provides MMWR summaries of cases for nationally notifiable conditions during each past year back to 1993. Be sure to read what data users should know about the National Notifiable Diseases Surveillance System before you dig into the data, and definitions of key terms are available here. Other helpful information to understanding the system is here.
The U.S. Agency for International Development, or USAID, plays a significant role in aiding foreign governments with infrastructure support, education support, international food programs and a variety of health-related programs. The main site for the data retrieval provides an extensive list of databases, spreadsheets and other searchable information, including ones by program and by country, that can be searched individually, or journalists can search all the information at once with a search box (though the site can be buggy). The data dump is so large that it could take some time to familiarize yourself, so spend some time perusing their Data Resources page to gain a sense of what's available. Specific databases from the main page that could be of particular use for health journalists will be added here gradually.
Lead poisoning research
If you are reporting on the lead-contaminated water crisis in Flint, Michigan, on follow-up stories in other locations, or on any other stories related to lead exposure, it helps to have handy an overview of the facts and research related to lead exposure.
The Shorenstein’s Center’s Journalist Research offers a nice overview summary of the effects of lead exposure and how many people the problem affects throughout the U.S. and the world. Their resource page also includes a helpful list of studies and their summaries for reporters looking for data about specific effects or specific populations.
Medical devices and equipment
Want to find out if doctors or companies have reported problems with a medical device to the FDA? Check MAUDE – Manufacture And User Facility Device Experience.
Want to find out if a piece of hospital equipment has malfunctioned at more than one facility recently? You'll want to check MedSun.
DrugAlert.org is a comprehensive database featuring information and news alerts about potentially dangerous drugs currently on the market or previously available worldwide. The Web site posts information about drug recalls, side effects, and pending litigation associated with various drugs and their manufacturers.
U.S. and world prisons
Reporting on prisons is typically a beat for criminal justice reporters, but as more research reveals failures in prison health care systems, the mental health effects of solitary confinement and the abuses of some private, for-profit prisons, it is increasingly becoming a beat for health reporters as well. The measles outbreak in Arizona in the summer of 2016, for example, highlighted low immunization rates and inadequate rules and oversight regarding employee vaccinations.
One of the best places to start is the U.S. Bureau of Justice Statistics, which has statistics and costs on total correctional population, prison population, jail population, probation population and parole population. All their annual surveys are archived here as well as various reports on recidivism, capital punishment, sexual assault in prison, deaths in custody and related topics.
A wealth of worldwide comparative information is available at the International Centre for Prison Studies, “an online database comprising information on prisons and the use of imprisonment around the world” that has recently merged with the Institute for Criminal Policy Research. They have a 15-page fact sheet full of big-picture states, and their world prison briefs provide contact information for prison systems in every country in the world as well as statistics on overall prison population and rate; juvenile, female, foreign and pre-trial populations and rates; system institutions and capacity; and trends over time. They also have a section on research and publications worth perusing if you’re seeking general information or aren’t sure what you need yet.
A report from the U.S. Department of Justice offers a detailed breakdown of prison and parole/probation populations in the U.S. from 2000 through 2014, including a per-state breakdown. A National Academies Press publication provides an overview of causes in the increase in incarceration and recommendations for addressing it (complete report here). For more than 100 of graphic representations of federal, state and historical prison populations, check out the Prison Policy Initiative report on tracking state prison growth. The site offers dozens of other reports as well.
Additional resources are available at the Journalist’s Resource here, here (solitary confinement) and here (father incarceration’s impact on children). Looking for ideas to localize? Check out Frontline’s “Locked Up in America” series.
Number needed to treat
One of the most easily understandable ways to talk about risk is the number needed to treat (NNT). Researchers are catching on its value and it’s cropping up more and more often in studies. A group of enterprising docs has started collecting these stats in a searchable website. It’s a good one to bookmark if you cover medical studies.
The coverage of suicide and prominent suicides in the news can have unintended consequences—such as an increase in copycat suicides—so reporters must be cognizant of the research on suicide reporting and how they can minimize that impact. Examples include as not describing suicide methods in detail and not reporting on suicides unless there is a pressing news need, such as the death of a prominent person. The Poynter Institute offers a course that goes more into depth, and the Journalist’s Resource provides a good overview about reporting on suicide and the relevant research.
Just as important as the way suicide is reported on, however, is that the statistics and facts are accurate and placed in context. The following data resources can help. First, a page at Western Michigan University explains how to understand suicide data and the importance (and pitfalls) of doing so. The World Health Organization has a robust selection of data sources and databases related to suicide across the world.
It can be challenging to gather information on supplements and nutrition products, such as vitamins and minerals sold over the counter and not regulated by the FDA in the same way that approved drugs and medical devices are. The website Examine.com is an “independent and unbiased encyclopedia on supplementation and nutrition” that provides extensive information on each vitamin, mineral or other supplement you might need to look up. In addition to a basic summary, list of alternative names and recommended dosage, the site provides a “Human Effect Matrix” that goes over every possible effect/outcome the supplement might affect, what evidence does (or doesn’t) exist for those effects, and the strength of that evidence along with links to the individual studies. Each page also contains a Scientific Research tab that makes researching studies on the supplement far easier than a PubMed keyword search, and the citations are frequently several hundred items long. Although they are a commercial company, they are not affiliated with any supplement companies, instead gaining all income from three products: Examine.com Research Digest, Supplement-Goals Reference, and the Supplement Stack Guides. The summaries are compiled by editors, physicians, scientists and other experts.
Supplemental Nutrition Assistance Program (SNAP) Data
Medical studies might focus on specific populations, including Medicaid and/or lower income populations. If reporting on one of these studies for a local market, reporters might want to try to localize the data since the study population is likely to be either national data or regional data from a place outside the reporter’s coverage area. If the study focuses specifically on food stamps or individuals using food stamps, reporters can discover state-level and county-level estimates of participation in the Supplemental Nutrition Assistance Program (SNAP). The SNAP Data System on the USDA website also include “area estimates of total population, the number of persons in poverty, and selected socio-demographic characteristics of the population,” each for a specific point in time each year and including benefit levels. The data is three to five years old but can provide an overview reporters can use to localize the data found in a study relating to lower-income populations or SNAP recipients.
Vaccines and Immunization Data
The most complete record of national and states rates of immunization coverage for each vaccine are in the National Immunization Surveys. Anyone can download the data sets in various forms for each of the most recent five years for which data is available. Adverse events occurring after vaccination are reported to the Vaccine Adverse Event Reporting System (VAERS), a passive surveillance system available for anyone (doctors, patients, parents, other health care providers, etc.) to report any adverse event that occurred after receiving a vaccine. However, because VAERS is a passive system – it only collects information, which anyone can submit as many times as they like – it does not accurately represent “side effects” that are linked to vaccines. (It is similar to MAUDE at the FDA.) Reports may be duplicates and may be coincidental or actual side effects from vaccines. (Some reports include car accidents, for example.) A YouTube training video explains how to search the VAERS database. Reports are available as CSV or ZIP files by year dating to 1990. An active surveillance system for vaccines is the Vaccine Safety Datalink. Research findings from the VSD are frequently published in medical studies (complete list here), and two datasets are available by public request.