Member? Log in...

Join or renew today

Key concepts

Acute vs. chronic conditions

Blinding (or masking)

Bonferroni correction

Clinical significance



Correlation vs. causation

Disease elimination vs. eradication

Dose response

Effectiveness vs. efficacy


False balance

Hill Criteria for Evaluating Observational Studies

Impact factor

Incidence and prevalence

Ingelfinger rule

Institutional Review Boards

In vitro vs. in vivo (and in silico)



P value

Phases of clinical testing

Placebo and nocebo effect

Quality-adjusted life-year

Reporting biases


Run-in phase

Sensitivity analysis

Sex vs. gender

Social desirability bias

Specificity and sensitivity

Screening vs. diagnostic tests

Self-controlled case series

Statistical significance

Surrogate endpoints

Acute vs. chronic conditions

The simplest explanation of the difference between acute and chronic conditions is that acute conditions are short-term while chronic conditions are long-term. However, these two ways of categorizing an illness, disease, pain, or other condition involve many other differences aside from duration of the condition. Acute illnesses are those that tend to have a definitive start and end (the patient or their physician can identify when the condition started and when it has stopped). They also generally affect one or a few specific, identifiable body parts, organs, or systems. Most of the time, acute illnesses respond to medication or other treatments, or they resolve on their own over time (e.g., a broken bone healing, a common cold infection eventually defeated by the immune system, etc.). The causes of acute infections or conditions are also usually pretty straightforward.

Chronic illnesses, in addition to affecting a person over a longer period of time, tend to be more complex in general and may not have easily identifiable or isolated causes, or they may have multiple causes. Chronic conditions often involve multiple body areas, organs or systems, or even the entire body. They may or may not respond to treatment, or treatment options for them may not exist. Chronic conditions may require multiple treatment strategies or shifts in treatment. Cancer, for example, is a chronic condition that can affect up to the entire body and may require radiation, surgery, and/or multiple drugs, and physicians may need to switch strategies over the course of the illness. (An Ebola virus infection, however, is acute even though it may affect the entire body.) Treatment of chronic conditions, then, is typically more complex than that of acute conditions and may focus on management, quality of life, self-care and coping skills. With acute conditions, the goal is a cure or healing.

Broadly speaking, chronic illnesses tend to have a greater impact on quality of life, but an acute illness can certainly have intense short-term effects on quality of life and can result in a chronic condition. For example, measles is an acute disease (with a seriously unpleasant short-term experience), but if complications from the measles causes deafness, then the hearing loss becomes a chronic condition requiring management, even if that management is occupational therapy to learn new ways of living. Or, an acute condition may cause a temporary chronic one, such as frequent coughing over several months caused by a pertussis infection, even if the active pertussis infection resolves within the first month.

Blinding (or masking)

When researchers conduct clinical trials that randomize participants, they have to contend with the possibility that participants will have psychosomatic responses to their treatments, such as expecting that they will get better — and subsequently feeling better — if they know they are receiving a medication that is intended to treat a condition they have. Similarly, the person administering a treatment might convey through body language, verbal language or some other way, even inadvertently, that a participant is receiving the real medication as opposed to the placebo. If participants know who is receiving the actual, pharmacologically active medication and who is receiving the placebo, it could bias the results based on what is known about the placebo effect and nocebo effect and other types of biases.

Therefore, when possible, researchers employ blinding, sometimes (less often) called masking, during their trials, in which participants and/or those administering a medication or treatment do not know which is the real treatment and which is the placebo or control patient. Those administering the treatments may also be blinded from knowing which participants have a condition and which don’t if the control is a group of participants who do not have a condition.

During a single-blind trial, only the participants (or, more rarely, only the researchers/administrators of a medication/treatment) do not know who is in the control group and who is in the experimental group. With double blinding or double masking, both the participants and the researchers are presented from knowing who is in the control group and who is in the experimental group. (Note: Sometimes “triple blinding” is used if the participants, the researchers and the person administering a treatment do not know who is in the control versus experimental groups, but this same scenario is often covered under double blinding as well.)
Blinding can also refer to removing from an analysis results or a time period that could bias the analysis if left in. For example, if a long-term study involving women looks at side effects that could result from use of birth control, and some women in the study become pregnant, their period of gestation may be “blinded” in the analysis since any of the outcomes studied that occur during that time would likely be caused by the pregnancy or some other factor and not from the birth control.

In some single-blind procedures, patients are not receiving a placebo but instead a “sham treatment” that resembles a non-pharmacological intervention being tested. In acupuncture, for example, a sham procedure would involve inserting needles into the control participants but at different body locations than what traditional acupuncture would call for.

Bonferroni correction

A Bonferroni correction is a calculation intended to reduce the likelihood of a false positive in study results. As described in the key concept of P-hacking, if the data from a study are examined in enough different ways, some kind of association will usually emerge, based purely on the statistical chance that you’ll find something if you look hard enough in enough different places. The Bonferroni correction is a statistical adjustment that’s intended to offset this possibility by establishing a lower threshold for statistical significance.

The mathematical mechanics of it (described here and here) aren’t important so much as simply understanding that it exists for two reasons. First, you might come across it in a study’s methods, which tells you that the researchers are aware of how many different outcome possibilities might be in their paper and they want to be sure none of them result from statistical chance. Second, you want to note the absence of Bonferroni correction if you’re reading a study that seems to have a lot of possible outcomes (and/or sub-analyses), but the authors don’t note any attempts to control for statistically significant results that emerge by chance. Having multiple dependent or independent variables or statistical tests conducted on a single data set is a situation where a Bonferroni correction should be used to ensure the results are “real.”

Clinical significance

Statistical significance measures how likely it is that a research finding occurred due to a real effect versus chance, but whether that finding is actually meaningful for doctors and patients is a separate issue. Clinical significance, also called practical significance or clinical importance, attempts to answer whether a new finding will make a big enough difference to change the way a doctor treats a patient’s condition. While statistical significance is usually measured with P values in clear, objective numbers (even if they have arbitrary cut-offs), clinical significance is more subjective. It relies on clinical judgment and various other factors, such as the condition being treated, side effects of an intervention, details about the patient population, the cost of a drug or intervention, a doctor and/or patient’s comfort with trying something new, and various other risks and benefits.

Clinical significance most often depends on whether a drug exceeds the threshold of a minimal clinically important difference, or the smallest effect needed to cause a doctor to change how he manages a patient’s condition. For example, let’s say the minimal clinically important difference for a new pain medication is that it reduces someone’s pain by at least two points on a scale of 1 to 10.

Now, consider a new pain drug assessed in a series of studies that uses a pain scale of 1 to 10 to measure its effectiveness. The drug proves to be effective multiple times with a P value of less than 0.01. That means the results are almost certainly a real effect, not due to chance. However, let’s say the improvement in pain is a change of 0.4 points on the scale. In other words, if a person’s pain is an 8, then taking this medication will have an effect, but it will only drop their pain, on average, to a 7.6. Such a small change in pain is unlikely to make it worth taking the drug, especially if it is expensive or has other unpleasant side effects. If relying on the minimal clinically important difference of 2, it doesn’t meet the threshold. How that threshold is determined does vary. For a condition that is excruciatingly painful, even a slight reduction in pain from a new drug may be worthwhile. But for another condition that is only mildly painful, a patient might barely notice a slight reduction in pain.

Although clinical significance does not have a single, objective measurement tool, there are several objective numbers that can contribute to clinical judgment. Two are Number Needed to Treat (NNT) and Number Needed to Harm (NNH). If a drug has a big effect and is statistically significant but 100 people must be treated for just one of them to experience that benefit, then a doctor may be more likely to stick with a drug that is slightly less effective but works for more people. Switching to a new drug might risk having too many people who get little benefit. Or, if a drug is very effective and works for a lot of people (say one in every two people benefits, an NNT of 2) but harms one of every four people who uses it (NNH of 4), then a doctor may determine the benefit isn’t worth the risk for a group of patients.

Confidence intervals are another objective measurement often used in assessing clinical significance. They provide the upper and lower ranges of the effect likely to occur 95 percent of the time. If the range is narrow, the doctor can be more confident about the expected effects of the treatment. If the range is wide, the extent to which the drug works for a patient becomes less predictable.

For example, let’s say a new drug for psoriasis reduces the number of flare-ups by an average of 10 per year. If the confidence interval is between 8.2 and 11.4, a doctor can be fairly confident that a patient taking this medication will have 8 to 11 fewer flare-ups in a year. However, if the confidence interval is 1.4 to 20.1, then some patients may see a huge benefit (20 fewer flare-ups) while others see very little benefit (just two fewer flare-ups). Or, we consider the pain drug example above, a confidence interval of 0.1 to 4 might make the drug worth trying if some individuals will experience a reduction in pain from an 8 to a 4 — that’s a 50 percent reduction in pain.

Finally, comparing relative risk and absolute risk may offer clues to whether a finding should be regarded as clinically significant. If, for example, taking antidepressants during pregnancy increases the likelihood of a specific birth defect by five times, that sounds pretty frightening. But if the birth defect in question only occurs in 1 in 1 million babies, then a fivefold increase in risk means that 5 in 1 million babies will experience it, perhaps not enough to justify telling a woman not to take it.

As illustrated with that example, clinical significance can also relate to harms. If a risk from a procedure is blood loss, but the average blood loss amounts to an average 70 mL with a narrow confidence interval, that’s not enough blood loss to pose a serious risk. Sometimes a research finding may not offer much clinical significance because it’s simply not ready for prime time: the findings were conducted in animal studies, can’t yet be generalized to a broader population or do not yet have enough evidence to support them from multiple studies. If a study finds that eating apples during pregnancy is associated with a higher risk that the fetus will later develop ADHD, for example, much more research is needed before doctors start telling pregnant women not to eat apples. The finding may be interesting and statistically significant, but not yet clinically significant.

Any time a journalist is covering a study, they should ask the researcher and other interviewees about the clinical significance of the findings: what are the implications for patients and doctors now or in the future?

For additional reading on how researchers and doctors consider objective measures of clinical significance, read this review paper.


 A comorbidity refers to having two or more conditions or diseases at the same time in a person. Comorbidities may coexist because they are related to one another (one causes the other or they both result from a shared underlying factor) or because of coincidence. In a person with an anxiety disorder and severe tooth decay, the conditions are most likely not related. But sometimes conditions can have unexpected links. A person with obsessive compulsive disorder, a type of anxiety disorder, may develop severe dental problems if their OCD leads to excessive teeth-brushing that damages their enamel. Other times, the conditions may not be directly related, but one can exacerbate another. A person might have cirrhosis of the liver and major depressive disorder at the same time that were not initially related, but reduced quality of life from cirrhosis might worsen depression symptoms.

Comorbidities typically refer to chronic conditions, but a person may have an acute condition along with chronic cormorbidities. Most often, the comorbid conditions are related because they share risk factors or affect the same body systems, or because one increases the risk of the other. Certain mental health disorders often co-occur with substance use disorders in those who self-medicate and/or did not receive appropriate treatment for their mental disorder. Diabetes is a risk factor for cardiovascular disease and often occurs with a comorbidity of atherosclerosis. Unsurprisingly, comorbidities increase with age. Comorbid conditions are also associated with more severe disease and more complex care and treatment needs. Some of the most common conditions that involve comorbidities are arthritis, asthma, chronic pain, hepatitis, dementia, back problems, cancer, diabetes, cardiovascular disease, mental health conditions, autoimmune conditions and chronic obstructive pulmonary disease.

In covering medical studies, understanding comorbidities becomes especially important because presence or absence of certain comorbidities may influence study findings, especially if the researchers don’t control for them. If researchers were to conduct a study on risk factors for cardiovascular disease, and they did not account for participants with high cholesterol, diabetes, obesity, smoking, or other comorbidities, the results would not be valid. Looking at what comorbidities a study does or doesn’t account for may be a place reporters find significant limitations in a study, or at least questions to ask the researchers.


No matter what study you’re covering, or what dramatic truth it seems to tell, the fact is that science doesn’t happen in a vacuum. Evidence came before; evidence will come after.  No one study is ever the answer. 

Karl Popper, the noted scientific philosopher, said it like this in his book Conjectures and Refutations: The Growth of Scientific Knowledge: “The scientific tradition … passes on its theories; but it also passes on a critical attitude towards them. The theories are passed on not as dogma, but rather with the challenge to discuss and improve upon them.”

Popper believed that attempting to disprove a theory was an important test of its validity. The more testing a theory is subjected to, and the more it survived, the closer it might be to the truth.

All this testing and refuting may be good for science, but it can be really bad for readers who are trying to make choices about their own health.

Health reporters can help keep readers from getting whiplash with every headline by putting studies in context.

Think of context as showing readers the lay of the land. Here’s what doctors and patients have known about this idea in the past. Here’s what previous studies have shown. In a sense, you’re trying to give them an idea of how much certainty or skepticism to use as they read about the findings.

Is there a mountain of evidence to back up the findings of the study you’re covering (smoking causes lung cancer), or is your study the kind of research that turns everything we thought we knew on its head (i.e. lower sodium diets may actually be bad for health)?

Many reporters who cover medical studies may start by working with just a copy of the study and perhaps a press release that goes along with it.

Study authors usually provide some background in the introduction section of their studies, which is another reason to always read the full text of a paper, not just the abstract.

Press releases may offer some context, but keep in mind that the aim of the press release is to promote the study. Press releases may ignore previous evidence if it doesn’t support the validity of the research they’re selling.

When a medical journal flags a study as noteworthy, it will sometimes pair the study with an editorial or commentary — opinion pieces that can be valuable sources of contextual information. That’s a good place to start, but it’s never a place to stop since commentaries are usually the opinion of just one or two people.

Medical journals also are increasingly aware that their audience extends beyond doctors and researchers. Some make an effort to quickly provide some contextual information about the study they’re presenting. The journal Pediatrics is an example. They post a little blue box at the top of each study letting readers know what came before and what the study adds. It looked like this on a recent study of background television exposure in young children:

Other sources you can use to find out the context of a study include PubMed, the searchable database of medical literature. Searching with a couple of key terms can help you find other recent studies on the same topics you’re covering. PubMed also helps you quickly find research reviews. Research reviews are studies of studies that are designed to summarize the body of literature available on a given question.

Article types

To filter the reviews from your PubMed search, read down the left column of the screen until you see the heading “Article Types.”  Check “Systematic Reviews.”

It also may be helpful to interview the author of a recent review since they’re likely to be up to speed on the latest thinking on the subject you’re interested in.

The Cochrane Collaboration is another source of in-depth reviews on medical topics. You get free access to the entire Cochrane library as a benefit of AHCJ membership. Read more about how to sign up here.

Correlation vs. causation

A common mistake reporters make when writing about medical studies is confusing correlation and causation.

Two variables in a study can be related, without one actually being directly caused by the other. Many people suffer from alcoholism also smoke cigarettes. Alcohol addiction and tobacco addiction are correlated, but one doesn’t cause the other.

In a study that compares drinker and non-drinkers, heavy drinkers would have higher rates of pancreatic cancer than non-drinkers.

But it’s impossible to know if the cancer was caused by their drinking or by something else that made them different from the nondrinkers, like higher rates of smoking.

Here’s another example of how correlation can cloud the interpretation of a study. The amount of sodium a person gets in their diet is closely correlated to the total calories they eat. In other words, the more a person eats, the more sodium they’re likely to take in.  Eating a lot of calories often also leads to obesity. Both obesity and high-sodium diets are believed to contribute to high blood pressure. So what’s the primary driver of high blood pressure in a scenario like this, sodium? Or obesity? Those are the kinds of questions researchers try to disentangle in their studies.

Observational studies can only show correlation. They can’t show causation.

When covering observational studies, it’s important to use language that makes the limits of the research clear.

Seasoned health reporters will eschew wording in their leads or headlines that reads like this:

A new study shows that short sleep may cause weight gain.

Instead, they aim for wording that suggests a less direct relationship:

A new study shows that people who don’t get at least seven hours of sleep a night are more likely to gain weight compared to those who snooze less.

That’s the most accurate way of describing the comparison that’s being made in the study, but it can also be a little wordy. Here’s another way that would work if you’re tight on space:

A new study shows short sleep is linked to (or tied to or associated with) weight gain.

Disease elimination vs. eradication

The elimination of rubella in the Americas was announced in April 2015, followed by the elimination of measles across the Americas continents in September 2016. Yet measles cases still occur in North and South America, and news is still being reported on the eradication of polio, which has not been seen in the Americas in years. The key here is that “elimination” and “eradication” are different things, though they are often confused by readers and sometimes even by journalists.

Eradication refers to a disease being completely, literally eradicated from the earth: no cases occur at all, from any source. The best-known example is the eradication of smallpox in 1980. Another lesser known disease that has been eradicated includes the livestock virus rinderpest. Campaigns to eradicate polio and Guinea worm are officially underway, and it could be argued that public health officials are — so far unofficially — working toward eradication of hookworm, measles, rubella, malaria and other diseases.  Only certain conditions are able to be eradicated with current tools. An example would be a disease which lives in the environment, rather than requiring a human vector.

Elimination, however, refers to a permanent interruption in indigenous transmission of a disease, making it no longer endemic, but the disease can still be introduced by a case from another geographical region. Or, as it was put in an article about the measles elimination, “Measles no longer lives in the Americas though it occasionally visits.” For example, measles has been eliminated from the U.S. since 2000, but there have been a number of measles outbreaks in the U.S. since then. All of those outbreaks, however, were introduced by a person visiting from outside the U.S. None of them began with a person already living in the U.S. because the virus no longer circulates on its own in the U.S., thanks to the effectiveness of the measles vaccine.

The distinction is important because an eliminated disease can always return if conditions allow for it, such as a sufficiently deep, sustained drop in immunization rates that allows measles to begin circulating again.

Dose response

This term describes a pattern all of us are intuitively familiar with already: a relationship between a substance or exposure and an outcome or effect. For example, the more calories you consume, the more weight you gain over time. The less water you drink, the more dehydrated you become. The more miles you drive (exposure), the more gas you use (effect).

These are all example of clear cause and effect, but what happens when a study is looking at an exposure and an outcome that are associated — but it’s not clear yet if it’s cause-and-effect? A dose response effect is a phenomenon to look for in the study. It’s most commonly seen in pharmaceutical or toxicology studies, such as the decreasing IQ points that accompanies increasing levels of lead exposure. In toxicology, dose response is basically the technical term for “the dose makes the poison.” In pharmaceutical studies, dose response curves can illustrate the proportion of the study population that respond to the medication (and the extent to which they respond).

Although a clear dose-response relationship — x follows a predictable curve with y — cannot show causation on its own, it’s part of the nine criteria necessary to conclude from observational studies that an exposure and an associated outcome are, indeed, causally linked when randomized controlled studies are not available or ethical. For example, it’s now established that smoking causes lung cancer. The more a person smokes (amount and duration), the higher their risk for lung cancer is. That dose-response is part of what helped scientists determine the causal link. These curves typically taper off at the low end (one lifetime cigarette isn’t that much different than 10 lifetime cigarettes) and at the high end (smoking three packs a day isn’t that much different than four packs a day).

One value of understanding dose response is that NOT seeing it may indicate a flaw in a study or another explanation for an association. If substance X and outcome Y are related to each other statistically, but Y doesn’t increase as X does (maybe it stays the same or has no clear pattern), the lack of a dose response could be a red flag.

For example, say you are provided the following information in a study about a toxic substance and tumors occurring in mice:

  • 5 mg of the substance is linked with 6 malignant tumors

  • 10 mcg, 3 malignant tumors

  • 15 mcg, 7 malignant tumors

  • 20 mcg, 1 malignant tumor

  • 25 mcg, 5 malignant tumors

If you were to graph those numbers, the line would go up and down without a clear pattern — no dose response relationship. However, it’s important to realize that not all dose response relationships are linear or inversely linear. Some substances have a different dose response curve, such as endocrine disruptors, which typically have an upside down U-shaped dose response. A very low and a very high exposure may have similar modest effects where an exposure somewhere in the middle has the greatest effect. It’s therefore important to be aware of what the natural dose response curve should look like for the type of substance/exposure the study includes.

Effectiveness vs. efficacy

At first glance, it would seem the only difference between “effectiveness” and “efficacy” is a handful of letters. But these terms actually have very different meanings in the context of medical research and should never be used interchangeably. Efficacy refers to how well a drug, device or other intervention performs under ideal conditions. Efficacy can only be determined in clinical trials, and these trials typically have very specific criteria for participants. The criteria nearly always have age requirements and often have sex, gender, race or ethnicity requirements as well. Other potential criteria could relate to having or not having certain underlying conditions, having a certain severity of a particular disease, living in a particular geographical area, living in a rural or urban or suburban area or having public, private or no insurance, among other things. The stricter the criteria is, the less generalizable the findings might be. If, for example, a trial includes only black women in their 40s who were diagnosed with type 2 diabetes within the past two years, would the effects of a drug being tested also apply to white men in their 20s who have had type 2 diabetes since they were children? It’s often impossible to know without doing a trial with that population as well.

Effectiveness, on the other hand, is how well a drug, device or intervention performs in everyday real life for a broader range of patients. Data from effectiveness studies looks at the intervention’s effects in a large, often diverse population. There may still be criteria for inclusion in the study, but it generally will be broader and it often may include people who are sicker or healthier than the groups included in efficacy trials. Effectiveness has much more relevance for doctors and other clinicians.

Why not just always test the effectiveness of an intervention then? The problem with this approach is that the diversity of characteristics among participants could mask potential therapeutic effects or side effects because it’s harder to tell whether the outcomes occur because of the drug or because of differences among the participants.

Another difference between efficacy and effectiveness relates to how a drug is prescribed and how well the patients adhere to it, or how different physicians might perform different procedures. An efficacious drug may not actually be effective the way it’s used. Further, efficacy trials are designed to answer the question of whether a drug, device or intervention works at all while effectiveness studies help answer whether that intervention can help patients. In this sense,efficacy relies more on results’ statistical significance whereas effectiveness relates more to results’ clinical significance. Read more on efficacy and effectiveness here and here.


Epidemiology is the study of disease behavior, particularly at the population level. Epidemiology includes study of both chronic and acute conditions and of both infectious and non-infectious diseases. This science encompasses the incidence and prevalence of a disease or conditions, disease risk factors, surveillance strategies and efforts, biostatistics, prevention strategies, treatment strategies, transmission modalities and risks (particularly for infectious diseases), etiology of the condition, patterns in disease behavior, the natural history of a disease, and any other information necessary to understand disease behavior and to prevent, stop, or treat disease. Observational studies—which comprise the vast majority of studies that are not randomized controlled trials—are also commonly called epidemiological studies because they nearly always involve observing a large population and drawing inferences from those observations. In circumstances where randomized controlled trials cannot answer a particular question, whether for ethical or logistical reasons, epidemiological research may be able to if enough data exist from enough well-designed studies. (See Hill criteria for more on how epidemiological/observational studies can contribute to conclusions about causation.)

False balance

This lapse in responsible reporting refers to using outliers’ voices to state opinions that contradict the facts simply to provide “balance” to a story. Stories about any topic certainly need to include as many perspectives on an issue as possible as long as those perspectives are purely opinion-based (something that science cannot show to be true or untrue either way) or those perspectives are supported by some scientific evidence, even if that evidence diverges from other evidence. However, if such a strong consensus from the evidence exists that something is regarded as a fact, then including a person who doesn’t believe that fact does not provide accurate or appropriate balance to a story — it just confuses the reader about what the facts are. A flip example would be including a quote from someone who believes the earth is flat in a story related to weather or the curvature of the earth, or quoting someone who believes the moon landing was a hoax in a memorial story about the moonwalk. In reporting on medical research, this becomes tricky because scientists are learning more information all the time, and it’s reasonable for journalists to seek countering opinions particularly on new research, such as new findings about the gut microbiome or a new treatment. Other topics, such as breast cancer screening, may have contradictory evidence or involve controversial opinions on what to do about the evidence, all of which should be considered for a story.

One of the most common examples of a topic that falls prey to false balance, or false equivalency, in reporting is vaccines, mostly in smaller markets or by general assignment reporters who are less familiar with the health or science beat. The way the media’s falsely balanced vaccine reporting damaged public health reporting (and consequently public health) is such a well-worn case study that CJR featured outstanding coverage of it in Curtis Brainard’s Sticking with the Truth. Quoting “both sides” on concerns about a safety issue in vaccines that has been demonstrably shown not to exist makes it appear that there is a controversy among experts when there is not. The group Voices for Vaccines offers an excellent primer to false balance and how to avoid it in accurate news stories about vaccines.

The danger of false equivalence remains for any issue on which a broad medical or scientific consensus exists based on the evidence and a handful of outliers attempt to discredit that information for various reasons, often motivated by personal financial gain. Avoiding false balance doesn’t mean journalists take off their skeptical hat in covering these issues, but they should only report these scientifically outlier positions if solid evidence supports it, not just because someone somewhere believes it.  

Hill Criteria for Evaluating Observational Studies

If there's one phrase that most reporters who cover medical studies can repeat in their sleep, it's the caution that observational studies only show associations, they don't prove cause-and-effect.

But even though these studies are less considered to be less definitive than experiments, there are reasons not to dismiss the findings of observational studies. In some cases, it's the only kind of research that can ethically or practically be done to understand a health problem. Observational studies are also usually more reflective of the kinds of messy and multi-faceted conditions faced by people in the real world. Thus, they can be more applicable to clinical practice than carefully controlled ivory tower experiments like randomized, controlled trials.

The challenge, then, becomes understanding how sound the conclusions of an observational study may be and communicating either your confidence or your doubts about those study findings to your readers.

One way to evaluate observational research is to use The Hill Criteria, a set of nine tests developed by Sir Austin Bradford Hill, a British epidemiologist and statistician. He published these criteria in a 1965 essay called "The Environment and Disease: Association or Causation?" and they are still being used today.

The nine criteria are:

  1. Strength – Stronger associations are more likely to prove causal than weaker associations. Hill gave an example that chimney sweeps were 200 times more likely to die of particular type of cancer that affects the skin of a man's scrotum than men in other occupations. "Chimney Sweep Cancer," identified in 1775, was the first kind of cancer to be tied to a person's occupation. An association between smoking and lung cancer was also very strong. Observational studies found smokers were nine to 10 times more likely to die of lung cancer than non-smokers.

  2. Consistency– has the association been repeated in different studies by different researchers in different places and at different times?

  3. Specificity – How narrow is the observed relationship? Is it limited to one group of people who are dying from one kind of a disease? Or is the same group of people dying from many different causes? Making it difficult to pinpoint the tie?

  4. Temporality – In order to prove, or at least to suggest with a high degree of suspicion, that smoking causes lung cancer, people had to start smoking before they developed lung cancer. Hill believed this criterion to be the most important for determining causality. The horse has to come before the cart. But temporality isn't always easy to determine (see reverse causality).

  5. Dose-response – Studies that demonstrate a biological gradient, that is the higher the dose or exposure, the more likely a person is to have develop the outcome under study, are more likely to be causal than those that don't find a dose-response relationship. The more cigarettes a person smokes, the more likely they are to develop lung cancer, for example.

  6. Biological plausibility – Is there a biological mechanism that helps to explain the observed relationship?

  7. Coherence – Is the observation in line with previous studies on the same question? Findings that are repeated are stronger than those that are new or contradictory. (Covering a new-to-you topic? A search of PubMed or the Cochrane Library can help you check for coherence.)

  8. Experiment – Does removing the exposure change the observed outcome? Answering this question usually requires additional studies, but in some cases, observational studies may support the conclusions of clinical trials.

  9. Analogy – Do similar exposures result in similar outcomes? Hill argued that observing the birth defects to fetuses born to mothers who took the drug thalidomide should surely make doctors think twice about prescribing chemically similar drugs to pregnant women.

For a practical demonstration of the Hill Criteria in action, see this recent commentary in the Journal of the American Medical Association. In it, Drs. Sanjay Kaul and George Diamond use the nine criteria to evaluate a study that looked at the association between aspirin and macular degeneration.

Ingelfinger rule

This refers to the New England Journal of Medicine submission policy outlined in 1969 by then-editor Franz J. Ingelfinger. He wrote an editorial clarifying NEJM’s policy that “articles are accepted for consideration with the understanding that they are contributed for publication solely in this journal.”

Ingelfinger wrote: “The understanding is that material submitted to the Journal has not been offered to any book, journal or newspaper. If an author willingly and actively has contributed the same material to any other publication — whether as text to a standard medical journal, or as a ‘letter to the editor,’ or as a feature in a lay magazine — that understanding has been disregarded.” At the end, he explicitly laid out the precise text come to be known as the Ingelfinger rule: “Papers are submitted to the Journal with the understanding that they, or their essential substance, have been neither published nor submitted elsewhere (including news media and controlled-circulation publications). This restriction does not apply to (a) abstracts published in connection with meetings, or (b) press reports resulting from formal and public oral presentation.” It was later modified to be included in letters to submitting authors: “The Journal undertakes review with the understanding that neither the substance of the article nor any of its pictures or tables have been published or will be submitted for publication elsewhere... This restriction does not apply to abstracts published in connection with scientific meetings, or to news reports based solely on formal and public oral presentations at such meetings, but press conferences at these meetings are discouraged.”

Because NEJM is a top medical journal whose policies influence medical publishing in general, the policy became a generally accepted rule for all research submitted to all journals. Often criticized, the Ingelfinger rule has sometimes made coverage of medical research more challenging. For one thing, researchers presenting abstracts at meetings may not be willing to talk to journalists about a presentation that hasn’t been published in a journal. The rule makes an exception for abstracts printed in meeting programs, but Ingelfinger directly addressed the situation in which a journalist seeks an interview with the researcher after a publication: “Here a decision may be difficult, but in the Journal's opinion the material has been contributed elsewhere if the speaker makes illustrations available to the interviewer, or if the published interview covers practically all the principal points contained in a subsequently submitted manuscript.” Being aware of this restriction can help journalists understand why they might be brushed off by a researcher who actually may want to talk about his research but fears jeopardizing its chance at publication. The NEJM’s later clarification muddied this further by suggesting researchers can talk with journalists to ensure information is accurately reported, as long as they don’t provide extra information not in the presentation (something many journalists might want).

The Ingelfinger rule contributed to expanding the practice of instituting media embargoes, but misunderstandings about what the Ingelfinger rule and embargoes actually limit can interfere with reporting. Because they often lack of media training, researchers may incorrectly believe they are not allowed to discuss an unpublished study that is under embargo lest they jeopardize its appearance in the journal or, worse, any future articles in that publication. Researchers may discuss their embargoed study with a journalist who has agreed to abide by the embargo, and journalists sometimes need to educate the researchers about this to get the interview they need. Other times, the researcher may just play it safe and not provide any interview until the study is published, even if that’s after most journalists’ deadlines. Consider turning to a public information officer, if available, to help sort things out.

Impact Factor

In the world of research publishing, a loose hierarchy of journals exists both overall and within individual fields. In science publishing generally, a handful journals are widely regarded as the most prestigious, such as Science and Nature. Within medicine, The New England Journal of Medicine, The Lancet and JAMA are among those considered top journals. While there is no official, agreed-upon list ranking journals by prestige, the most objective feature is a journal's “impact factor.” The impact factor represents how frequently, on average, an article in that journal has been cited in another paper in a particular year. Thomson Reuters calculates the impact factor for each journal annually in their Journal Citation Reports.

The calculation is the ratio of A over B, in which A represents the number of times all items published in a particular journal during two consecutive years were cited in the third consecutive year, and B represents the total number of "citable items" published in that journal during those two consecutive years. Citable items typically refer to studies, reviews, proceedings or notes but not editorials or letters to the editor. For example, in 2013-2014 Nature had an impact factor of 42.351. The New England Journal of Medicine’s factor that for that period was 54.42, and for JAMA it was 30.387. Other journals generally have impact factors in the lower single digits.

The expectation is that the more frequently a journal's research is cited, the stronger, more reliable and more impactful the science in that journal is, and the more highly the publication is regarded. That said, impact factor is not a perfect measure of journal quality and certainly not of individual papers in the journal. An infamously fraudulent retracted study that attempted to link autism to the MMR (measles-mumps-rubella) vaccine was published in The Lancet, by all accounts a prestigious journal with a high impact factor (39.207 in 2013-2014). Some critics have also pointed out that a journal's impact factor can be strongly influenced by only a small proportion of papers, which reveals little about all the other papers published in that journal but cited far less often. For example, a 2005 editorial in Nature notes that 89 percent of Nature's impact factor the previous year was generated by just a quarter of the paper's published in the preceding years. Finally, a New York Times op-ed by the creators of Retraction Watch notes that journals with a higher impact factor tend to have more retractions than those with lower impact factors, for unknown reasons.

While imperfect the impact factor still provides a rough guide as to a journal's reliability, which can be helpful if a reporter comes across a study published in an unfamiliar journal. Several sites provide ways to search for impact factors. The International Scientific Institute offers an index at Affiliates of universities, such as the University of Virginia Health System, sometimes include the impact factor in their listings.

Incidence and prevalence

Many of the medical studies journalists cover are epidemiological, which are observational studies focusing on the health of populations. These studies tend to report on the incidence and prevalence of diseases and other conditions, so it’s important that journalists understand the difference between these two commonly confused terms in epidemiology.

In the plainest terms, “incidence” refers to new cases of a disease or injury or condition. “prevalence” refers to the total existing cases of a disease or injury or condition — whether newly occurring or ongoing from a previous diagnosis or occurrence. Although these terms can refer to any condition studied, such as gunshot wounds, short-term infectious disease or chronic conditions, this section will primarily focus on diseases for the sake of simplicity.

Whether a study, or a journalist, uses incidence or prevalence depends on what’s being communicated. For example, to have a sense of how quickly a disease is spreading through a population, incidence is more relevant because it describes new cases.

But to understand the burden of a disease, especially a chronic condition, in a population, prevalence is more relevant because it focuses on how many people are suffering, regardless of whether they were diagnosed yesterday or ten years ago.

These concepts involve more complexity, but first, here is a visual analogy to make sense of the difference: imagine a bathtub that has the faucet turned on and the drain open. The water pouring into the bathtub is the incidence — the new cases getting diagnosed. The water that is in the bathtub is the prevalence — how many currently have the condition. The water exiting the tub through the drain are the people leaving the prevalence either because they died from the condition or because they recovered from it.

Two more examples: If 500 people are diagnosed with diabetes each year, that refers to incidence, but if 15 million people are currently living with diabetes, that refers to prevalence. If 6 million people caught the flu in the first week of February, that’s incidence, but if only 4 million people are currently suffering from symptoms of the flu on February 7, that refers to the prevalence of influenza; the other 2 million recovered or died from the flu during that week.

Incidence is typically described in one of two ways: incidence proportion or incidence rate. The incidence proportion is also called cumulative incidence, attack rate, or risk of a condition — the probability of developing it. Incidence proportion is expressed as a ratio where the numerator (top number) is the total number of new cases of a condition during a specified time interval, and the denominator (bottom number) is the population of people who are at risk for the condition.

For example, the incidence proportion of HIV in a particular country might 25 people per 100,000 individuals per year. Similarly, the incidence of cervical cancer in the same country might be 10 per 50,000 women. Even though it’s the same population, the denominator must reflect the population that is at risk. Both males and females can get HIV, but only females can get cervical cancer, so the denominator can only include women in the second example. (The second example would probably actually be expressed as 5/100,000, but it’s important to know that the denominator still only contains women and that the HIV rate and the cervical cancer rates given here cannot be directly compared since the denominators refer to different populations within the same country.)

Even incidence proportion can be described different ways. For example, the overall incidence proportion, or attack rate, of a listeriosis outbreak refers to the total number of individuals getting newly diagnosed with the food-borne illness out of the total population. But if the source of the outbreak is determined to be cantaloupe, the food-specific attack rate refers to the number of new cases of illness among people who ate that food. (It can get even more specific if the denominator is limited to the people who ate the cantaloupe from the farm where the outbreak originated.)

The incidence rate is less familiar to journalists even though they will come across it in studies; it refers to the number of newly diagnosed cases in the population over a set amount of time. It’s often expressed in “person-years,” which incorporates time into the denominator. In writing about this type of incidence in layperson terms, one way express it is to do a quick division and use “cases per year” (or whatever the unit of time is, usually days or years). For example, if rate of norovirus in Pleasantville over a 10-year period is 25,000 cases per 1 million person-years, then that actually means the population is approximately 100,000 people (100,000 people times 10 years is 1 million person-years), and 2,500 people a year got sick. (If 25,000 cases occur over that time, the annual rate is estimated by dividing by 10.) The reason researchers might express a condition in person-years instead of annual rate is that the population might change over that time and person-years is more precise and accurate for researchers. Usually, for a journalist’s purposes, that level of precision is not necessary, and the estimate of 2,500 cases per year is sufficient.)

Prevalence can also be discussed in two different ways: point prevalence and period prevalence. Just as it sounds, point prevalence refers to the number of people with a certain condition at a precise moment in time, such as a day or “right now” throughout the U.S. The numerator is the number of current cases, and the denominator is the total current population. The estimated point prevalence of HIV in the U.S. is 1.2 million people. Meanwhile, period prevalence refers to the number of current cases over a period of time, such as over a year. The point prevalence of the flu in February may be 300,000 cases, but the period prevalence of the flu over the entire year might be 9 million (which includes the 300,000 cases in February). Whether this refers to the cases in a nation, a state, a county, a city, a school or some other group depends on the study and the needs of the journalist’s story.

For additional discussion of incidence versus prevalence, review this Paediatric Nursing primer or this explanation from the University of North Carolina School of Public Health (which includes an illustration of the bathtub analogy).

For more nitty gritty specifics, check out this lesson on the CDC website on Morbidity Frequency Measures.

Institutional Review Boards

The history of medicine is full of sordid experiments, from the Nazi experimentation on concentration camp prisoners to the tragic and unethical Tuskegee syphillis study. Institutional Review Boards (IRBs) came about to prevent future unethical human research such as these incidents (and many others). Although the concept of having ethical standards for human experimentation dates back centuries, one of the most influential codified versions in recent history was the Nuremberg Code developed after World War II.

Today, countries throughout the world have laws regarding IRBs, also called independent ethics committees, ethical review boards, research ethics board or similar names. They all serve the same essential purpose: to look after the safety, welfare and rights of human participants in studies.

IRBs, which can be independent or formed by institutions such as universities and hospitals, are groups that formally convene to approve, review and monitor any kind of biomedical research involving living human subjects. Even studies which simply ask participants questions, whether through surveys or in-depth interviews, must undergo IRB approval. If the study contains “identifiable private information,” the researchers must seek IRB approval. Most biography, journalism, art and similar activities as well as research involving federal data sets do not require approval even if they involve people. Most social science research, such as psychology and sociology, does require approval, though studies might qualify for an exemption. Some research also qualifies for an expedited review.

IRB guidelines in the U.S. must follow regulations laid out by the U.S. Food and Drug Administration and the U.S. Department of Health and Human Services (specifically Title 45 Code of Federal Regulations Part 46) and typically follow other institutional and ethical guidelines. The path to the federal regulations began with the National Research Act, passed in 1974 partly in response to the Tuskegee experiments. The resulting Belmont Report, a summary of ethical principles that formed the basis of current U.S. policy on human research, led to current law in 1981 (expanded in 1991). No trials for products governed by the FDA can proceed without IRB approval and oversight, and all IRBs overseeing FDA-regulated studies must register with the agency.

The key concepts underlying IRBs are informed consent of the participants (specific requirements here) and a risk-benefit ratio that involves significant benefit to society with acceptable levels of risk for the participants. IRBs consider a variety of factors in ensuring that research upholds these goals, such as how participants are recruited, for example.

An extensive (66-question) FAQ on IRBs is available at the FDA website here. Notable points include the following:

  • The first priority of IRBs is to protect the welfare and rights of human subjects, not of the institution.

  • Institutions without their own IRBs may use outside/independent ones, including those of hospitals.

  • FDA regulations do not provide guidance regarding compensation for participant injury or regarding institutional liability for malpractice lawsuits, even if all IRB regulations were followed.

  • IRBs must have a diverse membership that meets certain parameters, such as having members with a non-scientific background (such as lawyers, clergy or ethicists) and includes diversity in terms of “race, gender, cultural backgrounds and sensitivity to such issues as community attitudes.”

  • If researchers submit a study for IRB approval that has been previously denied approval by a different IRB, their documentation for the new approval request must include this information.

  • Clinical investigators must report all adverse events occurring during a study to the IRB overseeing that research.

  • Parents must provide informed consent for their children’s participation in research, but FDA regulations do not necessarily require that the children themselves provide assent (agreement to participate).

  • Some research for investigational drugs, biological products or devices may qualify for an exemption for IRB approval (see #56 in the FDA FAQ).

  • Medical devices have their own set of requirements discussed here.

  • Exceptions to IRB approval can be granted for humanitarian reasons, such as emergency use of a drug or biologic or for medical devices, discussed in the FDA FAQ here.

Any time a journalist is covering a study, he or she should check to see that the study received IRB approval (usually described in the methods section) or that the research qualified for an exemption. The study should note whether participants provided informed consent.

In vitro vs. in vivo (and in silico)

Experimental research involving new drugs, environmental exposures, or other chemicals or interventions will occur in one of three environments: in a human, in a non-human animal or in a petri dish, test tube, beaker or other environment outside a living multi-celled organism. An “in vitro” experiment occurs outside of a living multi-celled organism whereas an “in vivo” experiment occurs in a human or other animal. These terms are important to know because studies may not clearly explain the experimental setting beyond using in vitro and in vivo.

Typically, anything not involving a living multi-celled organism is preclinical research, used to show proof of concept or to help generate hypotheses for what might be seen in non-human animals. Animal research also usually comprises preclinical research, though research outside and inside non-human animals also occurs in environmental health research to better understand the effects of substances found in the environment. Some studies, however, involve a combination of settings. Researchers might conduct an epidemiological study on humans and then use what they find to conduct a series of experiments in a petri dish or test tube and/or non-human animal. Or, they may conduct a series of progressive experiments that start with an in vitro setting and progress to in vivo experiments in a non-human animal and then in a human.

A relatively newer type of study that you might see described is “in silico.” This term is less used because authors typically simply describe what it is: computer simulation or modeling. It can also refer to gene sequencing that is performed exclusively on the computer. 


To understand the concept of non-inferiority in medical research, it's helpful to remember Miller Lite's long-running slogan "great taste… less filing."

Miller Lite was the first successful light beer. And, to persuade people to try it, advertising executives had to find a way to convince people that it would taste as good as regular beer, but that it had added benefits.

The tag line for these spots was "Lite Beer from Miller: Everything you've always wanted in a beer. And less."

Non-inferiority studies are aiming to prove essentially the same thing. They aim to show that a new drug will work about as well as standard treatments but that it comes with other important advantages – it costs less, it has fewer side effects or it's easier to take, for example.

So what exactly does non-inferiority mean? It might sound like it means one drug is "as good as" another. But statistically, when a drug is non-inferior to another treatment, it actually means something a little different: “not much worse than.” Read more ...


P-hacking is data diving, data fishing, data mining, or any other term (dredging, snooping, etc.) that describes manipulating or rearranging the results of a study in enough different ways that SOMETHING significant eventually emerges. It doesn’t require manipulation in the sense of making up, fudging or changing any of your numbers or other data. It simply means a researchers plays around with the numbers enough that they can eventually uncover an association based purely on the statistical likelihood that if you look hard enough for something in enough different places, you’ll eventually find something. As one study described it, “researchers collect or select data or statistical analyses until nonsignificant results become significant.”

One of the best explanations of P hacking is in Christie Aschwanden’s award-winning piece Science Isn’t Broken, which we’ve previously written about. The ability to P hack rests on the disadvantage of randomly establishing a threshold for statistical significance that will virtually guarantee coming up with SOMETHING significant if a study has enough variables and/or outcome possibilities. A P-value is intended to represent a proportion of times an experiment or study could be run and give the same results by chance. Considering that the arbitrary cutoff for significance is generally accepted as a P value of 0.05, that means that you would expect to get the same results seen in a particular study with statistical significance just under 5% of the time. So imagine that you’re looking for 12 different possible outcomes and how they’re associated with 8 different variables. That’s basically 96 possible combinations that could result. But that’s not much different than running a study 100 times — statistically speaking, about 5 of those combinations would probably show an association simply based on chance alone.

To watch out for P hacking, look at how many outcomes researchers are looking at, how many variables they’re considering in the study, and how many subgroups they consider and/or subanalyses they calculate. The more they have, the more likely they’ll “find” something by chance alone, and the more skeptical you should be that the finding is “real.”

Phases of Clinical Testing

New drugs and devices typical go through four, and sometimes five, phases of clinical testing. Three of these happen before treatments are approved and sold to the public. The fourth and fifth phases, which are types of post-marketing studies, occur after products are more widely released.

In Phase 1, new treatments that have been developed in test tubes and vetted in animals are given to human patients for the first time. These studies typically enroll very small numbers of patients who take a new treatment for a limited period of time. The goal of phase 1 studies is to demonstrate safety. They often use different drug dosages to measure responses and side effects.

Once basic safety is demonstrated, new treatments enter phase 2 studies. In phase two, researchers give drugs to larger groups for longer periods of time. The goal of phase 2 studies is to demonstrate efficacy, or that a drug works as doctors hope it will. Side effects and adverse events continued to be monitored in phase 2 trials, as well. Many new treatments fail during phase 2. Only about 18 percent of new treatments succeed in phase 2 studies, according to a study published in May 2011 in Nature Reviews Drug Discovery.

Phase 3 studies are large and often expensive studies that pit an experimental remedy against an inert placebo or an established treatment for the same disease (see non-inferiority). Phase three studies are designed to evaluate how well a drug will work in the real world. Phase 3 studies also sometimes shed light on the patient population that's mostly likely to benefit from a new therapy. Regulators rely most heavily on data from phase 3 studies when they weigh whether a drug should be approved.

Phase 4 studies, or post-marketing surveillance, are sometimes required by the FDA when regulators want drug makers to continue to monitor the safety of a drug after approval.

Phase 5, or comparative effectiveness, studies are sometimes undertaken to see which agents in a group of similar treatments work best.

Placebo and nocebo effect

Nearly all randomized controlled trials involve using a placebo, a biologically inactive substance (sometimes called a “sugar pill”) intended to appear like the actual medication or treatment that the participants in the real treatment group receive. A placebo should resemble the real treatment as much as possible so that participants do not know if they are receiving the real treatment or the “fake” one. Using placebos allows researchers to evaluate treatment effects above and beyond the “placebo effect,” the phenomenon in which a person feels better or experiences improvement of their condition based at least partly on the expectation that they will improve.

In a substantial number of people – estimated to be up to a third of all individuals – receiving a placebo for a condition improves their symptoms. Scientists do not fully understand all the mechanisms involved in the placebo effect (some possibilities are discussed here), but the expectation of feeling better may activate the body’s endorphins, which relieve pain and can improve mood, similar to the way cursing and screaming have been shown to alleviate pain. The interaction with a doctor can also contribute to the placebo effect. Double-blind placebo-controlled trials ensure that neither the participant nor the researcher know which is the real treatment and which is placebo so that it’s less likely for those receiving placebo to know. If a medication in a trial leads to improvement, but that improvement is not statistically or clinically significant compared to the placebo, it calls into question whether it’s really the medication itself leading to improved outcomes or a mirrored placebo effect. This is also true with other interventions, even surgeries sometimes. In acupuncture trials, the control group should receive "sham" acupuncture, or needles in non-acupuncture locations, to adequately account for symptom improvements that may result from the placebo effect.

The lesser known nocebo effect describes a somewhat opposite phenomenon: a person experiences what they believe to be the side effects of the medication when they are actually taking a placebo, at least in part as a result of the expectation that they will experience side effects. Although it’s also been studied quite a bit, the nocebo effect has not been explored nearly as much as the placebo effect, or at least not until recently. Yet, just as the placebo effect can lead to great improvement in some people, the nocebo effect can actually be dangerous, almost leading to a fatality in one case study. The nocebo effect may relate to anxiety, and both placebo and nocebo effects have been observed in brain scan studies.

Not all improvements in symptoms in a control group are necessarily a result of a placebo effect: the American Cancer Society website discusses other phenomena that can be confused with the placebo effect. Also discussed in detail at the ACS website, placebos can be used to answer the following questions in research:

•        Does this treatment work?

•        Does it work better than what we’re now using?

•        What side effects does it cause?

•        Do the benefits of the treatment outweigh the risks?

•        Which patients are most likely to find this treatment helpful?

Quality-adjusted life-year

A quality-adjusted life-year, abbreviated as QALY, is a mathematically derived measurement intended to capture both quality and quantity of life. It’s used in many cost effectiveness studies, and its purpose is to assess the benefit from a particular intervention or outcome, even if that outcome is simply survival. It takes into account a person’s life expectancy (quantity) as well as the quality of those remaining years. Researchers may calculate how many QALYs are gained by a particular treatment, intervention or preventive measure, or they may calculate how many are lost by not using a particular intervention or due to some event that happens, such as the loss of QALYs following a stroke or a specific cancer diagnosis.

The actual formula for calculating QALYs is multiplying the number of remaining years by a number intended to represent the utility value of each of those years: 1 represents a year of perfect health/quality, and 0 represents death. For example, if taking a particular medication for a chronic condition extends that person’s life by 10 years, but each of those years will have a slightly reduced quality estimated at 0.8, then the QALY increase from taking that medication is 8 QALYs (10 x 0.8). (That example is oversimplified because the need to take medication would be factored in as reducing quality of life as well.)

Reporting Biases

According to the Cochrane Collaboration, reporting biases arise when the dissemination of information is skewed by the "nature and direction" of the study results. Reporting biases mean that studies with positive results are more likely to be published in the medical literature (publication bias), are published more quickly (time lag bias), are more likely to be published more than once (multiple publication bias), are more likely to be repeatedly referenced by other researchers (citation bias), are more likely to be published in English (language bias), more likely to be published in peer-reviewed journals with high indexing standards that make it easier to find the study (location bias) than studies which find that the drug or treatment they are testing has no effect compared to placebo.

Taken together, reporting biases mean that published studies and reviews of published studies may overestimate the effects of a drug or treatment.

For more on the causes and consequences of reporting biases, see this 2010 study in the journal Trials.


The word "risk" often connotes danger: The risk of getting cancer. But in medicine, risk is a ratio that’s used to show the chance that something will happen. There are two main kinds of risk to find and report when covering a medical study—absolute risk and relative risk.

Relative risk is often the headline generating number that’s pulled out of a medical study:  “Study shows Experimental Drug Halves Heart Attack Risk.”

Without also reporting absolute risk, however, relative risk can exaggerate either the chance of getting sick or the chance that a drug or treatment may help. [See "Tanning beds: What do the numbers really mean?" for an example.]

Absolute risk is the overall chance that something will happen. It’s also sometimes referred to as the starting risk. In randomized-controlled trials, absolute or starting risk is usually found by looking at what happens to the group assigned to take the placebo.

For example, in a study testing a new drug against a placebo to prevent heart attacks, if 40 people out of 1,000 who are taking a placebo die of heart attacks during the study the absolute risk of dying of a heart attack during the study was 40 out of 1,000 or 40 ÷ 1,000 = .04 or 4 percent.

Now let’s say that in the same study, the risk of dying of a heart attack among people taking an experimental drug was 20 out of 1,000 or 20 ÷ 1,000 = .02 or 2 percent.

Relative risk is calculated by comparing the risk of a heart attack in the group that received the experimental treatment to the risk seen in the placebo group. In this example, it would be calculated like this .02 ÷ .04 = .5 or 50 percent.

The study would say that the relative risk of a heart attack among people who were taking the study drug was .50, or 50 percent.

In reporting the study, a writer might say: Overall, 4 percent of people in the placebo group suffered heart attacks during the course of the study compared to 2 percent in the group on the experimental drug. The new drug reduced a person’s chances of having a heart attack by 50 percent.

This kind of graph is helpful for readers because it shows two things: First, that the starting risk of having a heart attack during the study was pretty low; and second, the relative effect of the new drug.

Obviously, this is very simplified example. The numbers presented in actual scientific papers are generally more complex, and they may be expressed in different ways. If you’re not easily able to find the absolute risk behind a relative risk, don’t be afraid to ask the researcher. They’re usually happy to explain.

Run-in phase

Run-in phases are a period before the trial starts when all possible study participants are given the placebo, or the medication, or sometimes both, to screen out certain patients. Sometimes run-in phases sharpen a study's findings in important ways, but they can also skew the results.

For example, patients might be asked to take a placebo to screen out patients who are irregular pill takers. This might result in a study of highly motivated patients, who aren't necessarily representative of the real world. Other run-in phases are meant to screen out patients who report side effects early on, perhaps decreasing the true population of people who experience side effects on the drug. Ask about the purpose of the run-in phase and find out whether the results from the run-in were included in the final analysis.

Sensitivity analysis

Any time researchers calculate results in an observational study, they have to make certain assumptions about what does and does not contribute to those results. Even if (or especially if) they are controlling for or adjusting for certain factors that could confound the results, they will want to test whether the assumptions they made to calculate the data are strong enough to hold up when changing (removing or adding) one or more of those factors or some other input into the calculation. This process is conducting a sensitivity analysis — how sensitive are the results to changes in the mathematical model or other inputs?

In plainer terms, it’s playing the “what if” game. What happens to the results if they no longer control for one confounder? What happens if they add in a different confounder? What if they only analyze the individuals for whom they had 100% complete data? The goal is to find out if the results hold up with each of these different calculations or to predict or discover potential alternative results/outcomes. Sometimes a sensitivity analysis might reveal that a subgroup has a greater or lesser risk than another, or that age doesn’t actually play a role as it was thought to.

Here are some questions a sensitivity analysis might help answer:

  • How robust are these findings?

  • What are the biggest factors driving these findings?

  • If we tweak X, what happens to Y?

  • What are the weakest areas of the analysis or model? (Where is there uncertainty?)

  • What inputs or factors can we remove without significantly changing the results? (Useful for fine tuning a list of risk factors, for example)

  • What future areas of research might be worth pursuing based on these findings or the influence of certain factors?

Sensitivity analyses typically involve complex biostatistics, so most reporters would need to ask a biostatistician to look over one to ensure it looks solid. But not every study would require an in-depth understanding of the sensitivity analysis. It’s helpful to read in a story because it provides insight into how the researchers were thinking, what assumptions they were making, and what they didn’t account for, but a biostatistician’s help would generally only be necessary if something seemed off, didn’t seem to add up, or lacked certain considerations that you think should have been considered.

Social desirability bias

Social desirability bias is a type of bias that can commonly occur with any type of self-reported data. It refers to the tendency of a person, whether consciously or not, to respond to questions with answers that they perceive or expect will be received in a positive way. This bias is especially prevalent in research that asks individuals about behaviors or attitudes or sensitive issues in general, such as drug or alcohol use or frequency of exercise. People may overreport perceived-positive behaviors (eg, claiming to get more exercise than they actually get) or underreport perceived-negative behaviors (eg, saying they smoke less often than they actually do). Perception is key: a behavior such as masturbation or sexual intercourse may be overreported in some contexts and underreported in others.

Social desirability bias can also occur with reporting diagnoses or conditions, particularly if the condition is one associated with stigma. For example, an epilepsy diagnosis has been stigmatized in some communities or may lead to restriction of privileges, such as loss of a driving license. Studies that rely on self-reported data about epilepsy prevalence may therefore end up with underreporting, depending on the circumstances and the way the question is asked. Other topics with a higher risk of social desirability bias include the following: any sexual activity, use of illicit or restricted substances, self-assessed abilities or skills, personality traits, healthy habits (brushing teeth, flossing, fruit and vegetable consumption, etc.), feelings/emotions, mental health, medication or treatment compliance, excretory activity, intelligence, involvement in or experience of violence, religion, patriotism, bigotry and self-perception/self-esteem (among many others).

Researchers attempt to reduce social desirability bias through several strategies: careful and neutral phrasing of their questions; use of interviewers or interview situations that increase an individual’s comfort level or increase their likelihood of being candid; offering confidentiality and/or anonymity; and collecting data without a person (such as computer-based questions). It may not be possible to completely eliminate the risk of social desirability bias in some research.

Specificity and sensitivity

Specificity and sensitivity both refer to different characteristics of screening or diagnostic tests. They refer to the likelihood that a positive or negative result truly is an accurate positive or negative result. Sensitivity refers to the proportion of people who have a disease and also test positive, so it tells you the true positive rate — that is, a positive result when the person truly is positive. A very highly sensitive test for a disease resulting in a negative result means it’s highly likely that the person does not have the disease. It’s reported as a ratio: true positives/(true positive false negative).

Specificity refers to the opposite — the proportion of people without a disease who test negative, or the true negative rate. A positive result with a very high specificity therefore means it’s more likely that the person does have the disease. Specificity is also reported as a ratio: true negatives/(true negative false positives)

Where things get tricky is in understanding how these relate to false positives and false negatives. With most tests, it’s difficult to achieve a high specificity and a high sensitivity. Generally speaking, the higher the sensitivity is, the lower the specificity is, and vice versa. (When both are high, it’s an extremely accurate test.) A test with a sensitivity of 100% means it correctly identifies everyone who is positive, but it also picks up more people who are not truly positive. A test with a sensitivity of 75 percent means it correctly identifies 75 percent of people who take it as positive, but 25 percent of people’s disease goes undetected with false negatives. The risk of a screening test with a very high sensitivity is a higher risk of a false positive.

Because specificity does the opposite, it has a higher risk of missing people who *do* have the disease. A specificity of 100 percent correctly identifies everyone who does not have a disease — along with more or false negatives. A specificity of 80 percent correctly identifies 80 percent of the people who do not have the disease, but it does not pick up the 20 percent who receive false positives.

If your head is spinning a bit, you’re not alone. It can be difficult to wrap your head around these concepts, and it generally becomes easier with more familiarity and time. Reading specific examples can also be helpful.

Screening vs. diagnostic tests

Does a patient have a particular condition? Most of the time, at least part of the answer rests in giving that patient a test, whether it’s a questionnaire to identify a likely mental illness, an imaging test to see inside the body, analyzing a biological specimen (such as urine or a piece of a skin) or some other test. All these examples fall into one of two categories of medical tests, and it’s very important that journalists (and patients) do not confuse the two: screening tests and diagnostic tests.

Screening tests look at presumably healthy individuals to find out whether there is any indication that it’s possible a disease might exist. They are designed to offer probability of a a disease’s existence and are used to determine whether further exploration — with additional screening tests or, more often, with diagnostic tests — is warranted. A diagnostic test, on the other hand, is given to establish whether a condition is or is not present in a person suspected of having a condition because they show symptoms or have an abnormality/anomaly of some kind.

For example, a clinical breast exam is a screening test: a health care provider feels the breast of a presumably healthy woman and looks for lumps, discoloration, unusual discharge or anything unusual that might suggest a problem. A biopsy, on the other hand, is a diagnostic test: a piece of a tissue mass is tested in a lab to determine what it is and whether it’s benign or malignant.

Screening tests always involve probability. They cannot predict with 100 percent certainty that a person has a condition, but they can suggest what a person’s likelihood is of having a condition within a certain range. Diagnostic tests may not always be definitive either, but the doctor is basically making a decision about the presence of a condition so that the next step in treatment can be decided. (Some conditions prevent total certainty because they are based only on symptoms identified through a clinical exam and don’t have a biological way to test for them, such as the majority of mental illnesses and many autoimmune conditions and seizure disorders.)

All tests have the potential to be wrong, whether incorrectly pointing toward having a condition when the patient does not (false positive) or whether neglecting to identify a problem the patient has (false negative). But when a screening test is “wrong,” it really means that a person has ended up in the minority side of the probability curve. For example, if a screening test suggests a 90 percent likelihood that a woman has a child with a chromosomal disorder, and then a diagnostic test determines through a amniocentesis that no chromosomal disorder exists, then the screening test was not really “wrong.” Rather, the woman exists into that 10 percent of lesser probability. Where it gets admittedly a bit more confusing is that a diagnosis can certainly take into account the results of one or more screening tests. However, a screening test alone cannot a diagnosis make.

Another key difference that can exist between screening and diagnostic tests — most of the time — is how insurance companies and other payers treat them. As a (usually) preventive intervention, screening tests are nearly always covered 100 percent by insurance companies. Diagnostic tests, on the other hand, often require a copay and are more frequently more expensive than screening tests. Screening tests also tend to be simpler and noninvasive whereas diagnostic tests may be invasive or otherwise more involved.

Two key questions can help a reporter determine whether a test is screening or diagnostic (aside from asking a clinician, which is the best idea when possible): Is the person receiving the test presumed to be healthy and non-symptomatic at the start? If so, it’s a screening test. Is the test designed to lead to a definite diagnosis? If so, then it’s a diagnostic test. This link offers a helpful chart to the differences as well.

Self-controlled case series

In a “controlled” study, the participants receiving an intervention are compared to a control group of participants who don’t receive the intervention or receive a placebo. Although the intervention and control groups should contain participants as similar as possible, it’s not possible to use clones, so it’s always possible other confounders could affect the results. But in a self-controlled case series study design, the participants comprise both the intervention group and the control group — they are compared against themselves at different periods in time. Instead of looking at the background rate of an overall population, researchers look at how often a particular event occurs in each individual person over a set amount of time without an intervention. This personal background rate is then compared to what happened during or after the intervention when any possible effects would be expected to occur. This type of design was developed originally to study adverse events from vaccines, but researchers may also sometimes use it to look at effects of other interventions, such as short experimental programs or medications.

This kind of study cannot be used for any kind of intervention. For example, if you tried to compare a person taking medication X to themselves before medication X, you still don’t know if whatever is observed happens because of medication X or something else. But if a specific time period exists during which a reaction would be expected, then researchers can compare what happens within that window to what happens outside that window. For example, if an inactivated vaccine is going to cause any kind of reaction in a person, it will happen within the first 48 hours. That’s the only time window when it’s biologically possible for the vaccine to affect a person because that’s the time period during which the immune response occurs. If researchers are looking to see a particular vaccine causes a specific reaction, such as a seizure, they can compare what happens in those 48 hours after the vaccine’s administration to several months before and after the vaccine to see if seizures more frequently occur in that 48 hours than other times across the whole group. Similarly, reactions from live vaccines occur approximately one to two weeks after the vaccine is given, so researchers looking for live vaccine reactions would compare what happens in those few weeks to what happens in the months leading up to the vaccine, a few months after the vaccine, and in the period between the vaccine and the window when a reaction is biologically plausible.

A self-controlled case series can also be used for studies involving specific experimental conditions that occur within a clearly defined time period. For example, if a person in a sleep study went through a rigorous experimental schedule for one week to see how their bodies reacted to different sleeping conditions, it would make sense to compare their body’s processes (blood pressure, blood sugar, fatigue, etc.) during the sleep experiment to those same body processes before the experiment and a few week after the experiment, when the body is readjusted to normalcy. If their blood pressure increased only for the period of time during the sleep study, but it’s pretty consistent in the weeks before and several weeks after the experiment, it’s much more likely that the experimental sleep conditions influenced blood pressure, or at least something that occurred during that experiment.

The advantage of self-controlled case series studies, when the conditions are appropriate for them, is that no other cause is likely to lead to the effects being studied in a person except the short time-sensitive intervention; they act as their own controls so that any unidentified, underlying conditions cannot confound the results when compared to other people. More reading is available here, here and here.

Sex vs. gender

Two of the most commonly confused concepts in everyday language are sex and gender. Most often, the confusion is a belief that these terms mean the same thing, but they do not. Sex refers to a biological designation based on presence of sexual characteristics associated with “maleness” or “femaleness.” Most often, these sexual characteristics — primarily female or male reproductive organs — correspond with chromosomal sex determination. That is, a person with XX chromosomes typically would have female sexual organs and characteristics and would be designated as the female sex, whereas a person with XY chromosomes would have male sexual characteristics and organs and be designated the male sex. However, sex designation becomes more complex with other sexual chromosomal combinations, such as XYY or XXY. Individuals with these chromosomal combinations, typically called intersexual, may still receive an assigned male or female sex based on the most prominent or dominant sexual characteristics.

Gender, on the other hand, is not biological. It is a social construct of identity, and it is often defined by an individual. (A community or society at large may attempt to assign a gender to an individual who rejects that gender label, but the ultimate determination of an individual’s gender should ethically be defined only by that person.) “Male” and “female” are also terms used with gender, but many other gender identities exist as well, whether informally or formally recognized.

The distinction between sex and gender is particularly important for journalists to understand because many researchers do not appropriately distinguish between them in their research studies. A study may use the term “gender” when the researchers actually intended “sex,” or vice versa. If a conscientious journalist is aware of the distinction, they can hopefully recognize when a study misuses the term and/or ask the researchers whether it was actually gender or sex that was recorded.

Another reason journalists must understand gender and sex is to ensure they use appropriate identifiers and pronouns when discussing an individual, especially if that individual’s presenting gender does not conform to traditionally understood or recognized gender labels. The safest way to ensure you are properly identifying a person is to ask them which pronouns they prefer. These preferred pronouns should supersede any publication style guide rules as a matter of human dignity.

Statistical significance

Statistical significance is a test that researchers apply to their results to find out if their results represent real effects or if they could have occurred simply by chance.

There are two main ways that statistical significance is reported in research: The P Value and the 95 percent confidence interval. 

The P value is usually said to be “significant,” or unlikely to be the result of chance, if it is less than .05. Sometimes that threshold is .01, but it’s typically .05.  A P value less than .05 means there’s a less than 5 percent probability that the result could have occurred by chance alone.

(Note: Study results can pass a test of statistical significance and still be wrong. By definition, the P value and 95 percent confidence interval chance of being right and a 5 percent chance of being wrong. It’s unlikely, but still possible.) See a tutorial from the Cochrane Collaboration about P values and statistics.

A different way of stating significance is the 95 percent confidence interval. Confidence intervals that do not cross the number 1 (the risk assigned to the reference group) are said to be statistically significant.

Confidence intervals are a bit more descriptive than P values, and thus they can be more informative to reporters who are looking at a study. Rather than boiling the test of significance down to a yes/no answer as the P value does, the confidence interval shows the range over which the results may be true.

For example, consider this table from a recent study, published in the Archives of Internal Medicine, which reported a person’s chance of developing end-stage kidney disease if they were overweight or obese as a teen:

Reading the table tells us that people who were underweight at age 17 seem to have a slightly reduced chance of developing kidney disease as adults. Looking at the row highlighted by the red arrow, we see their hazard ratio is .49, after researchers adjusted their data to account for other things that might be influencing the risk of kidney disease. A hazard ratio of .49, means that their risk of developing kidney disease is reduced by about 51 percent (100-49=51) compared to the reference group, which was made up of people were at normal weights as teenagers. Now check out the confidence interval which is reported in parentheses beside the hazard ratio.

The confidence interval shows that underweight individuals had chances of developing end-stage kidney disease that ranged from .18 to 1.34. That means the real risk of kidney disease for underweight people was anywhere from 82% less than the reference group (100-18=82) to 34 percent higher than the reference group. That’s pretty well all over the map. Because that confidence interval includes the number 1, we can say it’s unlikely that people who are underweight really have a 51 percent reduced chance of getting kidney disease compared to people who are normal weight. Their true risk is not known.

Now look at the row highlighted with the red arrow. These results are the chances that a person who was obese as a teenager would develop end-stage kidney disease as an adult. Their hazard ratio is 19.37 with a confidence interval that ranges from 14.13 to 26.55. That confidence interval shows us that the risk of end-stage kidney disease in this group is somewhere between 14 times and 26 times higher than it is for someone who is normal weight. Because that range doesn’t include the number 1, it’s unlikely that researchers observed these results due to chance alone.

Confidence intervals that are very wide, meaning that they encompass a very broad range of results, are less reliable than confidence intervals that are narrow. Narrow confidence intervals show that the intervention or variable had about the same effect on all the people in the group.

Now look at the lower half of the chart. The researchers knew that people who were overweight or obese were more likely than people at normal weight to develop diabetes later in life. Diabetes is a major contributor to kidney disease. They wanted to see if the relationship between weight and kidney disease was still there if they excluded people who had developed diabetes. Remarkably, when they looked at the data this way, the confidence intervals for the risks of being overweight or obese narrow even more, suggesting that those effects are even more reliable and likely to be real (yellow arrow).

Surrogate endpoints: Handle with care

Clinical trials often rely on surrogate endpoints to determine whether treatments work.  In medicine, surrogates are biomarkers (i.e. blood pressure, cholesterol, proteins) that are thought to lie in the causal pathway of harder health outcomes like heart attacks, strokes, deaths, and dementias.

Researchers use surrogate or intermediate endpoints because it’s usually faster and less expensive to see effects on these stand-ins than to wait for actual events.

A 2008 study in the Journal of the American Medical Association found that surrogate endpoints were the rule more than the exception in 436 clinical trials of diabetes treatments. In that study, patient-important outcomes — including cardiovascular events, death, pain, function, and quality of life — were chosen as primary outcomes just 18 percent of the time. They were primary or secondary outcomes in less than half of the studies.

Sometimes when drugs affect a surrogate endpoint, they also affect the risk of a health outcome. That’s true in the case of statins, which lower LDL or “bad”, cholesterol and also reduce the risk of heart attacks. But it’s proving not to be true in many other instances:

  • Two Alzheimer’s drugs — bapineuzumab and solanezumab — reduce the buildup of beta amyloid protein in the brain. But both ultimately failed to improve thinking or memory better than placebos in patients with mild to moderate Alzheimer’s disease.

  • The drug Tredaptive increases HDL or “good” cholesterol and lowers LDL, but a large trial found it didn’t prevent heart attacks, strokes, or heart procedures.

  • Tight control of blood sugar was long thought to be the best way to keep diabetic patients healthy. But more and more studies are describing a paradox — patients with the tightest control of blood sugar sometimes have worse health outcomes than those who don’t lower their blood sugar as aggressively.

  • The cancer drug Avastin extends progression-free survival in patients with advanced breast cancer, but not overall survival, leading the FDA to revoke its approval for that indication.

For this reason, reporters should always be clear about the limits of studies that use surrogate endpoints.

Here’s how Andrew Pollack handled it in a story for The New York Times on experimental injections that lower cholesterol:

And there are still some caveats. One is that while the drugs lower cholesterol, it has not yet been shown that they actually reduce the risk of heart attacks, strokes or other cardiovascular problems.