Resources: Articles

Batea initiative hopes to help improve Wikipedia health content Date: 12/14/15

By Tara Haelle

It's well known that Google, Facebook and dozens of other companies mine the browsing histories of their users and use that data for advertising. But imagine what we might learn if we homed in on the browsing history of medical students browsing clinical research sites during their studies? That's exactly the idea behind a new browser extension called Batea, recently released by the company DocGraph.

The data collected will be shared with Wikipedia for WikiProject Medicine, which focuses on updating medical and health content on Wikipedia. With its crowdsourced platform, Wikipedia content often may not be ideal for journalists to rely on (though it's great for a quick refresher or to skim the sources). But its more than 25,000 medical articles get 200 million views per month. In its press release, DocGraph cites a 2014 study finding that Wikipedia is the single leading source of medical information for both patients AND health care professionals. So they need to get their info right. That's part of the rationale behind using the browsing histories of medical students, and it's possible this data may be useful to journalists as well.

DocGraph is founded by Fred Trotter, an AHCJ member who has been working for years to make big data accessible and useful from the consumer side. The company's mission is "to create, maintain, and improve open health care datasets" while aiming "to grow the open health data movement and build a community of data scientists, journalists, and clinical enterprises who use open data to understand and help evolve the health care system."

I asked Trotter about Batea, which was created with support from the Robert Wood Johnson Foundation, and how it might serve journalists in the future.


Fred Trotter

Q: How can the browsing histories of medical students be useful for Wikipedia?

A: Our basic thesis is that medical students use Wikipedia specifically, and the Internet generally, much differently than they will later in their careers. As they struggle to keep pace with the demands of learning so much information, Wikipedia should be a great help to them, and sometimes it is. Wikipedia has a goal of having accurate articles that do not use technical jargon, which can be very helpful when you are being introduced to a difficult subject like medicine.

But we also know that medical students are experiencing, later in their education, the cutting edge of medicine. They are in a very good position to ensure that Wikipedia articles get the finer points right and to ensure that important medical content is not overlooked. If you put those two things together — learning the cutting edge and leveraging Wikipedia all the time — we think medical student browsing patterns and comments will be extremely helpful to the clinical editors of Wikipedia.

Having said that, we are also inviting clinicians and members of the patient community to contribute too. We need Wikipedia clinical content to work from every perspective.

Q: What might be done with the data?

A: In the beginning it will be basic triage. Medical students will complain about specific content in specific articles, and we will pass those comments along to Wikipedia editors for repair. We are coordinating our efforts with the University of California, San Francisco and other programs that formally encourage medical students to become Wikipedia editors. The goal is to establish a feedback loop where first- and second-year medical students discover problems in medical articles, and fourth-year medical students fix them. Eventually we expect both critiquing and improving Wikipedia to become a core part of medical education not just for doctors, but for nurses, pharmacists and other health care professionals too.

Q: Is this data something journalists could ever have access to?

A: DocGraph is a health care data journalist organization, and it is our mission to provide journalists with the data about the health care system. There are a few reasons why we might have data that is not provided automatically to any journalist who asks for it, such as protecting patient/contributor privacy, but given a very short list of caveats, everything else we do is available at no charge to journalists. Once we have enough data contributed to protect everyone's privacy, we will be contacting AHCJ to let everyone know that this data is available.

Q: How might the data be useful to a journalist?

A: I expect this data is going to especially valuable to health care journalists who focus on informing the public about specific health care topics. I think we have a pretty good understanding of what is available on the Internet when a patient searched for something generic like "diabetes." But what happens when things get more and more specific, as we step from "diabetes" to "insulin" to "diabetic retinopathy" to "macula," and how do medical students rate those articles? How do clinicians? Where do patients stumble?

If you think of the health care journalistic obligation as "tell the public what is changing in the world of health care," I am not sure Batea data will be that helpful. But if the purpose of health care journalism is "inform the public about what they do not currently know, but that they need to know about health care," then Batea will be a gold mine. (“Batea” is Spanish word for “gold pan.”) In reality, I think different health care journalists take one or both of those approaches to health care journalism, so I expect this data to be really useful to some journalists and completely useless to others.

We have very little "at scale" information about how patients consume health care information. This project will be part of fixing that problem.

Q: For example, could they sort the data by subject matter to see what the top sites are for learning clinical information about a particular medical condition?

A: The right way to think about browsing patterns is a "hierarchy in time". It’s a hierarchy because of the back button, the ultimate "that was not helpful" signal that people can send using Batea. It’s "in time" because it is fundamentally a sequenced time-stamped data set. Batea aims to create thousands or even millions of these hierarchies and merge them into useful patterns. Google and Bing already ask the question "Do people find what they want?" using this type of information. But health care journalists, in some sense, have a much deeper responsibility. They should be asking "Do patients find the truth?" That’s a much different question that requires us to model carefully what people were looking for, and what they are finding.

Once you have the right picture in your head about what the data is, then it’s pretty easy to start slicing it in ways that are useful for journalists. The top three questions that I would want to ask would be something like:

  • "How do people start searching for the health care topic I am interested in?"

  • "How often do they flow through a given website that has some unique information that I know is important?"

  • "Does the place that they 'give up' their clinical search actually have the right answers to the underlying question?"

Q: Can you describe other applications in journalism?

A: I think there will probably be some stories about large meta-patterns in the data. I hope websites such as ProPublica and FiveThirtyEight decide to take that challenge up.

I think there are likely to be some "industry cover-up" type stories, where you find that some part of the Internet is being "funded" (probably indirectly) to maintain a clinical "party line" about an issue. I think combining the browsing pattern data with some of the open payment data will be fascinating for this kind of investigation. I think people outside health care would be pretty surprised to the degree that information about medications, devices and treatments is sourced from companies with direct profit interests. I expect (and hope) that outright conspiracies will be rare, but instead I think that there will be lots of cases where "sloppy echo chamber" ensures that patients cannot get correct information easily. [Trotter clarified that these are conjectures, and it remains to be seen what actually pans out.]

Q: Where does most health care "big data" come from currently?

A: HHS generally, and CMS specifically, have really opened up and started treating their claims data source as a public resource. I could spend 400 percent of my time thinking just about that and enjoy every minute of that, but claims data is financial data about clinical data, which is a far cry from clinical data. We need to start getting richer data sets, and the kind of very clinical topic-focused information that Batea could generate is a solid alternative to what we can do effectively with claims.

[Trotter noted that a tremendous amount of "-omic" (genomic, proteomic, etc.) data is also publicly available but not easily accessed — perhaps an opportunity for other entrepreneurs.]

Q: How is using browsing histories different and beneficial?

A: Browsing the Internet is a little bit like choosing a path to walk. We have all seen "desire paths" that do not conform to sidewalk designs. Wikipedia is organic and naturally responds to users’ needs, and my hope to accelerate that process using browsing pattern data.

Thankfully, Batea also generates comment data. Using Batea, you can communicate your motivation rather than just merely your browsing pattern. No Batea users are forced to use that feature, but if even a few do, it will mean a much richer browsing data set than has ever been released to the public before. Hopefully that means that we can both detect and solve completely new types of health care knowledge dissemination problems. If we are lucky and careful, it means we could dramatically improve the decisions patients make, and ensure longer and healthier lives for everyone.