Why Sabotage is Rarely an Issue in Citizen Science

Following a recent Nature editorial, the Citizen Science researcher-practitioner community has been having a lively discussion. Muki Haklay made a great initial analysis of the editorial, and you should read that before continuing on.

OK, now that you know the back story, a related comment from Sam Droege on the cit-sci-discuss listserv included the following observation:

Statistically, to properly corrupt a large survey without being detected would require the mass and secret work of many of the survey participants and effectively would be so complicated and nuanced that it would be impossible to manage when you have such complex datasets as the Breeding Bird Survey.

I agree, and I frequently have to explain this to reviewers in computing, who are often concerned about the risk of vandalism (as seen all over Wikipedia).

Based on a very small number of reports from projects with very large contributor bases—projects that are statistically more likely to attract malcontents due to size and anti-science saboteurs due to visibility—only around 0.0001% of users (if that) are blacklisted for deliberately (and repeatedly) submitting “bad” data.

If we presume that we’re failing to detect such behavior for at least a few more people than the ones we actually catch, say at the level of a couple orders of magnitude, we’d still only be talking about 0.01% of the users, who pretty much always submit less than 0.01% of the data (these are not your more prolific “core” contributors). In no project that I’ve ever encountered has this issue been considered a substantial problem; it’s just an annoyance. Most ill-intentioned individuals quickly give up their trolling ways when they are repeatedly shut down without any fanfare. From a few discussions with project leaders, it seems that each of those individuals has a rather interesting story and their unique participation profiles make their behaviors obvious as…aberrant.

In fact, the way most citizen science projects work makes it unlikely that they would be seen as good targets for malicious data-bombing anyway. Why? For better or worse, a lot of citizen science sites provide relatively little support for social interaction: less access to an audience means they’re not going to get a rise out of people. Those projects that do have vibrant online communities rarely tolerate that kind of thing; their own participants quickly flag such behavior and if the project is well-managed, the traces are gone in no time, further disincentivizing additional vandalism.

From a social psychological standpoint, it seems that the reality of the situation is actually more like this:

  1. convincingly faking scientific data is (usually a lot) more work than collecting good data in the first place;
  2. systematically undermining data quality for any specific nefarious purpose requires near-expert knowledge and skill to accomplish, and people fitting that profile are unlikely to be inclined to pull such shenanigans;
  3. anyone who genuinely believes their POV is scientifically sound should logically be invested in demonstrating it via sound science and good data quality;
  4. most citizen science projects do not reward this kind of behavior well enough to encourage ongoing sabotage, as discussed above; and
  5. as Sam noted, effectively corrupting a large-scale project’s data without detection requires a lot of smarts and more collaboration than is reasonable to assume anyone would undertake, no matter how potentially contentious the content of the project. They’d be more likely to succeed in producing misleading results by starting their own citizen-counter-science project than trying to hijack one. And frankly, such a counter-science project would probably be easy to identify for what it was.

Seriously, under those conditions, who’s going to bother trying to ruin your science?

Citizen Science Data Quality is a Design Problem

I’ve been giving talks for years that boil down to, “Hey citizen science organizers, it’s up to you to design things so your volunteers can give you good data.” I genuinely believe that most data quality issues in citizen science are either 1) mismatched research question and methodology, or 2) design problems. In either case, the onus should fall on the researcher to know when citizen science is not the right approach or to design the project so that participants can succeed in contributing good data.

So it’s disheartening to see a headline like this in my Google alerts: Study: Citizen scientist data collection increases risk of error.

Well. I can only access the abstract for the article, but in my opinion, the framing of the results is all wrong. I think that the findings may contribute a useful summary–albeit veiled–of the improvements to data quality that can be achieved through successive refinements of the study design. If you looked at it that way, the paper would say what others have: “after tweaking things so that normal people could successfully follow procedures, we got good data.” But that’s not particularly sensational, is it?

Instead, the news report makes it sound like citizen science data is bad data. Not so, I say! Bad citizen science project design makes for bad citizen science data. Obviously. (So I was really excited to see this other headline recently: Designing a Citizen Science and Crowdsourcing Toolkit for the Federal Government!)

The framing suggests that the authors, like most scientists and by extension most reviewers, probably aren’t very familiar with how most citizen science actually works. This is also completely understandable. We don’t yet have much in the way of empirical literature warning of the perils, pitfalls, and sure-fire shortcuts to success in citizen science. I suspect a few specific issues probably led to the unfortunate framing of the findings.

The wrong demographic: an intrinsically-motivated volunteer base is typically more attentive and careful in their work. The authors saw this in better results from students in thematically aligned science classes than general science classes. The usual self-selection that occurs in most citizen science projects that draw upon volunteers from the general public might have yielded even better results. My take-away: high school students are a special participant population. They are not intrinsically-motivated volunteers, so they must be managed differently.

The wrong trainers and/or training requirements: one of the results was that university researchers were the best trainers for data quality. That suggests that the bar was too high to begin with, because train-the-trainer works well in many citizen science projects. My take-away: if you can’t successfully train the trainer, your procedures are probably too complicated to succeed at any scale beyond a small closely-supervised group.

The wrong tasks: students struggled to find and mark the right plots; they also had lower accuracy in more biodiverse areas. There are at least four problems here.

  1. Geolocation and plot-making are special skills. No one should be surprised that students had a hard time with those tasks. As discussed in gory detail in my dissertation, marking plots is a much smarter approach;  using distinctive landmarks like trail junctions is also reasonable.
  2. Species identification is hard. Some people are spectacularly good at it, but only because they have devoted substantial time and attention to a taxon of interest. Most people have limited skills and interest in species identification, and therefore probably won’t get enough practice to retain any details of what they learned.
  3. There was no mention of the information resources the students were provided, which would also be very important to successful task completion.
  4. To make this task even harder, it appears to be a landscape survey in which every species in the plot is recorded. That means that species identification is an extra-high-uncertainty task; the more uncertainty you allow, the more ways you’re enabling participants to screw up.

On top of species identification, the students took measurements, and there was naturally some variation in accuracy there too. There are a lot of ways the project could have supported data quality, but I didn’t see enough detail to assess how well they did. My take-away: citizen science project design usually requires piloting several iterations of the procedures. If there’s an existing protocol that you can adopt or adapt, don’t start from scratch!

To sum it up, the citizen science project described here looks like a pretty normal start-up, despite the slightly sensational framing of the news article. Although one of the authors inaccurately claims that no one is keeping an eye on data quality (pshah!), the results are not all that surprising given some project design issues, and most citizen science projects are explicitly structured to overcome such problems. For the sharp-eyed reader, the same old message shines through: when we design it right, we can generate good data.

Crowdsourcing session, CSCW 2013

ACM Conference on Computer Supported Cooperative Work and Social Computing
26 February, 2013
San Antonio, TX

Crowdsourcing session

——

Tammy Waterhouse – Pay by the Bit: Information-theoretic metric for collective human judgment

Collective human judgment: using people to answer well-posed objective questions [RIGHT/WRONG]. Collective human computation in this context – related questions grouped into tasks, e.g. birthdays of each Texan legislator.

Gave example of Galaxy Zoo. Issues of measuring human computation performance. Fast? Encourages poor quality. Better? Percent correct isn’t always useful/meaningful.

Using info entropy – self-information of random outcome (surprise associated w/ outcome); entropy of random variable is its expected information. Resolving collective judgment – model uses Bayesian techniques. Then looked at entropy remaining after conditional information – conditional entropy. Used data from Galaxy Zoo to look at question scheduling; new approach improved overall performance.

——

Shih-Wen Huang – Enhancing reliability using peer consistency evaluation in human computation

Human computation not reliable – when tested, many people couldn’t count the nouns in 15-word list. Without quality control, they have 70% accuracy. Believes quality control is most important thing in human computation.

Gold standard evaluation: objectively determined correct answer [notably, not always possible]. Favored by researchers but not scalable because gold standard answers are costly to generate.

Peer consistency in GWAP: sometimes use inter-player consistency to reward/score. Mechanism significantly improves outcomes. Using peer consistency evaluation as scalable mechanism – can it work? Used AMT to test it. Concludes peer consistency is scalable and effective for quality control.

——

Derek Hansen – Quality Control Mechanisms for Crowdsourcing: Peer Review, Arbitration, & Expertise at FamilySearch Indexing

FamilySearch Index is one of largest crowdsourcing projects around. Volunteers transcribe old records – 400K contributors.

Looked at several models to improve efficiency while reducing added time. Use a downloaded package to do tasks, can use keystroke logging with idle time to evaluate task efficiency. Comparing arbitration process with a simple review. A-B agreement by form field varied. Experienced contributors had improved agreement.

Implications: retention is important – experienced workers faster, more accurate; encourages novices and experts to do more; contextualized knowledge, specialized skills needed for some tasks.  Tension between recruitment and retention with crowdsourcing – assumption that more people makes up for losing an experienced person, which is not always true. In this context it would take 4 new recruits to replace 1 experienced volunteer.

Findings: no need for a second round of review/arbitration – only slight reduction of error and arbitration adds more time (than it’s really worth).

Implications: peer review has considerable efficiency gains, nearly as good quality as arbitration process. Can prime reviewers to find errors, highlight potential problems (e.g., flagging), etc. Integrate human and algorithmic transcription – use algorithms on easy fields integrated with human reviews.