Citizen Science Data Quality is a Design Problem

I’ve been giving talks for years that boil down to, “Hey citizen science organizers, it’s up to you to design things so your volunteers can give you good data.” I genuinely believe that most data quality issues in citizen science are either 1) mismatched research question and methodology, or 2) design problems. In either case, the onus should fall on the researcher to know when citizen science is not the right approach or to design the project so that participants can succeed in contributing good data.

So it’s disheartening to see a headline like this in my Google alerts: Study: Citizen scientist data collection increases risk of error.

Well. I can only access the abstract for the article, but in my opinion, the framing of the results is all wrong. I think that the findings may contribute a useful summary–albeit veiled–of the improvements to data quality that can be achieved through successive refinements of the study design. If you looked at it that way, the paper would say what others have: “after tweaking things so that normal people could successfully follow procedures, we got good data.” But that’s not particularly sensational, is it?

Instead, the news report makes it sound like citizen science data is bad data. Not so, I say! Bad citizen science project design makes for bad citizen science data. Obviously. (So I was really excited to see this other headline recently: Designing a Citizen Science and Crowdsourcing Toolkit for the Federal Government!)

The framing suggests that the authors, like most scientists and by extension most reviewers, probably aren’t very familiar with how most citizen science actually works. This is also completely understandable. We don’t yet have much in the way of empirical literature warning of the perils, pitfalls, and sure-fire shortcuts to success in citizen science. I suspect a few specific issues probably led to the unfortunate framing of the findings.

The wrong demographic: an intrinsically-motivated volunteer base is typically more attentive and careful in their work. The authors saw this in better results from students in thematically aligned science classes than general science classes. The usual self-selection that occurs in most citizen science projects that draw upon volunteers from the general public might have yielded even better results. My take-away: high school students are a special participant population. They are not intrinsically-motivated volunteers, so they must be managed differently.

The wrong trainers and/or training requirements: one of the results was that university researchers were the best trainers for data quality. That suggests that the bar was too high to begin with, because train-the-trainer works well in many citizen science projects. My take-away: if you can’t successfully train the trainer, your procedures are probably too complicated to succeed at any scale beyond a small closely-supervised group.

The wrong tasks: students struggled to find and mark the right plots; they also had lower accuracy in more biodiverse areas. There are at least four problems here.

  1. Geolocation and plot-making are special skills. No one should be surprised that students had a hard time with those tasks. As discussed in gory detail in my dissertation, marking plots is a much smarter approach;  using distinctive landmarks like trail junctions is also reasonable.
  2. Species identification is hard. Some people are spectacularly good at it, but only because they have devoted substantial time and attention to a taxon of interest. Most people have limited skills and interest in species identification, and therefore probably won’t get enough practice to retain any details of what they learned.
  3. There was no mention of the information resources the students were provided, which would also be very important to successful task completion.
  4. To make this task even harder, it appears to be a landscape survey in which every species in the plot is recorded. That means that species identification is an extra-high-uncertainty task; the more uncertainty you allow, the more ways you’re enabling participants to screw up.

On top of species identification, the students took measurements, and there was naturally some variation in accuracy there too. There are a lot of ways the project could have supported data quality, but I didn’t see enough detail to assess how well they did. My take-away: citizen science project design usually requires piloting several iterations of the procedures. If there’s an existing protocol that you can adopt or adapt, don’t start from scratch!

To sum it up, the citizen science project described here looks like a pretty normal start-up, despite the slightly sensational framing of the news article. Although one of the authors inaccurately claims that no one is keeping an eye on data quality (pshah!), the results are not all that surprising given some project design issues, and most citizen science projects are explicitly structured to overcome such problems. For the sharp-eyed reader, the same old message shines through: when we design it right, we can generate good data.

Citizen Science session, CSCW 2013

ACM Conference on Computer Supported Cooperative Work and Social Computing
27 February, 2013
San Antonio, TX

Citizen Science session

——

Sunyoung Kim – Sensr

Intro to types of citizen science, diversity of project types. Common underlying characteristic: using volunteer’s time to advance science. Many typologies, projects can be divided by activity types into primarily data collection and data analysis/processing. Focus here is field observation, has great opportunities for mobile technologies.

Problem is that most citizen science projects are resource-poor and can’t handle mobile technologies on their own. Goal is supporting people with no technical expertise to create mobile data collection apps for their own citizen science projects. Terms used: campaigns – projects, author – person who creates campaign, volunteer – someone who contributes to collecting data/analysis.

Design considerations include: 1) current tech use, similar available tools, needs for practitioners. Reviewed 340+ existing projects (campaigns) from scistarter.com, found only 11% provide mobile tools for data collection. Looked at types of data they’re collecting – primarily include location, pictures, and text data entry. 2) Data quality is paramount, and data also contains personal information. 3) How to recruit volunteers. Looked at similar mobile data collection tools like EpiCollect and ODK. They’re pretty similar in terms of available functionality, but Sensr is simplest to use. Most comparable platforms are open source so you need programming skills to make them work (free as in puppies!) – even the term open source can be very techie to the target users.

Built Sensr as visual environment combined with mobile app to author mobile data collection tools for citizen science. Demo video demonstrates setting up data collection form for “eBird”, pick fields to have on form. Just a few steps, creates back end database and front end mobile interface. Very straightforward interface to assemble a mobile app for citizen science data collection.

A couple of features: can define geographic boundary but can’t prevent people from outside the boundary to join (App Store is global), but you can help users target correct places. Can review the data before it is publicly viewable or goes into scientific data set.

Did case studies to see how nontechnical users did with it, betas with existing projects, before launching tool. Strong enthusiasm for the app, especially for projects with interest in attracting younger participants. Main contribution: Sensr lowers barriers for implementing mobile data collection for citizen science.

Question about native apps versus HTML5 mobile browser apps due to need for cross-OS support.

Question if there’s a way to help support motivation; not the focus in this study. Case study projects didn’t ask for it because they were so thrilled to have an app at all.

——

Christine Robson – Comparing use of social networking and social media channels for citizen science

One of main questions from practitioners at Minnowbrook workshop on Design for Citizen Science (organized by Kevin Crowston and me) was how to get people to adopt technologies for citizen science, and how to engage them. They were questions that could be tested out, so she did some experiments.

Built simple platform (sponsored by IBM Research) to address big picture questions about water quality for a local project, and this app development was advised by California EPA. App went global, have gotten data from around the world for 3 years now. Data can be browsed at creekwatch.org, you can also download it in CSV if you want to work on it. “Available on the App Store” button on the website was important for tracking adoption.

Creek Watch iPhone app asks for only 3 data points: water level, flow rate, presence of trash. Taken from CA Water Rapid Assessment survey, used those definitions to help guide people on what to put in the app, timestamped images, can look for nearby points as well. More in the CHI 2011 paper. Very specific use pattern: almost everyone submits data in the morning, probably while walking the dog, taking a run, something like that.

Ran 3 experimental campaigns to investigate mobile app adoption for citizen science.

Experiment #1: Big international press release – listed by IBM as one of the top 5 things that were going to change the world. It’s a big worldwide thing when IBM makes press releases – 23 original news articles were generated, that’s not including republication in smaller venues. Lots of press, could track how many more new users came from it by evaluating normal rate of signups versus post-article signup. +233 users

Experiment #2: Local recruitment with campaign “snapshot day”, driven by two groups in CA and Korea. Groups used local channels, mailing lists, and flyers. +40 users

Experiment #3: Social networking campaign: launched new version of app with new feature, spent a day sending messages via FB and Twitter, guest speaker blog posts, YouTube video, really embedded social media campaign. Very successful, +254 new users.

Signups aren’t full story – Snapshot Day generated the most data in one day. So if you want more people, go for the social media campaign, but if you want more data, just ask for more data.

Implemented sharing on Twitter and Facebook – simple updates as usually seen in both systems. Tracking sharing feature – conversions tracked with App store button. Can’t link clickthrough to actual download, just know that they went to iTunes to look at it, but it’s a good conversion indicator. Lots more visits resulted from FB than Twitter, a lot more visitors in general from FB as a result. Conversion by social media platform was dramatically different – 2.5x more from FB versus Twitter or web, which were pretty much the same.

Effects of these sharing posts over time – posts are transient, almost all of the clicks occur in the first 2-5 hours, after that its effect is nearly negligible. Most people clicked through from posts in the morning, there are also peaks later in the evening when people check FB after work; then next morning they do data submission.

However, social media sharing was not that popular – only 1 in 5 wanted to use Twitter/FB feature. Did survey to find out why. Problem wasn’t that they didn’t know about the sharing feature, 50% just didn’t want to use it for a variety of reasons. Conversely, for those uninterested in contributing data, they were happy to “like” Creek Watch and be affiliated on Facebook, but also didn’t want to clutter FB wall with it.

Facebook campaign as effective – or more – than massive international news campaign from a major corporation (though the corporate affiliation may have some effect there), and much easier to conduct. Obviously there are some generalizability questions, but if you want more data, then a participation campaign would be the way to go. Sharing feature shows some promise, but it was also a lot of work for a smaller payoff. With limited resources, it would be more useful to cultivate Facebook community than build social media sharing into a citizen science app.