Trip Report: Workshop on Human Computation for Science & Computational Sustainability

Human Computation for Science and Computational Sustainability workshop at Neural Information Processing Systems conference, 12/7/2012, Lake Tahoe, NV.


Workshop details:

Timo Honkela also posted great summary notes, which are much briefer than what follows…


Keynote speaker: Eric Hortvitz, MSR

HCOMP and AI: programmatic access to people. This generates new opportunities with applications in science, sustainability, and society.

Programmatic access brings people into machine intelligence. We can apply this to construct case libraries for learning classifiers; debugging and refinement of perception and reasoning; human expertise as components of larger systems; probe frontiers of MI competency; and addressing “open world” challenges.

Classic example is ReCAPTCHA. Using AI to optimize HCOMP – mutual construction. Coherent fusion of contributions – predictions, measures, recommendations; guiding human effort via active learning, expected info value; ideal plans for decomp & sequence of efforts; incentivizing design.

Study in citizen science to combine ML and decision making at MSR. Came out of a conversation w/ Jim Grey about 15 years ago. Focusing on consensus tasks in citizen science – classic crowdsourcing. Questions: how to fuse automated analysis with humans, whose vote counts more, etc. Ran into Lintott when Galaxy Zoo was starting. Describes how GZ works and how awesome it is, what they’ve discovered, etc. Core science is coming out of citizen science.

CrowdSynth: machine learning for fusion and task routing; learning from machine vision and votes. Combines machine & human perceptions – different sources of cognition and intelligence. GZ is using people, sole focus, but SDSS has applied machine vision to all of these data for about 450 features, very specific details that are completely incomprehensible to non-astronomers, e.g. “isophotal axes.” With both of these information sources, can we combine them for machine learning, predication & action?

So layering human abilities onto machine abilities, depending on cost and ability, using them together to get the most out of the complementarity of the systems. CrowdSynth gives a task that machine figures out how to assign, humans do their thing, etc. What’s inside the box: machine features and cases w/ task and worker databases. That feeds into answer models and vote models, both of which go into the planner. The task features are for machine vision. Worker features involve experience and competency – employees can be tracked – dwell time, experience, accuracy. Vote features – distributional aspects of votes on tasks – # votes, entropy, mode class, vote of most accurate or experienced worker, etc.

Vote models predict the next vote, and answer models predict correct answers. They used a Bayesian model selection to construct models that can predict outcomes – looks like 100+ variables. Model sensitivity to number of votes – answers require 5-9 people; votes require more like 50 people.

Soul of the system: planning goal to optimize hiring decisions to maximize expected utility. Consensus tasks are modeled as finite-horizon MDP with partial observability. Long evidential sequence problems – each incremental vote/answer is worthless until they’re aggregated at a certain point [pooled interdependence, could apply coordination theory?] The value of additional human computation was simulated – fairly complex process. It worked really well – new efficiencies and better stopping criteria to maximize accuracy. Decision theoretic methods ideally route tasks to individuals, can get better results with fewer people.

Did real-time deployment within Zooniverse project, Planet Hunters. First test of realtime system that’s not based on retrospective data. Moving beyond consensus tasks, the Milky Way project – people labeling images for “interestingness” – should be really interesting to see how that works. Predicting and promoting engagement – how to engage volunteers and figure out when they will disengage. Goal is designing interventions for engagement. Trying to figure out how many tasks before they disengage, want to learn about short versus long sessions and how long they disengage between sessions. Given training data, can we predict when they will disengage?

Features in engagement models. Task features include mean agreement and related metrics; session features include log-based metrics, e.g. time on task (very interesting inferences to be made); long-term historical focuses on things like total experience of tasks, # sessions, accumulated info about a person based on their participation.

Found that session and history features more important than task features or session features alone; found average engagement of 4 minutes and 20 tasks – bite-sized participation. Also working on rich models of individuals abilities – things like user activity and experience to figure out accuracy. Personalizing delivery of tasks by current skill level will also let them tailor skill development if there’s a specific pattern of errors.

Examples he finds exciting to leverage in-stream activity of crowds. Ambient data (trace data) from crowd behavior in science and sustainability – like eBird SoCS project. Example of Lac Kivu earthquake in 2008. Looked at cell communications data on 6 days around the event, used it to detect anomalies as earthquake hit. The predicted epicenter was a few km from true epicenter. Inferring opportunities for assistance based on % increase in calls. Had coherent measures of uncertainty for optimizing emergency response. Could do really interesting things with intentionally collected data, not just trace data.

Example: Identify drug interactions with the crowd. FDA adverse event reporting system (AERS) because pharmaceutical companies do limited testing. But in the wild, they’re finding new interactions: Paxil and Pravachol don’t cause hyperglycemia individually, but they do in combination [metabolic syndrome]. Can use large-scale web logs on searches related to “pharmacovigilance” – people searching on hyperglycemic systems along with drug names in queries. Do disproportionality analysis of reporting rations – observations versus expected in Venn diagram style, high statistical systems.

Final example: Crowd physics [crowd coordination] – coordination of people in space and time. Collaborations and synchronization in time and space – flash search & rescue team, transport packages between arbitrary locations quickly with low-cost hand-offs. Disrupt flow of disease from epidemiological models. Used geocoded Tweets to identify feasible coordination by distance and time, can think about how to incentivize to change those limits. Changing slack and wait time for coordination; use small-world opportunistic routing; incentives to modify graph properties.

Animated visualization of individuals’ movements based on tweets and their location change between tweets. Found a lot of tweets at airport hubs, realized they could route most packages in 3 hours that way.

Summary: learning and inference for harnessing human & machine intelligence in citizen science. Great opportunities for doing science with crowd & potential to coordinate crowd on physical tasks as well as virtual tasks.


A meta-theory of boundary detection benchmarks, X. Hou et al., Caltech & UCLA

Looking at how to find edges in images, drawing the lines on a photo. Huge variability in detail they included, which isn’t being addressed by current computer vision benchmarks. Increasing label consistency has to do with looking at which labels are identified by everyone, and which are only detected by a few (orphan labels.)

Did experiment to test the strength of line segments, whether human boundary is considered stronger by a third party, specifically the orphan labels – “false alarms” – like grasses highlighted in a photo of a pheasant. Algorithm will identify false boundaries based on that, developed a way to evaluate risk of false positives by subject number – orphan labels are about as strong as machine labels, which is a big problem.

Type I/Type II errors – believe all existing labels are correct, can always rationalize the boundaries behind it, but there are a lot of misses – labeling weak boundaries but not strong ones, while algorithms highlight everything. How to identify the ideal overlapping regions between orphan labels and machine labels? Gets a bit technical about how they approached the question, but managed to use an inference process to substantially improve boundary strength over initial noisy data. Another experiment worked in reducing risk.


Evaluating crowdsourcing participants in the absence of ground truth, R. Subramanian et al., Northeastern University, LinkedIn & Siemens

Problem setting: supervised/semi-supervised leaning – ground truth exists but not available or expensive; multiple sources of annotation. Question is evaluating annotators – are they adversarial, spammers, helpful? Example uses: identify helpful/unhelpful annotators as early as possible; evaluate data collection/annotation process/mechanisms.

Using binary classification; not all annotators label all data points, and ground-truth not available. Example scenario: diagnosing coronary artery disease, much difference in cardiologists’ expert diagnosis. CAD can be diagnosed by measure and scoring regional heart-wall motion in echocardiography; quality of diagnosis highly dependent on skill & training; increasingly common scenario.

Challenges: variability in annotators – comparative reliability between people, internal reliability by person, maliciousness. Practical questions for healthcare – how to diagnose if docs don’t agree, how to tell which docs are skilled enough.

Multi-annotator model – gets into much more technical detail with probabilities for graphical model. Read paper for more details.


Using community structure detection to rank annotators when ground truth is subjective, H. Dutta et al., Columbia University

Chronicling America project – National Endowment for the Humanities and Library of Congress project. Developed online searchable database of historically significant newspapers from NY Public Library collection between 1830 – 1922. Question of how to improve indexing and retrieval of the content. Many notable events. Describes how historical newspaper is made searchable – scanned, metadata assigned, OCR.

Data pre-processing involved NLP and text mining, then similarity graphs for articles with edges based on cosine similarity of TF/IDF beyond a certain threshold. Choice of threshold will generate multiple ground truths, however, so subjectivity is introduced even in this automatic process.

Did community structure detection using modularity maximization; that’s an NP-hard max-cut problem. Approximation techniques are therefore the best approach.


Crowdsourcing citizen science data quality with a human-computer learning network, A. Wiggins et al., DataONE, Cornell Lab of Ornithology



Human computation for combinatorial materials discovery, R. Le Bras et al., Cornell University

Goal is developing new fuel cells – current electrocatalyst is platinum but still not great and way too expensive. Process for finding alternatives is rather technical. Using CHESS to identify the resulting crystal structures, resulting in graphs of the underlying lattice on the silicon wafers. The experiments can be run for about 2 weeks a year, costs $1M/day to use CHESS.

Satisfiability Modulo Theory approach. It all gets very technical.

UDiscoverIt UI: download a client, several complex data displays for pattern identification. Select slices, look for patterns in the heatmap of “Q-values”. Include user input speeds up the process by 1-2 orders of magnitude, despite only involvement in a minimal part of the process by providing useful information about the structure of the problem, which reduces the search space for the SMT solver and improves overall performance. Lots more to do if they’re going to get this to implementation, including getting it to a point where they can run it on AMT.


Dynamic Bayesian combination of multiple imperfect classifiers & An information theoretic approach to managing multiple decision makers, E. Simpson et al., Oxford University, University of Southampton & Zooniverse

Human computation: people are unreliable; they either learn or get bored. How to optimize the crowd, maximize scale, and maintain accuracy? Zooniverse is their example domain – people do pattern recognition tasks and label the objects. The problems they’re addressing involve tracking worker reliability, combining classifications, and mixing their computational agents where possible to assist workers and scale up.

Probabilistic model of changing worker behavior – treats artificial and human agents and base classifiers, conditionally independent responses given the type of object. Uses Bayesian approach to combining the decisions – dynamic extension of IBCC incorporating prior info like known expertise. Faster version with Variational Bayes, semi-supervised and dealing with limited training data.

Technical details with Dirichlet distributions and the like.


Building the Visipedia Field Guide to NA Birds, Serge Belongie, UCSD

Working with CLO; this is a status update on a bird recognition system. Visipedia is a visual counterpart to Wikipedia. Subordinate categories for recognization. Similar in some ways to Leafsnap. Lots of crowdsourcing involved at multiple steps; breaking down something really complicated into smaller easy tasks. Oh, wait – it’s Merlin!

Processing images – MTurk – labeling attributes. MTurkers liked it so much that they complained when they took it down! There are no accounts or scores or anything – the reward for finishing the task is getting another task! People like pretty flying things.

“Taster” sets – bitter vs sweet – pretty colorful birds like scarlet tanagers, vs “little brown jobs”. Need to give people pretty things sometimes. Setting up as a visual 20 questions. Current iPad app – 200 species, pick a bird. Films strip on the left, sorted the 200 species in order of likelihood of match to a photo taken by the user. Trying to find the bird parts that people click, heatmapping where the computer thinks the body parts are.