ACM Conference on Computer Supported Cooperative Work and Social Computing
26 February, 2013
San Antonio, TX
Tammy Waterhouse – Pay by the Bit: Information-theoretic metric for collective human judgment
Collective human judgment: using people to answer well-posed objective questions [RIGHT/WRONG]. Collective human computation in this context – related questions grouped into tasks, e.g. birthdays of each Texan legislator.
Gave example of Galaxy Zoo. Issues of measuring human computation performance. Fast? Encourages poor quality. Better? Percent correct isn’t always useful/meaningful.
Using info entropy – self-information of random outcome (surprise associated w/ outcome); entropy of random variable is its expected information. Resolving collective judgment – model uses Bayesian techniques. Then looked at entropy remaining after conditional information – conditional entropy. Used data from Galaxy Zoo to look at question scheduling; new approach improved overall performance.
Shih-Wen Huang – Enhancing reliability using peer consistency evaluation in human computation
Human computation not reliable – when tested, many people couldn’t count the nouns in 15-word list. Without quality control, they have 70% accuracy. Believes quality control is most important thing in human computation.
Gold standard evaluation: objectively determined correct answer [notably, not always possible]. Favored by researchers but not scalable because gold standard answers are costly to generate.
Peer consistency in GWAP: sometimes use inter-player consistency to reward/score. Mechanism significantly improves outcomes. Using peer consistency evaluation as scalable mechanism – can it work? Used AMT to test it. Concludes peer consistency is scalable and effective for quality control.
Derek Hansen – Quality Control Mechanisms for Crowdsourcing: Peer Review, Arbitration, & Expertise at FamilySearch Indexing
FamilySearch Index is one of largest crowdsourcing projects around. Volunteers transcribe old records – 400K contributors.
Looked at several models to improve efficiency while reducing added time. Use a downloaded package to do tasks, can use keystroke logging with idle time to evaluate task efficiency. Comparing arbitration process with a simple review. A-B agreement by form field varied. Experienced contributors had improved agreement.
Implications: retention is important – experienced workers faster, more accurate; encourages novices and experts to do more; contextualized knowledge, specialized skills needed for some tasks. Tension between recruitment and retention with crowdsourcing – assumption that more people makes up for losing an experienced person, which is not always true. In this context it would take 4 new recruits to replace 1 experienced volunteer.
Findings: no need for a second round of review/arbitration – only slight reduction of error and arbitration adds more time (than it’s really worth).
Implications: peer review has considerable efficiency gains, nearly as good quality as arbitration process. Can prime reviewers to find errors, highlight potential problems (e.g., flagging), etc. Integrate human and algorithmic transcription – use algorithms on easy fields integrated with human reviews.