9/12/12 USGS Community Data Integration Citizen Science workshop
Ice Core Lab, Federal Center, Denver, CO
Data Management session
Austin Mast, Florida State University, iDigBio
Public Participation in the Digitization of Biodiversity Collections
iDigBio is national resource for advancing digitization of biodiversity collections. Key objectives include digitizing data from all US biological collections, large and small, and integrate these in a web accessible interface using shared standards and formats – huge challenges b/c estimates suggest about 1 billion specimens in bio collection at thousands of institutions, and 90% are not accessible online. Community collaboration and technology are clearly important to this. Only one mention of public/citizen in the document – may contribute to digitization workforce.
Public participation is going to be necessary to accomplish this in 10 years. Produced strategic plan in 2010 from 2 workshops and that yielded an NSF program (ADBC) which is collaboration between Bio and Geo directorates to fund digitization of thematic collections networks (TCNs) on RQs, and national hub to coordinate. Currently have 7 TCNs, example of New England Vascular Plant Specimen Data to Track Environmental Changes project. Goal is 3 centuries of data with about 1.3 specimens and images from herbaria in New England. Focus of other TCNs include lichens, plants/herbivores/parasitoids, arthropods, macrofungi, vascular plants, fossils, and integrative platform development. Geo distribution of 130 institutions currently involved in these groups – all across the country.
Goals of iDigBio is enabling digitization, portal access in cloud environment, engage users in research and outreach, and plan for long-term sustainability. Initially 5 years, $10M, if all goes well, possibility of another 5 years. Project is not going away soon, means they can collaborate more intensely. There is a PPSR-like WG in iDigBio that is working on engaging the public in digitization at earlier stages than just data use. It cuts across their components of digitization, CI, research, education & outreach.
Models for digitization processes – getting data to distribution points, classifying the different workflows into three models. Have since focused on types of specimens being digitized and how that intersects with workflows. Hope that public can be engaged in specimen curation & imaging, text transcription and specimen description from images, & georeferencing. They do specimen curation w/ about 3 volunteers per semester, requires on-site presence. The other two tasks can be done in distributed format.
Examples of text transcription projects include Apiary project, volunteers ID regions of interest in the image (draw rectangles around specimen label and calling it a label), then those images get fed to volunteers to transcribe it to correct OCR or straight-up transcription and categorizing the elements e.g. who collected it and where. Specimen descriptions from images are also realistic, i.e. in classrooms. Games are another way to do it, with example of Forgotten Island.
Georeferencing tasks have been done with students for records for their own counties due to familiarity. They use GEOLocate to annotate map layers, very successful so far. Often observations are collected repeatedly in the same spot, don’t want to georeference over and over. Volunteers use zooming tools to specify precision of locations based on records.
Challenge: Provide opportunities for not just contributory participation, but also collaborative and co-created participation. Big challenge but worth taking on. Example of CitSci.org: what CI is needed for this community to build a historical dataset of relevant specimens from Milwaukee by digitizing target specimens from across US collections? And what is necessary for sharing the info to CitSci.org as it is constructed? Want to allow comparison of current data to historical data.
What CI is needed for a student to gain recognition for volunteer hours participating in this kind of science? What is necessary for her to gain service learning credit, e.g. for President’s service award?
iDigBio has a special role it can play – the cloud-based strategies are the resting place for the data, thinking about it earlier on in the workflow to make it a source for content that needs to be worked upon.
Mike Fienen, USGS Wisconsin Water Science Center
Social.Water and CrowdHydrology
Audubon Christmas Bird Count started as response to shotgun ornithology. Participation has grown tremendously over 100 years, and the data are meaningfully used in science. He runs groundwater monitoring network, with over 100 wells regularly monitored but half by citizen observers and has been going on for a really long time – they used to submit data on postcards, now by email.
Harnessing crowd for scientific data and analysis – his ideas sparked by wildlifecrossing.net which uses photos of roadkill to figure out where animals cross roads. Found CreekWatch and thought they did it already, but not exactly, and they didn’t want to rely on smartphones due to a number of issues and wanted low barrier of entry for both themselves and the participants. Only need to send a text message to participate in CrowdHydrology project.
Guy at Buffalo set up a little infrastructure for 10 sites he needed to monitor, put a note on the sites to text him a message of water levels, that he manually put into a spreadsheet. Mike helped develop software stack for this using Google voice, imap and python using stuff that’s already there with standard email protocols, people can text a local number with Google voice, the system automatically logs into the account, checks the data and parses them into a csv and plots the data in almost realtime on the web. Important for this to be open source to share – not building real infrastructure but building on existing data.
Basic signs included no attempt to tell people how to format messages – afraid if they said to type in station number and value, they would type in exactly that, weren’t sure how it would work for interpretation. They’ve since improved the signs on the gauges. There are also shore signs with more detail that they worked pretty carefully on making sure it was understandable. Worked to make sure there was no language of art – picture of guy “find the ruler” to measure “water height” (not water level, that confused people.) Sign tells people that they can see the data point within minutes, so people can look at data almost right away to see their own data points and start looking at trends.
Cost of generality: “fuzzy wuzzy” – texts came in many formats, some very basic, others very descriptive. Got a few messages that stations weren’t getting NY, but By or My due to iPhone autocorrect and fat fingers with proximal values. Why no typos on y? Found research on favorable/unfavorable accuracy so now they understand typos better.
They use regex to trim out irrelevant info after checking that they contain some of the keywords, using FuzzyWuzzy open source code to make it a lot easier, made by a ticket scalper, found in 2 minutes online, easily implemented. They now use better regex to find the value after identifying station number. Shockingly sophisticated database (csv file) with four fields – date/time has some drawbacks in case data submitted later. Data integrity something of an issue – incorrect observations quite obvious but not removed from data for several reasons. Validated the station with transducer data, verified that the American public can read a ruler. Records for precipitation incorporated, saw major rainfall event lining up with increasing set of measurements.
200 values at site NY1000, but max of 8 observations at other sites. Why? People go there to check out beavers, and it’s near a nature center so they’re primed to participate. Other locations are trout fishing holes that get a lot of visits, had one person at a bait shop ask about whether anyone participates – said wife walks by every day and wondered if anyone contributed data, but has never done so herself!
Collaborating with social scientist to get a grip on social aspects, found Trout Lake LTER to work on this together, they’re already doing surveys and such that complement his skills. Future plans also include looking at lakes and streams in the glacial aquifer system. Other plans – publishing papers, Social.Water code available on github.
Handling the data – using USGS cycle, have really focused on first four steps, haven’t found good ways to validate all data points, have focused on validating process for now. PII considerations play into storing contact info, and because of that the data doesn’t live at USGS but instead at a university. Found another group already planning to do it, so if they had waited for bureaucratic approvals, they would have been behind. Asking about value of recruiting trained observers, funding level from inception to two papers was zero, did it on his own time. Were criticized in paper for not having town meetings and training people, but that’s not free!
Crowdsourcing hydrologic info may be secondary source of data – $100 per site investment rather than $20K instruments – but is a primary source of public engagement.
Tim Kern, USGS Fort Collins Science Center
Framework for Public Participation GIS: Options for Federal Agencies
Working with a variety of DOI agencies to help them engage public in routine monitoring efforts. Need is obvious, resources keep decreasing, responsibility keeps increasing, and office turnover means that organizational memory is faltering. Impediments are PRA, PII, technology policies – especially TOS reviews and security and purchasing blocks, and data integrity requirements. The fact that we’ve seen so many examples of work-arounds tells us there are big issues.
Have put together framework for implementation within policy. PRA limits inputs to general comments on an identified area. PII/PIA is handled with hashing personal info. Tech policy means writing for mobile web and using approved social media APIs. Data integrity requirements – they’re working toward developing metadata and starting to connect to USGS publication process. Really do need ombudsperson to enable the work.
Their workflow is being implemented across several agencies. Workflow includes elements such as: study data, catalog and repository, advertise study, block for social media/custom web/mobile option, publication, & reporting. Repository for the data, building the pipes for getting the data from those sources so others don’t need to do it.
Starting off with developing study metadata. Then working with systems built around secure enterprise repo – currently using ScienceBase with agency-specific portals, and point people to other repos like DataBasin. ScienceBase has full DOI security, provides spaces for multiple projects/studies within projects, etc. Study area is loaded into repo, example of complex map data, gives view for public comment and interaction. System redelivers in web services, so there are other service endpoints.
Needed to make sure agencies could work w/in their own agency contexts and systems. They can put their products through an approval workflow w/in USFS, for example, gives a lot of flexibility in developing good materials. Once data is in repo, it can go to client device for comment.
Data collection marketing is critical – it’s not a build-it-and-they-will-come thing. Outreach guidelines include community meetings, local media, social media, agency websites. Also need to advertise access options – hashtag publication, site URL, app download. Borrowed from USGSted for incorporating data from Twitter, ripped off ideas from others w/in agencies to build something usable and useful.
Helped write clients for public data input options – multiple screen sizes, limitations of mobile views, need native app for full capability – especially important for offline data collection. That’s a problem because right now USGS doesn’t allow it, but USDA has approved Apple Store, so they can send the software to USFS because USFS are permitted to use it and distribute it. Automated input processing – Twitter term harvest, location inference, user metadata obfuscation. Web/mobile automation involves comment parsing, logic/term identification, comment screening, image metadata scraping, location and spatial data capture. A lot of high schools have blocked social media, but they can access USGS, and students have found this and are using it to do their Facebook updates.
Data management and distribution: Working on data publication with metadata development and approval workflow. For data discovery and distribution – search, preview, feedback/analytics, download, multiple service endpoints back to the data. So the data is getting out there.
Derek Masaki, USGS CSAS, Eco-Science Synthesis, National Geospatial Program
Vision for Data Management in Support of Future Science
Data coordinator with BISON project, looking to increase his participation in citizen science. Send a species observation to @speciesobs or email@example.com.
TechCrunch wants to disrupt government IT priorities. Strategy points: open data will be new default; anyway, anytime, any device; everything should be an API; make government data social; change meaning of social participation. Instead of treating people as though they’re going to muck up the data, start thinking about leveraging those resources more productively.
Embrace change, explore disruptive technology! Need to fight misconceptions – social/volunteered info is inferior, lots of evidence that this is not true, e.g. comparison of Brittanica and Wikipedia, Google bought Frommers for $20M but Yelp market cap $1B – all crowdsourced info, cognitive surplus that we have because we’re a privileged society. Online info can’t be trusted: obviously not so w/ Wikipedia studies. Who would read a book on an iPhone? We want to do everything on our smartphones!
What do do about open data: stop hoarding data! The crowd will make it a superior product and find better uses. E.g. GMaps base data source of TIGER US Census Bureau data, improved through millions of edits, ground truthing, image algorithms – USGS is great at scientific data, but bad at enterprise software development, so stop doing it! Push data out, let others figure out what to do with it and add value.
Changing social participation: empower the crowd! Power to the people – how many scientists in USGS? 10K employees in 400 locations, compared to 55M K-12 students in 100K schools.
Embracing change with disruptive technology – handling massive participation through volunteer county network. Pick 3K US Counties, 25 schools per county, each school posting four records per week – every month it would generate 1.2M data points! People really do want to do this, so let’s involve them.
Data system framework – source data to dataset registry, then to workflow processing for derived data index, curated data, and then data delivery through HTML, REST, SOAP service APIs (could push to SciencePipes!) Does his software development in his free time and isn’t great at it, wants others who are good at it to make it good stuff and make it salable and usable.
Citizen science is about people – need to invest in capacity, technical training, building monitoring networks (Marine Survey, Kauai Algae Restoration, Coastal Dune Restoration in Maui). Thinks kids are best resource – teach kids how to do quality data and monitoring, and they will keep doing it throughout their lives. This will help address constantly decreasing budgets.
Think big, consider data management and human management. Can coordinate volunteer network of 10K, conduct biodiversity survey of every US County, generate 1M obs records next year, develop mobile bio observation form standard, implement standards and generate resources.
Technology & Tools session
Jen Hammock Encyclopedia of Life (WebEx)
The Encyclopedia of Life as a Source of Materials and a Venue for Showing off Your Work
Intro to EOL – open access materials aggregated from many sources in a public venue, with context and provenance, plus species-level info and also higher taxa. What they don’t do is archival storage, or specimen-level or observation-level info.
Diveboard – program using info from EOL to let divers create dive lists! Intro to Scratchpad – difficult learning curve but powerful. You can enter your own info for any species, load up taxonomic structures from EOL, etc. Working with other API users who need image, video, and text content.
Other projects they’re pushing out include educational tools and apps – Field Guides to customize your own field guide as desired – potentially very helpful for many projects. Any export of content is public domain info and CC info, which does come with license requirements; there are no ND content in their content so you can actually do what you like with the images so long as you give attribution.
EOL has a lot of resources to use the content. Next set of functionality is posting content to pages – CC still applies, Flickr pics in EOL groups. Quality of contributed images are pretty high, Flickr group is most prolific content source. Many sources of images being shared, iNaturalist just got added and others in the works, Morphbank is another one for scientific images – good bulk upload tool, properly licensed content automatically goes to EOL. Videos from YouTube and Vimeo as well, working on SoundCloud for audio records. Not much direct upload of media, but other platforms are better at it so they haven’t rebuilt tools that are already working quite well, just connecting to them instead.
A similar content partner that isn’t biodiversity data is Encyclopedia of Earth (eoearth.org), they are cross-linking taxa to habitats across the two platforms, so river info currently in EOL and linked to all species that occur in that area (Amur River Benthopelagic Habitat).
Review content varies by partner, so although iNaturalist info has already been verified and comes in as trusted – but curators can change the status. Curators’ favorite activity is looking at unreviewed content, so depending on the project they ask for a judgment call as to whether the data come in as either trusted or unreviewed.
Jessica Zelt, USGS Patuxent
The North American Bird Phenology Program: Reviving a Historic Program in the Digital Age
Program started w/ Wells Woodbridge Cooke, really into bird migration and made it a research project in 1881, got friends to record arrival/departure dates in their areas. Started w/ 20 participants, but AOU was founded and network let them grow to about 3K observers. Program ran successfully for 90 years, never had a name, Chandler Robbins closed it in 1970 to focus on NA BBS. Many original observers were some of the most notable naturalists and ornithologists of their time, but also ordinary citizens – diverse contributors. Collected 6M records, everything that was known about bird migrations at the time, including publications, breeding and nest records, records of extinct and exotic species. Contributed to AOU checklists and first field guides.
Records stored in 52 filing cabinets in leaky attics and basements, then offsite storage facilities, the records got forgotten. Chan prevented the destruction of the records, many were actually stored in his house. In 2009 funding was acquired to hire a coordinator and scanner to revive the program because it could be used for tracking climate change as baseline data.
Cards were scanned for a full year before they went live with a website and data entry page. Goals were to curate, organize, prioritize data records, scan and key cards w/ QA, create digital format for data, generate and develop network of volunteers to do record transcriptions, automated transcription verification system, etc. Very ambitious.
Workflow is scanning of images to PDF, retained as raw records, images then sent to website, people signup and watch 15 minute video and then they can start transcribing. Card formats differed across program coordinators – originals were hand transcribed to notecards from mailed records. More formalized data formats were created later that didn’t require manual transcription. Cards look different but contain same info.
Constantly refined data entry interface, most important part for motivating contributors is “My Stats” bar that shows transcription numbers – for session, individual, all. Database structure is pretty simple overall. Pretty cool system for transcription verification – if first two transcribers don’t match, third transcriber takes a shot; if still no match, then it goes to “rectification system”. Calling it a “validate-o-rama”.
Volunteer recruitment: tried lots of strategies, got most people from a press release that was picked up by a number of content-related venues plus ABC. Volunteers are from all over the world, though 80% in US. Retention strategies: let people choose own level of involvement, establish a community by interacting, give a reason to donate time, allow them to feel needed, keep lines of communication open proactively, satisfaction survey, allow contributions of suggestions and improvements anytime, and recognize each volunteer for their work. Monthly newsletter to give feedback, announcements, news, a look ahead at what’s coming, plus a volunteer of the month who’s been a strong contributor (they write their own profile). Also picks observer of the month, does write-up about them, and that helps create a connection across time.
Includes a trivia question that isn’t a quick Google, has been really popular and usually gets an answer very quickly. Has also supported competition with leaderboards (typical long tail distribution) and names shown as handles. Very popular, have had to add/update “top” transcriber lists because people are so into it. Woman in 1st place has transcribed 150K cards, but takes time off for bird banding – and sometimes does transcription during banding!
Clear annual patterns of transcription on monthly basis. So far have scanned over 1M records, have transcribed about 700K and have 120 participants. All data released on the website, soon releasing about 160K validated records, making headway with publications as well. System is being repurposed for crowdsourcing in museum collections, citizen science, other large datasets. Other future goals include depositing in repositories, either start collecting migration dates or merge with another program. Also starting to work on “stomach content cards” but it’s a lot more complicated.
Derek Masaki, Sam Droege, USGS
Mobile Application use in the 2012 Baltimore Cricket Crawl
Twitter-based citizen science observation platform, initially in NYC, then Baltimore/DC – very successful with mobile platform, soon to be replicated in Hawaii. Web-based viz tool with citizen science info from Twitter client & API, Twitter-based submission protocol, scripts mine Twitter stream API, then output mapping, tabular data.
Ran project as an event, could submit data with Tweets, email, SMS, voice, got lots of media coverage. Only 8 species to look for, one minute survey, wherever you want. Send results right from the site, HQ processed the info as it came in and then mapped it in real time on Discover Life and U of Hawaii. Got 400 sites surveyed by 300 individuals, still getting submissions for phenology data to see when cricket calls fall off. About 1800 species observations, most went out in small groups, 75% of data came in email, mostly from mobile, and 10% from Twitter so that was worthwhile.
Data sheet simple – shortcodes for the 8 species w/ space for counts, name/date/start time/location. Tweets pop up on the map in the correct location with optional image/sound files attached. Used to have problems with audio filtering that removed high audio range and made it impossible to use smartphones for recording crickets, but now it’s no problem and they use a free app to collect and upload the audio clips. Future goals – phenology program regionally; national coordination of volunteer monitoring of singing insects; reuse technology for frogs and birds; Hawaii K-12 Critter Crawl in October, and USGS support for social media/mobile. Reuse of tech works well because frogs stop calling right about when crickets start up. Currently operating through Google Maps and Gmail, hosted by U Hawaii because they can’t run it off USGS as yet. Eventually want to do some audio vouchers on it, and finding that observers are accurate through ground truthing and they don’t report questionable IDs.
Findings – these species haven’t gotten attention in a century b/c scientists didn’t think it was that interesting. Wanted to learn if certain katydid had been extirpated, turned out that there were 7 species they thought were extirpated but were actually still around. Jessica did it, found it easy, and liked that it used systems that people were already familiar with – plus “cool factor.” Can also verify with waveforms. Challenging if multiple things are calling at once.
John Pickering, Discover Life
Discover Life – Networking Study Sites to Predict the Impact of Climate Change and Other Factors on Species and Their Interactions
Discover Life – big site, lots of traffic, sunsetting plan is to have project adopted by federal agency or NGO. Mothing project started by Sam Droege, they take photos of moths early in the morning. Interested in climate effects on moths, indicators of air quality, etc. Go outside to porch light at 4 AM, photograph every moth you see, upload to the site, document where you took them, then the crowd comes in and IDs them. About 500 people take moth photos regularly and upload them, but asking them to do it at 4 AM every day takes a special person. These data could be collected by student teams.
Great story about student who saw a rare ladybug and became local expert on moths in county through internship, was uninterested in nature and now on a STEM career path. Reporting data requires taking photo of GIS/camera timestamp and one of cell phone to check time offset errors. Takes photos of the moths with rulers, also frogs if they’re around, and then volunteers start assigning names. ID’ing moths starts w/ location and time of year (much like birds) to narrow down the number of species.
Series of tools to do identification – progression albums, mapper, customized guides. Start w/ shape at rest, further reduces the search space, etc. Very simple series of characters lets you narrow it down pretty quickly, but some are horrible to identify, so they just don’t even try with those.
Can also do mashups with other data sources, harvested on the fly and no local caching. Happy Moths game, going mobile with optimized mobile browser version rather than app, done in HTML5 to get location (etc) automatically. So far have about 120K images, 90% ID’d to species, 5% to genus, other 5% gotta deal with later.
Isla Young, Maui Economic Development Board
Women in Technology/STEMworks- Mobilizing K-12 Student Scientists
Hawaii is a very expensive place to live, so many people have multiple jobs and most students have parents working a lot. Goal is to develop a stronger economy, especially in tech, to make it a more sustainable economy for residents. There are a few high-tech companies and she has to fly between islands to visit.
Living wage in Maui County would need to earn over $50K a year for bare minimum survival for a single mom and one or two kids. High tech company people who make a lot more money are there. Cashiers make only about $20K/year. All the talent is being imported and they neither reflect nor are invested in their community.
Working to develop a home grown workforce as key to growth, focusing on women in technology program launched in 2000. Goal is encouraging girls/women, Native Hawaiians and other underrepresented groups to pursue STEM education and careers – which means pretty much everybody on the islands, because there’s lots of cross-breeding so to speak. Want to build resident technical workforce in the state as a pipeline, starting in elementary school. Works with relationships with high tech companies to help students get internships and jobs. About 21K students/year in the program, about 450 teachers, summer camps, mentoring, afterschool programs, much more – very broad set of approaches to getting people involved. Found that kids are much faster to get the technology than teachers, so they teach teachers to just get out of the way and let the students figure it out.
All activities align science with culture. STEMworks program is first STEM/service-learning program in Hawaii, project and service based, software training from industry professionals with advanced tech tools, high tech industry connections, and hands-on, real-world internships. They work with GIS, CAD, game design, web design, digital media, 3D viz, cybersecurity. So far involving 17 schools statewide and 1K middle and high school students.
Key elements of STEMworks is the active involvement, community values and stewardship, they get to use high-end software when they often don’t have computers at home, working to create critical thinkers who can be self-directed learners, and they learn to collaborate in teams with a technical advisor/mentor from their community. Teachers are not teachers, they are facilitators, help guide students to find their own solutions. Already generating cool projects, citizen science focus is an exciting shift as they move forward. They have to ID the problem/opportunity in their community, design project, test solution, develop partnerships, deliver on their project, and maintain the partnership – all key skills for getting students into the workforce.
Works with ESRI for state-wide licenses to manage access to geotech tools, Google Sketchup as local authority – over 215 schools requested the GIS software and over 200 teachers trained in it. Have a STEM conference every year – with an astronomy star party, software competitions, program showcase. One of the projects is “Island Energy Inquiry”, realtime clean energy monitoring that has curriculum to go with it for classroom use, partnered with energy companies to get this going.
Citizen science in HI: only single district school system in the country! Over 400 schools and 170K students throughout state, going to use trust network and partnerships, connecting scientists and educators, create relevant and meaningful experiences through civic engagement at a young age, along with mentoring and leadership skill building. The standardized testing requirements don’t focus on science, the teachers only have to spend 1/2 hr/wk on science so if they’re uncomfortable with it they minimize science in the classroom. Seeing citizen science as an accessible way to get them into science in a different way that’s more approachable.
Starting w/ Hawaii Cricket, Coqui and Gecko Crawl – species relevant to their community. Worked on initial planning and teacher workshop, partnerships w/ USGS and several other groups. Integrating tools including smartphones, iPad, computers, GPS, and using social media, students in their programs do have access to some of these tools.