Skip to main content
Research Article Research Article

When is crowdsourced data most useful for research?

Download report

Crowdsourced data collection is when researchers turn to internet communities to answer research, survey or feedback questions. Crowdsourced data collection is gaining popularity because it is convenient, inexpensive and relatively quick. Learning how this strategy compares with more comprehensive and traditional approaches is important to guiding future research.

That’s why researchers compared crowdsourced data from the Amazon Mechanical Turk, an internet marketplace where researchers can post questions to collect data, with the National Adult Tobacco Survey, a comprehensive survey from the U.S. Centers for Disease Control and Prevention.

Specifically, researchers crowdsourced demographic, tobacco perception and tobacco warning label exposure data from nearly 4,000 young adults ages 18-30, and compared their data with NATS data after using a statistical technique to make the samples comparable in gender, race/ethnicity, educational attainment and age.

The results, published in Preventive Medicine, reveal some important insights into the research method.

Different surveys, different results

The two surveys yielded different results in several areas, including tobacco use, demographics, perceptions of harm and warning label exposure. Crowdsourced data had significantly more non-cigarette tobacco users, particularly e-cigarette users (13.9 percent compared with 2.6 percent). Crowdsourced participants were also less likely to report that smoking is very harmful (81.8 percent vs. 88.8 percent), but were equally likely to report that smoking is very addictive.

When it comes to exposure to warning labels, fewer (16.9 percent) crowdsourced participants indicated that they faced exposure to cigarette warning labels “very often,” and were less likely than NATS smokers to report that smoking is “very harmful” across almost all levels of warning label exposure. Consistent with past studies, researchers found the crowdsourced data over-sampled lower income populations.

These results demonstrate that crowdsourced data are not generalizable to the population at large, and should not be used for the “monitoring of population trends in behaviors and other outcomes,” according to the authors.

Opportunities for crowdsourcing

Although crowdsourced data can’t be used to represent the general population, several types of research would benefit from this form of data collection. Some examples include: research looking for high tobacco-use populations, or earlier phases of studies, including “idea development, designing public health interventions, and ascertaining feedback on alternative approaches,” write the authors.

The framing of crowdsourced surveys is also important for yielding good results. For example, the survey was framed as a tobacco survey, which may have attracted more tobacco-using participants. The authors recommend “framing descriptions for crowdsourced data collection more generally (e.g., a survey on health) to reduce the likelihood that those familiar with the survey subject will be more likely to respond.”

Key takeaways


The percentage of crowdsourced participants who used e-cigarettes, compared with 2.6 percent from the National Adult Tobacco Survey


The percentage of crowdsourced participants who indicated exposure to cigarette warning labels “very often” compared with 21.7 percent from NATS


The percentage of crowdsourced participants who reported income under $20,000 compared to 12 percent from NATS

Download report