21 July 2009 | Markus Jakobsson
Performing human-subjects experiments on Amazon Mechanical Turk offers many benefits, including very low experiment costs, quick turn-around rates, and relatively simple approvals from human subjects boards. But you have to be careful to avoid bias and error; we describe some techniques below. Feel free to add your insights in the comments.
Amazon Mechanical Turk, or MTurk for short, is a cloud computing platform that permits outsourcing of tasks to other users, using a built-in payment scheme to compensate users. People (often referred to as “Turkers”) perform MTurk tasks, which are called Human Intelligence Tasks (HITs), and are paid just a few cents for completing them.
Low cost. Turkers seldom earn more than $5-10 an hour. While it may appear that the low earnings potential would severely skew the demographics, that doesn’t actually seem to be the case. Note that a large portion of users are not from developing countries as there are only 2 ways to cash out: spend the earnings on products sold by Amazon, or transfer earnings to a U.S. bank account. [Panos Ipeirotis provides an excellent overview of MTurk demographics.]
Anonymity. The MTurk terms of service do not allow you to ask for identifying information, use the platform to commit crimes (e.g., clickfraud), or use it to violate the terms of service of other providers. Since the service doesn’t offer “verified” user profiles, users can lie about their demographic group to qualify for a study. But given the built-in anonymity of the system, they may also be more honest in answering stigmatizing questions. We have evidence that this is often the case.
Ease of approval. Many Institutional Review Boards (IRBs) treat MTurk studies as exempt from review since prospective subjects have accepted its terms of use and anonymity. However, you can’t be certain that subjects are not from particularly “vulnerable” groups (e.g., minors) as defined by the Belmont Report. This is a common problem with network-based human subjects research, of course, and not specific to MTurk.
Speed and access to subjects. It’s possible to get several hundred subjects in just a few days — though of course the number depends on the task, how much you’re willing to pay, and how many restrictive qualifications you place on participants.
Whatever your motive for hiding your study — protecting an organization, obscuring an idea, or most commonly to avoid selection bias — the solution is simple. First, perform a first-phase study that doesn’t introduce any bias, but which allows you to select subjects to perform follow-up studies with. Then, as a second step, select subjects from this collection of turkers. They will be asked to visit a website to perform the “real” task. This way, you don’t have to worry about a biased selection — e.g., passionate baseball players who sign up for a study of average baseball knowledge. But if you only want people from a certain demographic — say, males between the ages of 30 to 35 — then screen for those in the first phase.
In many situations, subjects may be biased by the knowledge that they are participating in a study, or by knowing the goals of a study. They may think that in order to please you – the experimenter – they should respond in a particular way. They may lie in order to hide an embarrassing truth about themselves. They may pay more attention to some aspects than they normally would have, because they know that this is what they are tested on. [See "Phishing IQ Tests Measure Fear, Not Ability" in Financial Cryptography and Data Security (2008) for more about this bias.]
For example: if you show someone a website and ask whether it is a phishing site, s/he is far more likely to scrutinize the site, detect an aberration, and respond “yes” than they would have if they had come across the site in a real-life situation. A large body of recent research (including mine) has examined how to perform such studies. In the particular case of phishing, the resulting technique is referred to as a “phishing stint” or as a “naturalistic phishing experiment”. [For addressing ethical aspects of such experiments, see "Designing and Conducting Phishing Experiments" in IEEE Technology and Society magazine's Special Issue on Usability and Security (2007).]
To perform any such naturalistic study, you need to convey a different task to your subject than what you are observing – essentially deceive them – to see how they react when faced with the situation of interest. You may, for example, say that you are studying the common reaction to online e-commerce sites, and ask them to rate how helpful various sites are, adding an additional free-text input field where they can add other observations. You first show them three or four perfectly legitimate websites, asking them to rate and describe them; then you show them a phishing site and do the same. Will they tell you that this is a site run by fraudsters? If they do, they noticed signs of fraud without you prompting them.
You can also perform much more invasive studies where you actually attempt to defraud them, only to see what portion of users fall for it. But this has to be done with extreme care — or you’ll become a criminal! Your IRB will offer you plenty of advice if you decide to try an experiment of this type – be sure to read up on some ways in which it has been successfully done before submitting your application.This area is full of pitfalls, and deserves a separate explanation. [See"Social Phishing" in the Communications of the ACM (2007) for an example.]
People may supply arbitrary information (to save time, hide personal information, increase their chances of getting paid, or simply because they’re lazy) or even lie (to preserve self-image or to intentionally destroy a study). Here are some techniques for detecting and deterring these kinds of behavior:
Once you have selected subjects and asked them to participate in a follow-up study, you can ask additional “error-detecting” questions (or even some of the same questions from the first phase). This improves your chances of catching cheaters, especially lazy liars or liars who filled the first phase form in an arbitrary manner. They won’t know how to answer in a consistent manner.
One of the most compelling benefits of MTurk is how inexpensive it is to carry out experiments on it. Some researchers may be tempted to pay no more than what is needed to get the work done. I am against that: I believe that if you pay peanuts, you get peanuts. If you are very clearly trying to minimize your payments, subjects will respond by minimizing their effort or avoid the HIT altogether. An average HIT requiring a minute of the user’s time may pay 5-10 cents — which corresponds to an hourly wage of $3-6. But why pay minimum wages if paying four times as much is still an incredible deal for you? I would pay about 25 cents for a minute’s effort.
To determine the best price, I’ve performed simple experiments where I ask people to answer a question at different prices. When you pay a bit more, the results often improve — and also make evident to subjects their expected level of effort.
But paying well also introduces problems. If you pay more than others, you may skew your subject distribution by getting people who focus excessively on the payment. They, in turn, may be willing to rationalize a bit more than you want. My approach is to first use a screening study (like the ones I describe above) where I don’t offer to pay above the norm. Then, I pay the users the two cents they expected – plus an immediate bonus of another two cents (which doesn’t cost a lot, but gets people’s attention.) Finally, I offer a follow-up study in which I pay quite respectably, say, 60 cents for a two-minute effort. That is very inexpensive as far as subject reimbursement goes, but still means an hourly wage of $18 — significantly above the hourly wage for average MTurk tasks.
It’s also a good idea to set realistic expectations for when a subject is to be paid. This is particularly true when your technique involves bonus payments. Some people get skittish and wonder if they will be paid, and if so, when. You don’t want a few hundred inquiries. Tell them that it may take a few days, because you pay in batches.
MTurk is not optimized for following up with a subject after a few months. It does allow you to assign predicates with each user who performs a task, and later offer HITs only to users who have (or who don’t have) predicates of your choice. That’s a bit complicated, though, and leads to much lower opt-in rates than directly contacting the desired subjects. Here’s what you can do:
For constructing and deploying more complicated surveys, MTurk offers a programmable tool. But it isn’t easy to do, and doesn’t offer an easy visualization of results. Instead, I set up a survey on SurveyMonkey and link to that in my recruiting message.
It’s also possible to ask your subjects to visit a URL of your suggestion, perform a task there, and report back an observation to another site including MTurk. Or, you can ask subjects to input their observations on the visited site.
For example: you can ask subjects to visit 3 websites and rank them. One of them is yours or your client’s, the others belongs to competitors — which one is nicest, and why? Similarly, you may have constructed a site that dispenses user advice – such as the SecurityCartoon one I run – and want to know whether typical Internet users understand a particular piece of advice. You can ask them to read the advice (include a URL for them to visit); then ask them to tell you, in their words, what it means. Or ask them to judge five fictional situations and tell you which ones are safe. If they understood the advice, that should impact their selection. Or ask them simply whether they learnt anything new, and whether they would suggest to friends of theirs concerned with the associated issue to read the advice.
Using MTurk may seem to complicate matters in terms of getting truthful answers, including demographic information. As described above, there are many ways to identify “casual liars”. But how about subjects who take great pains to lie in a way that cannot be detected? How can we test their presence?
First of all, turkers enjoy anonymity: by the nature of the underlying medium, they are pseudonymous to experimenters, on an already-anonymous Internet. Even though many dispute the concept of anonymity on the Internet, and there are methods to identify pseudonymous records, turkers feel that they’re anonymous. Given these assumptions, it’s possible that they won’t bother lying if there is no apparent advantage associated with doing so. It’s also possible that those inclined to lie don’t make the effort to construct particularly good lies. After all, generating a meaningful lie far exceeds task effort and remuneration.
I conducted 2 parallel studies: one on MTurk, and one proprietary (and much more expensive) study I helped a client perform through an established, independent survey company. The results were statistically indistinguishable from each other.
While I’m not claiming that survey results would always be the same in both traditional and MTurk mediums, you can test whether your method is likely to mimic traditional survey results. You could include components in your study whose results you can compare to representative results. If they don’t match, it’s likely that the rest of your results may not represent what you would have obtained using a traditional approach. You can then go back and look at how you ask questions, how you recruit, what demographics you drew from. Maybe one of these aspects introduced a bias. If you find a problem, just re-run the study with new participants.
In fact, it’s hard to design studies, and easy to make mistakes. How should you approach this? Expect mistakes. You can run a small (10-15 participant, one day) study to review responses and identify problems in the design. Fix the flaws, then run a new version and observe those results. When you’re satisfied that things work, you can run your real study with a much larger number of subjects. If you’re concerned that your early experiments may introduce bias into the results of your real experiment, then use an MTurk feature to effectively “ban” early participants from later experiments, by assigning them credentials that later disqualify their participation.
Iterate until you get it right. The beauty of MTurk is the low cost of testing each iteration.
“Crowdsourcing User Studies With Mechanical Turk”
Greg, thank you for your question. And yes, I am referring to qualifications. To allow people *without* classifications, you need to use a macro, as the MTurk interface does not allow for “negative” demands.
Another way of doing something like this is to use cookies, and disallow people who have the cookie. This will automatically be done if you use a service like SurveyMonkey, where people cannot fill out a survey twice without clearing their cookies in between.
[...] link is being shared on Twitter right now. @parcinc said Experimenting on Mechanical Turk — 5 how to’s: [...]
Good answer. Correct, the way to do “negative” qualifications is to define a new qualification in terms of another, with scripting. Cookies <- awesome tip.
Not that it matters, but Amazon AWS almost never refers to the workers as "turkers"; they are "workers"
[...] on it. I’ve seen others report on such experiments too. Markus Jacobsson from PARC gives general tips for conducting such human experiments using Mechanical Turk [...]
[...] Experimenting on Mechanical Turk: 5 How Tos: Another set of tips on Mechanical Turk experiments, from Markus Jakobsson (PARC). [...]
ambient intelligence AR augmented reality authentication batteries brainstorming business of innovation CHI cleantech collaboration collective intelligence competitive edge computer vision context-aware computing contextual intelligence crowdsourcing curation data centers decision making disruptive innovation electric vehicles email energy energy efficiency epic conference ethnography ethnography in industry ethnomethodology ev everyware field of use government green HCI information overload innovation innovation culture innovation strategy intellectual property IP IT kiffets licensing lithium-ion location based services long tail malware materials minimum viable product mobile computing mobile devices & interfaces mobile security MVP natural language processing news NSF open innovation opportunity discovery organic electronics Pasteur's Quadrant personal information management pervasive computing phishing photovoltaics portfolio management printed electronics privacy QR codes recommendation systems research methodology responsive mirror SaaS search smart environment smart grid social analytics social computational systems social indexing social media social streams social web software as a service technology scouting technology trends terms thin film transistors twitter ubicomp user behavior modeling user centered design user experience user interface design v2g vehicle-to-grid virtualization virtual machines virtual reality web 2.0 Wikipedia
July 23rd, 2009 at 11:26am
Posted by Greg Little
This is a technical question:
You say: “It does allow you to assign predicates with each user who performs a task, and later offer HITs only to users who have (or who don’t have) predicates of your choice.”
Are you referring to “qualifications”? If so, how do you prevent workers with a qualification from doing your HIT?