Experimental Design Details
We will recruit participants through Prolific, aiming for N = 600. (We may add more N if we get more grant money).
Participants will complete a screening survey that elicits their occupation, and people in eligible occupations will be asked whether they consent to be invited to a 1-hour followup survey involving writing tasks and signing up for an online account. The screening survey will have pre-screeners applied that restrict to people with college degrees, in fulltime employment, and working in a set of “Industry Roles” that encompass the occupations we are interested in. If we end up struggling to reach N = 600, we will relax the college degree requirement, and also the Industry prescreeners (to capture people in our occupations of interest who did not respond to the Industry prescreener questions when signing up to Prolific).
The occupations we are interested in are marketers, PR professionals, copywriters, freelance writers, grant/proposal writers, technical writers, policy analysts, management consultants, business strategists, HR professionals, managers (office jobs), and data analysts. Managers appear to be the most abundant occupation, so we will max out on attempted recruitment of all other occupations and intentionally moderate the pace of manager recruitment with an aim of about 30% managers in the final sample. However, if data collection for other occupations slows down considerably, we will go back to recruiting managers in order to fill out our sample size.
Participants who consent will be invited to the followup survey via custom invite lists on Prolific. They will again be asked to consent to participate in the survey.
Our survey requires treatment-group respondents to sign up for ChatGPT, and ChatGPT signups are often down. We will aim to have the main survey collection active only during periods of time when ChatGPT is reliably up (~5-11pm EST). If we end up collecting responses during times when ChatGPT is down, we will drop all responses (treatment and control) collected during the down period.
The main survey proceeds as follows:
Respondents are cross-randomized into one of three incentive conditions that are applied to their writing tasks:
Linear incentives (40% of sample): each task is graded on a 1-7 scale, and the respondent receives $1 for each point they receive on each task. The task is expected to take 20-30 minutes.
Convex incentives (40% of sample): same as above, except the respondent receives a $3 bonus payment for getting a grade of 6 or 7. The task is expected to take 20-30 minutes.
Exact time (20% of sample): same as linear incentives, except the respondent is told they will be required to spend exactly 15 minutes on the task, their activity will be tracked, and if they are not on-task for the full 15 minutes their bonus payment will be withheld.
We operationalize checking for on-task activity by taking a snapshot of the respondent’s essay each minute, as well as how many words the respondent has added or modified that minute.
Respondents are told the survey will involve two occupation-specific writing tasks following the incentive scheme they’re randomized into, as well as signing up for an online account. They can consent or not.
Respondents who consent are shown a more comprehensive set of instructions.
Respondents answer two comprehension questions about the instructions, and respondents who fail twice to get both questions correct are screened out.
We elicit demographics:
Salary, hours worked, job tenure
Which of a list of websites/softwares they are aware of and have used before (including ChatGPT)
Respondents are shown task instructions again, then shown their first task prompt and asked to complete it. Each occupation is associated with two task prompts, and the order in which the respondents receive the prompts is randomized.
Respondents answer a few questions about the first task
How they split their time between brainstorming, writing a rough draft, and editing
Rate how realistic it is and answer whether they have done a similar task before, and how frequently
Job satisfaction and self-efficacy
Respondents are randomized into treatment and control.
The treatment group is required to sign up for ChatGPT and upload three screenshots of them entering prompts we’ve asked them to enter into ChatGPT.
The control group is required to sign up for Overleaf, and upload three screenshots of them compiling documents with sentences we’ve asked for.
Respondents complete their second writing task. Treatment group respondents are told they are allowed to use ChatGPT if they find it helpful.
Respondents are asked followup questions
How they split their time
Job satisfaction and self-efficacy, as well as concerns about automation
All respondents are asked whether they used ChatGPT on the second task
If they answer yes, they are given a set of questions about how they used it
All respondents are asked what software they used on the first task
End of survey
We are considering two grading methods:
Hiring people in the relevant occupations on Upwork to grade the tasks. We can budget for roughly 2 Upworker grades per task if we choose this option.
Using people on Prolific in the relevant occupations to grade the tasks. We can budget for roughly 4-5 Prolificers per task if we choose this option.
We may add additional screening for selecting graders to ensure that they have expertise in the task area, such as asking about job tenure or whether they have done certain types of tasks before.
Once we have collected some essays, we will pilot both grading methods. We will check two things:
How long graders on each platform spend per essay (we are aiming for ~5 minutes per essay ideally).
How correlated the graders are with each other.
Based on pilot evidence, we will choose a grading method to use for the full set of tasks.
If we use the Prolific method (4-5 graders per essay), we may perform Empirical Bayes corrections on individual graders’ grades to shrink them towards the mean grade for that essay.
Depending on how successful we are at getting graders to assign a roughly uniform distribution of grades, we may also decide to standardize grades at the grader level.
We are planning a 14- or 30-day followup survey with experimental participants, where we ask them whether they are using ChatGPT in their real job (and potentially other questions, depending on main experiment results, TBD).