Limits and capabilities of LLM

Last registered on July 04, 2023

View Trial History

Pre-Trial

Trial Information

General Information

Title

Limits and capabilities of LLM

RCT ID

AEARCTR-0011584

Initial registration date

June 16, 2023

Initial registration date is when the trial was registered.

It corresponds to when the registration was submitted to the Registry to be reviewed for publication.

First published

June 23, 2023, 4:46 PM EDT

First published corresponds to when the trial was first made public on the Registry after being reviewed.

Last updated

July 04, 2023, 5:26 AM EDT

Last updated is the most recent time when changes to the trial's registration were published.

Locations

Country

Switzerland

Region

Primary Investigator

Name

Rustamdjan Hakimov

Affiliation

University of Lausanne

Contact Primary Investigator

Other Primary Investigator(s)

PI Name

Noah Bohren

PI Affiliation

University of Lausanne

Contact Investigator

PI Name

Rafael Lalive

PI Affiliation

University of Lausanne

Contact Investigator

Additional Trial Information

Status

In development

Start date

2023-06-18

End date

2023-09-30

Keywords

Behavior, Labor, Other

Additional Keywords

Artificial intelligence, creativity

JEL code(s)

Secondary IDs

Prior work

This trial does not extend or rely on any prior RCTs.

Abstract

We are conducting an experiment to determine whether state-of-the-art language prediction models can outperform humans in areas of creativity and strategic intelligence. Comparing human performance to these advanced models is of paramount importance. This comparison sheds light on the level of competition that language models can present and the range of their potential applications, especially in areas traditionally seen as beyond their capabilities. Additionally, we also study effect of competition with artificial intelligence on creativity of human subjects, both for those who perform the tasks and for the evaluators, who might alter their perceptions of creativity when informed that some of the responses were generated by AI.

External Link(s)

Registration Citation

Citation

Bohren, Noah, Rustamdjan Hakimov and Rafael Lalive. 2023. "Limits and capabilities of LLM." AEA RCT Registry. July 04. https://doi.org/10.1257/rct.11584-1.1

Sponsors & Partners

There is information in this trial unavailable to the public. Use the button below to request access.

Request Information

Experimental Details

Interventions

Intervention(s)

In the first experiment, the interventions vary the information about competitors.

In the second experiment, participants rate texts, and interventions vary information about those who submitted the text for ratings.

Intervention (Hidden)

In the first experiment, participants either compete only with other participants or with other participants and AI on a creative task. In the strategic task, they either engage with an opponent who uses an equilibrium strategy, or with an opponent who only employs two out of the three possible actions.
Later addition: In an additional treatment, we screen participants who have access to one of the AI LLM platforms, and encourage them to generate responses to the creative questions using AI. The strategic task remains unchanged.

In the second experiment, participants rate texts either with or without the knowledge that some of the entries are AI-generated. They also either evaluate original texts from human responses or use texts corrected by AI for language errors.

Intervention Start Date

2023-06-18

Intervention End Date

2023-09-30

Primary Outcomes

Primary Outcomes (end points)

We will estimate the impact of the treatment on various outcomes: (1) creativity measures (overall, uniqueness, surprise, and value) by prolific workers; (2) creativity measures (overall, uniqueness, surprise, and value) by RAs and creative workers; (3) the number of points won in strategic tasks in the last 12 rounds; and (4) the number of "rock" moves in the last 12 rounds.

Primary Outcomes (explanation)

(1) Creativity measures (overall, uniqueness, surprise, and value) by prolific participants from stage 2. Since each participant will only rate 10 responses, we will run a regression of the rating on answer fixed effects and raters' fixed effects. The resulting fixed effects of the answers will be used as ratings for the response.

(2) Creativity measures (overall, uniqueness, surprise, and value) by RAs. Given that each RA rates all the stories, we will calculate the average rating for each response from all RAs.

(3) The number of points won in the strategic task in the last 12 rounds.

(4) The number of "rock" moves in the last 12 rounds.

Secondary Outcomes

Secondary Outcomes (end points)

Secondary Outcomes (explanation)

Experimental Design

In the first experiment, Prolific participants undertake creative and strategic tasks. The interventions vary the information about competitors.

In the second experiment, Prolific participants rate creative responses to one of two questions.The interventions vary information about those who submitted the text for ratings.

Experimental Design Details

Experimental Design
In the first experiment, Prolific participants undertake creative and strategic tasks. In one treatment, they are notified that they are competing with AI-generated texts when performing the creative task. In the strategic task, they either play against a player who employs an equilibrium strategy or a player who uses a non-equilibrium strategy. Later addition: In an additional treatment, we screen participants who have access to one of the AI LLM platforms, and encourage them to generate responses to the creative questions using AI. The strategic task remains unchanged.

In the second experiment, Prolific participants rate creative responses to one of two questions. These responses are either AI-generated or produced by human subjects. In the baseline scenario, raters are unaware that some of the answers are generated by AI, and human responses are not edited. In one treatment, although raters are still unaware that some responses are AI-generated, human responses are corrected for language mistakes by AI before the rating. In another treatment, participants are informed that some of the responses are generated by AI, and human responses are likewise corrected for language errors by AI prior to rating.

Experimental Design Details
The study consists of three stages.
First Stage is experiment 1:
Participants from the US recruited from Prolific will perform two tasks in random order.
Task 1. Creativity Task:
We use the open creativity task of Charness & Grieco (2019). Participants have a choice between two open creativity questions:
1. “If you had the talent to invent things just by thinking of them, what would you create?”
2. “Imagine and describe a town, city, or society in the future.”

Participants have 10 minutes to respond in at most 1000 characters. They understand that other participants will evaluate their answers using a subjective creativity ranking scale of 0 to 10 points. They know that participants with 10% best scores receive a bonus payment of 5 dollars.

There are two treatments for this task between-subjects:
1. Participants compete exclusively amongst themselves (700 in the representative sample). Note that 10% of
2. Participants compete with AI. Thus, they are aware that some responses are generated by AI (an additional 300 participants), and AI-generated texts also participate in the creativity rankings. That is, these AI-generated texts must be within the top 10% of all texts in the sample that includes AI-generated texts.

Task 2. Strategic Intelligence Task:
After task 1, participants play "Rock, Paper, Scissors" against a computer that implements the pre-determined moves of a human player. Participants know that the strategy of the opponent is fixed. They will play 24 rounds against the same opponent, and the opponent's actions and winner are disclosed after each round. Participants earn +1 point for each victory, 0 points for each loss, and 0.5 points for a draw. Each point is worth 20 cents at the end of the game.

There are two treatments for this task between-subjects:
1. The computerized opponent employs one of the equilibrium strategies involving the three actions.
2. The opponent utilizes a non-equilibrium strategy and never uses the "scissors" action.
The prediction in the case of out-of-equilibrium play is that the players should be able to understand it and use it to gain more points in the last 12 rounds by never using the dominated "rock" action.

Additional measures:
After these tasks, participants complete a questionnaire as in Charness & Grieco (2019). The questionnaire comprises ten questions on creative and cognitive style and sensation-seeking behavior, based on questions by Nielsen, Pickett, and Simonton (2008) on creative style and Zuckerman et al. (1964) on sensation-seeking attitude. It also contains seven demographic queries concerning gender, age, sibling count, birth order, handedness, and parental marital status. Furthermore, it includes six queries about past involvement in creative activities (Hocevar, 1980). Finally, we ask for a non-incentivized measure of risk preferences (Dohmen et al., 2009)

Treatments:
1. Human
2. Bard
3. Chat GPT

For the AI treatments in the creativity task, we generate 200 unedited responses, 100 for each creativity question. We ask for 25 queries:
Give 4 alternative unique and creative answers to the following question within 1000 characters for each answer: ““If you had the talent to invent things just by thinking of them, what would you create?”
And for 25 queries:
Give 4 alternative unique and creative answers to the following question within 1000 characters for each answer: “Imagine and describe a town, city, or society in the future.”

For the AI treatments in the intelligence task, we interact with the AI within a single chat session using the following script of commands:
"Let's play 24 rounds of Rock, Paper, Scissors. I have my moves fixed for all 24 rounds and will reveal them to you honestly after each round, so you can potentially adjust your strategy to win the most rounds. Note that your goal should be to win as many rounds as possible. What is your first move?"

Next, after response we reveal the move of human and ask for the next move. And so on for 24 rounds.

Each treatment (whether equilibrium moves of human or “never scissors”) is replicated 200 times with each AI.

Second Stage is Experiment 2:
Raters:
All responses to the creativity task are anonymized, and prolific participants are invited to rate them on a 0 to 10 scale. Additionally, we ask for subjective measures of novelty, value and surprise (Boden 1980)

• "To what extent is this response new or original?"
• "How would you rate the usefulness of this response?"
• "How surprised were you by this response?

Each participant rates a random selection 10 responses. The selection is randomly, ensuring each answer is evaluated by at least 10 different raters. After a response is evaluated by 10 raters, it is dropped from randomization for future raters. No variable incentives are offered for this task, only the fixed payment of £1.7.

There are three treatments for this task between-subjects:
1. Raters evaluate AI responses alongside original human responses, unaware that some of the responses are generated by AI. 'Please rate responses to one of the following two questions.'
2. Raters evaluate AI responses and human responses that have been edited for typos and language by AI. 'Please rate responses to one of the following two questions.'
3. Raters evaluate AI responses and human responses that have been edited for typos and language by AI. They are also asked to guess the probability that the response was generated by an AI. 'Please rate responses to one of the following two questions. Some responses are given by participants like you and edited by AI for language mistakes, while others are originally generated by AI. Please also rate the probability of each response being generated by AI.'

Third Stage. RA work:
Three research assistants (RAs) from the field of economics and three individuals from the art industry are hired to evaluate all creativity responses, while remaining unaware of the treatment assignment. For human responses, they rate responses that have been edited for language errors by AI. In addition to this, the RAs assess the uniqueness of responses using plagiarism detection tools and search engines.

Randomization Method

based on randomization in qualtrics

Randomization Unit

participant

Was the treatment clustered?

Experiment Characteristics

Sample size: planned number of clusters

1000 for experiment 1+300 for new treatment
4200 for experiment 2

Sample size: planned number of observations

1000 for experiment 1+300 for new treatment 4200 for experiment 2

Sample size (or number of clusters) by treatment arms

1000 for experiment 1+300 for new treatment
4200 for experiment 2

Minimum detectable effect size for main outcomes (accounting for sample design and clustering)

Supporting Documents and Materials

IRB

Institutional Review Boards (IRBs)

IRB Name

LABEX University of Lausanne

IRB Approval Date

2023-06-18

IRB Approval Number

Not yet known

Analysis Plan

Post-Trial

Post Trial Information

Study Withdrawal

There is information in this trial unavailable to the public. Use the button below to request access.

Request Information

Intervention

Is the intervention completed?

Data Collection Complete

Data Publication

Is public data available?

Program Files

Reports, Papers & Other Materials

Limits and capabilities of LLM

Pre-Trial

General Information

Locations

Primary Investigator

Other Primary Investigator(s)

Additional Trial Information

Registration Citation

Interventions

Primary Outcomes

Secondary Outcomes

Experimental Design

Experiment Characteristics

Institutional Review Boards (IRBs)

Post-Trial

Study Withdrawal

Intervention

Data Publication

Program Files

Relevant Paper(s)

Reports & Other Materials