Domain-specific knowledge and LLM performance

Last registered on November 15, 2024

Pre-Trial

Trial Information

General Information

Title
Domain-specific knowledge and LLM performance
RCT ID
AEARCTR-0014755
Initial registration date
November 02, 2024

Initial registration date is when the trial was registered.

It corresponds to when the registration was submitted to the Registry to be reviewed for publication.

First published
November 15, 2024, 1:16 PM EST

First published corresponds to when the trial was first made public on the Registry after being reviewed.

Locations

Region

Primary Investigator

Affiliation
Catholic University Eichstaett-Ingolstadt

Other Primary Investigator(s)

PI Affiliation
Catholic University Eichstaett-Ingolstadt
PI Affiliation
Dresden University of Technology

Additional Trial Information

Status
In development
Start date
2024-11-02
End date
2024-11-30
Secondary IDs
Prior work
This trial does not extend or rely on any prior RCTs.
Abstract
Large language models (LLMs), the most well-known form of generative artificial intelligence, have proven remarkably useful in the Finance domain. Thus far, studies on the capabilities of LLMs in financial applications have investigated general-purpose models (mostly drawing on OpenAI's cutting-edge GPT models) without domain-specific knowledge (i.e., Finance-specific knowledge). It stands to reason that the performance of LLMs could be further enhanced by adding domain-specific knowledge to pre-trained LLMs. Thus, we study how injecting domain-specific knowledge affects the performance of LLMs in a simple portfolio allocation task. We vary the type and scope of domain-specific information we inject and compare the performance across treatment conditions and to a human benchmark derived from an incentivized survey among financial advisors.
External Link(s)

Registration Citation

Citation
Hornuf, Lars, David Streich and Niklas Töllich. 2024. "Domain-specific knowledge and LLM performance." AEA RCT Registry. November 15. https://doi.org/10.1257/rct.14755-1.0
Experimental Details

Interventions

Intervention(s)
We will vary the type and scope of domain-specific information injected to the LLMs. More information is provided in the hidden intervention description.
Intervention (Hidden)
For the LLM arm, we define 5 experimental conditions:
1. Baseline: No domain-specific information is provided to the LLMs. Human financial advisors will have the same information as LLMs in the baseline condition.
2. Experiment 1: Basic investment theory is provided to the LLMs in the form of a summary of relevant text book chapters (Berk/DeMarzo, Chapters 9-11). The summary has been created by GPT-4 (which is not part of the LLM sample) in a two-step procedure and cross-checked by the PIs.
3. Experiment 2: Quantitative indicators for each stock are provided to the LLMs. Building on the asset pricing literature, these include the stocks' CAPM betas, market capitalization, book-to-market ratio, momentum (previous 12-month cumulative returns), earnings-to-price ratio, and Thomson Reuters ESG scores. The injected information also contains a brief definition of all 6 metrics.
4. Experiment 3: Qualitative information on previous firm performance is provided to the LLMs in the form of a summary of their latest 10-K filing. Building on previous studies on the information content of 10-K filings, we provide a summary of the MD&A section of the respective stock's 10-K filing. The summaries have been created by GPT-4 (which is not part of the LLM sample) and cross-checked by the PIs.
5. Experiment 4: Information from experiments 1 through 3 are combined in experiment 4.
Intervention Start Date
2024-11-02
Intervention End Date
2024-11-30

Primary Outcomes

Primary Outcomes (end points)
Stock-level analyses: individual portfolio weights, portfolio weights of groups of stocks (e.g., value stocks, large stocks, etc.)
Portfolio-level analyses: diversification measures (e.g., HHI, number of securities, risky share, IVOL), risk-adjusted performance (e.g., excess returns, Sharpe ratio, CAPM alpha, FF6 alpha) both in and out of sample.
Primary Outcomes (explanation)

Secondary Outcomes

Secondary Outcomes (end points)
Secondary Outcomes (explanation)

Experimental Design

Experimental Design
Portfolio recommendations will be elicited from 7 state-of-the-art LLMs, for 12 hypothetical investor profiles and in 5 experimental conditions. As a human benchmark, we elicit portfolio recommendations for each of the 12 profiles from 100 human financial advisors in an incentivized online survey. More information is provided in the hidden experimental design description.
Experimental Design Details
Portfolio allocation task:
The task consists of assigning portfolio weights to a pre-defined universe of 13 securities: 1 US bond fund (Vanguard Total Bond Market ETF, BND) and a heterogeneous range of 12 US stocks (Berkshire Hathaway, PepsiCo, Lockheed Martin, Kimberly-Clark, Cincinnati Financial, Eastman Chemical, Air Lease, Alkermes, St Joe, Evertec, S&T Bancorp, Sturm Ruger & Company). The stocks have been chosen in a stratified sampling approach from the S&P 500 and S&P 600 indices to ensure that stocks vary with respect to size and book-to-market ratio (two of our most important financial metrics). The portfolio recommendations are supposed to coincide with the respective investor profile. Investor profiles differ with respect to their level of risk tolerance (high/low), sustainability preference (yes/no), and investment horizon (1 month/6months/12 months).

LLM data collection:
For each of the 7 LLMs, we elicit 60 (12 investor profiles * 5 experimental conditions) portfolio recommendations (i.e., vectors of portfolio weights for the 13 securities). We inject the additional information in the 4 experimental conditions other than the baseline using JSON files, for which the order of the securities are randomized. We formulate standardized prompts, which take into account recent research on the performance impact of various prompting techniques (e.g., role prompting, chain-of-thought prompting). After each request, all previous correspondence with the model is deleted to avoid learning effects.

Human financial advisor collection:
As a human benchmark, we will elicit portfolio recommendations for all 12 profiles from 100 human financial advisors (targeted number, subject to response rate) in an incentivized online survey. Respondents are recruited via Prolific, and participation is restricted to financial advisors residing in the US. Respondents are paid a fixed remuneration of 3 GBP (=3,90 USD) for completion of the survey, which takes 15-20 minutes to complete. In addition, respondents are paid a variable remuneration of up to 3 GBP (=3,90 USD) based on the performance of their portfolio recommendations. In particular, one of the 12 investor profiles will be randomly selected, and all survey respondents will be ranked according to the risk-adjusted performance of their portfolios for this profile. Those with higher risk-adjusted performance will receive higher variable payments. We also include an attention check (which risk tolerance did the previous investor profile display?) and questions targeting the respondents' experience in the financial advisory industry and financial knowledge. We will use the attention check (as well as response time) as a potential indicator for inattentiveness and restrict our analyses to attentive respondents as a robustness check. We might also explore heterogeneity in the quality of recommendations based on experience or financial knowledge.
Randomization Method
Randomization done in office by a computer
Randomization Unit
Not applicable
Was the treatment clustered?
No

Experiment Characteristics

Sample size: planned number of clusters
7 LLMs + 1 platform for human advisor recommendations (100 advisors targeted)
Sample size: planned number of observations
1620 (7 LLMs * 12 profiles * 5 experimental conditions = 420 LLM-generated observations; 100 financial advisors * 12 profiles = 1200 human-generated observations)
Sample size (or number of clusters) by treatment arms
84 observations (7 LLMs * 12 profiles) per experimental condition in the LLM arm; 1200 observations (100 advisors * 12 profiles) in the human financial advisor arm.
Minimum detectable effect size for main outcomes (accounting for sample design and clustering)
IRB

Institutional Review Boards (IRBs)

IRB Name
IRB Approval Date
IRB Approval Number

Post-Trial

Post Trial Information

Study Withdrawal

There is information in this trial unavailable to the public. Use the button below to request access.

Request Information

Intervention

Is the intervention completed?
No
Data Collection Complete
Data Publication

Data Publication

Is public data available?
No

Program Files

Program Files
Reports, Papers & Other Materials

Relevant Paper(s)

Reports & Other Materials