|
Field
Trial Status
|
Before
on_going
|
After
in_development
|
|
Field
Abstract
|
Before
Online news feeds are responsive attention markets: the content shown to users affects the interactions that determine future rankings. This feedback loop can generate cumulative advantage, concentrating attention among incumbent publishers and making outcomes sensitive to early luck. We propose a randomized parallel-world field experiment on the Graze Trending News feed to test how ranking design shapes these dynamics. Users are persistently assigned to one of nine worlds implementing three ranking rules: a status-quo local-interaction rule, a lightly smoothed interaction-per-impression rule, and a heavily smoothed interaction-per-impression rule. Because items decay out of the feed within 12 hours, the experiment consists of repeated short-lived attention markets. The primary outcomes are user engagement, position-weighted quality allocation, variance in exposure among similarly high-quality posts, outlet-level concentration, and cross-world divergence in market shares. Quality is measured using a pre-specified external label estimated from chronological-feed interactions. The central question is whether exposure-normalized ranking can reduce concentration and path dependence while preserving or improving the quality of attention allocated to news.
|
After
Online news feeds are responsive attention markets: the content shown to users affects the interactions that determine future rankings. This feedback loop can generate cumulative advantage, concentrating attention among incumbent publishers and making outcomes sensitive to early luck. We propose a randomized parallel-world field experiment on the Graze Trending News feed to test how ranking design shapes these dynamics. Users are persistently assigned to one of nine worlds implementing three ranking rules: a status-quo local-interaction rule, a lightly smoothed interaction-per-impression rule, and a heavily smoothed interaction-per-impression rule. Because items decay out of the feed within 48 hours, the experiment consists of repeated short-lived attention markets. The primary outcomes are user engagement, position-weighted quality allocation, variance in exposure among similarly high-quality posts, outlet-level concentration, and cross-world divergence in market shares. Quality is measured using a pre-specified external label estimated from chronological-feed interactions. The central question is how designs, such as exposure-normalized ranking, trade off concentration, unpredictability, and user-facing quality/engagement metrics. The central hypothesis is that interventions that reduce concentration may do so by increasing unpredictability.
|
|
Field
Trial Start Date
|
Before
March 27, 2026
|
After
June 11, 2026
|
|
Field
Trial End Date
|
Before
October 31, 2026
|
After
March 23, 2027
|
|
Field
Last Published
|
Before
May 04, 2026 11:47 AM
|
After
June 22, 2026 09:56 PM
|
|
Field
Intervention Start Date
|
Before
April 27, 2026
|
After
June 23, 2026
|
|
Field
Intervention End Date
|
Before
June 30, 2026
|
After
September 23, 2026
|
|
Field
Primary Outcomes (End Points)
|
Before
1. User engagement metrics: positive interactions (likes, replies, reposts, quote posts, see more), sessions, and session depth. A session is defined as a request to the Graze Trending News feed. Session depth is defined as the number of items viewed within a request. Positive interactions and sessions are the primary engagement outcomes, while session depth is a secondary engagement outcome. The primary engagement outcomes are measured on a user-day panel and analyzed intent-to-treat. Included users are those who used the Graze Trending News feed at least once in the month before the experimental period, and the user-day panel begins on the calendar day of that user's first observed Graze Trending News feed impression during the experiment. For positive interactions, our primary measure is "likes", with a secondary measure using the full set of interaction types, weighted in the same way as the ranking rule in the experiment.
2. User-facing quality allocation. The main quality outcome is the average external quality of the visibility allocated by the ranking rule, using position-weighted exposure rather than raw impressions. This will primarily be defined as the average (unweighted) quality of posts in the Top 3 positions sent to users. For these Top 3 exposure measures, an opportunity is a Graze Trending News feed request while the post is eligible. The inference unit is each post entry cohort: we will compute the quality-allocation metric for each design x cohort, then compare designs across cohorts (properly accounting for how world averaging affects variance of the estimate). We average across worlds within each design.
3. Within-quality variance in post visibility. This measures how differently similarly high-quality posts are treated by the ranking rule. We will bin posts into quality quantiles and measure the variance in exposure among posts in each quantile. The primary exposure measure is the fraction of Graze Trending News feed requests while eligible in which a post appears in the Top 3, with a secondary measure on the number of impressions. Same as user-facing quality allocation, the inference unit is the post-entry cohort, and we will average across the 3 worlds within a design before comparing designs across cohorts.
4. Outlet concentration. The main concentration outcome is the Gini coefficient of outlet exposure shares among outlets with at least one eligible post in a given design x world x post-entry cohort. Outlet exposure is defined using the same Top 3 opportunity-based exposure measure as above, averaged across all posts from the outlet in that cohort. The inferential unit is the post-entry cohort: we compute the Gini coefficient separately for each design x cohort, and then compare designs across cohorts. We average across worlds within each design.
5. Cross-world unpredictability. This measures how much item-level market shares diverge across the three parallel worlds within a design. This is defined as the average pairwise absolute difference in item-level market shares across the three worlds within a design. Here, the inference unit is the post-entry cohort; for each design x cohort, we will compute the average pairwise divergence across the 3 worlds for each post, then average across posts within the cohort, and then compare designs across cohorts.
|
After
We will group posts into post-entry cohorts: sets of posts that were created in the same time interval.
We will primarily examine the render log, which tracks the time, user identifier, and the posts sent to the user for each request of the Trending News Feed (a “session”). In particular, we examine news posts sent in the top 6 positions for requests with post limit > 1. (We consider the absolute position in the feed, that is, we include the offset caused by non-news posts, such as announcements and surveys. Since the feed always includes a pinned post, this is typically 5 news posts if the survey is not shown).
We also construct a “shadow global” render log. For each render log entry at time t with n news posts, we create an entry in the shadow global render log at time t with the n posts that would have been shown by the “Shadow Global” Design 0 based on global interaction counts up to time t.
**Relevant definitions for outcomes**
1. *Exposure market share for posts.* For each post, an exposure is defined as being in the top 6 positions of posts sent to users. The market share of a post is the number of exposures of a post, normalized by the total exposures of posts in the same post-entry cohort.
2. *Residual from shadow global.* For each post, in a given world, its residual from the shadow global is the difference in its market share based on the actual posts sent to users, and what its market share would have been if the shadow global ranking was used to rank posts in that world.
3. *Item quality from reverse chronological feed.* For each post, we will define its quality as explained in detail below. Given a measure of quality, we will bin eligible posts (posts rendered to any user across any Design 0-3) into quality quantiles (5 buckets), where each quintile has the same number of posts.
**Primary outcomes**
1. *Movement from Shadow Global.* In terms of market share of posts, how different are the designs from shadow global. Defined as the sum of absolute values of residuals in post-level market share in each cohort.
2. *Inequality (concentration) of post-level market shares in each cohort.* This measures how differently similarly high-quality posts are treated by the ranking rule. Our primary measure of inequality is the Gini coefficient of exposure market shares in each cohort. We will measure this both across all posts, and within post-quality groups, focusing on the top 20% of posts by estimated quality. The inference unit is the post-entry cohort, and we will average across the 3 worlds within a design before comparing designs across cohorts. As a secondary illustration, we will plot this metric for quality groups beyond the top 20%, including more granular bins.
3. *Cross world unpredictability (arbitrariness) of post-level market shares across worlds in the same design.* This measures how much item-level market shares diverge across the three parallel worlds within a design. The unpredictability of a design is the Gini Mean Difference (GMD) for a post i’s market share across the 3 worlds in that design. We will report: the GMD across all posts and across the top 20% of posts by estimated quality. As a secondary illustration, we will plot this metric for quality groups beyond the top 20%, including more granular bins. We will also report the GMD conditional on the mean movement – the residual from shadow global (bucketed, into quintiles) – to decompose unpredictability caused by a design changing average exposures relative to shadow global versus variance in the change. More precisely, we will primarily test the difference in GMD averaged with uniform weight over movement quintiles. Here, the inference unit is the post-entry cohort; for each design x cohort, we will compute the GMD across the 3 worlds for each post, then average across posts within the cohort, and then compare designs across cohorts.
4. *User facing outcomes.*
(a) *Exposure quality:* The main quality outcome is the average external quality of the visibility allocated by the ranking rule, using position-weighted exposure. This will primarily be defined as the average (unweighted) quality of the top-6 posts sent to users, as in the exposure metric. The inference unit is each post entry cohort: we will compute the quality-allocation metric for each design x cohort, then compare designs across cohorts (properly accounting for how world averaging affects variance of the estimate). We average across worlds within each design.
(b) *User engagement metrics:* positive interactions (likes, replies, reposts, quote posts, see more), sessions, and session depth. A session is defined as a request to the Graze Trending News feed. Session depth is defined as the number of items viewed within a request. Positive interactions and sessions are the primary engagement outcomes, while session depth is a secondary engagement outcome. For positive interactions, our primary measure is "on-feed likes per session", with secondary measures using “on-feed likes per exposure” and the full set of interaction types, weighted in the same way as the ranking rule in the experiment.
|
|
Field
Primary Outcomes (Explanation)
|
Before
The experiment is designed to measure both user value and producer-side allocation. The quality and fairness outcomes focus on how the ranking rule allocates visibility across posts and outlets. The unpredictability outcome measures path dependence by asking whether nominally identical worlds diverge because of feedback and early luck. The user-engagement outcome measures whether a ranking rule preserves or improves the value of the feed for users.
Quality is measured using an external proxy constructed from the chronological news feed rather than from the experimental Graze Trending News feed. For post i, the primary quality label is the rate of positive interactions per chronological-feed impression, using the same positive interaction types and impression normalization used in the experimental ranking design, using equal weights across interaction types. The primary quality analyses will use these impression-normalized quality labels rather than raw interaction counts. This quality proxy is intended to be less contaminated by treatment-induced feedback on the Graze Trending News feed than interactions observed directly on the Graze Trending News feed. We will only include posts with at least 10 impressions in the chronological feed to ensure a minimum level of reliability in the quality proxy. We also will smooth these scores additively toward the overall interaction rate on the chronological news feed, similar to the treatment. In particular, we will compute the average rate of interactions per impression r. For each post, we will then add 10 x r and 10 to the interaction and impression counts respectively.
The primary confirmatory hypotheses are as follows. We expect that user engagement will not change significantly between the three designs, but that there will be a tradeoff between good performance on the second endpoint (high user-facing quality), and the remaining three (low within-quality variance, fair outlet exposure, and low cross-world unpredictability). In particular, relative to Design 1, we expect both Designs 2 and 3 to decrease user-facing quality and within-quality variance, increase fairness, and decrease unpredictability. We expect Design 3 to interpolate between Designs 1 and 2: relative to Design 2, we expect Design 3 to increase user-facing quality but also increase within-quality variance, decrease fairness, and increase cross-world unpredictability.
|
After
The experiment is designed to measure both user value and producer-side allocation. The quality and fairness outcomes focus on how the ranking rule allocates visibility across posts and outlets. The unpredictability outcome measures path dependence by asking whether nominally identical worlds diverge because of feedback and early luck. The user-engagement outcome measures whether a ranking rule preserves or improves the value of the feed for users.
Quality is measured using an external proxy constructed from the chronological news feed rather than from the experimental Graze Trending News feed. This quality proxy is intended to be less contaminated by treatment-induced feedback on the Graze Trending News feed than interactions observed directly on the Graze Trending News feed. For post i, the primary quality label is the rate of positive interactions per chronological-feed impression, using the same positive interaction types and impression normalization design used in the experimental ranking design, using equal weights across interaction types. The primary quality analyses will use these impression-normalized quality labels rather than raw interaction counts. We will only include posts with at least 10 impressions in the chronological feed to ensure a minimum level of reliability in the quality proxy, and exclude engagement from users with over 600 on-feed views and interactions per day (informed by the 99.9th percentile of on-feed engagement). We also will smooth these scores additively toward the overall interaction rate on the chronological news feed, similar to the treatment. In particular, we will compute the average interaction score per impression r. For each post, we will then add 10 x r and 10 to the interaction and impression counts respectively. We will only include the first feed load per user per 5 minute window, and exclude a user-post pair once they have already liked the post.
As a robustness check metric of quality, we will conduct the following regression, at the post-impression level:
Positive interaction ~ Post_fixed_effect + post_age + position_in_ranking + {current_global_engagement_counts_per_interaction_type}.
Then, we will use the post_fixed_effect, regularized toward the mean, as a measure of quality. This regression attempts to control for social influence effects, in which a post may be more likely to be engaged with if it already has many engagements. We will also report the coefficients on the other components of the regression, as a measure of such social influence and algorithmic ranking effects on interaction rates.
As a secondary robustness check, for both the primary and above metric, we will also calculate the metrics using the treatment feeds themselves (all worlds and designs together). This definition reflects quality as from the same population as the treatment population, but is contaminated via the experiment data. We will report the correlation between the primary and these secondary/robustness quality metrics, and calculate some of our primary hypotheses using these metrics.
|
|
Field
Experimental Design (Public)
|
Before
Users are randomly and persistently assigned to one of nine cells:
- 1a, 1b, 1c
- 2a, 2b, 2c
- 3a, 3b, 3c
The numeric index denotes the ranking design and the letter index denotes the parallel world within design. Assignment is fixed for the duration of the study.
The primary item unit is the post. Some analyses, especially concentration and follow behavior, additionally aggregate posts to the outlet level.
Because posts decay out of the feed after 12 hours, the primary market unit is an entry cohort rather than a fixed calendar window. Posts are grouped by first eligibility time into cohorts of width Delta. The primary Delta unit will be 1 hour, with robustness analyses using 30 minutes, 2 hours, and 4 hours.
|
After
Users are randomly and persistently assigned to one of nine cells:
- 1a, 1b, 1c
- 2a, 2b, 2c
- 3a, 3b, 3c
The numeric index denotes the ranking design and the letter index denotes the parallel world within design. Assignment is fixed for the duration of the study.
The primary item unit is the post. Some analyses, especially concentration and follow behavior, additionally aggregate posts to the outlet level.
Because posts decay out of the feed after 48 hours, the primary market unit is an entry cohort rather than a fixed calendar window. Posts are grouped by first eligibility time into cohorts of width Delta. The primary Delta unit will be 1 hour, with robustness analyses using 30 minutes, 2 hours, and 4 hours.
|
|
Field
Planned Number of Observations
|
Before
All users who used the Graze Trending News feed at least once in the month before the experimental period are included. For user-level outcomes, counting begins on the calendar day of the user's first observed impression on the Graze Trending News feed during the experiment. The exact number of users is not fixed in advance but is expected to be about 9k, and so about 1k per cell.
|
After
For post-level aggregate metrics (such as cohort market share), we consider all posts that had at least one exposure in any of the 9 treatment cells or in the shadow global. On average, a cohort (1 hour window) has about 170 posts. As discussed below, analyses are largely at the cohort level, with appropriate bootstrap sampling. For each post, we consider all their interactions and exposures in the treatment cell, by any user.
For user-level outcomes such as sessions, all users who used the Graze Trending News feed at least once in the month before the pilot period are included. Counting begins on the calendar day of the user's first observed request on the Graze Trending News feed during the pilot period or experiment. We cannot predict in advance how many users will use the feed during the experiment, but we expect the number of included users to be about 9k, and so about 1k per cell.
|
|
Field
Additional Keyword(s)
|
Before
Social Media, News, Bluesky, Recommendation, Attention Markets
|
After
Online platforms, news feeds, ranking algorithms, social influence, market design, Graze, Bluesky, trending news, feed ranking, producer fairness, inequality and predictability, user engagement
|
|
Field
Intervention (Hidden)
|
Before
The experiment compares three ranking designs.
Design 1 is a ranking rule that weights local interactions (interactions while using the Trending News feed from users in the same world) and global interactions (aggregate interactions from across the ecosystem) to determine post visibility. This is the baseline design that most closely resembles the current Graze Trending News feed ranking rule. This design is a function of likes, replies, reposts, and a time decay.
Design 2 takes the weighted interaction scores, and normalizes by world-specific impression counts (the number of times a post was seen on the Trending News feed). The normalization is smoothed to reduce sensitivity to noisy early interactions.
Design 3 uses the same smoothed interaction-per-impression scoring as Design 2, but increases the smoothing.
Within each design, three parallel worlds are run in parallel. Ranking-state variables used by the experimental algorithm are computed separately within each world, including world-specific interaction counts, impression counts, and any other popularity state used in scoring. Publicly displayed popularity counts remain global across worlds.
|
After
The experiment compares three ranking designs.
Design 1 is a ranking rule that weights local interactions (interactions while using the Trending News feed from users in the same world) and global interactions (aggregate interactions from across the ecosystem) to determine post visibility. This is the design that most closely resembles the current Graze Trending News feed ranking rule. This design is a function of likes, replies, reposts, and a time decay.
Design 2 takes the weighted interaction scores, and normalizes by world-specific impression counts (the number of times a post was seen on the Trending News feed). The normalization is smoothed to reduce sensitivity to noisy early interactions.
Design 3 uses the same smoothed interaction-per-impression scoring as Design 2, but reduces the smoothing, which should make the feed more responsive to local interactions.
We also will analyze a “Shadow Global” Design 0, that no user will directly see, but will serve as a comparison to the other designs in the analysis. It is the same as Design 1, except that there are no weights on local interactions, only global.
Within each design, three parallel worlds are run in parallel. Ranking-state variables used by the experimental algorithm are computed separately within each world, including world-specific interaction counts, impression counts, and any other popularity state used in scoring. Publicly displayed popularity counts remain global across worlds.
|
|
Field
Secondary Outcomes (End Points)
|
Before
1. Small-high-quality lift: whether high-quality content from smaller outlets receives more visibility.
2. Turnover: how long a post remains in the top 3.
3. Dynamic event-time trajectories for the main market-level outcomes over the 12-hour life of a post.
4. Early-interaction trajectory plots that compare later trajectories for posts with different early interaction signals.
5. Secondary user-experience measures such as "show more / show less" and an explicit survey satisfaction measure.
6. Coverage and exploration outcomes: cohort-level coverage of eligible posts, and user-level exploration outcomes measuring how much visibility goes to outlets not recently seen by the same user.
7. New follows to outlets surfaced in the Graze Trending News feed during the experimental period, including the number of newly followed outlets and the share or count of those new follows going to small outlets.
8. Relative-prominence engagement: for user-post first impressions, whether users are more likely to engage with or follow content when that post is ranked more prominently in their own world than in the other parallel worlds under the same design.
9. Behavioral-convergence classification: whether post-treatment user behavior vectors can predict a user's assigned design, and within design the user's assigned cell, more accurately than chance using out-of-sample regularized multinomial regression with permutation inference.
As in the primary outcomes, market-level secondary outcomes are largely constructed within each design x world x post-entry cohort and use the post-entry cohort as the inferential unit; secondary user-experience and follow outcomes use user-day or user-level analyses, and relative-prominence engagement uses the user-post first-impression as the analysis unit.
|
After
- Rich-get-richer effects more directly. Do early interactions predict future exposure and outcomes, including after controlling for post quality? This will be measured for both the shadow global feed, and each of the designs. Within a design, we will further compare post outcomes across worlds, as a function of the early interactions in each world.
- How many local-feed interactions are needed in each design for a post to have different outcomes than it would under shadow global. This will be an illustration of the mechanism that Design 3 “moves faster” based on local information, increasing arbitrariness but also being more effective at reducing inequality.
- High-quality global underdog exposure. For posts that are in the highest quality buckets but have 0 shadow global exposure: what is the overall exposure of these posts?
- Regression coefficients for explanatory variables in the regression on post-impression level interaction rates, as measures of social influence and algorithm ranking effects.
- Using the above regression coefficients (alongside uncertainty), we may develop a simulator to evaluate different ranking algorithms given such feedback effects.
- Outlet concentration and arbitrariness. Same as post-level, but first aggregated at the outlet (poster) level. Are smaller outlets by pre-experiment follow counts prioritized in some designs?
Other explanatory measures
- Turnover: how long a post remains in the top 6.
- Dynamic event-time trajectories for the main market-level outcomes over the 48-hour life of a post.
- Coverage and exploration outcomes: cohort-level coverage of eligible posts, and user-level exploration outcomes measuring how much visibility goes to outlets not recently seen by the same user.
- New follows to outlets surfaced in the Graze Trending News feed during the experimental period, including the number of newly followed outlets and the share or count of those new follows going to small outlets, where outlet size is defined at experiment start.
Robustness/additional measures on primary outcomes
- Alternative parameters/definitions of primary metrics: e.g., the quality metric; defining exposure as a different N for Top N posts; different bucketing of post quality; for movement, the unique number of posts with changed exposure.
- Repeating the analyses using the view/engagement data reported to Graze by Bluesky instead of the render log to define exposure.
- Secondary user-experience measures such as "show more / show less" and an explicit survey satisfaction measure.
- We will analyze top-4 exposure and turnover in addition to top-6.
As in the primary outcomes, market-level secondary outcomes are largely constructed within each design x world x post-entry cohort and use the post-entry cohort as the inferential unit; secondary user-experience and follow outcomes use user-day or user-level analyses.
|
|
Field
Secondary Outcomes (Explanation)
|
Before
These outcomes are intended to clarify mechanism and user experience. In particular, the trajectory analyses are meant to show whether early interaction differences are amplified into later exposure differences, while coverage, exploration, new follows to surfaced outlets, and relative-prominence engagement speak to novelty, discovery, and within-design reinforcement of what a user's own feed promotes. The behavioral-convergence classification outcome asks whether users in the same design or cell become behaviorally distinguishable from users in other designs or cells based on post-treatment interaction and follow patterns. The primary version predicts design from post-treatment behavior vectors; a secondary version predicts cell within design. Performance will be measured out of sample using multinomial log loss, and label permutations will be conducted within design for the cell-classification test.
|
After
These outcomes are intended to clarify mechanisms and provide robustness analyses.
|