Back to History

Fields Changed

Registration

Field Before After
Trial Title The Impacts of Automated Essay Scoring on Writing Skills and Access to College Can Artificial Intelligence Improve Writing? Experimental Evidence on AWE-Based ed-techs
Abstract Artificial intelligence has the potential of – perfectly or imperfectly – substituting and/or complementing time-intensive teacher tasks. However, evidence on whether and how this impacts learning is still scant. We propose a randomized evaluation of a program that uses automated essay scoring to provide personalized feedback on writing skills to students. Teachers also receive a summary of error patterns, which may help them tailor class content to students’ specific needs. The current structure of the program – which integrates automated scoring and human graders – motivated the evaluation design, which will test not only the impact of the algorithm but also the marginal effect of human graders, a relevant feature for scalability. The results will shed light on the question of whether and how this ed-tech can be used to improve learning in a low-quality (public schools) post-primary education context. Automated writing evaluation (AWE) systems use artificial intelligence to score and comment essays. We designed an experiment with two treatment arms in Brazil to study the effects of AWE-based pedagogy programs on writing skills. In both treatments, teachers were stimulated to use an AWE system that provides students with instantaneous performance signals on their essays. In both cases, students receive a final grade and formative feedback. However, in one of them this grade is set by human graders, who supervise the feedback and deliver a delayed but arguably richer assessment. The mechanisms we describe range from changes in the amount of training to the reallocation of teachers' time to different tasks. The results help address the question of whether and how these technologies can be used to improve writing skills in a post-primary education developing country context. More generally, we provide information on the potentials and limitations of artificial intelligence.
JEL Code(s) I21, I25, I28, J22, J45
Last Published December 30, 2018 10:03 PM August 28, 2019 12:40 PM
Intervention (Public) There are two intervention arms: (i) providing a school access to an online platform with an automated essay scoring (AES) algorithm that gives feedback on a broad set of writing features for seniors in public school secondary students in Brazil; (ii) providing the school with the same platform, but automatically corrected essays are assigned to professional human graders independently hired, who validate the corrections/feedback from the algorithm and typically include more personalized comments on the essay content, structure and on ways to improve it. In both cases, all the information generated by the artificial and human intelligence is available to teachers. Both treatment arms teachers were stimulated to use an Automated writing evaluation (AWE) system that provides students with instantaneous performance signals on their essays. The first treatment arm (standard treatment) uses an algorithm to provide students with instantaneous feedback on syntactic text features --- such as spelling mistakes --- and with a noisy signal of student achievement, a performance bar with 5 levels. About three days after submitting their essays on the program's platform, students receive a final grading elaborated by human graders hired by the implementer, who correct the essays trying to mimic the real-world exam. This grading includes the final essay grade, comments on the skills valued in the exam and a general comment on essay quality. In the second treatment arm (alternative treatment), the whole experience with the writing task is completed at once, and is based only on interactions with the artificial intelligence: after submitting the essays, students receive the instantaneous feedback on text features and the noisy signal of achievement (as in the first treatment arm), but are also presented to the AWE-predicted final grade and to comments selected in the implementers' database among a list of specific comments suited for a skill score. In both treatment arms, the essays and the aggregate and individual grading information generated throughout the year --- by the artificial intelligence supervised by human graders in the standard treatment, and only by the artificial intelligence in the alternative treatment --- are presented to teachers on a personal dashboard.
Primary Outcomes (End Points) The key outcome variables are: (a) enrollment and participation in the National Secondary Education Exam (“Exame Nacional do Ensino Medio”, ENEM), a non-compulsory standardized test that has been used by a large number of post-secondary institutions for admission purposes; (b) achievement in the argumentative essay (and other parts, which are standardized test scores in Mathematics, Language and Natural Sciences) of ENEM; (c) students’ proficiency in language and mathematics, using administrative data from the state’s standardized test; (d) writing skills, captured by independently hired graders, who will grade small writing prompts both in terms of form (handwriting, spelling, punctuation, sentence structure/grammar) and content (vocabulary, organization/overall structure and ideas); (d) admission and progress in postsecondary education; (e) future labor market outcomes, like formality and work earnings Achievement in the argumentative essay in the National Secondary Education Exam (“Exame Nacional do Ensino Medio”, ENEM).
Primary Outcomes (Explanation) For the primary outcome, we will combine administrative data from the ENEM exam, and an essay with the same structure of the ENEM essay that will be included in the standardized state exam administered by the state secretary of education.
Experimental Design (Public) The evaluation design will be based on the random allocation of a sample of public schools with at least 5 computers in Espírito Santo state into one of the three conditions for the 2019 academic year: (i) control; (ii) platform with the AES algorithm and (iii) platform with the AES algorithm and human essay graders. More details on the interventions can be found above. The evaluation design will be based on the random allocation of a sample of 178 public schools in Espírito Santo state into one of the three conditions for the 2019 academic year: (i) control (68 schools); (ii) standard treatment (55 schools) and (iii) alternative treatment (55 schools). More details on the interventions can be found above.
Planned Number of Clusters 150 schools 178 schools
Planned Number of Observations 15,000 students approximately 20,000 students
Sample size (or number of clusters) by treatment arms 50 schools control, 50 schools to the treatment that gives access to platform with the AES algorithm, 50 schools to the additional treatment involving human graders Control schools: 68 Standard treatment: 55 Alternative treatment: 55
Power calculation: Minimum Detectable Effect Size for Main Outcomes Based on simulations using administrative data from state’s 2017 standardized language test scores, for a significance level of 0.05 and power of 0.8, we expect a minimum detectable effect (MDE) of around 0.1 standard deviation for the effects of each treatment, once we control for individual-level pre-intervention test scores. We also reached a minimum detectable effect (MDE) of 0.1 standard deviation for the comparison between the two treatments, using the same simulations. Based on simulations using administrative data from ENEM 2018 essay scores, for a significance level of 0.05 and power of 0.8, we expect a minimum detectable effect (MDE) of around 0.1 standard deviation for the effects of each treatment. We also reached a minimum detectable effect (MDE) of 0.1 standard deviation for the comparison between the two treatments, using the same simulations. Since we will use information on not only the ENEM data, but also the essays administrated state exam, we see these numbers as a (loose) upper bounds for the actual MDEs.
Additional Keyword(s) writing skills, ed-techs, automated essay scoring writing skills, ed-techs, automated essay scoring, technological change, routine tasks, nonroutine tasks
Secondary Outcomes (End Points) Mechanisms: number of essays written to train for the real ENEM essay; amount, speed, and quality of feedback; students' aspirations to enter in a post-secondary education institution; teachers' time allocation. Secondary outcomes: general writing skills; achievement in language (non-writing) subjects; achievement in non-language subjects Outcomes for follow-up papers: enrollment and achievement in post-secondary education; labor market outcomes.
Pi as first author No Yes
Back to top