Validation and enumerator effects in phone-based assessments of learning

Last registered on September 24, 2021

Pre-Trial

Trial Information

General Information

Title
Validation and enumerator effects in phone-based assessments of learning
RCT ID
AEARCTR-0006913
Initial registration date
December 17, 2020

Initial registration date is when the trial was registered.

It corresponds to when the registration was submitted to the Registry to be reviewed for publication.

First published
December 18, 2020, 12:19 PM EST

First published corresponds to when the trial was first made public on the Registry after being reviewed.

Last updated
September 24, 2021, 11:58 AM EDT

Last updated is the most recent time when changes to the trial's registration were published.

Locations

Region

Primary Investigator

Affiliation
University of Virginia

Other Primary Investigator(s)

PI Affiliation
University of Virginia

Additional Trial Information

Status
Completed
Start date
2020-12-07
End date
2020-12-23
Secondary IDs
Prior work
This trial is based on or builds upon one or more prior RCTs.
Abstract
The school closures induced by the COVID-19 outbreak has placed heightened emphasis on alternative ways to measure and track student learning besides in-person assessments. A potential option to assess students remotely is phone-based assessments, where an assessor calls students, and asks them to solve some questions remotely. However, to the best of our knowledge, there has not yet been a formal validation of these learning assessments, where the scores obtained over the phone are correlated with the same students’ classroom scores or other measures of achievement. It is also unclear whether these correlations hold up across sub-groups of interests like by gender, baseline performance, grade, or different degrees of rurality of each school -which may be particularly important for access to technology. Furthermore, important survey features such as whether there are strong enumerator effects in this type of assessment is also unknown for this type of assessment. We leverage the full random assignment of assessors to primary school children in Kenya to validate this type of assessment, and understand the extent to which enumerator effects explain part of the variance in outcomes recorded.
External Link(s)

Registration Citation

Citation
Rodriguez Segura, Daniel and Beth Schueler. 2021. "Validation and enumerator effects in phone-based assessments of learning." AEA RCT Registry. September 24. https://doi.org/10.1257/rct.6913-1.1
Experimental Details

Interventions

Intervention(s)
The school closures induced by the COVID-19 outbreak has placed heightened emphasis on alternative ways to measure and track student learning besides in-person assessments. Even beyond school closures, situations like humanitarian and natural disasters, or students simply living in physically remote locations might hinder the proper assessment of their learning profile. A potential option to assess students is phone-based assessments, where an assessor calls students, and asks them to solve some questions remotely. Work such as Angrist et al. (2020a) has already used these assessments as outcomes, and their work has also led to the identification of practical recommendations to assess children over the phone (Angrist et al. (2020b). However, to the best of our knowledge, there has not yet been a formal validation of these learning assessments, where the scores obtained over the phone are correlated with the same students’ classroom scores or other measures of achievement. Furthermore, it is still unclear whether these correlations hold up across sub-groups of interests like by gender, baseline performance, grade, or different degrees of rurality of each school (which may be particularly important for access to technology).
The school closures induced by the COVID-19 outbreak has placed heightened emphasis on alternative ways to measure and track student learning besides in-person assessments. Even beyond school closures, situations like humanitarian and natural disasters, or students simply living in physically remote locations might hinder the proper assessment of their learning profile. A potential option to assess students is phone-based assessments, where an assessor calls students, and asks them to solve some questions remotely. Work such as Angrist et al. (2020a) has already used these assessments as outcomes, and their work has also led to the identification of practical recommendations to assess children over the phone (Angrist et al. (2020b). However, to the best of our knowledge, there has not yet been a formal validation of these learning assessments, where the scores obtained over the phone are correlated with the same students’ classroom scores or other measures of achievement. Furthermore, it is still unclear whether these correlations hold up across sub-groups of interests like by gender, baseline performance, grade, or different degrees of rurality of each school (which may be particularly important for access to technology).


Beyond the validation of these assessments, other important features of phone-based assessments are yet to be studied (Lupu and Michelitch, 2018). Social scientists have long discussed and identified “enumerator effects” for in-person surveys (Bischoping and Schuman, 1992; Lupu and Michelitch, 2018; Schaeffer et al., 2010; and West and Blom, 2017), where observable characteristics of enumerators drive differential rates of responses and scores for seemingly similar populations in developing countries (Adida et al., 2016; Benstead, 2014a; Benstead, 2014b; Blaydes and Gillum, 2013; Blom et al., 2007; Durrant et al., 2010; Flores-Macías and Lawson, 2008; Kane and Macaulay, 1993; Liu and Stainback, 2013; Olson, 2007). For instance, Di Maio and Fiala (2018) find that in Uganda most observable characteristics yield minimal enumerator effects, except when enumerators are asking highly sensitive political preference questions, which account for over 30 percent of the variation in responses. To clearly identify enumerator effects and avoid confounding enumerator and respondent characteristics, researchers would ideally fully randomize the assignment of assessors to assessees, or as West and Blom (2017) call it, create “fully interpenetrated designs”. In spite of the large body of work suggesting the presence of enumerator effects within in-person assessments, the physical logistics of “fully interpenetrated designs” can be challenging, and few studies, conducted only in the United States and with small samples, have actually conducted such a study (Di Maio and Fiala, 2018). Typically, the logistical issues have been dealt with by assigning assessors to small areas that are still feasible for assessors to move in and yet capture as much variability in assessee-assessor assignments as possible (Lupu and Michelitch, 2018). For example, in the case of in Uganda, the most disaggregated unit they can feasibly assign assessors to is villages.

Phone-based assessments lend themselves to a more rigorous documentation of enumerator effects, in general and for learning assessments more specifically, as the enumerators are centrally located and can be randomly allocated across the full sample. One could hypothesize that in such a personal level of assessment between assessor and assessee, especially one with a degree of power dynamics between students and teachers, the level of comfort in the relationship could indeed lead to diPhone-based assessments lend themselves to a more rigorous documentation of enumerator effects, in general and for learning assessments more specifically, as the enumerators are centrally located and can be randomly allocated across the full sample. One could hypothesize that in such a personal level of assessment between assessor and assessee, especially one with a degree of power dynamics between students and teachers, the level of comfort in the relationship could indeed lead to differential response rates.

We leverage the data collection process from a phone-based assessment in Kenya to add to the literature previously mentioned. As part of the outcomes measured in another RCT evaluation, students are given a short, phone-based assessment consisting of math questions, a student survey question, and a few parent survey questions. Assessors will be teachers from within the educational system where the RCT is conducted. Students in 3rd, 5th, and 6th grade across all 105 schools in the RCT sample were randomly selected to receive a phone-based assessment. Then, students selected to receive a phone-based assessment were fully randomized to an enumerator, as well as the order in which they are called. Therefore, there is a random assignment in the match between assessor and assessees, the day each student is called, and the order in which they are reached. Strong protocols are in place to ensure that this order is preserved.
In particular, we will explore the following research questions:

1. Are phone-based assessments valid measures of learning?
2. To what extent are there differential response rates by enumerators?
3. Does the match on observable characteristics of assessors and assessees (e.g. same gender) drive differential response rates and scores?
4. Does assessor experience and teaching skill change the average and variability of scores?

References
Adida, C.L., Feree, K.E., Posner, D.N., Robinson, A.L. (2016). Who's asking? Interviewer coethnicity effects in African survey data. Comparative Political Studies. 49: 1630–60

Angrist, N., Bergman, P., Matsheng, M. (2020a). School’s Out: Experimental Evidence on Limiting Learning Loss Using 'Low-Tech' in a Pandemic. Working Paper. https://papers.ssrn.com/sol3/Papers.cfm?abstract_id=3735967

Angrist, N., Bergman, P., Evans, D. K., Hares, S., Jukes, M. C. H., & Letsomo, T. (2020b). Practical lessons for phone-based assessments of learning. BMJ Global Health, 5(7), e003030. https://doi.org/10.1136/bmjgh-2020-003030

Benstead, L.J. (2014a). Does interviewer religious dress affect survey responses? Evidence from Morocco. Politics and Religion 7: 734–60

Benstead, L.J. (2014b). Effects of interviewer–respondent gender interaction on attitudes toward women and politics: findings from Morocco. International Journal of Public Opinion Research. 26: 369–83

Bischoping, K., Schuman, H. (1992). Pens and polls in Nicaragua: an analysis of the 1990 preelection surveys. American Journal of Political Science. 36: 331–50

Blaydes, L., Gillum, R.M. (2013). Religiosity-of-interviewer effects: assessing the impact of veiled enumerators on survey response in Egypt. Politics and Religion 6: 459–82

Blom, M., Hox, J., Koch, A. (2007). The influence of interviewers’ contact behavior on the contact and cooperation rate in face-to-face household surveys. International Journal of Public Opinion Research. 19: 97–111

Di Maio, M., Fiala, N. (2018). Be Wary of Those Who Ask : A Randomized Experiment on the Size and Determinants of the Enumerator Effect. Policy Research Working Paper; No. 8671. World Bank, Washington, DC. © World Bank. https://openknowledge.worldbank.org/handle/10986/30993 License: CC BY 3.0 IGO.

Durrant, G.B., Groves, R.M., Staetsky, L., Steele, F. (2010). Effects of interviewer attitudes and behaviors on refusal in household surveys. Public Opinion Quarterly. 74: 1–36

Flores-Macías, F., Lawson, C. (2008). Effects of interviewer gender on survey responses: findings from a household survey in Mexico. International Journal of Public Opinion Research. 20: 100–10

Kane, E. W., and Macaulay, L. J. (1993). Interviewer gender and gender attitudes. Public Opinion Quarterly. 57:1–28

Liu, M., Stainback, K. (2013). Interviewer gender effects on survey responses to marriage-related questions. Public Opinion Quarterly. 77: 606–18

Lupu, N., & Michelitch, K. (2018). Advances in Survey Methods for the Developing World. Annual Review of Political Science, 21(1), 195–214. https://doi.org/10.1146/annurev-polisci-052115-021432

Olson, K. P. A. (2007). Effect of interviewer experience on interview pace and interviewer attitudes. Public Opinion Quarterly. 71: 273–86

Schaeffer, N.C., Dykema J., Maynard, D.W. (2010). Interviewers and interviewing. In Handbook of Survey Research, ed. PV Marsden, JD Wright, pp. 437–70. Bingley, UK: Emerald Group. 2nd ed.

West, B.T., Blom, A.G. ( 2017). Explaining interviewer effects: a research synthesis. J. Surv. Stat. Methodol. 5: 175–211
Intervention Start Date
2020-12-07
Intervention End Date
2020-12-23

Primary Outcomes

Primary Outcomes (end points)
The outcomes to be used here are the total math score in this phone-based assessment, the individual scores in the core numeracy and curriculum-aligned sections, a survey question for the pupils, survey questions for the parents about education at home, and a survey question for parents about COVID-related shocks. Baseline assessment data that we will use to validate the phone-based measure consists of standardized, in-school, test score data from February, 2020, as well as July and October 2019. Given the data infrastructure of our implementing partner, the in-person assessments are standardized across all schools, ensuring the comparability of baseline scores across students.
Primary Outcomes (explanation)
In terms of covariates, we will know for each assessor the grade and school where they teach, their age, their gender, their average attendance rate, and their lesson completion rates. For each pupil, we will know their baseline scores, their grade and school, gender, average attendance rate, age, and years attending schools within our partner’s system.

Secondary Outcomes

Secondary Outcomes (end points)
Secondary Outcomes (explanation)

Experimental Design

Experimental Design
1. Are phone-based assessments valid measures of learning?
For the full sample, we will correlate the baseline in-person test scores with the overall phone-based assessment score. We will further do this for the core numeracy and curriculum-aligned sections separately. Finally, we will repeat this exercise subsetting to student sub-groups of interest such as by gender, baseline performance, grade, and different degrees of rurality of each school. We will also examine correlations with SMS text-based quiz results, although it is currently unclear whether the sample of respondents will be large enough to make this analysis feasible. Eventually, we will also examine correlations between the phone-based assessment results and a post-intervention, post-school reopening in-person assessment. Furthermore, we will examine whether results vary depending on other characteristics of the assessors for the purpose of validating the assessments, such as teacher value-added and overall experience. Finally, we will examine whether the correlation between phone-based assessments and in-class assessments weakens for students that experienced a COVID-related shock.

2. To what extent are there differential response rates by enumerators?
For each of the outcomes Y mentioned above, for student i, teacher/assessor j, we will explore whether any specific teacher characteristic predicts the outcome to some extent. The null hypothesis is that all teacher-level characteristics are orthogonal to the outcomes.

3. Does the match on observables between assessors and assessees (e.g. same gender) drive differential response rates and scores?
For each student, we will create binary “match” variables. For instance, a female student assessed by a female assessor would have a 1 in the gender match binary variable. We will create these binary variables by gender, grade (i.e. the grade that the student is in, and the grade the assessor teaches), school, whether the assessor is specifically the student’s teacher, county and province of origin.

4. Do assessor characteristics change the average and variability of scores?
The order in which students are assessed is also fully random, and the assessment period is expected to last about 15 days. Therefore, in expectation, students assessed on day 1 should be similar to those assessed on day 15. To test this hypothesis, we will explore the extent to which the day when a student was assessed predicts their outcomes. Furthermore, we will estimate the daily variability by assessor*grade, and run a similar model where we check whether the variance in scores by assessor changes with increased testing experience.
Experimental Design Details
Randomization Method
Randomization using Stata
Randomization Unit
Individual students randomly assigned to individual assessors
Was the treatment clustered?
Yes

Experiment Characteristics

Sample size: planned number of clusters
105 schools across Kenya, 3 grades per school.
Sample size: planned number of observations
~6,000 students with ~20 assessors
Sample size (or number of clusters) by treatment arms
~300 students per assessor
Minimum detectable effect size for main outcomes (accounting for sample design and clustering)
IRB

Institutional Review Boards (IRBs)

IRB Name
University of Virginia
IRB Approval Date
2020-06-08
IRB Approval Number
Protocol Number: 3751

Post-Trial

Post Trial Information

Study Withdrawal

There is information in this trial unavailable to the public. Use the button below to request access.

Request Information

Intervention

Is the intervention completed?
No
Data Collection Complete
Data Publication

Data Publication

Is public data available?
No

Program Files

Program Files
Reports, Papers & Other Materials

Relevant Paper(s)

Reports & Other Materials