Thwaites, Peter
[UCL]
Kollias, Charalambos
[Polytomous Ltd.]
Paquot, Magali
[UCL]
Comparative Judgement (CJ) is a method of assessment in which judges (who may be experts, peers or even novices) decide which of two pieces of student work is “better”. These comparisons can be used to automatically rank each piece of work, and from these rankings a grade can be derived. Research from numerous fields has demonstrated that CJ generates reliable, valid, and efficient evaluations of learner outputs (Jones et al., 2019; Wheadon et al., 2020). This presentation reports on findings from a project which explores the potential for a community-driven, crowdsourced form of CJ to contribute to the assessment of L2 texts included in widely used learner corpora that unfortunately often do not include reliable proficiency information. A recently published preliminary study (Author, 2022) provided initial support for this approach. However, in focusing on relatively short writing samples (median = 272 words), using a single essay prompt, and covering the full spectrum of L2 proficiency, that study provided optimal conditions for such results. A next step is to find out whether CJ continues to be valid and reliable in less optimal conditions. We report on two studies which investigated this question. In each study, participants were crowdsourced from the linguistic community. Each participant completed 5-10 comparative judgements of a set of argumentative essays written by learners of L2 English. All texts were taken from the International Corpus of Learner English and were both longer (median length in words = 548) and represented a narrower proficiency range (roughly B1-C2) than the texts in earlier studies. In the first study, the essays were written in response to a single prompt; in the second study they comprised responses to five different topics. Initial results from the first study reveal slightly lower reliability levels than in those reported in the preliminary study (SSR around 0.8). This suggests that the increased length and homogeneity of these texts made the task more difficult for judges, in line with previous research on the influence of task complexity in CJ (van Daal et al., 2019). Nevertheless, the reliability levels remain sufficient to “reflect good dependability of scores” (Hoyt, 2018). Study 2 will be completed in late July. In addition to being assessed using CJ, all of the essays in the two studies were also triple-graded by expert graders using a CEFR writing rubric. These judgements will be used to provide a measure of the concurrent validity of the CJ data (i.e. a measure of the extent to which it corresponds with a better-established grading method), as well as comparison of the reliability and efficiency of the two grading methods. The results contribute both to learner corpus research, where CJ can potentially enrich learner corpora with accurate measures of text proficiency; and to the more general goal of exploring reliable, valid, and efficient alternatives to rubric-based L2 writing assessment.
Bibliographic reference |
Thwaites, Peter ; Kollias, Charalambos ; Paquot, Magali. Crowdsourced comparative judgement for L2 writing assessment: is high reliability still possible when texts are long, homogeneous in proficiency, and topically diverse?.BKL-CBL Linguists Day 2023 (Antwerp, Belgium, 13/10/2023). |
Permanent URL |
http://hdl.handle.net/2078.1/279771 |