Building reliable and generalizable clerkship competency assessments: Impact of 'hawk-dove' correction

Sally A Santen; Michael Ryan; Marieka A Helou; Alicia Richards; Robert A Perera; Kellen Haley; Melissa Bradner; Fidelma B Rigby; Yoon Soo Park

doi:10.1080/0142159X.2021.1948519

Building reliable and generalizable clerkship competency assessments: Impact of 'hawk-dove' correction

Med Teach. 2021 Dec;43(12):1374-1380. doi: 10.1080/0142159X.2021.1948519. Epub 2021 Sep 17.

Authors

Sally A Santen¹, Michael Ryan¹, Marieka A Helou¹, Alicia Richards¹, Robert A Perera¹, Kellen Haley¹, Melissa Bradner¹, Fidelma B Rigby¹, Yoon Soo Park^{2

3}

Affiliations

¹ Virginia Commonwealth University School of Medicine, Richmond, VA, USA.
² College of Medicine, University of Illinois at Chicago, Chicago, IL, USA.
³ The Massachusetts General Hospital, Harvard Medical School, Boston, MA, USA.

PMID: 34534035
DOI: 10.1080/0142159X.2021.1948519

Abstract

Purpose: Systematic differences among raters' approaches to student assessment may result in leniency or stringency of assessment scores. This study examines the generalizability of medical student workplace-based competency assessments including the impact of rater-adjusted scores for leniency and stringency.

Methods: Data were collected from summative clerkship assessments completed for 204 students during 2017-2018 the clerkship at a single institution. Generalizability theory was used to explore variance attributed to different facets (rater, learner, item, and competency domain) through three unbalanced random-effects models by clerkship including applying assessor stringency-leniency adjustments.

Results: In the original assessments, only 4-8% of the variance was attributed to the student with the remainder being rater variance and error. Aggregating items to create a composite score increased variability attributable to the student (5-13% of variance). Applying a stringency-leniency ('hawk-dove') correction substantially increased the variance attributed to the student (14.8-17.8%) and reliability. Controlling for assessor leniency/stringency reduced measurement error, decreasing the number of assessments required for generalizability from 16-50 to 11-14.

Conclusions: Similar to prior research, most of the variance in competency assessment scores was attributable to raters, with only a small proportion attributed to the student. Making stringency-leniency corrections using rater-adjusted scores improved the psychometric characteristics of assessment scores.

Keywords: Clinical; assessment; medicine; psychometrics; undergraduate.

MeSH terms

Clinical Competence
Educational Measurement*
Humans
Reproducibility of Results
Students, Medical*