The Impact of Behavioral Anchors in the Assessment of Fellowship Applicants: Reducing Rater Biases

Melissa L Langhan; Gunjan Tiyyagura

doi:10.1016/j.acap.2021.11.018

The Impact of Behavioral Anchors in the Assessment of Fellowship Applicants: Reducing Rater Biases

Acad Pediatr. 2022 Mar;22(2):313-318. doi: 10.1016/j.acap.2021.11.018. Epub 2021 Dec 2.

Authors

Melissa L Langhan¹, Gunjan Tiyyagura²

Affiliations

¹ Departments of Pediatrics and Emergency Medicine, Section of Pediatric Emergency Medicine, Yale University School of Medicine, New Haven, Conn 06510. Electronic address: [email protected].
² Departments of Pediatrics and Emergency Medicine, Section of Pediatric Emergency Medicine, Yale University School of Medicine, New Haven, Conn 06510.

PMID: 34864133
DOI: 10.1016/j.acap.2021.11.018

Abstract

Introduction: No standardized evaluation tool for fellowship applicant assessment exists. Assessment tools are subject to biases and scoring tendencies which can skew scores and impact rankings. We aimed to develop and evaluate an objective assessment tool for fellowship applicants.

Methods: We detected rater effects in our numerically scaled assessment tool (NST), which consisted of 10 domains rated from 0 to 9. We evaluated each domain, consolidated redundant categories, and removed subjective categories. For 7 remaining domains, we described each quality and developed a question with a behaviorally-anchored rating scale (BARS). Applicants were rated by 6 attendings. Ratings from the NST in 2018 were compared with the BARS from 2020 for distribution of data, skewness, and inter-rater reliability.

Results: Thirty-four applicants were evaluated with the NST and 38 with the BARS. Demographics were similar between groups. The median score on the NST was 8 out of 9; scores <5 were used in less than 1% of all evaluations. Distribution of data was improved in the BARS tool. In the NST, scores from 6 of 10 domains demonstrated moderate skewness and 3 high skewness. Three of the 7 domains in the BARS showed moderate skewness and none had high skewness. Two of 10 domains in the NST vs 5 of 7 domains in the BARS achieved good inter-rater reliability.

Conclusion: Replacing a standard numeric scale with a BARS normalized the distribution of data, reduced skewness, and enhanced inter-rater reliability in our evaluation tool. This provides some validity evidence for improved applicant assessment and ranking.

Keywords: applicants; assessment; raters.

MeSH terms

Bias
Fellowships and Scholarships*
Humans
Reproducibility of Results