Performance evaluation of automated scoring for the descriptive similarity response task

Sci Rep. 2024 Mar 14;14(1):6228. doi: 10.1038/s41598-024-56743-6.

Abstract

We examined whether a machine-learning-based automated scoring system can mimic the human similarity task performance. We trained a bidirectional encoder representations from transformer-model based on the semantic similarity test (SST), which presented participants with a word pair and asked them to write about how the two concepts were similar. In Experiment 1, based on the fivefold cross validation, we showed the model trained on the combination of the responses (N = 1600) and classification criteria (which is the rubric of the SST; N = 616) scored the correct labels with 83% accuracy. In Experiment 2, using the test data obtained from different participants in different timing from Experiment 1, we showed the models trained on the responses alone and the combination of responses and classification criteria scored the correct labels in 80% accuracy. In addition, human-model scoring showed inter-rater reliability of 0.63, which was almost the same as that of human-human scoring (0.67 to 0.72). These results suggest that the machine learning model can reach human-level performance in scoring the Japanese version of the SST.