"I understand why I got this grade": Automatic Short Answer Grading with Feedback

Dishank Aggarwal \AndPushpak Bhattacharyya \AndBhaskaran Raman
\ANDDepartment of Computer Science
Indian Institute of Technology, Bombay
{dishankaggarwal, pb, br}@cse.iitb.ac.in
Abstract

The demand for efficient and accurate assessment methods has intensified as education systems transition to digital platforms. Providing feedback is essential in educational settings and goes beyond simply conveying marks as it justifies the assigned marks. In this context, we present a significant advancement in automated grading by introducing Engineering Short Answer Feedback (EngSAF)—a dataset of 5.8ksimilar-toabsent5.8𝑘\sim 5.8k∼ 5.8 italic_k student answers accompanied by reference answers and questions for the Automatic Short Answer Grading (ASAG) task. The EngSAF dataset is meticulously curated to cover a diverse range of subjects, questions, and answer patterns from multiple engineering domains. We leverage state-of-the-art large language models’ (LLMs) generative capabilities with our Label-Aware Synthetic Feedback Generation (LASFG) strategy to include feedback in our dataset. This paper underscores the importance of enhanced feedback in practical educational settings, outlines dataset annotation and feedback generation processes, conducts a thorough EngSAF analysis, and provides different LLMs-based zero-shot and finetuned baselines for future comparison111Our code and dataset are available at https://github.com/dishankaggarwal/EngSAF. Additionally, we demonstrate the efficiency and effectiveness of the ASAG system through its deployment in a real-world end-semester exam at the Indian Institute of Technology Bombay (IITB), showcasing its practical viability and potential for broader implementation in educational institutions.

1 Introduction

Technology integration in education has resulted in transformative changes, redefining traditional pedagogical approaches and assessment methodologies. Effective education relies on feedback and explanations provided during assessments to ensure quality learning outcomes Shute (2008). Grading questions in tests and examinations have proven to be a good measure to assess student’s learning and understanding of a topic or a subject. An exam could include various question types, such as multiple choice, fill-in-the-blank, short answers, essays, etc. Among these question types, short answers and essays are more complicated to analyze than multiple-choice or fill-in-the-blank type questions due to flexibility and natural language in the response. Automating the grading process becomes crucial, especially in countries with extremely high student-to-teacher ratios, as it can significantly reduce instructor’s workloads and improve the assessment process. Significant advancements have been achieved in this field in recent years, primarily after the introduction of SemEval Semantic Textual Similarity (STS) task Agirre et al. (2012, 2013, 2014, 2015).

This challenge can be approached as a machine learning issue, where the objective is to grade a student’s response based on how similar it is to the reference answers. However, simply assigning a score or label to a learner’s response is often inadequate in practical educational contexts. Ahea et al. (2016) highlights the value and effectiveness of feedback in improving students’ learning and professionalizing teaching in higher education.

1.1 Problem Statement

Given a question, a reference answer, and a student’s answer, the aim is to provide content-focused elaborated feedback and assign a label indicating the degree of correctness in the student’s answer compared to the reference answer. Here, we focus on questions where the answer type is a sentence or a short paragraph. This task involves evaluating the alignment between the student’s and reference answers. Figure 1 illustrates the problem statement, showing the input and output using an example from the EngSAF dataset.

Refer to caption
Figure 1: ASAG with feedback model which accepts a question, reference answer, and student answer as input and outputs a grade from ‘correct,’ ‘partially correct,’ or ‘incorrect’ along with the feedback/explanation of the assigned grade.
Table 1: An example showing a question, reference answer, and three student answers (Student#1, Student#2, and Student#3) alongside their corresponding labels and synthetically generated Feedback/Explanation for the assigned label from the EngSAF Dataset
Question What is the difference between basin order and channel order?
Reference Answer Basin order is highest order of any stream in that basin whereas channel order is order of stream which denotes that in what order of streams has joined the channel."
Student Answer 1 Highest order channel is the basin order whereas channel order is the order of channel from tributaries to reaches to main river stream.
Label 2 (Correct response)
Feedback Excellent! You have a clear understanding of the distinction between basin order and channel order.
Student Answer 2 Channel order reflects to the number of streams coming together to form a channel.
Label 1 (Partially Correct response)
Feedback Your answer includes a part of the distinction. Channel order indeed indicates the number of streams joining together to make a channel, but the difference between basin order and channel order is not mentioned.
Student Answer 3 Channel order is the order of the highest order streams. For example, two first order streams (or more) will make a second order stream and similarly for highest orders.
Label 0 (Incorrect response)
Feedback The student answer confuses basin order with channel order. Basin order refers to the highest order of streams within a basin, while channel order refers to the order of streams based on the sequence of junctions.

1.2 Motivation

The increasing demand for technology in education has led to a need for more efficient and effective methods of grading short-answer assessments. In the case of short answer grading, feedback with clear explanations goes beyond simply conveying a grade. It offers valuable insights into student’s strengths, weaknesses, and areas of improvement. However, the effectiveness of feedback is largely unexplored due to the lack of public, content-centered, elaborated feedback datasets in different domains. These datasets are crucial for training and developing automated feedback systems that can provide personalized and nuanced feedback. The need for a more efficient and effective ASAG system incorporating feedback alongside grade has arisen due to these limitations, and this is where the Engineering Short Answer Feedback (EngSAF) dataset steps in, consisting of questions students answer from multiple engineering domains for the ASAG task.

Our Contributions are:

  1. 1.

    EngSAF dataset containing around 5.8K student responses to 119 questions from multiple engineering domains along with synthetically generated feedback explaining the assigned grade for the task of ASAG. To the best of our knowledge, this is the first dataset containing questions and responses from multiple engineering domains. (Section 3)

  2. 2.

    Benchmark scores on the EngSAF dataset using different Large Language Models (LLM) for future comparison and research. (Section 5)

  3. 3.

    Real-world deployment of the EngSAF fine-tuned ASAG model in an end-semester exam at Indian Institute of Technology, Bombay (IITB) (Section 6).

2 Related work

2.1 Automatic Short Answer Grading (ASAG)

ASAG is an essential area of research that has garnered significant attention in recent years. Several approaches have been proposed for traditional ASAG, ranging from rule-based methods to more sophisticated machine-learning techniques. One early approach for ASAG was based on keyword or pattern matching, where the presence or absence of certain keywords in the student’s answer was used to determine its accuracy. Mitchell et al. (2002); Sukkarieh et al. (2004); Nielsen et al. (2009).To overcome these limitations, researchers have developed more sophisticated methods that use natural language processing (NLP) techniques. One such method is based on Latent Semantic Analysis (LSA), which represents texts as high-dimensional vectors and compares them to the reference answers using cosine similarity LaVoie et al. (2020). In a related study, the task of ASAG is addressed by incorporating features such as answer length, grammatical correctness, and semantic similarity in comparison to reference answers Sultan et al. (2016).

More recently, deep learning models such as convolutional neural networks (CNNs) and recurrent neural networks (RNNs) have been applied to ASAG’s task. These models are trained on large amounts of annotated data and can capture the semantic relationships between words in a student’s answer, and the reference answers Surya et al. (2019); Zhang et al. (2022). Pre-trained Language Models (PLMs), such as BERT Devlin et al. (2018), GPT Radford et al. (2019), RoBERTa Liu et al. (2019), DistillBERT Sanh et al. (2019), and ALBERT Lan et al. (2019) have performed exceptionally well for various tasks in NLP, also used for the task of ASAG. Sentence-BERT or SBERT Reimers and Gurevych (2019) performs exceptionally well on current publicly available traditional ASAG datasets Condor et al. (2021). A common limitation of the prior work is that only grades are assigned, and no feedback is provided. We seek to fill this gap with the introduction of the EngSAF dataset.

2.2 Short Answer Grading Datasets

Several publicly accessible datasets have been curated, each designed to facilitate research and benchmarking in the ASAG task. In 2009, Mohler and Milhalcea published a short answer dataset on a data structures course that contained 630 records Mohler and Mihalcea (2009). In 2011, they published an extended dataset on the same course that contained 2273 records Mohler et al. (2011), named the Texas Extended dataset. To complete the assignment of student-authored answer annotation, the Beetle corpus is being created Nielsen et al. (2008). Another dataset that is publicly available on Kaggle and could be used for the task of ASAG is provided by Hewlett Foundation named ASAP-AES222https://www.kaggle.com/c/asap-aes/data (Automated Assessment Prize Competition for Essay Scoring).

However, all aforementioned datasets are conventional ASAG datasets solely consisting of grades or scores. Our extensive literature survey includes only one short answer grading dataset with content-focused elaborated feedback, i.e., Filighera et al. (2022), which introduces an inaugural dataset for short-answer feedback comprising bilingual responses in English and German. Human annotations were meticulously collected and refined to uphold the feedback’s quality. However, this dataset is constrained by its limited number of student responses (2ksimilar-toabsent2𝑘\sim 2k∼ 2 italic_k) in English and its exclusive focus on questions from a single domain, specifically computer network queries, thereby lacking diversity across different engineering domains. To address these limitations, we introduced EngSAF, which contains 5.8ksimilar-toabsent5.8𝑘\sim 5.8k∼ 5.8 italic_k data points and questions from multiple engineering domains.

3 Engineering Short Answer Feedback (EngSAF) Dataset

This dataset contains 119 questions drawn from different undergraduate and graduate engineering courses, accompanied by approximately 5.8k student responses. An Instructor provided correct answer/ reference answer also accompanies each question. These questions and answers have been taken from actual quizzes/ exams from the Indian Institute of Technology, Bombay (IITB), covering a diverse range of 25 courses. The questions and responses span across multiple domains, including image processing, water quality management, and operating systems, to name a few. Technical Appendix contains the complete list of courses included in the dataset. Based on the instructor’s assigned marks, each response has been categorized as “correct,” “partially correct,” or “incorrect.”

Table 1 provides an example from the EngSAF dataset showcasing a question, its corresponding reference answer, and three different student responses alongside their associated output labels and synthetically generated feedback/Explanation. Technical Appendix contains more examples from the EngSAF dataset. The dataset is applicable for both traditional automatic short answer grading and the generation of elaborated feedback.

3.1 Dataset Construction

EngSAF contains questions and student responses from different undergraduate and graduate engineering courses. The instruction provided a correct answer/ reference answer accompanying each question. The Instructor/ Teaching Assistant (TA) assigns marks to each student’s answer from their respective course. The students’ responses and grades are obtained from a reputed engineering university. The quality assessment of the proposed dataset is covered in subsection 3.5. If the maximum mark for a question is k𝑘kitalic_k and a student’s answer is graded x𝑥xitalic_x marks by the instructor, then the output label is annotated according to the scheme in Table 2.

Table 2: Labelling scheme used for the annotation. x𝑥xitalic_x denotes the marks given by the Instructor/ TA for a student’s answer, and k𝑘kitalic_k shows the max marks for that particular question
Condition Output Label
x=k𝑥𝑘x=kitalic_x = italic_k 2 (Correct Response)
0<x<k0𝑥𝑘0<x<k0 < italic_x < italic_k 1 (Partially Correct Response)
x=0𝑥0x=0italic_x = 0 0 (Incorrect Response)

3.2 Challenges and Requirements

  • Diversity in student responses: Human language is inherently subjective, and different individuals may interpret the same content differently. This subjectivity extends to the grading process, making it challenging to provide universally applicable feedback.

  • Response Variability: Short answer responses can vary significantly regarding quality, coherence, and relevance. Some responses may be well-structured and articulate, while others may be incomplete or contain grammatical errors. Designing feedback that caters to this variability in response quality poses a considerable challenge.

  • Feedback Impact: Wrong feedback can negatively shape learners’ perceptions of themselves and their abilities. If the feedback consistently highlights their shortcomings or mistakes, they may internalize these negative perceptions and develop a fixed mindset about their capabilities. This can hinder their willingness to take risks, seek challenges, and persist in facing difficulties.

3.3 Label Aware Synthetic Feedback Generation (LASFG)

Leveraging the advanced language generative and reasoning capabilities of state-of-the-art Large Language Models (LLMs) like Gemini333https://gemini.google.com/ and ChatGPT, we enhance educational assessment with feedback, particularly in Automatic Short Answer Grading (ASAG). The approach involves utilizing Gemini’s ability to comprehend input prompts consisting of a question, a student’s answer, a reference answer, and the corresponding grading label provided by an instructor or teaching assistant to generate content-focused elaborated feedback. The generated feedback covers the reasoning or explanation of the Gold output label. Grammarly444https://www.grammarly.com/ then checks the synthetically generated feedback to remove any grammatical errors. Prompt details used for synthetic feedback generation can be found in the Technical Appendix 10.1. The quality estimation of the generated feedback is shown in Section 3.5

3.4 Corpus Statistics

Table 3: Distribution of gold label outputs across train and test split on EngSAF dataset. The test set is further subdivided into Unseen Answers (UA) and Unseen Questions (UQ).
Label Train UA UQ Total
Correct 1716 403 321 2440
Partially Correct 1412 344 278 2034
Incorrect 941 233 166 1340
Total 4069 980 765 5814

Following Dzikovska et al. (2013), we partitioned the data into training sets, comprising 70% of the dataset, as well as unseen answers (16%) and unseen questions (14%) for test sets as shown in table 3. The test split with unseen answers includes fresh responses to the questions used in training. In contrast, the test split with unseen questions comprises entirely new questions for testing the model’s ability to generalize to new questions without prior exposure. Technical Appendix 10.3 contains insight about insights about the distribution of text length for student answers, reference answers, questions, and feedback in the EngSAF dataset.

3.5 Quality Estimation

To show the reliability and credibility of our dataset and synthetically generated feedback, we randomly sampled 300 data points and equally distributed them across the output label. This sampled data is a good representation of the EngSAF dataset.

Each synthetically generated feedback was scored by three human annotators over three aspects. Each aspect is scored on a scale (1-5), with a high score indicating a better response. Each feedback is analyzed on the following aspects.

  1. 1.

    Fluency and Grammatical Correctness: This aspect tests whether the generated feedback is fluent in English and grammatically correct.

  2. 2.

    Feedback Correctness/Accuracy: This aspect assesses the overall quality of the generated feedback regarding content, relevance, quality, and explanation for the assigned grade.

  3. 3.

    Emotional Impact: This aspect assesses how feedback affects the learner’s emotional state. Annotators are tasked with rating whether the feedback avoids triggering negative emotions or impacts by refraining from using words such as "fail" that may evoke feelings of discouragement or distress in the learner.

Table 4: Human annotators provided scores ranging from 1 to 5, indicating their assessment of various aspects, with higher scores denoting better performance.
Aspect Avg. Score
Fluency & Grammatical Correctness 4.73
Feedback Correctness/Accuracy 4.55
Emotional Impact 4.61

Three human annotators evaluate the correctness of each output label for sampled data points, achieving an accuracy of 98% and pair-wise average Cohen’s Kappa (κ𝜅\kappaitalic_κ) score of 0.65 (substantial agreement), showcasing the high reliability of the assigned output label. Table 4 shows the average score over all the designed aspects to measure the reliability of the proposed dataset. Human evaluation across various designed aspects consistently yields an average score greater than 4.5 out of 5, which underscores the reliability and quality of the EngSAF dataset. The details of the human annotators can be found in Technical Appendix.

4 Experiments

Building on the foundational work by Filighera et al. (2022) in their investigation, our primary objective is to conduct experiments to establish a baseline for our EngSAF dataset. We also seek to delve deeper into the impact of incorporating questions on the generated feedback and the assigned labels. Traditionally, in Automated Short Answer Grading (ASAG), assessments have focused solely on evaluating reference answers and student responses. However, Lv et al. (2021) challenge this convention by demonstrating that integrating questions into the evaluation process enhances the performance of traditional ASAG tasks. Therefore, our study aims to build upon these insights to explore the nuanced effects of incorporating questions in ASAG assessments, aiming to refine and improve the evaluation criteria and methodologies in this domain.

4.1 Experimental Setting

To establish baselines for EngSAF, we have conducted experiments in two settings.

  1. 1.

    Fine-tuning Large Language Models (LLMs): We use the Llama-2 model Touvron et al. (2023), Mistral 7b model555https://mistral.ai/, which are fine-tuned to predict the output label for student responses, categorizing them as correct, partially correct, or incorrect and jointly providing feedback explaining the assigned output label. Furthermore, we have conducted this experiment using two distinct methodologies.

    1. (a)

      Without Question: Student answer and Reference answer are passed as input.

    2. (b)

      With Question: Question, Student answer, and Reference answer are passed as input.

    The following models are used in this experiment.

    llama-2-13B-chat
    Mistral-7B-Instruct-v0.1

  2. 2.

    Zero-shot experiment using chatGPT: For this experiment, we prompted ChatGPT to assign an output label along with feedback to each student’s response by evaluating its correctness compared to the reference answer for the ASAG task. Prompt details are present in the Techinical Appendix. We used ChatGPT API666https://openai.com/ to perform this experiment on EngSAF Dataset. The following model has been used in this experiment.

    gpt-3.5-turbo-16k

Hyperparameters used for fine-tuning LLMs on EngSAF can be found in Technical Appendix 10.2. All experiments were performed on 2 NVIDIA A100-SXM GPU with 80 GiB of memory. Fine-tuning takes about 1 hour per epoch for training.

5 Results

Table 5 shows a majority baseline, Llama-2 fine-tuning, Mistral-7b fine-tuning, and ChatGPT zero-shot experiments results. The majority baseline contains the most occurring label and feedback from the EngSAF train set. The most common label is “correct response,” and the most common feedback is “Well done! You have answered the question correctly, covering all the required aspects.”. We conducted the ChatGPT experiment only on the unseen answers test set due to its zero-shot setting, as both unseen questions and answers are identical in the zero-shot setting.

Table 5: Llama-2’s, Mistral-7b, a majority baseline and ChatGPT zero-shot results on the EngSAG unseen answers and unseen questions test splits. w_quest models additionally received the questions as input, while wo_quest did not. Please note that the text similarity measures, accuracy, and F1 scores are in percent. ChatGPT experiment was conducted solely on the unseen answers test set, given its zero-shot setting, as both unseen questions and answers are identical in the zero-shot setting.
Unseen Answers Unseen Questions
Model Acc. F1 BLEU MET. ROU. BERT Acc. F1 BLEU MET. ROU. BERT
Majority 43.3 26.1 1.2 8.6 12.7 16.2 40.5 23.4 0.1 8.64 2.78 12.32
Llama2wo_quest𝐿𝑙𝑎𝑚𝑎subscript2𝑤𝑜_𝑞𝑢𝑒𝑠𝑡Llama-2_{wo\_quest}italic_L italic_l italic_a italic_m italic_a - 2 start_POSTSUBSCRIPT italic_w italic_o _ italic_q italic_u italic_e italic_s italic_t end_POSTSUBSCRIPT 74.4 73.0 13.3 34.1 16.8 31.9 55.6 53.6 12.5 31.4 16.2 28.9
Llama2w_quest𝐿𝑙𝑎𝑚𝑎subscript2𝑤_𝑞𝑢𝑒𝑠𝑡Llama-2_{w\_quest}italic_L italic_l italic_a italic_m italic_a - 2 start_POSTSUBSCRIPT italic_w _ italic_q italic_u italic_e italic_s italic_t end_POSTSUBSCRIPT 73.9 73.7 11.7 35.9 16.9 35.1 56.3 54.9 9.2 31.4 13.9 32.6
Mistralwo_quest𝑀𝑖𝑠𝑡𝑟𝑎subscript𝑙𝑤𝑜_𝑞𝑢𝑒𝑠𝑡Mistral_{wo\_quest}italic_M italic_i italic_s italic_t italic_r italic_a italic_l start_POSTSUBSCRIPT italic_w italic_o _ italic_q italic_u italic_e italic_s italic_t end_POSTSUBSCRIPT 72.8 73.1 11.3 33.3 14.82 37.4 54.7 55.4 11.3 32.0 15.2 36.2
Mistralw_quest𝑀𝑖𝑠𝑡𝑟𝑎subscript𝑙𝑤_𝑞𝑢𝑒𝑠𝑡Mistral_{w\_quest}italic_M italic_i italic_s italic_t italic_r italic_a italic_l start_POSTSUBSCRIPT italic_w _ italic_q italic_u italic_e italic_s italic_t end_POSTSUBSCRIPT 75.4 75.7 13.9 38.3 19.5 41.6 58.7 57.9 14.9 36.7 19.7 40.9
ChatGPT 41.1 24.0 7.4 31 11.3 28.8 - - - - - -

5.1 Analysis

In this section, we delve into the insights drawn from the performance results of experiments on the EngSAF dataset for the ASAG task. The evaluation metrics used for labels are accuracy and macro-averaged F1. To evaluate the feedback, we measure the Rouge-2 Post (2018), SCAREBLEU 777https://pypi.org/project/sacrebleu/, METEOR Banerjee and Lavie (2005) and BERTSCORE Zhang et al. (2019) scores.

Table 5 shows that Mistral-7b significantly outperforms the majority baseline in both label and feedback metrics. However, a notable disparity exists in improvement between unseen questions and unseen answers. We observe a decrease of approximately 23% in accuracy for unseen questions compared to unseen answers. This underscores the challenge even fine-tuned models face when tasked with generalizing to new questions. Additionally, there is approximately 4% decrease in Bertscore for unseen questions compared to unseen answers, suggesting the necessity for new evaluation metrics that comprehensively assess text on the context level instead of the lexical level. Including questions in the input has led to a notable improvement in Bertscore, with a 10% increase for unseen answers and a 12% increase for unseen questions. Similarly, for label accuracy, we observe a 3.5% increase for unseen answers and a 7.3% increase for unseen questions in the test set. Technical Appendix includes qualitative analysis of test examples. Overall, the experiment suggests including questions for improvement in the predicted feedback by the model. The ChatGPT zero-shot experiment yields the lowest accuracy at 41.1%, even falling below that of the majority baseline, highlighting the complexity of the ASAG task and struggle of large language models like ChatGPT struggle in a zero-shot setting. Another contributing factor to this performance disparity could be the inherent bias present in teacher grading, which models can only learn when trained on such data. Interestingly, the smaller Mistral 7B model outperformed the LLaMA-2-13B model for ASAG, demonstrating the potential of more compact models in this task.

6 Deployment

Table 6: Subject Matter Experts (SME) provided scores ranging from 1 to 5 for Feedback Quality/ Correctness and Emotional Impact, with higher scores denoting better performance. SME also validated the correctness of the predicted output label. Accuracy value is in percent.
Aspect Avg. Score
Output Label Accuracy 92.5%
Feedback Quality/Correctness 4.5
Emotional Impact 4.9

For the real-world deployment, our fine-tuned ASAG model was integrated into an end-semester exam on ET 623 (Learning Analytics Course) at IIT Bombay for the 2024 academic year. The setup included the students enrolled in the course who consented to participate in the experiment. The end-sem exam includes 2 short-answer questions, each accompanied by the instructor’s correct/reference answer. We randomly sampled 25 student answers for each question and used our fine-tuned ASAG with Feedback model to predict the output label and feedback/ explanation for the predicted output. Each predicted output label was evaluated based on its correctness, where the subject matter expert checks whether the predicted output label is correct or not. Each predicted feedback is analyzed in terms of Feedback Correctness/ Accuracy and Emotional Impact as discussed in the Quality Estimate section 3.5. Each aspect is scored on a scale (1-5), with a high score indicating a better response.

Table 6 presents the average scores across all the designed aspects used to measure the reliability of the proposed dataset. Upon examining the table, it is evident that the subject matter evaluation scores greater than 4.5 for both Feedback Quality/Correctness and the Emotional Impact aspect. This demonstrates the reliability and effectiveness of the ASAG model in real-world scenarios. Additionally, the model’s predicted output label achieved an accuracy of 92.5%, further showcasing its performance and reliability. Technical Appendix contains more details about deployment.

7 Summary, Conclusion, and Future Work

This study presents a novel and extensive dataset for grading short answers across multiple engineering domains. Our dataset addresses the need for a standardized evaluation platform by covering diverse subjects and a wide range of short answers. We meticulously curated and annotated this dataset to ensure its quality and applicability. Using the Label Aware Synthetic Feedback Generation (LASFG) strategy we have synthetically incorporated feedback in the EngSAF dataset. Through experimentation, we benchmarked the dataset’s performance using different LLM’s fine-tuning and chatGPT zero shot experiments.

In conclusion, our research endeavors culminated in successfully creating and evaluating a cutting-edge multi-domain short answer grading dataset. The dataset’s diverse content and meticulous annotations provide a solid platform for training and assessing grading models across various subject domains. Our benchmarking experiments showcased promising results, indicating that our dataset has the potential to significantly improve the quality and reliability of ASAG systems. Additionally, we deployed the ASAG system in a real-world setting for an IITB course exam, demonstrating its practical applicability. This work marks a crucial step forward in educational technology, equipping educators and researchers with a valuable resource to boost advancements in automated assessment techniques.

While this research presents a novel dataset for effective short answer grading along with feedback, there are several possibilities for future investigation. One area of focus could be the refinement and expansion of our dataset to include more nuanced and complex short answers, enabling us to accommodate a broader range of grading scenarios. Baseline figures could be improved by expansion or augmentation of the dataset. External knowledge inclusion through knowledge graphs could be explored in this direction.

8 Limitations

Our study, titled “I understand why I got this grade": Automatic Short Answer Grading with Feedback" has provided valuable insights into the multi-domain ASAG. However, it’s important to acknowledge certain limitations. The synthetic nature of the generated feedback may introduce biases or might contain inconsistencies, factual inaccuracies, or overly generic content, that could impact the implications of the results. One key finding from table 3 is an imbalance in output labels. Specifically, Output label Zero (0) is less frequent compared to labels One (1) and Two (2). This imbalance could impact the performance of automatic grading models trained on this dataset. Despite our efforts to curate diverse short-answer responses across multiple engineering domains, the dataset’s overall size might limit the complexity and depth of models that can be trained on it. Variations in data distribution, language patterns, and task requirements across different domains and contexts may impact the application of our results to real-world scenarios. While our dataset’s strength lies in its coverage of multiple engineering domains, it’s important to note that the range of domains might not be exhaustive. Researchers focusing on specific domains or aiming to enhance domain-specific performance may need to consider domain-specific or subject-specific fine-tuning or gather additional data.

9 Ethical Statement

The EngSAF dataset contains questions, reference answers, and student answers along with the output label assigned for each student answer. This dataset is taken from actual exams and quizzes conducted at the university. We are committed to protecting those student’s privacy and anonymity. Hence, no student-specific information is disclosed within the dataset. The process of collecting and curating this dataset adhered to the ethical norms of informed consent. Prior to any contribution, instructors of each course were provided details concerning the data collection’s intent, the utilization of their questions and responses, and any potential implications. While the ASAG models created with this dataset offer great potential in educational assessment, they must be used properly and ethically. As with any AI technology, its impact on education must be regularly monitored and evaluated. Ensuring an appropriate balance between the advantages derived from automation and safeguarding the educational experience and human judgement is of utmost importance.

References

  • Agirre et al. [2012] Eneko Agirre, Daniel Cer, Mona Diab, and Aitor Gonzalez-Agirre. Semeval-2012 task 6: A pilot on semantic textual similarity. In * SEM 2012: The First Joint Conference on Lexical and Computational Semantics–Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012), pages 385–393, 2012.
  • Agirre et al. [2013] Eneko Agirre, Daniel Cer, Mona Diab, Aitor Gonzalez-Agirre, and Weiwei Guo. * sem 2013 shared task: Semantic textual similarity. In Second joint conference on lexical and computational semantics (* SEM), volume 1: proceedings of the Main conference and the shared task: semantic textual similarity, pages 32–43, 2013.
  • Agirre et al. [2014] Eneko Agirre, Carmen Banea, Claire Cardie, Daniel Cer, Mona Diab, Aitor Gonzalez-Agirre, Weiwei Guo, Rada Mihalcea, German Rigau, and Janyce Wiebe. Semeval-2014 task 10: Multilingual semantic textual similarity. In Proceedings of the 8th international workshop on semantic evaluation (SemEval 2014), pages 81–91, 2014.
  • Agirre et al. [2015] Eneko Agirre, Carmen Banea, Claire Cardie, Daniel Cer, Mona Diab, Aitor Gonzalez-Agirre, Weiwei Guo, Inigo Lopez-Gazpio, Montse Maritxalar, Rada Mihalcea, et al. Semeval-2015 task 2: Semantic textual similarity, english, spanish and pilot on interpretability. In Proceedings of the 9th international workshop on semantic evaluation (SemEval 2015), pages 252–263, 2015.
  • Ahea et al. [2016] Md Mamoon-Al-Bashir Ahea, Md Rezaul Kabir Ahea, and Ismat Rahman. The value and effectiveness of feedback in improving students’ learning and professionalizing teaching in higher education. Journal of Education and Practice, 7(16):38–41, 2016.
  • Banerjee and Lavie [2005] Satanjeev Banerjee and Alon Lavie. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pages 65–72, 2005.
  • Condor et al. [2021] Aubrey Condor, Max Litster, and Zachary Pardos. Automatic short answer grading with sbert on out-of-sample questions. International Educational Data Mining Society, 2021.
  • Devlin et al. [2018] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  • Dzikovska et al. [2013] Myroslava O Dzikovska, Rodney Nielsen, Chris Brew, Claudia Leacock, Danilo Giampiccolo, Luisa Bentivogli, Peter Clark, Ido Dagan, and Hoa Trang Dang. Semeval-2013 task 7: The joint student response analysis and 8th recognizing textual entailment challenge. In Second Joint Conference on Lexical and Computational Semantics (* SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013), pages 263–274, 2013.
  • Filighera et al. [2022] Anna Filighera, Siddharth Parihar, Tim Steuer, Tobias Meuser, and Sebastian Ochs. Your answer is incorrect… would you like to know why? introducing a bilingual short answer feedback dataset. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8577–8591, 2022.
  • Lan et al. [2019] Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942, 2019.
  • LaVoie et al. [2020] Noelle LaVoie, James Parker, Peter J Legree, Sharon Ardison, and Robert N Kilcullen. Using latent semantic analysis to score short answer constructed responses: Automated scoring of the consequences test. Educational and Psychological Measurement, 80(2):399–414, 2020.
  • Liu et al. [2019] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
  • Lv et al. [2021] Gaoyan Lv, Wei Song, Miaomiao Cheng, and Lizhen Liu. Exploring the effectiveness of question for neural short answer scoring system. In 2021 IEEE 11th International Conference on Electronics Information and Emergency Communication (ICEIEC) 2021 IEEE 11th International Conference on Electronics Information and Emergency Communication (ICEIEC), pages 1–4. IEEE, 2021.
  • Mitchell et al. [2002] Tom Mitchell, Terry Russell, Peter Broomhead, and Nicola Aldridge. Towards robust computerised marking of free-text responses. 2002.
  • Mohler and Mihalcea [2009] Michael Mohler and Rada Mihalcea. Text-to-text semantic similarity for automatic short answer grading. In Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009), pages 567–575, 2009.
  • Mohler et al. [2011] Michael Mohler, Razvan Bunescu, and Rada Mihalcea. Learning to grade short answer questions using semantic similarity measures and dependency graph alignments. In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies, pages 752–762, 2011.
  • Nielsen et al. [2008] Rodney D Nielsen, Wayne H Ward, James H Martin, and Martha Palmer. Annotating students’ understanding of science concepts. In LREC. Citeseer, 2008.
  • Nielsen et al. [2009] Rodney D Nielsen, Wayne Ward, and James H Martin. Recognizing entailment in intelligent tutoring systems. Natural Language Engineering, 15(4):479–501, 2009.
  • Post [2018] Matt Post. A call for clarity in reporting bleu scores. arXiv preprint arXiv:1804.08771, 2018.
  • Radford et al. [2019] Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. 2019.
  • Reimers and Gurevych [2019] Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084, 2019.
  • Sanh et al. [2019] Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108, 2019.
  • Shute [2008] Valerie J Shute. Focus on formative feedback. Review of educational research, 78(1):153–189, 2008.
  • Sukkarieh et al. [2004] Jana Z Sukkarieh, Stephen G Pulman, and Nicholas Raikes. Auto-marking 2: An update on the ucles-oxford university research into using computational linguistics to score short, free text responses. International Association of Educational Assessment, Philadephia, 2004.
  • Sultan et al. [2016] Md Arafat Sultan, Cristobal Salazar, and Tamara Sumner. Fast and easy short answer grading with high accuracy. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1070–1075, 2016.
  • Surya et al. [2019] K Surya, Ekansh Gayakwad, and MK Nallakaruppan. Deep learning for short answer scoring. Int. J. Recent. Technol. Eng.(IJRTE), 7(6), 2019.
  • Touvron et al. [2023] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  • Zhang et al. [2019] Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Weinberger, and Yoav Artzi. Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675, 2019.
  • Zhang et al. [2022] Lishan Zhang, Yuwei Huang, Xi Yang, Shengquan Yu, and Fuzhen Zhuang. An automatic short-answer grading model for semi-open-ended questions. Interactive learning environments, 30(1):177–190, 2022.

10 Appendix

10.1 Prompt used for Synthetic Feedback Generation:

To effectively tackle the challenges mentioned above and meet the outlined requirements, the instruction provided to the LLM should contain several key components:

  1. 1.

    It should articulate the ASAG task specifying input and output.

  2. 2.

    It should specify guidelines for generating concise feedback and emphasize excluding words or phrases that may evoke negative emotions in the learner.

  3. 3.

    The instruction should outline the expected format for the input data to ensure compatibility and consistency in processing.

Prompt Used:

You are an automatic short-answer feedback generator.

Given a question, a student answer, a reference/correct answer, and a correctness label (correct/partially correct/incorrect), your task is to provide constructive feedback or reasoning for the assigned label.

Ensure the feedback does not reference the provided reference answer. Keep it concise (3-4 lines) and aim to guide the learner without invoking any negative emotions.

<START> and <END> token shows the starting and ending of each given input.

<START> {{Question}} <END>

<START> {{Correct Answer}} <END>

<START> {{Student Answer}} <END>

<START> {{Output Label}} <END>

10.2 Hyperparameters used

The hyperparameters used for fine-tuning the llama-2 and mistral-7b models are as follows:

    bf16 = True
    number_of_training_epochs = 4
    per_device_eval_batch_size = 8
    per_device_train_batch_size = 8
    gradient_accumulation_steps = 1
    learning_rate = 2e-4
    warmup_ratio = 0.03
    weight_decay=0.001
    lr_scheduler_type = cosine
    LoraConfig(
        lora_alpha=16,
        lora_dropout=0.1,
        r=64,
        bias="none",
        task_type="CAUSAL_LM",
        target_modules=["q_proj", "k_proj", "v_proj", "o_proj","gate_proj"]
    )

10.3 Dataset Insight

To gain insights into the distribution of text length within our dataset, we generated plots in figure 2 depicting the distribution of text length for student answers, reference answers, questions, and feedback. Table 7 shows the average length and standard deviation of different text fields from the EngSAF dataset. We utilized NLTK’s 888https://www.nltk.org/ word_tokenize function for finding the tokens present in the text.

Table 7: Average text length and standard deviation (tokens) of Question, Student Answer, Reference answer, and feedback.
Field Average Length Std. dev
Question 17.50 16.06
Student Answer 25.85 20.03
Reference Answer 26.47 18.17
Feedback 41.95 14.26
Refer to caption
(a) Question Length
Refer to caption
(b) Reference answer length
Refer to caption
(c) Student answer length
Refer to caption
(d) Feedback length
Figure 2: Histogram representing the distribution of text lengths (in tokens)

11 EngSAF Dataset- Courses and Subjects

EngSAFE dataset contains questions from the following courses. Our code and dataset are available at https://github.com/dishankaggarwal/EngSAF

  1. 1.

    Sustainability Assessment of Urban Systems

  2. 2.

    Introduction to Philosophy

  3. 3.

    Business Valuation, Mergers and Acquisitions

  4. 4.

    Environmental Management

  5. 5.

    Fundamentals of Environmental Chemistry

  6. 6.

    Environmental Chemistry

  7. 7.

    Municipal waste and biomedical waste management

  8. 8.

    Solid Waste Management - Basic Principles and Technical Aspects

  9. 9.

    Mechanical Behavior of Materials

  10. 10.

    Water Resource Engineering

  11. 11.

    Advanced Hydrological Analysis and Design

  12. 12.

    Solid and Hazardous Waste Laboratory

  13. 13.

    Embedded Systems

  14. 14.

    Water Quality Management

  15. 15.

    Abstractions and Paradigms for Programming

  16. 16.

    Manufacturing Processes

  17. 17.

    Operating Systems

  18. 18.

    Water Resources and Environmental Hydraulics

  19. 19.

    Fiber Reinforced Composites

  20. 20.

    Electrochemical Materials Science

  21. 21.

    Environmental Microbiology and Ecology

  22. 22.

    Probabilistic and Statistical Methods in Civil Engineering

  23. 23.

    Image Processing

  24. 24.

    Computer-integrated Manufacturing

  25. 25.

    Mass Transfer Processes in Environmental Systems

12 Zero-Shot Experiment Prompt

The following prompt is used for the zero-shot experiment using gpt-3.5-turbo-16k model.

You are an automatic short-answer feedback generator.

Given a question, a student answer, and a reference answer. Evaluate student answers against a reference answer for correctness, providing labels (correct/partially correct/incorrect) and constructive feedback in about 3-4 lines.

Ensure feedback guides the learner without invoking negative emotions and does not reference the provided reference answer.

Format output with the label on the first line followed by feedback starting from the next line.

13 More Examples

Table 11 presents two additional questions from the EngSAF dataset. Each question is accompanied by a reference answer and multiple student answers, each labeled with their respective output label and feedback summary.

14 Oualitative Analysis

Table 8 shows an example of the fine-tuned ASAG model’s output on a sample from the EngSAF test set used for qualitative analysis. The gold label classifies the student’s answer as incorrect, emphasizing that the student did not address how corners benefit specific image analysis applications. The model classifies the answer as partially correct, acknowledging that the student captured the essence of what corners represent but failed to elaborate on their practical applications. The model’s feedback correctly identifies the main issue with the student’s answer: it lacks specific examples of how corners are useful in image analysis applications. The reference answer provides specific applications such as image segmentation, edge detection, and object classification. The feedback suggests that the student should include these specific applications to improve his answer. This guidance is important for a complete understanding and improvement. The feedback is educational, aiming to help the student understand better by encouraging the practical uses of the concept. This helps correct the current answer and aids the student’s overall learning by showing the importance of practical applications in theoretical concepts. The distinction between an answer being incorrect or partially correct is quite subtle. Therefore, while the model predicted the student’s response as partially correct, the feedback explained the assigned grade accurately.

Question 1 Which image analysis applications benefit from interest points like corners?
Reference Answer Harris Corner detection from the corner, we can Improve image segmentation; if we have corners, we can detect edges more efficiently. By using corner details, we can improve object classification. It is used in many Computer vision applications.
Student Answer A corner implies that there is a change in all the directions and this change is particularly significant where helps in analysis.
Gold Output Label 0 (Incorrect response)
Gold Feedback Your answer provides a basic definition of a corner but does not address the image analysis applications that benefit from corner detection techniques. Try to relate the characteristics of corners to specific image analysis tasks or applications where they are commonly used.
Pred. Label 1 (Partially correct response)
Pred. Feedback Your answer captures the essence of corners as points of significant change in an image. However, it does not directly address how corners benefit image analysis applications. Try to focus on the specific advantages of corners in the context of image analysis, such as improving segmentation, edge detection, and object classification.
Table 8: Example #1: Output from the fine-tuned ASAG Model on a sample from the EngSAF test set.

Table 9 shows an example of the fine-tuned ASAG model’s output on a sample from the EngSAF test set used for qualitative analysis. The model’s feedback accurately recognizes the comprehensiveness of the student’s answer, which aligns well with the reference answer. The student correctly notes that multilayer perceptron (MLP) networks with multiple hidden layers are beneficial for learning non-linear features and handling complex feature discrimination. The feedback is positive, acknowledging the correctness of the student’s response and providing encouragement. This accurate assessment demonstrates the model’s capability to deliver meaningful and constructive feedback, crucial for the student’s learning process. By validating the student’s understanding and encouraging further exploration, the feedback contributes to a supportive educational environment.

Question 1 Under which circumstances is a multilayer perceptron neural network with multiple hidden layers (>1 layer) desirable?
Reference Answer when more deep learning or critical learning of features are required, then more hidden layers are required to extract out and learn features properly and efficiently so as to perform further tasks well. example- multiclass classification, XOR function.
Student Answer it is desirable when we want to learn non-linear features. as the number of hidden layers increases, we can learn more complex, non-linear features. thus, we use it for a problem involving discrimination of complex features.
Gold Output Label 2 (Correct response)
Gold Feedback Your answer grasps the key concept of using multilayer perceptron neural networks with multiple hidden layers for learning complex, non-linear features. This is indeed a correct and insightful response. Keep exploring different types of neural network architectures and their applications to various problems.
Pred. Label 2 (Correct response)
Pred. Feedback Your answer captures the essence of when multilayer perceptron neural networks with multiple hidden layers are desirable. It correctly highlights the purpose of increasing the number of hidden layers for learning complex, non-linear features. Well done!
Table 9: Example #2: Output from the fine-tuned ASAG Model on a sample from the EngSAF test set.

15 Deployment details

For the real-world deployment, our fine-tuned ASAG model was integrated into an end-semester exam on ET 623 (Learning Analytics Course) at IIT Bombay for the 2024 academic year. The setup included the students enrolled in the course who consented to participate in the experiment. The end-sem exam includes 2 short-answer questions, each accompanied by the instructor’s correct/reference answer. Table 10 contains the details of the question and reference answer. Each predicted feedback is analyzed by three Subject Matter Expert (SME) in terms of Feedback Correctness/ Accuracy and Emotional Impact as discussed in the Quality Estimate section of the main paper. Each aspect is scored on a scale (1-5), with a high score indicating a better response. Each annotator is a current PhD student and expert in the education technology domain, ensuring the reliability of evaluations during real-world deployment of the ASAG model in the learning analytics course.

Upon evaluation, the evaluation scores greater than 4.5 (Out of 5) for both the Feedback Quality/Correctness and the Emotional Impact aspect demonstrate the reliability and effectiveness of the ASAG model in real-world scenarios. Additionally, the model’s predicted output label achieved an accuracy of 92.5%, further showcasing its performance and reliability. Further, they achieved a Fleiss’ Kappa score of 0.83 (Almost perfect agreement) for the feedback quality/correctness aspect, providing valuable insights into the agreement among three annotators in evaluating performance.

Question 1 Explain the role of different stakeholders in learning analytics.
Reference Answer Key stakeholders are 1) Educators, 2) Students, 3) Policymakers or Administration Educators who benefit from learning analytics by gaining real-time insights into learner performance, including identifying students who may be underperforming. This information enables educators to improve their teaching activities and methodology to meet the specific needs of individual learners, thereby improving overall teaching effectiveness. For students, learning analytics provides valuable feedback on their performance relative to their peers and progress toward personal learning goals. This feedback serves as a source of motivation and encouragement. It helps them make better decisions for their future career. Policymakers and administrators face complex challenges in the education landscape, including budget constraints and global competition. Learning analytics offers valuable data-driven insights that inform decision-making processes regarding resource allocation, curriculum development, and strategic planning. It helps significantly in improving the quality of education.
Question 2 We have the process models below for two groups of students divided based on their final grades in a course. These are based on log data in a MOOC. Group A consists of students in the top 30 percentile; the rest are in group B. Thicker arrows mean more frequently followed paths while thinner transitions are less frequent. Based on these models, what conclusions can be drawn regarding their learning patterns and what methodology should be encouraged in the classroom? Refer to figure 3
Reference Answer Group A seems first to cover learning content more thoroughly - reading, watching videos, and referring/participating in discussions done in conjunction - and then proceeding to practice questions. They also refer back to course material when doing questions and then do assignments. Meanwhile, group B students focus more on practice questions and go to reading/videos in between. There is considerably less interaction with the discussions page as well. This might indicate only referring to content relevant to those questions. Thorough course content consumption should be encouraged.
Table 10: Two questions used to evaluate the EngSAF fine-tuned ASAG model in real-world deployment during an end-semester exam for the ET 623 (Learning Analytics Course) course at IIT Bombay.
Refer to caption
Figure 3: Figure for Question 2 from real-world deployment
Question 1 What do you mean by empirical equations?
Reference Answer Equations derived from experiments/observed data. They don’t have a mathematical proof behind them.
Student Answer 1a The equations which are obtained through extensive experimentation involving constant numerical terms are called empirical equations.
Label 2 (Correct response)
Feedback Well done! Your answer demonstrates a clear understanding of the concept of empirical equations. You have correctly highlighted the key characteristics of empirical equations, including their derivation from experiments and the presence of constant numerical terms.
Student Answer 1b these equations which are formed by empirical (experimental) data.
Label 1 (Partially Correct response)
Feedback Your answer is partially correct. While you mentioned the empirical nature of these equations, you missed the key point that they lack a mathematical proof. Try to include this aspect in your definition.
Student Answer 1c the equation which is derived for specific area of land .
Label 0 (Incorrect response)
Feedback Empirical equations are derived from experiments or observed data across a wide range of conditions, not from a specific area of land. Consider broadening your definition to include the general concept of deriving equations from experimental data.
Question 2 What issues arise when the number of hidden layers is very large?
Reference Answer When the number of hidden layers is very large, then the model contains a high variance, and overfitting occurs, so it cannot generalize new unseen test data but performs well on training data. Backpropagation also becomes slow.
Student Answer 2a Issues that arise are: Training becomes slow. If the training data is small, then the model doesn’t generalize well. Also, overfitting occurs,
Label 2 (Correct response)
Feedback Well done! Your answer demonstrates a clear understanding of the concept of empirical equations. You have correctly highlighted the key characteristics of empirical equations, including their derivation from experiments and the presence of constant numerical terms.
Student Answer 2b When number of hidden layers are very large it can cause overfitting and also increases the training time.
Label 1 (Partially Correct response)
Feedback Your answer is partially correct. While you mentioned the empirical nature of these equations, you missed the key point that they lack a mathematical proof. Try to include this aspect in your definition.
Student Answer 2c Error Backproparation arises when number of hidden layers is very large it means during training, error propogated back.
Label 0 (Incorrect response)
Feedback Your answer captures only one aspect of the challenges faced when the hidden layers are large. It doesn’t mention issues like high variance, overfitting, and the impact on backpropagation speed. Try to cover all aspects of the concerns associated with a large number of hidden layers.
Table 11: Two examples showcasing a question, reference answer, and three student answers (Student#1, Student#2, and Student#3) alongside their corresponding labels and synthetically generated Feedback/Explanation for the assigned label from the Eng-SAF Dataset

16 Annotator Details

We’ve brought in three human evaluators to ensure the EngSAF dataset output labels and artificial feedback quality and reliability. Each evaluator is working towards a master’s degree, is in their final year/ pre-final year, having a lot of expertise in evaluating short answers. Each annotator was provided with comprehensive guidelines and requirements to execute their assigned task proficiently. Each evaluator was fairly compensated (5$ per hour) for reviewing grades and the synthetically generated feedback. Each feedback from the randomly sampled 300 data points was scored independently, without access to the scores provided by each other. Human evaluation across various designed aspects, as mentioned in the quality estimate section, consistently yields an average score greater than 4.5 out of 5, which shows the high reliability of the feedback. The same annotators evaluate the correctness of each output label for sampled data points, achieving an accuracy of 98% and pair-wise average Cohen’s Kappa (κ𝜅\kappaitalic_κ) score of 0.65 (substantial agreement), showcasing the high reliability of the assigned output label.