Despite numerous studies on the magnitude of differential item functioning (DIF), different DIF detection methods often define effect sizes inconsistently and fail to adequately account for testing conditions. To address these limitations, this study introduces the unified M-DIF model, which defines the magnitude of DIF as the difference in item difficulty parameters between reference and focal groups. The M-DIF model can incorporate various DIF detection methods and test conditions to form a quantitative model. The pretrained approach was employed to leverage a sufficiently representative large sample as the training set and ensure the model's generalizability. Once the pretrained model is constructed, it can be directly applied to new data. Specifically, a training dataset comprising 144 combinations of test conditions and 144,000 potential DIF items, each equipped with 29 statistical metrics, was used. We adopt the XGBoost method for modeling. Results show that, based on root mean square error (RMSE) and BIAS metrics, the M-DIF model outperforms the baseline model in both validation sets: under consistent and inconsistent test conditions. Across all 360 combinations of test conditions (144 consistent and 216 inconsistent with the training set), the M-DIF model demonstrates lower RMSE in 357 cases (99.2%), illustrating its robustness. Finally, we provided an empirical example to showcase the practical feasibility of implementing the M-DIF model.
Keywords: bias; differential item functioning (DIF); magnitude; pretrained model; root mean square error (RMSE).
© The Author(s) 2024.