A Comparative Evaluation of Statistical Product and Service Solutions (SPSS) and ChatGPT-4 in Statistical Analyses

Cureus. 2024 Oct 28;16(10):e72581. doi: 10.7759/cureus.72581. eCollection 2024 Oct.

Abstract

Background: The objective of this study was to assess the accuracy of Chat Generative Pre-trained Transformer 4.0 (ChatGPT-4; OpenAI, San Francisco, CA ) compared to Statistical Product and Service Solutions (SPSS; IBM SPSS Statistics for Windows, Armonk, NY) in performing statistical analyses commonly used in medical and dental research.

Methods: The datasets were analysed using SPSS (version 26) and ChatGPT-4. Statistical tests included the independent t-test, paired t-test, ANOVA, chi-square test, Wilcoxon signed-rank test, Mann-Whitney U test, Pearson and Spearman correlation, regression analysis, kappa statistic, intraclass correlation coefficient (ICC), Bland-Altman analysis, and sensitivity and specificity analysis. Descriptive statistics were used to report results, and differences between the two tools were noted.

Results: SPSS and ChatGPT-4 produced identical results for the independent sample t-test, paired t-test, and simple linear regression. In one-way ANOVA, both tools provided consistent F-values, but post-hoc analysis revealed discrepancies in mean differences and confidence intervals. Pearson chi-square and Wilcoxon signed-rank tests showed variations in p-values and Z-values. Mann-Whitney U test had differences in interquartile range (IQR), U, and Z-values. Pearson and Spearman's correlations were consistent, with IQR differences in Spearman. Sensitivity, specificity, and area under the curve (AUC) analyses were consistent, though differences in standard errors and confidence intervals were observed.

Conclusion: ChatGPT-4 produced accurate results for several statistical tests, matching SPSS in simpler analyses. However, discrepancies in post-hoc analyses, confidence intervals, and more complex tests indicate that careful validation is required when using ChatGPT-4 for detailed statistical work. Researchers should exercise caution and cross-validate results with established tools such as SPSS.

Keywords: artificial intelligence; data analysis; machine learning; natural language processing; software comparison; statistical software.