Enhancing semantical text understanding with fine-tuned large language models: A case study on Quora Question Pair duplicate identification

PLoS One. 2025 Jan 10;20(1):e0317042. doi: 10.1371/journal.pone.0317042. eCollection 2025.

Abstract

Semantical text understanding holds significant importance in natural language processing (NLP). Numerous datasets, such as Quora Question Pairs (QQP), have been devised for this purpose. In our previous study, we developed a Siamese Convolutional Neural Network (S-CNN) that achieved an F1 score of 82.02% (95% C.I.: 81.83%-82.20%). Given the growing attention toward large language models (LLMs) like ChatGPT, we aimed to explore their effectiveness in text similarity tasks. In this research, we leveraged 5 pretrained LLMs, conducted various fine-tuning approaches (prompt engineering, n-shot learning, and supervised learning using the low-rank adaptation [LoRA]), and compared their performance using F1 score. To ensure a fair comparison, we followed our previous study's design and dataset by employing a 10-fold cross-validation for supervised model training and evaluation. Additionally, we conducted a secondary study by introducing a recent larger LLM with 70B parameters and comparing it with the 7B model using the GLUE benchmark, and both models were finetuned with the corpus. The fine-tuned LLaMA model with 7B parameters (qLLaMA_LoRA-7B) using 100,000 QQP corpus yielded the best results, achieving an F1 score of 84.9% (95% C.I.: 84.13%-85.67%), which outperformed the Alpaca_LoRA-65B (finetuned based on LLaMA-65B) (F1: 64.98% [64.72%-65.25%]; P<0.01) and had a 3% improvement compared to our previously published best model, S-CNN. The finetuned LLaMA3.1-70B (qLLaMA3.1_LoRA-70B) with 70B parameters (F1: 74.4%) outperformed the qLLaMA_LoRA-7B (F1: 71.9%) using the GLUE benchmark. The study demonstrated an effective LLM finetuning framework, which highlights the importance of finetuning LLMs for improved performance. Our task-specific supervised finetuning demonstrated improved LLM performance compared to larger pretrained models with or without n-shot learning; moreover, finetuning a larger LLM further improved performance compared to finetuning a smaller LLM. Our LLM-based finetuning framework may potentially improve various document similarity tasks, such as matching resumes with job descriptions, recommending subject-matter experts, or identifying potential reviewers for grant proposals or manuscript submissions.

MeSH terms

  • Humans
  • Language
  • Natural Language Processing*
  • Neural Networks, Computer*
  • Semantics

Grants and funding

The Children’s Hospital of Philadelphia. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.