PharmaGPT: Domain-Specific Large Language Models for Bio-Pharmaceutical and Chemistry

Chen, Linqing; Wang, Weilei; Bai, Zilong; Xu, Peng; Fang, Yan; Fang, Jie; Wu, Wentao; Zhou, Lizhi; Zhang, Ruiji; Xia, Yubin; Xu, Chaobo; Hu, Ran; Xu, Licong; Cai, Qijun; Hua, Haoran; Sun, Jing; Liu, Jin; Qiu, Tian; Liu, Haowen; Hu, Meng; Li, Xiuwen; Gao, Fei; Wang, Yufu; Tie, Lin; Wang, Chaochao; Lu, Jianping; Sun, Cheng; Wang, Yixin; Yang, Shengjie; Li, Yuancheng; Jin, Lu; Zhang, Lisha; Bian, Fu; Ye, Zhongkai; Pei, Lidong; Tu, Changyang

Computer Science > Computation and Language

arXiv:2406.18045 (cs)

[Submitted on 26 Jun 2024 (v1), last revised 9 Jul 2024 (this version, v3)]

Title:PharmaGPT: Domain-Specific Large Language Models for Bio-Pharmaceutical and Chemistry

Abstract:Large language models (LLMs) have revolutionized Natural Language Processing (NLP) by minimizing the need for complex feature engineering. However, the application of LLMs in specialized domains like biopharmaceuticals and chemistry remains largely unexplored. These fields are characterized by intricate terminologies, specialized knowledge, and a high demand for precision areas where general purpose LLMs often fall short. In this study, we introduce PharmaGPT, a suite of domain specilized LLMs with 13 billion and 70 billion parameters, specifically trained on a comprehensive corpus tailored to the Bio-Pharmaceutical and Chemical domains. Our evaluation shows that PharmaGPT surpasses existing general models on specific-domain benchmarks such as NAPLEX, demonstrating its exceptional capability in domain-specific tasks. Remarkably, this performance is achieved with a model that has only a fraction, sometimes just one-tenth-of the parameters of general-purpose large models. This advancement establishes a new benchmark for LLMs in the bio-pharmaceutical and chemical fields, addressing the existing gap in specialized language modeling. It also suggests a promising path for enhanced research and development, paving the way for more precise and effective NLP applications in these areas.

Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2406.18045 [cs.CL]
	(or arXiv:2406.18045v3 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2406.18045

Submission history

From: Linqing Chen [view email]
[v1] Wed, 26 Jun 2024 03:43:09 UTC (8,976 KB)
[v2] Wed, 3 Jul 2024 12:56:40 UTC (8,976 KB)
[v3] Tue, 9 Jul 2024 06:52:17 UTC (8,976 KB)

Computer Science > Computation and Language

Title:PharmaGPT: Domain-Specific Large Language Models for Bio-Pharmaceutical and Chemistry

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:PharmaGPT: Domain-Specific Large Language Models for Bio-Pharmaceutical and Chemistry

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators