\addbibresource

sample.bib

MTFinEval:A Multi-domain Chinese Financial Benchmark with Eurypalynous questions

Ke Jin, Xinyu Liu    Xinyu Liu1, Ke Jin1
1Beihang University
Abstract

With the emergence of more and more economy-specific LLMS, how to measure whether they can be safely invested in production becomes a problem. Previous research has primarily focused on evaluating the performance of LLMs within specific application scenarios. However, these benchmarks cannot reflect the theoretical level and generalization ability, and the backward datasets are increasingly unsuitable for problems in real scenarios. In this paper, we have compiled a new benchmark, MTFinEval, focusing on the LLMs’ basic knowledge of economics, which can always be used as a basis for judgment. To examine only theoretical knowledge as much as possible, MTFinEval is build with foundational questions from university textbooks,and exam papers in economics and management major. Aware of the overall performance of LLMs do not depend solely on one subdiscipline of economics, MTFinEval comprise 360 questions refined from six major disciplines of economics, and reflect capabilities more comprehensively. Experiment result shows all LLMs perform poorly on MTFinEval, which proves that our benchmark built on basic knowledge is very successful. Our research not only offers guidance for selecting the appropriate LLM for specific use cases, but also put forward increase the rigor reliability of LLMs from the basics.

1 Introduction

In the realm of economics, LLMs offer economists and policymakers unique insights\parenciteLi2023CFGPTCF, thereby enhancing the efficiency of economic industry development\parenciteZha2023TableGPTTU\parenciteZhang2024TableLLMET. For example, CatMemo\parenciteCao2024CatMemoAT uses LLM for stock trading, FINANCEBENCH\parenciteIslam2023FinanceBenchAN uses LLM for company earnings analysis, and FinPT\parenciteYin2023FinPTFR uses LLM for risk forecasting. Financial data grows exponentially in both volume and complexity. However, the task-oriented benchmarks\parenciteLei2023CFBenchmarkCF \parenciteAraci2019FinBERTFS, composed of specific past events are gradually deviating from the actual situation. No matter a drastic change in policy between countries, or disruptive innovation in new technologies such as AI, it will lead to dramatic changes in economic phenomena. Furthermore, the comprehensive capabilities of LLMs in the field of economics cannot be fully assessed by a single task requirement. Therefore we create a benchmark, MTFinEval, to examine LLM theoretical knowledge across a wide range of economics fields.

In this article, we narrow focus to the cognitive level of theoretical knowledge within LLMs because theoretical knowledge is the most fundamental requirement. It not only shapes the model’s grasp of problems but also forms the basis for task execution. Our analysis breaks down the comprehensive capabilities of LLMs across numerous sub-aspects. This approach aims to identify the reasons behind the subpar performance of LLMs when tackling complex tasks, offering a clearer direction for understanding their limitations.

The benchmark, MTFinEval, covers six fields of management, accounting, e-commerce, strategic management of enterprise, macroeconomics and microeconomics, and covers multiple dimensions such as economic indicators, financial technology, financial law and economic phenomena. For all questions, the ability of the model is examined in the form of question and answer, and the correct rate of the model is directly calculated. Through this dataset, we aim to assess the multi-faceted capabilities\parenciteLiu2019MultiTaskDN\parenciteMa2018ModelingTR\parenciteKendall2017MultitaskLU\parenciteLiu2018EndToEndML of LLMs in the field of finance, including but not limited to data understanding, logical reasoning, and situational adaptation.

Refer to caption
Figure 1: Answer pairs on different question types.

2 Related Work

In finance and economics, the application of Large Language Models (LLMs) is emerging as an essential tool for in-depth market analysis\parenciteZhang2023XuanYuan2A\parenciteCornucopia-LLaMA-Fin-Chinese\parencitepaeg, precise investment advice, and effective risk assessment\parenciteganlm. This paper aims to delve into the specialized development and potential of LLMs within the financial domain, highlighting the significance of systematic evaluation and theoretical integration.

Firstly, BloombergGPT\parenciteWu2023BloombergGPTAL, an LLM tailored for the financial sector, exhibits significant potential in performing financial natural language processing (NLP) tasks. Its capability to manage complex financial data and tasks, such as analyzing market trends and generating investment reports\parenciteKoa2024LearningTG, illustrates the promise of LLM applications in finance. Similarly, FinMA, through command tuning, adeptly handles a variety of financial NLP tasks\parenciteSinha2020ImpactON\parenciteZhou2021TradeTE\parenciteMaia2018WWW18OC\parenciteCIKM2020MAEC including sentiment analysis, event detection\parencitecortis-etal-2017-semeval, and risk assessment\parenciteAlvarado2015DomainAO\parencite10386611, further demonstrating the broad application potential of LLMs in the financial sector.

Nonetheless, systematic evaluation of LLMs’ performance in finance is crucial. The introduction of the EconNLI dataset\parenciteGuo2024EconNLIEL, designed to assess LLMs’ knowledge and competence in economic reasoning, exposes potential deficiencies in LLMs’ economic reasoning abilities\parencitePark2023MachineLB. This underscores the necessity for thorough evaluation of these models. Additionally, models that incorporate economic theory excel in financial analysis and forecasting, emphasizing the importance of integrating economic knowledge when developing financial LLMs. Evaluating LLMs’ performance in specialized areas is crucial for gauging their true capabilities. While existing assessments often concentrate on general NLP tasks relevant to finance, such as causal reasoning, text classification, and predictive analytics, there is a dearth of systematic evaluations tailored to the financial and economic sectors. It becomes evident that a deep grasp of economic principles significantly enhances LLMs’ financial task performance\parenciteZhai2024ActionsSL.

The application of multimodal learning in financial forecasting also presents new opportunities. For instance, MONOPOLY\parenciteMathur2022MONOPOLYFP leverages multimodal cues to forecast finances from monetary policy meeting videos, offering a novel perspective on market dynamics understanding. Concurrently, the exploration of cross-language and zero-sample learning capabilities\parenciteJin2024ZeroShotCR, such as the zero-sample cross-language named entity recognition method introduced by CROP\parenciteYang2022CROPZC\parenciteYang2020AlternatingLM, opens up new possibilities for multilingual\parenciteum4 financial applications and aids in processing financial documents in various languages.

In conclusion, LLMs in the financial and economic fields demonstrate unique value and potential. To enhance the performance and reliability of these models\parenciteHuang2022MBCTTF, a deeper understanding and integration of economic theory\parencite10.1007/978-3-030-45439-5_46\parenciteElman1993LearningAD\parenciteFan2018LearningTT are required.

3 MTFinEval Benchmark

3.1 Dataset Statistics

Our research is dedicated to strengthening the capacity of these models to tackle economic challenges fundamentally and systematically.

The MTFinEval dataset comprises 360 university economics questions, spanning six major sub-topics: macroeconomics, microeconomics, accounting, management, e-commerce, and strategic—management. Each subtopic includes single choice, multiple choice, and true or false questions. All questions and answers are manually extracted from college textbooks\parenciteLi2023TextbooksAA and exam papers to ensure they are foundational and introductory. The data collection process has been meticulously scrutinized through a series of systematic checks. We began by manually entering the paper questions into CSV files, ensuring the completeness of each entry.

Subsequently, to verify the accuracy of the questions and answers, we enlisted the expertise of six economists, each with high proficiency in various subfields. They were tasked with reviewing the clarity of the descriptions and the correctness of the answers, for which they were compensated at a rate of $3 per question. In cases where a question was deemed questionable during the expert review stage, a collective discussion was held among all experts to determine the correct answer and to decide if the question should be omitted. The types and numbers of questions for all subtopics are shown in Figure 1.

Refer to caption
Figure 2: types and numbers of questions

3.2 Formulation

For problem type T𝑇Titalic_T, given the large model \mathcal{M}caligraphic_M, the corpus set d𝑑ditalic_d used for training and fine-tuning the large model, input the question q, and the large model returns the answer a𝑎aitalic_a\parenciteNarayan2018DontGM.

In the zero-shot scenario for true or false statements, the objective function is defined as follows:

r=argmaxa{0,1}P(a|qT;,dT)𝑟𝑎01𝑃conditional𝑎superscript𝑞𝑇superscript𝑑𝑇\displaystyle r=\underset{a\in\{0,1\}}{\arg\max}P(a|q^{T};\mathcal{M},d^{T})italic_r = start_UNDERACCENT italic_a ∈ { 0 , 1 } end_UNDERACCENT start_ARG roman_arg roman_max end_ARG italic_P ( italic_a | italic_q start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ; caligraphic_M , italic_d start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) (1)

Similarly, for single-choice questions in the zero-shot scenario, the objective function is specified as:

r=argmaxa{A,B,C,D}P(a|qT;,dT)𝑟𝑎𝐴𝐵𝐶𝐷𝑃conditional𝑎superscript𝑞𝑇superscript𝑑𝑇\displaystyle r=\underset{a\in\{A,B,C,D\}}{\arg\max}P(a|q^{T};\mathcal{M},d^{T})italic_r = start_UNDERACCENT italic_a ∈ { italic_A , italic_B , italic_C , italic_D } end_UNDERACCENT start_ARG roman_arg roman_max end_ARG italic_P ( italic_a | italic_q start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ; caligraphic_M , italic_d start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) (2)

Lastly, for multiple-choice questions in the zero-shot scenario, the objective function is:

r=argmaxa{A,B,C,AB,AC,BC,ABC}P(a|qT;,dT)𝑟𝑎𝐴𝐵𝐶𝐴𝐵𝐴𝐶𝐵𝐶𝐴𝐵𝐶𝑃conditional𝑎superscript𝑞𝑇superscript𝑑𝑇\displaystyle r=\underset{a\in\{A,B,C,AB,AC,BC,ABC...\}}{\arg\max}P(a|q^{T};% \mathcal{M},d^{T})italic_r = start_UNDERACCENT italic_a ∈ { italic_A , italic_B , italic_C , italic_A italic_B , italic_A italic_C , italic_B italic_C , italic_A italic_B italic_C … } end_UNDERACCENT start_ARG roman_arg roman_max end_ARG italic_P ( italic_a | italic_q start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ; caligraphic_M , italic_d start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) (3)

On the surface, choices and judgments are both classify problems. Fundamentally, for the decoder only model, the underlying task is a generation task. In the training of LLMs, While pursuing to reduce cross entropy, maximum likelihood estimation is also used to reproduce similar sentences in the corpus. Therefore, although only the selection of the judgment problem, the bottom layer still recalls the relevant corpus content during the training of the LLMs.

4 Experiments

4.1 Experimental Setup

For problem type T, given the large model M,the corpus set d used for training and finetuning the large model, input the question q, and the large model returns the answer a. In the zero-shot scenario for judge questions the objective function is define To simulate the most direct use cases effectively, we introduce a brief prompt to the question. This prompt includes a description of the question type, all under zero-shot conditions\parenciteBrown2020LanguageMA. The prompts for the three question types are similiar with follows\parenciteWei2022ChainOT\parencitexcot\parencitelow_resource_template\parencitehlt_mt\parenciteTai2023ExploringCS: multiple choice: You are a financial knowledge expert, please combine your financial knowledge to answer the following multiple choice, note that there is not only one answer to multiple choice, to think and reason whether each option is correct but do not output thinking process, directly return the option letter in markdown format, each option in the answer is separated by a space. When it comes to judgment questions, the LLM is instructed to provide answers limited to ”yes,” ”wrong,” or ”do not know.”. Subsequently, key words from these answers are extracted and mapped onto these three specific options. In the realm of traditional economics, LLMs are typically based on the Llama series\parenciteTouvron2023Llama2O\parenciteDubey2024TheL3 for processing. To promote decentralization, the models participating in this evaluation are exclusively open-source models that were released in the year 2024.

4.2 Results

model macroeconomics microeconomics accounting management e-commerce strategic-management comprehensive
Qwen/Qwen2-7B-Instruct 55.0 66.67 80.0 51.67 65.0 61.67 63.33
Qwen/Qwen2-1.5B-Instruct 23.33 30.0 50.0 40.0 50.0 46.67 40.0
Qwen/Qwen1.5-7B-Chat 40.0 50.0 66.67 45.0 60.0 51.67 52.22
THUDM/glm-4-9b-chat 53.33 55.0 70.0 46.67 66.67 56.67 58.06
THUDM/chatglm3-6b 26.67 35.0 35.0 30.0 50.0 35.0 35.28
01-ai/Yi-1.5-9B-Chat-16K 48.33 55.0 80.0 48.33 60.0 56.67 58.06
01-ai/Yi-1.5-6B-Chat 40.0 53.33 58.33 48.33 68.33 53.33 53.61
google/gemma-2-9b-it 56.67 50.0 56.67 50.0 58.33 65.0 56.11
meta-llama/Meta-Llama-3-8B-Instruct 30.0 33.33 45.0 38.33 38.33 55.0 40.0
Table 1: accuracy(%) of different models on different sub-disciplines

Regardless of little changes in the LLMs’ architecture, the diverse answers provided by various LLMs can reveal the extent and quality of the training data used. This insight can then be used to reasonably gauge the model’s capability level. Specialization in Subjects: Certain models demonstrate expertise in specific domains. For instance, the 01-ai/Yi-1.5-9B-Chat-16K\parenciteYoung2024YiOF model excels in accounting, e-commerce, and strategic management, suggesting a rich dataset in these areas. This proficiency likely stems from its capacity to interpret complex financial data, manage online platforms efficiently, and formulate strategic business plans. Macroeconomic Understanding: The THUDM/glm-4-9b-chat\parenciteZeng2024ChatGLMAF\parenciteBommasani2021OnTO model’s elevated score in macroeconomics indicates a robust grasp of economic trends and policy impacts. This proficiency may be due to its exposure to a diverse range of economic data, enabling precise assessments. Strategic Management Insight: The google/Gemma-2-9B-IT model’s high strategic management score reflects its potential to support business planning. Its ability to integrate information from various sources to offer valuable strategic insights is likely the reason for this performance. Overall Performance: The Qwen/Qwen2-7B-Instruct\parenciteBai2023QwenTR\parenciteYang2024Qwen2TR model’s lead in overall performance with a 63.33 score signifies its emergence as a leading contender in the field of financial language model technology. Its comprehensive score not only reflects its proficiency in individual subjects but also suggests a robust and well-rounded training approach that has equipped it to handle a variety of economic analyses. This model’s performance is indicative of its potential to become the new baseline, or pedestal\parenciteNori2023CanGF, for financial LLMs\parenciteChen2023TigerBotAO, possibly outperforming or replacing previous models such as Llama\parenciteCui2023EfficientAE in certain applications. Reflection on Weaker Performances: Models like the meta-llama/Meta-Llama-3-8B-Instruct, which scored lower overall, may have been trained on less diverse or pertinent data for certain subjects, or they may have architectural limitations hindering effective processing and analysis of economic information. Implications for Model Improvement: To enhance performance in specific subjects, models should be trained with more domain-specific data. Moreover, ongoing updates and refinements to the model architecture, informed by performance data, can further augment capabilities. In conclusion, LLM performance in economic subjects mirrors the quality of their training data and architecture. While specialized models are better suited for particular applications, a well-rounded model like Qwen/Qwen2-7B-Instruct serves as a multifaceted tool for a range of economic analyses. Ongoing refinement of training data and model architecture is essential for advancing their capabilities in economics.

5 Conclusion

Refer to caption
Figure 3: subject and type radars

In this study, we introduced MTFinEval, a comprehensive multi-domain benchmark, consisting of 360 questions across six major economic disciplines. It provides a rigorous assessment tool to evaluate the fundamental economic knowledge of LLMs, for the purpose of ensuring that the LLM still makes the right analysis in a rapidly changing environment. The experimental results demonstrated that current LLMs, generally perform poorly on this benchmark, highlighting significant gaps in their theoretical understanding of economics. This research contributes to the field by offering a robust evaluation framework that can guide future improvements in LLMs, ensuring they are better equipped for complex and dynamic economic environments.

\printbibliography