NeedleBench: Can LLMs Do Retrieval and Reasoning in 1 Million Context Window?

Mo Li1,2, Songyang Zhang1‡, Yunxin Liu2, Kai Chen1‡
1Shanghai AI Laboratory 2Tsinghua University
{limo,zhangsongyang}@pjlab.org.cn
Abstract

In evaluating the long-context capabilities of large language models (LLMs), identifying content relevant to a user’s query from original long documents is a crucial prerequisite for any LLM to answer questions based on long text. We present NeedleBench, a framework consisting of a series of progressively more challenging tasks for assessing bilingual long-context capabilities, spanning multiple length intervals (4k, 8k, 32k, 128k, 200k, 1000k, and beyond) and different depth ranges, allowing the strategic insertion of critical data points in different text depth zones to rigorously test the retrieval and reasoning capabilities of models in diverse contexts. We use the NeedleBench framework to assess how well the leading open-source models can identify key information relevant to the question and apply that information to reasoning in bilingual long texts. Furthermore, we propose the Ancestral Trace Challenge (ATC) to mimic the complexity of logical reasoning challenges that are likely to be present in real-world long-context tasks, providing a simple method for evaluating LLMs in dealing with complex long-context situations. Our results suggest that current LLMs have significant room for improvement in practical long-context applications, as they struggle with the complexity of logical reasoning challenges that are likely to be present in real-world long-context tasks. All codes and resources are available at OpenCompass: https://github.com/open-compass/opencompass.

{NoHyper}footnotetext: Corresponding author.

1 Introduction

The capability of LLMs to process long texts is particularly crucial across various situations. LLMs can rapidly identify and summarize relevant information within lengthy documents, making them invaluable for legal document retrieval, academic research, and aggregating business intelligence, among other applications.

Modern LLMs have recently been developed to support longer context windows. For example, GPT-4 Turbo(OpenAI, 2023) offers long-context capabilities up to 128K tokens. Similarly, Claude 2.1(Anthropic, 2024a) has been enhanced to manage contexts up to 200K tokens, with the Claude 3 series(Anthropic, 2024b) being specifically engineered to digest inputs surpassing 1 million tokens. Furthermore, Gemini 1.5 supports context windows of millions of tokens(Gemini Team, 2024). Moreover, the recent open-source models, GLM4-9B-Chat and InternLM2.5-7B-Chat also support the 1 million context window. As models accommodate longer text lengths, verifying their comprehension of details within the text becomes increasingly essential.

Considering the critical role of LLMs in handling long texts, numerous approaches have been suggested to evaluate their long-context capabilities. Existing datasets, such as the LongBench dataset(Bai et al., 2023), provide a bilingual (Chinese and English) benchmark for long text comprehension, featuring tasks with lengths generally ranging from 5k to 15k tokens. However, accurately assessing the performance of long-context LLMs, especially at the 1M token level, continues to be a significant challenge.

There are some initial attempts in this direction, Mohtashami & Jaggi (2023) introduces the passkey testing approach, which embeds key information passkeys in repetitively similar texts and queries for this information at the end, to evaluate the basic information extraction capability of the LLaMA-7B model at text lengths of 32K. InfiniteBench by Zhang et al. (2023) extends the passkey method to lengths beyond 100K, inserting passkeys at various depths within long texts. Kamradt (2023) develops the Needle In A Haystack(NIAH) test, which uses a more diverse set of non-repetitive personal essays as filler information unrelated to the passkeys. This test further extends the context window to 200K and performs stress tests on the Claude2.1 and GPT-4 Turbo111GPT-4 1106-preview version models.The results from Kamradt (2023) and Anthropic (2024b) indicate that leading models are generally capable of passing the NIAH test with excellent performance.

Refer to caption
Figure 1: Our ATC test, even with a modest context length of 2K, reveals surprising limitations of these leading models in real-world complex long-context questions.

Does passing the ”needle-in-a-haystack” test—extracting key info from lengthy texts—really indicate that LLMs can handle complex real-world long context problems? Typically, real-world tasks require models to retrieve and integrate multiple pieces of dispersed, question-related information rather than just a single piece. For instance, in legal case analysis, a model must extract various relevant facts and legal provisions from a case file and synthesize this information to answer specific legal questions. Similarly, in business intelligence analysis, a model may need to aggregate information on market trends, competitor strategies, and consumer behaviors from multiple reports to provide a comprehensive market analysis. This requires models that can identify multiple key information points and integrate a wide range of relevant content within the text, offering in-depth and accurate analyses.

Therefore, improving the long-context ability of LLMs for realistic applications not only requires the accurate information retrieval but also the strong reasoning capabilities. To address the limitations of existing long-context information extraction evaluation methods and their misalignment with real-world application scenarios, we introduce the NeedleBench dataset. NeedleBench comprises a series of advanced long-context information capability evaluation methods, aimed at providing a comprehensive and targeted assessment of models’ abilities to extract and analyze information within the context of long texts. Furthermore, we develop the Ancestral Trace Challenge (ATC) test as the simplified proxy for measuring multi-step logical reasoning. Our findings demonstrate that current LLMs struggle to handle reasoning tasks with complex logical relationships, even with texts shorter than 2K tokens (see Figure 1). Our major contributions are as follows:

  • We introduce NeedleBench, a customizable dataset framework that includes tasks for evaluating the bilingual long-context capabilities of LLMs across multiple length intervals (4k, 8k, 32k, 128k, 200k, 1000k and beyond) and various text depth ranges, allowing the strategic insertion of critical data points in different text depth zones to rigorously test both retrieval and reasoning capabilities of models in a diverse context.

  • We propose the ATC to simulate complex long-context tasks in real-world scenarios, providing a simple method for evaluating LLMs in complicated long-context situations. Through the experimental results of the ATC, we discover that all current LLMs have significant room for improvement in practical long-context applications, struggling with the complexity of real-world long-context tasks.

  • We conduct a fine-grained evaluation and analysis of the performance of mainstream models in identifying key question-relevant information and reasoning. Additionally, all reproducible scripts, code, and datasets are provided in OpenCompass Contributors (2023).

2 Tasks and Datasets

Refer to caption
Figure 2: NeedleBench Framework

In order to construct a dataset capable of comprehensively evaluating models’ capabilities in long-context information extraction and reasoning, we design a series of progressively more challenging test schemes. We show the composition of the Needlebench framework in Figure 2. Specifically, we divide the overall task into three subtasks: ”Single Retrieval Task(S-RT)” (where a single piece of information is inserted at one depth), ”Multi-Retrieval Task (M-RT)”, and ”Multi-Reasoning Task(M-RS)”. Each subtask includes the design of the key information (needle design) and the unrelated text (haystack design). In addition, we design the Ancestral Trace Challenge (ATC) to test the ability of the LLMs to handle multi-step logical challenges that are likely present in real world long-context scenarios.The instantiation of each task is detailed in Appendix C, where actual examples of each task are provided.

2.1 NeedleBench  Tasks

We first introduce the several different subtasks within NeedleBench .

\bullet Single-Needle Retrieval Task (S-RT): Tests LLMs’ ability to recall a single key information inserted at various positions in a long text, highlighting their precision in navigating and recalling single detail within extensive texts.

\bulletMulti-Needle Retrieval Task (M-RT): Explores LLMs’ ability to retrieve multiple pieces of related information scattered across a lengthy text, simulating complex real-world queries that require extracting several data points from comprehensive documents.

\bullet Multi-Needle Reasoning Task(M-RS): Evaluates LLMs’ ability for complex reasoning by extracting multiple pieces of information from long texts and using them to logically answer questions that demand an integrated understanding and reasoning of various text segments.

2.2 Ancestral Trace Challenge(ATC)

In extreme scenarios, we envision models capable of recalling and analytically understanding every detail of every sentence within the input long text on demand. Thus, we introduce the Ancestral Trace Challenge (ATC) to simulate complex long-context tasks in real-world scenarios.

In ATC experiments, we construct the problem using a series of simple first-order logical inferences, forming an information chain that LLMs need to fully understand in order to answer the question. Forgetting any key information in the long context directly results in the inability of the LLMs to provide the correct answer.LLMs will be asked to choose one correct answer from the four given options, This method can be expanded to more challenging logical relationships in the study by Besta et al. (2024)

In practical scenarios, longer text lengths inevitably mean that LLMs need to understand more complex layers of logical relationships. Using the ATC test, we can easily stress test the multi-step reasoning capabilities of LLMs. This methodology is scalable and can be extended to extensive context lengths. While the NIAH test aim to stress test the information retrieval capabilities of LLMs within long texts, the ATC test specifically aims to stress test their reasoning abilities in similar long-context scenarios. Actual examples of ATC test can be found in the Appendix, Figures 21 and 22.

2.3 Dataset Construction

2.3.1 Needle Design

Refer to caption
Figure 3: Distribution of Reasoning Steps in Multi-Needle Reasoning Task

To prevent the model’s inherent knowledge from interfering with its ability to retrieve information, we deliberately design the needles in both the Single-Needle Retrieval(S-RT) and Multi-Needle Retrieval tasks to be abstract and nonexistent in the real world. Models are required to answer questions based solely on the key information provided during the test.

For the Multi-Needle Reasoning task, we construct the needle corpus utilizing the 4𝒞superscript4𝒞\mathcal{R}^{4}\mathcal{C}caligraphic_R start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT caligraphic_C dataset(Inoue et al., 2019), an enhancement of the HotpotQA dataset(Yang et al., 2018) that incorporates ”Derivation” information detailing each reasoning step needed to answer questions, while also addressing the issue of unclear pronouns present in HotpotQA. The corpus is then translated into Chinese to ensure a high-quality dataset for evaluating the bilingual reasoning capabilities of the models.As illustrated in Figure 3, the majority of reasoning questions involve two or three steps, with fewer questions requiring four or more steps.

2.3.2 Haystack Design

We follow the method described in Kamradt (2023), using the PaulGrahamEssays dataset to extend the prompt to the target length.

In the design of the Chinese Haystack, we utilize the ChineseDomainModelingEval dataset released by Wei et al. (2023) to ensure the diversity and quality of Chinese text sources. This dataset covers a wide range of topics from finance to technology, providing high-quality, up-to-date Chinese articles and a stable benchmark for evaluating the ability of different models to handle domain-specific long texts.

3 Experiments

We primarily evaluate the performance of mainstream open-source LLMs on NeedleBench at various token lengths of 4K, 8K, 32K, and 200K due to the significant token consumption involved. For the ATC experiment, we expand our evaluation to include leading API models, such as GPT-4 Turbo and Claude-3-Opus, and perform multiple logical stress tests across both open-source and API models.The complete list of LLMs evaluated in our research is detailed in the Table 1. Detailed experiments settings for each task parameter can be found in the Table 7.

Series Models
Claude Claude-3-Opus
OpenAI GPT4-Turbo
Zhipu AI GLM4,GLM4-9B-Chat-1M,ChatGLM3-6B, ChatGLM3-6B-32K
LLaMA-2 LLaMA-2-7B, LLaMA-2-13B, LLaMA-2-70B
Baichuan2 Baichuan2-7B, Baichuan2-13B
Gemma Gemma 2B, Gemma 7B
Yi Yi-6B
OrionStarAI Orion-14B-LongChat, OrionStar-Yi-34B
DeepSeek DeepSeek-67B
WizardLM WizardLM-70B
Qwen Qwen-7B, Qwen-14B, Qwen-72B,Qwen-72B-vLLM, Qwen-1.5-0.5B, Qwen-1.5-1.8B, Qwen-1.5-4B, Qwen-1.5-14B-vLLM, Qwen-1.5-72B-vLLM
InternLM InternLM-7B, InternLM2-7B, InternLM2-20B, InternLM2-7B-200K, InternLM2-20B-200K, InternLM2.5-7B-Chat-1M
Zephyr Zephyr-7B Beta
Mistral Mistral-7B Instruct v0.2, Mixtral-8x7B Instruct v0.1
Table 1: Evaluated LLMs. The ”-200K” suffix indicates models configured for up to 200,000 context token length, deployed with LMDeploy(Contributors, 2023). ”vLLM” denotes deployment via vLLM(Kwon et al., 2023).

3.1 Performance of NeedleBench Tasks

Experimental Setting

We use the recall accuracy of needles placed at different positions as a metric to evaluate the performance of the models. By sequentially averaging the performance across data sets of different lengths and depths, we obtain the performance of the models on each task within NeedleBench. Further averaging of the scores from different tasks provided an overall score , ensuring a balanced representation of each task’s contribution to the overall score. To ensure the stability of our results, we ran multiple iterations of each test. The number of tokens is calculated uniformly using the tokenizer from GPT-4222https://github.com/openai/tiktoken. In addition, we introduced different buffer size to mitigate the problem of significant differences between tokenizers of different models, which could prevent a model from receiving the full prompt. In our experiments, we consistently position the question command prompts at the end of the extended texts. We examin the impact of the question prompt’s placement as detailed in Section 4.

To evaluate the similarity between predictions and references under each specific task, we use the Levenshtein distance, with P𝑃Pitalic_P and R𝑅Ritalic_R representing the lists of predictions and references, respectively, each of length n𝑛nitalic_n. The score for each pair Pisubscript𝑃𝑖P_{i}italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and Risubscript𝑅𝑖R_{i}italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is adjusted for the presence of core keywords within Pisubscript𝑃𝑖P_{i}italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, defined by the set Wisubscript𝑊𝑖W_{i}italic_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for each Risubscript𝑅𝑖R_{i}italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. A penalty factor α=0.2𝛼0.2\alpha=0.2italic_α = 0.2 is applied to the score calculation for predictions missing any core keywords from Risubscript𝑅𝑖R_{i}italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. The formula for an individual score, ScoreisubscriptScore𝑖\text{Score}_{i}Score start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, is as follows:

Scorei={100if PiWi,100α(1d(Pi,Ri)max(|Pi|,|Ri|))otherwise,subscriptScore𝑖cases100if subscript𝑃𝑖subscript𝑊𝑖100𝛼1𝑑subscript𝑃𝑖subscript𝑅𝑖subscript𝑃𝑖subscript𝑅𝑖otherwise,\text{Score}_{i}=\left\{\begin{array}[]{ll}100&\mbox{if }P_{i}\cap W_{i}\neq% \emptyset,\\ 100\cdot\alpha\cdot\left(1-\frac{d(P_{i},R_{i})}{\max(|P_{i}|,|R_{i}|)}\right)% &\mbox{otherwise,}\end{array}\right.Score start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { start_ARRAY start_ROW start_CELL 100 end_CELL start_CELL if italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∩ italic_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≠ ∅ , end_CELL end_ROW start_ROW start_CELL 100 ⋅ italic_α ⋅ ( 1 - divide start_ARG italic_d ( italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG roman_max ( | italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | , | italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | ) end_ARG ) end_CELL start_CELL otherwise, end_CELL end_ROW end_ARRAY

where d(Pi,Ri)𝑑subscript𝑃𝑖subscript𝑅𝑖d(P_{i},R_{i})italic_d ( italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) is the Levenshtein distance between Pisubscript𝑃𝑖P_{i}italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and Risubscript𝑅𝑖R_{i}italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, and |Pi|subscript𝑃𝑖|P_{i}|| italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | and |Ri|subscript𝑅𝑖|R_{i}|| italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | denote their lengths. The final score is the average of ScoreisubscriptScore𝑖\text{Score}_{i}Score start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT across n𝑛nitalic_n repetitions.

Focusing on key findings, we present main results for 32K and 200K context lengths in Table 2 and 3. More comprehensive results are available in the Appendix B.

Model Single-Retrieval Multi-Retrieval Multi-Reasoning Overall
Chinese Englisch Overall Chinese Englisch Overall Chinese Englisch Overall
Models with Fewer Than 7B Parameters
Qwen-1.5-4B 99.80 96.64 98.22 90.68 93.50 92.09 55.79 54.70 55.24 83.49
ChatGLM3-6B-32K 98.43 80.24 89.34 82.50 87.91 85.20 69.86 73.90 71.88 82.86
InternLM2-7B-200K 100.00 100.00 100.00 24.77 58.50 41.64 58.72 89.79 74.25 74.77
Mistral-7B Instruct v0.2 79.84 47.92 63.88 79.36 92.05 85.70 49.92 71.86 60.89 69.53
Qwen-1.5-1.8B 71.62 39.44 55.53 37.45 25.64 31.5 5 46.34 29.42 37.88 43.04
Qwen-7B 56.71 29.47 43.09 22.09 14.86 18.48 35.60 21.62 28.61 31.36
Qwen-1.5-0.5B 40.09 28.99 34.54 5.45 6.18 5.82 24.48 11.04 17.76 20.89
Zephyr-7B Beta 13.73 21.33 17.53 0.23 4.09 2.16 19.80 33.79 26.79 15.70
Models with 7-20B Parameters
Qwen-1.5-14B-vLLM 82.68 88.83 85.75 94.27 96.91 95.59 67.05 66.05 66.55 82.94
Orion-14B-LongChat 99.37 89.40 94.39 88.05 87.36 87.70 64.63 59.42 62.02 82.67
InternLM2-20B-200K 100.00 100.00 100.00 13.86 12.91 13.39 64.15 89.26 76.70 67.03
Models Larger Than 20B Parameters
Mixtral-8x7B Instruct v0.1 94.74 93.23 93.99 93.00 99.45 96.23 73.84 78.93 76.39 89.38
Qwen-1.5-72B-vLLM 96.67 63.97 80.32 84.95 82.41 83.68 82.53 85.02 83.77 82.36
Qwen-72B-vLLM 95.59 35.70 65.65 90.77 79.18 84.98 74.82 59.76 67.29 71.94
Table 2: Main Results of NeedleBench 32K.
Model Single-Retrieval Multi-Retrieval Multi-Reasoning Overall
Chinese Englisch Overall Chinese Englisch Overall Chinese Englisch Overall
Models with Fewer Than 7B Parameters
InternLM2-7B-200K 98.98 98.69 98.83 12.45 25.50 18.98 48.90 63.96 56.43 62.15
Models with 7-20B Parameters
InternLM2-20B-200K 100.00 100.00 100.00 10.68 12.00 11.34 50.72 70.98 60.85 61.66
Orion-14B-LongChat 78.07 32.84 55.45 33.45 30.32 31.89 33.14 27.36 30.25 40.82
Table 3: Main Results of NeedleBench 200K.
NeedleBench 32K and 200K

InternLM2-7B-200K achieves perfection in Single-Retrieval, indicating strong capabilities in accurately retrieving single information points. Qwen-1.5-72B-vLLM excels in Multi-Reasoning with its substantial parameter count of 72B, demonstrating its advantage in understanding and reasoning about complex relationships between pieces of information. Mixtral-8x7B Instruct v0.1 demonstrates formidable overall performance, particularly in retrieval tasks. Models with larger parameter counts tend to achieve higher average scores (see Figure 5).

Refer to caption
Figure 4: Model Performance Comparison on NeedleBench 32K
Refer to caption
Refer to caption

Refer to caption

Refer to caption
Refer to caption

Refer to caption

Figure 5: Selective Visualization Results in NeedleBench 200K

As the context length further extends to 200K, fewer open-source LLMs are capable of supporting such extended contexts, as illustrated in Figure 5. InternLM2-7B-200K continues to exhibit its consistent strength in Single-Retrieval tasks but experiences a significant performance decline in Multi-Retrieval tests. Upon further investigation, it is confirmed that this is due to InternLM2 overfitting the Single-Retrieval tasks in the training corpus, leading it to often respond with only one needle in the Multi-Retrieval tests.This indicates that a longer context length may require enhanced capabilities in instruction following and information retrieval from the model. On the other hand, Orion-14B-LongChat is more adept at Multi-Retrieval tasks but fails to effectively manage the challenges of extended texts in Single-Retrieval tasks, showing a decline when the context length reaches approximately 80K.

NeedleBench 1000K

In our extended evaluation, we push the context length further to 1000K tokens. We evaluate the InternLM2.5-7B-1M and GLM4-9B-Chat-1M models under this context length.

First, we present the comprehensive results of the two models under NeedleBenchin Table 4. It can be seen that in the vast majority of tasks, the InternLM2.5-7B-Chat-1M performs better compared to the GLM4-9B-Chat-1M.

Model Single-Retrieval Multi-Retrieval Multi-Reasoning Overall
Chinese Englisch Overall Chinese Englisch Overall Chinese Englisch Overall
Models Support 1M tokens
InternLM2.5-7B-Chat-1M 95.73 95.13 95.43 57.91 70.95 64.43 72.37 69.89 71.13 78.84
GLM4-9B-Chat-1M 89.99 11.6 50.8 74.64 24 49.32 52.58 42.35 47.47 49.35
Table 4: Main Results of NeedleBench 1000K.

We investigate why the performance of GLM4-9B-Chat-1M is significantly worse than InternLM2.5-7B-Chat-1M. One of the tasks with a large performance difference is the Single-Retrieval task. We show the results under the default prompt settings in Figure 6. It is found that the English version of the GLM4-9B-Chat-1M model perform poorly (indicated by almost all red scores), even at relatively lower context lengths.

Refer to caption
(a) GLM4-9B-Chat-1M: ZH
Refer to caption
(b) GLM4-9B-Chat-1M: EN
Refer to caption
(c) InternLM2.5-7B-Chat-1M: ZH
Refer to caption
(d) InternLM2.5-7B-Chat-1M: EN
Figure 6: Performance Comparison of InternLM2.5-7B-Chat-1M and GLM4-9B-Chat-1M Models at 1000K Context Length under Default Settings

By examining the model outputs, we find that the GLM4-9B-Chat-1M model often fail to answer questions due to thinking that there was no relevant content in the original text. For example, the following scenarios in Figure7:

GLM4-9B-Chat-1M’s Most Common Error Cases in Single Retrieval Prompt: You are an intelligent AI assistant skilled in answering user questions based on documents provided by the user. Please keep your answers concise and clear. Do not talk about irrelevant topics or repeat your answers. The document given to you by the user is: Hidden on Emerald Island is the legendary Stardust Shard. Paul Graham Essays— —Paul Graham Essays— —Paul Graham Essays— —Paul Graham Essays— —Paul Graham Essays— —Paul Graham Essays— —Paul Graham Essays— —Paul Graham Essays— —Paul Graham Essays Paul Graham Essays— —Paul Graham Essays— —Paul Graham Essays— —Paul Graham Essays— —Paul Graham Essays— —Paul Graham Essays— —Paul Graham Essays— —Paul Graham Essays— —Paul Graham Essays Now, the question is: What legendary item is hidden on Emerald Island? Before answering, please consider what in the document is most relevant to this question. Please answer in the format ’The legendary item hidden on the Emerald Island is ______’ GLM4-9B-Chat-1M’s Response: The document does not provide any information about a legendary item hidden on Emerald Island. It discusses programming languages, startup funding, and the nature of work and creativity.”
Figure 7: GLM4-9B-Chat-1M’s Most Common Error Cases in Single Retrieval

Therefore, we try removing the requirement for the model to first recall the most relevant sentence in the document to the question, and instead instruct the model to answer directly. The results obtained are shown in the new Figure 8, where we observe that the performance is now nearly perfect (all green).

Refer to caption
(a) GLM4-9B-Chat-1MEN(w/o Recall Step)
Refer to caption
(b) GLM4-9B-Chat-1MZH(w/o Recall Step)
Refer to caption
(c) InternLM2.5-7B-Chat-1MEN (w/o Recall Step)
Refer to caption
(d) InternLM2.5-7B-Chat-1MZH (w/o Recall Step)
Figure 8: Performance comparison of InternLM2.5-7B-Chat-1M and GLM4-9B-Chat-1M models at 1000K context length after removing the requirement for the model to first recall the most relevant sentence in the document to the question.

The significant difference caused by this prompt indicates that the GLM4-9B-Chat-1M is highly sensitive to the prompt used. This suggests that the GLM4-9B-Chat-1M has the inherent capability to handle 1000K context length but may not always demonstrate this ability under different prompt settings. This implies that more sophisticated fine-tuning or alignment strategies may need to be introduced to fully utilize the model’s capabilities.

3.2 Ancestral Trace Challenge(ATC)

We perform extreme stress testing with multi-step tests on the ATC for mainstream LLMs. Specifically, we test LLMs on bilingual multi-step logical reasoning problems, with the number of steps ranging from 2 to 19. For each step count setting, models were required to respond in a single-choice format. We employ few-shot learning to instruct the LLMs, where the few-shot example count is 4. In the direct version test, models were asked to provide options A, B, C, or D without having to demonstrate their reasoning path. In contrast, we introduce reasoning paths in the responses for the few-shot examples to instruct LLMs on how to answer each question. This version of the test is labeled as the Reasoning (Rea.) version.

To mitigate the risk of LLMs randomly guessing correctly on multiple-choice questions, we use the Circular-Eval (CE) method(Liu et al., 2023), which systematically changes the options on each question.

Let PStepsubscript𝑃StepP_{\text{Step}}italic_P start_POSTSUBSCRIPT Step end_POSTSUBSCRIPT represent the score achieved in a task, which consists of questions designed to be solved in a specific number of steps, precisely, in Step𝑆𝑡𝑒𝑝Stepitalic_S italic_t italic_e italic_p steps, where PStep=i=1RCE(QStep(i))subscript𝑃Stepsuperscriptsubscript𝑖1𝑅𝐶𝐸superscriptsubscript𝑄Step𝑖P_{\text{Step}}=\sum_{i=1}^{R}CE(Q_{\text{Step}}^{(i)})italic_P start_POSTSUBSCRIPT Step end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT italic_C italic_E ( italic_Q start_POSTSUBSCRIPT Step end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ), and CE(QStep(i))𝐶𝐸superscriptsubscript𝑄Step𝑖CE(Q_{\text{Step}}^{(i)})italic_C italic_E ( italic_Q start_POSTSUBSCRIPT Step end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) denotes the circular evaluation score for the i𝑖iitalic_ith iteration of question Q𝑄Qitalic_Q at a given number of steps, with the score being 100/R if the LLM accurately identifies the correct answer regardless of its position among the options A, B, C, and D, and 0 otherwise. Here, R𝑅Ritalic_R represents the number of repetitions, and in our experiments, R𝑅Ritalic_R is set to 10. L𝐿Litalic_L denotes the number of reasoning steps in the most complex task, with L𝐿Litalic_L being set to 19 in our case. The overall score (Overall Score) is obtained by averaging the scores for each task. The score for each task setting (Task Score) is calculated as follows:

PStep=i=1RCE(QStep(i))Task Score=Step=2L(PStep×Step)Step=2LStepformulae-sequencesubscript𝑃Stepsuperscriptsubscript𝑖1𝑅𝐶𝐸superscriptsubscript𝑄Step𝑖Task ScoresuperscriptsubscriptStep2𝐿subscript𝑃StepStepsuperscriptsubscriptStep2𝐿StepP_{\text{Step}}=\sum_{i=1}^{R}CE(Q_{\text{Step}}^{(i)})\quad\text{Task Score}=% \frac{\sum_{\text{Step}=2}^{L}(P_{\text{Step}}\times\text{Step})}{\sum_{\text{% Step}=2}^{L}\text{Step}}italic_P start_POSTSUBSCRIPT Step end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT italic_C italic_E ( italic_Q start_POSTSUBSCRIPT Step end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) Task Score = divide start_ARG ∑ start_POSTSUBSCRIPT Step = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ( italic_P start_POSTSUBSCRIPT Step end_POSTSUBSCRIPT × Step ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT Step = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT Step end_ARG
Model ZH-Direct EN-Direct ZH-Rea. EN-Rea. Overall
API Models
Claude-3-Opus 45.34 56.77 58.15 70.16 57.61
GPT4-Turbo 21.69 66.24 40.26 64.97 48.29
GLM-4 28.47 57.46 35.82 50.21 42.99
Models with Fewer Than 7B Parameters
InternLM2-7B 11.53 18.10 25.34 34.23 22.30
InternLM2-7B-200K 12.49 16.03 23.33 32.43 21.07
Yi-6B 7.14 11.53 13.33 10.69 10.67
Qwen-7B 1.96 11.01 10.26 14.55 9.45
Zephyr-7B Beta 2.28 6.72 5.82 16.03 7.71
Mistral-7B Inst. v0.2 6.35 6.35 6.56 10.95 7.55
InternLM-7B 4.23 6.88 7.88 10.16 7.29
Baichuan2-7B 0.32 4.87 12.38 6.40 5.99
ChatGLM3-6B-32K 3.17 5.66 5.71 6.88 5.35
Gemma-7B 3.49 2.22 5.19 8.68 4.89
ChatGLM3-6B 4.55 4.50 4.60 5.24 4.72
Qwen-1.5-4B 0.63 2.06 3.07 1.75 1.88
LLaMA-2-7B 0.74 0.00 0.42 4.13 1.32
Gemma-2B 1.06 0.21 0.00 0.26 0.38
Qwen-1.5-1.8B 0.00 0.11 0.63 0.26 0.25
Qwen-1.5-0.5B 0.00 0.21 0.00 0.00 0.05
Models with 7-20B Parameters
InternLM2-20B 26.72 42.43 23.65 31.06 30.96
InternLM2-20B-200K 26.14 35.08 32.75 29.47 30.86
Qwen-14B 12.01 27.46 31.85 23.92 23.81
Qwen-1.5-14B-vLLM 19.84 8.25 36.72 2.96 16.94
Baichuan2-13B 1.06 12.86 21.90 10.11 11.48
LLaMA-2-13B 1.75 6.08 3.39 12.22 5.86
Orion-14B-LongChat 1.38 7.94 1.43 11.43 5.54
Models Larger Than 20B Parameters
DeepSeek-67B 27.88 46.51 38.68 63.07 44.03
Qwen-72B-vLLM 22.91 23.49 29.89 35.24 27.88
Qwen-1.5-72B-vLLM 35.66 12.75 36.93 2.80 22.04
OrionStar-Yi-34B 2.91 28.25 18.99 25.98 19.03
Mixtral-8x7B Inst. v0.1 15.50 13.65 19.63 26.51 18.82
Yi-34B 11.48 20.95 18.68 18.10 17.30
LLaMA-2-70B 7.83 15.45 17.14 25.82 16.56
WizardLM-70B 12.75 18.36 10.79 19.05 15.24
Table 5: ATC Results

We present the results of the ATC experiments as follows. In Table 5, we present the scores of various models on ATC subtasks as well as their overall performance, we can observe several key insights:

Adding Reasoning Paths Improves Model Performance: It is evident that the versions with reasoning paths generally score higher than those without reasoning paths. For instance, the EN-Rea. version of Claude-3-Opus scores 70.16, while the corresponding EN-Direct. version scores 56.77. This aligns with the observation that chain-of-thought (CoT) reasoning can enhance model performance.

Larger Parameter Count Improves Model ATC Scores: Furthermore, for models within the same series, such as Baichuan and LLaMA, there is a noticeable trend where an increase in parameter count corresponds to higher scores. For example, Baichuan2-7B has an overall score of 5.99, while Baichuan2-13B achieves a significantly higher score of 11.48. Similarly, LLaMA-2-7B scores 1.32 overall, LLaMA-2-13B scores 5.86, and LLaMA-2-70B reaches 16.56, demonstrating the same upward trend. This indicates that larger models tend to perform better in these tasks.

Superiority of Closed-Source Models: We also observe a clear superiority of API models over open-source LLMs, particularly highlighted by the performances of Claude-3-Opus and GPT4-Turbo. However, DeepSeek-67B stands out with reasoning abilities that are close to those of the top API models, suggesting a promising direction for bridging the performance gap between API and open-source models.

Refer to caption
Figure 9: Performance decline trend of various models on ATC

In Figure 9, we present the performance decline trend of various models on ATC with the increase in the number of reasoning steps. The vertical axis represents the accuracy of different models in addressing the corresponding questions at a given number of reasoning steps (number of needles). As the number of reasoning steps increases.The majority of models struggle to cope with the complexity of long texts while the accuracy of nearly all open-source LLMs drops below 10% when the number of reasoning steps reaches around 16, at which point the prompt length is merely approximately 1120 tokens, indicating their difficulty in handling complex logical problems that are likely present in long-context questions.

4 Discussion

In this section, we present a series of ablation studies to explore the factors influencing model performance on long-context tasks. We focus on three key aspects: the effect of model size on long-text capabilities, the impact of instruction following in multi-retrieval tasks, and the influence of prompt position on model scores, with the aim of providing a comprehensive understanding of how these factors affect the performance and reliability of LLMs in different scenarios.

Effect of Model Size on Long-text Capabilities
Model Parameters Score Task Type
Baichuan2-7B 7B 81.30 Overall
Baichuan2-13B 13B 92.28 Overall
LLaMA-2-7B 7B 62.09 Overall
LLaMA-2-13B 13B 61.82 Overall
LLaMA-2-70B 70B 69.35 Overall
Qwen-1.5-0.5B 0.5B 59.40 Overall
Qwen-1.5-1.8B 1.8B 82.49 Overall
Qwen-1.5-4B 4B 86.98 Overall
Qwen-1.5-14B 14B 77.88 Overall
Qwen-1.5-72B 72B 94.76 Overall
Qwen-7B 7B 74.7 Overall
Qwen-14B 14B 83.71 Overall
Qwen-72B 72B 59.55 Overall
Qwen-7B 7B 80.18 S-RT
Qwen-14B 14B 94.12 S-RT
Qwen-72B 72B 33.04 S-RT
InternLM2-7B 7B 85.19 M-RS
InternLM2-20B 20B 90.66 M-RS
Table 6: Model Size Impact on Scores by Series

We examin the relationship between model size and performance on long text tasks in NeedleBench 4K, as detailed in table 6. Our results show a pattern that is in most cases consistent with the scaling laws(Kaplan et al., 2020) which argue that larger models typically exhibit superior performance, such as the Baichuan, LLaMA, and InternLM2 model series.

However, for the Qwen series LLMs, as detailed in Table 6, the Qwen-72B model shows a significant performance degradation, especially on the NeedleBench 4K single retrieval task. Upon verification, it is confirmed that the Qwen-72B model often fail by assuming that there is no content relevant to the query within the documents.Such findings indicate the need for more focused training and optimization strategies tailored to the specifics of long text tasks in order to fully exploit their increased capacities and conform to the predictions of the scaling law.

Exploring Instruction Following in Multi-Retrieval Tasks

In our evaluation of the Multi-Needle Retrieval tasks in NeedleBench 8K, we investigate how changing the number of ”needles,” or key pieces of information that models needle to retrieve from extensive texts, affects their information retrieval capabilities Figure 10.

Refer to caption
(a) Models with Stable Scores
Refer to caption
(b) Models with Fluctuating Scores
Figure 10: Initial needle recall rate in multi-retrieval tasks by requested needle count.

Interestingly, rather than observing a straightforward decline in recall scores with an increase in the number of needles, we note exceptions in the performance of certain models. For instance, the Mixtral-8*7b-instruct-v0.1 model, presented in Figure 10(b), demonstrate an unexpected behavior in its ability to recall the very first needle. When the model is tasked with recalling a larger set of needles at different postions (including the initially requested first needle), its accuracy in recalling the initial piece of key information actually improved.

This pattern indicates that while these models have the inherent capability to accurately recall initial key information, this potential is not always fully utilized, suggesting room for improvement in their instruction-following abilities. The fluctuating performance, which paradoxically improves with more demanding instructions, reflects a need for enhancing the robustness of the models’ capability to follow instructions, particularly for tasks that require precise information retrieval.

Effect of Prompt Position on Model Score
Refer to caption
Figure 11: Performance Changes

In NeedleBench 32K, we study how prompt position (whether the question prompt is placed before the long text or at the end of the long text) affects model performance on various subtasks. The results, shown in Figure 11, are the changes in performance due to moving the prompt position from before the long texts to after the long text.The predominance of red blocks suggests that the majority of models tend to perform worse when the question prompt is placed at the beginning of the text. However, for InternLM2-7B-200K, placing the question prompt of the M-RT task before the long text helps significantly improve the model’s command following ability, resulting in score improvements.

5 Conclusion

In this report, we thoroughly evaluate the ability of LLMs to handle long-context information retrieval and reasoning. Our research uncovers notable limitations in the current open-source LLMs’ ability to interpret and reason over long texts. Despite the progress in single information retrieval capabilities under extended context length of models such as GPT-4 Turbo and Claude 3, our findings underscore the difficulties these models encounter due to the complexity of logical reasoning challenges that are likely present in real-world long-context tasks.. This suggests that there is significant room for improving the utility of LLMs in scenarios requiring intricate information retrieval and numerous reasoning steps.

Our report reveals the importance of concentrated efforts within the AI research community to improve LLMs’ long-context comprehension and reasoning abilities. Addressing the shortcomings identified in the NeedleBench assessments could enable future models to perform more accurate and sophisticated analyses, equipping them more effectively for intricate long-context tasks in real-world scenarios.

6 Limitations and Future Work

Limitations: One limitation identified in our experiments is related to the multi-needle reasoning task. The needles used in this task are derived from datasets curated from Wikipedia. It is challenging to determine whether the tested models are genuinely reasoning over the context or simply utilizing internal knowledge. We plan to address it in future work by incorporating ATC needles into the multi-needle reasoning task to further refine this aspect.

Future Work: The ATC test in this study demonstrates that for LLMs, long-text comprehension should not be confined to retrieval tasks alone. It reveals that logical reasoning within long texts is inherently more complex. Future LLMs could be specifically optimized for long-context reasoning to handle the intricate logical challenges that are likely to arise in such scenarios.

Additionally, future work should focus on improving the robustness of long-context models’ abilities, as our findings in NeedleBench 1000K experiments indicate that these models are highly prompt sensitive. Developing strategies to mitigate prompt sensitivity will be crucial to ensure consistent performance across different prompts and enhance the overall reliability of these models in handling extensive textual data.

References

  • Anthropic (2024a) Anthropic. Introducing claude 2.1. https://www.anthropic.com/index/claude-2-1, 2024a. Accessed: 2024-01-23.
  • Anthropic (2024b) Anthropic. Introducing the next generation of claude. https://www.anthropic.com/news/claude-3-family, 2024b. Accessed: 2024-03-27.
  • Bai et al. (2023) Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. Longbench: A bilingual, multitask benchmark for long context understanding, 2023.
  • Besta et al. (2024) Maciej Besta, Florim Memedi, Zhenyu Zhang, Robert Gerstenberger, Nils Blach, Piotr Nyczyk, Marcin Copik, Grzegorz Kwaśniewski, Jürgen Müller, Lukas Gianinazzi, Ales Kubicek, Hubert Niewiadomski, Onur Mutlu, and Torsten Hoefler. Topologies of reasoning: Demystifying chains, trees, and graphs of thoughts, 2024.
  • Contributors (2023) LMDeploy Contributors. Lmdeploy: A toolkit for compressing, deploying, and serving llm. https://github.com/InternLM/lmdeploy, 2023.
  • Gemini Team (2024) Gemini Team. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. https://storage.googleapis.com/deepmind-media/gemini/gemini_v1_5_report.pdf, 2024. Accessed: 2024-03-17.
  • Inoue et al. (2019) Nozomu Inoue, Pontus Stenetorp, and Kentaro Inui. R4c: A benchmark for evaluating rc systems to get the right answer for the right reason. arXiv: Computation and Language,arXiv: Computation and Language, Oct 2019.
  • Kamradt (2023) Greg Kamradt. LLMs Need Needle In A Haystack Test-Pressure Testing LLMs. https://github.com/gkamradt/LLMTest_NeedleInAHaystack, 2023.
  • Kaplan et al. (2020) Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models, 2020.
  • Kwon et al. (2023) Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023.
  • Liu et al. (2023) Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, Kai Chen, and Dahua Lin. Mmbench: Is your multi-modal model an all-around player?, 2023.
  • Mohtashami & Jaggi (2023) Amirkeivan Mohtashami and Martin Jaggi. Landmark attention: Random-access infinite context length for transformers, 2023.
  • OpenAI (2023) OpenAI. Gpt-4 and gpt-4 turbo - documentation. https://platform.openai.com/docs/models/gpt-4-and-gpt-4-turbo, 2023. Accessed: 2024-01-23.
  • OpenCompass Contributors (2023) OpenCompass Contributors. OpenCompass: A universal evaluation platform for foundation models. https://github.com/open-compass/opencompass, 2023.
  • Wei et al. (2023) Tianwen Wei, Liang Zhao, Lichang Zhang, Bo Zhu, Lijie Wang, Haihua Yang, Biye Li, Cheng Cheng, Weiwei Lü, Rui Hu, Chenxia Li, Liu Yang, Xilin Luo, Xuejie Wu, Lunan Liu, Wenjun Cheng, Peng Cheng, Jianhao Zhang, Xiaoyu Zhang, Lei Lin, Xiaokun Wang, Yutuan Ma, Chuanhai Dong, Yanqi Sun, Yifu Chen, Yongyi Peng, Xiaojuan Liang, Shuicheng Yan, Han Fang, and Yahui Zhou. Skywork: A more open bilingual foundation model, 2023.
  • Yang et al. (2018) Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Jan 2018. doi: 10.18653/v1/d18-1259. URL http://dx.doi.org/10.18653/v1/d18-1259.
  • Zhang et al. (2023) Xinrong Zhang, Yingfa Chen, Shengding Hu, Qihao Wu, Junhao Chen, Zihang Xu, Zhenning Dai, Xu Han, Shuo Wang, Zhiyuan Liu, and Maosong Sun. Infinitebench: 128k long-context benchmark for language models, 2023.

7 Appendix

Appendix A Parameter Settings

We have carefully adjust parameters such as buffersize, repeats, depths, and lengths for various tasks, languages, and content lengths in NeedleBench . This ensures test durations are kept within acceptable limits, experiments remain reproducible, and the influence of tokenizer discrepancies among different models is minimized. Building on this foundation, we further refine our evaluation approach by quantifying the overall performance of models using a weighted scoring system. This system is designed to capture the comprehensive abilities of models in performing a range of tasks, from single-item retrieval to complex reasoning across multiple contexts.

Following the meticulous configuration of these parameters, the computation of the Overall Score (O𝑂Oitalic_O) is derived from the weighted averages of scores from different tasks. This is formalized as:

O=WSRTScoreSRT+WMRTScoreMRT+WMRSScoreMRS𝑂subscript𝑊𝑆𝑅𝑇𝑆𝑐𝑜𝑟subscript𝑒𝑆𝑅𝑇subscript𝑊𝑀𝑅𝑇𝑆𝑐𝑜𝑟subscript𝑒𝑀𝑅𝑇subscript𝑊𝑀𝑅𝑆𝑆𝑐𝑜𝑟subscript𝑒𝑀𝑅𝑆O=W_{S-RT}\cdot Score_{S-RT}+W_{M-RT}\cdot Score_{M-RT}+W_{M-RS}\cdot Score_{M% -RS}italic_O = italic_W start_POSTSUBSCRIPT italic_S - italic_R italic_T end_POSTSUBSCRIPT ⋅ italic_S italic_c italic_o italic_r italic_e start_POSTSUBSCRIPT italic_S - italic_R italic_T end_POSTSUBSCRIPT + italic_W start_POSTSUBSCRIPT italic_M - italic_R italic_T end_POSTSUBSCRIPT ⋅ italic_S italic_c italic_o italic_r italic_e start_POSTSUBSCRIPT italic_M - italic_R italic_T end_POSTSUBSCRIPT + italic_W start_POSTSUBSCRIPT italic_M - italic_R italic_S end_POSTSUBSCRIPT ⋅ italic_S italic_c italic_o italic_r italic_e start_POSTSUBSCRIPT italic_M - italic_R italic_S end_POSTSUBSCRIPT

Where WSRT=0.4subscript𝑊𝑆𝑅𝑇0.4W_{S-RT}=0.4italic_W start_POSTSUBSCRIPT italic_S - italic_R italic_T end_POSTSUBSCRIPT = 0.4, WMRT=0.3subscript𝑊𝑀𝑅𝑇0.3W_{M-RT}=0.3italic_W start_POSTSUBSCRIPT italic_M - italic_R italic_T end_POSTSUBSCRIPT = 0.3, and WMRS=0.3subscript𝑊𝑀𝑅𝑆0.3W_{M-RS}=0.3italic_W start_POSTSUBSCRIPT italic_M - italic_R italic_S end_POSTSUBSCRIPT = 0.3 are the weights assigned to the scores of the Single-Retrieval Task (S-RT), Multi-Retrieval Task (M-RT), and Multi-Reasoning Task (M-RS) respectively.

The Levenshtein distance d𝑑ditalic_d we use in score calculation is defined as follows:

d(i,j)={max(i,j)if min(i,j)=0,min{d(i1,j)+1,d(i,j1)+1,d(i1,j1)+1(s1[i]s2[j])}otherwise.𝑑𝑖𝑗cases𝑖𝑗if 𝑖𝑗0𝑑𝑖1𝑗1𝑑𝑖𝑗11𝑑𝑖1𝑗1subscript1subscript𝑠1delimited-[]𝑖subscript𝑠2delimited-[]𝑗otherwise.d(i,j)=\left\{\begin{array}[]{ll}\max(i,j)&\mbox{if }\min(i,j)=0,\\ \min\left\{\begin{array}[]{l}d(i-1,j)+1,\\ d(i,j-1)+1,\\ d(i-1,j-1)+1_{(s_{1}[i]\neq s_{2}[j])}\end{array}\right\}&\mbox{otherwise.}% \end{array}\right.italic_d ( italic_i , italic_j ) = { start_ARRAY start_ROW start_CELL roman_max ( italic_i , italic_j ) end_CELL start_CELL if roman_min ( italic_i , italic_j ) = 0 , end_CELL end_ROW start_ROW start_CELL roman_min { start_ARRAY start_ROW start_CELL italic_d ( italic_i - 1 , italic_j ) + 1 , end_CELL end_ROW start_ROW start_CELL italic_d ( italic_i , italic_j - 1 ) + 1 , end_CELL end_ROW start_ROW start_CELL italic_d ( italic_i - 1 , italic_j - 1 ) + 1 start_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT [ italic_i ] ≠ italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT [ italic_j ] ) end_POSTSUBSCRIPT end_CELL end_ROW end_ARRAY } end_CELL start_CELL otherwise. end_CELL end_ROW end_ARRAY

Here, 1(s1[i]s2[j])subscript1subscript𝑠1delimited-[]𝑖subscript𝑠2delimited-[]𝑗1_{(s_{1}[i]\neq s_{2}[j])}1 start_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT [ italic_i ] ≠ italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT [ italic_j ] ) end_POSTSUBSCRIPT denotes an indicator function that equals 1 when s1[i]s2[j]subscript𝑠1delimited-[]𝑖subscript𝑠2delimited-[]𝑗s_{1}[i]\neq s_{2}[j]italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT [ italic_i ] ≠ italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT [ italic_j ] and 0 otherwise. The term d(i,j)𝑑𝑖𝑗d(i,j)italic_d ( italic_i , italic_j ) represents the Levenshtein distance between the first i𝑖iitalic_i characters of s1subscript𝑠1s_{1}italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and the first j𝑗jitalic_j characters of s2subscript𝑠2s_{2}italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT.

Table 7: Needlebench Parameter Settings:
Task Repeats Buffersize # of Depth Intervals # of Length Intervals
2-Needle-Reasoning-EN-4K 10 600 20 4
3-Needle-Reasoning-EN-4K 10 600 20 4
4-Needle-Reasoning-EN-4K 10 600 20 4
5-Needle-Reasoning-EN-4K 10 600 20 4
Single-Needle-Retrieval-EN-4K 10 600 20 4
Multi-Needle-Retrieval-EN-4K 25 1000 20 4
2-Needle-Reasoning-ZH-4K 10 200 20 4
3-Needle-Reasoning-ZH-4K 10 200 20 4
4-Needle-Reasoning-ZH-4K 10 200 20 4
5-Needle-Reasoning-ZH-4K 10 200 20 4
Single-Needle-Retrieval-ZH-4K 10 200 20 4
Multi-Needle-Retrieval-ZH-4K 25 200 20 4
2-Needle-Reasoning-EN-8K 10 1000 20 4
3-Needle-Reasoning-EN-8K 10 1000 20 4
4-Needle-Reasoning-EN-8K 10 1000 20 4
5-Needle-Reasoning-EN-8K 10 1000 20 4
Single-Needle-Retrieval-EN-8K 10 800 20 4
Multi-Needle-Retrieval-EN-8K 25 1300 20 4
2-Needle-Reasoning-ZH-8K 10 200 20 4
3-Needle-Reasoning-ZH-8K 10 200 20 4
4-Needle-Reasoning-ZH-8K 10 200 20 4
5-Needle-Reasoning-ZH-8K 10 200 20 4
Single-Needle-Retrieval-ZH-8K 10 200 20 4
Multi-Needle-Retrieval-ZH-8K 25 200 20 4
2-Needle-Reasoning-EN-32K 10 3000 10 8
3-Needle-Reasoning-EN-32K 10 3000 10 8
4-Needle-Reasoning-EN-32K 10 3000 10 8
5-Needle-Reasoning-EN-32K 10 3000 10 8
Single-Needle-Retrieval-EN-32K 10 3000 10 8
Multi-Needle-Retrieval-EN-32K 25 3000 10 8
2-Needle-Reasoning-ZH-32K 10 200 10 8
3-Needle-Reasoning-ZH-32K 10 200 10 8
4-Needle-Reasoning-ZH-32K 10 200 10 8
5-Needle-Reasoning-ZH-32K 10 200 10 8
Single-Needle-Retrieval-ZH-32K 10 200 10 8
Multi-Needle-Retrieval-ZH-32K 25 200 10 8
2-Needle-Reasoning-EN-200K 10 600 10 8
3-Needle-Reasoning-EN-200K 10 600 10 8
4-Needle-Reasoning-EN-200K 10 600 10 8
5-Needle-Reasoning-EN-200K 10 600 10 8
Single-Needle-Retrieval-EN-200K 10 600 10 8
Multi-Needle-Retrieval-EN-200K 25 3000 10 8
2-Needle-Reasoning-ZH-200K 10 200 10 8
3-Needle-Reasoning-ZH-200K 10 200 10 8
4-Needle-Reasoning-ZH-200K 10 200 10 8
5-Needle-Reasoning-ZH-200K 10 200 10 8
Single-Needle-Retrieval-ZH-200K 10 200 10 8
Multi-Needle-Retrieval-ZH-200K 25 200 10 8

Appendix B Detailed Experimental Results

In this section, we present the performance of models for different token lengths (4K, 8K, 32K, and 200K) in NeedleBench. The models are grouped according to their maximum sequence length, allowing a focused analysis on appropriate subtasks for each group. Detailed results for each length and task are presented in the corresponding tables: for 4K text length in table 8, for 8K text length in table 9

B.1 NeedleBench 4K

Model Single-Retrieval Multi-Retrieval Multi-Reasoning Overall
Chinese Englisch Overall Chinese Englisch Overall Chinese Englisch Overall
Models with Fewer Than 7B Parameters
InternLM2-7B-200K 100.00 100.00 100.00 94.00 99.90 96.95 74.17 95.29 84.73 94.50
InternLM2-7B 100.00 100.00 100.00 91.15 99.85 95.50 74.98 95.39 85.19 94.21
ChatGLM3-6B-32K 99.77 99.53 99.65 94.50 91.10 92.80 71.55 80.67 76.11 90.53
Qwen-1.5-4B 100.00 100.00 100.00 92.55 93.20 92.88 58.45 68.97 63.71 86.98
Yi-6B 83.75 99.77 91.76 86.05 84.80 85.43 71.91 81.93 76.92 85.41
Qwen-1.5-1.8B 99.16 96.45 97.81 77.80 78.60 78.20 67.68 65.01 66.34 82.49
Baichuan2-7B 91.42 94.13 92.78 80.00 69.45 74.72 72.33 72.80 72.57 81.30
InternLM-7B 95.12 89.77 92.44 62.00 82.85 72.42 53.06 75.60 64.33 78.00
Qwen-7B 79.26 81.10 80.18 75.25 79.25 77.25 63.19 66.48 64.83 74.70
Gemma-7B 75.32 74.31 74.82 82.15 90.80 86.47 55.74 60.33 58.03 73.28
ChatGLM3-6B 87.12 84.15 85.64 78.75 86.30 82.53 43.69 45.06 44.38 72.33
Mistral-7B Inst. v0.2 68.02 38.23 53.12 91.75 99.05 95.40 34.11 76.92 55.52 66.52
LLaMA-2-7B 32.42 93.84 63.13 41.20 84.25 62.73 31.61 88.52 60.07 62.09
Qwen-1.5-0.5B 75.67 92.13 83.90 35.75 49.90 42.82 41.40 45.20 43.30 59.40
Zephyr-7B Beta 69.22 45.70 57.46 52.25 53.30 52.78 53.63 54.60 54.12 55.05
Gemma-2B 21.47 20.29 20.88 56.90 80.20 68.55 12.72 31.68 22.20 35.57
Models with 7-20B Parameters
Orion-14B-LongChat 99.51 99.29 99.40 98.15 98.55 98.35 68.24 87.29 77.76 92.59
Baichuan2-13B 99.31 98.73 99.02 94.45 96.90 95.67 80.76 79.01 79.89 92.28
InternLM2-20B-200K 100.00 100.00 100.00 65.75 86.70 76.23 84.49 96.77 90.63 90.06
InternLM2-20B 100.00 100.00 100.00 63.20 82.40 72.80 84.49 96.83 90.66 89.04
Qwen-14B 98.82 89.43 94.12 93.40 92.30 92.85 66.24 55.15 60.69 83.71
Qwen-1.5-14B-vLLM 49.78 84.57 67.18 94.80 93.60 94.20 71.62 83.36 77.49 78.38
LLaMA-2-13B 44.22 86.94 65.58 25.20 87.55 56.38 37.22 87.25 62.24 61.82
Models Larger Than 20B Parameters
Qwen-1.5-72B-vLLM 99.32 95.67 97.50 95.05 98.85 96.95 87.10 95.91 91.50 95.53
Yi-34B 92.86 94.41 93.63 99.15 99.50 99.33 78.14 83.26 80.70 91.46
Mixtral-8x7B Inst. v0.1 87.22 84.89 86.05 99.15 99.45 99.30 71.19 83.76 77.48 87.45
DeepSeek-67B 95.44 85.21 90.33 80.40 78.35 79.38 85.11 76.60 80.86 84.20
Qwen-72B-vLLM 73.32 29.37 51.34 99.15 92.55 95.85 71.45 63.76 67.60 69.57
LLaMA-2-70B 41.35 98.43 69.89 40.70 98.55 69.62 43.37 93.30 68.34 69.35
OrionStar-Yi-34B 82.42 41.24 61.83 82.05 72.25 77.15 63.76 77.44 70.60 69.06
WizardLM-70B 63.95 42.61 53.28 49.25 77.45 63.35 53.93 62.98 58.46 57.85
Table 8: Main Results of NeedleBench 4K. Overall stands for the score calculated from an average of metrics on all subsets. (bold denotes the best score among all models, and underline denotes the best score under the same model scale. For the following tables, the same notation applies.)

From Table 8, it is evident that the InternLM2 models are particularly adept at Single-Retrieval tasks, while also demonstrating commendable cross-document inference abilities in Multi-Reasoning tasks. Notably, InternLM2-7B-200K achieve the best overall score among models with fewer than 7 billion parameters, although it is outperformed by the larger Qwen-1.5-72B-vLLM model.

Refer to caption
Refer to caption
Refer to caption

Refer to caption

Refer to caption
Refer to caption
Refer to caption

Refer to caption

Refer to caption
Refer to caption
Refer to caption

Refer to caption

Refer to caption
Refer to caption
Refer to caption

Refer to caption

Figure 12: Selective Visualization in NeedleBench 4K

Figure 12 shows detailed graphs for selected representative models at various lengths and depths: The InternLM2 models show superior overall performance, while the LLaMA models show significantly weaker Chinese language capabilities compared to English. The Gemma series shows a significant improvement in combined scores as the parameter size increases from 2B to 7B.

Refer to caption
Figure 13: Model Performance Comparison on NeedleBench 4K

B.2 NeedleBench 8K

Model Single-Retrieval Multi-Retrieval Multi-Reasoning Overall
Chinese Englisch Overall Chinese Englisch Overall Chinese Englisch Overall
Models with Fewer Than 7B Parameters
ChatGLM3-6B-32K 100.00 99.04 99.52 87.90 89.55 88.73 73.26 82.48 77.87 89.79
InternLM2-7B 100.00 100.00 100.00 49.90 98.90 74.40 65.29 90.62 77.96 85.71
InternLM2-7B-200K 100.00 100.00 100.00 46.25 99.00 72.62 65.09 91.38 78.23 85.26
Qwen-1.5-4B 99.89 97.96 98.92 92.05 92.35 92.20 56.25 57.41 56.83 84.28
Qwen-1.5-1.8B 99.20 94.01 96.61 65.85 62.75 64.30 61.15 51.65 56.40 74.85
Mistral-7B Instruct v0.2 76.63 40.92 58.78 86.55 97.90 92.22 42.59 76.48 59.54 69.04
ChatGLM3-6B 85.42 48.67 67.04 57.45 65.85 61.65 42.97 24.14 33.55 55.38
Gemma-7B 76.51 55.55 66.03 54.50 44.05 49.28 50.10 31.36 40.73 53.41
Qwen-1.5-0.5B 69.53 78.60 74.07 13.80 23.35 18.58 36.67 29.00 32.84 45.05
Qwen-7B 63.41 53.55 58.48 22.95 17.90 20.43 49.45 51.71 50.58 44.69
InternLM-7B 59.62 38.87 49.25 12.65 12.65 12.65 48.26 52.32 50.29 38.58
Zephyr-7B Beta 34.16 42.73 38.45 0.10 11.35 5.73 34.57 44.88 39.73 29.01
Gemma-2B 15.57 5.59 10.58 16.90 14.80 15.85 11.66 10.12 10.89 12.25
Models with 7-20B Parameters
Orion-14B-LongChat 100.00 99.41 99.71 95.05 94.75 94.90 65.33 79.96 72.65 90.15
Qwen-1.5-14B-vLLM 67.67 83.33 75.50 88.55 94.30 91.42 66.64 74.57 70.60 78.81
InternLM2-20B 100.00 100.00 100.00 23.75 47.55 35.65 73.87 94.73 84.30 75.99
InternLM2-20B-200K 100.00 100.00 100.00 26.55 44.55 35.55 73.84 94.76 84.30 75.96
Qwen-14B 90.35 46.28 68.32 79.85 59.15 69.50 57.46 34.74 46.10 62.01
Models Larger Than 20B Parameters
Qwen-1.5-72B-vLLM 99.39 79.27 89.33 92.40 93.65 93.03 86.23 93.87 90.05 90.65
Mixtral-8x7B Instruct v0.1 94.83 83.67 89.25 98.70 100.00 99.35 71.55 81.31 76.43 88.43
Qwen-72B-vLLM 89.19 16.29 52.74 99.10 73.05 86.08 66.79 45.42 56.11 63.75
Table 9: Main Results of NeedleBench 8K.

In NeedleBench 8K, we observe a general upward trend in scores with increasing model parameters, with Qwen-1.5-72B, the model with the largest number of parameters, achieving the highest score of 90.65. The InternLM2 series, however, does not maintain a leading position, primarily due to noticeable issues with command following in Multi-Retrieval tasks (recalling only a single needle instead of multiple needles as per the command). Among models with less than 7B parameters, ChatGLM3-6B scores the highest within the 8K length and closely trails behind Qwen-1.5-72B in performance.

Refer to caption
Figure 14: Model performance comparison on NeedleBench 8K
Refer to caption
Refer to caption
Refer to caption

Refer to caption

Refer to caption
Refer to caption
Refer to caption

Refer to caption

Figure 15: Selective Visualization Results in NeedleBench 8K: InternLM2-20B model excels in Single-Retrieval, but shows a significant decline in Multi-Retrieval due to command-following issues, while demonstrating strong English Multi-Reasoning abilities. Orion-14B-LongChat achieves the highest score of the 7-20B models with its outstanding performance in the Multi-Retrieval Task.

Appendix C NeedleBench Prompt Examples

Single-Needle Retrieval (Needle First - Demonstration with English Version) Prompt: You are an intelligent AI assistant skilled in answering user questions base on documents provided by the user. Please keep your answers concise and clear. Do not talk about irrelevant topics or repeat your answers.The document given to you by the user is : Hidden on Emerald Island is the legendary Stardust Shard. Paul Graham Essays— —Paul Graham Essays— —Paul Graham Essays— —Paul Graham Essays— —Paul Graham Essays— —Paul Graham Essays— —Paul Graham Essays— —Paul Graham Essays— —Paul Graham Essays Paul Graham Essays— —Paul Graham Essays— —Paul Graham Essays— —Paul Graham Essays— —Paul Graham Essays— —Paul Graham Essays— —Paul Graham Essays— —Paul Graham Essays— —Paul Graham Essays Now, the question is: What legendary item is hidden on Emerald Island? Before answering, please consider what in the document is most relevant to this question. Please answer in the format ’The legendary item hidden on the Emerald Island is ______’
Figure 16: An example prompt of Single-Needle Retrieval showcasing key information with the single needle positioned at the very beginning. In actual tests, the needle is placed at various depths within extended texts to evaluate performance under different conditions.
Single-Needle Retrieval (Needle Middle - Demonstration with English Version) Prompt: You are an intelligent AI assistant skilled in answering user questions base on documents provided by the user. Please keep your answers concise and clear. Do not talk about irrelevant topics or repeat your answers.The document given to you by the user is : Paul Graham Essays— —Paul Graham Essays— —Paul Graham Essays— —Paul Graham Essays— —Paul Graham Essays— —Paul Graham Essays— —Paul Graham Essays— —Paul Graham Essays— —Paul Graham Essays Hidden on Emerald Island is the legendary Stardust Shard. Paul Graham Essays— —Paul Graham Essays— —Paul Graham Essays— —Paul Graham Essays— —Paul Graham Essays— —Paul Graham Essays— —Paul Graham Essays— —Paul Graham Essays— —Paul Graham Essays Now, the question is: What legendary item is hidden on Emerald Island? Before answering, please consider what in the document is most relevant to this question. Please answer in the format ’The legendary item hidden on the Emerald Island is ______’
Figure 17: An example prompt of Single-Needle Retrieval showcasing key information with the single needle positioned at the middle
Single-Needle Retrieval (Needle Last - Demonstration with English Version) Prompt: You are an intelligent AI assistant skilled in answering user questions base on documents provided by the user. Please keep your answers concise and clear. Do not talk about irrelevant topics or repeat your answers.The document given to you by the user is : Paul Graham Essays— —Paul Graham Essays— —Paul Graham Essays— —Paul Graham Essays— —Paul Graham Essays— —Paul Graham Essays— —Paul Graham Essays— —Paul Graham Essays— —Paul Graham Essays Paul Graham Essays— —Paul Graham Essays— —Paul Graham Essays— —Paul Graham Essays— —Paul Graham Essays— —Paul Graham Essays— —Paul Graham Essays— —Paul Graham Essays— —Paul Graham Essays Hidden on Emerald Island is the legendary Stardust Shard. Now, the question is: What legendary item is hidden on Emerald Island? Before answering, please consider what in the document is most relevant to this question. Please answer in the format ’The legendary item hidden on the Emerald Island is ______’
Figure 18: An example prompt of Single-Needle Retrieval showcasing key information with the single needle positioned at last
Multi-Needle Retrieval(Demonstration with Five Needles English Version) Prompt: You are an intelligent AI assistant skilled in answering user questions base on documents provided by the user. Please keep your answers concise and clear. Do not talk about irrelevant topics or repeat your answers.The document given to you by the user is : Paul Graham Essays— —Paul Graham Essays— —Paul Graham Essays The ruler of the Polaris star system is Orion the Hunter. Paul Graham Essays— —Paul Graham Essays— —Paul Graham Essays Hidden on Heaven Island is the legendary Lucky Clover. Paul Graham Essays— —Paul Graham Essays— —Paul Graham Essays Hidden on Mysterious Island is the legendary Counterclockwise Crystal. Paul Graham Essays— —Paul Graham Essays— —Paul Graham Essays The ruler of the Orion star system is Guardian of Time Lightspeed. Paul Graham Essays— —Paul Graham Essays— —Paul Graham Essays Hidden on Phantom Island is the legendary Goodness Heart. Paul Graham Essays— —Paul Graham Essays— —Paul Graham Essays Now, the questions are: Who is the ruler of the Polaris star system?, What legendary item is hidden on Heaven Island?, What legendary item is hidden on Mysterious Island?, Who is the ruler of the Orion star system?, What legendary item is hidden on Phantom Island?Before answering, please consider what in the document is most relevant to this question. Please answer in the format of ’The ruler of the Polaris star system is________, The legendary item hidden on the Heaven Island is________, The legendary item hidden on the Mysterious Island is________, The ruler of the Orion star system is________, The legendary item hidden on the Phantom Island is________’
Figure 19: An example prompt of Multi-Needle Retrieval
Multi-Needle Reasoning (Demonstration with Three Needles English Version) Prompt: You are an intelligent AI assistant skilled in answering user questions base on documents provided by the user. Please keep your answers concise and clear. Do not talk about irrelevant topics or repeat your answers. The document given to you by the user is: Paul Graham Essays— —Paul Graham Essays— —Paul Graham Essays The Love for Three Oranges is known as L’amour des trois oranges. Paul Graham Essays— —Paul Graham Essays— —Paul Graham Essays The Love for Three Oranges is a satirical opera by Sergei Prokofiev. Paul Graham Essays— —Paul Graham Essays— —Paul Graham Essays Sergei Prokofiev died on 5 March 1953. Paul Graham Essays— —Paul Graham Essays— —Paul Graham Essays Now, the question is: When did the Soviet composer of French language title L’amour des trois oranges die? Before answering, please consider what in the document is most relevant to this question.
Figure 20: An example prompt of Multi-Needle Reasoning
Ancestral Trace Challenge (4-shot Six-Step Examples without Reasoning Paths) Few-shot Prompt 1: Question: David House, as Jessica Stewart’s father, has a significant impact on Jessica Stewart’s upbringing. Given the scrambled family relationships described above, who is the eldest relative that ’Jessica Stewart’ can trace back to in the context? A. Jack Burch B. Jessica Stewart C. David House D. Carolyn Jackson Bot Response: Answer: C Few-shot Prompt 2 Question: For Roberta Hill, Michael Scott is not just a paternal grandfather, but also a friend. Jacob Oconnor’s paternal grandmother is Roberta Hill. Given the scrambled family relationships described above, who is the eldest relative that ’Jacob Oconnor’ can trace back to in the context? A. Laura Holland B. Roberta Hill C. Jacob Oconnor D. Michael Scott Bot Response: Answer: D […Continue with other few-shot examples…] Final Challenge Question: Question: Victoria Dean is not only Danielle Yates’s maternal grandmother but also Danielle Yates’s role model. Carlos Smith, as Martin Gary’s grandfather, has a significant impact on Martin Gary’s upbringing. Victoria Dean considers Martin Gary as their grandfather. Marcus Miller, as Carlos Smith’s paternal grandfather, has a significant impact on Carlos Smith’s upbringing. Danielle Yates is not only Davina Hernandez’s paternal grandmother but also Davina Hernandez’s role model. Davina Hernandez is Jennifer Williams’s mom. Given the scrambled family relationships described above, who is the eldest relative that ’Jennifer Williams’ can trace back to in the context? A. Marcus Miller B. Carlos Smith C. Martin Gary D. Victoria Dean Example Answer from GPT4-Turbo: Answer: A
Figure 21: Examples of 4-shot prompts and responses in the Ancestral Trace Challenge(ATC) without reasoning paths for six-step logic relations
Ancestral Trace Challenge (4-shot Six-Step Examples with Reasoning Paths) Few-shot Prompt 1: Question: David House, as Jessica Stewart’s father, has a significant impact on Jessica Stewart’s upbringing. Given the scrambled family relationships described above, who is the eldest relative that ’Jessica Stewart’ can trace back to in the context? A. Jack Burch B. Jessica Stewart C. David House D. Carolyn Jackson Bot Response: Answer: Based on the provided information, we can construct the following family relationship chain: - David House, as Jessica Stewart’s father, has a significant impact on Jessica Stewart’s upbringing. Therefore, the eldest relative that ’Jessica Stewart’ can trace back to in the context is David House. The answer is: C […Continue with other few-shot examples…] Final Challenge Question: Question: Carlos Smith, as Martin Gary’s grandfather, has a significant impact on Martin Gary’s upbringing. Victoria Dean considers Martin Gary as their grandfather. Victoria Dean is not only Danielle Yates’s maternal grandmother but also Danielle Yates’s role model. Danielle Yates is not only Davina Hernandez’s paternal grandmother but also Davina Hernandez’s role model. Marcus Miller, as Carlos Smith’s paternal grandfather, has a significant impact on Carlos Smith’s upbringing. Davina Hernandez is Jennifer Williams’s mom. Given the scrambled family relationships described above, who is the eldest relative that ’Jennifer Williams’ can trace back to in the context? A. Marcus Miller B. Carlos Smith C. Martin Gary D. Victoria Dean Example Answer from GPT4-Turbo: Answer: Based on the provided information, we can construct the following family relationship chain: - Davina Hernandez is Jennifer Williams’s mom. - Danielle Yates is not only Davina Hernandez’s paternal grandmother but also Davina Hernandez’s role model. - Victoria Dean is not only Danielle Yates’s maternal grandmother but also Danielle Yates’s role model. - Victoria Dean considers Martin Gary as their grandfather. - Carlos Smith, as Martin Gary’s grandfather, has a significant impact on Martin Gary’s upbringing. - Marcus Miller, as Carlos Smith’s paternal grandfather, has a significant impact on Carlos Smith’s upbringing. Therefore, the eldest relative that ’Jennifer Williams’ can trace back to in the context is Marcus Miller. The answer is: A
Figure 22: Examples of 4-shot prompts and responses in the Ancestral Trace Challenge(ATC) with reasoning paths for six-step logic relations