DocBench: A Benchmark for Evaluating LLM-based Document Reading Systems

Paper Link: DocBench: A Benchmark for Evaluating LLM-based Document Reading Systems

Einführung

DocBench is a benchmark that takes raw PDF files and accompanying questions as inputs, with the objective of generating corresponding textual answers. It includes 229 real documents and 1,102 questions, spanning across five different domains and four major types of questions.

The construction pipeline consists of three pahses: (a) Document Collection; (b) QA-pair Generation; (c) Quality Check.

Dataset Overview

Data

Data can be downloaded from: https://drive.google.com/drive/folders/1yxhF1lFF2gKeTNc8Wh0EyBdMT3M4pDYr?usp=sharing

Implementations

We need keys from Hugging Face and OpenAI. (get your own keys to replace the HF_KEY and OPENAI_API_KEY in secret_key.py)

a. Download

Download the models to evaluate:

bash download.sh

YOUR_OWN_DIR: where to save the downloaded models
MODEL_TO_DOWNLOAD: model name from hugging face

b. Run

First, we deploy vLLM as a server:

python -m vllm.entrypoints.openai.api_server --model your_merged_model_output_path --served-model-name my_model --worker-use-ray --tensor-parallel-size 8 --port 8081 --host 0.0.0.0 --trust-remote-code --max-model-len 8192

Second, we run the models for inference:

python run.py \
  --system gpt4 \
  --model_dir MODEL_DIR \	#comment this line if we use api-based models
  --initial_folder 0

c. Evaluate

Evaluate the results:

python evaluate.py \
  --system gpt4 \
  --resume_id 0

Notice: there could be some warnings for unexpected outputs. We could check the outputs according to the warning hint.

Citation

If you find this work useful, please kindly cite our paper:

@misc{zou2024docbenchbenchmarkevaluatingllmbased,
      title={DOCBENCH: A Benchmark for Evaluating LLM-based Document Reading Systems}, 
      author={Anni Zou and Wenhao Yu and Hongming Zhang and Kaixin Ma and Deng Cai and Zhuosheng Zhang and Hai Zhao and Dong Yu},
      year={2024},
      eprint={2407.10701},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2407.10701}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
figs		figs
README.md		README.md
download.py		download.py
download.sh		download.sh
evaluate.py		evaluate.py
evaluation_prompt.txt		evaluation_prompt.txt
run.py		run.py
secret_key.py		secret_key.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DocBench: A Benchmark for Evaluating LLM-based Document Reading Systems

Einführung

Dataset Overview

Data

Implementations

a. Download

b. Run

c. Evaluate

Citation

Über uns

Releases

Packages

Languages

Anni-Zou/DocBench

Folders and files

Latest commit

History

Repository files navigation

DocBench: A Benchmark for Evaluating LLM-based Document Reading Systems

Einführung

Dataset Overview

Data

Implementations

a. Download

b. Run

c. Evaluate

Citation

Über uns

Ressourcen

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages