First Place Solution of 2023 Global Artificial Intelligence Technology
Innovation Competition Track 1

Xiangyu Wu¹ Hailiang Zhang¹ Yang Yang¹ Jianfeng Lu¹ ¹ Nanjing University of Science and Technology
{wxy_yyjhl,121106022667,yyang,lujf}@njust.edu.cn

Abstract

In this paper, we present our champion solution to the Global Artificial Intelligence Technology Innovation Competition Track 1: Medical Imaging Diagnosis Report Generation. We select CPT-BASE as our base model for the text generation task. During the pre-training stage, we delete the mask language modeling task of CPT-BASE and instead reconstruct the vocabulary, adopting a span mask strategy and gradually increasing the number of masking ratios to perform the denoising auto-encoder pre-training task. In the fine-tuning stage, we design iterative retrieval augmentation and noise-aware similarity bucket prompt strategies. The retrieval augmentation constructs a mini-knowledge base, enriching the input information of the model, while the similarity bucket further perceives the noise information within the mini-knowledge base, guiding the model to generate higher-quality diagnostic reports based on the similarity prompts. Surprisingly, our single model has achieved a score of 2.321 on leaderboard A, and the multiple model fusion scores are 2.362 and 2.320 on the A and B leaderboards respectively, securing first place in the rankings.

1 Introduction

Medical imaging diagnosis report generation [5, 13, 21, 3] is an important research in the field of medical artificial intelligence. With the development of natural language processing [17, 16, 25, 19] technology, it has become feasible to automatically generate medical imaging diagnostic reports. The traditional diagnostic process relies on the professional knowledge and experience of radiologists to write detailed diagnostic reports, a process that is time-consuming and easily influenced by subjective factors.

Refer to caption — Figure 1: The sample of the dataset. Clinical and Description are denoted as the inputs, while Diagnosis is the output. Note that in the competition, all words have undergone desensitization processing, which means that the text is desensitized at the character level, separated by spaces (e.g., 88 29 17 55 72).

In recent years, with the advancement of artificial intelligence technology [24, 29, 4, 23], researchers have begun to explore how to use deep learning models [30, 18, 12] to automatically generate accurate diagnostic reports. These models typically generate natural language descriptions based on sequence generation models such as Recurrent Neural Networks (RNN) [31, 33, 28] or Transformers [22, 14, 27, 26] and require a large amount of annotated data, including descriptions of medical imaging diagnoses and corresponding expert-written diagnostic reports.

Due to data desensitization, it is not feasible to directly fine-tune open-source pre-trained models. Therefore, we select Chinese CPT-BASE [20] as our base model and append the desensitized numbers to the vocabulary of the pre-trained model. To reduce the gap between pre-train and downstream tasks, we remove the Masked Language Modeling task from CPT-BASE and adopt a span mask [11] strategy to perform the Denoising Auto-Encoding pre-training task, with an increasingly larger mask ratio to enlarge the difficulty of the pre-training tasks. This is beneficial for the pre-training tasks to capture deeper levels of textual context information.

In the fine-tuning stage, we note the rapid development of retrieval augmentation [6, 9, 7] strategies in the field of natural language processing. Therefore, we introduce and improve retrieval augmentation technology to adapt to the competition. For each input sample, we use the embedding of Description as the query and key to retrieve similar Description Diagnosis pairs, adding them to the input as a mini-knowledge base for the sample, enriching the input information of the model. Additionally, we originally design a noise-aware similarity bucketing prompt strategy, distributing the training data into different buckets according to their noise levels. Different buckets represent different quantities and qualities. This training method can force the model to generate higher-quality diagnostic reports during the inference stage. We achieve first place with a score of 2.320 on the final leaderboard through additional general tricks (i.e., FGM, R-Dropout, EMA, and Model Ensemble).

2 Related Works

2.1 Text Generation in NLP

Text generation [34, 10, 8, 32, 1] is a widely researched direction in the field of natural language processing. It involves using computer systems to generate text that resembles human language, and it has extensive applications in areas such as machine translation, intelligent customer service, literary creation, and more. In recent years, significant progress has been made in text generation techniques, which mainly include text summarization and text generation. In the field of text summarization, algorithms based on TF-IDF and sequence-to-sequence (Seq2Seq) models from deep learning are commonly used methods. TF-IDF measures the importance of words in documents, while Seq2Seq models consist of an encoder and a decoder for generating summaries. Additionally, the large language model based on the Transformer architecture in natural language generation has also gained increasing attention. It implements text generation through an attention mechanism and encoder-decoder structure.

2.2 Retrieval Augmentation in NLP

Retrieval Augmented Generation (RAG) technology [6, 9, 7, 15, 2] in the field of natural language processing represents an innovative breakthrough. Traditional NLP techniques primarily rely on large language models, but their accuracy and depth may be limited when dealing with complex queries that require extensive background knowledge. To overcome this limitation, RAG combines conventional information retrieval methods with modern generative language models, aiming to enhance the model’s text generation capabilities by incorporating external knowledge sources. The core principle is to integrate retrieval and generation techniques, allowing the model to access and utilize a vast amount of external information before generating text. RAG excels in addressing knowledge-intensive NLP tasks such as question answering, fact verification, and more. In recent years, RAG systems have evolved from a primary stage to an advanced stage, and then to a modular stage, to improve performance, cost-effectiveness, and efficiency.

3 Methods

3.1 Pre-training Stage

Task definition: We define Clinical information as $\mathcal{C}$ , Description information as $\mathcal{D}$ , and the output Diagnostic report as $\mathcal{O}$ . Therefore, this competition can be regarded as a Diagnostic report generation task conditional on Clinical and Description information, i.e., $\mathbf{F}(\mathcal{C},\mathcal{D})\longrightarrow\mathcal{O}$ , where $\mathbf{F}$ is the encoder-decoder language model.

As shown in Figure 2, we select Chinese CPT-Base as the base model for text generation, which consists of 12 layers of transformers as the encoder and 2 layers of transformers as the decoder. The size of the original vocabulary of the CPT model is 51271. We sequentially append the anonymized numbers of this competition to the end of the vocabulary and remove the numbers that already exist in the original vocabulary, resulting in a new vocabulary size of $\mathrm{51271+347}$ .

In the pre-training stage, we remove the MLM (Masked Language Modeling) pre-training task from the CPT model and retain only the DAE (Denoising Auto-encoder) task for pre-training. This is done to maintain consistency between the pre-training task and the downstream task, reducing the gap between them. We concatenate $\mathcal{C}$ (Clinical), $\mathcal{D}$ (Description), and $\mathcal{O}$ (Diagnostic report), and use the $[\mathbf{SEP}]$ token to separate them, resulting in the input $[\mathcal{C},\mathcal{D},\mathcal{O}]$ for the pre-training stage.

Regarding the MASK strategy, we notice that the text content in $\mathcal{C}$ , $\mathcal{D}$ , and $\mathcal{O}$ usually appears in chunked form, so we chose the span mask strategy as our final masking approach. We use a Poisson distribution to generate the mask length, biasing it towards smaller values to match the characteristics of text length. Furthermore, we find that the pre-training in the first competition stage could train for 150 epochs, while in the second competition stage, the model training saturated at 40 epochs. This inspired us to increase the difficulty of the pre-training task, gradually increasing the proportion of masking as the number of epochs increased. Specifically, we set an initial mask proportion of 0.3, and after every 10 epochs of pre-training, we perform fine-tuning of the downstream task. If the performance of the fine-tuning is lower than the previous one, we increase the mask proportion by 0.05 and continue with pre-training. Ultimately, we increase the number of pre-training epochs to 140, which significantly improves the text generation performance of the downstream task.

3.2 Fine-tuning Stage

3.2.1 Retrieval Augmentation.

Figure 3 illustrates our iterative retrieval augmentation strategy, which consists of three main parts: the construction of the retrieval knowledge base, nearest neighbor retrieval, and retrieval iterations.

Retrieval knowledge base. We divide the entire training set into training and validation sets in a $9:1$ ratio, where the training set is used to construct the retrieval knowledge base, and the validation set is used for testing the performance of the model. For each sample in the knowledge base, we extract the embedding of $\mathcal{D}$ (Description) as the key and the original $\mathcal{O}$ (Diagnostic report) corresponding to the $\mathcal{D}$ (Description) as the value. Therefore, each sample in the training set can form a key-value pair.

Nearest Neighbor Retrieval. For the construction of the training set with retrieval knowledge, we use $\mathcal{D}$ (Description) as the query and calculate the similarity with the key of each key-value pair in the knowledge base (e.g., vector inner product, L2 distance, or cosine similarity). If the similarity is larger than the threshold k, we call it an effective retrieval. We retrieve this key-value pair and concatenate the value to the end of the query as the new training sample corresponding to the query. For the val set and test set, we use the same retrieval method to construct the val set and test set with retrieval knowledge.

Retrieval Iterations. For the first retrieval augmentation, the embeddings of key-value pairs are computed using a model trained on a training set without a knowledge base. However, as retrieval augmentation progresses, the performance of the model also gradually improves, which suggests that we can use the augmented model to recalculate the embeddings of key-value pairs. This not only results in more accurate representations of the embeddings but also improves the accuracy of retrieval. Therefore, we design an iterative retrieval augmentation strategy that uses the augmented model to continue the retrieval augmentation process, iteratively training an even better model.

3.2.2 Similarity Bucketing.

It is worth noting that with each iteration of retrieval augmentation, a new retrieved Diagnostic report will be added to the end of the sample. After n iterations, at least 0 pseudo Diagnostic reports will be added, but up to n pseudo Diagnostic reports could be appended. While more Diagnostic reports can bring a greater diversity of information, it also means that the sample will contain more noisy information. So, how can we add as many pseudo Diagnostic reports as possible while also ensuring that the model is influenced as little as possible by the noise?

As shown in Figure 4, we innovatively design a noise-aware similarity bucketing prompt strategy. For each sample in the training set, we calculate the similarity between the input and output, where the input refers to $\mathcal{C}$ (Clinical), $\mathcal{D}$ (Description), and retrieved $\mathcal{O}$ (Diagnostic reports), while the output is the corresponding $\mathcal{O}$ (Diagnostic report) label for the sample. Based on this similarity, we divide the training set into n buckets, with the first bin representing samples with high similarity and the last bin representing samples with low similarity. We believe that higher similarity indicates that the retrieved diagnostic reports are more similar to the labeled diagnostic report, meaning it is a high-quality sample. To integrate the bucket signal into the model’s input, we use [’best match’, ’good match’, ’not good match’, ’noisy match’] to represent different buckets, and add these signals to the front of the $\mathcal{C}$ (Clinical) information.

The distribution of dataset similarity shows a normal distribution, where higher similarity indicates that the sample $\mathcal{D}$ (Description) and $\mathcal{O}$ (Diagnostic reports) are highly relevant. Lower similarity suggests that the $\mathcal{D}$ (Description) and $\mathcal{O}$ (Diagnostic reports) are less relevant, and it is likely to be a noisy sample. Each sample belongs to a certain bin, and through training, each sample is associated with the prompt of its bin. During inference, we fix the prompt to ’best match’, forcing the model to generate the most similar, best-matched, and highest-quality $\mathcal{O}$ (Diagnostic reports).

3.3 Model Tricks

FGM. Introducing noise to the embeddings during training and regularizing the model parameters can enhance the robustness and generalization ability of the model.

R-Dropout. R-Dropout applies regularization constraints to the output predictions of different combinations of neurons, thereby improving the model’s robustness and generalization.

EMA. Averaging the weights of the model at different times makes the weight updates smoother, enhancing the model’s generalization and stability.

Model Ensemble. From the predictions of n models, one is selected as the candidate answer, and the rest are used as references. The CIDEr score between the candidate answer and all references is calculated, and the total score is used as the score for that candidate answer. The candidate answer with the highest score is selected as the final integrated answer.

4 Experiments

4.1 Dataset.

The training set for the first stage of the competition consists of 20,000 samples, while the Test Set A/B each has 3,000 samples. In the second stage, the training set comprises 80,000 samples, and the Test Set A/B each has 7,500 samples. Clinical information data is only provided in the second stage.

4.2 Leadboards.

Method	Cider	Bleu	Score
Baseline	3.0793	0.4043	2.1876
Span Mask	3.1446	0.4058	2.2317
Retrieval-1	3.2130	0.4241	2.2834
Retrieval-2	3.2374	0.4291	2.3013
Bucketing	3.2553	0.4288	2.3132
Tricks	3.2735	0.4342	2.3271
Ensemble	3.3242	0.4384	2.3622

Table 1: Results of each component.

Table 1 shows the improvement in model performance by each of our components. It can be seen that the SPAN mask strategy with an increasing mask ratio and the first retrieval strategy significantly improved text generation performance. The score of the single model also reached 2.3271, surpassing the scores of most teams’ model ensembles. In the end, we ensemble 10 CPT-Base models, achieving scores of 2.362 and 2.320 on Leaderboards A and B, respectively, securing first place in the final competition.

References

Balepur et al. [2023] Nishant Balepur, Jie Huang, and Kevin Chen-Chuan Chang. Expository text generation: Imitate, retrieve, paraphrase. In EMNLP, pages 11896–11919, 2023.
Chen et al. [2022] Wenhu Chen, Hexiang Hu, Xi Chen, Pat Verga, and William W. Cohen. Murag: Multimodal retrieval-augmented generator for open question answering over images and text. In EMNLP, pages 5558–5570, 2022.
Fotin et al. [2016] Sergei V. Fotin, Yin Yin, Hrishikesh Haldankar, Jeffrey W. Hoffmeister, and Senthil Periaswamy. Workflow improvements for digital breast tomosynthesis: computerized generation of enhanced synthetic images. In Medical Imaging 2016: Computer-Aided Diagnosis, San Diego, California, United States, 27 February - 3 March 2016, page 97850K. SPIE, 2016.
Fu et al. [2024] Zhongtian Fu, Kefei Song, Luping Zhou, and Yang Yang. Noise-aware image captioning with progressively exploring mismatched words. In AAAI, pages 12091–12099, 2024.
Gao et al. [2023] Xingyu Gao, Feng Shi, Dinggang Shen, and Manhua Liu. Multimodal transformer network for incomplete image generation and diagnosis of alzheimer’s disease. Comput. Medical Imaging Graph., 110:102303, 2023.
Gou et al. [2023] Qi Gou, Zehua Xia, Bowen Yu, Haiyang Yu, Fei Huang, Yongbin Li, and Cam-Tu Nguyen. Diversify question generation with retrieval-augmented style transfer. In EMNLP, pages 1677–1690, 2023.
Hofstätter et al. [2023] Sebastian Hofstätter, Jiecao Chen, Karthik Raman, and Hamed Zamani. Fid-light: Efficient and effective retrieval-augmented text generation. In SIGIR, pages 1437–1447, 2023.
Hu and Wan [2023] Xinyu Hu and Xiaojun Wan. RST discourse parsing as text-to-text generation. IEEE ACM Trans. Audio Speech Lang. Process., 31:3278–3289, 2023.
Huang et al. [2023] Wenyu Huang, Mirella Lapata, Pavlos Vougiouklis, Nikos Papasarantopoulos, and Jeff Z. Pan. Retrieval augmented generation with rich answer encoding. In IJCNLP, pages 1012–1025, 2023.
Hussein and Savas [2024] Mustafa Abbas Hussein Hussein and Serkan Savas. Lstm-based text generation: A study on historical datasets. CoRR, abs/2403.07087, 2024.
Joshi et al. [2020] Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S. Weld, Luke Zettlemoyer, and Omer Levy. Spanbert: Improving pre-training by representing and predicting spans. Trans. Assoc. Comput. Linguistics, 8:64–77, 2020.
Li et al. [2024] Xin-Chun Li, Shaoming Song, Yinchuan Li, Bingshuai Li, Yunfeng Shao, Yang Yang, and De-Chuan Zhan. MAP: model aggregation and personalization in federated learning with incomplete classes. CoRR, abs/2404.09232, 2024.
Liu et al. [2023a] Ruhan Liu, Tianqin Wang, Huating Li, Ping Zhang, Jing Li, Xiaokang Yang, Dinggang Shen, and Bin Sheng. Tmm-nets: Transferred multi- to mono-modal generation for lupus retinopathy diagnosis. IEEE Trans. Medical Imaging, 42(4):1083–1094, 2023a.
Liu et al. [2023b] Yucheng Liu, Zipeng Gao, Xiangyang Liu, Pengfei Luo, Yang Yang, and Hui Xiong. QTIAH-GNN: quantity and topology imbalance-aware heterogeneous graph neural network for bankruptcy prediction. In KDD, pages 1572–1582, 2023b.
Long et al. [2022] Alexander Long, Wei Yin, Thalaiyasingam Ajanthan, Vu Nguyen, Pulak Purkait, Ravi Garg, Alan Blair, Chunhua Shen, and Anton van den Hengel. Retrieval augmented classification for long-tail visual recognition. In CVPR, pages 6949–6959, 2022.
Lopez-Gazpio [2024] Iñigo Lopez-Gazpio. Revisiting challenges and hazards in large language model evaluation. Proces. del Leng. Natural, 72:15–30, 2024.
Manaka et al. [2024] Thokozile Manaka, Terence L. van Zyl, Deepak Kar, and Alisha N. Wade. Multi-step transfer learning in natural language processing for the health domain. Neural Process. Lett., 56(3):177, 2024.
Meng et al. [2024] Lingwu Meng, Jing Wang, Ran Meng, Yang Yang, and Liang Xiao. A multiscale grouping transformer with CLIP latents for remote sensing image captioning. IEEE Trans. Geosci. Remote. Sens., 62:1–15, 2024.
Ni et al. [2024] Shiwen Ni, Jiawen Li, Min Yang, and Hung-Yu Kao. Dropattack: A random dropped weight attack adversarial training for natural language understanding. IEEE ACM Trans. Audio Speech Lang. Process., 32:364–373, 2024.
Shao et al. [2024] Yunfan Shao, Zhichao Geng, Yitao Liu, Junqi Dai, Hang Yan, Fei Yang, Li Zhe, Hujun Bao, and Xipeng Qiu. CPT: a pre-trained unbalanced transformer for both chinese language understanding and generation. Sci. China Inf. Sci., 67(5), 2024.
Shmueli et al. [2022] Omer Zucker Shmueli, Chen Solomon, Noam Ben-Eliezer, and Hayit Greenspan. Deep learning based multiple sclerosis lesion detection utilizing synthetic data generation and soft attention mechanism. In Medical Imaging 2022: Computer-Aided Diagnosis, San Diego, CA, USA, February 20-24, 2022 / online, March 21-27, 2022. SPIE, 2022.
Vaswani et al. [2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NeurIps, pages 5998–6008, 2017.
Wu et al. [2024] Xiangyu Wu, Qing-Yuan Jiang, Yang Yang, Yi-Feng Wu, Qingguo Chen, and Jianfeng Lu. TAI++: text as image for multi-label image classification by co-learning transferable prompt. CoRR, abs/2405.06926, 2024.
Xi et al. [2023] Wenjuan Xi, Xin Song, Weili Guo, and Yang Yang. Robust semi-supervised learning for self-learning open-world classes. In ICDM, pages 658–667, 2023.
Yáñez-Romero et al. [2024] Fabio Yáñez-Romero, Andrés Montoyo, Rafael Muñoz, Yoan Gutiérrez, and Armando Suárez. Ontolm: Integrating knowledge bases and language models for classification in the medical domain. Proces. del Leng. Natural, 72:137–148, 2024.
Yang et al. [2022] Yang Yang, Jingshuai Zhang, Fan Gao, Xiaoru Gao, and Hengshu Zhu. DOMFN: A divergence-orientated multi-modal fusion network for resume assessment. In ACM MM, pages 1612–1620, 2022.
Yang et al. [2023a] Yang Yang, Yurui Huang, Weili Guo, Baohua Xu, and Dingyin Xia. Towards global video scene segmentation with context-aware transformer. In AAAI, pages 3206–3213, 2023a.
Yang et al. [2023b] Yang Yang, Jia-Qi Yang, Ran Bao, De-Chuan Zhan, Hengshu Zhu, Xiaoru Gao, Hui Xiong, and Jian Yang. Corporate relative valuation using heterogeneous multi-modal graph neural network. TKDE, 35(1):211–224, 2023b.
Yang et al. [2023c] Yang Yang, Yuxuan Zhang, Xin Song, and Yi Xu. Not all out-of-distribution data are harmful to open-set active learning. In NeurIPS, 2023c.
Yang et al. [2024] Yang Yang, Nan Jiang, Yi Xu, and De-Chuan Zhan. Robust semi-supervised learning by wisely leveraging open-set data. CoRR, abs/2405.06979, 2024.
Zaremba et al. [2014] Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. Recurrent neural network regularization. CoRR, abs/1409.2329, 2014.
Zheng et al. [2023] Carolina Zheng, Claudia Shi, Keyon Vafa, Amir Feder, and David M. Blei. An invariant learning characterization of controlled text generation. In ACL, pages 3186–3206, 2023.
Zhou et al. [2022] Da-Wei Zhou, Yang Yang, and De-Chuan Zhan. Learning to classify with incremental new class. IEEE Trans. Neural Networks Learn. Syst., 33(6):2429–2443, 2022.
Zhu et al. [2024] Wenhong Zhu, Hongkun Hao, Zhiwei He, Yiming Ai, and Rui Wang. Improving open-ended text generation via adaptive decoding. CoRR, abs/2402.18223, 2024.

First Place Solution of 2023 Global Artificial Intelligence Technology Innovation Competition Track 1