LoFTI: Localization and Factuality Transfer to Indian Locales

Sona Elza Simon\ddagger, Soumen Kumar Mondal,\ddagger, Abhishek Singhania§§\S§,
Sayambhu Sen§§\S§, Preethi Jyothi\ddagger

\ddagger Indian Institute of Technology Bombay, Mumbai, India,
§§\S§Amazon Alexa
{sona.simon,23m2157,pjyothi}@iitb.ac.in, {mrabhsin,sensayam}@amazon.com
Abstract

Large language models (LLMs) encode vast amounts of world knowledge acquired via training on large web-scale datasets crawled from the internet. However, these datasets typically exhibit a geographical bias towards English-speaking Western countries. This results in LLMs producing biased or hallucinated responses to queries that require answers localized to other geographical regions. In this work, we introduce a new benchmark named LoFTI (Localization and Factuality Transfer to Indian Locales) that can be used to evaluate an LLM’s localization and factual text transfer capabilities. LoFTI consists of factual statements about entities in source and target locations; the source locations are spread across the globe and the target locations are all within India with varying degrees of hyperlocality (country, states, cities). The entities span a wide variety of categories. We use LoFTI to evaluate Mixtral, GPT-4 and two other Mixtral-based approaches well-suited to the task of localized factual transfer. We demonstrate that LoFTI is a high-quality evaluation benchmark and all the models, including GPT-4, produce skewed results across varying levels of hyperlocality.

LoFTI: Localization and Factuality Transfer to Indian Locales


Sona Elza Simon\ddagger, Soumen Kumar Mondal,\ddagger, Abhishek Singhania§§\S§, Sayambhu Sen§§\S§, Preethi Jyothi\ddagger \ddagger Indian Institute of Technology Bombay, Mumbai, India, §§\S§Amazon Alexa {sona.simon,23m2157,pjyothi}@iitb.ac.in, {mrabhsin,sensayam}@amazon.com


1 Introduction

Large language models (LLMs) are proficient in text generation and are also extensive repositories of world knowledge, owing to their pretraining and fine-tuning on vast and diverse internet data. This suggests that LLMs might be effective at transferring factual knowledge across geographical locations. They can generate localized text in a given target location by transferring from a reference text in a source location. However, there is no existing benchmark that helps assess this specific form of localization and fact-driven transfer. Benchmarks that measure LLMs’ ability to understand cultural concepts and their transference across geographical regions are steadily emerging in recent work (Li et al., 2024a, c, b; Rao et al., 2024). We argue that it is also important to evaluate the ability of models to transfer factual knowledge across geographical regions. Figure 1 illustrates this point by showing two use-cases: 1) Generating a localized response given a common question that can be asked across locations and, 2) accurate factuality transfer from one locale to another.

Refer to caption
Figure 1: Illustration of LLM’s localized factual text transfer capabilities.

In this work, we introduce a new evaluation benchmark called LoFTI (Localization and Factuality Transfer for Indian Locales). Notable features of LoFTI are:

  • It contains factual statements in source and target locations involving source and target entities.

  • The statements are grounded in various source locations across the globe, while all the target locations are in India.

  • The target locations are at different levels of hyperlocality namely specific to India as a whole, or specific to states and cities within India.

  • The entities in the statements span a diverse set of categories including food, sports, nature, etc.

  • Each parallel set of statements is accompanied by (one or more) common questions that can be answered at any location.

The motivation behind creating LoFTI stems from the lack of comprehensive multi-locale (and multilingual data) on the internet, which is essential for both training and evaluating LLMs. Simple translations of English datasets are inadequate because they predominantly feature Western entities and facts, introducing biases that are irrelevant or inaccurate for non-Western locales. LoFTI can be used as a benchmark to help improve LLMs on factuality transfer in English from reference to target locations. Once we have high-performing LLMs on this task in English, we could potentially create multilingual factual data using direct translations of the target text into languages specific to the target locations. LoFTI can also be used to benchmark multilingual/multi-locale LLMs by evaluating their performance on localized question answering with different context locations.

In this work, we define three different metrics to evaluate the quality of both localization and factuality transfer on LoFTI. We evaluate the performance of powerful open-source (Mixtral) and closed-source models (GPT-4) on LoFTI. We also develop two variants of Mixtral that leverage external sources of evidence to significantly improve performance on all three metrics. While GPT-4 is expectedly superior in performance compared to all Mixtral variants, it shows degradation in performance across target locations of varying hyperlocality, thus revealing clear gaps in coverage across geographical regions. We publicly release LoFTI under the Apache 2.0 license 111The LoFTI dataset and codebase are available at: https://huggingface.co/datasets/sonasimon/LoFTI https://github.com/csalt-research/LoFTI.

2 Methodology for Dataset Creation

Figure 2 describes the overall dataset creation pipeline with the help of an example. Next, we outline the details of each step in the dataset creation process.

Refer to caption
Figure 2: Illustration of the dataset creation pipeline with an example.

2.1 Generation of Entity-pairs

For dataset creation, we compile pairs of entities (eref,etar)subscript𝑒refsubscript𝑒tar(e_{\text{ref}},e_{\text{tar}})( italic_e start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT , italic_e start_POSTSUBSCRIPT tar end_POSTSUBSCRIPT ), where erefsubscript𝑒refe_{\text{ref}}italic_e start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT is an entity from a reference location outside India and etarsubscript𝑒tare_{\text{tar}}italic_e start_POSTSUBSCRIPT tar end_POSTSUBSCRIPT is an entity from India that serves as a suitable substitute for erefsubscript𝑒refe_{\text{ref}}italic_e start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT. These pairs are curated by human annotators and cover diverse categories and hyperlocal regions.

2.2 Reference Text Generation

Given the reference entity erefsubscript𝑒refe_{\text{ref}}italic_e start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT, a fact-based reference text Trefsubscript𝑇refT_{\text{ref}}italic_T start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT is obtained from the entity’s description on the internet. We use the Google API Client or Wikipedia for this purpose. If no entity description is found from these sources, human annotators are tasked with providing the reference text.

2.3 Text Localization

Given a reference text Trefsubscript𝑇refT_{\text{ref}}italic_T start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT and a target entity etarsubscript𝑒tare_{\text{tar}}italic_e start_POSTSUBSCRIPT tar end_POSTSUBSCRIPT (paired with erefsubscript𝑒refe_{\text{ref}}italic_e start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT) from a target location Ltarsubscript𝐿tarL_{\text{tar}}italic_L start_POSTSUBSCRIPT tar end_POSTSUBSCRIPT, text localization aims to generate a target text Ttarsubscript𝑇tarT_{\text{tar}}italic_T start_POSTSUBSCRIPT tar end_POSTSUBSCRIPT localized to Ltarsubscript𝐿tarL_{\text{tar}}italic_L start_POSTSUBSCRIPT tar end_POSTSUBSCRIPT that retains the stylistic and semantic features of Trefsubscript𝑇refT_{\text{ref}}italic_T start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT. This process involves localizing the entities and facts present in Trefsubscript𝑇refT_{\text{ref}}italic_T start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT while ensuring factual correctness. For text localization, we employ the Mixtral-8x7b-instruct-v0.1.Q4_K_M model. Given the target location Ltarsubscript𝐿tarL_{\text{tar}}italic_L start_POSTSUBSCRIPT tar end_POSTSUBSCRIPT, target entity etarsubscript𝑒tare_{\text{tar}}italic_e start_POSTSUBSCRIPT tar end_POSTSUBSCRIPT, and the reference text Trefsubscript𝑇refT_{\text{ref}}italic_T start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT, we prompt the Mixtral model to generate the localized target text Ttarsubscript𝑇tarT_{\text{tar}}italic_T start_POSTSUBSCRIPT tar end_POSTSUBSCRIPT. The prompt used for text localization is given in Figure A2.

LoFTI Dataset Details
No. of entity pairs 1100 No. of entities from US/Europe 651 No. entities with high cardinality 835
No. of categories 99 No. of entity pairs from other places 449 No. entities with low cardinality 265
No. of entities with hyperlocal score = 1 369 No. of entities with hyperlocal score = 2 391 No. of entities with hyperlocal score = 3 34
Example:
Region Kategorie Reference Location Reference Entity Reference Text Target Location High Cardinality Hyperlocal Score Target Entity Target Text Common Questions
US/ Europe Monument US Statue of Liberty The Statue of Liberty is a colossal neoclassical sculpture on Liberty Island in New York Harbor in New York City, United States. Gujarat Yes 2 Statue of Unity The Statue of Unity is a colossal sculpture on the island of Sadhu Bet near Vadodara in the Gujarat, India. (i) Can you name a famous colossal sculpture?
(ii) Can you name a famous statue?
Table 1: The statistics of LoFTI dataset and an example with all its metadata.
Refer to caption
Figure 3: Illustrates the global distribution of the reference entities and the spread of target entities in India.

2.4 Common Question Generation

In addition to the reference and target text pairs, LoFTI also contains questions that capture common aspects shared by Trefsubscript𝑇refT_{\text{ref}}italic_T start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT and Ttarsubscript𝑇tarT_{\text{tar}}italic_T start_POSTSUBSCRIPT tar end_POSTSUBSCRIPT. Given a pair of text (Tref,Ttar)subscript𝑇refsubscript𝑇tar(T_{\text{ref}},T_{\text{tar}})( italic_T start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT , italic_T start_POSTSUBSCRIPT tar end_POSTSUBSCRIPT ), we generate these questions by identifying shared properties or descriptions of the entities mentioned in the text pairs. We use few-shot prompting on Mixtral-8x7b-instruct-v0.1.Q4_K_M model for common question generation and the prompt used is given in Figure A1.

2.5 Human Annotators

To ensure the correctness of the LoFTI dataset, all the generations were carefully checked by human annotators at each stage. These annotators represent diverse demographics and have knowledge about samples from different geographic and hyperlocal regions. Each sample undergoes verification by three annotators. Guidelines used by the human annotators at each stage are detailed in Appendix A.1.

3 Properties of LoFTI Dataset

LoFTI consists of factual texts that are localized from a non-Indian reference location to a location in India. The reference locations are spread across the globe, mainly in USA/Europe. The target locations are spread across India covering different regions. Figure 3 shows the distribution of entities across reference and target locations.

Table 1 presents salient statistics of LoFTI and an example with all its metadata detailed below.

  • Region: The region of the reference location.

  • Category: The category of the entity in the factual text.

  • Reference Location: A non-Indian location.

  • Reference Entity: An entity from the reference location.

  • Reference Text: Factual text about the reference entity.

  • Target Location: A location in India.

  • True Target Entity: An example of a correct localization of the reference entity in the target location.

  • True Target Text: A localized factual text of the true target entity.

  • Hyperlocal Score: The degree of hyperlocality within the Indian context. The dataset includes three hyperlocality scores: 1, 2, and 3. These scores correspond to the target locations ‘India,’ ‘any state in India,’ and ‘any city in India,’ respectively.

  • High Cardinality: Cardinality denotes the potential count of replaceable entities for the reference entity within the target location. A high cardinality suggests there are many such replaceable entities. This feature is denoted by ’yes’ or ’no’ values.

  • Common Questions: Questions extracted from the reference and the target texts.

Category Distribution.

The dataset consists of 99 unique categories which can be grouped into 10 domains namely Entertainment, Buildings/Monuments/Companies, Food & Lifestyle, Professions, Nature, Finance & Economy, Sports, Incidents, Places & Landmarks, and Others. The category clusters and the category distribution are shown in Table A5 and Figure A3, respectively.

4 Evaluation Metrics

4.1 Entity Correctness

To evaluate entity correctness of a generated target text, the human annotator checks if the entity present in the target text is correctly localized to a target location given the reference entity in the reference text. Note that there can be multiple correct localized entities for a given target location. If the entity localization is correct, then a score of 1 is assigned, else it is 0. Thus, for each generated target text, Tisuperscript𝑇𝑖T^{i}italic_T start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT, a binary score Eisuperscript𝐸𝑖E^{i}italic_E start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT is assigned. Across N𝑁Nitalic_N generated target text sequences, the entity correctness metric is computed as EC=1Ni=1NEiEC1𝑁superscriptsubscript𝑖1𝑁superscript𝐸𝑖\textbf{EC}=\frac{1}{N}\sum_{i=1}^{N}E^{i}EC = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_E start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT.

4.2 Common Question Correctness

For each target text with EC=1EC1\text{EC}=1EC = 1, the common questions present in LoFTI are further used to evaluate the localization capability of the model. Human evaluators check if the target text correctly answers the common questions given the target location. Each question is evaluated separately and they assign a binary score of 1 if it is answered correctly, else it is 0.

For a generated target text Tisuperscript𝑇𝑖T^{i}italic_T start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT, let the number of predefined common questions be misubscript𝑚𝑖m_{i}italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and the binary scores for these questions be {Cji}j=1misuperscriptsubscriptsuperscriptsubscript𝐶𝑗𝑖𝑗1subscript𝑚𝑖\{C_{j}^{i}\}_{j=1}^{m_{i}}{ italic_C start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT. Then, the common question correctness metric across N𝑁Nitalic_N texts is calculated as CQ=1i=1Nmii=1Nj=1miCjiCQ1superscriptsubscript𝑖1𝑁subscript𝑚𝑖superscriptsubscript𝑖1𝑁superscriptsubscript𝑗1subscript𝑚𝑖superscriptsubscript𝐶𝑗𝑖\textbf{CQ}=\frac{1}{\sum_{i=1}^{N}m_{i}}\sum_{i=1}^{N}\sum_{j=1}^{m_{i}}C_{j}% ^{i}CQ = divide start_ARG 1 end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT. This metric aggregates the scores across all questions for all target texts, providing an overall measure of the model’s effectiveness in generating contextually accurate and relevant responses.

4.3 Factual Correctness

For each target text Tisuperscript𝑇𝑖T^{i}italic_T start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT with EC=1EC1\text{EC}=1EC = 1, the human annotator checks if every detail in the text is factually correct and provides a binary score Fi=1superscript𝐹𝑖1F^{i}=1italic_F start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = 1 if every fact is correct, else Fi=0superscript𝐹𝑖0F^{i}=0italic_F start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = 0. The factual correctness metric across N𝑁Nitalic_N texts is calculated as FC=1Ni=1NFiFC1𝑁superscriptsubscript𝑖1𝑁superscript𝐹𝑖\textbf{FC}=\frac{1}{N}\sum_{i=1}^{N}F^{i}FC = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_F start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT.

5 Models and Approaches

5.1 Models

We evaluate the performance of two state-of-the-art LLMs on LoFTI: Mixtral (Jiang et al., 2024) and GPT-4 (OpenAI, 2023). The Mixtral-8x7B LLM (Jiang et al., 2024) is a pre-trained generative sparse mixture-of-experts model. It has a decoder-only architecture with its feedforward block selecting from a set of 8 distinct groups of parameters. For our analysis, we utilize the quantized Mixtral model Mixtral-8x7b-instruct-v0.1.Q4_K_M222The quantized Mixtral-8x7b models Q6 and Q8 gave similar performance to Q4. with zero-shot prompting. Interestingly, we observed that few-shot prompting did not improve performance compared to the zero-shot setting and adding more localization examples appeared to confuse the model. We also evaluate the performance of the state-of-the-art GPT-4 model on LoFTI. We use the same prompt for both Mixtral and GPT-4 (detailed in Appendix A.7).

5.2 Approaches

Mixtral + RARR.

LLM generations, while fluent, are known to be prone to hallucinations and factual inaccuracies. To address this, Gao et al. (2022) proposed RARR (Retrofit Attribution using Research and Revision), an attribution mechanism that leverages external evidence from the web to validate and edit LLM-generated text while aiming to maintain the original style of the output. We utilize RARR to factually correct the generations produced by Mixtral.

RARR consists of three modules: (i) Question Generation Module, (ii) Evidence Retrieval Module, and (iii) Editor Module. The Question Generation Module formulates questions from the text to be edited and the Evidence Retrieval Module queries these questions on the web for factual evidence. While querying, the target location of the text is appended to the start of each question to extract evidence relevant to that location. The retrieval module also checks if the text to be edited disagrees with the evidence. The Editor Module then utilizes all the disagreed evidence to make factual edits to the text. We employ the Mixtral-8x7b-instruct-v0.1.Q4_K_M model in both the Question Generation and Editor Modules. As in the original RARR pipeline, we utilize Microsoft Bing for evidence retrieval. We adhere to the RARR pipeline, except for one detail. We aggregate all the evidence obtained for all the generated questions and make a single edit, whereas RARR makes edits for each question individually. We found that sequential editing increased the text context and disrupted the style. Making a single edit helped maintain the text length and style better.

Mixtral Revised.

To improve the factual accuracy of the Mixtral generations, we propose a revised version (henceforth referred to as Mixtral Revised). Motivated by RARR, we use the Question Generation and Evidence Retrieval Modules as discussed in Section 5.2. However, we replace the Editor module with a Re-generation module which filters the evidence and re-generates the text using Mixtral-8x7b-instruct-v0.1.Q4_K_M model. The evidences retrieved from the Evidence Retrieval Module are filtered to assess their relevance to the context and they are added to the localized text transfer prompt to obtain more factually correct re-generation. This approach focuses on improving the factual correctness of the entity generated by Mixtral while preserving the style.

All the prompts used in the two above-mentioned approaches are detailed in Appendix A.9 and A.10.

6 Experiments and Results

6.1 Comparison of Mixtral and GPT-4 Performance for Localized Text Transfer

Table 2 compares the performance of state-of-the-art models GPT-4 and Mixtral on LoFTI using all three metrics that collectively help evaluate localization and factual transfer. We observe that GPT-4 significantly outperforms Mixtral in EC, FC, and CQ by 17%, 14%, and 27%, respectively. As hyperlocal scores increase, both models show a decline in accuracy, indicating the difficulty in recalling knowledge about these locales. Benefiting from training on diverse and larger datasets and tasks, GPT-4 surpasses Mixtral by achieving superior localization even at higher hyperlocal scores, resulting in more accurate outputs.

Model # Samples EC CQ FC
Mixtral 1100 0.63 0.50 0.35
Hyperlocal 1 369 0.72 0.58 0.41
Hyperlocal 2 391 0.63 0.49 0.38
Hyperlocal 3 340 0.54 0.43 0.25
GPT-4 1100 0.80 0.64 0.62
Hyperlocal 1 369 0.85 0.71 0.67
Hyperlocal 2 391 0.80 0.63 0.63
Hyperlocal 3 340 0.75 0.59 0.56
Table 2: Comparison of Mixtral and GPT-4 Performance for Localized Text Transfer.
Reference Sentence Target Location Mixtral          GPT-4
The Reichenbach Falls are a waterfall cascade of seven steps on the stream called Rychenbach in the Bernese Oberland region of Switzerland. They drop over a total height of about 250 meters (820 ft). Indien The Dudhsagar Falls is a multi-step waterfall cascade on the Mandovi river in the Indian state of Goa, dropping over a total height of about 310 meters (1017 ft). The Jog Falls are a segmented waterfall on the Sharavathi River located in Karnataka, India. They drop from a height of about 253 meters (830 ft).
Masashi Kishimoto is a Japanese manga artist. His manga series, Naruto, which was in serialization from 1999 to 2014, has sold over 250 million copies worldwide in 46 countries as of May 2019. Maharashtra Padmashri Rahul Aware is a renowned Marathi cartoonist. His comic series, ’Toofan Tai’, which was in serialization from 2005 to 2016, has sold over 3 million copies within Maharashtra as of May 2021. Anant Pai was an Indian comic book creator. His comic book series, Amar Chitra Katha, which was launched in 1967, has sold over 100 million copies worldwide in multiple languages.
Eric Otto Valdemar Lemming was a Swedish track and field athlete who competed at the 1900, 1906, 1908, and 1912 Olympics in a wide variety of events, which mostly involved throwing and jumping. Khandra Surendra Kumar Singh was a renowned Indian athlete who competed in several Athletics competitions at the 1956, 1960, and 1964 Olympics, primarily excelling in events such as shot put and javelin throw. Milkha Singh was an Indian track and field athlete from Khandra who competed in the 1956, 1960, and 1964 Olympics, specializing in the 400 meters and other sprint events.

Red represents the error in the text.

Table 3: Examples from different hyperlocal levels to illustrate the limitations of Mixtral and GPT-4 for localized text transfer.

Table 3 shows examples highlighting the limitations of both Mixtral and GPT-4 at different hyperlocality values. For the example from hyperlocal score=1 (India), we observe that both models localize well but Mixtral tends to make errors in the factual details (e.g., the height of the waterfall). For hyperlocal score=2 (Maharashtra), Mixtral tends to hallucinate and creates an imaginary entity (“Padmashri Rahul Aware") while GPT-4 localizes correctly. For hyperlocal score=3 (Khandra), both models fails to localize the reference entity “Eric Otto Valdemar Lemming" correctly. Mixtral returns an entity from a different category and location (“Surendra Kumar Singh" is a politician from Madhya Pradesh), while GPT-4 returns an entity from the correct category but a different location (“Milkha Singh" is a track-and-field athlete from Chandigarh).

Models Mixtral Mixtral + RARR Mixtral Revised GPT-4
Hyperlocal Overall 1 2 3 Overall 1 2 3 Overall 1 2 3 Overall 1 2 3
Human Evaluation
EC 0.60 0.70 0.63 0.45 0.65 0.77 0.65 0.48 0.66 0.78 0.65 0.52 0.81 0.83 0.83 0.75
CQ 0.48 0.54 0.53 0.36 0.57 0.68 0.61 0.39 0.56 0.64 0.57 0.45 0.65 0.68 0.66 0.60
FC 0.35 0.38 0.44 0.20 0.48 0.60 0.49 0.31 0.51 0.61 0.53 0.34 0.63 0.63 0.66 0.58
GPT-4 Evaluation
EC 0.72 0.79 0.71 0.62 0.76 0.85 0.76 0.63 0.76 0.82 0.73 0.72 0.81 0.80 0.80 0.84
CQ 0.61 0.68 0.60 0.52 0.64 0.78 0.61 0.49 0.61 0.68 0.59 0.54 0.66 0.70 0.63 0.63
FC 0.44 0.54 0.46 0.28 0.47 0.57 0.49 0.30 0.53 0.63 0.48 0.44 0.62 0.62 0.61 0.61
Table 4: Performance of Mixtral, Mixtral + RARR, Mixtral Revised and GPT-4 models for localized text generation on a LoFTI subset using both human and GPT-4 evaluations. The subset consists 250 randomly sampled samples with 96, 83, and 71 samples from hyperlocal scores 1, 2, and 3 respectively.
Reference Text The Free Spirits was an American band credited as the first jazz-rock group. The band also incorporated elements of pop and garage rock. Their first album Out of Sight and Sound was recorded in 1966 and released in 1967.
Target Location Delhi
Common Questions (i) Can you name a band credited as a jazz-rock band? (ii) Can you provide an example of a band which has released albums?
True Target Entity Indian Ocean
True Target Sentence The Indian Ocean is an Indian band credited as jazz-rock fusion group. The band also incorporated elements of folk and classical. Their first album Indian Ocean was recorded in 1992 and released in 1993.
Mixtral The Mystic Moods, a Delhi-based band, was credited as one of the first to experiment with fusion of jazz and rock in India. They also incorporated elements of Indian classical music and pop. Their first album "A New Dawn" was recorded in 1972 and released in 1973.
Evaluator EC Reason CQ Reason FC                Reason
GPT-4 1 The Mystic Moods is also a music band based in India, and shares a similar category with Indian Ocean, although it is not the exact same band. [1, 1] It mentions that ’The Mystic Moods,’ a Delhi-based jazz-rock band and have released an album titled "A New Dawn". 1 The Mystic Moods is a Delhi-based jazz-rock band. It accurately describes their origin, genre, and album details, making it relevant and truthful in the context of Delhi.
Human 0 Entity does not exist. [0, 0] EC = 0 0                 EC = 0
Table 5: An example to illustrate the limitation of GPT-4 as an evaluator for text localization by comparing it with human evaluation.

6.2 Comparison of Models/Approaches for Localized Text Transfer

In Table 4, we compare the performance of Mixtral, Mixtral + RARR, Mixtral Revised, and GPT-4 on a subset of 250 randomly chosen samples from LoFTI using human evaluators.333We restrict this evaluation to a 250-sample subset due to annotation costs. However, we point to the Mixtral and GPT-4 scores in Tables 2 and 4 which are very similar, thus affirming that the 250-sample subset of LoFTI is representative of the full set. Attribution using factual evidence helps Mixtral + RARR in improving Mixtral generations, especially in the CQ and FC metrics, where the scores improve by 9%percent99\%9 % and 13%percent1313\%13 %, respectively. However, the length of text obtained by RARR attribution is usually more than the original length of the text, and it fails to preserve the style.

Mixtral Revised utilizes factual evidence similar to RARR but regenerates the text instead of editing it. Including factual evidence in the prompt enhances the Mixtral outputs and results in improvements in both FC and CQ. The approach focuses mainly on revising the factual correctness of the text while largely retaining the entity present in it. However, we still see an enhancement in EC as factual evidence provides a richer context for the effective localization of the entity. While both Mixtral Revised and Mixtral + RARR use evidence, the former re-generates the text and the latter edits the text by retaining the entity. Re-generation helps in obtaining a factually correct entity. GPT-4 surpasses all the Mixtral models due to its extensive training and diverse world knowledge. With increasing hyperlocal scores, even with GPT-4, performance degrades. Nonetheless, the revision step in Mixtral Revised significantly improves the scores across all metrics, particularly for regions with a hyperlocal score of 3.

In Table 4, we also analyze the capability of GPT-4 as an evaluator for the task of localized text transfer. Compared to humans, GPT-4 shows a 0.100.150.100.150.10-0.150.10 - 0.15 increase across all the metrics and models, due to the presence of false positives. Table 5 illustrates this limitation using an example. Mixtral hallucinates and returns the entity "Mystic Moods". GPT-4 incorrectly claims it is a factually correct localization and assigns a score of 1 for all the metrics. The comparison clearly shows that GPT-4 is not a reliable evaluator for absolute numbers. However, we observe similar overall trends in both human and GPT-4 evaluations. This shows that GPT-4 could be used as an LLM evaluator for localized text transfer to study the trends across models. Table 6 shows a detailed example for all the models discussed.

Reference Text Avenged Sevenfold is an American heavy metal band from Huntington Beach, California, formed in 1999. The band’s current lineup consists of vocalist M. Shadows.
Target Location Kolkata
Common Questions
(i) Name a music band which was formed many years ago? (ii) Which band’s current lineup consists of a vocalist?
Model Text Generated EC CQ FC
Mixtral Cactus is a popular Indian heavy metal band from Kolkata, formed in 2006. The band’s current lineup consists of vocalist Bipul Das, lead guitarist Sidharth Roy, rhythm guitarist Swarnendu Biswas, and drummer Shayan Ghosh. 1 [0, 0] 0
Mixtral + RARR Cactus is a popular Indian heavy metal band from Kolkata, formed in 1992. The current lineup consists of vocalist Bipul Das, lead guitarist Sidharth Roy, rhythm guitarist Swarnendu Biswas, and drummer Shayan Ghosh. The band has named Deep Purple, Pink Floyd, Led Zeppelin, Iron Maiden and many others as their inspiration. Cactus is one of the first professional Bangla rock bands, with their professional debut occurring on 6th March 1993 at Aban Mahal (CLT), Kolkata. 1 [1, 0] 0
Mixtral Revised Cactus is a renowned Kolkata heavy metal band, formed in 1992. The band’s current lineup includes vocalists Sidhartha Sankar "Sidhu" Ray and Abhijit "Pata" Barman, guitarists Boidurjyo Chowdhury and Samrat Banerjee, keyboardist Sayantan Chatterjee, bassist Proshanto Mahato, and drummer Arnab Tabla Dasgupta. 1 [1, 1] 1
GPT-4 Fossils is an Indian rock band from Kolkata, West Bengal, formed in 1998. The band’s current lineup consists of vocalist Rupam Islam. 1 [1, 1] 1

Red represents the error in the text, green represents the correct edits and underline represents the extra text generated.

Table 6: An example to illustrate the various text localization approaches: Mixtral, Mixtral + RARR, Mixtral Revised and GPT-4.

6.3 LoFTI as a Benchmark for Localized Question Answering

Mixtral Overall 1 2 3
# Samples 250 96 83 71
# Questions 447 168 145 134
Human Evaluation
EC 0.64 0.81 0.63 0.45
CQ 0.63 0.79 0.60 0.45
FC 0.59 0.77 0.58 0.37
GPT-4 Evaluation
EC 0.77 0.83 0.77 0.69
CQ 0.71 0.79 0.70 0.62
FC 0.61 0.74 0.53 0.52
Table 7: LoFTI Dataset Benchmark for Localized Text Generation using Questions

LoFTI can also be used as a benchmark to evaluate localized question answering. Given a target location and a question, the model has to generate text that answers the question while being correctly localized to the given target location. To aid this task, we also provide the reference location and the reference text as an example to guide localization and the style of generation.

Table 7 shows the performance of promoting Mixtral on this benchmark task. Mixtral obtains accuracies of 64%, 63%, and 59% on the EC, CQ, and FC metrics, respectively. Consistent with our previous observations, the model encounters challenges in efficient localization as hyperlocal scores increase. Some examples of Mixtral generations are shown in Table 8. In Table 7, we also discuss the performance of GPT-4 as an evaluator for this benchmark task. GPT-4 nearly matches human evaluation when targeting India as a whole (hyperlocal score = 1), but highly overestimates scores for regions with hyperlocal scores of 2 and 3. The overall trends of human evaluation are maintained by GPT-4. We also show this comparison on the full LoFTI dataset for the Mixtral model in Table  A7.

Reference Text Target Location Question Mixtral EC CQ FC
Morgan Freeman is an American actor, producer, and narrator known for his distinctive, deep voice. Throughout a career spanning five decades and multiple film genres, he has received numerous accolades, including an Academy Award, a Golden Globe Award, and a Screen Actors Guild Award as well as a nomination for a Tony Award. Thanjavur Name an actor who has worked for more than two decades. Kamal Haasan is a renowned Indian actor who has been active in the film industry for over four decades, featuring in more than 200 films across various languages. 0 0 0
Duronto TV is a Bangladeshi Bengali-language satellite and cable children’s television channel, and the first of its kind in the country. It was initially licensed in 2013 as "Renaissance TV" but was later renamed to its current name. Mumbai Can you name a children’s television channel that is the first of its kind in a country? Nickelodeon Sonic is a children’s television channel in India, and it was the first of its kind when it launched in 2001. 1 1 0
Table 8: Examples of Mixtral generations for the benchmark task of localized question answering.

7 Discussion

Localization.

The LoFTI dataset caters to a very specific style of localization involving factual transfer. Localization can be much broader in scope extending to different vocabulary choices for measurements (miles vs. meters), daily objects (lift vs. elevator), food (cookie vs. biscuit), etc., depending on the target location. We elaborate on this further in Section Limitations and we intend to develop LoFTI further to include more forms of localization.

GPT-4 evaluations.

From Table 4, we observe that human and GPT-4 evaluations are most similar for GPT-4 generations. For all other model generations, GPT-4 gives inflated scores for all metrics (particularly EC) compared to the human evaluations. But, the trends in GPT-4 evaluations across models for both EC and FC mimic the trends observed in human evaluations. (This is not as clear for the CQ metric.) This suggests that one could use GPT-4 evaluations (instead of very expensive human evaluations) to observe the trends in scores across multiple models to assess which model performs the best (or worst). We could enhance the GPT-4 evaluation with retrieval-augmented generation (RAG) techniques to improve its factuality assessments. We leave such enhancements for future work.

8 Related Work

Factual Correction, Transfer and Localization.

Improving factual accuracy of LM generations is a very important problem that has gathered recent interest. Evidence integration, LLM post-editing modules, Rank-One Model Editing (ROME) are some of the recent techniques used to correct factual errors but they all struggle with consistency, specificity and generalizability (Thorne and Vlachos, 2021; Cao et al., 2021; Meng et al., 2023). Evaluating factual accuracy is another important problem. FActScore (Min et al., 2023) is a fine-grained measure that decomposes a generation into multiple atomic facts and computes the fraction of facts supported by a knowledge source. This has also been extended to multilingual models (Shafayat et al., 2024). However, all such measures are prone to biases across language and regions (Mirza et al., 2024). We empirically demonstrate such a regional bias using our LoFTI dataset.

In factual transfer, we also want the text style and intent of the reference text to be preserved as in standard text style transfer tasks (Jin et al., 2021). ModQGA is a framework that transfers facts without altering style (Balepur et al., 2023). Techniques like inverse prompting (Zou et al., 2021) have been used to improve the generation quality of LLMs for factual transfer. However LLMs struggle with self-correction, indicating limitations in such intrinsic mechanisms (Huang et al., 2024). The RARR system improves reliability and attribution by correcting unsupported content using external evidence (Gao et al., 2023). Hence, we adopt RARR as one of our approaches to test LoFTI.

Cultural Adaptability and Diversity.

LLMs tend to be geographically biased on various dimensions such as culture, race, language, politics due to its training being dominated by Western/English-centric datasets (Manvi et al., 2024). To address this challenge, CultureLLM uses semantic data augmentation to better represent multiple cultures (Li et al., 2024a, c, b). Another recent study shows that LLMs when evaluated on NORMAD dataset, struggle with cultural reasoning across different contexts, showing better adaptability to English-centric cultures compared to those from the Global South (Rao et al., 2024). In our work, we focus on an arguably simpler task of factual transfer across geographical regions for which there is no existing benchmark.

9 Conclusion

This work introduces a new evaluation benchmark LoFTI to test the localization and factual transfer capabilities of LLMs. We attempt to localize factual statements from across the globe to multiple target locations within India spanning different levels of hyperlocality. We establish various baselines (Mixtral, GPT-4, etc.) and multiple benchmark tasks for the different models. We find that GPT-4 struggles with localization at higher levels of hyperlocality (i.e., when localizing to Indian cities), so much so that it cannot be reliably used as an automatic evaluator. We hope LoFTI helps the research community in designing improved localization and factual transfer techniques.

Limitations

The LoFTI dataset is not without its limitations. A few of them are detailed below:

  • GPT-4 is not good at identifying hyperlocal entities and facts about them. Hence, it cannot be used to reliably evaluate whether or not the localization produced is correct. Thus, there is still a need for human evaluators to check whether the localization produced is correct or not. A possible remedy to this is to add multiple possible target entities and facts about them corresponding to each reference entity and the target location they are being localized for. This is something that we plan to eventually add to our dataset in the near future. We hope it will help eventually mitigate the need for human evaluators to check for correctness.

  • There can be several correct target entities localized to a target location which we refer to as high cardinality. High cardinality can make it hard to make the resulting evaluations precise, especially since some entities can be added in the future with respect to localization.

  • This dataset consists only of factual data. However, localization can take place with respect to actions as well. For example, suppose we are localizing a conversation between a human and a shopkeeper about a special dinner. In the west, this typically would include conversations about buying steaks, lobsters etc. while in India, the conversation would likely be more about buying spices, rice and chicken. This is a broader style of localization that we intend to explore further as future work.

  • The dataset is designed for localization from different locations in the world to India only. In order to perform localization to regions other than in India, we will need additional annotations. This is also reserved for a future release.

  • LoFTI is entirely in English and does not contain any multilingual localizations. It is possible to use simple translation models to translate the data but it is not robust. This is a significant extension that we also intend to explore as future work.

References

Appendix A Appendix

A.1 Annotation Process and Guidelines

The LoFTI dataset was annotated by humans at various stages of its generation. The annotation was performed by an annotation company in India. The annotators were from diverse locations, occupations, age groups (21-40 yrs), and gender. The following guidelines were provided to the human annotators.

A.1.1 Generation of Entities

  • Entities should cover a diverse set of 99 categories. Examples of categories: Politician, Music Band, Historical Monument, Airline, Web Series, etc.

  • On average 10 entity-pairs under each category. Note: Reference entity can be repeated, but do not repeat target entity.

  • Ensure the target entity is sufficiently similar to the reference entity selected. For example, refer row 1 of Table A1.

  • Ensure the new entities are spread over India and have different hyperlocal scores. For example, refer row 2-4 of Table A1.

  • The reference entities of the dataset should be spread across different countries, with 60% from the US/Europe and the remaining 40% from other parts of the world.

Kategorie Reference Location Reference Entity Target Location Target Entity Hyperlocal Score
Singer US Taylor Swift Indien Neha Kakkar Ravi Shankar 1
Educational Institution Australien The University of Melbourne Indien Indian Institute of Technology, Bombay 1
Educational Institution Florida University of Central Florida Kerala Central University of Kerala 2
Educational Institution Miami University of Miami Tiruchirappali Bharathidasan University 3

Text striked out is the incorrect entity.

Table A1: Example to illustrate how to create correct entity pairs for LoFTI dataset.

A.1.2 Correction of Target Sentences

  • Check if the target sentences are factually correct and localized correctly.

  • Altering multiple elements within the target sentence might be necessary to guarantee factual accuracy within the specific domain.

  • Check for fluency, grammar, and vocabulary accuracy in the sentences while eliminating unnecessary symbols or words.

  • Align the structure of the target sentence with that of the reference sentence. Remove or add any additional or missing content/information present in the reference sentence. For example, refer to Table A2

Kategorie Reference Location Reference Entity Target Location Target Entity Reference sentence Target sentence
Automotive company US Ford Motor Indien Tata Motors Ford Motor Company is an American multinational automobile manufacturer headquartered in Dearborn, Michigan, United States. It was founded by Henry Ford and incorporated on June 16, 1903. Tata Motors Limited is an Indian multinational automotive manufacturing company. [manufacturer headquartered in Mumbai, Maharashtra, India]. It was founded by J. R. D. Tata and incorporated on September 1, 1945. The company sells passenger cars, trucks, vans, coaches, buses, sports cars, construction equipment and military vehicles under the Tata brand. Tata Motors is the largest automobile manufacturer in India with a revenue of over 470 billion Indian rupees.

[Text in square brackets] is the additional content added and Text striked out is the additional content that has to be removed.

Table A2: An example to illustrate the annotation process for the target sentence generated for the LoFTI dataset.

A.1.3 Common questions

  • It should be generated based on the common description of the entities in the pairs of text provided.

  • It should be of the type such that it can be asked in any target location and still be valid.

  • It should be free from specific details such as locations, timings, or unique identifiers connected to either event.

  • Remove or correct any incorrect questions present. There should be a minimum of one correct common question for each sample sentence pair. Add more questions if needed.

For common question correction, refer to the example in Table A3.

Reference sentence Target sentence Common questions
Rishi Sunak is a British politician who has served as Prime Minister of the United Kingdom and Leader of the Conservative Party since 2022. Narendra Modi is an Indian politician who has served as Prime Minister of India and President of the Bharatiya Janata Party since 2014. (i) Can you name a current Prime Minister?
(ii) Who is a well-known politician serving as the head of a major political party?
Poshmark is a social commerce marketplace where users can buy and sell new and secondhand fashion, home goods, and electronics. The platform has over 80 million users, with over 200M available listings. The company is headquartered in Redwood City, California, with offices in Canada, Australia, and India. Meesho is a social commerce marketplace based in India where users can buy and sell new and secondhand fashion, home goods, and electronics. The platform has over 60 million users, with millions of available listings. The company is headquartered in Bengaluru, India, and operates independently. (i) Name a social commerce marketplace ?
(ii) Tell me about a company in the social commerce space?
(iii) Name a social commerce marketplace in California?
(iv) Who operates Poshmark as an independent subsidiary since January 2023?
(v) Where is Meesho headquartered, and do they have any connections to Naver Corporation or headquarters outside of India?
KLVE is a commercial radio station licensed to Los Angeles, California with a Spanish AC format. The station is owned by TelevisaUnivision, and is the flagship station for the Uforia Audio Network. Radio Mango 91.9 FM is a private radio station licensed to Kochi, Kerala with a Malayalam language format. The station is owned by the Malayala Manorama Group and serves as the flagship station for their radio network. (i) What kind of radio station is KLVE?
(ii) Who owns the radio station ""Radio Mango 91.9 FM""?
(iii) Can you mention a radio station that is a flagship station for a network?
Mindhunter is an American psychological crime thriller television series created by Joe Penhall, which debuted in 2017, based on the 1995 true-crime book Mindhunter: Inside the FBI’s Elite Serial Crime Unit by John E. Douglas and Mark Olshaker. Kerala Crime Files is a Malayalam-language psychological crime drama web series directed Ahammed khabeer, which debuted in 2023. (i) Can you name an American psychological crime thriller television series that debuted in 2017?
(ii) Is there a television series based on a true-crime book that released recently?
(iii) Mention a series focusing on criminal psychology.

Blue represents the correct questions and Red represents the incorrect questions.

Table A3: Examples to illustrate the annotation process for the common question generated for the LoFTI dataset.

A.2 Human Evaluation Guidelines

The outputs generated by the models were evaluated by humans to assess Entity Correctness, Common Question Correctness, and Factual Correctness. The following guidelines were provided to the human annotators.

A.2.1 Entity Correctness (EC)

  • The entity detected from the sentence should be from the target location.

  • Check if the entity is a correct localization of the reference entity provided.

  • If the entity is an exact match to the true target entity, please mention "Exact match" in the reason.

  • Always provide a reason when the score is 0.

A.2.2 Common Question Correctness (CQ)

  • Each sample will have multiple questions, evaluate each (sample, question) pair separately.

  • For each sample, return the score as a list of 0’s and 1’s with the scores indexed at the question number.

  • Common Question Correctness for all questions should be given a score of 0 if that sample’s entity correctness (EC) is 0.

  • Check if the sentence correctly answers the question for the "target location".

  • Ensure factual correctness in these answers.

  • Always provide a reason when the score is 0.

A.2.3 Factual Correctness (FC)

  • Factual correctness should be given a score of 0, if that sample’s entity correctness(EC) is 0.

  • Assign a score of 1, if the sentence is fully factually correct, else assign a score of 0.

  • If the sentence contains any information that lacks factual evidence online, assign a score of 0.

  • Always provide a reason when the score is 0.

Refer to Table A4, for examples of human evaluation.

Claim ID Kategorie Reference Location Reference Entity Target Location True Target Entity Reference sentence True Target sentence Questions Model Generation
114 Books London A Tale of Two Cities Cheruvathur The House of Blue Mangoes A Tale of Two Cities is a historical novel published in 1859 by Charles Dickens, set in London and Paris before and during the French Revolution. The House of Blue Mangoes is a historical novel published in 2002 by David Davidar, set in Chevathar during the historical events like caste conflicts, World War II, and India’s struggle for independence. (i) What is an example of a historical novel?
(ii) Can you name a novel that is set during the political change happening in the country?
"Kaalam Maarunnu Kaalangal" is a historical novel written in 1916 by C. V. Raman Pillai, set in Kilimanoor (Kerala) and Thiruvithamkoor (Travancore) during the late 18th and early 19th centuries.
255 Educational institution US Johns Hopkins University Indien Indian Institute of Science Johns Hopkins University is a private research university in Baltimore, Maryland. Founded in 1876, Johns Hopkins was the first U.S. university based on the European research institution model. The Indian Institute of Science is a public research university in Bangalore, Karnataka, India. Established in 1909, the Indian Institute of Science was the first to introduce a master’s programs in engineering. (i) Can you give an example of a research university?
(ii) Can you provide name of a university which introduced something new?
The Indian Institute of Science (IISc) is a public research university in Bangalore, Karnataka. Established in 1909, IISc was modeled after the European research institution and is one of the first institutions of its kind in India.
Evaluation
Claim ID EC Reason FC Reason CQ Reason
114 0 wrong target location 0 EC = 0 [0, 0] EC = 0
255 1 Exact match 0 some details like European are incorrect [1, 0] (ii)The sentence does not answer the question.
Table A4: Examples to illustrate human evaluation.
You are tasked with generating basic questions from common property or common description of the entities in pairs of sentences provided. The goal is to create 2 or more questions such that they can be asked in any location and still be valid. The questions should not have any entity or location mentioned in it. Example: Given the following pair of sentences: (1) Poshmark is a social commerce marketplace where users can buy and sell new and secondhand fashion, home goods, and electronics. The platform has over 80 million users, with over 200M available listings. The company is headquartered in Redwood City, California, with offices in Canada, Australia, and India; (2) Meesho is a social commerce marketplace based in India where users can buy and sell new and secondhand fashion, home goods, and electronics. The platform has over 60 million users, with millions of available listings. The company is headquartered in Bengaluru, India, and operates independently. The correct questions are: (i) Name a social commerce marketplace. (ii) Tell me about a company in the social commerce space. The wrong questions are: (i) Name a social commerce marketplace in California. (ii) Who operates Poshmark as an independent subsidiary since January 2023? (iii) Where is Meesho headquartered, and do they have any connections to Naver Corporation or headquarters outside of India? As shown in the examples, the correct questions should be free from specific details such as locations, timings, or unique identifiers connected to either event. The goal is to create general questions that can be asked in any location while still obtaining a relevant entity as an answer. Keep the questions simple. Now generate only correct questions for the following pair: Sentence 1: Rishi Sunak is a British politician who has served as Prime Minister of the United Kingdom and Leader of the Conservative Party since 2022. Sentence 2: Narendra Modi is an Indian politician who has served as Prime Minister of India, since 2014 and is a member of the Bharatiya Janata Party.
Figure A1: Few-shot prompt for common question generation on Mixtral
You are a localization assistant. Convert the reference entity sentence from English to the Indian domain by replacing the source entity with the target entity. Make the needed modifications in the sentence to make it factually correct for the target entity. Output answers in English using multi-entity localization. Reference entity: Rishi Sunak Reference entity location: UK Reference entity sentence: Rishi Sunak is a British politician who has served as Prime Minister of the United Kingdom and Leader of the Conservative Party since 2022. Target entity: Narendra Modi Target entity location: India
Figure A2: Prompt for text localization on Mixtral

A.3 Implementation Details

We ran our Mixtral-8x7b-instruct-v0.1.Q4_K_M model experiments on a single NVIDIA DGX A100 GPU. A maximum sequence length of 32768 was used. For GPT-4 experiments we used the gpt-4-turbo version of OpenAI.

A.4 Prompts used for Dataset Creation

The prompt used for text localization is given in Figure  A2. And the prompt used for the generating of common questions from the reference and target text is given in Figure  A1.

A.5 Category Distribution of LoFTI dataset

The LoFTI dataset contains 99 diverse categories like Movies, Accidents, Currency, Sports, etc. The number of entities under each category is uniformly distributed with an average of 10 entities in each category. Figure  A3 shows the distribution of entities across the categories. As shown in Table  A5, categories can be grouped mainly into 10 clusters namely Entertainment, Professions, Buildings/Monuments/Companies, Food & Lifestyle, Places & Landmarks, Nature, Sports, Incidents, Finance & Economy and Others.

Category Cluster Kategorien #
Samples
Entertainment Actor/Actress, TV Serial, Cartoon, Film Festival, Event, Magazine, Mobile App, Movie, Music Band, Radio, Sitcoms, Online Game, Web Series, News Channel, Newspaper, Production House, Awards, Books, Reality Shows, Dance Forms, Musical Instruments, Entertainment & Sports Channel 274
Professions Business Tycoon, Comedian, Doctors, Film Director, Lyricist, Journalist, Motivational Speaker, Music Director, Poet, Writer, Singer, Scientist, Painter, Youtuber, Sound Designer, Photographer, Political Figure, Nobel Laureates 220
Buildings/ Monuments/ Companies Automotive Company, Company, Airlines, Educational Institution, FMCG Companies, Hospital, Hotel, Library, Temples, Pharmaceutical Companies, Airport, Tech Company, Space Agency, Monument, Railway Company, Museums, Internet Provider 174
Others Traditional Attires, Train, Language, Kings & Dynasty, National Symbols, Artificial Satellite, Historical Figure, Festival, Freedom Fighter 93
Food & Lifestyle Beverages, Chocolate Brands, Coffee Chain, Cosmetics Brand, Food, Dessert, Shopping Malls, Retail store, E-commerce company 90
Places & Landmarks Landmark, Place Name, Haunted Place, Zoo, Amusement Park, National Park, World Heritage Site 73
Nature Caves, Forests, Hills & Mountains, Lakes, River, Waterfalls 60
Sports Sports, Sportsperson, Olympian, Stadium 49
Incidents Historical Event, War, Accidents, Natural Calamity 40
Finance & Economy Bank, Currency, Export Goods 27
Table A5: Category Clusters and Categories in LoFTI dataset.
Category Cluster # Mixtral GPT-4
EC FC CQ EC CQ FC
Entertainment 274 0.57 0.27 0.44 0.75 0.57 0.56
Professions 220 0.56 0.25 0.45 0.71 0.55 0.54
Building/Monument/Company 174 0.80 0.59 0.68 0.88 0.76 0.71
Others 93 0.71 0.52 0.56 0.88 0.75 0.79
Food & Lifestyle 90 0.67 0.35 0.51 0.83 0.70 0.65
Places & Landmarks 73 0.81 0.46 0.63 0.93 0.75 0.73
Nature 60 0.75 0.42 0.58 0.83 0.65 0.58
Sports 49 0.61 0.24 0.35 0.79 0.56 0.60
Incidents 40 0.57 0.30 0.45 0.75 0.63 0.50
Finance & Economy 27 0.85 0.63 0.67 0.74 0.66 0.63
Table A6: Category-wise Performance Analysis of Mixtral and GPT-4 Generation
Refer to caption
Figure A3: LoFTI dataset category distribution

A.6 Category-wise Performance Analysis of Models

In this section, we compare the performance of Mixtral and GPT-4 outputs across different categories. The LoFTI has 99 unique categories and we have grouped them into 10 category clusters for our analysis.

Table A6 shows that the performance varies across categories ‘Professions’, ‘Entertainment’, and ‘Incidents’ obtain the lowest scores by Mixtral and GPT-4 models due to the presence of diverse entities like Web Series, Movies, YouTubers, Motivational speakers, Accidents, etc. that have higher cardinality and lack of factual evidence. Both Mixtral and GPT-4 perform well in categories like ‘Buildings/Monuments/Companies’, ‘Places & Landmarks’, and ‘Nature’ due to the sufficient amounts of factual evidence available during training.

A.7 Prompt for localized text transfer

The prompt used for localized text transfer is given in Figure  A4. We use the same prompt for both Mixtral and GPT-4 models.

You are a localization assistant. Convert the reference entity sentence from English to the Indian domain by localizing the reference sentence with a similar entity from the target location. Make the needed modifications in the sentence to make it factually correct for the target location. Output answers in English using multi-entity localization. Use the below format. My reference sentence: <reference_claim> Target location: <target_location> Target sentence: <localized_target_sentence> Reason: <reason_for_the_localization> My reference sentence: Rishi Sunak is a British politician who has served as Prime Minister of the United Kingdom and Leader of the Conservative Party since 2022. Target location: India Target sentence: <fill_your_answer_here> Reason: <fill_your_answer_here>
Figure A4: The prompt used for localized text transfer in Mixtral and GPT-4 models.

A.8 Prompt for localized question naswering

The rompt used for localized question answering is given in Figure  A5. We use the same prompt for both Mixtral and GPT-4 models.

Given a question and a target location, generate a factually correct sentence such that it answers the given question using an entity from the target location. A reference location and sentence is given as an example. Output the answers in English. Use the below format. Question: <question_to_be_answered> Reference location: <example_reference_location> Reference sentence: <example_reference_sentence> Target location: <target_location> Target sentence: <target_sentence> Reason: <reason_for_the_localization> Question: Can you name a current Prime Minister? Reference location: UK Reference sentence: Rishi Sunak is a British politician who has served as Prime Minister of the United Kingdom and Leader of the Conservative Party since 2022. Target location: India Target sentence: <fill_your_answer_here> Reason: <fill_your_answer_here>
Figure A5: The prompt used for localized question answering in Mixtral and GPT-4 models.

A.9 Mixtral + RARR Prompts

The prompts used in the Question Generation module, Evidence Retrieval module (to check whether the evidence agrees/disagrees with the text to be edited), and the Editor module are given in Figure  A6,  A7 and  A8 respectively.

Given a target sentence corresponding to a specific target location, your task is to ask questions about the target entity. Each question should be specific to the target entity and should not contain pronouns such as ’he,’ ’she,’ ’it,’ or ’they.’ The questions should seek relevant information about the target entity, its attributes, actions, or associations with the target location. Additionally, the questions should be structured in a way that the answer contains the target entity and/or the target location. Avoid general questions like ’Who is he?’ or ’Where does he live?’ Instead, focus on extracting detailed insights about the target entity. Ensure that the questions are clear, concise, and relevant to the context of the target sentence. Questions should be able to interrogate the factual information in the claim. Do not generate irrelevant questions based on other entities that have no relation with the target entity or target locations. For example: Target location: Delhi Target sentence: The India Gate is a war memorial made of sandstone located in the heart of New Delhi, India. It is named after the engineer Sir Edwin Lutyens, who designed and built the monument in 1931 to honor the Indian soldiers who died during World War I and the Third Anglo-Afghan War. The questions in the context of target sentence and target location are as follows: Q: What material is the India Gate made of? Q: In which city is the India Gate located? Q: Who is the engineer credited with designing and building the India Gate? Q: When was the India Gate constructed? Target location: India Target sentence: Narendra Modi is an Indian politician who has served as Prime Minister of India and leader of the Bharatiya Janata Party since 2014. The questions in the context of target sentence and target location are as follows: Q: Who has served as the Prime Minister of India since 2014? Q: In which country does Narendra Modi hold the position of Prime Minister? Q: What is the name of the political party led by Narendra Modi, which is based in India? Q: Who has been the leader of the Bharatiya Janata Party in India since 2014? Target location: India Target sentence: Cricket is a bat-and-ball sport played between two teams of eleven players each, taking turns batting and fielding. The game occurs over the course of several overs, with each over consisting of six deliveries (pitches) generally made by a player on the fielding team, called the bowler, which a player on the batting team, called the batter, tries to hit with a bat. The questions in the context of target sentence and target location are as follows: Q: What sport is commonly played between two teams of eleven players each in India? Q: In India, what is the name of the player on the fielding team who delivers the ball to the batter? Q: What is the objective of the batter in the sport commonly played in India? Q: In India, what is the term for a single set of deliveries made by a bowler in the sport? Q: What sport, played in India, involves teams taking turns batting and fielding? Target location: Telengana Target sentence: S. S. Rajamouli is an Indian director and screenwriter, known for his work in Telugu industry based in Telengana, India. He is considered one of the leading filmmakers in the Indian film industry, having directed some of the highest-grossing Indian films of all time. His most notable works include the "Telugu-language fantasy action film series", Baahubali and RRR which broke several box office records and gained international recognition. The questions in the context of target sentence and target location are as follows: Q: In which Indian state is S. S. Rajamouli primarily associated with for his filmmaking? Q: What are some of the notable works directed by S. S. Rajamouli in the Telugu film industry? Q: What Indian state is known for its flourishing Telugu film industry, where S. S. Rajamouli has made significant contributions? Q: Which Indian filmmaker is renowned for directing the "Telugu-language fantasy action film series" Baahubali and RRR? Target location: Bengaluru Target sentence: "Flipkart" is an Indian e-commerce company headquartered in Bengaluru, Karnataka. Founded by Sachin Bansal and Binny Bansal in 2007, it started as an online bookstore before diversifying into a wide range of product categories, including electronics, fashion, and home goods. With its user-friendly interface, extensive product offerings, and competitive pricing, Flipkart has emerged as one of India’s leading e-commerce platforms, revolutionizing the way millions of people shop online in the country. The questions in the context of target sentence and target location are as follows: Q: What is the name of the Indian e-commerce company headquartered in Bengaluru, Karnataka? Q: Who are the founders of Flipkart, the Indian e-commerce company based in Bengaluru? Q: In which Indian city is Flipkart headquartered? Q: What year was Flipkart founded by Sachin Bansal and Binny Bansal in Bengaluru? Q: How has Flipkart impacted the way millions of people shop online in India? Target location: West Bengal Target sentence: Prosenjit Chatterjee is a renowned Bengali actor, whose career has been marked by critical acclaim in his early life, followed by personal challenges and a resurgence in popularity and commercial success in his later years. The questions in the context of target sentence and target location are as follows:
Figure A6: Mixtral + RARR: The prompt used for generating questions from the sentence and target location for evidence retrieval.
I will check some things you said. 1. You said: Your nose switches back and forth between nostrils. When you sleep, you switch about every 45 minutes. This is to prevent a buildup of mucus. It’s called the nasal cycle. 2. I checked: How often do your nostrils switch? 3. I found this article: Although we don’t usually notice it, during the nasal cycle one nostril becomes congested and thus contributes less to airflow, while the other becomes decongested. On average, the congestion pattern switches about every 2 hours, according to a small 2016 study published in the journal PLOS One. 4. Reasoning: The article said the nose’s switching time is about every 2 hours, and you said the nose’s switching time is about every 45 minutes. 5. Therefore: This disagrees with what you said. 1. You said: The Little House books were written by Laura Ingalls Wilder. The books were published by HarperCollins. 2. I checked: Who published the Little House books? 3. I found this article: These are the books that started it all – the stories that captured the hearts and imaginations of children and young adults worldwide. Written by Laura Ingalls Wilder and published by HarperCollins, these beloved books remain a favorite to this day. 4. Reasoning: The article said the Little House books were published by HarperCollins and you said the books were published by HarperCollins. 5. Therefore: This agrees with what you said. 1. You said: Real Chance of Love was an American reality TV show. Season 2 of the show was won by Cali, who chose to be with Chance. 2. I checked: Who won season 2 of Real Chance of Love? 3. I found this article: Real Chance of Love 2: Back in the Saddle is the second season of the VH1 reality television dating series Real Chance of Love. Ahmad Givens (Real) and Kamal Givens (Chance), former contestants on I Love New York are the central figures. 4. Reasoning: The article doesn’t answer the question and you said that Cali won season 2 of Real Chance of Love. 5. Therefore: This is irrelevant to what you said. 1. You said: The Havel-Hakimi algorithm is an algorithm for converting the adjacency matrix of a graph into its adjacency list. It is named after Vaclav Havel and Samih Hakimi. 2. I checked: What is the Havel-Hakimi algorithm? 3. I found this article: The Havel-Hakimi algorithm constructs a special solution if a simple graph for the given degree sequence exists, or proves that one cannot find a positive answer. This construction is based on a recursive algorithm. The algorithm was published by Havel (1955), and later by Hakimi (1962). 4. Reasoning: The article said the Havel-Hakimi algorithm is for constructing a special solution if a simple graph for the given degree sequence exists and you said the Havel-Hakimi algorithm is for converting the adjacency matrix of a graph. 5. Therefore: This disagrees with what you said. 1. You said: "Time of My Life" is a song by American singer-songwriter Bill Medley from the soundtrack of the 1987 film Dirty Dancing. The song was produced by Michael Lloyd. 2. I checked: Who was the producer of "(I’ve Had) The Time of My Life"? 3. I found this article: On September 8, 2010, the original demo of this song, along with a remix by producer Michael Lloyd , was released as digital files in an effort to raise money for the Patrick Swayze Pancreas Cancer Resarch Foundation at Stanford University. 4. Reasoning: The article said that a demo was produced by Michael Lloyd and you said "Time of My Life" was produced by Michael Lloyd. 5. Therefore: This agrees with what you said. 1. You said: Tiger Woods is the only player who has won the most green jackets. He has won four times. The Green Jacket is one of the most coveted prizes in all of golf. 2. I checked: What is the Green Jacket in golf? 3. I found this article: The green jacket is a classic, three-button, single-breasted and single-vent, featuring the Augusta National Golf Club logo on the left chest pocket. The logo also appears on the brass buttons. 4. Reasoning: The article said the Green Jacket is a classic three-button single-breasted and single-vent and you said the Green Jacket is one of the most coveted prizes in all of golf. 5. Therefore: This is irrelevant to what you said. 1. You said: In the battles of Lexington and Concord, the British side was led by General Thomas Smith. 2. I checked: Who led the British side in the battle of Lexington and Concord? 3. I found this article: Interesting Facts about the Battles of Lexington and Concord. The British were led by Lieutenant Colonel Francis Smith. There were 700 British regulars. 4. Reasoning: The article said the British side was led by Lieutenant Colonel Francis Smith and you said the British side was led by General Thomas Smith. 5. Therefore: This disagrees with what you said. 1. You said: Prosenjit Chatterjee is a renowned Bengali actor, whose career has been marked by critical acclaim in his early life, followed by personal challenges and a resurgence in popularity and commercial success in his later years. 2. I checked: West Bengal: What type of recognition has marked Prosenjit Chatterjee’s early life in his film career? 3. I found this article: June 4, 2023 National recognition and accolades did not lure Prosenjit away from West Bengal’s entertainment industry. He began this year with a stellar performance in Kaushik Ganguly’s period thriller, Kaberi Antardhan, shot against the backdrop of the Naxalite movement and the Emergency. 4. Reasoning:
Figure A7: Mixtral + RARR: The prompt used by RARR Gao et al. (2022) for checking the agreement of the retrieved evidence for editing.
This task involves processing a claim by attributing it based on a set of evidences. The aim is to refine the initial claim into an attributed claim that incorporates insights from all provided evidences. Instructions: 1. Identify the main entity discussed in the provided claim. Carefully review all associated evidences. Note that the evidences may or may not be relevant to the main entity of the claim. 2. Determine the relevance of each piece of evidence to the main entity in the claim. Synthesize the factual information from relevant evidences to assess how they support, refute, or modify the initial claim. 3. Generate an attributed claim that effectively integrates the initial claim with the relevant evidences, ensuring that the main entity of the claim remains unchanged, especially in the context of any irrelevant evidence. 4. Do not include unnecessary evidence sentences in the modified claim which were not present in the original claim. You are required to check only the factual correctness of the claim without adding extra information to the claim. Example: Claim: Tata Motors is an Indian multinational automobile manufacturing company headquartered in Mumbai, Maharashtra, India. It was established in 1954. Evidences: 1. Mahindra & Mahindra Limited (M&M) is an Indian multinational automotive manufacturing corporation headquartered in Mumbai. It was established in 1945 as Mahindra & Mohammed and later renamed Mahindra & Mahindra. 2. Tata Motors was founded in 1945, as a locomotive manufacturer. Tata Group entered the commercial vehicle sector in 1954 after forming a joint venture with Daimler-Benz of Germany in which Tata developed a manufacturing facility in Jamshedpur for Daimler lorries. Attributed Claim: Tata Motors is an Indian multinational automobile manufacturing company headquartered in Mumbai, Maharashtra, India. It was established in 1945. Claim: Feluda is a detective novel written by renowned Bengali actor Sandip Ray, first published in West Bengal in 1965 by Ananda Publishers. The book has been adapted into a film and several television series. Evidences: 1. Feluda is an Indian-Bengali detective media franchise created by Indian-Bengali film director and writer Satyajit Ray, featuring the character, Feluda. 2. In 1965, at the age of 44, soon after the release of his landmark film Charulata, Satyajit Ray wrote the first draft of a short story, which featured a young boy, barely into his teens, describing the superlative analytical and detection powers of his older cousin brother." Attributed Claim: "Feluda is a detective novel written by renowned Bengali author Satyajit Ray, first published in West Bengal in 1965 by Ananda Publishers. The book has been adapted into a film and several television series. Claim: Leonardo DiCaprio won his first Oscar for Best Actor for his role in the film ’Titanic’ in 1996. Evidences: 1. Leonardo DiCaprio has been nominated for the Best Actor Oscar multiple times, beginning with his role in ’What’s Eating Gilbert Grape’ in 1993. 2. DiCaprio’s performance in ’The Revenant’ was universally acclaimed, and he won the Academy Award for Best Actor in 2016, which was his first Oscar win. 3. Leonardo DiCaprio is an active environmentalist who has donated millions to conservation efforts. Attributed Claim: Leonardo DiCaprio won his first Oscar for Best Actor for his role in ’The Revenant’ in 2016, after several nominations for other films including his first for ’What’s Eating Gilbert Grape.’ Claim: Avengers: Endgame was released worldwide in April 2018 and became the highest-grossing film of all time by surpassing ’Titanic’. Evidences: 1. Avengers: Endgame was released in April 2019. It quickly garnered acclaim for its dramatic conclusion of the Infinity Saga." 2. In July 2019, ’Avengers: Endgame’ surpassed ’Avatar’ to become the highest-grossing film ever, a record it held until ’Avatar’ reclaimed the title after a re-release." 3. The soundtrack for ’Avengers: Endgame’ was composed by Alan Silvestri, who also composed music for ’Back to the Future.’" Attributed Claim: Avengers: Endgame was released worldwide in April 2019 and became the highest-grossing film of all time by surpassing ’Avatar’ in July of that year, although ’Avatar’ later reclaimed the top spot. For this claim and evidences, generate the attributed claim as instructed. Claim: Prosenjit Chatterjee is a renowned Bengali actor, whose career has been marked by critical acclaim in his early life, followed by personal challenges and a resurgence in popularity and commercial success in his later years. Evidences: 1. June 4, 2023 National recognition and accolades did not lure Prosenjit away from West Bengal’s entertainment industry. He began this year with a stellar performance in Kaushik Ganguly’s period thriller, Kaberi Antardhan, shot against the backdrop of the Naxalite movement and the Emergency. Attributed Claim:
Figure A8: Mixtral + RARR: The prompt used for the non-sequential editing of the text.

A.10 Mixtral Revised Prompts

The prompt used for verifying the relevance of the evidence for the target context is given in Figure  A9. The text re-generation prompt of the Mixtral Revised model is given in Figure  A10.

Given a claim, query and an evidence, check the following: (i) if the evidence ANSWERS the query and (ii) if the claim INCORRECTLY answers the query, make this judgement based only on the evidence. If both the conditions are satisfied, then return a score 1 else return a score 0. Also provide a reason for your score. For example, Claim: Revathy is a renowned Indian actress and humanitarian, who has won several accolades including two National Film Awards and three Filmfare Awards. Query: Kerala: What are some of the accolades won by Revathy, the Indian actress from Kerala, including National Film Awards and Filmfare Awards? Evidence: She has won several accolades, including three National Film Awards , and six Filmfare Awards South. She has also won the Kerala State Film Award for Best Actress for her performance in Bhoothakaalam (2022). Early life Revathi was born as Asha Kelunni Nair in Cochin (present-day Kochi) to Malank Kelunni Nair, a major in the Indian Army , who hails from Palakkad, and Lalitha Kelunni who hails from a Palakkad Tamil family. When she was in school, she took part in a fashion show. The score and reason are: Score: 1 Reason: The evidence answers the query and the claim claim incorrectly answers the query based on my knowledge from the evidence. The evidence said Revathy has won three National Film Awards, six Filmfare Awards South, and Kerala State Film Award for Best Actress but the claim said two National Film Awards and three Filmfare Awards. Claim: Tata Motors is an Indian multinational automobile manufacturer headquartered in Mumbai, Maharashtra. It was founded by Jamsetji Tata and established the company on August 1, 1945. Query: India: Which industry does Tata Motors operate in, as a prominent player in the Indian market? Evidence: Tata Motors has established itself as a leading player in the Indian automotive market, enjoying a substantial market share and a strong customer base. Competitive advantage in low-cost production: With a low-cost labor base in India, Tata Motors has a competitive advantage in producing economical segment vehicles. This advantage allows the company to target not only the Indian market but also other emerging markets, leading to substantial profits. Innovation and research and development: Tata Motors is known for its excellent innovation and research and development efforts in the automotive sector. The score and reason are: Score: 0 Reason: The evidence answers the query but the claim correctly answers the query based on my knowledge from the evidence. The evidence said that Tata Motors is a prominent player in India’s automotive market and Tata Motors is an Indian multinational automobile manufacturer headquartered in Mumbai, Maharashtra. Now answer for the below sample, Claim: Prosenjit Chatterjee is a renowned Bengali actor, whose career has been marked by critical acclaim in his early life, followed by personal challenges and a resurgence in popularity and commercial success in his later years. Query: West Bengal: What type of recognition has marked Prosenjit Chatterjee’s early life in his film career? Evidence: June 4, 2023 National recognition and accolades did not lure Prosenjit away from West Bengal’s entertainment industry. He began this year with a stellar performance in Kaushik Ganguly’s period thriller, Kaberi Antardhan, shot against the backdrop of the Naxalite movement and the Emergency. The score and reason are:
Figure A9: Mixtral Revised: The prompt used for filtering the evidences that are relevant to the entity in the text and for the target location.
You are a localization assistant. Convert the reference entity sentence from English to the Indian domain by localizing the reference sentence with a similar entity from the target location. Make the needed modifications in the sentence to make it factually correct for the target location with the help of evidences. Output answers in English using multi-entity localization. Provide a reason for your localization. Keep the word count of the generated sentence almost the same as the reference sentence. Examples: My reference sentence: Ford Motor Company is an American multinational automobile manufacturer headquartered in Dearborn, Michigan, United States. It was founded by Henry Ford and incorporated on June 16, 1903. Target location: India Evidences: [ "1: Tata motors were founded by J. R. D. Tata.", "2: Tata Motors was founded in 1945, as a locomotive manufacturer. Tata Group entered the commercial vehicle sector in 1954 after forming a joint venture with Daimler-Benz of Germany in which Tata developed a manufacturing facility in Jamshedpur for Daimler lorries."] The target sentence and the reason are: Target sentence: Tata Motors Limited is an Indian multinational automobile manufacturer headquartered in Mumbai, Maharashtra, India. It was founded by J. R. D. Tata and incorporated in 1945. Reason: Tata Motors Limited is a good localization for Ford Motor Company in the Indian context. From the evidences, it is a multinational automobile manufacturer headquartered in Mumbai and it was founded by J. R. D. Tata and incorporated in 1945. My reference sentence: A train derailment occurred on February 3, 2023, at 8:55 p.m. EST, when 38 cars of a Norfolk Southern freight train carrying hazardous materials derailed in East Palestine, Ohio, United States. Target location: Andhra Pradesh Evidences: [ "1: On 29 October 2023, around 7:00 pm, the collision occurred on the Howrah–Chennai main line after Visakhapatnam-Palasa Express service train stopped due to a break in an overhead cable when it was hit by an oncoming passenger train travelling from Visakhapatnam in Andhra Pradesh, to Rayagada in Odisha, derailing its three carriages, in the Vizianagaram district of Andhra Pradesh, India. The collision occurred between Kantakapalli and Alamanda railway stations resulting in severe damage to three coaches of the Palasa passenger, and the locomotive and two coaches of the Rayagada passenger. At least 14 people were killed and 50 others were injured as a result."] The target sentence and the reason are: Target sentence: A train derailment occurred on October 29, 2023, around 07:00 p.m. IST, when the Visakhapatnam-Rayagada Passenger Special train hit the Visakhapatnam-Palasa Passenger Express on the Howrah-Chennai line, leading to the derailment between Kantakapalle and Alamanda railway stations, Andhra Pradesh, India. Reason: An example of train derailment in Andra Pradesh would be Vizianagaram train derailment. The localization was done by replacing the location, date, and time details of the reference sentence with those from the provided evidence related to a train derailment in Andhra Pradesh, India. My reference sentence: Robert John Downey Jr. is an American actor. His career has been characterized by critical success in his youth, followed by a period of substance abuse and legal troubles, and a surge in popular and commercial success later in his career. Target location: West Bengal Evidences: ["1: Prosenjit Chatterjee (born 30 September 1962) is an Indian actor and producer. He is widely regarded as one of the leading actors in modern Bengali cinema. He predominantly works in Bengali cinema . He is the son of veteran Bollywood actor Biswajit Chatterjee.", "2: June 4, 2023 National recognition and accolades did not lure Prosenjit away from West Bengal’s entertainment industry. He began this year with a stellar performance in Kaushik Ganguly’s period thriller, Kaberi Antardhan, shot against the backdrop of the Naxalite movement and the Emergency.", "3: Prosenjit Chatterjee began his film career in West Bengal’s entertainment industry with critical acclaim in the 1980s. He received national recognition and accolades for his roles in period thriller Kaberi Antardhan, shot against the backdrop of the Naxalite movement and the Emergency, released this year. After facing personal challenges, he has experienced a resurgence in popularity and commercial success in recent years."] The target sentence and the reason are:
Figure A10: Mixtral Revised: The prompt used for re-generating text with the help of the retrieved evidence.

A.11 GPT-4 Evaluation of Mixtral for the full LoFTI Dataset

We also analyze the performance of GPT-4 as an evaluator for localized text transfer on the full LoFTI dataset. In Table A7, we compare the human and GPT-4 evaluations on the Mixtral model for the full dataset. Similar to our observation on the 250 subset (Table  4), GPT-4 closely aligns with human evaluation for regions with a hyperlocal score of 1 but significantly overestimates scores for regions with hyperlocal scores of 2 and 3. Despite this, GPT-4 maintains the overall trends observed in human evaluation.

Mixtral Overall 1 2 3
# Samples 1100 369 391 340
# Questions 447 168 145 134
Human Evaluation
EC 0.63 0.72 0.63 0.54
CQ 0.50 0.58 0.49 0.43
FC 0.35 0.41 0.38 0.25
GPT-4 Evaluation
EC 0.71 0.76 0.71 0.67
CQ 0.58 0.65 0.56 0.52
FC 0.42 0.52 0.42 0.33
Table A7: Comparison of human and GPT-4 evaluation on Mixtral outputs on the full LoFTI dataset.