-
AI "News" Content Farms Are Easy to Make and Hard to Detect: A Case Study in Italian
Authors:
Giovanni Puccetti,
Anna Rogers,
Chiara Alzetta,
Felice Dell'Orletta,
Andrea Esuli
Abstract:
Large Language Models (LLMs) are increasingly used as "content farm" models (CFMs), to generate synthetic text that could pass for real news articles. This is already happening even for languages that do not have high-quality monolingual LLMs. We show that fine-tuning Llama (v1), mostly trained on English, on as little as 40K Italian news articles, is sufficient for producing news-like texts that…
▽ More
Large Language Models (LLMs) are increasingly used as "content farm" models (CFMs), to generate synthetic text that could pass for real news articles. This is already happening even for languages that do not have high-quality monolingual LLMs. We show that fine-tuning Llama (v1), mostly trained on English, on as little as 40K Italian news articles, is sufficient for producing news-like texts that native speakers of Italian struggle to identify as synthetic.
We investigate three LLMs and three methods of detecting synthetic texts (log-likelihood, DetectGPT, and supervised classification), finding that they all perform better than human raters, but they are all impractical in the real world (requiring either access to token likelihood information or a large dataset of CFM texts). We also explore the possibility of creating a proxy CFM: an LLM fine-tuned on a similar dataset to one used by the real "content farm". We find that even a small amount of fine-tuning data suffices for creating a successful detector, but we need to know which base LLM is used, which is a major challenge.
Our results suggest that there are currently no practical methods for detecting synthetic news-like texts 'in the wild', while generating them is too easy. We highlight the urgency of more NLP research on this problem.
△ Less
Submitted 17 June, 2024;
originally announced June 2024.
-
SemEval-2024 Task 8: Multidomain, Multimodel and Multilingual Machine-Generated Text Detection
Authors:
Yuxia Wang,
Jonibek Mansurov,
Petar Ivanov,
Jinyan Su,
Artem Shelmanov,
Akim Tsvigun,
Osama Mohammed Afzal,
Tarek Mahmoud,
Giovanni Puccetti,
Thomas Arnold,
Chenxi Whitehouse,
Alham Fikri Aji,
Nizar Habash,
Iryna Gurevych,
Preslav Nakov
Abstract:
We present the results and the main findings of SemEval-2024 Task 8: Multigenerator, Multidomain, and Multilingual Machine-Generated Text Detection. The task featured three subtasks. Subtask A is a binary classification task determining whether a text is written by a human or generated by a machine. This subtask has two tracks: a monolingual track focused solely on English texts and a multilingual…
▽ More
We present the results and the main findings of SemEval-2024 Task 8: Multigenerator, Multidomain, and Multilingual Machine-Generated Text Detection. The task featured three subtasks. Subtask A is a binary classification task determining whether a text is written by a human or generated by a machine. This subtask has two tracks: a monolingual track focused solely on English texts and a multilingual track. Subtask B is to detect the exact source of a text, discerning whether it is written by a human or generated by a specific LLM. Subtask C aims to identify the changing point within a text, at which the authorship transitions from human to machine. The task attracted a large number of participants: subtask A monolingual (126), subtask A multilingual (59), subtask B (70), and subtask C (30). In this paper, we present the task, analyze the results, and discuss the system submissions and the methods they used. For all subtasks, the best systems used LLMs.
△ Less
Submitted 22 April, 2024;
originally announced April 2024.
-
The Invalsi Benchmarks: measuring Linguistic and Mathematical understanding of Large Language Models in Italian
Authors:
Andrea Esuli,
Giovanni Puccetti
Abstract:
While Italian is a high resource language, there are few Italian-native benchmarks to evaluate Large Language Models (LLMs) generative abilities in this language. This work presents two new benchmarks: Invalsi MATE to evaluate models performance on mathematical understanding in Italian and Invalsi ITA to evaluate language understanding in Italian. These benchmarks are based on the Invalsi tests, w…
▽ More
While Italian is a high resource language, there are few Italian-native benchmarks to evaluate Large Language Models (LLMs) generative abilities in this language. This work presents two new benchmarks: Invalsi MATE to evaluate models performance on mathematical understanding in Italian and Invalsi ITA to evaluate language understanding in Italian. These benchmarks are based on the Invalsi tests, which are administered to students of age between 6 and 18 within the Italian school system and have been validated by several experts in teaching and pedagogy.
We use these benchmarks to evaluate 9 powerful language models showing that current language models are bound by 70% accuracy in mathematical understanding, achieved by Llama 3 70b and by 85% in language understanding. We also compare LLMs with the average performance of Italian students to show that Llama 3 is the only one to perform better than students on Invalsi MATE while most models outperform students on Invalsi ITA.
We will make data and evaluation code openly available to pave the way for the future development of larger and harder benchmarks to evaluate LLMs' mathematical and linguistic understanding in Italian.
△ Less
Submitted 17 June, 2024; v1 submitted 27 March, 2024;
originally announced March 2024.
-
M4GT-Bench: Evaluation Benchmark for Black-Box Machine-Generated Text Detection
Authors:
Yuxia Wang,
Jonibek Mansurov,
Petar Ivanov,
Jinyan Su,
Artem Shelmanov,
Akim Tsvigun,
Osama Mohanned Afzal,
Tarek Mahmoud,
Giovanni Puccetti,
Thomas Arnold,
Alham Fikri Aji,
Nizar Habash,
Iryna Gurevych,
Preslav Nakov
Abstract:
The advent of Large Language Models (LLMs) has brought an unprecedented surge in machine-generated text (MGT) across diverse channels. This raises legitimate concerns about its potential misuse and societal implications. The need to identify and differentiate such content from genuine human-generated text is critical in combating disinformation, preserving the integrity of education and scientific…
▽ More
The advent of Large Language Models (LLMs) has brought an unprecedented surge in machine-generated text (MGT) across diverse channels. This raises legitimate concerns about its potential misuse and societal implications. The need to identify and differentiate such content from genuine human-generated text is critical in combating disinformation, preserving the integrity of education and scientific fields, and maintaining trust in communication. In this work, we address this problem by introducing a new benchmark based on a multilingual, multi-domain, and multi-generator corpus of MGTs -- M4GT-Bench. The benchmark is compiled of three tasks: (1) mono-lingual and multi-lingual binary MGT detection; (2) multi-way detection where one need to identify, which particular model generated the text; and (3) mixed human-machine text detection, where a word boundary delimiting MGT from human-written content should be determined. On the developed benchmark, we have tested several MGT detection baselines and also conducted an evaluation of human performance. We see that obtaining good performance in MGT detection usually requires an access to the training data from the same domain and generators. The benchmark is available at https://github.com/mbzuai-nlp/M4GT-Bench.
△ Less
Submitted 27 June, 2024; v1 submitted 16 February, 2024;
originally announced February 2024.
-
Outliers Dimensions that Disrupt Transformers Are Driven by Frequency
Authors:
Giovanni Puccetti,
Anna Rogers,
Aleksandr Drozd,
Felice Dell'Orletta
Abstract:
While Transformer-based language models are generally very robust to pruning, there is the recently discovered outlier phenomenon: disabling only 48 out of 110M parameters in BERT-base drops its performance by nearly 30% on MNLI. We replicate the original evidence for the outlier phenomenon and we link it to the geometry of the embedding space. We find that in both BERT and RoBERTa the magnitude o…
▽ More
While Transformer-based language models are generally very robust to pruning, there is the recently discovered outlier phenomenon: disabling only 48 out of 110M parameters in BERT-base drops its performance by nearly 30% on MNLI. We replicate the original evidence for the outlier phenomenon and we link it to the geometry of the embedding space. We find that in both BERT and RoBERTa the magnitude of hidden state coefficients corresponding to outlier dimensions correlates with the frequency of encoded tokens in pre-training data, and it also contributes to the "vertical" self-attention pattern enabling the model to focus on the special tokens. This explains the drop in performance from disabling the outliers, and it suggests that to decrease anisotropicity in future models we need pre-training schemas that would better take into account the skewed token distributions.
△ Less
Submitted 22 October, 2022; v1 submitted 23 May, 2022;
originally announced May 2022.