Curating Custom Datasets for LLM Training with NVIDIA NeMo Curator

Data curation is the first, and arguably the most important, step in the pretraining and continuous training of large language models (LLMs) and small language models (SLMs). NVIDIA recently announced the open-source release of NVIDIA NeMo Curator, a data curation framework that prepares large-scale, high-quality datasets for pretraining generative AI models.

NeMo Curator, which is part of NVIDIA NeMo, offers workflows to download and curate data from various public sources out of the box such as Common Crawl, Wikipedia, and arXiv. It also provides flexibility for developers to customize data curation pipelines to address their unique requirements and create custom datasets.

This post walks you through creating a custom data curation pipeline using NeMo Curator. Doing so enables you to:

Tailor data curation and customize the pipeline to fit the specific needs of your generative AI project.
Ensure data quality by applying rigorous filters and deduplication to train your model with the best possible dataset.
Protect privacy by identifying and removing personally identifiable information (PII) and adhere to data protection regulations.
Streamline the development by automating the curation process, saving time and resources to allow you to focus on solving your business-specific problems.

Overview

This tutorial focuses on creating a simple data curation pipeline that can download, process, and filter the TinyStories dataset. TinyStories is a dataset of around 2.2 million short stories generated by GPT-3.5 and GPT-4, featuring English words that are understood by 3- to 4-year olds. It is publicly available on Hugging Face. To learn more about the dataset, see TinyStories: How Small Can Language Models Be and Still Speak Coherent English?

The small size of this dataset makes it ideal for creating and validating data curation pipelines on a local machine. The dataset is split into training and validation files. This tutorial primarily uses the validation file, which contains about 22,000 records.

Defining the data curation pipeline involves the following high-level steps:

Defining custom document builders that can:
- Download the dataset from the web and convert to the JSONL format.
- Iterate through the dataset and extract each document.
Define custom modifiers to clean and unify the text data.
Filter the dataset using predefined, as well as user-defined heuristics.
Deduplicate the dataset and remove identical records.
Redact all personally identifiable information (PII) from the dataset.
Output the results into the JSONL format.

The execution of this curation pipeline should take less than 5 minutes on consumer-grade hardware, and the curated dataset should have about 21,500 records after curation. To access the complete code for this tutorial, visit NVIDIA/NeMo-Curator on GitHub.

Prerequisite

Before starting, the NeMo Curator framework must be installed. Follow the instructions in the project’s NeMo Curator GitHub README file to install the framework. After that, run the following commands from the terminal to verify the installation. Also install additional dependencies needed for following along.

$ python -c "import nemo_curator; print(nemo_curator);"
$ pip3 install requests

Defining custom document builders

To support working with arbitrary datasets, NeMo Curator provides a set of document builders that abstract away the representation of the underlying dataset, including:

DocumentDownloader: an abstract class for downloading remote data to disk.
DocumentIterator: an abstract class for reading dataset raw records from the disk.
DocumentExtractor: an abstract class for extracting text records, as well as any relevant metadata from the records on the disk.

Several implementations for these to work with datasets such as CommonCrawl, Wikipedia, and arXiv are available on the NVIDIA/NeMo-Curator GitHub repo. The following sections show how to implement each of these abstract classes to customize the work with the TinyStories dataset.

Downloading the TinyStories dataset

First, implement the DocumentDownloader class, which takes the URL of the dataset’s validation split and downloads it using the requests library.

import requests
from nemo_curator.download.doc_builder import DocumentDownloader

class TinyStoriesDownloader(DocumentDownloader):
    def __init__(self, download_dir: str):
        super().__init__()

        if not os.path.isdir(download_dir):
            os.makedirs(download_dir)

        self._download_dir = download_dir
        print("Download directory: ", self._download_dir)

    def download(self, url: str) -> str:
        filename = os.path.basename(url)
        output_file = os.path.join(self._download_dir, filename)

        if os.path.exists(output_file):
            print(f"File '{output_file}' already exists, skipping download.")
            return output_file

        print(f"Downloading TinyStories dataset from '{url}'...")
        response = requests.get(url)

        with open(output_file, "wb") as file:
            file.write(response.content)

        return output_file

Next, download the actual dataset using the following code:

# Download the TinyStories dataset.
downloader = TinyStoriesDownloader("/path/to/download/")
tinystories_fp = downloader.download(TINY_STORIES_URL)
write_jsonl(tinystories_fp, jsonl_dir)

The dataset will download as a plain text file. To parse this dataset, implement the DocumentIterator and DocumentExtractor classes. This will enable you to convert it to the JSONL format (one of the formats that NeMo Curator supports).

Iterating and extracting text from the dataset

In the downloaded file, each record (or story) spans several lines, and records are separated by the <|endoftext|> token. The DocumentIterator class defines an iterate function that takes the path to the file that is to be iterated and yields each record for that file, in the form of the raw text from the record and (optionally) any relevant metadata for that record. Although adding metadata to each record is not mandatory, some data processing algorithms (such as deduplication) rely on such data to uniquely identify each document and correctly perform their intended function.

Next, implement the iterator for the TinyStories dataset. Given that each story can span several lines, define the iterator function such that it would keep reading (and storing) each line in the file, until it reaches the separator token.

Once a separator is reached, concatenate all the lines seen so far, tack on some metadata to the record, and yield the result. To ensure records are uniquely identifiable, use the dataset’s filename, as well as an internal counter to create the unique id and (optionally) filename metadata included with each record:

from nemo_curator.download.doc_builder import DocumentIterator

class TinyStoriesIterator(DocumentIterator):
    SEPARATOR_TOKEN = "<|endoftext|>"

    def __init__(self):
        super().__init__()
        self._counter = -1

    def iterate(self, file_path):
        self._counter = -1
        file_name = os.path.basename(file_path)

        with open(file_path, "r") as file:
            example = []

            def split_meta(example):
                if example:
                    self._counter += 1
                    content = " ".join(example)
                    meta = {
                        "filename": file_name,
                        "id": f"{file_name}-{self._counter}",
                    }

                    return meta, content

            for line in file:
                if line.strip() == TinyStoriesIterator.SEPARATOR_TOKEN:
                    if example:
                        yield split_meta(example)
                        example = []
                else:
                    example.append(line.strip())

            if example:
                yield split_meta(example)

The last remaining document builder to implement is the DocumentExtractor class, which simply returns the text for each record. Note that you may optionally associate some metadata for the extracted text, but the usage of this metadata is beyond the scope of this tutorial.

from nemo_curator.download.doc_builder import DocumentExtractor

class TinyStoriesExtractor(DocumentExtractor):
    def extract(self, content: str) -> Tuple[Set, str]:
        # No metadata for the text, just the content.
        return {}, content

Writing the dataset to the JSONL format

NeMo Curator provides helpers that can load datasets from the disk in JSONL, Parquet, or Pickle formats. Given the popularity of the JSONL format, this section demonstrates the conversion of the raw text dataset to this format using the iterator and extractor classes previously implemented.

To convert the dataset to JSONL, simply point the TinyStoriesIterator instance to the downloaded plain text file, iterate through each record, and extract entries using the TinyStoriesExtractor instance. Create a JSON object from each record (story) and write it to a single line in an output file. This procedure is straightforward:

import os
import json

def write_jsonl(input_filename: str, output_dir: str, dump_every_n: int = 10000):
    basename = os.path.basename(input_filename)
    iterator = TinyStoriesIterator()
    extractor = TinyStoriesExtractor()
    to_dump = []
    dump_ctr = 0

    def dump_to_file(to_dump, dump_ctr):
        """Helper function to facilitate dumping to file."""
        output_filename = f"{basename}-{dump_ctr}.jsonl"
        with open(os.path.join(output_dir, output_filename), "w") as output_file:
            output_file.writelines(to_dump)
        # Empty out the list and increment the counter.
        return [], dump_ctr + 1

    for item in iterator.iterate(input_filename):
        record_meta, content = item
        extracted = extractor.extract(content)

        if extracted is None:
            continue

        text_meta, text = extracted

        if text is None:
            continue

        line = {
            "text": text,
            **text_meta,
            **record_meta,
        }
        json_out = json.dumps(line, ensure_ascii=False)
        to_dump.append(json_out + "\n")

        # Should we dump what we have so far?
        if len(to_dump) == dump_every_n:
            to_dump, dump_ctr = dump_to_file(to_dump, dump_ctr)

    # Dump the remaining records.
    if to_dump:
        dump_to_file(to_dump, dump_ctr)

Note that by default, this function creates one JSONL file for every 10,000 records. While entirely optional, this is to ensure that each output file remains small enough for easy manual inspection using a text editor, without consuming too much memory.

Also note that the content of each story is written into the text field of each JSON object. Many data curation operations throughout NeMo Curator need to know which field inside each record contains the text data for that record. If not explicitly specified, these operations assume the existence of a text field in the dataset. As such, it is often good practice to always populate the text field for each record with the text data of interest.

Loading the dataset using the document builders

In NeMo Curator, datasets are represented as objects of type DocumentDataset. This provides helpers to load the datasets from disk in various formats. Having created the dataset in the JSONL format, you can use the following code to load it and start working with it:

from nemo_curator.datasets import DocumentDataset
# define `files` to be a list of all the JSONL files to load
dataset = DocumentDataset.read_json(files, add_filename=True)

You now have everything needed to define a custom dataset curation pipeline and prepare your data for training (or validation) use cases.

Text cleaning and unification

A fundamental operation in data curation pipelines involving text data is text unification and cleaning, as text scraped from online sources may contain inconsistencies or unicode issues. To modify documents, NeMo Curator provides a DocumentModifier interface, which defines how a given text from each document should be modified. The actual modification is done through the Modify helper, which takes a DocumentDataset object along with a DocumentModifier object and applies the modifier to the dataset.

The TinyStories dataset has inconsistent quotation marks, where some quotation marks are curly, while others are straight. Such inconsistencies (poor quality tokens, for example) may cause problems for models that are trained on this data.

To resolve these, create a DocumentModifier that unifies all single- and double-quotation marks in the documents by replacing all the curly quotation marks with their straight variants:

from nemo_curator.modifiers import DocumentModifier

class QuotationUnifier(DocumentModifier):
    def modify_document(self, text: str) -> str:
        text = text.replace("‘", "'").replace("’", "'")
        text = text.replace("“", '"').replace("”", '"')
        return text

NeMo Curator provides various DocumentModifier implementations out of the box. One such modifier is UnicodeReformatter, which uses ftfy to resolve all unicode issues in the dataset. Next, chain these modifiers together and clean the downloaded dataset. The chaining operation is done through the Sequential class, which takes a list of operations that are to be sequentially performed and applies them to a given DocumentDataset instance:

from nemo_curator import Sequential
from nemo_curator.modules.modify import Modify
from nemo_curator.modifiers.unicode_reformatter import UnicodeReformatter

def clean_and_unify(dataset: DocumentDataset) -> DocumentDataset:
    cleaners = Sequential(
        [
            # Unify all the quotation marks
            Modify(QuotationUnifier()),
            # Unify all unicode
            Modify(UnicodeReformatter()),
        ]
    )
    return cleaners(dataset)

Dataset filtering

Another important step in the dataset curation process is data filtering, where some documents that do not fit certain criteria are discarded. For instance, you might want to discard documents that are too short, too long, or incomplete. At the time of writing, NeMo Curator provides 24 heuristics for natural languages, as well as eight heuristics for coding languages.

NeMo Curator provides a DocumentFilter interface, which defines a way to score documents based on various criteria, along with a ScoreFilter helper to filter the documents. The ScoreFilter helper takes a DocumentDataset along with a DocumentFilter and determines whether each document in the dataset passes the filtering criteria.

Create a simple DocumentFilter that determines whether a story ends with an end of sentence character. The goal is to discard all stories that do not end with an end of sentence character:

from nemo_curator.filters import DocumentFilter

class IncompleteStoryFilter(DocumentFilter):
    def __init__(self):
        super().__init__()
        self._story_terminators = {".", "!", "?", '"', "”"}

    def score_document(self, text: str) -> bool:
        ret = text.strip()[-1] in self._story_terminators
        return ret

    def keep_document(self, score) -> bool:
        return score

The main functionality is implemented in score_document and keep_document functions, where False (that is, don’t keep this document) is returned if the document does not end with an end of sentence character.

To apply this filter to the dataset, pass an instance of IncompleteStoryFilter to a ScoreFilter object. NeMo Curator provides many DocumentFilter implementations out of the box. These filters can be chained together through the Sequential class. The following code shows how to apply various filters to the dataset:

def filter_dataset(dataset: DocumentDataset) -> DocumentDataset:
    filters = Sequential(
        [
            ScoreFilter(
                WordCountFilter(min_words=80),
                text_field="text",
                score_field="word_count",
            ),
            ScoreFilter(IncompleteStoryFilter(), text_field="text"),
            ScoreFilter(
                RepeatingTopNGramsFilter(n=2, max_repeating_ngram_ratio=0.2),
                text_field="text",
            ),
            ScoreFilter(
                RepeatingTopNGramsFilter(n=3, max_repeating_ngram_ratio=0.18),
                text_field="text",
            ),
            ScoreFilter(
                RepeatingTopNGramsFilter(n=4, max_repeating_ngram_ratio=0.16),
                text_field="text",
            ),
        ]
    )
    return filters(dataset)

This code filters all short (less than 80 words) or incomplete stories, along with any other stories that have certain ratios of repeating n-grams. Note the usage of text_field=”text”, which tells the ScoreFilter to pass the contents of the dataset text column to each filtering criteria.

Deduplication

When working with large amounts of text data, there may be records that are identical (or near-identical) to each other. Training on such data may incur additional compute and storage overhead. NeMo Curator provides functionality to find and discard such duplicates. For simplicity, focus on finding exact duplicate records in the dataset. This can be accomplished using the ExactDuplicates class, as shown below.

This module will automatically leverage existing CUDA devices and the GPU-accelerated implementations from the RAPIDS cuDF library to identify duplicate documents, resulting in much faster processing times. This is because the deduplication stage involves calculating a hash for every document, which is compute-intensive. Each document can be hashed independently, which makes this workload ideal to run in parallel on the GPU.

from nemo_curator.modules import ExactDuplicates

def dedupe(dataset: DocumentDataset) -> DocumentDataset:
    deduplicator = ExactDuplicates(id_field="id", text_field="text", hash_method="md5")
    # Find the duplicates
    duplicates = deduplicator(dataset)
    docs_to_remove = duplicates.df.map_partitions(
        lambda x: x[x._hashes.duplicated(keep="first")]
    )
    # Remove the duplicates using their IDs.
    duplicate_ids = list(docs_to_remove.compute().id)
    dataset_df = dataset.df
    deduped = dataset_df[~dataset_df.id.isin(duplicate_ids)]
    return DocumentDataset(deduped)

This specifies that each record’s unique identifier and content are in the id and text columns, respectively. Recall that a unique identifier was assigned to each document during the download and extraction phase. This enables the deduplicator to uniquely identify documents from one another. The deduplicator object returns a set of IDs that it has determined to be duplicates. Simply remove these documents from the dataset.

PII redaction

The last processing step discussed in this tutorial is the redaction of personally identifiable information (PII). NeMo Curator facilitates the detection and removal of PII using the PiiModifier class, which is an implementation of the DocumentModifier class previously discussed. This modifier leverages the Presidio framework and enables you to specify which PII to detect, what action to take for each detection, and process the data in batches to accelerate the operation.

The stories in the TinyStories dataset contain many instances of first names. This example intends to detect all such names and replace them with an anonymized token. This can be accomplished using a few lines of code:

from nemo_curator.modifiers.pii_modifier import PiiModifier

def redact_pii(dataset: DocumentDataset) -> DocumentDataset:
    redactor = Modify(
        PiiModifier(
            supported_entities=["PERSON"],
            anonymize_action="replace",
            device="cpu",
        ),
    )
    return redactor(dataset)

The operation takes the entire dataset and returns the modified dataset.

Putting the curation pipeline together

Having implemented each step of the curation pipeline, it’s time to put everything together and sequentially apply each operation on the dataset. You can use the Sequential class to chain curation operations together:

curation_steps = Sequential(
    [
        clean_and_unify,
        filter_dataset,
        dedupe,
        redact_pii,
    ]
)
dataset = curation_steps(dataset)
print("Executing the pipeline...")
dataset = dataset.persist()
dataset.to_json("/output/path", write_to_filename=True)

Under the hood, NeMo Curator uses Dask to work with the dataset in a distributed manner. Since Dask operations are lazy-evaluated, it’s necessary to call the .persist function to instruct Dask to apply the operations. Once processing finishes, you can write the dataset to disk in the JSONL format by calling the .to_json function and providing an output path.

Next steps

NeMo Curator supports many advanced data processing and filtering techniques, such as fuzzy or task-based deduplication, task identification and decontamination, domain classification (and much more) that are not covered in this tutorial. Check out the collection of data curation examples on GitHub to learn more.

You can also request access to the NVIDIA NeMo Curator microservice, which provides the easiest path for enterprises to get started with data curation from anywhere. It offers streamlined performance and scalability to shorten the time to market.

Curating Custom Datasets for LLM Training with NVIDIA NeMo Curator

Overview

Prerequisite

Defining custom document builders

Downloading the TinyStories dataset

Iterating and extracting text from the dataset

Writing the dataset to the JSONL format

Loading the dataset using the document builders

Text cleaning and unification

Dataset filtering

Deduplication

PII redaction

Putting the curation pipeline together

Next steps

Related resources

Tags

About the Authors

Curating Custom Datasets for LLM Training with NVIDIA NeMo Curator

Overview

Prerequisite

Defining custom document builders

Downloading the TinyStories dataset

Iterating and extracting text from the dataset

Writing the dataset to the JSONL format

Loading the dataset using the document builders

Text cleaning and unification

Dataset filtering

Deduplication

PII redaction

Putting the curation pipeline together

Next steps

Related resources

Tags

About the Authors

Comments

Related posts

Leverage the Latest Open Models for Synthetic Data Generation with NVIDIA Nemotron-4 340B

Scale and Curate High-Quality Datasets for LLM Training with NVIDIA NeMo Curator

Fine-Tune and Align LLMs Easily with NVIDIA NeMo Customizer

Streamline Evaluation of LLMs for Accuracy with NVIDIA NeMo Evaluator

Curating Trillion-Token Datasets: Introducing NVIDIA NeMo Data Curator

Related posts

Train Generative AI Models More Efficiently with New NVIDIA Megatron-Core Functionalities

Customizing NVIDIA NIMs for Domain-Specific Needs with NVIDIA NeMo

Addressing Hallucinations in Speech Synthesis LLMs with the NVIDIA NeMo T5-TTS Model

Enhance Text-to-Image Fine-Tuning with DRaFT+, Now Part of NVIDIA NeMo

Webinar: Enhance LLMs with RAG and Accelerate Enterprise AI with Pure Storage and NVIDIA