Data Preparation Stage Workflow Analysis

Code repository: rasbt/LLMs-from-scratch

This section explains the complete workflow of the data preparation stage:
raw text → token → token ID → vector
The following breaks down the core content of each stage step-by-step.

1. raw text → token Stage

Description

In this stage, raw text is split into tokens (words or symbols) using regular expressions. Subsequently, a unique vocabulary list is constructed by removing duplicates with set and sorting with sorted().

2. token → token ID Stage

Description

A vocabulary is generated using enumerate combined with a dictionary comprehension:

vocab = {token: integer for integer, token in enumerate(all_words)}

Notes

When manually building a vocabulary, special tokens are typically added, such as:

<|unk|>: Represents unknown words not in the vocabulary.
<|endoftext|>: Marks the end of a text sequence.
These tokens prevent errors when the model encounters out-of-vocabulary (OOV) tokens.

However, tokenization algorithms like BPE (Byte Pair Encoding) eliminate the need for <|unk|>. BPE breaks unrecognized tokens into smaller, more common subword units that can be mapped to the vocabulary.

2.1 BPE Introduction

Description

BPE is a subword-level tokenization algorithm. Its core idea is to break unrecognized tokens into smaller, more frequent sub-tokens, ensuring all can be mapped to the vocabulary. In this project, the GPT-2 BPE tokenizer is utilized as follows:

tokenizer = tiktoken.get_encoding("gpt2")

Here, gpt2 refers to a tokenizer built based on the BPE algorithm.

2.2 Building Dataset and DataLoader

Description

Dataset: Organizes the sample data structure (segments of token IDs + targets).
DataLoader: Extracts data from the Dataset in batches (controls batch size, shuffling, etc.).

Example code:

from torch.utils.data import Dataset, DataLoader

class GPTDatasetV1(Dataset):
    def __init__(self, txt, tokenizer, max_length, stride):
        self.input_ids = []
        self.target_ids = []

        token_ids = tokenizer.encode(txt, allowed_special={"<|endoftext|>"})

        for i in range(0, len(token_ids) - max_length, stride):
            input_chunk = token_ids[i:i + max_length]
            target_chunk = token_ids[i + 1: i + max_length + 1]
            self.input_ids.append(torch.tensor(input_chunk))
            self.target_ids.append(torch.tensor(target_chunk))

    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, idx):
        return self.input_ids[idx], self.target_ids[idx]

def create_dataloader_v1(txt, batch_size=4, max_length=256, 
                         stride=128, shuffle=True, drop_last=True,
                         num_workers=0):
    tokenizer = tiktoken.get_encoding("gpt2")
    dataset = GPTDatasetV1(txt, tokenizer, max_length, stride)

    dataloader = DataLoader(
        dataset,
        batch_size=batch_size,
        shuffle=shuffle,
        drop_last=drop_last,
        num_workers=num_workers
    )

    return dataloader

2.2.1 Sliding Window

Description

The sliding window technique splits long sequence data into multiple smaller segments. For example, if a text has 1000 characters and the model supports a maximum input of 200 characters, you can set:

max_length = 200
stride = 200

This divides the text into 5 non-overlapping segments, each treated as a training sample. At this stage, the DataLoader only retrieves data and does not involve feeding it into an LLM for inference.

To retain contextual information between segments, you can use a stride smaller than max_length to create overlapping windows.

3. token ID → Vector Stage

Token ID to Vector Stage

3.1 Token Embedding

Description

An embedding layer maps each token ID to a dense vector:

embedding_layer = torch.nn.Embedding(vocab_size, output_dim)
max_length = 4
dataloader = create_dataloader_v1(
    raw_text, batch_size=8, max_length=max_length,
    stride=max_length, shuffle=False
)
data_iter = iter(dataloader)
inputs, targets = next(data_iter)

token_embeddings = embedding_layer(inputs)

3.2 Position Embedding

Description

If a sequence contains multiple identical tokens, their token embeddings will be identical. However, in natural language, the same word in different positions often carries different meanings. To address this, position embeddings are introduced: each token embedding is combined with a position-specific vector to produce the final input vector. The resulting input embedding has the same dimensionality as the token embedding.

4. Summary Flowchart

Description

Raw text  
  ↓ Tokenization + Deduplication + Indexing  
Tokens → Vocabulary → Token IDs  
  ↓ Pass to embedding layer  
Token Embedding + Position Embedding  
  ↓  
Final Input Embedding

Data Preparation Stage Workflow Analysis#

1. raw text → token Stage#

Description#

2. token → token ID Stage#

Description#

Notes#

2.1 BPE Introduction#

Description#

2.2 Building Dataset and DataLoader#

Description#

2.2.1 Sliding Window#

Description#

3. token ID → Vector Stage#

3.1 Token Embedding#

Description#

3.2 Position Embedding#

Description#

4. Summary Flowchart#

Description#

Data Preparation Stage Workflow Analysis

1. raw text → token Stage

Description

2. token → token ID Stage

Description

Notes

2.1 BPE Introduction

Description

2.2 Building Dataset and DataLoader

Description

2.2.1 Sliding Window

Description

3. token ID → Vector Stage

3.1 Token Embedding

Description

3.2 Position Embedding

Description

4. Summary Flowchart

Description