Data Preparation Stage Workflow Analysis
Code repository: rasbt/LLMs-from-scratch
This section explains the complete workflow of the data preparation stage:
raw text → token → token ID → vector
The following breaks down the core content of each stage step-by-step.
1. raw text → token Stage
Description
In this stage, raw text is split into tokens (words or symbols) using regular expressions. Subsequently, a unique vocabulary list is constructed by removing duplicates with set
and sorting with sorted()
.
2. token → token ID Stage
Description
A vocabulary is generated using enumerate
combined with a dictionary comprehension:
vocab = {token: integer for integer, token in enumerate(all_words)}
Notes
When manually building a vocabulary, special tokens are typically added, such as:
<|unk|>
: Represents unknown words not in the vocabulary.<|endoftext|>
: Marks the end of a text sequence.
These tokens prevent errors when the model encounters out-of-vocabulary (OOV) tokens.
However, tokenization algorithms like BPE (Byte Pair Encoding) eliminate the need for <|unk|>
. BPE breaks unrecognized tokens into smaller, more common subword units that can be mapped to the vocabulary.
2.1 BPE Introduction
Description
BPE is a subword-level tokenization algorithm. Its core idea is to break unrecognized tokens into smaller, more frequent sub-tokens, ensuring all can be mapped to the vocabulary. In this project, the GPT-2 BPE tokenizer is utilized as follows:
tokenizer = tiktoken.get_encoding("gpt2")
Here, gpt2
refers to a tokenizer built based on the BPE algorithm.
2.2 Building Dataset and DataLoader
Description
- Dataset: Organizes the sample data structure (segments of token IDs + targets).
- DataLoader: Extracts data from the Dataset in batches (controls batch size, shuffling, etc.).
Example code:
from torch.utils.data import Dataset, DataLoader
class GPTDatasetV1(Dataset):
def __init__(self, txt, tokenizer, max_length, stride):
self.input_ids = []
self.target_ids = []
token_ids = tokenizer.encode(txt, allowed_special={"<|endoftext|>"})
for i in range(0, len(token_ids) - max_length, stride):
input_chunk = token_ids[i:i + max_length]
target_chunk = token_ids[i + 1: i + max_length + 1]
self.input_ids.append(torch.tensor(input_chunk))
self.target_ids.append(torch.tensor(target_chunk))
def __len__(self):
return len(self.input_ids)
def __getitem__(self, idx):
return self.input_ids[idx], self.target_ids[idx]
def create_dataloader_v1(txt, batch_size=4, max_length=256,
stride=128, shuffle=True, drop_last=True,
num_workers=0):
tokenizer = tiktoken.get_encoding("gpt2")
dataset = GPTDatasetV1(txt, tokenizer, max_length, stride)
dataloader = DataLoader(
dataset,
batch_size=batch_size,
shuffle=shuffle,
drop_last=drop_last,
num_workers=num_workers
)
return dataloader
2.2.1 Sliding Window
Description
The sliding window technique splits long sequence data into multiple smaller segments. For example, if a text has 1000 characters and the model supports a maximum input of 200 characters, you can set:
max_length = 200
stride = 200
This divides the text into 5 non-overlapping segments, each treated as a training sample. At this stage, the DataLoader only retrieves data and does not involve feeding it into an LLM for inference.
To retain contextual information between segments, you can use a stride
smaller than max_length
to create overlapping windows.
3. token ID → Vector Stage
Token ID to Vector Stage
3.1 Token Embedding
Description
An embedding layer maps each token ID to a dense vector:
embedding_layer = torch.nn.Embedding(vocab_size, output_dim)
max_length = 4
dataloader = create_dataloader_v1(
raw_text, batch_size=8, max_length=max_length,
stride=max_length, shuffle=False
)
data_iter = iter(dataloader)
inputs, targets = next(data_iter)
token_embeddings = embedding_layer(inputs)
3.2 Position Embedding
Description
If a sequence contains multiple identical tokens, their token embeddings will be identical. However, in natural language, the same word in different positions often carries different meanings. To address this, position embeddings are introduced: each token embedding is combined with a position-specific vector to produce the final input vector. The resulting input embedding has the same dimensionality as the token embedding.
4. Summary Flowchart
Description
Raw text
↓ Tokenization + Deduplication + Indexing
Tokens → Vocabulary → Token IDs
↓ Pass to embedding layer
Token Embedding + Position Embedding
↓
Final Input Embedding