Training Llama From Scratch



Training Llama from scratch is very resource-intensive but rewarding. Running the training loop with the right preparations of the training dataset and proper settings of the training parameters will assure you of producing a solid enough language model to be applied in many NLP tasks. The secrets of success are proper preprocessing, parameter tuning, and optimization during the training.

The version of Llama is an open-source version compared to other GPT-style models. This model requires lots of resources, thorough preparation, and much more to begin training from scratch. This chapter reports on the training of Llama from scratch. The method includes everything from getting your training dataset ready to configuring the training parameters and actually doing training.

Llama aims to support almost all NLP applications, including but not limited to generating text, translation, and summarization. A large language model can be trained from scratch by three critical steps −

  • Preparation of the training dataset
  • Appropriate training parameters
  • Managing the procedure and making sure that the right optimization is in effect

All steps will be followed step-by-step with code snippets and what the output means.

Preparing Your Training Dataset

The most important first step to train any LLM is to feed it an excellent, diverse, and extensive dataset. Llama needs an extremely large amount of text data to capture the richness of human language.

Collecting Data

Training Llama needs a monolithic dataset with diverse samples of texts from a variety of domains. Some exemplary datasets to train LLMs are Common Crawl, Wikipedia, BooksCorpus, and OpenWebText.

Example: Download a Dataset

import requests
import os

# Create a directory for datasets
os.makedirs("datasets", exist_ok=True)

# URL to dataset
url = "https://example.com/openwebtext.zip"
output = "datasets/openwebtext.zip"

# Download the dataset
response = requests.get(url)
with open(output, "wb") as file:
    file.write(response.content)
print(f"Dataset downloaded and saved at {output}")

Output

Dataset downloaded and saved at datasets/openwebtext.zip

When you download your dataset, you will need to preprocess text data before training. Most preprocessing involved tokenization, down-casing, removing special characters, and setting the data to fit a given structure.

Example: Preprocessing a Dataset

from transformers import LlamaTokenizer

# Load pre-trained tokenizer 
tokenizer = LlamaTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf", token=token)

# Load raw text
with open('/content/raw_data.txt', 'r') as file:
    raw_text = file.read()

# Tokenize the text
tokens = tokenizer.encode(raw_text, add_special_tokens=True)

# Save tokens to a file
with open('/tokenized_text.txt', 'w') as token_file:
    token_file.write(str(tokens))
    
print(f"Text tokenized and saved as tokens.")

Output

Text tokenized and saved as tokens.

Setting Model Training Parameters

Now, we are going to proceed with the setup of the training parameters. These parameters set how your model is going to learn from the dataset; therefore, they have a direct influence on your model's performance.

Main Training Parameters

  • Batch Size − The number of specimens that went through before the simulation's weights were updated.
  • Learning Rate − Sets how much to update model parameters based on the loss gradient.
  • Epochs − How many times the model is run over the whole data set.
  • Optimizer − To be used in the minimization of the loss function by changing the weights

You would use AdamW as your optimizer and a warm-up learning rate scheduler to train Llama.

Example: Training Parameters Configuration

import torch
from transformers import LlamaForCausalLM, AdamW, get_linear_schedule_with_warmup
# token="you_token"

# Load the model
model = LlamaForCausalLM.from_pretrained('meta-llama/Llama-2-7b-chat-hf', token=token)

model = model.to("cuda") if torch.cuda.is_available() else model.to("cpu")
# Training parameters
epochs = 3
batch_size = 8
learning_rate = 5e-5
warmup_steps = 200

# Set the optimizer and scheduler
optimizer = AdamW(model.parameters(), lr=learning_rate)
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=warmup_steps, num_training_steps=epochs)
print("Training parameters set.")

Output

Training parameters set.

Dataloader for Batches

Training needs data in batches. This can be done pretty easily with PyTorch's DataLoader.

from torch.utils.data import DataLoader, Dataset
# Custom dataset class
class TextDataset(Dataset):
    def __init__(self, tokenized_text):
       self.data = tokenized_text
    def __len__(self): 
        return len(self.data) // batch_size 
    def __getitem__(self, idx): 
        return self.data[idx * batch_size : (idx + 1) * batch_size]

with open("/tokenized_text.txt", 'r') as f:
  tokens_str = f.read()
tokens = eval(tokens_str)  # Evaluate the string to get the list

# DataLoader definition
train_data = TextDataset(tokens)
train_loader = DataLoader(train_data, batch_size=batch_size, shuffle=True)

print(f"DataLoader created with batch size {batch_size}.")

Output

DataLoader created with batch size 8.

It's time to go onto the actual training stage now that the requirements of the learning process and the data loading procedure have been established.

Training the Model

All that preparation works together in the running of the training loop. Training a dataset is nothing more than simply feeding the model in batches and then updating its parameters with the loss function.

Running the Training Loop

Now comes the training process of it all, where all these preparations meet the real world. Provide the collection of data to the algorithm in stages so that it may be updated based on the loss function for its variables.

import tqdm

# Move model to GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

for epoch in range(epochs):
   print(f"Epoch {epoch + 1}/{epochs}")
   model.train
   total_loss = 0  
   for batch in tqdm.tqdm(train_loader
      batch = [torch.tensor(sub_batch, device=device) for sub_batch in batch]
      max_len = max(len(seq) for seq in batch)
      padded_batch = torch.zeros((len(batch), max_len), dtype=torch.long, device=device)
      for i, seq in enumerate(batch):
         padded_batch[i, :len(seq)] = seq

       # Forward pass, use padded_batch 
       outputs = model(padded_batch, labels=padded_batch
       loss = outputs.loss  
       # Backward pass
       optimizer.zero_grad()  # Reset gradients.
       loss.backward()  # Calculate gradients.
       optimizer.step()  # Update model parameters.
       scheduler.step()  # Update learning rate.
        
       total_loss += loss.item()  # Accumulate loss.

   print(f"Epoch {epoch + 1} completed. Loss: {total_loss:.4f}")  

Output

Epoch 1 completed. Loss: 424.4011
Epoch 2 completed. Loss: 343.4245
Epoch 3 completed. Loss: 328.7054

Saving the Model

Once you are done training, save the model; otherwise, every time you train it.

# Save the trained model
model.save_pretrained('trained_Llama_model')
print("Model saved successfully.")

Output

Model saved successfully.

Now we have trained the model from scratch and saved it. We can use the model for the predicting new characters/ words. We will look in details in upcoming chapters.

Advertisements