 
- Llama - Home
- Llama - Introduction
- Llama - Environment Setup
- Llama - Getting Started
- Llama - Data Preparation
- Llama - Training From Scratch
- Fine-Tuning Llama Model
- Llama - Evaluating Model Performance
- Llama - Optimizing Models
Llama Useful Resources
Training Llama From Scratch
Training Llama from scratch is very resource-intensive but rewarding. Running the training loop with the right preparations of the training dataset and proper settings of the training parameters will assure you of producing a solid enough language model to be applied in many NLP tasks. The secrets of success are proper preprocessing, parameter tuning, and optimization during the training.
The version of Llama is an open-source version compared to other GPT-style models. This model requires lots of resources, thorough preparation, and much more to begin training from scratch. This chapter reports on the training of Llama from scratch. The method includes everything from getting your training dataset ready to configuring the training parameters and actually doing training.
Llama aims to support almost all NLP applications, including but not limited to generating text, translation, and summarization. A large language model can be trained from scratch by three critical steps −
- Preparation of the training dataset
- Appropriate training parameters
- Managing the procedure and making sure that the right optimization is in effect
All steps will be followed step-by-step with code snippets and what the output means.
Preparing Your Training Dataset
The most important first step to train any LLM is to feed it an excellent, diverse, and extensive dataset. Llama needs an extremely large amount of text data to capture the richness of human language.
Collecting Data
Training Llama needs a monolithic dataset with diverse samples of texts from a variety of domains. Some exemplary datasets to train LLMs are Common Crawl, Wikipedia, BooksCorpus, and OpenWebText.
Example: Download a Dataset
import requests
import os
# Create a directory for datasets
os.makedirs("datasets", exist_ok=True)
# URL to dataset
url = "https://example.com/openwebtext.zip"
output = "datasets/openwebtext.zip"
# Download the dataset
response = requests.get(url)
with open(output, "wb") as file:
    file.write(response.content)
print(f"Dataset downloaded and saved at {output}")
Output
Dataset downloaded and saved at datasets/openwebtext.zip
When you download your dataset, you will need to preprocess text data before training. Most preprocessing involved tokenization, down-casing, removing special characters, and setting the data to fit a given structure.
Example: Preprocessing a Dataset
from transformers import LlamaTokenizer
# Load pre-trained tokenizer 
tokenizer = LlamaTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf", token=token)
# Load raw text
with open('/content/raw_data.txt', 'r') as file:
    raw_text = file.read()
# Tokenize the text
tokens = tokenizer.encode(raw_text, add_special_tokens=True)
# Save tokens to a file
with open('/tokenized_text.txt', 'w') as token_file:
    token_file.write(str(tokens))
    
print(f"Text tokenized and saved as tokens.")
Output
Text tokenized and saved as tokens.
Setting Model Training Parameters
Now, we are going to proceed with the setup of the training parameters. These parameters set how your model is going to learn from the dataset; therefore, they have a direct influence on your model's performance.
Main Training Parameters
- Batch Size − The number of specimens that went through before the simulation's weights were updated.
- Learning Rate − Sets how much to update model parameters based on the loss gradient.
- Epochs − How many times the model is run over the whole data set.
- Optimizer − To be used in the minimization of the loss function by changing the weights
You would use AdamW as your optimizer and a warm-up learning rate scheduler to train Llama.
Example: Training Parameters Configuration
import torch
from transformers import LlamaForCausalLM, AdamW, get_linear_schedule_with_warmup
# token="you_token"
# Load the model
model = LlamaForCausalLM.from_pretrained('meta-llama/Llama-2-7b-chat-hf', token=token)
model = model.to("cuda") if torch.cuda.is_available() else model.to("cpu")
# Training parameters
epochs = 3
batch_size = 8
learning_rate = 5e-5
warmup_steps = 200
# Set the optimizer and scheduler
optimizer = AdamW(model.parameters(), lr=learning_rate)
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=warmup_steps, num_training_steps=epochs)
print("Training parameters set.")
Output
Training parameters set.
Dataloader for Batches
Training needs data in batches. This can be done pretty easily with PyTorch's DataLoader.
from torch.utils.data import DataLoader, Dataset
# Custom dataset class
class TextDataset(Dataset):
    def __init__(self, tokenized_text):
       self.data = tokenized_text
    def __len__(self): 
        return len(self.data) // batch_size 
    def __getitem__(self, idx): 
        return self.data[idx * batch_size : (idx + 1) * batch_size]
with open("/tokenized_text.txt", 'r') as f:
  tokens_str = f.read()
tokens = eval(tokens_str)  # Evaluate the string to get the list
# DataLoader definition
train_data = TextDataset(tokens)
train_loader = DataLoader(train_data, batch_size=batch_size, shuffle=True)
print(f"DataLoader created with batch size {batch_size}.")
Output
DataLoader created with batch size 8.
It's time to go onto the actual training stage now that the requirements of the learning process and the data loading procedure have been established.
Training the Model
All that preparation works together in the running of the training loop. Training a dataset is nothing more than simply feeding the model in batches and then updating its parameters with the loss function.
Running the Training Loop
Now comes the training process of it all, where all these preparations meet the real world. Provide the collection of data to the algorithm in stages so that it may be updated based on the loss function for its variables.
import tqdm
# Move model to GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
for epoch in range(epochs):
   print(f"Epoch {epoch + 1}/{epochs}")
   model.train
   total_loss = 0  
   for batch in tqdm.tqdm(train_loader
      batch = [torch.tensor(sub_batch, device=device) for sub_batch in batch]
      max_len = max(len(seq) for seq in batch)
      padded_batch = torch.zeros((len(batch), max_len), dtype=torch.long, device=device)
      for i, seq in enumerate(batch):
         padded_batch[i, :len(seq)] = seq
       # Forward pass, use padded_batch 
       outputs = model(padded_batch, labels=padded_batch
       loss = outputs.loss  
       # Backward pass
       optimizer.zero_grad()  # Reset gradients.
       loss.backward()  # Calculate gradients.
       optimizer.step()  # Update model parameters.
       scheduler.step()  # Update learning rate.
        
       total_loss += loss.item()  # Accumulate loss.
   print(f"Epoch {epoch + 1} completed. Loss: {total_loss:.4f}")  
Output
Epoch 1 completed. Loss: 424.4011 Epoch 2 completed. Loss: 343.4245 Epoch 3 completed. Loss: 328.7054
Saving the Model
Once you are done training, save the model; otherwise, every time you train it.
# Save the trained model
model.save_pretrained('trained_Llama_model')
print("Model saved successfully.")
Output
Model saved successfully.
Now we have trained the model from scratch and saved it. We can use the model for the predicting new characters/ words. We will look in details in upcoming chapters.