Data Preparation For Llama



Good data preparation is the hour of need to train any high-performance language model, like Llama. Data preparation includes gathering and cleaning up the data, getting data ready for Llama, and usage of different data preprocessors. Tools like NLTK, spaCy, and Hugging Face tokenizers all combine to help make the data ready for application in the training pipeline of Llama. Once you get an idea of these stages of data preprocessing, you are sure to improve the performance of the Llama model.

The preparation of data is considered one of the most critical stages in a machine learning model, especially when dealing with large language models. This chapter discusses how to prepare the data for use with Llama and also covers the following topics.

  • Data Collection and Cleaning
  • Formatting Data for Llama
  • Tools Used During Data Pre-processing

All these processes ensure that the data will be cleaned well and structured appropriately to optimize for use in the pipeline training Llama.

Collection and Cleaning Data

Data Collection

The most crucial point associated with training models like Llama is the high-quality diversity of data. In other words, the primary source of textual data for training uses while running the language models is scraps from other kinds of texts, which include books, articles, blog entries, social media content, forums, and other publicly available textual data.

Scrapping text data of a website with Python

import requests
from bs4 import BeautifulSoup
# URL to fetch data from
url = 'https://www.tutorialspoint.com/Llama/index.htm'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
# Now, extract text data
text_data = soup.get_text()
# Now, save data to the file
with open('raw_data.txt', 'w', encoding='utf-8') as file:
    file.write(text_data)

Output

When you run the script, it saves the scraped text to a file named raw_data.txt and then that raw text is cleaned into data.

Data Cleaning

Raw data is full of noise, including HTML tags, special characters, and irrelevant data presented in raw data, so it has to be cleaned before it can be presented to Llama. Data cleaning may include;

  • Removal of HTML tags
  • Special characters
  • Case sensitivity
  • Tokenization
  • Stopword removal

Example: Pre-processing Text Data Using Python

import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

import nltk
nltk.download('punkt')
nltk.download('stopwords')

# Load raw data
with open('/raw_data.txt', 'r', encoding='utf-8') as file:
    text_data = file.read()

# Clean HTML tags
clean_data = re.sub(r'<.*?>', '', text_data)

# Clean special characters
clean_data = re.sub(r'[^A-Za-z0-9\\\\\\s]', '', clean_data)

# Split text into tokens
tokens = word_tokenize(clean_data)

stop_words = set(stopwords.words('english'))

# Filter out stop words from tokens
filtered_tokens = [w for w in tokens if not w.lower() in stop_words]

# Save cleaned data
with open('cleaned_data.txt', 'w', encoding='utf-8') as file:
    file.write(' '.join(filtered_tokens))

print("Data cleaned and saved to cleaned_data.txt")

Output

Data cleaned and saved to cleaned_data.txt

The cleaned data will be saved to the cleaned_data.txt. The file now contains tokenized and cleaned data and is ready for further formatting and preprocessing for Llama.

Preprocessing your Data to Work with Llama

Llama needs the data taken as input to train; it is pre structured. Data should be tokenized and can also be converted to formats like JSON or CSV based upon the architecture it is to be used in conjunction with training.

Text Tokenization

Text tokenization is the act of dividing sentences into smaller parts (typically words or sub words) so that Llama can handle them. You may use pre-built libraries, which include Hugging Face's tokenizers library.

from transformers import LlamaTokenizer

# token = "your_token"
# Sample sentence
text = "Llama is an innovative language model."

#Load Llama tokenizer
tokenizer = LlamaTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf", token=token)

#Tokenize
encoded_input = tokenizer(text)

print("Original Text:", text)
print("Tokenized Output:", encoded_input)

Output

Original Text: Llama is an innovative language model.
Tokenized Output: {'input_ids': [1, 365, 29880, 3304, 338, 385, 24233, 1230, 4086, 1904, 29889], 
   'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

Converting Data to JSON Format

JSON format is relevant to Llama because this stores text data in a format that represents data in a structured manner.

import json
    
# Data structure
data = {
"id": "1",
"text": "Llama is a powerful language model for AI research."
}
# Save data as JSON
with open('formatted_data.json', 'w', encoding='utf-8') as json_file:
    json.dump(data, json_file, indent=4)
    
print("Data formatted and saved to formatted_data.json")

Output

Data formatted and saved to formatted_data.json

The program will print a file called formatted_data.json containing the formatted textual data in the JSON format.

Tools for Data Preprocessing

Data cleaning, tokenization, and formatting tools are meant for Llama. The group of most common tools is found using Python libraries, text-processing frameworks, and commands. Here is a list of some of the widely applied tools in Llama data preparation.

1. NLTK (Natural Language Toolkit)

The most well-known library for natural language processing is identified as NLTK. The functionalities supported by this library include cleaning, tokenization, and stemming of the text data.

Example: Stopword Removal using NLTK

import nltk
from nltk.corpus import stopwords
nltk.download('punkt')
nltk.download('stopwords')

# Test Data
text = "This is a simple sentence with stopwords."
 
# Tokenization
words = nltk.word_tokenize(text)

# Stopwords
stop_words = set(stopwords.words('english'))

filtered_text = [w for w in words if not w.lower() in stop_words] # This line is added to filter the words and assign to the variable
print("Original Text:", text)
print("Filtered Text:", filtered_text)

Output

Original Text: This is a simple sentence with stopwords.
Filtered Text: ['simple', 'sentence', 'stopwords', '.']

2. spaCy

Another high-level library designed for data preprocessing. It is also fast, efficient, and built for real-world usage applications in NLP tasks.

Example: Using spaCy for Tokenization

import spacy

# Load spaCy model
nlp = spacy.load("en_core_web_sm")

# Sample sentence
text = "Llama is an innovative language model."

# Process the text
doc = nlp(text)

# Tokenize
tokens = [token.text for token in doc]

print("Tokens:", tokens)

Output

Tokens: ['Llama', 'is', 'an', 'innovative', 'language', 'model', '.']

3. Hugging Face Tokenizers

Hugging Face provides some high-performance tokenizers that are mostly used while training language models, not Llama itself.

Example: Using Hugging Face Tokenizer

from transformers import AutoTokenizer
token = "your_token"
# Sample sentence
text = "Llama is an innovative language model."

#Load Llama tokenizer
tokenizer = AutoTokenizer.from_pretrained('meta-llama/Llama-2-7b-hf', token=token)

#Tokenize
encoded_input = tokenizer(text)
print("Original Text:", text)
print("Tokenized Output:", encoded_input)

Output

Original Text: Llama is an innovative language model.
Tokenized Output: {'input_ids': [1, 365, 29880, 3304, 338, 385, 24233, 1230, 4086, 1904, 29889], 
   'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

4. Pandas for Data Formatting

Use when you are working with structured data. You can format your data with Pandas as CSV or JSON before passing it to Llama.

import pandas as pd

# Data structure
data = {
"id": "1",
"text": "Llama is a powerful language model for AI research."
}

# Create DataFrame with an explicit index
df = pd.DataFrame([data], index=[0]) # Creating a list of dictionary and passing an index [0]

# Save DataFrame to CSV
df.to_csv('formatted_data.csv', index=False)

print("Data saved to formatted_data.csv")

Output

Data saved to formatted_data.csv

The formatted text data will be found in the CSV file formatted_data.csv.

Advertisements