
- Llama - Home
- Llama - Introduction
- Llama - Environment Setup
- Llama - Getting Started
- Llama - Data Preparation
- Llama - Training From Scratch
- Fine-Tuning Llama Model
- Llama - Evaluating Model Performance
- Llama - Optimizing Models
Llama Useful Resources
Data Preparation For Llama
Good data preparation is the hour of need to train any high-performance language model, like Llama. Data preparation includes gathering and cleaning up the data, getting data ready for Llama, and usage of different data preprocessors. Tools like NLTK, spaCy, and Hugging Face tokenizers all combine to help make the data ready for application in the training pipeline of Llama. Once you get an idea of these stages of data preprocessing, you are sure to improve the performance of the Llama model.
The preparation of data is considered one of the most critical stages in a machine learning model, especially when dealing with large language models. This chapter discusses how to prepare the data for use with Llama and also covers the following topics.
- Data Collection and Cleaning
- Formatting Data for Llama
- Tools Used During Data Pre-processing
All these processes ensure that the data will be cleaned well and structured appropriately to optimize for use in the pipeline training Llama.
Collection and Cleaning Data
Data Collection
The most crucial point associated with training models like Llama is the high-quality diversity of data. In other words, the primary source of textual data for training uses while running the language models is scraps from other kinds of texts, which include books, articles, blog entries, social media content, forums, and other publicly available textual data.
Scrapping text data of a website with Python
import requests from bs4 import BeautifulSoup # URL to fetch data from url = 'https://www.tutorialspoint.com/Llama/index.htm' response = requests.get(url) soup = BeautifulSoup(response.text, 'html.parser') # Now, extract text data text_data = soup.get_text() # Now, save data to the file with open('raw_data.txt', 'w', encoding='utf-8') as file: file.write(text_data)
Output
When you run the script, it saves the scraped text to a file named raw_data.txt and then that raw text is cleaned into data.
Data Cleaning
Raw data is full of noise, including HTML tags, special characters, and irrelevant data presented in raw data, so it has to be cleaned before it can be presented to Llama. Data cleaning may include;
- Removal of HTML tags
- Special characters
- Case sensitivity
- Tokenization
- Stopword removal
Example: Pre-processing Text Data Using Python
import re from nltk.corpus import stopwords from nltk.tokenize import word_tokenize import nltk nltk.download('punkt') nltk.download('stopwords') # Load raw data with open('/raw_data.txt', 'r', encoding='utf-8') as file: text_data = file.read() # Clean HTML tags clean_data = re.sub(r'<.*?>', '', text_data) # Clean special characters clean_data = re.sub(r'[^A-Za-z0-9\\\\\\s]', '', clean_data) # Split text into tokens tokens = word_tokenize(clean_data) stop_words = set(stopwords.words('english')) # Filter out stop words from tokens filtered_tokens = [w for w in tokens if not w.lower() in stop_words] # Save cleaned data with open('cleaned_data.txt', 'w', encoding='utf-8') as file: file.write(' '.join(filtered_tokens)) print("Data cleaned and saved to cleaned_data.txt")
Output
Data cleaned and saved to cleaned_data.txt
The cleaned data will be saved to the cleaned_data.txt. The file now contains tokenized and cleaned data and is ready for further formatting and preprocessing for Llama.
Preprocessing your Data to Work with Llama
Llama needs the data taken as input to train; it is pre structured. Data should be tokenized and can also be converted to formats like JSON or CSV based upon the architecture it is to be used in conjunction with training.
Text Tokenization
Text tokenization is the act of dividing sentences into smaller parts (typically words or sub words) so that Llama can handle them. You may use pre-built libraries, which include Hugging Face's tokenizers library.
from transformers import LlamaTokenizer # token = "your_token" # Sample sentence text = "Llama is an innovative language model." #Load Llama tokenizer tokenizer = LlamaTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf", token=token) #Tokenize encoded_input = tokenizer(text) print("Original Text:", text) print("Tokenized Output:", encoded_input)
Output
Original Text: Llama is an innovative language model. Tokenized Output: {'input_ids': [1, 365, 29880, 3304, 338, 385, 24233, 1230, 4086, 1904, 29889], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
Converting Data to JSON Format
JSON format is relevant to Llama because this stores text data in a format that represents data in a structured manner.
import json # Data structure data = { "id": "1", "text": "Llama is a powerful language model for AI research." } # Save data as JSON with open('formatted_data.json', 'w', encoding='utf-8') as json_file: json.dump(data, json_file, indent=4) print("Data formatted and saved to formatted_data.json")
Output
Data formatted and saved to formatted_data.json
The program will print a file called formatted_data.json containing the formatted textual data in the JSON format.
Tools for Data Preprocessing
Data cleaning, tokenization, and formatting tools are meant for Llama. The group of most common tools is found using Python libraries, text-processing frameworks, and commands. Here is a list of some of the widely applied tools in Llama data preparation.
1. NLTK (Natural Language Toolkit)
The most well-known library for natural language processing is identified as NLTK. The functionalities supported by this library include cleaning, tokenization, and stemming of the text data.
Example: Stopword Removal using NLTK
import nltk from nltk.corpus import stopwords nltk.download('punkt') nltk.download('stopwords') # Test Data text = "This is a simple sentence with stopwords." # Tokenization words = nltk.word_tokenize(text) # Stopwords stop_words = set(stopwords.words('english')) filtered_text = [w for w in words if not w.lower() in stop_words] # This line is added to filter the words and assign to the variable print("Original Text:", text) print("Filtered Text:", filtered_text)
Output
Original Text: This is a simple sentence with stopwords. Filtered Text: ['simple', 'sentence', 'stopwords', '.']
2. spaCy
Another high-level library designed for data preprocessing. It is also fast, efficient, and built for real-world usage applications in NLP tasks.
Example: Using spaCy for Tokenization
import spacy # Load spaCy model nlp = spacy.load("en_core_web_sm") # Sample sentence text = "Llama is an innovative language model." # Process the text doc = nlp(text) # Tokenize tokens = [token.text for token in doc] print("Tokens:", tokens)
Output
Tokens: ['Llama', 'is', 'an', 'innovative', 'language', 'model', '.']
3. Hugging Face Tokenizers
Hugging Face provides some high-performance tokenizers that are mostly used while training language models, not Llama itself.
Example: Using Hugging Face Tokenizer
from transformers import AutoTokenizer token = "your_token" # Sample sentence text = "Llama is an innovative language model." #Load Llama tokenizer tokenizer = AutoTokenizer.from_pretrained('meta-llama/Llama-2-7b-hf', token=token) #Tokenize encoded_input = tokenizer(text) print("Original Text:", text) print("Tokenized Output:", encoded_input)
Output
Original Text: Llama is an innovative language model. Tokenized Output: {'input_ids': [1, 365, 29880, 3304, 338, 385, 24233, 1230, 4086, 1904, 29889], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
4. Pandas for Data Formatting
Use when you are working with structured data. You can format your data with Pandas as CSV or JSON before passing it to Llama.
import pandas as pd # Data structure data = { "id": "1", "text": "Llama is a powerful language model for AI research." } # Create DataFrame with an explicit index df = pd.DataFrame([data], index=[0]) # Creating a list of dictionary and passing an index [0] # Save DataFrame to CSV df.to_csv('formatted_data.csv', index=False) print("Data saved to formatted_data.csv")
Output
Data saved to formatted_data.csv
The formatted text data will be found in the CSV file formatted_data.csv.