Getting Started with Llama



Llama stands for Large Language Model Meta AI. Founded by Meta AI, architecture has been advanced within transformers with design to handle more complex problems in natural language processing. Llama generates text in such a manner that it bestows on it human-like characteristics, improving the comprehension of language and much more, including text generation, translation, summarization, and more.

Llama is one kind of beast that manages to optimize performance on rather smaller datasets than its peer GPT-3 requires. It's designed to be efficient on smaller datasets, thereby becoming accessible to a much wider set of users yet scalable.

Overview of Llama Architecture

The Transformer model serves as the backbone architecture of Llama. It was first introduced by Vaswani et al. under the name "Attention is All You Need, but it is essentially an autoregressive model. This means that it generates text one token at a time, predicting the next word in the sequence given what has appeared up to the point.

Important Features of the Llama Architecture are as follows −

  • Efficient Training − Llama can train on much smaller sets of data efficiently. So it's particularly great for research and applications where either available compute power is a limitation, or where data availability might be small.
  • Auto-regressive Structure − It generates tokens one by one, making the text generated highly coherent because each next token is based on all the tokens it has so far.
  • Multi-head Self-Attention − The attention mechanism of the model is designed in a way that assigns different weights to words within a sentence based on importance so it understands both local and global contexts within the input,
  • Stacked Transformer Layers − Llama stacks many transformer blocks consisting of a self-attention mechanism followed by feedforward neural networks.

Why Llama?

Llama has achieved a reasonable level of computationally efficient fitting of its model capacity. It can generate very long coherent text streams and perform nearly any task, including question answering and summarization, all the way down to language translation, among other resource-frugal activities. Llama models are smaller and less expensive to run than some of the other large-scale language models, like GPT-3, so the work is thereby accessible to more people.

Llama Variants

Llama existed in various flavors, all of which were trained with a varying number of parameters −

  • Llama-7B = 7 billion parameters
  • Llama-13B = 13 billion parameters
  • Llama-30B = 30 billion parameters
  • Llama-65B = 65 billion parameters

In doing so, the user may pick the right variant of the model according to his hardware as well as requirements for the specific task.

Understanding Components of the Model

The functionality of Llama is built on just a few highly crucial components. Let's discuss each, taking into consideration how they intercommunicate to enhance the overall performance of the model.

Embedding Layer

The embedding layer of Llama is the mapping of the input tokens to high-dimensional vectors. It thus captures semantic relationships between words. The intuition behind this mapping is that in a continuous space of vectors, semantically similar tokens are closest to each other.

The embedding layer also readies the input for the transformational layers down the line by changing the shape of the tokens into dimensions that the transformational layers are expecting.

import torch
import torch.nn as nn
# Embedding layer
embedding = nn.Embedding(num_embeddings=10000, embedding_dim=256)
# Tokenized input (for example: "The future is bright")
input_tokens = torch.LongTensor([2, 45, 103, 567])
# Output embedding
embedding_output = embedding(input_tokens)
print(embedding_output)

Output

tensor([[-0.4185, -0.5514, -0.8762,  ...,  0.7456,  0.2396,  2.4756],
        [ 0.7882,  0.8366,  0.1050,  ...,  0.2018, -0.2126,  0.7039],
        [ 0.3088, -0.3697,  0.1556,  ..., -0.9751, -0.0777, -1.3352],
        [ 0.7220, -0.7661,  0.2614,  ...,  1.2152,  1.6356,  0.6806]],
       grad_fn=<EmbeddingBackward0>)

This word embedding representation also allows the model to understand how the tokens are interlinked with each other in complex ways.

Self-Attention Mechanism

The self-attention of transformer models is the innovation where Llama will have the attention mechanism on parts of the sentence and understand how every word relates to others. Multi-head attention, in this case, is used by Llama, splitting the attention mechanism into multiple heads so that the model can freely explore parts of the input sequence.

Query, key, and value matrices are thus created, upon which the model chooses how much weightor attentionis given to each of the words relative to others.

import torch
import torch.nn.functional as F

# Sample query, key, value tensors
queries = torch.rand(1, 4, 16)  # (batch_size, seq_length, embedding_dim)
keys = torch.rand(1, 4, 16)
values = torch.rand(1, 4, 16)

# Compute scaled dot-product attention
scores = torch.bmm(queries, keys.transpose(1, 2)) / (16 ** 0.5)
attention_weights = F.softmax(scores, dim=-1)

# apply attention weights to values
output = torch.bmm(attention_weights, values)
print(output)

Output

tensor([[[0.4782, 0.5340, 0.4079, 0.4829, 0.4172, 0.5398, 0.3584, 0.6369,
          0.5429, 0.7614, 0.5928, 0.5989, 0.6796, 0.7634, 0.6868, 0.5903],
         [0.4651, 0.5553, 0.4406, 0.4909, 0.3724, 0.5828, 0.3781, 0.6293,
          0.5463, 0.7658, 0.5828, 0.5964, 0.6699, 0.7652, 0.6770, 0.5583],
         [0.4675, 0.5414, 0.4212, 0.4895, 0.3983, 0.5619, 0.3676, 0.6234,
          0.5400, 0.7646, 0.5865, 0.5936, 0.6742, 0.7704, 0.6792, 0.5767],
         [0.4722, 0.5550, 0.4352, 0.4829, 0.3769, 0.5802, 0.3673, 0.6354,
          0.5525, 0.7641, 0.5722, 0.6045, 0.6644, 0.7693, 0.6745, 0.5674]]])

This attention mechanism makes it possible for the model to "attend" to different parts of the sequence, thereby allowing it to learn long-distance dependencies between words in the sentence.

Multi-Head Attention

Multi-head attention is the extension of self-attention in which multiple attention heads are applied in parallel. In doing so, each attention head picks on a different part of the input, making sure that all possibilities for dependencies in the data are realized.

Then it comes to a feed-forward network processing separately each of the attention results.

import torch
import torch.nn as nn
import torch.nn.functional as F

class MultiHeadAttention(nn.Module):
   def __init__(self, dim_model, num_heads):
      super(MultiHeadAttention, self).__init__()
      self.num_heads = num_heads
      self.dim_head = dim_model // num_heads

        self.query = nn.Linear(dim_model, dim_model)
        self.key = nn.Linear(dim_model, dim_model)
        self.value = nn.Linear(dim_model, dim_model)
        self.out = nn.Linear(dim_model, dim_model)
        
   def forward(self, x):
      B, N, C = x.shape
      queries = self.query(x).reshape(B, N, self.num_heads, self.dim_head).transpose(1, 2)
      keys = self.key(x).reshape(B, N, self.num_heads, self.dim_head).transpose(1, 2)
      values = self.value(x).reshape(B, N, self.num_heads, self.dim_head).transpose(1, 2)
      intention = torch.matmul(queries, keys.transpose(-2, -1)) / (self.dim_head ** 0.5)
      attention_weights = F.softmax(intention, dim=-1)
      out = torch.matmul(attention_weights, values).transpose(1, 2).reshape(B, N, C)
      return self.out(out)

# Multiple attention building and calling
attention_layer = MultiHeadAttention(128, 8)
output = attention_layer(torch.rand(1, 10, 128))  # (batch_size, seq_length, embedding_dim)
print(output)

Output

tensor([[[-0.1015, -0.1076,  0.2237,  ...,  0.1794, -0.3297,  0.1177],
         [-0.1028, -0.1068,  0.2219,  ...,  0.1798, -0.3307,  0.1175],
         [-0.1018, -0.1070,  0.2228,  ...,  0.1793, -0.3294,  0.1183],
         ...,
         [-0.1021, -0.1075,  0.2245,  ...,  0.1803, -0.3312,  0.1171],
         [-0.1041, -0.1070,  0.2232,  ...,  0.1817, -0.3308,  0.1184],
         [-0.1027, -0.1087,  0.2223,  ...,  0.1801, -0.3295,  0.1179]]],
       grad_fn=<ViewBackward0>)

Feed-Forward Network

The feed-forward network is perhaps the most trivial yet fundamentally important building block of transformer blocks. By the name itself, it applies some form of non-linear transformation to the input sequence; thus, the model is allowed to learn more complex patterns.

Every layer of Llama's attention uses a feed-forward network for this transformation.

class FeedForward(nn.Module):
   def __init__(self, dim_model, dim_ff):
      super(FeedForward, self).__init__() #This line was incorrectly indented
      self.fc1 = nn.Linear(dim_model, dim_ff)
      self.fc2 = nn.Linear(dim_ff, dim_model)
      self.relu = nn.ReLU()
   def forward(self, x):
      return self.fc2(self.relu(self.fc1(x)))

# define and use the feed-forward network
ffn = FeedForward(128, 512)
ffn_output = ffn(torch.rand(1, 10, 128))  # (batch_size, seq_length, embedding_dim)
print(ffn_output)

Output

tensor([[[ 0.0222, -0.1035, -0.1494,  ...,  0.0891,  0.2920, -0.1607],
         [ 0.0313, -0.2393, -0.2456,  ...,  0.0704,  0.1300, -0.1176],
         [-0.0838, -0.0756, -0.1824,  ...,  0.2570,  0.0700, -0.1471],
         ...,
         [ 0.0146, -0.0733, -0.0649,  ...,  0.0465,  0.2674, -0.1506],
         [-0.0152, -0.0657, -0.0991,  ...,  0.2389,  0.2404, -0.1785],
         [ 0.0095, -0.1162, -0.0693,  ...,  0.0919,  0.1621, -0.1421]]],
       grad_fn=<ViewBackward0>)

Steps to Create a Token for Using the Llama Models

Before accessing the Llama models, you need to create token on Hugging face. We are using Llama 2 models for its lightweight. You can choose any model. Please follow the below steps to get started.

Step 1: Sign-up for Hugging Face account (if you haven't done so)

  • On the Hugging Face home page, click on Sign Up.
  • For all of you who haven't yet created an account, create one now

Step 2: Fill out Request Form to Access to Llama Models

To download and use the Llama models you need to fill up a request form. For this −

Fill out Request Form to Access to Llama Models
  • Select your model (here we will use Llama 2 for sake of simplicity and light weight) and click on next the form.
  • Accept the Llama 2 terms and conditions and click on Accept and continue.
  • You are all set.

Step 3: Obtain Access Token

  • Go to your Hugging Face account.
  • Click the photo of your profile on top right-hand side to find yourself on Settings
  • Navigate to Access Tokens
  • Click on Create new token
    • Name it for example "Llama Access Token"
    • Tick the user permissions. The scope should be at least set as read to access gated models.
    • Click create token
  • Copy the token, you will use that for your next step.

Step 4: Authentication in Your Script with the Token

Once you have your Hugging Face token, you must authenticate with this token in your Python script.

First of all, if you haven't done so already, install the required packages −

!pip install transformers huggingface_hub torch

Import the login method from Hugging Face Hub and login using your token −

from huggingface_hub import login
# Set your_token to your token
login(token=" <your_token>") 

Or, if you do not wish to log in interactively, you can directly pass your token in the code at the time you load your model.

Step 5: Update the Code to Load the Model Using the Token

Loading the gated model using your token.

The token can be passed directly to the from_pretrained() method.

from transformers import AutoModelForCausalLM, AutoTokenizer
from huggingface_hub import login
 
token = "your_token"
# Login with your token (put <your_token> in quotes)
login(token=token)
 
# Loading tokenizer and model from gated repository and using auth token
tokenizer = AutoTokenizer.from_pretrained('meta-llama/Llama-2-7b-hf', token=token)
model = AutoModelForCausalLM.from_pretrained('meta-llama/Llama-2-7b-hf', token=token)

Step 6: Run the Code

With your token inserted and logged or passed during model-loading functions, your script should now be able to access the gated repository and feed text from the Llama model.

Running Your First Llama Script

We have created tokens and other authentications; now it is time to run your very first Llama script. You can use a pre-trained Llama model for text generation. We are using Llama-2-7b-hf, one of the Llama 2 models.

from transformers import AutoModelForCausalLM, AutoTokenizer
#import tokenizer and model
tokenizer = AutoTokenizer.from_pretrained('meta-llama/Llama-2-7b-hf', token=token)
model = AutoModelForCausalLM.from_pretrained('meta-llama/Llama-2-7b-hf', token=token)
#Encode input text and generate
input_text = "The future of AI is"
input_ids = tokenizer.encode(input_text, return_tensors="pt")
outputs = model.generate(input_ids, max_length=50, num_return_sequences=1)

# Decode and print output
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Output

The future of AI is a subject of great interest, and it is not surprising
that many people are interested in the subject. It is a very interesting
topic, and it is a subject that is likely to be discussed for many years to come

Generate Text − The above script generates a text sequence that represents how Llama can interpret the context as well as create coherent writing.

Summing Up

Impressive due to its transformer-based architecture, multi-head attention, and autoregressive generation capabilities. The balance struck between computational efficiency and how well the model performs makes Llama applicable to a broad range of natural language processing tasks. Familiarity with the most important components and architecture of Llama will give you the chance to try generating text, translation, summarization, and much, much more.

Advertisements