Unlocking Llama 3: Your Ultimate Guide to Mastering Llama 3!

Published in

The Pythoneers

10 min readApr 29, 2024

The world of artificial intelligence has been rapidly evolving in recent years, largely due to the emergence of Large Language Models (LLMs). These advanced systems have progressed from basic text processors to sophisticated models capable of understanding and generating human-like text, marking significant advancements in their capabilities and applications. At the forefront of this evolution is Meta’s latest offering, Llama3, a platform that promises to redefine the boundaries of accessibility and performance for open models.

Just last week, Meta unveiled the Llama3 8B and 70B models, showcasing a remarkable leap in capabilities that include enhanced reasoning and setting new benchmarks for models of their size. Llama3 stands as the most capable openly available LLM to date, and its release marks a significant milestone in the field of artificial intelligence.

In this article, we will delve into Llama3 and provide a comprehensive guide on how to effectively leverage its power. We will also explore the potential of Llama3 and shed light on how it can revolutionize various industries.

Getting Started

What is Llama 3
Key Features
Llama 2 vs Llama 3
Llama 3 vs other models
Llama 3 Safety features
Experimenting with Llama 3
Method 1: Using Google Colab and HuggingFace
Building a chatbot using Llama 3
Method 2: Using Ollama

What is Llama 3

Meta Llama 3 is the latest in Meta’s line of language models, with versions containing 8 billion and 70 billion parameters. It’s designed to excel in various applications, from everyday conversations to complex reasoning tasks, surpassing previous models in performance. Llama 3 is freely accessible, encouraging innovation in AI development and beyond.

AI at Meta on X: “Introducing Meta Llama 3: the most capable openly available LLM to date. Today we’re releasing 8B & 70B models that deliver on new capabilities such as improved reasoning and set a new state-of-the-art for models of their sizes. Today’s release includes the first two Llama 3… https://t.co/Q80lVTeS7m" / X (twitter.com)

Key Features:

Integrated across both 8 billion and 70 billion parameter models to enhance inference efficiency for focused and effective processing.
Outperforms its predecessors and competitors across various benchmarks, excelling in tasks such as MMLU and HumanEval.
Llama 3 maintains a decoder-only transformer architecture with significant improvements, including a tokenizer supporting 128,000 tokens for better language encoding efficiency.
Trained on a dataset of over 15 trillion tokens, seven times larger than Llama 2’s dataset, incorporating diverse linguistic representation and non-English data from over 30 languages.
An enhanced post-training phase combines supervised fine-tuning, rejection sampling, and policy optimization to improve model quality and decision-making capabilities.
Detailed scaling laws optimize data mix and computational resources, ensuring robust performance across diverse applications while tripling the training process’s efficiency compared to Llama 2.

Llama 2 vs Llama 3

Llama 3 builds upon the previous Llama 2 model, retaining the core decoder-only transformer architecture. However, it introduces several key improvements. The tokenizer now supports 128,000 tokens, enabling more efficient encoding of language and enhanced performance. Furthermore, Llama 3 integrates Grouped Query Attention (GQA) which enhances inference efficiency across various parameter models. The model also processes sequences of 8,192 tokens with masking for ensuring more focused and effective processing.

Llama 3 vs other models

Llama 3, developed by Meta, has set new standards in generative AI, outshining both its predecessors and competitors across a range of benchmarks. It has particularly excelled in tests such as MMLU, which evaluates knowledge across diverse areas, and HumanEval, which focuses on coding skills. Additionally, Llama 3 has surpassed other high-parameter models like Google’s Gemini 1.5 Pro and Anthropic’s Claude 3 Sonnet, especially in complex reasoning and comprehension tasks.

Meta’s Llama 3 demonstrates exceptional performance across various benchmarks and applications, notably excelling in tasks related to reasoning, coding, and creative writing. Its ability to produce diverse and accurate responses distinguishes it from other models, ensuring improved user experiences and productivity.

Llama 3 Safety features

Llama 3 introduces new safety and trust features such as Llama Guard 2, Cybersec Eval 2, and Code Shield, which filter out unsafe code during use. It was developed alongside torchtune, a PyTorch-based library facilitating efficient authoring, fine-tuning, and testing of large language models (LLMs), integrating with platforms like Hugging Face and Weights & Biases.

Image credits Introducing Meta Llama 3: The most capable openly available LLM to date

Responsible deployment is ensured through systematic testing, including “red-teaming” efforts to assess safety and robustness against misuse, particularly in cybersecurity. Llama Guard 2 follows industry standards from MLCommons, while CyberSecEval 2 enhances security measures. The development of Llama 3 emphasizes an open approach to unite the AI community and address potential risks, with Meta’s Responsible Use Guide (RUG) outlining best practices and cloud providers offering content moderation tools.

Experimenting with Llama 3

Experimenting with Llama 3 on your local machine has never been easier, thanks to a range of tools tapping into its open-source capabilities. With Hugging Face leading the charge, support for Llama 3 models is now available, accessible via the Transformers library on their Hub. Whether you prefer full-precision models or the efficiency of 4-bit quantized ones, installation and execution are seamless.

Here, we’ll explore two distinct methods tailored to different user preferences and technical levels.

Method 1: Using Google Colab and HuggingFace

Lets dive in with a hands-on demonstration of running Llama 3 on the Colab free tier.

Step 1: Enabling Llama 3 access

Llama 3 is a gated model, requiring users to request access.

Follow the steps to enabling the model access.

Login to your Hugging Face account or register a new account if you don’t already have one.
You can visit https://huggingface.co/meta-llama/Meta-Llama-3-8B to request access.
Provide user details such as First Name, Last Name, Date of Birth, Country and Affiliation. With the license agreement accepted, you now have access to the Llama 3 model.

To confirm your access, navigate to the following link: huggingface.co/meta-llama/Meta-Llama-3–8B/resolve/main/config.json. If you have successfully gained access to the Llama 3 model, you will receive relevant information about it.

Step 2: Hugging Face Access token generation

To access the model, you will need a HuggingFace access token as well. You can generate one by going to Settings, then Access Tokens in the left sidebar, and clicking on the “New token” button to create a new access token.

Step 3: Creating your first script with Llama 3 using HuggingFace

Open the link Welcome To Colaboratory — Colaboratory and Click on Sign in to login to your colab account or create a new account if you don’t have an account.
Change the Runtime to T4 GPU by Runtime → Change runtime type → T4 GPU → Save.
To use Gemma, you must provide your Hugging Face access token. Select Secrets (🔑) in the left pane and add your HF_TOKEN key.
Create a new colab notebook by Clicking on + New notebook button.

Step 4: Installing dependencies

Install transformers, accelerate and bitsandbytes libraries using the following command.

!pip install -U "transformers==4.40.0" --upgrade
!pip install accelerate bitsandbytes

Step 5: Downloading and installing the model

Install the Llama 3 model and set up the text generation pipeline.

import transformers
import torch

model_id = "unsloth/llama-3-8b-Instruct-bnb-4bit"

pipeline = transformers.pipeline(
    "text-generation",
    model=model_id,
    model_kwargs={
        "torch_dtype": torch.float16,
        "quantization_config": {"load_in_4bit": True},
        "low_cpu_mem_usage": True,
    },
)

Step 6: Send queries

Send queries to the model for inference.

messages = [
    {"role": "system", "content": "You are a helpful assistant!"},
    {"role": "user", "content": """Hey how are you doing today?"""},
]

prompt = pipeline.tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
)

terminators = [
    pipeline.tokenizer.eos_token_id,
    pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

outputs = pipeline(
    prompt,
    max_new_tokens=256,
    eos_token_id=terminators,
    do_sample=True,
    temperature=0.6,
    top_p=0.9,
)

print(outputs[0]["generated_text"][len(prompt):])

You will get the output as given below,

I'm doing great, thanks for asking! I'm a helpful assistant, so I'm always ready to assist you with any questions or tasks you may have. How about you? How's your day going so far?

Building a chatbot using Llama 3

In this section, we’ll create a chatbot using Llama 3 with gradio.

Install the gradio package

!pip install gradio

Create a new cell in your notebook and add the following code to it.

import gradio as gr

messages = []

def add_text(history, text):
    global messages
    history = history + [(text,'')]
    messages = messages + [{"role":'user', 'content': text}]
    return history, text

def generate(history):
  global messages
  prompt = pipeline.tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
)

  terminators = [
    pipeline.tokenizer.eos_token_id,
    pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

  outputs = pipeline(
    prompt,
    max_new_tokens=256,
    eos_token_id=terminators,
    do_sample=True,
    temperature=0.6,
    top_p=0.9,
)
  response_msg = outputs[0]["generated_text"][len(prompt):]
  for char in response_msg:
      history[-1][1] += char
      yield history
  pass

with gr.Blocks() as demo:

    chatbot = gr.Chatbot(value=[], elem_id="chatbot")
    with gr.Row():
            txt = gr.Textbox(
                show_label=False,
                placeholder="Enter text and press enter",
            )

    txt.submit(add_text, [chatbot, txt], [chatbot, txt], queue=False).then(
            generate, inputs =[chatbot,],outputs = chatbot,)

demo.queue()
demo.launch(debug=True)

Run the cell. You will get a gradio interface in the notebook or you can use the given link to open it in a new tab. The output will be look like below.

Method 2: Using Ollama

If you’re looking for an alternative to run large language models (LLMs) locally without relying on cloud services, Ollama is a best choice for that. Ollama is an open-source software designed for running LLMs locally, putting the control directly in your hands.

To get started with Ollama, all you need to do is download the software. With Ollama, you can enjoy the benefits of LLMs while maintaining data privacy and control over your computational resources.

Go to Ollama official site.
Click on Download to download the software.

Double click on the installer and click on Install to install it in your machine.
Once installed, use the following command to start a local server with the specified model.

ollama run llama3:instruct

Also use llama3, llama3:70b, llama3:70b-instruct as arguments for different types of llama3 models. Ensure that you have proper internet connect otherwise might get Error dial tcp: lookup no such host error while pulling the model.

Run the model

Once the model is downloaded, you can begin querying. Input your context directly through the terminal or utilize the API to interact with the model.

Querying the model using Curl command

The Ollama has exposed an endpoint (/api/generate) on port 11434 for use with curl. You can utilize the following format to query it.

curl http://localhost:11434/api/generate -d "{ \"model\": \"llama3:instruct\", \"prompt\": \"How many colors are there in a rainbow?\", \"stream\": false }"

Querying the model using postman

Open Postman.
Select POST as the request method.
In the URL input field, provide the endpoint: localhost:11434/api/generate.
Choose JSON as the request body format.
Provide the request body content as follows. Replace "prompt": "value" with the appropriate content.

{
    "model": "llama3:instruct",
    "prompt":"How many colors are there in a rainbow?",
    "stream":false
}

Click on the Send button to make the request to the Ollama endpoint.
You will get the result as follows.

{
    "model": "llama3:instruct",
    "created_at": "2024-04-29T17:34:38.0223636Z",
    "response": "There are 7 colors that we commonly see in a rainbow, which are:\n\n1. Red\n2. Orange\n3. Yellow\n4. Green\n5. Blue\n6. Indigo\n7. Violet\n\nThese colors are also sometimes remembered using the acronym ROY G BIV, with each letter standing for the name of a color. Would you like to know more about rainbows or is there something else I can help you with?",
    "done": true,
    "context": [
        128006,
        ...
        128009
    ],
    "total_duration": 17278010500,
    "load_duration": 5247897400,
    "prompt_eval_count": 19,
    "prompt_eval_duration": 1196966000,
    "eval_count": 92,
    "eval_duration": 10829807000
}

If you enjoyed this article, please click on the clap button 👏 and share to help others find it!

The full source code for this tutorial can be found here,

GitHub - codemaker2015/llama3-experiments

Contribute to codemaker2015/llama3-experiments development by creating an account on GitHub.

github.com

References

Introducing Meta Llama 3: The most capable openly available LLM to date

Today, we're introducing Meta Llama 3, the next generation of our state-of-the-art open source large language model. In…

ai.meta.com

GitHub - meta-llama/llama3: The official Meta Llama 3 GitHub site

The official Meta Llama 3 GitHub site. Contribute to meta-llama/llama3 development by creating an account on GitHub.