Build Your Personal RAG Chatbot: Chat Freely with Your Data Powered by HuggingFace, LlamaIndex and Open LLMs!

15 min readJan 9, 2024

Whether you’re a curious researcher, a diligent student, or a busy professional, the world of open-source tools offers endless possibilities. Today, let’s embark on a journey together, a journey to build your very own RAG Chatbot.

Imagine having a personal assistant that can effortlessly chat with your data, providing answers based on the information you provide. The best part? It won’t cost you a penny. So, without further ado, let’s dive in and explore how you can create a simple and powerful RAG Chatbot using the free and user-friendly LlamaIndex and Open LLMs.

Get ready to revolutionize your data interactions in a way that’s simple, accessible, and tailored to your needs!

Prerequisites

Before we dive in, there are a few key concepts we need to understand. If you’re already comfortable with LlamaIndex, Huggingface and Open LLMs feel free to skip ahead.

1. LlamaIndex🦙

LlamaIndex is a versatile and user-friendly data framework crafted to seamlessly link custom data sources with large language models. Serving as a crucial bridge between your unique datasets and powerful language models, LlamaIndex simplifies the process of incorporating your data into conversations with these intelligent machines.

Whether your information resides in APIs, databases, or PDFs, LlamaIndex provides a straightforward solution, ensuring a hassle-free integration that lets your data interact effortlessly with these sophisticated models.

LlamaIndex serves multiple purposes in the realm of working with data and large language models:

Data Ingestion: LlamaIndex facilitates the seamless transfer of data from its original source into the system, streamlining the process of ‘ingesting’ information.
Data Structuring: It plays a crucial role in organizing and ‘structuring’ the data in a manner that is easily comprehensible for language models, ensuring effective communication between your data and these advanced systems.
Data Retrieval: LlamaIndex excels in the task of ‘retrieval,’ enabling the efficient location and fetching of specific pieces of data when needed, ensuring quick and accurate access to relevant information.
Integration: Simplifying the ‘integration’ process, LlamaIndex makes it effortless to blend your data with various application frameworks, promoting a harmonious collaboration between your datasets and diverse platforms.

Creating a Retrieval Augmented Generation (RAG) conversational agent with LlamaIndex is simpler than it sounds. Here’s a straightforward breakdown:

🛠️ Tools Needed:

LlamaIndex: This tool helps you manage your data and connects with Large Language Models (LLMs).
FAISS: It’s used for similarity search, making it easier to find relevant information.
LLM: Large Language Model used for responding with that relevant information.
Unstructured.IO: This tool is handy for processing unstructured data.

🏗️ Steps to Build Your RAG Chatbot:

Fetch the Data: Grab the information you need. It could be from APIs, databases, or any other source.
Parse the Data: Organize the data into a format that makes sense for your chatbot.
Build Chat and Query Models: Create the models that will handle the conversation and respond to queries.
Leverage Semantic Search: Use tools like FAISS to make your search for information smarter and more efficient.

2. Huggingface 🤗

Hugging Face is a vibrant community and infrastructure for AI enthusiasts, researchers, and developers. Imagine it is the GitHub of machine learning, where everyone can share, collaborate, and build on top of each other’s work. Here’s a breakdown of its key aspects:

Community Platform:

Open-Source Focus: Hugging Face champions open-source AI models and datasets, democratizing access for everyone.
Sharing Hub: Explore a vast library of pre-trained AI models, datasets, and tools, all contributed by the community.
Collaboration Space: Discuss, learn, and solve problems together through forums, tutorials, and workshops.

2. Tools and Technologies:

Transformers Library: The crown jewel, offering an easy-to-use Python library for working with state-of-the-art NLP models like GPT-3.
API & Tools: Build, train, fine-tune, and deploy AI models with powerful tools for tasks like text classification, summarization, and dialogue generation.
Model Inference: Deploy your models in production environments with flexible options for cloud or on-premises hosting.

While Hugging Face started with NLP, it’s expanding to other domains like computer vision and audio processing. The “Transformers” library is being adapted for these broader applications, opening new avenues for creative exploration.

3. Open LLMs 📖

Large language models (LLMs) are a type of artificial intelligence (AI) that are trained on massive amounts of text data to generate human-quality text, translate languages, write different kinds of creative content, and answer your questions in an informative way. Open source LLMs are those that are freely available for anyone to use, modify, and distribute, as opposed to proprietary LLMs that are owned and controlled by specific companies.

Key characteristics of Open LLMs:

Accessibility: Open source LLMs can be used by anyone without licensing restrictions or expensive fees.
Transparency: Their code and training data are often publicly available, allowing for greater understanding, scrutiny, and improvement by the community.
Customizability: Users can fine-tune LLMs for specific tasks or domains by training them on additional data relevant to their needs.
Collaboration: Open-source development fosters collective innovation and knowledge sharing, leading to faster progress and wider adoption.

Some of the most popular ones include LLaMA, BLOOM, Mistral, SOLAR, BERT, T5, UL2, Cerebras-GPT, Open Assistant (Pythia family), Pythia, Dolly, DLite, RWKV, GPT-J-6B, GPT-NeoX-20B, Bloom, StableLM-Alpha, FastChat-T5, h2oGPT, MPT-7B, RedPajama-INCITE, OpenLLaMA, Falcon, MPT-30B4.

4. Google Colaboratory 💻

Google Colab, short for Google Colaboratory, is a cloud-based service from Google that makes writing and executing Python code a breeze in a collaborative setting. Think of it as a Jupyter Notebook environment, but entirely in the cloud.

Key Features:

Zero Configuration: No need for any setup. You can start writing and running Python code right in your browser.
Free Access to GPUs and TPUs: Google Colab offers free access to powerful hardware accelerators like GPUs and TPUs, which are crucial for swiftly and efficiently training machine learning models.
Easy Sharing and Collaboration: Your Colab notebooks are saved in your Google Drive, making it simple to share them with others. Collaborators can comment or even edit the notebooks seamlessly.
Integration with Google Drive: The integration with Google Drive ensures a smooth transition and easy access to your Colab notebooks.
Interactive Environment: Unlike static web pages, Colab provides an interactive environment where you can both write and execute code directly in the notebook.

What is Retrieval Augmented Generation

Retrieval-Augmented Generation (RAG) is a technique employed to enhance the performance of a large language model (LLM). It achieves this by consulting an authoritative knowledge base beyond its training data sources before generating a response.

Large Language Models (LLMs) undergo training on extensive datasets, utilizing billions of parameters to produce original content for tasks such as answering questions, language translation, and sentence completion. Despite their capabilities, LLMs can sometimes provide unpredictable or outdated information. This is where RAG comes into play.

RAG expands upon the robust functionalities of LLMs by incorporating specific domains or an organization’s internal knowledge base, all without requiring model retraining. It offers a cost-effective solution to enhance LLM output, ensuring its relevance, accuracy, and usefulness across various contexts.

As an AI framework, RAG retrieves factual information from an external knowledge base to anchor large language models (LLMs) in the most precise and current data, providing users with insight into the generative process of LLMs. This approach guarantees that the model has access to the latest, reliable facts, and users can verify the sources, establishing trust in its claims.

Essentially, RAG is a technique that bolsters the accuracy and reliability of generative AI models by incorporating facts from external sources. It addresses a gap in how LLMs operate by grounding the model in external knowledge, complementing the LLM’s internal representation of information.

The advantages of RAG include cost-effective implementation, increased control over generated text output, and a better understanding of how the LLM constructs responses. By grounding an LLM in a set of external, verifiable facts, the model has fewer opportunities to inadvertently reveal sensitive data or generate inaccurate information. RAG also diminishes the necessity for users to continuously train the model on new data and update its parameters as circumstances evolve.

🚀 Enough Theory Let’s Practice!

We utilize Google Colab as our integrated development environment (IDE) and leverage its complimentary T4 GPU available in the free tier.

1. Open Google Colab and Change its runtime to T4 GPU

Here are the instructions to change the Google Colab runtime to T4 GPU:

Click on the “Runtime” option in the top menu.
Select “Change runtime type”.
Change the “Hardware accelerator” to “T4 GPU”.
Click “Save”.

2. Install Necessary Libraries

We have to install the required libraries to create a RAG conversational agent including transformers, llama-index, accelerate, pypdf and einops.

!pip install -U transformers llama-index accelerate pypdf einops bitsandbytes

Transformers: The Transformers library is a popular open-source library for natural language processing (NLP) tasks in Python, developed by Hugging Face. The library provides APIs and tools to easily download and train state-of-the-art pretrained models.
LlamaIndex: LlamaIndex is a data framework designed to connect custom data sources to large language models (LLMs). It acts as a bridge between your custom data and LLM.
Accelerate: The Accelerate library is a tool developed by Hugging Face that allows you to run the same PyTorch code across any distributed configuration by adding just a few lines of code. In short, it makes training and inference at scale simple, efficient, and adaptable.
PyPDF: PyPDF is a free and open-source pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files1. It can also add custom data, viewing options, and passwords to PDF files. PyPDF can retrieve text and metadata from PDFs as well.
Einops: Einops, short for Einstein-Inspired Notation for operations, is an open-source Python library that provides flexible and powerful tensor operations for readable and reliable code. It supports various frameworks including numpy, pytorch, tensorflow, jax, and others.
BitsandBytes: BitsandBytes is a Python library that serves as a lightweight wrapper around CUDA custom functions. It is particularly used for 8-bit optimizers, matrix multiplication (LLM.int8 ()), and quantization functions which can optimize the GPU usage.

3. Import modules

from llama_index import VectorStoreIndex, SimpleDirectoryReader, ServiceContext
from llama_index.llms import HuggingFaceLLM

VectorStoreIndex: Used to create an index over documents for semantic similarity via vector search. This index is particularly useful when your workflow involves comparing texts for semantic similarity.
SimpleDirectoryReader: Used to read data from a directory or a list of files. It selects the best file reader based on the file extensions.
ServiceContext: This is a parameter class used for passing contextual information for a service. Using a parameter class allows you to consolidate many different methods with different sets of optional parameters into a single, easier-to-use method.
HuggingFaceLLM: Used to import models and tokenizers from HuggingFace directly.

These classes are used in the context of working with Large Language Models (LLMs), particularly when you need to create semantic search indexes over documents, read data from directories, pass contextual information for a service, and work with Hugging Face’s LLMs. They are part of the llama_index library, which provides tools for connecting private, customized data sources to your LLMs.

4. Load the Documents

Begin by creating a folder called ‘data’ within the files section, and subsequently, populate the ‘data’ directory with PDF or text documents.

documents = SimpleDirectoryReader("./data").load_data()

We are using the SimpleDirectoryReader class to read data from a directory.

SimpleDirectoryReader("./data"): This creates an instance of the SimpleDirectoryReader class. The string ./data is a path to the directory that this instance will read from. The . means the current directory, so it’s looking for a folder named data in the current directory.

.load_data(): This method is called on the instance of SimpleDirectoryReader. It reads all the data from the specified directory and returns it.

documents = : Then assign the data returned by load_data() to the variable documents.

So, in short, this line of code is reading all the data from the data folder in the current directory and storing it in the documents variable.

5. Create a Prompt Template

from llama_index.prompts import PromptTemplate

system_prompt = """<|SYSTEM|>#
Mistral Research is an expert in the field of research
"""

# This will wrap the default prompts that are internal to llama-index
query_wrapper_prompt = PromptTemplate("<|USER|>{query_str}<|ASSISTANT|>")

from llama_index.prompts import PromptTemplate: This line imports the PromptTemplate class from the prompts module of the llama_index library. PromptTemplate is used to create templates for prompts that are used in the model.

system_prompt = """<|SYSTEM|># Mistral Research is an expert in the field of research """: This line defines a string variable named system_prompt. The string contains a system prompt that provides some context for the model. Feel free to change it to your needs.

query_wrapper_prompt = PromptTemplate("<|USER|>{query_str}<|ASSISTANT|>"): This line creates an instance of the PromptTemplate class. The string passed to PromptTemplate is a template for wrapping user queries and assistant responses. {query_str} is a placeholder that will be replaced with the actual query string.

The purpose of this code is to set up the prompts that will be used when interacting with the model. The prompts provide context for the model and help guide its responses.

6. Initialize the LLM from HuggingFace

We have chosen to use the Mistral 7B v0.1 as our Large Language Model (LLM) because it is compact and has demonstrated strong performance.

import torch

llm = HuggingFaceLLM(
    context_window=4096,
    max_new_tokens=256,
    generate_kwargs={"temperature": 0, "do_sample": False},
    system_prompt=system_prompt,
    query_wrapper_prompt=query_wrapper_prompt,
    tokenizer_name="mistralai/Mistral-7B-v0.1",
    model_name="mistralai/Mistral-7B-v0.1",
    device_map="auto",
    tokenizer_kwargs={"max_length": 4096},
    # uncomment this if using CUDA to reduce memory usage
    model_kwargs={
        "torch_dtype": torch.float16, 
        "llm_int8_enable_fp32_cpu_offload": True,
        "bnb_4bit_quant_type": 'nf4',
        "bnb_4bit_use_double_quant":True,
        "bnb_4bit_compute_dtype":torch.bfloat16,
        "load_in_4bit": True}
)

The above Python code initializes a Large Language Model (LLM) from Hugging Face.

import torch: This line imports the PyTorch library, which is a popular open-source machine learning library.

llm = HuggingFaceLLM(...): This line is creating an instance of the HuggingFaceLLM class, which represents a Large Language Model (LLM) from Hugging Face.

context_window=4096: This sets the maximum number of tokens that the model considers when generating a response.

max_new_tokens=256: This sets the maximum number of new tokens that the model can generate in a single response.

generate_kwargs={"temperature": 0, "do_sample": False}: These are additional arguments for the model’s generate function. temperature controls the randomness of the model’s output. A lower temperature results in less random output. do_sample determines whether the model should sample from its output distribution.

system_prompt=system_prompt, query_wrapper_prompt = query_wrapper_prompt: These are prompts that provide context for the model and help guide its responses.

tokenizer_name="mistralai/Mistral-7B-v0.1", model_name="mistralai/Mistral-7B-v0.1": These specify the name of the tokenizer and the model to use. In this case, both are set to "mistralai/Mistral-7B-v0.1".

device_map="auto": This sets the device map to “auto”, which means the model will automatically use the best available device (CPU or GPU).

tokenizer_kwargs={"max_length": 4096}: This sets the maximum length for the tokenizer.

model_kwargs={"torch_dtype": torch.float16}: This sets the data type for the model’s tensors to torch.float16. This can help reduce memory usage if you’re using a CUDA-enabled GPU.

"llm_int8_enable_fp32_cpu_offload": True: This flag is used for advanced use cases. If you want to split your model into different parts and run some parts in int8 on GPU and some parts in fp32 on CPU, you can use this flag.
"bnb_4bit_quant_type": 'nf4': This sets the quantization data type in the bnb.nn.Linear4Bit layers.
"bnb_4bit_use_double_quant": True: When set to True, this flag enables double quantization, which can further enhance the efficiency of 4-bit quantization.
"bnb_4bit_compute_dtype": torch.bfloat16: This defines the data type to use for computations when the model weights are quantized. By default, the compute dtype is set to float32.
"load_in_4bit": True: This flag is used to enable 4-bit quantization by replacing the Linear layers with FP4/NF4 layers from bitsandbytes. This can help reduce memory usage by approximately fourfold.

The comment at the end suggests that you can uncomment the model_kwargs line if you’re using a CUDA-enabled GPU to reduce memory usage.

In summary, this code is setting up a Large Language Model from Hugging Face with specific parameters and options. It’s ready to generate responses based on the provided prompts and the model’s training data.

7. Initialize the Service Context

In the context of LlamaIndex, ServiceContext is a utility container that bundles commonly used resources during the indexing and querying stages of a LlamaIndex pipeline or application.

Indexing Stage: This is the first step in the process where the chatbot creates a searchable knowledge base. The data collected from various sources are optimized into vector indices. This essentially means that indexing converts data from external sources into embeddings that store its semantic meaning, thereby facilitating the smooth searching of data. In other words, the indexing stage involves efficiently indexing private data into a vector index.

Querying Stage: Once the indexing stage is complete, the chatbot moves to the querying stage. This is where the chatbot interacts with the indexed data to find relevant information based on the user’s query. The chatbot performs a similarity search on the vector index to retrieve the most relevant data.

service_context = ServiceContext.from_defaults(chunk_size=1024,
                                               llm=llm,
                                               embed_model='local')

ServiceContext.from_defaults(chunk_size=1024, llm=llm, embed_model='local'): This line creates an instance of the ServiceContext class with certain default settings. The from_defaults method is used to create a ServiceContext object with default values. If an argument is specified, then the argument value provided for that parameter is used. If an argument is not specified, then the default value is used.

chunk_size=1024: This sets the chunk size to 1024. The chunk size could refer to the number of tokens that the model processes at a time.

llm=llm: This sets the Large Language Model (LLM) that the service will use. In this case, it’s the llm object that was defined earlier in the code.

embed_model='local': This sets the embedding model to ‘local’. The embedding model could refer to a model that’s used to convert text into numerical vectors, which can then be processed by the LLM.

8. Initialize the Vector Store Index

Vector Store is a type of index that stores data as vector embeddings. These vector embeddings are numerical representations of the data that capture their semantic meaning. This allows for efficient similarity searches, where the most similar items to a given query are retrieved.

LlamaIndex offers the flexibility to store these vector embeddings either locally or in a purpose-built vector database like Milvus. When queried, LlamaIndex finds the top_k most similar nodes in the vector store and returns them to the response synthesizer.

In summary, a Vector Store in LlamaIndex is a powerful tool that enables efficient storage and retrieval of data in the form of vector embeddings, facilitating semantic similarity searches.

index = VectorStoreIndex.from_documents(
    documents, service_context=service_context
)

The provided Python code is using the VectorStoreIndex.from_documents() function from the LlamaIndex library to create a vector store index from a list of documents.

VectorStoreIndex.from_documents(documents, service_context = service_context): This line is calling the from_documents() method of the VectorStoreIndex class. This method takes a list of documents and a ServiceContext as arguments.

documents: This is a list of documents that you want to index. Each document in this list will be converted into a vector embedding and added to the index.

service_context=service_context: This is passing the ServiceContext that was defined earlier in the code. The ServiceContext contains settings and resources that are used during the indexing process.

9. Initialize the Query Engine

QueryEngine is a high-level component that combines a Retriever and a ResponseSynthesizer into a pipeline.

Retriever: This component fetches nodes (or data points) based on the query string.

ResponseSynthesizer: This component takes the nodes retrieved by the Retriever and sends them to the Large Language Model (LLM) to generate a response.

query_engine = index.as_query_engine(streaming=True)

index.as_query_engine(streaming=True): This line is calling the as_query_engine() method on the index object. This method transforms the index into a QueryEngine.

streaming=True: This argument enables streaming mode. When streaming mode is enabled, the QueryEngine returns a StreamingResponse object instead of a regular Response object. This allows you to process the response as it’s being generated, which can be useful for large responses that take a long time to generate.

query_engine = : This is assigning the QueryEngine returned by as_query_engine() to the variable query_engine.

10. Validate the Response

Let’s evaluate our model’s output. I’ve placed the research paper titled “Attention is All You Need” in the ‘data’ folder. Now, let’s pose a query to the model.

response_stream = query_engine.query("explain about cross attention?")
response_stream.print_response_stream()

The outputs look like this:

In conclusion, the article provides a guide on how to build a personal RAG Chatbot using LlamaIndex and Open LLMs. This tool can revolutionize your data interactions in a way that’s simple, accessible, and tailored to your needs.

See you later, Thanks for reading 😃.