from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import DeepLake
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.llms import OpenAI
from langchain.chains import RetrievalQA
from langchain.agents import initialize_agent, Tool
from langchain.agents import AgentType
import os
from dotenv import load_dotenv, find_dotenv
= load_dotenv(find_dotenv()) # read local .env file
_ # We have loaded the environment vars using a .env file and have assigned os.environ["ACTIVELOOP_TOKEN"]
1 Introduction
Activeloop Deep Lake provides storage for embeddings and their corresponding metadata in the context of LLM apps . It enables hybrid searches on these embeddings and their attributes for efficient data retrieval. It also integrates with LangChain & Agents, facilitating the development and deployment of applications.
2 Deeplake v Other Vector Stores
Deep Lake provides several advantages over the typical vector store:
- It’s multimodal, which means that it can be used to store items of diverse modalities, such as texts, images, audio, and video, along with their vector representations.
- It’s serverless, which means that we can create and manage cloud datasets without creating and managing a database instance. This aspect gives a great speedup to new projects.
- Last, it’s possible to easily create a data loader out of the data loaded into a Deep Lake dataset. It is convenient for fine-tuning machine learning models using common frameworks like PyTorch and TensorFlow.
In order to use Deep Lake, you first have to register on the Activeloop website and redeem your API token. Here are the steps for doing it:
- Sign up for an account on Activeloop’s platform. You can sign up at Activeloop’s website. After specifying your username, click on the “Sign up” button. You should now see your homepage.
- You should now see a “Create API token” button at the top of your homepage. Click on it, and you’ll get redirected to the “API tokens” page. This is where you can generate, manage, and revoke your API keys for accessing Deep Lake.
- Click on the “Create API token” button. Then, you should see a popup asking for a token name and an expiration date. By default, the token expiration date is set so that the token expires after one day from its creation, but you can set it further in the future if you want to keep using the same token for the whole duration of the course. Once you’ve set the token name and its expiration date, click on the “Create API token” button.
- You should now see a green banner saying that the token has been successfully generated, along with your new API token, on the “API tokens” page. To copy your token to your clipboard, click on the square icon on its right.
Now that you have your API token, you can conveniently store under the ACTIVELOOP_TOKEN key in the environment variable to retrieve it automatically by the Deep Lake libraries whenever needed.
Let’s demonsrate how it can be used.
3 Import Libs & Setup
4 Basic Deeplake Demo
Lets demonstrate how we can use the Deeplake vector store. We will use Langchain as well as an OpenAI GPT-3.5 model as our LLM stack. We will set up a simple vector store with some birthdays, create an LLM based agent then ask a question about one of the birthdays - which will require the agent to find the details in the Deeplake.
Let’s first set up the Deeplake vector store and LLM.
# instantiate the LLM and embeddings models
= OpenAI(model="text-davinci-003", temperature=0)
llm = OpenAIEmbeddings(model="text-embedding-ada-002")
embeddings
# create our documents
= [
texts "Napoleon Bonaparte was born in 15 August 1769",
"Louis XIV was born in 5 September 1638"
]= RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
text_splitter = text_splitter.create_documents(texts)
docs
# Create Deep Lake dataset
# Use your organization id here. (by default, org id is your username)
= "pranath"
my_activeloop_org_id = "langchain_course_from_zero_to_hero"
my_activeloop_dataset_name = f"hub://{my_activeloop_org_id}/{my_activeloop_dataset_name}"
dataset_path = DeepLake(dataset_path=dataset_path, embedding_function=embeddings)
db
# add documents to our Deep Lake dataset
db.add_documents(docs)
Your Deep Lake dataset has been successfully created!
Dataset(path='hub://pranath/langchain_course_from_zero_to_hero', tensors=['embedding', 'id', 'metadata', 'text'])
tensor htype shape dtype compression
------- ------- ------- ------- -------
embedding embedding (2, 1536) float32 None
id text (2, 1) str None
metadata json (2, 1) str None
text text (2, 1) str None
/
['d9f49eb8-354b-11ee-9eb0-acde48001122',
'd9f4a034-354b-11ee-9eb0-acde48001122']
Now, let’s create a Langchain RetrievalQA chain:
= RetrievalQA.from_chain_type(
retrieval_qa =llm,
llm="stuff",
chain_type=db.as_retriever()
retriever )
Next, let’s create an agent that uses the RetrievalQA chain as a tool:
= [
tools
Tool(="Retrieval QA System",
name=retrieval_qa.run,
func="Useful for answering questions."
description
),
]
= initialize_agent(
agent
tools,
llm,=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
agent=True
verbose )
Finally, we can use the agent to ask a question:
= agent.run("When was Napoleone born?")
response print(response)
> Entering new chain...
I need to find out when Napoleone was born.
Action: Retrieval QA System
Action Input: When was Napoleone born?
Observation: Napoleon Bonaparte was born on 15 August 1769.
Thought: I now know the final answer.
Final Answer: Napoleon Bonaparte was born on 15 August 1769.
> Finished chain.
Napoleon Bonaparte was born on 15 August 1769.
Here, the agent used the “Retrieval QA System” tool with the query “When was Napoleone born?” which is then run on our new Deep Lake dataset, returning the most similar document (i.e., the document containing the date of birth of Napoleon). This document is eventually used to generate the final output.
Note the Agent also made use of the ReaCT framework for LLM prompt structuring.
This example shows how to utilise Deep Lake as a vector database and to develop an agent that uses a RetrievalQA chain as a tool to respond to queries depending on the provided content.
5 Adding more Data and Reloading Deeplake
Let’s add a case where more data is added and an existing vector storage is reloaded.
We first reload a vector store from Deep Lake that is already there and is situated at a specific dataset path. After that, we import fresh text data and divide it into manageable portions. Last but not least, we include these chunks into the current dataset by producing and archiving matching embeddings for each additional text segment:
# load the existing Deep Lake dataset and specify the embedding function
= DeepLake(dataset_path=dataset_path, embedding_function=embeddings)
db
# create new documents
= [
texts "Lady Gaga was born in 28 March 1986",
"Michael Jeffrey Jordan was born in 17 February 1963"
]= RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
text_splitter = text_splitter.create_documents(texts)
docs
# add documents to our Deep Lake dataset
db.add_documents(docs)
Deep Lake Dataset in hub://pranath/langchain_course_from_zero_to_hero already exists, loading from the storage
Dataset(path='hub://pranath/langchain_course_from_zero_to_hero', tensors=['embedding', 'id', 'metadata', 'text'])
tensor htype shape dtype compression
------- ------- ------- ------- -------
embedding embedding (4, 1536) float32 None
id text (4, 1) str None
metadata json (4, 1) str None
text text (4, 1) str None
\
['b7931762-354d-11ee-9eb0-acde48001122',
'b79318e8-354d-11ee-9eb0-acde48001122']
Then, we replicate our prior agent and pose a query that can only be addressed by the most recent documents added.
# instantiate the wrapper class for GPT3
= OpenAI(model="text-davinci-003", temperature=0)
llm
# create a retriever from the db
= RetrievalQA.from_chain_type(
retrieval_qa =llm, chain_type="stuff", retriever=db.as_retriever()
llm
)
# instantiate a tool that uses the retriever
= [
tools
Tool(="Retrieval QA System",
name=retrieval_qa.run,
func="Useful for answering questions."
description
),
]
# create an agent that uses the tool
= initialize_agent(
agent
tools,
llm,=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
agent=True
verbose )
Let’s now test our agent with a new question.
= agent.run("When was Michael Jordan born?")
response print(response)
> Entering new chain...
I need to find out when Michael Jordan was born.
Action: Retrieval QA System
Action Input: When was Michael Jordan born?
Observation: Michael Jordan was born on 17 February 1963.
Thought: I now know the final answer.
Final Answer: Michael Jordan was born on 17 February 1963.
> Finished chain.
Michael Jordan was born on 17 February 1963.
6 Acknowledgements
I’d like to express my thanks to the wonderful LangChain & Vector Databases in Production Course by Activeloop - which i completed, and acknowledge the use of some images and other materials from the course in this article.