import os
import openai
import sys
'../..')
sys.path.append(
from dotenv import load_dotenv, find_dotenv
= load_dotenv(find_dotenv()) # read local .env file
_
= os.environ['OPENAI_API_KEY'] openai.api_key
1 Introduction
In this article we look at how to convert documents into vector stores an embeddings as an important step in making content available for Large Language Models.
2 Vectorstores and Embeddings
If our document has been divided into manageable, semantically meaningful parts, we need to index these chunks so we can quickly retrieve them when we need to respond to inquiries about this corpus of information. We’ll use vector storage and embeddings to accomplish it. Let’s find out what they are.
First off, these are crucial for creating chatbots using your data. Second, we’ll delve a little deeper and discuss edge cases, when this general approach may really fall short.
Recall the overall workflow for retrieval augmented generation (RAG):
3 Load Libs & Setup
A few documents will be loaded at this point. After the documents have loaded, chunks can be made using the recursive character text splitter. It is evident that we have now produced more than 200 distinct chunks. These embeddings will be produced using OpenAI.
We discussed Document Loading
and Splitting
in a previous article.
from langchain.document_loaders import PyPDFLoader
# Load PDF
= [
loaders # Duplicate documents on purpose - messy data
"docs/MachineLearning-Lecture01.pdf"),
PyPDFLoader("docs/MachineLearning-Lecture01.pdf"),
PyPDFLoader("docs/MachineLearning-Lecture02.pdf"),
PyPDFLoader("docs/MachineLearning-Lecture03.pdf")
PyPDFLoader(
]= []
docs for loader in loaders:
docs.extend(loader.load())
# Split
from langchain.text_splitter import RecursiveCharacterTextSplitter
= RecursiveCharacterTextSplitter(
text_splitter = 1500,
chunk_size = 150
chunk_overlap )
= text_splitter.split_documents(docs) splits
len(splits)
209
4 Embeddings
What exactly are embeddings? A numerical representation of a text is made using the text as the source. Similar vectors will exist in this numerical space for texts with similar content. By comparing those vectors, we may identify text passages that are comparable. Therefore, it is clear from the example below that two statements about pets are quite similar, but not as similar as a sentence about a pet and a sentence about the weather.
from langchain.embeddings.openai import OpenAIEmbeddings
= OpenAIEmbeddings() embedding
= "i like dogs"
sentence1 = "i like canines"
sentence2 = "the weather is ugly outside" sentence3
= embedding.embed_query(sentence1)
embedding1 = embedding.embed_query(sentence2)
embedding2 = embedding.embed_query(sentence3) embedding3
import numpy as np
np.dot(embedding1, embedding2)
0.9631853877103518
np.dot(embedding1, embedding3)
0.7709997651294672
np.dot(embedding2, embedding3)
0.7596334120325523
Recalling the entire end-to-end workflow, we begin with documents, divide them into smaller chunks, embed those chunks in other documents, and then store everything in a vector store. A database where you can quickly seek for related vectors later on is called a vector store. This will be helpful when we are looking for materials that are relevant to the current issue. Then, using an embedding of the current problem, we may compare it to every vector in the vector store and choose the one that is most comparable to the original problem.
Then, after selecting the n pieces that are the most similar, we submit the query and those chunks to an LLM to receive an answer. Later, we’ll talk more about all of it. It’s time to focus on vector storage and embeddings themselves for the time being.
Here, we can see that the first two embeddings have a relatively good score of 0.96. When we compare the first embedding to the third one, we can observe that it is substantially lower at 0.77. And if we compare the second to the third, we can see that the value is roughly the same at 0.76.
5 Vectorstores
It’s time to build embeddings for every PDF chunk of an example document and then keep them all together in a vector store.
We’ll utilise Chroma as our vector store for this. Let’s import that, then. LangChain has integrations with a large number of vector stores—more than 30 in total. We select Chroma because it is portable and memory-based, making it simple to set up and operate. When trying to persist huge volumes of data or persist it in a cloud storage location, there are different vector stores that provide hosted solutions.
So, let’s make a variable named persist directory that we will utilise at docs slash Chroma later on. Additionally, let’s check to see if anything is already present. It can throw things off if there is already material there, and we don’t want that to happen. To check sure there is nothing there, let’s RM dash RF documents dot Chroma. Now let’s build the vector store. As a result, we call Chroma from documents passing in splits; these splits were originally built with embedding passed in.
# ! pip install chromadb
from langchain.vectorstores import Chroma
= 'docs/chroma/' persist_directory
!rm -rf ./docs/chroma # remove old database files if any
= Chroma.from_documents(
vectordb =splits,
documents=embedding,
embedding=persist_directory
persist_directory )
print(vectordb._collection.count())
209
And this is the open AI embedding model. We can store the directory to disc by supplying in the persist directory keyword argument, which is unique to Chroma. After performing this, we can see that the collection count is 209, which is exactly the same as the number of divides we had previously.
6 Semantic Similarity Search
Let’s come up with a query we can use to analyse this data. So let’s check the email system to see if there is a contact number we can call if we need assistance with the course, the readings, or anything else of that nature. Using the similarity search strategy, we will be successful in answering the question and in K equals three as well. This details the quantity of documents we want to return. Therefore, if we use it and check the documents’ lengths, we can see that they are three as specified.
= "is there an email i can ask for help" question
= vectordb.similarity_search(question,k=3) docs
len(docs)
3
0].page_content docs[
"cs229-qa@cs.stanford.edu. This goes to an acc ount that's read by all the TAs and me. So \nrather than sending us email individually, if you send email to this account, it will \nactually let us get back to you maximally quickly with answers to your questions. \nIf you're asking questions about homework probl ems, please say in the subject line which \nassignment and which question the email refers to, since that will also help us to route \nyour question to the appropriate TA or to me appropriately and get the response back to \nyou quickly. \nLet's see. Skipping ahead — let's see — for homework, one midterm, one open and term \nproject. Notice on the honor code. So one thi ng that I think will help you to succeed and \ndo well in this class and even help you to enjoy this cla ss more is if you form a study \ngroup. \nSo start looking around where you' re sitting now or at the end of class today, mingle a \nlittle bit and get to know your classmates. I strongly encourage you to form study groups \nand sort of have a group of people to study with and have a group of your fellow students \nto talk over these concepts with. You can also post on the class news group if you want to \nuse that to try to form a study group. \nBut some of the problems sets in this cla ss are reasonably difficult. People that have \ntaken the class before may tell you they were very difficult. And just I bet it would be \nmore fun for you, and you'd probably have a be tter learning experience if you form a"
vectordb.persist()
Looking at the first document’s text reveals that it actually refers to the email address cs229-qa.cs.stanford.edu. Additionally, all TAs read this email, to which we can send inquiries.
After that, let’s make sure to execute vectordb.persist to save the vector database so we may utilise this later. This has gone through the fundamentals of semantic search and demonstrated that using only embeddings can yield some good results. But it isn’t flawless, and in this section, we’ll discuss a few edge circumstances and demonstrate how this can go wrong.
7 Failure modes
This seems great, and basic similarity search will get you 80% of the way there very easily, but there are some failure modes that can creep up. Here are some edge cases that can arise.
Let’s try a different query. What were their comments about MATLAB? Let’s run this with K equal to 5 and see what happens. The first two findings are actually identical, as can be seen by looking at them. This is due to the fact that, as you may recall, we purposefully specified a duplicate item when we loaded in the PDFs. This is problematic since we will later provide both of these chunks to the language model and we have the same information in two different forms. The second bit of information has little real value, and it would be much better if the language model could learn from a different, more distinct item of data.
= "what did they say about matlab?" question
= vectordb.similarity_search(question,k=5) docs
Notice that we’re getting duplicate chunks (because of the duplicate MachineLearning-Lecture01.pdf
in the index).
Semantic search fetches all similar documents, but does not enforce diversity.
docs[0]
and docs[1]
are indentical.
0] docs[
Document(page_content='those homeworks will be done in either MATLA B or in Octave, which is sort of — I \nknow some people call it a free ve rsion of MATLAB, which it sort of is, sort of isn\'t. \nSo I guess for those of you that haven\'t s een MATLAB before, and I know most of you \nhave, MATLAB is I guess part of the programming language that makes it very easy to write codes using matrices, to write code for numerical routines, to move data around, to \nplot data. And it\'s sort of an extremely easy to learn tool to use for implementing a lot of \nlearning algorithms. \nAnd in case some of you want to work on your own home computer or something if you \ndon\'t have a MATLAB license, for the purposes of this class, there\'s also — [inaudible] \nwrite that down [inaudible] MATLAB — there\' s also a software package called Octave \nthat you can download for free off the Internet. And it has somewhat fewer features than MATLAB, but it\'s free, and for the purposes of this class, it will work for just about \neverything. \nSo actually I, well, so yeah, just a side comment for those of you that haven\'t seen \nMATLAB before I guess, once a colleague of mine at a different university, not at \nStanford, actually teaches another machine l earning course. He\'s taught it for many years. \nSo one day, he was in his office, and an old student of his from, lik e, ten years ago came \ninto his office and he said, "Oh, professo r, professor, thank you so much for your', metadata={'source': 'docs/MachineLearning-Lecture01.pdf', 'page': 8})
1] docs[
Document(page_content='those homeworks will be done in either MATLA B or in Octave, which is sort of — I \nknow some people call it a free ve rsion of MATLAB, which it sort of is, sort of isn\'t. \nSo I guess for those of you that haven\'t s een MATLAB before, and I know most of you \nhave, MATLAB is I guess part of the programming language that makes it very easy to write codes using matrices, to write code for numerical routines, to move data around, to \nplot data. And it\'s sort of an extremely easy to learn tool to use for implementing a lot of \nlearning algorithms. \nAnd in case some of you want to work on your own home computer or something if you \ndon\'t have a MATLAB license, for the purposes of this class, there\'s also — [inaudible] \nwrite that down [inaudible] MATLAB — there\' s also a software package called Octave \nthat you can download for free off the Internet. And it has somewhat fewer features than MATLAB, but it\'s free, and for the purposes of this class, it will work for just about \neverything. \nSo actually I, well, so yeah, just a side comment for those of you that haven\'t seen \nMATLAB before I guess, once a colleague of mine at a different university, not at \nStanford, actually teaches another machine l earning course. He\'s taught it for many years. \nSo one day, he was in his office, and an old student of his from, lik e, ten years ago came \ninto his office and he said, "Oh, professo r, professor, thank you so much for your', metadata={'source': 'docs/MachineLearning-Lecture01.pdf', 'page': 8})
Another type of failure mode is also conceivable. What was said in the third lecture document concerning regression, then? is our new query. Intuitively, we would anticipate that all of these documents would be included in the third lesson when we receive them.
The metadata we have about the lectures they were taken from allows us to verify this. So let’s iterate through each page and print the info. We can see that the outcomes are actually a mix of those from the first lecture, the second lecture, and the third lecture.
= "what did they say about regression in the third lecture?" question
= vectordb.similarity_search(question,k=5) docs
for doc in docs:
print(doc.metadata)
{'source': 'docs/MachineLearning-Lecture03.pdf', 'page': 0}
{'source': 'docs/MachineLearning-Lecture03.pdf', 'page': 14}
{'source': 'docs/MachineLearning-Lecture02.pdf', 'page': 0}
{'source': 'docs/MachineLearning-Lecture03.pdf', 'page': 6}
{'source': 'docs/MachineLearning-Lecture01.pdf', 'page': 8}
print(docs[4].page_content)
into his office and he said, "Oh, professo r, professor, thank you so much for your
machine learning class. I learned so much from it. There's this stuff that I learned in your
class, and I now use every day. And it's help ed me make lots of money, and here's a
picture of my big house."
So my friend was very excited. He said, "W ow. That's great. I'm glad to hear this
machine learning stuff was actually useful. So what was it that you learned? Was it
logistic regression? Was it the PCA? Was it the data ne tworks? What was it that you
learned that was so helpful?" And the student said, "Oh, it was the MATLAB."
So for those of you that don't know MATLAB yet, I hope you do learn it. It's not hard,
and we'll actually have a short MATLAB tutori al in one of the discussion sections for
those of you that don't know it.
Okay. The very last piece of logistical th ing is the discussion s ections. So discussion
sections will be taught by the TAs, and atte ndance at discussion sections is optional,
although they'll also be recorded and televi sed. And we'll use the discussion sections
mainly for two things. For the next two or th ree weeks, we'll use the discussion sections
to go over the prerequisites to this class or if some of you haven't seen probability or
statistics for a while or maybe algebra, we'll go over those in the discussion sections as a
refresher for those of you that want one.
The third lecture and the fact that we only want documents from the third lecture are both pieces of structured information, but we’re only using embeddings to perform a semantic lookup, which embeds the entire sentence and is probably a little more focused on regression. As a result, we are receiving findings that are presumably quite relevant to regression, and if we look at the fifth document, which is the one from the first lecture, we can see that regression is in fact mentioned there.
Since it is a piece of structured information that isn’t really completely represented in our semantic embedding, it is catching up on that, but it isn’t picking up on the fact that it should only be querying papers from the third lecture.
This is a case where we might actually want to do some kind of pre-filtering on our embeddings, for example to somehow prefilter only the embeddings from the third lecture document. This is possible using richer metadata and indexes for this metadata, which I will look at in the next article.
8 Acknowledgements
I’d like to express my thanks to the wonderful LangChain: Chat with your data course by DeepLearning.ai and LangChain - which i completed, and acknowledge the use of some images and other materials from the course in this article.