Improving the Semantic Search Tool
2023 has been an exciting year of experimenting with ChatGPT. When ChatGPT first saw the light of day, it impressed many people. But soon people realised that ChatGPT tends to hallucinate and produce false information confidently.
Hence came the paradigm of Retrieval Augmented Generated (RAG). The core idea is to do a semantic search, then pass the top k number of outputs to ChatGPT to use as a context to answer the user’s query. This is very similar to the Question Answering approach used in traditional Natural Language Processing. There has been substantial work on this, but the baseline process can be easily built with a vector database and a few API calls. Nothing too fancy.
However, RAG comes with a very big issue. Although providing the context helps to ground the LLM’s response in the “Generation” step, it depends on the accuracy of the “Retrieval” portion. Hence, if retrieval is not done well, the LLM might give another wrong answer. Although specific instructions have been given to the LLM to say “I don’t know” if no answers are found, I have found that to be ineffective at times.
I have a hypothesis on why some retrieval tasks work better than others. I suppose that if the answer contains the words in the question, in the case of simple searching of answers within a wall of text, the embeddings in the vector space might be close enough. However, if the answer requires some form of deduction, it does not make sense to have the question and answer both exist near each other in the same vector space. But more about that later.
One of the methods that I have tested is the HyDE method, where the user’s query will be sent to the LLM to generate an answer, and then use the answer for semantic search. As long as the answer provided is in close proximity to the content in the vector store, the top k outputs from the vector store will be fairly accurate. This circumvents the issue of questions and answers being in different parts of the vector space and has improved my semantic search tools significantly.
Despite the promise that HyDE will bring, it fully hinges on the fact that the LLM will produce the correct answer. I am lucky to have built in a threshold value for the semantic search, and this meant that if the LLM produces the incorrect answer, it might not return any content at all.
So what if I drew ideas from Frequently Asked Questions, where the user asks a question and similar questions are shown to the students, along with the answers associated with it? Can I then do a semantic search on similar questions instead to return relevant content?
To understand the complexities behind semantic search, one of the problems that people have faced is the issue of chunk size — which is the amount of text we are going to embed. Too small a chunk size means that you might have too little context. Too large a chunk size means that the LLM will struggle to find the right answer, and it certainly costs more. Also, a large chunk means that more ideas are encoded, which means it might shift the chunk to another section of the vector space.
Here are two examples to illustrate this point (assuming you know/still remember your JC chemistry). The answer to the question “Why are alkanes unreactive” is around two lines, while “How do I draw the SN1 mechanism?” might take up two pages in the notes. It should be clear that different questions require different chunk sizes to be effective.
Other than separating knowledge by chunk size, one suggestion will be to plot a knowledge graph and use relevant sections of the knowledge graph and subgraphs as input to the LLM. The issue here then probably is how much time is required to create that knowledge graph, and whether that knowledge graph is static or fluid. Compared to just using a simple algorithm to break up the corpus into fixed chunk sizes, using the knowledge graph is certainly a little more complicated.
But other than segmenting items as knowledge graphs, lecture notes tend to contain structured data with headers, sections and subsections. These lecture notes have been crafted by experts, and implicitly contain certain rules of how these content can be grouped. Thus, I hypothesise that being able to extract content by sections (with their headers) might allow me to encode the knowledge in a more meaningful manner.
So here is the workflow that I have taken to do this with a set of Organic Chemistry lecture notes:
- Parse a document into sections and extract their content
- Get an LLM to think of questions that the section’s content holds the answer to
- Get the embeddings of the questions, and store the question and section pairs in a vector database
- In the semantic search tool, when the user asks a question, the top k similar questions are returned (based on similarity search)
- The relevant section to answer the question is returned to the user
Parse a document into sections and extract their content
With tools like Microsoft Word which has structured XML data behind it, splitting the content by sections shouldn’t be that much of a challenge.
Here’s the catch. Unlike creating a website where h1, h2, and h3 tags are used, many people who create notes on Microsoft Word might not segment their document properly with the headings, subheadings, paragraph text etc. It is simply not a rule that people have to follow, unlike web developers. It doesn’t help that because of aesthetics, many people might end up using random tables or text boxes as section headers. As shown in the sample below, the sections are created using tables, while the subsections are of a different format.
Hence, despite trying many options (such as extracting all texts and segmenting by section numbers), the algorithmic approach did not seem to work. The problem is that there could also be numbers in the chapter, and naive detection of numbers as section and subsection numbers just does not cut it.
Luckily, AI models such as Nougat (developed by Meta) helped to analyse my file and extracted the contents within the segments in markdown language. Although there were a couple of missing segments, it was relatively good enough to be used.
Analysis of the example above showed that “##” indicates a section while “###” indicates a subsection.
Get Large Language Models to think of questions which the section’s content can answer
Once the content is segmented into sections and subsections, the content in each sub(section) is copied into ChatGPT to generate questions. After some experiments, this was the prompt I went with.
Get the embeddings and store the question and section pairs in a vector database
All the questions are then collated. It is a painful process to extract the questions one by one. So I edited the prompt such that it outputs a JSON object. This was the output.
[
{
"question": "What are Arenes also known as?",
"answer": "Arenes are also referred to as aromatic hydrocarbons."
},
{
"question": "What is the structural unit of Arenes?",
"answer": "The structural unit of Arenes is the benzene ring."
}
]
With this output, I can now generate embeddings for all of the questions. Once it is done, I will store it within a vector database like Pinecone.
In the semantic search tool, when the user asks a question, the top k similar questions are returned (based on similarity search)
Once the vector database is created, the next few steps are rather straightforward.
When the user sends in a query, the query’s embeddings are generated, and the top k nearest neighbours from cosine similarity scores are returned to the user.
The relevant section to answer the question is returned to the user
Since each section is stored with its relevant questions as key-value pairs, it is easy to find the section’s content using the questions tagged to it.
The relevant section’s content will then be returned to the user as the output. The output can also be used in an LLM as the context.
And here it is, another approach to improve semantic search.
There are many other methods that I will be experimenting with next. Here is a sneak preview of one of the ideas.
Finetuning an Embeddings Model
Since the embeddings model is another reason why the retrieval step might fail, it might be useful to finetune a better transformer.
Thus, I can use the question-answer pair provided above to finetune a sentence transformer embeddings model (through contrastive learning) to improve on semantic search.
Finally, despite the usefulness of semantic search, good search is not simply about returning the top-k responses.
For instance, if a student asks “Please tell me all reactions involving alkenes”, a top-k response means I might potentially only return k relevant responses. It does not allow me to capture all reactions unless I have one particular section on all reactions involving alkenes.
Thus, I believe that the best form of retrieval is one where the content is well curated (both structurally and functionally) and retrieval takes on the form of a hybrid approach.
Search is not an easy problem, and powerful RAG implementations require the retrieval process to be as accurate as possible. I look forward to exploring more of this topic!