Science News

Tools to go from prototype to production

The Quick-start Guide Isn’t Enough

“Retrieval augmented generation is the process of supplementing a user’s input to a large language model (LLM) like ChatGPT with additional information that you (the system) have retrieved from somewhere else. The LLM can then use that information to augment the response that it generates.” — Cory Zue

LLMs are an amazing invention, prone to one key issue. They make stuff up. RAG makes LLMs far more useful by giving them factual context to use while answering queries.

Using the quick-start guide to a framework like LangChain or LlamaIndex, anyone can build a simple RAG system, like a chatbot for your docs, with about five lines of code.

But, the bot built with those five lines of code isn’t going to work very well. RAG is easy to prototype, but very hard to productionize — i.e. get to a point where users would be happy with it. A basic tutorial might get RAG working at 80%. But bridging the next 20% often takes some serious experimentation. Best practices are yet to be ironed out and can vary depending on use case. But figuring out the best practices is well worth our time, because RAG is probably the single most effective way to use LLMs.

This post will survey strategies for improving the quality of RAG systems. It’s tailored for those building with RAG who want to bridge the gap between basic setups and production-level performance. For the purposes of this post, improving means increasing the proportion of queries for which the system: 1. Finds the proper context and 2. Generates and appropriate response. I will assume the reader already has an understanding of how RAG works. If not, I’d suggest reading this article by Cory Zue for a good introduction. It will also assume some basic familiarity with the common frameworks used to build these tools: LangChain and LlamaIndex. However the ideas discussed here are framework-agnostic.

I won’t dive into the details of exactly how to implement each strategy I cover, but rather I will try to give an idea of when and why it might useful. Given how fast the space is moving, it is impossible to provide and exhaustive, or perfectly up to date, list of best practices. Instead, I aim to outline some things you might consider and try when attempting to improve your retrieval augmented generation application.

10 Ways to Improve the Performance of Retrieval Augmented Generation

1. Clean your data.

RAG connects the capabilities of an LLM to your data. If your data is confusing, in substance or layout, then your system will suffer. If you’re using data with conflicting or redundant information, your retrieval will struggle to find the right context. And when it does, the generation step performed by the LLM may be suboptimal. Say you’re building a chatbot for your startup’s help docs and you find it is not working well. The first thing you should take a look at is the data you are feeding into the system. Are topics broken out logically? Are topics covered in one place or many separate places? If you, as a human, can’t easily tell which document you would need to look at to answer common queries, your retrieval system won’t be able to either.

This process can be as simple as manually combining documents on the same topic, but you can take it further. One of the more creative approaches I’ve seen is to use the LLM to create summaries of all the documents provided as context. The retrieval step can then first run a search over these summaries, and dive into the details only when necessary. Some framework even have this as a built in abstraction.

2. Explore different index types.

The index is the core pillar of LlamaIndex and LangChain. It is the object that holds your retrieval system. The standard approach to RAG involves embeddings and similarity search. Chunk up the context data, embed everything, when a query comes, find similar pieces from the context. This works very well, but isn’t the best approach for every use case. Will queries relate to specific items, such as products in an e-commerce store? You may want to explore key-word based search. It doesn’t have to be one or the other, many applications use a hybrid. For example, you may use a key-word based index for queries relating to a specific product, but rely on embeddings for general customer support.

3. Experiment with your chunking approach.

Chunking up the context data is a core part of building a RAG system. Frameworks abstract away the chunking process and allow you to get away without thinking about it. But you should think about it. Chunk size matters. You should explore what works best for your application. In general, smaller chunks often improve retrieval but may cause generation to suffer from a lack of surrounding context. There are a lot of ways you can approach chunking. The one thing that doesn’t work is approaching it blindly. This post from PineCone lays out some strategies to consider. I have a test set of questions. I approached this by running an experiment. I looped through each set one time with a small, medium, and large chunk size and found small to be best.

4. Play around with your base prompt.

One example of a base prompt used in LlamaIndex is:

‘Context information is below. Given the context information and not prior knowledge, answer the query.’

You can overwrite this and experiment with other options. You can even hack the RAG such that you do allow the LLM to rely on its own knowledge if it can’t find a good answer in the context. You may also adjust the prompt to help steer the types of queries it accepts, for example, instructing it to respond a certain way for subjective questions. At a minimum it’s helpful to overwrite the prompt such that the LLM has context on what jobs it’s doing. For example:

‘You are a customer support agent. You are designed to be as helpful as possible while providing only factual information. You should be friendly, but not overly chatty. Context information is below. Given the context information and not prior knowledge, answer the query.’

5. Try meta-data filtering.

A very effective strategy for improving retrieval is to add meta-data to your chunks, and then use it to help process results. Date is a common meta-data tag to add because it allows you to filter by recency. Imagine you are building an app that allows users to query their email history. It’s likely that more recent emails will be more relevant. But we don’t know that they’ll be the most similar, from an embedding standpoint, to the user’s query. This brings up a general concept to keep in mind when building RAG: similar ≠ relevant. You can append the date of each email to its meta-data and then then prioritize most recent context during retrieval. LlamaIndex has a built in class of Node Post-Processors that help with exactly this.

6. Use query routing.

It’s often useful to have more than one index. You then route queries to the appropriate index when they come in. For example, you may have one index that handles summarization questions, another that handles pointed questions, and another that works well for date sensitive questions. If you try to optimize one index for all of these behaviors, you’ll end up compromising how well it does at all of them. Instead you can route the query to the proper index. Another use case would be to direct some queries to a key-word based index as discussed in section 2.

Once you have constructed you indexes, you just have to define in text what each should be used for. Then at query time, the LLM will choose the appropriate option. Both LlamaIndex and LangChain have tools for this.

7. Look into reranking.

Reranking is one solution to the issue of discrepancy between similarity and relevance. With reranking, your retrieval system gets the top nodes for context as usual. It then re-ranks them based on relevance. Cohere Rereanker is commonly used for this. This strategy is one I see experts recommend often. No matter the use case, if you’re building with RAG, you should experiment with reranking and see if it improves your system. Both LangChain and LlamaIndex have abstractions that make it easy to set up.

8. Consider query transformations.

You already alter your user’s query by placing it within your base prompt. It can make sense to alter it even further. Here are a few examples:

Rephrasing: if your system doesn’t find relevant context for the query, you can have the LLM rephrase the query and try again. Two questions that seem the same to humans don’t always look that similar in embedding space.

HyDE: HyDE is a strategy which takes a query, generates a hypothetical response, and then uses both for embedding look up. Researches have found this can dramatically improve performance.

Sub-queries: LLMs tend to work better when they break down complex queries. You can build this into your RAG system such that a query is decomposed into multiple questions.

LLamaIndex has docs covering these types of query transformations.

9. Fine-tune your embedding model.

Embedding based similarity is the standard retrieval mechanism for RAG. Your data is broken up and embedded inside the index. When a query comes in, it is also embedded for comparison against the embedding in the index. But what is doig the embedding? Usually, a pre-trained model such as OpenAI’ text-embeddingada002.

The issue is, the pre-trained model’s concept of what is similar in embedding space may not align very well with what is similar in your context. Imagine you are working with legal documents. You would like your embedding to base its judgement of similarity more on your domain specific terms like “intellectual property” or “breach of contract” and less on general terms like “hereby” and “agreement.”

You can fine-tune you embedding model to resolve this issue. Doing so can boost your retrieval metrics by 5–10%. This requires a bit more effort, but can make a significant difference in your retrieval performance. The process is easier than you might think, as LlamaIndex can help you generate a training set. For more information, you can check out this post by Jerry Liu on how LlamaIndex approaches fine-tuning embeddings, or this post which walks through the process of fine-tuning.

10. Start using LLM dev tools.

You’re likely already using LlamaIndex or LangChain to build your system. Both frameworks have helpful debugging tools which allow you to define callbacks, see what context is used, what document your retrieval comes from, and more.

If you’re finding that the tools built into these frameworks are lacking, there is a growing ecosystem of tools which can you help you dive into the inner working of your RAG system. Arize AI has an in-notebook tool that allows you to explore how which context is being retrieved and why. Rivet is a tool which provides a visual interface for helping your build complex agents. It was just open-sourced by the legal technology company Ironclad. New tools are constantly being released and it’s worth experimenting to see which are helpful in your workflow.


Building with RAG can be frustrating because it’s so easy to get working and so hard to get working well. I hope the strategies above can provide some inspiration for how you might bridge the gap. No one of these ideas works all the time and the process is one of experimentation, trial, and error. I didn’t dive into evaluation, how you can measure the performance of your system, in this post. Evaluation is more of an art than a science at the moment, but it’s important to set some type of system up that you can consistently check in on. This is the only way to tell if the changes you are implementing make a difference. I wrote about how to evaluate RAG system previously. For more information, you can explore LlamaIndex Evals, LangChain Evals, and a really promising new framework called RAGAS.

10 Ways to Improve the Performance of Retrieval Augmented Generation Systems was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

Read More


Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Loading Disqus Comments ...

No Trackbacks.