Vector search and Retrieval Augmented Generation

This is an overview of vector search and Retrieval Augmented Generation (RAG) in the context of text data, illustrated with worked code examples, and some closing thoughts on their usage in practice.

It is primarily to clarify my understanding and for my future reference, and is likely to be updated with further information over time. It may be of interest to others, but those familiar with vector search and RAG can skip the terminology and examples sections.

Terminology Link to heading

Vectors Link to heading

Machine learning requires data to be represented as numbers for processing. A vector is simply a list of numbers, the length of which is referred to as its dimension.

With a large amount of input data, it is generally more useful to work with low dimension vectors. To create these for text, it is normally a two step process:

  1. Tokenization.
  2. Embedding.

Tokenization Link to heading

Tokenization turns the text into a vector where each number simply represents words or parts of words or other tokens.

The number of possible word parts is the “vocabulary size”.

Embedding Link to heading

Embedding in effect compresses tokenized text into a lower dimension vector, while preserving the features of the data with which you want to work.

Different embeddings can serve quite different purposes, e.g. search or summarization, so it is vital to use the correct embeddings for your purpose.

Embeddings are usually for words or sentences, i.e. shorter extracts of text. Longer pieces of text often have to be broken into “chunks” for embedding.

Embeddings are created with a language model, which is a type of neural network for Natural Language Processing (NLP). A language model will typically have a number of layers, with the last layers adapted for the specific purpose. The number of “nodes” in the neural network is referred to as the number of “parameters”.

Vector search (aka semantic search) is a way of finding similar vectors using vector similarity measures such as cosine similarity.

This only works if, firstly, the vectors / embeddings have been created with a model designed for vector search, and secondly, all vectors have been created with the same model. In a search context, this means an embedding for the search query has to be obtained using the same model as the embeddings for the content being searched (aka the search corpus).

Vectors / embeddings for the search corpus are often stored in chunks (if the input size is e.g. greater than 256 tokens) in some form of vector database.

Vector search uses algorithms like Hierarchical Navigable Small World (HNSW) and Approximate Nearest Neighbour (ANN) to operate at speed and scale.

Transformers Link to heading

Transformers are a special type of language model which are particularly good at preserving context in text. They do this with a self-attention mechanism which weights the importance of each part of the input data differently. Transformers were introduced with the Attention Is All You Need paper in 2017. The BERT (Bidirectional Encoder Representations from Transformers) model was released in 2018, and generative models include GPT-2 (Generative Pre-trained Transformer 2) released in 2019, GPT-3 in 2020, GPT-3.5 in 2022, and GPT-4 in 2023.

Given each token is connected to each other token, computation time scales quadratically with the number of tokens, effectively placing a limit on context length.

Large Language Models (LLMs) Link to heading

An LLM has a lot more parameters than other language models like sentence transformers, e.g. upwards of 7 billion parameters. They are also trained on very large amounts of data1. LLMs are typically trained primarily for text generation.

In simplified terms, they take input text / tokens, i.e. a prompt, and predict the words / tokens most likely to appear next based on all the training data they have seen. The prompt can also contain additional context to assist with the generation, allowing LLMs to be used for a wide variety of different purposes, e.g. question answering or preparing code snippets, and have other affects on the output including the “tone” and confidence levels of the response. The amount of context that can be provided is the “context window”, and mechanisms for extending the length of the context window is an active topic of research.

Defining good prompts for LLMs is crucial to getting the type of output desired, while minimising some of the issues. Prompts that work well for one LLM may not work well for another. This has led to the creation of a new role for “prompt engineers”.

The same prompt can also generate different responses due to an element of randomness configured with a value called the “temperature”. A higher value will produce more likely and therefore potentially higher quality results, while a lower value will sometimes produce less likely and possibly more unusual results.

LLMs also suffer various drawbacks such as a tendency to fabricate information in responses, known as “hallucination”.

Note that LLMs are not normally used to create embeddings for vector search. This is because their focus on text generation doesn’t typically produce embeddings which are as useful for vector similarity measures, or more specifically LLMs are usually decoder-only transformers rather than bidirectional encoder-decoder transformers.

Note also that some LLMs are “multimodal”, i.e. can work with forms of data other than text, e.g. images, although the focus of this post is text data.

Fine tuning Link to heading

Training an LLM from scratch, or retraining an entire LLM, is computationally infeasible for most. However, it is possible to “fine tune” an existing LLM to adapt it to a more specific purpose. Generally, fine tuning relates more to following instructions rather than adding new “domain knowledge”. As such, in some ways, fine tuning can be seen as a complement to prompt engineering.

There are different approaches to fine tuning, but broadly speaking they involve unfreezing the last layer or two and recalculating all the weight changes for those layers. To reduce the cost of fine tuning, there are approaches such as Low-Rank Adaptation (LoRA) which does not compute the full set of weight changes but rather a decomposed (lower rank) representation of them.

Self-hosted LLM Link to heading

LLMs such as those used by ChatGPT have not had their model weights publicly released, so it is only possible to use them via an API to the cloud hosted instance. Furthermore, even if the model weights had been released, many would be prohibitively expensive to self-host. For example, the largest GPT-3 model has 175 billion parameters2.

Sending data to the cloud can be undesirable, e.g. if the data contains potentially sensitive information, and also potentially costly given there is a charge per API call.

Some LLMs such as Llama 2 have had their weights publicly released, and have smaller versions available, which can make self-hosting possible.

Quantization Link to heading

An LLM with 7 billion, 13 billion, or even 70 billion parameters, which stores each parameter as a 16-bit floating point number, will need a lot of memory and storage to operate. Converting each weight to a lower precision number reduces memory and storage requirements, e.g. converting from fp16 to int8 halves memory requirements, but will often not significantly reduce effectiveness. Quantizing can also help improve inference performance on CPUs, although not on GPUs optimised for fp16 calculations.

Retrieval Augmented Generation (RAG) Link to heading

Retrieval Augmented Generation is a new information retrieval technique which takes an input query (typically in the form of a question), searches for similar text to the query, and uses an LLM to generate a summary of the result based on the similar text it has found. This allows it to provide references, and in theory reduces (but does not eliminate) the chance of fabricated information being returned.

The search is typically performed via a vector search, rather than keyword search.

A simple LLM prompt for RAG could be:

Answer the question based on the context below. 
[context]: {context} 
[question]: {question}

More sophisticated approaches are also possible. For example, it can be possible to break down complex questions into subquesions with a Sub Question Query Engine.

Code examples Link to heading

The examples below use the following sentences:

sentences = ['vacation', 'holiday', 'vacations', 'red', 'green', 'dogs are popular pets']

Tokenization Link to heading

Using the BERT Tokenizer in the Transformers library:

from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

To see how words are split into word parts:

[tokenizer.tokenize(sentence) for sentence in sentences]

To see how words are represented as numbers (noting the special tokens added to mark the start and end):

tokenized_sentences = tokenizer(sentences)
[tokenizer.decode(tokenized_sentence) for tokenized_sentence in tokenized_sentences['input_ids']]

Note that if you are using embeddings, the model will normally tokenize the text for you (to ensure a supported tokenizer is used).

Embeddings Link to heading

To create embeddings with the ‘sentence-transformers/all-MiniLM-L6-v2’ model for vector search:

from sentence_transformers import SentenceTransformer
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
embeddings = model.encode(sentences)
[embedding for embedding in embeddings]

Note that all the output vectors are all the same dimension, in this example 384, even for very short and relatively long sentences. This is to make it easier to compare vector similarity. Other models may output different dimension vectors.

Note also that there is a limit on input sequence length, in this case 256 tokens.

Vector search Link to heading

In theory, two vectors representing similar text should occupy a similar area of the vector space. It is difficult to visualise a 384 dimension vector (if you could usefully reduce to 2 or 3 dimensions you could plot on a chart to visualise proximity), but by viewing the cosine similarity values for vector pairs you can see similarity:

from sentence_transformers import util
[print('{} {} {}'.format(source, target, util.cos_sim(embeddings[index], embeddings[sentences.index(target)])[0])) for index, source in enumerate(sentences) for target in sentences[index + 1:]]

Many online examples show the ‘sentence-transformers/all-MiniLM-L6-v2’ model being used for vector search.

However, I found that produced some surprisingly poor results in many real-world cases. For example, “How high is Ben Nevis?” gives a similarity score of 0.3176 to text about mountains containing the words “Ben Nevis” and its height, but a higher score of 0.4072 to some text about someone called Benjamin talking about someone down a well, and “Can you summarize Immanuel Kant’s biography in two sentences?” gives a similarity score of 0.5178 to text containing “Immanuel Kant” and some details of his life, but a higher score of 0.5766 to just the word “Biography”:

from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
question1 = "How high is Ben Nevis?"
answers1 = ["The three peaks in this context are the three highest peaks in Great Britain: Scafell Pike, England, 978m; Snowdon (Yr Wyddfa in Welsh), Wales, 1085m; Ben Nevis (Bheinn Nibheis in Scottish Gaelic), Scotland, 1345m", "Imagine being all that way down in the dark. Hope they thought to haul him up again at the end opined Benjamin, pleasantly."]
util.cos_sim(model.encode(question1), model.encode(answers1[0]))
util.cos_sim(model.encode(question1), model.encode(answers1[1]))

question2 = "Can you summarize Immanuel Kant's biography in two sentences?"
answers2 = ["Biography", "Immanuel Kant, born in 1724, was one of the most influential philosophers of the Enlightenment. Although Kant is best known today as a philosopher, his early work focused on physics. He correctly deduced a number of complicated physical phenomena, including the orbital mechanics of the earth and moon, the effects of the earth\u2019s rotation on weather patterns, and how the solar system was formed."]
util.cos_sim(model.encode(question2), model.encode(answers2[0]))
util.cos_sim(model.encode(question2), model.encode(answers2[1]))

The Massive Text Embedding Benchmark Leaderboard has a list of alternatives to try, e.g. BAAI/bge-small-en-v1.5.

Hugging Face login Link to heading

Create an account on Hugging Face, then login, and create an access token as per

huggingface-cli login

This saves a token to ~/.cache/huggingface/token. Alternatively, it can be an env variable via huggingface-cli login --token $HUGGINGFACE_TOKEN, or run from Python:

from huggingface_hub import login

Llama access Link to heading

Fill in the details at and and wait for an email confirmation from both.

Download the Llama models Link to heading

Download the original LLama 2 models:

cd ~/models
git clone
cd llama/

Enter the URL from the email after filling the form at , then select 7B,7B-chat.

Download the Hugging Face version of the original Llama 2 models:

cd ~/models
mkdir llama-hf
huggingface-cli login
python ~/projects/serve/examples/large_models/Huggingface_accelerate/ --model_path model --model_name meta-llama/Llama-2-7b-chat-hf

Using a pre-quantized model Link to heading

To use the 7bn parameter Llama 2 model designed for chat, using a popular already-quantized 3bit version:

from ctransformers import AutoModelForCausalLM
llm = AutoModelForCausalLM.from_pretrained("TheBloke/Llama-2-7b-Chat-GGUF", model_file="llama-2-7b-chat.Q3_K_S.gguf", model_type="llama")
print(llm("AI is going to"))

Model servers and TorchServe Link to heading

Running a self-hosted LLM from the command line is fine for demos and personal use. If making it public facing, simply wrapping a Flask or Fast API layer around a self-hosted LLM may also work for limited use.

However, for use in production with more than one concurrent user, it is good to use a model serving framework with additional features such as scaling, monitoring, throttling and so on.

There are a number of model servers available, but in this case I’m looking at TorchServe.

Running the TorchServe example llama2 chat app Link to heading

As per :

cd ~/models
git clone
cd serve/examples/LLM/llama2/chat_app
source ~/models/llama-hf/model/models--meta-llama--Llama-2-7b-chat-hf/snapshots/af6df14e494ef16d69ec55e9a016e900a2dde1c8/
streamlit run

Note that quantises the model, so takes some time (approx 5 mins), and needs only be run once.

On http://localhost:8501/ Start Server and Register Llama2, then in another terminal:

cd ~/models/serve/examples/LLM/llama2/chat_app
streamlit run

Which opens in http://localhost:8502. The front end is based on

Streamlit is great for demos, but can’t really be integrated into other existing applications (without reverse proxy and iframe type of solutions) given it has a dependency on its own web server.

Putting Llama inside TorchServe Link to heading

I found the TorchServe documentation required a bit of trial and error to follow, and each iteration of building and testing a TorchServe archive took quite a while. So I started out with the most basic handler which loads the model and provides dummy output, just in order to test the model archiving and serving, test the stages of a basic handler file, and provide debug statements to confirm values of manifest, properties and handling of input data:

import torch
import logging
import os
from ts.torch_handler.base_handler import BaseHandler
logger = logging.getLogger(__name__)
class ModelHandler(BaseHandler):
    def __init__(self):
        self._context = None
        self.initialized = False
        self.model = None
        self.device = None
    def initialize(self, context):
        #  load the model
        self.manifest = context.manifest'manifest: {self.manifest}')
        properties = context.system_properties'properties: {properties}')
        self.device = torch.device("cuda:" + str(properties.get("gpu_id")) if torch.cuda.is_available() else "cpu")
        # Read model file
        model_dir = properties.get("model_dir")
        serialized_file = self.manifest['model']['serializedFile']
        model_path = os.path.join(model_dir, serialized_file)
        if not os.path.isfile(model_path):
            raise RuntimeError("Missing the model file")
        #self.model = torch.jit.load(model_pt_path)
        self.initialized = True
    def preprocess(self, data):
        # Take the input data and make it inference ready
        preprocessed_data = data[0].get("data")
        if preprocessed_data is None:
            preprocessed_data = data[0].get("body")'preprocessed_data: {preprocessed_data}')
        return preprocessed_data
    def handle(self, data, context):
        model_input = self.preprocess(data)
        #model_output = self.inference(model_input)
        model_output = [[0.005593413952738047, 0.07203678041696548, -0.029577888548374176]]
        return model_output

Once that was successfully building, I was then able to update the to actually use the model and return the encoded data.

Building a 2 stage Dockerfile with a pre-quantized Llama2 model served by TorchServe Link to heading

# Builder image
FROM pytorch/torchserve AS builder

WORKDIR /usr/app

RUN pip install huggingface_hub

RUN huggingface-cli download TheBloke/Llama-2-7b-Chat-GGUF llama-2-7b-chat.Q3_K_S.gguf config.json --local-dir . --local-dir-use-symlinks False

ADD model-config.yaml .
#ADD .

RUN mkdir model_store
RUN torch-model-archiver --model-name llamacpp --version 1.0 --serialized-file llama-2-7b-chat.Q3_K_S.gguf --handler --extra-files config.json --config-file model-config.yaml --export-path model_store

# Production image
FROM pytorch/torchserve

RUN pip install llama-cpp-python

COPY --from=builder /usr/app/model_store model_store

CMD ["torchserve", "--start", "--model-store", "model_store", "--models", "llama2=llamacpp.mar", "--ncs"]

Note that pytorch/torchserve doesn’t currently support arm64 as per so if deploying to arm64 servers one temporary workaround is to replace FROM pytorch/torchserve with:

FROM ubuntu:22.04
RUN apt-get update && apt-get install -y python3 python3-pip openjdk-17-jdk git
RUN git clone && cd /serve ; python3 ./ts_scripts/ ; cd /
RUN pip install torchserve torch-model-archiver torch-workflow-archiver
RUN ln -s /usr/bin/python3 /usr/bin/python

Usage of vector search, RAG and LLMs in search in practice Link to heading

Vector search Link to heading

Vector search has the following benefits:

  • It should be better for synonyms, i.e. different words with similar meanings, assuming those different words have similar vectors, e.g. a search for “holiday” should return a document which doesn’t contain the “holiday” keyword but does contain the “vacation” keyword, given “holiday” and vacation" have similar embeddings. Keyword search can configure synonyms, but these would often have to be set up manually.
  • It should be better for homonyms, i.e. words that are spelled the same but have different meanings in different contexts, e.g. “bank” in “river bank” and “bank account”.

It has the following drawbacks:

  • Vector search currently has various limitations which could adversely affect results, e.g. requiring longer text to be split into chunks could miss results split across two chunks. There are various workarounds, and proper solutions may emerge in time.
  • Vector search works best with longer input text, optimally similar length queries as the chunks of text being searched. For a single search term, traditional keyword search may perform better.
  • Debugging a poor result can be very difficult, and fixing bugs even more difficult, given embeddings and language models are something of a black box. In contrast, a keyword search will be able to explain exactly how a score was obtained, and expose all the levers required for tuning that score.

So vector search is not a “silver bullet” for search. I would be inclined to think the ideal solution would combine the best of both in some way, e.g. using keyword search to quickly return a broad set of results and re-rank with vector search, or using LLMs to fill gaps in keyword search e.g. by auto-generating synonym lists. There are various approaches for such “hybrid search” models, but no best practice has emerged as yet.

Retrieval Augmented Generation Link to heading

Retrieval Augmented Generation can look great in demos, but is difficult to get working consistently well in practice. This is because it can be quite brittle, e.g. very dependent on prompt engineering, and also difficult to debug, e.g. due to the lack of explainability and the inherent randomness of responses. This may be acceptable in some contexts, but not in others. Even the risk of fabricated information may outweigh any rewards of RAG in some contexts, e.g. if investment decisions might be based on information from search then you want demonstrably zero chance of your search showing false information.

There are many other ideas for using LLMs within search which I would like to explore further, e.g. using LLM prompts to build knowledge graphs.

  1. For example, according to, GPT-3 is 60% Common Crawl, 22% WebText2 (the contents of which aren’t publicly known, but likely to include Reddit comments), 16% books (again the actual book list isn’t publicly known), and 3% Wikipedia. There is not currently any public information available about GPT-4’s training data. ↩︎

  2. There is not currently any public information on GPT-4 model sizes, but they are almost certainly significantly larger than GPT-3. ↩︎