Build a semantic search engine in Python

13 min read

Semantic search is a hot topic these days - companies are raising millions of dollars to build infrastructure and tools. I think due to this, most semantic search tutorials I see assume you need lots of tools like vector databases and LangChain. This couldn’t be further from the truth - for most use cases, you’ll be fine with just a few lines of Python code and no external dependencies.

In this post, I’m going to introduce a semantic search model that’s worked in production at Dataquest and Endless Academy. It’s simple enough that you can start, on your laptop, in just a few lines of code. I’ll also talk about how to scale it to millions of embeddings. Let’s get started.

Semantic search is about finding passages that have a similar meaning to your query. With full-text search, you’re finding exact matches to your query. With semantic search, you’re going a layer deeper.

To do this, we need to represent each passage as a vector. We use neural network to convert each passage of text into an array of numbers. This is called an embedding, and represents what the neural network thinks the meaning of the passage is. We can then compare the embeddings to find similar passages:

Network embeddings

The network converts each passage into a vector of numbers. Due to how the network was trained, two vectors that have a similar meaning should be close together in vector space. We can find the distance between the vectors using cosine similarity, and thus figure out how similar they are. Cosine similarity will give you a number between 0 and 1 indicating how similar the vectors are:

Cosine similarity

When you’re doing semantic search, it’s common to follow this process:

  • Take a corpus of passages
  • Convert each passage into an embedding
  • Convert the queries into embeddings
  • Find the most similar embeddings to the query

Semantic search in python

We can implement this process in python. We’ll use the sentence-transformers package for our embedding neural network. Sentence transformers has several embedding models that were trained to find similar passages.

Downloading and cleaning the data

First, we’ll download the full text of Moby Dick from Project Gutenberg. We’ll also do a little bit of data cleaning:

  • Split on the first chapter title, so we remove the preface, table of contents, etc.
  • Split into chunks on runs of consecutive newlines (\n or \r)

The specific data cleaning you do will vary based on the dataset.

import requests
import re

# Download the book
book = requests.get("https://www.gutenberg.org/files/2701/2701-0.txt").text

# Remove all text before the first chapter
book = book.split("CHAPTER 1. Loomings.")[-1]

# Split into passages
passages = re.split('[\n\r]{3,}', book)

We’ll need to do a little more data cleaning to get good results:

  • Strip out leading/trailing whitespace, and remove chunks that are too short
  • Use the ftfy library to fix any unicode issues
from ftfy import fix_text

# Clean up the passages
passages = [fix_text(chunk.strip()) for chunk in passages if len(chunk.strip()) > 100]
passages[:2]
["Call me Ishmael. Some years ago—never mind how long precisely—having\nlittle or no money in my purse, and nothing particular to interest me\non shore, I thought I would sail about a little and see the watery part\nof the world. It is a way I have of driving off the spleen and\nregulating the circulation. Whenever I find myself growing grim about\nthe mouth; whenever it is a damp, drizzly November in my soul; whenever\nI find myself involuntarily pausing before coffin warehouses, and\nbringing up the rear of every funeral I meet; and especially whenever\nmy hypos get such an upper hand of me, that it requires a strong moral\nprinciple to prevent me from deliberately stepping into the street, and\nmethodically knocking people's hats off—then, I account it high time to\nget to sea as soon as I can. This is my substitute for pistol and ball.\nWith a philosophical flourish Cato throws himself upon his sword; I\nquietly take to the ship. There is nothing surprising in this. If they\nbut knew it, almost all men in their degree, some time or other,\ncherish very nearly the same feelings towards the ocean with me.",
 'There now is your insular city of the Manhattoes, belted round by\nwharves as Indian isles by coral reefs—commerce surrounds it with her\nsurf. Right and left, the streets take you waterward. Its extreme\ndowntown is the battery, where that noble mole is washed by waves, and\ncooled by breezes, which a few hours previous were out of sight of\nland. Look at the crowds of water-gazers there.']

We now have a list of strings that should tokenize nicely. We can initialize a sentence transformer model to get embeddings for each sentence.

We’ll use the all-MiniLM-L6-v2 model, which is a small model that works well for semantic search:

  • It’s fast and accurate. You can see benchmarks here. all-MiniLM-L6-v2 is a good mix of speed and performance, even on CPU.
  • Compared to cloud embeddings, it has a lower dimensionality, making the embeddings easier to store and lookup. It’s also cheaper, and more accurate than many cloud embeddings.
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2') # Use device='cuda' to use GPU

If you use the GPU when loading the model with device='cuda', it can embed more than 20x faster.

Once we load the model, we can embed the passages. It should take about 30 seconds on CPU.

embeddings = model.encode(passages, convert_to_tensor=True)

Now we can query against the embeddings. This will return a PyTorch tensor with the shape (Q, E) where Q is the number of queries we made, and E is the number of embeddings we’re searching. It tells us how similar each query is to each embedding. Note that this is a brute force method - every query will be compared to every embedding. This is fine for small datasets, but it won’t scale well.

Also be careful with the scoring function (in this case, cosine similarity) - not all sentence transformers models work well with all scoring functions. It depends on how the model was trained, and if the output is normalized.

from sentence_transformers import util
import torch

query_text = [
    "A really big whale",
    "Adrift alone in the ocean",
]

# Encode the query text
query = model.encode(query_text, convert_to_tensor=True)

# Find the similar passages
cos_scores = util.cos_sim(query, embeddings)
cos_scores.shape
torch.Size([2, 2014])

We now select the top k results for each query, and get the original passages:

# Find the most similar passage indices.  Increase k to get more results.
top_results = torch.topk(cos_scores, k=1, dim=-1)
indices = top_results.indices.tolist()

# Display the results

def display_results(indices):
    for i, result in enumerate(indices):
        print(f"Query: {query_text[i]}")
        for idx in result:
            print(f"Passage: {passages[idx]}")
        print()

display_results(indices)
Query: A really big whale
Passage: BOOK I. (_Folio_) CHAPTER IV. (_Hump Back_).—This whale is often seen
on the northern American coast. He has been frequently captured there,
and towed into harbor. He has a great pack on him like a peddler; or
you might call him the Elephant and Castle whale. At any rate, the
popular name for him does not sufficiently distinguish him, since the
sperm whale also has a hump though a smaller one. His oil is not very
valuable. He has baleen. He is the most gamesome and light-hearted of
all the whales, making more gay foam and white water generally than any
other of them.

Query: Adrift alone in the ocean
Passage: Now, in calm weather, to swim in the open ocean is as easy to the
practised swimmer as to ride in a spring-carriage ashore. But the awful
lonesomeness is intolerable. The intense concentration of self in the
middle of such a heartless immensity, my God! who can tell it? Mark,
how when sailors in a dead calm bathe in the open sea—mark how closely
they hug their ship and only coast along her sides.

As you can see, the results are good, but not perfect. You can improve accuracy by:

  • Using a more powerful embedding model (see the benchmarks)
  • Doing better data cleaning
  • Filtering the passages before doing each query. For example, you can use pgvector to do a combination of searching on structured fields and embeddings.

Scaling up

As you scale up, the simple model I’ve outlined won’t work as well anymore. This is because you’ll need to re-embed your passages each time you want to do a search. The search will also be brute force.

For better performance, you’ll want to consider both persisting embeddings in some way, and using a more efficient search method. Here are some strategies:

  1. Persist the embedding tensors to disk, and reload them each time you need them. This will still do a brute force search, but it will save you the embedding time.
  2. Use a file-based index like faiss or lancedb. Faiss has a variety of search algorithms you can pick from. It’s a good option, even for millions of embeddings.
  3. Persist the embeddings using a vector database like pgvector. Pgvector is a good option because many people use Postgres already, and you can query on structured field and embedding fields. There are other options, like Weaviate and Pincone as well. This is the most complex option, but the most performant and flexible. You can see some benchmarks here.

I’ll show examples of strategies 1 and 2.

Saving and loading embeddings

Saving and loading embeddings is straightforward, since they’re just PyTorch tensors:

# Save embeddings
torch.save(embeddings, "embeddings.pt")

# Load embeddings
new_embeddings = torch.load("embeddings.pt")

# Verify that the query results are the same
torch.allclose(util.cos_sim(query, new_embeddings), util.cos_sim(query, embeddings))
True

Using faiss to index and store embeddings

Faiss is a popular package that implements vector search algorithms. I’ll use faiss-cpu, but you can also install a GPU version faiss-gpu for better performance.

We’ll use an HNSW index, which is an efficient index even for millions of passages. It has a few parameters (M, efSearch, and efConstruction), which impact the construction and querying of the search graph.

import faiss

# Initialize an HNSW index
index = faiss.IndexHNSWFlat(embeddings.shape[1], 64)

# Add our embeddings
index.add(embeddings.numpy())

# Query the index
distances, indices = index.search(query, k=1)

# Display the results
display_results(indices)

# Persist the index, can load with faiss.read_index("index.faiss")
faiss.write_index(index, "index.faiss")
Query: A really big whale
Passage: BOOK I. (_Folio_) CHAPTER IV. (_Hump Back_).—This whale is often seen
on the northern American coast. He has been frequently captured there,
and towed into harbor. He has a great pack on him like a peddler; or
you might call him the Elephant and Castle whale. At any rate, the
popular name for him does not sufficiently distinguish him, since the
sperm whale also has a hump though a smaller one. His oil is not very
valuable. He has baleen. He is the most gamesome and light-hearted of
all the whales, making more gay foam and white water generally than any
other of them.

Query: Adrift alone in the ocean
Passage: Now, in calm weather, to swim in the open ocean is as easy to the
practised swimmer as to ride in a spring-carriage ashore. But the awful
lonesomeness is intolerable. The intense concentration of self in the
middle of such a heartless immensity, my God! who can tell it? Mark,
how when sailors in a dead calm bathe in the open sea—mark how closely
they hug their ship and only coast along her sides.

Production notes

Here are some notes that may help you as you run semantic search in production:

  • Use a GPU if you can. It will run much faster.
  • Keep the system as simple as possible. The fewer moving parts, the easier the system will be to debug and observe.
  • Data cleaning and chunking matters a lot more than you think it will. The search is only as good as the tokens fed into the embedding model.
  • Most vector search is based on a variant of k-nearest-neighbors. If you have millions of embeddings, this can quickly start giving you poor quality search results. Segmenting your data using structured fields before searching can help with this. For example, you can search for passages that are in the same book as the query, and then do a semantic search on those passages. pgvector makes this easy.
  • Try using a different embedding model to improve accuracy. You’ll need to figure out the right speed/accuracy tradeoff for your use case. In most cases, local embedding models outperform cloud embeddings.

Wrap-up

I hope this was a useful introduction to local semantic search. It’s a powerful technology that seems kind of magical at first. When you can combine semantic search results with LLM output, you can get accurate, personalized chats. Please let me know if you have any questions or comments!


Vikas Paruchuri

I’m Vikas, a developer based in Oakland. I'm a lifelong learner currently diving deep into AI. I've started an education company, won ML competitions, and served as a US diplomat.