AvnishYadav
WorkProjectsBlogsNewsletterSupportAbout
Work With Me

Avnish Yadav

Engineer. Automate. Build. Scale.

© 2026 Avnish Yadav. All rights reserved.

The Automation Update

AI agents, automation, and micro-SaaS. Weekly.

Explore

  • Home
  • Projects
  • Blogs
  • Newsletter Archive
  • About
  • Contact
  • Support

Legal

  • Privacy Policy

Connect

LinkedInGitHubInstagramYouTube
Building Long-Term Memory: Setting Up Vector Databases for Semantic Search in LangChain
2026-02-22

Building Long-Term Memory: Setting Up Vector Databases for Semantic Search in LangChain

8 min readTechnical TutorialsAI EngineeringLangChainLangChainAI DevelopmentRAGVector DatabaseChromaDBPineconePythonSemantic Search

A comprehensive technical guide on setting up vector stores like ChromaDB and Pinecone within the LangChain ecosystem. Covers text splitting, embedding generation, and retrieval strategies.

If you have ever tried to paste a 50-page PDF into ChatGPT, you’ve hit the context window wall. While models like GPT-4 Turbo are expanding their windows, passing massive datasets into a prompt is rarely efficient—and definitely not cost-effective.

To build truly intelligent agents, we don't need larger context windows; we need better retrieval. We need to give our LLMs long-term memory.

This is where Vector Databases and Semantic Search come in. Instead of relying on keyword matching (which fails when users use synonyms), semantic search understands the intent and meaning behind a query.

In this guide, I’m going to walk you through how I set up vector stores using LangChain. We will move from raw text to a fully functional semantic search engine that can power a RAG (Retrieval-Augmented Generation) pipeline.

The Architecture of Semantic Search

Before writing code, we need to understand the data flow. When we build a retrieval system in LangChain, we are essentially building a pipeline that transforms unstructured data into mathematical vectors.

The process looks like this:

  1. Load: Bring in data (PDFs, Markdown, Text).
  2. Split: Chunk the data into manageable pieces.
  3. Embed: Convert those chunks into vector representations (arrays of floating-point numbers) using an Embedding Model.
  4. Store: Save these vectors in a Vector Database (Vector Store).
  5. Retrieve: When a user asks a question, we embed the question and find the vectors that are mathematically closest to it.

Prerequisites

We will be using LangChain, OpenAI (for embeddings), and ChromaDB (a local vector store ideal for prototyping). Later, I will touch on Pinecone for production environments.

pip install langchain langchain-openai langchain-chroma chromadb pypdf

Make sure you have your OPENAI_API_KEY set in your environment variables.

Step 1: Document Loading and Splitting

Garbage in, garbage out. If you feed your vector store massive, unbroken blocks of text, your retrieval accuracy will plummet. The LLM needs specific context, not a whole chapter.

I almost exclusively use the RecursiveCharacterTextSplitter. It respects the structure of natural language (paragraphs, newlines) better than simple character splitting.

import os
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

# 1. Load the Data
loader = PyPDFLoader("./technical_manual.pdf")
docs = loader.load()

# 2. Split the Data
# chunk_size: Number of characters per chunk
# chunk_overlap: Characters to overlap between chunks (crucial for maintaining context)
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000, 
    chunk_overlap=200,
    add_start_index=True
)

all_splits = text_splitter.split_documents(docs)

print(f"Split into {len(all_splits)} chunks.")

Builder Tip: The chunk_overlap is vital. Without it, you might cut a sentence in half at the boundary of a chunk, destroying the semantic meaning necessary for the embedding model to work.

Step 2: Selecting the Embedding Model

This is the engine of your semantic search. The embedding model converts text into a vector space. Closely related concepts will be positioned near each other in this space.

For most use cases, OpenAI's text-embedding-3-small or text-embedding-3-large are the industry standards for performance vs. cost. However, if you are running locally for privacy, HuggingFace embeddings are a solid alternative.

from langchain_openai import OpenAIEmbeddings

embedding_model = OpenAIEmbeddings(model="text-embedding-3-small")

Step 3: Creating the Vector Store (ChromaDB)

Now we combine our splits and our embedding model to populate the database. I recommend starting with Chroma because it runs locally as a library—no API keys or cloud setup required.

from langchain_chroma import Chroma

# Create the vector store and persist it locally
vectorstore = Chroma.from_documents(
    documents=all_splits,
    embedding=embedding_model,
    persist_directory="./chroma_db"
)

print("Vector store created and saved.")

At this point, you actually have a searchable database. LangChain has handled the complexity of looping through documents, calling the OpenAI API for embeddings, and storing the results.

Step 4: Semantic Search and Retrieval

This is where the magic happens. We don't query this database with SQL; we query it with natural language.

The standard method is Similarity Search (usually using Cosine Similarity). It looks for vectors that point in the same direction as your query vector.

query = "How do I configure the API rate limits?"

# k=4 means return the top 4 most relevant chunks
relevant_docs = vectorstore.similarity_search(query, k=4)

for i, doc in enumerate(relevant_docs):
    print(f"--- Result {i+1} ---")
    print(doc.page_content)
    print("\n")

The Problem with Basic Similarity

Basic similarity search has a flaw: Redundancy. If your document has three very similar paragraphs about API rate limits, the search will return all three. This wastes the LLM's context window with duplicate information.

To fix this, I use MMR (Maximal Marginal Relevance). MMR selects the "best" match first, and then looks for other matches that are relevant but diverse from the first one.

# search_type="mmr" prioritizes diversity
retriever = vectorstore.as_retriever(
    search_type="mmr",
    search_kwargs={"k": 4, "fetch_k": 20}
)

results = retriever.invoke("How do I configure the API rate limits?")

Production Considerations: Moving to Pinecone

Chroma is great for local development, but when you are deploying a micro-SaaS or a heavy-duty agent, you need a cloud-native solution. Pinecone is often my go-to for production because of its serverless architecture and low latency.

Switching from Chroma to Pinecone in LangChain is trivial thanks to the abstraction layer:

from langchain_pinecone import PineconeVectorStore

# Assuming PINECONE_API_KEY is in env
index_name = "my-production-index"

vectorstore = PineconeVectorStore.from_documents(
    documents=all_splits,
    embedding=embedding_model,
    index_name=index_name
)

Connecting to an LLM (The RAG Pipeline)

A vector database works best when it feeds an LLM. Here is how you connect the dots to create a system that answers questions based only on your data.

from langchain_openai import ChatOpenAI
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate

llm = ChatOpenAI(model="gpt-4-turbo")

# 1. Define the system prompt
system_prompt = (
    "You are an expert technical assistant. "
    "Use the following context to answer the user's question. "
    "If you don't know the answer, say you don't know."
    "\n\n"
    "{context}"
)

prompt = ChatPromptTemplate.from_messages([
    ("system", system_prompt),
    ("human", "{input}"),
])

# 2. Create the chain that combines documents into the prompt
question_answer_chain = create_stuff_documents_chain(llm, prompt)

# 3. Create the retrieval chain
rag_chain = create_retrieval_chain(retriever, question_answer_chain)

# 4. Run it
response = rag_chain.invoke({"input": "How do I configure the API rate limits?"})
print(response["answer"])

Conclusion

Setting up a vector database is the first step in moving from basic prompt engineering to building intelligent, data-aware systems. By utilizing LangChain, we abstract away the complex mathematics of vector calculus and focus on the architecture of our application.

Remember: The quality of your retrieval depends heavily on your chunking strategy and your metadata. In future posts, I will dive deeper into Self-Querying Retrievers, where the LLM actually structures the database query for you.

Share

Comments

Loading comments...

Add a comment

By posting a comment, you’ll be subscribed to the newsletter. You can unsubscribe anytime.

0/2000