- Manish Shivanandhan
- Posts
- How AI Agents Remember Things: The Role of Vector Stores in LLM Memory
How AI Agents Remember Things: The Role of Vector Stores in LLM Memory
Learn how vector databases like FAISS and Pinecone give AI agents a sense of memory.

When you talk to an AI assistant, it can feel like it remembers what you said before.
But large language models (LLMs) don’t actually have memory on their own. They don’t remember conversations unless that information is given to them again.
So, how do they seem to recall things?
The answer lies in something called a vector store.
What Is a Vector Store?
A vector store is a special type of database. Instead of storing text or numbers like a regular database, it stores vectors.
A vector is a list of numbers that represents the meaning of a piece of text. You get these vectors using a process called embedding.
The model takes a sentence and turns it into a high-dimensional point in space. In that space, similar meanings are close together.

For example, if I embed “I love sushi,” it might be close to “Sushi is my favourite food” in vector space. These embeddings help an AI agent find related thoughts even if the exact words differ.
How Embeddings Work
Let’s say a user tells an assistant:
“I live in Austin, Texas.”
The model turns this sentence into a vector:
[0.23, -0.41, 0.77, ..., 0.08]
This vector doesn’t mean much to us, but to the AI, it’s a way to capture the sentence’s meaning. That vector gets stored in a vector database, along with some extra info — maybe a timestamp or a note that it came from this user.
Later, if the user says:
“Book a flight to my hometown.”
The model turns this new sentence into a new vector. It then searches the vector database to find the most similar stored vectors.
The closest match might be “I live in Austin, Texas.” Now the AI knows what you probably meant by “my hometown.”
This ability to look up related past inputs based on meaning — not just matching keywords, is what gives LLMs a form of memory.
Why Vector Stores Are Crucial for Memory
LLMs process language using a context window. That’s the amount of text they can “see” at once.
For GPT-4-turbo, the window can handle up to 128,000 tokens, which sounds huge — but even that gets filled fast. You can’t keep the whole conversation there forever.
Instead, you use a vector store as long-term memory. You embed and save useful info.
Then, when needed, you query the vector store, retrieve the top relevant pieces, and feed them back into the LLM. This way, the model remembers just enough to act smart — without holding everything in its short-term memory.
Popular Vector Stores
There are several popular vector databases in use. Each one has its strengths.
FAISS is an open-source library developed by Meta. It’s fast and works well for local or on-premise applications.
FAISS is great if you want full control and don’t need cloud hosting. It supports millions of vectors and provides tools for indexing and searching with high performance.
Here’s how you can use FAISS:
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np
# Load a pre-trained sentence transformer model that converts sentences to numerical vectors (embeddings)
model = SentenceTransformer('all-MiniLM-L6-v2')
# Define the input sentence we want to store in memory
sentence = "User lives in Austin, Texas"
# Convert the sentence into a dense vector (embedding)
embedding = model.encode(sentence)
# Get the dimensionality of the embedding vector (needed to create the FAISS index)
dimension = embedding.shape[0]
# Create a FAISS index for L2 (Euclidean) similarity search using the embedding dimension
index = faiss.IndexFlatL2(dimension)
# Add the sentence embedding to the FAISS index (this is our "memory")
index.add(np.array([embedding]))
# Encode a new query sentence that we want to match against the stored memory
query = model.encode("Where is the user from?")
# Search the FAISS index for the top-1 most similar vector to the query
D, I = index.search(np.array([query]), k=1)
# Print the index of the most relevant memory (in this case, only one item in the index)
print("Most relevant memory index:", I[0][0])
This code uses a pre-trained model to turn a sentence like “User lives in Austin, Texas” into an embedding.
It stores this embedding in a FAISS index. When you ask a question like “Where is the user from?”, the code converts that question into another embedding and searches the index to find the stored sentence that’s most similar in meaning.
Finally, it prints the position (index) of the most relevant sentence in the memory.
FAISS is efficient, but it’s not hosted. That means you need to manage your own infrastructure.
Pinecone is a cloud-native vector database. It’s managed for you, which makes it great for production systems.
You don’t need to worry about scaling or maintaining servers. Pinecone handles billions of vectors and offers filtering, metadata support, and fast queries. It integrates well with tools like LangChain and OpenAI.
Here’s how a basic Pinecone setup works:
import pinecone
from sentence_transformers import SentenceTransformer
# Initialize Pinecone with your API key and environment
pinecone.init(api_key="your-api-key", environment="us-west1-gcp")
# Connect to or create a Pinecone index named "memory-store"
index = pinecone.Index("memory-store")
# Load a pre-trained sentence transformer model to convert text into embeddings
model = SentenceTransformer('all-MiniLM-L6-v2')
# Convert a fact/sentence into a numerical embedding (vector)
embedding = model.encode("User prefers vegetarian food")
# Store (upsert) the embedding into Pinecone with a unique ID
index.upsert([("user-pref-001", embedding.tolist())])
# Encode the query sentence into an embedding
query = model.encode("What kind of food does the user like?")
# Search Pinecone to find the most relevant stored embedding for the query
results = index.query(queries=[query.tolist()], top_k=1)
# Print the ID of the top matching memory
print("Top match ID:", results['matches'][0]['id'])
Pinecone is ideal if you want scalability and ease of use without managing hardware.
Other popular vector stores include:
Each of these has its place depending on whether you need speed, scale, simplicity, or special features.
Making AI Seem Smart with Retrieval-Augmented Generation
This whole system — embedding user inputs, storing them in a vector database, and retrieving them later — is called retrieval-augmented generation (RAG).
The AI still doesn’t have a brain, but it can act like it does. You choose what to remember, when to recall it, and how to feed it back into the conversation.
If the AI helps a user track project updates, you can store each project detail as a vector. When the user later asks, “What’s the status of the design phase?” you search your memory database, pull the most relevant notes, and let the LLM stitch them into a helpful answer.
The Limits of Vector-Based Memory
Simulated memory has limits. It retrieves based on similarity, not true understanding.
Sometimes the most relevant match isn’t actually useful. Sometimes the vector space misses nuance. And it doesn’t evolve.
There’s no way for the model to update or refine its understanding unless you explicitly store new embeddings.
Also, not everything should be remembered. Vector stores raise privacy and ethical questions.
Who decides what gets saved? How long is it kept? Can a user delete it?
Conclusion
Vector stores give AI agents a way to fake memory — and they do it well. By embedding text into vectors and using tools like FAISS or Pinecone, we give models the power to recall what matters. It’s not real memory. But it makes AI systems feel more personal, more helpful, and more human.
As these tools grow more advanced, so does the illusion. But behind every smart AI is a simple system of vectors and similarity. If you can master that, you can build assistants that remember, learn, and improve with time.