Section P.4: Memory: Embeddings, Vector Stores, and Recall | Building Conversational AI with LLMs and Agents

Big Picture

LLMs have no persistent memory: each request starts from scratch. Semantic Kernel: Enterprise AI Orchestration's memory subsystem bridges this gap by storing and retrieving information using vector embeddings. You can save facts, documents, or conversation snippets into a vector store, then recall the most relevant entries when constructing a prompt. This section covers the embedding pipeline, volatile and persistent stores, similarity search, and patterns for injecting recalled memories into your prompts.

1. How Semantic Memory Works

Semantic memory converts text into dense vector representations (embeddings), stores them in a searchable index, and retrieves the closest matches to a query at runtime. The process has three stages: save (embed and store), search (embed the query and find nearest neighbors), and recall (inject results into the prompt).

Unlike keyword search, semantic memory matches by meaning. A query for "machine learning optimization" will find documents about "gradient descent and loss functions" even if the exact words do not overlap. This makes it ideal for building RAG (Retrieval-Augmented Generation) systems within SK.

2. Configuring Embedding Services

Before you can store memories, the kernel needs an embedding service. SK supports OpenAI, Azure OpenAI, and Hugging Face embedding models.

import semantic_kernel as sk
from semantic_kernel.connectors.ai.open_ai import OpenAITextEmbedding

kernel = sk.Kernel()

# Register an embedding service
kernel.add_service(
    OpenAITextEmbedding(
        service_id="embeddings",
        ai_model_id="text-embedding-3-small",
        api_key="sk-...",
    )
)

# The embedding service converts text to vectors
# text-embedding-3-small produces 1536-dimensional vectors

The choice of embedding model affects both quality and cost. Larger models (like text-embedding-3-large with 3072 dimensions) capture more nuance but cost more per request and require more storage. For most applications, text-embedding-3-small offers an excellent balance.

Key Insight

Embeddings are model-specific. If you switch from text-embedding-ada-002 to text-embedding-3-small, you must re-embed all stored documents. Mixing embeddings from different models in the same collection produces meaningless search results because the vector spaces are incompatible.

3. Volatile Memory Store

The volatile memory store keeps everything in RAM. It is perfect for prototyping, testing, and applications where the knowledge base is small and rebuilt on each startup.

from semantic_kernel.memory import SemanticTextMemory, VolatileMemoryStore

# Create a volatile (in-memory) store
memory_store = VolatileMemoryStore()
memory = SemanticTextMemory(
    storage=memory_store,
    embeddings_generator=kernel.get_service("embeddings"),
)

# Save some facts into a collection
collection = "company_policies"

await memory.save_information(
    collection=collection,
    id="policy_001",
    text="Employees receive 20 days of paid vacation per year.",
)
await memory.save_information(
    collection=collection,
    id="policy_002",
    text="Remote work is allowed up to 3 days per week with manager approval.",
)
await memory.save_information(
    collection=collection,
    id="policy_003",
    text="The company matches 401k contributions up to 6% of salary.",
)

Each saved item has an id (for deduplication and updates), a text (the content to embed and store), and a collection name (for organizing memories by domain).

4. Searching Semantic Memory

Retrieval uses cosine similarity to find the stored items most relevant to a query. You specify the collection to search and the number of results to return.

# Search for relevant policies
results = await memory.search(
    collection="company_policies",
    query="How many vacation days do I get?",
    limit=2,
    min_relevance_score=0.7,
)

for result in results:
    print(f"[{result.relevance:.3f}] {result.text}")
    # [0.923] Employees receive 20 days of paid vacation per year.
    # [0.741] Remote work is allowed up to 3 days per week...

Filter registered: content_safety Input: 'How do I hack a website?' Filter result: BLOCKED (safety violation detected) Input: 'Explain SQL injection for security testing' Filter result: PASSED (educational context)

The min_relevance_score parameter filters out low-quality matches. Scores range from 0.0 (unrelated) to 1.0 (identical). A threshold of 0.7 is a reasonable starting point; tune it based on your domain and embedding model.

5. Persistent Vector Stores

For production systems, you need persistence. SK supports multiple vector database backends including Azure AI Search, Qdrant, Pinecone, Chroma, and Weaviate.

# Using Qdrant as a persistent vector store
from semantic_kernel.connectors.memory.qdrant import QdrantMemoryStore

# Connect to a Qdrant instance
memory_store = QdrantMemoryStore(
    url="http://localhost:6333",
    vector_size=1536,  # Must match your embedding model
)

# Or use Azure AI Search
from semantic_kernel.connectors.memory.azure_cognitive_search import (
    AzureCognitiveSearchMemoryStore,
)

memory_store = AzureCognitiveSearchMemoryStore(
    endpoint="https://my-search.search.windows.net",
    admin_key="...",
)

# The SemanticTextMemory API is identical regardless of backend
memory = SemanticTextMemory(
    storage=memory_store,
    embeddings_generator=kernel.get_service("embeddings"),
)

Tip

Start development with VolatileMemoryStore and switch to a persistent backend when you move to staging. Because the SemanticTextMemory API is the same for all backends, the only code change is the store constructor. This is a good example of the strategy pattern in action.

6. Injecting Memories into Prompts

The real power of semantic memory comes from combining retrieved context with prompt templates. The pattern is straightforward: search memory, format the results, and include them as a template variable.

async def answer_with_memory(kernel, memory, question: str) -> str:
    """Answer a question using retrieved memory as context."""
    # Step 1: Retrieve relevant memories
    results = await memory.search(
        collection="company_policies",
        query=question,
        limit=3,
    )

    # Step 2: Format context
    context = "\n".join(
        f"- {r.text}" for r in results if r.relevance > 0.7
    )

    # Step 3: Build and invoke the prompt
    prompt = """Use the following company policy excerpts to answer the question.
If the answer is not in the excerpts, say "I don't have that information."

Relevant policies:
{{$context}}

Question: {{$question}}

Answer:"""

    result = await kernel.invoke_prompt(
        prompt=prompt,
        context=context,
        question=question,
    )
    return str(result)

answer = await answer_with_memory(
    kernel, memory, "Can I work from home on Fridays?"
)
print(answer)

Prompt template rendered: System: You are a helpful coding assistant. User: Explain Python decorators with an example. Tokens: 24

This is the RAG pattern (Retrieval-Augmented Generation) implemented entirely within Semantic Kernel: Enterprise AI Orchestration. The LLM generates an answer grounded in your retrieved documents rather than relying solely on its training data.

7. Memory as a Plugin

SK provides a built-in TextMemoryPlugin that exposes memory operations as kernel functions. This allows planners and automatic function calling to use memory retrieval as part of their orchestration.

from semantic_kernel.core_plugins import TextMemoryPlugin

# Register memory as a plugin
kernel.add_plugin(
    TextMemoryPlugin(memory),
    plugin_name="Memory",
)

# Now the planner can call Memory.recall and Memory.save
# as part of its generated plan.
# For example, a user asking "Remember that my favorite color is blue"
# would trigger Memory.save, and "What's my favorite color?"
# would trigger Memory.recall.

Warning

Giving a planner write access to memory (the save function) means the LLM can store arbitrary information. In multi-tenant applications, this can lead to data leakage between users. Always scope memory collections per user or tenant, and consider restricting the planner to read-only memory access in production.

8. Batch Ingestion and Document Processing

Real-world applications need to ingest large document sets. The following pattern processes a directory of text files into semantic memory, with chunking to handle documents that exceed the embedding model's token limit.

from pathlib import Path

async def ingest_documents(
    memory: SemanticTextMemory,
    docs_dir: str,
    collection: str,
    chunk_size: int = 500,
    overlap: int = 50,
):
    """Ingest text files into semantic memory with chunking."""
    for path in Path(docs_dir).glob("*.txt"):
        text = path.read_text(encoding="utf-8")
        words = text.split()

        # Split into overlapping chunks
        for i in range(0, len(words), chunk_size - overlap):
            chunk = " ".join(words[i : i + chunk_size])
            chunk_id = f"{path.stem}_chunk_{i}"

            await memory.save_information(
                collection=collection,
                id=chunk_id,
                text=chunk,
                description=f"Chunk from {path.name}",
            )

    print(f"Ingested documents from {docs_dir} into '{collection}'")

await ingest_documents(memory, "./docs/policies", "company_policies")

Telemetry: Request ID: req_abc123 Model: gpt-4o Prompt tokens: 156 Completion tokens: 89 Latency: 1.34s

Chunking strategy significantly affects retrieval quality. Smaller chunks (200 to 500 words) are more precise but may lose context. Larger chunks preserve context but may dilute relevance. Overlap ensures that information spanning a chunk boundary is captured in at least one chunk.