Skip to main content

What is RAG?

Retrieval-Augmented Generation (RAG) gives your agents access to specific documents and data that aren’t part of their pre-training. Instead of relying solely on the language model’s general knowledge, RAG agents retrieve relevant information from your documents and use it to generate informed, grounded responses.

How RAG works

1

Retrieve relevant chunks

When an agent receives a query, the system searches your knowledge base for document chunks semantically similar to the query
2

Augment the prompt

Retrieved chunks are injected into the agent’s prompt as additional context
3

Generate informed response

The language model generates a response based on both its general knowledge AND the specific retrieved content
The key insight: LLMs are excellent at reasoning and language generation, but they don’t know YOUR data. RAG bridges this gap by retrieving your data at query time and providing it as context.

Why use RAG?

Knowledge cutoffs

LLMs have training cutoffs and don’t know information published after that date. RAG gives agents access to current information.

Internal data

Your organization’s policies, procedures, and documentation aren’t in the LLM’s training data. RAG makes this information accessible.

Grounded responses

RAG grounds agent outputs in specific source documents, reducing hallucinations and enabling citation of sources.

Dynamic knowledge

Update your knowledge base and agents immediately have access to new information — no model retraining required.

When to use RAG

RAG is ideal for:
  • HR policy assistants — Answer employee questions based on specific policy documents
  • Compliance reviewers — Verify that proposals comply with internal guidelines and regulatory requirements
  • Technical documentation Q&A — Help users find information in product documentation
  • Customer support — Answer questions based on knowledge bases and help articles
  • Contract analysis — Review contracts against your organization’s standard terms and conditions
  • Research assistants — Query large document collections to find relevant information
If your agent needs to know information specific to your organization, domain, or use case, you need RAG.

Qdrant vector database

MagOneAI uses Qdrant as its vector database for semantic search and retrieval. Qdrant is purpose-built for similarity search over high-dimensional vectors.

How Qdrant powers RAG

1

Document ingestion

When you upload documents to a knowledge base, they’re processed and split into chunks
2

Embedding generation

Each chunk is converted to a high-dimensional vector using an embedding model
3

Vector storage

Vectors are stored in Qdrant with metadata (source document, chunk position, etc.)
4

Query-time retrieval

When an agent queries the knowledge base, the query is embedded and Qdrant performs similarity search
5

Ranking and return

Qdrant returns the top-k most similar chunks based on vector distance

Why Qdrant?

  • Performance — Fast similarity search even over millions of vectors
  • Scalability — Handles large knowledge bases with consistent query latency
  • Filtering — Combine vector similarity with metadata filtering (e.g., “search only in contract documents”)
  • Hybrid search — Blend semantic similarity with keyword matching for better retrieval quality
MagOneAI manages the Qdrant infrastructure for you. You simply upload documents and configure knowledge bases — the vector database operations happen automatically.

Creating and managing knowledge bases

Knowledge bases are collections of documents that agents can query. Each knowledge base has its own Qdrant collection and can be attached to multiple agents.
1

Create a knowledge base in your project

Navigate to the Knowledge Bases section and click “Create Knowledge Base”
  • Name — Descriptive name like “HR Policies” or “Product Documentation”
  • Description — What documents this knowledge base contains
  • Embedding model — Select the embedding model (default: text-embedding-3-small)
  • Chunk size — How large each document chunk should be (default: 512 tokens)
  • Chunk overlap — How much chunks should overlap (default: 50 tokens)
2

Upload documents

Upload documents to the knowledge base using drag-and-drop or file selectionSupported formats:
  • PDF (.pdf)
  • Microsoft Word (.docx, .doc)
  • Plain text (.txt)
  • Markdown (.md)
  • CSV (.csv)
  • Rich Text Format (.rtf)
You can upload multiple files simultaneously. Each file is processed asynchronously.
3

Documents are automatically chunked and embedded

MagOneAI processes your documents in the background:
  • Text is extracted from each document
  • Content is split into chunks based on your configured chunk size
  • Each chunk is embedded using the selected embedding model
  • Vectors are stored in Qdrant with source metadata
You can monitor processing status in the knowledge base detail view.
4

Attach the knowledge base to an agent

In the agent configuration, add the knowledge base under “Knowledge Bases”You can attach multiple knowledge bases to a single agent. The agent will search across all attached knowledge bases when retrieving context.
5

The agent now retrieves relevant context when answering questions

When the agent executes in a workflow, it automatically queries attached knowledge bases based on the input, retrieves relevant chunks, and generates responses grounded in your documents.

Managing knowledge bases

Upload new documents at any time. They’re automatically processed and become immediately available for retrieval.
RAG agents are ideal for HR policy assistants, compliance reviewers, technical documentation Q&A, and any use case where the agent needs to answer from your specific documents.

How RAG works in agent execution

When a RAG agent executes within a workflow, the retrieval and generation process follows a precise sequence:

Detailed execution flow

1

Agent receives query/input

The agent receives input from the workflow — typically a question or task that requires knowledge base consultationExample input:
{
  "question": "What is our policy on remote work for new employees?",
  "context": "employee_onboarding"
}
2

Query is embedded

The input question is converted to a vector using the same embedding model used for the knowledge baseThis ensures queries and documents exist in the same vector space for accurate similarity search
3

Similarity search against knowledge base vectors

Qdrant performs similarity search to find document chunks most relevant to the query
  • Default: retrieves top-5 chunks
  • Configurable: you can adjust the number of chunks (k) based on context window size and retrieval quality needs
  • Metadata filtering: optionally filter by document type, date, or custom metadata
4

Top-k relevant chunks are retrieved

The most similar chunks are returned with:
  • Chunk text — The actual document content
  • Source metadata — Which document the chunk came from
  • Similarity score — How relevant the chunk is (0-1)
  • Position metadata — Where in the document this chunk appeared
5

Chunks are injected into the agent's prompt as context

Retrieved chunks are formatted and added to the agent’s system prompt:
You are an HR policy assistant. Answer the user's question based on the
following policy documents:

---
SOURCE: Remote Work Policy 2024.pdf

[Chunk text about remote work eligibility...]

---
SOURCE: Employee Handbook.pdf

[Chunk text about onboarding procedures...]

---

User question: What is our policy on remote work for new employees?

Answer based on the provided policy documents. If the answer isn't in
the documents, say so clearly.
6

Agent generates response grounded in retrieved documents

The LLM generates a response using:
  • Its general language understanding and reasoning capabilities
  • The specific content from retrieved chunks
  • The agent’s persona and instructions
The response is grounded in your documents rather than the model’s general training data.

Retrieval configuration

You can configure retrieval behavior per agent:
agent:
  name: "HR Policy Assistant"
  knowledge_bases:
    - hr_policies
    - employee_handbook
  retrieval_config:
    top_k: 5  # Number of chunks to retrieve
    min_similarity: 0.7  # Minimum similarity score threshold
    reranking: true  # Use reranking model for better ordering
    metadata_filters:
      document_type: "policy"  # Only retrieve from policy docs

Citation and source tracking

RAG agents can cite their sources, providing transparency and auditability:
{
  "answer": "New employees are eligible for remote work after completing 90 days of employment and receiving manager approval.",
  "sources": [
    {
      "document": "Remote Work Policy 2024.pdf",
      "chunk_id": "chunk_42",
      "similarity": 0.89,
      "excerpt": "...employees must complete the initial 90-day probationary period..."
    }
  ],
  "confidence": 0.92
}
This structured output allows downstream workflow nodes to access both the answer AND the supporting evidence.

Chunking and embedding

The quality of your RAG system depends heavily on how documents are chunked and embedded.

Document chunking strategies

Chunking is the process of splitting documents into smaller segments for embedding and retrieval.
Split documents into chunks of fixed token lengthConfiguration:
chunk_size: 512  # tokens per chunk
chunk_overlap: 50  # tokens of overlap between chunks
Advantages:
  • Predictable chunk sizes fit LLM context windows
  • Simple and fast
  • Works well for uniform documents
Disadvantages:
  • May split logical sections (paragraphs, lists) arbitrarily
  • Doesn’t respect document structure

Chunk overlap

Overlap ensures important information at chunk boundaries isn’t lost: Without overlap:
Chunk 1: [...employee must complete 90 days]
Chunk 2: [of employment before remote work eligibility...]
The connection between “90 days” and “remote work eligibility” is split across chunks. With overlap:
Chunk 1: [...employee must complete 90 days of employment before]
Chunk 2: [complete 90 days of employment before remote work eligibility...]
Now both chunks contain the complete concept. Recommended overlap: 10-20% of chunk size (e.g., 50-100 tokens for 512-token chunks)

Embedding models

Embedding models convert text to high-dimensional vectors. MagOneAI supports several options:
ModelDimensionsStrengthsUse Cases
text-embedding-3-small1536Fast, cost-effective, good general performanceDefault for most use cases
text-embedding-3-large3072Higher quality embeddings, better semantic understandingComplex domain-specific knowledge bases
text-embedding-ada-0021536Proven performance, widely usedBackward compatibility
Model selection considerations:
  • Performance vs. cost — Larger models produce better embeddings but cost more per token
  • Domain specificity — Domain-specific models (legal, medical) may outperform general models for specialized content
  • Consistency — Use the same embedding model for all documents in a knowledge base

How chunk size affects retrieval quality

Chunk size is a critical parameter that impacts both retrieval quality and LLM reasoning:
Advantages:
  • Precise retrieval of specific facts
  • Less noise in retrieved context
  • More chunks fit in LLM context window
Disadvantages:
  • May lack surrounding context needed for understanding
  • More chunks required to answer complex questions
  • Higher retrieval overhead
Best for: FAQ-style Q&A, fact lookup, specific information retrieval
Advantages:
  • Balance between precision and context
  • Chunks are semantically self-contained
  • Good for most use cases
Disadvantages:
  • May retrieve some irrelevant content along with relevant information
  • Fewer chunks fit in context window
Best for: General knowledge base Q&A, policy consultation, documentation searchRecommended default: 512 tokens with 50-token overlap
Advantages:
  • Retrieves broad context around relevant information
  • Good for understanding complex relationships
  • Fewer retrieval calls needed
Disadvantages:
  • More noise in retrieved context
  • Fewer chunks fit in LLM context window
  • May retrieve content that’s partially relevant
Best for: Summarization tasks, complex analytical questions, research use cases

Best practices

Keep documents focused and well-structured

Good document structure:
  • Clear headings and sections
  • Logical information hierarchy
  • Consistent formatting
  • One topic per document or section
Poor document structure:
  • Wall-of-text with no sections
  • Mixed topics in single document
  • Inconsistent formatting
  • Unclear boundaries between concepts
Well-structured documents chunk better and retrieve more accurately.

Use descriptive file names

File names appear in source citations and help agents understand context: Good file names:
  • remote_work_policy_2024.pdf
  • employee_onboarding_checklist.pdf
  • gdpr_compliance_guidelines.pdf
Poor file names:
  • document_final_v3.pdf
  • policy.pdf
  • untitled.pdf

Regularly update knowledge bases when source documents change

RAG systems reflect the documents in the knowledge base. When source documents change:
1

Identify updated documents

Track which source documents have been revised
2

Upload new versions

Replace old documents with updated versions
3

Verify processing

Ensure documents are re-chunked and re-embedded
4

Test retrieval

Query the knowledge base to verify updated information is retrieved
Stale knowledge bases lead to agents providing outdated information.

Test retrieval quality before production deployment

Before deploying RAG agents to production:
1

Create test query set

Build a set of representative questions your agents will receive
2

Evaluate retrieval

For each test query, examine retrieved chunks:
  • Are the most relevant chunks retrieved?
  • Is irrelevant content being retrieved?
  • Are there gaps in coverage?
3

Measure metrics

Track retrieval metrics:
  • Precision — What percentage of retrieved chunks are relevant?
  • Recall — What percentage of relevant chunks are retrieved?
  • MRR (Mean Reciprocal Rank) — How highly ranked is the first relevant chunk?
4

Iterate on configuration

Adjust chunk size, overlap, top-k, and embedding models based on metrics
5

Test end-to-end agent performance

Evaluate not just retrieval, but agent answer quality using the retrieved context

Optimize chunk size for your use case

No single chunk size is optimal for all use cases. Run experiments:
  1. Create knowledge bases with different chunk sizes (256, 512, 1024 tokens)
  2. Test agent performance on representative queries
  3. Measure answer quality and relevance
  4. Select the configuration that performs best for YOUR data and use case

Use metadata filtering for multi-domain knowledge bases

If your knowledge base contains different document types, use metadata filtering:
retrieval_config:
  metadata_filters:
    document_type: "technical_documentation"
    version: "latest"
    department: "engineering"
This prevents agents from retrieving irrelevant documents even if they’re semantically similar.

Monitor retrieval performance in production

Track retrieval quality metrics in production:
  • Average similarity scores — Declining scores may indicate knowledge base staleness
  • Retrieval latency — Monitor query performance as knowledge base grows
  • Zero-result queries — Track queries that retrieve no relevant chunks
  • Agent confidence scores — Low confidence may indicate poor retrieval quality
Use this telemetry to continuously improve your knowledge bases.
RAG is not a replacement for fine-tuning or prompt engineering. It’s a complementary technique that gives agents access to specific information. You still need well-crafted personas, clear instructions, and appropriate guardrails for production-quality agents.

Advanced RAG techniques

Once you’ve mastered basic RAG, consider these advanced techniques: Combine semantic similarity (vector search) with keyword matching (BM25):
  • Vector search finds semantically similar content
  • Keyword search finds exact term matches
  • Results are merged and re-ranked
Hybrid search improves retrieval when users include specific technical terms or entity names.

Reranking

After initial retrieval, use a reranking model to reorder chunks:
1

Initial retrieval

Qdrant returns top-20 chunks based on vector similarity
2

Reranking

A specialized reranking model scores each chunk’s relevance to the query
3

Selection

Top-5 reranked chunks are used as agent context
Reranking improves precision by using a more sophisticated relevance model.

Query expansion

Expand user queries before retrieval:
  • Generate variations of the query
  • Add synonyms and related terms
  • Retrieve using multiple query formulations
  • Combine and deduplicate results
Query expansion improves recall when users phrase questions differently than source documents.

Contextual compression

Compress retrieved chunks to include only relevant sentences:
1

Retrieve chunks

Get top-k chunks from Qdrant
2

Extract relevant sentences

Use a compression model to identify which sentences in each chunk are most relevant to the query
3

Compressed context

Provide only relevant sentences to the agent, reducing noise and fitting more context in the window
Compression increases the effective context window by removing irrelevant content.

Troubleshooting RAG issues

Possible causes:
  • Documents aren’t in the knowledge base
  • Chunk size is too small or too large
  • Query and documents use different terminology
  • Embedding model doesn’t capture domain semantics
Solutions:
  • Verify documents uploaded and processed
  • Experiment with different chunk sizes
  • Use query expansion or synonyms
  • Try a different embedding model
Possible causes:
  • Knowledge base contains too much diverse content
  • Chunk size is too large
  • No metadata filtering
Solutions:
  • Split knowledge bases by domain
  • Reduce chunk size
  • Add metadata filters to retrieval config
  • Use reranking to improve precision
Possible causes:
  • Relevant chunks not retrieved (retrieval problem)
  • Relevant chunks retrieved but not used by LLM (generation problem)
  • Persona doesn’t emphasize grounding in documents
Solutions:
  • Test retrieval separately from generation
  • Update persona to emphasize: “Base your answer ONLY on the provided documents”
  • Add guardrails that require source citations
  • Increase top-k to provide more context
Possible causes:
  • Knowledge base is very large
  • Qdrant cluster under-provisioned
  • Embedding model is slow
Solutions:
  • Monitor Qdrant performance metrics
  • Consider Qdrant scaling options
  • Use a faster embedding model
  • Implement caching for common queries

Next steps