Skip to main content

What is RAG?

Retrieval-Augmented Generation (RAG) gives your agents access to specific documents and data that aren’t part of their pre-training. Instead of relying solely on the language model’s general knowledge, RAG agents retrieve relevant information from your documents and use it to generate informed, grounded responses.

How RAG works

1

Retrieve relevant chunks

When an agent receives a query, the system searches your knowledge base for document chunks semantically similar to the query
2

Augment the prompt

Retrieved chunks are injected into the agent’s prompt as additional context
3

Generate informed response

The language model generates a response based on both its general knowledge AND the specific retrieved content
The key insight: LLMs are excellent at reasoning and language generation, but they don’t know YOUR data. RAG bridges this gap by retrieving your data at query time and providing it as context.

Why use RAG?

Knowledge cutoffs

LLMs have training cutoffs and don’t know information published after that date. RAG gives agents access to current information.

Internal data

Your organization’s policies, procedures, and documentation aren’t in the LLM’s training data. RAG makes this information accessible.

Grounded responses

RAG grounds agent outputs in specific source documents, reducing hallucinations and enabling citation of sources.

Dynamic knowledge

Update your knowledge base and agents immediately have access to new information — no model retraining required.

When to use RAG

RAG is ideal for:
  • HR policy assistants — Answer employee questions based on specific policy documents
  • Compliance reviewers — Verify that proposals comply with internal guidelines and regulatory requirements
  • Technical documentation Q&A — Help users find information in product documentation
  • Customer support — Answer questions based on knowledge bases and help articles
  • Contract analysis — Review contracts against your organization’s standard terms and conditions
  • Research assistants — Query large document collections to find relevant information
If your agent needs to know information specific to your organization, domain, or use case, you need RAG.

RAG pipeline architecture

MagOneAI implements a production-grade RAG pipeline with hybrid search, reranking, and advanced retrieval techniques.

Retrieval flow

1

Query embedding

The input query is converted to both a dense vector (semantic) and a sparse BM25 vector (keyword) using the configured embedding model
2

Hybrid search

Qdrant performs parallel dense + BM25 sparse search with Reciprocal Rank Fusion (RRF) to merge results. This retrieves 50 initial candidates.
3

HyDE expansion (optional)

If HyDE is enabled, a hypothetical answer passage is generated and used as an additional search query. Results from both the original query and HyDE passage are merged and deduplicated.
4

Reranking

A cross-encoder reranking model (BGE-reranker-v2-m3) scores each candidate’s relevance to the original query. Results are reordered by relevance with labels: HIGH, MEDIUM, or LOW.
5

Parent expansion (Small2Big)

If Small2Big chunking was used, child chunks are expanded to their parent chunks for richer context. Results are deduplicated by parent ID.
6

Context injection

Top-k results are formatted with source metadata and injected into the agent’s prompt.

Qdrant vector database

MagOneAI uses Qdrant as its vector database for semantic search and retrieval.
  • Hybrid search — Dense vector similarity + BM25 keyword matching with RRF fusion
  • Performance — Fast similarity search even over millions of vectors
  • Filtering — Combine vector similarity with metadata filtering (e.g., filter by kb_id)
  • Scalability — Handles large knowledge bases with consistent query latency
MagOneAI manages the Qdrant infrastructure for you. You simply upload documents and configure knowledge bases — the vector database operations happen automatically.

Creating and managing knowledge bases

Knowledge bases are collections of documents that agents can query. Each knowledge base has its own vector collection and can be attached to multiple agents.
1

Create a knowledge base in your project

Navigate to the Knowledge Bases section and click “Create Knowledge Base”
  • Name — Descriptive name like “HR Policies” or “Product Documentation”
  • Description — What documents this knowledge base contains
  • Chunk size — Words per chunk (default: 400 words)
  • Chunk overlap — Overlap between chunks (default: 40 words)
  • Chunking strategy — Standard or Small2Big (see below)
  • Contextual chunking — Enable LLM-enriched chunk context (see below)
2

Upload documents

Upload documents to the knowledge base using drag-and-drop or file selectionSupported formats:
  • PDF (.pdf)
  • Microsoft Word (.docx, .doc)
  • Plain text (.txt)
  • Markdown (.md)
  • CSV (.csv)
  • Rich Text Format (.rtf)
You can upload multiple files simultaneously. Each file is processed asynchronously.
3

Documents are automatically chunked and embedded

MagOneAI processes your documents in the background:
  • Text is extracted from each document with section-level parsing
  • Content is split into chunks with section-aware splitting and title prepend
  • If contextual chunking is enabled, each chunk is enriched with LLM-generated situating context
  • Each chunk is embedded using both dense and sparse models for hybrid search
  • Vectors are stored in Qdrant with source metadata (filename, section, page number)
You can monitor processing status in the knowledge base detail view.
4

Attach the knowledge base to an agent

In the agent configuration, add the knowledge base under “Knowledge Bases”You can attach multiple knowledge bases to a single agent. The agent will search across all attached knowledge bases when retrieving context.
5

The agent now retrieves relevant context when answering questions

When the agent executes in a workflow, it automatically queries attached knowledge bases based on the input, retrieves relevant chunks, and generates responses grounded in your documents.

Managing knowledge bases

Upload new documents at any time. They’re automatically processed and become immediately available for retrieval.

Chunking strategies

The quality of your RAG system depends heavily on how documents are chunked. MagOneAI supports two chunking strategies.

Standard chunking

Section-aware recursive splitting with title prepend for better embedding quality. How it works:
  1. Documents are parsed into sections (headings, paragraphs)
  2. Each section is recursively split using progressively finer separators: \n\n\n.
  3. Chunks receive a title/section prefix prepended to the embedding text (e.g., filename > Section Title: chunk text)
  4. Overlapping windows ensure information at chunk boundaries isn’t lost
Configuration:
  • Chunk size — Max words per chunk (default: 400)
  • Chunk overlap — Words of overlap between chunks (default: 40)
Best for: Most use cases. Works well with structured documents that have clear headings and sections.

Small2Big chunking

A parent-child chunking strategy that retrieves on small chunks but expands to larger parent chunks for context. How it works:
  1. Documents are first split into parent chunks (default: 400 words, no overlap)
  2. Each parent chunk is sub-divided into smaller child chunks (default: 200 words)
  3. Child chunks store a reference to their parent (parent_id and parent_text)
  4. At retrieval time, search matches on precise small chunks, then expands to the full parent chunk for richer context
  5. Results are deduplicated by parent ID — only the highest-scoring child per parent is kept
Best for: When you need precise retrieval (matching on specific phrases) but want to provide broader context to the agent.

Contextual chunking

An optional LLM-enriched step that generates situating context for each chunk before embedding. This dramatically improves retrieval quality. How it works:
  1. A document summary is generated using an LLM (2-3 paragraphs covering document type, main topics, key entities)
  2. For each chunk, an LLM generates 1-2 sentences of situating context that describes how the chunk relates to the overall document
  3. The situating context is prepended to the chunk text before embedding
Configuration:
contextual_chunking:
  enabled: true
  llm_config_id: "your-llm-config"  # Required - LLM used for context generation
  summary_prompt: "..."  # Customizable document summary prompt
  context_prompt: "..."  # Customizable chunk context prompt
The context prompt receives: the document summary, the previous chunk, the current chunk, and the next chunk — giving the LLM full context to write accurate situating context. Best for: Knowledge bases where retrieval precision is critical. Adds processing time and cost during ingestion, but significantly improves retrieval quality.
Contextual chunking adds LLM cost during document ingestion (one call per chunk + one summary call per document). This is a one-time cost — retrieval performance is not affected.

Chunk overlap

Overlap ensures important information at chunk boundaries isn’t lost: Without overlap:
Chunk 1: [...employee must complete 90 days]
Chunk 2: [of employment before remote work eligibility...]
The connection between “90 days” and “remote work eligibility” is split across chunks. With overlap:
Chunk 1: [...employee must complete 90 days of employment before]
Chunk 2: [complete 90 days of employment before remote work eligibility...]
Now both chunks contain the complete concept. Recommended overlap: 10% of chunk size (e.g., 40 words for 400-word chunks)

KB retrieval modes

MagOneAI supports two retrieval modes that control how agents interact with knowledge bases.

Auto mode (default)

In auto mode (kb_retrieval_mode: "auto"), the system performs a single retrieval when the agent starts executing:
  1. The agent’s input is used as the search query
  2. All attached knowledge bases are searched in parallel
  3. Retrieved chunks are injected into the agent’s system prompt as static context
  4. The agent generates its response using this context
Best for: Simple Q&A, straightforward document lookup, and cases where the input query is a good search query.

Agentic mode

In agentic mode (kb_retrieval_mode: "agentic"), the agent can iteratively search knowledge bases during its reasoning:
  1. A __kb.search tool is added to the agent’s available tools
  2. The agent decides when and what to search based on its reasoning
  3. The agent can make multiple search calls with different queries
  4. Each search returns formatted results with source metadata and relevance scores
  5. The agent synthesizes information across multiple searches
Configuration on the agent:
capabilities:
  kb_retrieval_mode: "agentic"
  max_kb_searches: 10        # Max search calls per execution (1-50)
  hyde_enabled: true          # Enable HyDE query expansion
  hyde_llm_config_id: "..."   # LLM for HyDE passage generation
How agentic search works: The agent receives a KB search tool with this schema:
  • query (required) — The search query. The agent crafts specific queries based on its reasoning.
  • top_k (optional) — Number of results (1-20, default 5)
  • kb_id (optional) — Target a specific knowledge base by ID (only shown when multiple KBs are attached)
Results are returned with source metadata and relevance labels:
=== SOURCE: Remote Work Policy 2024.pdf > Section 3.2 [relevance: HIGH] ===
[Page 4] Employees must complete the initial 90-day probationary period before...
=== END ===
Best for: Complex queries that require multiple searches, research tasks, and cases where the initial input isn’t a good search query on its own.
Agentic RAG is more powerful but uses more LLM tokens (each search is a tool call in the agent loop). Use auto mode for simple lookups and agentic mode for complex research tasks.

HyDE (Hypothetical Document Embeddings)

HyDE is an advanced query expansion technique that improves retrieval by generating a hypothetical answer before searching.

How HyDE works

1

Generate hypothetical passage

Given the user’s query, an LLM generates a short passage (2-3 sentences) that would directly answer the question — as if quoting from a reference document.
2

Embed the hypothetical passage

The generated passage is embedded using both dense and sparse models, just like a real query.
3

Search with both queries

The system searches with both the original query embedding AND the hypothetical passage embedding, retrieving candidates from both.
4

Merge and deduplicate

Results from both searches are merged and deduplicated. This expanded candidate set is then reranked.

Why HyDE helps

User queries often use different vocabulary than source documents. For example:
  • User asks: “Can I work from home?”
  • Document says: “Remote work eligibility requires completion of the probationary period”
The hypothetical passage bridges this vocabulary gap by generating text that’s likely to use similar language to the source documents.

Enabling HyDE

HyDE is configured per agent in the capabilities section:
capabilities:
  hyde_enabled: true
  hyde_llm_config_id: "your-llm-config"  # LLM used for passage generation
HyDE works with both auto and agentic retrieval modes. In agentic mode, HyDE passages are generated automatically for each __kb.search tool call.
HyDE adds one LLM call per search query. Use a fast, cost-effective model for HyDE passage generation — the passage doesn’t need to be perfect, just directionally helpful.

Hybrid search and reranking

MagOneAI uses a multi-stage retrieval pipeline for high-quality results. Every search query produces both:
  • Dense vector — Captures semantic meaning (what the text means)
  • Sparse BM25 vector — Captures keyword relevance (what words appear)
Qdrant runs both searches in parallel and merges results using Reciprocal Rank Fusion (RRF). This combines the strengths of both approaches:
  • Semantic search finds conceptually similar content even with different vocabulary
  • Keyword search finds exact term matches (names, acronyms, codes)

Reranking

After hybrid search returns ~50 candidates, a cross-encoder reranking model rescores each result:
  • Model: BGE-reranker-v2-m3
  • Input: (query, candidate_text) pairs
  • Output: Relevance scores with labels
    • HIGH — Score > 0.5 (highly relevant)
    • MEDIUM — Score > -1.0 (moderately relevant)
    • LOW — Score ≤ -1.0 (marginally relevant)
Top-k results after reranking are returned to the agent.
Reranking is optional and can be enabled/disabled via configuration. When disabled, results are ordered by hybrid search score only.

How RAG works in agent execution

When a RAG agent executes within a workflow, the retrieval and generation process follows a precise sequence:

Detailed execution flow

1

Agent receives query/input

The agent receives input from the workflow — typically a question or task that requires knowledge base consultationExample input:
{
  "question": "What is our policy on remote work for new employees?",
  "context": "employee_onboarding"
}
2

Query is embedded (dual encoding)

The input is converted to both a dense embedding vector and a sparse BM25 vector using the configured embedding modelThis dual encoding enables hybrid search — combining semantic and keyword matching
3

Hybrid search against knowledge base vectors

Qdrant performs parallel dense + sparse search with RRF fusion:
  • Dense prefetch retrieves 150 candidates by semantic similarity
  • BM25 prefetch retrieves 150 candidates by keyword relevance
  • RRF fusion merges and deduplicates to top 50 candidates
  • Results are filtered by knowledge base ID
4

Reranking and parent expansion

Candidates are reranked by a cross-encoder model, then Small2Big parent expansion is applied if applicable. Final top-k results are returned with:
  • Chunk text — The actual document content (or parent text if expanded)
  • Source metadata — Filename, section heading, page number
  • Relevance score — Reranker score with HIGH/MEDIUM/LOW label
5

Context injection into agent prompt

Retrieved chunks are formatted and added to the agent’s context:
=== SOURCE: Remote Work Policy 2024.pdf > Section 3.2 [relevance: HIGH] ===
[Page 4] Employees must complete the initial 90-day probationary period
before becoming eligible for remote work arrangements...
=== END ===

=== SOURCE: Employee Handbook.pdf > Onboarding [relevance: MEDIUM] ===
[Page 12] New employees are assigned a buddy during their first 90 days...
=== END ===
6

Agent generates response grounded in retrieved documents

The LLM generates a response using:
  • Its general language understanding and reasoning capabilities
  • The specific content from retrieved chunks
  • The agent’s persona and instructions
The response is grounded in your documents rather than the model’s general training data.

Best practices

Keep documents focused and well-structured

Good document structure:
  • Clear headings and sections
  • Logical information hierarchy
  • Consistent formatting
  • One topic per document or section
Well-structured documents chunk better and retrieve more accurately. Section headings are used in the chunk title prefix, improving embedding quality.

Use descriptive file names

File names appear in source citations and are prepended to chunk embeddings: Good file names:
  • remote_work_policy_2024.pdf
  • employee_onboarding_checklist.pdf
  • gdpr_compliance_guidelines.pdf
Poor file names:
  • document_final_v3.pdf
  • policy.pdf
  • untitled.pdf

Choose the right retrieval mode

ScenarioRecommended Mode
Simple Q&A with direct questionsAuto
Complex research requiring multiple searchesAgentic
Agents that need to explore a topic iterativelyAgentic
High-volume, cost-sensitive workloadsAuto
Multi-KB searches with targeted queriesAgentic

Enable HyDE for vocabulary mismatch

If your users ask questions using different terminology than your source documents, enable HyDE. It’s especially helpful for:
  • Technical documentation with domain-specific jargon
  • Policy documents with formal language
  • Multi-language knowledge bases

Use contextual chunking for high-stakes use cases

Contextual chunking significantly improves retrieval quality at the cost of higher ingestion time and LLM usage. Enable it for:
  • Compliance and regulatory documents
  • Legal contracts and policies
  • Medical or financial documents where precision matters

Test retrieval quality before production deployment

Before deploying RAG agents to production:
1

Create test query set

Build a set of representative questions your agents will receive
2

Evaluate retrieval

For each test query, examine retrieved chunks:
  • Are the most relevant chunks retrieved?
  • Is irrelevant content being retrieved?
  • Are there gaps in coverage?
3

Iterate on configuration

Adjust chunk size, chunking strategy, retrieval mode, and HyDE settings based on results
4

Test end-to-end agent performance

Evaluate not just retrieval, but agent answer quality using the retrieved context

Troubleshooting RAG issues

Possible causes:
  • Documents aren’t in the knowledge base
  • Chunk size is too small or too large
  • Query and documents use different terminology
  • Contextual chunking not enabled for complex documents
Solutions:
  • Verify documents uploaded and processed
  • Experiment with different chunk sizes (200-600 words)
  • Enable HyDE to bridge vocabulary gaps
  • Enable contextual chunking for better chunk embeddings
  • Try agentic mode so the agent can craft better search queries
Possible causes:
  • Knowledge base contains too much diverse content
  • Chunk size is too large
  • Reranking not enabled
Solutions:
  • Split knowledge bases by domain
  • Reduce chunk size
  • Enable reranking to improve precision
  • Use Small2Big chunking for precise matching with broader context
Possible causes:
  • Relevant chunks not retrieved (retrieval problem)
  • Relevant chunks retrieved but not used by LLM (generation problem)
  • Persona doesn’t emphasize grounding in documents
Solutions:
  • Test retrieval separately from generation
  • Update persona to emphasize: “Base your answer ONLY on the provided documents”
  • Switch to agentic mode so the agent actively searches for information
  • Increase top-k to provide more context
Possible causes:
  • Knowledge base is very large
  • HyDE adding latency (extra LLM call per search)
  • Too many knowledge bases attached to agent
Solutions:
  • Monitor Qdrant performance metrics
  • Use a faster LLM for HyDE passage generation
  • Reduce the number of attached knowledge bases
  • Consider splitting large KBs into smaller, focused ones

Next steps

Personas and prompts

Craft prompts that effectively use retrieved context

Building workflows

Integrate RAG agents into Temporal workflows

Agent node

Configure agents with RAG in your workflows