What is RAG?
Retrieval-Augmented Generation (RAG) gives your agents access to specific documents and data that aren’t part of their pre-training. Instead of relying solely on the language model’s general knowledge, RAG agents retrieve relevant information from your documents and use it to generate informed, grounded responses.How RAG works
Retrieve relevant chunks
Why use RAG?
Knowledge cutoffs
Internal data
Grounded responses
Dynamic knowledge
When to use RAG
RAG is ideal for:- HR policy assistants — Answer employee questions based on specific policy documents
- Compliance reviewers — Verify that proposals comply with internal guidelines and regulatory requirements
- Technical documentation Q&A — Help users find information in product documentation
- Customer support — Answer questions based on knowledge bases and help articles
- Contract analysis — Review contracts against your organization’s standard terms and conditions
- Research assistants — Query large document collections to find relevant information
RAG pipeline architecture
MagOneAI implements a production-grade RAG pipeline with hybrid search, reranking, and advanced retrieval techniques.Retrieval flow
Query embedding
Hybrid search
HyDE expansion (optional)
Reranking
Parent expansion (Small2Big)
Qdrant vector database
MagOneAI uses Qdrant as its vector database for semantic search and retrieval.- Hybrid search — Dense vector similarity + BM25 keyword matching with RRF fusion
- Performance — Fast similarity search even over millions of vectors
- Filtering — Combine vector similarity with metadata filtering (e.g., filter by
kb_id) - Scalability — Handles large knowledge bases with consistent query latency
Creating and managing knowledge bases
Knowledge bases are collections of documents that agents can query. Each knowledge base has its own vector collection and can be attached to multiple agents.Create a knowledge base in your project
- Name — Descriptive name like “HR Policies” or “Product Documentation”
- Description — What documents this knowledge base contains
- Chunk size — Words per chunk (default: 400 words)
- Chunk overlap — Overlap between chunks (default: 40 words)
- Chunking strategy — Standard or Small2Big (see below)
- Contextual chunking — Enable LLM-enriched chunk context (see below)
Upload documents
- PDF (.pdf)
- Microsoft Word (.docx, .doc)
- Plain text (.txt)
- Markdown (.md)
- CSV (.csv)
- Rich Text Format (.rtf)
Documents are automatically chunked and embedded
- Text is extracted from each document with section-level parsing
- Content is split into chunks with section-aware splitting and title prepend
- If contextual chunking is enabled, each chunk is enriched with LLM-generated situating context
- Each chunk is embedded using both dense and sparse models for hybrid search
- Vectors are stored in Qdrant with source metadata (filename, section, page number)
Attach the knowledge base to an agent
Managing knowledge bases
- Add documents
- Update documents
- Delete documents
- Test retrieval
- View statistics
Chunking strategies
The quality of your RAG system depends heavily on how documents are chunked. MagOneAI supports two chunking strategies.Standard chunking
Section-aware recursive splitting with title prepend for better embedding quality. How it works:- Documents are parsed into sections (headings, paragraphs)
- Each section is recursively split using progressively finer separators:
\n\n→\n→.→ - Chunks receive a title/section prefix prepended to the embedding text (e.g.,
filename > Section Title: chunk text) - Overlapping windows ensure information at chunk boundaries isn’t lost
- Chunk size — Max words per chunk (default: 400)
- Chunk overlap — Words of overlap between chunks (default: 40)
Small2Big chunking
A parent-child chunking strategy that retrieves on small chunks but expands to larger parent chunks for context. How it works:- Documents are first split into parent chunks (default: 400 words, no overlap)
- Each parent chunk is sub-divided into smaller child chunks (default: 200 words)
- Child chunks store a reference to their parent (
parent_idandparent_text) - At retrieval time, search matches on precise small chunks, then expands to the full parent chunk for richer context
- Results are deduplicated by parent ID — only the highest-scoring child per parent is kept
Contextual chunking
An optional LLM-enriched step that generates situating context for each chunk before embedding. This dramatically improves retrieval quality. How it works:- A document summary is generated using an LLM (2-3 paragraphs covering document type, main topics, key entities)
- For each chunk, an LLM generates 1-2 sentences of situating context that describes how the chunk relates to the overall document
- The situating context is prepended to the chunk text before embedding
Chunk overlap
Overlap ensures important information at chunk boundaries isn’t lost: Without overlap:KB retrieval modes
MagOneAI supports two retrieval modes that control how agents interact with knowledge bases.Auto mode (default)
In auto mode (kb_retrieval_mode: "auto"), the system performs a single retrieval when the agent starts executing:
- The agent’s input is used as the search query
- All attached knowledge bases are searched in parallel
- Retrieved chunks are injected into the agent’s system prompt as static context
- The agent generates its response using this context
Agentic mode
In agentic mode (kb_retrieval_mode: "agentic"), the agent can iteratively search knowledge bases during its reasoning:
- A
__kb.searchtool is added to the agent’s available tools - The agent decides when and what to search based on its reasoning
- The agent can make multiple search calls with different queries
- Each search returns formatted results with source metadata and relevance scores
- The agent synthesizes information across multiple searches
query(required) — The search query. The agent crafts specific queries based on its reasoning.top_k(optional) — Number of results (1-20, default 5)kb_id(optional) — Target a specific knowledge base by ID (only shown when multiple KBs are attached)
HyDE (Hypothetical Document Embeddings)
HyDE is an advanced query expansion technique that improves retrieval by generating a hypothetical answer before searching.How HyDE works
Generate hypothetical passage
Embed the hypothetical passage
Search with both queries
Why HyDE helps
User queries often use different vocabulary than source documents. For example:- User asks: “Can I work from home?”
- Document says: “Remote work eligibility requires completion of the probationary period”
Enabling HyDE
HyDE is configured per agent in the capabilities section:__kb.search tool call.
Hybrid search and reranking
MagOneAI uses a multi-stage retrieval pipeline for high-quality results.Hybrid search
Every search query produces both:- Dense vector — Captures semantic meaning (what the text means)
- Sparse BM25 vector — Captures keyword relevance (what words appear)
- Semantic search finds conceptually similar content even with different vocabulary
- Keyword search finds exact term matches (names, acronyms, codes)
Reranking
After hybrid search returns ~50 candidates, a cross-encoder reranking model rescores each result:- Model: BGE-reranker-v2-m3
- Input: (query, candidate_text) pairs
- Output: Relevance scores with labels
- HIGH — Score > 0.5 (highly relevant)
- MEDIUM — Score > -1.0 (moderately relevant)
- LOW — Score ≤ -1.0 (marginally relevant)
How RAG works in agent execution
When a RAG agent executes within a workflow, the retrieval and generation process follows a precise sequence:Detailed execution flow
Agent receives query/input
Query is embedded (dual encoding)
Hybrid search against knowledge base vectors
- Dense prefetch retrieves 150 candidates by semantic similarity
- BM25 prefetch retrieves 150 candidates by keyword relevance
- RRF fusion merges and deduplicates to top 50 candidates
- Results are filtered by knowledge base ID
Reranking and parent expansion
- Chunk text — The actual document content (or parent text if expanded)
- Source metadata — Filename, section heading, page number
- Relevance score — Reranker score with HIGH/MEDIUM/LOW label
Context injection into agent prompt
Agent generates response grounded in retrieved documents
- Its general language understanding and reasoning capabilities
- The specific content from retrieved chunks
- The agent’s persona and instructions
Best practices
Keep documents focused and well-structured
Good document structure:- Clear headings and sections
- Logical information hierarchy
- Consistent formatting
- One topic per document or section
Use descriptive file names
File names appear in source citations and are prepended to chunk embeddings: Good file names:remote_work_policy_2024.pdfemployee_onboarding_checklist.pdfgdpr_compliance_guidelines.pdf
document_final_v3.pdfpolicy.pdfuntitled.pdf
Choose the right retrieval mode
| Scenario | Recommended Mode |
|---|---|
| Simple Q&A with direct questions | Auto |
| Complex research requiring multiple searches | Agentic |
| Agents that need to explore a topic iteratively | Agentic |
| High-volume, cost-sensitive workloads | Auto |
| Multi-KB searches with targeted queries | Agentic |
Enable HyDE for vocabulary mismatch
If your users ask questions using different terminology than your source documents, enable HyDE. It’s especially helpful for:- Technical documentation with domain-specific jargon
- Policy documents with formal language
- Multi-language knowledge bases
Use contextual chunking for high-stakes use cases
Contextual chunking significantly improves retrieval quality at the cost of higher ingestion time and LLM usage. Enable it for:- Compliance and regulatory documents
- Legal contracts and policies
- Medical or financial documents where precision matters
Test retrieval quality before production deployment
Before deploying RAG agents to production:Evaluate retrieval
- Are the most relevant chunks retrieved?
- Is irrelevant content being retrieved?
- Are there gaps in coverage?
Iterate on configuration
Troubleshooting RAG issues
Agent doesn't retrieve relevant information
Agent doesn't retrieve relevant information
- Documents aren’t in the knowledge base
- Chunk size is too small or too large
- Query and documents use different terminology
- Contextual chunking not enabled for complex documents
- Verify documents uploaded and processed
- Experiment with different chunk sizes (200-600 words)
- Enable HyDE to bridge vocabulary gaps
- Enable contextual chunking for better chunk embeddings
- Try agentic mode so the agent can craft better search queries
Agent retrieves irrelevant chunks
Agent retrieves irrelevant chunks
- Knowledge base contains too much diverse content
- Chunk size is too large
- Reranking not enabled
- Split knowledge bases by domain
- Reduce chunk size
- Enable reranking to improve precision
- Use Small2Big chunking for precise matching with broader context
Agent hallucinates despite having access to correct information
Agent hallucinates despite having access to correct information
- Relevant chunks not retrieved (retrieval problem)
- Relevant chunks retrieved but not used by LLM (generation problem)
- Persona doesn’t emphasize grounding in documents
- Test retrieval separately from generation
- Update persona to emphasize: “Base your answer ONLY on the provided documents”
- Switch to agentic mode so the agent actively searches for information
- Increase top-k to provide more context
Retrieval is too slow
Retrieval is too slow
- Knowledge base is very large
- HyDE adding latency (extra LLM call per search)
- Too many knowledge bases attached to agent
- Monitor Qdrant performance metrics
- Use a faster LLM for HyDE passage generation
- Reduce the number of attached knowledge bases
- Consider splitting large KBs into smaller, focused ones