What is RAG?
Retrieval-Augmented Generation (RAG) gives your agents access to specific documents and data that aren’t part of their pre-training. Instead of relying solely on the language model’s general knowledge, RAG agents retrieve relevant information from your documents and use it to generate informed, grounded responses.How RAG works
Retrieve relevant chunks
Why use RAG?
Knowledge cutoffs
Internal data
Grounded responses
Dynamic knowledge
When to use RAG
RAG is ideal for:- HR policy assistants — Answer employee questions based on specific policy documents
- Compliance reviewers — Verify that proposals comply with internal guidelines and regulatory requirements
- Technical documentation Q&A — Help users find information in product documentation
- Customer support — Answer questions based on knowledge bases and help articles
- Contract analysis — Review contracts against your organization’s standard terms and conditions
- Research assistants — Query large document collections to find relevant information
Qdrant vector database
MagOneAI uses Qdrant as its vector database for semantic search and retrieval. Qdrant is purpose-built for similarity search over high-dimensional vectors.How Qdrant powers RAG
Document ingestion
Query-time retrieval
Why Qdrant?
- Performance — Fast similarity search even over millions of vectors
- Scalability — Handles large knowledge bases with consistent query latency
- Filtering — Combine vector similarity with metadata filtering (e.g., “search only in contract documents”)
- Hybrid search — Blend semantic similarity with keyword matching for better retrieval quality
Creating and managing knowledge bases
Knowledge bases are collections of documents that agents can query. Each knowledge base has its own Qdrant collection and can be attached to multiple agents.Create a knowledge base in your project
- Name — Descriptive name like “HR Policies” or “Product Documentation”
- Description — What documents this knowledge base contains
- Embedding model — Select the embedding model (default: text-embedding-3-small)
- Chunk size — How large each document chunk should be (default: 512 tokens)
- Chunk overlap — How much chunks should overlap (default: 50 tokens)
Upload documents
- PDF (.pdf)
- Microsoft Word (.docx, .doc)
- Plain text (.txt)
- Markdown (.md)
- CSV (.csv)
- Rich Text Format (.rtf)
Documents are automatically chunked and embedded
- Text is extracted from each document
- Content is split into chunks based on your configured chunk size
- Each chunk is embedded using the selected embedding model
- Vectors are stored in Qdrant with source metadata
Attach the knowledge base to an agent
Managing knowledge bases
- Add documents
- Update documents
- Delete documents
- Test retrieval
- View statistics
How RAG works in agent execution
When a RAG agent executes within a workflow, the retrieval and generation process follows a precise sequence:Detailed execution flow
Agent receives query/input
Query is embedded
Similarity search against knowledge base vectors
- Default: retrieves top-5 chunks
- Configurable: you can adjust the number of chunks (k) based on context window size and retrieval quality needs
- Metadata filtering: optionally filter by document type, date, or custom metadata
Top-k relevant chunks are retrieved
- Chunk text — The actual document content
- Source metadata — Which document the chunk came from
- Similarity score — How relevant the chunk is (0-1)
- Position metadata — Where in the document this chunk appeared
Chunks are injected into the agent's prompt as context
Agent generates response grounded in retrieved documents
- Its general language understanding and reasoning capabilities
- The specific content from retrieved chunks
- The agent’s persona and instructions
Retrieval configuration
You can configure retrieval behavior per agent:Citation and source tracking
RAG agents can cite their sources, providing transparency and auditability:Chunking and embedding
The quality of your RAG system depends heavily on how documents are chunked and embedded.Document chunking strategies
Chunking is the process of splitting documents into smaller segments for embedding and retrieval.- Fixed-size chunking
- Semantic chunking
- Hierarchical chunking
- Predictable chunk sizes fit LLM context windows
- Simple and fast
- Works well for uniform documents
- May split logical sections (paragraphs, lists) arbitrarily
- Doesn’t respect document structure
Chunk overlap
Overlap ensures important information at chunk boundaries isn’t lost: Without overlap:Embedding models
Embedding models convert text to high-dimensional vectors. MagOneAI supports several options:| Model | Dimensions | Strengths | Use Cases |
|---|---|---|---|
| text-embedding-3-small | 1536 | Fast, cost-effective, good general performance | Default for most use cases |
| text-embedding-3-large | 3072 | Higher quality embeddings, better semantic understanding | Complex domain-specific knowledge bases |
| text-embedding-ada-002 | 1536 | Proven performance, widely used | Backward compatibility |
- Performance vs. cost — Larger models produce better embeddings but cost more per token
- Domain specificity — Domain-specific models (legal, medical) may outperform general models for specialized content
- Consistency — Use the same embedding model for all documents in a knowledge base
How chunk size affects retrieval quality
Chunk size is a critical parameter that impacts both retrieval quality and LLM reasoning:Small chunks (128-256 tokens)
Small chunks (128-256 tokens)
- Precise retrieval of specific facts
- Less noise in retrieved context
- More chunks fit in LLM context window
- May lack surrounding context needed for understanding
- More chunks required to answer complex questions
- Higher retrieval overhead
Medium chunks (512-1024 tokens)
Medium chunks (512-1024 tokens)
- Balance between precision and context
- Chunks are semantically self-contained
- Good for most use cases
- May retrieve some irrelevant content along with relevant information
- Fewer chunks fit in context window
Large chunks (1024-2048 tokens)
Large chunks (1024-2048 tokens)
- Retrieves broad context around relevant information
- Good for understanding complex relationships
- Fewer retrieval calls needed
- More noise in retrieved context
- Fewer chunks fit in LLM context window
- May retrieve content that’s partially relevant
Best practices
Keep documents focused and well-structured
Good document structure:- Clear headings and sections
- Logical information hierarchy
- Consistent formatting
- One topic per document or section
- Wall-of-text with no sections
- Mixed topics in single document
- Inconsistent formatting
- Unclear boundaries between concepts
Use descriptive file names
File names appear in source citations and help agents understand context: Good file names:remote_work_policy_2024.pdfemployee_onboarding_checklist.pdfgdpr_compliance_guidelines.pdf
document_final_v3.pdfpolicy.pdfuntitled.pdf
Regularly update knowledge bases when source documents change
RAG systems reflect the documents in the knowledge base. When source documents change:Test retrieval quality before production deployment
Before deploying RAG agents to production:Evaluate retrieval
- Are the most relevant chunks retrieved?
- Is irrelevant content being retrieved?
- Are there gaps in coverage?
Measure metrics
- Precision — What percentage of retrieved chunks are relevant?
- Recall — What percentage of relevant chunks are retrieved?
- MRR (Mean Reciprocal Rank) — How highly ranked is the first relevant chunk?
Optimize chunk size for your use case
No single chunk size is optimal for all use cases. Run experiments:- Create knowledge bases with different chunk sizes (256, 512, 1024 tokens)
- Test agent performance on representative queries
- Measure answer quality and relevance
- Select the configuration that performs best for YOUR data and use case
Use metadata filtering for multi-domain knowledge bases
If your knowledge base contains different document types, use metadata filtering:Monitor retrieval performance in production
Track retrieval quality metrics in production:- Average similarity scores — Declining scores may indicate knowledge base staleness
- Retrieval latency — Monitor query performance as knowledge base grows
- Zero-result queries — Track queries that retrieve no relevant chunks
- Agent confidence scores — Low confidence may indicate poor retrieval quality
Advanced RAG techniques
Once you’ve mastered basic RAG, consider these advanced techniques:Hybrid search
Combine semantic similarity (vector search) with keyword matching (BM25):- Vector search finds semantically similar content
- Keyword search finds exact term matches
- Results are merged and re-ranked
Reranking
After initial retrieval, use a reranking model to reorder chunks:Query expansion
Expand user queries before retrieval:- Generate variations of the query
- Add synonyms and related terms
- Retrieve using multiple query formulations
- Combine and deduplicate results
Contextual compression
Compress retrieved chunks to include only relevant sentences:Extract relevant sentences
Troubleshooting RAG issues
Agent doesn't retrieve relevant information
Agent doesn't retrieve relevant information
- Documents aren’t in the knowledge base
- Chunk size is too small or too large
- Query and documents use different terminology
- Embedding model doesn’t capture domain semantics
- Verify documents uploaded and processed
- Experiment with different chunk sizes
- Use query expansion or synonyms
- Try a different embedding model
Agent retrieves irrelevant chunks
Agent retrieves irrelevant chunks
- Knowledge base contains too much diverse content
- Chunk size is too large
- No metadata filtering
- Split knowledge bases by domain
- Reduce chunk size
- Add metadata filters to retrieval config
- Use reranking to improve precision
Agent hallucinates despite having access to correct information
Agent hallucinates despite having access to correct information
- Relevant chunks not retrieved (retrieval problem)
- Relevant chunks retrieved but not used by LLM (generation problem)
- Persona doesn’t emphasize grounding in documents
- Test retrieval separately from generation
- Update persona to emphasize: “Base your answer ONLY on the provided documents”
- Add guardrails that require source citations
- Increase top-k to provide more context
Retrieval is too slow
Retrieval is too slow
- Knowledge base is very large
- Qdrant cluster under-provisioned
- Embedding model is slow
- Monitor Qdrant performance metrics
- Consider Qdrant scaling options
- Use a faster embedding model
- Implement caching for common queries