Model configuration hierarchy
MagOneAI provides fine-grained control over which models are available and who can use them. Configuration flows through four levels, each building on the previous:Platform level — SuperAdmin configuration
SuperAdmin configures available LLM providers and models in the Admin Portal. This is the global catalog of models that could potentially be used anywhere on the platform.At this level, you:
- Add LLM providers (OpenAI, Anthropic, Google, private models)
- Store API keys securely in HashiCorp Vault
- Configure provider-specific settings (endpoints, default parameters)
- Monitor provider health and availability
Organization level — access control
Organization admins control which providers and models their organization can access. Not every org needs access to every model.At this level, you:
- Enable/disable specific providers for your organization
- Set which models from each provider are available
- Configure rate limits and quotas per organization
- Monitor model usage and costs for your org
Project level — further restrictions
Project settings can further restrict which models are available within a specific project. This is useful when you want different model access for different use cases.At this level, you:
- Select subset of org-available models for this project
- Set project-specific rate limits
- Configure default models for new agents in this project
Agent level — specific model selection
Each agent uses a specific model selected from the project’s available models. This is where you choose the right model for each task.At this level, you:
- Select the exact model this agent will use
- Configure model parameters (temperature, max tokens, etc.)
- Override defaults for specific agent needs
- Security: Only authorized organizations can use specific models
- Cost control: Organizations can’t accidentally use expensive models they didn’t approve
- Flexibility: Different projects and agents can use different models based on their needs
Provider management in Admin Portal
SuperAdmins manage all LLM providers through the Admin Portal. This is the central configuration point for the entire platform.Adding a provider
Select provider type
Choose from:
- OpenAI (cloud)
- Anthropic (cloud)
- Google (cloud)
- OpenAI Compatible (for private models)
- Custom (for proprietary integrations)
Configure connection
Provide:
- Provider name (for identification)
- API endpoint URL (for cloud providers, this is pre-filled)
- API key or authentication credentials
- Optional: default parameters, timeouts, retry settings
Store credentials in Vault
API keys are automatically stored in HashiCorp Vault. You’ll see a reference like
vault:openai/api_key instead of the actual key.Test connection
Click Test Connection to verify MagOneAI can reach the provider and authenticate successfully.
Rate limits and quotas per organization
Control model usage to prevent runaway costs and ensure fair resource allocation: Rate limits:- Requests per minute per organization
- Tokens per minute per organization
- Concurrent requests per organization
- Total tokens per day/month
- Total requests per day/month
- Total cost per month (in USD)
Usage monitoring
Track model usage across the platform:- Provider dashboard: See aggregate usage for each provider
- Organization usage: Break down usage by organization
- Cost tracking: Monitor spending per provider, per org, per project
- Model popularity: Identify which models are most/least used
- Error rates: Track failed requests by provider and reason
Model selection per agent
Each agent in MagOneAI uses a specific LLM. Choosing the right model for each task is crucial for balancing cost, speed, and capability.How to select a model
When creating or editing an agent:- Choose from available models: Dropdown shows models available to your project
- Consider the task requirements:
- Complex reasoning → large, capable models (GPT-4o, Claude Opus 4.6)
- Simple classification → small, fast models (GPT-4o-mini, Claude Haiku)
- Vision tasks → multimodal models (GPT-4o, Gemini 2.5 Pro)
- Configure model parameters:
- Temperature: Creativity vs consistency (0.0 - 1.0)
- Max tokens: Maximum response length
- Top-p: Alternative sampling method
- Frequency/presence penalty: Control repetition
Choosing the right model for the task
Different tasks require different model capabilities:Fast models for simple tasks
Use: GPT-4o-mini, Claude Haiku, Gemini FlashFor:
- Classification (spam detection, sentiment analysis)
- Routing (which agent handles this request?)
- Simple Q&A
- Data extraction from structured formats
Reasoning models for complex tasks
Use: Claude Opus 4.6, o1, GPT-4oFor:
- Multi-step reasoning
- Complex analysis and synthesis
- Strategic decision-making
- Nuanced understanding of context
Vision models for documents
Use: GPT-4o, Gemini 2.5 Pro, Qwen3-VLFor:
- Document OCR and analysis
- Image understanding
- Visual question answering
- Chart/diagram interpretation
Mix models in workflows
Use: Different models for different agents in the same workflowFor:
- Routing agent (fast model) → Analysis agent (reasoning model)
- Vision agent (multimodal) → Summary agent (text model)
- Classifier (cheap model) → multiple specialized agents
Model parameter tuning
Temperature: Creativity vs consistency
Temperature: Creativity vs consistency
Temperature controls randomness in responses.
- 0.0 - 0.3: Deterministic, consistent — use for classification, extraction, structured output
- 0.4 - 0.7: Balanced — default for most tasks
- 0.8 - 1.0: Creative, varied — use for content generation, brainstorming
Max tokens: Response length control
Max tokens: Response length control
Max tokens sets the maximum response length.
- Short responses (100-500 tokens): classifications, summaries
- Medium responses (500-2000 tokens): explanations, analyses
- Long responses (2000-4000+ tokens): comprehensive reports, documents
Top-p: Alternative to temperature
Top-p: Alternative to temperature
Top-p (nucleus sampling) is an alternative to temperature for controlling randomness.
- 0.1 - 0.5: Conservative, high-probability tokens only
- 0.5 - 0.9: Balanced
- 0.9 - 1.0: Diverse, includes lower-probability tokens
Frequency and presence penalties
Frequency and presence penalties
Frequency penalty: Reduces likelihood of repeating tokens based on how often they’ve appeared.Presence penalty: Reduces likelihood of repeating tokens that have appeared at all.Both range from -2.0 to 2.0. Positive values discourage repetition, negative values encourage it.Use case: Set frequency penalty to 0.5-1.0 for content generation to avoid repetitive phrasing.
Function calling support
For agents to use tools, the underlying model must support function calling (also called tool use).What is function calling?
Function calling allows models to:- Receive a list of available tools with their parameter schemas
- Decide during reasoning when to call a tool
- Generate a tool call with appropriate parameters
- Receive the tool’s response and continue reasoning
Models with function calling support
Full support:- OpenAI: GPT-4, GPT-4o, GPT-4o-mini, o1, o3-mini
- Anthropic: All Claude models (function calling via “tool use”)
- Google: Gemini 2.0 Flash, Gemini 2.5 Pro
- Base LLaMA models (unless fine-tuned)
- Some older or smaller open-source models
How MagOneAI converts tool definitions
When you attach tools to an agent, MagOneAI converts MCP tool definitions to the provider’s function calling format: MCP tool definition:Agent types and function calling
MagOneAI agents have different function calling requirements depending on their configuration:- Basic agents: No tools, no function calling needed
- Router agents: No tools, no function calling needed
- Tool agents: Require function calling — agent decides when to use tools
- Full agents: Require function calling — agent uses tools during complex reasoning
Cost optimization strategies
Model selection has significant cost implications. Use these strategies to minimize spending while maintaining quality:1. Tier your agents by model size
- Tier 1 (entry points): GPT-4o-mini or Claude Haiku for routing and classification
- Tier 2 (specialists): GPT-4o or Claude Sonnet for most tasks
- Tier 3 (experts): Claude Opus 4.6 or o1 only for complex reasoning
2. Use caching where available
Some providers (Anthropic) support prompt caching. Enable caching for:- System prompts (same across many requests)
- Large context documents (RAG results, knowledge base content)
- Tool definitions (same for all calls with this agent)
3. Compress prompts and context
- Remove unnecessary whitespace and verbose instructions
- Summarize long documents before including in context
- Use structured formats (JSON, tables) instead of verbose prose
- Prune irrelevant context from multi-turn conversations
4. Monitor and alert on usage
Set up alerts for:- Daily spend exceeding threshold
- Specific agents using expensive models excessively
- Failed requests (you’re paying for them but getting no value)
5. Use private models for high-volume tasks
If you have consistent, high-volume workloads, deploying private open-source models can be significantly cheaper than cloud APIs:- High upfront cost (GPU infrastructure)
- Near-zero marginal cost per request
- Breakeven typically at 10M+ tokens per month