What constitutes a private model deployment
A private model deployment means running LLMs on your own infrastructure instead of using cloud APIs. This gives you:- Complete data sovereignty — no data sent to third-party APIs
- Full control — over model versions, fine-tuning, and access
- Cost predictability — fixed infrastructure cost, near-zero marginal cost per request
- Compliance — easier to meet regulatory requirements for sensitive data
- Customization — fine-tune models on proprietary data
- Regulated industries (healthcare, finance, government)
- Processing sensitive data (PII, trade secrets, classified information)
- High-volume workloads (where per-token pricing becomes expensive)
- Custom domain requirements (fine-tuned models)
OpenAI-compatible API requirement
MagOneAI connects to private models via the OpenAI-compatible API format. This is an industry-standard HTTP API that many serving frameworks implement.What “OpenAI-compatible” means
Your model server must implement the/v1/chat/completions endpoint with request/response format matching OpenAI’s API:
Request format:
Supported serving frameworks
Several open-source frameworks implement the OpenAI-compatible API for serving LLMs. Choose based on your performance, ease-of-use, and feature requirements.vLLM
High-performance inference with PagedAttention
- Best for: Production deployments requiring maximum throughput
- Performance: State-of-the-art serving speed
- Features: Dynamic batching, optimized attention, quantization support
- GPU: Nvidia (CUDA), AMD (ROCm)
- Setup complexity: Medium
Ollama
Easy local model management
- Best for: Development, testing, and small-scale production
- Performance: Good for single-user workloads
- Features: Simple CLI, automatic model downloads, built-in model library
- GPU: Nvidia, Apple Silicon (Metal), CPU
- Setup complexity: Very easy
LM Studio
Desktop application with GUI
- Best for: Experimentation, local development, non-technical users
- Performance: Good for single-user workloads
- Features: User-friendly UI, model browser, one-click downloads
- GPU: Nvidia, Apple Silicon, CPU
- Setup complexity: Very easy
Text Generation Inference (TGI)
Hugging Face’s production serving solution
- Best for: Production deployments, Hugging Face ecosystem users
- Performance: Excellent, optimized for various hardware
- Features: Tensor parallelism, quantization, streaming
- GPU: Nvidia (CUDA)
- Setup complexity: Medium
Custom deployments
Any server implementing the OpenAI-compatible API format works with MagOneAI:- Custom inference servers built in-house
- Cloud-provider managed endpoints (AWS SageMaker, Azure ML, GCP Vertex AI)
- Specialized serving solutions for specific models or hardware
- Multi-model routing servers
Configuration
Connect MagOneAI to your private model deployment.Deploy your model using a supported framework
Choose a serving framework (vLLM, Ollama, TGI, etc.) and deploy your model. Ensure the server is running and accessible from the MagOneAI platform.Example vLLM deployment:
Note the endpoint URL
Your model server’s URL, including the
/v1 path if required by the framework.Examples:- vLLM:
http://gpu-server.internal:8000/v1 - Ollama:
http://localhost:11434/v1 - TGI:
http://tgi-server.internal:8080/v1
Add provider in Admin Portal
Navigate to Admin Portal → Configuration → LLM Providers → Add Provider.Select OpenAI Compatible as the provider type.
Enter endpoint URL and credentials
Provide:
- Provider name: Friendly name (e.g., “Internal LLaMA 3.1”)
- API endpoint: Your server URL (e.g.,
http://gpu-server:8000/v1) - API key: If your server requires authentication (optional)
- Model name: The model identifier your server expects
Test connection
Click Test Connection. MagOneAI will send a test request to verify:
- Server is reachable
- API format is correct
- Authentication works (if required)
- Model responds correctly
Assign to organizations
Choose which MagOneAI organizations can use this private model. Private models can be restricted to specific orgs for cost allocation or access control.
Vision model support
Some open-source models support vision (multimodal) capabilities. You can serve these via vLLM or TGI and use them in MagOneAI workflows.Vision-capable open-source models
Popular open-source vision models:- Qwen3-VL — Strong vision and language understanding, excellent for document analysis
- LLaVA — Open vision-language model family
- Moondream — Lightweight vision model for resource-constrained deployments
- CogVLM — Chinese and English vision-language model
- InternVL — Versatile vision-language model
Serving vision models with vLLM
vLLM supports multimodal models. Serve a vision model:Using vision models in workflows
Once deployed and configured, use vision models the same way as cloud vision models:- Upload images via workflow inputs
- Agent processes images and text together
- Model returns analysis, extracted text, or answers
- Upload driver’s license or passport image
- Qwen3-VL extracts name, date of birth, ID number, expiration date
- No cloud APIs — all processing on your infrastructure
- Compliant with data residency requirements
Private vision models enable document analysis without cloud APIs. Process identity documents, financial statements, medical records, and classified information without sending images to third parties.
Performance considerations
Private model deployments require careful capacity planning and optimization.GPU requirements by model size
Approximate GPU memory requirements (FP16 precision):| Model Size | Single GPU | Distributed |
|---|---|---|
| 7B params | 16 GB (RTX 4090, A10) | Not needed |
| 13B params | 26 GB (A100 40GB) | Optional |
| 30B params | 60 GB (A100 80GB) | Recommended (2x A40) |
| 70B params | 140 GB | Required (2-4x A100) |
| 175B params (GPT-3 scale) | 350 GB | Required (8x A100) |
Quantization for reduced memory usage
Quantization reduces model precision to lower memory requirements with minimal quality loss:GPTQ (4-bit/8-bit quantization)
GPTQ (4-bit/8-bit quantization)
Memory reduction: 4-bit reduces to ~25% of original, 8-bit to ~50%Quality: Minimal degradation for most tasksPerformance: Slightly slower than FP16, but fits on smaller GPUsUse case: Run 70B models on 2x A40 instead of 4x A100Example:
TheBloke/Llama-3-70B-Instruct-GPTQ (4-bit quantized)AWQ (Activation-Aware Weight Quantization)
AWQ (Activation-Aware Weight Quantization)
Memory reduction: Similar to GPTQ (4-bit)Quality: Often better than GPTQ, especially for reasoningPerformance: Faster inference than GPTQ on compatible hardwareUse case: Best quality-size tradeoff for 4-bit quantizationExample:
casperhansen/llama-3-70b-instruct-awqGGUF (llama.cpp format)
GGUF (llama.cpp format)
Memory reduction: 2-bit to 8-bit optionsQuality: Varies by quantization levelPerformance: Optimized for CPU inferenceUse case: Run models on CPU or Apple Silicon without GPUsExample: Used by Ollama for efficient local inference
Batch inference for throughput optimization
Batching processes multiple requests simultaneously to maximize GPU utilization:- Dynamic batching: vLLM and TGI automatically batch incoming requests
- Continuous batching: Start generating for one request while others are still arriving
- PagedAttention: vLLM’s technique for efficient memory usage during batching
Load balancing across multiple GPU instances
For high-volume production workloads, deploy multiple model replicas and load balance across them:- Higher throughput (handle more concurrent requests)
- Redundancy (failover if one instance crashes)
- Rolling updates (update one instance at a time)
Latency vs quality tradeoffs
Optimize for either speed or quality based on your use case: For low latency (faster responses):- Use smaller models (7B or 13B)
- Lower max tokens
- Quantized models (4-bit)
- Higher temperature can sometimes reduce thinking time
- Use larger models (70B+)
- Higher max tokens (allow longer responses)
- FP16 precision (no quantization)
- Lower temperature for consistency
Cost analysis: Private vs cloud
Understand when private deployment is more cost-effective than cloud APIs.Fixed costs of private deployment
Infrastructure costs:- GPU servers: 5,000-50,000 per GPU (owned hardware)
- Networking and storage
- Monitoring and management tools
- DevOps for deployment and maintenance
- ML engineers for model selection and optimization
- On-call support for production issues
- Initial setup and testing
- Model fine-tuning (if needed)
- Integration and tooling
Variable costs of cloud APIs
Per-token pricing:- GPT-4o: $5-15 per million tokens
- Claude Opus: $15-75 per million tokens
- Gemini Pro: $1.25-5 per million tokens
- 100M tokens/month on GPT-4o: ~$1,000/month
- 1B tokens/month on GPT-4o: ~$10,000/month
- 10B tokens/month on GPT-4o: ~$100,000/month
Breakeven analysis
Cloud GPU costs (approximate):- 1x A100 (80GB): 2,160/month (24/7)
- 2x A100 (for 70B model): 4,320/month
- 4x A100 (for fast 70B or 175B model): 8,640/month
- Fixed cost: $4,320/month
- Variable cost (cloud): $10/million tokens
| Monthly tokens | Cloud cost (GPT-4o) | Private cost (2x A100) | Winner |
|---|---|---|---|
| 100M | $1,000 | $4,320 | Cloud |
| 500M | $5,000 | $4,320 | Private |
| 1B | $10,000 | $4,320 | Private |
| 10B | $100,000 | $4,320 | Private |
Other factors to consider
Beyond cost:- Data sovereignty: Private is only option for some regulated data
- Latency: Private can be faster (no internet round-trip)
- Customization: Fine-tuning only practical with private models
- Reliability: You control uptime (but also responsible for maintenance)
Example deployments
Small-scale development: Ollama on workstation
For development and testing, run Ollama locally:- Endpoint:
http://localhost:11434/v1 - Model:
llama3.1:70b
Medium-scale production: vLLM on single GPU server
For production workloads on a single GPU server:- Endpoint:
http://gpu-server.internal:8000/v1 - Model:
meta-llama/Llama-3.1-70B-Instruct
Large-scale production: TGI with Kubernetes
For enterprise-scale deployments with auto-scaling:- Endpoint:
http://llama-service.production.svc.cluster.local/v1 - Model:
meta-llama/Llama-3.1-70B-Instruct
Vision model deployment: Qwen3-VL for KYB
For document processing with vision models:- Endpoint:
http://vision-server.internal:8000/v1 - Model:
Qwen/Qwen2-VL-7B-Instruct - Capabilities: Vision enabled
Security and access control
Network isolation
Deploy private models in isolated network segments:- VPC/VLAN isolation: Separate network for GPU servers
- Firewall rules: Only MagOneAI platform can access model endpoints
- No internet access: Models don’t need outbound internet (after initial download)
- Private DNS: Use internal DNS names, not public endpoints
Authentication
Secure model endpoints with authentication:- API keys: Simple shared secret
- Mutual TLS: Certificate-based authentication
- JWT tokens: For more complex access control
- Network-level auth: VPN or private network access only
Audit logging
Log all requests to private models:- Who: Which user or service made the request
- What: Model name, input summary (not full prompt for privacy)
- When: Timestamp
- Result: Success/failure, token count
Model access control
Control which organizations and projects can use each private model:- Some models only for specific organizations (cost allocation)
- Experimental models only for development projects
- Production-grade models for production projects
Troubleshooting
Model server not reachable
Model server not reachable
Symptoms: “Connection refused” or “Host unreachable” errors.Solutions:
- Verify server is running:
curl http://server:8000/v1/models - Check firewall rules allow MagOneAI platform to connect
- Verify endpoint URL is correct (include
/v1if required) - Test from MagOneAI server:
curlfrom platform host to model server - Check DNS resolution if using hostnames
Out of memory errors
Out of memory errors
Symptoms: Model crashes with OOM, or refuses to load.Solutions:
- Model too large for available GPU memory
- Use quantized version (GPTQ, AWQ) to reduce memory usage
- Increase tensor parallelism to spread model across more GPUs
- Reduce
--max-num-seqsor--max-model-lento use less memory for batching - Upgrade to GPUs with more memory (A100 80GB vs 40GB)
Slow inference speed
Slow inference speed
Symptoms: Requests take much longer than expected.Solutions:
- Check GPU utilization: is GPU actually being used? (
nvidia-smi) - Reduce max tokens to limit generation length
- Use quantized models for faster inference
- Enable batching if processing multiple requests
- Check for CPU bottlenecks (tokenization, data loading)
- Use tensor parallelism for larger models
Model returns errors or nonsense
Model returns errors or nonsense
Symptoms: Model responses are gibberish or error messages.Solutions:
- Verify model loaded correctly (check server logs)
- Ensure prompt format matches model’s expected format
- Check if model requires special chat templates
- Verify model is compatible with serving framework version
- Try with a simpler test prompt to isolate the issue
Inconsistent response quality
Inconsistent response quality
Symptoms: Some requests get good responses, others are poor.Solutions:
- Check temperature setting (lower for consistency)
- Verify system prompt is included in all requests
- Check if quantization is affecting quality (try FP16)
- Review prompts for clarity and specificity
- Consider using a larger model for better reasoning