Skip to main content

What constitutes a private model deployment

A private model deployment means running LLMs on your own infrastructure instead of using cloud APIs. This gives you:
  • Complete data sovereignty — no data sent to third-party APIs
  • Full control — over model versions, fine-tuning, and access
  • Cost predictability — fixed infrastructure cost, near-zero marginal cost per request
  • Compliance — easier to meet regulatory requirements for sensitive data
  • Customization — fine-tune models on proprietary data
Private deployments are ideal for:
  • Regulated industries (healthcare, finance, government)
  • Processing sensitive data (PII, trade secrets, classified information)
  • High-volume workloads (where per-token pricing becomes expensive)
  • Custom domain requirements (fine-tuned models)

OpenAI-compatible API requirement

MagOneAI connects to private models via the OpenAI-compatible API format. This is an industry-standard HTTP API that many serving frameworks implement.

What “OpenAI-compatible” means

Your model server must implement the /v1/chat/completions endpoint with request/response format matching OpenAI’s API: Request format:
POST /v1/chat/completions
{
  "model": "your-model-name",
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Hello!"}
  ],
  "temperature": 0.7,
  "max_tokens": 500
}
Response format:
{
  "id": "chatcmpl-123",
  "object": "chat.completion",
  "created": 1677652288,
  "model": "your-model-name",
  "choices": [{
    "index": 0,
    "message": {
      "role": "assistant",
      "content": "Hello! How can I help you today?"
    },
    "finish_reason": "stop"
  }],
  "usage": {
    "prompt_tokens": 20,
    "completion_tokens": 10,
    "total_tokens": 30
  }
}
As long as your server implements this format, MagOneAI can connect to it — no custom adapters needed.

Supported serving frameworks

Several open-source frameworks implement the OpenAI-compatible API for serving LLMs. Choose based on your performance, ease-of-use, and feature requirements.

vLLM

High-performance inference with PagedAttention
  • Best for: Production deployments requiring maximum throughput
  • Performance: State-of-the-art serving speed
  • Features: Dynamic batching, optimized attention, quantization support
  • GPU: Nvidia (CUDA), AMD (ROCm)
  • Setup complexity: Medium
vllm.ai

Ollama

Easy local model management
  • Best for: Development, testing, and small-scale production
  • Performance: Good for single-user workloads
  • Features: Simple CLI, automatic model downloads, built-in model library
  • GPU: Nvidia, Apple Silicon (Metal), CPU
  • Setup complexity: Very easy
ollama.com

LM Studio

Desktop application with GUI
  • Best for: Experimentation, local development, non-technical users
  • Performance: Good for single-user workloads
  • Features: User-friendly UI, model browser, one-click downloads
  • GPU: Nvidia, Apple Silicon, CPU
  • Setup complexity: Very easy
lmstudio.ai

Text Generation Inference (TGI)

Hugging Face’s production serving solution
  • Best for: Production deployments, Hugging Face ecosystem users
  • Performance: Excellent, optimized for various hardware
  • Features: Tensor parallelism, quantization, streaming
  • GPU: Nvidia (CUDA)
  • Setup complexity: Medium
huggingface.co/docs/text-generation-inference

Custom deployments

Any server implementing the OpenAI-compatible API format works with MagOneAI:
  • Custom inference servers built in-house
  • Cloud-provider managed endpoints (AWS SageMaker, Azure ML, GCP Vertex AI)
  • Specialized serving solutions for specific models or hardware
  • Multi-model routing servers
If it speaks OpenAI’s API format, MagOneAI can use it.

Configuration

Connect MagOneAI to your private model deployment.
1

Deploy your model using a supported framework

Choose a serving framework (vLLM, Ollama, TGI, etc.) and deploy your model. Ensure the server is running and accessible from the MagOneAI platform.Example vLLM deployment:
vllm serve meta-llama/Llama-3.1-70B-Instruct \
  --host 0.0.0.0 \
  --port 8000 \
  --tensor-parallel-size 4
2

Note the endpoint URL

Your model server’s URL, including the /v1 path if required by the framework.Examples:
  • vLLM: http://gpu-server.internal:8000/v1
  • Ollama: http://localhost:11434/v1
  • TGI: http://tgi-server.internal:8080/v1
3

Add provider in Admin Portal

Navigate to Admin Portal → Configuration → LLM Providers → Add Provider.Select OpenAI Compatible as the provider type.
4

Enter endpoint URL and credentials

Provide:
  • Provider name: Friendly name (e.g., “Internal LLaMA 3.1”)
  • API endpoint: Your server URL (e.g., http://gpu-server:8000/v1)
  • API key: If your server requires authentication (optional)
  • Model name: The model identifier your server expects
If your server requires an API key, it’s stored in HashiCorp Vault.
5

Test connection

Click Test Connection. MagOneAI will send a test request to verify:
  • Server is reachable
  • API format is correct
  • Authentication works (if required)
  • Model responds correctly
6

Assign to organizations

Choose which MagOneAI organizations can use this private model. Private models can be restricted to specific orgs for cost allocation or access control.
7

Agents can now select the model

The private model appears in agent configuration dropdowns alongside cloud models. Select it just like any other model.

Vision model support

Some open-source models support vision (multimodal) capabilities. You can serve these via vLLM or TGI and use them in MagOneAI workflows.

Vision-capable open-source models

Popular open-source vision models:
  • Qwen3-VL — Strong vision and language understanding, excellent for document analysis
  • LLaVA — Open vision-language model family
  • Moondream — Lightweight vision model for resource-constrained deployments
  • CogVLM — Chinese and English vision-language model
  • InternVL — Versatile vision-language model

Serving vision models with vLLM

vLLM supports multimodal models. Serve a vision model:
vllm serve Qwen/Qwen2-VL-7B-Instruct \
  --host 0.0.0.0 \
  --port 8000 \
  --tensor-parallel-size 2 \
  --max-model-len 4096

Using vision models in workflows

Once deployed and configured, use vision models the same way as cloud vision models:
  1. Upload images via workflow inputs
  2. Agent processes images and text together
  3. Model returns analysis, extracted text, or answers
Example use case: KYB (Know Your Business) workflow using Qwen3-VL for identity document processing:
  • Upload driver’s license or passport image
  • Qwen3-VL extracts name, date of birth, ID number, expiration date
  • No cloud APIs — all processing on your infrastructure
  • Compliant with data residency requirements
Private vision models enable document analysis without cloud APIs. Process identity documents, financial statements, medical records, and classified information without sending images to third parties.

Performance considerations

Private model deployments require careful capacity planning and optimization.

GPU requirements by model size

Approximate GPU memory requirements (FP16 precision):
Model SizeSingle GPUDistributed
7B params16 GB (RTX 4090, A10)Not needed
13B params26 GB (A100 40GB)Optional
30B params60 GB (A100 80GB)Recommended (2x A40)
70B params140 GBRequired (2-4x A100)
175B params (GPT-3 scale)350 GBRequired (8x A100)
Note: These are minimum requirements for inference. Batch inference and caching require additional memory.

Quantization for reduced memory usage

Quantization reduces model precision to lower memory requirements with minimal quality loss:
Memory reduction: 4-bit reduces to ~25% of original, 8-bit to ~50%Quality: Minimal degradation for most tasksPerformance: Slightly slower than FP16, but fits on smaller GPUsUse case: Run 70B models on 2x A40 instead of 4x A100Example: TheBloke/Llama-3-70B-Instruct-GPTQ (4-bit quantized)
Memory reduction: Similar to GPTQ (4-bit)Quality: Often better than GPTQ, especially for reasoningPerformance: Faster inference than GPTQ on compatible hardwareUse case: Best quality-size tradeoff for 4-bit quantizationExample: casperhansen/llama-3-70b-instruct-awq
Memory reduction: 2-bit to 8-bit optionsQuality: Varies by quantization levelPerformance: Optimized for CPU inferenceUse case: Run models on CPU or Apple Silicon without GPUsExample: Used by Ollama for efficient local inference
Quantized models work with vLLM, TGI, and Ollama. Just specify the quantized model variant when loading.

Batch inference for throughput optimization

Batching processes multiple requests simultaneously to maximize GPU utilization:
  • Dynamic batching: vLLM and TGI automatically batch incoming requests
  • Continuous batching: Start generating for one request while others are still arriving
  • PagedAttention: vLLM’s technique for efficient memory usage during batching
Impact: 5-10x higher throughput compared to single-request processing. Configuration:
vllm serve model-name \
  --max-num-seqs 256 \     # Max sequences to batch
  --max-num-batched-tokens 8192  # Total tokens in a batch

Load balancing across multiple GPU instances

For high-volume production workloads, deploy multiple model replicas and load balance across them:
┌─────────────┐
│ Load Balancer│
└──────┬───────┘

   ┌───┴───────────────┐
   │                   │
┌──▼───┐         ┌─────▼──┐
│GPU #1│         │GPU #2  │
│vLLM  │         │vLLM    │
└──────┘         └────────┘
Use Nginx, HAProxy, or cloud load balancers to distribute requests. Benefits:
  • Higher throughput (handle more concurrent requests)
  • Redundancy (failover if one instance crashes)
  • Rolling updates (update one instance at a time)

Latency vs quality tradeoffs

Optimize for either speed or quality based on your use case: For low latency (faster responses):
  • Use smaller models (7B or 13B)
  • Lower max tokens
  • Quantized models (4-bit)
  • Higher temperature can sometimes reduce thinking time
For highest quality:
  • Use larger models (70B+)
  • Higher max tokens (allow longer responses)
  • FP16 precision (no quantization)
  • Lower temperature for consistency
Hybrid approach: Use a small model for classification/routing, large model for complex tasks.

Cost analysis: Private vs cloud

Understand when private deployment is more cost-effective than cloud APIs.

Fixed costs of private deployment

Infrastructure costs:
  • GPU servers: 15/hourperGPU(cloud),1-5/hour per GPU (cloud), 5,000-50,000 per GPU (owned hardware)
  • Networking and storage
  • Monitoring and management tools
Personnel costs:
  • DevOps for deployment and maintenance
  • ML engineers for model selection and optimization
  • On-call support for production issues
One-time costs:
  • Initial setup and testing
  • Model fine-tuning (if needed)
  • Integration and tooling

Variable costs of cloud APIs

Per-token pricing:
  • GPT-4o: $5-15 per million tokens
  • Claude Opus: $15-75 per million tokens
  • Gemini Pro: $1.25-5 per million tokens
High-volume example:
  • 100M tokens/month on GPT-4o: ~$1,000/month
  • 1B tokens/month on GPT-4o: ~$10,000/month
  • 10B tokens/month on GPT-4o: ~$100,000/month

Breakeven analysis

Cloud GPU costs (approximate):
  • 1x A100 (80GB): 3/hour=3/hour = 2,160/month (24/7)
  • 2x A100 (for 70B model): 6/hour=6/hour = 4,320/month
  • 4x A100 (for fast 70B or 175B model): 12/hour=12/hour = 8,640/month
Breakeven calculation: For GPT-4o equivalent quality (70B model, 2x A100):
  • Fixed cost: $4,320/month
  • Variable cost (cloud): $10/million tokens
Breakeven: ~430 million tokens/month If you process more than ~430M tokens/month, private deployment is cheaper. Example scenarios:
Monthly tokensCloud cost (GPT-4o)Private cost (2x A100)Winner
100M$1,000$4,320Cloud
500M$5,000$4,320Private
1B$10,000$4,320Private
10B$100,000$4,320Private
Conclusion: For high-volume workloads (500M+ tokens/month), private deployment offers massive savings.

Other factors to consider

Beyond cost:
  • Data sovereignty: Private is only option for some regulated data
  • Latency: Private can be faster (no internet round-trip)
  • Customization: Fine-tuning only practical with private models
  • Reliability: You control uptime (but also responsible for maintenance)

Example deployments

Small-scale development: Ollama on workstation

For development and testing, run Ollama locally:
# Install Ollama
curl https://ollama.ai/install.sh | sh

# Pull a model
ollama pull llama3.1:70b

# Serve with OpenAI-compatible API
ollama serve
Configure in MagOneAI:
  • Endpoint: http://localhost:11434/v1
  • Model: llama3.1:70b
Use case: Develop workflows locally before deploying to production with cloud or private production models.

Medium-scale production: vLLM on single GPU server

For production workloads on a single GPU server:
# Install vLLM
pip install vllm

# Serve a 70B model on 2 GPUs with tensor parallelism
vllm serve meta-llama/Llama-3.1-70B-Instruct \
  --host 0.0.0.0 \
  --port 8000 \
  --tensor-parallel-size 2 \
  --max-num-seqs 128 \
  --trust-remote-code
Configure in MagOneAI:
  • Endpoint: http://gpu-server.internal:8000/v1
  • Model: meta-llama/Llama-3.1-70B-Instruct
Use case: Production deployment for medium-volume workloads (10-100 requests/minute).

Large-scale production: TGI with Kubernetes

For enterprise-scale deployments with auto-scaling:
# Kubernetes deployment for TGI
apiVersion: apps/v1
kind: Deployment
metadata:
  name: llama-inference
spec:
  replicas: 3
  template:
    spec:
      containers:
      - name: tgi
        image: ghcr.io/huggingface/text-generation-inference:latest
        args:
          - --model-id=meta-llama/Llama-3.1-70B-Instruct
          - --num-shard=2
        resources:
          limits:
            nvidia.com/gpu: 2
---
apiVersion: v1
kind: Service
metadata:
  name: llama-service
spec:
  selector:
    app: llama-inference
  ports:
    - port: 80
      targetPort: 8080
Configure in MagOneAI:
  • Endpoint: http://llama-service.production.svc.cluster.local/v1
  • Model: meta-llama/Llama-3.1-70B-Instruct
Use case: Enterprise production with high availability, auto-scaling, and load balancing.

Vision model deployment: Qwen3-VL for KYB

For document processing with vision models:
# Serve Qwen3-VL with vLLM
vllm serve Qwen/Qwen2-VL-7B-Instruct \
  --host 0.0.0.0 \
  --port 8000 \
  --tensor-parallel-size 1 \
  --max-model-len 8192 \
  --trust-remote-code
Configure in MagOneAI:
  • Endpoint: http://vision-server.internal:8000/v1
  • Model: Qwen/Qwen2-VL-7B-Instruct
  • Capabilities: Vision enabled
Use case: Know Your Business (KYB) workflow that processes identity documents without sending images to cloud APIs. Compliant with data residency requirements.

Security and access control

Network isolation

Deploy private models in isolated network segments:
  • VPC/VLAN isolation: Separate network for GPU servers
  • Firewall rules: Only MagOneAI platform can access model endpoints
  • No internet access: Models don’t need outbound internet (after initial download)
  • Private DNS: Use internal DNS names, not public endpoints

Authentication

Secure model endpoints with authentication:
  • API keys: Simple shared secret
  • Mutual TLS: Certificate-based authentication
  • JWT tokens: For more complex access control
  • Network-level auth: VPN or private network access only
Store authentication credentials in HashiCorp Vault, reference by path in configuration.

Audit logging

Log all requests to private models:
  • Who: Which user or service made the request
  • What: Model name, input summary (not full prompt for privacy)
  • When: Timestamp
  • Result: Success/failure, token count
Audit logs help with compliance, debugging, and cost allocation.

Model access control

Control which organizations and projects can use each private model:
  • Some models only for specific organizations (cost allocation)
  • Experimental models only for development projects
  • Production-grade models for production projects
Configure access control when adding the provider in Admin Portal.
Private models give you ultimate AI sovereignty. Process sensitive documents, regulated industry data, and classified information — all without leaving your infrastructure.

Troubleshooting

Symptoms: “Connection refused” or “Host unreachable” errors.Solutions:
  • Verify server is running: curl http://server:8000/v1/models
  • Check firewall rules allow MagOneAI platform to connect
  • Verify endpoint URL is correct (include /v1 if required)
  • Test from MagOneAI server: curl from platform host to model server
  • Check DNS resolution if using hostnames
Symptoms: Model crashes with OOM, or refuses to load.Solutions:
  • Model too large for available GPU memory
  • Use quantized version (GPTQ, AWQ) to reduce memory usage
  • Increase tensor parallelism to spread model across more GPUs
  • Reduce --max-num-seqs or --max-model-len to use less memory for batching
  • Upgrade to GPUs with more memory (A100 80GB vs 40GB)
Symptoms: Requests take much longer than expected.Solutions:
  • Check GPU utilization: is GPU actually being used? (nvidia-smi)
  • Reduce max tokens to limit generation length
  • Use quantized models for faster inference
  • Enable batching if processing multiple requests
  • Check for CPU bottlenecks (tokenization, data loading)
  • Use tensor parallelism for larger models
Symptoms: Model responses are gibberish or error messages.Solutions:
  • Verify model loaded correctly (check server logs)
  • Ensure prompt format matches model’s expected format
  • Check if model requires special chat templates
  • Verify model is compatible with serving framework version
  • Try with a simpler test prompt to isolate the issue
Symptoms: Some requests get good responses, others are poor.Solutions:
  • Check temperature setting (lower for consistency)
  • Verify system prompt is included in all requests
  • Check if quantization is affecting quality (try FP16)
  • Review prompts for clarity and specificity
  • Consider using a larger model for better reasoning

Next steps