Private model deployment

What constitutes a private model deployment

A private model deployment means running LLMs on your own infrastructure instead of using cloud APIs. This gives you:

Complete data sovereignty — no data sent to third-party APIs
Full control — over model versions, fine-tuning, and access
Cost predictability — fixed infrastructure cost, near-zero marginal cost per request
Compliance — easier to meet regulatory requirements for sensitive data
Customization — fine-tune models on proprietary data

Private deployments are ideal for:

Regulated industries (healthcare, finance, government)
Processing sensitive data (PII, trade secrets, classified information)
High-volume workloads (where per-token pricing becomes expensive)
Custom domain requirements (fine-tuned models)

OpenAI-compatible API requirement

MagOneAI connects to private models via the OpenAI-compatible API format. This is an industry-standard HTTP API that many serving frameworks implement.

What “OpenAI-compatible” means

Your model server must implement the /v1/chat/completions endpoint with request/response format matching OpenAI’s API: Request format:

POST /v1/chat/completions
{
  "model": "your-model-name",
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Hello!"}
  ],
  "temperature": 0.7,
  "max_tokens": 500
}

Response format:

{
  "id": "chatcmpl-123",
  "object": "chat.completion",
  "created": 1677652288,
  "model": "your-model-name",
  "choices": [{
    "index": 0,
    "message": {
      "role": "assistant",
      "content": "Hello! How can I help you today?"
    },
    "finish_reason": "stop"
  }],
  "usage": {
    "prompt_tokens": 20,
    "completion_tokens": 10,
    "total_tokens": 30
  }
}

As long as your server implements this format, MagOneAI can connect to it — no custom adapters needed.

Supported serving frameworks

Several open-source frameworks implement the OpenAI-compatible API for serving LLMs. Choose based on your performance, ease-of-use, and feature requirements.

vLLM

High-performance inference with PagedAttention

Best for: Production deployments requiring maximum throughput
Performance: State-of-the-art serving speed
Features: Dynamic batching, optimized attention, quantization support
GPU: Nvidia (CUDA), AMD (ROCm)
Setup complexity: Medium

vllm.ai

Ollama

Easy local model management

Best for: Development, testing, and small-scale production
Performance: Good for single-user workloads
Features: Simple CLI, automatic model downloads, built-in model library
GPU: Nvidia, Apple Silicon (Metal), CPU
Setup complexity: Very easy

ollama.com

LM Studio

Desktop application with GUI

Best for: Experimentation, local development, non-technical users
Performance: Good for single-user workloads
Features: User-friendly UI, model browser, one-click downloads
GPU: Nvidia, Apple Silicon, CPU
Setup complexity: Very easy

lmstudio.ai

Text Generation Inference (TGI)

Hugging Face’s production serving solution

Best for: Production deployments, Hugging Face ecosystem users
Performance: Excellent, optimized for various hardware
Features: Tensor parallelism, quantization, streaming
GPU: Nvidia (CUDA)
Setup complexity: Medium

huggingface.co/docs/text-generation-inference

Custom deployments

Any server implementing the OpenAI-compatible API format works with MagOneAI:

Custom inference servers built in-house
Cloud-provider managed endpoints (AWS SageMaker, Azure ML, GCP Vertex AI)
Specialized serving solutions for specific models or hardware
Multi-model routing servers

If it speaks OpenAI’s API format, MagOneAI can use it.

Configuration

Connect MagOneAI to your private model deployment.

Deploy your model using a supported framework

Choose a serving framework (vLLM, Ollama, TGI, etc.) and deploy your model. Ensure the server is running and accessible from the MagOneAI platform.Example vLLM deployment:

vllm serve meta-llama/Llama-3.1-70B-Instruct \
  --host 0.0.0.0 \
  --port 8000 \
  --tensor-parallel-size 4

Note the endpoint URL

Your model server’s URL, including the /v1 path if required by the framework.Examples:

vLLM: http://gpu-server.internal:8000/v1
Ollama: http://localhost:11434/v1
TGI: http://tgi-server.internal:8080/v1

Add provider in Admin Portal

Navigate to Admin Portal → Configuration → LLM Providers → Add Provider.Select OpenAI Compatible as the provider type.

Enter endpoint URL and credentials

Provide:

Provider name: Friendly name (e.g., “Internal LLaMA 3.1”)
API endpoint: Your server URL (e.g., http://gpu-server:8000/v1)
API key: If your server requires authentication (optional)
Model name: The model identifier your server expects

If your server requires an API key, it’s stored in HashiCorp Vault.

Test connection

Click Test Connection. MagOneAI will send a test request to verify:

Server is reachable
API format is correct
Authentication works (if required)
Model responds correctly

Assign to organizations

Choose which MagOneAI organizations can use this private model. Private models can be restricted to specific orgs for cost allocation or access control.

Agents can now select the model

The private model appears in agent configuration dropdowns alongside cloud models. Select it just like any other model.

Vision model support

Some open-source models support vision (multimodal) capabilities. You can serve these via vLLM or TGI and use them in MagOneAI workflows.

Vision-capable open-source models

Popular open-source vision models:

Qwen3-VL — Strong vision and language understanding, excellent for document analysis
LLaVA — Open vision-language model family
Moondream — Lightweight vision model for resource-constrained deployments
CogVLM — Chinese and English vision-language model
InternVL — Versatile vision-language model

Serving vision models with vLLM

vLLM supports multimodal models. Serve a vision model:

vllm serve Qwen/Qwen2-VL-7B-Instruct \
  --host 0.0.0.0 \
  --port 8000 \
  --tensor-parallel-size 2 \
  --max-model-len 4096

Using vision models in workflows

Once deployed and configured, use vision models the same way as cloud vision models:

Upload images via workflow inputs
Agent processes images and text together
Model returns analysis, extracted text, or answers

Example use case: KYB (Know Your Business) workflow using Qwen3-VL for identity document processing:

Upload driver’s license or passport image
Qwen3-VL extracts name, date of birth, ID number, expiration date
No cloud APIs — all processing on your infrastructure
Compliant with data residency requirements

Private vision models enable document analysis without cloud APIs. Process identity documents, financial statements, medical records, and classified information without sending images to third parties.

Performance considerations

Private model deployments require careful capacity planning and optimization.

GPU requirements by model size

Approximate GPU memory requirements (FP16 precision):

Model Size	Single GPU	Distributed
7B params	16 GB (RTX 4090, A10)	Not needed
13B params	26 GB (A100 40GB)	Optional
30B params	60 GB (A100 80GB)	Recommended (2x A40)
70B params	140 GB	Required (2-4x A100)
175B params (GPT-3 scale)	350 GB	Required (8x A100)

Note: These are minimum requirements for inference. Batch inference and caching require additional memory.

Quantization for reduced memory usage

Quantization reduces model precision to lower memory requirements with minimal quality loss:

GPTQ (4-bit/8-bit quantization)

Memory reduction: 4-bit reduces to ~25% of original, 8-bit to ~50%Quality: Minimal degradation for most tasksPerformance: Slightly slower than FP16, but fits on smaller GPUsUse case: Run 70B models on 2x A40 instead of 4x A100Example: TheBloke/Llama-3-70B-Instruct-GPTQ (4-bit quantized)

AWQ (Activation-Aware Weight Quantization)

Memory reduction: Similar to GPTQ (4-bit)Quality: Often better than GPTQ, especially for reasoningPerformance: Faster inference than GPTQ on compatible hardwareUse case: Best quality-size tradeoff for 4-bit quantizationExample: casperhansen/llama-3-70b-instruct-awq

GGUF (llama.cpp format)

Memory reduction: 2-bit to 8-bit optionsQuality: Varies by quantization levelPerformance: Optimized for CPU inferenceUse case: Run models on CPU or Apple Silicon without GPUsExample: Used by Ollama for efficient local inference

Quantized models work with vLLM, TGI, and Ollama. Just specify the quantized model variant when loading.

Batch inference for throughput optimization

Batching processes multiple requests simultaneously to maximize GPU utilization:

Dynamic batching: vLLM and TGI automatically batch incoming requests
Continuous batching: Start generating for one request while others are still arriving
PagedAttention: vLLM’s technique for efficient memory usage during batching

Impact: 5-10x higher throughput compared to single-request processing. Configuration:

vllm serve model-name \
  --max-num-seqs 256 \     # Max sequences to batch
  --max-num-batched-tokens 8192  # Total tokens in a batch

Load balancing across multiple GPU instances

For high-volume production workloads, deploy multiple model replicas and load balance across them:

┌─────────────┐
│ Load Balancer│
└──────┬───────┘
       │
   ┌───┴───────────────┐
   │                   │
┌──▼───┐         ┌─────▼──┐
│GPU #1│         │GPU #2  │
│vLLM  │         │vLLM    │
└──────┘         └────────┘

Use Nginx, HAProxy, or cloud load balancers to distribute requests. Benefits:

Higher throughput (handle more concurrent requests)
Redundancy (failover if one instance crashes)
Rolling updates (update one instance at a time)

Latency vs quality tradeoffs

Optimize for either speed or quality based on your use case: For low latency (faster responses):

Use smaller models (7B or 13B)
Lower max tokens
Quantized models (4-bit)
Higher temperature can sometimes reduce thinking time

For highest quality:

Use larger models (70B+)
Higher max tokens (allow longer responses)
FP16 precision (no quantization)
Lower temperature for consistency

Hybrid approach: Use a small model for classification/routing, large model for complex tasks.

Cost analysis: Private vs cloud

Understand when private deployment is more cost-effective than cloud APIs.

Fixed costs of private deployment

Infrastructure costs:

GPU servers: $1-5/hour per GPU (cloud),$ 5,000-50,000 per GPU (owned hardware)
Networking and storage
Monitoring and management tools

Personnel costs:

DevOps for deployment and maintenance
ML engineers for model selection and optimization
On-call support for production issues

One-time costs:

Initial setup and testing
Model fine-tuning (if needed)
Integration and tooling

Variable costs of cloud APIs

Per-token pricing:

GPT-4o: $5-15 per million tokens
Claude Opus: $15-75 per million tokens
Gemini Pro: $1.25-5 per million tokens

High-volume example:

100M tokens/month on GPT-4o: ~$1,000/month
1B tokens/month on GPT-4o: ~$10,000/month
10B tokens/month on GPT-4o: ~$100,000/month

Breakeven analysis

Cloud GPU costs (approximate):

1x A100 (80GB): $3/hour =$ 2,160/month (24/7)
2x A100 (for 70B model): $6/hour =$ 4,320/month
4x A100 (for fast 70B or 175B model): $12/hour =$ 8,640/month

Breakeven calculation: For GPT-4o equivalent quality (70B model, 2x A100):

Fixed cost: $4,320/month
Variable cost (cloud): $10/million tokens

Breakeven: ~430 million tokens/month If you process more than ~430M tokens/month, private deployment is cheaper. Example scenarios:

Monthly tokens	Cloud cost (GPT-4o)	Private cost (2x A100)	Winner
100M	$1,000	$4,320	Cloud
500M	$5,000	$4,320	Private
1B	$10,000	$4,320	Private
10B	$100,000	$4,320	Private

Conclusion: For high-volume workloads (500M+ tokens/month), private deployment offers massive savings.

Other factors to consider

Beyond cost:

Data sovereignty: Private is only option for some regulated data
Latency: Private can be faster (no internet round-trip)
Customization: Fine-tuning only practical with private models
Reliability: You control uptime (but also responsible for maintenance)

Example deployments

Small-scale development: Ollama on workstation

For development and testing, run Ollama locally:

# Install Ollama
curl https://ollama.ai/install.sh | sh

# Pull a model
ollama pull llama3.1:70b

# Serve with OpenAI-compatible API
ollama serve

Configure in MagOneAI:

Endpoint: http://localhost:11434/v1
Model: llama3.1:70b

Use case: Develop workflows locally before deploying to production with cloud or private production models.

Medium-scale production: vLLM on single GPU server

For production workloads on a single GPU server:

# Install vLLM
pip install vllm

# Serve a 70B model on 2 GPUs with tensor parallelism
vllm serve meta-llama/Llama-3.1-70B-Instruct \
  --host 0.0.0.0 \
  --port 8000 \
  --tensor-parallel-size 2 \
  --max-num-seqs 128 \
  --trust-remote-code

Configure in MagOneAI:

Endpoint: http://gpu-server.internal:8000/v1
Model: meta-llama/Llama-3.1-70B-Instruct

Use case: Production deployment for medium-volume workloads (10-100 requests/minute).

Large-scale production: TGI with Kubernetes

For enterprise-scale deployments with auto-scaling:

# Kubernetes deployment for TGI
apiVersion: apps/v1
kind: Deployment
metadata:
  name: llama-inference
spec:
  replicas: 3
  template:
    spec:
      containers:
      - name: tgi
        image: ghcr.io/huggingface/text-generation-inference:latest
        args:
          - --model-id=meta-llama/Llama-3.1-70B-Instruct
          - --num-shard=2
        resources:
          limits:
            nvidia.com/gpu: 2
---
apiVersion: v1
kind: Service
metadata:
  name: llama-service
spec:
  selector:
    app: llama-inference
  ports:
    - port: 80
      targetPort: 8080

Configure in MagOneAI:

Endpoint: http://llama-service.production.svc.cluster.local/v1
Model: meta-llama/Llama-3.1-70B-Instruct

Use case: Enterprise production with high availability, auto-scaling, and load balancing.

Vision model deployment: Qwen3-VL for KYB

For document processing with vision models:

# Serve Qwen3-VL with vLLM
vllm serve Qwen/Qwen2-VL-7B-Instruct \
  --host 0.0.0.0 \
  --port 8000 \
  --tensor-parallel-size 1 \
  --max-model-len 8192 \
  --trust-remote-code

Configure in MagOneAI:

Endpoint: http://vision-server.internal:8000/v1
Model: Qwen/Qwen2-VL-7B-Instruct
Capabilities: Vision enabled

Use case: Know Your Business (KYB) workflow that processes identity documents without sending images to cloud APIs. Compliant with data residency requirements.

Security and access control

Network isolation

Deploy private models in isolated network segments:

VPC/VLAN isolation: Separate network for GPU servers
Firewall rules: Only MagOneAI platform can access model endpoints
No internet access: Models don’t need outbound internet (after initial download)
Private DNS: Use internal DNS names, not public endpoints

Authentication

Secure model endpoints with authentication:

API keys: Simple shared secret
Mutual TLS: Certificate-based authentication
JWT tokens: For more complex access control
Network-level auth: VPN or private network access only

Store authentication credentials in HashiCorp Vault, reference by path in configuration.

Audit logging

Log all requests to private models:

Who: Which user or service made the request
What: Model name, input summary (not full prompt for privacy)
When: Timestamp
Result: Success/failure, token count

Audit logs help with compliance, debugging, and cost allocation.

Model access control

Control which organizations and projects can use each private model:

Some models only for specific organizations (cost allocation)
Experimental models only for development projects
Production-grade models for production projects

Configure access control when adding the provider in Admin Portal.

Private models give you ultimate AI sovereignty. Process sensitive documents, regulated industry data, and classified information — all without leaving your infrastructure.

Troubleshooting

Model server not reachable

Symptoms: “Connection refused” or “Host unreachable” errors.Solutions:

Verify server is running: curl http://server:8000/v1/models
Check firewall rules allow MagOneAI platform to connect
Verify endpoint URL is correct (include /v1 if required)
Test from MagOneAI server: curl from platform host to model server
Check DNS resolution if using hostnames

Out of memory errors

Symptoms: Model crashes with OOM, or refuses to load.Solutions:

Model too large for available GPU memory
Use quantized version (GPTQ, AWQ) to reduce memory usage
Increase tensor parallelism to spread model across more GPUs
Reduce --max-num-seqs or --max-model-len to use less memory for batching
Upgrade to GPUs with more memory (A100 80GB vs 40GB)

Slow inference speed

Symptoms: Requests take much longer than expected.Solutions:

Check GPU utilization: is GPU actually being used? (nvidia-smi)
Reduce max tokens to limit generation length
Use quantized models for faster inference
Enable batching if processing multiple requests
Check for CPU bottlenecks (tokenization, data loading)
Use tensor parallelism for larger models

Model returns errors or nonsense

Symptoms: Model responses are gibberish or error messages.Solutions:

Verify model loaded correctly (check server logs)
Ensure prompt format matches model’s expected format
Check if model requires special chat templates
Verify model is compatible with serving framework version
Try with a simpler test prompt to isolate the issue

Inconsistent response quality

Symptoms: Some requests get good responses, others are poor.Solutions:

Check temperature setting (lower for consistency)
Verify system prompt is included in all requests
Check if quantization is affecting quality (try FP16)
Review prompts for clarity and specificity
Consider using a larger model for better reasoning

Next steps

Cloud providers

Compare with cloud model options

Model configuration

Learn about model selection strategies

Custom tools

Build tools that work with private models

Security overview

Secure your private deployments

Getting Started

Platform Guide

Agents

Workflow Builder

Tools & Integrations

Models & Providers

Cookbooks

Security & Administration

​What constitutes a private model deployment

​OpenAI-compatible API requirement

​What “OpenAI-compatible” means

​Supported serving frameworks

vLLM

Ollama

LM Studio

Text Generation Inference (TGI)

​Custom deployments

​Configuration

​Vision model support

​Vision-capable open-source models

​Serving vision models with vLLM

​Using vision models in workflows

​Performance considerations

​GPU requirements by model size

​Quantization for reduced memory usage

​Batch inference for throughput optimization

​Load balancing across multiple GPU instances

​Latency vs quality tradeoffs

​Cost analysis: Private vs cloud

​Fixed costs of private deployment

​Variable costs of cloud APIs

​Breakeven analysis

​Other factors to consider

​Example deployments

​Small-scale development: Ollama on workstation

​Medium-scale production: vLLM on single GPU server

​Large-scale production: TGI with Kubernetes

​Vision model deployment: Qwen3-VL for KYB

​Security and access control

​Network isolation

​Authentication

​Audit logging

​Model access control

​Troubleshooting

​Next steps

Cloud providers

Model configuration

Custom tools

Security overview

What constitutes a private model deployment

OpenAI-compatible API requirement

What “OpenAI-compatible” means

Supported serving frameworks

Custom deployments

Configuration

Vision model support

Vision-capable open-source models

Serving vision models with vLLM

Using vision models in workflows

Performance considerations

GPU requirements by model size

Quantization for reduced memory usage

Batch inference for throughput optimization

Load balancing across multiple GPU instances

Latency vs quality tradeoffs

Cost analysis: Private vs cloud

Fixed costs of private deployment

Variable costs of cloud APIs

Breakeven analysis

Other factors to consider

Example deployments

Small-scale development: Ollama on workstation

Medium-scale production: vLLM on single GPU server

Large-scale production: TGI with Kubernetes

Vision model deployment: Qwen3-VL for KYB

Security and access control

Network isolation

Authentication

Audit logging

Model access control

Troubleshooting

Next steps