The Gap Between Vision and Version 1
In my last post, I wrote about architecting an AI orchestrator with grand visions of auto-scaling controllers, queue-based request routing, and multi-provider GPU orchestration. The architecture looked beautiful in ASCII diagrams.
Then I sat down to actually build it.
This is the story of what happened when theory met practice - the decisions we made, the complexity we cut, and why our actual POC looks nothing like the original architecture (and that's totally fine).
TL;DR - What Changed?
Original Vision:
K8s Cluster → API Gateway → Request Queue → Auto-Scaler Controller
→ SSH Tunnel Pod → vast.ai GPU → vLLMActual POC:
Client → Trigger API (K8s) → vast.ai API → GPU Instance
Client → Direct HTTPS → GPU Instance vLLM ✅Key Simplifications:
- ❌ No request queue (Redis/RabbitMQ)
- ❌ No auto-scaler controller monitoring queue depth
- ❌ No SSH tunnel complexity
- ✅ Simple trigger API with API key auth
- ✅ Direct client access to GPU instances
- ✅ Hetzner S3 for model storage (specific, not generic)
- ✅ Prometheus remote write (detailed implementation)
Let's dive into why we made these decisions.
Decision 1: Direct Access vs. Proxy Pattern
Original Plan: SSH Tunnel from K8s
The initial architecture had this whole SSH tunnel setup:
# What I thought I needed
apiVersion: v1
kind: Pod
metadata:
name: vastai-tunnel
spec:
containers:
- name: ssh-tunnel
command: ["/bin/sh", "-c"]
args:
- ssh -L 0.0.0.0:8000:localhost:8000 root@vastai-host -NThe reasoning was solid:
- All traffic routes through K8s (easier monitoring)
- Centralized access control
- Clients never see vast.ai IPs directly
What We Chose: Direct Client → GPU Access
# Actual POC flow
curl https://123.45.67.89:8000/v1/completions \
-H "Authorization: Bearer ${VASTAI_TOKEN}" \
-d '{"model": "llama-3.1-8b", "prompt": "..."}'Why we simplified:
- SSH tunnels add latency - Every request goes K8s → SSH → GPU instead of Client → GPU
- Operational complexity - Managing SSH keys, tunnel health checks, reconnection logic
- POC goal clarity - We're validating GPU provisioning, not building a production proxy
- vast.ai already has HTTPS - The instances support direct HTTPS access with bearer tokens
The hard truth: We were solving problems we don't have yet. For a POC validating "$0.35/hr GPU instances can run vLLM inference," the SSH tunnel is architectural gold-plating.
When we'll add it back: Production migration to RunPod, when we need:
- Centralized rate limiting
- Request logging/audit trails
- Client IP hiding
- Load balancing across multiple GPUs
Decision 2: Trigger API vs. Queue-Based Auto-Scaling
Original Plan: Queue Depth Monitoring
The vision was beautiful:
# Imagined auto-scaler logic
while True:
queue_depth = redis.llen('inference_queue')
if queue_depth > 10 and active_gpus == 0:
provision_gpu_instance()
elif queue_depth == 0 and gpu_idle_time > 900: # 15min
terminate_gpu_instance()
time.sleep(30)What this requires:
- Redis/RabbitMQ for request queue
- Background worker monitoring queue depth
- State management for active instances
- Request routing logic (which GPU gets which request?)
- Graceful shutdown handling (in-flight requests)
What We Built: On-Demand Trigger API
# FastAPI endpoint - actual POC
@app.post("/provision")
async def provision_gpu(request: ProvisionRequest):
# Search vast.ai for available instance
offers = vastai.search_offers(
gpu_name="RTX 4090",
max_price=0.50
)
# Create instance
instance = vastai.create_instance(
offer_id=offers[0].id,
image="vllm/vllm-openai:latest"
)
# Wait for SSH, bootstrap vLLM
await bootstrap_instance(instance.id)
return {
"endpoint": f"https://{instance.ip}:8000",
"id": instance.id
}Why we simplified:
- POC usage pattern - We're manually testing, not handling production traffic
- Complexity explosion - Queue-based scaling is 5x the code for 0x the POC value
- Vast.ai reality - Instance startup takes 60-120 seconds anyway, queue won't help cold starts
- Iteration speed - We can test provisioning in 2 minutes vs. 2 weeks building queue infrastructure
The key insight: For a POC, "on-demand" can literally mean "I call an API when I want a GPU." We don't need to auto-scale from zero usage yet.
When we'll add queues: When we have:
- Multiple concurrent users
- Unpredictable request patterns
- Need to batch requests for cost efficiency
- SLA requirements for response time
Right now? We have none of that.
Decision 3: Hetzner S3 for Model Storage
Original Plan: "Model Registry & Cache"
The original architecture had a vague box labeled "Model Registry & Cache" with bullet points:
- Model weights storage ✓
- Version management ✓
Cool. How do we actually build that?
What We Specified: Hetzner Object Storage (S3)
# Actual implementation
# 1. Upload model to Hetzner S3 (one-time setup)
aws s3 cp models/llama-3.1-8b-instruct/ \
s3://gpu-inference-models/llama-3.1-8b-instruct/ \
--endpoint-url=https://fsn1.your-objectstorage.com \
--recursive
# 2. Bootstrap script on GPU instance downloads model
aws s3 cp s3://gpu-inference-models/llama-3.1-8b-instruct \
/models/llama-3.1-8b-instruct \
--endpoint-url=$HETZNER_S3_ENDPOINT \
--recursive
# 3. Start vLLM with downloaded model
python -m vllm.entrypoints.openai.api_server \
--model /models/llama-3.1-8b-instruct \
--port 8000Why Hetzner S3 specifically:
- Regional proximity - Hetzner's Falkenstein datacenter → vast.ai Europe instances = faster downloads
- Cost - €0.0049/GB storage + €0.01/GB egress (first 1TB free)
- S3-compatible API - Works with standard AWS CLI/boto3
- Already using Hetzner - Same provider, same authentication model
- Terraform integration - We can provision buckets alongside K8s resources
Cost comparison for 8B model (~16GB):
| Provider | Storage | Egress (10x downloads/mo) | Total |
|---|---|---|---|
| Hetzner S3 | €0.08/mo | €1.60 | €1.68/mo |
| AWS S3 eu-central-1 | $0.18/mo | $9.00 | $9.18/mo |
| Cloudflare R2 | $0.15/mo | $0.00 | $0.15/mo |
Why not Cloudflare R2 (cheapest)? For a POC, staying in the Hetzner ecosystem is worth €1.53/mo to avoid managing another provider's credentials and networking.
Tradeoff we accepted: Vendor concentration. We're all-in on Hetzner for now (K8s + S3). That's fine for POC, we can migrate later if needed.
Decision 4: Full Automation from Day 1
Original Plan: Manual POC → Automation Later
The original roadmap was:
Phase 1: Manually provision vast.ai (web UI), SSH in, run commands Phase 2: Build automation
What We're Actually Doing: Automate Everything Immediately
# Instance manager - from POC day 1
async def provision_and_bootstrap(model_name: str):
"""Fully automated GPU provisioning"""
# 1. Find best instance
instance = await find_best_vastai_instance(
gpu_type="rtx4090",
max_price=0.50
)
# 2. Create instance
created = await vastai_api.create_instance(instance.id)
# 3. Wait for SSH (with timeout)
await wait_for_ssh(created.ip, timeout=300)
# 4. Run bootstrap script via SSH
await ssh_execute(created.ip, f"""
# Download model from Hetzner S3
aws s3 cp s3://models/{model_name} /models/{model_name} \
--endpoint-url={S3_ENDPOINT} --recursive
# Start vLLM
docker run -d --gpus all -p 8000:8000 \
vllm/vllm-openai:latest \
--model /models/{model_name}
# Start monitoring agents
prometheus-node-exporter --web.listen-address=:9100 &
nvidia-gpu-exporter --web.listen-address=:9101 &
""")
# 5. Health check vLLM endpoint
await wait_for_healthy(f"https://{created.ip}:8000/health")
# 6. Save to state store
await state_store.save({
"id": created.id,
"endpoint": f"https://{created.ip}:8000",
"cost_per_hour": instance.price,
"created_at": datetime.utcnow()
})
return createdWhy we changed our minds:
- Manual testing is slow - Each test requires 10+ manual steps × 10-20 tests = hours wasted
- Reproducibility - Automated bootstrapping ensures every instance is configured identically
- Metrics from day 1 - If we're automating anyway, we get monitoring for free
- POC ≠ Throwaway code - This orchestrator is the production foundation, just simplified
The key realization: Building the automation takes 1 day upfront but saves 30 minutes per test. After 3 tests, we've broken even.
What we're NOT automating (yet):
- Auto-scaling based on metrics (requires queue)
- Multi-region failover (single region for POC)
- Cost optimization algorithms (use first available instance)
- A/B testing across GPU types
Decision 5: Prometheus Remote Write (Not Just "Monitoring")
Original Plan: "Add Monitoring Later"
The first architecture doc had:
- ✅ Deploy vLLM
- ✅ Test inference
- 🔜 Add monitoring
What We Designed: Monitoring as a First-Class Component
# Prometheus config on GPU instance
remote_write:
- url: "http://prometheus.gpu-system.svc.cluster.local:9090/api/v1/write"
basic_auth:
username: "remote-write"
password_file: /etc/prometheus/password
scrape_configs:
- job_name: 'vllm'
static_configs:
- targets: ['localhost:8000'] # vLLM metrics endpoint
scrape_interval: 15s
- job_name: 'node'
static_configs:
- targets: ['localhost:9100'] # node_exporter
- job_name: 'gpu'
static_configs:
- targets: ['localhost:9101'] # nvidia_gpu_exporterGrafana dashboard queries:
# GPU utilization by instance
avg by (instance_id) (nvidia_gpu_utilization_percent)
# Cost per hour (custom metric)
sum(vast_instance_cost_per_hour) by (instance_id)
# Inference latency p95
histogram_quantile(0.95,
rate(vllm_request_duration_seconds_bucket[5m])
)
# Idle time (for auto-teardown logic)
time() - vast_instance_last_request_timestamp_secondsWhy we prioritized this:
- Cost tracking - Without metrics, we're flying blind on actual costs
- Performance validation - POC means measuring if vLLM is actually fast enough
- Idle detection - Auto-shutdown requires knowing when instance is idle
- Debugging - When things break (they will), we need visibility
The insight: Monitoring isn't a "later" feature, it's how we know if the POC succeeded.
Our POC success criteria are literally metric-based:
- ✅ Provisioning time <3 minutes (measured by
vast_instance_provisioning_duration_seconds) - ✅ Inference p95 latency <2s (measured by
vllm_request_duration_seconds) - ✅ Cost <$0.50/hour (measured by
vast_instance_cost_per_hour)
Without metrics, we can't validate the POC. So monitoring isn't optional.
What We Learned: The POC Philosophy Shift
Before Planning: "Let's Validate the Idea Quickly"
Thinking:
- Manually provision vast.ai instance
- SSH in, run vLLM manually
- Test a few requests
- Check if it works
- Estimated time: 2-3 hours
Problem: This validates "can vLLM run on vast.ai?" but not "can we orchestrate vast.ai instances?"
After Planning: "POC = Production Foundation, Simplified"
Thinking:
- Build the orchestrator (trigger API, instance manager)
- Automate provisioning and bootstrapping
- Integrate monitoring from day 1
- Test the system, not just the GPU
- Estimated time: 1-2 weeks
Value: This validates the architecture, not just the technology.
What we're building:
┌─────────────────────────────────────────────┐
│ Production Architecture │
│ ┌─────────────────────────────────────┐ │
│ │ Queue-Based Auto-Scaler │ │ ← Add later
│ │ Multi-Provider Failover │ │ ← Add later
│ │ Advanced Cost Optimization │ │ ← Add later
│ └─────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────┐ │
│ │ POC Foundation (Week 1-5) │ │
│ │ ✓ Trigger API │ │ ← Building now
│ │ ✓ Instance Manager │ │ ← Building now
│ │ ✓ Automated Bootstrapping │ │ ← Building now
│ │ ✓ Prometheus Integration │ │ ← Building now
│ │ ✓ Hetzner S3 Model Storage │ │ ← Building now
│ └─────────────────────────────────────┘ │
└─────────────────────────────────────────────┘The shift: POC isn't a throwaway prototype. It's the foundation we'll iteratively add complexity to.
Hard Truths We Accepted
1. Direct Client Access is "Insecure"
Reality: Clients hit vast.ai IPs directly, no centralized gateway.
Why it's fine for POC:
- We're the only users (no multi-tenancy)
- vast.ai provides HTTPS + bearer token auth
- We can add API gateway later when we have >1 user
When it becomes a problem: Production with external users needing rate limiting, audit logs, or IP whitelisting.
2. Vast.ai Instance Failures Will Happen
Reality: Marketplace provider, instance reliability varies.
Why it's fine for POC:
- Manual retry is acceptable (we're testing, not serving production traffic)
- Failures teach us what error handling to build
- Instance selection algorithm will evolve based on real failure data
What we're tracking:
# Metrics to inform production failover logic
vast_instance_provision_failures_total
vast_instance_ssh_timeout_seconds
vast_instance_bootstrap_failure_reason3. We're Ignoring Multi-Region/Multi-Provider
Reality: Single provider (vast.ai), single region (wherever we get an instance).
Why it's fine for POC:
- Latency doesn't matter for testing
- Multi-provider adds 3x complexity
- We'll migrate to RunPod for production anyway
When we'll address it: RunPod migration (Phase 2), when we need:
- Sub-3 second cold starts (RunPod serverless)
- SLA guarantees (vast.ai has none)
- SOC 2 compliance (production requirement)
4. No Request Batching/Optimization
Reality: One request = one API call to vLLM. No batching, no connection pooling, no smart queuing.
Why it's fine for POC:
- vLLM handles batching internally (continuous batching)
- Low request volume (manual testing)
- Premature optimization is the root of all evil
When we'll optimize: When metrics show it's needed (e.g., request rate >10/sec, queue depth regularly >5).
The 5-Week Implementation Plan (Reality Check)
Week 1: Manual Validation
# Goal: Prove vast.ai + vLLM + S3 works at all
# 1. Create S3 bucket, upload test model
terraform apply -target=hetzner_s3_bucket.models
# 2. Manually provision vast.ai instance (web UI)
# 3. SSH in, download model from S3
# 4. Run vLLM manually
# 5. Test inference with curl
# Success metric: One successful inference requestExpected failures:
- S3 authentication errors (wrong credentials format)
- Model download timeout (large file, slow network)
- vLLM OOM (picked wrong GPU size)
Week 2: Orchestrator Core
# Goal: API-triggered provisioning (no manual clicks)
# Build FastAPI trigger API
@app.post("/provision")
async def provision_gpu(request: ProvisionRequest):
instance = await create_vastai_instance(request.gpu_type)
return {"id": instance.id, "status": "provisioning"}
# Deploy to K8s
kubectl apply -f k8s/gpu-orchestrator.yaml
# Test
curl -X POST http://gpu-orchestrator/provision \
-H "X-API-Key: ${API_KEY}" \
-d '{"model": "llama-3.1-8b", "gpu_type": "rtx4090"}'Expected failures:
- vast.ai API rate limiting
- Instance not found (all RTX 4090s taken)
- Timeout waiting for SSH (instance provisioning slow)
Week 3: Bootstrap Automation
# Goal: Instance auto-configures itself (no manual SSH)
# Bootstrap script (runs on instance startup)
#!/bin/bash
set -euo pipefail
# Download model
aws s3 cp s3://models/llama-3.1-8b /models/llama-3.1-8b \
--endpoint-url=$S3_ENDPOINT --recursive
# Start vLLM
docker run -d --gpus all -p 8000:8000 \
vllm/vllm-openai:latest \
--model /models/llama-3.1-8b
# Health check
until curl -f http://localhost:8000/health; do sleep 5; doneExpected failures:
- S3 credentials not passed correctly to instance
- Docker not installed on base image
- vLLM fails to load model (wrong format, corrupted download)
Week 4: Monitoring Integration
# Goal: Metrics flow to Hetzner Prometheus
# Deploy Prometheus with remote write enabled
# Configure GPU instances to scrape and push
# Build Grafana dashboard
# Queries we'll actually run
- GPU utilization over time
- Cost per inference request
- Instance idle time (for auto-shutdown logic)Expected failures:
- Remote write authentication issues
- Firewall blocking Prometheus traffic
- Missing GPU metrics (nvidia_gpu_exporter not installed)
Week 5: Idle Timeout & Cost Tracking
# Goal: Instance auto-terminates when idle
async def monitor_idle_instances():
while True:
instances = await state_store.get_all()
for instance in instances:
idle_time = time.time() - instance.last_request
if idle_time > IDLE_TIMEOUT_SECONDS:
await vastai_api.destroy_instance(instance.id)
await state_store.delete(instance.id)
await asyncio.sleep(60)Expected failures:
- Race condition (terminate instance during active request)
- Cost calculation errors (timezone issues)
- Metrics show idle but instance still processing
Comparing Costs: Theory vs Reality
Original Cost Estimate (from first blog post)
| Scenario | Monthly Cost (RTX 4090 @ $0.34/hr) |
|---|---|
| 24/7 | $248.20 |
| On-demand (20h/mo) | $6.80 |
Assumptions:
- Perfect utilization (no idle time)
- Instant provisioning/teardown (no waste)
- No failed instances (no retry costs)
POC Reality Budget
# What we'll actually measure
# 1. Development/testing costs
provision_testing = 20 # 20 test runs
avg_instance_lifetime = 0.5 # 30min per test
cost_per_test = 0.35 * 0.5 # $0.175
total_dev_cost = 20 * cost_per_test # $3.50
# 2. Idle waste (instance ready but not processing)
model_load_time = 60 # seconds
avg_requests_per_session = 5
request_duration = 10 # seconds
session_duration = model_load_time + (5 * 10) # 110 seconds
idle_percentage = model_load_time / session_duration # 54% idle!
# 3. Failed provisioning attempts
failure_rate = 0.1 # 10% of provisions fail (SSH timeout, etc.)
retry_cost = failure_rate * cost_per_test # $0.0175 per attempt
# Actual expected cost
total_poc_cost = (
total_dev_cost + # $3.50
(idle_percentage * total_dev_cost) + # $1.89 (idle waste)
(20 * retry_cost) # $0.35 (retries)
) # ~$5.74
print(f"POC budget: ${total_poc_cost:.2f}")
# vs original estimate: $0.50 (manual POC)The insight: Our automated POC will cost 10x more than the original "manual validation" plan ($5.74 vs $0.50), but it gives us 100x more value (reusable orchestration system vs one-off test).
What Didn't Change (The Core Principles)
Despite all the simplifications, these stayed constant:
1. vLLM for Inference ✅
Why we stuck with it:
- Best throughput for LLM serving (continuous batching, PagedAttention)
- OpenAI-compatible API (easy migration from OpenAI to self-hosted)
- Active development, supports latest models
Alternatives we considered:
- Ollama (easier setup, lower throughput)
- TGI/Text Generation Inference (HuggingFace, good but less performant)
- Custom FastAPI + transformers (too much work)
Decision: vLLM is the right choice for production, so POC should use it too.
2. Cost Optimization as Primary Goal ✅
Original thesis: On-demand GPU saves 90%+ vs always-on.
Still true: Our POC validates this by measuring:
- Actual cost per inference request
- Idle time percentage
- Provisioning overhead cost
Even with POC inefficiencies (manual testing, frequent teardown/setup), we expect to validate the cost model.
3. Security Consciousness ✅
Original: Start with SOC 2/GDPR-compliant providers (RunPod, DataCrunch).
POC compromise: Use vast.ai (marketplace) but:
- No sensitive data in POC (only test prompts)
- Document security limitations
- Plan migration to RunPod for production
We didn't compromise: Using unencrypted connections or storing credentials insecurely. API keys in K8s Secrets, S3 credentials never logged.
4. Observability from Day 1 ✅
Original: Prometheus + Grafana integration.
POC: Same, but with specific implementation (remote write, custom metrics).
Why this matters: You can't optimize what you don't measure. Cost tracking, performance metrics, idle detection—all require metrics.
The Meta-Lesson: POC Design is About Tradeoffs
What Makes a Good POC?
Bad POC:
- "Quick and dirty" manual tests
- No reusable artifacts
- Validates technology, not architecture
- Throwaway code
Good POC:
- Automated foundation for production
- Validates architectural decisions
- Includes monitoring/observability
- Iteratively adds complexity
Our POC:
- ✅ Reusable orchestrator (FastAPI + instance manager)
- ✅ Automated bootstrapping (SSH scripts, S3 integration)
- ✅ Monitoring integration (Prometheus remote write)
- ✅ Cost tracking (custom metrics)
- ❌ No queue-based auto-scaling (add in Week 6+)
- ❌ No multi-provider failover (add in production migration)
- ❌ No advanced cost optimization (add after measuring real usage)
The Tradeoff Matrix
| Feature | POC Value | Implementation Cost | Decision |
|---|---|---|---|
| Trigger API | High (core orchestration) | Low (1 day) | ✅ Include |
| Instance manager | High (automation) | Medium (2-3 days) | ✅ Include |
| S3 model storage | High (reproducibility) | Low (1 day) | ✅ Include |
| Prometheus metrics | High (validation criteria) | Medium (2 days) | ✅ Include |
| Queue-based scaling | Low (manual testing) | High (5+ days) | ❌ Defer |
| SSH tunnel proxy | Low (direct access works) | Medium (3 days) | ❌ Defer |
| Multi-provider | Low (vast.ai sufficient) | Very high (10+ days) | ❌ Defer |
The principle: Include features that are either:
- Required for validation (metrics, automation)
- Cheap to build now, expensive later (S3 integration, bootstrapping)
Defer features that are:
- Not needed for POC (multi-tenancy, SLA guarantees)
- Can easily add later (queue, advanced scaling)
What We'll Write About Next
Upcoming Blog Posts
Week 2-3: "First GPU Provisioning"
- What worked immediately
- What failed spectacularly
- Actual vast.ai API gotchas
- Bootstrap script debugging war stories
Week 4-5: "POC Results: Metrics Don't Lie"
- Real cost per inference request
- Actual provisioning time distribution (p50, p95, p99)
- GPU utilization patterns
- Idle time waste analysis
Week 6+: "From POC to Production: RunPod Migration"
- Why we're switching providers
- Architecture changes for serverless endpoints
- Cost comparison (vast.ai vs RunPod actual)
- What we kept from the POC
The Code (What We're Actually Building)
All architecture docs and implementation code are open source:
Repository structure:
hetzner-cloud-minimal-kubernetes-cluster/
├── gpu-poc-draft.md # Full architecture spec
├── terraform/
│ └── kubernetes/
│ └── gpu-orchestrator/
│ ├── main.tf # K8s deployment
│ ├── orchestrator/ # FastAPI app
│ │ ├── api.py # Trigger endpoints
│ │ ├── manager.py # Instance lifecycle
│ │ └── vastai.py # API client wrapper
│ └── monitoring/
│ └── dashboards/ # Grafana JSON
└── newsletter-blog/
└── content/blog/
├── ai-orchestrator-architecture-gpu-poc/ # Original vision
└── gpu-poc-reality-check/ # This postFollow along:
- GitHub: fractiunate-ai/hetzner-cloud-minimal-kubernetes-cluster
- Progress updates: Weekly blog posts
- Implementation PRs: Tagged with
gpu-poc
Conclusion: Architecture is a Journey
The difference between the original architecture diagram and our actual POC isn't a failure—it's learning.
What we learned:
- Start simple, add complexity later - Queue-based scaling can wait
- Automate what matters - Manual tests waste time, automation is reusable
- Measure everything - Metrics tell us what to optimize
- POC ≠ Throwaway - Build the foundation, not a prototype
- Tradeoffs are explicit - Every deferred feature is documented with "when we'll add it"
The original vision is still valid - we just found a better path to get there.
Instead of:
Plan everything → Build everything → Test everythingWe're doing:
Plan foundation → Build foundation → Test →
Measure → Learn → Add complexity → RepeatWeek 1 starts Monday. Time to turn architecture diagrams into running code.
Wish us luck. We'll document every success and failure along the way.
Resources:
- GPU POC Architecture Spec - Full technical design
- Original Architecture Blog - Where we started
- Project Repository - Follow the code
Next post: "Week 1 Results: First GPU Provisioning" - What happens when theory meets vast.ai's reality.
The build begins.