GPU on Demand POC: When Architecture Meets Reality

December 21, 2025

The Gap Between Vision and Version 1

In my last post, I wrote about architecting an AI orchestrator with grand visions of auto-scaling controllers, queue-based request routing, and multi-provider GPU orchestration. The architecture looked beautiful in ASCII diagrams.

Then I sat down to actually build it.

This is the story of what happened when theory met practice - the decisions we made, the complexity we cut, and why our actual POC looks nothing like the original architecture (and that's totally fine).

TL;DR - What Changed?

Original Vision:

K8s Cluster → API Gateway → Request Queue → Auto-Scaler Controller
→ SSH Tunnel Pod → vast.ai GPU → vLLM

Actual POC:

Client → Trigger API (K8s) → vast.ai API → GPU Instance
Client → Direct HTTPS → GPU Instance vLLM ✅

Key Simplifications:

❌ No request queue (Redis/RabbitMQ)
❌ No auto-scaler controller monitoring queue depth
❌ No SSH tunnel complexity
✅ Simple trigger API with API key auth
✅ Direct client access to GPU instances
✅ Hetzner S3 for model storage (specific, not generic)
✅ Prometheus remote write (detailed implementation)

Let's dive into why we made these decisions.

Decision 1: Direct Access vs. Proxy Pattern

Original Plan: SSH Tunnel from K8s

The initial architecture had this whole SSH tunnel setup:

# What I thought I needed
apiVersion: v1
kind: Pod
metadata:
  name: vastai-tunnel
spec:
  containers:
  - name: ssh-tunnel
    command: ["/bin/sh", "-c"]
    args:
      - ssh -L 0.0.0.0:8000:localhost:8000 root@vastai-host -N

The reasoning was solid:

All traffic routes through K8s (easier monitoring)
Centralized access control
Clients never see vast.ai IPs directly

What We Chose: Direct Client → GPU Access

# Actual POC flow
curl https://123.45.67.89:8000/v1/completions \
  -H "Authorization: Bearer ${VASTAI_TOKEN}" \
  -d '{"model": "llama-3.1-8b", "prompt": "..."}'

Why we simplified:

SSH tunnels add latency - Every request goes K8s → SSH → GPU instead of Client → GPU
Operational complexity - Managing SSH keys, tunnel health checks, reconnection logic
POC goal clarity - We're validating GPU provisioning, not building a production proxy
vast.ai already has HTTPS - The instances support direct HTTPS access with bearer tokens

The hard truth: We were solving problems we don't have yet. For a POC validating "$0.35/hr GPU instances can run vLLM inference," the SSH tunnel is architectural gold-plating.

When we'll add it back: Production migration to RunPod, when we need:

Centralized rate limiting
Request logging/audit trails
Client IP hiding
Load balancing across multiple GPUs

Decision 2: Trigger API vs. Queue-Based Auto-Scaling

Original Plan: Queue Depth Monitoring

The vision was beautiful:

# Imagined auto-scaler logic
while True:
    queue_depth = redis.llen('inference_queue')

    if queue_depth > 10 and active_gpus == 0:
        provision_gpu_instance()
    elif queue_depth == 0 and gpu_idle_time > 900:  # 15min
        terminate_gpu_instance()

    time.sleep(30)

What this requires:

Redis/RabbitMQ for request queue
Background worker monitoring queue depth
State management for active instances
Request routing logic (which GPU gets which request?)
Graceful shutdown handling (in-flight requests)

What We Built: On-Demand Trigger API

# FastAPI endpoint - actual POC
@app.post("/provision")
async def provision_gpu(request: ProvisionRequest):
    # Search vast.ai for available instance
    offers = vastai.search_offers(
        gpu_name="RTX 4090",
        max_price=0.50
    )

    # Create instance
    instance = vastai.create_instance(
        offer_id=offers[0].id,
        image="vllm/vllm-openai:latest"
    )

    # Wait for SSH, bootstrap vLLM
    await bootstrap_instance(instance.id)

    return {
        "endpoint": f"https://{instance.ip}:8000",
        "id": instance.id
    }

Why we simplified:

POC usage pattern - We're manually testing, not handling production traffic
Complexity explosion - Queue-based scaling is 5x the code for 0x the POC value
Vast.ai reality - Instance startup takes 60-120 seconds anyway, queue won't help cold starts
Iteration speed - We can test provisioning in 2 minutes vs. 2 weeks building queue infrastructure

The key insight: For a POC, "on-demand" can literally mean "I call an API when I want a GPU." We don't need to auto-scale from zero usage yet.

When we'll add queues: When we have:

Multiple concurrent users
Unpredictable request patterns
Need to batch requests for cost efficiency
SLA requirements for response time

Right now? We have none of that.

Decision 3: Hetzner S3 for Model Storage

Original Plan: "Model Registry & Cache"

The original architecture had a vague box labeled "Model Registry & Cache" with bullet points:

Model weights storage ✓
Version management ✓

Cool. How do we actually build that?

What We Specified: Hetzner Object Storage (S3)

# Actual implementation
# 1. Upload model to Hetzner S3 (one-time setup)
aws s3 cp models/llama-3.1-8b-instruct/ \
  s3://gpu-inference-models/llama-3.1-8b-instruct/ \
  --endpoint-url=https://fsn1.your-objectstorage.com \
  --recursive

# 2. Bootstrap script on GPU instance downloads model
aws s3 cp s3://gpu-inference-models/llama-3.1-8b-instruct \
  /models/llama-3.1-8b-instruct \
  --endpoint-url=$HETZNER_S3_ENDPOINT \
  --recursive

# 3. Start vLLM with downloaded model
python -m vllm.entrypoints.openai.api_server \
  --model /models/llama-3.1-8b-instruct \
  --port 8000

Why Hetzner S3 specifically:

Regional proximity - Hetzner's Falkenstein datacenter → vast.ai Europe instances = faster downloads
Cost - €0.0049/GB storage + €0.01/GB egress (first 1TB free)
S3-compatible API - Works with standard AWS CLI/boto3
Already using Hetzner - Same provider, same authentication model
Terraform integration - We can provision buckets alongside K8s resources

Cost comparison for 8B model (~16GB):

Provider	Storage	Egress (10x downloads/mo)	Total
Hetzner S3	€0.08/mo	€1.60	€1.68/mo
AWS S3 eu-central-1	$0.18/mo	$9.00	$9.18/mo
Cloudflare R2	$0.15/mo	$0.00	$0.15/mo

Why not Cloudflare R2 (cheapest)? For a POC, staying in the Hetzner ecosystem is worth €1.53/mo to avoid managing another provider's credentials and networking.

Tradeoff we accepted: Vendor concentration. We're all-in on Hetzner for now (K8s + S3). That's fine for POC, we can migrate later if needed.

Decision 4: Full Automation from Day 1

Original Plan: Manual POC → Automation Later

The original roadmap was:

Phase 1: Manually provision vast.ai (web UI), SSH in, run commands Phase 2: Build automation

What We're Actually Doing: Automate Everything Immediately

# Instance manager - from POC day 1
async def provision_and_bootstrap(model_name: str):
    """Fully automated GPU provisioning"""

    # 1. Find best instance
    instance = await find_best_vastai_instance(
        gpu_type="rtx4090",
        max_price=0.50
    )

    # 2. Create instance
    created = await vastai_api.create_instance(instance.id)

    # 3. Wait for SSH (with timeout)
    await wait_for_ssh(created.ip, timeout=300)

    # 4. Run bootstrap script via SSH
    await ssh_execute(created.ip, f"""
        # Download model from Hetzner S3
        aws s3 cp s3://models/{model_name} /models/{model_name} \
          --endpoint-url={S3_ENDPOINT} --recursive

        # Start vLLM
        docker run -d --gpus all -p 8000:8000 \
          vllm/vllm-openai:latest \
          --model /models/{model_name}

        # Start monitoring agents
        prometheus-node-exporter --web.listen-address=:9100 &
        nvidia-gpu-exporter --web.listen-address=:9101 &
    """)

    # 5. Health check vLLM endpoint
    await wait_for_healthy(f"https://{created.ip}:8000/health")

    # 6. Save to state store
    await state_store.save({
        "id": created.id,
        "endpoint": f"https://{created.ip}:8000",
        "cost_per_hour": instance.price,
        "created_at": datetime.utcnow()
    })

    return created

Why we changed our minds:

Manual testing is slow - Each test requires 10+ manual steps × 10-20 tests = hours wasted
Reproducibility - Automated bootstrapping ensures every instance is configured identically
Metrics from day 1 - If we're automating anyway, we get monitoring for free
POC ≠ Throwaway code - This orchestrator is the production foundation, just simplified

The key realization: Building the automation takes 1 day upfront but saves 30 minutes per test. After 3 tests, we've broken even.

What we're NOT automating (yet):

Auto-scaling based on metrics (requires queue)
Multi-region failover (single region for POC)
Cost optimization algorithms (use first available instance)
A/B testing across GPU types

Decision 5: Prometheus Remote Write (Not Just "Monitoring")

Original Plan: "Add Monitoring Later"

The first architecture doc had:

✅ Deploy vLLM
✅ Test inference
🔜 Add monitoring

What We Designed: Monitoring as a First-Class Component

# Prometheus config on GPU instance
remote_write:
  - url: "http://prometheus.gpu-system.svc.cluster.local:9090/api/v1/write"
    basic_auth:
      username: "remote-write"
      password_file: /etc/prometheus/password

scrape_configs:
  - job_name: 'vllm'
    static_configs:
      - targets: ['localhost:8000']  # vLLM metrics endpoint
    scrape_interval: 15s

  - job_name: 'node'
    static_configs:
      - targets: ['localhost:9100']  # node_exporter

  - job_name: 'gpu'
    static_configs:
      - targets: ['localhost:9101']  # nvidia_gpu_exporter

Grafana dashboard queries:

# GPU utilization by instance
avg by (instance_id) (nvidia_gpu_utilization_percent)

# Cost per hour (custom metric)
sum(vast_instance_cost_per_hour) by (instance_id)

# Inference latency p95
histogram_quantile(0.95,
  rate(vllm_request_duration_seconds_bucket[5m])
)

# Idle time (for auto-teardown logic)
time() - vast_instance_last_request_timestamp_seconds

Why we prioritized this:

Cost tracking - Without metrics, we're flying blind on actual costs
Performance validation - POC means measuring if vLLM is actually fast enough
Idle detection - Auto-shutdown requires knowing when instance is idle
Debugging - When things break (they will), we need visibility

The insight: Monitoring isn't a "later" feature, it's how we know if the POC succeeded.

Our POC success criteria are literally metric-based:

✅ Provisioning time <3 minutes (measured by vast_instance_provisioning_duration_seconds)
✅ Inference p95 latency <2s (measured by vllm_request_duration_seconds)
✅ Cost <$0.50/hour (measured by vast_instance_cost_per_hour)

Without metrics, we can't validate the POC. So monitoring isn't optional.

What We Learned: The POC Philosophy Shift

Before Planning: "Let's Validate the Idea Quickly"

Thinking:

Manually provision vast.ai instance
SSH in, run vLLM manually
Test a few requests
Check if it works
Estimated time: 2-3 hours

Problem: This validates "can vLLM run on vast.ai?" but not "can we orchestrate vast.ai instances?"

After Planning: "POC = Production Foundation, Simplified"

Thinking:

Build the orchestrator (trigger API, instance manager)
Automate provisioning and bootstrapping
Integrate monitoring from day 1
Test the system, not just the GPU
Estimated time: 1-2 weeks

Value: This validates the architecture, not just the technology.

What we're building:

┌─────────────────────────────────────────────┐
│         Production Architecture             │
│  ┌─────────────────────────────────────┐   │
│  │  Queue-Based Auto-Scaler            │   │ ← Add later
│  │  Multi-Provider Failover            │   │ ← Add later
│  │  Advanced Cost Optimization         │   │ ← Add later
│  └─────────────────────────────────────┘   │
│                                              │
│  ┌─────────────────────────────────────┐   │
│  │  POC Foundation (Week 1-5)          │   │
│  │  ✓ Trigger API                      │   │ ← Building now
│  │  ✓ Instance Manager                 │   │ ← Building now
│  │  ✓ Automated Bootstrapping          │   │ ← Building now
│  │  ✓ Prometheus Integration           │   │ ← Building now
│  │  ✓ Hetzner S3 Model Storage         │   │ ← Building now
│  └─────────────────────────────────────┘   │
└─────────────────────────────────────────────┘

The shift: POC isn't a throwaway prototype. It's the foundation we'll iteratively add complexity to.

Hard Truths We Accepted

1. Direct Client Access is "Insecure"

Reality: Clients hit vast.ai IPs directly, no centralized gateway.

Why it's fine for POC:

We're the only users (no multi-tenancy)
vast.ai provides HTTPS + bearer token auth
We can add API gateway later when we have >1 user

When it becomes a problem: Production with external users needing rate limiting, audit logs, or IP whitelisting.

2. Vast.ai Instance Failures Will Happen

Reality: Marketplace provider, instance reliability varies.

Why it's fine for POC:

Manual retry is acceptable (we're testing, not serving production traffic)
Failures teach us what error handling to build
Instance selection algorithm will evolve based on real failure data

What we're tracking:

# Metrics to inform production failover logic
vast_instance_provision_failures_total
vast_instance_ssh_timeout_seconds
vast_instance_bootstrap_failure_reason

3. We're Ignoring Multi-Region/Multi-Provider

Reality: Single provider (vast.ai), single region (wherever we get an instance).

Why it's fine for POC:

Latency doesn't matter for testing
Multi-provider adds 3x complexity
We'll migrate to RunPod for production anyway

When we'll address it: RunPod migration (Phase 2), when we need:

Sub-3 second cold starts (RunPod serverless)
SLA guarantees (vast.ai has none)
SOC 2 compliance (production requirement)

4. No Request Batching/Optimization

Reality: One request = one API call to vLLM. No batching, no connection pooling, no smart queuing.

Why it's fine for POC:

vLLM handles batching internally (continuous batching)
Low request volume (manual testing)
Premature optimization is the root of all evil

When we'll optimize: When metrics show it's needed (e.g., request rate >10/sec, queue depth regularly >5).

The 5-Week Implementation Plan (Reality Check)

Week 1: Manual Validation

# Goal: Prove vast.ai + vLLM + S3 works at all

# 1. Create S3 bucket, upload test model
terraform apply -target=hetzner_s3_bucket.models

# 2. Manually provision vast.ai instance (web UI)
# 3. SSH in, download model from S3
# 4. Run vLLM manually
# 5. Test inference with curl

# Success metric: One successful inference request

Expected failures:

S3 authentication errors (wrong credentials format)
Model download timeout (large file, slow network)
vLLM OOM (picked wrong GPU size)

Week 2: Orchestrator Core

# Goal: API-triggered provisioning (no manual clicks)

# Build FastAPI trigger API
@app.post("/provision")
async def provision_gpu(request: ProvisionRequest):
    instance = await create_vastai_instance(request.gpu_type)
    return {"id": instance.id, "status": "provisioning"}

# Deploy to K8s
kubectl apply -f k8s/gpu-orchestrator.yaml

# Test
curl -X POST http://gpu-orchestrator/provision \
  -H "X-API-Key: ${API_KEY}" \
  -d '{"model": "llama-3.1-8b", "gpu_type": "rtx4090"}'

Expected failures:

vast.ai API rate limiting
Instance not found (all RTX 4090s taken)
Timeout waiting for SSH (instance provisioning slow)

Week 3: Bootstrap Automation

# Goal: Instance auto-configures itself (no manual SSH)

# Bootstrap script (runs on instance startup)
#!/bin/bash
set -euo pipefail

# Download model
aws s3 cp s3://models/llama-3.1-8b /models/llama-3.1-8b \
  --endpoint-url=$S3_ENDPOINT --recursive

# Start vLLM
docker run -d --gpus all -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model /models/llama-3.1-8b

# Health check
until curl -f http://localhost:8000/health; do sleep 5; done

Expected failures:

S3 credentials not passed correctly to instance
Docker not installed on base image
vLLM fails to load model (wrong format, corrupted download)

Week 4: Monitoring Integration

# Goal: Metrics flow to Hetzner Prometheus

# Deploy Prometheus with remote write enabled
# Configure GPU instances to scrape and push
# Build Grafana dashboard

# Queries we'll actually run
- GPU utilization over time
- Cost per inference request
- Instance idle time (for auto-shutdown logic)

Expected failures:

Remote write authentication issues
Firewall blocking Prometheus traffic
Missing GPU metrics (nvidia_gpu_exporter not installed)

Week 5: Idle Timeout & Cost Tracking

# Goal: Instance auto-terminates when idle

async def monitor_idle_instances():
    while True:
        instances = await state_store.get_all()

        for instance in instances:
            idle_time = time.time() - instance.last_request

            if idle_time > IDLE_TIMEOUT_SECONDS:
                await vastai_api.destroy_instance(instance.id)
                await state_store.delete(instance.id)

        await asyncio.sleep(60)

Expected failures:

Race condition (terminate instance during active request)
Cost calculation errors (timezone issues)
Metrics show idle but instance still processing

Comparing Costs: Theory vs Reality

Original Cost Estimate (from first blog post)

Scenario	Monthly Cost (RTX 4090 @ $0.34/hr)
24/7	$248.20
On-demand (20h/mo)	$6.80

Assumptions:

Perfect utilization (no idle time)
Instant provisioning/teardown (no waste)
No failed instances (no retry costs)

POC Reality Budget

# What we'll actually measure

# 1. Development/testing costs
provision_testing = 20  # 20 test runs
avg_instance_lifetime = 0.5  # 30min per test
cost_per_test = 0.35 * 0.5  # $0.175
total_dev_cost = 20 * cost_per_test  # $3.50

# 2. Idle waste (instance ready but not processing)
model_load_time = 60  # seconds
avg_requests_per_session = 5
request_duration = 10  # seconds
session_duration = model_load_time + (5 * 10)  # 110 seconds
idle_percentage = model_load_time / session_duration  # 54% idle!

# 3. Failed provisioning attempts
failure_rate = 0.1  # 10% of provisions fail (SSH timeout, etc.)
retry_cost = failure_rate * cost_per_test  # $0.0175 per attempt

# Actual expected cost
total_poc_cost = (
    total_dev_cost +  # $3.50
    (idle_percentage * total_dev_cost) +  # $1.89 (idle waste)
    (20 * retry_cost)  # $0.35 (retries)
)  # ~$5.74

print(f"POC budget: ${total_poc_cost:.2f}")
# vs original estimate: $0.50 (manual POC)

The insight: Our automated POC will cost 10x more than the original "manual validation" plan ($5.74 vs $0.50), but it gives us 100x more value (reusable orchestration system vs one-off test).

What Didn't Change (The Core Principles)

Despite all the simplifications, these stayed constant:

1. vLLM for Inference ✅

Why we stuck with it:

Best throughput for LLM serving (continuous batching, PagedAttention)
OpenAI-compatible API (easy migration from OpenAI to self-hosted)
Active development, supports latest models

Alternatives we considered:

Ollama (easier setup, lower throughput)
TGI/Text Generation Inference (HuggingFace, good but less performant)
Custom FastAPI + transformers (too much work)

Decision: vLLM is the right choice for production, so POC should use it too.

2. Cost Optimization as Primary Goal ✅

Original thesis: On-demand GPU saves 90%+ vs always-on.

Still true: Our POC validates this by measuring:

Actual cost per inference request
Idle time percentage
Provisioning overhead cost

Even with POC inefficiencies (manual testing, frequent teardown/setup), we expect to validate the cost model.

3. Security Consciousness ✅

Original: Start with SOC 2/GDPR-compliant providers (RunPod, DataCrunch).

POC compromise: Use vast.ai (marketplace) but:

No sensitive data in POC (only test prompts)
Document security limitations
Plan migration to RunPod for production

We didn't compromise: Using unencrypted connections or storing credentials insecurely. API keys in K8s Secrets, S3 credentials never logged.

4. Observability from Day 1 ✅

Original: Prometheus + Grafana integration.

POC: Same, but with specific implementation (remote write, custom metrics).

Why this matters: You can't optimize what you don't measure. Cost tracking, performance metrics, idle detection—all require metrics.

The Meta-Lesson: POC Design is About Tradeoffs

What Makes a Good POC?

Bad POC:

"Quick and dirty" manual tests
No reusable artifacts
Validates technology, not architecture
Throwaway code

Good POC:

Automated foundation for production
Validates architectural decisions
Includes monitoring/observability
Iteratively adds complexity

Our POC:

✅ Reusable orchestrator (FastAPI + instance manager)
✅ Automated bootstrapping (SSH scripts, S3 integration)
✅ Monitoring integration (Prometheus remote write)
✅ Cost tracking (custom metrics)
❌ No queue-based auto-scaling (add in Week 6+)
❌ No multi-provider failover (add in production migration)
❌ No advanced cost optimization (add after measuring real usage)

The Tradeoff Matrix

Feature	POC Value	Implementation Cost	Decision
Trigger API	High (core orchestration)	Low (1 day)	✅ Include
Instance manager	High (automation)	Medium (2-3 days)	✅ Include
S3 model storage	High (reproducibility)	Low (1 day)	✅ Include
Prometheus metrics	High (validation criteria)	Medium (2 days)	✅ Include
Queue-based scaling	Low (manual testing)	High (5+ days)	❌ Defer
SSH tunnel proxy	Low (direct access works)	Medium (3 days)	❌ Defer
Multi-provider	Low (vast.ai sufficient)	Very high (10+ days)	❌ Defer

The principle: Include features that are either:

Required for validation (metrics, automation)
Cheap to build now, expensive later (S3 integration, bootstrapping)

Defer features that are:

Not needed for POC (multi-tenancy, SLA guarantees)
Can easily add later (queue, advanced scaling)

What We'll Write About Next

Upcoming Blog Posts

Week 2-3: "First GPU Provisioning"

What worked immediately
What failed spectacularly
Actual vast.ai API gotchas
Bootstrap script debugging war stories

Week 4-5: "POC Results: Metrics Don't Lie"

Real cost per inference request
Actual provisioning time distribution (p50, p95, p99)
GPU utilization patterns
Idle time waste analysis

Week 6+: "From POC to Production: RunPod Migration"

Why we're switching providers
Architecture changes for serverless endpoints
Cost comparison (vast.ai vs RunPod actual)
What we kept from the POC

The Code (What We're Actually Building)

All architecture docs and implementation code are open source:

Repository structure:

hetzner-cloud-minimal-kubernetes-cluster/
├── gpu-poc-draft.md              # Full architecture spec
├── terraform/
│   └── kubernetes/
│       └── gpu-orchestrator/
│           ├── main.tf           # K8s deployment
│           ├── orchestrator/     # FastAPI app
│           │   ├── api.py        # Trigger endpoints
│           │   ├── manager.py    # Instance lifecycle
│           │   └── vastai.py     # API client wrapper
│           └── monitoring/
│               └── dashboards/   # Grafana JSON
└── newsletter-blog/
    └── content/blog/
        ├── ai-orchestrator-architecture-gpu-poc/  # Original vision
        └── gpu-poc-reality-check/                 # This post

Follow along:

GitHub: fractiunate-ai/hetzner-cloud-minimal-kubernetes-cluster
Progress updates: Weekly blog posts
Implementation PRs: Tagged with gpu-poc

Conclusion: Architecture is a Journey

The difference between the original architecture diagram and our actual POC isn't a failure—it's learning.

What we learned:

Start simple, add complexity later - Queue-based scaling can wait
Automate what matters - Manual tests waste time, automation is reusable
Measure everything - Metrics tell us what to optimize
POC ≠ Throwaway - Build the foundation, not a prototype
Tradeoffs are explicit - Every deferred feature is documented with "when we'll add it"

The original vision is still valid - we just found a better path to get there.

Instead of:

Plan everything → Build everything → Test everything

We're doing:

Plan foundation → Build foundation → Test →
Measure → Learn → Add complexity → Repeat

Week 1 starts Monday. Time to turn architecture diagrams into running code.

Wish us luck. We'll document every success and failure along the way.

Resources:

GPU POC Architecture Spec - Full technical design
Original Architecture Blog - Where we started
Project Repository - Follow the code

Next post: "Week 1 Results: First GPU Provisioning" - What happens when theory meets vast.ai's reality.

The build begins.