vLLM Template

Image: vllm/vllm-openai:latest Min VRAM: 24 GB | Port: 8000

High-throughput LLM inference serving with OpenAI-compatible API.

What’s Included

  • vLLM inference engine
  • OpenAI-compatible API server
  • PagedAttention for efficient memory
  • Continuous batching

Launch

curl -X POST https://api.pulserun.dev/v1/instances \
  -H "Authorization: Bearer pr_live_xxxxxxxxxxxxx" \
  -H "Content-Type: application/json" \
  -d '{"gpu": "a100_80gb", "template": "vllm"}'

Usage

curl http://<instance-ip>:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "meta-llama/Llama-2-7b-hf", "prompt": "Hello", "max_tokens": 100}'
  • A100 80GB — 70B parameter models
  • H100 — Maximum throughput
  • RTX 4090 — 7B-13B models