vLLM Template
Image: vllm/vllm-openai:latest
Min VRAM: 24 GB | Port: 8000
High-throughput LLM inference serving with OpenAI-compatible API.
What’s Included
- vLLM inference engine
- OpenAI-compatible API server
- PagedAttention for efficient memory
- Continuous batching
Launch
curl -X POST https://api.pulserun.dev/v1/instances \
-H "Authorization: Bearer pr_live_xxxxxxxxxxxxx" \
-H "Content-Type: application/json" \
-d '{"gpu": "a100_80gb", "template": "vllm"}'Usage
curl http://<instance-ip>:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{"model": "meta-llama/Llama-2-7b-hf", "prompt": "Hello", "max_tokens": 100}'Recommended GPUs
- A100 80GB — 70B parameter models
- H100 — Maximum throughput
- RTX 4090 — 7B-13B models