vLLM Template

Image: vllm/vllm-openai:latest Min VRAM: 24 GB | Port: 8000

High-throughput LLM inference serving with OpenAI-compatible API.

What’s Included

vLLM inference engine
OpenAI-compatible API server
PagedAttention for efficient memory
Continuous batching

Launch

curl -X POST https://api.pulserun.dev/v1/instances \
  -H "Authorization: Bearer pr_live_xxxxxxxxxxxxx" \
  -H "Content-Type: application/json" \
  -d '{"gpu": "a100_80gb", "template": "vllm"}'

Usage

curl http://<instance-ip>:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "meta-llama/Llama-2-7b-hf", "prompt": "Hello", "max_tokens": 100}'

Recommended GPUs

A100 80GB — 70B parameter models
H100 — Maximum throughput
RTX 4090 — 7B-13B models

ComfyUI Ollama