Run a vLLM Server on HF Jobs in One Command
You can spin up a private, OpenAI-compatible LLM endpoint on Hugging Face infrastructure with a single command — no servers to provision, no Kubernetes, pay-per-second.

You can spin up a private, OpenAI-compatible LLM endpoint on Hugging Face infrastructure with a single command — no servers to provision, no Kubernetes, pay-per-second. Once it's up, you can query it from your laptop, a notebook, or anywhere else.
It's the quickest way to stand up a model for tests, evals, or batch generation. (If you're after a managed, production-ready service instead, that's what Inference Endpoints are for — more on when to pick which at the end.)
hf jobs run is docker run for HF infrastructure. We use the official vllm/vllm-openai image, ask for a GPU with --flavor , and expose vLLM's port with --expose :
--expose 8000 routes the container's port through HF's public jobs proxy (see the Serve Models guide for the full reference). The command prints the URL your server is reachable at:
6a381ca1953ed90bfb947332 is your job ID. Keep track of it, we'll need it. We'll use as a placeholder for it in the rest of the post.
Give it a couple of minutes to download weights and boot. When the logs show Application startup complete , you're live.
vLLM speaks the OpenAI API, and every request just needs your HF token as a bearer token. The quickest way to hit it is curl:
which returns the usual OpenAI-style JSON, with choices[0].message.content holding "Hello! How can I assist you today? 😊" .
Or, from Python, point the OpenAI client at the exposed URL and pass the token as the API key:
Quick health check before you start: curl https://--8000.hf.jobs/v1/models -H "Authorization: Bearer $(hf auth token)" should list the model.
🔐 The endpoint is gated, not public. Every request must carry an HF token with read access to the job's namespace . A plain browser visit will be rejected. In effect, the jobs proxy is your API gate: access is scoped to you (and your org). That's fine for private use, but treat the URL accordingly: don't share it expecting it to be open, and don't paste your token into untrusted places. If you need finer-grained or public access, put a proper gateway in front instead. Or see HF Jobs or Inference Endpoints? below.
Jobs are billed per second, so stop the server when you're done:
The --timeout you set is a safety net (it'll auto-stop), but cancelling explicitly is cheaper. An a10g-large runs at $1.50/hour — check hf jobs hardware for the full price list and pick the smallest flavor that fits your model.
Source: Hugging Face