
Run a vLLM Server on HF Jobs in One Command
You can spin up a private, OpenAI-compatible LLM endpoint on Hugging Face infrastructure with a single command — no servers to provision, no Kubernetes, pay-per-second.
AINOVAT
Source
60 articles from this source

You can spin up a private, OpenAI-compatible LLM endpoint on Hugging Face infrastructure with a single command — no servers to provision, no Kubernetes, pay-per-second.

HuggingFace Transformers has become the foundation of the open-source AI ecosystem, and the recent Transformers v5 release strengthened it with first-class support for Mixture-of-Experts (MoE) models, now the dominant a…

🚀 First open far-field ASR benchmark: community-driven evaluation across 14 simulated rooms, validated against real-world measurements: https://huggingface.co/spaces/treble-technologies/ffasr 📉 The gap is real and it is…

TL;DR — Building an agent is mostly plumbing: tools, state, guardrails, scaling from one agent to many.

(This is a guest post by Developer Relations Engineer Thomas Steiner from the Chrome team at Google.) Transformers.js provides Web developers with a simple way to use the power of transformers in their Web apps through…

huggingface_hub is the Python client at the base of the Hugging Face ecosystem.

Evaluate PP-OCRv6 online, then integrate lightweight, production-ready OCR with PaddlePaddle, Transformers, or ONNX Runtime backend.

*Free as in beer, excluding the cost of electricity, and assuming you already own the hardware June 2026 will go down as the moment that people realized closed models can be taken away.

Deep research agents increasingly combine private local documents with external tools like web retrieval, creating a privacy risk: an agent's external queries may leak sensitive information.

Benchmarking transformers revisions across different metrics This is a human-made, agent-focused blogpost.

If you want to fine-tune an open model on your own data, you are probably interested in so-called parameter-efficient fine-tuning, in short PEFT .

🧠 Models: https://huggingface.co/collections/allenai/molmomotion | 📄 Tech Report: https://allenai.org/papers/molmomotion | 📊 Data: https://huggingface.co/datasets/allenai/molmo-motion-1m | 💻 Code: https://github.com/all…

A walkthrough of the LeRobot integration in Strands Robots - one agent loop, from a Hub dataset to a physical robot, with sim-to-real datasets in the same on-disk format and policies you swap with a string.

We're introducing GLM-5.2, our latest flagship model for long-horizon tasks.

If you build with agents today, you probably know three protocols.

💻 Code: https://github.com/allenai/olmo-eval While you're building an LLM, you evaluate it over and over across many interventions.

In the first part of this series "Profiling in PyTorch" , we used torch.add(torch.matmul(x, w), b) to learn how to read PyTorch profiler traces.

Over half of the world's population speaks more than one language.

Today, we are releasing North Mini Code, a 30B-parameter Mixture-of-Experts model with 3B active parameters with powerful agentic coding capabilities, available on Hugging Face under the Apache 2.0 license.

An agent built a 3D Paris gallery from two Hugging Face Spaces.

If you have a GitHub repository and you have GitHub Actions enabled, you probably use GitHub-hosted runners for CI.

For the Hugging Face Build Small Hackathon , I wanted to build something practical, local, and useful beyond a demo.

OpenEnv, a tool for creating agentic execution environments, gains backing from major AI players, including Meta-PyTorch, Hugging Face, and Nvidia.

The last two years have seen NVIDIA's content safety stack grow from a focused English text classifier into a family of specialized models—each extending coverage to new modalities, languages, and inference modes.

NVIDIA releases Nemotron 3.5 ASR, a 600M-parameter speech-to-text model that transcribes 40 language-locales in real-time, with punctuation and capitalization built-in.

Voice agent failures are often highly domain-specific.

A new approach to large-scale LLM development uses task-seeded synthetic Q&A generation to improve model performance on difficult reasoning and knowledge tasks.

The Hugging Face Hub's official command-line entrypoint, hf, has been rebuilt to optimize its use for both human users and coding agents.

A new study reveals that Direct Preference Optimization (DPO) can significantly reduce text degeneration in specialized structured OCR models, achieving an average reduction of 59.4% across five model families.

The Reachy Mini conversation app can now use tools hosted in public Hugging Face Spaces, called over MCP.

The Holo3.1 family of computer-use models improves robustness across environments, agent frameworks, and deployment targets, with a focus on local inference and seamless integration.

JetBrains unveils Mellum2, an open Mixture-of-Experts model optimized for low-latency text-and-code workloads.

The integration of agent logic with Large Language Models (LLMs) is crucial for scalable enterprise AI adoption, enabling more efficient and cost-effective AI execution.

NVIDIA has released Cosmos 3, an open omni-model for physical AI reasoning and action, now available on Hugging Face.

Mastering PyTorch profiling is crucial for optimizing models; this guide helps beginners understand and utilize torch.profiler.

The first benchmark for agentic enterprise IT tasks, ITBench-AA, reveals that leading AI models score below 50% on Site Reliability Engineering tasks.

After building your Reachy Mini, you'll install the conversation app and start talking to it.

TL;DR , because you have models to train and we respect that: Async RL just got a lot cheaper.

When a field evolves quickly, its vocabulary often evolves faster than its shared understanding.

Large language models (LLMs) have become the default interface for code generation, math problem solving, summarization, document understanding, and many other developer workflows.

A recent study reveals that specialization, not scale, can be a more decisive factor in AI model performance, challenging the common assumption that larger models are always superior.

Allen Institute for AI releases OlmoEarth v1.1, a family of models that cuts compute costs by up to 3x while maintaining performance on remote sensing tasks.

Six new state-of-the-art Sentence Transformers CrossEncoder rerankers are released, built on top of the Ettin ModernBERT encoders.

NVIDIA's large-scale world model, Cosmos Predict 2.5, can generate physically plausible videos conditioned on text, images, or video clips, and can be fine-tuned with LoRA or DoRA for specific domains like robot manipulation.

PaddleOCR 3.5 brings OCR and document parsing tasks closer to the Hugging Face ecosystem.

A new benchmark for comparing full AI agent systems, not just the models inside them, has been launched to evaluate their quality and cost across diverse tasks.

IBM releases two new Apache 2.0 multilingual embedding models, Granite Embedding Multilingual R2, with 97M and 311M parameters, offering improved retrieval quality and 32K context support.

Separating CPU and GPU workloads can lead to a massive performance boost for inference by eliminating idle gaps and maximizing GPU utilization.

AWS provides a robust infrastructure for foundation model training and inference, leveraging open-source software, high-performance computing, and scalable storage.

Researchers release EMO, a new mixture-of-experts model pretrained end-to-end to allow modular structure to emerge directly from data, enabling selective expert use without sacrificing performance.

A complete walkthrough of LoRA fine-tuning Qwen3-1.7B on MedMCQA using AMD MI300X, built for the AMD Developer Hackathon on lablab.ai.

PipelineRL's vLLM inference engine upgrade from V0 to V1 required fixing backend behavior to match training dynamics.
The cost of evaluating AI models has skyrocketed, making it a new bottleneck in the field, with some evaluations costing tens of thousands of dollars.
A detailed look at the data engineering, pre-training, supervised fine-tuning, and reinforcement learning behind the Granite 4.1 LLMs.
DeepInfra is now a supported Inference Provider on the Hugging Face Hub, expanding serverless inference capabilities.
NVIDIA unveils Nemotron 3 Nano Omni, a cutting-edge AI model capable of processing and understanding long-context multimodal data, including documents, audio, and video.
OpenAI's open-source Privacy Filter enables developers to build scalable web apps that detect personally identifiable information (PII) in text.
DeepSeek releases V4 with a 1M-token context window, competitive benchmark numbers, and innovative architecture for efficient large context length support.

A step-by-step guide on integrating Transformers.js into a Chrome extension, leveraging Gemma 4 E2B for enhanced web navigation.

QIMMA validates benchmarks before evaluating models, ensuring reported scores reflect genuine Arabic language capability in LLMs.