PixelRAG outperforms text parsers in accuracy and cuts AI agent token costs
PixelRAG skips text parsing, rendering web pages as screenshots to improve retrieval accuracy and cut AI agent token costs by 10x.

Most enterprise RAG pipelines start the same way: a text parser converts web pages and documents into plain text so they can be chunked and indexed for retrieval. That conversion step destroys retrieval signals — and according to new research, it's responsible for the majority of wrong answers. A research team from UC Berkeley, Princeton University, EPFL and Databricks published a paper this week introducing PixelRAG, a system that skips that conversion entirely.
Instead of parsing pages into text, PixelRAG renders them as screenshots, indexes those images and feeds retrieved tiles directly to a vision-language model reader. Tested across 30 million screenshot tiles covering all of Wikipedia, it outperforms text-based RAG across six benchmarks, improving accuracy by up to 18.1% over text-based baselines. Parsers are the wrong place to look for fixes, according to the research team.
"Improving parsers is an endless process because every website requires special handling," Yichuan Wang, lead author and UC Berkeley doctorate student, told VentureBeat. "Our goal was to explore whether recent advances in VLMs make it possible to bypass that entire problem and build a retrieval system that works across websites without site-specific engineering." HTML parsers destroy the retrieval signals that enterprise RAG depends on The goal of the researchers was to develop a clean end-to-end architecture. "Modern web RAG pipelines often involve rendering, parsing, cleaning, chunking, and many other handcrafted stages," Wang said.
"Every stage introduces potential cascade errors and abstractions that move us further away from the original webpage. We were interested in whether we could eliminate most of that complexity and operate directly on the rendered page." Wang also noted that parsing inevitably loses information. Images, visual hierarchy, typography, emphasis (e.g., bold text), tables, and layout are either discarded or converted into imperfect textual approximations.
"No matter how good a parser becomes, some information is fundamentally lost during the conversion," he said. The research identifies three ways text-based RAG loses the answer before it reaches the reader. All three were measured on SimpleQA, a standard benchmark of 1,000 factual Wikipedia questions: Parser loss (36.6% of failures).
HTML-to-text conversion destroys structured content so completely that no text chunk in the corpus contains the answer. Rank loss (55.2% of failures). The answer exists in the corpus but gets outranked by keyword-dense infoboxes that land at rank 1 for 75.9% of queries, pushing answer-bearing paragraphs to rank 20 or lower.
Source: VentureBeat