OpenAI Releases LifeSciBench, a 750-Task Benchmark for Evaluating AI in Life-Science Research
OpenAI's LifeSciBench evaluates AI models on real life-science research tasks with expert-written rubrics.

Most biology benchmarks ask narrow, fact-based questions with clean answers. Scientists, however, weigh imperfect evidence and make decisions. OpenAI released LifeSciBench to address this gap.
LifeSciBench contains 750 expert-authored tasks spanning seven workflows and seven biological domains. Each task pairs a prompt, supporting artifacts, and a grading rubric. The seven workflows cover evidence handling and analysis, design and optimization, scientific reasoning, validation and operations, translation, and scientific communication.
The seven domains range from genomics and medicinal chemistry to clinical and translational science. Tasks are written as a scientist would brief a colleague and are free-response, not multiple-choice. Around 79% require multiple reasoning or decision-making steps, averaging four steps each.
A cohort of 173 expert scientists wrote the tasks, each holding a Ph.D. and having biotechnology or pharmaceutical experience. Accepted tasks averaged six automated review cycles and at least two expert reviews.
Many tasks include artifacts, with the benchmark containing 1,062 attached artifacts in total. About 53% of tasks require at least one artifact, including sequences, figures, tables, PDFs, and chemical structures. A separate cohort validated quality, with 453 reviewers, and 97% holding doctorates.
Overall agreement exceeded 96% on relevance, reasoning, grounding, and usefulness. Rubrics are the core mechanic here, containing 19,020 criteria across the benchmark, or roughly 25 criteria per task. Each criterion rewards one concrete property, such as a specific fact, a reasoning step, or a numeric answer within tolerance.
Grading runs against the rubric, not a single reference string. Two metrics summarize performance: normalized rubric score and task pass rate. The pass threshold is strict by design.
OpenAI evaluated five models in a single-turn setting, with GPT-Rosalind, OpenAI's domain-specialized model, leading overall. It had the highest per-task mean on 386 of 750 tasks and lifted the overall pass rate over GPT-5.5, from 25.7% to 36.1%. Pass rates stayed modest across every model.
Rankings are not the whole story, as Gemini 3.1 Pro uniquely led on 214 tasks. Aggregate scores can hide task-specific strengths. Models were strongest on structured judgment, with GPT-Rosalind reaching a 0.712 mean score on Translation.
Scientific Communication scored 0.718, but that category is small, so it should be read cautiously. Two workflows stayed hard: Design, Optimization, and Prediction, and Analysis. Artifact use was a clear bottleneck, with GPT-Rosalind dropping from 45.1% on text-only tasks to 28.1% on artifact tasks.
Source: MarkTechPost