BrowseComp-Plus: A More Fair and Transparent Evaluation Benchmark of Deep-Research Agent

Zijian Chen1*, Xueguang Ma1*, Shengyao Zhuang2,5*, Ping Nie3, Kai Zou3
Andrew Liu1, Joshua Green1, Kshama Patel1, Ruoxi Meng1, Mingyi Su1
Sahel Sharifymoghaddam1, Yanxi Li1, Haoran Hong1, Xinyu Shi1, Xuye Liu1
Nandan Thakur1, Crystina Zhang1, Luyu Gao4, Wenhu Chen1, Jimmy Lin1
1University of Waterloo, 2CSIRO, 3Independent, 4Carnegie Mellon University, 5The University of Queensland
*Equal Contribution.  Correspondence: x93ma@uwaterloo.ca

Abstract

Deep-Research agents, which integrate large language models (LLMs) with search tools, have shown success in improving the effectiveness of handling complex queries that require iterative search planning and reasoning over search results. Evaluations on current benchmarks like BrowseComp rely on black-box live web search APIs, with notable limitations in (1) fairness: dynamic and opaque web APIs hinder fair comparisons and reproducibility of deep research methods; (2) transparency: lack of control over the document corpus makes it difficult to isolate retriever contributions. In other words, the current evaluations may compare a complete deep research system at a given time, but they do not foster well-controlled experiments to provide insights into the capability of underlying deep research LLMs. To address these challenges, we introduce BrowseComp-Plus, a benchmark derived from BrowseComp, employing a fixed, carefully curated corpus. Each query in BrowseComp-Plus includes human-verified supporting documents and mined challenging negatives, enabling controlled experimentation. The benchmark is shown to be effective in distinguishing the performance of deep research systems. For instance, the open-source model Search-R1, when paired with the BM25 retriever, achieves 3.86% accuracy, whereas the GPT-5 achieves 55.9%. Integrating the GPT-5 with the Qwen3-Embedding-8B retriever further enhances its accuracy to 70.1% with fewer search calls. This benchmark allows comprehensive evaluation and disentangled analysis of deep research agents and retrieval methods, fostering insights into retrieval effectiveness, citation accuracy, and context engineering in Deep-Research systems.

Teaser

Accuracy vs. number of search calls for Deep-Research agents with different retrievers.
The figure shows that Deep-Research agents mostly improve the final accuracy at a cost of more search calls, whereas better retrieval systems not only improve the overall accuracy but also reduce the number of search calls. That is, better retrievers lead to both efficiency and effectiveness. For reference, GPT-5 achieves 59.9\% accuracy when evaluated using the Google Search API.

Dataset Construction

BrowseComp-Plus contains 830 queries sourced from BrowseComp, each of which could take a human more than 2 hours to answer using a search engine. We carefully construct a corpus of ~100K web documents for these queries, designed to meet three criteria:

  1. Comprehensive Coverage: The corpus provides complete evidence to support the entire reasoning chain required to answer each question.
  2. Clear Differentiation of Effectiveness: The corpus contains sufficiently distracting hard negatives to maintain difficulty, capable of distinguishing the effectiveness of various strong Deep-Research agents.
  3. Practical Size: At a size of 100K, the corpus is large enough to yield reliable research insights, while being computationally reasonable for research purposes.

For each query, we collect the evidence documents in a two-stage process: (1) OpenAI's o3 retrieves candidate evidence documents from the web using the ground-truth question–answer pairs; (2) Human annotators verify the candidates and add missing documents to ensure the corpus contains all evidence needed to fully answer each query.

positive collection

Positive collection pipeline: o3 searches initial candidates, enhanced by human annotators

In addition to evidence documents, annotators also label the documents that semantically contain the final answer, designated as gold documents. These labels are later used to perform retriever-only evaluation.

For example, a query might ask for the number of publications by an author, with the ground-truth answer being "7". A gold document could be the author's personal webpage listing their publications; while it may not contain the string "7" explicitly, it semantically contains the answer.

For the negative collection, we take each query from BrowseComp, and prompt GPT-4o to decompose the query into simpler, self-contained sub-queries. For each sub-query, we use a Google Search API provider to search the web, and scrape the results as hard negatives.

negative collection

Negative collection pipeline: GPT-4o decomposes queries into sub-queries, Google Search sub-queries to get hard-negatives

Results

LLM Retriever Accuracy Recall Search Calls Calibration Error

End-to-end agent accuracy on BrowseComp-Plus across LLMs and retrievers

We evaluate popular Deep-Research agents paired with different retrievers on the following metrics:

  1. Accuracy: The percentage of queries answered correctly, judged by gpt-4.1.
  2. Recall: Fraction of evidence documents retrieved in at least one search call by the agent, relative to all evidence documents.
  3. Search Calls: Number of search calls issued by the agent.
  4. Calibration Error: Following BrowseComp, we prompt agents to estimate the confidence of their answers; calibration error measures the gap between a model's predicted confidence and actual accuracy.

Effect of Retrievers

Stronger retrievers (e.g., Qwen3-Embedding-8B) consistently improve end-to-end accuracy of Deep-Research agents. They also reduce the number of search calls, likely because higher-quality initial retrievals reduce the need for follow-up searches; further, fewer search calls translate to fewer output tokens. That is, better retrievers deliver both efficiency and effectiveness gains.

Search Calls vs. Accuracy

In general, more search calls correlate with higher accuracy. Closed-source agents tend to make substantially more search calls than open-source ones; for instance, OpenAI's gpt-5 and o3 average over 20 search calls per query, while Qwen3-32B and SearchR1-32B make fewer than 2, despite being explicitly prompted to use the tool. This gap in the ability to interleave extensive search calls and reasoning likely contributes to the gap in end-to-end accuracy between closed- and open-source agents.

Impact of Reasoning Effort

Impact of reasoning effort

We analyze how the reasoning effort of LLMs influences answer quality and retrieval behavior. To isolate this effect, we focus on OpenAI's OSS family, which offers three reasoning modes: low, medium, and high. These modes differ in the amount of computational effort and deliberation the model applies before producing an answer, with higher modes generally involving longer intermediate reasoning steps. Across all model sizes and retrievers, increasing the reasoning effort consistently boosts accuracy.

Retriever-only Evaluation

Retriever Recall@5 Recall@100 Recall@1000 nDCG@10

We also evaluate the effectiveness of different retrievers alone, measuring each retriever's recall@k and nDCG@k scores against the labeled evidence and gold documents.

  • Compared to BM25, Qwen3-Embedding-8B and ReasonIR-8B achieve substantially higher recall and nDCG for both evidence document retrieval and gold document retrieval.
  • We observe a model size scaling law within the Qwen3 embedding family; larger models consistently perform better, with Qwen3-8B surpassing ReasonIR-8B at the 8B scale.
  • However, even the best retriever, Qwen3-Embedding-8B, only achieves 20.3 nDCG@10, showcasing a substantial headroom for improvement.

Citation

Coming soon...