AgentIR: Reasoning-Aware Retrieval
for Deep Research Agents

Zijian Chen1, Xueguang Ma1, Shengyao Zhuang2, Jimmy Lin1, Akari Asai3, Victor Zhong1
1University of Waterloo, 2University of Queensland, 3Carnegie Mellon University
Correspondence: s42chen@uwaterloo.ca

Abstract

Deep Research agents are rapidly emerging as primary consumers of modern retrieval systems. Unlike human users who issue and refine queries without documenting their intermediate thought processes, Deep Research agents generate explicit natural language reasoning before each search call, revealing rich intent and contextual information that existing retrievers entirely ignore. To exploit this overlooked signal, we introduce: (1) Reasoning-Aware Retrieval, a retrieval paradigm that jointly embeds the agent's reasoning trace alongside its query; and (2) DR-Synth, a data synthesis method that generates Deep Research retriever training data from standard QA datasets. We demonstrate that both components are independently effective, and their combination yields a trained embedding model, AgentIR-4B, with substantial gains. On the challenging BrowseComp-Plus benchmark, AgentIR-4B achieves 68% accuracy with the open-weight agent Tongyi-DeepResearch, compared to 50% with conventional embedding models twice its size, and 37% with BM25.

Performance plot on BrowseComp-Plus

End-to-end Deep Research accuracy on BrowseComp-Plus for three agents paired with three retrievers. AgentIR-4B substantially outperforms conventional retrievers, in both effectiveness (improved accuracy) and efficiency (decreased search calls).

Reasoning-Aware Retrieval

Existing approaches treat a Deep Research agent's query identially to a standalone human search, where the retriever operates on the query alone and ignores the agent's explicit reasoning traces. To this end, we propose Reasoning-Aware Retrieval, a retrieval paradigm that jointly embeds the reasoning trace alongside the query, learning a retriever that leverages the rich intent and contextual information expressed in agent reasoning.

Teaser for Reasoning-Aware Retrieval

Reasoning-Aware Retrieval (AgentIR-4B) vs. conventional retrieval (Qwen3-Embedding-4B) for a task from BrowseComp-Plus.

Reasoning trace enhances retrieval in three ways:

  • Task Intent: The query "backroom studio early 2010s euphoric" is inherently ambiguous. However, the reasoning trace clarifies its intent: "finding a composer who composed euphoric music in the early 2010s in a backroom studio". Without this reasoning, a conventional retriever misinterprets the query as a search for video game studios. Analogous to human-written instructions in task-aware retrieval, the reasoning trace here functions as an implicit agent-written instruction.
  • Reflection on Prior Results: Unique to Deep Research is its multi-turn nature, which requires incorporating prior results. In this example, the overall task is to find an artist who won a specific award X and composed a specific song Y. As reflected in the reasoning trace, previous searches have already identified award X as "Grammy," drastically narrowing the search space for the target artist.
  • Hypothetical Search Targets: Beyond incorporating known information from past results, the agent uses its parametric knowledge to infer likely targets for future searches. The agent hypothesizes that a country "that's been an EU member since 1995" is likely "Sweden/Finland/Austria", and that the music subgenre with a "euphoric finale" is likely "progressive house." Both hypotheses are correct, further narrowing the search space.

Importantly, unlike task-aware retrieval that requires explicit human instructions, or past query-expansion methods that necessitate an additional costly LLM call purely for query expansion, the Deep Research agent generates its reasoning trace entirely "for free".

DR-Synth: Constructing Training Data for Deep Research Queries

The task of reasoning-aware retrieval is far from the pre-training of existing retrievers. Thus, to execute reasoning-aware retrieval effectively, we train the retriever to use reasoning-query pairs through contrastive learning.

However, this poses a new challenge: existing datasets do not contain training data for the agent sub-queries that arise during Deep Research. Traditional training datasets provide instances of global questions Q's and their corresponding relevance judgements. However, in Deep Research, the agent observes the global Q and iteratively issues sub-queries; the retriever's task is to handle these sub-queries, not the global Q.

To bridge this gap, we introduce DR-Synth, a pipeline that synthesizes sub-query level training data from traditional QA datasets.

Generating sub-queries

We run a Deep Research agent on each question Q. As the agent searches, it produces a sequence of reasoning traces and search queries. Each search step therefore yields a training example consisting of the reasoning trace and the issued query.

Generating Supervision

For every sub-query, we gather candidate documents using a standard retriever, followed by an oracle reranking procedure. The top-reranked document becomes positive supervision, and the bottom ranked become hard negatives.

Oracle Reranking Procedure

Visualization of the oracle reranking procedure used to generate sub-query level supervision.

Training a reasoning-aware retriever on DR-Synth generated data yields AgentIR-4B.

Results

We evaluate retrievers in end-to-end Deep Research on BrowseComp-Plus, paired with three open-weight agents: Tongyi-DeepResearch (Tongyi-DR), gpt-oss-120B, and GLM-4.7.

AgentIR-4B substantially outperforms all prior baselines. With Tongyi-DR, it achieves 66.27% accuracy, a 17.60% absolute improvement over the Qwen3-Embed-4B backbone. For perspective, this gain is comparable to the improvement from BM25 to Qwen3-Embed-4B (14.69%). AgentIR-4B also outperforms Qwen3-Embed-8B, a model twice its size, by about 15% absolute accuracy, and surpasses prior reasoning-intensive retrievers such as ReasonIR-8B, as well as query expansion approaches like Reason-Rewriter + Reason-Embed-8B.

Beyond accuracy, AgentIR-4B achieves notable efficiency gains: the number of search calls decreases from 32.92 with BM25 to 25.91. Moreover, AgentIR-4B outperforms Qwen3-Embed-4B + LLM Rerank, a highly computationally expensive reranking method by approximately 10% absolute accuracy, despite performing no reranking.

Importantly, AgentIR-4B’s improvements on BrowseComp-Plus are zero-shot, and the gains generalize consistently across all three agents.

LLM Retriever Accuracy Recall Search Calls

End-to-end agent accuracy on BrowseComp-Plus across LLMs and retrievers

Additional details and ablations can be found in the paper.

Citation

If you find this work useful, please cite:

@article{chen2026AgentIR,
      title={AgentIR: Reasoning-Aware Retrieval for Deep Research Agents}, 
      author={Zijian Chen and Xueguang Ma and Shengyao Zhuang and Jimmy Lin and Akari Asai and Victor Zhong},
      year={2026},
      journal={arXiv preprint arXiv:2603.04384}
}