Learning to Retrieve from Agent Trajectories

Yuqi Zhou*1 Sunhao Dai*†1 Changle Qu1 Liang Pang2 Jun Xu✉1 Ji-Rong Wen1

1 Gaoling School of Artificial Intelligence, Renmin University of China 2 Institute of Computing Technology, Chinese Academy of Sciences

* Equal contribution. Project Leader. Corresponding author.

yuqizhou@ruc.edu.cn, sunhaodai@ruc.edu.cn, junxu@ruc.edu.cn

Accepted by SIGIR 2026

Abstract

Information retrieval systems have traditionally been designed and trained for human users, with learning-to-rank methods relying heavily on large-scale human interaction logs such as clicks and dwell time. With the rapid emergence of large language model powered search agents, retrieval is increasingly consumed by agents rather than human beings, and is embedded as a core component within multi-turn reasoning and action loops. In this setting, retrieval models trained under human-centric assumptions exhibit a fundamental mismatch with the way agents issue queries and consume results.

LRAT introduces learning to retrieve from agent trajectories as a new training paradigm, where supervision is derived from multi-step agent interactions. In our instantiation, Tongyi-DeepResearch-30B is deployed on 10K InfoSeekQA queries with four retrievers, yielding 26,482 agent trajectories and 91,713 training pairs. Experiments on both in-domain and out-of-domain deep research benchmarks show consistent gains in evidence recall, end-to-end task success, and execution efficiency across diverse agent architectures and retriever backbones.

Overview figure illustrating the shift from human-centric retrieval training to agent-trajectory supervision.
LRAT addresses the mismatch between retrieval systems trained for human search and the retrieval needs of multi-step search agents.

Main Results

LRAT consistently improves task success, evidence recall, and step efficiency across both specialized search agents and generalist agentic foundation models.

Agent Backbone Retriever InfoSeek-Eval (ID) BrowseComp-Plus (OOD)
SR Avg. Steps SR Recall Avg. Steps
I. Task-Optimized Search Agents
AgentCPM-Explore (4B) Qwen3-Emb 40.3 38.0 13.5 23.2 40.7
+ LRAT (Ours) 55.7 34.4 15.8 32.0 40.4
E5-Large 47.3 38.9 15.9 26.5 40.7
+ LRAT (Ours) 49.7 35.5 15.9 32.1 40.1
WebExplore (8B) Qwen3-Emb 52.0 24.1 21.0 47.7 40.7
+ LRAT (Ours) 68.7 19.0 27.2 55.9 38.7
E5-Large 60.0 23.8 25.4 50.4 40.1
+ LRAT (Ours) 63.3 20.2 29.0 56.1 39.1
Tongyi-DeepResearch (30B) Qwen3-Emb 52.7 26.7 17.8 49.2 42.9
+ LRAT (Ours) 68.0 20.7 23.7 60.7 41.0
E5-Large 56.7 25.1 20.7 54.8 42.4
+ LRAT (Ours) 68.0 21.5 23.9 61.8 41.4
II. Generalist Agentic Foundation Models
GPT-OSS (120B) Qwen3-Emb 40.0 34.9 9.0 43.7 45.4
+ LRAT (Ours) 47.0 30.5 12.1 56.4 45.2
E5-Large 41.7 33.9 10.8 50.1 44.8
+ LRAT (Ours) 50.7 29.7 13.1 56.0 44.6
MiniMax-M2.1 (229B) Qwen3-Emb 58.7 21.4 38.2 57.2 30.8
+ LRAT (Ours) 78.3 14.7 48.3 69.2 28.3
E5-Large 64.0 18.9 46.4 64.9 29.1
+ LRAT (Ours) 75.0 14.8 48.7 69.7 28.9
GLM-4.7 (358B) Qwen3-Emb 67.7 27.5 43.9 66.6 45.5
+ LRAT (Ours) 82.0 18.5 54.6 77.8 44.6
E5-Large 73.7 24.2 46.4 68.7 44.6
+ LRAT (Ours) 81.7 19.5 50.6 76.3 44.8

Method Overview

Method figure for LRAT training from agent trajectories.
LRAT mines supervision from search-browse transitions and post-browse reasoning traces, then trains a dense retriever with reasoning-aware weighted contrastive learning.

Naive relevance mining

Browsed documents serve as coarse positives, while unbrowsed retrieved documents provide clean trajectory-aware negatives.

Reasoning-aware filtering

Post-browse reasoning traces are used to remove browsed-but-not useful documents from the positive set.

Relevance intensity estimation

Reasoning length is converted into a soft utility weight that captures how strongly a document contributes to progress.

Weighted retriever training

Final optimization uses weighted contrastive learning to align the retriever with agent-style search behavior.

Trajectory Analysis

Trajectory analysis figure from the LRAT paper.
Browsing is necessary for success, unbrowsed documents provide reliable negatives, and post-browse reasoning length tracks document utility.

Scalability and Robustness

Scaling and top-k robustness figure from the LRAT paper.
LRAT benefits from larger trajectory pools and remains stable under different retrieval budgets.

Released Resources

LRAT-Qwen3-Embedding-0.6B

Trajectory-trained dense retriever based on Qwen3-Embedding-0.6B.

Model Card

LRAT-multilingual-e5-large

Trajectory-trained dense retriever based on multilingual-e5-large-instruct.

Model Card

LRAT-Train

Training dataset built from deep research agent trajectories for retriever supervision.

Dataset Card

Data Flywheel

Data flywheel simulation figure from the LRAT paper.
Iterative agent-retriever interaction can form a sustainable data flywheel for continual retriever improvement.

Citation

@inproceedings{zhou2026lrat,
  title={Learning to Retrieve from Agent Trajectories},
  author={Zhou, Yuqi and Dai, Sunhao and Qu, Changle and Pang, Liang and Xu, Jun and Wen, Ji-Rong},
  booktitle={Proceedings of the 49th International ACM SIGIR Conference on Research and Development in Information Retrieval},
  year={2026}
}