Learning to Retrieve from Agent Trajectories

Yuqi Zhou^*1 Sunhao Dai^*†1 Changle Qu¹ Liang Pang² Jun Xu^✉1 Ji-Rong Wen¹

¹ Gaoling School of Artificial Intelligence, Renmin University of China ² Institute of Computing Technology, Chinese Academy of Sciences

^* Equal contribution. ^† Project Leader. ^✉ Corresponding author.

yuqizhou@ruc.edu.cn, sunhaodai@ruc.edu.cn, junxu@ruc.edu.cn

Accepted by SIGIR 2026

Paper Code HF Collection Dataset

Abstract

Information retrieval systems have traditionally been designed and trained for human users, with learning-to-rank methods relying heavily on large-scale human interaction logs such as clicks and dwell time. With the rapid emergence of large language model powered search agents, retrieval is increasingly consumed by agents rather than human beings, and is embedded as a core component within multi-turn reasoning and action loops. In this setting, retrieval models trained under human-centric assumptions exhibit a fundamental mismatch with the way agents issue queries and consume results.

LRAT introduces learning to retrieve from agent trajectories as a new training paradigm, where supervision is derived from multi-step agent interactions. In our instantiation, Tongyi-DeepResearch-30B is deployed on 10K InfoSeekQA queries with four retrievers, yielding 26,482 agent trajectories and 91,713 training pairs. Experiments on both in-domain and out-of-domain deep research benchmarks show consistent gains in evidence recall, end-to-end task success, and execution efficiency across diverse agent architectures and retriever backbones.

Overview figure illustrating the shift from human-centric retrieval training to agent-trajectory supervision. — LRAT addresses the mismatch between retrieval systems trained for human search and the retrieval needs of multi-step search agents.

Main Results

LRAT consistently improves task success, evidence recall, and step efficiency across both specialized search agents and generalist agentic foundation models.

Agent Backbone	Retriever	InfoSeek-Eval (ID)		BrowseComp-Plus (OOD)
Agent Backbone	Retriever	SR	Avg. Steps	SR	Recall	Avg. Steps
I. Task-Optimized Search Agents
AgentCPM-Explore (4B)	Qwen3-Emb	40.3	38.0	13.5	23.2	40.7
	+ LRAT (Ours)	55.7	34.4	15.8	32.0	40.4
	E5-Large	47.3	38.9	15.9	26.5	40.7
	+ LRAT (Ours)	49.7	35.5	15.9	32.1	40.1
WebExplore (8B)	Qwen3-Emb	52.0	24.1	21.0	47.7	40.7
	+ LRAT (Ours)	68.7	19.0	27.2	55.9	38.7
	E5-Large	60.0	23.8	25.4	50.4	40.1
	+ LRAT (Ours)	63.3	20.2	29.0	56.1	39.1
Tongyi-DeepResearch (30B)	Qwen3-Emb	52.7	26.7	17.8	49.2	42.9
	+ LRAT (Ours)	68.0	20.7	23.7	60.7	41.0
	E5-Large	56.7	25.1	20.7	54.8	42.4
	+ LRAT (Ours)	68.0	21.5	23.9	61.8	41.4
II. Generalist Agentic Foundation Models
GPT-OSS (120B)	Qwen3-Emb	40.0	34.9	9.0	43.7	45.4
	+ LRAT (Ours)	47.0	30.5	12.1	56.4	45.2
	E5-Large	41.7	33.9	10.8	50.1	44.8
	+ LRAT (Ours)	50.7	29.7	13.1	56.0	44.6
MiniMax-M2.1 (229B)	Qwen3-Emb	58.7	21.4	38.2	57.2	30.8
	+ LRAT (Ours)	78.3	14.7	48.3	69.2	28.3
	E5-Large	64.0	18.9	46.4	64.9	29.1
	+ LRAT (Ours)	75.0	14.8	48.7	69.7	28.9
GLM-4.7 (358B)	Qwen3-Emb	67.7	27.5	43.9	66.6	45.5
	+ LRAT (Ours)	82.0	18.5	54.6	77.8	44.6
	E5-Large	73.7	24.2	46.4	68.7	44.6
	+ LRAT (Ours)	81.7	19.5	50.6	76.3	44.8

Method Overview

Method figure for LRAT training from agent trajectories. — LRAT mines supervision from search-browse transitions and post-browse reasoning traces, then trains a dense retriever with reasoning-aware weighted contrastive learning.

Naive relevance mining

Browsed documents serve as coarse positives, while unbrowsed retrieved documents provide clean trajectory-aware negatives.

Reasoning-aware filtering

Post-browse reasoning traces are used to remove browsed-but-not useful documents from the positive set.

Relevance intensity estimation

Reasoning length is converted into a soft utility weight that captures how strongly a document contributes to progress.

Weighted retriever training

Final optimization uses weighted contrastive learning to align the retriever with agent-style search behavior.

Trajectory Analysis

Scalability and Robustness

Scaling and top-k robustness figure from the LRAT paper. — LRAT benefits from larger trajectory pools and remains stable under different retrieval budgets.

Released Resources

LRAT-Qwen3-Embedding-0.6B

Trajectory-trained dense retriever based on Qwen3-Embedding-0.6B.

Model Card

LRAT-multilingual-e5-large

Trajectory-trained dense retriever based on multilingual-e5-large-instruct.

Model Card

LRAT-Train

Training dataset built from deep research agent trajectories for retriever supervision.

Dataset Card

Data Flywheel

Citation

@inproceedings{zhou2026lrat,
  title={Learning to Retrieve from Agent Trajectories},
  author={Zhou, Yuqi and Dai, Sunhao and Qu, Changle and Pang, Liang and Xu, Jun and Wen, Ji-Rong},
  booktitle={Proceedings of the 49th International ACM SIGIR Conference on Research and Development in Information Retrieval},
  year={2026}
}