SeqWalker

Abstract

Sequential-Horizon Vision-and-Language Navigation (SH-VLN) presents a challenging scenario where agents should sequentially execute multi-task trajectory navigation guided by complex, long-horizon natural language instructions. Current vision-and-language navigation models exhibit significant performance degradation with such instructions, as information overload impairs the agent's ability to attend to observationally relevant details. To address this problem, we propose SeqWalker, a novel navigation model built on a hierarchical planning framework. Our SeqWalker features: (1) A High-Level Planner that dynamically selects global instructions into contextually relevant sub-instructions based on the agent's current visual observations, thus reducing cognitive load; (2) A Low-Level Planner incorporating an Exploration-Verification strategy that leverages the inherent logical structure of instructions for trajectory error correction. To evaluate SH-VLN performance, we also extend the IVLN dataset and establish a new benchmark. Extensive experiments are performed to demonstrate the effectiveness and superiority of SeqWalker.

SeqWalker:Sequential-Horizon Vision-and-Language Navigation with Hierarchical Planning

Abstract

(a): Some examples of construction sequential-horizon long instructions

(b): Some examples of enrichment long instructions are listed below:

(c): Different segmentation styles

(a): The test results for SeqWalker compared to SOTA methods on the SH IR2R-CE datasets. TL is in meters, and OS, nDTW, SR, SPL, CPsubT, t-nDTW are reported as percentages. Results are presented as mean ± standard deviation.

(b):The test results for SeqWalker compared to SOTA methods on the IR2R-CE datasets. TL and NE are in metres, and OS, nDTW, SR, SPL, and t-nDTW are reported as percentages. The results are reported as x±x

Navigation Demonstrations