SeqWalker:Sequential-Horizon Vision-and-Language Navigation with Hierarchical Planning

source code
Customization performance comparison

(a) SkillsCrafter is able to evolve and learn new skills based on learned skills. It maintains a skill knowledge base that stores learned knowledge to allow for efficient learning of new skills using shared knowledge of old skills, and for performing any unknown skill in the open world using previously learned knowledge. (b) Skills catastrophic forgetting under lifelong learning setting. (c) Our SkillsCrafter has a good anti-forgetting performance.

Abstract

Sequential-Horizon Vision-and-Language Navigation (SH-VLN) presents a challenging scenario where agents should sequentially execute multi-task trajectory navigation guided by complex, long-horizon natural language instructions. Current vision-and-language navigation models exhibit significant performance degradation with such instructions, as information overload impairs the agent's ability to attend to observationally relevant details. To address this problem, we propose SeqWalker, a novel navigation model built on a hierarchical planning framework. Our SeqWalker features: (1) A High-Level Planner that dynamically selects global instructions into contextually relevant sub-instructions based on the agent's current visual observations, thus reducing cognitive load; (2) A Low-Level Planner incorporating an Exploration-Verification strategy that leverages the inherent logical structure of instructions for trajectory error correction. To evaluate SH-VLN performance, we also extend the IVLN dataset and establish a new benchmark. Extensive experiments are performed to demonstrate the effectiveness and superiority of SeqWalker.

Customization performance comparison

Illustration of the proposed Sequential-Horizon Vision-and-Language Navigation (SH-VLN) task and our SeqWalker model. Different from traditional VLN, SH-VLN requires navigation agents to follow a sequential multi-task trajectory navigation, while users tend to provide complex long instructions, posing greater challenges. Our SeqWalker adopts a Hierarchical Planning strategy, where the High-Level Planner selects sub-tasks and instruction phrases based on the agent's observations, and the Low-Level Planner provides actions for robust navigation using a proposed Exploration and Verification strategy.

Simulator robotic skill tasks

(a)Statistics of the transformed datasets.

Real-world robotic skill tasks

(b)Navigation Word Cloud.

To evaluate the proposed SH-VLN task, we extend the IVLN IR2R-CE dataset and propose a new benchmark.To construct such trajectories, we select and connect pairs from the IR2R-CE dataset whose end- and start-points align. Corresponding instructions are concatenated to ensure semantic coherence using LLM. In addition to generating sequential-horizon trajectories, we also enrich the instructions to address their lack of sufficient discriminative granularity to distinguish semantically similar subtasks in multi-trajectory navigation.

Photograph of the UR5 robotic system

(a): Some examples of construction sequential-horizon long instructions

Illustration of the UR5 robotic system

(b): Some examples of enrichment long instructions are listed below:

Illustration of the UR5 robotic system

(c): Different segmentation styles

Photograph of the UR5 robotic system

(a): The test results for SeqWalker compared to SOTA methods on the SH IR2R-CE datasets. TL is in meters, and OS, nDTW, SR, SPL, CPsubT, t-nDTW are reported as percentages. Results are presented as mean ± standard deviation.

Illustration of the UR5 robotic system

(b):The test results for SeqWalker compared to SOTA methods on the IR2R-CE datasets. TL and NE are in metres, and OS, nDTW, SR, SPL, and t-nDTW are reported as percentages. The results are reported as x±x

Navigation Demonstrations

GIF 1
GIF 2
GIF 3