VLTNet:

Zero-shot Object Navigation using Vision-Language Models with Tree of Thoughts Reasoning

CongCong Wen1*, Yisiyuan Huang2*, Yanjia Huang2, Wenyu Han2,
Shuaihang Yuan1, Yu-Shen Liu3 and Yi Fang1

right-image

Object navigation is crucial for robots, but traditional methods require substantial training data and cannot generalize to unknown environments. Zero-shot object naviga- tion (ZSON) aims to address this challenge, enabling robots to interact with unfamiliar objects without specific training data. Language-driven zero-shot object navigation (L-ZSON) is an extension of ZSON that incorporates natural language instructions to guide robot navigation and interaction with objects. In this paper, we propose VLTNet for L-ZSON, a novel Vision Language model with a Tree-of-thought NETwork. VLTNet comprises four main modules: vision language model understanding, semantic mapping, tree-of-thought reasoning and exploration, and goal identification. Among these mod- ules, tree-of-thought (ToT) reasoning and exploration serve as core components, innovatively using the ToT reasoning frame- work for frontier selection in robot exploration. Compared to traditional Large Language Models (LLMs), ToT reason- ing involves multi-path reasoning processes and backtracking when necessary, enabling globally informed decision-making. Experimental results on benchmarks such as PASTURE and RoboTHOR demonstrate the outstanding performance of our model in LZSON, particularly in scenarios involving complex natural language as goal instructions.




Approach

pipeline-image

VLTNet coordinates four main modules to tackle the L-ZSON tasks:

Vision Language Model Understanding Module

left-image


Based on RGB observations,the VLM understanding Module will utilize pre-trained Visual-Grounding Models to detect objects seen in the current frame. Our VLNet Model applied GLIP to identify each object and draw out corresponding bounding boxes.




right-image

Semantic Mapping Module

The Semantic Mapping module reconstructs a 2D navigation map based on the depth input, and robot pose. Based on the semantic understanding of objects and rooms obtained by the VLM Understanding Module, the mapping module locates identified objects in the navigation map to generate a semantic navigation map.






Tree of Thoughts Exploration Module

Tree of Thoughts Exploration Modules incorporates Frontier-Based Exploration as the basis for navigation strategies. By analyzing current semantic information provided by the navigation map, VLTNet innovatively utilizes LLM such as GPT-3.5 to identify unexplored areas that are likely proximate to the target object. Noting the potential inaccuracies and multiple candidate frontiers inferred by the language model, we integrate the tree-of-thoughts mechanism. ToT employs a structured tree-based approach to decision-making, allowing for organized and systematic exploration, which enhances the model's ability to make informed choices in complex environments. Therefore, our method aids in refining and pinpointing the most promising frontier points, effectively bridging the vast insights of the language model with precision in frontier selection, thus enabling more informed and context-aware robotic exploration.

intro-image

Goal identification Module

The Goal Identification module determines whether the current object navigated to matches the target object. Our definition of the target object extends beyond simple categories, like "alarm clock." It encompasses more intricate spatial or appearance-based descriptions such as: "Alarm clock on a dresser near a desk lamp, bed" or "Small, metallic alarm clock." Thus, a basic algorithm that merely checks if the current object is an "alarm clock" is insufficient. To make a more informed judgment about whether the scene's context aligns with the target's description, we employ GPT-3.5. By analyzing the textual descriptions of the target and the currently detected object, this model offers a profound semantic understanding. It takes into account context and semantic associations to accurately determine the congruence between them.

intro-image

Trajectory visualization:

Here we give a visualized comparison between VLTNet and Exploration with Soft Common Sense (ESC) model on different benchmarks in PASTURE Dataset, as a comparison between Tree-of-Thoughts (ToT) Exploration and Probability Soft Logic (PSL) Exploration.

VLTNet

ESC

We can notice that ESC model tends to explore nearby frontiers first because ESC model employs PSL during commonsense reasoning, which prefers closer frontiers. For our VLTNet that employs ToT reasoning, since LLM models doesn't take frontier distances into consideration, VLTNet robots would sometimes explore further areas that it deemed to have high likelyhood of finding target objects.

ESC model was not designed for L-ZSON tasks and therefore overly complex prompts can disrupt its performance. Both ESC model and VLTNet succeeded in both Uncommon Object Goals and Object Goals with Appearance Attributes. However, in the Object Goal with Spatial Attributes, while ESC failed under the disruption of complex spatial description, our VLTNet model was able to successfully locate the target.