IMRL: Integrating Visual, Physical, Temporal, and Geometric Representations for Enhanced Food Acquisition

University of Maryland, College Park

Overview of the proposed IMRL approach for food acquisition. Given the last \(k\) steps of eye-in-hand RGB observations, the system segments the desired food using SAM, extracts features with an encoder (e.g., ResNet-50), and processes them through visual and physical representation modules to learn a joint representation \(z_{vp}\). A temporal representation module produces \(z_u\) to capture dynamics, while the geometric representation module provides bowl fullness \(l\) and optimal scooping points \((x^∗, y^∗)\). All these representations are integrated into a multi-dimensional representation \(z\), which, combined with robot proprioception, is used to generate robot actions through a control module.

Abstract

Robotic assistive feeding holds significant promise for improving the quality of life for individuals with eating disabilities. However, acquiring diverse food items under varying conditions and generalizing to unseen food presents unique challenges. Existing methods that rely on surface-level geometric information (e.g., bounding box and pose) derived from visual cues (e.g., color, shape, and texture) often lacks adaptability and robustness, especially when foods share similar physical properties but differ in visual appearance. We employ imitation learning (IL) to learn a policy for food acquisition. Existing methods employ IL or Reinforcement Learning (RL) to learn a policy based on off-the-shelf image encoders such as ResNet-50. However, such representations are not robust and struggle to generalize across diverse acquisition scenarios. To address these limitations, we propose a novel approach, IMRL (Integrated Multi-Dimensional Representation Learning), which integrates visual, physical, temporal, and geometric representations to enhance the robustness and generalizability of IL for food acquisition. Our approach captures food types and physical properties (e.g., solid, semi-solid, granular, liquid, and mixture), models temporal dynamics of acquisition actions, and introduces geometric information to determine optimal scooping points and assess bowl fullness. IMRL enables IL to adaptively adjust scooping strategies based on context, improving the robot's capability to handle diverse food acquisition scenarios. Experiments on a real robot demonstrate our approach's robustness and adaptability across various foods and bowl configurations, including zero-shot generalization to unseen settings. Our approach achieves improvement up to 35% in success rate compared with the best-performing baseline.

Motivation
Descriptive Alt Text

Comparison of standard BC and our approach for food acquisition. The standard BC (top) processes robot observations through an off-the-shelf encoder (e.g., ResNet-50), generating actions that are not robust and generalizable. In contrast, our proposed approach IMRL (bottom) utilizes visual, physical, temporal andd geometric representation learning to develop richer and more informative represen- tations to enhance the robustness and generalizability of BC for food acquisition.

IMRL (Seen)
Real UR3 robot experimental testing of IMRL on seen scenarios, including a white circular bowl containing granular cereals, semi-solid jello, and liquid water.
IMRL (Unseen)
Real UR3 robot experimental testing of IMRL on unseen scenarios, including large and small blue circular bowls, a transparent square bowl, containing rice, black beans, yellow beans, red beans, and milk.
Baseline Comparison
IMRL (Seen)
IMRL (Unseen)
Baseline
IMRL successfully generalizes to unseen scenarios, but baseline fails.

BibTeX


        @article{liu2024imrl,
        title={IMRL: Integrating Visual, Physical, Temporal, and Geometric Representations for Enhanced Food Acquisition},
        author={Liu, Rui and Mahammad, Zahiruddin and Bhaskar, Amisha and Tokekar, Pratap},
        journal={arXiv preprint arXiv:2409.12092},
        year={2024}}