world models and Human–Object Interaction (HOI)

Author: Chatgpt
Here are several key research papers that explore the intersection of world models and Human–Object Interaction (HOI)—especially ones that build structured, object-centric representations from videos or use world-model-based learning to plan object-rich interactions.

🧠 1. FOCUS: Object‑Centric World Models for Robotic Manipulation (Jul 2023)

Proposes a model-based RL agent, FOCUS, that builds a structured world model by encoding objects into separate latent vectors. It guides exploration toward object interaction and enables efficient task learning across environments like ManiSkill2 or Robosuite, even on real Franka robot hardware. Object-centric focus improves exploration and sample efficiency in sparse‑reward manipulation tasks. (arXiv, Frontiers)

🔧 2. Structured World Models from Human Videos (RSS’23)

Also known as SWIM (or SWIM/SWIMROC), this approach pre-trains world models using human video data. The affordance-based, human-centric structured action space lets robots learn diverse manipulation skills in just ~30 minutes of real robot experience. This model enables generalization beyond robot-specific embodiment. (Medium)

🎛️ 3. Structured World Models from Human Videos (Paper: Structured World Models from Human Videos)

Same as above, centered on leveraging human video to learn affordance-grounded world models that encode object interactions, enabling goal-based planning and policy execution even with limited robot experience.

🖐️ 4. Human‑Object Interaction with Vision‑Language Model Guided Relative Movement Dynamics (RMD‑HOI) — Mar 2025

Introduces a framework where vision-language models translate free-form instructions into Relative Movement Dynamics (RMD) guiding language‑conditioned reinforcement learning. The model allows long‑horizon, multi-round HOI planning—even with dynamic and articulated objects. It couples semantic instruction, perception, and motion planning. (arXiv)

🌍 5. OpenHOI: Open‑World HOI Synthesis with Multimodal LLM — May 2025

OpenHOI brings together affordance grounding, language decomposition, and an affordance-driven diffusion model with physics-based refinement. It enables generation of long-horizon hand-object interactions from language commands over novel objects. This is essentially world-model-informed HOI synthesis grounded in affordance and physics. (arXiv)

🔄 6. Vision-Based Manipulation from Single Human Video (ORION)

Learn manipulation policies from a single RGB-D human demonstration using Open-world Object Graphs (OOGs)—structured, object- and hand-centric representations. ORION constructs manipulation plans that generalize across spatial layouts, backgrounds, and unseen object instances. (arXiv)

📚 7. World Model Foundations

Ha & Schmidhuber (2018) original definition: VAE for perception, RNN for dynamics, policy head for control.
LeCun (2022): world models as neural “mental simulation” for commonsense reasoning, often incorporated in embodied agents. (维基百科)

📊 Summary Table

Paper / Model	Domain	World-Model Structure	HOI Aspect
FOCUS	RL / robotics	Object-centric latent dynamics	Focused exploration, object manipulation
SWIM (Structured WM)	Pre‑training RL	Affordance action world model	From human videos → robot affordance plans
RMD‑HOI	HOI / RL	Language-guided dynamics model	Vision-language → sequential HOI planning
OpenHOI	Multimodal HOI	Affordance+diffusion + world model	Open-world HOI synthesis with physics
ORION	Imitation from video	Object-graph world plan extraction	Single-demo generalizable HOI policies

💡 Why These Matter

Object-centric representations in world models (like FOCUS, SWIM, ORION) enable models to capture and reason about interactions more efficiently and generalize better.
Affordance-guided structures bridge perception and action, enabling tasks to be grounded even from limited data.
Language-guided dynamics planning (RMD‑HOI, OpenHOI) allows long-horizon sequential HOI planning from natural instructions.
These methods enable zero-/few-shot generalization to new objects, instructions, or environments.

本文来自互联网用户投稿，该文观点仅代表作者本人，不代表本站立场。本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。
如若转载，请注明出处：http://www.pswp.cn/web/89438.shtml
繁体地址，请注明出处：http://hk.pswp.cn/web/89438.shtml

如若内容造成侵权/违法违规/事实不符，请联系多彩编程网进行投诉反馈email:809451989@qq.com，一经查实，立即删除！