RAIN | Region-Aware Interaction Networks

Why RAIN

The task is the same. The scene is not.

Post-trained VLA policies often optimize task semantics, visual grounding, and control as one coupled policy. That can look strong on the training distribution, but it encourages memorized execution patterns when the object moves, a subtask is requested alone, or the scene layout changes.

RAIN makes the target region the interface. External reasoning decomposes the instruction and selects the target; segmentation converts that decision into masks; the action network only learns how to interact with the specified region.

RAIN overview and task generalization figure. — RAIN separates task interpretation from region-conditioned execution and evaluates the separation under object grounding, task comprehension, and layout robustness shifts.

Contributions

01

Region-aware action model

Executes an action type such as grasp, release, or open on explicit target masks instead of mapping a full instruction directly to actions.

02

Target-adaptive multi-view control

TarLN injects masks into each camera view, and the Cross-view Transformer Encoder aligns third-person, top-down, and wrist evidence around the target.

03

Reference-based Retargeting Strategy

RRS synthesizes smooth target-directed trajectories toward unseen configurations by joining arbitrary approach states to successful reference interactions.

04

LIBERO-EX and real-world OOD evaluation

The paper evaluates three 25-task LIBERO-EX splits and two real-robot task families under matched camera, action-space, and success-criterion settings.

Architecture

A compact control stack around explicit regions.

At timestep t, RAIN observes RGB views, a 7D robot state, an action type, and target masks. It predicts an action chunk A_t and auxiliary 3D plan waypoints P_t. A Progress Head estimates when the current subtask is done.

RAIN architecture diagram. — Overall architecture: Target-adaptive Cross-view Encoder, PlanDiT, Progress Head, and Reference-based Retargeting Strategy.

The action model is deliberately narrow: it does not consume the full task instruction. It receives multi-view RGB, robot state, action type, and target masks, then predicts a 16-step 7D action chunk together with 8 auxiliary 3D plan waypoints. RAIN-L uses frozen DINOv2-L/14 and CLIP ViT-L/14 backbones; RAIN-S uses frozen DINOv2-S/14 and CLIP ViT-B/32.

PlanDiT is a 12-layer Transformer trained with flow matching. At rollout, it uses 4 diffusion steps and executes the first 8 actions from each predicted chunk. The detailed objectives are left folded below so the main page keeps the method readable.

Training follows a two-stage procedure: first the visual encoder stack, PlanDiT, and action prediction modules are trained together; then the rest of the network is frozen and the Progress Head is trained for subtask transitions.

Mathematical details TCE, PlanDiT, Progress Head, and RRS derivations

Target-conditioned multi-view features

Masks enter the visual stream through pixel-wise modulation before cross-view attention, so each camera contributes target-aware evidence.

\bar{\mathbf{f}}^{v}_{t} = \mathrm{TarLN}(\mathbf{f}^{v}_{t},\mathbf{m}^{v}_{t}) = \gamma^{v}(\mathbf{m}^{v}_{t}) \odot \mathrm{LN}(\mathbf{f}^{v}_{t}) + \beta^{v}(\mathbf{m}^{v}_{t}),\quad v\in\{T,W\} \] \[ \tilde{\mathbf{f}}^{T}_{t} = \mathrm{FFN}(\mathrm{SelfAttn}(\mathrm{CrossAttn}( \bar{\mathbf{f}}^{T}_{t},\bar{\mathbf{f}}^{W}_{t}))) \] \[ \tilde{\mathbf{f}}^{W}_{t} = \mathrm{FFN}(\mathrm{SelfAttn}(\mathrm{CrossAttn}( \bar{\mathbf{f}}^{W}_{t},\bar{\mathbf{f}}^{T}_{t}))),\quad \phi_t=[\tilde{\mathbf{f}}^{T}_{t};\tilde{\mathbf{f}}^{W}_{t}]

PlanDiT flow objective

PlanDiT forms tokens \(E_t=[e_c;e_q;e_p;e_a]\) from action type, proprioception, plan tokens, and action tokens. The target sample is \(\mathbf{x}_t=[\mathbf{P}_t;\mathbf{A}_t]\). Flow matching learns the vector field from noisy plan-action tokens to the clean target, while an asymmetric attention mask lets action tokens attend to plan tokens without letting noisy action tokens distort the plan.

\mathbf{x}^{\tau}_{t}=\tau\mathbf{x}_{t}+(1-\tau)\epsilon,\quad \epsilon\sim\mathcal{N}(0,I) \] \[ \mathcal{L}_{\mathrm{PlanDiT}} = \lambda_{\mathrm{fm}}\mathbb{E} \left[ \left\|V_{\theta}(\mathbf{x}^{\tau}_{t},\phi_t,\mathbf{q}_t,c_t) -(\mathbf{x}_{t}-\epsilon)\right\|_2^2 \right] + \lambda_{\mathrm{cons}}\mathcal{L}_{\mathrm{cons}} \] \[ \mathcal{L}_{\mathrm{cons}} = \sum_{k=1}^{K-1} \mathrm{SmoothL1}( \hat{\mathbf{p}}_{t_k,N}-\hat{\mathbf{p}}_{t_K,N} )

Progress and transition scores

Completion is supervised directly, but distance and alignment provide geometric signals for deciding when the active subtask is actually done.

S_{\mathrm{dist}} = \exp(-\alpha\|\mathbf{p}^{ee}_{t}-\mathbf{p}_{g}\|_2) \] \[ S_{\mathrm{align}} = \frac{1}{2} \left( 1+ \frac{\mathrm{tr}((\mathbf{R}^{ee}_{t})^\top\mathbf{R}_{g})-1}{2} \right) \] \[ \mathcal{L}_{\mathrm{head}} = \lambda_{\mathrm{dist}}\mathcal{L}_{\mathrm{dist}} + \lambda_{\mathrm{align}}\mathcal{L}_{\mathrm{align}} + \lambda_{\mathrm{comp}}\mathcal{L}_{\mathrm{comp}}

The completion label comes from annotated subtask boundaries. At inference time, RAIN advances to the next subtask when \(\hat{y}_t\) crosses the completion threshold, so the action model can execute multi-step tasks without absorbing the whole language plan.

RRS junction and Hermite path

The junction score prefers a tangent-compatible, nearby, early merge point. The Hermite segment satisfies the endpoints by construction, then the remaining reference trajectory is reused.

S_k = (\mathbf{v}_{k}^{\top}\mathbf{t}_{k}) \left(1-\frac{d_k}{d_{\max}}\right) \left(1-\frac{k}{M}\right) \] \[ C(u)= \psi_{00}(u)\mathbf{x}_0 + \psi_{10}(u)\mathbf{D}_0 + \psi_{01}(u)\mathbf{r}_{k^*} + \psi_{11}(u)\mathbf{D}_1 \] \[ \mathbf{P} = [C(u_0),\ldots,C(u_{N_{\mathrm{app}}}), \mathbf{r}_{k^*+1},\ldots,\mathbf{r}_{M-1}]

Reason

VLM decomposition

In real-world rollout, a VLM decomposes the instruction and grounds a target phrase at subtask boundaries.

Segment

SAM 2 masks

The grounded target becomes seed masks that SAM 2 propagates through the control loop.

Encode

TCE + TarLN

TarLN(f,m)=gamma(m) LN(f)+beta(m) injects the target into each view.

Generate

PlanDiT

Flow matching predicts 8 plan waypoints and a 16-step 7D action chunk.

Transition

Progress Head

Completion, distance, and alignment scores drive autonomous subtask switching.

Retarget

RRS

Trajectory synthesis expands the target configurations seen during training.

RRS trajectory retargeting figure. — RRS selects a tangent-compatible junction on a reference trajectory, connects to it with a cubic Hermite spline, then preserves the reference contact segment.

RRS does not arbitrarily mix demonstrations. It samples future subtasks from the same episode, filters candidates by hand state and action type, rejects invalid masks or near-duplicate targets, then connects the current end-effector state to a tangent-compatible junction on a successful reference trajectory. After the junction, the original reference actions are reused to preserve contact-rich interaction dynamics.

RRS mathematical deep dive Tangent junction scoring and Hermite spline constraints

Reference path and candidate junctions

Given a successful reference trajectory \(\mathcal{R}=\{\mathbf{r}_0,\ldots,\mathbf{r}_{M-1}\}\) and a new approach state \(\mathbf{x}_0\in\mathbb{R}^3\), RRS evaluates interior reference points \(k\in\{1,\ldots,M-2\}\) as possible merge points.

\mathbf{t}_k = \frac{\mathbf{r}_{k+1}-\mathbf{r}_{k-1}} {\|\mathbf{r}_{k+1}-\mathbf{r}_{k-1}\|_2}, \quad \mathbf{v}_k = \frac{\mathbf{r}_k-\mathbf{x}_0} {\|\mathbf{r}_k-\mathbf{x}_0\|_2}, \quad d_k=\|\mathbf{r}_k-\mathbf{x}_0\|_2

Why the score has three factors

The first term rewards entering the reference motion along its tangent, the second avoids excessive approach length, and the third prefers an early merge so more of the demonstrated successful interaction is preserved.

S_k = (\mathbf{v}_k^\top \mathbf{t}_k) \left(1-\frac{d_k}{d_{\max}}\right) \left(1-\frac{k}{M}\right), \quad k^*=\arg\max_k S_k

Hermite connection

After selecting \(k^*\), RRS connects the current state to the reference junction with a cubic Hermite curve. The endpoint tangent \(\mathbf{D}_1=L\mathbf{t}_{k^*}\) makes the synthesized approach enter the reference trajectory in the same direction as the recorded interaction.

\mathbf{D}_0=\mathbf{r}_{k^*}-\mathbf{x}_0,\quad \mathbf{D}_1=L\mathbf{t}_{k^*},\quad L=\|\mathbf{r}_{k^*}-\mathbf{x}_0\|_2 \] \[ C(u)= \psi_{00}(u)\mathbf{x}_0+ \psi_{10}(u)\mathbf{D}_0+ \psi_{01}(u)\mathbf{r}_{k^*}+ \psi_{11}(u)\mathbf{D}_1,\quad u\in[0,1] \] \[ \psi_{00}=2u^3-3u^2+1,\quad \psi_{10}=u^3-2u^2+u,\quad \psi_{01}=-2u^3+3u^2,\quad \psi_{11}=u^3-u^2

Endpoint proof sketch

The basis functions satisfy the Hermite boundary conditions. Therefore the generated approach begins at the current state, ends at the selected reference point, and its derivative at the junction is aligned with the reference tangent.

C(0)=\mathbf{x}_0,\quad C(1)=\mathbf{r}_{k^*},\quad C'(0)=\mathbf{D}_0,\quad C'(1)=\mathbf{D}_1=L\mathbf{t}_{k^*} \] \[ \mathbf{P}= [C(u_0),\ldots,C(u_{N_{\mathrm{app}}}), \mathbf{r}_{k^*+1},\ldots,\mathbf{r}_{M-1}]

This is the mathematical reason RRS is not just a straight-line shortcut: it creates a smooth tangent-compatible approach, then reuses the contact-rich part of the demonstrated reference trajectory.

Video demo

Retargeting turns a reference interaction into a new rollout.

This RRS demo shows how RAIN joins an approach trajectory to a successful reference interaction, preserving the contact segment while adapting the rollout to a new target configuration.

Reference-based Retargeting Strategy demo: approach synthesis, junction selection, and target-directed execution in one rollout.

Benchmarks

The gap appears when the task meaning is preserved but the context changes.

RAIN is trained on 40 original LIBERO tasks and evaluated on LIBERO-EX: 75 OOD tasks across object grounding, layout robustness, and task comprehension, with 10 episodes per task.

In simulation, a single policy is trained on the 40 tasks from LIBERO Spatial, Object, Goal, and Long under the same joint-suite protocol used for baselines. LIBERO-EX then evaluates three 25-task stress splits with 10 episodes per task.

The splits are designed to separate task understanding from demonstration replay: object grounding includes unseen objects and object switching, layout robustness changes scene arrangements, and task comprehension asks for subtask extraction or composition. In the paper, compared baselines stay below 10 success on the task-comprehension split.

This matters because several baselines keep high in-distribution LIBERO scores while collapsing under LIBERO-EX. The paper highlights task comprehension as the sharpest failure mode: policies must either execute a constituent subtask of a learned long-horizon task or compose learned single-step skills into a new long-horizon instruction.

RAIN-L beats the strongest baseline, \(\pi_{0.5}\), by 25.7 points on LIBERO-EX average and by 11.6 / 14.8 points on object grounding and layout robustness. On task comprehension, where the best baseline is Dita at 9.6, RAIN-L reaches 53.2. Even RAIN-S reaches 52.7 EX average, still ahead of every compared VLA baseline.

Object grounding76.0

+11.6 over best baseline

Layout robustness72.0

+14.8 over best baseline

Task comprehension53.2

+43.6 over best baseline

Simulation success rates. LIBERO-EX is the OOD benchmark; LIBERO Avg. is the original suite average.
Method	EX Object	EX Layout	EX Task	EX Avg.	LIBERO Avg.
OpenVLA-OFT	1.6	0.0	0.0	0.5	97.1
π₀	38.8	37.2	6.0	27.3	94.2
π_0.5	64.4	57.2	2.4	41.3	96.9
GR00T N1.6	35.2	31.2	2.8	23.0	95.8
Action-Sketcher	47.6	27.6	3.2	26.1	96.9
Cosmos Policy	40.8	48.4	5.6	31.6	98.5
Dita	29.6	26.0	9.6	21.7	78.5
RAIN-L	76.0	72.0	53.2	67.0	91.0
RAIN-S	53.6	54.0	50.4	52.7	86.9

TCE 91.0 → 82.1

Removing cross-view exchange is the largest LIBERO component drop.

PlanDiT 91.0 → 89.7

Removing plan tokens hurts action prediction.

Progress Head 89.7 → 91.0

Distance and alignment supervision improve completion prediction.

RRS 44.3 → 67.0

Retargeting gives the largest LIBERO-EX gain.

LIBERO-EX

75 stress tests for task generalization.

LIBERO-EX is derived from LIBERO but does not simply perturb the original tasks. The object split checks whether the policy follows the referred object rather than replaying a scene-specific trajectory: unseen-object tasks keep the learned task structure but swap in objects absent from the corresponding training scene, while object-switching tasks move multiple objects and force instruction-relevant selection.

The layout split rearranges objects and adds distractors, changing the geometry around a learned interaction. The task split changes task scope: some tasks request only a constituent subtask from a learned long-horizon demonstration, while others compose familiar skills into longer instructions not seen during post-training.

Qualitative LIBERO-EX rollouts. — Qualitative LIBERO-EX rollouts with region masks across the three evaluation axes.

Real robot

The same region-conditioned interface transfers to OMY-F3M.

RAIN is deployed on a ROBOTIS OMY-F3M arm with a D405 wrist camera and two D435 third-person/top-down cameras. The policy uses a 7D joint-absolute action interface and the robot client runs at 30 Hz.

A task registry provides the active target phrase. Cosmos-Reason2-2B grounds the phrase once at subtask boundaries; SAM 2 tracks the region during control; RAIN produces action chunks and progress scores.

The real-world setup uses a 6-DoF ROBOTIS OMY-F3M arm with a 1-DoF two-finger gripper. The robot observes D405 wrist and D435 third-person/top-down RGB streams, and all compared policies use the same 7D joint-absolute action interface at 30 Hz.

Deployment separates sparse grounding from closed-loop control. A ROS rollout client streams camera frames and joint states to the model server, while Cosmos-Reason2-2B grounds the active registry phrase into a 2D box at subtask boundaries. SAM 2 propagates the resulting masks, and RAIN uses those masks to produce action chunks and progress scores.

Demonstrations are recorded with a ROS-based system and stored in LeRobot format with synchronized wrist, third-person, and top-down RGB observations, joint states, actions, and end-effector poses at 30 Hz. The baselines are fine-tuned on the same OMY-F3M trajectories and receive the same three camera views, language instruction, and 7D proprioceptive state; RAIN additionally receives online SAM 2 masks for the third-person and top-down views.

OMY-F3M real-world setup. — Real-world setup with wrist, third-person, and top-down cameras.

Real-world generalization results. — OOD success rates and qualitative Lemon-to-Plate rollout.

Real-world RAIN rollout trajectory with target grounding overlays across camera views. — Real-world RAIN trajectory visualization with target-region grounding and multi-view rollout evidence.

Real-world success rates under matched robot, camera, action-space, and success criteria.
Method	Potato ID	Potato OOD Avg.	Tape ID Avg.	Tape OOD Avg.
GR00T N1.6	90	43.3	66.7	25.0
X-VLA	70	23.3	33.3	15.0
MolmoAct2	70	56.7	60.0	55.0
RAIN	90	86.7	66.7	65.0

Assets

Paper assets and implementation details.

PDFFull paper FigureArchitecture PDF FigureRRS PDF VideoRRS rollout demo FigureReal-world trajectory PNG FigureReal-world results PDF SupplementPotato-to-Plate demos SupplementTape-to-Clay demos CodeImplementation notes

Limitation: RAIN depends on reliable target-region prompts. Severe occlusion, poor camera placement, or interaction geometry far beyond demonstrations and RRS trajectories can still break the policy.