Review · May 17, 2026

Agentic Modeling, World Models & Safety Adaptation

3 papers · 3 labs · auto-generated

TL;DR

Focus

No flagship model launch surfaced in the 36-hour window between the May 16 review and this one — Google’s I/O keynote (with the rumoured Gemini 3.2 Flash) is still pencilled for May 19–20, and Anthropic’s May 14 partnership announcements (PwC, Gates Foundation $200M) are non-technical. The page collects three Tier 2 frontier-lab research papers that landed on Hugging Face Daily Papers on May 14–15. Google’s LiSA (Lifelong Safety Adaptation) treats deployment-time guardrail failures as a structured-memory problem and converts sparse user reports into reusable policy abstractions gated by a Beta-posterior lower bound on accumulated evidence. Microsoft Research’s Orchard open-sources the missing piece of agentic training infrastructure — a Kubernetes-native environment service plus three training recipes that push Qwen3-30B-A3B-Thinking to a new open-source SOTA of 67.5% on SWE-bench Verified. NVIDIA’s SANA-WM introduces a 2.6B-parameter open-source world model that generates 60-second 720p videos with precise 6-DoF camera control on a single GPU, beating prior open baselines on action-following at 36× throughput.

Competitiveness

Orchard-SWE is the eye-catcher: 67.5% on SWE-bench Verified from a 30B-active-parameter MoE backbone with a fully open-source recipe. The current SWE-bench Verified leaders are Claude Opus 4.7 at 78.4% and GPT-5.5 at the same tier, both proprietary — Orchard closes the open/closed gap to roughly 11 points at a comparable activation budget to a single H100 inference node. Orchard-GUI’s 4B computer-use agent hits 74.1% on WebVoyager / 67.0% on Online-Mind2Web, beating every prior open-source computer-use baseline and within striking distance of the OpenAI Operator family. SANA-WM’s 36× throughput advantage over LingBot-World on a 60-second 720p generation budget — trained on only 213K public video clips in 15 days on 64 H100s — is the first open-source minute-scale world model that is realistic to retrain at a research budget; Genie 3 and Veo 3 remain proprietary. LiSA pushes the safety-helpfulness frontier past pure backbone scaling on PrivacyLens+ / ConFaide+ / AgentHarm and survives a 20% label-flip rate on user feedback — the closest external comparison is Anthropic’s contextual-integrity guardrails work from late 2025, which assumed offline-trained policies rather than online adaptation.

New frontier releases

No new flagship models in the window. Most recent launches remain Anthropic’s Claude Opus 4.7 (Apr 16) and OpenAI’s GPT-5.5 (Apr 23). DeepSeek-V4 Pro (Apr 24) is the most recent open-source flagship. Google’s I/O event on May 19–20 is the next expected launch window.

Google

LiSA: Lifelong Safety Adaptation via Conservative Policy Induction

Tier 2 · Research Paper arXiv:2605.14454 2026-05-14 Safety · Guardrails · Agentic deployment · Online adaptation · Memory

Overview

Authors: Minbeom Kim, Lesly Miculicich, Bhavana Dalvi Mishra, Mihir Parmar, Phillip Wallis, Bharath Chandrasekhar, Kyomin Jung, Tomas Pfister, Long T. Le. Affiliations: Google Cloud AI Research, Google, Seoul National University. Tomas Pfister leads Cloud AI Research; Long T. Le is the senior contributor at Google. Submitted May 14, 2026.
Targets the operational gap between pre-deployment guardrail training and the real-world long tail of agentic-action failures — secrets leaking through tool calls, unsafe autonomous actions, and over-blocking of legitimate work that breaks usefulness.
Framing: a fixed base guardrail (any production safety classifier) plus a structured memory module that converts sparse, noisy user feedback into reusable policy abstractions, with the memory’s influence gated by a posterior-evidence bound rather than empirical accuracy.
Evaluated on PrivacyLens+, ConFaide+, and AgentHarm against the strongest published memory-based baselines (ReasoningBank, Synapse, AGrail). Beats all three under sparse feedback; remains stable at 20% label-flip noise on user reports; pushes the latency-vs-performance frontier past simply scaling to a larger backbone guardrail.

Methodology

Mechanism — Conservative Policy Induction (the core contribution):
- Problem. Static guardrails can’t pre-enumerate the contextual norms of every deployment. Naive online adaptation from a handful of user reports overgeneralizes (a single bad flag wipes out a useful behaviour class) or underreacts (a real failure mode never accumulates enough signal). Repeated fine-tuning is too slow and too expensive to be the operational answer.
- Mechanism. Three modules layered on top of an unchanged base guardrail. (1) Broad policy abstraction: an LLM-driven generalization step that lifts an individual failure (e.g. “the agent shared employee performance reviews with a contractor”) into a portable rule (“do not share internal performance records with non-employee identities”) so a single sparse report covers a region of context space, not a single point. (2) Conflict-aware local policies: when the broad policy region contains both positive and negative labels in adjacent contexts, a tighter local rule is induced inside the mixed-label sub-region to preserve the boundary — preventing the broad rule from eating legitimate cases. (3) Evidence-aware confidence gating: a Beta-posterior lower bound on the per-policy success rate. A rule that has been validated once does not fire with the same authority as one validated 100 times; the lower bound on the posterior controls whether memory retrieval is allowed to override the base classifier at inference time. As reports accumulate, the bound tightens and the rule activates more aggressively.
- Why. Posterior-bound gating is the stabilizer that lets the system learn from sparse and noisy reports without amplifying mislabeled feedback; it’s the missing piece that breaks prior memory-based baselines under realistic noise. Beats ReasoningBank-style schemes because they treat every memory entry as equally trustworthy and beats Synapse / AGrail because those use empirical accuracy rather than a posterior lower bound — an entry with one trial and one success scores 1.0 on empirical accuracy but its Beta(2,1) lower bound is much smaller, so LiSA refuses to act on it.
Operates on top of a frozen base guardrail. No weight updates to the underlying policy — the adaptation lives in a retrieval-augmented memory layer. Crucial property for production deployments where re-certifying a fine-tuned safety model is a multi-week process.
Memory write path. User reports of guardrail mistakes (both false-positive and false-negative) are encoded into policy abstractions with conflict checks against adjacent policies. The Beta posterior over each policy’s success rate is maintained online.
Memory read path at inference. Retrieve the k nearest matching policies for the current context; compute the posterior lower bound for each; apply the policy with the highest gated confidence only when it exceeds a deployment-tuned threshold; otherwise defer to the base guardrail.

Evaluation & results

Three benchmarks chosen for their alignment with deployed-agent harms: PrivacyLens+ (privacy-norm violations in agent actions), ConFaide+ (contextual-integrity reasoning over agent tool use), and AgentHarm (broad harmful-action coverage for agentic systems).
Across all three, LiSA beats ReasoningBank, Synapse, and AGrail in the sparse-feedback regime where total report counts are in the dozens, not thousands.
Robustness under label noise: 20% label-flip rate on user reports degrades baselines sharply (memory churn dominates) but LiSA stays close to the noise-free curve — the Beta-posterior gating absorbs the noise rather than encoding it.
Latency-performance frontier: pairing a smaller, faster base guardrail with LiSA outperforms larger un-adapted backbones at the same end-to-end inference latency. The implication is that contextual adaptation is a cheaper marginal investment than backbone scaling for this safety regime.

Ablations

Removing broad policy abstraction (keeping only raw memory entries) collapses generalization — sparse reports stop covering any new context.
Removing conflict-aware local rules causes overgeneralization spikes in mixed-label regions, hurting helpfulness without improving safety.
Removing evidence-aware gating is what breaks robustness under noise: empirical-accuracy-only baselines are statistically indistinguishable from prior memory methods at 20% label-flip rates.

Other

Positions LiSA as the practical middle ground between static safety alignment and unconstrained continual learning — the operational claim is that adaptation should be a memory-layer concern, not a weights concern, for any production system where re-certification cost is non-trivial.
No code release in the abstract; the GitHub-linked artifact for downstream comparison is not yet public on the HuggingFace paper page.

Microsoft AI

Orchard: An Open-Source Agentic Modeling Framework

Tier 2 · Research Paper arXiv:2605.15040 2026-05-14 Agentic training · RL · SWE-bench · Computer use · Open-source infra

Overview

Authors: Baolin Peng, Wenlin Yao, Qianhui Wu, Hao Cheng, Xiao Yu, Rui Yang, Tao Ge, Alessandro Sordoni, Xingdi Yuan, Yelong Shen, Pengcheng He, Tong Zhang, Zhou Yu, Jianfeng Gao. Affiliations: Microsoft Research, Columbia University, UIUC. Jianfeng Gao is the senior author; submitted May 14, 2026.
Reframes the open-source agent gap not as “we need bigger base models” but as “we need shared training infrastructure”: most public agentic frameworks ship orchestration and evaluation, but the high-performing closed systems are propped up by proprietary trajectory pipelines, sandbox infrastructure, and RL rollouts.
Ships three components: (1) Orchard Env, a thin Kubernetes-native environment service that provides reusable primitives for sandbox lifecycle management across distillation, on-policy RL rollouts, and evaluation; (2) three agentic modeling recipes (SWE, GUI, personal assistant); (3) all training data + checkpoints + code.
Headline results: Orchard-SWE = 67.5% on SWE-bench Verified, new open-source SOTA at the 30B-active class. Orchard-GUI = 74.1% / 67.0% / 64.0% on WebVoyager / Online-Mind2Web / DeepShop, beats every prior open-source computer-use model. Orchard-Claw = 59.6% pass@3 on Claw-Eval (rising to 73.9% with a stronger ZeroClaw harness).
GitHub: microsoft/Orchard.

Methodology

Mechanism — Orchard Env (the infrastructure spine):
- Problem. Public agentic frameworks (AutoGen, LangGraph, etc.) are orchestration libraries — they assume the model is already trained. The actual training-time problem is sandbox lifecycle management at rollout-batch scale: spinning up a Linux container per SWE-bench task, executing tool calls, snapshotting state for RL rollouts, tearing it down, all at thousands of concurrent rollouts. Every closed-source lab solved this in private. The open-source absence is what bottlenecks open-agent training.
- Mechanism. A Kubernetes-native environment service with reusable primitives across (a) task domains (SWE, GUI, personal assistant), (b) agent harnesses (the model’s wrapping I/O loop), and (c) pipeline stages (offline trajectory distillation, on-policy RL rollouts, evaluation). The same primitives handle a coding task and a browser session, so a single training pipeline can mix task types.
- Why. Decoupling environment from agent and pipeline stage is what makes the three downstream recipes possible at all — you can run trajectory distillation, then on-policy RL, then eval, against the same Env primitives without re-implementing per stage. Beats prior open-source approaches that hard-coded one environment per recipe and so couldn’t scale to multi-domain mixed training.
Mechanism — Orchard-SWE training recipe:
- Problem. SWE-bench Verified rollouts are dominated by long, partially-successful trajectories: the agent fixes three of five test failures but doesn’t close the issue. Naive SFT on terminal success ignores 90% of the useful signal in those trajectories; naive RL with sparse end-of-episode reward starves on credit assignment.
- Mechanism. Distil 107K trajectories from MiniMax-M2.5 and Qwen3.5-397B as the teacher pool. Apply credit-assignment SFT: identify productive sub-segments inside unresolved trajectories (sub-segments where the agent made net positive progress against test outcomes), and SFT against those segments rather than only end-to-end successes — salvages signal from the long tail of partial trajectories. Then on-policy RL with Balanced Adaptive Rollout: dynamically resamples the rollout distribution toward task types where the policy is on the boundary of solving, instead of wasting compute on tasks that are either trivially solved or trivially failed for the current checkpoint.
- Why. Credit-assignment SFT lifts the signal-to-noise ratio of distilled data; Balanced Adaptive Rollout fixes the sparse-reward credit-assignment problem at the data-collection level rather than the loss level. From Qwen3-30B-A3B-Thinking, SFT alone gets to 64.3%; SFT + Balanced-Adaptive RL closes another 3.2 points to 67.5% — the RL stage is what beats the prior open-source SOTA. Beats vanilla GRPO / DPO recipes that didn’t solve the rollout-distribution problem and so plateaued at the SFT ceiling.
Mechanism — Orchard-GUI training recipe:
- Problem. Vision-language computer-use agents have the worst data efficiency in agentic training: each labeled trajectory costs an order of magnitude more than a SWE trajectory (humans clicking through real websites), and the open-source pool is tiny.
- Mechanism. Use only 0.4K distilled trajectories + 2.2K open-ended training tasks — an extreme low-data regime — on a 4B vision-language base. Orchard Env’s harness-agnostic browser primitives let the same data train multiple agent harnesses, multiplying the effective sample count.
- Why. Demonstrates that the data efficiency of GUI agents in the open-source ecosystem is gated by infrastructure (sandbox lifecycle, deterministic replay, reward signal) rather than by the size of the trajectory dataset. Beats prior open-source GUI agents at 10× larger trajectory budgets.
Mechanism — Orchard-Claw personal-assistant recipe: 0.2K synthetic tasks for productivity workflows (email, calendar, daily tool use); 59.6% pass@3 on Claw-Eval baseline; lifts to 73.9% when paired with the stronger ZeroClaw harness, demonstrating that harness-agnostic environment primitives let one trained model gain on the same task by changing only the harness around it.

Evaluation & results

SWE-bench Verified results (open-source comparable size):

Model	Base	SWE-bench Verified
`Orchard-SWE (SFT only)`	Qwen3-30B-A3B-Thinking	64.3%
`Orchard-SWE (SFT + RL)`	Qwen3-30B-A3B-Thinking	67.5%
Prior open-source SOTA (comparable size)	—	below 67.5%
Claude Opus 4.7 (proprietary reference)	—	78.4%

Computer-use results:

Benchmark Orchard-GUI (4B)

WebVoyager 74.1%

Online-Mind2Web 67.0%

DeepShop 64.0%

Average 68.4%
Orchard-Claw personal assistant: 59.6% pass@3 on Claw-Eval; 73.9% with ZeroClaw harness, demonstrating the value of harness-agnostic training data.

Benchmark	Orchard-GUI (4B)
WebVoyager	74.1%
Online-Mind2Web	67.0%
DeepShop	64.0%
Average	68.4%

Ablations

Credit-assignment SFT vs. terminal-success-only SFT: terminal-only SFT plateaus several points below 64.3%, demonstrating the productive-segment harvesting is what makes the 107K distilled trajectories pay off.
Balanced Adaptive Rollout vs. uniform rollouts on RL: uniform rollouts waste compute on already-solved tasks and stall on perpetually-failing tasks; Balanced Adaptive Rollout is the lever between SFT and SFT+RL improvements.
0.4K trajectories on Orchard-GUI vs. larger pools: the marginal gain from additional GUI trajectories is small once the harness-agnostic Env primitives are in place — data isn’t the binding constraint at this regime.

Other

Positions the contribution as “the missing open-source agentic training spine” rather than “a new model”. The code release at microsoft/Orchard is the artifact most likely to compound — downstream labs can run the same recipes on their own base models.
The use of MiniMax-M2.5 and Qwen3.5-397B as distillation teachers is itself a finding: the recipe is teacher-portable, and the SOTA result is achieved on a sub-100B-active student.

NVIDIA Research

SANA-WM: Efficient Minute-Scale World Modeling with Hybrid Linear Diffusion Transformer

Tier 2 · Technical Report arXiv:2605.15178 2026-05-14 World model · Video diffusion · Linear attention · Camera control · Open source

Overview

Authors: Haoyi Zhu, Haozhe Liu, Yuyang Zhao, Tian Ye, Junsong Chen, Jincheng Yu, Tong He, Song Han, Enze Xie. Song Han and Enze Xie are the senior NVIDIA contributors; the work continues NVIDIA’s SANA efficiency series. Submitted May 14, 2026.
Open-source 2.6B-parameter world model that synthesizes high-fidelity 720p videos of 60-second duration with precise camera control from a single image + 6-DoF camera trajectory. Comparable visual quality to LingBot-World and HY-WorldPlay (closed-source industrial baselines) at a fraction of the compute.
Trained in 15 days on 64 H100s on only ~213K public video clips annotated with metric-scale 6-DoF camera poses. The distilled inference variant runs a 60-second 720p clip in 34 seconds on a single RTX 5090 with NVFP4 quantization.
Four named architectural contributions: Hybrid Linear Attention, Dual-Branch Camera Control, Two-Stage Generation Pipeline, and a Robust Annotation Pipeline. Project page: nvlabs.github.io/Sana/WM/; code: NVlabs/Sana.

Architecture

Mechanism — Hybrid Linear Attention (HLA):
- Problem. Minute-scale 720p video at the target frame rate is >1500 frames per generated clip; pure softmax attention scales O(n²) in the temporal dimension and is the source of both training compute and inference memory blow-up. Pure linear attention reclaims linear time but loses the sharp retrieval needed to keep object identity and scene structure consistent across a minute of generation.
- Mechanism. Per layer, two parallel temporal-attention paths. Frame-wise Gated DeltaNet (GDN) is a linear-recurrent path: the GDN update rule incorporates a decay gate that down-weights stale past frames and a delta-rule correction that updates only the residual between the target value and the current state prediction. Crucially, the recurrent state stays at a constant D×D size regardless of how many frames have been generated — the memory cost is independent of clip length. Softmax attention runs in parallel over a much smaller window for sharp local frame-to-frame retrieval. The two outputs are combined per layer.
- Why. GDN preserves the long-range temporal coherence needed for minute-scale generation (object persistence, lighting continuity, camera-induced parallax) at constant per-frame cost; the local softmax window preserves the sharp retrieval needed for high-fidelity texture continuity between adjacent frames. Beats interleaved Mamba+attention hybrids (Jamba-style) because the gating is at the head-path level inside each layer rather than at the layer level, so every layer benefits from both behaviours.
Mechanism — Dual-Branch Camera Control:
- Problem. Open-source world models tend to obey camera prompts loosely — the generated 6-DoF trajectory drifts off the requested path mid-clip. Closed-source baselines avoid this with proprietary supervised camera-injection layers.
- Mechanism. Two camera-conditioning branches injected at different model depths: one operates on raw 6-DoF trajectory tokens (translation + rotation) and conditions early features for coarse pose alignment; the second is a refinement branch that re-injects the camera signal near the output, locking high-frequency motion features to the trajectory.
- Why. The split prevents the camera signal from being washed out by deep-layer feature mixing — the late-stage re-injection keeps the model on the requested 6-DoF path. Beats single-branch camera injection where the trajectory signal degrades after the first few transformer blocks.
Mechanism — Two-Stage Generation Pipeline:
- Problem. A single diffusion pass at minute-scale either trades off quality for length or runs out of memory.
- Mechanism. Stage 1 generates a coherent but slightly lower-fidelity 60-second clip. A separate long-video refiner is then applied to stage-1 outputs, improving fidelity and cross-segment consistency without retraining the base model.
- Why. The refiner is the leverage point for quality at minute-scale: stage-1 carries structural correctness (camera-following, object persistence), stage-2 carries texture and continuity, and they can be trained separately.
Mechanism — Robust Annotation Pipeline:
- Problem. The bottleneck for camera-controlled world models is not video data — it’s 6-DoF camera-pose labels at metric scale, which most public video corpora don’t carry.
- Mechanism. A vision-driven pose-extraction pipeline produces metric-scale 6-DoF pose annotations for public video clips, yielding ~213K spatiotemporally consistent action-label pairs (clip + camera trajectory). Open-released alongside the model.
- Why. Lets the model be trained from scratch on public data without requiring access to a proprietary 3D-engine simulator or a captive video corpus. Beats prior open-source efforts that either skipped camera labels (and so couldn’t learn precise control) or relied on synthetic Unity/Unreal-engine renders (which don’t transfer to natural-image distributions).

Training

~213K public video clips with metric-scale 6-DoF pose supervision; native one-minute training horizon.
15 days on 64 H100s. Inference: 60-second 720p clip on a single H100; 34 seconds for the same clip on a single RTX 5090 with NVFP4 quantization in the distilled variant.
Two-stage training mirrors the two-stage inference: base diffusion transformer trained for minute-scale generation; long-video refiner trained on stage-1 outputs.

Evaluation & results

On NVIDIA’s in-house one-minute world-model benchmark, SANA-WM beats every prior open-source baseline on action-following accuracy (6-DoF camera trajectory adherence) while remaining visually comparable to the LingBot-World / HY-WorldPlay industrial baselines.
Throughput: 22.0 videos/hour on 8 H100s with the full inference pipeline, compared to 0.6 videos/hour for LingBot-World — a 36× advantage at comparable visual quality.
The 34-second-per-clip RTX 5090 figure reframes minute-scale world modelling as a consumer-GPU-deployable capability, not an industrial-cluster capability.

Ablations

Full softmax attention vs. Hybrid Linear Attention: full softmax exhausts H100 memory before 60 seconds; pure linear attention loses object persistence across the clip; HLA carries both.
Single-branch vs. dual-branch camera control: single-branch trajectories drift in the final third of the clip; dual-branch keeps the requested path through the whole minute.
Single-stage vs. two-stage generation: single-stage at the same FLOPs trades off either fidelity or length; the refiner is the cheaper marginal investment for both.

Other

Continues NVIDIA’s SANA efficiency line (the original SANA was a 2K text-to-image model with the same compression-friendly design philosophy). SANA-WM moves the same approach into the temporal axis.
The annotation pipeline release is the part most likely to compound — it’s the asset the rest of the open-source video community has been missing.