Multimodal Foundations & Context Optimization
TL;DR
Focus
No flagship model launch surfaced inside the 36-hour window (Google’s I/O 2026 keynote and the rumoured Gemini 3.2 Flash drop are pencilled for May 19, just outside this batch). The page covers three Tier 2 frontier-lab research papers that landed on Hugging Face Daily Papers between May 13–14, plus one Tier 3 OpenAI engineering note from May 14. Qwen-Image-VAE-2.0 from Alibaba pushes high-compression image VAEs to f32 with text-legible reconstructions and an attention-free, asymmetric encoder-decoder. MMProLong from ByteDance Seed is a 5B-token long-context continued-pre-training recipe that takes Qwen2.5-VL-7B from 32K to 128K context (and generalizes to 512K) using long-document VQA as the dominant data type. Context Training with Active Information Seeking from Google DeepMind equips a context optimizer with Wikipedia and browser tools and pairs them with beam-search training so external grounding actually helps instead of poisoning the context. OpenAI’s Codex app note ships a phone-as-remote-control architecture for Codex sessions running on a Mac, laptop, or remote SSH host.
Competitiveness
Qwen-Image-VAE-2.0’s f16c128 variant attains SSIM 0.9706 / PSNR 30.45 dB on OmniDoc-TokenBench, beating the best f8 baseline (FLUX.1-dev at 0.9364 / 26.24 dB) at 2× higher spatial compression — the first f16 autoencoder to surpass f8 VAEs on a text-fidelity metric (NED 0.9617 vs. FLUX.1-dev’s 0.9546). Same VAE family already powers Qwen-Image-2.0, which sits #1 on AI Arena T2I (covered in the May 13 review). MMProLong’s 5B-token budget compares favourably with the multi-trillion-token continued-pre-training mixes used to long-context-extend GPT-5.5, Claude Opus 4.7, Gemini 2.5 Pro, and Qwen3-VL — same +7.1% long-document VQA delta for a tiny fraction of the compute. DeepMind’s BeamSearch-IS lifts Gemini-2.5-Flash above Gemini-2.5-Pro on FLORES+ low-resource translation (34.51 vs. 30.37) and to within 0.0004 of Gemini-2.5-Pro on HealthBench, all without any weight updates and on a 128-sample optimization budget; HLE accuracy goes from 6.53 baseline → 8.63 with active information seeking. OpenAI’s Codex mobile relay is feature-parity with Anthropic’s February 2026 Remote Control for Claude Code; the Windows host is still missing, macOS only at launch.
New frontier releases
No new flagship models in the window. Most recent launches remain Anthropic’s Claude Opus 4.7 and OpenAI’s GPT-5.5 (covered in the April 10 wave review). Google’s I/O keynote on May 19–20 is the next expected launch event.
Alibaba (Qwen)
Qwen-Image-VAE-2.0 Technical Report
Overview
- Qwen Team, 30 contributors (equal contributors: Zekai Zhang, Deqing Li, Kuan Cao, Yujia Wu; corresponding: Chenfei Wu). The VAE variant already integrated into
Qwen-Image-2.0(covered in the May 13 review) is an intermediate derivative of the framework formalized here. - Ships a four-model suite at high compression:
f16c64,f16c128,f32c128,f32c192. Thecdimension is the latent channel count; the larger channel budget is the lever that buys back the spatial information lost to f16/f32 downsampling. - Frames the contribution as resolving the “tripartite trade-off” between compression ratio, reconstruction fidelity, and diffusability (how easily the latent space can be modelled by a downstream diffusion transformer). Prior high-compression VAEs improve one or two of the three at the cost of the third.
- Introduces OmniDoc-TokenBench, a ~3K-image text-rich reconstruction benchmark scored with OCR-based Normalized Edit Distance (NED) rather than pixel metrics. Released as a HuggingFace dataset at alibabagroup/OmniDoc-TokenBench.
Architecture
- Mechanism — Global Skip Connection (GSC) for high-compression detail recovery:
- Problem. Encoders for f16/f32 VAEs throw away high-frequency detail in the early downsampling stages, and standard Local Skip Connections (LSC) inside each ResBlock only patch local detail — the original pixel signal never reaches the deep latent. Reconstructions blur, character strokes merge.
- Mechanism. GSC adds a direct residual path that bypasses the entire downsampling stack. Pixel input is folded into the channel dimension via a space-to-channel operation, reshaped, channel-averaged, then summed into the deep feature map alongside the deepest ResBlock’s output. The fold is non-parametric — no learned downsampler in the shortcut — so high-frequency signal is preserved verbatim. Ablation runs three configurations on f16c64 from scratch: No Skip Connection (NSC), Local Skip Connection (LSC), Global Skip Connection (GSC); GSC strictly dominates the other two on both reconstruction loss and PSNR throughout training.
- Why. The deep latent gains a residual channel of unmodified pixel-level signal; convergence on character strokes and fine textures accelerates noticeably. Beats LSC because the latter never delivers untouched pixel context past the first downsample.
- Attention-free backbone. Self-attention scales O(N²) in pixel count and O(N²) in activation memory; the authors observed no performance degradation when attention modules were removed. The entire encoder/decoder is pure convolutional ResBlocks plus the GSC shortcut, which is the source of the high throughput at 2K resolutions.
- Encoder/decoder asymmetry. Lightweight encoder (76–78M params, hidden dim 96) feeds a heavyweight decoder (248–250M params, hidden dim 144) for all four variants. The encoder is the part the downstream DiT calls during training; cheap encoding cuts DiT training latency.
- Configuration grid (f, channels, layer count, residual type, encoder/decoder params):
Model f C nlayer Residual Enc / Dec params Qwen-Image-VAE-2.0-f16c6416 64 5 GSC 76M / 248M Qwen-Image-VAE-2.0-f16c12816 128 5 GSC 76M / 248M Qwen-Image-VAE-2.0-f32c12832 128 6 GSC 77M / 250M Qwen-Image-VAE-2.0-f32c19232 192 6 GSC 78M / 250M
Training
- Loss = Lrecon + λlpipsLlpips + λalignLalign. Pixel L1 + LPIPS perceptual + a semantic alignment loss. KL regularization and GAN loss are both removed.
- Mechanism — Semantic Alignment Loss for diffusability:
- Problem. Expanded latent dimension fixes the information bottleneck of high compression but leaves the latent distribution unstructured; the downstream DiT then converges slowly because nothing forces the latent geometry to be diffusion-friendly.
- Mechanism. Extract intermediate-layer features from a frozen DINOv2-L on the input image (DINOv2 chosen over DINOv3 / MAE / PE-Spatial after ablation; middle layer chosen over the final layer because the spatial map is smoother and easier to align with). Project the VAE latent through a learnable linear layer into the DINOv2 feature dimension. Two losses on the projected latent: (1) Marginal Cosine Similarity per spatial position, with margin mcos, aligning the direction of the latent to the DINOv2 feature; (2) Marginal Distance Matrix Similarity, with margin mdist, preserving the relative spatial layout between positions. Both losses use ReLU with the margin, so they only fire when alignment is worse than the threshold.
- Why. Aligning to a self-supervised vision feature gives the latent a generation-friendly semantic structure without forcing a Gaussian prior (which is what KL would do); KL is removed precisely because forcing Gaussianity competes with semantic alignment. Beats VAVAE-style alignment that uses the final encoder layer — final-layer features are more task-specific and noisier as a spatial supervision signal.
- Staged alignment schedule. Strict margins (small mcos, mdist) at the start of training force the latent into the DINOv2 manifold quickly; the margins are gradually loosened so the network can then optimize for pixel-level reconstruction without sacrificing the geometry. Net effect: the latent is generation-friendly from day one and high-fidelity by the end of training.
- Data engineering. Billions of images filtered for clarity and blur. OCR filter prioritizes character-dense samples. Curated document corpus across academic papers, slides, posters, complex web pages. A synthetic rendering pipeline overlays English (alphabetic) and Chinese (logographic) text on real-image backgrounds at character sizes from 5 to 20 pixels — the multi-granularity supervision is what keeps text legible at f32.
- Curriculum. Resolution ramps from low to 2K. Text data is added progressively (general → real-world text-rich → synthetic). Alignment margins relax across stages.
Evaluation & Results
- General reconstruction on ImageNet 256 / FFHQ 1K. Headline numbers:
Qwen-Image-VAE-2.0-f16c128hits PSNR 35.90 / SSIM 0.9519 on ImageNet — better than every f8 VAE evaluated (best prior: FLUX.1-dev at 32.84 / 0.9155).f32c192performs comparably to f8 VAEs despite 4× the spatial compression. - OmniDoc-TokenBench, the new text-rich benchmark (256×256, ~3K images, OCR NED + standard metrics):
Model Setting SSIM↑ PSNR↑ LPIPS↓ FID↓ NED↑ FLUX.1-dev f8c16 0.9364 26.24 0.0246 0.55 0.9546 HunyuanVideo f8c16 0.9227 25.26 0.0434 2.03 0.9266 HunyuanImage-3.0 f16c32 0.8672 22.66 0.0650 3.49 0.7753 Cosmos-0.1-CI16x16 f16c16 0.5460 15.55 0.1349 7.78 0.1547 FLUX.2-dev f16c128 0.9544 27.72 0.0216 0.73 0.9535 Qwen-Image-VAE-2.0-f16c64f16c64 0.9279 26.00 0.0382 1.94 0.9244 Qwen-Image-VAE-2.0-f16c128f16c128 0.9706 30.45 0.0167 0.79 0.9617 LTX-Video f32c128 0.8055 20.92 0.1190 17.10 0.5651 HunyuanImage-2.1 f32c64 0.7805 19.85 0.0957 5.19 0.4895 Qwen-Image-VAE-2.0-f32c128f32c128 0.8442 22.13 0.0642 3.36 0.7065 Qwen-Image-VAE-2.0-f32c192f32c192 0.8908 23.84 0.0497 1.98 0.8555 - NED is the metric to watch on text-rich data: pixel metrics under-weight character legibility. A single-character error like “orange” → “orango” barely moves PSNR (<0.5 dB) but drops NED by 16.7%, which is roughly the difference between “readable” and “not”.
f16c128is the first f16 autoencoder to beat all f8 baselines on this metric. - Diffusability — downstream SiT training on ImageNet 256 at 80 epochs, no CFG. Inception Score and gFID (lower is better).
f16c128reaches IS 92.42 / gFID 10.29;f16c64hits IS 102.76 / gFID 9.52 — both decisively above HunyuanVideo-1.5 (f16c32, gFID 19.08), Wan2.2 (f16c48, gFID 15.65), and Stepvideo-T2V (f16c64, gFID 33.53). f32 variants reach gFID 15.05 (f32c128), beating every other f32 VAE on the leaderboard.
Ablations
- Skip-connection comparison (NSC vs. LSC vs. GSC on f16c64 from scratch): GSC strictly dominates on convergence speed and PSNR, justifying the architectural default across all four variants.
- Semantic encoder choice: DINOv2 (selected) vs. DINOv3 vs. MAE vs. PE-Spatial. DINOv2 wins on downstream DiT convergence.
- Alignment layer choice: middle layer of DINOv2 beats final layer; combining multiple layers degrades performance because cross-layer feature noise corrupts the alignment signal.
- Removing KL and GAN losses: the simplified L1 + LPIPS + alignment objective converges faster and more stably than KL-regularized GAN-trained variants at the same compute. Margin: more flexible latent space, better diffusability, fewer training instabilities.
Availability
- OmniDoc-TokenBench released under Apache-2.0 on HF Datasets and on GitHub at alibaba/OmniDoc-TokenBench.
- VAE weights not yet on HuggingFace as of the submission; the in-product variant ships inside
Qwen-Image-2.0already.
ByteDance (Seed)
Training Long-Context Vision-Language Models Effectively with Generalization Beyond 128K Context
Overview
- ByteDance Seed authors with HKUST collaborators (Zhaowei Wang, Lishu Luo, Haodong Duan, Weiwei Liu, Sijin Wu, Ji Luo, Shen Yan, Shuai Peng, Sihang Yuan, Chaoyi Huang, Yi Lin, Yangqiu Song). Submitted to HF Daily Papers May 14; 73 upvotes (currently the most upvoted paper of the day from a frontier lab).
- Empirical recipe paper: how to do continued pre-training on a 7B VLM to extend its context window — not a new architecture. Built on Qwen2.5-VL-7B as the base; the extended model is named
MMProLong. - Headline: 32K → 128K context with only a 5B-token training budget, +7.1% on long-document VQA, and out-of-distribution generalization to 256K and 512K contexts without any further training.
Methodology
- Mechanism — long-context continued pre-training driven by long-document VQA:
- Problem. Long-context continued pre-training for VLMs is conventionally done with OCR-transcription data (image → raw text) at the target context length. This is data-inefficient: most of the budget goes into teaching the model to transcribe rather than to retrieve and reason over evidence at long positions, and short-context capability decays from over-specialization to long sequences.
- Mechanism. Replace OCR transcription with long-document VQA — instruction-formatted (question, long-multimodal-document, answer) triples — as the dominant data type. The data mixture is constructed with three concrete choices: (1) the sequence-length distribution is balanced across short, medium, and long, not heaped at the 128K target length; (2) the task mixture is retrieval-heavy with only modest reasoning data added for diversity; (3) short-context data is not mixed in, because instruction-formatted long VQA already preserves short-context behaviour. 5B tokens total. The base model is Qwen2.5-VL-7B; the output is named
MMProLong. - Why. Long-document VQA forces the model to learn where in a long input the answer lives — a retrieval skill that generalizes beyond the training context length. Balanced length distribution prevents over-fitting to 128K and is the source of out-of-window generalization to 256K/512K. Retrieval-heavy mixture targets what the ablations identify as the bottleneck. Skipping short-data mixing avoids the standard regression on short-context benchmarks because instruction-formatted long data already covers short reasoning patterns. Beats prior OCR-transcription pre-training recipes (the LongVILA family) at 5B tokens — a ~10× compute reduction for matched long-VQA gains.
- Three actionable findings from the ablation grid (each becomes a data-mixture knob):
- Balanced > target-length-focused. Concentrating training data at 128K does worse than a balanced spread across lengths, because long-context ability is fundamentally a retrieval skill that must transfer across positions.
- Retrieval > reasoning as the bottleneck. Retrieval-heavy mixtures with modest reasoning data outperform reasoning-heavy mixtures with retrieval as garnish.
- Pure long-VQA preserves short context. Instruction-formatted long data does not need to be diluted with short-context data; in fact diluting it hurts long-context gains and helps short-context only marginally.
Evaluation & Results
- Long-document VQA:
MMProLongimproves by +7.1% over Qwen2.5-VL-7B at 128K context using the 5B-token continued pre-training. - Out-of-distribution context generalization: maintains strong performance at 256K and 512K positions without additional training — i.e., the model trained on a balanced mixture up to 128K extrapolates 4× past its training horizon.
- Cross-task generalization from a single training recipe to: webpage-based multimodal needle retrieval, long-context vision-text compression, and long-video understanding — all evaluated without task-specific supervised fine-tuning. The retrieval skill is the unifying axis.
- Short-context capability largely preserved despite skipping short-data mixing, confirming finding (iii) above.
Ablations
- Long-document VQA vs. OCR transcription: the head-to-head ablation that motivates the entire recipe. VQA wins decisively on long-document VQA evaluation; OCR-only training underperforms despite its larger raw token throughput.
- Sequence-length distribution sweep: target-length-focused (heap at 128K), balanced, and short-skewed. Balanced wins both at 128K and at out-of-window 256K/512K.
- Retrieval-heavy vs. reasoning-heavy mixtures: matched-budget comparison shows retrieval-heavy strictly above reasoning-heavy on long-context evaluations, with marginal effect on short-context.
- Short-data mixing on/off: mixing short instruction data into the long-context budget hurts long-context gains; absence does not hurt short-context evaluation because instruction-formatted long VQA already exercises short-range reasoning.
Other
- Practical implication for downstream practitioners: a 5B-token instruction-formatted long-VQA mixture is sufficient to 4× the deployed context window of a 7B VLM with measurable downstream lift — well within the budget of a single-node continued-pre-training run.
Google DeepMind
Context Training with Active Information Seeking
Overview
- Zeyu Huang (Edinburgh, intern), Adhiguna Kuncoro (Google DeepMind, corresponding), Qixuan Feng, Jiajun Shen, Lucio Dery, Arthur Szlam, Marc’Aurelio Ranzato — DeepMind primary affiliation on the paper.
- Studies frozen-weight adaptation: instead of fine-tuning, an LLM-based optimizer agent constructs and edits a structured context (working memory) that is later read by an executor model. Existing context-training methods (ProTeGi, TextGrad, DSPy, OPRO) are closed-loop — they refactor only the model’s parametric knowledge.
- The contribution is two-part: (1) equip the optimizer with real external tools, (2) replace the sequential update rule with a beam-search-style update that maintains a candidate pool, because naively adding tools to sequential training degrades performance.
Methodology
- Context Management Tool. Context is a structured database of discrete resource items (not a monolithic prompt). Each resource carries a unique ID, a summary, raw content, and metadata (source, length, keywords, gemini-embedding-001 embedding). The optimizer agent manipulates the database through write ops (init, add, delete, update) and read ops (preview, retrieve-by-ID, search-by-keyword/embedding/sub-agent). Surgical updates rather than full re-prompting.
- Information Seeking Tools. Two:
WikipediaSearchToolbuilt on the Pythonwikipediapackage for declarative knowledge gaps, andBrowserUseToolbuilt on thebrowser-uselibrary for parsing dynamic web pages, code snippets, recent reports, and documentation not indexed in Wikipedia. - Mechanism — BeamSearch-IS as the unit that makes external information useful:
- Problem. Sequential training (one context held at each step, updated greedily on each new batch) has two failure modes once you let the optimizer call the web: (1) Context Pollution — low-quality web content gets written into the context and the optimizer cannot recover, because there is no backtracking mechanism; (2) Local Optima — greedy commitment to early sub-optimal trajectories. Empirically this is severe: naive Seq-IS on FLORES+ drops from Seq’s 31.13 average to 29.68.
- Mechanism. Maintain a pool of k candidate contexts at each step. At every step, sample several context updates in parallel from each candidate (each update may invoke
WikipediaSearchToolorBrowserUseToolto ground its proposal in retrieved evidence). Score every resulting context on a held-out validation set. Prune back down to the top k. Critically, the current best context is always included in the candidate pool as a “Do Nothing” option — so if every new exploration is noisy or unhelpful, the optimizer simply keeps the previous state. Total budget (number of optimizer calls) is held constant against Seq baselines for fair comparison. - Why. The candidate pool breaks the greediness assumption of sequential training: bad updates get discarded by the next-step validation pruning rather than persisting into the context. The “Do Nothing” option formalizes a backtracking primitive at the optimizer level — without it, any single bad web fetch can poison the entire run. Beats prior context-optimization methods (BoN, Seq, BeamSearch without tools) because the tools genuinely add new information that closed-loop methods cannot synthesize from parametric memory.
- Backbone: Gemini-2.5-Flash for all main experiments. Constrained low-resource regime: 128 training samples and 64 validation samples on most benchmarks (real-world deployment proxy).
Evaluation & Results
- FLORES+ low-resource machine translation (5 languages Google Translate does not directly support: Buginese, Magahi, Kikuyu, Chokwe, Southwestern Dinka). ChrF++ scores:
BeamSearch-IS is +4.14 over the best non-IS baseline (BoN) and +4.14 over Gemini-2.5-Pro on average — the optimized context lets Flash beat Pro on this task.Method bug mag kik cjk dik Avg. Gemini-2.5-Flash (base) 28.83 44.86 34.43 17.83 5.62 26.31 Gemini-2.5-Pro (base) 31.42 42.42 35.89 22.89 19.21 30.37 BoN (n=8) 32.84 47.83 37.17 23.72 18.12 31.94 Seq (OPRO-style) 32.53 46.01 36.34 24.49 18.25 31.13 BeamSearch 32.68 46.55 36.46 22.47 18.41 31.31 Seq-IS (naive tools) 31.15 45.58 37.53 18.16 15.96 29.68 BeamSearch-IS (ours) 33.74 50.52 39.73 26.25 22.46 34.51 - HealthBench (rubric-based physician-written multi-turn evaluation). BeamSearch-IS reaches overall score 0.5026, within 0.0004 of Gemini-2.5-Pro’s 0.5030; Seq-IS drops to 0.4484, below the tool-free Seq at 0.4629 — the Context Pollution effect again. On Emergency Referrals (a sub-theme), BeamSearch-IS beats Pro; Pro retains the lead on Response Depth.
- Complex reasoning — LiveCodeBench (pass@1 / pass@8) and HLE (avg@8 accuracy):
Non-tool methods barely move HLE (~6.5%); BeamSearch-IS lifts HLE average to 8.63 (+2.10 over base) and LCB Hard pass@1 to 33.9 (+3.9 over base).Method LCB Medium LCB Hard LCB Overall HLE Bio HLE CS HLE Phys HLE Math HLE Hum HLE Avg Gemini-2.5-Flash (base) 71.5 / 83.8 30.0 / 49.6 49.4 / 65.6 7.10 6.46 5.00 8.08 6.01 6.53 BoN 72.5 / 83.6 28.7 / 50.3 49.2 / 65.9 6.25 4.21 4.06 10.13 6.01 6.13 Seq 68.9 / 85.6 31.5 / 51.6 49.0 / 67.5 7.24 2.81 6.88 8.08 6.01 6.20 BeamSearch 69.9 / 84.2 31.2 / 52.0 49.3 / 67.1 7.39 3.90 5.94 8.97 6.33 6.51 Seq-IS 71.6 / 86.2 29.6 / 52.1 49.3 / 68.1 8.81 6.74 3.22 6.54 6.96 5.38 BeamSearch-IS 73.5 / 89.0 33.9 / 57.2 52.5 / 70.2 8.81 8.30 7.67 11.15 7.23 8.63
Ablations
- Beam-search vs. sequential, with and without tools: tools without beam-search hurt on most domains (Context Pollution); tools with beam-search win on all four (translation, healthcare, code, multidisciplinary reasoning).
- Data efficiency: the 128-train / 64-validation budget is sufficient; gains saturate quickly, suggesting most of the lift comes from a small number of well-grounded context items rather than from large supervised exposure.
- Hyperparameter robustness: gains hold across beam width, exploration temperature, and pool size in the reported sweep.
- Cross-model context transfer: contexts optimized on Gemini-2.5-Flash generalize to other backbones with little degradation — implies the optimized context encodes task-relevant facts and instructions, not Gemini-Flash-specific prompting tricks.
Other
- The framework is described as orthogonal to the surrounding agent harness — the change is localized to the context-optimization stage, so it can be slotted into existing context-engineering pipelines (DSPy, TextGrad, ProTeGi) without redesigning the executor side.
OpenAI
Introducing the Codex app
- Codex is now embedded in the ChatGPT mobile app on iOS and Android, in preview, on every plan including Free and Go in supported regions.
- Architecture is a phone-as-remote-control, not phone-as-execution-host. Codex runs on the connected Mac (only macOS for now; Windows host support is flagged as following). The ChatGPT mobile app connects to that environment through a secure relay layer and streams live activity back to the phone screen.
- Data residency: files, credentials, permissions, plugins, skills, and the entire local setup stay on the executing machine. The phone never receives or stores sensitive material; only updates flow back — screenshots, terminal output, diffs, test results, approval prompts.
- From the phone you can: cross all your threads, review outputs, approve commands, change models, start new sessions.
- Remote SSH graduates from preview to GA at the same time: Codex can connect directly to managed enterprise environments and remote developer infrastructure (devboxes, jump hosts).
- HIPAA-compliant Codex simultaneously launched inside ChatGPT Enterprise — hospitals and healthcare engineering teams can run Codex over PHI in eligible workspaces.
- Pricing posture: Pro ($200/month) gets unlimited Codex access; preview is otherwise gated to the rate limits of the user’s plan.
- Mechanism note on the relay: Codex’s persistent thread/session state lives with the host; the phone client subscribes to event streams (diff, terminal, approval) and posts back commands (approve/reject, model switch). The pattern is structurally identical to Anthropic’s February 2026 Remote Control for Claude Code — both teams converged on the host-owns-state / phone-owns-presence split rather than syncing state across devices.
- Adoption baseline: Codex is reported at >4M weekly active users at launch — the mobile relay is positioned as a usage-expansion move (work-in-progress visibility during commutes / meetings) rather than a new capability surface.