high
2605.15188v1
Agent World Replay Eval
Build a bounded FutureSim-style replay harness for Hermes/OpenClaw agent runs using chronological local evidence instead of open-ended live browsing.
Hermes hook
Hermes cron/session outputs become replay episodes with Brier-style prediction and adaptation scoring.
OpenClaw hook
OpenClaw can consume the same episode JSON as regression cases for autonomous research and coding agents.
source pulse
AI agents are being increasingly deployed in dynamic, open-ended environments that require adapting to new information as it arrives. To efficiently measure this capability for realistic use-cases, we...
FutureSim: Replaying World Events to Evaluate Adaptive Agents
high
2605.15199v1
Entity Memory Consistency Board
Adapt EntityBench's entity-consistency idea to durable Hermes/OpenClaw memory: named entities must stay consistent across crons, wiki pages, and reports.
Hermes hook
Nightly checks compare memory/wiki/cron claims for stable entity IDs, aliases, and source provenance.
OpenClaw hook
OpenClaw gets a machine-readable entity consistency board for local vs external name collisions and stale claims.
source pulse
Multi-shot video generation extends single-shot generation to coherent visual narratives, yet maintaining consistent characters, objects, and locations across shots remains a challenge over long sequences....
EntityBench: Towards Entity-Consistent Long-Range Multi-Shot Video Generation
medium
2605.15187v1
Programmatic Asset Generation Harness
Borrow Articraft's code+tests loop for local artifact generation: HTML, diagrams, dashboards, and future simulation assets should ship with validators.
Hermes hook
Hermes visual artifact builds get structured validation manifests beside HTML/PDF/PNG outputs.
OpenClaw hook
OpenClaw coding tasks can target the same harness format for artifact generation and tests.
source pulse
A bottleneck in learning to understand articulated 3D objects is the lack of large and diverse datasets. In this paper, we propose to leverage large language models (LLMs) to close this gap and generate...
Articraft: An Agentic System for Scalable Articulated 3D Asset Generation
medium
2605.15198v1
Visual Reasoning Tool/Token Lab
Use ATLAS as a design prompt for measuring when agents should call tools versus compress reasoning into internal state.
Hermes hook
Hermes can log tool-call vs no-tool decisions on visual/browser tasks and grade outcome quality.
OpenClaw hook
OpenClaw can reuse the labelled cases to tune approval, browser, and tool-routing policies.
source pulse
Visual reasoning, often interleaved with intermediate visual states, has emerged as a promising direction in the field. A straightforward approach is to directly generate images via unified models during...
ATLAS: Agentic or Latent Visual Reasoning? One Word is Enough for Both
watch
2605.15183v1
Mechanistic Similarity Regression Check
Track model/component equivalence methods as future regression-test ideas for local model/tool changes.
Hermes hook
Hermes model/provider drift checks can record behavioural and configuration equivalence first; tensor methods remain research watchlist.
OpenClaw hook
OpenClaw can attach this to model-change review notes, not live deployment gates yet.
source pulse
Mechanistic interpretability aims to break models into meaningful parts; verifying that two such parts implement the same computation is a prerequisite. Existing similarity measures evaluate either empirical...
When Are Two Networks the Same? Tensor Similarity for Mechanistic Interpretability
watch
2605.15185v1
Visual World-Model Geometry Audit
Keep a watchlist for quantitative geometry/consistency checks before trusting generated video/3D outputs as evidence.
Hermes hook
Generated visual artifacts can carry an explicit illustrative-vs-evidentiary confidence flag.
OpenClaw hook
OpenClaw media/research outputs can attach geometry-consistency caveats to generated scenes.
source pulse
Generative video models are increasingly studied as implicit world models, yet evaluating whether they produce physically plausible 3D structure and motion remains challenging. Most existing video evaluation...
Quantitative Video World Model Evaluation for Geometric-Consistency
watch
2605.15186v1
Visual World-Model Geometry Audit
Keep a watchlist for quantitative geometry/consistency checks before trusting generated video/3D outputs as evidence.
Hermes hook
Generated visual artifacts can carry an explicit illustrative-vs-evidentiary confidence flag.
OpenClaw hook
OpenClaw media/research outputs can attach geometry-consistency caveats to generated scenes.
source pulse
High-quality 3D scene reconstruction has recently advanced toward generalizable feed-forward architectures, enabling the generation of complex environments in a single forward pass. However, despite their...
VGGT-Edit: Feed-forward Native 3D Scene Editing with Residual Field Prediction
watch
2605.15196v1
Visual World-Model Geometry Audit
Keep a watchlist for quantitative geometry/consistency checks before trusting generated video/3D outputs as evidence.
Hermes hook
Generated visual artifacts can carry an explicit illustrative-vs-evidentiary confidence flag.
OpenClaw hook
OpenClaw media/research outputs can attach geometry-consistency caveats to generated scenes.
source pulse
Video generation powers a vast array of downstream applications. However, while the de facto standard, i.e., latent diffusion models, typically employ heavily conditioned denoising networks, their decoders...
RefDecoder: Enhancing Visual Generation with Conditional Video Decoding