The paper argues that to build an AI that truly understands and simulates the real world, it must be consistent in three ways at once: across different senses (modal), across 3D space (spatial), and across time (temporal).
SkyReels-V4 is a single, unified model that makes videos and matching sounds together, while also letting you fix or change parts of a video.
The paper shows that Test-Time Training (TTT) with key–value (KV) binding is not really memorizing like a notebook; it is acting like a learned linear attention layer.
Searching through videos, images, and long documents is powerful but gets very expensive when every tiny piece is stored separately.
This paper shows that you can vastly improve a model’s command-line (terminal) skills by carefully engineering the training data, not just by using a bigger model.
The paper introduces CHAIN, a hands-on 3D playground that tests if AI can not only see objects but also plan and act under real physics.
Modern image generators can still make strange mistakes like extra fingers or melted faces, and today’s vision-language models (VLMs) often miss them.
PyVision-RL teaches vision-language models to act like curious agents that think in multiple steps and use Python tools to inspect images and videos.
Vision-Language-Action (VLA) robots are powerful but too big and slow for many real-world devices.
Mobile-O is a small but smart AI that can both understand pictures and make new images, and it runs right on your phone.
This paper builds a gigantic library of video puzzles (VBVR) so AI can practice not just making pretty videos, but actually thinking through what happens over time.
ManCAR helps recommendation systems think step by step but keeps their thoughts on realistic paths using a map of how items connect.