Sora and V-JEPA Have Not Learned The Complete Real World Model -- A Philosophical Analysis of Video AIs Through the Theory of Productive Imagination

Zhang, Jianqiu

Abstract:Sora from Open AI has shown exceptional performance, yet it faces scrutiny over whether its technological prowess equates to an authentic comprehension of reality. Critics contend that it lacks a foundational grasp of the world, a deficiency V-JEPA from Meta aims to amend with its joint embedding approach. This debate is vital for steering the future direction of Artificial General Intelligence(AGI). We enrich this debate by developing a theory of productive imagination that generates a coherent world model based on Kantian philosophy. We identify three indispensable components of the coherent world model capable of genuine world understanding: representations of isolated objects, an a priori law of change across space and time, and Kantian categories. Our analysis reveals that Sora is limited because of its oversight of the a priori law of change and Kantian categories, flaws that are not rectifiable through scaling up the training. V-JEPA learns the context-dependent aspect of the a priori law of change. Yet it fails to fully comprehend Kantian categories and incorporate experience, leading us to conclude that neither system currently achieves a comprehensive world understanding. Nevertheless, each system has developed components essential to advancing an integrated AI productive imagination-understanding engine. Finally, we propose an innovative training framework for an AI productive imagination-understanding engine, centered around a joint embedding system designed to transform disordered perceptual input into a structured, coherent world model. Our philosophical analysis pinpoints critical challenges within contemporary video AI technologies and a pathway toward achieving an AI system capable of genuine world understanding, such that it can be applied for reasoning and planning in the future.

Comments:	30 pages, 9 figures
Subjects:	Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
MSC classes:	00-68
ACM classes:	K.m
Cite as:	arXiv:2407.10311 [cs.AI]
	(or arXiv:2407.10311v1 [cs.AI] for this version)
	https://doi.org/10.48550/arXiv.2407.10311

Computer Science > Artificial Intelligence

Title:Sora and V-JEPA Have Not Learned The Complete Real World Model -- A Philosophical Analysis of Video AIs Through the Theory of Productive Imagination

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators