In her recent manifesto “From Words to Worlds: Spatial Intelligence is AI’s Next Frontier,” Dr Fei-Fei Li, Sequoia Professor of Computer Science at Stanford University, Co-Director of the Stanford Institute for Human-Centred Artificial Intelligence (HAI), and currently Chief Executive Officer of World Labs, redefines the trajectory of modern AI. She contends that the true next leap in artificial intelligence lies not in linguistic eloquence but in spatial intelligence—machines that can perceive, reason about, and act within three-dimensional worlds. It is an argument grounded in her pioneering work on computer vision and datasets such as ImageNet, which shaped the last AI revolution. Now, Li proposes a new one: moving beyond word-trained models to world-trained systems.
Yet, as the field divides between text-driven large language models and spatially grounded world models, one question endures: which path will thrive—and who will lead it? What is already clear is that leadership, not expenditure, will decide the outcome.
Words: The Interface That Already Works
Language models have become the cognitive interface of our digital lives. They draft contracts, debug software, analyse data, and increasingly blend vision and sound. Their advantage is accessibility—they sit atop vast linguistic knowledge, connecting billions of users to structured reasoning. But despite their brilliance, they remain disembodied. Even the most advanced struggle with physical reasoning, causality, or the intuitive understanding of space and motion. Words may shape thought, but they do not ground perception.
Worlds: The Engine We Still Need
Li’s argument is that human-like understanding requires embodiment. True intelligence must sense, imagine, and interact. Her concept of world models calls for systems that are:
Generative — able to create geometrically and physically consistent 3D worlds.
Multimodal — integrating text, vision, depth, and action.
Interactive — predicting the next state or behaviour based on actions and goals.
Her start-up, World Labs—which has raised approximately $230 million to date at a post-money valuation of over $1 billion—is building this vision through early systems such as Marble, a 3D creative tool for storytellers and designers, and RTFM, a real-time generative model with spatial memory. These prototypes demonstrate the emergence of grounded intelligence—AI that not only narrates the world but also inhabits it.
The LeCun Moment — and Google’s Character.ai Gambit
While Li is building worlds, Yann LeCun, Meta’s Chief AI Scientist and Turing Laureate, is reportedly preparing to leave the social-media giant to found his own world-model venture. LeCun has long argued that large language models are a “dead end” for achieving human-level intelligence, emphasizing perception and self-supervised learning instead. His rumoured departure underscores a philosophical realignment: from words that predict to worlds that understand.
This leadership dynamic is not unique to Meta. Google’s recent arrangement involving Character.ai and the return of Noam Shazeer—a co-author of the Transformer paper—illustrates the same principle from the other direction. Rather than a straightforward takeover, it was a talent-and-technology repatriation: a large, strategic deal to re-align a pivotal architect with Google’s core model roadmap while licensing critical conversational IP. The message is unmistakable: in an era when models are expensive and data centres are vast, the scarcest asset is still people who know what to build and how to build it.
None of this is an obituary for language-first approaches. It simply affirms that the future of AI will not be dictated by who trains the largest model, but by who defines what intelligence means—and can mobilise institutions to pursue it.
Leadership Over Capital
Money can buy compute, data, and marketing—but not insight. The next phase of AI will be determined by the leaders who can balance science, ethics, and product execution. Enduring institutions, whether start-ups or incumbents, share three traits:
Scientific judgement – the discernment to pursue problems that yield general capability rather than publicity.
Product realism – building tools that professionals can depend on, not just admire in a demo.
Cultural trust – retaining brilliant people through candour, curiosity, and purpose.
Budgets move numbers, but leaders move frontiers.
What Will Decide the Winner
Although the divide between words and worlds dominates discourse, convergence is inevitable. The deciding factors will be structural rather than ideological.
a.) Credible Evaluation
Language models benefitted from quantifiable metrics like next-token accuracy. World models need equivalent benchmarks—tests for physical consistency, geometric continuity, and cause–effect coherence over long timescales. The first research ecosystem to establish transparent, peer-reviewed evaluations will command credibility.
b.) Deployment Economics
It is one thing to demonstrate spatial reasoning on high-end GPUs; it is quite another to deploy it affordably at the edge—in headsets, robots, or classrooms. Latency, efficiency, and cost per inference will define viability. The victors will design architectures that balance cloud power with local performance.
c.) Workflow Integration
For creators, world models must integrate seamlessly into existing design and simulation tools. For robotics, they must train across simulated and real environments with predictable safety margins. For knowledge workers, they must blend reasoning with real-world context. In every domain, integration will outlast inspiration.
d.) Talent Alignment
Money may lure talent, but mission keeps it. Scientists, engineers, and artists align with causes that combine intellectual freedom and moral purpose. Institutions that maintain academic openness, protect exploratory research, and link it to human-centred outcomes will attract and retain the best minds.
Near-Term Proving Grounds
Creator Tools
AI that allows directors, architects, and designers to generate entire 3D environments from sketches will redefine creative practice. The true test will be long-term usability—whether professionals can rely on these systems in daily production, not merely for experimental projects.
a.) Simulation and Robotics
Spatially intelligent models will narrow the sim-to-real gap, allowing robots to practise complex manipulations in virtual environments before executing them safely in the physical world.
b.) Education and Training
Immersive learning—where students can walk through atoms, ecosystems, or historical reconstructions—will transform comprehension and retention. The key will be balancing immersion with pedagogy, analytics, and accessibility.
c.) Hybrid Knowledge Work
Language-first systems enhanced with world-awareness—and world models capable of linguistic reasoning—will merge into hybrid assistants. Language will remain the interface; spatial intelligence will become the engine.
The Fusion Thesis
The next decade will not crown a single victor. Language will remain humanity’s most efficient way to communicate thought; spatial intelligence will remain the foundation of understanding and action. The true frontier lies in their fusion—AI that can both describe and do. This convergence, not competition, will define maturity in artificial intelligence.
The Bottom Line
It remains uncertain whether words or worlds will dominate the future of artificial intelligence. That uncertainty is not a weakness but a sign of vitality. What is certain, however, is that the field’s fate will not hinge on burn rate or compute scale. It will depend on leaders—scientists and builders with courage, humility, and imagination.
Fei-Fei Li’s vision for spatial intelligence, Yann LeCun’s push toward embodied cognition, and Google’s Character.ai–Noam Shazeer realignment all point to the same truth: intelligence must be both expressive and grounded, and it is leadership that orchestrates the synthesis. Billions may build the infrastructure, but only inspired leaders will give it direction.
The future of AI will not be won by who spends the most, but by those who can see both the words and the world, and build bridges strong enough for all of humanity to cross.