Father of Convolutional Neural Networks, Yang Likun: I am no longer interested in LLM models, these four major challenges will define the next step for AI.

2025-04-18 05:50:41

This article is from a public conversation between Yann LeCun, chief AI scientist at Meta and Turing Award winner, and Bill Dally, chief scientist at NVIDIA. LeCun believes that the large-scale language model (LLM) craze is nearing its end, and that the future of AI breakthroughs will lie in understanding the physical world, inference planning, and open source models. (Synopsis: OpenAI releases o3 and o4-mini strongest inference models: can think about pictures, automatically select tools, and make breakthroughs in mathematics and coding performance) (Background supplement: OpenAI secretly creates its own community platform, pointing to Musk's X) Today, when the AI wave sweeps the world, everyone's attention is still focused on the (LLM) of large language models At this time, Yann LeCun, known as the father of convolutional neural networks and now the chief AI scientist at Meta, has recently made a surprising statement that his interest in LLMs has waned. In an in-depth conversation with NVIDIA Chief Scientist Bill Dally last month, LeCun detailed his unique insights into the future direction of AI, emphasizing that understanding the physical world, lasting memory, reasoning and planning capabilities, and the importance of the open source ecosystem are the keys to leading the next wave of AI revolution. Say goodbye to the LLM myth: Why does AI need to understand the world better? LeCun admits that despite the exciting developments in AI over the past year, he believes that LLM has largely become a technique for industry product teams to improve at the margin, such as pursuing larger datasets, more computing power, and even generating synthetic data to train models. He believes that these are not the most forward-looking research directions. Instead, he sets his sights on four more fundamental challenges: Understanding the physical world: getting machines to grasp the real laws of the environment in which we live. Have a lasting memory: Enable AI to accumulate and apply experience like a human. Ability to reason: LeCun believes that the current way of reasoning with LLM is too simplistic and requires a more fundamental approach. Implement planning capabilities: Enable AI to predict the consequences of actions and make plans. LeCun emphasizes that human babies learn basic models of the physical world, such as the difference between tipping down and sliding water bottles, within months of birth. This intuitive understanding of how the world works is fundamental to our interactions with the real world, far more difficult than dealing with language. He believes that for AI to truly understand and respond to the real world, the architecture required will be completely different from the current mainstream LLM. He further explained that the core of LLM is to predict the next "symbol". While symbols can be anything, such as in an autonomous driving model, where the symbols entered by sensors eventually produce the symbols that drive the car, which is to some extent reasoning about the physical world (such as judging where it is safe to drive), this discrete symbol-based approach has its limitations. LeCun points out that the typical number of LLM symbols is about 100,000, and the model produces a probability distribution that covers all possible symbols. However, this approach is difficult to apply to high-dimensional, continuous real-world data, such as film. "All attempts to get the system to understand the world or model the world by predicting pixel-level detail in the film have basically failed." LeCun mentions that experience over the past 20 years has shown that even techniques for learning image representation, such as autoencoders, by reconstructing damaged or transformed images, are not as effective as the "federated embedding" architecture he advocates (Joint Embedding). The latter does not attempt to reconstruct at the pixel level, but learns the abstract representation (representation) the image or film and makes predictions in that abstract space. For example, if you take a video of a room, then stop and ask the system to predict the next picture, the system may be able to predict who is sitting in the room, but it can't accurately predict what everyone will look like because the details are unpredictable. If you force the model to predict these pixel-level details, you will waste a lot of resources on tasks that cannot be achieved. "Attempts at self-supervised learning through predictive video will not work, only at the representation level." This means that the architecture of a model that truly understands the world may not be generative. The World Model and JAPA: The Path to True Reasoning So, what would a model that could understand the physical world, have a lasting memory, and do reasoning programming look like if it weren't for LLM? LeCun believes the answer lies in the "world model" (World Models). The world model, he explains, is our inner simulator of how the world works, allowing us to manipulate ideas in our minds and predict the consequences of our actions. This is the core mechanism of human planning and reasoning, and we do not think in symbolic space. He came up with the concept of Embedding Predictive Architecture, (Joint called "Joint Embedding Predictive Architecture," JAPA). This architecture works by feeding a piece of movie or image into the encoder to get a representation, then feeding subsequent movies or images into another encoder, and then trying to make predictions in the "representation space" rather than in the original input space (such as pixels or symbols). While a "fill-in-the-blank" training method can be used, the operation takes place in an abstract latent space (latent space). The difficulty with this approach is that if not designed properly, the system can "crash", that is, ignore the input, and produce only a constant and uninformative representation. LeCun says it wasn't until five or six years ago that technology was available to effectively prevent this. He and his colleagues have published several papers in recent years on the preliminary results of the JAPA World Model. The goal of JAPA is to build a predictor: when the system observes a video, it forms an understanding of the current state of the world; Then, it needs to be able to predict "what the next state of the world will be if I take an imaginary action." With such a predictor, AI can plan a series of actions to reach a specific goal. LeCun firmly believes that this is the right way to achieve true reasoning and planning, far better than some of today's so-called "surrogate reasoning systems." These systems typically generate a large number of symbolic sequences and then use another neural network to pick the best sequence, an approach that LeCun describes as "randomly writing a program and then testing which one works," which is extremely inefficient and unreliable. LeCun also disputes the claims of some AI researchers that artificial general intelligence (AGI) or what he prefers to call advanced machine intelligence (AMI, Advanced Machine Intelligence), is just around the corner. He believes that the idea that human-level intelligence can be achieved simply by scaling LLMs and generating massive sequences of symbols is "nonsense" (nonsense). Although he expects that in the next 3 to 5 years, he will be able to master building systems with abstract world models and use for inference planning on a small scale, and may reach human level in about a decade, he emphasizes that AI researchers have repeatedly announced that a revolution is coming, and the results have proved to be overly optimistic. "Now this wave is also wrong." He believes that AI has reached the doctoral level in a specific field or...

O3-3.19%

View Original

The content is for reference only, not a solicitation or offer. No investment, tax, or legal advice provided. See Disclaimer for more risks disclosure.

Reward
like
Comment
Share

Comment

0/400

No comments

Topic
#BTC#
225k posts
#PI#
191k posts
#ETH#
143k posts
4#GateioInto11#
79k posts
5#ContentStar#
66k posts
6#GT#
63k posts
7#BOME#
60k posts
8#DOGE#
57k posts
9#MAGA#
52k posts
10#SLERF#
51k posts

sitemap