More than half a year has passed, and ChatGPT’s ranking is almost at the bottom.

2023-09-08 06:02:49

Source: Titanium Media

Author: Sanyan Technology

Yesterday, I accidentally came across a picture.

According to the picture, OpenAI's GPT-4 has been ranked last among the 11 large models (the number one is 0). Some netizens added the words "GPT4: How can I sue my grievances?"

This makes people curious. At the beginning of this year, after ChatGPT became popular, other companies began to mention the concept of large models.

It's only been more than half a year, and GPT has already "bottomed out"?

Therefore, the author wants to see how the GPT ranking is.

The test time is different, the test team is different, GPT-4 ranks eleventh

Judging from the information displayed on the picture in the previous article, this ranking is from the C-list.

C-List, the full name of C-Global Large Model Comprehensive Examination Test List, is a Chinese language model comprehensive examination evaluation kit jointly constructed by Tsinghua University, Shanghai Jiaotong University and the University of Edinburgh.

It is reported that the suite covers four major directions: humanities, social sciences, science and engineering, and other majors, including 52 subjects, covering multiple knowledge fields such as calculus and linear algebra. There are a total of 13,948 Chinese knowledge and reasoning questions, and the difficulty is divided into four levels: middle school, undergraduate, graduate, and vocational.

So the author checked the latest C-list.

The latest ranking of the C-list is consistent with the ranking shown in the picture in the previous article. Among the top eleven large models, GPT-4 ranks last.

According to the C-list, these results represent zero-shot (zero sample learning) or few-shot (few-shot learning) tests, but few-shot is not necessarily better than zero-shot.

C- said that in its tests it was found that many models after instruction fine-tuning were better under zero-shot. Many of the models tested have both zero-shot and few-shot results, and the ranking shows the setting with the better overall average score.

The C-list also indicates that the names of large models with "*" indicate that the model results were tested by the C-team, while other results were obtained through user submissions.

In addition, the author also noticed that the time for these large models to submit test results varies greatly.

The test result submission time for GPT-4 is May 15th, while Yuntianshu, which ranks first, submits on August 31st; Galaxy, which ranks second, submits on August 23rd; and YaYi, which ranks third, submits its results on August 31st. for September 4th.

Moreover, among the top 16 large models, only GPT-4 has "*" added to its name and was tested by the C-team.

So the author checked the complete C-list again.

The latest C-list includes a total of 66 large model rankings.

Among them, there are only 11 with "*" in their names, which are tested by the C-team, and the time of submission for testing was May 15th.

These large models tested by the C-team, OpenAI's GPT-4 ranked eleventh, ChatGPT ranked thirty-sixth, while Tsinghua Zhipu AI's ChatGLM-6B ranked sixty, and Fudan's MOSS ranked sixth fourteen.

Although these rankings can show the rapid development momentum of domestic large models, the author believes that, after all, they are not tested by the same team at the same time, which is not enough to fully prove who is stronger and who is weaker among these large models.

This is like a class of students who each have different test times and answer different papers. How can we rely on each student's score to compare?

What does the big model developer say? A number of people said that they surpassed ChatGPT in terms of Chinese and other abilities

Recently, the big model circle is quite lively.

The large-scale model products of 8 companies including Baidu and Byte have passed the filing of the "Interim Measures for the Management of Generative Artificial Intelligence Services" and can be officially launched to provide services to the public. Other companies have successively released their own large model products.

So how do the developers of these large models introduce their products?

On July 7, at the 2023 World Artificial Intelligence Conference "General Artificial Intelligence Industry Development Opportunities and Risks in the Era of Large-scale Models", Qiu Xipeng, a professor at the School of Computer Science and Technology of Fudan University and the person in charge of the MOSS system, said that Fudan's conversational large-scale language model MOSS After being released in February this year, it has been continuously iterating, "The latest MOSS has been able to surpass ChatGPT in Chinese capabilities."

At the end of July, NetEase Youdao launched a large translation model. NetEase Youdao CEO Zhou Feng publicly stated that in internal tests, in the direction of Chinese-English translation, it has surpassed the translation capabilities of ChatGPT and surpassed Google Translate. level. **

In late August, at the 2023 Yabuli Forum Summer Summit, Liu Qingfeng, founder and chairman of iFlytek, gave a speech and said, “**The code generation and completion capabilities of the iFlytek Spark model have surpassed ChatGPT, and other This capability is catching up quickly. **The logic, algorithms, method systems, and data preparations for the current code capability are ready, and all that is needed is time and computing power.”

SenseTime stated in a recent press release that in August this year, the new model internlm-123b completed training and the number of parameters increased to 123 billion. **On the 51 well-known evaluation sets in the world with a total of 300,000 questions, the overall test score ranks second in the world, surpassing gpt-3.5-turbo and meta's newly released llama2-70b and other models. **

According to Shangtang, **internlm-123 ranked first in 12 major evaluations. Among them, the agi score in the comprehensive test of the evaluation set is 57.8, surpassing gpt-4 and ranking first; the evaluation score of **knowledge commonsenseqa is 88.5, ranking first; internlm-123b scores in the five reading comprehension evaluations All top the list.

In addition, it ranked first in the five evaluations of reasoning.

Earlier this month, Zuoyebang officially released its self-developed Galaxy model.

Zuoyebang said that the Galaxy model has achieved results on the two authoritative large language model evaluation benchmarks of C- and CMMLU. According to the data, the large model of the Jobbang Galaxy ranks first in the C-list with an average score of 73.7 points; at the same time, it ranks in the five-shot and zero-shot evaluations of the CMMLU list with an average score of 74.03 points and 73.85 points respectively First, it became the first major education model to rank first in average score on the two authoritative lists mentioned above.

Yesterday, Baichuan Intelligent announced the official open source fine-tuned Baichuan 2-7B, Baichuan 2-13B, Baichuan 2-13B-Chat and their 4-bit quantized version.

Wang Xiaochuan, founder and CEO of Baichuan Intelligence, said that in the Chinese field, the actual performance of the fine-tuned Chat model in the Q&A environment or summary environment has exceeded that of closed-source models such as ChatGPT-3.5. **

Today, at the 2023 Tencent Global Digital Ecology Conference, Tencent officially released the Hunyuan model. Jiang Jie, vice president of Tencent Group, said that the Chinese ability of the **Tencent Hunyuan Large Model has surpassed GPT-3.5. **

In addition to the self-introduction of these developers, there are also some media and teams evaluating a large model.

In early August, the team of Shen Yang, a professor and doctoral supervisor at the School of Journalism and Communication at Tsinghua University, released the "Comprehensive Performance Evaluation Report of Large Language Models." The report shows that **Baidu Wenxinyiyan's comprehensive score in 20 indicators in three major dimensions leads the country, and is better than ChatGPT. Among them, Chinese semantic understanding ranks high, and some Chinese abilities are better than GPT-4. **

In mid-August, some media reported that on August 11, Xiaomi’s large model MiLM-6B appeared on the C- and CMMLU large model evaluation lists. As of now, MiLM-6B ranks 10th in the C-overall list, 1st in the same parameter magnitude, and 1st in CMMLU Chinese large models.

On August 12, Tianjin University released the "Large Model Evaluation Report". The report shows that the comprehensive performance of **GPT-4 and Baidu Wenxinyiyan is significantly ahead of other models, and their scores are not much different and are at the same level. Wen Xinyiyan has surpassed ChatGPT in most Chinese tasks and gradually narrowed the gap with GPT-4. **

In late August, some media reported that KwaiYii, a large language model developed by Kuaishou, had started internal testing. In the latest CMMLU Chinese ranking, Ruyi’s 13B version KwaiYii-13B ranked first under both five-shot and zero-shot. It is strong in humanities and China-specific topics, with an average score of over 61 points.

It can be seen from the above that although these large models claim to be at the top of a certain ranking or surpass ChatGPT in certain aspects, most of them perform well in some specific fields.

In addition, there are some comprehensive scores that exceed GPT-3.5 or GPT-4, but the GPT test is still in May. Who can guarantee that GPT has not improved in the past three months?

OpenAI’s situation

According to a report by UBS in February, just two months after ChatGPT was launched, its monthly active users at the end of January 2023 had exceeded 100 million, making it the fastest growing consumer application in history.

But the development of ChatGPT is not so smooth.

In July this year, many GPT-4 users complained that compared with the previous reasoning ability, the performance of GPT-4 has declined.

Some users have pointed out issues on Twitter and on the OpenAI online developer forum, focusing on weaker logic, more wrong answers, inability to keep track of information provided, difficulty following instructions, forgetting to put parentheses in basic software code, only remembering the most recent tips and so on.

In August, another report stated that OpenAi may be in potential financial crisis and may go bankrupt by the end of 2024.

The report stated that OpenAI costs approximately US$700,000 per day just to run its artificial intelligence service ChatGPT. Currently, the company is trying to become profitable with GPT-3.5 and GPT-4, but has yet to generate enough revenue to break even.

However, OpenAI may also have a new turning point.

Recently, OpenAI announced that it will hold its first developer conference in November.

Although OpenAI stated that it will not release GPT-5, OpenAI said that hundreds of developers from around the world will work with the OpenAI team to preview "new tools" in advance and exchange ideas.

This may mean that ChatGPT has made new progress.

According to The Paper, on August 30, a person familiar with the matter revealed that OpenAI is expected to achieve more than $1 billion in revenue in the next 12 months by selling AI software and the computing power to drive its operation.

Today, another media report stated that Morgan Stanley will launch a generative artificial intelligence chatbot jointly developed with OpenAI later this month.

People who deal with Morgan Stanley bankers are either rich or expensive. If this upcoming generative AI chatbot can make a difference to Morgan Stan’s clients, it could be a huge boon for OpenAI.

The arrival of the era of artificial intelligence is unstoppable. As for who is better in the end, you can't just rely on yourself, you have to let users score. We also believe that domestic large-scale models will definitely and will be able to catch up with ChatGPT in terms of specific capabilities and comprehensive capabilities.

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.

Reward
like
Comment
Repost
Share

Comment

0/400

No comments

Topic
#Gate Initial Global Listing YZY
3k Popularity
#Crypto Market Rebound
178k Popularity
#FOMC July Minutes
4k Popularity
#Gate Alpha ESPORTS Points Airdrop
13k Popularity
#Crypto-Related xStocks Rally
217 Popularity

sitemap