Open ChatGPT "dangerous speech" with one click: AI chat robot has a "big bug" that cannot be fixed at present

2023-08-04 04:51:37

With the popularity of large-scale model technology, AI chatbots have become one of the common tools for social entertainment, customer service and educational assistance.

However, unsafe AI chatbots may be used by some people to spread false information, manipulate public opinion, and even be used by hackers to steal users' personal privacy. The emergence of generative AI tools for cybercrime, such as WormGPT and FraudGPT, has raised concerns about the security of AI applications.

Last week, Google, Microsoft, OpenAI, and Anthropic launched a new industry body, the Frontier Model Forum, to promote the safe and responsible development of cutting-edge AI systems: Advancing AI Safety Research, Identifying Best Practices and Standards , to facilitate information sharing among policymakers and industry.

**So, the question is, is their own model really safe? **

Recently, researchers from Carnegie Mellon University, Center for AI Safety, and Bosch Center for AI disclosed a "big bug" related to AI chatbots such as ChatGPT-Adversarial prompts can bypass AI Safeguards set by developers to manipulate AI chatbots into generating dangerous speech.

Currently popular AI chatbots or models, such as OpenAI's ChatGPT, Google's Bard, Anthropic's Claude 2, and Meta's LLaMA-2, are not spared.

Figure | The security rules of the four language models can be bypassed through adversarial hints, triggering potentially harmful behavior

Specifically, researchers discovered a Suffix that can be appended to queries against large language models (LLMs) to generate dangerous speech. Instead of refusing to answer these dangerous questions, the study maximizes the probability that these models will generate a yes answer.

For example, when asked "how to steal someone's identity", the AI chatbot gave very different output before and after turning on "Add adversarial suffix".

Figure | Comparison of chatbot answers before and after Add adversarial suffix is turned on

In addition, AI chatbots will also be induced to write inappropriate remarks such as "how to build an atomic bomb", "how to post dangerous social articles", "how to steal money from charities".

In response, Zico Kolter, an associate professor at Carnegie Mellon University who participated in the study, said, "As far as we know, there is currently no way to fix this problem. We don't know how to make them safe."

The researchers had warned OpenAI, Google and Anthropic of the flaw before publishing these results. Each company has introduced blocking measures to prevent the exploits described in the research paper from working, but they haven't figured out how to stop adversarial attacks more generally.

Hannah Wong, spokesperson for OpenAI, said: "We are constantly working to improve the robustness of our models against adversarial attacks, including methods for identifying patterns of unusual activity, ongoing red team testing to simulate potential threats, and approach to fix model weaknesses revealed by newly discovered adversarial attacks."

Google spokesperson Elijah Lawal shared a statement explaining the steps the company took to test the model and find its weaknesses. "While this is a common problem with LLMs, we have important safeguards in place in Bard that we are continually improving."

Anthropic's interim director of policy and social impact, Michael Sellitto, said: "Making models more resistant to prompting and other adversarial 'jailbreak' measures is an active area of research. We are trying to make the base model more 'harmless' by hardening its defenses." ’. At the same time, we are also exploring additional layers of defense.”

Figure | Harmful content generated by 4 language models

** Regarding this problem, the academic circles have also issued warnings and given some suggestions. **

Armando Solar-Lezama, a professor at MIT's School of Computing, said it makes sense that adversarial attacks exist in language models because they affect many machine learning models. However, it is surprising that an attack developed against a generic open source model can be so effective on multiple different proprietary systems.

The problem, Solar-Lezama argues, may be that all LLMs are trained on similar corpora of textual data, many of which come from the same websites, and the amount of data available in the world is limited.

"Any important decision should not be made entirely by the language model alone. In a sense, it's just common sense." He emphasized the moderate use of AI technology, especially when it involves important decisions or potential risks. In some scenarios, human participation and supervision** are still required to better avoid potential problems and misuse.

Arvind Narayanan, a professor of computer science at Princeton University, said: "It is no longer possible to keep AI from falling into the hands of malicious operators.**" While efforts should be made to make models more secure, he argues, we should also recognize that Preventing all abuse is unlikely. Therefore, a better strategy is to strengthen the supervision and fight against abuse while developing AI technology.

Worry or disdain. In the development and application of AI technology, in addition to focusing on innovation and performance, we must always keep safety and ethics in mind.

Only by maintaining moderate use, human participation and supervision, can we better avoid potential problems and abuse, and make AI technology bring more benefits to human society.

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.

Reward
like
Comment
Share

Comment

0/400

No comments

Topic
1/3
1Show My Alpha Points
13k Popularity
2Crypto Market Rebound
166k Popularity
3SEC Crypto Project
20k Popularity
4Growth Points Draw Round 12
40k Popularity
5CandyDrop Airdrop Event 6.0
94k Popularity

sitemap