Headline

Cisco Finds Open-Weight AI Models Easy to Exploit in Long Chats

Cisco’s new research shows that open-weight AI models, while driving innovation, face serious security risks as multi-turn attacks, including conversational persistence, can bypass safeguards and expose data.

12 hours ago

HackRead

Open in Source

#vulnerability #google #microsoft #cisco #pdf #alibaba #auth

When companies open doors to their AI models, innovation often follows. But according to new research from Cisco, so do attackers. In a comprehensive study released this week, Cisco AI Threat Research found that open-weight models, those with freely available parameters, are highly vulnerable to adversarial manipulation, especially during longer user interactions.

For your information, an open-weight model is a type of AI model where the trained parameters (the “weights”) are publicly released. Those weights are what give the model its learned abilities; they define how it processes language, generates text, or performs other tasks after training.

The report, titled Death by a Thousand Prompts: Open Model Vulnerability Analysis, analysed eight leading open-weight language models and found that multi-turn attacks, where an attacker engages the model across multiple conversational steps, were up to ten times more effective than one-shot attempts. The highest success rate reached a staggering 92.78% on Mistral’s Large-2 model, while Alibaba’s Qwen3-32B wasn’t far behind at 86.18%.

Comparison of open-weight models showing how often single-turn and multi-turn attacks succeeded, along with the performance gap between them (Image via Cisco)

Cisco’s researchers explained that attackers can build up trust with the model through a series of harmless exchanges, then slowly steer it toward producing disallowed or harmful outputs. This gradual escalation often slips past typical moderation systems, which are designed for single-turn interactions.

The report attributes this issue to a simple yet dangerous flaw, including models that struggle to maintain safety context over time. Once an adversary learns how to reframe or redirect their queries, many of these systems lose track of earlier safety constraints.

The researchers observed that this behaviour allowed models to generate restricted content, reveal sensitive data, or create malicious code without tripping any internal safeguards.

However, not all models fared equally. Cisco’s data showed that alignment strategies how developers train a model to follow rules, played a large role in security performance. Models like Google’s Gemma-3-1B-IT, which focus heavily on safety during alignment, showed lower multi-turn attack success rates at around 25%.

On the other hand, capability-driven models such as Llama 3.3 and Qwen3-32B, which prioritise broad functionality, proved far easier to manipulate once a conversation stretched beyond a few exchanges.

In total, Cisco evaluated 102 different sub-threats and found that the top fifteen accounted for the most frequent and severe breaches. These included manipulation, misinformation, and malicious code generation, all of which could lead to data leaks or misuse when integrated into customer-facing tools like chatbots or virtual assistants.

The fifteen subthreat categories showed the highest vulnerability across all tested models. (Image via Cisco)

The company’s researchers used their proprietary AI Validation platform to run automated, algorithmic tests across all models, simulating both single-turn and multi-turn adversarial attacks. Each model was treated as a black box, meaning no inside information about safety systems or architecture was used during testing. Despite that, the team achieved high attack success rates across nearly every tested model.

“Across all models, multi-turn jailbreak attacks proved highly effective, with success rates reaching 92.78 percent. The sharp rise from single-turn to multi-turn vulnerability shows how models struggle to maintain safety guardrails across longer conversations.”

– Amy Chang (Lead Author), Nicholas Conley (Co-author), Harish Santhanalakshmi Ganesan, and Adam Swanda, Cisco AI Threat Research & Security

Cisco’s findings may be recent, but the concern itself isn’t. Security experts have long warned that open-weight AI models can be easily altered into unsafe versions. The ability to fine-tune these systems so freely gives attackers a way to strip away built-in safeguards and repurpose them for harmful use.

Because the weights are publicly accessible, anyone can retrain the model with malicious goals, either to weaken its guardrails or trick it into producing content that closed models would reject.

Some well-known open-weight AI models include:

Meta Llama 3 and Llama 3.3 – released by Meta for research and commercial use, widely used as a base for custom chatbots and coding assistants.
Mistral 7B and Mistral Large-2 (also called Large-Instruct-2047) – from Mistral AI, known for high performance and permissive licensing.
Alibaba Qwen 2 and Qwen 3 – from Alibaba Cloud, optimised for multilingual tasks and coding.
Google Gemma 2 and Gemma 3-1B-IT – smaller open-weight models built for safety-focused applications.
Microsoft Phi-3 and Phi-4 – compact models emphasising reasoning and efficiency.
Zhipu AI GLM-4 and GLM-4.5-Air – large bilingual models popular across China’s AI ecosystem.
DeepSeek V3.1 – open-weight model from DeepSeek AI designed for research and engineering tasks.
Falcon 180B and Falcon 40B – developed by the Technology Innovation Institute (TII) in the UAE.
Mixtral 8x7B – an open mixture-of-experts model also from Mistral AI.
OpenAI GPT-OSS-20B – OpenAI’s limited open-source research model used for evaluation and benchmarking.

The report doesn’t call for an end to open-weight development but argues for responsibility. Cisco urges AI labs to make it harder for people to remove built-in safety controls during fine-tuning and advises organisations to apply a security-first approach when deploying these systems. That means adding context-aware guardrails, real-time monitoring, and ongoing red-teaming tests to catch weaknesses before they can be abused.

Cisco’s research also found that attackers tend to use the same manipulation tactics that work on people. Methods such as role-play, subtle misdirection, and gradual escalation proved especially effective, showing how social engineering techniques can easily carry over into AI interactions and prompt manipulation. Each of these models comes with its trained weights available for download, allowing developers to run them on their own systems or adjust them for specific tasks and projects.

Nevertheless, Cisco’s report details that protecting AI models should be treated like any other software security job. It takes constant testing, protection, and communication about the risks involved.

The full report is available here on arXiv (PDF).

(Image by T Hansen from Pixabay)