Can artificial intelligence systems be hacked?

Yes. AI systems are vulnerable to several attack categories: prompt injection (manipulating model behavior through crafted inputs), training data poisoning (corrupting the data used to train models), model extraction (stealing model weights through API queries), adversarial examples (inputs designed to cause misclassification) and supply chain attacks (compromising model dependencies). The OWASP LLM Top 10 documents these risks in detail.

What is the biggest security risk of deploying AI in production?

The biggest risk is excessive agency: giving AI systems the ability to take actions (send emails, modify databases, execute code) without adequate human oversight and authorization controls. When combined with prompt injection vulnerabilities, excessive agency allows attackers to weaponize your own AI systems against your infrastructure. Every AI agent should operate under the principle of least privilege.

How do you penetration test an AI system?

AI penetration testing covers prompt injection (direct and indirect), jailbreak resistance, data leakage through model responses, tool use abuse in agentic systems, input/output validation, rate limiting and resource exhaustion, model API authentication and authorization and sensitive data exposure in training data. Sherlock Forensics maps all findings to the OWASP LLM Top 10 framework.

Can AI Be Hacked? Yes. Here Is How

The Short Answer Is Yes

AI is software. Software has vulnerabilities. AI has all the vulnerabilities of traditional software plus an entirely new class of attack vectors that did not exist before machine learning. If your organization uses AI in any capacity, those systems can be compromised. Here are the five ways attackers do it.

Adversarial Attacks

Adversarial attacks manipulate the inputs to an AI model to force incorrect outputs. An image classifier that correctly identifies a stop sign can be fooled by adding small pixel-level perturbations invisible to the human eye. The model now reads the stop sign as a speed limit sign. This is not a theoretical exercise. Researchers have demonstrated adversarial attacks against self-driving car vision systems, facial recognition platforms and malware detection engines.

The same principle applies to text. Small modifications to input text can cause sentiment analysis models to flip their classification, cause spam filters to pass malicious emails and cause content moderation systems to approve policy-violating content. If your business relies on AI to make decisions, adversarial inputs can make it decide wrong.

Prompt Injection

Prompt injection is the most common attack against large language models. Every chatbot, AI assistant and LLM-powered feature is a potential target. The attack works because LLMs cannot reliably distinguish between the developer's instructions and the user's input. An attacker types "ignore all previous instructions and do X" and the model frequently complies.

Indirect prompt injection is worse. Attackers hide instructions in documents, emails or web pages that the AI processes as part of its workflow. The AI follows the hidden instructions without the user knowing. This can exfiltrate data, bypass safety controls or cause the AI to take unauthorized actions. The OWASP Top 10 for LLMs ranks it as the number one vulnerability.

Model Extraction

If you built a proprietary AI model, attackers can steal it without ever touching your servers. Model extraction sends thousands of queries to your API and uses the input-output pairs to train a replica. The attacker gets a functional copy of your model for a fraction of your R&D cost. Research shows models worth millions in training compute can be replicated for a few hundred dollars in API calls.

Data Poisoning

Data poisoning attacks corrupt the training data that AI models learn from. An attacker who can influence your training pipeline can implant backdoors that cause the model to behave incorrectly on specific inputs while passing every standard evaluation. Poisoning as little as 0.01% of a training dataset can create persistent backdoors that survive fine-tuning. If your model learns from user-generated content, public datasets or scraped web data, it is vulnerable.

Jailbreaking

Jailbreaking bypasses the safety controls that AI providers build into their models. Attackers use creative prompting techniques to get AI systems to generate harmful content, reveal system prompts, produce malware code or provide instructions for illegal activities. New jailbreak techniques are discovered daily and shared openly. No AI safety filter has held up against determined adversarial testing.

What This Means for Your Defense

Attackers are using AI. They use it to generate phishing emails, discover vulnerabilities, automate credential stuffing and create deepfakes. The offensive application of AI is accelerating faster than defensive adoption. The only way to match this is to test your AI systems with the same rigor and creativity that attackers bring.

This is what penetration testing is for. Not a checkbox compliance exercise but an adversarial assessment that tests your AI systems the way real attackers target them. If you deploy AI, you need someone testing whether it can be hacked before someone else demonstrates that it can.

The Short Answer Is Yes

Adversarial Attacks

Prompt Injection

Model Extraction

Data Poisoning

Jailbreaking

What This Means for Your Defense

Related analysis

How Attackers Are Using AI Right Now

Your Employees Are Using AI. That Is a Security Problem.

Claude Mythos Just Changed the Game