ModelArmor Use Case: Adversarial Attacks on Deep Learning Models¶

Adversarial attacks exploit vulnerabilities in AI systems by subtly altering input data to mislead the model into incorrect predictions or decisions. These perturbations are often imperceptible to humans but can significantly degrade the system's performance.

Types of Adversarial Attacks¶

By Model Access:
- White-box Attacks: Complete knowledge of the model, including architecture and training data.
- Black-box Attacks: No information about the model; the attacker probes responses to craft inputs.
By Target Objective:
- Non-targeted Attacks: Push input to any incorrect class.
- Targeted Attacks: Force input into a specific class.

Attack Phases¶

Training Phase Attacks:
- Data Poisoning: Injects malicious data into the training set, altering model behavior.
- Backdoor Attacks: Embeds triggers in training data that activate specific responses during inference.
Inference Phase Attacks:
- Model Evasion: Gradually perturbs input to skew predictions (e.g., targeted misclassification).
- Membership Inference: Exploits model outputs to infer sensitive training data (e.g., credit card numbers).

Observations on Model Robustness¶

Highly accurate models often exhibit reduced robustness against adversarial perturbations, creating a tradeoff between accuracy and security. For instance, Chen et al. found that better-performing models tend to be more sensitive to adversarial inputs.

Adversarial Model Performance

Defense Strategies¶

Pre-analysis: Test models for prompt injection vulnerabilities using techniques like fuzzing.
Input Sanitation:
- Validation: Enforce strict input rules (e.g., character and data type checks).
- Filtering: Strip malicious scripts or fragments.
- Encoding: Convert special characters to safe representations.
Secure Practices for Model Deployment:
- Restrict model permissions.
- Regularly update libraries to patch vulnerabilities.
- Detect injection attempts with specialized tooling.

Defense Strategies

Case Study: Pickle Injection Vulnerability¶

Python's pickle module allows serialization and deserialization but lacks security checks. Attackers can exploit this to execute arbitrary code using crafted payloads. The module’s inherent insecurity makes it risky to use with untrusted inputs.

Mitigation:

Avoid using pickle with untrusted sources.
Use secure serialization libraries like json or protobuf.