GUARDING MULTIMODAL ARTIFICIAL INTELLIGENCE SYSTEMS FROM MALICIOUS PROMPT ATTACKS
Assignee
Microsoft Technology Licensing, LLC
Inventors
Reshmi GHOSH, Vitor Rocha De CARVALHO, Robert SIM, Emily LAWTON, Jack Wilson STOKES, Lukas WUTSCHITZ, Ahmed Mohamed Gamal SALEM, Xuefeng DU
Abstract
A data processing system implements obtaining a plurality of unlabeled user prompts including an unknown mixture of malicious prompts and benign prompts; analyzing each unlabeled user prompt using a multimodal vision language model to obtain embeddings representing each unlabeled user prompt; analyzing the embeddings to determine representation of each unlabeled user prompt of the plurality of unlabeled user prompts in a latent space; determining a first region of the latent space associated with benign user prompts and a second region of the latent space associated with malicious user prompts; generating labeled training data by labeling each unlabeled user prompt of the plurality of unlabeled user prompts with an indication whether each unlabeled user prompt is a benign user prompt falling with the first region or a malicious user prompt falling within the second region; and training a prompt classifier using the labeled training data.
CPC Classifications
Filing Date
2024-12-19
Application No.
18988604