Sensible Universal Adversarial Triggers

Dec 27, 2021

Positive Word Cloud

Proposed a novel technique combining parts-of-speech filtering and perplexity based loss function to generate sensible triggers that are closer to natural phrases.
For the task of sentiment analysis on the SST dataset, the method produced sensible triggers that achieve accura- cies as low as 4% and 12% for flipping positive to negative predictions and vice-versa.
To build robust models, performed adversarial training using the generated triggers that increases the accuracy of the model from 12% to 48%.
Illustrated that adversarial at- tacks can be made difficult to detect by generating sensible triggers, and to facilitate robust model development through relevant defenses.

My research interests are at the juncture of deep learning and computer vision.