- Proposed a novel technique combining parts-of-speech filtering and perplexity based loss function to generate sensible triggers that are closer to natural phrases.
- For the task of sentiment analysis on the SST dataset, the method produced sensible triggers that achieve accura- cies as low as 4% and 12% for flipping positive to negative predictions and vice-versa.
- To build robust models, performed adversarial training using the generated triggers that increases the accuracy of the model from 12% to 48%.
- Illustrated that adversarial at- tacks can be made difficult to detect by generating sensible triggers, and to facilitate robust model development through relevant defenses.