System and method for robust natural language classification under character encoding
Assignee
Barracuda Networks, Inc.
Inventors
Christopher L. Sawtelle
Abstract
A new approach is proposed to support robust natural language classification under character encoding. A plurality of images that represent a plurality of characters under various language encoding schemes for a target language character are accepted and utilized to create a distribution of text similarity probabilities for the plurality of characters likely to be swapped/replaced/substituted with the target language character to trick a human user. The distribution of text similarity probabilities is then applied against a true text corpus comprising a set of real/actual texts to generate a synthetic text corpus that further includes a set of characters being swapped with one or more of the plurality of characters based on the distribution of text similarity probabilities. The synthetic text corpus is then utilized to train one or more NLP models, which are then utilized to correctly classify and recognize an incoming electronic message that contains a character swap attack.
CPC Classifications
Filing Date
2023-09-22
Application No.
18371878
Claims
17