bincfg.normalization.java package

Submodules

bincfg.normalization.java.java_normalizers module

class bincfg.normalization.java.java_normalizers.JavaBaseNormalizer(*args, **kwargs)[source]

Bases: BaseNormalizer

A base class for a normalization method.

Performs an ‘unnormalized’ normalization, removing what is likely extraneous information, and providing a base class for other normalization methods to inherit from.

Parameters:
  • tokenizer (Tokenizer) – the tokenizer to use

  • token_handlers (Optional[Dict[str, Callable[[NormalizerState], Union[None, str]]]]) – optional dictionary mapping string token types to functions to handle those tokens. These will override any token handlers that are used by default (IE: all of the self.handle_* functions). Functions should take one arg (the current normalizer state) as input and return either the next string token to add to the current line, or None to not add anything. This is useful for adding more methods to handle new token types that are not builtin.

  • token_sep (Optional[str]) – the string to use to separate each token in returned instruction lines. Only used if tokenization_level is ‘instruction’. If None, then a default value will be used (’ ‘ for unnormalized using BaseNormalizer(), ‘_’ for everything else)

  • tokenization_level (Optional[Union[TokenizationLevel, str]]) –

    the tokenization level to use for return values. Can be a string, or a TokenizationLevel type. Strings can be:

    • ’op’: tokenized at the opcode/operand level. Will insert a ‘INSTRUCTION_START’ token at the beginning of each instruction line

    • ’inst’/’instruction’: tokenized at the instruction level. All tokens in each instruction line are joined together using token_sep to construct the final token

    • ’auto’: pick the default value for this normalization technique

  • anonymize_tokens (bool) – if True, then tokens will be annonymized by taking their 4-byte shake_128 hash. Why does this exist? Bureaucracy.

DEFAULT_TOKENIZATION_LEVEL = ['inst', 'instruction', 'line', 'instructions', 'lines']

The default tokenization level used for this normalizer

opcode_function_call(state)[source]

Handles function call opcodes, defaults to doing nothing

opcode_jump(state)[source]

Handles jump opcodes, defaults to doing nothing

renormalizable = True

Whether or not this normalization method can be renormalized later by other normalization methods

save(path)
token_sep = None

The separator string used for this normalizer

Will default to ‘ ‘ for BaseNormalizer, and ‘_’ for all other normalizers.

tokenizer = None

The tokenizer used for this normalizer

class bincfg.normalization.java.java_normalizers.JavaReplaceImmediateNormalizer(*args, **kwargs)[source]

Bases: JavaBaseNormalizer

Replaces all immediate values over some threshold with the immediate token

Parameters:
  • imm_threshold (Optional[int]) – all immediate values whose absolute value is greater than this threshold will be replaced with the immediate value token. If None or < 0, then all immediates will be replaced no matter the size

  • include_negative (bool) – if True, then a negative sign will be added to the front of all replaced immediate tokens that are negative

  • tokenizer (Tokenizer) – the tokenizer to use

  • token_handlers (Optional[Dict[str, Callable[[NormalizerState], Union[None, str]]]]) – optional dictionary mapping string token types to functions to handle those tokens. These will override any token handlers that are used by default (IE: all of the self.handle_* functions). Functions should take one arg (the current normalizer state) as input and return either the next string token to add to the current line, or None to not add anything. This is useful for adding more methods to handle new token types that are not builtin.

  • token_sep (Optional[str]) – the string to use to separate each token in returned instruction lines. Only used if tokenization_level is ‘instruction’. If None, then a default value will be used (’ ‘ for unnormalized using BaseNormalizer(), ‘_’ for everything else)

  • tokenization_level (Optional[Union[TokenizationLevel, str]]) –

    the tokenization level to use for return values. Can be a string, or a TokenizationLevel type. Strings can be:

    • ’op’: tokenized at the opcode/operand level. Will insert a ‘INSTRUCTION_START’ token at the beginning of each instruction line

    • ’inst’/’instruction’: tokenized at the instruction level. All tokens in each instruction line are joined together using token_sep to construct the final token

    • ’auto’: pick the default value for this normalization technique

  • anonymize_tokens (bool) – if True, then tokens will be annonymized by taking their 4-byte shake_128 hash. Why does this exist? Bureaucracy.

save(path)

bincfg.normalization.java.java_tokenizer module

bincfg.normalization.java.java_tokenizer.JAVA_DEFAULT_TOKENS = [('prefix', 'wide(?=[, \\t.]+|[|\\n]|<(?:(?:"[^"\\\\]*(?:\\\\.[^"\\\\]*)*"|\\\'[^\\\'\\\\]*(?:\\\\.[^\\\'\\\\]*)*\\\')|#str#|<(?:(?:"[^"\\\\]*(?:\\\\.[^"\\\\]*)*"|\\\'[^\\\'\\\\]*(?:\\\\.[^\\\'\\\\]*)*\\\')|#str#|<(?:(?:"[^"\\\\]*(?:\\\\.[^"\\\\]*)*"|\\\'[^\\\'\\\\]*(?:\\\\.[^\\\'\\\\]*)*\\\')|#str#|[^>])*>|[^>])*>|[^>])*>|$)'), ('opcode', '(?:[a-z][a-z0-9_]*)(?=[, \\t.]+|[|\\n]|<(?:(?:"[^"\\\\]*(?:\\\\.[^"\\\\]*)*"|\\\'[^\\\'\\\\]*(?:\\\\.[^\\\'\\\\]*)*\\\')|#str#|<(?:(?:"[^"\\\\]*(?:\\\\.[^"\\\\]*)*"|\\\'[^\\\'\\\\]*(?:\\\\.[^\\\'\\\\]*)*\\\')|#str#|<(?:(?:"[^"\\\\]*(?:\\\\.[^"\\\\]*)*"|\\\'[^\\\'\\\\]*(?:\\\\.[^\\\'\\\\]*)*\\\')|#str#|[^>])*>|[^>])*>|[^>])*>|$)')]

Default list of (token_type, regex) token tuples to match to

class bincfg.normalization.java.java_tokenizer.JavaBaseTokenizer(*args, **kwargs)[source]

Bases: BaseTokenizer

A default class to tokenize java bytecode line input

The tokenizer will tokenize essentially anything, so long as it fits known tokens.

Known Tokens:

  • All of the default tokens from BaseTokenizer

  • Instruction prefix: the ‘wide’ prefix

  • Opcode: any alpha-numeric + underscore substring

Anything that does not fit one of the above tokens will be considered a ‘token mismatch’

Parameters:
  • tokens (Optional[List[Tuple[str, str]]]) – the tokens to use. Should be a list of 2-tuples. Each tuple is a pair of (name, regex) where name is the string name of the token, and regex is a regular expression to find that token. These tuples should be ordered in the preferred order to search for tokens. If None, then this will default to self.DEFAULT_TOKENS (which should be set when defining the class)

  • token_handlers (Optional[Dict[str, Callable[[Dict[str, Any]], Union[None, str]]]]) – optional dictionary mapping token type strings to functions to handle those token types when tokenizing. This is intended to be used when you wish to add entirely new token types not present in bincfg.normalization.base_tokenizer.Tokens. If you wish to change the behavior of handling an already-present token type, just override that token handler function. These will override the default token handlers.

  • insert_special_tokens (bool) – by default, some special tokens will be inserted at the front of tokens (see the ‘special tokens’ listed above). If you wish to stop this from happening, you can set insert_special_tokens to False

  • case_sensitive (bool) – If True, then it is assumed that all regular expressions will exactly match case. If False, then it is assumed that all regular expressions only handle lowercase strings, and all incoming instructions will be converted to lowercase

ARCHITECTURE = ['java', 'java_bytecode']

The architecture this tokenizer works on

DEFAULT_TOKENS = [('prefix', 'wide(?=[, \\t.]+|[|\\n]|<(?:(?:"[^"\\\\]*(?:\\\\.[^"\\\\]*)*"|\\\'[^\\\'\\\\]*(?:\\\\.[^\\\'\\\\]*)*\\\')|#str#|<(?:(?:"[^"\\\\]*(?:\\\\.[^"\\\\]*)*"|\\\'[^\\\'\\\\]*(?:\\\\.[^\\\'\\\\]*)*\\\')|#str#|<(?:(?:"[^"\\\\]*(?:\\\\.[^"\\\\]*)*"|\\\'[^\\\'\\\\]*(?:\\\\.[^\\\'\\\\]*)*\\\')|#str#|[^>])*>|[^>])*>|[^>])*>|$)'), ('opcode', '(?:[a-z][a-z0-9_]*)(?=[, \\t.]+|[|\\n]|<(?:(?:"[^"\\\\]*(?:\\\\.[^"\\\\]*)*"|\\\'[^\\\'\\\\]*(?:\\\\.[^\\\'\\\\]*)*\\\')|#str#|<(?:(?:"[^"\\\\]*(?:\\\\.[^"\\\\]*)*"|\\\'[^\\\'\\\\]*(?:\\\\.[^\\\'\\\\]*)*\\\')|#str#|<(?:(?:"[^"\\\\]*(?:\\\\.[^"\\\\]*)*"|\\\'[^\\\'\\\\]*(?:\\\\.[^\\\'\\\\]*)*\\\')|#str#|[^>])*>|[^>])*>|[^>])*>|$)')]
save(path)

Module contents