bincfg.normalization package

This subpackage provides classes to tokenize and normalize assembly lines, as well as the ability to easily create new tokenization/normalization methods.

This library currently supports the following architectures:

  • x86/x86_64

  • java

And disassembler output from the following binary analysis tools:

Tokenizer classes convert assembly instructions into lists of individual tokens for later processing. Normalizer classes take those tokens and normalize them to create the final string tokens for later use in models. This normalization process is useful to prevent overfitting and Out of Vocabulary (OOV) problems in machine learning models.

An example of using a default X86BaseNormalizer on some x86_64 assembly:

from bincfg.normalization import X86BaseNormalizer

asm_lines = [
   '0x00402cdd: add    rsp, 0x08',
   '0x00402cf0: push   qword ds:[rip + 0x0000000000252312<absolute=0x0000000000655008>]',
   'CALL   0x0000000000403360'
]
normalizer = X86BaseNormalizer()

for line in asm_lines:
   print(normalizer.normalize(line))

Which would give the output:

>>> add rsp 8
>>> push qword [ rip + 2433810 ]
>>> call 4207456

The BaseNormalizer classes by default do some simple cleaning while keeping all of the necessary information for the assembly line itself. For example: removing memory addresses of the instruction itself if it exists, converting all values to decimal, removing extra whitespace/commas, etc.

This process is split into two main parts: tokenization, and normalization.

Tokenization

Normalization

Normalizer classes will normalize incoming strings. They do this by first tokenizing the strings (using either a user-defined or default tokenizer), then normalizing that stream of (token_name, token_string) tuples into strings.

Normalization has two possible Tokenization Levels for the incoming strings:

  • ‘op’: opcode/operand level tokenization. Each individual opcode/operand gets normalized into its own token

  • ‘instruction’: instruction level tokenization. Each instruction line gets normalized into a single token, with all opcodes/operands in that instruction joined together, separated by some separator string (defaults to ‘ ‘ for BaseNormalizer, and ‘_’ for all other normalizers)

This library has a few built-in normalization methods based on literature:

This module also provides a normalize_cfg_data() function to normalize CFG data.

Custom Normalizers

Creating custom normalizers is quite simple. In fact, multiple of the built-in normalization techniques are as simple as a few lines of code:

class X86InnerEyeNormalizer(X86BaseNormalizer):
   DEFAULT_TOKENIZATION_LEVEL = TokenizationLevel.INSTRUCTION
   handle_immediate = return_immstr(include_negative=True)
   handle_memory_size = ignore
   handle_function_call = replace_function_call_immediate(FUNCTION_CALL_STR)

Custom normalizers should inherit from BaseNormalizer, and override parent methods to alter functionality. Most methods do exactly as they say, “handling” the tokens in their names:

  • handle_opcode()

  • handle_memory_size()

  • handle_register()

  • handle_immediate()

  • handle_memory_expression()

  • handle_rose_info()

  • handle_ignored()

  • handle_mismatch()

There are some handlers that have slightly different functionality:

  • handle_newline(): this gets called after each full string has been parsed, or a new line character was found, indicating the end of a single assembly instruction. The full instruction will then be parsed, modified if necessary, specific opcodes handled, and converted into the final string (or list of strings if using ‘op’ tokenization level).

  • handle_instruction(): this gets called by handle_newline(). It will parse the full instruction, checking for any specifc opcodes that need to be handled. This method does not do any other cleaning/converting of the instruction.

Specific opcodes can be handled differently after the full line has been parsed. The register_opcode_handler() function allows you to pass in a string regular expression to identify the opcodes to handle, and a function to handle those opcodes. There are also a few built-in opcode handler functions:

  • handle_jump(): handles jump instructions

  • handle_call(): handles call instructions

  • ‘nop’ instructions: all ‘nop’ instructions will have everything stripped from them except the ‘nop’ opcode itself, since there is often a large amount of useless/extraneous information alongside those filler instructions

Finally, one can add in behavior for brand new token types using the handle_unknown_token() method, which will have passed to it the token_name and token_string whenever an unknown token_name is found. This way, you need not create an entirely new Normalizer class, and can still use BaseNormalizer as a parent, if you wish to add in new token types to parse.

For info on method signatures/expected return values, see their documentation below.

As shown above, you need only set the handler to the desired function to change behavior. This can be done either when building the class definition, or during the __init__ call.

There are multiple utility functions defined under bincfg.normalization.norm_utils that can be used to set the handlers above to different common behaviors without having to implement those functions yourself.

One may also set the DEFAULT_TOKENIZATION_LEVEL attribute on the class definition/instances to change what the default tokenization level behavior will be.

Subpackages

Submodules

bincfg.normalization.base_normalizer module

Classes for normalizing assembly instructions.

class bincfg.normalization.base_normalizer.BaseNormalizer(*args, **kwargs)[source]

Bases: object

A base class for a normalization method.

This should be subclassed once for each new instruction set to create a base normalizer for that instruction set that performs a default ‘unnormalized’ normalization

There are three types of functions that are intended to be overridden when needed:

  1. Token handlers: these functions will start with ‘handle’ and are used to handle either single tokens, or small groups of similar tokens (EG: memory expressions). They should accept both self and ‘state’ as inputs (see bincfg.normalization.base_normalizer.NormalizerState) and can return either a token which will be added to the end of the current line, or None to not add any token post-calling.

  2. Opcode handlers: these functions will start with ‘opcode’ and are used to handle specific opcodes (not the ‘opcode’ token in general, only specific ones like ‘call’ or ‘jump’ opcodes). They should accept both self and ‘state’ as inputs (See bincfg.normalization.base_normalizer.NormalizerState) and can return either the integer index of the next token that should be checked (IE: “we have handled all tokens up to but not including this index”), or None to indicate the previously mentioned index is just one after the opcode. These operate directly on the state’s current ‘.line’ attribute. These are expected to be called only after the entire current line has finished being parsed and normalized. New opcode handlers can be added with self.register_opcode_handler()

  3. Administrative functions: these functions perform different administrative operations before, during, or after normalizing the individual tokens. Some examples include:

    • ‘finalize_instruction’: used as a post-processing function once an instruction has finished being normalized to perform extra processing to the line, apply opcode handlers, stringify the line, update the normalizer state

    • ‘hash_token’: hashes a fully processed string token (if self.anonymize_tokens=True)

    • ‘stringify_line’: takes the current line of token tuples and converts into strings based on self.tokenization_level

Disassembler Information:

Extra information from the disassembler can be inserted into the lines within angle brackets “<>” (see BaseTokenizer() for info on how this can be tokenized). This disassembler info will be treated as a single token, and passed to the self.handle_disassembler_info function. By default, the normalizer will check for the following in order

  1. Valid JSON. If the data inside of the angle brackets is valid JSON, then it will be parsed into a JSON object. This JSON object will be inserted into the state.disinfo_json attribute in the normalizer state. There are a few special cases for this JSON data that have special effects by default:

    • If this object is an integer, we will attempt to insert it into a previous immediate value like in #2 below

    • If this is a string, we will always insert it as a string literal like in #3 below

    • If this is a dictionary, there are a few special keys that one can use:

      • ‘immediate’: value should be an integer. We will attempt to insert value into a previous immediate value like in #2 below

      • ‘insert’: this value will be inserted into the string. If it is already a string, it is left as-is. If not a string, then we call repr() on it to convert it into a string. Insertion actions depend on whether or not the key ‘insert_type’ is present.

        If not present, this value will first be tokenized/normalized by this normalizer and that value + token type will be inserted. Should that fail, then the value will be inserted as a string literal WITHOUT processing it as a string literal token (and, it won’t have quotes on it).

        If the ‘insert_type’ key is present, then it can be one of two values:

        • String token_type: the value will be handled as if it is of this token type, no matter what the value actually is, then it will be inserted (assuming that token handler did not return None)

        • False (the JSON object, not the string): the value will be immediately inserted as a string literal WITHOUT processing it as a string literal token (and, it won’t have quotes on it)

      • ‘insert_type’: Determines the token type for an ‘insert’ key value. Ignored if the ‘insert’ key is not present. See the ‘insert’ key for more info

  2. Otherwise, if the disassembler info token starts with an immediate value within the angle brackets, and there is an immediate value token immediately preceeding them (ignoring spacing tokens), this will replace said immediate value token with the immediate value found within the disassembler info. The inserted value will first be handled by the appropriate handler for Token.IMMEDIATE token types. EG: “add rax 0xffff <-1>” -> “add rax -1”

  3. Otherwise, if the disassembler info token starts with a string literal, this will insert that string literal right where it appears (and, that string literal will be handled with self.handle_string_literal). The inserted value will first be handled by the appropriate handler for Token.STRING_LITERAL token types.

  4. Finally, if it doesn’t match anything above, then it will fail silently and be ignored. If you wish to raise an error when this happens instead, you can pass raise_unk_di=True when calling .normalize()

The disassembler tokens themselves are always ignored by default.

NOTE: escapes will be treated normally within all strings. EG: ‘n’ will be considered the newline character, but ‘\n’ will escape the escape and produce the string ‘n’.

NOTE: immediates and string literals must match those found in bincfg.normalization.norm_utils (RE_IMMEDIATE and RE_STRING_LITERAL). The disassembler info does not take into account the regex’s used to parse immediates and string literals for the specific normalizer.

Parameters:
  • tokenizer (Tokenizer) – the tokenizer to use

  • token_handlers (Optional[Dict[str, Callable[[NormalizerState], Union[None, str]]]]) – optional dictionary mapping string token types to functions to handle those tokens. These will override any token handlers that are used by default (IE: all of the self.handle_* functions). Functions should take one arg (the current normalizer state) as input and return either the next string token to add to the current line, or None to not add anything. This is useful for adding more methods to handle new token types that are not builtin.

  • token_sep (Optional[str]) – the string to use to separate each token in returned instruction lines. Only used if tokenization_level is ‘instruction’. If None, then a default value will be used (’ ‘ for unnormalized using BaseNormalizer(), ‘_’ for everything else)

  • tokenization_level (Optional[Union[TokenizationLevel, str]]) –

    the tokenization level to use for return values. Can be a string, or a TokenizationLevel type. Strings can be:

    • ’op’: tokenized at the opcode/operand level. Will insert a ‘INSTRUCTION_START’ token at the beginning of each instruction line

    • ’inst’/’instruction’: tokenized at the instruction level. All tokens in each instruction line are joined together using token_sep to construct the final token

    • ’auto’: pick the default value for this normalization technique

  • anonymize_tokens (bool) – if True, then tokens will be annonymized by taking their 4-byte shake_128 hash. Why does this exist? Bureaucracy.

DEFAULT_TOKENIZATION_LEVEL = ['inst', 'instruction', 'line', 'instructions', 'lines']

The default tokenization level used for this normalizer

add_line_to_sentence(state)[source]

Stringifies the current line, then adds it to the normalized lines and clears state.line

finalize_instruction(state)[source]

Handles an entire instruction once reaching a new line

If overridden, should at the very least:

  • call all the registered opcode handlers for each known opcode token (while updating token_type/token/token_idx)

By default, each opcode handler is expected to take in the current state, and return either the integer index of the next token that should be checked (IE: “we have handled all tokens up to but not including this index”), or None to indicate the previously mentioned index is just one after the opcode

Parameters:

state (NormalizerState) – dictionary of current state information. See bincfg.normalization.base_normalizer.NormalizerState

handle_all_symbols(state)[source]

Handles symbols (‘+’, ‘[’, ‘]’, ‘*’, ‘:’). Defaults to returning the original token

Should return either the token to add to the current line, or None to not add any token

Parameters:

state (NormalizerState) – dictionary of current state information. See bincfg.normalization.base_normalizer.NormalizerState

handle_branch_prediction(state)[source]

Handles a branch prediction. Defaults to returning the original token

Should return either the token to add to the current line, or None to not add any token

Parameters:

state (NormalizerState) – dictionary of current state information. See bincfg.normalization.base_normalizer.NormalizerState

handle_disassembler_info(state)[source]

Handles disassembler information

See BaseNormalizer() for more info on how disassembler info is parsed.

Should return either the token to add to the current line, or None to not add any token

Parameters:

state (NormalizerState) – dictionary of current state information. See bincfg.normalization.base_normalizer.NormalizerState

handle_immediate(state)[source]

Handles an immediate value. Defaults to converting into decimal

Should return either the token to add to the current line, or None to not add any token

Parameters:

state (NormalizerState) – dictionary of current state information. See bincfg.normalization.base_normalizer.NormalizerState

handle_instruction_address(state)[source]

Handles an instruction address. Defaults to ignoring these tokens

Should return either the token to add to the current line, or None to not add any token

Parameters:

state (NormalizerState) – dictionary of current state information. See bincfg.normalization.base_normalizer.NormalizerState

handle_instruction_prefix(state)[source]

Handles an instruction prefix. Defaults to returning the original token

Should return either the token to add to the current line, or None to not add any token

Parameters:

state (NormalizerState) – dictionary of current state information. See bincfg.normalization.base_normalizer.NormalizerState

handle_memory_size(state)[source]

Handles a memory size. Defaults to returning the original token

Should return either the token to add to the current line, or None to not add any token

Parameters:

state (NormalizerState) – dictionary of current state information. See bincfg.normalization.base_normalizer.NormalizerState

handle_mismatch(state)[source]

What to do when the normalizaion method finds a token mismatch (in case they were ignored in the tokenizer)

Defaults to raising a TokenMismatchError()

Should return either the token to add to the current line, or None to not add any token

Parameters:

state (NormalizerState) – dictionary of current state information. See bincfg.normalization.base_normalizer.NormalizerState

Raises:

TokenMismatchError – always

handle_newline(state)[source]

Handles a newline token. Defaults to ignoring the token

Parameters:

state (NormalizerState) – dictionary of current state information. See bincfg.normalization.base_normalizer.NormalizerState

handle_opcode(state)[source]

Handles an opcode. Defaults to returning the original token

NOTE: This should only be used to determine how all opcode strings are handled. For how to handle specific opcodes to give them different behaviors, see register_opcode_handler()

Should return either the token to add to the current line, or None to not add any token

Parameters:

state (NormalizerState) – dictionary of current state information. See bincfg.normalization.base_normalizer.NormalizerState

handle_register(state)[source]

Handles a register. Defaults to returning the original token

Should return either the token to add to the current line, or None to not add any token

Parameters:

state (NormalizerState) – dictionary of current state information. See bincfg.normalization.base_normalizer.NormalizerState

handle_spacing(state)[source]

Handles spacing. Defaults to ignoring these tokens

Should return either the token to add to the current line, or None to not add any token

Parameters:

state (NormalizerState) – dictionary of current state information. See bincfg.normalization.base_normalizer.NormalizerState

handle_string_literal(state)[source]

Handles string literals. Defaults to returning the original token as a double-quoted string

Should return either the token to add to the current line, or None to not add any token

Parameters:

state (NormalizerState) – dictionary of current state information. See bincfg.normalization.base_normalizer.NormalizerState

handle_unknown_token(state)[source]

Handles an unknown token. Defaults to raising an UnknownTokenError

Should return either the token to add to the current line, or None to not add any token

Parameters:

state (NormalizerState) – dictionary of current state information. See bincfg.normalization.base_normalizer.NormalizerState

Raises:

UnknownTokenError – always

hash_token(token)[source]

Hashes tokens during annonymization

By default, converts each individual token into its 4-byte shake_128 hash

Parameters:

token (str) – the string token to hash

Returns:

the 4-byte shake_128 hash of the given token

Return type:

str

classmethod load(path)
normalize(*strings, cfg=None, block=None, newline_tup=<object object>, match_instruction_address=True, **kwargs)[source]

Normalizes the given iterable of strings.

Parameters:
  • strings (str) – arbitrary number of strings to normalize

  • cfg (Union[CFG, MemCFG], optional) – either a CFG or MemCFG object that these lines occur in. Used for determining function calls to self, internal functions, and external functions. If not passed, then these will not be used. Defaults to None.

  • block (Union[CFGBasicBlock, int], optional) – either a CFGBasicBlock or integer block_idx in a MemCFG object. Used for determining function calls to self, internal functions, and external functions. If not passed, then these will not be used. Defaults to None.

  • newline_tup (Tuple[str, str], optional) – the tuple to insert inbetween each passed string, or None to not insert anything. Defaults to self.tokenizer.DEFAULT_NEWLINE_TUPLE

  • match_instruction_address (bool, optional) –

    if True, will match instruction addresses. If there is an immediate value at the start of a line (IE: start of a string in strings, or immediately after a Tokens.NEWLINE or Tokens.INSTRUCTION_START [ignoring any Tokens.SPACING]), then that token will be converted into a Tokens.INSTRUCTION_ADDRESS token. If there is a Tokens.COLON immediately after that token (again, ignoring any Tokens.SPACING), then that first Tokens.COLON match will be appended (along with any inbetween Tokens.SPACING) to that Tokens.INSTRUCTION_ADDRESS token. For example, using the x86 tokenization scheme:

    • ”0x1234: add rax rax” -> [(Tokens.INSTRUCTION_ADDRESS, ‘0x1234:’), …]

    • ” 0x1234 : add rax rax” -> [(Tokens.SPACING, ‘ ‘), (Tokens.INSTRUCTION_ADDRESS, ‘0x1234 :’), …]

    • ”0x1234 add rax rax” -> [(Tokens.INSTRUCTION_ADDRESS, ‘0x1234’), …]

  • kwargs (Any) – extra kwargs to pass along to tokenization method, and to store in normalizer state

Returns:

a list of normalized string instruction lines

Return type:

List[str]

register_opcode_handler(op_regex, func_or_str_name)[source]

Registers an opcode handler for this normalizer

Adds the given op_regex as an opcode to handle during self._handle_instruction() along with the given function to call with token/cfg arguments. op_regex can be either a compiled regex expression, or a string which will be compiled into a regex expression. func_or_str_name can either be a callable, or a string. If it’s a string, then that attribute will be looked up on this normalizer dynamically to find the function to use.

Notes for registering opcode handlers:

  1. passing instance method functions converts them to strings automatically

  2. passing lambda’s or inner functions (not at global scope) would not be able to be pickled

  3. opcodes will be matched in the order they were passed in

Parameters:
  • op_regex (Union[str, Pattern]) – a string or compiled regex

  • func_or_str_name (Union[Callable, str]) – the function to call with token/cfg arguments when an opcode matches op_regex, or a string name of a callable attribute of this normalizer to be looked up dynamically

renormalizable = False

Whether or not this normalization method can be renormalized later by other normalization methods

save(path)
stringify_line(state)[source]

Converts the current line into a list of final normalized string tokens and returns that list

Also normalizes the case, converting all tokens (except those in strings) to lowercase

Parameters:

state (NormalizerState) – dictionary of current state information. See bincfg.normalization.base_normalizer.NormalizerState

Returns:

a list of tokens to add to state.normalized_lines

Return type:

List[str]

token_sep = None

The separator string used for this normalizer

Will default to ‘ ‘

tokenization_level = ['auto', 'automatic', 'default']

The tokenization level to use for this normalizer

tokenize(*strings, newline_tup=<object object>, match_instruction_address=True, **kwargs)[source]

Tokenizes the given strings using this normalizer’s tokenizer

See the docs for BaseTokenizer() for more info on how tokenization works, how to create subclasses, etc.

Parameters:
  • strings (str) – arbitrary number of strings to tokenize.

  • newline_tup (Optional[Tuple[str, str]]) – the tuple to insert inbetween each passed string, or None to not insert anything. Defaults to self.__class__.DEFAULT_NEWLINE_TUPLE.

  • match_instruction_address (bool, optional) –

    if True, will match instruction addresses. If there is an immediate value at the start of a line (IE: start of a string in strings, or immediately after a Tokens.NEWLINE or Tokens.INSTRUCTION_START [ignoring any Tokens.SPACING]), then that token will be converted into a Tokens.INSTRUCTION_ADDRESS token. If there is a Tokens.COLON immediately after that token (again, ignoring any Tokens.SPACING), then that first Tokens.COLON match will be appended (along with any inbetween Tokens.SPACING) to that Tokens.INSTRUCTION_ADDRESS token. For example, using the x86 tokenization scheme:

    • ”0x1234: add rax rax” -> [(Tokens.INSTRUCTION_ADDRESS, ‘0x1234:’), …]

    • ” 0x1234 : add rax rax” -> [(Tokens.SPACING, ‘ ‘), (Tokens.INSTRUCTION_ADDRESS, ‘0x1234 :’), …]

    • ”0x1234 add rax rax” -> [(Tokens.INSTRUCTION_ADDRESS, ‘0x1234’), …]

  • kwargs (Any) – extra kwargs to store in the tokenizer state, for use in child classes

Returns:

list of (token_type, token) tuples

Return type:

List[Tuple[str, str]]

tokenizer = None

The tokenizer used for this normalizer

class bincfg.normalization.base_normalizer.MetaNorm(name, bases, dct)[source]

Bases: type

A metaclass for BaseNormalizer.

The Problem:

If you change instance functions within the __init__ method (EG: see the SAFE _handle_immediate() function being changed in __init__), then ‘self’ will not automatically be passed to those functions.

NOTE: this is specifically useful when the effect of a normalization method depends on parameters sent to the instance, not inherent to the class

NOTE: this is not the case for any functions that are set during class initialization (EG: outside of the __init__() block)

So, any functions changed within __init__ methods must be altered to also pass ‘self’. I ~could~ force the users to have to call a ‘__post_init__()’ function or something, but can we count on them (IE: myself) to always do that?…

The Solution:

This metaclass inserts extra code before and after any normalizer’s __init__ method is called. That code keeps track of all instance functions before intitialization, and checks to see if any of them change after initialization. This means someone re-set a function within __init__ (IE: self._handle_immediate = …). When this happens, ‘self’ will not automatically be passed when that function is called. These functions are then wrapped to also automatically pass ‘self’.

NOTE: to determine if a function changes, we just check equality between previous and new functions using getattr(self, func_name). I don’t know why basic ‘==’ works but ‘is’ and checking id’s do not, but I’m not going to question it…

NOTE: We also have to keep track of the instance functions as an instance variable in case a parent class needs their function updated, or if a child class also changes a parent class’s function in init

NOTE: this will mean you cannot call all of that class’s methods and expect them to always be the same as calling instance methods if you change functions in __init__

class bincfg.normalization.base_normalizer.NormalizerState(**kwargs)[source]

Bases: object

A class that contains information during a normalizer’s normalization process

block = None

the CFGBasicBlock that this token belongs to, or None if not using

Type:

Optional[bincfg.CFGBasicBlock]

cfg = None

the CFG that this token’s basic block belongs to, or None if not using

Type:

Optional[bincfg.CFG]

copy()[source]

Returns a copy of this state, but doesn’t copy cfg or block

copy_set(**kwargs)[source]

Copies this state, then updates all the given parameters

disinfo_json = None

the parsed json from a disinfo object

Type:

Optional[JSONObject]

handlers = None

dictionary of current token handler functions

Type:

Dict[str, Callable[[NormalizerState], Union[str, None]]]

kwargs = None

dictionary of extra kwargs for use in tokenization, or child classes

Type:

Dict

line = None

list of all TokenTuple’s in this current line. TokenTuple = (token_type [from bincfg.normalization.base_tokenizer.Tokens enum], new_token_string, original_token_string)

Type:

List[Tuple[str, str, str]])

match_instruction_address = None

whether or not we are matching instruction addresses at the beginning of assembly lines. This is very likely always True

Type:

bool

memory_start = None

the index of the start of the current memory expression, or None if we are not in a memory expression currently

Type:

Optional[int]

newline_tup = None

the newline tuple being used (token_type [probably Tokens.NEWLINE], token_string), or None if not using

Type:

Optional[Tuple[str, str]]

normalized_lines = None

list of all currently normalized lines/tokens (depending on self.tokenization_level)

Type:

List[str])

orig_token = None

The current string token being normalized

Type:

str

raw_strings = None

list of all of the raw strings passed to the current .normalize() call

Type:

List[str]

set(**kwargs)[source]

Sets the given kwargs on this object’s attribute dictionary

token = None

The current processed version of token if it has already been partially or fully normalized, or None if not

Type:

str

token_idx = None

The index of the current token in ‘line’

Type:

int

property token_tuple

Returns (token_type, token, orig_token)

token_type = None

The token type of the current token, see bincfg.normalization.base_tokenizer.Tokens

Type:

str

bincfg.normalization.base_tokenizer module

Class for tokenizing assembly lines, as well as other tokenization constants

class bincfg.normalization.base_tokenizer.Architectures(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]

Bases: Enum

Known (but not necessarily supported) architectures

JAVA = ['java', 'java_bytecode']
X86 = ['x86', 'i686', 'x86_64']
class bincfg.normalization.base_tokenizer.BaseTokenizer(*args, **kwargs)[source]

Bases: object

A default class to tokenize instructions

Should be subclassed once for each instruction set, providing the tokens being used.

Many functions may be overriden to change tokenization behavior. These functions all start with the name token_… and take as input a single state dictionary and return either a string for the next token to append to the current line being tokenized, or None to not add anything to the line. The state dictionary contains the following:

  • ‘tokenizer’ (BaseTokenizer): this tokenizer

  • ‘kwargs’ (Dict[str, Any]): dictionary of extra kwargs passed to the initial call to the tokenize function

  • ‘all_strings’ (List[str]): list of input strings (args) passed to the initial call to the tokenize function

  • ‘token_handlers’ (Dict[str, Callable[]]): dictionary mapping token types to the function that handles that token

  • ‘sentence’ (List[Tuple[str, str]]): list of processed token tuples to return, each a tuple of (token_name, token)

  • ‘newline_tup’ (Union[None, Tuple[str, str]]): token tuple to add at the end of each line to indicate a new line

  • ‘match_instruction_address’ (bool): whether or not we are matching instruction addresses

  • ‘split_imm’ (bool): whether or not we are currently handling an immediate token that was split

  • ‘line’ (List[Tuple[str, str]]): the current line of tokens we are working on

  • ‘string’ (str): the current string being tokenized

  • ‘token_type’ (str): the type of the ‘token’, should be from bincfg.normalization.base_tokenizer.Tokens

  • ‘token’ (str): the currently matched token string

  • ‘match’ (re.Match): the re match object that matched this token

Some extra functions are available for overriding including:

  • handle_line(): called at the end of each line being tokenized (an individual string passed to the tokenizer)

  • handle_sentence(): called at the end of each sentence being tokenized (aggregation of all lines passed to the tokenizer)

Each instruction set architecture (ISA) should have its own Tokenizer class that inherits from BaseTokenizer. The tokenization process uses python’s re module to perform tokenization, converting strings into streams of (token_name, token_string) tuples. For more information on how to use regex to create tokenizers, see: https://docs.python.org/3/library/re.html#writing-a-tokenizer

TOKENIZATION PROCESS

  1. Clean the incomming instruction strings using the passed clean_instruction_func

  2. Iterate through the strings finding all tokens

    1. Each token is sent to its corresponding token handler function

    2. At the end of each ‘line’ (EG: end of a passed string, reaching Tokens.NEWLINE token, etc.), that line is handled with the handle_line() function

    3. All tokens are added to the same return ‘sentence’, even if multiple strings in strings were passed

  3. After all strings have been tokenized and lines handled, the final return ‘sentence’ is sent to handle_sentence()

SPECIAL TOKENS

There are some ‘special tokens’ that are assumed to exist for all ISA’s as they are a part of the tokenization process itself. These tokens will be inserted into the passed tokens parameter at the beginning of the list (IE: they are the first tokens searched for), except for the ‘mismatch’ token which is inserted at the end, and are inserted in the following order:

  1. String literals (Tokens.STRING_LITERAL) - matches strings which can start/end with matching single or double quotes, and can escape inner quotes with ' or ", and can escape the escape character with \. Any extra escape characters (not behind a ‘ or “ or ) will be left as-is.

  2. Disassembler information (Tokens.DISASSEMBLER_INFO) - matches disassembler information of the form “<…>”. This info must be within open/close angle brackets. It is also possible to nest angle brackets within the disassembler info up to a maximum current depth of 3. IE: we can match the following:

    • “<no angle brackets inside>” - depth of 1

    • “<angle <brackets> depth <2>>” - depth of 2

    • “<level <3 angle <bracket>> depth>” - depth of 3

    We also do not check that every open has a matching close, just that every close has a matching open. So, the following could still be matched:

    • “<lots of <<<<<<< things>”

    However, missing or unmatched ending angle brackets will fail, as well as very deep nesting:

    • “<” : no matching ‘>’ only for the first occurance of ‘<’

    • “<data>>” : no matching ‘<’ for both of the ‘>’ brackets

    • “<super<deep<nested<…<thing>>…>>” : too large nesting depth

    String literals are checked first within the disassembler info so that any end brackets ‘>’ within the strings won’t affect the parsing of the disassembler info.

    This limit on nesting depth is present due to the inability for python’s re engine to handle recursive matching of nested brackets, and I can’t think of any way to implement it entirely within re’s (which is needed in order to continue using the python re tokenization method). I don’t see any reason why this would be needed as we already go down to a depth of 3 to handle more than what I would expect as output from disassemblers, and if the user is inserting information themselves, they could simply input the information within the brackets using a different delimiter and parse it themselves by overriding things like token_disassembler_info() and handle_disassembler_info in the Tokenizer and Normalizer classes respectively. If a larger depth is needed, one can manually alter the _DIS_INFO_MAX_REC_DEPTH variable at the top of this file. It will increase the valid nesting depth at the cost of slower regular expression matching for disassembler info.

  3. Instruction start token “#start_instr#” (Tokens.INSTRUCTION_START) - used to determine when instructions start/stop when using an op-level tokenization scheme. When tokenizing, we need to know when a new instruction is started to decide if an immediate value found should be considered an instruction address or just a plain immediate. New instructions occur whenever we reach a newline token, an instruction start token, or the start of a new string passed in the args of the tokenize() method. This instruction start token is removed when found, and won’t appear during normalization.

  4. Split immediate token “#split_imm#” (Tokens.SPLIT_IMMEDIATE) - used to designate a split immediate value. This is useful for reducing the number of unique tokens present while keeping full immediate information. When using split immediates during normalization, immediate values with more digits than some threshold will be split into multiple immediate tokens and placed one after the other, prepended with this “#split_imm#” token. In order to keep that output as renormalizable, the tokenizer, when finding one of these split immediate tokens, will concatenate all of the following immediate tokens until reaching some non-immediate (and, non-spacing) token to rebuild the original immediate token. This split immediate token is removed when found, and won’t appear during normalization

  5. Plus sign (Tokens.PLUS_SIGN) - ‘+’

  6. Times sign (Tokens.TIMES_SIGN) - ‘*’

  7. Open bracket (Tokens.OPEN_BRACKET) - ‘[’

  8. Close bracket (Tokens.CLOSE_BRACKET) - ‘]’

  9. Colon (Tokens.COLON) - ‘:’

  10. Spacing (Tokens.SPACING) - One or more space ‘ ‘, comma ‘,’, or tab ‘t’ characters in a row

  11. Newline (Tokens.NEWLINE) - Either the newline character ‘n’ or a pipe character ‘|’

  12. Immediate values (Tokens.IMMEDIATE) - any integer immediate value in hex, decimal, octal, or binary. Hex values must start with ‘0x’, octal with ‘0o’, and binary with ‘0b’

  13. Mismatch token (Tokens.MISMATCH) - matches any character. Inserted at the very end of tokens and is used to designate the start of an unknown token or character so that can be handled (by default, an error is raised)

If you wish to keep some of the above tokens, but overwrite others, you can set that token’s regex in the passed tokens parameter, and that will overwrite these special tokens. You may also set it to None to not insert it at all.

INSTRUCTION ADDRESSES

If match_instruction_address=True when tokenizing, the tokenizer will attempt to match instruction addresses at the beginning of each line. If there is an immediate value at the start of a line (IE: start of a string in strings, or immediately after a Tokens.NEWLINE or Tokens.INSTRUCTION_START [ignoring any Tokens.SPACING]), then that token will be converted into a Tokens.INSTRUCTION_ADDRESS token. If there is a Tokens.COLON immediately after that token (again, ignoring any Tokens.SPACING), then that first Tokens.COLON match will be appended to that Tokens.INSTRUCTION_ADDRESS token, removing any Tokens.SPACING inbetween them. For example, using the x86 tokenization scheme:

  • “0x1234: add rax rax” -> [(Tokens.INSTRUCTION_ADDRESS, ‘0x1234:’), …]

  • “ 0x1234 : add rax rax” -> [(Tokens.SPACING, ‘ ‘), (Tokens.INSTRUCTION_ADDRESS, ‘0x1234:’), …]

  • “0x1234 add rax rax” -> [(Tokens.INSTRUCTION_ADDRESS, ‘0x1234’), …]

Parameters:
  • tokens (Optional[List[Tuple[str, str]]]) – the tokens to use. Should be a list of 2-tuples. Each tuple is a pair of (name, regex) where name is the string name of the token, and regex is a regular expression to find that token. These tuples should be ordered in the preferred order to search for tokens. If None, then this will default to self.DEFAULT_TOKENS (which should be set when defining the class)

  • token_handlers (Optional[Dict[str, Callable[[Dict[str, Any]], Union[None, str]]]]) – optional dictionary mapping token type strings to functions to handle those token types when tokenizing. This is intended to be used when you wish to add entirely new token types not present in bincfg.normalization.base_tokenizer.Tokens. If you wish to change the behavior of handling an already-present token type, just override that token handler function. These will override the default token handlers.

  • insert_special_tokens (bool) – by default, some special tokens will be inserted at the front of tokens (see the ‘special tokens’ listed above). If you wish to stop this from happening, you can set insert_special_tokens to False

  • case_sensitive (bool) – If True, then regular expressions will be matched exactly as they appear. If False, then the re.IGNORECASE flag will be passed when compiling the regular expressions

ARCHITECTURE = None

The architecture this tokenizer works on

DEFAULT_NEWLINE_TUPLE = ('newline', '\n')

The default (token_type, token) tuple to use for newlines

handle_line(state)[source]

Handles a single line (one string passed to the tokenizer)

Each line could contain newlines and whatnot, but no newline_tup’s will have been inserted.

Subclasses may override this function for more behavior, but it defaults to just returning the passed line.

Parameters:

state (Dict) – dictionary of current state. See BaseTokenizer() for more info

Returns:

list of (token_type, token) tuples for this line

Return type:

List[Tuple[str, str]]

handle_sentence(state)[source]

Handles an entire sentence (aggregation of all strings passed to one call of this tokenizer)

Inbetween each line, a newline_tup will have already been inserted (if using)

Subclasses may override this function for more behavior, but it defaults to just returning the passed sentence

Parameters:

state (Dict) – dictionary of current state. See BaseTokenizer() for more info

Returns:

the final list of tokens

Return type:

List[Tuple[str, str]]

classmethod load(path)
save(path)
token_all_symbols(state)[source]

Handles all symbol tokens (‘+’, ‘*’, ‘[’, ‘]’, ‘:’)

This can be overriden by subclasses for more functionality, but defaults to just returning the original token, except for colons ‘:’, for which we check if the previous non-spacing token was an immediate value. If so, and match_instruction_address is True, then we append any inbetween spacing and the colon to that immediate and replace its type with Token.INSTRUCTION_ADDRESS.

Parameters:

state (Dict) – dictionary of current state. See BaseTokenizer() for more info

Returns:

either a string token for the next token to append to line, or None to not append anything

Return type:

Union[str, None]

token_branch_prediction(state)[source]

Handles any branch_prediction tokens

This can be overriden by subclasses for more functionality, but defaults to just returning the original token

Parameters:

state (Dict) – dictionary of current state. See BaseTokenizer() for more info

Returns:

either a string token for the next token to append to line, or None to not append anything

Return type:

Union[str, None]

token_disassembler_info(state)[source]

Handles any disassembler information tokens

This can be overriden by subclasses for more functionality, but defaults to just returning the original token

Parameters:

state (Dict) – dictionary of current state. See BaseTokenizer() for more info

Returns:

either a string token for the next token to append to line, or None to not append anything

Return type:

Union[str, None]

token_immediate(state)[source]

Handles any immediate tokens

This can be overriden by subclasses for more functionality, but defaults to just returning the original token

Parameters:

state (Dict) – dictionary of current state. See BaseTokenizer() for more info

Returns:

either a string token for the next token to append to line, or None to not append anything

Return type:

Union[str, None]

token_instruction_address(state)[source]

Handles any instruction address tokens

This can be overriden by subclasses for more functionality, but defaults to just returning the original token

Parameters:

state (Dict) – dictionary of current state. See BaseTokenizer() for more info

Returns:

either a string token for the next token to append to line, or None to not append anything

Return type:

Union[str, None]

token_instruction_prefix(state)[source]

Handles any instruction_prefix tokens

This can be overriden by subclasses for more functionality, but defaults to just returning the original token

Parameters:

state (Dict) – dictionary of current state. See BaseTokenizer() for more info

Returns:

either a string token for the next token to append to line, or None to not append anything

Return type:

Union[str, None]

token_memory_size(state)[source]

Handles any memory_size tokens

This can be overriden by subclasses for more functionality, but defaults to just returning the original token

Parameters:

state (Dict) – dictionary of current state. See BaseTokenizer() for more info

Returns:

either a string token for the next token to append to line, or None to not append anything

Return type:

Union[str, None]

token_mismatch(state)[source]

What to do when there is a token mismatch in a string

This can be overriden by subclasses for more functionality, bet defaults to raising a TokenMismatchError with info on the mismatch

Parameters:

state (Dict) – dictionary of current state. See BaseTokenizer() for more info

Raises:

TokenMismatchError – by default

token_newline(state)[source]

Handles any newline tokens

This can be overriden by subclasses for more functionality, but defaults to just returning the original token

Parameters:

state (Dict) – dictionary of current state. See BaseTokenizer() for more info

Returns:

either a string token for the next token to append to line, or None to not append anything

Return type:

Union[str, None]

token_opcode(state)[source]

Handles any opcode tokens

This can be overriden by subclasses for more functionality, but defaults to just returning the original token

Parameters:

state (Dict) – dictionary of current state. See BaseTokenizer() for more info

Returns:

either a string token for the next token to append to line, or None to not append anything

Return type:

Union[str, None]

token_register(state)[source]

Handles any register tokens

This can be overriden by subclasses for more functionality, but defaults to just returning the original token

Parameters:

state (Dict) – dictionary of current state. See BaseTokenizer() for more info

Returns:

either a string token for the next token to append to line, or None to not append anything

Return type:

Union[str, None]

token_spacing(state)[source]

Handles any spacing tokens

This can be overriden by subclasses for more functionality, but defaults to just returning the original token

Parameters:

state (Dict) – dictionary of current state. See BaseTokenizer() for more info

Returns:

either a string token for the next token to append to line, or None to not append anything

Return type:

Union[str, None]

token_string_literal(state)[source]

Handles any string literals

This can be overriden by subclasses for more functionality, but defaults to just returning the original token

Parameters:

state (Dict) – dictionary of current state. See BaseTokenizer() for more info

Returns:

either a string token for the next token to append to line, or None to not append anything

Return type:

Union[str, None]

token_unknown(state)[source]

What to do when there is a token type that we don’t know how to handle

This can be overriden by subclasses for more functionality, bet defaults to raising a UnknownTokenError with info on the unknown token

Parameters:

state (Dict) – dictionary of current state. See BaseTokenizer() for more info

Raises:

UnknownTokenError – by default

tokenize(*strings, newline_tup=<object object>, match_instruction_address=True, **kwargs)[source]

Tokenizes the input

Subclasses should override any self.token_* methods they wish to inject behavior into. Each one of those functions takes in a ‘state’ dictionary as input and should return either a new string token or None to use the old token.

See the docs for BaseTokenizer() for more info on how tokenization works, how to create subclasses, etc.

Parameters:
  • strings (str) – arbitrary number of strings to tokenize.

  • newline_tup (Optional[Tuple[str, str]]) – the tuple to insert inbetween each passed string, or None to not insert anything. Defaults to self.__class__.DEFAULT_NEWLINE_TUPLE.

  • match_instruction_address (bool, optional) –

    if True, will match instruction addresses. If there is an immediate value at the start of a line (IE: start of a string in strings, or immediately after a Tokens.NEWLINE or Tokens.INSTRUCTION_START [ignoring any Tokens.SPACING]), then that token will be converted into a Tokens.INSTRUCTION_ADDRESS token. If there is a Tokens.COLON immediately after that token (again, ignoring any Tokens.SPACING), then that first Tokens.COLON match will be appended to that Tokens.INSTRUCTION_ADDRESS token, removing any Tokens.SPACING inbetween them. For example, using the x86 tokenization scheme:

    • ”0x1234: add rax rax” -> [(Tokens.INSTRUCTION_ADDRESS, ‘0x1234:’), …]

    • ” 0x1234 : add rax rax” -> [(Tokens.SPACING, ‘ ‘), (Tokens.INSTRUCTION_ADDRESS, ‘0x1234:’), …]

    • ”0x1234 add rax rax” -> [(Tokens.INSTRUCTION_ADDRESS, ‘0x1234’), …]

  • kwargs (Any) – extra kwargs to store in the tokenizer state, for use in child classes

Returns:

list of (token_type, token) tuples

Return type:

List[Tuple[str, str]]

exception bincfg.normalization.base_tokenizer.TokenMismatchError[source]

Bases: Exception

class bincfg.normalization.base_tokenizer.TokenizationLevel(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]

Bases: Enum

Different levels to perform tokenization

AUTO = ['auto', 'automatic', 'default']
INSTRUCTION = ['inst', 'instruction', 'line', 'instructions', 'lines']
OPCODE = ['op', 'opcode', 'operand', 'opcodes', 'operands']
class bincfg.normalization.base_tokenizer.Tokens[source]

Bases: object

BRANCH_PREDICTION = 'branch_prediction'
CLOSE_BRACKET = 'close_bracket'
COLON = 'colon'
DISASSEMBLER_INFO = 'disassembler_info'
IMMEDIATE = 'immediate'
INSTRUCTION_ADDRESS = 'inst_addr'
INSTRUCTION_PREFIX = 'prefix'
INSTRUCTION_START = 'inst_start'
MEMORY_EXPRESSION = 'memory_expression'
MEMORY_SIZE = 'memory_size'
MISMATCH = 'mismatch'
NEWLINE = 'newline'
OPCODE = 'opcode'
OPEN_BRACKET = 'open_bracket'
PLUS_SIGN = 'plus_sign'
REGISTER = 'register'
SEGMENT_ADDRESS = 'segment_address'
SPACING = 'spacing'
SPLIT_IMMEDIATE = 'split_imm'
STRING_LITERAL = 'string_literal'
TIMES_SIGN = 'times_sign'
exception bincfg.normalization.base_tokenizer.UnknownTokenError[source]

Bases: Exception

bincfg.normalization.base_tokenizer.get_architecture(arch: str | Architectures) Architectures[source]

Returns the architecture

Parameters:

arch (Union[str, Architectures])

bincfg.normalization.base_tokenizer.parse_tokenization_level(tokenization_level, auto_tl)[source]

Returns the bincfg.TokenizationLevel enum based on the given tokenization_level.

Parameters:
  • tokenization_level (Union[bincfg.TokenizationLevel, str]) – either a string tokenization level, or a class from the bincfg.TokenizationLevels enum

  • auto_tl (bincfg.TokenizationLevel) – the default tokenization level to use if we get an ‘auto’ tokenization level

Returns:

a class from the bincfg.TokenizationLevels enum

Return type:

bincfg.TokenizationLevel

bincfg.normalization.multi_normalizer module

Class that can use multiple normalization methods

class bincfg.normalization.multi_normalizer.MultiNormalizer(*normalizers)[source]

Bases: object

A normalizer that can work with multiple sub-normalizers based on architecture

This does not inheret from BaseNormalizer, and thus you cannot modify or call most normalizer functions from this normalizer itself. It essentially just acts as a wrapper around multiple different normalizers.

Parameters:

normalizers (BaseNormalizer) – One or more normalizers to use together. May only use one per architecture.

normalize(*strings, cfg=None, block=None, newline_tup=<object object>, match_instruction_address=True)[source]

Normalizes the given iterable of strings.

Parameters:
  • strings (str) – arbitrary number of strings to normalize

  • cfg (Union[CFG, MemCFG], optional) – either a CFG or MemCFG object that these lines occur in. Used for determining function calls to self, internal functions, and external functions. If not passed, then these will not be used. Defaults to None.

  • block (Union[CFGBasicBlock, int], optional) – either a CFGBasicBlock or integer block_idx in a MemCFG object. Used for determining function calls to self, internal functions, and external functions. If not passed, then these will not be used. Defaults to None.

  • newline_tup (Tuple[str, str], optional) – the tuple to insert inbetween each passed string, or None to not insert anything. Defaults to self.tokenizer.DEFAULT_NEWLINE_TUPLE

  • match_instruction_address (bool, optional) –

    if True, will match instruction addresses. If there is an immediate value at the start of a line (IE: start of a string in strings, or immediately after a Tokens.NEWLINE or Tokens.INSTRUCTION_START [ignoring any Tokens.SPACING]), then that token will be converted into a Tokens.INSTRUCTION_ADDRESS token. If there is a Tokens.COLON immediately after that token (again, ignoring any Tokens.SPACING), then that first Tokens.COLON match will be appended (along with any inbetween Tokens.SPACING) to that Tokens.INSTRUCTION_ADDRESS token. For example, using the x86 tokenization scheme:

    • ”0x1234: add rax rax” -> [(Tokens.INSTRUCTION_ADDRESS, ‘0x1234:’), …]

    • ” 0x1234 : add rax rax” -> [(Tokens.SPACING, ‘ ‘), (Tokens.INSTRUCTION_ADDRESS, ‘0x1234 :’), …]

    • ”0x1234 add rax rax” -> [(Tokens.INSTRUCTION_ADDRESS, ‘0x1234’), …]

  • kwargs (Any) – extra kwargs to pass along to tokenization method, and to store in normalizer state

Returns:

a list of normalized string instruction lines

Return type:

List[str]

bincfg.normalization.norm_funcs module

bincfg.normalization.norm_funcs.identity(self, state)[source]

Returns the original token

bincfg.normalization.norm_funcs.ignore(self, state)[source]

Ignores information (if using for rose info, then it will also ignore negatives)

bincfg.normalization.norm_funcs.replace_function_call_immediate(*args)[source]

Builds a function that replaces function call immediate values with the given replacement string

This will return a function to be called as a part of a normalizer. This only takes one argument: the replacement string. If no arguments are passed, then the replacement string will default to ‘func’

NOTE: This is meant to be a higher-order function. But, just in case the user forgets that (or is too lazy to add in two extra characters to call this function), if you pass multiple args then it will be assumed this is being called as if it is the _repl_func() function below and will simply return the default result

Parameters:

args – args for this function. Ideally either empty to use default function call string, or a string to replace all function callsa with.

Returns:

either a function that will handle function calls (if this function was

called correctly), or a handled function call

Return type:

Union[Callable[…, None], None]

bincfg.normalization.norm_funcs.replace_immediate(*args, include_negative=False)[source]

Builds a function that replaces immediate values with the IMMEDIATE_VALUE_STR.

This will return a function to be called as a part of a normalizer. This function takes no arguments and only 1 keyword argument: whether or not to include a negative sign ‘-’ in front of the immediate string when the input is negative.

NOTE: This is meant to be a higher-order function. But, just in case the user forgets that (or is too lazy to add in two extra characters to call this function), if you pass multiple args then it will be assumed this is being called as if it is the _repl_func() function below and will simply return the default result

Parameters:
  • args – args for this function. Ideally empty

  • include_negative (bool, optional) – if True, will include a negative sign in front of the returned immediate string when the input is negative. Defaults to False.

Returns:

either a function that will handle immediate strings (if this function was

called correctly), or a handled immediate string

Return type:

Union[Callable[…, str], str]

bincfg.normalization.norm_funcs.replace_jump_destination(self, state)[source]

Replaces the jump destination immediate with ‘jmpdst’ iff the jump destination is an immediate value, not a segment address

Parameters:
  • idx (int) – the index in line of the ‘jump’ opcode

  • line (List[TokenTuple]) – a list of (token_type, token) tuples. the current assembly line

Returns:

integer index in line of last handled token

Return type:

int

bincfg.normalization.norm_funcs.replace_memory_expression(*args)[source]

Builds a function that replaces memory expressions with the given replacement string

This will return a function to be called as a part of a normalizer. This only takes one argument: the replacement string. If no arguments are passed, then the replacement string will default to ‘memexpr’

NOTE: This is meant to be a higher-order function. But, just in case the user forgets that (or is too lazy to add in two extra characters to call this function), if you pass multiple args then it will be assumed this is being called as if it is the _repl_func() function below and will simply return the default result

Parameters:

args – args for this function. Ideally either empty to use default memory expression string, or a string to replace all memory expressions with.

Returns:

either a function that will handle memory expressions (if this function was

called correctly), or a handled memory expression

Return type:

Union[Callable[…, None], None]

bincfg.normalization.norm_funcs.replace_string_literal(*args, replace_previous_immediate=False)[source]

Builds a function that replaces string literal values with the string ‘str’

This will return a function to be called as a part of a normalizer. This function takes no arguments and only 1 keyword argument: whether to replace the previous immediate, or keep it and add in a ‘str’ string

NOTE: This is meant to be a higher-order function. But, just in case the user forgets that (or is too lazy to add in two extra characters to call this function), if you pass multiple args then it will be assumed this is being called as if it is the _repl_func() function below and will simply return the default result

Parameters:
  • args – args for this function. Ideally empty

  • replace_previous_immediate (bool) – if True, then any previous immediate value will be replaced with the ‘str’ string, otherwise the ‘str’ string will just be added

Returns:

either a function that will handle immediate strings (if this function was

called correctly), or a handled immediate string

Return type:

Union[Callable[…, str], str]

bincfg.normalization.norm_funcs.return_dispmem(self, state)[source]

Replaces memory addressing displacement values with the string ‘dispmem’

bincfg.normalization.norm_funcs.return_token(self, state)[source]

Returns the original token

bincfg.normalization.norm_funcs.special_function_call(self, state, ret_only_call_type=False)[source]

Handles special function calls

Special external functions have their name kept. Recursive calls are replaced with ‘self’, other internal function calls are replaced with ‘internfunc’, other external function calls are replaced with ‘externfunc’. If a block has multiple function calls out, then it will be replaced with ‘multifunc’.

NOTE: This can all only happen if cfg and block information is passed. If it is not passed, then all function calls will be replaced with ‘func’

Parameters:
  • idx (int) – the index in line of the ‘call’ opcode

  • line (List[TokenTuple]) – a list of (token_type, token) tuples. the current assembly line

  • special_functions (Set[str]) – a set of string special function names.

  • cfg (Union[CFG, MemCFG], optional) – either a CFG or MemCFG object that these lines occur in. Used for determining function calls to self, internal functions, and external functions. If not passed, then these will not be used. Defaults to None.

  • block (Union[CFGBasicBlock, int], optional) – either a CFGBasicBlock or integer block_idx in a MemCFG object. Used for determining function calls to self, internal functions, and external functions. If not passed, then these will not be used. Defaults to None.

  • ret_only_call_type (bool) – if True, will return only the call type being used as a string. This is only for testing purposes and should likely not be used in normalization as this function already can handle the normalizing. This will return a string if it is not a special function call (for the appropriate function call type), or a tuple with one element for a special function call (the name of the special function).

Returns:

integer index in line of last handled token

Return type:

int

bincfg.normalization.norm_funcs.threshold_immediate(threshold=5000, include_negative=False, imm_str='#immval#')[source]

Builds a function that replaces immediate values with immval iff abs(immediate) > some threshold

Parameters:
  • threshold (int) – the threshold to use. Any immediates whose absolute values are larger than this threshold will be replaced with the imm_str

  • include_negative (bool) – if True, then any immediate that are too large and get replaced will have a negative sign added to the front of the replacement string if the immediates were negative

  • imm_str (str) – the string to replace immediate values with

Returns:

either a function that will handle thresholded immediate strings (if this

function was called correctly), or a handled thresholded immediate string

Return type:

Union[Callable[…, str], str]

bincfg.normalization.norm_utils module

An assortment of helper/utility functions for tokenization/normalization.

bincfg.normalization.norm_utils.get_normalizer(normalizer)[source]

Returns the normalizer being used.

Parameters:

normalizer (Union[str, Normalizer, type]) – either a Normalizer object (IE: has a callable ‘normalize’ function), or a string name of a built-in normalizer to use, or a type of a normalizer to instantiate with no args/kwargs passed. Accepted strings include: ‘innereye’, ‘deepbindiff’, ‘safe’, ‘deepsemantic’, ‘unnormalized’, ‘compressed_stats’, ‘hpc_data’

Raises:
  • ValueError – for unknown string name of normalizer

  • TypeError – if normalizer was not a string or Normalizer object

Returns:

a Normalizer object

Return type:

Normalizer

bincfg.normalization.norm_utils.imm_to_int(token, on_err=<object object>)[source]

Convert the given value to integer

If token is an integer, returns token. Otherwise, converts a string token to an integer, then back to a string,

accounting for hexadecimal, decimal, octal, and binary values

Parameters:
  • token (Union[str, int]) – the immediate token to convert to integer

  • on_err (Optional[Any]) – if passed, then this value will be returned if there is an error while trying to parse the immediate value. Otherwise the error will just be raised like normal

Returns:

integer value of given token

Return type:

int

bincfg.normalization.norm_utils.parse_disinfo_json(string)[source]

Attempts to pase a JSON object inside of disassembler info tokens

Assumes the DISINFO_START and DISINFO_END have already been removed from the string.

Parameters:

string (str) – the string to attempt to parse into json

Returns:

returns the resulting JSON object, or None if the string could not be parsed as JSON

Return type:

Union[None, JSONObject]

bincfg.normalization.norm_utils.scan_for_token(token_list, type=None, token=None, stop_on_type=None, stop_on_token=None, ignore_type=None, ignore_token=None, stop_unmatched=False, match_re=False, ignore_re_case=True, start=0, increment=1, wrap=True, max_matches=1, ret_list=False, ret='index', on_no_match=None)[source]

Scans the given token list looking for a specific token(s) or token type(s)

Will return None if no match is found.

Detects tokens in the order:

  1. ‘ignore’ tokens

  2. ‘stop’ tokens

  3. accepted tokens (from type or token parameters)

So, if one passes multiple parameters that conflict with one another, the above ordering is what takes precedent.

Parameters:
  • token_list (List[Tuple[str, str, ...]]) – the list of tokens. Each element should be a tuple of (token_type, token, …). The first element is the type of the token, second is the string token, and anything else is ignored. This means this function can work with either the 2-tuple token lists from Tokenizer() objects as well as the 3-tuple token lists from Normalizer() objects.

  • type (Optional[Union[str, Iterable[str]]]) – the type or types of tokens to return. Can be a string to only return one type of token, or an iterable of strings to return the first token found that has any of those types. If token is not None, then the returned token must also match that argument. NOTE: you can match “not X” by using python re’s negative lookahead: r’(?![X]).*’, where ‘[X]’ is the thing to not match

  • token (Optional[Union[str, Iterable[str]]]) – the token to return. Can be a string to only return one matching token, or an iterable of strings to return the first token found that matches any of those tokens. If type is None, then the returned token must also match that type. NOTE: you can match “not X” by using python re’s negative lookahead: r’(?![X]).*’, where ‘[X]’ is the thing to not match

  • stop_on_type (Optional[Union[str, Iterable[str]]]) – if a token of this type is found, then we immediately stop searching and return whatever we currently have. Can be a string to only stop at one type of token, or an iterable of strings to stop at the first token found that has any of those types. NOTE: you can match “not X” by using python re’s negative lookahead: r’(?![X]).*’, where ‘[X]’ is the thing to not match

  • stop_on_token (Optional[Union[str, Iterable[str]]]) – if this token is found, then we immediately stop searching and return whatever we currently have. Can be a string to only stop at one token, or an iterable of strings to stop at the first token found that matches any of these. NOTE: you can match “not X” by using python re’s negative lookahead: r’(?![X]).*’, where ‘[X]’ is the thing to not match

  • ignore_type (Optional[Union[str, Iterable[str]]]) – ignores token types. Can be a string to only ignore one token type, or an iterable of strings to ignore any token types that match any of these. These tokens will not be added to return lists or considered tokens to keep. Since these are checked before ‘stop’ token types, this will override the stopping on any tokens also matched with stop_on_type. NOTE: you can match “not X” by using python re’s negative lookahead: r’(?![X]).*’, where ‘[X]’ is the thing to not match

  • ignore_token (Optional[Union[str, Iterable[str]]]) – ignores tokens. Can be a string to only ignore one token, or an iterable of strings to ignore any tokens that match any of these. These tokens will not be added to return lists or considered tokens to keep. Since these are checked before ‘stop’ token types, this will override the stopping on any tokens also matched with stop_on_token. NOTE: you can match “not X” by using python re’s negative lookahead: r’(?![X]).*’, where ‘[X]’ is the thing to not match

  • stop_unmatched (bool) – if True, will stop on the first unmatched token. IE: a token that was not ignored, was not already stopped on, and was not considered a token to keep

  • match_re (bool) – if True, will assume any match values in type or token are to be considered regular expressions to fullmatch()

  • ignore_re_case (bool) – if True, will pass re.IGNORECASE as a flag when making the regular expressions

  • start (int) – the index to start at within token_list

  • increment (int) – the increment to use when searching for tokens. Set to a negative number to move backwards through the list NOTE: if returning multiple values, they will be returned in the order they appear in the input list, regardless of the increment value

  • wrap (bool) – if True, then the initial start index will be wrapped to the length of the token_list. If False, then an initial start index that is out of bounds of the token_list will immediately stop.

  • max_matches (Union[int, None]) – the number of matches to find. If 1, then values will be returned as normal. If >1, then this will search through the list finding up to max_matches matching tokens and return their ret values as a list in the order that they were found. If None, then all matches found will be returned NOTE: if max_matches != 1, then the return value will always either be None if no matches were found, or a list (even if only one match was found)

  • ret_list (bool) – if True, will always return a list, even if only a single return value was present

  • ret (Union[str, Iterable[str]]) –

    what value(s) to return. Can be a single string to return a single value, or an iterable of strings to return multiple values as a tuple in the order they were passed. Valid strings:

    • ’index’: return the index in token_list of the matched token

    • ’type’: return the token type of the matched token

    • ’token’: return the string token that was matched

    • ’all’: return all of the above. If in a passed list, ignores all other values in the list. Will return values in the order above.

  • on_no_match (Optional[Any]) – value to return if there were no matches found. Defaults to None

Returns:

None if no match is found, or one of the return types designated by ret argument,

or a tuple of multiple return values if user passed multiple values in ret, or a list of one of the previous if collecting matches for multiple tokens. NOTE: if returning multiple values, they will be returned in the order they appear in the input list, regardless of the increment value

Return type:

Union[None, int, str, Tuple, List]

bincfg.normalization.normalize module

Provides function(s) to perform normalization techniques on CFG’s

bincfg.normalization.normalize.normalize_cfg_data(cfg_data: CFGInputDataType | bincfg.CFG | bincfg.MemCFG | bincfg.CFGDataset | bincfg.MemCFGDataset | Iterable, normalizer: str | NormalizerType, inplace: bool = False, using_tokens: dict[str, int] | AtomicTokenDict | None = None, force_renormalize: bool = False, convert_to_mem: bool = False, conv_keep_mem_addrs: bool = True, unpack_cfgs: bool = False, progress: bool = False) bincfg.CFG | bincfg.MemCFG | bincfg.CFGDataset | bincfg.MemCFGDataset | list | tuple[source]

Normalizes some cfg data.

Parameters:
  • cfg_data (Union[CFGInputDataType, CFG, MemCFG, CFGDataset, MemCFGDataset, Iterable]) – some cfg data. Can be either: str, CFG, MemCFG, CFGDataset, MemCFGDataset, or iterable of previously mentioned types. Will return the same type as that passed, unless that particular input was a string, in which case a CFG will be returned.

  • normalizer (Union[str, Normalizer]) – the normalizer to use. Can be either a Normalizer class with a .normalize() method, or a string to use a built-in normalizer. See bincfg.normalization.get_normalizer() for acceptable strings.

  • inplace (bool) – if True, will modify data in-place instead of creating new objects. Defaults to False. NOTE: if inplace=False, and the incoming data has already been normalized with the passed normalizer, then the original cfg will be returned, NOT a copy.

  • using_tokens (Optional[Union[dict[str, int], AtomicTokenDict]]) – only used for MemCFG’s. If not None, then a dictionary mapping string tokens to integer token values that will be used as any MemCFG’s tokens. Defaults to None.

  • force_renormalize (bool) – by default, this method will only normalize cfg’s whose .normalizer != to the passed normalizer. However if force_renormalize=True, then all cfg’s will be renormalized even if they have been previously normalized with the same normalizer. Defaults to False.

  • convert_to_mem (bool) – if True, will convert all CFG’s and CFGDatasets to their memory-efficient versions after normalizing. Defaults to False.

  • conv_keep_mem_addrs (bool) – if True, will pass keep_memory_addresses=True when converting CFG’s into MemCFG’s

  • unpack_cfgs (bool) – by default, this method will return the same types that were passed to be normalized. However if unpack_cfgs=True, then instead, a list of all cfgs unpacked (EG: unpacked from lists, and pulled out of datasets) will be returned. Defaults to False. NOTE: if only a single CFG/MemCFG was passed, a list will still be returned of only that single element.

  • progress (bool) – if True, will show a progressbar for normalizations of multiple cfg’s. Defaults to False.

Returns:

the normalized data

Return type:

Union[CFG, MemCFG, CFGDataset, MemCFGDataset, List, Tuple]

Module contents