bincfg.normalization.x86 package

Submodules

bincfg.normalization.x86.x86_norm_funcs module

bincfg.normalization.x86.x86_norm_funcs.x86_clean_nop(state)[source]

Cleans any line with the opcode ‘nop’ to only contain the opcode

Parameters:
  • idx (int) – the index in line of the ‘nop’ opcode

  • line (List[TokenTuple]) – a list of (token_type, token) tuples. the current assembly line

  • args – unused

  • kwargs – unused

Returns:

integer index in line of last handled token

Return type:

int

bincfg.normalization.x86.x86_norm_funcs.x86_memsize_value(self, state)[source]

Replaces memory size pointers with ‘memsize’ followed by the value of that memsize in bytes

Parameters:

token (str) – the current string token

Returns:

normalized memory size string

Return type:

str

bincfg.normalization.x86.x86_norm_funcs.x86_replace_general_register(self, state)[source]

Replaces general registers with a default string and their size, keeping special registers the same (while removing their numbers)

Parameters:

token (str) – the current string token

Returns:

normalized name of register

Return type:

str

bincfg.normalization.x86.x86_normalizers module

A bunch of builtin normalization methods based on literature.

NOTE: some of these are slightly modified from their original papers either for code purposes, or because we are using decompiled binaries instead of compiled assembly and thus lose out on some information (EG: symbol information for jump instructions)

class bincfg.normalization.x86.x86_normalizers.X86BaseNormalizer(*args, **kwargs)[source]

Bases: BaseNormalizer

Base class for x86 normalizers.

Performs an ‘unnormalized’ normalization, removing what is likely extraneous information, and providing a base class for other x86 normalization methods to inherit from.

Parameters:
  • tokenizer (Tokenizer) – the tokenizer to use

  • token_handlers (Optional[Dict[str, Callable[[NormalizerState], Union[None, str]]]]) – optional dictionary mapping string token types to functions to handle those tokens. These will override any token handlers that are used by default (IE: all of the self.handle_* functions). Functions should take one arg (the current normalizer state) as input and return either the next string token to add to the current line, or None to not add anything. This is useful for adding more methods to handle new token types that are not builtin.

  • token_sep (Optional[str]) – the string to use to separate each token in returned instruction lines. Only used if tokenization_level is ‘instruction’. If None, then a default value will be used (’ ‘ for unnormalized using BaseNormalizer(), ‘_’ for everything else)

  • tokenization_level (Optional[Union[TokenizationLevel, str]]) –

    the tokenization level to use for return values. Can be a string, or a TokenizationLevel type. Strings can be:

    • ’op’: tokenized at the opcode/operand level. Will insert a ‘INSTRUCTION_START’ token at the beginning of each instruction line

    • ’inst’/’instruction’: tokenized at the instruction level. All tokens in each instruction line are joined together using token_sep to construct the final token

    • ’auto’: pick the default value for this normalization technique

  • anonymize_tokens (bool) – if True, then tokens will be annonymized by taking their 4-byte shake_128 hash. Why does this exist? Bureaucracy.

DEFAULT_TOKENIZATION_LEVEL = ['inst', 'instruction', 'line', 'instructions', 'lines']

The default tokenization level used for this normalizer

handle_all_symbols(state)[source]

Handles all symbols. We use this to keep track of when memory expressions start/end

handle_memory_base(state)[source]

Handles the ‘base’ section of a memory addressing. Defaults to returning it processed as a register

Should return either the token to add to the current line, or None to not add any token

Parameters:

state (NormalizerState) – dictionary of current state information. See bincfg.normalization.base_normalizer.NormalizerState

handle_memory_displacement(state)[source]

Handles the ‘displacement’ section of a memory addressing. Defaults to returning it processed as an immediate

Should return either the token to add to the current line, or None to not add any token

Parameters:

state (NormalizerState) – dictionary of current state information. See bincfg.normalization.base_normalizer.NormalizerState

handle_memory_expression(state)[source]

Handles memory expressions. Splits values up into ‘base’, ‘index’, ‘scale’, and ‘displacement’

Values:

  • ‘base’ (B): a register holding the starting base location. Handled by self.handle_memory_base()

  • ‘index’ (I): a register that is added to a base register. Handled by self.handle_memory_index()

  • ‘scale’ (S): an immediate value that is multiplied onto ‘index’, should only be 1, 2, 4, or 8. Handled by self.handle_memory_scale()

  • ‘displacement’ (D): and immediate value that acts as a displacement to the memory address (or in some cases, the literal memory address itself). Handled by self.handle_memory_displacement()

Acceptable x86 memory addressing modes:

  • [D]

  • [B]

  • [B + I]

  • [B + D]

  • [B + I + D]

  • [B + I*S]

  • [I*S + D]

  • [B + I*S + D]

NOTE: You can have other formats that don’t fit known addressing modes, but the values might not be handled properly. Specifically, the first register found will be considered the ‘base’, and all subsequent are considered ‘index’, etc.

The respective handle_memory_*() methods will be called with the same parameters as this function, but ‘memory_start’ will instead be the starting index of that token, and ‘token’ will be the original token value (IE: before being handled)

Parameters:

state (NormalizerState) – dictionary of current state information. See bincfg.normalization.base_normalizer.NormalizerState

handle_memory_index(state)[source]

Handles the ‘index’ section of a memory addressing. Defaults to returning it processed as a register

Should return either the token to add to the current line, or None to not add any token

Parameters:

state (NormalizerState) – dictionary of current state information. See bincfg.normalization.base_normalizer.NormalizerState

handle_memory_scale(state)[source]

Handles the ‘scale’ section of a memory addressing. Defaults to returning it processed as an immediate

Should return either the token to add to the current line, or None to not add any token

Parameters:

state (NormalizerState) – dictionary of current state information. See bincfg.normalization.base_normalizer.NormalizerState

handle_memory_size(state)[source]

Handles a memory size. Removes any ‘{spacing}ptr’ where {spacing} is any amount of spacing

Should return either the token to add to the current line, or None to not add any token

Parameters:

state (NormalizerState) – dictionary of current state information. See bincfg.normalization.base_normalizer.NormalizerState

handle_segment_address(state)[source]

Handles a segment address. Defaults to returning the original token

Should return either the token to add to the current line, or None to not add any token

Parameters:

state (NormalizerState) – dictionary of current state information. See bincfg.normalization.base_normalizer.NormalizerState

opcode_function_call(state)[source]

Handles function call opcodes, defaults to doing nothing

opcode_jump(state)[source]

Handles jump opcodes, defaults to doing nothing

renormalizable = True

Whether or not this normalization method can be renormalized later by other normalization methods

save(path)
token_sep = None

The separator string used for this normalizer. Will default to ‘ ‘

tokenizer = None

The tokenizer used for this normalizer

class bincfg.normalization.x86.x86_normalizers.X86CompressedStatsNormalizer(*args, **kwargs)[source]

Bases: X86BaseNormalizer

A normalizer I created for use in CFG.get_compressed_stats()

Rules:

  • Immediates are replaced with immediate string (including negative)

  • function calls are either self vs. intern vs. extern func, no special functions

  • jump destinations are ‘jmpdst’

  • registers are handled the same as deepsem/deepbindiff

  • memory pointers/memory expressions are handled the same as in deepsemantic

  • Tokenized at the instruction-level

  • segment addresses are ignored

  • branch predictions are ignored

Parameters:
  • special_functions (Optional[Set[str]]) – a set of special function names. All external functions whose name (ignoring any @plt’ at the end) is in this set will have their name kept, otherwise they will be replaced with ‘externfunc’. If None, will attempt to load the default special function names from get_special_function_names(). If you do not wish to use any special function names, then pass an empty set.

  • tokenizer (Tokenizer) – the tokenizer to use

  • token_handlers (Optional[Dict[str, Callable[[NormalizerState], Union[None, str]]]]) – optional dictionary mapping string token types to functions to handle those tokens. These will override any token handlers that are used by default (IE: all of the self.handle_* functions). Functions should take one arg (the current normalizer state) as input and return either the next string token to add to the current line, or None to not add anything. This is useful for adding more methods to handle new token types that are not builtin.

  • token_sep (Optional[str]) – the string to use to separate each token in returned instruction lines. Only used if tokenization_level is ‘instruction’. If None, then a default value will be used (’ ‘ for unnormalized using BaseNormalizer(), ‘_’ for everything else)

  • tokenization_level (Optional[Union[TokenizationLevel, str]]) –

    the tokenization level to use for return values. Can be a string, or a TokenizationLevel type. Strings can be:

    • ’op’: tokenized at the opcode/operand level. Will insert a ‘INSTRUCTION_START’ token at the beginning of each instruction line

    • ’inst’/’instruction’: tokenized at the instruction level. All tokens in each instruction line are joined together using token_sep to construct the final token

    • ’auto’: pick the default value for this normalization technique

  • anonymize_tokens (bool) – if True, then tokens will be annonymized by taking their 4-byte shake_128 hash. Why does this exist? Bureaucracy.

DEFAULT_TOKENIZATION_LEVEL = ['inst', 'instruction', 'line', 'instructions', 'lines']
handle_branch_prediction(state)
handle_immediate(state)
handle_memory_size(state)
handle_register(state)
handle_segment_address(state)
handle_string_literal(*, replace_previous_immediate=False)
opcode_function_call(state, ret_only_call_type=False)
opcode_jump(state)
renormalizable = False
save(path)
class bincfg.normalization.x86.x86_normalizers.X86DeepBinDiffNormalizer(*args, **kwargs)[source]

Bases: X86BaseNormalizer

A normalizer based on the Deep Bin Diff method

From the DeepBinDiff paper: https://www.ndss-symposium.org/wp-content/uploads/2020/02/24311-paper.pdf

Rules:

  • Constant values are ignored and replaced with ‘immval’

  • General registers are renamed based on length, special ones are left as-is (with number information removed.

    EG: st5 -> st, rax -> reg8, r14d -> reg4, rip -> rip, zmm13 -> zmm)

  • Memory expressions are replaced with ‘memexpr’

  • Can’t really tell what’s supposed to be done with function calls, will just assume they should be ‘call immval’

  • Jump destinations are ‘immval’

  • Strings are left as-is (Kinda bad, but they are doing binary diffing and not binary similarity, so I’ll let it slide)

  • Doesn’t say anything about segment addresses, so they are ignored

  • Doesn’t say anything about branch predictions, so they are ignored

  • Tokens are at the op-level

Parameters:
  • replace_strings (bool) – if True, then strings will be replaced with a ‘str’ token. Default is False which is the default deepbindiff behavior in the paper and leaves full strings as individual tokens

  • tokenizer (Tokenizer) – the tokenizer to use

  • token_handlers (Optional[Dict[str, Callable[[NormalizerState], Union[None, str]]]]) – optional dictionary mapping string token types to functions to handle those tokens. These will override any token handlers that are used by default (IE: all of the self.handle_* functions). Functions should take one arg (the current normalizer state) as input and return either the next string token to add to the current line, or None to not add anything. This is useful for adding more methods to handle new token types that are not builtin.

  • token_sep (Optional[str]) – the string to use to separate each token in returned instruction lines. Only used if tokenization_level is ‘instruction’. If None, then a default value will be used (’ ‘ for unnormalized using BaseNormalizer(), ‘_’ for everything else)

  • tokenization_level (Optional[Union[TokenizationLevel, str]]) –

    the tokenization level to use for return values. Can be a string, or a TokenizationLevel type. Strings can be:

    • ’op’: tokenized at the opcode/operand level. Will insert a ‘INSTRUCTION_START’ token at the beginning of each instruction line

    • ’inst’/’instruction’: tokenized at the instruction level. All tokens in each instruction line are joined together using token_sep to construct the final token

    • ’auto’: pick the default value for this normalization technique

  • anonymize_tokens (bool) – if True, then tokens will be annonymized by taking their 4-byte shake_128 hash. Why does this exist? Bureaucracy.

DEFAULT_TOKENIZATION_LEVEL = ['op', 'opcode', 'operand', 'opcodes', 'operands']
handle_branch_prediction(state)
handle_immediate(*, include_negative=False)
handle_memory_expression(state)
handle_memory_size(state)
handle_register(state)
handle_segment_address(state)
opcode_function_call(state)
renormalizable = False
save(path)
class bincfg.normalization.x86.x86_normalizers.X86DeepSemanticNormalizer(*args, **kwargs)[source]

Bases: X86BaseNormalizer

A normalizer based on the Deepsemantic method

from the DeepSemantic paper: https://arxiv.org/abs/2106.05478

Rules:

  • Immediates can fall into multiple categories:
    1. Function calls:

      • libc function name(): “libc[name]” (instead, we use “{name}”)

      • recursive call: ‘self’

      • function within the binary: ‘innerfunc’

      • function outside the binary: ‘externfunc’

      • NOTE: they do not take into account call tables that could theoretically call both inner and extern functions. So, when this rather rare even occurs, it is given the token ‘multifunc’

    2. Jump (branching) family: “jmpdst”

    3. Reference: (NOTE: This might not be done for all disassebly output, such as ROSE, since they don’t always have this information readily available)

      • String literal: ‘str’

      • Statically allocated variable: “dispbss”

      • Data (data other than a string): “dispdata”

    4. Default (all other immediate values): “immval”

  • Registers can fall into multiple categories:
    1. Stack/Base/Instruction pointer: Keep track of type and size [e|r]*[b|s|i]p[l]* -> [s|b|i]p[1|2|4|8]

    2. Special purpose (IE: flags): Keep track of type cr[0-15], dr[0-15], st([0-7]), [c|d|e|f|g|s]s -> reg[cr|dr|st], reg[c|d|e|f|s]s

    3. AVX registers: Keep track of type [x|y|z]*mm[0-7|0-31] -> reg[x|y|z]*mm

    4. General purpose registers: Keep track of size [e|r]*[a|b|c|d|si|di][x|l|h]*, r[8-15][b|w|d]* -> reg[1|2|4|8]

  • Pointers can fall into multiple categories:
    1. Direct, small: keep track of size byte,word,dword,qword,ptr -> memptr[1|2|4|8]

    2. Direct, large: keep track of size tbyte,xword,[x|y|z]mmword -> memptr[10|16|32|64]

    3. Indirect, string: [base+index*scale+displacement] -> [base+index*scale+dispstr] (NOTE: we use ‘str’ instead of dispstr)

    4. Indirect, not string: [base+index*scale+displacement] -> [base+index*scale+disp]

      NOTE: for our purposes, we don’t necessarily always have base, index, scale, and displacement present, and they may appear in a different (but deterministic) order. It shouldn’t really change anything for any models, just how the tokens are formatted

      NOTE: it looks like the ‘scale’ values are their original immediate values and not replaced with the ‘immval’ string, so that is taken into account as well

  • Tokenized at instruction-level

  • Doesn’t say anything about segment addresses, so they are left as-is

  • Doesn’t say anything about branch predictions, so they are ignored

Parameters:
  • special_functions (Optional[Set[str]]) – a set of special function names. All external functions whose name (ignoring any @plt’ at the end) is in this set will have their name kept, otherwise they will be replaced with ‘externfunc’. If None, will attempt to load the default special function names from get_special_function_names(). If you do not wish to use any special function names, then pass an empty set.

  • tokenizer (Tokenizer) – the tokenizer to use

  • token_handlers (Optional[Dict[str, Callable[[NormalizerState], Union[None, str]]]]) – optional dictionary mapping string token types to functions to handle those tokens. These will override any token handlers that are used by default (IE: all of the self.handle_* functions). Functions should take one arg (the current normalizer state) as input and return either the next string token to add to the current line, or None to not add anything. This is useful for adding more methods to handle new token types that are not builtin.

  • token_sep (Optional[str]) – the string to use to separate each token in returned instruction lines. Only used if tokenization_level is ‘instruction’. If None, then a default value will be used (’ ‘ for unnormalized using BaseNormalizer(), ‘_’ for everything else)

  • tokenization_level (Optional[Union[TokenizationLevel, str]]) –

    the tokenization level to use for return values. Can be a string, or a TokenizationLevel type. Strings can be:

    • ’op’: tokenized at the opcode/operand level. Will insert a ‘INSTRUCTION_START’ token at the beginning of each instruction line

    • ’inst’/’instruction’: tokenized at the instruction level. All tokens in each instruction line are joined together using token_sep to construct the final token

    • ’auto’: pick the default value for this normalization technique

  • anonymize_tokens (bool) – if True, then tokens will be annonymized by taking their 4-byte shake_128 hash. Why does this exist? Bureaucracy.

DEFAULT_TOKENIZATION_LEVEL = ['inst', 'instruction', 'line', 'instructions', 'lines']
handle_branch_prediction(state)
handle_immediate(*, include_negative=False)
handle_memory_scale(state)
handle_memory_size(state)
handle_register(state)
handle_string_literal(*, replace_previous_immediate=False)
opcode_function_call(state, ret_only_call_type=False)
opcode_jump(state)
renormalizable = False
save(path)
class bincfg.normalization.x86.x86_normalizers.X86HPCDataNormalizer(*args, **kwargs)[source]

Bases: X86BaseNormalizer

A special normalizer meant for use in HPC compile jobs

This normalizer is made to reduce the total number of new tokens as much as possible while still being able to fully reproduce original BaseNormalizer output, and while trying to minimize the number of tokens per assembly line as much as possible.

Since immediate values make up the vast majority of ‘unique’ tokens (or are the root cause of there being so many) in the BaseNormalizer, this is all that is changed. Specifically:

  • immediate values get split into multiple tokens. EG: ‘123456789’ -> ‘1234’, ‘5678’, ‘9’ if using num_digits of 4

  • negatives stay connected to tokens. EG: ‘-54321’ -> ‘-543’, ‘21’ if using num_digits of 4

  • Before some split immediate values, a ‘split immediate’ token is inserted for later tokenization to know that the following immediate values should all be concatenated together. This is only inserted when default behavior would produce the wrong values (EG: whenever a token is split, whenever a non-split token has a split token before it, etc.)

NOTE: this should only be used with the ‘opcode’ tokenization level as it provides no benefit otherwise

Parameters:
  • num_digits (int) – the number of digits to use before splitting. This will include the minus sign as a digit.

  • replace_strings (bool) – if True, then strings will be replaced with a ‘str’ token. Default is False which is the default deepbindiff behavior in the paper and leaves full strings as individual tokens

  • tokenizer (Tokenizer) – the tokenizer to use

  • token_handlers (Optional[Dict[str, Callable[[NormalizerState], Union[None, str]]]]) – optional dictionary mapping string token types to functions to handle those tokens. These will override any token handlers that are used by default (IE: all of the self.handle_* functions). Functions should take one arg (the current normalizer state) as input and return either the next string token to add to the current line, or None to not add anything. This is useful for adding more methods to handle new token types that are not builtin.

  • token_sep (Optional[str]) – the string to use to separate each token in returned instruction lines. Only used if tokenization_level is ‘instruction’. If None, then a default value will be used (’ ‘ for unnormalized using BaseNormalizer(), ‘_’ for everything else)

  • tokenization_level (Optional[Union[TokenizationLevel, str]]) –

    the tokenization level to use for return values. Can be a string, or a TokenizationLevel type. Strings can be:

    • ’op’: tokenized at the opcode/operand level. Will insert a ‘INSTRUCTION_START’ token at the beginning of each instruction line

    • ’inst’/’instruction’: tokenized at the instruction level. All tokens in each instruction line are joined together using token_sep to construct the final token

    • ’auto’: pick the default value for this normalization technique

  • anonymize_tokens (bool) – if True, then tokens will be annonymized by taking their 4-byte shake_128 hash. Why does this exist? Bureaucracy.

DEFAULT_TOKENIZATION_LEVEL = ['op', 'opcode', 'operand', 'opcodes', 'operands']
finalize_instruction(state)[source]

Handles a single instruction. Calls super()’s handle_instruction, then performs the immediate splitting

Parameters:

state (NormalizerState) – dictionary of current state information. See bincfg.normalization.base_normalizer.NormalizerState

property renormalizable

We are only losslessly renormalizable if self._replace_strings == False

save(path)
class bincfg.normalization.x86.x86_normalizers.X86InnerEyeNormalizer(*args, **kwargs)[source]

Bases: X86BaseNormalizer

A normalizer based on the Innereye method

From the InnerEye paper: https://arxiv.org/pdf/1808.04706.pdf

Rules:

  • Constant values are ignored and replaced with ‘immval’ or ‘-immval’ for negative values

  • Function names are ignored and replaced with ‘func’

  • Strings are ‘str’

  • Jump destinations are ‘immval’

  • Registers are left as-is

  • Doesn’t say anything about memory sizes, so they are ignored

  • Doesn’t say anything about segment addresses, so they are ignored

  • Doesn’t say anything about branch predictions, so they are ignored

  • Tokens are at the instruction-level

Parameters:
  • tokenizer (Tokenizer) – the tokenizer to use

  • token_handlers (Optional[Dict[str, Callable[[NormalizerState], Union[None, str]]]]) – optional dictionary mapping string token types to functions to handle those tokens. These will override any token handlers that are used by default (IE: all of the self.handle_* functions). Functions should take one arg (the current normalizer state) as input and return either the next string token to add to the current line, or None to not add anything. This is useful for adding more methods to handle new token types that are not builtin.

  • token_sep (Optional[str]) – the string to use to separate each token in returned instruction lines. Only used if tokenization_level is ‘instruction’. If None, then a default value will be used (’ ‘ for unnormalized using BaseNormalizer(), ‘_’ for everything else)

  • tokenization_level (Optional[Union[TokenizationLevel, str]]) –

    the tokenization level to use for return values. Can be a string, or a TokenizationLevel type. Strings can be:

    • ’op’: tokenized at the opcode/operand level. Will insert a ‘INSTRUCTION_START’ token at the beginning of each instruction line

    • ’inst’/’instruction’: tokenized at the instruction level. All tokens in each instruction line are joined together using token_sep to construct the final token

    • ’auto’: pick the default value for this normalization technique

  • anonymize_tokens (bool) – if True, then tokens will be annonymized by taking their 4-byte shake_128 hash. Why does this exist? Bureaucracy.

DEFAULT_TOKENIZATION_LEVEL = ['inst', 'instruction', 'line', 'instructions', 'lines']
handle_branch_prediction(state)
handle_immediate(state)
handle_memory_size(state)
handle_segment_address(state)
handle_string_literal(*, replace_previous_immediate=False)
opcode_function_call(state)
renormalizable = False
save(path)
class bincfg.normalization.x86.x86_normalizers.X86SafeNormalizer(*args, **kwargs)[source]

Bases: X86BaseNormalizer

A normalizer based on the SAFE method

From the SAFE paper: https://github.com/gadiluna/SAFE

Rules:

  • All base memory addresses (IE: memory addresses that are constant values) are replaced with ‘dispmem’ NOTE: they only specify that this is used for base memory addresses. I’m not sure if they mean any displacement values or only those that are alone with no other registers and whatnot. It doesn’t help that their implementation seems to have some bugs here (see below). So, I assume it is to be used for any displacement values, hence the string they are replaced with

  • All immediate values greater than some threshold (safe_threshold parameter, they use 5000 in the paper) are replaced with ‘immval’. Any immediate values smaller than said threshold (including those that are targets of call/jump instructions) are left alone

  • Memory sizes are ignored

  • Doesn’t say anything about registers, so they are left as-is

  • Doesn’t say anything about segment addresses, so they are left as-is

  • Doesn’t say anything about branch predictions, so they are ignored

  • Strings would be ignored, just using the immediate values associated with their memory address

  • Tokens are at the instruction-level

NOTE: the code from the safe paper has at least one bug I’ve found (specifically, in their Radare2 analyzer code which is the only analyzer code I could find in their repo, even though they say they also use Angr), specifically when it comes to memory expressions. They do not consider some of the possible memory addressing methods that are allowed in x86_64 binaries. For example:

Original Disassembly Their Result Probably Intended Result lea esi, [esi + ecx*2] X_lea_esi,_[esi*2+0] X_lea_esi,_[esi+ecx*2+0] lea ebx, [ebx + esi*4 + 0x10] X_lea_ebx,_[ebx*4+16] X_lea_ebx,_[ebx+esi*4+16] lea edi, [eax*4 + 0x419f40] X_lea_edi,_[MEM] X_lea_edi,_[eax*4+MEM]

They also seem to have problems with immediate/displacement values in those memory addresses:

Original Disassembly Their Result Probably Intended Result add byte [ebp + 0x4d8de455], cl X_add_[ebp*1+1301144661],_cl X_add_[ebp*1+MEM],_cl

So, I take a benefit-of-the-doubt approach to this normalizer and assume the authors did not intend for this to happen. Instructions are normalized taking into account these possible memory addressing methods. Any displacement values that are larger than the safe_threshold are converted into the dispmem token.

Parameters:
  • imm_threshold (int) – immediate values whose absolute value is <= imm_threshold will be left alone, those above it will be replaced with the string ‘immval’

  • tokenizer (Tokenizer) – the tokenizer to use

  • token_handlers (Optional[Dict[str, Callable[[NormalizerState], Union[None, str]]]]) – optional dictionary mapping string token types to functions to handle those tokens. These will override any token handlers that are used by default (IE: all of the self.handle_* functions). Functions should take one arg (the current normalizer state) as input and return either the next string token to add to the current line, or None to not add anything. This is useful for adding more methods to handle new token types that are not builtin.

  • token_sep (Optional[str]) – the string to use to separate each token in returned instruction lines. Only used if tokenization_level is ‘instruction’. If None, then a default value will be used (’ ‘ for unnormalized using BaseNormalizer(), ‘_’ for everything else)

  • tokenization_level (Optional[Union[TokenizationLevel, str]]) –

    the tokenization level to use for return values. Can be a string, or a TokenizationLevel type. Strings can be:

    • ’op’: tokenized at the opcode/operand level. Will insert a ‘INSTRUCTION_START’ token at the beginning of each instruction line

    • ’inst’/’instruction’: tokenized at the instruction level. All tokens in each instruction line are joined together using token_sep to construct the final token

    • ’auto’: pick the default value for this normalization technique

  • anonymize_tokens (bool) – if True, then tokens will be annonymized by taking their 4-byte shake_128 hash. Why does this exist? Bureaucracy.

DEFAULT_TOKENIZATION_LEVEL = ['inst', 'instruction', 'line', 'instructions', 'lines']
handle_branch_prediction(state)
handle_memory_size(state)
handle_string_literal(state)
renormalizable = False
save(path)

bincfg.normalization.x86.x86_tokenizer module

class bincfg.normalization.x86.x86_tokenizer.X86BaseTokenizer(*args, **kwargs)[source]

Bases: BaseTokenizer

A default class to tokenize x86 assembly line input

This class matches the following tokens:

  • All of the default special tokens from parent (see BaseTokenizer())

  • Instruction prefix tokens (EG: ‘lock’, ‘repe’, etc.)

  • Memory sizes (EG: ‘qword’, ‘byte ptr’, ‘xmmword’, ‘v2float’, etc.)

  • Registers (see bincfg.normalization.x86.x86_tokenizer.X86_REGISTER_SIZES for the list of them)

  • Prepended or appended instruction prefixes and branch predictions. You can prepend or append either of those to opcodes while separating with either ‘,’, ‘_’, or ‘.’. EG: “lock_str”, “jnz,pt”. This will only apply to known prefixes and branch prediction strings; if it is unknown, it is considered a part of a larger opcode (EG: “vcmpneq_oqss”).

    Known prefixes: [‘lock’, ‘rep’, ‘repe’, ‘repz’, ‘repne’, ‘repnz’] Known branch predictions: [‘pt’, ‘pn’]

This will perform the following transformations to the incomming token stream:

  • Instruction prefixes and branch prediction tokens may be reordered to keep ordering consistent. NOTE: if you do not wish to have this behavior, pass ‘reorder_tokens=False’ as a kwarg to the .tokenize() call.

    Any instruction prefixes alone on their own line will be moved to the next line under the assumption that is the opcode they are affecting (no check is done to ensure the first token in the subsequent line is actually an opcode, however). Then, for each opcode, we get all surrounding instruction prefix and branch prediction tokens (ignoring spacing, grabbing all tokens until reaching a non-instruction prefix and non-branch prediction token). These are reordered such that all instruction prefixes come first, then branch predictions after those, and finally the opcode token. They will appear in the order that they were found in the token list. There will only be a single space ‘ ‘ acting as spacing inbetween them.

Parameters:
  • tokens (Optional[List[Tuple[str, str]]]) – the tokens to use. Should be a list of 2-tuples. Each tuple is a pair of (name, regex) where name is the string name of the token, and regex is a regular expression to find that token. These tuples should be ordered in the preferred order to search for tokens. If None, then this will default to self.DEFAULT_TOKENS, which is bincfg.normalization.x86.x86_tokenizer.X86_DEFAULT_TOKENS

  • token_handlers (Optional[Dict[str, Callable[[Dict[str, Any]], Union[None, str]]]]) – optional dictionary mapping token type strings to functions to handle those token types when tokenizing. This is intended to be used when you wish to add entirely new token types not present in bincfg.normalization.base_tokenizer.Tokens. If you wish to change the behavior of handling an already-present token type, just override that token handler function. These will override the default token handlers.

  • insert_special_tokens (bool) – by default, some special tokens will be inserted at the front of tokens (see the ‘special tokens’ listed above). If you wish to stop this from happening, you can set insert_special_tokens to False

  • case_sensitive (bool) – If True, then regular expressions will be matched exactly as they appear. If False, then the re.IGNORECASE flag will be passed when compiling the regular expressions

ARCHITECTURE = ['x86', 'i686', 'x86_64']

The architecture this tokenizer works on

DEFAULT_TOKENS = [('prefix', '(?:lock|rep(?:ne|nz|e|z)?)(?=[, \\t.]+|[|\\n]|[\\[\\]+*<>:]|$)'), ('memory_size', '(?:v[0-9]+)?(?:byte|[dqt]?word|t?float|l?double|[xyz]mmword)(?:[, \\t.]+ptr)?(?=[, \\t.]+|[|\\n]|[\\[\\]+*<>:]|$)'), ('register', '(?:[xyz]?mm[0-9]+|st\\(?[0-9]*\\)?|(?:[sb]p|[ds]i)l|[re]?(?:flags|ip|[bs]p|[sd]i|[abcd]x)|[cd]r[0-9]+|r[0-9]+[dwb]?|[abcd][lh]|[cst]w|fp_(?:[id]p|[cd]s|opc)|[cdefgs]s|(?:[gil]d)?tr|msw|mxcsr)(?=[, \\t.]+|[|\\n]|[\\[\\]+*<>:]|$)'), ('branch_prediction', 'p[tn](?=[, \\t.]+|[|\\n]|[\\[\\]+*<>:]|$)'), ('opcode', '(?:[a-z][a-z0-9_]*)(?=[, \\t.]+|[|\\n]|[\\[\\]+*<>:]|$)')]
save(path)
bincfg.normalization.x86.x86_tokenizer.X86_DEFAULT_TOKENS = [('prefix', '(?:lock|rep(?:ne|nz|e|z)?)(?=[, \\t.]+|[|\\n]|[\\[\\]+*<>:]|$)'), ('memory_size', '(?:v[0-9]+)?(?:byte|[dqt]?word|t?float|l?double|[xyz]mmword)(?:[, \\t.]+ptr)?(?=[, \\t.]+|[|\\n]|[\\[\\]+*<>:]|$)'), ('register', '(?:[xyz]?mm[0-9]+|st\\(?[0-9]*\\)?|(?:[sb]p|[ds]i)l|[re]?(?:flags|ip|[bs]p|[sd]i|[abcd]x)|[cd]r[0-9]+|r[0-9]+[dwb]?|[abcd][lh]|[cst]w|fp_(?:[id]p|[cd]s|opc)|[cdefgs]s|(?:[gil]d)?tr|msw|mxcsr)(?=[, \\t.]+|[|\\n]|[\\[\\]+*<>:]|$)'), ('branch_prediction', 'p[tn](?=[, \\t.]+|[|\\n]|[\\[\\]+*<>:]|$)'), ('opcode', '(?:[a-z][a-z0-9_]*)(?=[, \\t.]+|[|\\n]|[\\[\\]+*<>:]|$)')]

Default list of (token_type, regex) token tuples to match to

Module contents