bincfg.normalization.x86 package
Submodules
bincfg.normalization.x86.x86_norm_funcs module
- bincfg.normalization.x86.x86_norm_funcs.x86_clean_nop(state)[source]
Cleans any line with the opcode ‘nop’ to only contain the opcode
- Parameters:
idx (int) – the index in
lineof the ‘nop’ opcodeline (List[TokenTuple]) – a list of (token_type, token) tuples. the current assembly line
args – unused
kwargs – unused
- Returns:
integer index in line of last handled token
- Return type:
int
- bincfg.normalization.x86.x86_norm_funcs.x86_memsize_value(self, state)[source]
Replaces memory size pointers with ‘memsize’ followed by the value of that memsize in bytes
- Parameters:
token (str) – the current string token
- Returns:
normalized memory size string
- Return type:
str
- bincfg.normalization.x86.x86_norm_funcs.x86_replace_general_register(self, state)[source]
Replaces general registers with a default string and their size, keeping special registers the same (while removing their numbers)
- Parameters:
token (str) – the current string token
- Returns:
normalized name of register
- Return type:
str
bincfg.normalization.x86.x86_normalizers module
A bunch of builtin normalization methods based on literature.
NOTE: some of these are slightly modified from their original papers either for code purposes, or because we are using decompiled binaries instead of compiled assembly and thus lose out on some information (EG: symbol information for jump instructions)
- class bincfg.normalization.x86.x86_normalizers.X86BaseNormalizer(*args, **kwargs)[source]
Bases:
BaseNormalizerBase class for x86 normalizers.
Performs an ‘unnormalized’ normalization, removing what is likely extraneous information, and providing a base class for other x86 normalization methods to inherit from.
- Parameters:
tokenizer (Tokenizer) – the tokenizer to use
token_handlers (Optional[Dict[str, Callable[[NormalizerState], Union[None, str]]]]) – optional dictionary mapping string token types to functions to handle those tokens. These will override any token handlers that are used by default (IE: all of the self.handle_* functions). Functions should take one arg (the current normalizer state) as input and return either the next string token to add to the current line, or None to not add anything. This is useful for adding more methods to handle new token types that are not builtin.
token_sep (Optional[str]) – the string to use to separate each token in returned instruction lines. Only used if tokenization_level is ‘instruction’. If None, then a default value will be used (’ ‘ for unnormalized using BaseNormalizer(), ‘_’ for everything else)
tokenization_level (Optional[Union[TokenizationLevel, str]]) –
the tokenization level to use for return values. Can be a string, or a
TokenizationLeveltype. Strings can be:’op’: tokenized at the opcode/operand level. Will insert a ‘INSTRUCTION_START’ token at the beginning of each instruction line
’inst’/’instruction’: tokenized at the instruction level. All tokens in each instruction line are joined together using token_sep to construct the final token
’auto’: pick the default value for this normalization technique
anonymize_tokens (bool) – if True, then tokens will be annonymized by taking their 4-byte shake_128 hash. Why does this exist? Bureaucracy.
- DEFAULT_TOKENIZATION_LEVEL = ['inst', 'instruction', 'line', 'instructions', 'lines']
The default tokenization level used for this normalizer
- handle_all_symbols(state)[source]
Handles all symbols. We use this to keep track of when memory expressions start/end
- handle_memory_base(state)[source]
Handles the ‘base’ section of a memory addressing. Defaults to returning it processed as a register
Should return either the token to add to the current line, or None to not add any token
- Parameters:
state (NormalizerState) – dictionary of current state information. See bincfg.normalization.base_normalizer.NormalizerState
- handle_memory_displacement(state)[source]
Handles the ‘displacement’ section of a memory addressing. Defaults to returning it processed as an immediate
Should return either the token to add to the current line, or None to not add any token
- Parameters:
state (NormalizerState) – dictionary of current state information. See bincfg.normalization.base_normalizer.NormalizerState
- handle_memory_expression(state)[source]
Handles memory expressions. Splits values up into ‘base’, ‘index’, ‘scale’, and ‘displacement’
Values:
‘base’ (B): a register holding the starting base location. Handled by self.handle_memory_base()
‘index’ (I): a register that is added to a base register. Handled by self.handle_memory_index()
‘scale’ (S): an immediate value that is multiplied onto ‘index’, should only be 1, 2, 4, or 8. Handled by self.handle_memory_scale()
‘displacement’ (D): and immediate value that acts as a displacement to the memory address (or in some cases, the literal memory address itself). Handled by self.handle_memory_displacement()
Acceptable x86 memory addressing modes:
[D]
[B]
[B + I]
[B + D]
[B + I + D]
[B + I*S]
[I*S + D]
[B + I*S + D]
NOTE: You can have other formats that don’t fit known addressing modes, but the values might not be handled properly. Specifically, the first register found will be considered the ‘base’, and all subsequent are considered ‘index’, etc.
The respective handle_memory_*() methods will be called with the same parameters as this function, but ‘memory_start’ will instead be the starting index of that token, and ‘token’ will be the original token value (IE: before being handled)
- Parameters:
state (NormalizerState) – dictionary of current state information. See
bincfg.normalization.base_normalizer.NormalizerState
- handle_memory_index(state)[source]
Handles the ‘index’ section of a memory addressing. Defaults to returning it processed as a register
Should return either the token to add to the current line, or None to not add any token
- Parameters:
state (NormalizerState) – dictionary of current state information. See bincfg.normalization.base_normalizer.NormalizerState
- handle_memory_scale(state)[source]
Handles the ‘scale’ section of a memory addressing. Defaults to returning it processed as an immediate
Should return either the token to add to the current line, or None to not add any token
- Parameters:
state (NormalizerState) – dictionary of current state information. See bincfg.normalization.base_normalizer.NormalizerState
- handle_memory_size(state)[source]
Handles a memory size. Removes any ‘{spacing}ptr’ where {spacing} is any amount of spacing
Should return either the token to add to the current line, or None to not add any token
- Parameters:
state (NormalizerState) – dictionary of current state information. See
bincfg.normalization.base_normalizer.NormalizerState
- handle_segment_address(state)[source]
Handles a segment address. Defaults to returning the original token
Should return either the token to add to the current line, or None to not add any token
- Parameters:
state (NormalizerState) – dictionary of current state information. See
bincfg.normalization.base_normalizer.NormalizerState
- renormalizable = True
Whether or not this normalization method can be renormalized later by other normalization methods
- save(path)
- token_sep = None
The separator string used for this normalizer. Will default to ‘ ‘
- tokenizer = None
The tokenizer used for this normalizer
- class bincfg.normalization.x86.x86_normalizers.X86CompressedStatsNormalizer(*args, **kwargs)[source]
Bases:
X86BaseNormalizerA normalizer I created for use in CFG.get_compressed_stats()
Rules:
Immediates are replaced with immediate string (including negative)
function calls are either self vs. intern vs. extern func, no special functions
jump destinations are ‘jmpdst’
registers are handled the same as deepsem/deepbindiff
memory pointers/memory expressions are handled the same as in deepsemantic
Tokenized at the instruction-level
segment addresses are ignored
branch predictions are ignored
- Parameters:
special_functions (Optional[Set[str]]) – a set of special function names. All external functions whose name (ignoring any ‘@plt’ at the end) is in this set will have their name kept, otherwise they will be replaced with ‘externfunc’. If None, will attempt to load the default special function names from
get_special_function_names(). If you do not wish to use any special function names, then pass an empty set.tokenizer (Tokenizer) – the tokenizer to use
token_handlers (Optional[Dict[str, Callable[[NormalizerState], Union[None, str]]]]) – optional dictionary mapping string token types to functions to handle those tokens. These will override any token handlers that are used by default (IE: all of the self.handle_* functions). Functions should take one arg (the current normalizer state) as input and return either the next string token to add to the current line, or None to not add anything. This is useful for adding more methods to handle new token types that are not builtin.
token_sep (Optional[str]) – the string to use to separate each token in returned instruction lines. Only used if tokenization_level is ‘instruction’. If None, then a default value will be used (’ ‘ for unnormalized using BaseNormalizer(), ‘_’ for everything else)
tokenization_level (Optional[Union[TokenizationLevel, str]]) –
the tokenization level to use for return values. Can be a string, or a
TokenizationLeveltype. Strings can be:’op’: tokenized at the opcode/operand level. Will insert a ‘INSTRUCTION_START’ token at the beginning of each instruction line
’inst’/’instruction’: tokenized at the instruction level. All tokens in each instruction line are joined together using token_sep to construct the final token
’auto’: pick the default value for this normalization technique
anonymize_tokens (bool) – if True, then tokens will be annonymized by taking their 4-byte shake_128 hash. Why does this exist? Bureaucracy.
- DEFAULT_TOKENIZATION_LEVEL = ['inst', 'instruction', 'line', 'instructions', 'lines']
- handle_branch_prediction(state)
- handle_immediate(state)
- handle_memory_size(state)
- handle_register(state)
- handle_segment_address(state)
- handle_string_literal(*, replace_previous_immediate=False)
- opcode_function_call(state, ret_only_call_type=False)
- opcode_jump(state)
- renormalizable = False
- save(path)
- class bincfg.normalization.x86.x86_normalizers.X86DeepBinDiffNormalizer(*args, **kwargs)[source]
Bases:
X86BaseNormalizerA normalizer based on the Deep Bin Diff method
From the DeepBinDiff paper: https://www.ndss-symposium.org/wp-content/uploads/2020/02/24311-paper.pdf
Rules:
Constant values are ignored and replaced with ‘immval’
- General registers are renamed based on length, special ones are left as-is (with number information removed.
EG: st5 -> st, rax -> reg8, r14d -> reg4, rip -> rip, zmm13 -> zmm)
Memory expressions are replaced with ‘memexpr’
Can’t really tell what’s supposed to be done with function calls, will just assume they should be ‘call immval’
Jump destinations are ‘immval’
Strings are left as-is (Kinda bad, but they are doing binary diffing and not binary similarity, so I’ll let it slide)
Doesn’t say anything about segment addresses, so they are ignored
Doesn’t say anything about branch predictions, so they are ignored
Tokens are at the op-level
- Parameters:
replace_strings (bool) – if True, then strings will be replaced with a ‘str’ token. Default is False which is the default deepbindiff behavior in the paper and leaves full strings as individual tokens
tokenizer (Tokenizer) – the tokenizer to use
token_handlers (Optional[Dict[str, Callable[[NormalizerState], Union[None, str]]]]) – optional dictionary mapping string token types to functions to handle those tokens. These will override any token handlers that are used by default (IE: all of the self.handle_* functions). Functions should take one arg (the current normalizer state) as input and return either the next string token to add to the current line, or None to not add anything. This is useful for adding more methods to handle new token types that are not builtin.
token_sep (Optional[str]) – the string to use to separate each token in returned instruction lines. Only used if tokenization_level is ‘instruction’. If None, then a default value will be used (’ ‘ for unnormalized using BaseNormalizer(), ‘_’ for everything else)
tokenization_level (Optional[Union[TokenizationLevel, str]]) –
the tokenization level to use for return values. Can be a string, or a
TokenizationLeveltype. Strings can be:’op’: tokenized at the opcode/operand level. Will insert a ‘INSTRUCTION_START’ token at the beginning of each instruction line
’inst’/’instruction’: tokenized at the instruction level. All tokens in each instruction line are joined together using token_sep to construct the final token
’auto’: pick the default value for this normalization technique
anonymize_tokens (bool) – if True, then tokens will be annonymized by taking their 4-byte shake_128 hash. Why does this exist? Bureaucracy.
- DEFAULT_TOKENIZATION_LEVEL = ['op', 'opcode', 'operand', 'opcodes', 'operands']
- handle_branch_prediction(state)
- handle_immediate(*, include_negative=False)
- handle_memory_expression(state)
- handle_memory_size(state)
- handle_register(state)
- handle_segment_address(state)
- opcode_function_call(state)
- renormalizable = False
- save(path)
- class bincfg.normalization.x86.x86_normalizers.X86DeepSemanticNormalizer(*args, **kwargs)[source]
Bases:
X86BaseNormalizerA normalizer based on the Deepsemantic method
from the DeepSemantic paper: https://arxiv.org/abs/2106.05478
Rules:
- Immediates can fall into multiple categories:
Function calls:
libc function name(): “libc[name]” (instead, we use “{name}”)
recursive call: ‘self’
function within the binary: ‘innerfunc’
function outside the binary: ‘externfunc’
NOTE: they do not take into account call tables that could theoretically call both inner and extern functions. So, when this rather rare even occurs, it is given the token ‘multifunc’
Jump (branching) family: “jmpdst”
Reference: (NOTE: This might not be done for all disassebly output, such as ROSE, since they don’t always have this information readily available)
String literal: ‘str’
Statically allocated variable: “dispbss”
Data (data other than a string): “dispdata”
Default (all other immediate values): “immval”
- Registers can fall into multiple categories:
Stack/Base/Instruction pointer: Keep track of type and size [e|r]*[b|s|i]p[l]* -> [s|b|i]p[1|2|4|8]
Special purpose (IE: flags): Keep track of type cr[0-15], dr[0-15], st([0-7]), [c|d|e|f|g|s]s -> reg[cr|dr|st], reg[c|d|e|f|s]s
AVX registers: Keep track of type [x|y|z]*mm[0-7|0-31] -> reg[x|y|z]*mm
General purpose registers: Keep track of size [e|r]*[a|b|c|d|si|di][x|l|h]*, r[8-15][b|w|d]* -> reg[1|2|4|8]
- Pointers can fall into multiple categories:
Direct, small: keep track of size byte,word,dword,qword,ptr -> memptr[1|2|4|8]
Direct, large: keep track of size tbyte,xword,[x|y|z]mmword -> memptr[10|16|32|64]
Indirect, string: [base+index*scale+displacement] -> [base+index*scale+dispstr] (NOTE: we use ‘str’ instead of dispstr)
Indirect, not string: [base+index*scale+displacement] -> [base+index*scale+disp]
NOTE: for our purposes, we don’t necessarily always have base, index, scale, and displacement present, and they may appear in a different (but deterministic) order. It shouldn’t really change anything for any models, just how the tokens are formatted
NOTE: it looks like the ‘scale’ values are their original immediate values and not replaced with the ‘immval’ string, so that is taken into account as well
Tokenized at instruction-level
Doesn’t say anything about segment addresses, so they are left as-is
Doesn’t say anything about branch predictions, so they are ignored
- Parameters:
special_functions (Optional[Set[str]]) – a set of special function names. All external functions whose name (ignoring any ‘@plt’ at the end) is in this set will have their name kept, otherwise they will be replaced with ‘externfunc’. If None, will attempt to load the default special function names from
get_special_function_names(). If you do not wish to use any special function names, then pass an empty set.tokenizer (Tokenizer) – the tokenizer to use
token_handlers (Optional[Dict[str, Callable[[NormalizerState], Union[None, str]]]]) – optional dictionary mapping string token types to functions to handle those tokens. These will override any token handlers that are used by default (IE: all of the self.handle_* functions). Functions should take one arg (the current normalizer state) as input and return either the next string token to add to the current line, or None to not add anything. This is useful for adding more methods to handle new token types that are not builtin.
token_sep (Optional[str]) – the string to use to separate each token in returned instruction lines. Only used if tokenization_level is ‘instruction’. If None, then a default value will be used (’ ‘ for unnormalized using BaseNormalizer(), ‘_’ for everything else)
tokenization_level (Optional[Union[TokenizationLevel, str]]) –
the tokenization level to use for return values. Can be a string, or a
TokenizationLeveltype. Strings can be:’op’: tokenized at the opcode/operand level. Will insert a ‘INSTRUCTION_START’ token at the beginning of each instruction line
’inst’/’instruction’: tokenized at the instruction level. All tokens in each instruction line are joined together using token_sep to construct the final token
’auto’: pick the default value for this normalization technique
anonymize_tokens (bool) – if True, then tokens will be annonymized by taking their 4-byte shake_128 hash. Why does this exist? Bureaucracy.
- DEFAULT_TOKENIZATION_LEVEL = ['inst', 'instruction', 'line', 'instructions', 'lines']
- handle_branch_prediction(state)
- handle_immediate(*, include_negative=False)
- handle_memory_scale(state)
- handle_memory_size(state)
- handle_register(state)
- handle_string_literal(*, replace_previous_immediate=False)
- opcode_function_call(state, ret_only_call_type=False)
- opcode_jump(state)
- renormalizable = False
- save(path)
- class bincfg.normalization.x86.x86_normalizers.X86HPCDataNormalizer(*args, **kwargs)[source]
Bases:
X86BaseNormalizerA special normalizer meant for use in HPC compile jobs
This normalizer is made to reduce the total number of new tokens as much as possible while still being able to fully reproduce original BaseNormalizer output, and while trying to minimize the number of tokens per assembly line as much as possible.
Since immediate values make up the vast majority of ‘unique’ tokens (or are the root cause of there being so many) in the BaseNormalizer, this is all that is changed. Specifically:
immediate values get split into multiple tokens. EG: ‘123456789’ -> ‘1234’, ‘5678’, ‘9’ if using num_digits of 4
negatives stay connected to tokens. EG: ‘-54321’ -> ‘-543’, ‘21’ if using num_digits of 4
Before some split immediate values, a ‘split immediate’ token is inserted for later tokenization to know that the following immediate values should all be concatenated together. This is only inserted when default behavior would produce the wrong values (EG: whenever a token is split, whenever a non-split token has a split token before it, etc.)
NOTE: this should only be used with the ‘opcode’ tokenization level as it provides no benefit otherwise
- Parameters:
num_digits (int) – the number of digits to use before splitting. This will include the minus sign as a digit.
replace_strings (bool) – if True, then strings will be replaced with a ‘str’ token. Default is False which is the default deepbindiff behavior in the paper and leaves full strings as individual tokens
tokenizer (Tokenizer) – the tokenizer to use
token_handlers (Optional[Dict[str, Callable[[NormalizerState], Union[None, str]]]]) – optional dictionary mapping string token types to functions to handle those tokens. These will override any token handlers that are used by default (IE: all of the self.handle_* functions). Functions should take one arg (the current normalizer state) as input and return either the next string token to add to the current line, or None to not add anything. This is useful for adding more methods to handle new token types that are not builtin.
token_sep (Optional[str]) – the string to use to separate each token in returned instruction lines. Only used if tokenization_level is ‘instruction’. If None, then a default value will be used (’ ‘ for unnormalized using BaseNormalizer(), ‘_’ for everything else)
tokenization_level (Optional[Union[TokenizationLevel, str]]) –
the tokenization level to use for return values. Can be a string, or a
TokenizationLeveltype. Strings can be:’op’: tokenized at the opcode/operand level. Will insert a ‘INSTRUCTION_START’ token at the beginning of each instruction line
’inst’/’instruction’: tokenized at the instruction level. All tokens in each instruction line are joined together using token_sep to construct the final token
’auto’: pick the default value for this normalization technique
anonymize_tokens (bool) – if True, then tokens will be annonymized by taking their 4-byte shake_128 hash. Why does this exist? Bureaucracy.
- DEFAULT_TOKENIZATION_LEVEL = ['op', 'opcode', 'operand', 'opcodes', 'operands']
- finalize_instruction(state)[source]
Handles a single instruction. Calls super()’s handle_instruction, then performs the immediate splitting
- Parameters:
state (NormalizerState) – dictionary of current state information. See
bincfg.normalization.base_normalizer.NormalizerState
- property renormalizable
We are only losslessly renormalizable if self._replace_strings == False
- save(path)
- class bincfg.normalization.x86.x86_normalizers.X86InnerEyeNormalizer(*args, **kwargs)[source]
Bases:
X86BaseNormalizerA normalizer based on the Innereye method
From the InnerEye paper: https://arxiv.org/pdf/1808.04706.pdf
Rules:
Constant values are ignored and replaced with ‘immval’ or ‘-immval’ for negative values
Function names are ignored and replaced with ‘func’
Strings are ‘str’
Jump destinations are ‘immval’
Registers are left as-is
Doesn’t say anything about memory sizes, so they are ignored
Doesn’t say anything about segment addresses, so they are ignored
Doesn’t say anything about branch predictions, so they are ignored
Tokens are at the instruction-level
- Parameters:
tokenizer (Tokenizer) – the tokenizer to use
token_handlers (Optional[Dict[str, Callable[[NormalizerState], Union[None, str]]]]) – optional dictionary mapping string token types to functions to handle those tokens. These will override any token handlers that are used by default (IE: all of the self.handle_* functions). Functions should take one arg (the current normalizer state) as input and return either the next string token to add to the current line, or None to not add anything. This is useful for adding more methods to handle new token types that are not builtin.
token_sep (Optional[str]) – the string to use to separate each token in returned instruction lines. Only used if tokenization_level is ‘instruction’. If None, then a default value will be used (’ ‘ for unnormalized using BaseNormalizer(), ‘_’ for everything else)
tokenization_level (Optional[Union[TokenizationLevel, str]]) –
the tokenization level to use for return values. Can be a string, or a
TokenizationLeveltype. Strings can be:’op’: tokenized at the opcode/operand level. Will insert a ‘INSTRUCTION_START’ token at the beginning of each instruction line
’inst’/’instruction’: tokenized at the instruction level. All tokens in each instruction line are joined together using token_sep to construct the final token
’auto’: pick the default value for this normalization technique
anonymize_tokens (bool) – if True, then tokens will be annonymized by taking their 4-byte shake_128 hash. Why does this exist? Bureaucracy.
- DEFAULT_TOKENIZATION_LEVEL = ['inst', 'instruction', 'line', 'instructions', 'lines']
- handle_branch_prediction(state)
- handle_immediate(state)
- handle_memory_size(state)
- handle_segment_address(state)
- handle_string_literal(*, replace_previous_immediate=False)
- opcode_function_call(state)
- renormalizable = False
- save(path)
- class bincfg.normalization.x86.x86_normalizers.X86SafeNormalizer(*args, **kwargs)[source]
Bases:
X86BaseNormalizerA normalizer based on the SAFE method
From the SAFE paper: https://github.com/gadiluna/SAFE
Rules:
All base memory addresses (IE: memory addresses that are constant values) are replaced with ‘dispmem’ NOTE: they only specify that this is used for base memory addresses. I’m not sure if they mean any displacement values or only those that are alone with no other registers and whatnot. It doesn’t help that their implementation seems to have some bugs here (see below). So, I assume it is to be used for any displacement values, hence the string they are replaced with
All immediate values greater than some threshold (safe_threshold parameter, they use 5000 in the paper) are replaced with ‘immval’. Any immediate values smaller than said threshold (including those that are targets of call/jump instructions) are left alone
Memory sizes are ignored
Doesn’t say anything about registers, so they are left as-is
Doesn’t say anything about segment addresses, so they are left as-is
Doesn’t say anything about branch predictions, so they are ignored
Strings would be ignored, just using the immediate values associated with their memory address
Tokens are at the instruction-level
NOTE: the code from the safe paper has at least one bug I’ve found (specifically, in their Radare2 analyzer code which is the only analyzer code I could find in their repo, even though they say they also use Angr), specifically when it comes to memory expressions. They do not consider some of the possible memory addressing methods that are allowed in x86_64 binaries. For example:
Original Disassembly Their Result Probably Intended Result lea esi, [esi + ecx*2] X_lea_esi,_[esi*2+0] X_lea_esi,_[esi+ecx*2+0] lea ebx, [ebx + esi*4 + 0x10] X_lea_ebx,_[ebx*4+16] X_lea_ebx,_[ebx+esi*4+16] lea edi, [eax*4 + 0x419f40] X_lea_edi,_[MEM] X_lea_edi,_[eax*4+MEM]
They also seem to have problems with immediate/displacement values in those memory addresses:
Original Disassembly Their Result Probably Intended Result add byte [ebp + 0x4d8de455], cl X_add_[ebp*1+1301144661],_cl X_add_[ebp*1+MEM],_cl
So, I take a benefit-of-the-doubt approach to this normalizer and assume the authors did not intend for this to happen. Instructions are normalized taking into account these possible memory addressing methods. Any displacement values that are larger than the safe_threshold are converted into the dispmem token.
- Parameters:
imm_threshold (int) – immediate values whose absolute value is <= imm_threshold will be left alone, those above it will be replaced with the string ‘immval’
tokenizer (Tokenizer) – the tokenizer to use
token_handlers (Optional[Dict[str, Callable[[NormalizerState], Union[None, str]]]]) – optional dictionary mapping string token types to functions to handle those tokens. These will override any token handlers that are used by default (IE: all of the self.handle_* functions). Functions should take one arg (the current normalizer state) as input and return either the next string token to add to the current line, or None to not add anything. This is useful for adding more methods to handle new token types that are not builtin.
token_sep (Optional[str]) – the string to use to separate each token in returned instruction lines. Only used if tokenization_level is ‘instruction’. If None, then a default value will be used (’ ‘ for unnormalized using BaseNormalizer(), ‘_’ for everything else)
tokenization_level (Optional[Union[TokenizationLevel, str]]) –
the tokenization level to use for return values. Can be a string, or a
TokenizationLeveltype. Strings can be:’op’: tokenized at the opcode/operand level. Will insert a ‘INSTRUCTION_START’ token at the beginning of each instruction line
’inst’/’instruction’: tokenized at the instruction level. All tokens in each instruction line are joined together using token_sep to construct the final token
’auto’: pick the default value for this normalization technique
anonymize_tokens (bool) – if True, then tokens will be annonymized by taking their 4-byte shake_128 hash. Why does this exist? Bureaucracy.
- DEFAULT_TOKENIZATION_LEVEL = ['inst', 'instruction', 'line', 'instructions', 'lines']
- handle_branch_prediction(state)
- handle_memory_size(state)
- handle_string_literal(state)
- renormalizable = False
- save(path)
bincfg.normalization.x86.x86_tokenizer module
- class bincfg.normalization.x86.x86_tokenizer.X86BaseTokenizer(*args, **kwargs)[source]
Bases:
BaseTokenizerA default class to tokenize x86 assembly line input
This class matches the following tokens:
All of the default special tokens from parent (see
BaseTokenizer())Instruction prefix tokens (EG: ‘lock’, ‘repe’, etc.)
Memory sizes (EG: ‘qword’, ‘byte ptr’, ‘xmmword’, ‘v2float’, etc.)
Registers (see
bincfg.normalization.x86.x86_tokenizer.X86_REGISTER_SIZESfor the list of them)Prepended or appended instruction prefixes and branch predictions. You can prepend or append either of those to opcodes while separating with either ‘,’, ‘_’, or ‘.’. EG: “lock_str”, “jnz,pt”. This will only apply to known prefixes and branch prediction strings; if it is unknown, it is considered a part of a larger opcode (EG: “vcmpneq_oqss”).
Known prefixes: [‘lock’, ‘rep’, ‘repe’, ‘repz’, ‘repne’, ‘repnz’] Known branch predictions: [‘pt’, ‘pn’]
This will perform the following transformations to the incomming token stream:
Instruction prefixes and branch prediction tokens may be reordered to keep ordering consistent. NOTE: if you do not wish to have this behavior, pass ‘reorder_tokens=False’ as a kwarg to the .tokenize() call.
Any instruction prefixes alone on their own line will be moved to the next line under the assumption that is the opcode they are affecting (no check is done to ensure the first token in the subsequent line is actually an opcode, however). Then, for each opcode, we get all surrounding instruction prefix and branch prediction tokens (ignoring spacing, grabbing all tokens until reaching a non-instruction prefix and non-branch prediction token). These are reordered such that all instruction prefixes come first, then branch predictions after those, and finally the opcode token. They will appear in the order that they were found in the token list. There will only be a single space ‘ ‘ acting as spacing inbetween them.
- Parameters:
tokens (Optional[List[Tuple[str, str]]]) – the tokens to use. Should be a list of 2-tuples. Each tuple is a pair of (name, regex) where name is the string name of the token, and regex is a regular expression to find that token. These tuples should be ordered in the preferred order to search for tokens. If None, then this will default to self.DEFAULT_TOKENS, which is
bincfg.normalization.x86.x86_tokenizer.X86_DEFAULT_TOKENStoken_handlers (Optional[Dict[str, Callable[[Dict[str, Any]], Union[None, str]]]]) – optional dictionary mapping token type strings to functions to handle those token types when tokenizing. This is intended to be used when you wish to add entirely new token types not present in bincfg.normalization.base_tokenizer.Tokens. If you wish to change the behavior of handling an already-present token type, just override that token handler function. These will override the default token handlers.
insert_special_tokens (bool) – by default, some special tokens will be inserted at the front of tokens (see the ‘special tokens’ listed above). If you wish to stop this from happening, you can set insert_special_tokens to False
case_sensitive (bool) – If True, then regular expressions will be matched exactly as they appear. If False, then the re.IGNORECASE flag will be passed when compiling the regular expressions
- ARCHITECTURE = ['x86', 'i686', 'x86_64']
The architecture this tokenizer works on
- DEFAULT_TOKENS = [('prefix', '(?:lock|rep(?:ne|nz|e|z)?)(?=[, \\t.]+|[|\\n]|[\\[\\]+*<>:]|$)'), ('memory_size', '(?:v[0-9]+)?(?:byte|[dqt]?word|t?float|l?double|[xyz]mmword)(?:[, \\t.]+ptr)?(?=[, \\t.]+|[|\\n]|[\\[\\]+*<>:]|$)'), ('register', '(?:[xyz]?mm[0-9]+|st\\(?[0-9]*\\)?|(?:[sb]p|[ds]i)l|[re]?(?:flags|ip|[bs]p|[sd]i|[abcd]x)|[cd]r[0-9]+|r[0-9]+[dwb]?|[abcd][lh]|[cst]w|fp_(?:[id]p|[cd]s|opc)|[cdefgs]s|(?:[gil]d)?tr|msw|mxcsr)(?=[, \\t.]+|[|\\n]|[\\[\\]+*<>:]|$)'), ('branch_prediction', 'p[tn](?=[, \\t.]+|[|\\n]|[\\[\\]+*<>:]|$)'), ('opcode', '(?:[a-z][a-z0-9_]*)(?=[, \\t.]+|[|\\n]|[\\[\\]+*<>:]|$)')]
- save(path)
- bincfg.normalization.x86.x86_tokenizer.X86_DEFAULT_TOKENS = [('prefix', '(?:lock|rep(?:ne|nz|e|z)?)(?=[, \\t.]+|[|\\n]|[\\[\\]+*<>:]|$)'), ('memory_size', '(?:v[0-9]+)?(?:byte|[dqt]?word|t?float|l?double|[xyz]mmword)(?:[, \\t.]+ptr)?(?=[, \\t.]+|[|\\n]|[\\[\\]+*<>:]|$)'), ('register', '(?:[xyz]?mm[0-9]+|st\\(?[0-9]*\\)?|(?:[sb]p|[ds]i)l|[re]?(?:flags|ip|[bs]p|[sd]i|[abcd]x)|[cd]r[0-9]+|r[0-9]+[dwb]?|[abcd][lh]|[cst]w|fp_(?:[id]p|[cd]s|opc)|[cdefgs]s|(?:[gil]d)?tr|msw|mxcsr)(?=[, \\t.]+|[|\\n]|[\\[\\]+*<>:]|$)'), ('branch_prediction', 'p[tn](?=[, \\t.]+|[|\\n]|[\\[\\]+*<>:]|$)'), ('opcode', '(?:[a-z][a-z0-9_]*)(?=[, \\t.]+|[|\\n]|[\\[\\]+*<>:]|$)')]
Default list of (token_type, regex) token tuples to match to