bincfg.normalization package
This subpackage provides classes to tokenize and normalize assembly lines, as well as the ability to easily create new tokenization/normalization methods.
This library currently supports the following architectures:
x86/x86_64
java
And disassembler output from the following binary analysis tools:
Tokenizer classes convert assembly instructions into lists of individual tokens for later processing. Normalizer
classes take those tokens and normalize them to create the final string tokens for later use in models. This normalization
process is useful to prevent overfitting and Out of Vocabulary (OOV) problems in machine learning models.
An example of using a default X86BaseNormalizer on some x86_64 assembly:
from bincfg.normalization import X86BaseNormalizer
asm_lines = [
'0x00402cdd: add rsp, 0x08',
'0x00402cf0: push qword ds:[rip + 0x0000000000252312<absolute=0x0000000000655008>]',
'CALL 0x0000000000403360'
]
normalizer = X86BaseNormalizer()
for line in asm_lines:
print(normalizer.normalize(line))
Which would give the output:
>>> add rsp 8
>>> push qword [ rip + 2433810 ]
>>> call 4207456
The BaseNormalizer classes by default do some simple cleaning while keeping all of the necessary information for the
assembly line itself. For example: removing memory addresses of the instruction itself if it exists, converting all values
to decimal, removing extra whitespace/commas, etc.
This process is split into two main parts: tokenization, and normalization.
Tokenization
Normalization
Normalizer classes will normalize incoming strings. They do this by first tokenizing the strings (using either a
user-defined or default tokenizer), then normalizing that stream of (token_name, token_string) tuples into strings.
Normalization has two possible Tokenization Levels for the incoming strings:
‘op’: opcode/operand level tokenization. Each individual opcode/operand gets normalized into its own token
‘instruction’: instruction level tokenization. Each instruction line gets normalized into a single token, with all opcodes/operands in that instruction joined together, separated by some separator string (defaults to ‘ ‘ for
BaseNormalizer, and ‘_’ for all other normalizers)
This library has a few built-in normalization methods based on literature:
InnerEye: https://arxiv.org/pdf/1808.04706.pdf
Deep Bin Diff: https://www.ndss-symposium.org/wp-content/uploads/2020/02/24311-paper.pdf
Deep Semantic: https://arxiv.org/abs/2106.05478
This module also provides a normalize_cfg_data() function to normalize CFG data.
Custom Normalizers
Creating custom normalizers is quite simple. In fact, multiple of the built-in normalization techniques are as simple as a few lines of code:
class X86InnerEyeNormalizer(X86BaseNormalizer):
DEFAULT_TOKENIZATION_LEVEL = TokenizationLevel.INSTRUCTION
handle_immediate = return_immstr(include_negative=True)
handle_memory_size = ignore
handle_function_call = replace_function_call_immediate(FUNCTION_CALL_STR)
Custom normalizers should inherit from BaseNormalizer, and override parent methods to alter functionality. Most
methods do exactly as they say, “handling” the tokens in their names:
handle_opcode()
handle_memory_size()
handle_register()
handle_immediate()
handle_memory_expression()
handle_rose_info()
handle_ignored()
handle_mismatch()
There are some handlers that have slightly different functionality:
handle_newline(): this gets called after each full string has been parsed, or a new line character was found, indicating the end of a single assembly instruction. The full instruction will then be parsed, modified if necessary, specific opcodes handled, and converted into the final string (or list of strings if using ‘op’ tokenization level).
handle_instruction(): this gets called by handle_newline(). It will parse the full instruction, checking for any specifc opcodes that need to be handled. This method does not do any other cleaning/converting of the instruction.
Specific opcodes can be handled differently after the full line has been parsed. The register_opcode_handler() function allows you to pass in a string regular expression to identify the opcodes to handle, and a function to handle those opcodes. There are also a few built-in opcode handler functions:
handle_jump(): handles jump instructions
handle_call(): handles call instructions
‘nop’ instructions: all ‘nop’ instructions will have everything stripped from them except the ‘nop’ opcode itself, since there is often a large amount of useless/extraneous information alongside those filler instructions
Finally, one can add in behavior for brand new token types using the handle_unknown_token() method, which will have
passed to it the token_name and token_string whenever an unknown token_name is found. This way, you need not create
an entirely new Normalizer class, and can still use BaseNormalizer as a parent, if you wish to add in new token
types to parse.
For info on method signatures/expected return values, see their documentation below.
As shown above, you need only set the handler to the desired function to change behavior. This can be done either when building the class definition, or during the __init__ call.
There are multiple utility functions defined under bincfg.normalization.norm_utils that can be used to set the handlers above to different common behaviors without having to implement those functions yourself.
One may also set the DEFAULT_TOKENIZATION_LEVEL attribute on the class definition/instances to change what the default tokenization level behavior will be.
Subpackages
- bincfg.normalization.java package
- bincfg.normalization.x86 package
- Submodules
- bincfg.normalization.x86.x86_norm_funcs module
- bincfg.normalization.x86.x86_normalizers module
X86BaseNormalizerX86BaseNormalizer.DEFAULT_TOKENIZATION_LEVELX86BaseNormalizer.handle_all_symbols()X86BaseNormalizer.handle_memory_base()X86BaseNormalizer.handle_memory_displacement()X86BaseNormalizer.handle_memory_expression()X86BaseNormalizer.handle_memory_index()X86BaseNormalizer.handle_memory_scale()X86BaseNormalizer.handle_memory_size()X86BaseNormalizer.handle_segment_address()X86BaseNormalizer.opcode_function_call()X86BaseNormalizer.opcode_jump()X86BaseNormalizer.renormalizableX86BaseNormalizer.save()X86BaseNormalizer.token_sepX86BaseNormalizer.tokenizer
X86CompressedStatsNormalizerX86CompressedStatsNormalizer.DEFAULT_TOKENIZATION_LEVELX86CompressedStatsNormalizer.handle_branch_prediction()X86CompressedStatsNormalizer.handle_immediate()X86CompressedStatsNormalizer.handle_memory_size()X86CompressedStatsNormalizer.handle_register()X86CompressedStatsNormalizer.handle_segment_address()X86CompressedStatsNormalizer.handle_string_literal()X86CompressedStatsNormalizer.opcode_function_call()X86CompressedStatsNormalizer.opcode_jump()X86CompressedStatsNormalizer.renormalizableX86CompressedStatsNormalizer.save()
X86DeepBinDiffNormalizerX86DeepBinDiffNormalizer.DEFAULT_TOKENIZATION_LEVELX86DeepBinDiffNormalizer.handle_branch_prediction()X86DeepBinDiffNormalizer.handle_immediate()X86DeepBinDiffNormalizer.handle_memory_expression()X86DeepBinDiffNormalizer.handle_memory_size()X86DeepBinDiffNormalizer.handle_register()X86DeepBinDiffNormalizer.handle_segment_address()X86DeepBinDiffNormalizer.opcode_function_call()X86DeepBinDiffNormalizer.renormalizableX86DeepBinDiffNormalizer.save()
X86DeepSemanticNormalizerX86DeepSemanticNormalizer.DEFAULT_TOKENIZATION_LEVELX86DeepSemanticNormalizer.handle_branch_prediction()X86DeepSemanticNormalizer.handle_immediate()X86DeepSemanticNormalizer.handle_memory_scale()X86DeepSemanticNormalizer.handle_memory_size()X86DeepSemanticNormalizer.handle_register()X86DeepSemanticNormalizer.handle_string_literal()X86DeepSemanticNormalizer.opcode_function_call()X86DeepSemanticNormalizer.opcode_jump()X86DeepSemanticNormalizer.renormalizableX86DeepSemanticNormalizer.save()
X86HPCDataNormalizerX86InnerEyeNormalizerX86InnerEyeNormalizer.DEFAULT_TOKENIZATION_LEVELX86InnerEyeNormalizer.handle_branch_prediction()X86InnerEyeNormalizer.handle_immediate()X86InnerEyeNormalizer.handle_memory_size()X86InnerEyeNormalizer.handle_segment_address()X86InnerEyeNormalizer.handle_string_literal()X86InnerEyeNormalizer.opcode_function_call()X86InnerEyeNormalizer.renormalizableX86InnerEyeNormalizer.save()
X86SafeNormalizer
- bincfg.normalization.x86.x86_tokenizer module
- Module contents
Submodules
bincfg.normalization.base_normalizer module
Classes for normalizing assembly instructions.
- class bincfg.normalization.base_normalizer.BaseNormalizer(*args, **kwargs)[source]
Bases:
objectA base class for a normalization method.
This should be subclassed once for each new instruction set to create a base normalizer for that instruction set that performs a default ‘unnormalized’ normalization
There are three types of functions that are intended to be overridden when needed:
Token handlers: these functions will start with ‘handle’ and are used to handle either single tokens, or small groups of similar tokens (EG: memory expressions). They should accept both self and ‘state’ as inputs (see bincfg.normalization.base_normalizer.NormalizerState) and can return either a token which will be added to the end of the current line, or None to not add any token post-calling.
Opcode handlers: these functions will start with ‘opcode’ and are used to handle specific opcodes (not the ‘opcode’ token in general, only specific ones like ‘call’ or ‘jump’ opcodes). They should accept both self and ‘state’ as inputs (See
bincfg.normalization.base_normalizer.NormalizerState) and can return either the integer index of the next token that should be checked (IE: “we have handled all tokens up to but not including this index”), or None to indicate the previously mentioned index is just one after the opcode. These operate directly on the state’s current ‘.line’ attribute. These are expected to be called only after the entire current line has finished being parsed and normalized. New opcode handlers can be added with self.register_opcode_handler()Administrative functions: these functions perform different administrative operations before, during, or after normalizing the individual tokens. Some examples include:
‘finalize_instruction’: used as a post-processing function once an instruction has finished being normalized to perform extra processing to the line, apply opcode handlers, stringify the line, update the normalizer state
‘hash_token’: hashes a fully processed string token (if self.anonymize_tokens=True)
‘stringify_line’: takes the current line of token tuples and converts into strings based on self.tokenization_level
Disassembler Information:
Extra information from the disassembler can be inserted into the lines within angle brackets “<>” (see
BaseTokenizer()for info on how this can be tokenized). This disassembler info will be treated as a single token, and passed to the self.handle_disassembler_info function. By default, the normalizer will check for the following in orderValid JSON. If the data inside of the angle brackets is valid JSON, then it will be parsed into a JSON object. This JSON object will be inserted into the state.disinfo_json attribute in the normalizer state. There are a few special cases for this JSON data that have special effects by default:
If this object is an integer, we will attempt to insert it into a previous immediate value like in #2 below
If this is a string, we will always insert it as a string literal like in #3 below
If this is a dictionary, there are a few special keys that one can use:
‘immediate’: value should be an integer. We will attempt to insert value into a previous immediate value like in #2 below
‘insert’: this value will be inserted into the string. If it is already a string, it is left as-is. If not a string, then we call repr() on it to convert it into a string. Insertion actions depend on whether or not the key ‘insert_type’ is present.
If not present, this value will first be tokenized/normalized by this normalizer and that value + token type will be inserted. Should that fail, then the value will be inserted as a string literal WITHOUT processing it as a string literal token (and, it won’t have quotes on it).
If the ‘insert_type’ key is present, then it can be one of two values:
String token_type: the value will be handled as if it is of this token type, no matter what the value actually is, then it will be inserted (assuming that token handler did not return None)
False (the JSON object, not the string): the value will be immediately inserted as a string literal WITHOUT processing it as a string literal token (and, it won’t have quotes on it)
‘insert_type’: Determines the token type for an ‘insert’ key value. Ignored if the ‘insert’ key is not present. See the ‘insert’ key for more info
Otherwise, if the disassembler info token starts with an immediate value within the angle brackets, and there is an immediate value token immediately preceeding them (ignoring spacing tokens), this will replace said immediate value token with the immediate value found within the disassembler info. The inserted value will first be handled by the appropriate handler for Token.IMMEDIATE token types. EG: “add rax 0xffff <-1>” -> “add rax -1”
Otherwise, if the disassembler info token starts with a string literal, this will insert that string literal right where it appears (and, that string literal will be handled with self.handle_string_literal). The inserted value will first be handled by the appropriate handler for Token.STRING_LITERAL token types.
Finally, if it doesn’t match anything above, then it will fail silently and be ignored. If you wish to raise an error when this happens instead, you can pass raise_unk_di=True when calling .normalize()
The disassembler tokens themselves are always ignored by default.
NOTE: escapes will be treated normally within all strings. EG: ‘n’ will be considered the newline character, but ‘\n’ will escape the escape and produce the string ‘n’.
NOTE: immediates and string literals must match those found in
bincfg.normalization.norm_utils(RE_IMMEDIATE and RE_STRING_LITERAL). The disassembler info does not take into account the regex’s used to parse immediates and string literals for the specific normalizer.- Parameters:
tokenizer (Tokenizer) – the tokenizer to use
token_handlers (Optional[Dict[str, Callable[[NormalizerState], Union[None, str]]]]) – optional dictionary mapping string token types to functions to handle those tokens. These will override any token handlers that are used by default (IE: all of the self.handle_* functions). Functions should take one arg (the current normalizer state) as input and return either the next string token to add to the current line, or None to not add anything. This is useful for adding more methods to handle new token types that are not builtin.
token_sep (Optional[str]) – the string to use to separate each token in returned instruction lines. Only used if tokenization_level is ‘instruction’. If None, then a default value will be used (’ ‘ for unnormalized using BaseNormalizer(), ‘_’ for everything else)
tokenization_level (Optional[Union[TokenizationLevel, str]]) –
the tokenization level to use for return values. Can be a string, or a
TokenizationLeveltype. Strings can be:’op’: tokenized at the opcode/operand level. Will insert a ‘INSTRUCTION_START’ token at the beginning of each instruction line
’inst’/’instruction’: tokenized at the instruction level. All tokens in each instruction line are joined together using token_sep to construct the final token
’auto’: pick the default value for this normalization technique
anonymize_tokens (bool) – if True, then tokens will be annonymized by taking their 4-byte shake_128 hash. Why does this exist? Bureaucracy.
- DEFAULT_TOKENIZATION_LEVEL = ['inst', 'instruction', 'line', 'instructions', 'lines']
The default tokenization level used for this normalizer
- add_line_to_sentence(state)[source]
Stringifies the current line, then adds it to the normalized lines and clears state.line
- finalize_instruction(state)[source]
Handles an entire instruction once reaching a new line
If overridden, should at the very least:
call all the registered opcode handlers for each known opcode token (while updating token_type/token/token_idx)
By default, each opcode handler is expected to take in the current state, and return either the integer index of the next token that should be checked (IE: “we have handled all tokens up to but not including this index”), or None to indicate the previously mentioned index is just one after the opcode
- Parameters:
state (NormalizerState) – dictionary of current state information. See
bincfg.normalization.base_normalizer.NormalizerState
- handle_all_symbols(state)[source]
Handles symbols (‘+’, ‘[’, ‘]’, ‘*’, ‘:’). Defaults to returning the original token
Should return either the token to add to the current line, or None to not add any token
- Parameters:
state (NormalizerState) – dictionary of current state information. See
bincfg.normalization.base_normalizer.NormalizerState
- handle_branch_prediction(state)[source]
Handles a branch prediction. Defaults to returning the original token
Should return either the token to add to the current line, or None to not add any token
- Parameters:
state (NormalizerState) – dictionary of current state information. See
bincfg.normalization.base_normalizer.NormalizerState
- handle_disassembler_info(state)[source]
Handles disassembler information
See
BaseNormalizer()for more info on how disassembler info is parsed.Should return either the token to add to the current line, or None to not add any token
- Parameters:
state (NormalizerState) – dictionary of current state information. See
bincfg.normalization.base_normalizer.NormalizerState
- handle_immediate(state)[source]
Handles an immediate value. Defaults to converting into decimal
Should return either the token to add to the current line, or None to not add any token
- Parameters:
state (NormalizerState) – dictionary of current state information. See
bincfg.normalization.base_normalizer.NormalizerState
- handle_instruction_address(state)[source]
Handles an instruction address. Defaults to ignoring these tokens
Should return either the token to add to the current line, or None to not add any token
- Parameters:
state (NormalizerState) – dictionary of current state information. See
bincfg.normalization.base_normalizer.NormalizerState
- handle_instruction_prefix(state)[source]
Handles an instruction prefix. Defaults to returning the original token
Should return either the token to add to the current line, or None to not add any token
- Parameters:
state (NormalizerState) – dictionary of current state information. See
bincfg.normalization.base_normalizer.NormalizerState
- handle_memory_size(state)[source]
Handles a memory size. Defaults to returning the original token
Should return either the token to add to the current line, or None to not add any token
- Parameters:
state (NormalizerState) – dictionary of current state information. See
bincfg.normalization.base_normalizer.NormalizerState
- handle_mismatch(state)[source]
What to do when the normalizaion method finds a token mismatch (in case they were ignored in the tokenizer)
Defaults to raising a TokenMismatchError()
Should return either the token to add to the current line, or None to not add any token
- Parameters:
state (NormalizerState) – dictionary of current state information. See
bincfg.normalization.base_normalizer.NormalizerState- Raises:
TokenMismatchError – always
- handle_newline(state)[source]
Handles a newline token. Defaults to ignoring the token
- Parameters:
state (NormalizerState) – dictionary of current state information. See
bincfg.normalization.base_normalizer.NormalizerState
- handle_opcode(state)[source]
Handles an opcode. Defaults to returning the original token
NOTE: This should only be used to determine how all opcode strings are handled. For how to handle specific opcodes to give them different behaviors, see
register_opcode_handler()Should return either the token to add to the current line, or None to not add any token
- Parameters:
state (NormalizerState) – dictionary of current state information. See
bincfg.normalization.base_normalizer.NormalizerState
- handle_register(state)[source]
Handles a register. Defaults to returning the original token
Should return either the token to add to the current line, or None to not add any token
- Parameters:
state (NormalizerState) – dictionary of current state information. See
bincfg.normalization.base_normalizer.NormalizerState
- handle_spacing(state)[source]
Handles spacing. Defaults to ignoring these tokens
Should return either the token to add to the current line, or None to not add any token
- Parameters:
state (NormalizerState) – dictionary of current state information. See
bincfg.normalization.base_normalizer.NormalizerState
- handle_string_literal(state)[source]
Handles string literals. Defaults to returning the original token as a double-quoted string
Should return either the token to add to the current line, or None to not add any token
- Parameters:
state (NormalizerState) – dictionary of current state information. See
bincfg.normalization.base_normalizer.NormalizerState
- handle_unknown_token(state)[source]
Handles an unknown token. Defaults to raising an UnknownTokenError
Should return either the token to add to the current line, or None to not add any token
- Parameters:
state (NormalizerState) – dictionary of current state information. See
bincfg.normalization.base_normalizer.NormalizerState- Raises:
UnknownTokenError – always
- hash_token(token)[source]
Hashes tokens during annonymization
By default, converts each individual token into its 4-byte shake_128 hash
- Parameters:
token (str) – the string token to hash
- Returns:
the 4-byte shake_128 hash of the given token
- Return type:
str
- classmethod load(path)
- normalize(*strings, cfg=None, block=None, newline_tup=<object object>, match_instruction_address=True, **kwargs)[source]
Normalizes the given iterable of strings.
- Parameters:
strings (str) – arbitrary number of strings to normalize
cfg (Union[CFG, MemCFG], optional) – either a
CFGorMemCFGobject that these lines occur in. Used for determining function calls to self, internal functions, and external functions. If not passed, then these will not be used. Defaults to None.block (Union[CFGBasicBlock, int], optional) – either a
CFGBasicBlockor integer block_idx in aMemCFGobject. Used for determining function calls to self, internal functions, and external functions. If not passed, then these will not be used. Defaults to None.newline_tup (Tuple[str, str], optional) – the tuple to insert inbetween each passed string, or None to not insert anything. Defaults to self.tokenizer.DEFAULT_NEWLINE_TUPLE
match_instruction_address (bool, optional) –
if True, will match instruction addresses. If there is an immediate value at the start of a line (IE: start of a string in strings, or immediately after a Tokens.NEWLINE or Tokens.INSTRUCTION_START [ignoring any Tokens.SPACING]), then that token will be converted into a Tokens.INSTRUCTION_ADDRESS token. If there is a Tokens.COLON immediately after that token (again, ignoring any Tokens.SPACING), then that first Tokens.COLON match will be appended (along with any inbetween Tokens.SPACING) to that Tokens.INSTRUCTION_ADDRESS token. For example, using the x86 tokenization scheme:
”0x1234: add rax rax” -> [(Tokens.INSTRUCTION_ADDRESS, ‘0x1234:’), …]
” 0x1234 : add rax rax” -> [(Tokens.SPACING, ‘ ‘), (Tokens.INSTRUCTION_ADDRESS, ‘0x1234 :’), …]
”0x1234 add rax rax” -> [(Tokens.INSTRUCTION_ADDRESS, ‘0x1234’), …]
kwargs (Any) – extra kwargs to pass along to tokenization method, and to store in normalizer state
- Returns:
a list of normalized string instruction lines
- Return type:
List[str]
- register_opcode_handler(op_regex, func_or_str_name)[source]
Registers an opcode handler for this normalizer
Adds the given op_regex as an opcode to handle during self._handle_instruction() along with the given function to call with token/cfg arguments. op_regex can be either a compiled regex expression, or a string which will be compiled into a regex expression. func_or_str_name can either be a callable, or a string. If it’s a string, then that attribute will be looked up on this normalizer dynamically to find the function to use.
Notes for registering opcode handlers:
passing instance method functions converts them to strings automatically
passing lambda’s or inner functions (not at global scope) would not be able to be pickled
opcodes will be matched in the order they were passed in
- Parameters:
op_regex (Union[str, Pattern]) – a string or compiled regex
func_or_str_name (Union[Callable, str]) – the function to call with token/cfg arguments when an opcode matches op_regex, or a string name of a callable attribute of this normalizer to be looked up dynamically
- renormalizable = False
Whether or not this normalization method can be renormalized later by other normalization methods
- save(path)
- stringify_line(state)[source]
Converts the current line into a list of final normalized string tokens and returns that list
Also normalizes the case, converting all tokens (except those in strings) to lowercase
- Parameters:
state (NormalizerState) – dictionary of current state information. See
bincfg.normalization.base_normalizer.NormalizerState- Returns:
a list of tokens to add to state.normalized_lines
- Return type:
List[str]
- token_sep = None
The separator string used for this normalizer
Will default to ‘ ‘
- tokenization_level = ['auto', 'automatic', 'default']
The tokenization level to use for this normalizer
- tokenize(*strings, newline_tup=<object object>, match_instruction_address=True, **kwargs)[source]
Tokenizes the given strings using this normalizer’s tokenizer
See the docs for
BaseTokenizer()for more info on how tokenization works, how to create subclasses, etc.- Parameters:
strings (str) – arbitrary number of strings to tokenize.
newline_tup (Optional[Tuple[str, str]]) – the tuple to insert inbetween each passed string, or None to not insert anything. Defaults to self.__class__.DEFAULT_NEWLINE_TUPLE.
match_instruction_address (bool, optional) –
if True, will match instruction addresses. If there is an immediate value at the start of a line (IE: start of a string in strings, or immediately after a Tokens.NEWLINE or Tokens.INSTRUCTION_START [ignoring any Tokens.SPACING]), then that token will be converted into a Tokens.INSTRUCTION_ADDRESS token. If there is a Tokens.COLON immediately after that token (again, ignoring any Tokens.SPACING), then that first Tokens.COLON match will be appended (along with any inbetween Tokens.SPACING) to that Tokens.INSTRUCTION_ADDRESS token. For example, using the x86 tokenization scheme:
”0x1234: add rax rax” -> [(Tokens.INSTRUCTION_ADDRESS, ‘0x1234:’), …]
” 0x1234 : add rax rax” -> [(Tokens.SPACING, ‘ ‘), (Tokens.INSTRUCTION_ADDRESS, ‘0x1234 :’), …]
”0x1234 add rax rax” -> [(Tokens.INSTRUCTION_ADDRESS, ‘0x1234’), …]
kwargs (Any) – extra kwargs to store in the tokenizer state, for use in child classes
- Returns:
list of (token_type, token) tuples
- Return type:
List[Tuple[str, str]]
- tokenizer = None
The tokenizer used for this normalizer
- class bincfg.normalization.base_normalizer.MetaNorm(name, bases, dct)[source]
Bases:
typeA metaclass for BaseNormalizer.
- The Problem:
If you change instance functions within the __init__ method (EG: see the SAFE _handle_immediate() function being changed in __init__), then ‘self’ will not automatically be passed to those functions.
NOTE: this is specifically useful when the effect of a normalization method depends on parameters sent to the instance, not inherent to the class
NOTE: this is not the case for any functions that are set during class initialization (EG: outside of the __init__() block)
So, any functions changed within __init__ methods must be altered to also pass ‘self’. I ~could~ force the users to have to call a ‘__post_init__()’ function or something, but can we count on them (IE: myself) to always do that?…
- The Solution:
This metaclass inserts extra code before and after any normalizer’s __init__ method is called. That code keeps track of all instance functions before intitialization, and checks to see if any of them change after initialization. This means someone re-set a function within __init__ (IE: self._handle_immediate = …). When this happens, ‘self’ will not automatically be passed when that function is called. These functions are then wrapped to also automatically pass ‘self’.
NOTE: to determine if a function changes, we just check equality between previous and new functions using getattr(self, func_name). I don’t know why basic ‘==’ works but ‘is’ and checking id’s do not, but I’m not going to question it…
NOTE: We also have to keep track of the instance functions as an instance variable in case a parent class needs their function updated, or if a child class also changes a parent class’s function in init
NOTE: this will mean you cannot call all of that class’s methods and expect them to always be the same as calling instance methods if you change functions in __init__
- class bincfg.normalization.base_normalizer.NormalizerState(**kwargs)[source]
Bases:
objectA class that contains information during a normalizer’s normalization process
- block = None
the CFGBasicBlock that this token belongs to, or None if not using
- Type:
Optional[bincfg.CFGBasicBlock]
- cfg = None
the CFG that this token’s basic block belongs to, or None if not using
- Type:
Optional[bincfg.CFG]
- disinfo_json = None
the parsed json from a disinfo object
- Type:
Optional[JSONObject]
- handlers = None
dictionary of current token handler functions
- Type:
Dict[str, Callable[[NormalizerState], Union[str, None]]]
- kwargs = None
dictionary of extra kwargs for use in tokenization, or child classes
- Type:
Dict
- line = None
list of all TokenTuple’s in this current line. TokenTuple = (token_type [from bincfg.normalization.base_tokenizer.Tokens enum], new_token_string, original_token_string)
- Type:
List[Tuple[str, str, str]])
- match_instruction_address = None
whether or not we are matching instruction addresses at the beginning of assembly lines. This is very likely always True
- Type:
bool
- memory_start = None
the index of the start of the current memory expression, or None if we are not in a memory expression currently
- Type:
Optional[int]
- newline_tup = None
the newline tuple being used (token_type [probably Tokens.NEWLINE], token_string), or None if not using
- Type:
Optional[Tuple[str, str]]
- normalized_lines = None
list of all currently normalized lines/tokens (depending on self.tokenization_level)
- Type:
List[str])
- orig_token = None
The current string token being normalized
- Type:
str
- raw_strings = None
list of all of the raw strings passed to the current .normalize() call
- Type:
List[str]
- token = None
The current processed version of token if it has already been partially or fully normalized, or None if not
- Type:
str
- token_idx = None
The index of the current token in ‘line’
- Type:
int
- property token_tuple
Returns (token_type, token, orig_token)
- token_type = None
The token type of the current token, see bincfg.normalization.base_tokenizer.Tokens
- Type:
str
bincfg.normalization.base_tokenizer module
Class for tokenizing assembly lines, as well as other tokenization constants
- class bincfg.normalization.base_tokenizer.Architectures(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]
Bases:
EnumKnown (but not necessarily supported) architectures
- JAVA = ['java', 'java_bytecode']
- X86 = ['x86', 'i686', 'x86_64']
- class bincfg.normalization.base_tokenizer.BaseTokenizer(*args, **kwargs)[source]
Bases:
objectA default class to tokenize instructions
Should be subclassed once for each instruction set, providing the tokens being used.
Many functions may be overriden to change tokenization behavior. These functions all start with the name token_… and take as input a single state dictionary and return either a string for the next token to append to the current line being tokenized, or None to not add anything to the line. The state dictionary contains the following:
‘tokenizer’ (BaseTokenizer): this tokenizer
‘kwargs’ (Dict[str, Any]): dictionary of extra kwargs passed to the initial call to the tokenize function
‘all_strings’ (List[str]): list of input strings (args) passed to the initial call to the tokenize function
‘token_handlers’ (Dict[str, Callable[]]): dictionary mapping token types to the function that handles that token
‘sentence’ (List[Tuple[str, str]]): list of processed token tuples to return, each a tuple of (token_name, token)
‘newline_tup’ (Union[None, Tuple[str, str]]): token tuple to add at the end of each line to indicate a new line
‘match_instruction_address’ (bool): whether or not we are matching instruction addresses
‘split_imm’ (bool): whether or not we are currently handling an immediate token that was split
‘line’ (List[Tuple[str, str]]): the current line of tokens we are working on
‘string’ (str): the current string being tokenized
‘token_type’ (str): the type of the ‘token’, should be from bincfg.normalization.base_tokenizer.Tokens
‘token’ (str): the currently matched token string
‘match’ (re.Match): the re match object that matched this token
Some extra functions are available for overriding including:
handle_line(): called at the end of each line being tokenized (an individual string passed to the tokenizer)
handle_sentence(): called at the end of each sentence being tokenized (aggregation of all lines passed to the tokenizer)
Each instruction set architecture (ISA) should have its own
Tokenizerclass that inherits fromBaseTokenizer. The tokenization process uses python’sremodule to perform tokenization, converting strings into streams of (token_name, token_string) tuples. For more information on how to use regex to create tokenizers, see: https://docs.python.org/3/library/re.html#writing-a-tokenizerTOKENIZATION PROCESS
Clean the incomming instruction strings using the passed clean_instruction_func
Iterate through the strings finding all tokens
Each token is sent to its corresponding token handler function
At the end of each ‘line’ (EG: end of a passed string, reaching Tokens.NEWLINE token, etc.), that line is handled with the handle_line() function
All tokens are added to the same return ‘sentence’, even if multiple strings in strings were passed
After all strings have been tokenized and lines handled, the final return ‘sentence’ is sent to handle_sentence()
SPECIAL TOKENS
There are some ‘special tokens’ that are assumed to exist for all ISA’s as they are a part of the tokenization process itself. These tokens will be inserted into the passed tokens parameter at the beginning of the list (IE: they are the first tokens searched for), except for the ‘mismatch’ token which is inserted at the end, and are inserted in the following order:
String literals (Tokens.STRING_LITERAL) - matches strings which can start/end with matching single or double quotes, and can escape inner quotes with ' or ", and can escape the escape character with \. Any extra escape characters (not behind a ‘ or “ or ) will be left as-is.
Disassembler information (Tokens.DISASSEMBLER_INFO) - matches disassembler information of the form “<…>”. This info must be within open/close angle brackets. It is also possible to nest angle brackets within the disassembler info up to a maximum current depth of 3. IE: we can match the following:
“<no angle brackets inside>” - depth of 1
“<angle <brackets> depth <2>>” - depth of 2
“<level <3 angle <bracket>> depth>” - depth of 3
We also do not check that every open has a matching close, just that every close has a matching open. So, the following could still be matched:
“<lots of <<<<<<< things>”
However, missing or unmatched ending angle brackets will fail, as well as very deep nesting:
“<” : no matching ‘>’ only for the first occurance of ‘<’
“<data>>” : no matching ‘<’ for both of the ‘>’ brackets
“<super<deep<nested<…<thing>>…>>” : too large nesting depth
String literals are checked first within the disassembler info so that any end brackets ‘>’ within the strings won’t affect the parsing of the disassembler info.
This limit on nesting depth is present due to the inability for python’s re engine to handle recursive matching of nested brackets, and I can’t think of any way to implement it entirely within re’s (which is needed in order to continue using the python re tokenization method). I don’t see any reason why this would be needed as we already go down to a depth of 3 to handle more than what I would expect as output from disassemblers, and if the user is inserting information themselves, they could simply input the information within the brackets using a different delimiter and parse it themselves by overriding things like token_disassembler_info() and handle_disassembler_info in the Tokenizer and Normalizer classes respectively. If a larger depth is needed, one can manually alter the _DIS_INFO_MAX_REC_DEPTH variable at the top of this file. It will increase the valid nesting depth at the cost of slower regular expression matching for disassembler info.
Instruction start token “#start_instr#” (Tokens.INSTRUCTION_START) - used to determine when instructions start/stop when using an op-level tokenization scheme. When tokenizing, we need to know when a new instruction is started to decide if an immediate value found should be considered an instruction address or just a plain immediate. New instructions occur whenever we reach a newline token, an instruction start token, or the start of a new string passed in the args of the tokenize() method. This instruction start token is removed when found, and won’t appear during normalization.
Split immediate token “#split_imm#” (Tokens.SPLIT_IMMEDIATE) - used to designate a split immediate value. This is useful for reducing the number of unique tokens present while keeping full immediate information. When using split immediates during normalization, immediate values with more digits than some threshold will be split into multiple immediate tokens and placed one after the other, prepended with this “#split_imm#” token. In order to keep that output as renormalizable, the tokenizer, when finding one of these split immediate tokens, will concatenate all of the following immediate tokens until reaching some non-immediate (and, non-spacing) token to rebuild the original immediate token. This split immediate token is removed when found, and won’t appear during normalization
Plus sign (Tokens.PLUS_SIGN) - ‘+’
Times sign (Tokens.TIMES_SIGN) - ‘*’
Open bracket (Tokens.OPEN_BRACKET) - ‘[’
Close bracket (Tokens.CLOSE_BRACKET) - ‘]’
Colon (Tokens.COLON) - ‘:’
Spacing (Tokens.SPACING) - One or more space ‘ ‘, comma ‘,’, or tab ‘t’ characters in a row
Newline (Tokens.NEWLINE) - Either the newline character ‘n’ or a pipe character ‘|’
Immediate values (Tokens.IMMEDIATE) - any integer immediate value in hex, decimal, octal, or binary. Hex values must start with ‘0x’, octal with ‘0o’, and binary with ‘0b’
Mismatch token (Tokens.MISMATCH) - matches any character. Inserted at the very end of tokens and is used to designate the start of an unknown token or character so that can be handled (by default, an error is raised)
If you wish to keep some of the above tokens, but overwrite others, you can set that token’s regex in the passed tokens parameter, and that will overwrite these special tokens. You may also set it to None to not insert it at all.
INSTRUCTION ADDRESSES
If match_instruction_address=True when tokenizing, the tokenizer will attempt to match instruction addresses at the beginning of each line. If there is an immediate value at the start of a line (IE: start of a string in strings, or immediately after a Tokens.NEWLINE or Tokens.INSTRUCTION_START [ignoring any Tokens.SPACING]), then that token will be converted into a Tokens.INSTRUCTION_ADDRESS token. If there is a Tokens.COLON immediately after that token (again, ignoring any Tokens.SPACING), then that first Tokens.COLON match will be appended to that Tokens.INSTRUCTION_ADDRESS token, removing any Tokens.SPACING inbetween them. For example, using the x86 tokenization scheme:
“0x1234: add rax rax” -> [(Tokens.INSTRUCTION_ADDRESS, ‘0x1234:’), …]
“ 0x1234 : add rax rax” -> [(Tokens.SPACING, ‘ ‘), (Tokens.INSTRUCTION_ADDRESS, ‘0x1234:’), …]
“0x1234 add rax rax” -> [(Tokens.INSTRUCTION_ADDRESS, ‘0x1234’), …]
- Parameters:
tokens (Optional[List[Tuple[str, str]]]) – the tokens to use. Should be a list of 2-tuples. Each tuple is a pair of (name, regex) where name is the string name of the token, and regex is a regular expression to find that token. These tuples should be ordered in the preferred order to search for tokens. If None, then this will default to self.DEFAULT_TOKENS (which should be set when defining the class)
token_handlers (Optional[Dict[str, Callable[[Dict[str, Any]], Union[None, str]]]]) – optional dictionary mapping token type strings to functions to handle those token types when tokenizing. This is intended to be used when you wish to add entirely new token types not present in bincfg.normalization.base_tokenizer.Tokens. If you wish to change the behavior of handling an already-present token type, just override that token handler function. These will override the default token handlers.
insert_special_tokens (bool) – by default, some special tokens will be inserted at the front of tokens (see the ‘special tokens’ listed above). If you wish to stop this from happening, you can set insert_special_tokens to False
case_sensitive (bool) – If True, then regular expressions will be matched exactly as they appear. If False, then the re.IGNORECASE flag will be passed when compiling the regular expressions
- ARCHITECTURE = None
The architecture this tokenizer works on
- DEFAULT_NEWLINE_TUPLE = ('newline', '\n')
The default (token_type, token) tuple to use for newlines
- handle_line(state)[source]
Handles a single line (one string passed to the tokenizer)
Each line could contain newlines and whatnot, but no newline_tup’s will have been inserted.
Subclasses may override this function for more behavior, but it defaults to just returning the passed line.
- Parameters:
state (Dict) – dictionary of current state. See
BaseTokenizer()for more info- Returns:
list of (token_type, token) tuples for this line
- Return type:
List[Tuple[str, str]]
- handle_sentence(state)[source]
Handles an entire sentence (aggregation of all strings passed to one call of this tokenizer)
Inbetween each line, a newline_tup will have already been inserted (if using)
Subclasses may override this function for more behavior, but it defaults to just returning the passed sentence
- Parameters:
state (Dict) – dictionary of current state. See
BaseTokenizer()for more info- Returns:
the final list of tokens
- Return type:
List[Tuple[str, str]]
- classmethod load(path)
- save(path)
- token_all_symbols(state)[source]
Handles all symbol tokens (‘+’, ‘*’, ‘[’, ‘]’, ‘:’)
This can be overriden by subclasses for more functionality, but defaults to just returning the original token, except for colons ‘:’, for which we check if the previous non-spacing token was an immediate value. If so, and match_instruction_address is True, then we append any inbetween spacing and the colon to that immediate and replace its type with Token.INSTRUCTION_ADDRESS.
- Parameters:
state (Dict) – dictionary of current state. See
BaseTokenizer()for more info- Returns:
either a string token for the next token to append to line, or None to not append anything
- Return type:
Union[str, None]
- token_branch_prediction(state)[source]
Handles any branch_prediction tokens
This can be overriden by subclasses for more functionality, but defaults to just returning the original token
- Parameters:
state (Dict) – dictionary of current state. See
BaseTokenizer()for more info- Returns:
either a string token for the next token to append to line, or None to not append anything
- Return type:
Union[str, None]
- token_disassembler_info(state)[source]
Handles any disassembler information tokens
This can be overriden by subclasses for more functionality, but defaults to just returning the original token
- Parameters:
state (Dict) – dictionary of current state. See
BaseTokenizer()for more info- Returns:
either a string token for the next token to append to line, or None to not append anything
- Return type:
Union[str, None]
- token_immediate(state)[source]
Handles any immediate tokens
This can be overriden by subclasses for more functionality, but defaults to just returning the original token
- Parameters:
state (Dict) – dictionary of current state. See
BaseTokenizer()for more info- Returns:
either a string token for the next token to append to line, or None to not append anything
- Return type:
Union[str, None]
- token_instruction_address(state)[source]
Handles any instruction address tokens
This can be overriden by subclasses for more functionality, but defaults to just returning the original token
- Parameters:
state (Dict) – dictionary of current state. See
BaseTokenizer()for more info- Returns:
either a string token for the next token to append to line, or None to not append anything
- Return type:
Union[str, None]
- token_instruction_prefix(state)[source]
Handles any instruction_prefix tokens
This can be overriden by subclasses for more functionality, but defaults to just returning the original token
- Parameters:
state (Dict) – dictionary of current state. See
BaseTokenizer()for more info- Returns:
either a string token for the next token to append to line, or None to not append anything
- Return type:
Union[str, None]
- token_memory_size(state)[source]
Handles any memory_size tokens
This can be overriden by subclasses for more functionality, but defaults to just returning the original token
- Parameters:
state (Dict) – dictionary of current state. See
BaseTokenizer()for more info- Returns:
either a string token for the next token to append to line, or None to not append anything
- Return type:
Union[str, None]
- token_mismatch(state)[source]
What to do when there is a token mismatch in a string
This can be overriden by subclasses for more functionality, bet defaults to raising a
TokenMismatchErrorwith info on the mismatch- Parameters:
state (Dict) – dictionary of current state. See
BaseTokenizer()for more info- Raises:
TokenMismatchError – by default
- token_newline(state)[source]
Handles any newline tokens
This can be overriden by subclasses for more functionality, but defaults to just returning the original token
- Parameters:
state (Dict) – dictionary of current state. See
BaseTokenizer()for more info- Returns:
either a string token for the next token to append to line, or None to not append anything
- Return type:
Union[str, None]
- token_opcode(state)[source]
Handles any opcode tokens
This can be overriden by subclasses for more functionality, but defaults to just returning the original token
- Parameters:
state (Dict) – dictionary of current state. See
BaseTokenizer()for more info- Returns:
either a string token for the next token to append to line, or None to not append anything
- Return type:
Union[str, None]
- token_register(state)[source]
Handles any register tokens
This can be overriden by subclasses for more functionality, but defaults to just returning the original token
- Parameters:
state (Dict) – dictionary of current state. See
BaseTokenizer()for more info- Returns:
either a string token for the next token to append to line, or None to not append anything
- Return type:
Union[str, None]
- token_spacing(state)[source]
Handles any spacing tokens
This can be overriden by subclasses for more functionality, but defaults to just returning the original token
- Parameters:
state (Dict) – dictionary of current state. See
BaseTokenizer()for more info- Returns:
either a string token for the next token to append to line, or None to not append anything
- Return type:
Union[str, None]
- token_string_literal(state)[source]
Handles any string literals
This can be overriden by subclasses for more functionality, but defaults to just returning the original token
- Parameters:
state (Dict) – dictionary of current state. See
BaseTokenizer()for more info- Returns:
either a string token for the next token to append to line, or None to not append anything
- Return type:
Union[str, None]
- token_unknown(state)[source]
What to do when there is a token type that we don’t know how to handle
This can be overriden by subclasses for more functionality, bet defaults to raising a
UnknownTokenErrorwith info on the unknown token- Parameters:
state (Dict) – dictionary of current state. See
BaseTokenizer()for more info- Raises:
UnknownTokenError – by default
- tokenize(*strings, newline_tup=<object object>, match_instruction_address=True, **kwargs)[source]
Tokenizes the input
Subclasses should override any self.token_* methods they wish to inject behavior into. Each one of those functions takes in a ‘state’ dictionary as input and should return either a new string token or None to use the old token.
See the docs for
BaseTokenizer()for more info on how tokenization works, how to create subclasses, etc.- Parameters:
strings (str) – arbitrary number of strings to tokenize.
newline_tup (Optional[Tuple[str, str]]) – the tuple to insert inbetween each passed string, or None to not insert anything. Defaults to self.__class__.DEFAULT_NEWLINE_TUPLE.
match_instruction_address (bool, optional) –
if True, will match instruction addresses. If there is an immediate value at the start of a line (IE: start of a string in strings, or immediately after a Tokens.NEWLINE or Tokens.INSTRUCTION_START [ignoring any Tokens.SPACING]), then that token will be converted into a Tokens.INSTRUCTION_ADDRESS token. If there is a Tokens.COLON immediately after that token (again, ignoring any Tokens.SPACING), then that first Tokens.COLON match will be appended to that Tokens.INSTRUCTION_ADDRESS token, removing any Tokens.SPACING inbetween them. For example, using the x86 tokenization scheme:
”0x1234: add rax rax” -> [(Tokens.INSTRUCTION_ADDRESS, ‘0x1234:’), …]
” 0x1234 : add rax rax” -> [(Tokens.SPACING, ‘ ‘), (Tokens.INSTRUCTION_ADDRESS, ‘0x1234:’), …]
”0x1234 add rax rax” -> [(Tokens.INSTRUCTION_ADDRESS, ‘0x1234’), …]
kwargs (Any) – extra kwargs to store in the tokenizer state, for use in child classes
- Returns:
list of (token_type, token) tuples
- Return type:
List[Tuple[str, str]]
- class bincfg.normalization.base_tokenizer.TokenizationLevel(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]
Bases:
EnumDifferent levels to perform tokenization
- AUTO = ['auto', 'automatic', 'default']
- INSTRUCTION = ['inst', 'instruction', 'line', 'instructions', 'lines']
- OPCODE = ['op', 'opcode', 'operand', 'opcodes', 'operands']
- class bincfg.normalization.base_tokenizer.Tokens[source]
Bases:
object- BRANCH_PREDICTION = 'branch_prediction'
- CLOSE_BRACKET = 'close_bracket'
- COLON = 'colon'
- DISASSEMBLER_INFO = 'disassembler_info'
- IMMEDIATE = 'immediate'
- INSTRUCTION_ADDRESS = 'inst_addr'
- INSTRUCTION_PREFIX = 'prefix'
- INSTRUCTION_START = 'inst_start'
- MEMORY_EXPRESSION = 'memory_expression'
- MEMORY_SIZE = 'memory_size'
- MISMATCH = 'mismatch'
- NEWLINE = 'newline'
- OPCODE = 'opcode'
- OPEN_BRACKET = 'open_bracket'
- PLUS_SIGN = 'plus_sign'
- REGISTER = 'register'
- SEGMENT_ADDRESS = 'segment_address'
- SPACING = 'spacing'
- SPLIT_IMMEDIATE = 'split_imm'
- STRING_LITERAL = 'string_literal'
- TIMES_SIGN = 'times_sign'
- bincfg.normalization.base_tokenizer.get_architecture(arch: str | Architectures) Architectures[source]
Returns the architecture
- Parameters:
arch (Union[str, Architectures])
- bincfg.normalization.base_tokenizer.parse_tokenization_level(tokenization_level, auto_tl)[source]
Returns the bincfg.TokenizationLevel enum based on the given tokenization_level.
- Parameters:
tokenization_level (Union[bincfg.TokenizationLevel, str]) – either a string tokenization level, or a class from the bincfg.TokenizationLevels enum
auto_tl (bincfg.TokenizationLevel) – the default tokenization level to use if we get an ‘auto’ tokenization level
- Returns:
a class from the
bincfg.TokenizationLevelsenum- Return type:
bincfg.TokenizationLevel
bincfg.normalization.multi_normalizer module
Class that can use multiple normalization methods
- class bincfg.normalization.multi_normalizer.MultiNormalizer(*normalizers)[source]
Bases:
objectA normalizer that can work with multiple sub-normalizers based on architecture
This does not inheret from BaseNormalizer, and thus you cannot modify or call most normalizer functions from this normalizer itself. It essentially just acts as a wrapper around multiple different normalizers.
- Parameters:
normalizers (BaseNormalizer) – One or more normalizers to use together. May only use one per architecture.
- normalize(*strings, cfg=None, block=None, newline_tup=<object object>, match_instruction_address=True)[source]
Normalizes the given iterable of strings.
- Parameters:
strings (str) – arbitrary number of strings to normalize
cfg (Union[CFG, MemCFG], optional) – either a
CFGorMemCFGobject that these lines occur in. Used for determining function calls to self, internal functions, and external functions. If not passed, then these will not be used. Defaults to None.block (Union[CFGBasicBlock, int], optional) – either a
CFGBasicBlockor integer block_idx in aMemCFGobject. Used for determining function calls to self, internal functions, and external functions. If not passed, then these will not be used. Defaults to None.newline_tup (Tuple[str, str], optional) – the tuple to insert inbetween each passed string, or None to not insert anything. Defaults to self.tokenizer.DEFAULT_NEWLINE_TUPLE
match_instruction_address (bool, optional) –
if True, will match instruction addresses. If there is an immediate value at the start of a line (IE: start of a string in strings, or immediately after a Tokens.NEWLINE or Tokens.INSTRUCTION_START [ignoring any Tokens.SPACING]), then that token will be converted into a Tokens.INSTRUCTION_ADDRESS token. If there is a Tokens.COLON immediately after that token (again, ignoring any Tokens.SPACING), then that first Tokens.COLON match will be appended (along with any inbetween Tokens.SPACING) to that Tokens.INSTRUCTION_ADDRESS token. For example, using the x86 tokenization scheme:
”0x1234: add rax rax” -> [(Tokens.INSTRUCTION_ADDRESS, ‘0x1234:’), …]
” 0x1234 : add rax rax” -> [(Tokens.SPACING, ‘ ‘), (Tokens.INSTRUCTION_ADDRESS, ‘0x1234 :’), …]
”0x1234 add rax rax” -> [(Tokens.INSTRUCTION_ADDRESS, ‘0x1234’), …]
kwargs (Any) – extra kwargs to pass along to tokenization method, and to store in normalizer state
- Returns:
a list of normalized string instruction lines
- Return type:
List[str]
bincfg.normalization.norm_funcs module
- bincfg.normalization.norm_funcs.ignore(self, state)[source]
Ignores information (if using for rose info, then it will also ignore negatives)
- bincfg.normalization.norm_funcs.replace_function_call_immediate(*args)[source]
Builds a function that replaces function call immediate values with the given replacement string
This will return a function to be called as a part of a normalizer. This only takes one argument: the replacement string. If no arguments are passed, then the replacement string will default to ‘func’
NOTE: This is meant to be a higher-order function. But, just in case the user forgets that (or is too lazy to add in two extra characters to call this function), if you pass multiple args then it will be assumed this is being called as if it is the _repl_func() function below and will simply return the default result
- Parameters:
args – args for this function. Ideally either empty to use default function call string, or a string to replace all function callsa with.
- Returns:
- either a function that will handle function calls (if this function was
called correctly), or a handled function call
- Return type:
Union[Callable[…, None], None]
- bincfg.normalization.norm_funcs.replace_immediate(*args, include_negative=False)[source]
Builds a function that replaces immediate values with the IMMEDIATE_VALUE_STR.
This will return a function to be called as a part of a normalizer. This function takes no arguments and only 1 keyword argument: whether or not to include a negative sign ‘-’ in front of the immediate string when the input is negative.
NOTE: This is meant to be a higher-order function. But, just in case the user forgets that (or is too lazy to add in two extra characters to call this function), if you pass multiple args then it will be assumed this is being called as if it is the _repl_func() function below and will simply return the default result
- Parameters:
args – args for this function. Ideally empty
include_negative (bool, optional) – if True, will include a negative sign in front of the returned immediate string when the input is negative. Defaults to False.
- Returns:
- either a function that will handle immediate strings (if this function was
called correctly), or a handled immediate string
- Return type:
Union[Callable[…, str], str]
- bincfg.normalization.norm_funcs.replace_jump_destination(self, state)[source]
Replaces the jump destination immediate with ‘jmpdst’ iff the jump destination is an immediate value, not a segment address
- Parameters:
idx (int) – the index in
lineof the ‘jump’ opcodeline (List[TokenTuple]) – a list of (token_type, token) tuples. the current assembly line
- Returns:
integer index in line of last handled token
- Return type:
int
- bincfg.normalization.norm_funcs.replace_memory_expression(*args)[source]
Builds a function that replaces memory expressions with the given replacement string
This will return a function to be called as a part of a normalizer. This only takes one argument: the replacement string. If no arguments are passed, then the replacement string will default to ‘memexpr’
NOTE: This is meant to be a higher-order function. But, just in case the user forgets that (or is too lazy to add in two extra characters to call this function), if you pass multiple args then it will be assumed this is being called as if it is the _repl_func() function below and will simply return the default result
- Parameters:
args – args for this function. Ideally either empty to use default memory expression string, or a string to replace all memory expressions with.
- Returns:
- either a function that will handle memory expressions (if this function was
called correctly), or a handled memory expression
- Return type:
Union[Callable[…, None], None]
- bincfg.normalization.norm_funcs.replace_string_literal(*args, replace_previous_immediate=False)[source]
Builds a function that replaces string literal values with the string ‘str’
This will return a function to be called as a part of a normalizer. This function takes no arguments and only 1 keyword argument: whether to replace the previous immediate, or keep it and add in a ‘str’ string
NOTE: This is meant to be a higher-order function. But, just in case the user forgets that (or is too lazy to add in two extra characters to call this function), if you pass multiple args then it will be assumed this is being called as if it is the _repl_func() function below and will simply return the default result
- Parameters:
args – args for this function. Ideally empty
replace_previous_immediate (bool) – if True, then any previous immediate value will be replaced with the ‘str’ string, otherwise the ‘str’ string will just be added
- Returns:
- either a function that will handle immediate strings (if this function was
called correctly), or a handled immediate string
- Return type:
Union[Callable[…, str], str]
- bincfg.normalization.norm_funcs.return_dispmem(self, state)[source]
Replaces memory addressing displacement values with the string ‘dispmem’
- bincfg.normalization.norm_funcs.special_function_call(self, state, ret_only_call_type=False)[source]
Handles special function calls
Special external functions have their name kept. Recursive calls are replaced with ‘self’, other internal function calls are replaced with ‘internfunc’, other external function calls are replaced with ‘externfunc’. If a block has multiple function calls out, then it will be replaced with ‘multifunc’.
NOTE: This can all only happen if cfg and block information is passed. If it is not passed, then all function calls will be replaced with ‘func’
- Parameters:
idx (int) – the index in
lineof the ‘call’ opcodeline (List[TokenTuple]) – a list of (token_type, token) tuples. the current assembly line
special_functions (Set[str]) – a set of string special function names.
cfg (Union[CFG, MemCFG], optional) – either a
CFGorMemCFGobject that these lines occur in. Used for determining function calls to self, internal functions, and external functions. If not passed, then these will not be used. Defaults to None.block (Union[CFGBasicBlock, int], optional) – either a
CFGBasicBlockor integer block_idx in aMemCFGobject. Used for determining function calls to self, internal functions, and external functions. If not passed, then these will not be used. Defaults to None.ret_only_call_type (bool) – if True, will return only the call type being used as a string. This is only for testing purposes and should likely not be used in normalization as this function already can handle the normalizing. This will return a string if it is not a special function call (for the appropriate function call type), or a tuple with one element for a special function call (the name of the special function).
- Returns:
integer index in line of last handled token
- Return type:
int
- bincfg.normalization.norm_funcs.threshold_immediate(threshold=5000, include_negative=False, imm_str='#immval#')[source]
Builds a function that replaces immediate values with immval iff abs(immediate) > some threshold
- Parameters:
threshold (int) – the threshold to use. Any immediates whose absolute values are larger than this threshold will be replaced with the imm_str
include_negative (bool) – if True, then any immediate that are too large and get replaced will have a negative sign added to the front of the replacement string if the immediates were negative
imm_str (str) – the string to replace immediate values with
- Returns:
- either a function that will handle thresholded immediate strings (if this
function was called correctly), or a handled thresholded immediate string
- Return type:
Union[Callable[…, str], str]
bincfg.normalization.norm_utils module
An assortment of helper/utility functions for tokenization/normalization.
- bincfg.normalization.norm_utils.get_normalizer(normalizer)[source]
Returns the normalizer being used.
- Parameters:
normalizer (Union[str, Normalizer, type]) – either a
Normalizerobject (IE: has a callable ‘normalize’ function), or a string name of a built-in normalizer to use, or a type of a normalizer to instantiate with no args/kwargs passed. Accepted strings include: ‘innereye’, ‘deepbindiff’, ‘safe’, ‘deepsemantic’, ‘unnormalized’, ‘compressed_stats’, ‘hpc_data’- Raises:
ValueError – for unknown string name of normalizer
TypeError – if normalizer was not a string or
Normalizerobject
- Returns:
a
Normalizerobject- Return type:
Normalizer
- bincfg.normalization.norm_utils.imm_to_int(token, on_err=<object object>)[source]
Convert the given value to integer
- If token is an integer, returns token. Otherwise, converts a string token to an integer, then back to a string,
accounting for hexadecimal, decimal, octal, and binary values
- Parameters:
token (Union[str, int]) – the immediate token to convert to integer
on_err (Optional[Any]) – if passed, then this value will be returned if there is an error while trying to parse the immediate value. Otherwise the error will just be raised like normal
- Returns:
integer value of given token
- Return type:
int
- bincfg.normalization.norm_utils.parse_disinfo_json(string)[source]
Attempts to pase a JSON object inside of disassembler info tokens
Assumes the DISINFO_START and DISINFO_END have already been removed from the string.
- Parameters:
string (str) – the string to attempt to parse into json
- Returns:
returns the resulting JSON object, or None if the string could not be parsed as JSON
- Return type:
Union[None, JSONObject]
- bincfg.normalization.norm_utils.scan_for_token(token_list, type=None, token=None, stop_on_type=None, stop_on_token=None, ignore_type=None, ignore_token=None, stop_unmatched=False, match_re=False, ignore_re_case=True, start=0, increment=1, wrap=True, max_matches=1, ret_list=False, ret='index', on_no_match=None)[source]
Scans the given token list looking for a specific token(s) or token type(s)
Will return None if no match is found.
Detects tokens in the order:
‘ignore’ tokens
‘stop’ tokens
accepted tokens (from type or token parameters)
So, if one passes multiple parameters that conflict with one another, the above ordering is what takes precedent.
- Parameters:
token_list (List[Tuple[str, str, ...]]) – the list of tokens. Each element should be a tuple of (token_type, token, …). The first element is the type of the token, second is the string token, and anything else is ignored. This means this function can work with either the 2-tuple token lists from Tokenizer() objects as well as the 3-tuple token lists from Normalizer() objects.
type (Optional[Union[str, Iterable[str]]]) – the type or types of tokens to return. Can be a string to only return one type of token, or an iterable of strings to return the first token found that has any of those types. If token is not None, then the returned token must also match that argument. NOTE: you can match “not X” by using python re’s negative lookahead: r’(?![X]).*’, where ‘[X]’ is the thing to not match
token (Optional[Union[str, Iterable[str]]]) – the token to return. Can be a string to only return one matching token, or an iterable of strings to return the first token found that matches any of those tokens. If type is None, then the returned token must also match that type. NOTE: you can match “not X” by using python re’s negative lookahead: r’(?![X]).*’, where ‘[X]’ is the thing to not match
stop_on_type (Optional[Union[str, Iterable[str]]]) – if a token of this type is found, then we immediately stop searching and return whatever we currently have. Can be a string to only stop at one type of token, or an iterable of strings to stop at the first token found that has any of those types. NOTE: you can match “not X” by using python re’s negative lookahead: r’(?![X]).*’, where ‘[X]’ is the thing to not match
stop_on_token (Optional[Union[str, Iterable[str]]]) – if this token is found, then we immediately stop searching and return whatever we currently have. Can be a string to only stop at one token, or an iterable of strings to stop at the first token found that matches any of these. NOTE: you can match “not X” by using python re’s negative lookahead: r’(?![X]).*’, where ‘[X]’ is the thing to not match
ignore_type (Optional[Union[str, Iterable[str]]]) – ignores token types. Can be a string to only ignore one token type, or an iterable of strings to ignore any token types that match any of these. These tokens will not be added to return lists or considered tokens to keep. Since these are checked before ‘stop’ token types, this will override the stopping on any tokens also matched with stop_on_type. NOTE: you can match “not X” by using python re’s negative lookahead: r’(?![X]).*’, where ‘[X]’ is the thing to not match
ignore_token (Optional[Union[str, Iterable[str]]]) – ignores tokens. Can be a string to only ignore one token, or an iterable of strings to ignore any tokens that match any of these. These tokens will not be added to return lists or considered tokens to keep. Since these are checked before ‘stop’ token types, this will override the stopping on any tokens also matched with stop_on_token. NOTE: you can match “not X” by using python re’s negative lookahead: r’(?![X]).*’, where ‘[X]’ is the thing to not match
stop_unmatched (bool) – if True, will stop on the first unmatched token. IE: a token that was not ignored, was not already stopped on, and was not considered a token to keep
match_re (bool) – if True, will assume any match values in type or token are to be considered regular expressions to fullmatch()
ignore_re_case (bool) – if True, will pass re.IGNORECASE as a flag when making the regular expressions
start (int) – the index to start at within token_list
increment (int) – the increment to use when searching for tokens. Set to a negative number to move backwards through the list NOTE: if returning multiple values, they will be returned in the order they appear in the input list, regardless of the increment value
wrap (bool) – if True, then the initial start index will be wrapped to the length of the token_list. If False, then an initial start index that is out of bounds of the token_list will immediately stop.
max_matches (Union[int, None]) – the number of matches to find. If 1, then values will be returned as normal. If >1, then this will search through the list finding up to max_matches matching tokens and return their ret values as a list in the order that they were found. If None, then all matches found will be returned NOTE: if max_matches != 1, then the return value will always either be None if no matches were found, or a list (even if only one match was found)
ret_list (bool) – if True, will always return a list, even if only a single return value was present
ret (Union[str, Iterable[str]]) –
what value(s) to return. Can be a single string to return a single value, or an iterable of strings to return multiple values as a tuple in the order they were passed. Valid strings:
’index’: return the index in token_list of the matched token
’type’: return the token type of the matched token
’token’: return the string token that was matched
’all’: return all of the above. If in a passed list, ignores all other values in the list. Will return values in the order above.
on_no_match (Optional[Any]) – value to return if there were no matches found. Defaults to None
- Returns:
- None if no match is found, or one of the return types designated by ret argument,
or a tuple of multiple return values if user passed multiple values in ret, or a list of one of the previous if collecting matches for multiple tokens. NOTE: if returning multiple values, they will be returned in the order they appear in the input list, regardless of the increment value
- Return type:
Union[None, int, str, Tuple, List]
bincfg.normalization.normalize module
Provides function(s) to perform normalization techniques on CFG’s
- bincfg.normalization.normalize.normalize_cfg_data(cfg_data: CFGInputDataType | bincfg.CFG | bincfg.MemCFG | bincfg.CFGDataset | bincfg.MemCFGDataset | Iterable, normalizer: str | NormalizerType, inplace: bool = False, using_tokens: dict[str, int] | AtomicTokenDict | None = None, force_renormalize: bool = False, convert_to_mem: bool = False, conv_keep_mem_addrs: bool = True, unpack_cfgs: bool = False, progress: bool = False) bincfg.CFG | bincfg.MemCFG | bincfg.CFGDataset | bincfg.MemCFGDataset | list | tuple[source]
Normalizes some cfg data.
- Parameters:
cfg_data (Union[CFGInputDataType, CFG, MemCFG, CFGDataset, MemCFGDataset, Iterable]) – some cfg data. Can be either: str, CFG, MemCFG, CFGDataset, MemCFGDataset, or iterable of previously mentioned types. Will return the same type as that passed, unless that particular input was a string, in which case a CFG will be returned.
normalizer (Union[str, Normalizer]) – the normalizer to use. Can be either a
Normalizerclass with a .normalize() method, or a string to use a built-in normalizer. Seebincfg.normalization.get_normalizer()for acceptable strings.inplace (bool) – if True, will modify data in-place instead of creating new objects. Defaults to False. NOTE: if inplace=False, and the incoming data has already been normalized with the passed normalizer, then the original cfg will be returned, NOT a copy.
using_tokens (Optional[Union[dict[str, int], AtomicTokenDict]]) – only used for
MemCFG’s. If not None, then a dictionary mapping string tokens to integer token values that will be used as anyMemCFG’s tokens. Defaults to None.force_renormalize (bool) – by default, this method will only normalize cfg’s whose .normalizer != to the passed normalizer. However if force_renormalize=True, then all cfg’s will be renormalized even if they have been previously normalized with the same normalizer. Defaults to False.
convert_to_mem (bool) – if True, will convert all
CFG’s andCFGDatasetsto their memory-efficient versions after normalizing. Defaults to False.conv_keep_mem_addrs (bool) – if True, will pass keep_memory_addresses=True when converting CFG’s into MemCFG’s
unpack_cfgs (bool) – by default, this method will return the same types that were passed to be normalized. However if unpack_cfgs=True, then instead, a list of all cfgs unpacked (EG: unpacked from lists, and pulled out of datasets) will be returned. Defaults to False. NOTE: if only a single
CFG/MemCFGwas passed, a list will still be returned of only that single element.progress (bool) – if True, will show a progressbar for normalizations of multiple cfg’s. Defaults to False.
- Returns:
the normalized data
- Return type:
Union[CFG, MemCFG, CFGDataset, MemCFGDataset, List, Tuple]