bincfg.cfg package
Submodules
bincfg.cfg.cfg module
- class bincfg.cfg.cfg.CFG(data: CFGInputDataType = None, normalizer: str | NormalizerType | None = None, metadata: dict | None = None, using_tokens: TokenDictType | None = None)[source]
Bases:
objectA Control Flow Graph (CFG) representation of a binary
- Parameters:
data (Optional[Union[str, TextIO, Sequence[str], SmdaReport]]) –
the data to use to make this CFG. Data type will be inferred based on the data passed:
string: either string with newline characters that will be split on all newlines and as a known disassembler format, or a string with no newline characters that will be treated as a filename.
Sequence of string: will be treated as already-read-in disassembler file split on newlines
open file object: will be read in using .readlines, then treated as disassembler input
SmdaReport: output from smda disassembly
normalizer (Optional[Union[str, NormalizerType]]) – the normalizer to use to force-renormalize the incoming CFG, or None to not normalize
metadata (Optional[dict]) –
a dictionary of metadata to add to this CFG
NOTE: passed dictionary will be shallow copied
using_tokens (Optional[Union[dict[str, int], AtomicTokenDict]]) – optional token dictionary to use when initializing and normalizing. Only used if normalizer is not None
- add_function(*functions: CFGFunction, override: bool = False) None[source]
Adds the given function(s) to this cfg. This should only be done once the given function(s) have been fully initialized
This will do some housekeeping things such as:
setting the parent_cfg and parent_function attributes of functions and blocks respectively
adding missing edges to their associated edges_out and edges_in
converting edges from (None/address, None/address, edge_type) tuples into CFGEdge() objects
adding from_block and to_block in new edges if missing
functions with no address will have their address be that of the smallest addressed block in their blocks, if present
- Parameters:
function (CFGFunction) – arbitrary number of CFGFunction’s to add
override (bool) – if False, an error will be raised if a function or basic block contains an address that already exists in this CFG. If True, then that error will not be raised and those functions/basic blocks will be overriden (which has unsupported behavior). Defaults to False.
- property architecture: Architectures
Returns the architecture being used. Currently a WIP
Checks for an ‘arch’ or ‘architecture’ key in the metadata and returns it if it is known. Can currently return: ‘java’, ‘x86’
- property asm_counts: Mapping[str, int]
A collections.Counter() of all unique assembly lines and their counts in this cfg
- property blocks: list[CFGBasicBlock]
A list of basic blocks in this CFG (in order of memory address)
- blocks_dict: dict[int, CFGBasicBlock]
Dictionary mapping integer basic block addresses to their
CFGBasicBlockobjects
- classmethod from_networkx(graph: networkx.MultiDiGraph, cfg: CFG | None = None) CFG[source]
Converts a networkx graph to a CFG
Expects the graph to have the exact same structure as is shown in CFG().to_networkx()
- Parameters:
graph (networkx.MultiDiGraph) – the networkx graph
cfg (Optional[CFG]) – can be None to create/return a new CFG object, or an already created and empty CFG() object to put data into that one
- property functions: list[CFGFunction]
A list of functions in this CFG (in order of memory address)
- functions_dict: dict[int, CFGFunction]
Dictionary mapping integer function addresses to their
CFGFunctionobjects
- get_block(address: int | str | Addressable, raise_err: bool = True) CFGBasicBlock | None[source]
Returns the basic block in this CFG with the given address
- Parameters:
address (AddressLike) – a string/integer memory address, or an addressable object (EG: CFGBasicBlock/CFGFunction)
raise_err (bool) – if True, will raise an error if the function with the given memory address was not found, otherwise will return None
- Raises:
ValueError – if the basic block with the given address could not be found
- Returns:
the basic block with the given address
- Return type:
Union[CFGBasicBlock, None]
- get_block_containing_address(address: int | str | Addressable, raise_err: bool = True) CFGBasicBlock | None[source]
Returns the basic block in this CFG that contains the given address at the start of one of its instructions
This will lazily compute an instruction lookup dictionary mapping addresses to the blocks that contain them
NOTE: this will only return a block if the address is either equal to the block’s address, or if it is exactly equal to one of the addresses for an assembly instruction in a block’s .asm_memory_addresses list
- Parameters:
address (AddressLike) – a string/integer memory address, or an addressable object (EG: CFGBasicBlock/CFGFunction)
raise_err (bool) – if True, will raise an error if the function with the given memory address was not found, otherwise will return None
- Raises:
ValueError – if the basic block containing the given address could not be found
- Returns:
the basic block that contains the given address
- Return type:
Union[CFGBasicBlock, None]
- get_cfg_build_code() str[source]
Returns python code that will build the given cfg. Used for testing.
This will return the plain code itself to build, with no initial tabs.
- Parameters:
cfg (CFG) – the cfg
- Returns:
string of python code to build the cfg
- Return type:
str
- get_function(address: int | str | Addressable, raise_err: bool = True) CFGFunction | None[source]
Returns the function in this
CFGwith the given address- Parameters:
address (AddressLike) – a string/integer memory address, or an addressable object (EG: CFGBasicBlock/CFGFunction)
raise_err (bool) – if True, will raise an error if the function with the given memory address was not found, otherwise will return None
- Raises:
ValueError – if the function with the given address could not be found
- Returns:
the function with the given address, or None if that function does not exist
- Return type:
Union[CFGFunction, None]
- get_function_by_name(name: str, raise_err: bool = True) CFGFunction | None[source]
Returns the function in this
CFGwith the given nameNOTE: if the name of the function is None, then the expected string name to this method would be: “__UNNAMED_FUNC_%d” % func.address
- Parameters:
name (str) – the name of the function to get
raise_err (bool) – if True, will raise an error if the function with the given memory address was not found, otherwise will return None
- Raises:
ValueError – if the function with the given address could not be found
- Returns:
the function with the given address, or None if that function does not exist
- Return type:
Union[CFGFunction, None]
- insert_library(cfg: CFG, function_mapping: dict[str, int], offset: int | None = None)[source]
WIP. Inserts the cfg of a shared library into this cfg
This will modify the memory addresses of cfg (adding an appropriate offset), then add all of the functions and basic blocks from cfg into this cfg. Finally, external functions in this cfg that have implemented functions in the function_mapping will have normal edges added.
NOTE: this assumes that no other libraries will be added later that depend on this one that is currently being added (otherwise, the external function edges might not be added properly). Make sure you add them in the correct order!
- Parameters:
cfg (CFG) – the cfg of the library to insert. It will be copied
function_mappping (Dict[str, int]) – dictionary mapping known exported function names to their addresses within cfg. While we can sometimes determine these mappings from function names in the new cfg, that is not always the case (EG: stripping function names from binaries, or compilers/linkers emitting aliases for the functions in cfg), hence why this parameter exists. If you don’t wish to add in new normal edges, or if you wish to add them in manually, you can pass an empty dictionary
offset (Optional[int]) – if None, then the library will be inserted in the first available memory location. Otherwise this can be an integer memory address to insert the cfg at (this will raise an error if it can’t fit there)
- metadata: dict
Dictionary of metadata associated with this
CFG
- normalize(normalizer: str | NormalizerType, using_tokens: dict[str, int] | AtomicTokenDict | None = None, inplace: bool = True, force_renormalize: bool = False) CFG[source]
Normalizes this cfg.
- Parameters:
normalizer (Union[str, NormalizerType]) – the normalizer to use. Can be a
Normalizerobject, or a string of a built-in normalizer to useusing_tokens (Optional[TokenDictType]) – token dictionary to use when normalizing, or None to normalize from scratch
inplace (bool) – whether or not to normalize inplace
force_renormalize (bool) – by default, this method will only normalize this cfg only if the passed normalizer is != self.normalizer. However if force_renormalize=True, then this will be renormalized even if it has been previously normalized with the same normalizer
- Returns:
this
CFGnormalized- Return type:
- normalizer: NormalizerType | None
The normalizer used to normalize assembly lines in this
CFG, or None if they have not been normalized
- property num_asm_lines: int
The number of asm lines across all blocks in this cfg
- property num_blocks: int
The number of basic blocks in this cfg
- property num_edges: int
The number of edges in this cfg
- property num_functions: int
The number of functions in this cfg
- set_tokens(tokens: dict[str, int] | AtomicTokenDict) CFG[source]
Sets this CFG’s tokens to the given tokens, and returns self
- to_adjacency_matrix(type: str = 'np', sparse: bool = False) np.ndarray | torch.Tensor[source]
Returns an adjacency matrix representation of this cfg’s graph connections
Currently is slow because I just convert to a MemCFG, then call that object’s to_adjacency_matrix(). I should probably speed this up at some point…
Connections will be directed and have values:
0: No edge
1: Normal edge
2: Function call edge
See
to_adjacency_matrix()for more details- Parameters:
type (str, optional) –
the type of matrix to return. Defaults to ‘np’. Can be:
’np’/’numpy’ for a numpy ndarray (dtype: np.int32)
’torch’/’pytorch’ for a pytorch tensor (type: LongTensor)
sparse (bool, optional) –
whether or not the return value should be a sparse matrix. Defaults to False. Has different behaviors based on type:
- numpy array: returns a 2-tuple of sparse COO representation (indices, values).
NOTE: if you want sparse CSR format, you already have it with self.graph_c and self.graph_r
- pytorch tensor: returns a pytorch sparse COO tensor.
NOTE: not using sparse CSR format for now since it seems to have less documentation/supportedness.
- Returns:
an adjacency matrix representation of this
CFG- Return type:
Union[np.ndarray, torch.Tensor]
- to_networkx() networkx.MultiDiGraph[source]
Converts this CFG to a networkx DiGraph() object
Requires that networkx be installed.
Creates a new MultiDiGraph() and adds as attributes to that graph:
‘normalizer’: string name of normalizer, or None if it had none
‘metadata’: a dictionary of metadata
- ‘functions’: a dictionary mapping integer function addresses to named tuples containing its data with the
structure (‘name’: Union[str, None], ‘is_extern_function’: bool, ‘blocks’: Tuple[int, …], ‘metadata’: dict).
The ‘name’ element (first element) is a string name of the function, or None if it doesn’t have a name
The ‘is_extern_function’ element (second element) is True if this function is an extern function, False otherwise. An extern function is one that is located in an external library intended to be found at runtime, and that doesn’t have its code here in the CFG, only a small function meant to jump to the external function when loaded at runtime
The ‘blocks’ element (third element) is an arbitrary-length tuple of integers, each integer being the memory address (equivalently, the block_id) of a basic block that is a part of that function. Each basic block is only part of a single function, and each function should have at least one basic block
The ‘metadata’ element (fourth element) is a dictionary of metadata associated with that function. May be empty.
NOTE: we use a multidigraph because edges are directed (in order of control flow), and it is theoretically possible (and occurs in some data) to have a node that calls another node, then has a normal edge back out to it. This has occured in some libc setup code
Then, each basic block will be added to the graph as nodes. Their id in the graph will be their integer address. Each block will have the following attributes:
‘asm_lines’ (Tuple[str]): tuple of string assembly lines
‘asm_memory_addresses (Tuple[int]): tuple of integer assembly line memory addresses, one for each line in order. Unless, if these addresses are not present, then this will be an empty tuple
‘metadata’ (dict): dictionary (possibly empty) of metadata associated with this basic block
Finally, all edges will be added (directed based on control flow direction), and with the attributes:
‘edge_type’ (str): the edge type, will be ‘normal’ for normal edges and ‘function_call’ for function call edges
- bincfg.cfg.cfg.auto_detect_assembly_language(cfg: CFG) None[source]
Attempts to detect the assembly language used in the given CFG, settings its ‘architecture’ key in the metadata if successful
Will attempt to find known substrings in any block that indicate a specific language. Assumes the full CFG is all the same language
- Parameters:
cfg (CFG) – the cfg to detect language on
bincfg.cfg.cfg_basic_block module
- class bincfg.cfg.cfg_basic_block.CFGBasicBlock(parent_function: CFGFunction | None = None, address: int | str | Addressable | None = None, edges_in: Iterable[CFGEdge] | None = None, edges_out: Iterable[CFGEdge] | None = None, asm_lines: Iterable[str] | None = None, asm_memory_addresses: Iterable[int | str | Addressable] | None = None, metadata: dict | None = None)[source]
Bases:
objectA single basic block in a
CFG.Can be initialized empty, or with attributes. Assumes its memory address is always unique within a cfg.
NOTE: these objects should not be pickled/copy.deepcopy()-ed by themselves, only as a part of a cfg
- Parameters:
parent_function (Optional[CFGFunction]) – the
CFGFunctionthis basic block belongs toaddress (Optional[Union[int, str, Addressable]]) – the memory address of this
CFGBasicBlock. Should be unique to theCFGthat contains it. If None, but asm_memory_addresses is passed, this will be set to the first value in asm_memory_addressesedges_in (Optional[Iterable[CFGEdge]]) – an iterable of incoming CFGEdge objects
edges_out (Optional[Iterable[CFGEdge]]) – an iterable of outgoing CFGEdge objects
asm_lines (Optional[Iterable[str]]) – an iterable of string assembly lines present at this basic block
asm_memory_addresses (Optional[Iterable[Union[str, int, Addressable]]]) – an iterable of string or integer memory addresses, one for each assembly line (will be converted into integer memory addresses). If this was passed, but address was not, then address will be set to the first value in asm_memory_addresses
metadata (Optional[Dict]) – optional dictionary of metadata to associate with this basic block
- address: int
The integer memory address of this basic block. Will be -1 if not set yet
- property asm_counts: Mapping[str, int]
A
collections.Counterof all unique assembly lines/tokens and their counts in this basic block
- asm_lines: list[str]
List of string assembly lines at this basic block
- asm_memory_addresses: list[int]
List of integer memory addresses for all assembly lines at this basic block. Will be empty list if not set yet
- calls(address: int | str | Addressable)[source]
Checks if this block calls the given address
IE: checks if this block has an outgoing function_call edge to the given address
- Parameters:
address (AddressLike) – a string/integer memory address, or an addressable object (EG:
CFGBasicBlock/CFGFunction)- Returns:
True if this block calls the given address, False otherwise
- Return type:
bool
- get_sorted_edges(edge_types: str | EdgeType | Iterable[str | EdgeType] | None = None, direction: Literal['out', 'in'] | Iterable[Literal['out', 'in']] | None = None, as_sets: bool = False) Tuple[list[CFGEdge], ...] | Tuple[set[CFGEdge], ...][source]
Returns a tuple of sorted lists of edges (sorted by address of the “other” block) of each type/direction in this block
Will return edge lists ordered first by edge type (their order of appearance in the cfg_edge.EdgeType enum), then by direction (‘in’, then ‘out’). Unless, if edge_types is passed, then only those edge types will be returned and the edge lists will be returned in the order of the edge types in edge_types, then by direction (‘in’, then ‘out’).
For example, with edge_types=None and direction=None, this would return the 4-tuple of: (normal_edges_in, normal_edges_out, function_call_edges_in, function_call_edges_out) Where each element is a list of CFGEdge objects.
- Parameters:
edge_types (Optional[Union[str, EdgeType, Iterable[Union[str, EdgeType]]]]) – either an edge type or an iterable of edge types. Only edges with one of these types will be returned. If not None, then the edge lists will be returned sorted based on the order of the edge types listed here, then by direction
direction (Optional[Union[Literal["out", "in"], Iterable[Literal["out", "in"]]]) – the direction to get. Can be the strings ‘in’ or ‘out’, or None to get both
as_sets (bool) – if True, then this will return unordered sets of edges instead of sorted lists. This may save a ~tiny~ bit of time in the long run, but will hinder deterministic behavior of this method.
- Returns:
a tuple of lists/sets of CFGEdge’s
- Return type:
- has_edge(address: int | str | Addressable, edge_types: str | EdgeType | Iterable[str | EdgeType] | None = None, direction: Literal['in', 'out'] | None = None) bool[source]
Checks if this block has an edge from/to the given address
- Parameters:
address (AddressLike) – a string/integer memory address, or an addressable object (EG:
CFGBasicBlock/CFGFunction).edge_types (Optional[Union[str, EdgeType, Iterable[Union[str, EdgeType]]]]) – either an edge type or an iterable of edge types. Only edges with one of these types will be considered. If None, then all edge types will be considered
direction (Optional[Literal['in', 'out']]) – the direction to check (strings ‘in’ or ‘out), or None to check both
- Returns:
True if this block has an edge from/to the given address, False otherwise
- Return type:
bool
- property is_function_call: bool
True if this block is a function call, False otherwise
Checks if this block has one or more outgoing function call edges
- property is_function_entry: bool
True if this block is a function entry block, False otherwise
Specifically, returns True if this block’s address matches its parent function’s address. If this block has no parent, False is returned.
- property is_function_jump: bool
True if this block is a function jump, False otherwise
Checks if this block has a ‘jump’ instruction to a basic block in a different function. Specifically, checks if this block has an outgoing EdgeType.NORMAL edge to a basic block who’s parent_function has an address different than this basic block’s parent_function’s address.
- property is_multi_function_call: bool
True if this block is a multi-function call, False otherwise
IE: this block has either two or more function call edges out
- metadata: dict
Dictionary of extra metadata to associate with this basic block
- property num_asm_lines: int
The number of assembly lines/tokens in this basic block
- property num_edges: int
The number of edges out in this basic block
- property num_edges_in: int
The number of incoming edges in this basic block
- property num_edges_out: int
The number of outgoing edges in this basic block
- parent_function: CFGFunction | None
The parent function of this basic block. Will be None if not set yet
bincfg.cfg.cfg_dataset module
- class bincfg.cfg.cfg_dataset.CFGDataset(cfg_data=None, normalizer=None, load_path=None, max_files=None, allow_multiple_norms=False, progress=False, metadata=None, num_workers=1, **add_data_kwargs)[source]
Bases:
objectA dataset of
CFG’s.- Parameters:
cfg_data (Optional[Union[CFG, CFGDataset, Iterable]]) – a
CFG,CFGDatasetor iterable ofCFG’s orCFGDataset’s to add to this dataset, or None to initialize thisCFGDatasetemptynormalizer (Optional[Union[str, Normalizer]]) – if not None, then a normalizer to use. Will normalize all incoming
CFG’s if they do not already have the name normalization (will attempt to renormalize incomingCFG’s if they already have a normalization). Can be aNormalizerobject or string.load_path (str) – if not None, loads all files in this directory that end with ‘.txt’ or ‘.dot’. Will raise an error if there are no files. Will ignore any files that end with ‘.txt’ or ‘.dot’, but cannot be parsed.
max_files (Optional[int]) – stops after loading this many files. If None, then there is no max
allow_multiple_norms (bool) – by default,
CFGDatasetwill only allow unnormalized cfg’s when normalizer=None (if normalizer is not None, then any normalized cfg added will be renormalized). Setting allow_multiple_norms to True will allow thisCFGDatasetto store cfg data with any normalization method (assuming normalizer=None)progress (bool) – if True, will show a progressbar when loading cfg’s from load_path
metadata (Optional[Dict]) – a dictionary of metadata to attach to this CFGDataset NOTE: passed dictionary will be shallow copied
num_workers (int) – if > 1, then the loading of data using the load_path parameter will be split over this many processes
add_data_kwargs (Any) – extra kwargs to pass to add_data while adding cfgs
- add_data(*cfg_data, inplace=True, force_renormalize=False, progress=False)[source]
Adds data to this dataset
- Parameters:
cfg_data (Union[CFG, CFGDataset, Iterable]) – arbitrary amount of
CFG/CFGDataset’s, or iterables of them, to add to this datasetinplace (bool, optional) – whether or not to normalize the incoming cfg_data inplace. Defaults to True.
force_renormalize (bool, optional) – by default, this method will only normalize cfg’s whose .normalizer != to this dataset’s normalizer. However if force_renormalize=True, then all cfg’s will be renormalized even if they have been previously normalized with the same normalizer. Defaults to False.
progress (bool, optional) – if True, will show a progressbar when adding multiple cfgs. Defaults to False.
- Raises:
TypeError – when attempting to add something that is not a
CFG,CFGDataset, or iterables of themValueError – when attempting to use multiple different normalizers and self.allow_multiple_norms=False
- property asm_counts
A collections.Counter() of all unique assembly lines and their counts accross all cfg’s in this dataset
- cfgs = None
The list of all cfgs in this dataset
- metadata = None
A dictionary of metadata associated with this
CFGDataset
- normalize(normalizer=None, inplace=True, force_renormalize=False, progress=False)[source]
Normalize this
CFGDataset.- Parameters:
normalizer (Union[str, Normalizer]) – the normalizer to use. Can be a
Normalizerobject, or a string, or None to use the default BaseNormalizer(). Defaults to None.inplace (bool, optional) – by default, normalizes this dataset inplace (IE: without copying objects). Can set to False to return a copy. Defaults to True.
force_renormalize (bool, optional) – by default, this method will only normalize cfg’s whose .normalizer != to the passed normalizer. However if force_renormalize=True, then all cfg’s will be renormalized even if they have been previously normalized with the same normalizer.. Defaults to False.
progress (bool, optional) – if True, will show a progressbar while normalizing. Defaults to False.
- Returns:
this dataset normalized
- Return type:
- normalizer = None
The normalizer used in this dataset, or None if there is no normalizer
- property num_asm_lines
Return total number of assembly lines across all cfg’s
- property num_blocks
Return total number of blocks across all cfg’s
- property num_cfgs
Return the number of cfgs in this dataset
- property num_edges
Return total number of edges across all cfg’s
- property num_functions
Return total number of functions across all cfg’s
bincfg.cfg.cfg_edge module
Classes/Methods involving edges in a CFG object
- class bincfg.cfg.cfg_edge.CFGEdge(from_block: CFGBasicBlock, to_block: CFGBasicBlock, edge_type: EdgeType | str)[source]
Bases:
objectA single immutable edge in a
CFGobject- Parameters:
from_block (CFGBasicBlock) – ‘from’
CFGBasicBlockobjectto_block (CFGBasicBlock) – ‘to’
CFGBasicBlockobjectedge_type (Union[EdgeType, str]) –
- the edge type. can be either an
EdgeTypesobject, or a string. String values include: ’normal’: a
EdgeTypes.NORMALedge’function_call’: a
EdgeTypes.FUNCTION_CALLedge
- the edge type. can be either an
- from_block: CFGBasicBlock
The from block of this directed edge
- property is_branch: bool
True if this edge is one of a branching instruction, False otherwise
Specifically, returns True if this edge’s from_block has exactly two outgoing edges, both of which are ‘normal’ edges. Sometimes, it is possible for blocks to have more than two ‘normal’ edges out (IE: jump tables), and those are NOT considered branches and this method would return False
- property is_function_call_edge: bool
True if this is a ‘function_call’ edge type, False otherwise
- property is_normal_edge: bool
True if this is a ‘normal’ edge type, False otherwise
- to_block: CFGBasicBlock
The to block of this directed edge
- class bincfg.cfg.cfg_edge.EdgeType(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]
Bases:
EnumEnum for different edge types for
CFGBasicBlockobjects.- FUNCTION_CALL = 2
an edge going from a basic block to another basic block in another function (or the same function).
The outgoing edge should always connect to a function entry block (IE: that block’s
.is_function_entrywould be True).
- NORMAL = 1
a normal edge as a result of some branching/jumping instruction, or plain continuation to a next block
(IE: an edge of control flow that does not involve calling a function)
- bincfg.cfg.cfg_edge.get_edge_type(edge_type: EdgeType | str) EdgeType[source]
Returns the edge type (instance of EdgeTypes enum class)
- Parameters:
edge_type (Union[EdgeType, str]) – can be either an EdgeTypes object, or a string. String values include: - ‘normal’: a EdgeTypes.NORMAL edge - ‘function_call’: a EdgeTypes.FUNCTION_CALL edge
- Raises:
ValueError – for an unknown EdgeType string
TypeError – for a bad edge_type type
- Returns:
the given edge_type as a class from the EdgeType enum
- Return type:
bincfg.cfg.cfg_function module
- class bincfg.cfg.cfg_function.CFGFunction(parent_cfg: CFG | None = None, address: int | str | Addressable | None = None, name: str | None = None, blocks: Iterable[CFGBasicBlock] | None = None, is_extern_function: bool = False, metadata: dict | None = None)[source]
Bases:
objectA single function in a
CFGCan be initialized empty, or by passing kwarg values.
NOTE: these objects should not be pickled/copy.deepcopy()-ed by themselves, only as a part of a cfg
- Parameters:
parent_cfg (Optional[bincfg.CFG]) – the parent
CFGobject to which thisCFGFunctionbelongsaddress (Optional[AddressLike]) – the memory address of this function. If not present, then the address will be set to -1
name (Optional[str]) – the string name of this function. If not present, or if the name passed is the empty string, this function is given a default name ‘__UNNAMMED_FUNC_X’ where ‘X’ is the memory address
blocks (Optional[Iterable[CFGBasicBlock]]) – if None, will be initialized to an empty list, otherwise an iterable of
CFGBasicBlockobjects that are within this functionis_extern_function (bool) – if True, then this function is an external function (a dynamically loaded function)
metadata (Optional[dict]) – optional dictionary of metadata to associate with this function
- address: int
the integer memory address of this function. Will be -1 if not initialized yet
- property asm_counts: Mapping[str, int]
A
collections.Counterof all unique assembly lines and their counts in this function
- blocks: list[CFGBasicBlock]
list of all basic blocks in this function
- property called_by: list[CFGBasicBlock]
A list of
CFGBasicBlock’s that call this functionSpecifically, the list of all
CFGBasicBlockobjects in this function’s .parent_cfg CFG object that call this function. If thisCFGFunctionhas no parent, then the empty list will be returned.NOTE: this is computed dynamically each call (as
CFGobjects are mutable), so it may be useful to compute it once per function and save it if needed
- property function_entry_block: CFGBasicBlock
The
CFGBasicBlockthat is the function entry blockSpecifically, returns the first
CFGBasicBlockfound that has the same address as this function (there ~should~ only be one as each basic block ~should~ have a unique memory address)
- property is_extern_function: bool
True if this function is an external function, False otherwise
- property is_intern_function: bool
True if this function is an internal function, False otherwise
- property is_recursive: bool
True if this function calls itself at some point
Specifically, if at least one
CFGBasicBlockin thisCFGFunction.blockslist has an edges_out function call address that is equal to thisCFGFunction’s address
- property is_root_function: bool
True if this function is not called by any other functions, False otherwise
- metadata: dict
Dictionary of metadata associated with this function
- name: str
the string name of this function. Will be given a default name based on its memory address if not present
- property num_asm_lines: int
The total number of assembly lines across all blocks in this function
- property num_blocks: int
The number of basic blocks in this function
- bincfg.cfg.cfg_function.CFGFunctionPickledState
The pickled state of a function
alias of
Tuple[int,str,Tuple[CFGBasicBlock, …],bool]
bincfg.cfg.cfg_parsers module
Functions to parse cfg inputs into CFG objects.
- bincfg.cfg.cfg_parsers.get_asm_from_node_label(label)[source]
Converts a node’s label into a list of assembly lines at that basic block.
- Parameters:
label (str) – the unparsed string label
- Returns:
tuple of 2 lists: (asm_lines, asm_memory_addresses)
- Return type:
Tuple[List[str], List[int]]
- bincfg.cfg.cfg_parsers.parse_cfg_data(cfg, data)[source]
Parses the incoming cfg data. Infers type of data
- Parameters:
cfg (CFG) – the cfg to parse into
data (Union[str, Sequence[str], TextIO, pd.DataFrame]) –
the data to parse, can be:
string: either string with newline characters that will be split on all newlines and treated as either a text or graphviz rose input, or a string with no newline characters that will be treated as a filename. Filenames will be opened as ghidra parquet files if they end with either ‘.pq’ or ‘.parquet’, and text/graphviz rose input otherwise
Sequence of string: will be treated as already-read-in text/graphviz rose input
open file object: will be read in using .readlines, then treated as text/graphviz rose input
pandas dataframe: will be parsed as ghidra parquet file
- Raises:
ValueError – bad
strfilename, or an unknown file start stringTypeError – bad
datainput typeCFGParseError – if there is an error during CFG parsing (but data type was inferred correctly)
- bincfg.cfg.cfg_parsers.parse_rose_gv(cfg, lines)[source]
Reads input as a graphviz file
- Parameters:
cfg (CFG) – an empty/loading CFG() object
lines (str, Iterable[str], TextIO) – the data to parse. Can be a string (which will be split on newlines to get each individual line), a list of string (each element will be considered one line), or an open file to call .readlines() on
- Raises:
CFGParseError – when the file cannot be parsed correctly
- bincfg.cfg.cfg_parsers.parse_rose_txt(cfg, lines)[source]
Reads input as a .txt file
- Parameters:
cfg (CFG) – an empty/loading CFG() object
lines (str, Iterable[str], TextIO) – the data to parse. Can be a string (which will be split on newlines to get each individual line), a list of string (each element will be considered one line), or an open file to call .readlines() on
- Raises:
CFGParseError – when file does not fit expected format
bincfg.cfg.mem_cfg module
- class bincfg.cfg.mem_cfg.MemCFG(cfg: CFG, normalizer: str | NormalizerType | None = None, keep_memory_addresses: bool = False, inplace: bool = False, using_tokens: dict | AtomicTokenDict | None = None, force_renormalize: bool = False)[source]
Bases:
objectA CFG that is more memory/speed efficient.
Keeps only the bare minimum information needed from a CFG. Stores edge connections in a CSR-like format.
- Parameters:
cfg (CFG) – a CFG object. Can be a normalized or un-normalized. If un-normalized, then it will be normalized using the normalizer parameter.
normalizer (Optional[Union[str, Normalizer]]) – the normalizer to use to normalize the incoming CFG (or None if it is already normalized). If the incoming CFG object has already been normalized, and normalizer is not None, then this will attempt to normalize the CFG again with this normalizer
keep_memory_addresses (bool) – if True, then memory addresses will also be kept. Otherwise they will be removed to save space
inplace (bool) – if True and cfg needs to be normalized, it will be normalized inplace
using_tokens (Union[Dict[str, int], AtomicTokenDict]) – if not None, then a dictionary mapping token strings to integer values. Any tokens in cfg but not in using_tokens will be added. Can also be an AtomicTokenDict for atomic updates to tokens
force_renormalize (bool) – by default, this method will only normalize cfg’s whose .normalizer != to the passed normalizer. However if force_renormalize=True, then all cfg’s will be renormalized even if they have been previously normalized with the same normalizer.
- class BlockInfoBitMask(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]
Bases:
EnumAn Enum for block info bit masks
Each value is a tuple of the bit mask for that boolean, and a function to call with the block that returns a boolean True if that bit should be set, False otherwise. If True, then that bit will be ‘1’ in that block’s block_flags int.
- IS_FUNCTION_CALL: Tuple[int, Callable[[CFGBasicBlock], Literal[0, 1, True, False]]] = (1, <function MemCFG.BlockInfoBitMask.<lambda>>)
Bit set if this block is a function call. See
is_function_call()
- IS_FUNCTION_ENTRY: Tuple[int, Callable[[CFGBasicBlock], Literal[0, 1, True, False]]] = (2, <function MemCFG.BlockInfoBitMask.<lambda>>)
Bit set if this block is a function entry. See
is_function_entry()
- IS_FUNCTION_JUMP: Tuple[int, Callable[[CFGBasicBlock], Literal[0, 1, True, False]]] = (8, <function MemCFG.BlockInfoBitMask.<lambda>>)
this block has a jump instruction that resolves to a basic block in a separate function. See
is_function_jump()- Type:
Bit set if this block is a function jump. IE
- IS_IN_EXTERN_FUNCTION: Tuple[int, Callable[[CFGBasicBlock], Literal[0, 1, True, False]]] = (4, <function MemCFG.BlockInfoBitMask.<lambda>>)
Bit set if this block is within an external function. See
is_extern_function()
- IS_MULTI_FUNCTION_CALL: Tuple[int, Callable[[CFGBasicBlock], Literal[0, 1, True, False]]] = (16, <function MemCFG.BlockInfoBitMask.<lambda>>)
this block has either two or more function call edges out, or one function call and two or more normal edges out. See
is_multi_function_call()Currently not setting the block here in _block_flags_int(), but instead in MemCFG initialization in order to save time (we don’t have to compute get_sorted_edges() multiple times)
- Type:
Bit set if this block is a multi-function call. IE
- property architecture: Architectures
Returns the architecture being used. Currently a WIP
Checks for an ‘arch’ or ‘architecture’ key in the metadata and returns it if it is known. Can currently return: ‘java’, ‘x86’
- asm_lines: ndarray
Assembly line information
A contiguous 1-d numpy array of shape (num_asm_lines,) of integer assembly line tokens. Dtype is the smallest unsigned dtype needed to store the largest token value in this
MemCFGTo get the assembly lines for some block index block_idx, you must get the assembly line indices from
block_asm_idx, and use those to slice the assembly lines:>>> block_idx = 7 >>> memcfg.asm_lines[memcfg.block_asm_idx[block_idx]:memcfg.block_asm_idx[block_idx + 1]]
Also see
get_block_asm_lines()
- asm_memory_addresses: None | ndarray
Memory addresses for all of the assembly lines
Only saved if keep_memory_addresses=True when constructing the
MemCFG. This will be a 1-d signed integer numpy array, where a value of -1 means the memory address for that corresponding line was not present in the basic block
- block_asm_idx: ndarray
Indices in
asm_linesthat correspond to the assembly lines for each basic block in thisMemCFGA 1-d numpy array of shape (num_blocks + 1,). Dtype is the smallest unsigned dtype needed to store the value num_asm_lines. Assembly tokens for a block at index i would have a start index of block_asm_idx[i] and an end index of block_asm_idx[i + 1] in
asm_lines.
- block_asm_mem_addr_idx: ndarray | None
Indices in
block_memory_addressesthat correspond to the assembly line memory addresses for basic blocksA 1-d numpy array of shape (num_blocks + 1,). Dtype is the smallest unsigned dtype needed to store the number of assembly line memory addresses. Memory addresses for a block at index i would have a start index of block_asm_mem_addr_idx[i] and an end index of block_asm_mem_addr_idx[i + 1] in
block_memory_addresses. Only saved if keep_memory_addresses=True when constructing theMemCFG.
- block_flags: ndarray
Integer of bit flags for each basic block
A 1-d numpy array of shape (num_blocks,) where each element is an integer of bit flags. See
BlockInfoBitMaskfor more info. Dtype is the smallest unsigned dtype with enough bits to store all flags inBlockInfoBitMaskAlso see
get_block_flags()
- block_func_idx: ndarray
Integer ids for the function that each basic block belongs to
A 1-d numpy array of shape (num_blocks,) where each element is a function id for the block at that index. The id can be found in
function_name_to_idx. Dtype is the smallest unsigned dtype needed to store the value num_functionsAlso see
get_block_function_idx()andget_block_function_name()
- block_memory_addresses: ndarray | None
Integer memory addresses of basic blocks.
Only saved if keep_memory_addresses=True when constructing the
MemCFG. This will be a 1-d unsigned integer numpy array containing the memory addresses
- block_metadata: list[int | dict]
Metadata for blocks
A list of run length compressed metadata at the basic block level. We only compress metadata dictionaries that are empty. Elements are in the same order as the block indices in block_asm_idx. Elements are either dictionaries (for the metadata of that current block), or integers indicating we should skip that many blocks as they all have no metadata.
- drop_tokens() Self[source]
Sets the tokens in this normalizer to None. Make sure you only do this if tokens are saved elsewhere! Returns self
- function_metadata: list[int | dict]
Metadata for functions
A list of run length compressed metadata at the function level. We only compress metadata dictionaries that are empty. Elements are in the same order as the function indices in block_func_idx. Elements are either dictionaries (for the metadata of that current function), or integers indicating we should skip that many functions as they all have no metadata.
- function_name_to_idx: dict[str, int]
Dictionary mapping string function names to their integer ids used in this
MemCFG
- get_block_asm_lines(block_idx: int) ndarray[source]
Get the asm lines associated with this block index
- Parameters:
block_idx (int) – integer block index
- Returns:
a 1-d numpy array of unsigned integer assembly tokens
- Return type:
np.ndarray
- get_block_asm_memory_addresses(block_idx: int) ndarray[source]
Get the asm memory addresses associated with this block index
Values are -1 if the memory address did not exist in that block
- Parameters:
block_idx (int) – integer block index
- Returns:
a 1-d numpy array of signed integer assembly tokens
- Return type:
np.ndarray
- get_block_edges_out(block_idx: int, ret_edge_types: bool = False) ndarray | Tuple[ndarray, ndarray][source]
Get numpy array of block indices for all edges out associated with the given block index
- Parameters:
block_idx (int) – integer block index
ret_edge_types (bool) –
if True, will also return a numpy array (1-d, dtype np.uint8) containing the edge type values for each edge with values:
1: normal edge
2: function call edge
- Returns:
either a 1-d numpy array of unsigned integer block indices for all edges out associated with the given block index, or if ret_edge_types=True, then a tuple of (block_edge_inds, edge_types) where the edge_types is a 1-d numpy array of uint8 edge types with the same shape as block_edge_inds that designates the types of the edges. Edge types will be the values of those in the EdgeType enum.
- Return type:
Union[np.ndarray, Tuple[np.ndarray, np.ndarray
- get_block_flags(block_idx: int) Tuple[bool, bool, bool, bool, bool, bool][source]
Get all block flags for the given block index
- Parameters:
block_idx (int) – integer block index
- Returns:
- (is_block_function_call, is_block_function_entry,
is_block_extern_function, is_block_function_jump, is_block_multi_function_call)
- Return type:
Tuple[bool, bool, bool, bool, bool, bool]
- get_block_function_idx(block_idx: int) int[source]
Get the function index for the given block index
- Parameters:
block_idx (int) – integer block index
- Returns:
the integer function index for the given block index
- Return type:
int
- get_block_function_name(block_idx: int) str[source]
Get the function name for the given block index
Functions without names will start with ‘__unnamed_func__’
- Parameters:
block_idx (int) – integer block index
- Returns:
the function name for the given block index
- Return type:
str
- get_block_info(block_idx)[source]
Returns all the info associated with the given block index as a dictionary
- Parameters:
block_idx (int) – integer block index
- Returns:
the block info dictionary with keys/values:
’asm_lines’ (np.ndarray): 1-d numpy array of unsigned integer assembly line tokens in this block
’asm_memory_addresses’ (np.ndarray): 1-d numpy array of signed integer memory addresses for the assembly lines in this block. Values will be -1 if the memory addresses do not exist
’edges_out’ (np.ndarray): 1-d numpy array of unsigned integer block indices for all of the edges out from this block
’edge_types’ (np.ndarray): 1-d numpy array of uint8 values for the edge types associated with all of the edges out. These are the values of objects in the EdgeType enum. Currently: EdgeType.NORMAL == 1, EdgeType.FUNCTION_CALL == 2
’function_index’ (int): the integer function index of the function this block resides in
’is_function_call’ (bool): true if this block is a function call block (has at least one outgoing function call edge)
’is_function_entry’ (bool): true if this block is a function entry block (has the same memory address as its parent function)
’is_extern_function’ (bool): true if this block is within an external function (parent_function.is_extern_function is True)
’is_function_jump’ (bool): true if this block is a function jump block (has a ‘normal’ edge to a block that is within another function)
’is_multi_function_call’ (bool): true if this block is a multi-function call block (has 2 or more outgoing function call edges. IE: a call table)
’metadata’ (dict): dictionary of metadata associated with this block
- Return type:
dict
- get_block_memory_address(block_idx: int) int[source]
Returns the memory address for the given block, if present, -1 if not present
- Parameters:
block_idx (int) – integer block index
- Returns:
the memory address
- Return type:
int
- get_block_metadata(block_idx: int | None) dict | list[dict][source]
Returns the metadata associated with that function index
- Parameters:
block_idx (Union[int, None]) – the integer block index of the metadata to get, or None to get the full list of metadata
- Returns:
dictionary of metadata associated with the given block index
- Return type:
Union[dict, list[dict]]
- get_coo_indices() ndarray[source]
Returns the COO indices for this MemCFG
Returns a 2-d numpy array of shape (num_edges, 2) of dtype np.int32. Each row is an edge, column 0 is the ‘row’ indexer, and column 1 is the ‘column’ indexer. EG:
original = np.array([ [0, 1], [1, 1] ]) coo_indices = np.array([ [0, 1], [1, 0], [1, 1] ])
NOTE: this returns as type np.int32 since pytorch can be finicky about what dtypes it wants NOTE: pytorch sparse_coo_tensor’s indicies are the transpose of the array this method returns
- Returns:
a 2-d numpy array of shape (num_edges, 2) of dtype np.int32 containing COO indices
- Return type:
np.ndarray
- get_edge_values() ndarray[source]
Returns the edge type values
Returns a 1-d numpy array of length self.num_edges and dtype np.int32 containing an integer type for each edge depending on if it is a normal or function call edge. Edges are directed and have values from EdgeType enum. Values:
1: ‘normal’ edges
2: ‘function call’ edges
NOTE: this returns as type np.int32 since pytorch can be finicky about what dtypes it wants
- Returns:
a 1-d numpy array of length self.num_edges and dtype np.int32 containing integer edge types
- Return type:
np.ndarray
- get_function_block_inds(func_idx: int) list[int][source]
Returns all of the block indices that are within the given function
- Parameters:
func_idx (int) – the integer function index
- Returns:
list of integer block indices that are within the given function
- Return type:
list[int]
- get_function_metadata(func_idx: int | None) dict | list[dict][source]
Returns the metadata associated with that function index
- Parameters:
func_idx (Union[int, None]) – the integer function index of the metadata to get, or None to get the full list of metadata
- Returns:
dictionary of metadata associated with the given function index
- Return type:
Union[dict, list[dict]]
- graph_c: ndarray
Array containing all of the outgoing edges for each block in order
1-D numpy array of shape (num_edges,). Dtype will be the smallest unsigned dtype required to store the value num_blocks + 1. Each element is a block index to which that edge connects. Edges will be in the order they appear in each block’s
edges_outattribute, for each block in order of their block_idx.Also see
get_edges_out()NOTE: this also contains information on which types of edges they are inherently. If the block is NOT a function call (stored as bit flag in the block_info array), then all edges for that block are normal edges. If it IS a function call, then there are 3 cases:
it has one outgoing edge: that edge is always a function call
it has two outgoing edges, one function call, one normal: the first edge is the function call edge, the second is a normal edge
it has >2 outgoing edges, or 2 function call edges: the edges will be listed first by function call edges, then by normal edges, with a separator inbetween. The separator will have the max unsigned int value for graph_c’s dtype. This is why we use the dtype that can store num_blocks + 1, since we need this extra value just in case. Whatever exactly it means for a basic block to have >2 outgoing edges while being a function call is left up to the user. Possibly due to call operators with non-explicit operands (eg: register memory locations)?
- graph_r: ndarray
Array containing information on the number of outgoing edges for each block
1-D numpy array of shape (num_edges + 1,). Dtype will be the smallest unsigned dtype required to store the value num_edges. This array is a cumulative sum of the number of edges for each basic block. One could get all of the outgoing edges for a block using:
>>> start_idx = memcfg.graph_r[block_idx] >>> end_idx = memcfg.graph_r[block_idx + 1] >>> edges = memcfg.graph_c[start_idx:end_idx]
Also see
get_edges_out()
- property inv_tokens: dict[int, str]
dictionary mapping token integers to their original strings
- Type:
Returns the inverse of self.tokens
- is_block_extern_function(block_idx: int) bool[source]
True if this block is in an external function, False otherwise
- is_block_function_call(block_idx: int) bool[source]
True if this block is a function call, False otherwise
- is_block_function_entry(block_idx: int) bool[source]
True if this block is a function entry, False otherwise
- is_block_function_jump(block_idx: int) bool[source]
True if this block is a function jump, False otherwise
- is_block_multi_function_call(block_idx: int) bool[source]
True if this block is a multi-function call, False otherwise
- metadata: dict
Dictionary of metadata associated with this MemCFG
- normalize(normalizer: str | NormalizerType | None = None, using_tokens: dict | AtomicTokenDict = None, inplace: bool = True, force_renormalize: bool = False) MemCFG[source]
Normalizes this memcfg in-place.
- Parameters:
normalizer (Optional[Union[str, NormalizerType]]) – the normalizer to use. Can be a
Normalizerobject, or a string, or None to use the default BaseNormalizer(). Defaults to None.using_tokens (Union[dict, AtomicTokenDict]) – tokens to use when normalizing
inplace (bool) – whether or not to normalize inplace. Defaults to True.
force_renormalize (bool) – by default, this method will only normalize this cfg if the passed normalizer is != self.normalizer. However if force_renormalize=True, then this will be renormalized even if it has been previously normalized with the same normalizer. Defaults to False.
- Returns:
this
MemCFGnormalized- Return type:
- normalizer: NormalizerType
The normalizer used to normalize input before converting to
MemCFGCan be shared with a
MemCFGDatasetobject if thisMemCFGis a part of one
- property num_asm_lines: int
The number of assembly lines in this
MemCFG
- property num_blocks: int
The number of blocks in this
MemCFG
- property num_edges: int
The number of edges in this
MemCFG
- property num_functions: int
The number of functions in this
MemCFG
- set_tokens(tokens: dict | AtomicTokenDict) Self[source]
Sets this MemCFG’s tokens to the given tokens, and returns self
- to_adjacency_matrix(type: Literal['np', 'numpy', 'torch'] = 'np', sparse: bool = False) ndarray | Tuple[ndarray, ndarray][source]
Returns an adjacency matrix representation of this memcfg’s graph connections
Edges are directed and have values from EdgeType enum. Values:
1: ‘normal’ edges
2: ‘function call’ edges
- Parameters:
type (Literal["np", "numpy", "torch"]) –
the type of matrix to return. Defaults to ‘np’. Can be:
’np’/’numpy’ for a numpy ndarray (dtype: np.int32)
’torch’/’pytorch’ for a pytorch tensor (type: LongTensor)
sparse (bool) –
whether or not the return value should be a sparse matrix. Defaults to False. Has different behaviors based on type:
- numpy array: returns a 2-tuple of sparse COO representation (indices, values).
NOTE: the indices are the transpose of those from get_coo_indices() NOTE: if you want sparse CSR format, you already have it with self.graph_c and self.graph_r
- pytorch tensor: returns a pytorch sparse COO tensor.
NOTE: not using sparse CSR format for now since it seems to have less documentation/supportedness.
- Returns:
an adjacency matrix representation of this
MemCFG- Return type:
Union[np.ndarray, Tuple[np.ndarray, np.ndarray
- to_cfg() CFG[source]
Converts this MemCFG back into a CFG
NOTE: if keep_memory_addresses=False when constucting this MemCFG, then memory addresses will not be present and basic blocks will be given a memory address that is just their index in the block list
- tokens: dict[str, int]
Dictionary mapping token strings to integer values used in this
MemCFGCan be shared with a
MemCFGDatasetobject if thisMemCFGis a part of one.Can also be an AtomicTokenDict object for atomic token updates
bincfg.cfg.mem_cfg_dataset module
- class bincfg.cfg.mem_cfg_dataset.MemCFGDataset(cfg_data=None, using_tokens=None, normalizer=None, metadata=None, **add_data_kwargs)[source]
Bases:
objectA CFGDataset that is more memory efficient
- Parameters:
cfg_data (Optional[Union[str, CFG, CFGDataset, MemCFG, MemCFGDataset, Iterable]]) – the data to use. Can be None for an empty dataset, or a string (for input to CFG), CFG, CFGDataset, MemCFG, MemCFGDataset, or iterable of those values to add that data to this dataset
tokens (Optional[Union[Dict[str, int], AtomicTokenDict]]) – if passed, will initialize the token dictionary to this dictionary of tokens (will be copied). Can be an AtomicTokenDict to use an atomic file token dictionary
normalizer (Optional[Union[str, Normalizer]]) – the normalizer to use, or None to default to the normalizer of the first added CFG/MemCFG
metadata (Optional[Dict]) – a dictionary of metadata to attach to this MemCFGDataset NOTE: passed dictionary will be shallow copied
add_data_kwargs (Any) – kwargs to pass to self.add_data() when adding the passed cfg_data
- add_data(*cfg_data, inplace=True, force_renormalize=False, progress=False)[source]
Adds data to this dataset
- Parameters:
cfg_data (Union[str, CFG, MemCFG, CFGDataset, MemCFGDataset, Iterable]) – arbitrary amount of str (CFG input)/CFG/MemCFG/CFGDataset/MemCFGDataset’s, or iterables of them, to add to this dataset
inplace (bool, optional) – whether or not to normalize the incoming cfg_data inplace. Defaults to True.
force_renormalize (bool, optional) – by default, this method will only normalize cfg’s whose .normalizer != to this dataset’s normalizer. However if force_renormalize=True, then all cfg’s will be renormalized even if they have been previously normalized with the same normalizer. Defaults to False.
mp (bool, optional) – if True, will use multiprocessing to normalize cfgs. Defaults to False.
progress (bool, optional) – if True, will show a progressbar when adding multiple cfgs. Defaults to False.
- Raises:
TypeError – if something other than a cfg/dataset is passed in cfg_data
- cfgs = None
The list of all memcfgs in this dataset
- metadata = None
A dictionary of metadata associated with this
MemCFGDataset
- normalize(normalizer=None, inplace=True, force_renormalize=False, progress=False)[source]
Normalize this
MemCFGDataset.- Parameters:
normalizer (Union[str, Normalizer]) – the normalizer to use. Can be a
Normalizerobject, or a string, or None to use the default BaseNormalizer(). Defaults to None.inplace (bool, optional) – by default, normalizes this dataset inplace (IE: without copying objects). Can set to False to return a copy. Defaults to True.
force_renormalize (bool, optional) – by default, this method will only normalize cfg’s whose .normalizer != to the passed normalizer. However if force_renormalize=True, then all cfg’s will be renormalized even if they have been previously normalized with the same normalizer.. Defaults to False.
progress (bool, optional) – if True, will show a progressbar while normalizing. Defaults to False.
- Returns:
this dataset normalized
- Return type:
- normalizer = None
The normalizer used in this dataset, or None if there is no normalizer
- property num_asm_lines
- property num_blocks
- property num_cfgs
- property num_edges
- property num_functions
- remove_cfg(cfg_or_idx)[source]
Removes the given MemCFG (or index of MemCFG if cfg_or_idx is an integer) from this MemCFGDataset
- Parameters:
cfg_or_idx (Union[MemCFG, int]) – cfg or index to remove
- save(path, freeze_tokens=True)[source]
Saves this MemCFGDataset to path
- Parameters:
path (str) – the filepath to save to
freeze_tokens (bool) – whether or not to ‘freeze’ the tokens in this MemCFGDataset. ‘freezing’ the tokens just means that, if an AtomicTokenDict is the current token dictionary for this MemCFGDataset, then its current data will be saved in the pickle file as a normal dict. This is useful for loading this data later so that the loading does not depend on being able to access the files for the AtomicTokenDict. Default: True. If the token dictionary is already a dict, then this has no effect
- using_tokens = None
A dictionary mapping string tokens to their integer values
Can be an AtomicTokenDict for atomic updates to tokens