bincfg.cfg package

Submodules

bincfg.cfg.cfg module

class bincfg.cfg.cfg.CFG(data: CFGInputDataType = None, normalizer: str | NormalizerType | None = None, metadata: dict | None = None, using_tokens: TokenDictType | None = None)[source]

Bases: object

A Control Flow Graph (CFG) representation of a binary

Parameters:
  • data (Optional[Union[str, TextIO, Sequence[str], SmdaReport]]) –

    the data to use to make this CFG. Data type will be inferred based on the data passed:

    • string: either string with newline characters that will be split on all newlines and as a known disassembler format, or a string with no newline characters that will be treated as a filename.

    • Sequence of string: will be treated as already-read-in disassembler file split on newlines

    • open file object: will be read in using .readlines, then treated as disassembler input

    • SmdaReport: output from smda disassembly

  • normalizer (Optional[Union[str, NormalizerType]]) – the normalizer to use to force-renormalize the incoming CFG, or None to not normalize

  • metadata (Optional[dict]) –

    a dictionary of metadata to add to this CFG

    NOTE: passed dictionary will be shallow copied

  • using_tokens (Optional[Union[dict[str, int], AtomicTokenDict]]) – optional token dictionary to use when initializing and normalizing. Only used if normalizer is not None

add_function(*functions: CFGFunction, override: bool = False) None[source]

Adds the given function(s) to this cfg. This should only be done once the given function(s) have been fully initialized

This will do some housekeeping things such as:

  • setting the parent_cfg and parent_function attributes of functions and blocks respectively

  • adding missing edges to their associated edges_out and edges_in

  • converting edges from (None/address, None/address, edge_type) tuples into CFGEdge() objects

  • adding from_block and to_block in new edges if missing

  • functions with no address will have their address be that of the smallest addressed block in their blocks, if present

Parameters:
  • function (CFGFunction) – arbitrary number of CFGFunction’s to add

  • override (bool) – if False, an error will be raised if a function or basic block contains an address that already exists in this CFG. If True, then that error will not be raised and those functions/basic blocks will be overriden (which has unsupported behavior). Defaults to False.

property architecture: Architectures

Returns the architecture being used. Currently a WIP

Checks for an ‘arch’ or ‘architecture’ key in the metadata and returns it if it is known. Can currently return: ‘java’, ‘x86’

property asm_counts: Mapping[str, int]

A collections.Counter() of all unique assembly lines and their counts in this cfg

property blocks: list[CFGBasicBlock]

A list of basic blocks in this CFG (in order of memory address)

blocks_dict: dict[int, CFGBasicBlock]

Dictionary mapping integer basic block addresses to their CFGBasicBlock objects

copy() CFG[source]
property edges: list[CFGEdge]

A list of all outgoing CFGEdge’s in this CFG

classmethod from_networkx(graph: networkx.MultiDiGraph, cfg: CFG | None = None) CFG[source]

Converts a networkx graph to a CFG

Expects the graph to have the exact same structure as is shown in CFG().to_networkx()

Parameters:
  • graph (networkx.MultiDiGraph) – the networkx graph

  • cfg (Optional[CFG]) – can be None to create/return a new CFG object, or an already created and empty CFG() object to put data into that one

property functions: list[CFGFunction]

A list of functions in this CFG (in order of memory address)

functions_dict: dict[int, CFGFunction]

Dictionary mapping integer function addresses to their CFGFunction objects

get_block(address: int | str | Addressable, raise_err: bool = True) CFGBasicBlock | None[source]

Returns the basic block in this CFG with the given address

Parameters:
  • address (AddressLike) – a string/integer memory address, or an addressable object (EG: CFGBasicBlock/CFGFunction)

  • raise_err (bool) – if True, will raise an error if the function with the given memory address was not found, otherwise will return None

Raises:

ValueError – if the basic block with the given address could not be found

Returns:

the basic block with the given address

Return type:

Union[CFGBasicBlock, None]

get_block_containing_address(address: int | str | Addressable, raise_err: bool = True) CFGBasicBlock | None[source]

Returns the basic block in this CFG that contains the given address at the start of one of its instructions

This will lazily compute an instruction lookup dictionary mapping addresses to the blocks that contain them

NOTE: this will only return a block if the address is either equal to the block’s address, or if it is exactly equal to one of the addresses for an assembly instruction in a block’s .asm_memory_addresses list

Parameters:
  • address (AddressLike) – a string/integer memory address, or an addressable object (EG: CFGBasicBlock/CFGFunction)

  • raise_err (bool) – if True, will raise an error if the function with the given memory address was not found, otherwise will return None

Raises:

ValueError – if the basic block containing the given address could not be found

Returns:

the basic block that contains the given address

Return type:

Union[CFGBasicBlock, None]

get_cfg_build_code() str[source]

Returns python code that will build the given cfg. Used for testing.

This will return the plain code itself to build, with no initial tabs.

Parameters:

cfg (CFG) – the cfg

Returns:

string of python code to build the cfg

Return type:

str

get_function(address: int | str | Addressable, raise_err: bool = True) CFGFunction | None[source]

Returns the function in this CFG with the given address

Parameters:
  • address (AddressLike) – a string/integer memory address, or an addressable object (EG: CFGBasicBlock/CFGFunction)

  • raise_err (bool) – if True, will raise an error if the function with the given memory address was not found, otherwise will return None

Raises:

ValueError – if the function with the given address could not be found

Returns:

the function with the given address, or None if that function does not exist

Return type:

Union[CFGFunction, None]

get_function_by_name(name: str, raise_err: bool = True) CFGFunction | None[source]

Returns the function in this CFG with the given name

NOTE: if the name of the function is None, then the expected string name to this method would be: “__UNNAMED_FUNC_%d” % func.address

Parameters:
  • name (str) – the name of the function to get

  • raise_err (bool) – if True, will raise an error if the function with the given memory address was not found, otherwise will return None

Raises:

ValueError – if the function with the given address could not be found

Returns:

the function with the given address, or None if that function does not exist

Return type:

Union[CFGFunction, None]

insert_library(cfg: CFG, function_mapping: dict[str, int], offset: int | None = None)[source]

WIP. Inserts the cfg of a shared library into this cfg

This will modify the memory addresses of cfg (adding an appropriate offset), then add all of the functions and basic blocks from cfg into this cfg. Finally, external functions in this cfg that have implemented functions in the function_mapping will have normal edges added.

NOTE: this assumes that no other libraries will be added later that depend on this one that is currently being added (otherwise, the external function edges might not be added properly). Make sure you add them in the correct order!

Parameters:
  • cfg (CFG) – the cfg of the library to insert. It will be copied

  • function_mappping (Dict[str, int]) – dictionary mapping known exported function names to their addresses within cfg. While we can sometimes determine these mappings from function names in the new cfg, that is not always the case (EG: stripping function names from binaries, or compilers/linkers emitting aliases for the functions in cfg), hence why this parameter exists. If you don’t wish to add in new normal edges, or if you wish to add them in manually, you can pass an empty dictionary

  • offset (Optional[int]) – if None, then the library will be inserted in the first available memory location. Otherwise this can be an integer memory address to insert the cfg at (this will raise an error if it can’t fit there)

metadata: dict

Dictionary of metadata associated with this CFG

normalize(normalizer: str | NormalizerType, using_tokens: dict[str, int] | AtomicTokenDict | None = None, inplace: bool = True, force_renormalize: bool = False) CFG[source]

Normalizes this cfg.

Parameters:
  • normalizer (Union[str, NormalizerType]) – the normalizer to use. Can be a Normalizer object, or a string of a built-in normalizer to use

  • using_tokens (Optional[TokenDictType]) – token dictionary to use when normalizing, or None to normalize from scratch

  • inplace (bool) – whether or not to normalize inplace

  • force_renormalize (bool) – by default, this method will only normalize this cfg only if the passed normalizer is != self.normalizer. However if force_renormalize=True, then this will be renormalized even if it has been previously normalized with the same normalizer

Returns:

this CFG normalized

Return type:

CFG

normalizer: NormalizerType | None

The normalizer used to normalize assembly lines in this CFG, or None if they have not been normalized

property num_asm_lines: int

The number of asm lines across all blocks in this cfg

property num_blocks: int

The number of basic blocks in this cfg

property num_edges: int

The number of edges in this cfg

property num_functions: int

The number of functions in this cfg

set_tokens(tokens: dict[str, int] | AtomicTokenDict) CFG[source]

Sets this CFG’s tokens to the given tokens, and returns self

to_adjacency_matrix(type: str = 'np', sparse: bool = False) np.ndarray | torch.Tensor[source]

Returns an adjacency matrix representation of this cfg’s graph connections

Currently is slow because I just convert to a MemCFG, then call that object’s to_adjacency_matrix(). I should probably speed this up at some point…

Connections will be directed and have values:

  • 0: No edge

  • 1: Normal edge

  • 2: Function call edge

See to_adjacency_matrix() for more details

Parameters:
  • type (str, optional) –

    the type of matrix to return. Defaults to ‘np’. Can be:

    • ’np’/’numpy’ for a numpy ndarray (dtype: np.int32)

    • ’torch’/’pytorch’ for a pytorch tensor (type: LongTensor)

  • sparse (bool, optional) –

    whether or not the return value should be a sparse matrix. Defaults to False. Has different behaviors based on type:

    • numpy array: returns a 2-tuple of sparse COO representation (indices, values).

      NOTE: if you want sparse CSR format, you already have it with self.graph_c and self.graph_r

    • pytorch tensor: returns a pytorch sparse COO tensor.

      NOTE: not using sparse CSR format for now since it seems to have less documentation/supportedness.

Returns:

an adjacency matrix representation of this CFG

Return type:

Union[np.ndarray, torch.Tensor]

to_networkx() networkx.MultiDiGraph[source]

Converts this CFG to a networkx DiGraph() object

Requires that networkx be installed.

Creates a new MultiDiGraph() and adds as attributes to that graph:

  • ‘normalizer’: string name of normalizer, or None if it had none

  • ‘metadata’: a dictionary of metadata

  • ‘functions’: a dictionary mapping integer function addresses to named tuples containing its data with the

    structure (‘name’: Union[str, None], ‘is_extern_function’: bool, ‘blocks’: Tuple[int, …], ‘metadata’: dict).

    • The ‘name’ element (first element) is a string name of the function, or None if it doesn’t have a name

    • The ‘is_extern_function’ element (second element) is True if this function is an extern function, False otherwise. An extern function is one that is located in an external library intended to be found at runtime, and that doesn’t have its code here in the CFG, only a small function meant to jump to the external function when loaded at runtime

    • The ‘blocks’ element (third element) is an arbitrary-length tuple of integers, each integer being the memory address (equivalently, the block_id) of a basic block that is a part of that function. Each basic block is only part of a single function, and each function should have at least one basic block

    • The ‘metadata’ element (fourth element) is a dictionary of metadata associated with that function. May be empty.

NOTE: we use a multidigraph because edges are directed (in order of control flow), and it is theoretically possible (and occurs in some data) to have a node that calls another node, then has a normal edge back out to it. This has occured in some libc setup code

Then, each basic block will be added to the graph as nodes. Their id in the graph will be their integer address. Each block will have the following attributes:

  • ‘asm_lines’ (Tuple[str]): tuple of string assembly lines

  • ‘asm_memory_addresses (Tuple[int]): tuple of integer assembly line memory addresses, one for each line in order. Unless, if these addresses are not present, then this will be an empty tuple

  • ‘metadata’ (dict): dictionary (possibly empty) of metadata associated with this basic block

Finally, all edges will be added (directed based on control flow direction), and with the attributes:

  • ‘edge_type’ (str): the edge type, will be ‘normal’ for normal edges and ‘function_call’ for function call edges

update_metadata(other: dict) CFG[source]

Updates this CFG’s metadata dictionary with the given dictionary, and returns self

exception bincfg.cfg.cfg.InvalidInsertionMemoryAddressError[source]

Bases: Exception

bincfg.cfg.cfg.auto_detect_assembly_language(cfg: CFG) None[source]

Attempts to detect the assembly language used in the given CFG, settings its ‘architecture’ key in the metadata if successful

Will attempt to find known substrings in any block that indicate a specific language. Assumes the full CFG is all the same language

Parameters:

cfg (CFG) – the cfg to detect language on

bincfg.cfg.cfg_basic_block module

class bincfg.cfg.cfg_basic_block.CFGBasicBlock(parent_function: CFGFunction | None = None, address: int | str | Addressable | None = None, edges_in: Iterable[CFGEdge] | None = None, edges_out: Iterable[CFGEdge] | None = None, asm_lines: Iterable[str] | None = None, asm_memory_addresses: Iterable[int | str | Addressable] | None = None, metadata: dict | None = None)[source]

Bases: object

A single basic block in a CFG.

Can be initialized empty, or with attributes. Assumes its memory address is always unique within a cfg.

NOTE: these objects should not be pickled/copy.deepcopy()-ed by themselves, only as a part of a cfg

Parameters:
  • parent_function (Optional[CFGFunction]) – the CFGFunction this basic block belongs to

  • address (Optional[Union[int, str, Addressable]]) – the memory address of this CFGBasicBlock. Should be unique to the CFG that contains it. If None, but asm_memory_addresses is passed, this will be set to the first value in asm_memory_addresses

  • edges_in (Optional[Iterable[CFGEdge]]) – an iterable of incoming CFGEdge objects

  • edges_out (Optional[Iterable[CFGEdge]]) – an iterable of outgoing CFGEdge objects

  • asm_lines (Optional[Iterable[str]]) – an iterable of string assembly lines present at this basic block

  • asm_memory_addresses (Optional[Iterable[Union[str, int, Addressable]]]) – an iterable of string or integer memory addresses, one for each assembly line (will be converted into integer memory addresses). If this was passed, but address was not, then address will be set to the first value in asm_memory_addresses

  • metadata (Optional[Dict]) – optional dictionary of metadata to associate with this basic block

address: int

The integer memory address of this basic block. Will be -1 if not set yet

property all_edges: set[CFGEdge]

Returns a set of all edges in this basic block

property asm_counts: Mapping[str, int]

A collections.Counter of all unique assembly lines/tokens and their counts in this basic block

asm_lines: list[str]

List of string assembly lines at this basic block

asm_memory_addresses: list[int]

List of integer memory addresses for all assembly lines at this basic block. Will be empty list if not set yet

calls(address: int | str | Addressable)[source]

Checks if this block calls the given address

IE: checks if this block has an outgoing function_call edge to the given address

Parameters:

address (AddressLike) – a string/integer memory address, or an addressable object (EG: CFGBasicBlock/CFGFunction)

Returns:

True if this block calls the given address, False otherwise

Return type:

bool

edges_in: set[CFGEdge]

The set of incomming CFGEdge’s to this basic block

edges_out: set[CFGEdge]

The set of outgoing CFGEdge’s from this basic block

get_sorted_edges(edge_types: str | EdgeType | Iterable[str | EdgeType] | None = None, direction: Literal['out', 'in'] | Iterable[Literal['out', 'in']] | None = None, as_sets: bool = False) Tuple[list[CFGEdge], ...] | Tuple[set[CFGEdge], ...][source]

Returns a tuple of sorted lists of edges (sorted by address of the “other” block) of each type/direction in this block

Will return edge lists ordered first by edge type (their order of appearance in the cfg_edge.EdgeType enum), then by direction (‘in’, then ‘out’). Unless, if edge_types is passed, then only those edge types will be returned and the edge lists will be returned in the order of the edge types in edge_types, then by direction (‘in’, then ‘out’).

For example, with edge_types=None and direction=None, this would return the 4-tuple of: (normal_edges_in, normal_edges_out, function_call_edges_in, function_call_edges_out) Where each element is a list of CFGEdge objects.

Parameters:
  • edge_types (Optional[Union[str, EdgeType, Iterable[Union[str, EdgeType]]]]) – either an edge type or an iterable of edge types. Only edges with one of these types will be returned. If not None, then the edge lists will be returned sorted based on the order of the edge types listed here, then by direction

  • direction (Optional[Union[Literal["out", "in"], Iterable[Literal["out", "in"]]]) – the direction to get. Can be the strings ‘in’ or ‘out’, or None to get both

  • as_sets (bool) – if True, then this will return unordered sets of edges instead of sorted lists. This may save a ~tiny~ bit of time in the long run, but will hinder deterministic behavior of this method.

Returns:

a tuple of lists/sets of CFGEdge’s

Return type:

Union[Tuple[List[CFGEdge], …], Tuple[Set[CFGEdge], …]]

has_edge(address: int | str | Addressable, edge_types: str | EdgeType | Iterable[str | EdgeType] | None = None, direction: Literal['in', 'out'] | None = None) bool[source]

Checks if this block has an edge from/to the given address

Parameters:
  • address (AddressLike) – a string/integer memory address, or an addressable object (EG: CFGBasicBlock/CFGFunction).

  • edge_types (Optional[Union[str, EdgeType, Iterable[Union[str, EdgeType]]]]) – either an edge type or an iterable of edge types. Only edges with one of these types will be considered. If None, then all edge types will be considered

  • direction (Optional[Literal['in', 'out']]) – the direction to check (strings ‘in’ or ‘out), or None to check both

Returns:

True if this block has an edge from/to the given address, False otherwise

Return type:

bool

property is_function_call: bool

True if this block is a function call, False otherwise

Checks if this block has one or more outgoing function call edges

property is_function_entry: bool

True if this block is a function entry block, False otherwise

Specifically, returns True if this block’s address matches its parent function’s address. If this block has no parent, False is returned.

property is_function_jump: bool

True if this block is a function jump, False otherwise

Checks if this block has a ‘jump’ instruction to a basic block in a different function. Specifically, checks if this block has an outgoing EdgeType.NORMAL edge to a basic block who’s parent_function has an address different than this basic block’s parent_function’s address.

property is_multi_function_call: bool

True if this block is a multi-function call, False otherwise

IE: this block has either two or more function call edges out

metadata: dict

Dictionary of extra metadata to associate with this basic block

property num_asm_lines: int

The number of assembly lines/tokens in this basic block

property num_edges: int

The number of edges out in this basic block

property num_edges_in: int

The number of incoming edges in this basic block

property num_edges_out: int

The number of outgoing edges in this basic block

parent_function: CFGFunction | None

The parent function of this basic block. Will be None if not set yet

remove_edge(edge: CFGEdge) None[source]

Removes the given edge from this block’s edges (both incoming and outgoing)

Parameters:

edge (CFGEdge) – the CFGEdge to remove

Raises:

ValueError – if the edge doesn’t exist in the incomming/outgoing edges

bincfg.cfg.cfg_basic_block.CFGBasicBlockPickledState

The pickled state of a CFGBasicBlock

alias of Tuple[int, Tuple[Tuple[int, int, EdgeType], …], Tuple[Tuple[int, int, EdgeType], …], list[str], list[int], dict]

bincfg.cfg.cfg_dataset module

class bincfg.cfg.cfg_dataset.CFGDataset(cfg_data=None, normalizer=None, load_path=None, max_files=None, allow_multiple_norms=False, progress=False, metadata=None, num_workers=1, **add_data_kwargs)[source]

Bases: object

A dataset of CFG’s.

Parameters:
  • cfg_data (Optional[Union[CFG, CFGDataset, Iterable]]) – a CFG, CFGDataset or iterable of CFG’s or CFGDataset’s to add to this dataset, or None to initialize this CFGDataset empty

  • normalizer (Optional[Union[str, Normalizer]]) – if not None, then a normalizer to use. Will normalize all incoming CFG’s if they do not already have the name normalization (will attempt to renormalize incoming CFG’s if they already have a normalization). Can be a Normalizer object or string.

  • load_path (str) – if not None, loads all files in this directory that end with ‘.txt’ or ‘.dot’. Will raise an error if there are no files. Will ignore any files that end with ‘.txt’ or ‘.dot’, but cannot be parsed.

  • max_files (Optional[int]) – stops after loading this many files. If None, then there is no max

  • allow_multiple_norms (bool) – by default, CFGDataset will only allow unnormalized cfg’s when normalizer=None (if normalizer is not None, then any normalized cfg added will be renormalized). Setting allow_multiple_norms to True will allow this CFGDataset to store cfg data with any normalization method (assuming normalizer=None)

  • progress (bool) – if True, will show a progressbar when loading cfg’s from load_path

  • metadata (Optional[Dict]) – a dictionary of metadata to attach to this CFGDataset NOTE: passed dictionary will be shallow copied

  • num_workers (int) – if > 1, then the loading of data using the load_path parameter will be split over this many processes

  • add_data_kwargs (Any) – extra kwargs to pass to add_data while adding cfgs

add_data(*cfg_data, inplace=True, force_renormalize=False, progress=False)[source]

Adds data to this dataset

Parameters:
  • cfg_data (Union[CFG, CFGDataset, Iterable]) – arbitrary amount of CFG/CFGDataset’s, or iterables of them, to add to this dataset

  • inplace (bool, optional) – whether or not to normalize the incoming cfg_data inplace. Defaults to True.

  • force_renormalize (bool, optional) – by default, this method will only normalize cfg’s whose .normalizer != to this dataset’s normalizer. However if force_renormalize=True, then all cfg’s will be renormalized even if they have been previously normalized with the same normalizer. Defaults to False.

  • progress (bool, optional) – if True, will show a progressbar when adding multiple cfgs. Defaults to False.

Raises:
  • TypeError – when attempting to add something that is not a CFG, CFGDataset, or iterables of them

  • ValueError – when attempting to use multiple different normalizers and self.allow_multiple_norms=False

property asm_counts

A collections.Counter() of all unique assembly lines and their counts accross all cfg’s in this dataset

cfgs = None

The list of all cfgs in this dataset

dumps()[source]

Returns this object pickled with pickle.dumps()

classmethod load(path)[source]

Loads this CFGDataset from path

metadata = None

A dictionary of metadata associated with this CFGDataset

normalize(normalizer=None, inplace=True, force_renormalize=False, progress=False)[source]

Normalize this CFGDataset.

Parameters:
  • normalizer (Union[str, Normalizer]) – the normalizer to use. Can be a Normalizer object, or a string, or None to use the default BaseNormalizer(). Defaults to None.

  • inplace (bool, optional) – by default, normalizes this dataset inplace (IE: without copying objects). Can set to False to return a copy. Defaults to True.

  • force_renormalize (bool, optional) – by default, this method will only normalize cfg’s whose .normalizer != to the passed normalizer. However if force_renormalize=True, then all cfg’s will be renormalized even if they have been previously normalized with the same normalizer.. Defaults to False.

  • progress (bool, optional) – if True, will show a progressbar while normalizing. Defaults to False.

Returns:

this dataset normalized

Return type:

CFGDataset

normalizer = None

The normalizer used in this dataset, or None if there is no normalizer

property num_asm_lines

Return total number of assembly lines across all cfg’s

property num_blocks

Return total number of blocks across all cfg’s

property num_cfgs

Return the number of cfgs in this dataset

property num_edges

Return total number of edges across all cfg’s

property num_functions

Return total number of functions across all cfg’s

save(path)[source]

Saves this CFGDataset to path

bincfg.cfg.cfg_edge module

Classes/Methods involving edges in a CFG object

class bincfg.cfg.cfg_edge.CFGEdge(from_block: CFGBasicBlock, to_block: CFGBasicBlock, edge_type: EdgeType | str)[source]

Bases: object

A single immutable edge in a CFG object

Parameters:
  • from_block (CFGBasicBlock) – ‘from’ CFGBasicBlock object

  • to_block (CFGBasicBlock) – ‘to’ CFGBasicBlock object

  • edge_type (Union[EdgeType, str]) –

    the edge type. can be either an EdgeTypes object, or a string. String values include:
    • ’normal’: a EdgeTypes.NORMAL edge

    • ’function_call’: a EdgeTypes.FUNCTION_CALL edge

edge_type: EdgeType

The type of this edge

from_block: CFGBasicBlock

The from block of this directed edge

property is_branch: bool

True if this edge is one of a branching instruction, False otherwise

Specifically, returns True if this edge’s from_block has exactly two outgoing edges, both of which are ‘normal’ edges. Sometimes, it is possible for blocks to have more than two ‘normal’ edges out (IE: jump tables), and those are NOT considered branches and this method would return False

property is_function_call_edge: bool

True if this is a ‘function_call’ edge type, False otherwise

property is_normal_edge: bool

True if this is a ‘normal’ edge type, False otherwise

to_block: CFGBasicBlock

The to block of this directed edge

class bincfg.cfg.cfg_edge.EdgeType(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]

Bases: Enum

Enum for different edge types for CFGBasicBlock objects.

FUNCTION_CALL = 2

an edge going from a basic block to another basic block in another function (or the same function).

The outgoing edge should always connect to a function entry block (IE: that block’s .is_function_entry would be True).

NORMAL = 1

a normal edge as a result of some branching/jumping instruction, or plain continuation to a next block

(IE: an edge of control flow that does not involve calling a function)

bincfg.cfg.cfg_edge.get_edge_type(edge_type: EdgeType | str) EdgeType[source]

Returns the edge type (instance of EdgeTypes enum class)

Parameters:

edge_type (Union[EdgeType, str]) – can be either an EdgeTypes object, or a string. String values include: - ‘normal’: a EdgeTypes.NORMAL edge - ‘function_call’: a EdgeTypes.FUNCTION_CALL edge

Raises:
  • ValueError – for an unknown EdgeType string

  • TypeError – for a bad edge_type type

Returns:

the given edge_type as a class from the EdgeType enum

Return type:

EdgeType

bincfg.cfg.cfg_function module

class bincfg.cfg.cfg_function.CFGFunction(parent_cfg: CFG | None = None, address: int | str | Addressable | None = None, name: str | None = None, blocks: Iterable[CFGBasicBlock] | None = None, is_extern_function: bool = False, metadata: dict | None = None)[source]

Bases: object

A single function in a CFG

Can be initialized empty, or by passing kwarg values.

NOTE: these objects should not be pickled/copy.deepcopy()-ed by themselves, only as a part of a cfg

Parameters:
  • parent_cfg (Optional[bincfg.CFG]) – the parent CFG object to which this CFGFunction belongs

  • address (Optional[AddressLike]) – the memory address of this function. If not present, then the address will be set to -1

  • name (Optional[str]) – the string name of this function. If not present, or if the name passed is the empty string, this function is given a default name ‘__UNNAMMED_FUNC_X’ where ‘X’ is the memory address

  • blocks (Optional[Iterable[CFGBasicBlock]]) – if None, will be initialized to an empty list, otherwise an iterable of CFGBasicBlock objects that are within this function

  • is_extern_function (bool) – if True, then this function is an external function (a dynamically loaded function)

  • metadata (Optional[dict]) – optional dictionary of metadata to associate with this function

address: int

the integer memory address of this function. Will be -1 if not initialized yet

property asm_counts: Mapping[str, int]

A collections.Counter of all unique assembly lines and their counts in this function

blocks: list[CFGBasicBlock]

list of all basic blocks in this function

property called_by: list[CFGBasicBlock]

A list of CFGBasicBlock’s that call this function

Specifically, the list of all CFGBasicBlock objects in this function’s .parent_cfg CFG object that call this function. If this CFGFunction has no parent, then the empty list will be returned.

NOTE: this is computed dynamically each call (as CFG objects are mutable), so it may be useful to compute it once per function and save it if needed

property function_entry_block: CFGBasicBlock

The CFGBasicBlock that is the function entry block

Specifically, returns the first CFGBasicBlock found that has the same address as this function (there ~should~ only be one as each basic block ~should~ have a unique memory address)

property is_extern_function: bool

True if this function is an external function, False otherwise

property is_intern_function: bool

True if this function is an internal function, False otherwise

property is_recursive: bool

True if this function calls itself at some point

Specifically, if at least one CFGBasicBlock in this CFGFunction.blocks list has an edges_out function call address that is equal to this CFGFunction’s address

property is_root_function: bool

True if this function is not called by any other functions, False otherwise

metadata: dict

Dictionary of metadata associated with this function

name: str

the string name of this function. Will be given a default name based on its memory address if not present

property num_asm_lines: int

The total number of assembly lines across all blocks in this function

property num_blocks: int

The number of basic blocks in this function

parent_cfg: CFG | None

the parent CFG object to which this CFGFunction belongs, or None if it hasn’t been initialized yet

bincfg.cfg.cfg_function.CFGFunctionPickledState

The pickled state of a function

alias of Tuple[int, str, Tuple[CFGBasicBlock, …], bool]

bincfg.cfg.cfg_parsers module

Functions to parse cfg inputs into CFG objects.

exception bincfg.cfg.cfg_parsers.CFGParseError[source]

Bases: Exception

bincfg.cfg.cfg_parsers.get_asm_from_node_label(label)[source]

Converts a node’s label into a list of assembly lines at that basic block.

Parameters:

label (str) – the unparsed string label

Returns:

tuple of 2 lists: (asm_lines, asm_memory_addresses)

Return type:

Tuple[List[str], List[int]]

bincfg.cfg.cfg_parsers.parse_cfg_data(cfg, data)[source]

Parses the incoming cfg data. Infers type of data

Parameters:
  • cfg (CFG) – the cfg to parse into

  • data (Union[str, Sequence[str], TextIO, pd.DataFrame]) –

    the data to parse, can be:

    • string: either string with newline characters that will be split on all newlines and treated as either a text or graphviz rose input, or a string with no newline characters that will be treated as a filename. Filenames will be opened as ghidra parquet files if they end with either ‘.pq’ or ‘.parquet’, and text/graphviz rose input otherwise

    • Sequence of string: will be treated as already-read-in text/graphviz rose input

    • open file object: will be read in using .readlines, then treated as text/graphviz rose input

    • pandas dataframe: will be parsed as ghidra parquet file

Raises:
  • ValueError – bad str filename, or an unknown file start string

  • TypeError – bad data input type

  • CFGParseError – if there is an error during CFG parsing (but data type was inferred correctly)

bincfg.cfg.cfg_parsers.parse_rose_gv(cfg, lines)[source]

Reads input as a graphviz file

Parameters:
  • cfg (CFG) – an empty/loading CFG() object

  • lines (str, Iterable[str], TextIO) – the data to parse. Can be a string (which will be split on newlines to get each individual line), a list of string (each element will be considered one line), or an open file to call .readlines() on

Raises:

CFGParseError – when the file cannot be parsed correctly

bincfg.cfg.cfg_parsers.parse_rose_txt(cfg, lines)[source]

Reads input as a .txt file

Parameters:
  • cfg (CFG) – an empty/loading CFG() object

  • lines (str, Iterable[str], TextIO) – the data to parse. Can be a string (which will be split on newlines to get each individual line), a list of string (each element will be considered one line), or an open file to call .readlines() on

Raises:

CFGParseError – when file does not fit expected format

bincfg.cfg.mem_cfg module

class bincfg.cfg.mem_cfg.MemCFG(cfg: CFG, normalizer: str | NormalizerType | None = None, keep_memory_addresses: bool = False, inplace: bool = False, using_tokens: dict | AtomicTokenDict | None = None, force_renormalize: bool = False)[source]

Bases: object

A CFG that is more memory/speed efficient.

Keeps only the bare minimum information needed from a CFG. Stores edge connections in a CSR-like format.

Parameters:
  • cfg (CFG) – a CFG object. Can be a normalized or un-normalized. If un-normalized, then it will be normalized using the normalizer parameter.

  • normalizer (Optional[Union[str, Normalizer]]) – the normalizer to use to normalize the incoming CFG (or None if it is already normalized). If the incoming CFG object has already been normalized, and normalizer is not None, then this will attempt to normalize the CFG again with this normalizer

  • keep_memory_addresses (bool) – if True, then memory addresses will also be kept. Otherwise they will be removed to save space

  • inplace (bool) – if True and cfg needs to be normalized, it will be normalized inplace

  • using_tokens (Union[Dict[str, int], AtomicTokenDict]) – if not None, then a dictionary mapping token strings to integer values. Any tokens in cfg but not in using_tokens will be added. Can also be an AtomicTokenDict for atomic updates to tokens

  • force_renormalize (bool) – by default, this method will only normalize cfg’s whose .normalizer != to the passed normalizer. However if force_renormalize=True, then all cfg’s will be renormalized even if they have been previously normalized with the same normalizer.

class BlockInfoBitMask(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]

Bases: Enum

An Enum for block info bit masks

Each value is a tuple of the bit mask for that boolean, and a function to call with the block that returns a boolean True if that bit should be set, False otherwise. If True, then that bit will be ‘1’ in that block’s block_flags int.

IS_FUNCTION_CALL: Tuple[int, Callable[[CFGBasicBlock], Literal[0, 1, True, False]]] = (1, <function MemCFG.BlockInfoBitMask.<lambda>>)

Bit set if this block is a function call. See is_function_call()

IS_FUNCTION_ENTRY: Tuple[int, Callable[[CFGBasicBlock], Literal[0, 1, True, False]]] = (2, <function MemCFG.BlockInfoBitMask.<lambda>>)

Bit set if this block is a function entry. See is_function_entry()

IS_FUNCTION_JUMP: Tuple[int, Callable[[CFGBasicBlock], Literal[0, 1, True, False]]] = (8, <function MemCFG.BlockInfoBitMask.<lambda>>)

this block has a jump instruction that resolves to a basic block in a separate function. See is_function_jump()

Type:

Bit set if this block is a function jump. IE

IS_IN_EXTERN_FUNCTION: Tuple[int, Callable[[CFGBasicBlock], Literal[0, 1, True, False]]] = (4, <function MemCFG.BlockInfoBitMask.<lambda>>)

Bit set if this block is within an external function. See is_extern_function()

IS_MULTI_FUNCTION_CALL: Tuple[int, Callable[[CFGBasicBlock], Literal[0, 1, True, False]]] = (16, <function MemCFG.BlockInfoBitMask.<lambda>>)

this block has either two or more function call edges out, or one function call and two or more normal edges out. See is_multi_function_call()

Currently not setting the block here in _block_flags_int(), but instead in MemCFG initialization in order to save time (we don’t have to compute get_sorted_edges() multiple times)

Type:

Bit set if this block is a multi-function call. IE

property architecture: Architectures

Returns the architecture being used. Currently a WIP

Checks for an ‘arch’ or ‘architecture’ key in the metadata and returns it if it is known. Can currently return: ‘java’, ‘x86’

asm_lines: ndarray

Assembly line information

A contiguous 1-d numpy array of shape (num_asm_lines,) of integer assembly line tokens. Dtype is the smallest unsigned dtype needed to store the largest token value in this MemCFG

To get the assembly lines for some block index block_idx, you must get the assembly line indices from block_asm_idx, and use those to slice the assembly lines:

>>> block_idx = 7
>>> memcfg.asm_lines[memcfg.block_asm_idx[block_idx]:memcfg.block_asm_idx[block_idx + 1]]

Also see get_block_asm_lines()

asm_memory_addresses: None | ndarray

Memory addresses for all of the assembly lines

Only saved if keep_memory_addresses=True when constructing the MemCFG. This will be a 1-d signed integer numpy array, where a value of -1 means the memory address for that corresponding line was not present in the basic block

block_asm_idx: ndarray

Indices in asm_lines that correspond to the assembly lines for each basic block in this MemCFG

A 1-d numpy array of shape (num_blocks + 1,). Dtype is the smallest unsigned dtype needed to store the value num_asm_lines. Assembly tokens for a block at index i would have a start index of block_asm_idx[i] and an end index of block_asm_idx[i + 1] in asm_lines.

block_asm_mem_addr_idx: ndarray | None

Indices in block_memory_addresses that correspond to the assembly line memory addresses for basic blocks

A 1-d numpy array of shape (num_blocks + 1,). Dtype is the smallest unsigned dtype needed to store the number of assembly line memory addresses. Memory addresses for a block at index i would have a start index of block_asm_mem_addr_idx[i] and an end index of block_asm_mem_addr_idx[i + 1] in block_memory_addresses. Only saved if keep_memory_addresses=True when constructing the MemCFG.

block_flags: ndarray

Integer of bit flags for each basic block

A 1-d numpy array of shape (num_blocks,) where each element is an integer of bit flags. See BlockInfoBitMask for more info. Dtype is the smallest unsigned dtype with enough bits to store all flags in BlockInfoBitMask

Also see get_block_flags()

block_func_idx: ndarray

Integer ids for the function that each basic block belongs to

A 1-d numpy array of shape (num_blocks,) where each element is a function id for the block at that index. The id can be found in function_name_to_idx. Dtype is the smallest unsigned dtype needed to store the value num_functions

Also see get_block_function_idx() and get_block_function_name()

block_memory_addresses: ndarray | None

Integer memory addresses of basic blocks.

Only saved if keep_memory_addresses=True when constructing the MemCFG. This will be a 1-d unsigned integer numpy array containing the memory addresses

block_metadata: list[int | dict]

Metadata for blocks

A list of run length compressed metadata at the basic block level. We only compress metadata dictionaries that are empty. Elements are in the same order as the block indices in block_asm_idx. Elements are either dictionaries (for the metadata of that current block), or integers indicating we should skip that many blocks as they all have no metadata.

drop_tokens() Self[source]

Sets the tokens in this normalizer to None. Make sure you only do this if tokens are saved elsewhere! Returns self

dumps() str[source]

Returns this object pickled with pickle.dumps()

function_metadata: list[int | dict]

Metadata for functions

A list of run length compressed metadata at the function level. We only compress metadata dictionaries that are empty. Elements are in the same order as the function indices in block_func_idx. Elements are either dictionaries (for the metadata of that current function), or integers indicating we should skip that many functions as they all have no metadata.

function_name_to_idx: dict[str, int]

Dictionary mapping string function names to their integer ids used in this MemCFG

get_block_asm_lines(block_idx: int) ndarray[source]

Get the asm lines associated with this block index

Parameters:

block_idx (int) – integer block index

Returns:

a 1-d numpy array of unsigned integer assembly tokens

Return type:

np.ndarray

get_block_asm_memory_addresses(block_idx: int) ndarray[source]

Get the asm memory addresses associated with this block index

Values are -1 if the memory address did not exist in that block

Parameters:

block_idx (int) – integer block index

Returns:

a 1-d numpy array of signed integer assembly tokens

Return type:

np.ndarray

get_block_edges_out(block_idx: int, ret_edge_types: bool = False) ndarray | Tuple[ndarray, ndarray][source]

Get numpy array of block indices for all edges out associated with the given block index

Parameters:
  • block_idx (int) – integer block index

  • ret_edge_types (bool) –

    if True, will also return a numpy array (1-d, dtype np.uint8) containing the edge type values for each edge with values:

    • 1: normal edge

    • 2: function call edge

Returns:

either a 1-d numpy array of unsigned integer block indices for all edges out associated with the given block index, or if ret_edge_types=True, then a tuple of (block_edge_inds, edge_types) where the edge_types is a 1-d numpy array of uint8 edge types with the same shape as block_edge_inds that designates the types of the edges. Edge types will be the values of those in the EdgeType enum.

Return type:

Union[np.ndarray, Tuple[np.ndarray, np.ndarray

get_block_flags(block_idx: int) Tuple[bool, bool, bool, bool, bool, bool][source]

Get all block flags for the given block index

Parameters:

block_idx (int) – integer block index

Returns:

(is_block_function_call, is_block_function_entry,

is_block_extern_function, is_block_function_jump, is_block_multi_function_call)

Return type:

Tuple[bool, bool, bool, bool, bool, bool]

get_block_function_idx(block_idx: int) int[source]

Get the function index for the given block index

Parameters:

block_idx (int) – integer block index

Returns:

the integer function index for the given block index

Return type:

int

get_block_function_name(block_idx: int) str[source]

Get the function name for the given block index

Functions without names will start with ‘__unnamed_func__’

Parameters:

block_idx (int) – integer block index

Returns:

the function name for the given block index

Return type:

str

get_block_info(block_idx)[source]

Returns all the info associated with the given block index as a dictionary

Parameters:

block_idx (int) – integer block index

Returns:

the block info dictionary with keys/values:

  • ’asm_lines’ (np.ndarray): 1-d numpy array of unsigned integer assembly line tokens in this block

  • ’asm_memory_addresses’ (np.ndarray): 1-d numpy array of signed integer memory addresses for the assembly lines in this block. Values will be -1 if the memory addresses do not exist

  • ’edges_out’ (np.ndarray): 1-d numpy array of unsigned integer block indices for all of the edges out from this block

  • ’edge_types’ (np.ndarray): 1-d numpy array of uint8 values for the edge types associated with all of the edges out. These are the values of objects in the EdgeType enum. Currently: EdgeType.NORMAL == 1, EdgeType.FUNCTION_CALL == 2

  • ’function_index’ (int): the integer function index of the function this block resides in

  • ’is_function_call’ (bool): true if this block is a function call block (has at least one outgoing function call edge)

  • ’is_function_entry’ (bool): true if this block is a function entry block (has the same memory address as its parent function)

  • ’is_extern_function’ (bool): true if this block is within an external function (parent_function.is_extern_function is True)

  • ’is_function_jump’ (bool): true if this block is a function jump block (has a ‘normal’ edge to a block that is within another function)

  • ’is_multi_function_call’ (bool): true if this block is a multi-function call block (has 2 or more outgoing function call edges. IE: a call table)

  • ’metadata’ (dict): dictionary of metadata associated with this block

Return type:

dict

get_block_memory_address(block_idx: int) int[source]

Returns the memory address for the given block, if present, -1 if not present

Parameters:

block_idx (int) – integer block index

Returns:

the memory address

Return type:

int

get_block_metadata(block_idx: int | None) dict | list[dict][source]

Returns the metadata associated with that function index

Parameters:

block_idx (Union[int, None]) – the integer block index of the metadata to get, or None to get the full list of metadata

Returns:

dictionary of metadata associated with the given block index

Return type:

Union[dict, list[dict]]

get_coo_indices() ndarray[source]

Returns the COO indices for this MemCFG

Returns a 2-d numpy array of shape (num_edges, 2) of dtype np.int32. Each row is an edge, column 0 is the ‘row’ indexer, and column 1 is the ‘column’ indexer. EG:

original = np.array([
    [0, 1],
    [1, 1]
])

coo_indices = np.array([
    [0, 1],
    [1, 0],
    [1, 1]
])

NOTE: this returns as type np.int32 since pytorch can be finicky about what dtypes it wants NOTE: pytorch sparse_coo_tensor’s indicies are the transpose of the array this method returns

Returns:

a 2-d numpy array of shape (num_edges, 2) of dtype np.int32 containing COO indices

Return type:

np.ndarray

get_edge_values() ndarray[source]

Returns the edge type values

Returns a 1-d numpy array of length self.num_edges and dtype np.int32 containing an integer type for each edge depending on if it is a normal or function call edge. Edges are directed and have values from EdgeType enum. Values:

  • 1: ‘normal’ edges

  • 2: ‘function call’ edges

NOTE: this returns as type np.int32 since pytorch can be finicky about what dtypes it wants

Returns:

a 1-d numpy array of length self.num_edges and dtype np.int32 containing integer edge types

Return type:

np.ndarray

get_function_block_inds(func_idx: int) list[int][source]

Returns all of the block indices that are within the given function

Parameters:

func_idx (int) – the integer function index

Returns:

list of integer block indices that are within the given function

Return type:

list[int]

get_function_metadata(func_idx: int | None) dict | list[dict][source]

Returns the metadata associated with that function index

Parameters:

func_idx (Union[int, None]) – the integer function index of the metadata to get, or None to get the full list of metadata

Returns:

dictionary of metadata associated with the given function index

Return type:

Union[dict, list[dict]]

graph_c: ndarray

Array containing all of the outgoing edges for each block in order

1-D numpy array of shape (num_edges,). Dtype will be the smallest unsigned dtype required to store the value num_blocks + 1. Each element is a block index to which that edge connects. Edges will be in the order they appear in each block’s edges_out attribute, for each block in order of their block_idx.

Also see get_edges_out()

NOTE: this also contains information on which types of edges they are inherently. If the block is NOT a function call (stored as bit flag in the block_info array), then all edges for that block are normal edges. If it IS a function call, then there are 3 cases:

  1. it has one outgoing edge: that edge is always a function call

  2. it has two outgoing edges, one function call, one normal: the first edge is the function call edge, the second is a normal edge

  3. it has >2 outgoing edges, or 2 function call edges: the edges will be listed first by function call edges, then by normal edges, with a separator inbetween. The separator will have the max unsigned int value for graph_c’s dtype. This is why we use the dtype that can store num_blocks + 1, since we need this extra value just in case. Whatever exactly it means for a basic block to have >2 outgoing edges while being a function call is left up to the user. Possibly due to call operators with non-explicit operands (eg: register memory locations)?

graph_r: ndarray

Array containing information on the number of outgoing edges for each block

1-D numpy array of shape (num_edges + 1,). Dtype will be the smallest unsigned dtype required to store the value num_edges. This array is a cumulative sum of the number of edges for each basic block. One could get all of the outgoing edges for a block using:

>>> start_idx = memcfg.graph_r[block_idx]
>>> end_idx = memcfg.graph_r[block_idx + 1]
>>> edges = memcfg.graph_c[start_idx:end_idx]

Also see get_edges_out()

property inv_tokens: dict[int, str]

dictionary mapping token integers to their original strings

Type:

Returns the inverse of self.tokens

is_block_extern_function(block_idx: int) bool[source]

True if this block is in an external function, False otherwise

is_block_function_call(block_idx: int) bool[source]

True if this block is a function call, False otherwise

is_block_function_entry(block_idx: int) bool[source]

True if this block is a function entry, False otherwise

is_block_function_jump(block_idx: int) bool[source]

True if this block is a function jump, False otherwise

is_block_multi_function_call(block_idx: int) bool[source]

True if this block is a multi-function call, False otherwise

classmethod load(path: str) MemCFG[source]

Loads a MemCFG from the given path

metadata: dict

Dictionary of metadata associated with this MemCFG

normalize(normalizer: str | NormalizerType | None = None, using_tokens: dict | AtomicTokenDict = None, inplace: bool = True, force_renormalize: bool = False) MemCFG[source]

Normalizes this memcfg in-place.

Parameters:
  • normalizer (Optional[Union[str, NormalizerType]]) – the normalizer to use. Can be a Normalizer object, or a string, or None to use the default BaseNormalizer(). Defaults to None.

  • using_tokens (Union[dict, AtomicTokenDict]) – tokens to use when normalizing

  • inplace (bool) – whether or not to normalize inplace. Defaults to True.

  • force_renormalize (bool) – by default, this method will only normalize this cfg if the passed normalizer is != self.normalizer. However if force_renormalize=True, then this will be renormalized even if it has been previously normalized with the same normalizer. Defaults to False.

Returns:

this MemCFG normalized

Return type:

MemCFG

normalizer: NormalizerType

The normalizer used to normalize input before converting to MemCFG

Can be shared with a MemCFGDataset object if this MemCFG is a part of one

property num_asm_lines: int

The number of assembly lines in this MemCFG

property num_blocks: int

The number of blocks in this MemCFG

property num_edges: int

The number of edges in this MemCFG

property num_functions: int

The number of functions in this MemCFG

save(path: str) None[source]

Saves this MemCFG to the given path

set_tokens(tokens: dict | AtomicTokenDict) Self[source]

Sets this MemCFG’s tokens to the given tokens, and returns self

to_adjacency_matrix(type: Literal['np', 'numpy', 'torch'] = 'np', sparse: bool = False) ndarray | Tuple[ndarray, ndarray][source]

Returns an adjacency matrix representation of this memcfg’s graph connections

Edges are directed and have values from EdgeType enum. Values:

  • 1: ‘normal’ edges

  • 2: ‘function call’ edges

Parameters:
  • type (Literal["np", "numpy", "torch"]) –

    the type of matrix to return. Defaults to ‘np’. Can be:

    • ’np’/’numpy’ for a numpy ndarray (dtype: np.int32)

    • ’torch’/’pytorch’ for a pytorch tensor (type: LongTensor)

  • sparse (bool) –

    whether or not the return value should be a sparse matrix. Defaults to False. Has different behaviors based on type:

    • numpy array: returns a 2-tuple of sparse COO representation (indices, values).

      NOTE: the indices are the transpose of those from get_coo_indices() NOTE: if you want sparse CSR format, you already have it with self.graph_c and self.graph_r

    • pytorch tensor: returns a pytorch sparse COO tensor.

      NOTE: not using sparse CSR format for now since it seems to have less documentation/supportedness.

Returns:

an adjacency matrix representation of this MemCFG

Return type:

Union[np.ndarray, Tuple[np.ndarray, np.ndarray

to_cfg() CFG[source]

Converts this MemCFG back into a CFG

NOTE: if keep_memory_addresses=False when constucting this MemCFG, then memory addresses will not be present and basic blocks will be given a memory address that is just their index in the block list

tokens: dict[str, int]

Dictionary mapping token strings to integer values used in this MemCFG

Can be shared with a MemCFGDataset object if this MemCFG is a part of one.

Can also be an AtomicTokenDict object for atomic token updates

update_metadata(other: dict) Self[source]

Updates this MemCFG’s metadata dictionary with the given dictionary, and returns self

bincfg.cfg.mem_cfg.assert_valid_idx(idx: int, max_val: int, objects_str: str) None[source]

Asserts that the idx passed is >= 0 and < max_val. objects_str is the type of object for error message (IE: ‘blocks’, ‘functions’, etc.)

bincfg.cfg.mem_cfg.default_max(iterable, default=None)[source]

Returns the max value in iterable, or default value if len(iterable) == 0

bincfg.cfg.mem_cfg_dataset module

class bincfg.cfg.mem_cfg_dataset.MemCFGDataset(cfg_data=None, using_tokens=None, normalizer=None, metadata=None, **add_data_kwargs)[source]

Bases: object

A CFGDataset that is more memory efficient

Parameters:
  • cfg_data (Optional[Union[str, CFG, CFGDataset, MemCFG, MemCFGDataset, Iterable]]) – the data to use. Can be None for an empty dataset, or a string (for input to CFG), CFG, CFGDataset, MemCFG, MemCFGDataset, or iterable of those values to add that data to this dataset

  • tokens (Optional[Union[Dict[str, int], AtomicTokenDict]]) – if passed, will initialize the token dictionary to this dictionary of tokens (will be copied). Can be an AtomicTokenDict to use an atomic file token dictionary

  • normalizer (Optional[Union[str, Normalizer]]) – the normalizer to use, or None to default to the normalizer of the first added CFG/MemCFG

  • metadata (Optional[Dict]) – a dictionary of metadata to attach to this MemCFGDataset NOTE: passed dictionary will be shallow copied

  • add_data_kwargs (Any) – kwargs to pass to self.add_data() when adding the passed cfg_data

add_data(*cfg_data, inplace=True, force_renormalize=False, progress=False)[source]

Adds data to this dataset

Parameters:
  • cfg_data (Union[str, CFG, MemCFG, CFGDataset, MemCFGDataset, Iterable]) – arbitrary amount of str (CFG input)/CFG/MemCFG/CFGDataset/MemCFGDataset’s, or iterables of them, to add to this dataset

  • inplace (bool, optional) – whether or not to normalize the incoming cfg_data inplace. Defaults to True.

  • force_renormalize (bool, optional) – by default, this method will only normalize cfg’s whose .normalizer != to this dataset’s normalizer. However if force_renormalize=True, then all cfg’s will be renormalized even if they have been previously normalized with the same normalizer. Defaults to False.

  • mp (bool, optional) – if True, will use multiprocessing to normalize cfgs. Defaults to False.

  • progress (bool, optional) – if True, will show a progressbar when adding multiple cfgs. Defaults to False.

Raises:

TypeError – if something other than a cfg/dataset is passed in cfg_data

cfgs = None

The list of all memcfgs in this dataset

dumps()[source]

Returns this object pickled with pickle.dumps()

classmethod load(path)[source]

Loads this MemCFGDataset from path

metadata = None

A dictionary of metadata associated with this MemCFGDataset

normalize(normalizer=None, inplace=True, force_renormalize=False, progress=False)[source]

Normalize this MemCFGDataset.

Parameters:
  • normalizer (Union[str, Normalizer]) – the normalizer to use. Can be a Normalizer object, or a string, or None to use the default BaseNormalizer(). Defaults to None.

  • inplace (bool, optional) – by default, normalizes this dataset inplace (IE: without copying objects). Can set to False to return a copy. Defaults to True.

  • force_renormalize (bool, optional) – by default, this method will only normalize cfg’s whose .normalizer != to the passed normalizer. However if force_renormalize=True, then all cfg’s will be renormalized even if they have been previously normalized with the same normalizer.. Defaults to False.

  • progress (bool, optional) – if True, will show a progressbar while normalizing. Defaults to False.

Returns:

this dataset normalized

Return type:

MemCFGDataset

normalizer = None

The normalizer used in this dataset, or None if there is no normalizer

property num_asm_lines
property num_blocks
property num_cfgs
property num_edges
property num_functions
remove_cfg(cfg_or_idx)[source]

Removes the given MemCFG (or index of MemCFG if cfg_or_idx is an integer) from this MemCFGDataset

Parameters:

cfg_or_idx (Union[MemCFG, int]) – cfg or index to remove

save(path, freeze_tokens=True)[source]

Saves this MemCFGDataset to path

Parameters:
  • path (str) – the filepath to save to

  • freeze_tokens (bool) – whether or not to ‘freeze’ the tokens in this MemCFGDataset. ‘freezing’ the tokens just means that, if an AtomicTokenDict is the current token dictionary for this MemCFGDataset, then its current data will be saved in the pickle file as a normal dict. This is useful for loading this data later so that the loading does not depend on being able to access the files for the AtomicTokenDict. Default: True. If the token dictionary is already a dict, then this has no effect

using_tokens = None

A dictionary mapping string tokens to their integer values

Can be an AtomicTokenDict for atomic updates to tokens

Module contents