bincfg.utils package

Various utility functions and objects.

AtomicTokenDict

When doing multithreaded processing with BinCFG, it would be useful to have the ability to do atomic synchronized updates of the current tokens that are being used when normalizing (that way, all MemCFG``s use the same shared tokens). The ``AtomicTokenDict allows for atomic updates to a shared token dictionary requiring only a shared filesystem to work. It ensures only one process can update a pickle file containing the token dictionary at a time using the atomicwrites pip package.

There are a couple possible downsides depending on how you use it:

  1. If you are doing a bunch of updates at the same time, that can be really slow. It may help to precompute much of the common tokens initially before doing a large multithreaded/HPC run to help get over this initial hurdle

  2. Crashing/interrupted code can cause deadlocks if they stop execution while the AtomicTokenDict is updating. If this occurs, you can delete the lockfile (‘.[filename].lock’ where ‘[filename]’ is the name of the pickle file), and that fixes it

Submodules

bincfg.utils.atomic_token_dict module

Atomically update tokens

exception bincfg.utils.atomic_token_dict.AquireLockError(attempts, lock_path)[source]

Bases: Exception

class bincfg.utils.atomic_token_dict.AtomicData(init_data, filepath=None, lockpath=None, max_read_attempts=None, delete_file=False)[source]

Bases: object

A class that allows for atomic reading/updating of the given data to a pickle file

Parameters:
  • init_data (Any) – Data to initialize the atomic file with. If the atomic file already exists, then that data will be loaded

  • filepath (Optional[str]) – An optional filepath to store the dictionary, otherwise will be stored at ‘./atomic_dict.pkl’

  • lockpath (Optional[str]) –

    An optional filepath for the lock file to use to atomically update the dictionary, otherwise will be

    stored at ‘./.[filepath].lock’ where [filepath] is the given filepath parameter

  • max_read_attempts (Optional[int]) –

    An optional integer specifying the maximum number of attempts to atomically read this dictionary before

    giving up and raising an error. Set to None to attempt indefinitely. Defaults to None

  • delete_file (bool) – If True, then the file and lockfile will be deleted on initialization to start from scratch

aquire_lock()[source]

Aquires the lock needed to update data

NOTE: this will prevent any and all updates to the atomic file until self.release_lock() is called. Make sure you call it quickly or other processes may hang!

NOTE: if the lock has already been aquired, nothing will happen

NOTE: it can be dangerous to attempt to aquire locks yourself, as any errors raised must be handled nicely and self.release_lock() must be called otherwise other processes may hang

atomic_read(default=<object object>)[source]

Atomically reads the data from file, updating self.data

Parameters:

default (Optional[Any]) – If this is passed and the file does not already exist, then this data will be saved to file and set to self.data

atomic_update(update_func, *update_args, **update_kwargs)[source]

Atomically updates the data

Will first aquire a lock on the data, read it in, then call update_func(file_data, update_data) where file_data is the data from the current atomic file, then write the data back to file and finally release the lock.

NOTE: this will prevent any and all updates to the atomic file until update_func has completed

NOTE: any errors within the update_func will be handled properly and will likely not mess up the atomic file

Parameters:
  • update_func (Callable) – function that takes in: the data currently saved in file, the current data, then the passed args and kwargs, and returns the updated data to write back to file

  • update_args (Any) – args to pass to update_func, after the current data saved in file

  • update_kwargs (Any) – kwargs to pass to update_func

Returns:

the updated data

Return type:

Any

atomic_write()[source]

Atomically writes the data at self.data to the pickle file

delete_file(force=False)[source]

Atomically deletes the file being used

release_lock()[source]

Releases the lock. Assumes it has already been aquired, otherwise an error will be raised

class bincfg.utils.atomic_token_dict.AtomicTokenDict(init_data=None, filepath=None, lockpath=None, max_read_attempts=None, delete_file=False)[source]

Bases: object

Acts like a normal token dictionary, but allows for atomic operations

Parameters:
  • init_data (Optional[Dict[str, int]]) – Data to initialize the atomic token dict with. If the atomic file already exists, then that data will be loaded

  • filepath (Optional[str]) – An optional filepath to store the dictionary, otherwise will be stored at ‘./atomic_dict.pkl’

  • lockpath (Optional[str]) –

    An optional filepath for the lock file to use to atomically update the dictionary, otherwise will be

    stored at ‘./.[filepath].lock’ where [filepath] is the given filepath parameter

  • max_read_attempts (Optional[int]) –

    An optional integer specifying the maximum number of attempts to atomically read this dictionary before

    giving up and raising an error. Set to None to attempt indefinitely. Defaults to None

  • delete_file (bool) – If True, then the file and lockfile will be deleted on initialization to start from scratch

addtokens(*tokens)[source]

Adds the given tokens to this dictionary, ignoring any that already exist

Parameters:

tokens (str) – arbitrary number of string tokens to add to this token dict

property data

Returns the token dictionary

delete_file()[source]

Deletes the atomic token dictinoary file

property filepath

Return the filepath being used to store the atomic data

get(key, default=None)[source]
property inverse

Return a new dict containing an inverse mapping of this current dictionary

items()[source]
keys()[source]
property lock_path

Return the lock path being used to store the atomic data

setdefault(key, default=None)[source]

If the key exists, return the value. Otherwise set the key to the given default (or len(self) if default=None)

update(tokens)[source]

Updates this dictionary with the given tokens

Parameters:

tokens (Union[Dict[str, int], AtomicTokenDict]) – dictionary mapping token strings to their integer values. Any tokens in the dictionary that are not in this dictionary will be added, and any tokens that already exist and have the same value will be ignored. If there are any tokens that already exist, but have a different value, then an error will be raised

values()[source]

bincfg.utils.cfg_utils module

Utilities for CFG/MemCFG objects and their datasets

bincfg.utils.cfg_utils.check_for_normalizer(dataset, cfg_data)[source]

Checks the incoming data for a normalizer to set to be dataset’s normalizer

Assumes this dataset does not yet have a normalizer. Searches the incoming cfg_data for a cfg/dataset that has a normalizer, and sets it to be this dataset’s normalizer. If this method finds no normalizer, or multiple unique normalizers, then an error will be raised.

Parameters:
Raises:

ValueError – when there are multiple conflicting normalizers, or if no normalizer could be found

bincfg.utils.cfg_utils.get_address(obj: int | str | Addressable) int[source]

Gets the integer address from the given object

Parameters:

obj (Union[str, int, Addressable]) – a string, int, or object with a string/int .address attribute (should always be positive)

Raises:
  • TypeErrorobj is an unknown type

  • ValueError – given address is negative

Returns:

the integer address

Return type:

int

bincfg.utils.cfg_utils.get_special_function_names()[source]

Returns the current global special function names

bincfg.utils.cfg_utils.update_atomic_tokens(file_tokens, curr_data, update_tokens)[source]

Updates atomic tokens. Only meant to be passed to AtomicData.atomic_update as the function to use

bincfg.utils.cfg_utils.update_memcfg_tokens(cfg_data, tokens)[source]

Adds all new tokens to tokens, and updates all tokens in cfg_data to their respective values in tokens

Tokens in cfg_data will be modified, as will the .asm_lines attribute of each memcfg. Assumes the cfg_data has conflicting tokens to tokens and thus needs modification. Both cfg_data and tokens will be modified in-place.

Parameters:
  • cfg_data (Union[MemCFG, MemCFGDataset]) – the memcfg/memcfgdataset to have its tokens changed

  • tokens (Union[Dict[str, int], AtomicData]) – the dictionary of tokens to update with the new tokens in cfg_data. Can be an AtomicData object for atomic updating of tokens

bincfg.utils.misc_utils module

Miscellaneous utility functions

exception bincfg.utils.misc_utils.EqualityCheckingError[source]

Bases: Exception

Error raised whenever there is an unexpected problem attempting to check equality between two objects

exception bincfg.utils.misc_utils.EqualityError(a, b, message=None)[source]

Bases: Exception

Error raised whenever an equal() check returns false and raise_err=True

class bincfg.utils.misc_utils.ParameterSaver(name, bases, dct)[source]

Bases: type

A metaclass used to add in parameter saving to the initialization function

This allows you to wrap __init__ of a class without having to worry about blocking IDE’s from seeing its args/kwargs, and will apply the parameter saving to all child classes as well. Will default to insert_functions=True

bincfg.utils.misc_utils.arg_array_split(length, sections, return_index=None, dtype=<class 'numpy.uint32'>)[source]

Like np.array_split(), but returns the indices that one would split at

This will always return sections sections, even if sections > length (in which case, any empty sections will come at the end). If sections does not perfectly divide length, then any extras will be front-loaded, one per split array as needed.

NOTE: this code was modified from the numpy array_split() source

Parameters:
  • length (int) – the length of the sequence to split

  • sections (int) – the number of sections to split into

  • return_index (Optional[int]) – if not None, then an int to determine which tuple of (start, end) indices to return (IE: if you were splitting an array into 10 sections, and passed return_index=3, this would return the tuple of (start, end) indicies for the 4th split array (since we start indexing at 0))

  • dtype (np.dtype) – the numpy dtype to use for the returned array

Returns:

a numpy array of length sections + 1 where the split array at index i

would use the start/end endices [returned_array[i]:returned_array[i+1]], unless return_index is not None, in which case a 2-tuple of the (start_idx, end_idx) will be returned

Return type:

Union[np.ndarray, Tuple[int, int]]

bincfg.utils.misc_utils.eq_obj(a, b, selector=None, strict_types=<object object>, unordered=<object object>, raise_err=<object object>)[source]

Determines whether a == b, generalizing for more objects and capabilities than default __eq__() method. Equal() is an equivalence relation, and thus:

  1. equal(a, a) is always True (reflexivity)

  2. equal(a, b) implies equal(b, a) (symmetry)

  3. equal(a, b) and equal(b, c) implies equal(a, c) (transitivity)

NOTE: This method is not meant to be very fast. I will apply as many optimizations as feasibly possible that I can think of, but there will be various inefficient conversions of types to check equality.

NOTE: kwargs passed to the initial equal() function call will be passed to all subcalls, including those done in other objects using their built-in __eq__ function. Any objects can override those kwargs for any later subcalls (but not those above/adjacent). NOTE: The selector kwarg is only used once, then consumed for any later subcalls

Parameters:
  • a (Any) – object to check equality

  • b (Any) – object to check equality

  • selector (Optional[str]) – if not None, then a string that determines the ‘selector’ to use on both objects for determining equality. It should start with either a letter (case-sensitive), underscore ‘_’, dot ‘.’ or bracket ‘[’. This string will essentially be appended to each object to get some attribute to determine equality of instead of the objects themselves. For example, if you have two lists, but only want to check if their element at index ‘2’ are equal, you could pass selector=’[2]’. This is useful for debugging purposes as the error messages on unequal objects will be far more informative. Defaults to None. NOTE: if you pass a selector string that starts with an alphabetical character, it will be assumed to be an attribute, and this will check equality on a.SELECTOR and b.SELECTOR

  • strict_types (bool) – if True, then the types of both objects must exactly match. Otherwise objects which are equal but of different types will be considered equal. Defaults to False.

  • unordered (bool) – if True, then all known sequential objects (list, tuple, numpy array, etc.) will be considered equal even if elements are in a different order (eg: a multiset equality). Otherwise, sequential objects are expected to have their subelements appear in the same order. If the passed objects are not sequential, then this has no effect. Defaults to False.

  • raise_err (bool) – if True, then an EqualityError will be raised whenever a and b are unequal, along with an informative stack trace as to why they were determined to be unequal. Defaults to False.

Raises:
Returns:

True if the two objects are equal, False otherwise

Return type:

bool

bincfg.utils.misc_utils.eq_obj_err(obj1, obj2)[source]

Same as eq_obj, but always raises an error

bincfg.utils.misc_utils.get_module(package, raise_err=True, err_message='')[source]

Checks that the given package is installed, returning it, and raising an error if not

Parameters:
  • package (str) – string name of the package

  • raise_err (bool, optional) – by default, this will raise an error if attempting to load the module and it doesn’t exist. If False, then None will be returned instead if it doesn’t exist. Defaults to True.

  • err_message (str) – an error message to add on to any import errors raised

Raises:

ImportError – if the package cannot be found, and raise_err=True

Returns:

the package

Return type:

Union[ModuleType, None]

bincfg.utils.misc_utils.get_smallest_np_dtype(val, signed=False)[source]

Returns the smallest numpy integer dtype needed to store the given max value.

Parameters:
  • val (int) – the largest magnitude (furthest from 0) integer value that we need to be able to store

  • signed (bool, optional) – if True, then use signed ints. Defaults to False.

Raises:

ValueError – if a bad value was passed, or if the value was too large to store in a known integer size

Returns:

the smallest integer dtype needed to store the given max value

Return type:

np.dtype

bincfg.utils.misc_utils.hash_obj(obj, return_int=False)[source]

Hashes the given object

Parameters:
  • obj (Any) – the object to hash

  • return_int (bool, optional) – by default this method returns a hex string, but setting return_int=True will return an integer instead. Defaults to False.

Returns:

hash of the given object

Return type:

Union[str, int]

bincfg.utils.misc_utils.isinstance_with_iterables(obj, types, recursive=False, ret_list=False)[source]

Checks that obj is one of the given types, allowing for iterables of these types

Parameters:
  • obj (Any) – the obj to test type

  • types (Union[type, Tuple[type, ...]]) – either a type, or tuple of types that obj can be

  • recursive (bool, optional) – by default, this method will only allow iterables to contain objects of a type in types. If recursive=True, then this will accept arbitrary-depth iterables of types in types. Defaults to False.

  • ret_list (bool, optional) – if True, will return a single list of all elements (or None if the isinstance check fails). Defaults to False.

Returns:

the return value

Return type:

Union[List[Any], bool, None]

bincfg.utils.misc_utils.parameter_saver(func=None, naming=None, not_naming=None, ignore=None, not_ignore=None, insert_functions=False, copy=True)[source]

A function that can wrap object methods to save calls to those methods

Should only be used on __init__, or some other function which is only called once in that object’s lifecycle.

Can be used both like:

@parameter_saver
def __init__(self, *args, **kwargs):
    ...

or like:

@parameter_saver()
def __init__(self, *args, **kwargs):
    ...

Subsequent calls to wrapped functions will not have their parameters saved.

Adds two new attributes: ‘__savedparams__’ and ‘__paramspec_name__’:

  • ‘__savedparams__’: a dictionary that has keys being the function names that this wrapper was applied to (EG: ‘__init__’), and values being a subdictionary with keys/values:

    • ‘args’ (OrderedDict[str, Any]): args that were passed on function call, in order with their argument names

    • ‘kwargs’ (OrderedDict[str, Any]): kwargs that were passed on function call, in order. NOTE: any extra args that would spill over into kwargs will be saved here

    • ‘naming’ (Set[str]): set of strings for parameters that will be used when calling paramspec_name()

    • ‘ignore’ (Set[str]): set of strings for parameters to ignore all together

Parameters:
  • func (Callable) – the function to wrap, or None if we should return a function that will later wrap another function

  • naming (Optional[Iterable[str]]) – iterable of strings for which parameters should be used for naming. Only the parameters with these names will be used when generating a name with paramspec_name() or obj.__paramspec_name__, and they will be used in the order that they appear here. Default (None) is to use all parameters in the order that they appear in the method signature. Mutually exclusive with not_naming

  • not_naming (Optional[Iterable[str]]) – iterable of strings for which parameters should NOT be used for naming. All other parameters will be used. Mutually exclusive with naming

  • ignore (Optional[Iterable[str]]) – iterable of strings for which parameters should be ignored. These parameters do not appear when calling paramspec_name() and will not be saved. Default (None) is to not ignore any parameters. Mutually exclusive with not_ignore NOTE: only keyword arguments can be ignored

  • not_ignore (Optional[Iterable[str]]) – iterable of strings for which parameters should NOT be ignored. All other parameters will be used. Mutually exclusive with ignore NOTE: only keyword arguments can be ignored

  • insert_functions (bool) –

    if True, then extra functions will be added to the object. This will add:

    • .save(path: str) function - pickles the object and saves it to the given path

    • .load(path: str) function - Adds this function at the class level. Attempts to load and return a pickled object from the given path, checking to make sure it is the correct type

    • __setstate__(state) function - re-initializes this object with the given state information. This will attempt to initialize the new object with __init__ and using the args/kwargs present in __savedparams__[‘__init__’] if present, then will fill in the rest of the __dict__ attributes as normal

  • copy (Union[bool, str]) – if True, will attempt to copy parameters by checking if they have a .copy() method and calling it if so to produce the object that is saved, that way any updates to objects during/after initialization will not affect the saved parameters. If False, then the original object will be used. Can also be the string ‘deep’ to perform a deep copy of each object.

bincfg.utils.misc_utils.paramspec_name(obj, file_ext=None, savedparam_funcname=None, valid_filename=None)[source]

Returns a string name for the given object based on save paramspec info

Requires that the @parameter_saver function decorator was used on at least one function on the given object and was called at least once.

Parameters:
  • obj (Any) – the object to get the string name from

  • file_ext (Optional[str]) – optional file extension to add to the end of the returned string. A period ‘.’ will be inserted between the paramspec name and the file_ext if it is not already present at the beginning of file_ext

  • savedparam_funcname (Optional[str]) – the name of the function to use to generate the paramspec name. If None, then it will default first to ‘__init__’ if it exists, then to the first saved paramspec attached to the object (in order of when the functions were called). Otherwise, should be a string name of the function to use

  • valid_filename (Optional[bool]) – if True, then the returned string will be modified so that it works as a valid filename. If False, then no such transformation will be applied. Otherwise if None, then this will be True if file_ext is not None and False otherwise.

bincfg.utils.misc_utils.paramspec_set_class_funcs(ret_cls)[source]

Sets class functions for paramspec things on the given class

bincfg.utils.misc_utils.progressbar(iterable, *args, progress=True, **kwargs)

Allows one to call progressbar(iterable, progress) to determine use of progressbar automatically.

Checks to see if we are in a python notebook or not to determine which progressbar we should use. Copied from: https://stackoverflow.com/questions/15411967/how-can-i-check-if-code-is-executed-in-the-ipython-notebook

bincfg.utils.misc_utils.scatter_nd_numpy(target, indices, values)[source]

Sets the values at indices to values in numpy array target

Shamelessly stolen from: https://stackoverflow.com/questions/46065873/how-to-do-scatter-and-gather-operations-in-numpy

Parameters:
  • target (np.ndarray) – the target ndarray to modify

  • indices (np.ndarray) – n-d array (same ndim as target) of the indices to set values to

  • values (np.ndarray) – 1-d array of the values to set

Returns:

the resultant array, modified inplace

Return type:

np.ndarray

bincfg.utils.misc_utils.split_by_metadata_key(metadata, set_splits, split_key, rng=None, subgroupings=None, final_sublist_size=1, eps=1e-08)[source]

Splits data based on arbitrary keys in its metadata. Allows for subgroupings as well

NOTE: This requires that all of the values for split_key in all metadata dictionaries (as well as those for any subgroupings being used) are hashable types.

NOTE: make sure you include an ‘INDEX’ key in all of the metadata values if the order they appear in the metadata is not the order they should be interpreted to have in file. IE: if your ‘INDEX’ column in file does not match up with the index of datapoints within the file

Parameters:
  • metadata (List[Dict]) – metadata for the data being split. A list of metadata dictionaries from all elements that could be loaded by the dataloader. If this has an ‘INDEX’ column, then that will be used to determine the ‘indices’ that are returned by this method. Otherwise, the indices will just be the order of datapoints as they appear. Assumes that if the ‘INDEX’ column is present in the first element, it will be present in all, and vice-versa

  • set_splits (Dict[Any, float]) – Dict mapping dataset name to float percent of the total dataset that should be allocated to that dataset name. If an OrderedDict, then data will be assigned with priority to earlier datasets in the case of too few ‘unique’ datapoints (by split_key), or uneven class sizes. Otherwise, order is arbitrary.

  • split_key (Optional[Any]) – the metadata key to use to split data by. If None, will split just by the number of datapoints in metadata

  • rng (Optional[Union[int, RNG]]) – integer random state, or numpy RNG object to use for rng, or None to not randomly select elements and instead grab them in the order that they appear in metadata. This will gather elements first in order of the unique keys that appear, then in order of individual metadata elements.

  • subgroupings (Optional[Iterable[Any]]) –

    If None, then this will split normally by metadata key. Otherwise, this can be string/int or a list of subelements which will act as a key or keys in the metadata to subgroup data by. Each key will be grouped in order to apply ‘subgroupings’ to the data. For example, if you were to split by the ‘problem_uid’ key, then subgroup by the ‘submission_id’ key, this would return a list of lists of indices as the value for each set_split. The first list would be at the ‘problem_uid’ level where all indices with the same problem_uid would appear in the same outer list. Each sublist would contain all indices with the same ‘submission_id’ key value from those grouped into the outer ‘problem_uid’-level list. Multiple subgrouping keys may be used at the same time to create deeper nested groupings. You may subgroup by the same key as the splitting key, which would ensure that, when loading data, all examples with the same value for its splitting key would be prioritized to load together.

    NOTE: the current loading RNG implementation will randomly select subelements from each level of list deeper and deeper until reaching the final layer, at which time all values within that final list will be taken together. This means that if you were to say, split by ‘problem_uid’, and subgroup by both ‘problem_uid’ and ‘submission_id’ in order. You would then lose out on the prioritization of loading values with the same ‘problem_uid’ all together. To help with this, you may use the final_sublist_size argument which will make the final sublists contain that many ‘unique’ indices. In this the above example, it would ensure that there are final_sublist_size unique submission_id’s within each final sublist, and that sublist would contain all indices with 1. a ‘problem_uid’ that is within that outer sublist and 2. a ‘submission_id’ that is within that inner sublist. This way, one could ensure the loading multiple examples from the same problem_uid each selection, and make sure that all compilations of the same submission_id are loaded at the same time as well.

  • final_sublist_size (int) – the max size of the final sublist, in terms of number of ‘unique’ elements. See the note above in subgroupings for more info. Only used if subgroupings is not None

  • eps (float) – small epsilon value to pass to split_list_by_sizes() using set_splits, see that func for more info

Returns:

dictionary mapping each key in set_splits to its list of SplitIndElement

objects. Each SplitIndElement can either be an integer index, or a list of SplitIndElement. This allows for nested groupings of elements to choose when loading data.

Return type:

Dict[Any, List[SplitIndElement]]

bincfg.utils.misc_utils.split_list_by_sizes(l, sizes, eps=1e-08)[source]

Splits the given list into len(sizes) different lists in order based on sizes

Elements will be inserted into returned lists in order, prioritizing first having at least one element per list, then biasing any remaining elements into earlier lists.

Parameters:
  • l (Iterable[Any]) – the list of elements to split

  • sizes (Union[Iterable[float], Iterable[int]]) – the different sizes to apply. Can either be an iterable of floats in which case each element is a percent of the total data to keep and all elements should be >=0 and <=1 and all elements should sum to 1. Or, can be an iterable of integers in which case all elements should be >=0 and <= len(l) and all elements should sum to len(l)

  • eps (float) – the epsilon value used to determine if sum(sizes) (when sizes is a float) is equal to 1

Returns:

a list of all sublists

Return type:

List[List[Any]]

bincfg.utils.misc_utils.timeout_wrapper(timeout=3, timeout_ret_val=None)[source]
Wraps a function to allow for timing-out after the specified time. If the function has not completed after timeout

seconds, then the function will be terminated.

bincfg.utils.type_utils module

bincfg.utils.type_utils.AddressLike

Objects that can be converted into a memory address, or that have a .address attribute which can

alias of int | str | Addressable

class bincfg.utils.type_utils.Addressable(*args, **kwargs)[source]

Bases: Protocol

Object that has a .address attribute which can be converted into a memory address

address: int | str
class bincfg.utils.type_utils.NormalizerType(*args, **kwargs)[source]

Bases: Protocol

Object that has a valid .normalize() function

normalize(*strings: str, cfg: CFG | None, block: CFGBasicBlock | None, newline_tup: None | Tuple[str, str] | object, match_instruction_address: bool, **kwargs: Any) list[str][source]
bincfg.utils.type_utils.PlainAddress

Types that can be converted into an address by themselves, without having to look at any attributes

alias of int | str

Module contents