bincfg.utils package
Various utility functions and objects.
AtomicTokenDict
When doing multithreaded processing with BinCFG, it would be useful to have the ability to do atomic synchronized updates
of the current tokens that are being used when normalizing (that way, all MemCFG``s use the same shared tokens).
The ``AtomicTokenDict allows for atomic updates to a shared token dictionary requiring only a shared filesystem
to work. It ensures only one process can update a pickle file containing the token dictionary at a time using the
atomicwrites pip package.
There are a couple possible downsides depending on how you use it:
If you are doing a bunch of updates at the same time, that can be really slow. It may help to precompute much of the common tokens initially before doing a large multithreaded/HPC run to help get over this initial hurdle
Crashing/interrupted code can cause deadlocks if they stop execution while the AtomicTokenDict is updating. If this occurs, you can delete the lockfile (‘.[filename].lock’ where ‘[filename]’ is the name of the pickle file), and that fixes it
Submodules
bincfg.utils.atomic_token_dict module
Atomically update tokens
- exception bincfg.utils.atomic_token_dict.AquireLockError(attempts, lock_path)[source]
Bases:
Exception
- class bincfg.utils.atomic_token_dict.AtomicData(init_data, filepath=None, lockpath=None, max_read_attempts=None, delete_file=False)[source]
Bases:
objectA class that allows for atomic reading/updating of the given data to a pickle file
- Parameters:
init_data (Any) – Data to initialize the atomic file with. If the atomic file already exists, then that data will be loaded
filepath (Optional[str]) – An optional filepath to store the dictionary, otherwise will be stored at ‘./atomic_dict.pkl’
lockpath (Optional[str]) –
- An optional filepath for the lock file to use to atomically update the dictionary, otherwise will be
stored at ‘./.[filepath].lock’ where [filepath] is the given filepath parameter
max_read_attempts (Optional[int]) –
- An optional integer specifying the maximum number of attempts to atomically read this dictionary before
giving up and raising an error. Set to None to attempt indefinitely. Defaults to None
delete_file (bool) – If True, then the file and lockfile will be deleted on initialization to start from scratch
- aquire_lock()[source]
Aquires the lock needed to update data
NOTE: this will prevent any and all updates to the atomic file until self.release_lock() is called. Make sure you call it quickly or other processes may hang!
NOTE: if the lock has already been aquired, nothing will happen
NOTE: it can be dangerous to attempt to aquire locks yourself, as any errors raised must be handled nicely and self.release_lock() must be called otherwise other processes may hang
- atomic_read(default=<object object>)[source]
Atomically reads the data from file, updating self.data
- Parameters:
default (Optional[Any]) – If this is passed and the file does not already exist, then this data will be saved to file and set to self.data
- atomic_update(update_func, *update_args, **update_kwargs)[source]
Atomically updates the data
Will first aquire a lock on the data, read it in, then call update_func(file_data, update_data) where file_data is the data from the current atomic file, then write the data back to file and finally release the lock.
NOTE: this will prevent any and all updates to the atomic file until update_func has completed
NOTE: any errors within the update_func will be handled properly and will likely not mess up the atomic file
- Parameters:
update_func (Callable) – function that takes in: the data currently saved in file, the current data, then the passed args and kwargs, and returns the updated data to write back to file
update_args (Any) – args to pass to update_func, after the current data saved in file
update_kwargs (Any) – kwargs to pass to update_func
- Returns:
the updated data
- Return type:
Any
- class bincfg.utils.atomic_token_dict.AtomicTokenDict(init_data=None, filepath=None, lockpath=None, max_read_attempts=None, delete_file=False)[source]
Bases:
objectActs like a normal token dictionary, but allows for atomic operations
- Parameters:
init_data (Optional[Dict[str, int]]) – Data to initialize the atomic token dict with. If the atomic file already exists, then that data will be loaded
filepath (Optional[str]) – An optional filepath to store the dictionary, otherwise will be stored at ‘./atomic_dict.pkl’
lockpath (Optional[str]) –
- An optional filepath for the lock file to use to atomically update the dictionary, otherwise will be
stored at ‘./.[filepath].lock’ where [filepath] is the given filepath parameter
max_read_attempts (Optional[int]) –
- An optional integer specifying the maximum number of attempts to atomically read this dictionary before
giving up and raising an error. Set to None to attempt indefinitely. Defaults to None
delete_file (bool) – If True, then the file and lockfile will be deleted on initialization to start from scratch
- addtokens(*tokens)[source]
Adds the given tokens to this dictionary, ignoring any that already exist
- Parameters:
tokens (str) – arbitrary number of string tokens to add to this token dict
- property data
Returns the token dictionary
- property filepath
Return the filepath being used to store the atomic data
- property inverse
Return a new dict containing an inverse mapping of this current dictionary
- property lock_path
Return the lock path being used to store the atomic data
- setdefault(key, default=None)[source]
If the key exists, return the value. Otherwise set the key to the given default (or len(self) if default=None)
- update(tokens)[source]
Updates this dictionary with the given tokens
- Parameters:
tokens (Union[Dict[str, int], AtomicTokenDict]) – dictionary mapping token strings to their integer values. Any tokens in the dictionary that are not in this dictionary will be added, and any tokens that already exist and have the same value will be ignored. If there are any tokens that already exist, but have a different value, then an error will be raised
bincfg.utils.cfg_utils module
Utilities for CFG/MemCFG objects and their datasets
- bincfg.utils.cfg_utils.check_for_normalizer(dataset, cfg_data)[source]
Checks the incoming data for a normalizer to set to be dataset’s normalizer
Assumes this dataset does not yet have a normalizer. Searches the incoming cfg_data for a cfg/dataset that has a normalizer, and sets it to be this dataset’s normalizer. If this method finds no normalizer, or multiple unique normalizers, then an error will be raised.
- Parameters:
dataset (Union[CFGDataset, MemCFGDataset]) – a
CFGDatasetorMemCFGDatasetwithout a normalizercfg_data (Iterable[Union[str, CFG, MemCFG, CFGDataset, MemCFGDataset]]) – an iterable of str/
CFG/MemCFG/CFGDataset/MemCFGDataset’s
- Raises:
ValueError – when there are multiple conflicting normalizers, or if no normalizer could be found
- bincfg.utils.cfg_utils.get_address(obj: int | str | Addressable) int[source]
Gets the integer address from the given object
- Parameters:
obj (Union[str, int, Addressable]) – a string, int, or object with a string/int .address attribute (should always be positive)
- Raises:
TypeError – obj is an unknown type
ValueError – given address is negative
- Returns:
the integer address
- Return type:
int
- bincfg.utils.cfg_utils.get_special_function_names()[source]
Returns the current global special function names
- bincfg.utils.cfg_utils.update_atomic_tokens(file_tokens, curr_data, update_tokens)[source]
Updates atomic tokens. Only meant to be passed to AtomicData.atomic_update as the function to use
- bincfg.utils.cfg_utils.update_memcfg_tokens(cfg_data, tokens)[source]
Adds all new tokens to tokens, and updates all tokens in cfg_data to their respective values in tokens
Tokens in cfg_data will be modified, as will the .asm_lines attribute of each memcfg. Assumes the cfg_data has conflicting tokens to tokens and thus needs modification. Both cfg_data and tokens will be modified in-place.
- Parameters:
cfg_data (Union[MemCFG, MemCFGDataset]) – the memcfg/memcfgdataset to have its tokens changed
tokens (Union[Dict[str, int], AtomicData]) – the dictionary of tokens to update with the new tokens in cfg_data. Can be an AtomicData object for atomic updating of tokens
bincfg.utils.misc_utils module
Miscellaneous utility functions
- exception bincfg.utils.misc_utils.EqualityCheckingError[source]
Bases:
ExceptionError raised whenever there is an unexpected problem attempting to check equality between two objects
- exception bincfg.utils.misc_utils.EqualityError(a, b, message=None)[source]
Bases:
ExceptionError raised whenever an
equal()check returns false and raise_err=True
- class bincfg.utils.misc_utils.ParameterSaver(name, bases, dct)[source]
Bases:
typeA metaclass used to add in parameter saving to the initialization function
This allows you to wrap __init__ of a class without having to worry about blocking IDE’s from seeing its args/kwargs, and will apply the parameter saving to all child classes as well. Will default to insert_functions=True
- bincfg.utils.misc_utils.arg_array_split(length, sections, return_index=None, dtype=<class 'numpy.uint32'>)[source]
Like np.array_split(), but returns the indices that one would split at
This will always return sections sections, even if sections > length (in which case, any empty sections will come at the end). If sections does not perfectly divide length, then any extras will be front-loaded, one per split array as needed.
NOTE: this code was modified from the numpy array_split() source
- Parameters:
length (int) – the length of the sequence to split
sections (int) – the number of sections to split into
return_index (Optional[int]) – if not None, then an int to determine which tuple of (start, end) indices to return (IE: if you were splitting an array into 10 sections, and passed return_index=3, this would return the tuple of (start, end) indicies for the 4th split array (since we start indexing at 0))
dtype (np.dtype) – the numpy dtype to use for the returned array
- Returns:
- a numpy array of length sections + 1 where the split array at index i
would use the start/end endices [returned_array[i]:returned_array[i+1]], unless return_index is not None, in which case a 2-tuple of the (start_idx, end_idx) will be returned
- Return type:
Union[np.ndarray, Tuple[int, int]]
- bincfg.utils.misc_utils.eq_obj(a, b, selector=None, strict_types=<object object>, unordered=<object object>, raise_err=<object object>)[source]
Determines whether a == b, generalizing for more objects and capabilities than default __eq__() method. Equal() is an equivalence relation, and thus:
equal(a, a) is always True (reflexivity)
equal(a, b) implies equal(b, a) (symmetry)
equal(a, b) and equal(b, c) implies equal(a, c) (transitivity)
NOTE: This method is not meant to be very fast. I will apply as many optimizations as feasibly possible that I can think of, but there will be various inefficient conversions of types to check equality.
NOTE: kwargs passed to the initial
equal()function call will be passed to all subcalls, including those done in other objects using their built-in __eq__ function. Any objects can override those kwargs for any later subcalls (but not those above/adjacent). NOTE: The selector kwarg is only used once, then consumed for any later subcalls- Parameters:
a (Any) – object to check equality
b (Any) – object to check equality
selector (Optional[str]) – if not None, then a string that determines the ‘selector’ to use on both objects for determining equality. It should start with either a letter (case-sensitive), underscore ‘_’, dot ‘.’ or bracket ‘[’. This string will essentially be appended to each object to get some attribute to determine equality of instead of the objects themselves. For example, if you have two lists, but only want to check if their element at index ‘2’ are equal, you could pass selector=’[2]’. This is useful for debugging purposes as the error messages on unequal objects will be far more informative. Defaults to None. NOTE: if you pass a selector string that starts with an alphabetical character, it will be assumed to be an attribute, and this will check equality on a.SELECTOR and b.SELECTOR
strict_types (bool) – if True, then the types of both objects must exactly match. Otherwise objects which are equal but of different types will be considered equal. Defaults to False.
unordered (bool) – if True, then all known sequential objects (list, tuple, numpy array, etc.) will be considered equal even if elements are in a different order (eg: a multiset equality). Otherwise, sequential objects are expected to have their subelements appear in the same order. If the passed objects are not sequential, then this has no effect. Defaults to False.
raise_err (bool) – if True, then an
EqualityErrorwill be raised whenever a and b are unequal, along with an informative stack trace as to why they were determined to be unequal. Defaults to False.
- Raises:
EqualityError – if the two objects are not equal, and raise_err=True
EqualityCheckingError – if there was an error raised during equality checking
- Returns:
True if the two objects are equal, False otherwise
- Return type:
bool
- bincfg.utils.misc_utils.get_module(package, raise_err=True, err_message='')[source]
Checks that the given package is installed, returning it, and raising an error if not
- Parameters:
package (str) – string name of the package
raise_err (bool, optional) – by default, this will raise an error if attempting to load the module and it doesn’t exist. If False, then None will be returned instead if it doesn’t exist. Defaults to True.
err_message (str) – an error message to add on to any import errors raised
- Raises:
ImportError – if the package cannot be found, and raise_err=True
- Returns:
the package
- Return type:
Union[ModuleType, None]
- bincfg.utils.misc_utils.get_smallest_np_dtype(val, signed=False)[source]
Returns the smallest numpy integer dtype needed to store the given max value.
- Parameters:
val (int) – the largest magnitude (furthest from 0) integer value that we need to be able to store
signed (bool, optional) – if True, then use signed ints. Defaults to False.
- Raises:
ValueError – if a bad value was passed, or if the value was too large to store in a known integer size
- Returns:
the smallest integer dtype needed to store the given max value
- Return type:
np.dtype
- bincfg.utils.misc_utils.hash_obj(obj, return_int=False)[source]
Hashes the given object
- Parameters:
obj (Any) – the object to hash
return_int (bool, optional) – by default this method returns a hex string, but setting return_int=True will return an integer instead. Defaults to False.
- Returns:
hash of the given object
- Return type:
Union[str, int]
- bincfg.utils.misc_utils.isinstance_with_iterables(obj, types, recursive=False, ret_list=False)[source]
Checks that obj is one of the given types, allowing for iterables of these types
- Parameters:
obj (Any) – the obj to test type
types (Union[type, Tuple[type, ...]]) – either a type, or tuple of types that obj can be
recursive (bool, optional) – by default, this method will only allow iterables to contain objects of a type in types. If recursive=True, then this will accept arbitrary-depth iterables of types in types. Defaults to False.
ret_list (bool, optional) – if True, will return a single list of all elements (or None if the isinstance check fails). Defaults to False.
- Returns:
the return value
- Return type:
Union[List[Any], bool, None]
- bincfg.utils.misc_utils.parameter_saver(func=None, naming=None, not_naming=None, ignore=None, not_ignore=None, insert_functions=False, copy=True)[source]
A function that can wrap object methods to save calls to those methods
Should only be used on __init__, or some other function which is only called once in that object’s lifecycle.
Can be used both like:
@parameter_saver def __init__(self, *args, **kwargs): ...
or like:
@parameter_saver() def __init__(self, *args, **kwargs): ...
Subsequent calls to wrapped functions will not have their parameters saved.
Adds two new attributes: ‘__savedparams__’ and ‘__paramspec_name__’:
‘__savedparams__’: a dictionary that has keys being the function names that this wrapper was applied to (EG: ‘__init__’), and values being a subdictionary with keys/values:
‘args’ (OrderedDict[str, Any]): args that were passed on function call, in order with their argument names
‘kwargs’ (OrderedDict[str, Any]): kwargs that were passed on function call, in order. NOTE: any extra args that would spill over into kwargs will be saved here
‘naming’ (Set[str]): set of strings for parameters that will be used when calling paramspec_name()
‘ignore’ (Set[str]): set of strings for parameters to ignore all together
- Parameters:
func (Callable) – the function to wrap, or None if we should return a function that will later wrap another function
naming (Optional[Iterable[str]]) – iterable of strings for which parameters should be used for naming. Only the parameters with these names will be used when generating a name with paramspec_name() or obj.__paramspec_name__, and they will be used in the order that they appear here. Default (None) is to use all parameters in the order that they appear in the method signature. Mutually exclusive with not_naming
not_naming (Optional[Iterable[str]]) – iterable of strings for which parameters should NOT be used for naming. All other parameters will be used. Mutually exclusive with naming
ignore (Optional[Iterable[str]]) – iterable of strings for which parameters should be ignored. These parameters do not appear when calling paramspec_name() and will not be saved. Default (None) is to not ignore any parameters. Mutually exclusive with not_ignore NOTE: only keyword arguments can be ignored
not_ignore (Optional[Iterable[str]]) – iterable of strings for which parameters should NOT be ignored. All other parameters will be used. Mutually exclusive with ignore NOTE: only keyword arguments can be ignored
insert_functions (bool) –
if True, then extra functions will be added to the object. This will add:
.save(path: str) function - pickles the object and saves it to the given path
.load(path: str) function - Adds this function at the class level. Attempts to load and return a pickled object from the given path, checking to make sure it is the correct type
__setstate__(state) function - re-initializes this object with the given state information. This will attempt to initialize the new object with __init__ and using the args/kwargs present in __savedparams__[‘__init__’] if present, then will fill in the rest of the __dict__ attributes as normal
copy (Union[bool, str]) – if True, will attempt to copy parameters by checking if they have a .copy() method and calling it if so to produce the object that is saved, that way any updates to objects during/after initialization will not affect the saved parameters. If False, then the original object will be used. Can also be the string ‘deep’ to perform a deep copy of each object.
- bincfg.utils.misc_utils.paramspec_name(obj, file_ext=None, savedparam_funcname=None, valid_filename=None)[source]
Returns a string name for the given object based on save paramspec info
Requires that the @parameter_saver function decorator was used on at least one function on the given object and was called at least once.
- Parameters:
obj (Any) – the object to get the string name from
file_ext (Optional[str]) – optional file extension to add to the end of the returned string. A period ‘.’ will be inserted between the paramspec name and the file_ext if it is not already present at the beginning of file_ext
savedparam_funcname (Optional[str]) – the name of the function to use to generate the paramspec name. If None, then it will default first to ‘__init__’ if it exists, then to the first saved paramspec attached to the object (in order of when the functions were called). Otherwise, should be a string name of the function to use
valid_filename (Optional[bool]) – if True, then the returned string will be modified so that it works as a valid filename. If False, then no such transformation will be applied. Otherwise if None, then this will be True if file_ext is not None and False otherwise.
- bincfg.utils.misc_utils.paramspec_set_class_funcs(ret_cls)[source]
Sets class functions for paramspec things on the given class
- bincfg.utils.misc_utils.progressbar(iterable, *args, progress=True, **kwargs)
Allows one to call progressbar(iterable, progress) to determine use of progressbar automatically.
Checks to see if we are in a python notebook or not to determine which progressbar we should use. Copied from: https://stackoverflow.com/questions/15411967/how-can-i-check-if-code-is-executed-in-the-ipython-notebook
- bincfg.utils.misc_utils.scatter_nd_numpy(target, indices, values)[source]
Sets the values at indices to values in numpy array target
Shamelessly stolen from: https://stackoverflow.com/questions/46065873/how-to-do-scatter-and-gather-operations-in-numpy
- Parameters:
target (np.ndarray) – the target ndarray to modify
indices (np.ndarray) – n-d array (same ndim as target) of the indices to set values to
values (np.ndarray) – 1-d array of the values to set
- Returns:
the resultant array, modified inplace
- Return type:
np.ndarray
- bincfg.utils.misc_utils.split_by_metadata_key(metadata, set_splits, split_key, rng=None, subgroupings=None, final_sublist_size=1, eps=1e-08)[source]
Splits data based on arbitrary keys in its metadata. Allows for subgroupings as well
NOTE: This requires that all of the values for split_key in all metadata dictionaries (as well as those for any subgroupings being used) are hashable types.
NOTE: make sure you include an ‘INDEX’ key in all of the metadata values if the order they appear in the metadata is not the order they should be interpreted to have in file. IE: if your ‘INDEX’ column in file does not match up with the index of datapoints within the file
- Parameters:
metadata (List[Dict]) – metadata for the data being split. A list of metadata dictionaries from all elements that could be loaded by the dataloader. If this has an ‘INDEX’ column, then that will be used to determine the ‘indices’ that are returned by this method. Otherwise, the indices will just be the order of datapoints as they appear. Assumes that if the ‘INDEX’ column is present in the first element, it will be present in all, and vice-versa
set_splits (Dict[Any, float]) – Dict mapping dataset name to float percent of the total dataset that should be allocated to that dataset name. If an OrderedDict, then data will be assigned with priority to earlier datasets in the case of too few ‘unique’ datapoints (by split_key), or uneven class sizes. Otherwise, order is arbitrary.
split_key (Optional[Any]) – the metadata key to use to split data by. If None, will split just by the number of datapoints in metadata
rng (Optional[Union[int, RNG]]) – integer random state, or numpy RNG object to use for rng, or None to not randomly select elements and instead grab them in the order that they appear in metadata. This will gather elements first in order of the unique keys that appear, then in order of individual metadata elements.
subgroupings (Optional[Iterable[Any]]) –
If None, then this will split normally by metadata key. Otherwise, this can be string/int or a list of subelements which will act as a key or keys in the metadata to subgroup data by. Each key will be grouped in order to apply ‘subgroupings’ to the data. For example, if you were to split by the ‘problem_uid’ key, then subgroup by the ‘submission_id’ key, this would return a list of lists of indices as the value for each set_split. The first list would be at the ‘problem_uid’ level where all indices with the same problem_uid would appear in the same outer list. Each sublist would contain all indices with the same ‘submission_id’ key value from those grouped into the outer ‘problem_uid’-level list. Multiple subgrouping keys may be used at the same time to create deeper nested groupings. You may subgroup by the same key as the splitting key, which would ensure that, when loading data, all examples with the same value for its splitting key would be prioritized to load together.
NOTE: the current loading RNG implementation will randomly select subelements from each level of list deeper and deeper until reaching the final layer, at which time all values within that final list will be taken together. This means that if you were to say, split by ‘problem_uid’, and subgroup by both ‘problem_uid’ and ‘submission_id’ in order. You would then lose out on the prioritization of loading values with the same ‘problem_uid’ all together. To help with this, you may use the final_sublist_size argument which will make the final sublists contain that many ‘unique’ indices. In this the above example, it would ensure that there are final_sublist_size unique submission_id’s within each final sublist, and that sublist would contain all indices with 1. a ‘problem_uid’ that is within that outer sublist and 2. a ‘submission_id’ that is within that inner sublist. This way, one could ensure the loading multiple examples from the same problem_uid each selection, and make sure that all compilations of the same submission_id are loaded at the same time as well.
final_sublist_size (int) – the max size of the final sublist, in terms of number of ‘unique’ elements. See the note above in subgroupings for more info. Only used if subgroupings is not None
eps (float) – small epsilon value to pass to split_list_by_sizes() using set_splits, see that func for more info
- Returns:
- dictionary mapping each key in set_splits to its list of SplitIndElement
objects. Each SplitIndElement can either be an integer index, or a list of SplitIndElement. This allows for nested groupings of elements to choose when loading data.
- Return type:
Dict[Any, List[SplitIndElement]]
- bincfg.utils.misc_utils.split_list_by_sizes(l, sizes, eps=1e-08)[source]
Splits the given list into len(sizes) different lists in order based on sizes
Elements will be inserted into returned lists in order, prioritizing first having at least one element per list, then biasing any remaining elements into earlier lists.
- Parameters:
l (Iterable[Any]) – the list of elements to split
sizes (Union[Iterable[float], Iterable[int]]) – the different sizes to apply. Can either be an iterable of floats in which case each element is a percent of the total data to keep and all elements should be >=0 and <=1 and all elements should sum to 1. Or, can be an iterable of integers in which case all elements should be >=0 and <= len(l) and all elements should sum to len(l)
eps (float) – the epsilon value used to determine if sum(sizes) (when sizes is a float) is equal to 1
- Returns:
a list of all sublists
- Return type:
List[List[Any]]
bincfg.utils.type_utils module
- bincfg.utils.type_utils.AddressLike
Objects that can be converted into a memory address, or that have a .address attribute which can
alias of
int|str|Addressable
- class bincfg.utils.type_utils.Addressable(*args, **kwargs)[source]
Bases:
ProtocolObject that has a .address attribute which can be converted into a memory address
- address: int | str
- class bincfg.utils.type_utils.NormalizerType(*args, **kwargs)[source]
Bases:
ProtocolObject that has a valid .normalize() function
- normalize(*strings: str, cfg: CFG | None, block: CFGBasicBlock | None, newline_tup: None | Tuple[str, str] | object, match_instruction_address: bool, **kwargs: Any) list[str][source]
- bincfg.utils.type_utils.PlainAddress
Types that can be converted into an address by themselves, without having to look at any attributes
alias of
int|str