`datasetops.dataset`¶

Module defining operations which may be applied to transform the data of a single dataset. The transforms are available as free functions or as extension methods defined on the dataset objects:

>>> ds.shuffle(seed=0)
>>> ds_s = shuffle(ds,seed=0)
>>> ds.idx == ds_s.idx
True

Module Contents¶

datasetops.dataset._warn_no_args(skip=0)¶

datasetops.dataset._raise_no_args(skip=0)¶

datasetops.dataset._key_index(item_names: ItemNames, key: Key) → int¶

datasetops.dataset._split_bulk_itemwise(l: Union[Optional[Callable], Sequence[Optional[Callable]]]) → Tuple[Optional[Callable], Sequence[Optional[Callable]]]¶

datasetops.dataset._combine_conditions(item_names: ItemNames, shape: Sequence[Shape], predicates: Optional[Union[DataPredicate, Sequence[Optional[DataPredicate]]]] = None, **kwpredicates: DataPredicate) → DataPredicate¶

datasetops.dataset._optional_argument_indexed_transform(shape: Union[Shape, Sequence[Shape]], ds_transform: Callable[[Any], ‘Dataset’], transform_fn: DatasetTransformFnCreator, args: Sequence[Optional[Sequence[Any]]])¶

datasetops.dataset._keywise(item_names: Dict[str, int], l: Sequence, d: Dict)¶

datasetops.dataset._itemwise(item_names: Dict[str, int], l: Sequence, d: Dict)¶

datasetops.dataset._keyarg2list(item_names, key_or_keys: Union[Key, Sequence[Key]], arg: Sequence[Any]) → List[Optional[Sequence[Any]]]¶

datasetops.dataset._DEFAULT_SHAPE¶

datasetops.dataset._ROOT_OPERATIONS = ['cache', 'stream', 'load']¶

datasetops.dataset._MAYBE_CACHEABLE_OPERATIONS = ['sample', 'shuffle', 'split']¶

datasetops.dataset._CACHEABLE_OPERATIONS = ['filter', 'split_filter', 'take', 'reorder', 'repeat', 'image', 'image_resize', 'numpy', 'reshape', 'categorical', 'one_hot', 'standardize', 'center', 'minmax', 'maxabs', 'copy', 'transform']¶

class datasetops.dataset.Dataset(downstream_getter: Union[ItemGetter, ‘Dataset’], operation_name: str, name: str = None, ids: Ids = None, item_transform_fn: ItemTransformFn = lambda x: ..., item_names: Dict[str, int] = None, operation_parameters: Dict = {}, stats: List[Optional[scaler.ElemStats]] = [])¶

Bases: datasetops.abstract.AbstractDataset

Contains information on how to access the raw data, and performs sampling and splitting related operations.

__len__(self)¶

__getitem__(self, i: int)¶

cached(self, path: str = None, keep_loaded_items: bool = False, display_progress: bool = False)¶

item_stats(self, item_key: Key, axis=None)¶

Compute the statistics (mean, std, min, max) for an item element

Arguments:

item_key {Key} – index of string identifyer for element on which: the stats should be computed

Keyword Arguments:

axis {[type]} – the axis on which to compute statistics (default: {None})

Raises:

TypeError: if statistics cannot be computed on the element type

Returns:

scaler.ElemStats – Named tuple with (mean, std, min, max, axis)

property shape(self)¶

Get the shape of a dataset item.

Returns:: Sequence[int] – Item shapes

counts(self, *itemkeys: Key)¶

Compute the counts of each unique item in the dataset.

Warning: this operation may be expensive for large datasets

Arguments:

itemkeys {Union[str, int]} –: The item keys (str) or indexes (int) to be checked for uniqueness. If no key is given, all item-parts must match for them to be considered equal

Returns:

List[Tuple[Any,int]] – List of tuples, each containing the unique value: and its number of occurences

unique(self, *itemkeys: Key)¶

Compute a list of unique values in the dataset.

Warning: this operation may be expensive for large datasets

Arguments:: itemkeys {str} – The item keys to be checked for uniqueness
Returns:: List[Any] – List of the unique items

sample(self, num: int, seed: int = None)¶

Sample data randomly from the dataset.

Arguments:

num {int} – Number of samples. If the number of samples is larger than the: dataset size, some samples may be samples multiple times

Keyword Arguments:

seed {int} – Random seed (default: {None})

Returns:

[Dataset] – Sampled dataset

filter(self, predicates: Optional[Union[DataPredicate, Sequence[Optional[DataPredicate]]]] = None, **kwpredicates: DataPredicate)¶

Filter a dataset using a predicate function.

Keyword Arguments:

predicates {Union[DataPredicate, Sequence[Optional[DataPredicate]]]} –: either a single or a list of functions taking a single dataset item and returning a bool if a single function is passed, it is applied to the whole item, if a list is passed, the functions are applied itemwise element-wise predicates can also be passed, if item_names have been named.

kwpredicates {DataPredicate} – Predicates passed by keyword

Returns:

[Dataset] – A filtered Dataset

split_filter(self, predicates: Optional[Union[DataPredicate, Sequence[Optional[DataPredicate]]]] = None, **kwpredicates: DataPredicate)¶

Split a dataset using a predicate function.

Keyword Arguments:

predicates {Union[DataPredicate, Sequence[Optional[DataPredicate]]]} –: either a single or a list of functions taking a single dataset item and returning a bool. if a single function is passed, it is applied to the whole item, if a list is passed, the functions are applied itemwise

element-wise predicates can also be passed, if item_names have been named.

Returns:

[Dataset] – Two datasets, one that passed the predicate and one that didn’t

shuffle(self, seed: int = None)¶

Shuffle the items in a dataset.

Keyword Arguments:: seed {[int]} – Random seed (default: {None})
Returns:: [Dataset] – Dataset with shuffled items

split(self, fractions: List[float], seed: int = None)¶

Split dataset into multiple datasets, determined by the fractions given.

A wildcard (-1) may be given at a single position, to fill in the rest. If fractions don’t add up, the last fraction in the list receives the remainding data.

Arguments:: fractions {List[float]} – a list or tuple of floats i the interval ]0,1[ One of the items may be a -1 wildcard.
Keyword Arguments:: seed {int} – Random seed (default: {None})
Returns:: List[Dataset] – Datasets with the number of samples corresponding to the fractions given

take(self, num: int)¶

Take the first elements of a dataset.

Arguments:: num {int} – number of elements to take
Returns:: Dataset – A dataset with only the first num elements

repeat(self, times=1, mode='itemwise')¶

Repeat the dataset elements.

Keyword Arguments:: times {int} – Number of times an element is repeated (default: {1}) mode {str} – Repeat ‘itemwise’ (i.e. [1,1,2,2,3,3]) or as a ‘whole’

(i.e. [1,2,3,1,2,3]) (default: {‘itemwise’})
Returns:: [type] – [description]

reorder(self, *keys: Key)¶

Reorder items in the dataset (similar to numpy.transpose).

Arguments:

new_inds {Union[int,str]} – positioned item index or key (if item names: were previously set) of item

Returns:

[Dataset] – Dataset with items whose elements have been reordered

named(self, first: Union[str, Sequence[str]], *rest: str)¶

Set the names associated with the elements of an item.

Arguments:: first {Union[str, Sequence[str]]} – The new item name(s)
Returns:: [Dataset] – A Dataset whose item elements can be accessed by name

property names(self)¶

Get the names of the elements in an item.

Returns:: List[str] – A list of element names

transform(self, fns: Optional[Union[ItemTransformFn, Sequence[Union[ItemTransformFn, DatasetTransformFn]]]] = None, **kwfns: DatasetTransformFn)¶

Transform the items of a dataset according to some function (passed as argument).

Arguments:

If a single function taking one input given, e.g. transform(lambda x: x),: it will be applied to the whole item.
If a list of functions are given, e.g. transform([image(), one_hot()]) they: will be applied to the elements of the item corresponding to the position.
If key is used, e.g. transform(data=lambda x:-x), the item associated with: the key i transformed.

Raises:

ValueError: If more functions are passed than there are elements in an item. KeyError: If a key doesn’t match

Returns:

[Dataset] – Dataset whose items are transformed

categorical(self, key: Key, mapping_fn: Callable[[Any], int] = None)¶

Transform elements into categorical categoricals (int).

Arguments:

key {Key} – Index of name for the element to be transformed

Keyword Arguments:

mapping_fn {Callable[[Any], int]} – User defined mapping function: (default: {None})

Returns:

[Dataset] – Dataset with items that have been transformed to categorical: labels

one_hot(self, key: Key, encoding_size: int = None, mapping_fn: Callable[[Any], int] = None, dtype='bool')¶

Transform elements into a categorical one-hot encoding.

Arguments:

key {Key} – Index of name for the element to be transformed

Keyword Arguments:

encoding_size {int} –: The number of positions in the one-hot vector. If size it not provided, it we be automatically inferred with a O(N) runtime cost (default: {None})
mapping_fn {Callable[[Any], int]} –: User defined mapping function (default: {None})
dtype {str} –: Numpy datatype for the one-hot encoded data (default: {‘bool’})

Returns:

[Dataset] –: Dataset with items that have been transformed to categorical labels

image(self, *positional_flags: Any)¶

Transforms item elements that are either numpy arrays or path strings into a PIL.Image.Image.

Arguments:: positional flags, e.g. (True, False) denoting which element should be converted. If no flags are supplied, all data that can be converted will be converted.
Returns:: [Dataset] – Dataset with PIL.Image.Image elements

numpy(self, *positional_flags: Any)¶

Transforms elements into numpy.ndarray.

Arguments:: positional flags, e.g. (True, False) denoting which element should be converted. If no flags are supplied, all data that can be converted will be converted.
Returns:: [Dataset] – Dataset with np.ndarray elements

zip(self, *datasets)¶

cartesian_product(self, *datasets)¶

concat(self, *datasets)¶

reshape(self, *new_shapes: Optional[Shape], **kwshapes: Optional[Shape])¶

image_resize(self, *new_sizes: Optional[Shape], **kwsizes: Optional[Shape])¶

standardize(self, key_or_keys: Union[Key, Sequence[Key]], axis=0)¶

Standardize features by removing the mean and scaling to unit variance

Arguments:

key_or_keys {Union[Key, Sequence[Key]]} –: The keys on which the Max Abs scaling should be performed

Keyword Arguments:

axis {int} – Axis on which to accumulate statistics (default: {0})

Returns:

Dataset – Transformed dataset

center(self, key_or_keys: Union[Key, Sequence[Key]], axis=0)¶

Centers features by removing the mean

Arguments:

key_or_keys {Union[Key, Sequence[Key]]} –: The keys on which the Max Abs scaling should be performed

Keyword Arguments:

axis {int} – Axis on which to accumulate statistics (default: {0})

Returns:

Dataset – Transformed dataset

minmax(self, key_or_keys: Union[Key, Sequence[Key]], axis=0, feature_range=0, 1)¶

Transform features by scaling each feature to a given range.

Arguments:

key_or_keys {Union[Key, Sequence[Key]]} –: The keys on which the Max Abs scaling should be performed

Keyword Arguments:

axis {int} – Axis on which to accumulate statistics (default: {0})

Keyword Arguments:

feature_range {Tuple[int, int]} – Minimum and maximum bound to scale to

Returns:

Dataset – Transformed dataset

maxabs(self, key_or_keys: Union[Key, Sequence[Key]], axis=0)¶

Scale each feature by its maximum absolute value.

Arguments:

key_or_keys {Union[Key, Sequence[Key]]} –: The keys on which the Max Abs scaling should be performed

Keyword Arguments:

axis {int} – Axis on which to accumulate statistics (default: {0})

Returns:

Dataset – Transformed dataset

close(self)¶

to_tensorflow(self)¶

to_pytorch(self)¶

class datasetops.dataset.StreamDataset(stream: IO, identifier: str, keep_loaded_items: bool = False)¶

Bases: datasetops.dataset.Dataset

property allow_random_access(self)¶

__skip_header(self)¶

__read_once(self)¶

__reset(self, clear_loaded_items: bool = False)¶

__read_item(self)¶

__getitem__(self, i: int)¶

close(self)¶

datasetops.dataset._make_dataset_element_transforming(make_fn: Callable[[AbstractDataset, Optional[int]], Callable], check: Callable = None, maintain_stats=False, operation_name='transform', operation_parameters={}) → DatasetTransformFn¶

datasetops.dataset._dataset_element_transforming(fn: Callable, check: Callable = None, maintain_stats=False, operation_name='transform', operation_parameters={}) → DatasetTransformFn¶: Applies the function to dataset item elements.

datasetops.dataset._check_shape_compatibility(shape: Shape)¶

datasetops.dataset.convert2img(elem: Union[Image.Image, str, Path, np.ndarray]) → Image.Image¶

datasetops.dataset._check_image_compatibility(elem)¶

datasetops.dataset._check_numpy_compatibility(allow_scalars=False)¶

datasetops.dataset.allow_unique(max_num_duplicates=1) → Callable[[Any], bool]¶

Predicate used for filtering/sampling a dataset classwise.

Keyword Arguments:

max_num_duplicates {int} –: max number of samples to take that share the same value (default: {1})

Returns:

Callable[[Any], bool] – Predicate function

datasetops.dataset._custom(elem_transform_fn: Callable[[Any], Any], elem_check_fn: Callable[[Any], None] = None) → DatasetTransformFn¶

Create a user defined transform.

Arguments:

fn {Callable[[Any], Any]} –: A user defined function, which takes the element as only argument

Keyword Arguments:

check_fn {Callable[[Any]]} –: A function that raises an Exception if the elem is incompatible (default: {None})

Returns:

DatasetTransformFn – [description]

datasetops.dataset.reshape(new_shape: Shape) → DatasetTransformFn¶

datasetops.dataset.categorical(mapping_fn: Callable[[Any], int] = None) → DatasetTransformFn¶

Transform data into a categorical int label.

Arguments:

mapping_fn {Callable[[Any], int]} –: A function transforming the input data to the integer label. If not specified, labels are automatically inferred from the data.

Returns:

DatasetTransformFn – A function to be passed to the Dataset.transform()

datasetops.dataset.categorical_template(ds: Dataset, key: Key) → Callable[[Any], int]¶

Creates a template mapping function to be with one_hot.

Arguments:: ds {Dataset} – Dataset from which to create a template for one_hot coding key {Key} – Dataset key (name or item index) on the one_hot coding is made
Returns:: {Callable[[Any],int]} – mapping_fn for one_hot

datasetops.dataset.one_hot(encoding_size: int, mapping_fn: Callable[[Any], int] = None, dtype='bool') → DatasetTransformFn¶

Transform data into a one-hot encoded label.

Arguments:: encoding_size {int} – The size of the encoding mapping_fn {Callable[[Any], int]} –

A function transforming the input data to an integer label. If not specified, labels are automatically inferred from the data.
Returns:: DatasetTransformFn – A function to be passed to the Dataset.transform()

datasetops.dataset.numpy() → DatasetTransformFn¶

datasetops.dataset.image() → DatasetTransformFn¶

datasetops.dataset.image_resize(new_size: Shape, resample=Image.NEAREST) → DatasetTransformFn¶

datasetops.dataset.standardize(axis=0) → DatasetTransformFn¶

Standardize features by removing the mean and scaling to unit variance

Keyword Arguments:: axis {int} – Axis on which to accumulate statistics (default: {0})
Returns:: DatasetTransformFn – Function to be passed to Datasets.transform

datasetops.dataset.center(axis=0) → DatasetTransformFn¶

Center features by removing the mean

Keyword Arguments:: axis {int} – Axis on which to accumulate statistics (default: {0})
Returns:: DatasetTransformFn – Function to be passed to Datasets.transform

datasetops.dataset.minmax(axis=0, feature_range=0, 1) → DatasetTransformFn¶

Transform features by scaling each feature to a given range.

Keyword Arguments:: axis {int} – Axis on which to accumulate statistics (default: {0})
Keyword Arguments:: feature_range {Tuple[int, int]} – Minimum and maximum bound to scale to
Returns:: DatasetTransformFn – Function to be passed to Datasets.transform

datasetops.dataset.maxabs(axis=0) → DatasetTransformFn¶

Scale each feature by its maximum absolute value.

Keyword Arguments:: axis {int} – Axis on which to accumulate statistics (default: {0})
Returns:: DatasetTransformFn – Function to be passed to Datasets.transform

datasetops.dataset.zipped(*datasets: AbstractDataset)¶

datasetops.dataset.cartesian_product(*datasets: AbstractDataset)¶

datasetops.dataset.concat(*datasets: AbstractDataset)¶

datasetops.dataset._tf_compute_type(item: Any)¶

datasetops.dataset._tf_compute_shape(item: Any)¶

datasetops.dataset._tf_item_conversion(item: Any)¶

datasetops.dataset.to_tensorflow(dataset: Dataset)¶

datasetops.dataset.to_pytorch(dataset: Dataset)¶

datasetops.dataset¶

Module Contents¶

`datasetops.dataset`¶