datasetops.dataset

Module defining operations which may be applied to transform the data of a single dataset. The transforms are available as free functions or as extension methods defined on the dataset objects:

>>> ds.shuffle(seed=0)
>>> ds_s = shuffle(ds,seed=0)
>>> ds.idx == ds_s.idx
True

Module Contents

datasetops.dataset._warn_no_args(skip=0)
datasetops.dataset._raise_no_args(skip=0)
datasetops.dataset._key_index(item_names: ItemNames, key: Key) → int
datasetops.dataset._split_bulk_itemwise(l: Union[Optional[Callable], Sequence[Optional[Callable]]]) → Tuple[Optional[Callable], Sequence[Optional[Callable]]]
datasetops.dataset._combine_conditions(item_names: ItemNames, shape: Sequence[Shape], predicates: Optional[Union[DataPredicate, Sequence[Optional[DataPredicate]]]] = None, **kwpredicates: DataPredicate) → DataPredicate
datasetops.dataset._optional_argument_indexed_transform(shape: Union[Shape, Sequence[Shape]], ds_transform: Callable[[Any], ‘Dataset’], transform_fn: DatasetTransformFnCreator, args: Sequence[Optional[Sequence[Any]]])
datasetops.dataset._keywise(item_names: Dict[str, int], l: Sequence, d: Dict)
datasetops.dataset._itemwise(item_names: Dict[str, int], l: Sequence, d: Dict)
datasetops.dataset._keyarg2list(item_names, key_or_keys: Union[Key, Sequence[Key]], arg: Sequence[Any]) → List[Optional[Sequence[Any]]]
datasetops.dataset._DEFAULT_SHAPE
datasetops.dataset._ROOT_OPERATIONS = ['cache', 'stream', 'load']
datasetops.dataset._MAYBE_CACHEABLE_OPERATIONS = ['sample', 'shuffle', 'split']
datasetops.dataset._CACHEABLE_OPERATIONS = ['filter', 'split_filter', 'take', 'reorder', 'repeat', 'image', 'image_resize', 'numpy', 'reshape', 'categorical', 'one_hot', 'standardize', 'center', 'minmax', 'maxabs', 'copy', 'transform']
class datasetops.dataset.Dataset(downstream_getter: Union[ItemGetter, ‘Dataset’], operation_name: str, name: str = None, ids: Ids = None, item_transform_fn: ItemTransformFn = lambda x: ..., item_names: Dict[str, int] = None, operation_parameters: Dict = {}, stats: List[Optional[scaler.ElemStats]] = [])

Bases: datasetops.abstract.AbstractDataset

Contains information on how to access the raw data, and performs sampling and splitting related operations.

__len__(self)
__getitem__(self, i: int)
cached(self, path: str = None, keep_loaded_items: bool = False, display_progress: bool = False)
item_stats(self, item_key: Key, axis=None)

Compute the statistics (mean, std, min, max) for an item element

Arguments:
item_key {Key} – index of string identifyer for element on which

the stats should be computed

Keyword Arguments:

axis {[type]} – the axis on which to compute statistics (default: {None})

Raises:

TypeError: if statistics cannot be computed on the element type

Returns:

scaler.ElemStats – Named tuple with (mean, std, min, max, axis)

property shape(self)

Get the shape of a dataset item.

Returns:

Sequence[int] – Item shapes

counts(self, *itemkeys: Key)

Compute the counts of each unique item in the dataset.

Warning: this operation may be expensive for large datasets

Arguments:
itemkeys {Union[str, int]} –

The item keys (str) or indexes (int) to be checked for uniqueness. If no key is given, all item-parts must match for them to be considered equal

Returns:
List[Tuple[Any,int]] – List of tuples, each containing the unique value

and its number of occurences

unique(self, *itemkeys: Key)

Compute a list of unique values in the dataset.

Warning: this operation may be expensive for large datasets

Arguments:

itemkeys {str} – The item keys to be checked for uniqueness

Returns:

List[Any] – List of the unique items

sample(self, num: int, seed: int = None)

Sample data randomly from the dataset.

Arguments:
num {int} – Number of samples. If the number of samples is larger than the

dataset size, some samples may be samples multiple times

Keyword Arguments:

seed {int} – Random seed (default: {None})

Returns:

[Dataset] – Sampled dataset

filter(self, predicates: Optional[Union[DataPredicate, Sequence[Optional[DataPredicate]]]] = None, **kwpredicates: DataPredicate)

Filter a dataset using a predicate function.

Keyword Arguments:
predicates {Union[DataPredicate, Sequence[Optional[DataPredicate]]]} –

either a single or a list of functions taking a single dataset item and returning a bool if a single function is passed, it is applied to the whole item, if a list is passed, the functions are applied itemwise element-wise predicates can also be passed, if item_names have been named.

kwpredicates {DataPredicate} – Predicates passed by keyword

Returns:

[Dataset] – A filtered Dataset

split_filter(self, predicates: Optional[Union[DataPredicate, Sequence[Optional[DataPredicate]]]] = None, **kwpredicates: DataPredicate)

Split a dataset using a predicate function.

Keyword Arguments:
predicates {Union[DataPredicate, Sequence[Optional[DataPredicate]]]} –

either a single or a list of functions taking a single dataset item and returning a bool. if a single function is passed, it is applied to the whole item, if a list is passed, the functions are applied itemwise

element-wise predicates can also be passed, if item_names have been named.

Returns:

[Dataset] – Two datasets, one that passed the predicate and one that didn’t

shuffle(self, seed: int = None)

Shuffle the items in a dataset.

Keyword Arguments:

seed {[int]} – Random seed (default: {None})

Returns:

[Dataset] – Dataset with shuffled items

split(self, fractions: List[float], seed: int = None)

Split dataset into multiple datasets, determined by the fractions given.

A wildcard (-1) may be given at a single position, to fill in the rest. If fractions don’t add up, the last fraction in the list receives the remainding data.

Arguments:

fractions {List[float]} – a list or tuple of floats i the interval ]0,1[ One of the items may be a -1 wildcard.

Keyword Arguments:

seed {int} – Random seed (default: {None})

Returns:

List[Dataset] – Datasets with the number of samples corresponding to the fractions given

take(self, num: int)

Take the first elements of a dataset.

Arguments:

num {int} – number of elements to take

Returns:

Dataset – A dataset with only the first num elements

repeat(self, times=1, mode='itemwise')

Repeat the dataset elements.

Keyword Arguments:

times {int} – Number of times an element is repeated (default: {1}) mode {str} – Repeat ‘itemwise’ (i.e. [1,1,2,2,3,3]) or as a ‘whole’

(i.e. [1,2,3,1,2,3]) (default: {‘itemwise’})

Returns:

[type] – [description]

reorder(self, *keys: Key)

Reorder items in the dataset (similar to numpy.transpose).

Arguments:
new_inds {Union[int,str]} – positioned item index or key (if item names

were previously set) of item

Returns:

[Dataset] – Dataset with items whose elements have been reordered

named(self, first: Union[str, Sequence[str]], *rest: str)

Set the names associated with the elements of an item.

Arguments:

first {Union[str, Sequence[str]]} – The new item name(s)

Returns:

[Dataset] – A Dataset whose item elements can be accessed by name

property names(self)

Get the names of the elements in an item.

Returns:

List[str] – A list of element names

transform(self, fns: Optional[Union[ItemTransformFn, Sequence[Union[ItemTransformFn, DatasetTransformFn]]]] = None, **kwfns: DatasetTransformFn)

Transform the items of a dataset according to some function (passed as argument).

Arguments:
If a single function taking one input given, e.g. transform(lambda x: x),

it will be applied to the whole item.

If a list of functions are given, e.g. transform([image(), one_hot()]) they

will be applied to the elements of the item corresponding to the position.

If key is used, e.g. transform(data=lambda x:-x), the item associated with

the key i transformed.

Raises:

ValueError: If more functions are passed than there are elements in an item. KeyError: If a key doesn’t match

Returns:

[Dataset] – Dataset whose items are transformed

categorical(self, key: Key, mapping_fn: Callable[[Any], int] = None)

Transform elements into categorical categoricals (int).

Arguments:

key {Key} – Index of name for the element to be transformed

Keyword Arguments:
mapping_fn {Callable[[Any], int]} – User defined mapping function

(default: {None})

Returns:
[Dataset] – Dataset with items that have been transformed to categorical

labels

one_hot(self, key: Key, encoding_size: int = None, mapping_fn: Callable[[Any], int] = None, dtype='bool')

Transform elements into a categorical one-hot encoding.

Arguments:

key {Key} – Index of name for the element to be transformed

Keyword Arguments:
encoding_size {int} –

The number of positions in the one-hot vector. If size it not provided, it we be automatically inferred with a O(N) runtime cost (default: {None})

mapping_fn {Callable[[Any], int]} –

User defined mapping function (default: {None})

dtype {str} –

Numpy datatype for the one-hot encoded data (default: {‘bool’})

Returns:
[Dataset] –

Dataset with items that have been transformed to categorical labels

image(self, *positional_flags: Any)

Transforms item elements that are either numpy arrays or path strings into a PIL.Image.Image.

Arguments:

positional flags, e.g. (True, False) denoting which element should be converted. If no flags are supplied, all data that can be converted will be converted.

Returns:

[Dataset] – Dataset with PIL.Image.Image elements

numpy(self, *positional_flags: Any)

Transforms elements into numpy.ndarray.

Arguments:

positional flags, e.g. (True, False) denoting which element should be converted. If no flags are supplied, all data that can be converted will be converted.

Returns:

[Dataset] – Dataset with np.ndarray elements

zip(self, *datasets)
cartesian_product(self, *datasets)
concat(self, *datasets)
reshape(self, *new_shapes: Optional[Shape], **kwshapes: Optional[Shape])
image_resize(self, *new_sizes: Optional[Shape], **kwsizes: Optional[Shape])
standardize(self, key_or_keys: Union[Key, Sequence[Key]], axis=0)

Standardize features by removing the mean and scaling to unit variance

Arguments:
key_or_keys {Union[Key, Sequence[Key]]} –

The keys on which the Max Abs scaling should be performed

Keyword Arguments:

axis {int} – Axis on which to accumulate statistics (default: {0})

Returns:

Dataset – Transformed dataset

center(self, key_or_keys: Union[Key, Sequence[Key]], axis=0)

Centers features by removing the mean

Arguments:
key_or_keys {Union[Key, Sequence[Key]]} –

The keys on which the Max Abs scaling should be performed

Keyword Arguments:

axis {int} – Axis on which to accumulate statistics (default: {0})

Returns:

Dataset – Transformed dataset

minmax(self, key_or_keys: Union[Key, Sequence[Key]], axis=0, feature_range=0, 1)

Transform features by scaling each feature to a given range.

Arguments:
key_or_keys {Union[Key, Sequence[Key]]} –

The keys on which the Max Abs scaling should be performed

Keyword Arguments:

axis {int} – Axis on which to accumulate statistics (default: {0})

Keyword Arguments:

feature_range {Tuple[int, int]} – Minimum and maximum bound to scale to

Returns:

Dataset – Transformed dataset

maxabs(self, key_or_keys: Union[Key, Sequence[Key]], axis=0)

Scale each feature by its maximum absolute value.

Arguments:
key_or_keys {Union[Key, Sequence[Key]]} –

The keys on which the Max Abs scaling should be performed

Keyword Arguments:

axis {int} – Axis on which to accumulate statistics (default: {0})

Returns:

Dataset – Transformed dataset

close(self)
to_tensorflow(self)
to_pytorch(self)
class datasetops.dataset.StreamDataset(stream: IO, identifier: str, keep_loaded_items: bool = False)

Bases: datasetops.dataset.Dataset

property allow_random_access(self)
__skip_header(self)
__read_once(self)
__reset(self, clear_loaded_items: bool = False)
__read_item(self)
__getitem__(self, i: int)
close(self)
datasetops.dataset._make_dataset_element_transforming(make_fn: Callable[[AbstractDataset, Optional[int]], Callable], check: Callable = None, maintain_stats=False, operation_name='transform', operation_parameters={}) → DatasetTransformFn
datasetops.dataset._dataset_element_transforming(fn: Callable, check: Callable = None, maintain_stats=False, operation_name='transform', operation_parameters={}) → DatasetTransformFn

Applies the function to dataset item elements.

datasetops.dataset._check_shape_compatibility(shape: Shape)
datasetops.dataset.convert2img(elem: Union[Image.Image, str, Path, np.ndarray]) → Image.Image
datasetops.dataset._check_image_compatibility(elem)
datasetops.dataset._check_numpy_compatibility(allow_scalars=False)
datasetops.dataset.allow_unique(max_num_duplicates=1) → Callable[[Any], bool]

Predicate used for filtering/sampling a dataset classwise.

Keyword Arguments:
max_num_duplicates {int} –

max number of samples to take that share the same value (default: {1})

Returns:

Callable[[Any], bool] – Predicate function

datasetops.dataset._custom(elem_transform_fn: Callable[[Any], Any], elem_check_fn: Callable[[Any], None] = None) → DatasetTransformFn

Create a user defined transform.

Arguments:
fn {Callable[[Any], Any]} –

A user defined function, which takes the element as only argument

Keyword Arguments:
check_fn {Callable[[Any]]} –

A function that raises an Exception if the elem is incompatible (default: {None})

Returns:

DatasetTransformFn – [description]

datasetops.dataset.reshape(new_shape: Shape) → DatasetTransformFn
datasetops.dataset.categorical(mapping_fn: Callable[[Any], int] = None) → DatasetTransformFn

Transform data into a categorical int label.

Arguments:
mapping_fn {Callable[[Any], int]} –

A function transforming the input data to the integer label. If not specified, labels are automatically inferred from the data.

Returns:

DatasetTransformFn – A function to be passed to the Dataset.transform()

datasetops.dataset.categorical_template(ds: Dataset, key: Key) → Callable[[Any], int]

Creates a template mapping function to be with one_hot.

Arguments:

ds {Dataset} – Dataset from which to create a template for one_hot coding key {Key} – Dataset key (name or item index) on the one_hot coding is made

Returns:

{Callable[[Any],int]} – mapping_fn for one_hot

datasetops.dataset.one_hot(encoding_size: int, mapping_fn: Callable[[Any], int] = None, dtype='bool') → DatasetTransformFn

Transform data into a one-hot encoded label.

Arguments:

encoding_size {int} – The size of the encoding mapping_fn {Callable[[Any], int]} –

A function transforming the input data to an integer label. If not specified, labels are automatically inferred from the data.

Returns:

DatasetTransformFn – A function to be passed to the Dataset.transform()

datasetops.dataset.numpy() → DatasetTransformFn
datasetops.dataset.image() → DatasetTransformFn
datasetops.dataset.image_resize(new_size: Shape, resample=Image.NEAREST) → DatasetTransformFn
datasetops.dataset.standardize(axis=0) → DatasetTransformFn

Standardize features by removing the mean and scaling to unit variance

Keyword Arguments:

axis {int} – Axis on which to accumulate statistics (default: {0})

Returns:

DatasetTransformFn – Function to be passed to Datasets.transform

datasetops.dataset.center(axis=0) → DatasetTransformFn

Center features by removing the mean

Keyword Arguments:

axis {int} – Axis on which to accumulate statistics (default: {0})

Returns:

DatasetTransformFn – Function to be passed to Datasets.transform

datasetops.dataset.minmax(axis=0, feature_range=0, 1) → DatasetTransformFn

Transform features by scaling each feature to a given range.

Keyword Arguments:

axis {int} – Axis on which to accumulate statistics (default: {0})

Keyword Arguments:

feature_range {Tuple[int, int]} – Minimum and maximum bound to scale to

Returns:

DatasetTransformFn – Function to be passed to Datasets.transform

datasetops.dataset.maxabs(axis=0) → DatasetTransformFn

Scale each feature by its maximum absolute value.

Keyword Arguments:

axis {int} – Axis on which to accumulate statistics (default: {0})

Returns:

DatasetTransformFn – Function to be passed to Datasets.transform

datasetops.dataset.zipped(*datasets: AbstractDataset)
datasetops.dataset.cartesian_product(*datasets: AbstractDataset)
datasetops.dataset.concat(*datasets: AbstractDataset)
datasetops.dataset._tf_compute_type(item: Any)
datasetops.dataset._tf_compute_shape(item: Any)
datasetops.dataset._tf_item_conversion(item: Any)
datasetops.dataset.to_tensorflow(dataset: Dataset)
datasetops.dataset.to_pytorch(dataset: Dataset)