datasetops

Dataset Ops is a library that enables the loading and processing of datasets stored in various formats. It does so by providing:: 1. Loaders for various storage formats 2. Transformations which may chained to transform the data into the desired form.

Finding The Documentation

Documentation is available online at:

https://datasetops.readthedocs.io/en/latest/

Package Contents

class datasetops.Dataset(downstream_getter: Union[ItemGetter, ‘Dataset’], operation_name: str, name: str = None, ids: Ids = None, item_transform_fn: ItemTransformFn = lambda x: ..., item_names: Dict[str, int] = None, operation_parameters: Dict = {}, stats: List[Optional[scaler.ElemStats]] = [])

Bases: datasetops.abstract.AbstractDataset

Contains information on how to access the raw data, and performs sampling and splitting related operations.

__len__(self)
__getitem__(self, i: int)
cached(self, path: str = None, keep_loaded_items: bool = False, display_progress: bool = False)
item_stats(self, item_key: Key, axis=None)

Compute the statistics (mean, std, min, max) for an item element

Arguments:
item_key {Key} – index of string identifyer for element on which

the stats should be computed

Keyword Arguments:

axis {[type]} – the axis on which to compute statistics (default: {None})

Raises:

TypeError: if statistics cannot be computed on the element type

Returns:

scaler.ElemStats – Named tuple with (mean, std, min, max, axis)

property shape(self)

Get the shape of a dataset item.

Returns:

Sequence[int] – Item shapes

counts(self, *itemkeys: Key)

Compute the counts of each unique item in the dataset.

Warning: this operation may be expensive for large datasets

Arguments:
itemkeys {Union[str, int]} –

The item keys (str) or indexes (int) to be checked for uniqueness. If no key is given, all item-parts must match for them to be considered equal

Returns:
List[Tuple[Any,int]] – List of tuples, each containing the unique value

and its number of occurences

unique(self, *itemkeys: Key)

Compute a list of unique values in the dataset.

Warning: this operation may be expensive for large datasets

Arguments:

itemkeys {str} – The item keys to be checked for uniqueness

Returns:

List[Any] – List of the unique items

sample(self, num: int, seed: int = None)

Sample data randomly from the dataset.

Arguments:
num {int} – Number of samples. If the number of samples is larger than the

dataset size, some samples may be samples multiple times

Keyword Arguments:

seed {int} – Random seed (default: {None})

Returns:

[Dataset] – Sampled dataset

filter(self, predicates: Optional[Union[DataPredicate, Sequence[Optional[DataPredicate]]]] = None, **kwpredicates: DataPredicate)

Filter a dataset using a predicate function.

Keyword Arguments:
predicates {Union[DataPredicate, Sequence[Optional[DataPredicate]]]} –

either a single or a list of functions taking a single dataset item and returning a bool if a single function is passed, it is applied to the whole item, if a list is passed, the functions are applied itemwise element-wise predicates can also be passed, if item_names have been named.

kwpredicates {DataPredicate} – Predicates passed by keyword

Returns:

[Dataset] – A filtered Dataset

split_filter(self, predicates: Optional[Union[DataPredicate, Sequence[Optional[DataPredicate]]]] = None, **kwpredicates: DataPredicate)

Split a dataset using a predicate function.

Keyword Arguments:
predicates {Union[DataPredicate, Sequence[Optional[DataPredicate]]]} –

either a single or a list of functions taking a single dataset item and returning a bool. if a single function is passed, it is applied to the whole item, if a list is passed, the functions are applied itemwise

element-wise predicates can also be passed, if item_names have been named.

Returns:

[Dataset] – Two datasets, one that passed the predicate and one that didn’t

shuffle(self, seed: int = None)

Shuffle the items in a dataset.

Keyword Arguments:

seed {[int]} – Random seed (default: {None})

Returns:

[Dataset] – Dataset with shuffled items

split(self, fractions: List[float], seed: int = None)

Split dataset into multiple datasets, determined by the fractions given.

A wildcard (-1) may be given at a single position, to fill in the rest. If fractions don’t add up, the last fraction in the list receives the remainding data.

Arguments:

fractions {List[float]} – a list or tuple of floats i the interval ]0,1[ One of the items may be a -1 wildcard.

Keyword Arguments:

seed {int} – Random seed (default: {None})

Returns:

List[Dataset] – Datasets with the number of samples corresponding to the fractions given

take(self, num: int)

Take the first elements of a dataset.

Arguments:

num {int} – number of elements to take

Returns:

Dataset – A dataset with only the first num elements

repeat(self, times=1, mode='itemwise')

Repeat the dataset elements.

Keyword Arguments:

times {int} – Number of times an element is repeated (default: {1}) mode {str} – Repeat ‘itemwise’ (i.e. [1,1,2,2,3,3]) or as a ‘whole’

(i.e. [1,2,3,1,2,3]) (default: {‘itemwise’})

Returns:

[type] – [description]

reorder(self, *keys: Key)

Reorder items in the dataset (similar to numpy.transpose).

Arguments:
new_inds {Union[int,str]} – positioned item index or key (if item names

were previously set) of item

Returns:

[Dataset] – Dataset with items whose elements have been reordered

named(self, first: Union[str, Sequence[str]], *rest: str)

Set the names associated with the elements of an item.

Arguments:

first {Union[str, Sequence[str]]} – The new item name(s)

Returns:

[Dataset] – A Dataset whose item elements can be accessed by name

property names(self)

Get the names of the elements in an item.

Returns:

List[str] – A list of element names

transform(self, fns: Optional[Union[ItemTransformFn, Sequence[Union[ItemTransformFn, DatasetTransformFn]]]] = None, **kwfns: DatasetTransformFn)

Transform the items of a dataset according to some function (passed as argument).

Arguments:
If a single function taking one input given, e.g. transform(lambda x: x),

it will be applied to the whole item.

If a list of functions are given, e.g. transform([image(), one_hot()]) they

will be applied to the elements of the item corresponding to the position.

If key is used, e.g. transform(data=lambda x:-x), the item associated with

the key i transformed.

Raises:

ValueError: If more functions are passed than there are elements in an item. KeyError: If a key doesn’t match

Returns:

[Dataset] – Dataset whose items are transformed

categorical(self, key: Key, mapping_fn: Callable[[Any], int] = None)

Transform elements into categorical categoricals (int).

Arguments:

key {Key} – Index of name for the element to be transformed

Keyword Arguments:
mapping_fn {Callable[[Any], int]} – User defined mapping function

(default: {None})

Returns:
[Dataset] – Dataset with items that have been transformed to categorical

labels

one_hot(self, key: Key, encoding_size: int = None, mapping_fn: Callable[[Any], int] = None, dtype='bool')

Transform elements into a categorical one-hot encoding.

Arguments:

key {Key} – Index of name for the element to be transformed

Keyword Arguments:
encoding_size {int} –

The number of positions in the one-hot vector. If size it not provided, it we be automatically inferred with a O(N) runtime cost (default: {None})

mapping_fn {Callable[[Any], int]} –

User defined mapping function (default: {None})

dtype {str} –

Numpy datatype for the one-hot encoded data (default: {‘bool’})

Returns:
[Dataset] –

Dataset with items that have been transformed to categorical labels

image(self, *positional_flags: Any)

Transforms item elements that are either numpy arrays or path strings into a PIL.Image.Image.

Arguments:

positional flags, e.g. (True, False) denoting which element should be converted. If no flags are supplied, all data that can be converted will be converted.

Returns:

[Dataset] – Dataset with PIL.Image.Image elements

numpy(self, *positional_flags: Any)

Transforms elements into numpy.ndarray.

Arguments:

positional flags, e.g. (True, False) denoting which element should be converted. If no flags are supplied, all data that can be converted will be converted.

Returns:

[Dataset] – Dataset with np.ndarray elements

zip(self, *datasets)
cartesian_product(self, *datasets)
concat(self, *datasets)
reshape(self, *new_shapes: Optional[Shape], **kwshapes: Optional[Shape])
image_resize(self, *new_sizes: Optional[Shape], **kwsizes: Optional[Shape])
standardize(self, key_or_keys: Union[Key, Sequence[Key]], axis=0)

Standardize features by removing the mean and scaling to unit variance

Arguments:
key_or_keys {Union[Key, Sequence[Key]]} –

The keys on which the Max Abs scaling should be performed

Keyword Arguments:

axis {int} – Axis on which to accumulate statistics (default: {0})

Returns:

Dataset – Transformed dataset

center(self, key_or_keys: Union[Key, Sequence[Key]], axis=0)

Centers features by removing the mean

Arguments:
key_or_keys {Union[Key, Sequence[Key]]} –

The keys on which the Max Abs scaling should be performed

Keyword Arguments:

axis {int} – Axis on which to accumulate statistics (default: {0})

Returns:

Dataset – Transformed dataset

minmax(self, key_or_keys: Union[Key, Sequence[Key]], axis=0, feature_range=0, 1)

Transform features by scaling each feature to a given range.

Arguments:
key_or_keys {Union[Key, Sequence[Key]]} –

The keys on which the Max Abs scaling should be performed

Keyword Arguments:

axis {int} – Axis on which to accumulate statistics (default: {0})

Keyword Arguments:

feature_range {Tuple[int, int]} – Minimum and maximum bound to scale to

Returns:

Dataset – Transformed dataset

maxabs(self, key_or_keys: Union[Key, Sequence[Key]], axis=0)

Scale each feature by its maximum absolute value.

Arguments:
key_or_keys {Union[Key, Sequence[Key]]} –

The keys on which the Max Abs scaling should be performed

Keyword Arguments:

axis {int} – Axis on which to accumulate statistics (default: {0})

Returns:

Dataset – Transformed dataset

close(self)
to_tensorflow(self)
to_pytorch(self)
datasetops.allow_unique(max_num_duplicates=1) → Callable[[Any], bool]

Predicate used for filtering/sampling a dataset classwise.

Keyword Arguments:
max_num_duplicates {int} –

max number of samples to take that share the same value (default: {1})

Returns:

Callable[[Any], bool] – Predicate function

datasetops.reshape(new_shape: Shape) → DatasetTransformFn
datasetops.categorical(mapping_fn: Callable[[Any], int] = None) → DatasetTransformFn

Transform data into a categorical int label.

Arguments:
mapping_fn {Callable[[Any], int]} –

A function transforming the input data to the integer label. If not specified, labels are automatically inferred from the data.

Returns:

DatasetTransformFn – A function to be passed to the Dataset.transform()

datasetops.one_hot(encoding_size: int, mapping_fn: Callable[[Any], int] = None, dtype='bool') → DatasetTransformFn

Transform data into a one-hot encoded label.

Arguments:

encoding_size {int} – The size of the encoding mapping_fn {Callable[[Any], int]} –

A function transforming the input data to an integer label. If not specified, labels are automatically inferred from the data.

Returns:

DatasetTransformFn – A function to be passed to the Dataset.transform()

datasetops.categorical_template(ds: Dataset, key: Key) → Callable[[Any], int]

Creates a template mapping function to be with one_hot.

Arguments:

ds {Dataset} – Dataset from which to create a template for one_hot coding key {Key} – Dataset key (name or item index) on the one_hot coding is made

Returns:

{Callable[[Any],int]} – mapping_fn for one_hot

datasetops.numpy() → DatasetTransformFn
datasetops.image() → DatasetTransformFn
datasetops.image_resize(new_size: Shape, resample=Image.NEAREST) → DatasetTransformFn
datasetops.zipped(*datasets: AbstractDataset)
datasetops.cartesian_product(*datasets: AbstractDataset)
datasetops.concat(*datasets: AbstractDataset)
datasetops.to_tensorflow(dataset: Dataset)
datasetops.to_pytorch(dataset: Dataset)
class datasetops.Loader(getdata: Callable[[Any], Any], identifier: Optional[str] = None, name: str = None)

Bases: datasetops.dataset.Dataset

append(self, identifier: Data)
extend(self, ids: Union[List[Data], np.ndarray])
datasetops.from_pytorch(pytorch_dataset, identifier: Optional[str] = None)

Create dataset from a Pytorch dataset

Arguments:

tf_dataset {torch.utils.data.Dataset} – A Pytorch dataset to load from identifier {Optional[str]} – unique identifier

Returns:

[Dataset] – A datasetops.Dataset

datasetops.from_folder_data(path: AnyPath) → Dataset

Load data from a folder with the data structure:

folder ├ sample1.jpg ├ sample2.jpg

Arguments:

path {AnyPath} – path to folder

Returns:
Dataset – A dataset of data paths,

e.g. (‘nested_folder/class1/sample1.jpg’)

datasetops.from_folder_class_data(path: AnyPath) → Dataset

Load data from a folder with the data structure:

` data ├── class1    ├── sample1.jpg    └── sample2.jpg └── class2 ****└── sample3.jpg `

Arguments:

path {AnyPath} – path to nested folder

Returns:
Dataset – A labelled dataset of data paths and corresponding class labels,

e.g. (‘nested_folder/class1/sample1.jpg’, ‘class1’)

datasetops.from_folder_group_data(path: AnyPath) → Dataset

Load data from a folder with the data structure:

data ├── group1 │   ├── sample1.jpg │   └── sample2.jpg └── group2 ….├── sample1.jpg ….└── sample2.jpg

Arguments:

path {AnyPath} – path to nested folder

Returns:
Dataset – A dataset of paths to objects of each groups zipped together with corresponding names,

e.g. (‘nested_folder/group1/sample1.jpg’, ‘nested_folder/group2/sample1.txt’)

datasetops.from_folder_dataset_class_data(path: AnyPath) → List[Dataset]

Load data from a folder with the data structure:

` data ├── dataset1 │   ├── class1 │   │   ├── sample1.jpg │   │   └── sample2.jpg │   └── class2 │       └── sample3.jpg └── dataset2 ****└── sample3.jpg `

Arguments:

path {AnyPath} – path to nested folder

Returns:
List[Dataset] – A list of labelled datasets, each with data paths and corresponding class labels,

e.g. (‘nested_folder/class1/sample1.jpg’, ‘class1’)

datasetops.from_folder_dataset_group_data(path: AnyPath) → List[Dataset]

Load data from a folder with the data structure:

nested_folder |- dataset1

|- group1

|- sample1.jpg |- sample2.jpg

|- group2

|- sample1.txt |- sample2.txt

|- dataset2

|- …

Arguments:

path {AnyPath} – path to nested folder

Returns:
List[Dataset] – A list of datasets, each with data composed from different types,

e.g. (‘nested_folder/group1/sample1.jpg’, ‘nested_folder/group2/sample1.txt’)

datasetops.from_mat_single_mult_data(path: AnyPath) → List[Dataset]

Load data from .mat file consisting of multiple data.

E.g. a .mat file with keys [‘X_src’, ‘Y_src’, ‘X_tgt’, ‘Y_tgt’]

Arguments:

path {AnyPath} – path to .mat file

Returns:
List[Dataset] – A list of datasets, where a dataset was created for each suffix

e.g. a dataset with data from the keys (‘X_src’, ‘Y_src’) and from (‘X_tgt’, ‘Y_tgt’)