datasetops

Submodules

Package Contents

Classes

Dataset

Contains information on how to access the raw data, and performs

Loader

Contains information on how to access the raw data, and performs

Functions

allow_unique(...)

Predicate used for filtering/sampling a dataset classwise.

custom(→ datasetops.types.DatasetTransformFn)

Create a user defined transform.

reshape(→ datasetops.types.DatasetTransformFn)

categorical(→ datasetops.types.DatasetTransformFn)

Transform data into a categorical int label.

one_hot(→ datasetops.types.DatasetTransformFn)

Transform data into a one-hot encoded label.

categorical_template(...)

Creates a template mapping function to be with one_hot.

numpy(→ datasetops.types.DatasetTransformFn)

image(→ datasetops.types.DatasetTransformFn)

image_resize(→ datasetops.types.DatasetTransformFn)

zipped(*datasets)

cartesian_product(*datasets)

concat(*datasets)

to_tensorflow(dataset)

to_pytorch(dataset)

from_pytorch(pytorch_dataset)

Create dataset from a Pytorch dataset

from_folder_data(→ datasetops.dataset.Dataset)

Load data from a folder with the data structure:

from_folder_class_data(→ datasetops.dataset.Dataset)

Load data from a folder with the data structure:

from_folder_dataset_class_data(...)

Load data from a folder with the data structure:

from_mat_single_mult_data(...)

Load data from .mat file consisting of multiple data.

class datasetops.Dataset(downstream_getter: datasetops.types.Union[datasetops.abstract.ItemGetter, Dataset], name: str = None, ids: datasetops.types.Ids = None, item_transform_fn: datasetops.types.ItemTransformFn = lambda x: ..., item_names: datasetops.types.Dict[str, int] = None)

Bases: datasetops.abstract.AbstractDataset

Contains information on how to access the raw data, and performs sampling and splitting related operations.

property shape: datasetops.types.Sequence[int]

Get the shape of a dataset item.

Returns:

Sequence[int] – Item shapes

property names: datasetops.types.List[str]

Get the names of the elements in an item.

Returns:

List[str] – A list of element names

__len__()

Return the total number of elements in the dataset.

__getitem__(i: int) datasetops.types.Tuple

Returns the element at the specified index.

Parameters

idxint

the index from which to read the sample.

counts(*itemkeys: datasetops.types.Key) datasetops.types.List[datasetops.types.Tuple[datasetops.types.Any, int]]

Compute the counts of each unique item in the dataset.

Warning: this operation may be expensive for large datasets

Arguments:

itemkeys {Union[str, int]} – The item keys (str) or indexes (int) to be checked for uniqueness. If no key is given, all item-parts must match for them to be considered equal

Returns:

List[Tuple[Any,int]] – List of tuples, each containing the unique value and its number of occurences

unique(*itemkeys: datasetops.types.Key) datasetops.types.List[datasetops.types.Any]

Compute a list of unique values in the dataset.

Warning: this operation may be expensive for large datasets

Arguments:

itemkeys {str} – The item keys to be checked for uniqueness

Returns:

List[Any] – List of the unique items

sample(num: int, seed: int = None)

Sample data randomly from the dataset.

Arguments:

num {int} – Number of samples. If the number of samples is larger than the dataset size, some samples may be samples multiple times

Keyword Arguments:

seed {int} – Random seed (default: {None})

Returns:

[Dataset] – Sampled dataset

filter(predicates: datasetops.types.Optional[datasetops.types.Union[datasetops.types.DataPredicate, datasetops.types.Sequence[datasetops.types.Optional[datasetops.types.DataPredicate]]]] = None, **kwpredicates: datasetops.types.DataPredicate)

Filter a dataset using a predicate function.

Keyword Arguments:

predicates {Union[DataPredicate, Sequence[Optional[DataPredicate]]]} – either a single or a list of functions taking a single dataset item and returning a bool if a single function is passed, it is applied to the whole item, if a list is passed, the functions are applied itemwise element-wise predicates can also be passed, if item_names have been named. kwpredicates {DataPredicate} – TODO

Returns:

[Dataset] – A filtered Dataset

split_filter(predicates: datasetops.types.Optional[datasetops.types.Union[datasetops.types.DataPredicate, datasetops.types.Sequence[datasetops.types.Optional[datasetops.types.DataPredicate]]]] = None, **kwpredicates: datasetops.types.DataPredicate)

Split a dataset using a predicate function.

Keyword Arguments:

predicates {Union[DataPredicate, Sequence[Optional[DataPredicate]]]} – either a single or a list of functions taking a single dataset item and returning a bool. if a single function is passed, it is applied to the whole item, if a list is passed, the functions are applied itemwise element-wise predicates can also be passed, if item_names have been named.

Returns:

[Dataset] – Two datasets, one that passed the predicate and one that didn’t

shuffle(seed: int = None)

Shuffle the items in a dataset.

Keyword Arguments:

seed {[int]} – Random seed (default: {None})

Returns:

[Dataset] – Dataset with shuffled items

split(fractions: datasetops.types.List[float], seed: int = None)

Split dataset into multiple datasets, determined by the fractions given.

A wildcard (-1) may be given at a single position, to fill in the rest. If fractions don’t add up, the last fraction in the list receives the remainding data.

Arguments:

fractions {List[float]} – a list or tuple of floats i the interval ]0,1[. One of the items may be a -1 wildcard.

Keyword Arguments:

seed {int} – Random seed (default: {None})

Returns:

List[Dataset] – Datasets with the number of samples corresponding to the fractions given

take(num: int)

Take the first elements of a dataset.

Arguments:

num {int} – number of elements to take

Returns:

Dataset – A dataset with only the first num elements

repeat(times=1, mode='itemwise')

Repeat the dataset elements.

Keyword Arguments:

times {int} – Number of times an element is repeated (default: {1}) mode {str} – Repeat ‘itemwise’ (i.e. [1,1,2,2,3,3]) or as a ‘whole’ (i.e. [1,2,3,1,2,3]) (default: {‘itemwise’})

Returns:

[type] – [description]

reorder(*keys: datasetops.types.Key)

Reorder items in the dataset (similar to numpy.transpose).

Arguments:

new_inds {Union[int,str]} – positioned item index or key (if item names were previously set) of item

Returns:

[Dataset] – Dataset with items whose elements have been reordered

named(first: datasetops.types.Union[str, datasetops.types.Sequence[str]], *rest: str)

Set the names associated with the elements of an item.

Arguments:

first {Union[str, Sequence[str]]} – The new item name(s)

Returns:

[Dataset] – A Dataset whose item elements can be accessed by name

transform(fns: datasetops.types.Optional[datasetops.types.Union[datasetops.types.ItemTransformFn, datasetops.types.Sequence[datasetops.types.Union[datasetops.types.ItemTransformFn, datasetops.types.DatasetTransformFn]]]] = None, **kwfns: datasetops.types.DatasetTransformFn)

Transform the items of a dataset according to some function (passed as argument).

Arguments:

If a single function taking one input given, e.g. transform(lambda x: x), it will be applied to the whole item. If a list of functions are given, e.g. transform([image(), one_hot()]) they will be applied to the elements of the item corresponding to the position. If key is used, e.g. transform(data=lambda x:-x), the item associated with the key i transformed.

Raises:

ValueError: If more functions are passed than there are elements in an item. KeyError: If a key doesn’t match

Returns:

[Dataset] – Dataset whose items are transformed

categorical(key: datasetops.types.Key, mapping_fn: datasetops.types.Callable[[datasetops.types.Any], int] = None)

Transform elements into categorical categoricals (int).

Arguments:

key {Key} – Index of name for the element to be transformed

Keyword Arguments:

mapping_fn {Callable[[Any], int]} – User defined mapping function (default: {None})

Returns:

[Dataset] – Dataset with items that have been transformed to categorical labels

one_hot(key: datasetops.types.Key, encoding_size: int = None, mapping_fn: datasetops.types.Callable[[datasetops.types.Any], int] = None, dtype='bool')

Transform elements into a categorical one-hot encoding.

Arguments:

key {Key} – Index of name for the element to be transformed

Keyword Arguments:

encoding_size {int} – The number of positions in the one-hot vector. If size it not provided, it we be automatically inferred (with a O(N) runtime cost) (default: {None}) mapping_fn {Callable[[Any], int]} – User defined mapping function (default: {None}) dtype {str} – Numpy datatype for the one-hot encoded data (default: {‘bool’})

Returns:

[Dataset] – Dataset with items that have been transformed to categorical labels

image(*positional_flags: datasetops.types.Any)

Transforms item elements that are either numpy arrays or path strings into a PIL.Image.Image.

Arguments:

positional flags, e.g. (True, False) denoting which element should be converted. If no flags are supplied, all data that can be converted will be converted.

Returns:

[Dataset] – Dataset with PIL.Image.Image elements

numpy(*positional_flags: datasetops.types.Any)

Transforms elements into numpy.ndarray.

Arguments:

positional flags, e.g. (True, False) denoting which element should be converted. If no flags are supplied, all data that can be converted will be converted.

Returns:

[Dataset] – Dataset with np.ndarray elements

zip(*datasets)
cartesian_product(*datasets)
concat(*datasets)
reshape(*new_shapes: datasetops.types.Optional[datasetops.types.Shape], **kwshapes: datasetops.types.Optional[datasetops.types.Shape])
image_resize(*new_sizes: datasetops.types.Optional[datasetops.types.Shape], **kwsizes: datasetops.types.Optional[datasetops.types.Shape])
to_tensorflow()
to_pytorch()
datasetops.allow_unique(max_num_duplicates=1) datasetops.types.Callable[[datasetops.types.Any], bool]

Predicate used for filtering/sampling a dataset classwise.

Keyword Arguments:

max_num_duplicates {int} – max number of samples to take that share the same value (default: {1})

Returns:

Callable[[Any], bool] – Predicate function

datasetops.custom(elem_transform_fn: datasetops.types.Callable[[datasetops.types.Any], datasetops.types.Any], elem_check_fn: datasetops.types.Callable[[datasetops.types.Any], None] = None) datasetops.types.DatasetTransformFn

Create a user defined transform.

Arguments:

fn {Callable[[Any], Any]} – A user defined function, which takes the element as only argument

Keyword Arguments:

check_fn {Callable[[Any]]} – A function that raises an Exception if the elem is incompatible (default: {None})

Returns:

DatasetTransformFn – [description]

datasetops.reshape(new_shape: datasetops.types.Shape) datasetops.types.DatasetTransformFn
datasetops.categorical(mapping_fn: datasetops.types.Callable[[datasetops.types.Any], int] = None) datasetops.types.DatasetTransformFn

Transform data into a categorical int label.

Arguments:

mapping_fn {Callable[[Any], int]} – A function transforming the input data to the integer label. If not specified, labels are automatically inferred from the data.

Returns:

DatasetTransformFn – A function to be passed to the Dataset.transform()

datasetops.one_hot(encoding_size: int, mapping_fn: datasetops.types.Callable[[datasetops.types.Any], int] = None, dtype='bool') datasetops.types.DatasetTransformFn

Transform data into a one-hot encoded label.

Arguments:

encoding_size {int} – The size of the encoding mapping_fn {Callable[[Any], int]} – A function transforming the input data to an integer label. If not specified, labels are automatically inferred from the data.

Returns:

DatasetTransformFn – A function to be passed to the Dataset.transform()

datasetops.categorical_template(ds: Dataset, key: datasetops.types.Key) datasetops.types.Callable[[datasetops.types.Any], int]

Creates a template mapping function to be with one_hot.

Arguments:

ds {Dataset} – Dataset from which to create a template for one_hot coding key {Key} – Dataset key (name or item index) on the one_hot coding is made

Returns:

{Callable[[Any],int]} – mapping_fn for one_hot

datasetops.numpy() datasetops.types.DatasetTransformFn
datasetops.image() datasetops.types.DatasetTransformFn
datasetops.image_resize(new_size: datasetops.types.Shape, resample=Image.NEAREST) datasetops.types.DatasetTransformFn
datasetops.zipped(*datasets: datasetops.abstract.AbstractDataset)
datasetops.cartesian_product(*datasets: datasetops.abstract.AbstractDataset)
datasetops.concat(*datasets: datasetops.abstract.AbstractDataset)
datasetops.to_tensorflow(dataset: Dataset)
datasetops.to_pytorch(dataset: Dataset)
class datasetops.Loader(getdata: datasetops.types.Callable[[datasetops.types.Any], datasetops.types.Any], name: str = None)

Bases: datasetops.dataset.Dataset

Contains information on how to access the raw data, and performs sampling and splitting related operations.

append(identifier: datasetops.types.Data)
extend(ids: datasetops.types.Union[datasetops.types.List[datasetops.types.Data], numpy.ndarray])
datasetops.from_pytorch(pytorch_dataset)

Create dataset from a Pytorch dataset

Arguments:

tf_dataset {torch.utils.data.Dataset} – A Pytorch dataset to load from

Returns:

[Dataset] – A datasetops.Dataset

datasetops.from_folder_data(path: datasetops.types.AnyPath) datasetops.dataset.Dataset

Load data from a folder with the data structure:

folder |- sample1.jpg |- sample2.jpg

Arguments:

path {AnyPath} – path to folder

Returns:
Dataset – A dataset of data paths,

e.g. (‘nested_folder/class1/sample1.jpg’)

datasetops.from_folder_class_data(path: datasetops.types.AnyPath) datasetops.dataset.Dataset

Load data from a folder with the data structure:

nested_folder |- class1

|- sample1.jpg |- sample2.jpg

|- class2

|- sample3.jpg

Arguments:

path {AnyPath} – path to nested folder

Returns:
Dataset – A labelled dataset of data paths and corresponding class labels,

e.g. (‘nested_folder/class1/sample1.jpg’, ‘class1’)

datasetops.from_folder_dataset_class_data(path: datasetops.types.AnyPath) datasetops.types.List[datasetops.dataset.Dataset]

Load data from a folder with the data structure:

nested_folder |- dataset1

|- class1

|- sample1.jpg |- sample2.jpg

|- class2

|- sample3.jpg

|- dataset2

|- …

Arguments:

path {AnyPath} – path to nested folder

Returns:
List[Dataset] – A list of labelled datasets, each with data paths and corresponding class labels,

e.g. (‘nested_folder/class1/sample1.jpg’, ‘class1’)

datasetops.from_mat_single_mult_data(path: datasetops.types.AnyPath) datasetops.types.List[datasetops.dataset.Dataset]

Load data from .mat file consisting of multiple data.

E.g. a .mat file with keys [‘X_src’, ‘Y_src’, ‘X_tgt’, ‘Y_tgt’]

Arguments:

path {AnyPath} – path to .mat file

Returns:
List[Dataset] – A list of datasets, where a dataset was created for each suffix

e.g. a dataset with data from the keys (‘X_src’, ‘Y_src’) and from (‘X_tgt’, ‘Y_tgt’)