`datasetops`¶

Dataset Ops is a library that enables the loading and processing of datasets stored in various formats. It does so by providing:: 1. Loaders for various storage formats 2. Transformations which may chained to transform the data into the desired form.

Finding The Documentation¶

Documentation is available online at:

https://datasetops.readthedocs.io/en/latest/

Submodules¶

Package Contents¶

class datasetops.Dataset(downstream_getter: Union[ItemGetter, ‘Dataset’], operation_name: str, name: str = None, ids: Ids = None, item_transform_fn: ItemTransformFn = lambda x: ..., item_names: Dict[str, int] = None, operation_parameters: Dict = {}, stats: List[Optional[scaler.ElemStats]] = [])¶

Bases: datasetops.abstract.AbstractDataset

Contains information on how to access the raw data, and performs sampling and splitting related operations.

__len__(self)¶

__getitem__(self, i: int)¶

cached(self, path: str = None, keep_loaded_items: bool = False, display_progress: bool = False)¶

item_stats(self, item_key: Key, axis=None)¶

Compute the statistics (mean, std, min, max) for an item element

Arguments:

item_key {Key} – index of string identifyer for element on which: the stats should be computed

Keyword Arguments:

axis {[type]} – the axis on which to compute statistics (default: {None})

Raises:

TypeError: if statistics cannot be computed on the element type

Returns:

scaler.ElemStats – Named tuple with (mean, std, min, max, axis)

property shape(self)¶

Get the shape of a dataset item.

Returns:: Sequence[int] – Item shapes

counts(self, *itemkeys: Key)¶

Compute the counts of each unique item in the dataset.

Warning: this operation may be expensive for large datasets

Arguments:

itemkeys {Union[str, int]} –: The item keys (str) or indexes (int) to be checked for uniqueness. If no key is given, all item-parts must match for them to be considered equal

Returns:

List[Tuple[Any,int]] – List of tuples, each containing the unique value: and its number of occurences

unique(self, *itemkeys: Key)¶

Compute a list of unique values in the dataset.

Warning: this operation may be expensive for large datasets

Arguments:: itemkeys {str} – The item keys to be checked for uniqueness
Returns:: List[Any] – List of the unique items

sample(self, num: int, seed: int = None)¶

Sample data randomly from the dataset.

Arguments:

num {int} – Number of samples. If the number of samples is larger than the: dataset size, some samples may be samples multiple times

Keyword Arguments:

seed {int} – Random seed (default: {None})

Returns:

[Dataset] – Sampled dataset

filter(self, predicates: Optional[Union[DataPredicate, Sequence[Optional[DataPredicate]]]] = None, **kwpredicates: DataPredicate)¶

Filter a dataset using a predicate function.

Keyword Arguments:

predicates {Union[DataPredicate, Sequence[Optional[DataPredicate]]]} –: either a single or a list of functions taking a single dataset item and returning a bool if a single function is passed, it is applied to the whole item, if a list is passed, the functions are applied itemwise element-wise predicates can also be passed, if item_names have been named.

kwpredicates {DataPredicate} – Predicates passed by keyword

Returns:

[Dataset] – A filtered Dataset

split_filter(self, predicates: Optional[Union[DataPredicate, Sequence[Optional[DataPredicate]]]] = None, **kwpredicates: DataPredicate)¶

Split a dataset using a predicate function.

Keyword Arguments:

predicates {Union[DataPredicate, Sequence[Optional[DataPredicate]]]} –: either a single or a list of functions taking a single dataset item and returning a bool. if a single function is passed, it is applied to the whole item, if a list is passed, the functions are applied itemwise

element-wise predicates can also be passed, if item_names have been named.

Returns:

[Dataset] – Two datasets, one that passed the predicate and one that didn’t

shuffle(self, seed: int = None)¶

Shuffle the items in a dataset.

Keyword Arguments:: seed {[int]} – Random seed (default: {None})
Returns:: [Dataset] – Dataset with shuffled items

split(self, fractions: List[float], seed: int = None)¶

Split dataset into multiple datasets, determined by the fractions given.

A wildcard (-1) may be given at a single position, to fill in the rest. If fractions don’t add up, the last fraction in the list receives the remainding data.

Arguments:: fractions {List[float]} – a list or tuple of floats i the interval ]0,1[ One of the items may be a -1 wildcard.
Keyword Arguments:: seed {int} – Random seed (default: {None})
Returns:: List[Dataset] – Datasets with the number of samples corresponding to the fractions given

take(self, num: int)¶

Take the first elements of a dataset.

Arguments:: num {int} – number of elements to take
Returns:: Dataset – A dataset with only the first num elements

repeat(self, times=1, mode='itemwise')¶

Repeat the dataset elements.

Keyword Arguments:: times {int} – Number of times an element is repeated (default: {1}) mode {str} – Repeat ‘itemwise’ (i.e. [1,1,2,2,3,3]) or as a ‘whole’

(i.e. [1,2,3,1,2,3]) (default: {‘itemwise’})
Returns:: [type] – [description]

reorder(self, *keys: Key)¶

Reorder items in the dataset (similar to numpy.transpose).

Arguments:

new_inds {Union[int,str]} – positioned item index or key (if item names: were previously set) of item

Returns:

[Dataset] – Dataset with items whose elements have been reordered

named(self, first: Union[str, Sequence[str]], *rest: str)¶

Set the names associated with the elements of an item.

Arguments:: first {Union[str, Sequence[str]]} – The new item name(s)
Returns:: [Dataset] – A Dataset whose item elements can be accessed by name

property names(self)¶

Get the names of the elements in an item.

Returns:: List[str] – A list of element names

transform(self, fns: Optional[Union[ItemTransformFn, Sequence[Union[ItemTransformFn, DatasetTransformFn]]]] = None, **kwfns: DatasetTransformFn)¶

Transform the items of a dataset according to some function (passed as argument).

Arguments:

If a single function taking one input given, e.g. transform(lambda x: x),: it will be applied to the whole item.
If a list of functions are given, e.g. transform([image(), one_hot()]) they: will be applied to the elements of the item corresponding to the position.
If key is used, e.g. transform(data=lambda x:-x), the item associated with: the key i transformed.

Raises:

ValueError: If more functions are passed than there are elements in an item. KeyError: If a key doesn’t match

Returns:

[Dataset] – Dataset whose items are transformed

categorical(self, key: Key, mapping_fn: Callable[[Any], int] = None)¶

Transform elements into categorical categoricals (int).

Arguments:

key {Key} – Index of name for the element to be transformed

Keyword Arguments:

mapping_fn {Callable[[Any], int]} – User defined mapping function: (default: {None})

Returns:

[Dataset] – Dataset with items that have been transformed to categorical: labels

one_hot(self, key: Key, encoding_size: int = None, mapping_fn: Callable[[Any], int] = None, dtype='bool')¶

Transform elements into a categorical one-hot encoding.

Arguments:

key {Key} – Index of name for the element to be transformed

Keyword Arguments:

encoding_size {int} –: The number of positions in the one-hot vector. If size it not provided, it we be automatically inferred with a O(N) runtime cost (default: {None})
mapping_fn {Callable[[Any], int]} –: User defined mapping function (default: {None})
dtype {str} –: Numpy datatype for the one-hot encoded data (default: {‘bool’})

Returns:

[Dataset] –: Dataset with items that have been transformed to categorical labels

image(self, *positional_flags: Any)¶

Transforms item elements that are either numpy arrays or path strings into a PIL.Image.Image.

Arguments:: positional flags, e.g. (True, False) denoting which element should be converted. If no flags are supplied, all data that can be converted will be converted.
Returns:: [Dataset] – Dataset with PIL.Image.Image elements

numpy(self, *positional_flags: Any)¶

Transforms elements into numpy.ndarray.

Arguments:: positional flags, e.g. (True, False) denoting which element should be converted. If no flags are supplied, all data that can be converted will be converted.
Returns:: [Dataset] – Dataset with np.ndarray elements

zip(self, *datasets)¶

cartesian_product(self, *datasets)¶

concat(self, *datasets)¶

reshape(self, *new_shapes: Optional[Shape], **kwshapes: Optional[Shape])¶

image_resize(self, *new_sizes: Optional[Shape], **kwsizes: Optional[Shape])¶

standardize(self, key_or_keys: Union[Key, Sequence[Key]], axis=0)¶

Standardize features by removing the mean and scaling to unit variance

Arguments:

key_or_keys {Union[Key, Sequence[Key]]} –: The keys on which the Max Abs scaling should be performed

Keyword Arguments:

axis {int} – Axis on which to accumulate statistics (default: {0})

Returns:

Dataset – Transformed dataset

center(self, key_or_keys: Union[Key, Sequence[Key]], axis=0)¶

Centers features by removing the mean

Arguments:

key_or_keys {Union[Key, Sequence[Key]]} –: The keys on which the Max Abs scaling should be performed

Keyword Arguments:

axis {int} – Axis on which to accumulate statistics (default: {0})

Returns:

Dataset – Transformed dataset

minmax(self, key_or_keys: Union[Key, Sequence[Key]], axis=0, feature_range=0, 1)¶

Transform features by scaling each feature to a given range.

Arguments:

key_or_keys {Union[Key, Sequence[Key]]} –: The keys on which the Max Abs scaling should be performed

Keyword Arguments:

axis {int} – Axis on which to accumulate statistics (default: {0})

Keyword Arguments:

feature_range {Tuple[int, int]} – Minimum and maximum bound to scale to

Returns:

Dataset – Transformed dataset

maxabs(self, key_or_keys: Union[Key, Sequence[Key]], axis=0)¶

Scale each feature by its maximum absolute value.

Arguments:

key_or_keys {Union[Key, Sequence[Key]]} –: The keys on which the Max Abs scaling should be performed

Keyword Arguments:

axis {int} – Axis on which to accumulate statistics (default: {0})

Returns:

Dataset – Transformed dataset

close(self)¶

to_tensorflow(self)¶

to_pytorch(self)¶

datasetops.allow_unique(max_num_duplicates=1) → Callable[[Any], bool]¶

Predicate used for filtering/sampling a dataset classwise.

Keyword Arguments:

max_num_duplicates {int} –: max number of samples to take that share the same value (default: {1})

Returns:

Callable[[Any], bool] – Predicate function

datasetops.reshape(new_shape: Shape) → DatasetTransformFn¶

datasetops.categorical(mapping_fn: Callable[[Any], int] = None) → DatasetTransformFn¶

Transform data into a categorical int label.

Arguments:

mapping_fn {Callable[[Any], int]} –: A function transforming the input data to the integer label. If not specified, labels are automatically inferred from the data.

Returns:

DatasetTransformFn – A function to be passed to the Dataset.transform()

datasetops.one_hot(encoding_size: int, mapping_fn: Callable[[Any], int] = None, dtype='bool') → DatasetTransformFn¶

Transform data into a one-hot encoded label.

Arguments:: encoding_size {int} – The size of the encoding mapping_fn {Callable[[Any], int]} –

A function transforming the input data to an integer label. If not specified, labels are automatically inferred from the data.
Returns:: DatasetTransformFn – A function to be passed to the Dataset.transform()

datasetops.categorical_template(ds: Dataset, key: Key) → Callable[[Any], int]¶

Creates a template mapping function to be with one_hot.

Arguments:: ds {Dataset} – Dataset from which to create a template for one_hot coding key {Key} – Dataset key (name or item index) on the one_hot coding is made
Returns:: {Callable[[Any],int]} – mapping_fn for one_hot

datasetops.numpy() → DatasetTransformFn¶

datasetops.image() → DatasetTransformFn¶

datasetops.image_resize(new_size: Shape, resample=Image.NEAREST) → DatasetTransformFn¶

datasetops.zipped(*datasets: AbstractDataset)¶

datasetops.cartesian_product(*datasets: AbstractDataset)¶

datasetops.concat(*datasets: AbstractDataset)¶

datasetops.to_tensorflow(dataset: Dataset)¶

datasetops.to_pytorch(dataset: Dataset)¶

class datasetops.Loader(getdata: Callable[[Any], Any], identifier: Optional[str] = None, name: str = None)¶

Bases: datasetops.dataset.Dataset

append(self, identifier: Data)¶

extend(self, ids: Union[List[Data], np.ndarray])¶

datasetops.from_pytorch(pytorch_dataset, identifier: Optional[str] = None)¶

Create dataset from a Pytorch dataset

Arguments:: tf_dataset {torch.utils.data.Dataset} – A Pytorch dataset to load from identifier {Optional[str]} – unique identifier
Returns:: [Dataset] – A datasetops.Dataset

datasetops.from_folder_data(path: AnyPath) → Dataset¶

Load data from a folder with the data structure:

folder ├ sample1.jpg ├ sample2.jpg

Arguments:

path {AnyPath} – path to folder

Returns:

Dataset – A dataset of data paths,: e.g. (‘nested_folder/class1/sample1.jpg’)

datasetops.from_folder_class_data(path: AnyPath) → Dataset¶

Load data from a folder with the data structure:

` data ├── class1 │ ├── sample1.jpg │ └── sample2.jpg └── class2 ****└── sample3.jpg `

Arguments:

path {AnyPath} – path to nested folder

Returns:

Dataset – A labelled dataset of data paths and corresponding class labels,: e.g. (‘nested_folder/class1/sample1.jpg’, ‘class1’)

datasetops.from_folder_group_data(path: AnyPath) → Dataset¶

Load data from a folder with the data structure:

data ├── group1 │ ├── sample1.jpg │ └── sample2.jpg └── group2 ….├── sample1.jpg ….└── sample2.jpg

Arguments:

path {AnyPath} – path to nested folder

Returns:

Dataset – A dataset of paths to objects of each groups zipped together with corresponding names,: e.g. (‘nested_folder/group1/sample1.jpg’, ‘nested_folder/group2/sample1.txt’)

datasetops.from_folder_dataset_class_data(path: AnyPath) → List[Dataset]¶

Load data from a folder with the data structure:

` data ├── dataset1 │ ├── class1 │ │ ├── sample1.jpg │ │ └── sample2.jpg │ └── class2 │ └── sample3.jpg └── dataset2 ****└── sample3.jpg `

Arguments:

path {AnyPath} – path to nested folder

Returns:

List[Dataset] – A list of labelled datasets, each with data paths and corresponding class labels,: e.g. (‘nested_folder/class1/sample1.jpg’, ‘class1’)

datasetops.from_folder_dataset_group_data(path: AnyPath) → List[Dataset]¶

Load data from a folder with the data structure:

nested_folder |- dataset1

|- group1
|- sample1.jpg |- sample2.jpg

|- group2
|- sample1.txt |- sample2.txt

|- dataset2
|- …

Arguments:

path {AnyPath} – path to nested folder

Returns:

List[Dataset] – A list of datasets, each with data composed from different types,: e.g. (‘nested_folder/group1/sample1.jpg’, ‘nested_folder/group2/sample1.txt’)

datasetops.from_mat_single_mult_data(path: AnyPath) → List[Dataset]¶

Load data from .mat file consisting of multiple data.

E.g. a .mat file with keys [‘X_src’, ‘Y_src’, ‘X_tgt’, ‘Y_tgt’]

Arguments:

path {AnyPath} – path to .mat file

Returns:

List[Dataset] – A list of datasets, where a dataset was created for each suffix: e.g. a dataset with data from the keys (‘X_src’, ‘Y_src’) and from (‘X_tgt’, ‘Y_tgt’)

datasetops¶

Finding The Documentation¶

Submodules¶

Package Contents¶

`datasetops`¶