datasetops
¶
Dataset Ops is a library that enables the loading and processing of datasets stored in various formats. It does so by providing:: 1. Loaders for various storage formats 2. Transformations which may chained to transform the data into the desired form.
Finding The Documentation¶
Documentation is available online at:
Submodules¶
Package Contents¶
-
class
datasetops.
Dataset
(downstream_getter: Union[ItemGetter, ‘Dataset’], operation_name: str, name: str = None, ids: Ids = None, item_transform_fn: ItemTransformFn = lambda x: ..., item_names: Dict[str, int] = None, operation_parameters: Dict = {}, stats: List[Optional[scaler.ElemStats]] = [])¶ Bases:
datasetops.abstract.AbstractDataset
Contains information on how to access the raw data, and performs sampling and splitting related operations.
-
__len__
(self)¶
-
__getitem__
(self, i: int)¶
-
cached
(self, path: str = None, keep_loaded_items: bool = False, display_progress: bool = False)¶
-
item_stats
(self, item_key: Key, axis=None)¶ Compute the statistics (mean, std, min, max) for an item element
- Arguments:
- item_key {Key} – index of string identifyer for element on which
the stats should be computed
- Keyword Arguments:
axis {[type]} – the axis on which to compute statistics (default: {None})
- Raises:
TypeError: if statistics cannot be computed on the element type
- Returns:
scaler.ElemStats – Named tuple with (mean, std, min, max, axis)
-
property
shape
(self)¶ Get the shape of a dataset item.
- Returns:
Sequence[int] – Item shapes
-
counts
(self, *itemkeys: Key)¶ Compute the counts of each unique item in the dataset.
Warning: this operation may be expensive for large datasets
- Arguments:
- itemkeys {Union[str, int]} –
The item keys (str) or indexes (int) to be checked for uniqueness. If no key is given, all item-parts must match for them to be considered equal
- Returns:
- List[Tuple[Any,int]] – List of tuples, each containing the unique value
and its number of occurences
-
unique
(self, *itemkeys: Key)¶ Compute a list of unique values in the dataset.
Warning: this operation may be expensive for large datasets
- Arguments:
itemkeys {str} – The item keys to be checked for uniqueness
- Returns:
List[Any] – List of the unique items
-
sample
(self, num: int, seed: int = None)¶ Sample data randomly from the dataset.
- Arguments:
- num {int} – Number of samples. If the number of samples is larger than the
dataset size, some samples may be samples multiple times
- Keyword Arguments:
seed {int} – Random seed (default: {None})
- Returns:
[Dataset] – Sampled dataset
-
filter
(self, predicates: Optional[Union[DataPredicate, Sequence[Optional[DataPredicate]]]] = None, **kwpredicates: DataPredicate)¶ Filter a dataset using a predicate function.
- Keyword Arguments:
- predicates {Union[DataPredicate, Sequence[Optional[DataPredicate]]]} –
either a single or a list of functions taking a single dataset item and returning a bool if a single function is passed, it is applied to the whole item, if a list is passed, the functions are applied itemwise element-wise predicates can also be passed, if item_names have been named.
kwpredicates {DataPredicate} – Predicates passed by keyword
- Returns:
[Dataset] – A filtered Dataset
-
split_filter
(self, predicates: Optional[Union[DataPredicate, Sequence[Optional[DataPredicate]]]] = None, **kwpredicates: DataPredicate)¶ Split a dataset using a predicate function.
- Keyword Arguments:
- predicates {Union[DataPredicate, Sequence[Optional[DataPredicate]]]} –
either a single or a list of functions taking a single dataset item and returning a bool. if a single function is passed, it is applied to the whole item, if a list is passed, the functions are applied itemwise
element-wise predicates can also be passed, if item_names have been named.
- Returns:
[Dataset] – Two datasets, one that passed the predicate and one that didn’t
-
shuffle
(self, seed: int = None)¶ Shuffle the items in a dataset.
- Keyword Arguments:
seed {[int]} – Random seed (default: {None})
- Returns:
[Dataset] – Dataset with shuffled items
-
split
(self, fractions: List[float], seed: int = None)¶ Split dataset into multiple datasets, determined by the fractions given.
A wildcard (-1) may be given at a single position, to fill in the rest. If fractions don’t add up, the last fraction in the list receives the remainding data.
- Arguments:
fractions {List[float]} – a list or tuple of floats i the interval ]0,1[ One of the items may be a -1 wildcard.
- Keyword Arguments:
seed {int} – Random seed (default: {None})
- Returns:
List[Dataset] – Datasets with the number of samples corresponding to the fractions given
-
take
(self, num: int)¶ Take the first elements of a dataset.
- Arguments:
num {int} – number of elements to take
- Returns:
Dataset – A dataset with only the first num elements
-
repeat
(self, times=1, mode='itemwise')¶ Repeat the dataset elements.
- Keyword Arguments:
times {int} – Number of times an element is repeated (default: {1}) mode {str} – Repeat ‘itemwise’ (i.e. [1,1,2,2,3,3]) or as a ‘whole’
(i.e. [1,2,3,1,2,3]) (default: {‘itemwise’})
- Returns:
[type] – [description]
-
reorder
(self, *keys: Key)¶ Reorder items in the dataset (similar to numpy.transpose).
- Arguments:
- new_inds {Union[int,str]} – positioned item index or key (if item names
were previously set) of item
- Returns:
[Dataset] – Dataset with items whose elements have been reordered
-
named
(self, first: Union[str, Sequence[str]], *rest: str)¶ Set the names associated with the elements of an item.
- Arguments:
first {Union[str, Sequence[str]]} – The new item name(s)
- Returns:
[Dataset] – A Dataset whose item elements can be accessed by name
-
property
names
(self)¶ Get the names of the elements in an item.
- Returns:
List[str] – A list of element names
-
transform
(self, fns: Optional[Union[ItemTransformFn, Sequence[Union[ItemTransformFn, DatasetTransformFn]]]] = None, **kwfns: DatasetTransformFn)¶ Transform the items of a dataset according to some function (passed as argument).
- Arguments:
- If a single function taking one input given, e.g. transform(lambda x: x),
it will be applied to the whole item.
- If a list of functions are given, e.g. transform([image(), one_hot()]) they
will be applied to the elements of the item corresponding to the position.
- If key is used, e.g. transform(data=lambda x:-x), the item associated with
the key i transformed.
- Raises:
ValueError: If more functions are passed than there are elements in an item. KeyError: If a key doesn’t match
- Returns:
[Dataset] – Dataset whose items are transformed
-
categorical
(self, key: Key, mapping_fn: Callable[[Any], int] = None)¶ Transform elements into categorical categoricals (int).
- Arguments:
key {Key} – Index of name for the element to be transformed
- Keyword Arguments:
- mapping_fn {Callable[[Any], int]} – User defined mapping function
(default: {None})
- Returns:
- [Dataset] – Dataset with items that have been transformed to categorical
labels
-
one_hot
(self, key: Key, encoding_size: int = None, mapping_fn: Callable[[Any], int] = None, dtype='bool')¶ Transform elements into a categorical one-hot encoding.
- Arguments:
key {Key} – Index of name for the element to be transformed
- Keyword Arguments:
- encoding_size {int} –
The number of positions in the one-hot vector. If size it not provided, it we be automatically inferred with a O(N) runtime cost (default: {None})
- mapping_fn {Callable[[Any], int]} –
User defined mapping function (default: {None})
- dtype {str} –
Numpy datatype for the one-hot encoded data (default: {‘bool’})
- Returns:
- [Dataset] –
Dataset with items that have been transformed to categorical labels
-
image
(self, *positional_flags: Any)¶ Transforms item elements that are either numpy arrays or path strings into a PIL.Image.Image.
- Arguments:
positional flags, e.g. (True, False) denoting which element should be converted. If no flags are supplied, all data that can be converted will be converted.
- Returns:
[Dataset] – Dataset with PIL.Image.Image elements
-
numpy
(self, *positional_flags: Any)¶ Transforms elements into numpy.ndarray.
- Arguments:
positional flags, e.g. (True, False) denoting which element should be converted. If no flags are supplied, all data that can be converted will be converted.
- Returns:
[Dataset] – Dataset with np.ndarray elements
-
zip
(self, *datasets)¶
-
cartesian_product
(self, *datasets)¶
-
concat
(self, *datasets)¶
-
reshape
(self, *new_shapes: Optional[Shape], **kwshapes: Optional[Shape])¶
-
image_resize
(self, *new_sizes: Optional[Shape], **kwsizes: Optional[Shape])¶
-
standardize
(self, key_or_keys: Union[Key, Sequence[Key]], axis=0)¶ Standardize features by removing the mean and scaling to unit variance
- Arguments:
- key_or_keys {Union[Key, Sequence[Key]]} –
The keys on which the Max Abs scaling should be performed
- Keyword Arguments:
axis {int} – Axis on which to accumulate statistics (default: {0})
- Returns:
Dataset – Transformed dataset
-
center
(self, key_or_keys: Union[Key, Sequence[Key]], axis=0)¶ Centers features by removing the mean
- Arguments:
- key_or_keys {Union[Key, Sequence[Key]]} –
The keys on which the Max Abs scaling should be performed
- Keyword Arguments:
axis {int} – Axis on which to accumulate statistics (default: {0})
- Returns:
Dataset – Transformed dataset
-
minmax
(self, key_or_keys: Union[Key, Sequence[Key]], axis=0, feature_range=0, 1)¶ Transform features by scaling each feature to a given range.
- Arguments:
- key_or_keys {Union[Key, Sequence[Key]]} –
The keys on which the Max Abs scaling should be performed
- Keyword Arguments:
axis {int} – Axis on which to accumulate statistics (default: {0})
- Keyword Arguments:
feature_range {Tuple[int, int]} – Minimum and maximum bound to scale to
- Returns:
Dataset – Transformed dataset
-
maxabs
(self, key_or_keys: Union[Key, Sequence[Key]], axis=0)¶ Scale each feature by its maximum absolute value.
- Arguments:
- key_or_keys {Union[Key, Sequence[Key]]} –
The keys on which the Max Abs scaling should be performed
- Keyword Arguments:
axis {int} – Axis on which to accumulate statistics (default: {0})
- Returns:
Dataset – Transformed dataset
-
close
(self)¶
-
to_tensorflow
(self)¶
-
to_pytorch
(self)¶
-
-
datasetops.
allow_unique
(max_num_duplicates=1) → Callable[[Any], bool]¶ Predicate used for filtering/sampling a dataset classwise.
- Keyword Arguments:
- max_num_duplicates {int} –
max number of samples to take that share the same value (default: {1})
- Returns:
Callable[[Any], bool] – Predicate function
-
datasetops.
reshape
(new_shape: Shape) → DatasetTransformFn¶
-
datasetops.
categorical
(mapping_fn: Callable[[Any], int] = None) → DatasetTransformFn¶ Transform data into a categorical int label.
- Arguments:
- mapping_fn {Callable[[Any], int]} –
A function transforming the input data to the integer label. If not specified, labels are automatically inferred from the data.
- Returns:
DatasetTransformFn – A function to be passed to the Dataset.transform()
-
datasetops.
one_hot
(encoding_size: int, mapping_fn: Callable[[Any], int] = None, dtype='bool') → DatasetTransformFn¶ Transform data into a one-hot encoded label.
- Arguments:
encoding_size {int} – The size of the encoding mapping_fn {Callable[[Any], int]} –
A function transforming the input data to an integer label. If not specified, labels are automatically inferred from the data.
- Returns:
DatasetTransformFn – A function to be passed to the Dataset.transform()
-
datasetops.
categorical_template
(ds: Dataset, key: Key) → Callable[[Any], int]¶ Creates a template mapping function to be with one_hot.
- Arguments:
ds {Dataset} – Dataset from which to create a template for one_hot coding key {Key} – Dataset key (name or item index) on the one_hot coding is made
- Returns:
{Callable[[Any],int]} – mapping_fn for one_hot
-
datasetops.
numpy
() → DatasetTransformFn¶
-
datasetops.
image
() → DatasetTransformFn¶
-
datasetops.
image_resize
(new_size: Shape, resample=Image.NEAREST) → DatasetTransformFn¶
-
datasetops.
zipped
(*datasets: AbstractDataset)¶
-
datasetops.
cartesian_product
(*datasets: AbstractDataset)¶
-
datasetops.
concat
(*datasets: AbstractDataset)¶
-
datasetops.
to_tensorflow
(dataset: Dataset)¶
-
datasetops.
to_pytorch
(dataset: Dataset)¶
-
class
datasetops.
Loader
(getdata: Callable[[Any], Any], identifier: Optional[str] = None, name: str = None)¶ Bases:
datasetops.dataset.Dataset
-
append
(self, identifier: Data)¶
-
extend
(self, ids: Union[List[Data], np.ndarray])¶
-
-
datasetops.
from_pytorch
(pytorch_dataset, identifier: Optional[str] = None)¶ Create dataset from a Pytorch dataset
- Arguments:
tf_dataset {torch.utils.data.Dataset} – A Pytorch dataset to load from identifier {Optional[str]} – unique identifier
- Returns:
[Dataset] – A datasetops.Dataset
-
datasetops.
from_folder_data
(path: AnyPath) → Dataset¶ Load data from a folder with the data structure:
folder ├ sample1.jpg ├ sample2.jpg
- Arguments:
path {AnyPath} – path to folder
- Returns:
- Dataset – A dataset of data paths,
e.g. (‘nested_folder/class1/sample1.jpg’)
-
datasetops.
from_folder_class_data
(path: AnyPath) → Dataset¶ Load data from a folder with the data structure:
` data ├── class1 │ ├── sample1.jpg │ └── sample2.jpg └── class2 ****└── sample3.jpg `
- Arguments:
path {AnyPath} – path to nested folder
- Returns:
- Dataset – A labelled dataset of data paths and corresponding class labels,
e.g. (‘nested_folder/class1/sample1.jpg’, ‘class1’)
-
datasetops.
from_folder_group_data
(path: AnyPath) → Dataset¶ Load data from a folder with the data structure:
data ├── group1 │ ├── sample1.jpg │ └── sample2.jpg └── group2 ….├── sample1.jpg ….└── sample2.jpg
- Arguments:
path {AnyPath} – path to nested folder
- Returns:
- Dataset – A dataset of paths to objects of each groups zipped together with corresponding names,
e.g. (‘nested_folder/group1/sample1.jpg’, ‘nested_folder/group2/sample1.txt’)
-
datasetops.
from_folder_dataset_class_data
(path: AnyPath) → List[Dataset]¶ Load data from a folder with the data structure:
` data ├── dataset1 │ ├── class1 │ │ ├── sample1.jpg │ │ └── sample2.jpg │ └── class2 │ └── sample3.jpg └── dataset2 ****└── sample3.jpg `
- Arguments:
path {AnyPath} – path to nested folder
- Returns:
- List[Dataset] – A list of labelled datasets, each with data paths and corresponding class labels,
e.g. (‘nested_folder/class1/sample1.jpg’, ‘class1’)
-
datasetops.
from_folder_dataset_group_data
(path: AnyPath) → List[Dataset]¶ Load data from a folder with the data structure:
- Arguments:
path {AnyPath} – path to nested folder
- Returns:
- List[Dataset] – A list of datasets, each with data composed from different types,
e.g. (‘nested_folder/group1/sample1.jpg’, ‘nested_folder/group2/sample1.txt’)
-
datasetops.
from_mat_single_mult_data
(path: AnyPath) → List[Dataset]¶ Load data from .mat file consisting of multiple data.
E.g. a .mat file with keys [‘X_src’, ‘Y_src’, ‘X_tgt’, ‘Y_tgt’]
- Arguments:
path {AnyPath} – path to .mat file
- Returns:
- List[Dataset] – A list of datasets, where a dataset was created for each suffix
e.g. a dataset with data from the keys (‘X_src’, ‘Y_src’) and from (‘X_tgt’, ‘Y_tgt’)