tensorplay.utils.data

Classes

`class BatchSampler` [source]

python

BatchSampler(sampler, batch_size, drop_last)

Bases: Sampler

Wraps another sampler to yield a mini-batch of indices.

Args

sampler (Sampler): Base sampler.
batch_size (int): Size of mini-batch.
drop_last (bool): If True, the sampler will drop the last batch if its size would be less than batch_size

Example

python

list(BatchSampler(SequentialSampler(range(10)), batch_size=3, drop_last=False))
    [[0, 1, 2], [3, 4, 5], [6, 7, 8], [9]]
list(BatchSampler(SequentialSampler(range(10)), batch_size=3, drop_last=True))
    [[0, 1, 2], [3, 4, 5], [6, 7, 8]]

Methods

`init(self, sampler, batch_size, drop_last)` [source]

Initialize self. See help(type(self)) for accurate signature.

`class ConcatDataset` [source]

python

ConcatDataset(datasets)

Bases: Dataset

Dataset as a concatenation of multiple datasets.

This class is useful to assemble different existing datasets.

Arguments

datasets (sequence): List of datasets to be concatenated

Methods

`init(self, datasets)` [source]

Initialize self. See help(type(self)) for accurate signature.

`cum_sum(sequence)` [source]

Computes the cumulative sum of a list of numbers.

`class DataLoader` [source]

python

DataLoader(dataset, batch_size=1, shuffle=False, sampler=None, batch_sampler=None, num_workers=0, collate_fn=None, pin_memory=False, drop_last=False, timeout=0, worker_init_fn=None, multiprocessing_context=None, generator=None, *, prefetch_factor=2, persistent_workers=False, device=None, worker_debug=False)

Methods

`init(self, dataset, batch_size=1, shuffle=False, sampler=None, batch_sampler=None, num_workers=0, collate_fn=None, pin_memory=False, drop_last=False, timeout=0, worker_init_fn=None, multiprocessing_context=None, generator=None, *, prefetch_factor=2, persistent_workers=False, device=None, worker_debug=False)` [source]

Initialize self. See help(type(self)) for accurate signature.

`class Dataset` [source]

python

Dataset()

An abstract class representing a Dataset.

All datasets that represent a map from keys to data samples should subclass it. All subclasses should overwrite __getitem__, supporting fetching a data sample for a given key. Subclasses could also optionally overwrite __len__, which is expected to return the size of the dataset by many ~tensorplay.utils.data.Sampler implementations and the default options of ~tensorplay.utils.data.DataLoader.

.. note:: ~tensorplay.utils.data.DataLoader by default constructs a index sampler that yields integral indices. To make it work with a map-style dataset with non-integral indices/keys, a custom sampler must be provided.

`class IterableDataset` [source]

python

IterableDataset()

Bases: Dataset

An iterable Dataset.

All datasets that represent an iterable of data samples should subclass it. Such form of datasets is particularly useful when data come from a stream.

All subclasses should overwrite __iter__, which would return an iterator of samples in this dataset.

When a subclass is used with ~tensorplay.utils.data.DataLoader, each item in the dataset will be yielded from the ~tensorplay.utils.data.DataLoader iterator. When num_workers > 0, each worker process will have a different copy of the dataset object, so it is often desired to configure each copy independently to avoid having duplicate data returned from the workers. get_worker_info, when called in a worker process, returns information about the worker. It can be used in either the dataset's __iter__ method or the ~tensorplay.utils.data.DataLoader 's worker_init_fn option to modify each copy's behavior.

Example 1: splitting workload across all workers in __iter__

python

class MyIterableDataset(IterableDataset):
    def __init__(self, start, end):
        super(MyIterableDataset).__init__()
        self.start = start
        self.end = end

    def __iter__(self):
        worker_info = tensorplay.utils.data.get_worker_info()
        if worker_info is None:  # single-process data loading, return the full iterator
            iter_start = self.start
            iter_end = self.end
- **else**: # in a worker process
            # split workload
            per_worker = int(math.ceil((self.end - self.start) / float(worker_info.num_workers)))
            worker_id = worker_info.id
            iter_start = self.start + worker_id * per_worker
            iter_end = min(iter_start + per_worker, self.end)
        return iter(range(iter_start, iter_end))

# should give same set of data as range(3, 7), i.e., [3, 4, 5, 6].
ds = MyIterableDataset(start=3, end=7)
# Single-process loading
print(list(tensorplay.utils.data.DataLoader(ds, num_workers=0)))
[3, 4, 5, 6]
# Mult-process loading with two worker processes
# Worker 0 fetched [3, 4].  Worker 1 fetched [5, 6].
print(list(tensorplay.utils.data.DataLoader(ds, num_workers=2)))
[3, 4, 5, 6]
# With even more workers
print(list(tensorplay.utils.data.DataLoader(ds, num_workers=20)))
[3, 4, 5, 6]

Example 2: splitting workload across all workers using worker_init_fn

python

class MyIterableDataset(IterableDataset):
    def __init__(self, start, end):
        super(MyIterableDataset).__init__()
        self.start = start
        self.end = end

    def __iter__(self):
        return iter(range(self.start, self.end))

# should give same set of data as range(3, 7), i.e., [3, 4, 5, 6].
ds = MyIterableDataset(start=3, end=7)
# Single-process loading
print(list(tensorplay.utils.data.DataLoader(ds, num_workers=0)))
[3, 4, 5, 6]

def worker_init_fn(worker_id):
    worker_info = tensorplay.utils.data.get_worker_info()
    dataset = worker_info.dataset  # the dataset copy in this worker process
    overall_start = dataset.start
    overall_end = dataset.end
    per_worker = int(math.ceil((overall_end - overall_start) / float(worker_info.num_workers)))
    worker_id = worker_info.id
    dataset.start = overall_start + worker_id * per_worker
    dataset.end = min(dataset.start + per_worker, overall_end)

# Mult-process loading with two worker processes
# Worker 0 fetched [3, 4].  Worker 1 fetched [5, 6].
print(list(tensorplay.utils.data.DataLoader(ds, num_workers=2, worker_init_fn=worker_init_fn)))
[3, 4, 5, 6]

`class RandomSampler` [source]

python

RandomSampler(data_source, replacement=False, num_samples=None, generator=None)

Bases: Sampler

Samples elements randomly. If without replacement, then sample from a shuffled dataset. If with replacement, then user can specify num_samples to draw.

Arguments

data_source (Dataset): dataset to sample from
replacement (bool): samples are drawn on-demand with replacement if True, default=False
num_samples (int): number of samples to draw, default=len(dataset). This argument is supposed to be specified only when replacement is True.
generator (Generator): Generator used in sampling.

Methods

`init(self, data_source, replacement=False, num_samples=None, generator=None)` [source]

Initialize self. See help(type(self)) for accurate signature.

`class Sampler` [source]

python

Sampler(data_source: Optional[Sized])

Bases: Generic

Base class for all Samplers.

Every Sampler subclass has to provide an __iter__ method, providing a way to iterate over indices of dataset elements, and a __len__ method that returns the length of the returned iterators.

.. note:: The __len__ method isn't strictly required by ~torch.utils.data.DataLoader, but is expected in any calculation involving the length of a ~torch.utils.data.DataLoader.

Methods

`init(self, data_source: Optional[Sized])` [source]

Initialize self. See help(type(self)) for accurate signature.

`class SequentialSampler` [source]

python

SequentialSampler(data_source)

Bases: Sampler

Samples elements sequentially, always in the same order.

Arguments

data_source (Dataset): dataset to sample from

Methods

`init(self, data_source)` [source]

Initialize self. See help(type(self)) for accurate signature.

`class Subset` [source]

python

Subset(dataset, indices)

Bases: Dataset

Subset of a dataset at specified indices.

Arguments

dataset (Dataset): The whole Dataset
indices (sequence): Indices in the whole set selected for subset

Methods

`init(self, dataset, indices)` [source]

Initialize self. See help(type(self)) for accurate signature.

`class TensorDataset` [source]

python

TensorDataset(*tensors)

Bases: Dataset

Dataset wrapping tensors.

Each sample will be retrieved by indexing tensors along the first dimension.

Arguments

*tensors (Tensor): tensors that have the same size of the first dimension.

Methods

`init(self, *tensors)` [source]

Initialize self. See help(type(self)) for accurate signature.

Functions

`default_collate()` [source]

python

default_collate(batch)

Puts each data field into a tensor with outer dimension batch size

`get_worker_info()` [source]

python

get_worker_info()

tensorplay.utils.data ​

Classes ​

class BatchSampler [source] ​

Args ​

Example ​

__init__(self, sampler, batch_size, drop_last) [source] ​

class ConcatDataset [source] ​

Arguments ​

__init__(self, datasets) [source] ​

cum_sum(sequence) [source] ​

class DataLoader [source] ​

class Dataset [source] ​

class IterableDataset [source] ​

class RandomSampler [source] ​

Arguments ​

__init__(self, data_source, replacement=False, num_samples=None, generator=None) [source] ​

class Sampler [source] ​

__init__(self, data_source: Optional[Sized]) [source] ​

class SequentialSampler [source] ​

Arguments ​

__init__(self, data_source) [source] ​

class Subset [source] ​

Arguments ​

__init__(self, dataset, indices) [source] ​

class TensorDataset [source] ​

Arguments ​

__init__(self, *tensors) [source] ​

Functions ​

default_collate() [source] ​

get_worker_info() [source] ​

tensorplay.utils.data

Classes

`class BatchSampler` [source]

Args

Example

`init(self, sampler, batch_size, drop_last)` [source]

`class ConcatDataset` [source]

Arguments

`init(self, datasets)` [source]

`cum_sum(sequence)` [source]

`class DataLoader` [source]

`class Dataset` [source]

`class IterableDataset` [source]

`class RandomSampler` [source]

Arguments

`init(self, data_source, replacement=False, num_samples=None, generator=None)` [source]

`class Sampler` [source]

`init(self, data_source: Optional[Sized])` [source]

`class SequentialSampler` [source]

Arguments

`init(self, data_source)` [source]

`class Subset` [source]

Arguments

`init(self, dataset, indices)` [source]

`class TensorDataset` [source]

Arguments

`init(self, *tensors)` [source]

Functions

`default_collate()` [source]

`get_worker_info()` [source]