tensorplay.utils.data
Classes
class BatchSampler [source]
BatchSampler(sampler, batch_size, drop_last)Bases: Sampler
Wraps another sampler to yield a mini-batch of indices.
Args
- sampler (
Sampler): Base sampler. - batch_size (
int): Size of mini-batch. - drop_last (
bool): IfTrue, the sampler will drop the last batch if its size would be less thanbatch_size
Example
list(BatchSampler(SequentialSampler(range(10)), batch_size=3, drop_last=False))
[[0, 1, 2], [3, 4, 5], [6, 7, 8], [9]]
list(BatchSampler(SequentialSampler(range(10)), batch_size=3, drop_last=True))
[[0, 1, 2], [3, 4, 5], [6, 7, 8]]Methods
__init__(self, sampler, batch_size, drop_last) [source]
Initialize self. See help(type(self)) for accurate signature.
class ConcatDataset [source]
ConcatDataset(datasets)Bases: Dataset
Dataset as a concatenation of multiple datasets.
This class is useful to assemble different existing datasets.
Arguments
- datasets (
sequence): List of datasets to be concatenated
Methods
__init__(self, datasets) [source]
Initialize self. See help(type(self)) for accurate signature.
cum_sum(sequence) [source]
Computes the cumulative sum of a list of numbers.
class DataLoader [source]
DataLoader(dataset, batch_size=1, shuffle=False, sampler=None, batch_sampler=None, num_workers=0, collate_fn=None, pin_memory=False, drop_last=False, timeout=0, worker_init_fn=None, multiprocessing_context=None, generator=None, *, prefetch_factor=2, persistent_workers=False, device=None, worker_debug=False)Methods
__init__(self, dataset, batch_size=1, shuffle=False, sampler=None, batch_sampler=None, num_workers=0, collate_fn=None, pin_memory=False, drop_last=False, timeout=0, worker_init_fn=None, multiprocessing_context=None, generator=None, *, prefetch_factor=2, persistent_workers=False, device=None, worker_debug=False) [source]
Initialize self. See help(type(self)) for accurate signature.
class Dataset [source]
Dataset()An abstract class representing a Dataset.
All datasets that represent a map from keys to data samples should subclass it. All subclasses should overwrite __getitem__, supporting fetching a data sample for a given key. Subclasses could also optionally overwrite __len__, which is expected to return the size of the dataset by many ~tensorplay.utils.data.Sampler implementations and the default options of ~tensorplay.utils.data.DataLoader.
.. note:: ~tensorplay.utils.data.DataLoader by default constructs a index sampler that yields integral indices. To make it work with a map-style dataset with non-integral indices/keys, a custom sampler must be provided.
class IterableDataset [source]
IterableDataset()Bases: Dataset
An iterable Dataset.
All datasets that represent an iterable of data samples should subclass it. Such form of datasets is particularly useful when data come from a stream.
All subclasses should overwrite __iter__, which would return an iterator of samples in this dataset.
When a subclass is used with ~tensorplay.utils.data.DataLoader, each item in the dataset will be yielded from the ~tensorplay.utils.data.DataLoader iterator. When num_workers > 0, each worker process will have a different copy of the dataset object, so it is often desired to configure each copy independently to avoid having duplicate data returned from the workers. get_worker_info, when called in a worker process, returns information about the worker. It can be used in either the dataset's __iter__ method or the ~tensorplay.utils.data.DataLoader 's worker_init_fn option to modify each copy's behavior.
Example 1: splitting workload across all workers in __iter__
class MyIterableDataset(IterableDataset):
def __init__(self, start, end):
super(MyIterableDataset).__init__()
self.start = start
self.end = end
def __iter__(self):
worker_info = tensorplay.utils.data.get_worker_info()
if worker_info is None: # single-process data loading, return the full iterator
iter_start = self.start
iter_end = self.end
- **else**: # in a worker process
# split workload
per_worker = int(math.ceil((self.end - self.start) / float(worker_info.num_workers)))
worker_id = worker_info.id
iter_start = self.start + worker_id * per_worker
iter_end = min(iter_start + per_worker, self.end)
return iter(range(iter_start, iter_end))
# should give same set of data as range(3, 7), i.e., [3, 4, 5, 6].
ds = MyIterableDataset(start=3, end=7)
# Single-process loading
print(list(tensorplay.utils.data.DataLoader(ds, num_workers=0)))
[3, 4, 5, 6]
# Mult-process loading with two worker processes
# Worker 0 fetched [3, 4]. Worker 1 fetched [5, 6].
print(list(tensorplay.utils.data.DataLoader(ds, num_workers=2)))
[3, 4, 5, 6]
# With even more workers
print(list(tensorplay.utils.data.DataLoader(ds, num_workers=20)))
[3, 4, 5, 6]Example 2: splitting workload across all workers using worker_init_fn
class MyIterableDataset(IterableDataset):
def __init__(self, start, end):
super(MyIterableDataset).__init__()
self.start = start
self.end = end
def __iter__(self):
return iter(range(self.start, self.end))
# should give same set of data as range(3, 7), i.e., [3, 4, 5, 6].
ds = MyIterableDataset(start=3, end=7)
# Single-process loading
print(list(tensorplay.utils.data.DataLoader(ds, num_workers=0)))
[3, 4, 5, 6]
def worker_init_fn(worker_id):
worker_info = tensorplay.utils.data.get_worker_info()
dataset = worker_info.dataset # the dataset copy in this worker process
overall_start = dataset.start
overall_end = dataset.end
per_worker = int(math.ceil((overall_end - overall_start) / float(worker_info.num_workers)))
worker_id = worker_info.id
dataset.start = overall_start + worker_id * per_worker
dataset.end = min(dataset.start + per_worker, overall_end)
# Mult-process loading with two worker processes
# Worker 0 fetched [3, 4]. Worker 1 fetched [5, 6].
print(list(tensorplay.utils.data.DataLoader(ds, num_workers=2, worker_init_fn=worker_init_fn)))
[3, 4, 5, 6]class RandomSampler [source]
RandomSampler(data_source, replacement=False, num_samples=None, generator=None)Bases: Sampler
Samples elements randomly. If without replacement, then sample from a shuffled dataset. If with replacement, then user can specify num_samples to draw.
Arguments
- data_source (
Dataset): dataset to sample from - replacement (
bool): samples are drawn on-demand with replacement ifTrue, default=False - num_samples (
int): number of samples to draw, default=len(dataset). This argument is supposed to be specified only whenreplacementisTrue. - generator (
Generator): Generator used in sampling.
Methods
__init__(self, data_source, replacement=False, num_samples=None, generator=None) [source]
Initialize self. See help(type(self)) for accurate signature.
class Sampler [source]
Sampler(data_source: Optional[Sized])Bases: Generic
Base class for all Samplers.
Every Sampler subclass has to provide an __iter__ method, providing a way to iterate over indices of dataset elements, and a __len__ method that returns the length of the returned iterators.
.. note:: The __len__ method isn't strictly required by ~torch.utils.data.DataLoader, but is expected in any calculation involving the length of a ~torch.utils.data.DataLoader.
Methods
__init__(self, data_source: Optional[Sized]) [source]
Initialize self. See help(type(self)) for accurate signature.
class SequentialSampler [source]
SequentialSampler(data_source)Bases: Sampler
Samples elements sequentially, always in the same order.
Arguments
- data_source (
Dataset): dataset to sample from
Methods
__init__(self, data_source) [source]
Initialize self. See help(type(self)) for accurate signature.
class Subset [source]
Subset(dataset, indices)Bases: Dataset
Subset of a dataset at specified indices.
Arguments
- dataset (
Dataset): The whole Dataset - indices (
sequence): Indices in the whole set selected for subset
Methods
__init__(self, dataset, indices) [source]
Initialize self. See help(type(self)) for accurate signature.
class TensorDataset [source]
TensorDataset(*tensors)Bases: Dataset
Dataset wrapping tensors.
Each sample will be retrieved by indexing tensors along the first dimension.
Arguments
*tensors (Tensor): tensors that have the same size of the first dimension.
Methods
__init__(self, *tensors) [source]
Initialize self. See help(type(self)) for accurate signature.
Functions
default_collate() [source]
default_collate(batch)Puts each data field into a tensor with outer dimension batch size
get_worker_info() [source]
get_worker_info()