WARNING
该页面尚未翻译。 以下内容为英文原版。
tensorplay.utils.data.dataset
Classes
class ConcatDataset [source]
ConcatDataset(datasets)Bases: Dataset
Dataset as a concatenation of multiple datasets.
This class is useful to assemble different existing datasets.
Arguments
- datasets (
sequence): List of datasets to be concatenated
Methods
__init__(self, datasets) [source]
Initialize self. See help(type(self)) for accurate signature.
cum_sum(sequence) [source]
Computes the cumulative sum of a list of numbers.
class Dataset [source]
Dataset()An abstract class representing a Dataset.
All datasets that represent a map from keys to data samples should subclass it. All subclasses should overwrite __getitem__, supporting fetching a data sample for a given key. Subclasses could also optionally overwrite __len__, which is expected to return the size of the dataset by many ~tensorplay.utils.data.Sampler implementations and the default options of ~tensorplay.utils.data.DataLoader.
.. note:: ~tensorplay.utils.data.DataLoader by default constructs a index sampler that yields integral indices. To make it work with a map-style dataset with non-integral indices/keys, a custom sampler must be provided.
class IterableDataset [source]
IterableDataset()Bases: Dataset
An iterable Dataset.
All datasets that represent an iterable of data samples should subclass it. Such form of datasets is particularly useful when data come from a stream.
All subclasses should overwrite __iter__, which would return an iterator of samples in this dataset.
When a subclass is used with ~tensorplay.utils.data.DataLoader, each item in the dataset will be yielded from the ~tensorplay.utils.data.DataLoader iterator. When num_workers > 0, each worker process will have a different copy of the dataset object, so it is often desired to configure each copy independently to avoid having duplicate data returned from the workers. get_worker_info, when called in a worker process, returns information about the worker. It can be used in either the dataset's __iter__ method or the ~tensorplay.utils.data.DataLoader 's worker_init_fn option to modify each copy's behavior.
Example 1: splitting workload across all workers in __iter__
class MyIterableDataset(IterableDataset):
def __init__(self, start, end):
super(MyIterableDataset).__init__()
self.start = start
self.end = end
def __iter__(self):
worker_info = tensorplay.utils.data.get_worker_info()
if worker_info is None: # single-process data loading, return the full iterator
iter_start = self.start
iter_end = self.end
- **else**: # in a worker process
# split workload
per_worker = int(math.ceil((self.end - self.start) / float(worker_info.num_workers)))
worker_id = worker_info.id
iter_start = self.start + worker_id * per_worker
iter_end = min(iter_start + per_worker, self.end)
return iter(range(iter_start, iter_end))
# should give same set of data as range(3, 7), i.e., [3, 4, 5, 6].
ds = MyIterableDataset(start=3, end=7)
# Single-process loading
print(list(tensorplay.utils.data.DataLoader(ds, num_workers=0)))
[3, 4, 5, 6]
# Mult-process loading with two worker processes
# Worker 0 fetched [3, 4]. Worker 1 fetched [5, 6].
print(list(tensorplay.utils.data.DataLoader(ds, num_workers=2)))
[3, 4, 5, 6]
# With even more workers
print(list(tensorplay.utils.data.DataLoader(ds, num_workers=20)))
[3, 4, 5, 6]Example 2: splitting workload across all workers using worker_init_fn
class MyIterableDataset(IterableDataset):
def __init__(self, start, end):
super(MyIterableDataset).__init__()
self.start = start
self.end = end
def __iter__(self):
return iter(range(self.start, self.end))
# should give same set of data as range(3, 7), i.e., [3, 4, 5, 6].
ds = MyIterableDataset(start=3, end=7)
# Single-process loading
print(list(tensorplay.utils.data.DataLoader(ds, num_workers=0)))
[3, 4, 5, 6]
def worker_init_fn(worker_id):
worker_info = tensorplay.utils.data.get_worker_info()
dataset = worker_info.dataset # the dataset copy in this worker process
overall_start = dataset.start
overall_end = dataset.end
per_worker = int(math.ceil((overall_end - overall_start) / float(worker_info.num_workers)))
worker_id = worker_info.id
dataset.start = overall_start + worker_id * per_worker
dataset.end = min(dataset.start + per_worker, overall_end)
# Mult-process loading with two worker processes
# Worker 0 fetched [3, 4]. Worker 1 fetched [5, 6].
print(list(tensorplay.utils.data.DataLoader(ds, num_workers=2, worker_init_fn=worker_init_fn)))
[3, 4, 5, 6]class Subset [source]
Subset(dataset, indices)Bases: Dataset
Subset of a dataset at specified indices.
Arguments
- dataset (
Dataset): The whole Dataset - indices (
sequence): Indices in the whole set selected for subset
Methods
__init__(self, dataset, indices) [source]
Initialize self. See help(type(self)) for accurate signature.
class TensorDataset [source]
TensorDataset(*tensors)Bases: Dataset
Dataset wrapping tensors.
Each sample will be retrieved by indexing tensors along the first dimension.
Arguments
*tensors (Tensor): tensors that have the same size of the first dimension.
