Skip to content

WARNING

该页面尚未翻译。 以下内容为英文原版。

tensorplay.utils.data.dataset

Classes

class ConcatDataset [source]

python
ConcatDataset(datasets)

Bases: Dataset

Dataset as a concatenation of multiple datasets.

This class is useful to assemble different existing datasets.

Arguments

  • datasets (sequence): List of datasets to be concatenated
Methods

__init__(self, datasets) [source]

Initialize self. See help(type(self)) for accurate signature.


cum_sum(sequence) [source]

Computes the cumulative sum of a list of numbers.


class Dataset [source]

python
Dataset()

An abstract class representing a Dataset.

All datasets that represent a map from keys to data samples should subclass it. All subclasses should overwrite __getitem__, supporting fetching a data sample for a given key. Subclasses could also optionally overwrite __len__, which is expected to return the size of the dataset by many ~tensorplay.utils.data.Sampler implementations and the default options of ~tensorplay.utils.data.DataLoader.

.. note:: ~tensorplay.utils.data.DataLoader by default constructs a index sampler that yields integral indices. To make it work with a map-style dataset with non-integral indices/keys, a custom sampler must be provided.

class IterableDataset [source]

python
IterableDataset()

Bases: Dataset

An iterable Dataset.

All datasets that represent an iterable of data samples should subclass it. Such form of datasets is particularly useful when data come from a stream.

All subclasses should overwrite __iter__, which would return an iterator of samples in this dataset.

When a subclass is used with ~tensorplay.utils.data.DataLoader, each item in the dataset will be yielded from the ~tensorplay.utils.data.DataLoader iterator. When num_workers > 0, each worker process will have a different copy of the dataset object, so it is often desired to configure each copy independently to avoid having duplicate data returned from the workers. get_worker_info, when called in a worker process, returns information about the worker. It can be used in either the dataset's __iter__ method or the ~tensorplay.utils.data.DataLoader 's worker_init_fn option to modify each copy's behavior.

Example 1: splitting workload across all workers in __iter__

python
class MyIterableDataset(IterableDataset):
    def __init__(self, start, end):
        super(MyIterableDataset).__init__()
        self.start = start
        self.end = end

    def __iter__(self):
        worker_info = tensorplay.utils.data.get_worker_info()
        if worker_info is None:  # single-process data loading, return the full iterator
            iter_start = self.start
            iter_end = self.end
- **else**: # in a worker process
            # split workload
            per_worker = int(math.ceil((self.end - self.start) / float(worker_info.num_workers)))
            worker_id = worker_info.id
            iter_start = self.start + worker_id * per_worker
            iter_end = min(iter_start + per_worker, self.end)
        return iter(range(iter_start, iter_end))

# should give same set of data as range(3, 7), i.e., [3, 4, 5, 6].
ds = MyIterableDataset(start=3, end=7)
# Single-process loading
print(list(tensorplay.utils.data.DataLoader(ds, num_workers=0)))
[3, 4, 5, 6]
# Mult-process loading with two worker processes
# Worker 0 fetched [3, 4].  Worker 1 fetched [5, 6].
print(list(tensorplay.utils.data.DataLoader(ds, num_workers=2)))
[3, 4, 5, 6]
# With even more workers
print(list(tensorplay.utils.data.DataLoader(ds, num_workers=20)))
[3, 4, 5, 6]

Example 2: splitting workload across all workers using worker_init_fn

python
class MyIterableDataset(IterableDataset):
    def __init__(self, start, end):
        super(MyIterableDataset).__init__()
        self.start = start
        self.end = end

    def __iter__(self):
        return iter(range(self.start, self.end))

# should give same set of data as range(3, 7), i.e., [3, 4, 5, 6].
ds = MyIterableDataset(start=3, end=7)
# Single-process loading
print(list(tensorplay.utils.data.DataLoader(ds, num_workers=0)))
[3, 4, 5, 6]

def worker_init_fn(worker_id):
    worker_info = tensorplay.utils.data.get_worker_info()
    dataset = worker_info.dataset  # the dataset copy in this worker process
    overall_start = dataset.start
    overall_end = dataset.end
    per_worker = int(math.ceil((overall_end - overall_start) / float(worker_info.num_workers)))
    worker_id = worker_info.id
    dataset.start = overall_start + worker_id * per_worker
    dataset.end = min(dataset.start + per_worker, overall_end)

# Mult-process loading with two worker processes
# Worker 0 fetched [3, 4].  Worker 1 fetched [5, 6].
print(list(tensorplay.utils.data.DataLoader(ds, num_workers=2, worker_init_fn=worker_init_fn)))
[3, 4, 5, 6]

class Subset [source]

python
Subset(dataset, indices)

Bases: Dataset

Subset of a dataset at specified indices.

Arguments

  • dataset (Dataset): The whole Dataset
  • indices (sequence): Indices in the whole set selected for subset
Methods

__init__(self, dataset, indices) [source]

Initialize self. See help(type(self)) for accurate signature.


class TensorDataset [source]

python
TensorDataset(*tensors)

Bases: Dataset

Dataset wrapping tensors.

Each sample will be retrieved by indexing tensors along the first dimension.

Arguments

*tensors (Tensor): tensors that have the same size of the first dimension.
Methods

__init__(self, *tensors) [source]

Initialize self. See help(type(self)) for accurate signature.


基于 Apache 2.0 许可发布。

📚DeepWiki