Skip to content

WARNING

该页面尚未翻译。 以下内容为英文原版。

tensorplay.amp

Classes

class GradScaler [source]

python
GradScaler(device: 'str' = 'cpu', init_scale: 'float' = 65536.0, growth_factor: 'float' = 2.0, backoff_factor: 'float' = 0.5, growth_interval: 'int' = 2000, enabled: 'bool' = True) -> 'None'

An instance scaler of GradScaler.

Helps perform the steps of gradient scaling conveniently.

  • scaler.scale(loss) multiplies a given loss by scaler's current scale factor.
  • scaler.step(optimizer) safely unscales gradients and calls optimizer.step().
  • scaler.update() updates scaler's scale factor.

Example

python
# Creates a GradScaler once at the beginning of training.
scaler = GradScaler()

for epoch in epochs:
    for input, target in data:
        optimizer.zero_grad()
        output = model(input)
        loss = loss_fn(output, target)

        # Scales loss.  Calls backward() on scaled loss to create scaled gradients.
        scaler.scale(loss).backward()

        # scaler.step() first unscales gradients of the optimizer's params.
        # If gradients don't contain infs/NaNs, optimizer.step() is then called,
        # otherwise, optimizer.step() is skipped.
        scaler.step(optimizer)

        # Updates the scale for next iteration.
        scaler.update()

See the Automatic Mixed Precision examples<amp-examples> for usage (along with autocasting) in more complex cases like gradient clipping, gradient accumulation, gradient penalty, and multiple losses/optimizers.

scaler dynamically estimates the scale factor each iteration. To minimize gradient underflow, a large scale factor should be used. However, float16 values can "overflow" (become inf or NaN) if the scale factor is too large. Therefore, the optimal scale factor is the largest factor that can be used without incurring inf or NaN gradient values. scaler approximates the optimal scale factor over time by checking the gradients for infs and NaNs during every scaler.step(optimizer) (or optional separate scaler.unscale_(optimizer), see unscale_).

  • If infs/NaNs are found, scaler.step(optimizer) skips the underlying optimizer.step() (so the params themselves remain uncorrupted) and update() multiplies the scale by backoff_factor.

  • If no infs/NaNs are found, scaler.step(optimizer) runs the underlying optimizer.step() as usual. If growth_interval unskipped iterations occur consecutively, update() multiplies the scale by growth_factor.

The scale factor often causes infs/NaNs to appear in gradients for the first few iterations as its value calibrates. scaler.step will skip the underlying optimizer.step() for these iterations. After that, step skipping should occur rarely (once every few hundred or thousand iterations).

Args

  • device (str, optional, default="cuda"): Device type to use. Possible values are: 'cuda' and 'cpu'. The type is the same as the type attribute of a torch.device. Thus, you may obtain the device type of a tensor using Tensor.device.type.
  • init_scale (float, optional, default=2.**16): Initial scale factor.
  • growth_factor (float, optional, default=2.0): Factor by which the scale is multiplied during update if no inf/NaN gradients occur for growth_interval consecutive iterations.
  • backoff_factor (float, optional, default=0.5): Factor by which the scale is multiplied during update if inf/NaN gradients occur in an iteration.
  • growth_interval (int, optional, default=2000): Number of consecutive iterations without inf/NaN gradients that must occur for the scale to be multiplied by growth_factor.
  • enabled (bool, optional): If False, disables gradient scaling. step simply invokes the underlying optimizer.step(), and other methods become no-ops.
  • Default: True
Methods

__init__(self, device: 'str' = 'cpu', init_scale: 'float' = 65536.0, growth_factor: 'float' = 2.0, backoff_factor: 'float' = 0.5, growth_interval: 'int' = 2000, enabled: 'bool' = True) -> 'None' [source]

Initialize self. See help(type(self)) for accurate signature.


get_backoff_factor(self) -> 'float' [source]

Return a Python float containing the scale backoff factor.


get_growth_factor(self) -> 'float' [source]

Return a Python float containing the scale growth factor.


get_growth_interval(self) -> 'int' [source]

Return a Python int containing the growth interval.


get_scale(self) -> 'float' [source]

Return a Python float containing the current scale, or 1.0 if scaling is disabled.

DANGER

get_scale incurs a CPU-GPU sync.


is_enabled(self) -> 'bool' [source]

Return a bool indicating whether this instance is enabled.


load_state_dict(self, state_dict: 'dict[str, Any]') -> 'None' [source]

Load the scaler state.

If this instance is disabled, load_state_dict is a no-op.

Args

state_dict(dict): scaler state. Should be an object returned from a call to state_dict.


scale(self, outputs: 'Union[tensorplay.Tensor, Iterable[tensorplay.Tensor]]') -> 'Union[tensorplay.Tensor, Iterable[tensorplay.Tensor]]' [source]

Multiplies ('scales') a tensor or list of tensors by the scale factor.

Returns scaled outputs. If this instance of GradScaler is not enabled, outputs are returned unmodified.

Args

  • outputs (Tensor or iterable of Tensors): Outputs to scale.

set_backoff_factor(self, new_factor: 'float') -> 'None' [source]

Set a new scale backoff factor.

Args

  • new_scale (float): Value to use as the new scale backoff factor.

set_growth_factor(self, new_factor: 'float') -> 'None' [source]

Set a new scale growth factor.

Args

  • new_scale (float): Value to use as the new scale growth factor.

set_growth_interval(self, new_interval: 'int') -> 'None' [source]

Set a new growth interval.

Args

  • new_interval (int): Value to use as the new growth interval.

state_dict(self) -> 'dict[str, Any]' [source]

Return the state of the scaler as a dict.

It contains five entries:

  • "scale" - a Python float containing the current scale
  • "growth_factor" - a Python float containing the current growth factor
  • "backoff_factor" - a Python float containing the current backoff factor
  • "growth_interval" - a Python int containing the current growth interval
  • "_growth_tracker" - a Python int containing the number of recent consecutive unskipped steps.

If this instance is not enabled, returns an empty dict.

.. note:: If you wish to checkpoint the scaler's state after a particular iteration, state_dict should be called after update.


step(self, optimizer: 'tensorplay.optim.Optimizer', *args: 'Any', **kwargs: 'Any') -> 'Optional[float]' [source]

Invoke unscale_(optimizer) followed by parameter update, if gradients are not infs/NaN.

step carries out the following two operations:

  1. Internally invokes unscale_(optimizer) (unless unscale_ was explicitly called for optimizer earlier in the iteration). As part of the unscale_, gradients are checked for infs/NaNs.
  2. If no inf/NaN gradients are found, invokes optimizer.step() using the unscaled gradients. Otherwise, optimizer.step() is skipped to avoid corrupting the params.

*args and **kwargs are forwarded to optimizer.step().

Returns the return value of optimizer.step(*args, **kwargs).

Args

  • optimizer (tensorplay.optim.Optimizer): Optimizer that applies the gradients.
  • args: Any arguments.
  • kwargs: Any keyword arguments.

DANGER

Closure use is not currently supported.


unscale_(self, optimizer: 'tensorplay.optim.Optimizer') -> 'None' [source]

Divides ("unscales") the optimizer's gradient tensors by the scale factor.

unscale_ is optional, serving cases where you need to modify or inspect gradients<working-with-unscaled-gradients> between the backward pass(es) and step. If unscale_ is not called explicitly, gradients will be unscaled automatically during step.

Simple example, using unscale_ to enable clipping of unscaled gradients

python
scaler.scale(loss).backward()
scaler.unscale_(optimizer)
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm)
scaler.step(optimizer)
scaler.update()

Args

  • optimizer (torch.optim.Optimizer): Optimizer that owns the gradients to be unscaled.

INFO

unscale_ does not incur a CPU-GPU sync.

DANGER

unscale_ should only be called once per optimizer per step call, and only after all gradients for that optimizer's assigned parameters have been accumulated. Calling unscale_ twice for a given optimizer between each step triggers a RuntimeError.

DANGER

unscale_ may unscale sparse gradients out of place, replacing the .grad attribute.


update(self, new_scale: 'Optional[Union[float, tensorplay.Tensor]]' = None) -> 'None' [source]

Update the scale factor.

If any optimizer steps were skipped the scale is multiplied by backoff_factor to reduce it. If growth_interval unskipped iterations occurred consecutively, the scale is multiplied by growth_factor to increase it.

Passing new_scale sets the new scale value manually. (new_scale is not used directly, it's used to fill GradScaler's internal scale tensor. So if new_scale was a tensor, later in-place changes to that tensor will not further affect the scale GradScaler uses internally.)

Args

  • new_scale (float or tensorplay.Tensor, optional, default=None): New scale factor.

DANGER

update should only be called at the end of the iteration, after scaler.step(optimizer) has been invoked for all optimizers used this iteration.

DANGER

For performance reasons, we do not check the scale factor value to avoid synchronizations, so the scale factor is not guaranteed to be above 1. If the scale falls below 1 and/or you are seeing NaNs in your gradients or loss, something is likely wrong. For example, bf16-pretrained models are often incompatible with AMP/fp16 due to differing dynamic ranges.


class autocast [source]

python
autocast(device_type: str, dtype: Optional[tensorplay.DType] = None, enabled: bool = True, cache_enabled: Optional[bool] = None)

Instances of autocast serve as context managers or decorators that allow regions of your script to run in mixed precision.

In these regions, ops run in an op-specific dtype chosen by autocast to improve performance while maintaining accuracy. See the Autocast Op Reference<autocast-op-reference> for details.

When entering an autocast-enabled region, Tensors may be any type. You should not call half() or bfloat16() on your model(s) or inputs when using autocasting.

autocast should wrap only the forward pass(es) of your network, including the loss computation(s). Backward passes under autocast are not recommended. Backward ops run in the same type that autocast used for corresponding forward ops.

Example for CUDA Devices

python
# Creates model and optimizer in default precision
model = Net().cuda()
optimizer = optim.SGD(model.parameters(), ...)

for input, target in data:
    optimizer.zero_grad()

    # Enables autocasting for the forward pass (model + loss)
    with tensorplay.autocast(device_type="cuda"):
        output = model(input)
        loss = loss_fn(output, target)

    # Exits the context manager before backward()
    loss.backward()
    optimizer.step()

See the Automatic Mixed Precision examples<amp-examples> for usage (along with gradient scaling) in more complex scenarios (e.g., gradient penalty, multiple models/losses, custom autograd functions).

autocast can also be used as a decorator, e.g., on the forward method of your model

python
class AutocastModel(nn.Module):

@tensorplay.autocast(device_type="cuda")
def forward(self, input):

Floating-point Tensors produced in an autocast-enabled region may be float16. After returning to an autocast-disabled region, using them with floating-point Tensors of different dtypes may cause type mismatch errors. If so, cast the Tensor(s) produced in the autocast region back to float32 (or other dtype if desired). If a Tensor from the autocast region is already float32, the cast is a no-op, and incurs no additional overhead. CUDA Example

python
# Creates some tensors in default dtype (here assumed to be float32)
a_float32 = tensorplay.rand((8, 8), device="cuda")
b_float32 = tensorplay.rand((8, 8), device="cuda")
c_float32 = tensorplay.rand((8, 8), device="cuda")
d_float32 = tensorplay.rand((8, 8), device="cuda")

with tensorplay.autocast(device_type="cuda"):
    # tensorplay.mm is on autocast's list of ops that should run in float16.
    # Inputs are float32, but the op runs in float16 and produces float16 output.
    # No manual casts are required.
    e_float16 = tensorplay.mm(a_float32, b_float32)
    # Also handles mixed input types
    f_float16 = tensorplay.mm(d_float32, e_float16)

# After exiting autocast, calls f_float16.float() to use with d_float32
g_float32 = tensorplay.mm(d_float32, f_float16.float())

CPU Training Example

python
# Creates model and optimizer in default precision
model = Net()
optimizer = optim.SGD(model.parameters(), ...)

for epoch in epochs:
    for input, target in data:
        optimizer.zero_grad()

        # Runs the forward pass with autocasting.
        with tensorplay.autocast(device_type="cpu", dtype=tensorplay.bfloat16):
            output = model(input)
            loss = loss_fn(output, target)

        loss.backward()
        optimizer.step()

CPU Inference Example

python
# Creates model in default precision
model = Net().eval()

with tensorplay.autocast(device_type="cpu", dtype=tensorplay.bfloat16):
    for input in data:
        # Runs the forward pass with autocasting.
        output = model(input)

CPU Inference Example with Jit Trace

python
class TestModel(nn.Module):
    def __init__(self, input_size, num_classes):
        super().__init__()
        self.fc1 = nn.Linear(input_size, num_classes)
    def forward(self, x):
        return self.fc1(x)

input_size = 2
num_classes = 2
model = TestModel(input_size, num_classes).eval()

# For now, we suggest to disable the Jit Autocast Pass,
# As the issue: https://github.com/pytensorplay/pytensorplay/issues/75956
tensorplay._C._jit_set_autocast_mode(False)

with tensorplay.cpu.amp.autocast(cache_enabled=False):
    model = tensorplay.jit.trace(model, tensorplay.randn(1, input_size))
model = tensorplay.jit.freeze(model)
# Models Run
for _ in range(3):
    model(tensorplay.randn(1, input_size))

Type mismatch errors in an autocast-enabled region are a bug; if this is what you observe, please file an issue.

autocast(enabled=False) subregions can be nested in autocast-enabled regions. Locally disabling autocast can be useful, for example, if you want to force a subregion to run in a particular dtype. Disabling autocast gives you explicit control over the execution type. In the subregion, inputs from the surrounding region should be cast to dtype before use

python
# Creates some tensors in default dtype (here assumed to be float32)
a_float32 = tensorplay.rand((8, 8), device="cuda")
b_float32 = tensorplay.rand((8, 8), device="cuda")
c_float32 = tensorplay.rand((8, 8), device="cuda")
d_float32 = tensorplay.rand((8, 8), device="cuda")

with tensorplay.autocast(device_type="cuda"):
    e_float16 = tensorplay.mm(a_float32, b_float32)
    with tensorplay.autocast(device_type="cuda", enabled=False):
        # Calls e_float16.float() to ensure float32 execution
        # (necessary because e_float16 was created in an autocasted region)
        f_float32 = tensorplay.mm(c_float32, e_float16.float())

    # No manual casts are required when re-entering the autocast-enabled region.
    # tensorplay.mm again runs in float16 and produces float16 output, regardless of input types.
    g_float16 = tensorplay.mm(d_float32, f_float32)

The autocast state is thread-local. If you want it enabled in a new thread, the context manager or decorator must be invoked in that thread. This affects tensorplay.nn.DataParallel and tensorplay.nn.parallel.DistributedDataParallel when used with more than one GPU per process (see Working with Multiple GPUs<amp-multigpu>).

Args

  • device_type (str, required): Device type to use. Possible values are: 'cuda', 'cpu', 'mtia', 'maia', 'xpu', and 'hpu'. The type is the same as the type attribute of a tensorplay.device. Thus, you may obtain the device type of a tensor using Tensor.device.type.
  • enabled (bool, optional): Whether autocasting should be enabled in the region.
  • Default: True
  • dtype (tensorplay_dtype, optional): Data type for ops run in autocast. It uses the default value (tensorplay.float16 for CUDA and tensorplay.bfloat16 for CPU), given by ~tensorplay.get_autocast_dtype, if dtype is None.
  • Default: None
  • cache_enabled (bool, optional): Whether the weight cache inside autocast should be enabled.
  • Default: True
Methods

__init__(self, device_type: str, dtype: Optional[tensorplay.DType] = None, enabled: bool = True, cache_enabled: Optional[bool] = None) [source]

Initialize self. See help(type(self)) for accurate signature.


Functions

custom_bwd() [source]

python
custom_bwd(bwd=None, *, device_type: str)

Create a helper decorator for backward methods of custom autograd functions.

Autograd functions are subclasses of tensorplay.autograd.Function. Ensures that backward executes with the same autocast state as forward. See the example page<amp-custom-examples> for more detail.

Args

  • device_type (str): Device type to use. 'cuda', 'cpu', 'mtia', 'maia', 'xpu' and so on. The type is the same as the type attribute of a tensorplay.device. Thus, you may obtain the device type of a tensor using Tensor.device.type.

custom_fwd() [source]

python
custom_fwd(fwd=None, *, device_type: str, cast_inputs: Optional[tensorplay.DType] = None)

Create a helper decorator for forward methods of custom autograd functions.

Autograd functions are subclasses of tensorplay.autograd.Function. See the example page<amp-custom-examples> for more detail.

Args

  • device_type (str): Device type to use. 'cuda', 'cpu', 'mtia', 'maia', 'xpu' and so on. The type is the same as the type attribute of a tensorplay.device. Thus, you may obtain the device type of a tensor using Tensor.device.type.
  • cast_inputs (tensorplay.dtype` or None, optional, default=None`): If not None, when forwardruns in an autocast-enabled region, casts incoming floating-point Tensors to the target dtype (non-floating-point Tensors are not affected), then executesforwardwith autocast disabled. IfNone, forward``'s internal ops execute with the current autocast state.

INFO

If the decorated forward is called outside an autocast-enabled region, custom_fwd<custom_fwd> is a no-op and cast_inputs has no effect.

is_autocast_available() [source]

python
is_autocast_available(device_type: str) -> bool

Return a bool indicating if autocast is available on device_type.

Args

  • device_type (str): Device type to use. Possible values are: 'cuda', 'cpu', 'mtia', 'maia', 'xpu', and so on. The type is the same as the type attribute of a tensorplay.device. Thus, you may obtain the device type of a tensor using Tensor.device.type.

基于 Apache 2.0 许可发布。

📚DeepWiki