Skip to content

tensorplay.amp.grad_scaler

Classes

class GradScaler [source]

python
GradScaler(device: 'str' = 'cpu', init_scale: 'float' = 65536.0, growth_factor: 'float' = 2.0, backoff_factor: 'float' = 0.5, growth_interval: 'int' = 2000, enabled: 'bool' = True) -> 'None'

An instance scaler of GradScaler.

Helps perform the steps of gradient scaling conveniently.

  • scaler.scale(loss) multiplies a given loss by scaler's current scale factor.
  • scaler.step(optimizer) safely unscales gradients and calls optimizer.step().
  • scaler.update() updates scaler's scale factor.

Example

python
# Creates a GradScaler once at the beginning of training.
scaler = GradScaler()

for epoch in epochs:
    for input, target in data:
        optimizer.zero_grad()
        output = model(input)
        loss = loss_fn(output, target)

        # Scales loss.  Calls backward() on scaled loss to create scaled gradients.
        scaler.scale(loss).backward()

        # scaler.step() first unscales gradients of the optimizer's params.
        # If gradients don't contain infs/NaNs, optimizer.step() is then called,
        # otherwise, optimizer.step() is skipped.
        scaler.step(optimizer)

        # Updates the scale for next iteration.
        scaler.update()

See the Automatic Mixed Precision examples<amp-examples> for usage (along with autocasting) in more complex cases like gradient clipping, gradient accumulation, gradient penalty, and multiple losses/optimizers.

scaler dynamically estimates the scale factor each iteration. To minimize gradient underflow, a large scale factor should be used. However, float16 values can "overflow" (become inf or NaN) if the scale factor is too large. Therefore, the optimal scale factor is the largest factor that can be used without incurring inf or NaN gradient values. scaler approximates the optimal scale factor over time by checking the gradients for infs and NaNs during every scaler.step(optimizer) (or optional separate scaler.unscale_(optimizer), see unscale_).

  • If infs/NaNs are found, scaler.step(optimizer) skips the underlying optimizer.step() (so the params themselves remain uncorrupted) and update() multiplies the scale by backoff_factor.

  • If no infs/NaNs are found, scaler.step(optimizer) runs the underlying optimizer.step() as usual. If growth_interval unskipped iterations occur consecutively, update() multiplies the scale by growth_factor.

The scale factor often causes infs/NaNs to appear in gradients for the first few iterations as its value calibrates. scaler.step will skip the underlying optimizer.step() for these iterations. After that, step skipping should occur rarely (once every few hundred or thousand iterations).

Args

  • device (str, optional, default="cuda"): Device type to use. Possible values are: 'cuda' and 'cpu'. The type is the same as the type attribute of a torch.device. Thus, you may obtain the device type of a tensor using Tensor.device.type.
  • init_scale (float, optional, default=2.**16): Initial scale factor.
  • growth_factor (float, optional, default=2.0): Factor by which the scale is multiplied during update if no inf/NaN gradients occur for growth_interval consecutive iterations.
  • backoff_factor (float, optional, default=0.5): Factor by which the scale is multiplied during update if inf/NaN gradients occur in an iteration.
  • growth_interval (int, optional, default=2000): Number of consecutive iterations without inf/NaN gradients that must occur for the scale to be multiplied by growth_factor.
  • enabled (bool, optional): If False, disables gradient scaling. step simply invokes the underlying optimizer.step(), and other methods become no-ops.
  • Default: True
Methods

__init__(self, device: 'str' = 'cpu', init_scale: 'float' = 65536.0, growth_factor: 'float' = 2.0, backoff_factor: 'float' = 0.5, growth_interval: 'int' = 2000, enabled: 'bool' = True) -> 'None' [source]

Initialize self. See help(type(self)) for accurate signature.


get_backoff_factor(self) -> 'float' [source]

Return a Python float containing the scale backoff factor.


get_growth_factor(self) -> 'float' [source]

Return a Python float containing the scale growth factor.


get_growth_interval(self) -> 'int' [source]

Return a Python int containing the growth interval.


get_scale(self) -> 'float' [source]

Return a Python float containing the current scale, or 1.0 if scaling is disabled.

DANGER

get_scale incurs a CPU-GPU sync.


is_enabled(self) -> 'bool' [source]

Return a bool indicating whether this instance is enabled.


load_state_dict(self, state_dict: 'dict[str, Any]') -> 'None' [source]

Load the scaler state.

If this instance is disabled, load_state_dict is a no-op.

Args

state_dict(dict): scaler state. Should be an object returned from a call to state_dict.


scale(self, outputs: 'Union[tensorplay.Tensor, Iterable[tensorplay.Tensor]]') -> 'Union[tensorplay.Tensor, Iterable[tensorplay.Tensor]]' [source]

Multiplies ('scales') a tensor or list of tensors by the scale factor.

Returns scaled outputs. If this instance of GradScaler is not enabled, outputs are returned unmodified.

Args

  • outputs (Tensor or iterable of Tensors): Outputs to scale.

set_backoff_factor(self, new_factor: 'float') -> 'None' [source]

Set a new scale backoff factor.

Args

  • new_scale (float): Value to use as the new scale backoff factor.

set_growth_factor(self, new_factor: 'float') -> 'None' [source]

Set a new scale growth factor.

Args

  • new_scale (float): Value to use as the new scale growth factor.

set_growth_interval(self, new_interval: 'int') -> 'None' [source]

Set a new growth interval.

Args

  • new_interval (int): Value to use as the new growth interval.

state_dict(self) -> 'dict[str, Any]' [source]

Return the state of the scaler as a dict.

It contains five entries:

  • "scale" - a Python float containing the current scale
  • "growth_factor" - a Python float containing the current growth factor
  • "backoff_factor" - a Python float containing the current backoff factor
  • "growth_interval" - a Python int containing the current growth interval
  • "_growth_tracker" - a Python int containing the number of recent consecutive unskipped steps.

If this instance is not enabled, returns an empty dict.

.. note:: If you wish to checkpoint the scaler's state after a particular iteration, state_dict should be called after update.


step(self, optimizer: 'tensorplay.optim.Optimizer', *args: 'Any', **kwargs: 'Any') -> 'Optional[float]' [source]

Invoke unscale_(optimizer) followed by parameter update, if gradients are not infs/NaN.

step carries out the following two operations:

  1. Internally invokes unscale_(optimizer) (unless unscale_ was explicitly called for optimizer earlier in the iteration). As part of the unscale_, gradients are checked for infs/NaNs.
  2. If no inf/NaN gradients are found, invokes optimizer.step() using the unscaled gradients. Otherwise, optimizer.step() is skipped to avoid corrupting the params.

*args and **kwargs are forwarded to optimizer.step().

Returns the return value of optimizer.step(*args, **kwargs).

Args

  • optimizer (tensorplay.optim.Optimizer): Optimizer that applies the gradients.
  • args: Any arguments.
  • kwargs: Any keyword arguments.

DANGER

Closure use is not currently supported.


unscale_(self, optimizer: 'tensorplay.optim.Optimizer') -> 'None' [source]

Divides ("unscales") the optimizer's gradient tensors by the scale factor.

unscale_ is optional, serving cases where you need to modify or inspect gradients<working-with-unscaled-gradients> between the backward pass(es) and step. If unscale_ is not called explicitly, gradients will be unscaled automatically during step.

Simple example, using unscale_ to enable clipping of unscaled gradients

python
scaler.scale(loss).backward()
scaler.unscale_(optimizer)
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm)
scaler.step(optimizer)
scaler.update()

Args

  • optimizer (torch.optim.Optimizer): Optimizer that owns the gradients to be unscaled.

INFO

unscale_ does not incur a CPU-GPU sync.

DANGER

unscale_ should only be called once per optimizer per step call, and only after all gradients for that optimizer's assigned parameters have been accumulated. Calling unscale_ twice for a given optimizer between each step triggers a RuntimeError.

DANGER

unscale_ may unscale sparse gradients out of place, replacing the .grad attribute.


update(self, new_scale: 'Optional[Union[float, tensorplay.Tensor]]' = None) -> 'None' [source]

Update the scale factor.

If any optimizer steps were skipped the scale is multiplied by backoff_factor to reduce it. If growth_interval unskipped iterations occurred consecutively, the scale is multiplied by growth_factor to increase it.

Passing new_scale sets the new scale value manually. (new_scale is not used directly, it's used to fill GradScaler's internal scale tensor. So if new_scale was a tensor, later in-place changes to that tensor will not further affect the scale GradScaler uses internally.)

Args

  • new_scale (float or tensorplay.Tensor, optional, default=None): New scale factor.

DANGER

update should only be called at the end of the iteration, after scaler.step(optimizer) has been invoked for all optimizers used this iteration.

DANGER

For performance reasons, we do not check the scale factor value to avoid synchronizations, so the scale factor is not guaranteed to be above 1. If the scale falls below 1 and/or you are seeing NaNs in your gradients or loss, something is likely wrong. For example, bf16-pretrained models are often incompatible with AMP/fp16 due to differing dynamic ranges.


class OptState [source]

python
OptState(*values)

Bases: Enum

Released under the Apache 2.0 License.

📚DeepWiki