tensorplay.amp.grad_scaler
Classes
class GradScaler [source]
GradScaler(device: 'str' = 'cpu', init_scale: 'float' = 65536.0, growth_factor: 'float' = 2.0, backoff_factor: 'float' = 0.5, growth_interval: 'int' = 2000, enabled: 'bool' = True) -> 'None'An instance scaler of GradScaler.
Helps perform the steps of gradient scaling conveniently.
scaler.scale(loss)multiplies a given loss byscaler's current scale factor.scaler.step(optimizer)safely unscales gradients and callsoptimizer.step().scaler.update()updatesscaler's scale factor.
Example
# Creates a GradScaler once at the beginning of training.
scaler = GradScaler()
for epoch in epochs:
for input, target in data:
optimizer.zero_grad()
output = model(input)
loss = loss_fn(output, target)
# Scales loss. Calls backward() on scaled loss to create scaled gradients.
scaler.scale(loss).backward()
# scaler.step() first unscales gradients of the optimizer's params.
# If gradients don't contain infs/NaNs, optimizer.step() is then called,
# otherwise, optimizer.step() is skipped.
scaler.step(optimizer)
# Updates the scale for next iteration.
scaler.update()See the Automatic Mixed Precision examples<amp-examples> for usage (along with autocasting) in more complex cases like gradient clipping, gradient accumulation, gradient penalty, and multiple losses/optimizers.
scaler dynamically estimates the scale factor each iteration. To minimize gradient underflow, a large scale factor should be used. However, float16 values can "overflow" (become inf or NaN) if the scale factor is too large. Therefore, the optimal scale factor is the largest factor that can be used without incurring inf or NaN gradient values. scaler approximates the optimal scale factor over time by checking the gradients for infs and NaNs during every scaler.step(optimizer) (or optional separate scaler.unscale_(optimizer), see unscale_).
If infs/NaNs are found,
scaler.step(optimizer)skips the underlyingoptimizer.step()(so the params themselves remain uncorrupted) andupdate()multiplies the scale bybackoff_factor.If no infs/NaNs are found,
scaler.step(optimizer)runs the underlyingoptimizer.step()as usual. Ifgrowth_intervalunskipped iterations occur consecutively,update()multiplies the scale bygrowth_factor.
The scale factor often causes infs/NaNs to appear in gradients for the first few iterations as its value calibrates. scaler.step will skip the underlying optimizer.step() for these iterations. After that, step skipping should occur rarely (once every few hundred or thousand iterations).
Args
- device (
str, optional, default="cuda"): Device type to use. Possible values are: 'cuda' and 'cpu'. The type is the same as thetypeattribute of atorch.device. Thus, you may obtain the device type of a tensor usingTensor.device.type. - init_scale (
float, optional, default=2.**16): Initial scale factor. - growth_factor (
float, optional, default=2.0): Factor by which the scale is multiplied duringupdateif no inf/NaN gradients occur forgrowth_intervalconsecutive iterations. - backoff_factor (
float, optional, default=0.5): Factor by which the scale is multiplied duringupdateif inf/NaN gradients occur in an iteration. - growth_interval (
int, optional, default=2000): Number of consecutive iterations without inf/NaN gradients that must occur for the scale to be multiplied bygrowth_factor. - enabled (
bool, optional): IfFalse, disables gradient scaling.stepsimply invokes the underlyingoptimizer.step(), and other methods become no-ops. - Default:
True
Methods
__init__(self, device: 'str' = 'cpu', init_scale: 'float' = 65536.0, growth_factor: 'float' = 2.0, backoff_factor: 'float' = 0.5, growth_interval: 'int' = 2000, enabled: 'bool' = True) -> 'None' [source]
Initialize self. See help(type(self)) for accurate signature.
get_backoff_factor(self) -> 'float' [source]
Return a Python float containing the scale backoff factor.
get_growth_factor(self) -> 'float' [source]
Return a Python float containing the scale growth factor.
get_growth_interval(self) -> 'int' [source]
Return a Python int containing the growth interval.
get_scale(self) -> 'float' [source]
Return a Python float containing the current scale, or 1.0 if scaling is disabled.
DANGER
get_scale incurs a CPU-GPU sync.
is_enabled(self) -> 'bool' [source]
Return a bool indicating whether this instance is enabled.
load_state_dict(self, state_dict: 'dict[str, Any]') -> 'None' [source]
Load the scaler state.
If this instance is disabled, load_state_dict is a no-op.
Args
state_dict(dict): scaler state. Should be an object returned from a call to state_dict.
scale(self, outputs: 'Union[tensorplay.Tensor, Iterable[tensorplay.Tensor]]') -> 'Union[tensorplay.Tensor, Iterable[tensorplay.Tensor]]' [source]
Multiplies ('scales') a tensor or list of tensors by the scale factor.
Returns scaled outputs. If this instance of GradScaler is not enabled, outputs are returned unmodified.
Args
- outputs (
Tensor or iterable of Tensors): Outputs to scale.
set_backoff_factor(self, new_factor: 'float') -> 'None' [source]
Set a new scale backoff factor.
Args
- new_scale (
float): Value to use as the new scale backoff factor.
set_growth_factor(self, new_factor: 'float') -> 'None' [source]
Set a new scale growth factor.
Args
- new_scale (
float): Value to use as the new scale growth factor.
set_growth_interval(self, new_interval: 'int') -> 'None' [source]
Set a new growth interval.
Args
- new_interval (
int): Value to use as the new growth interval.
state_dict(self) -> 'dict[str, Any]' [source]
Return the state of the scaler as a dict.
It contains five entries:
"scale"- a Python float containing the current scale"growth_factor"- a Python float containing the current growth factor"backoff_factor"- a Python float containing the current backoff factor"growth_interval"- a Python int containing the current growth interval"_growth_tracker"- a Python int containing the number of recent consecutive unskipped steps.
If this instance is not enabled, returns an empty dict.
.. note:: If you wish to checkpoint the scaler's state after a particular iteration, state_dict should be called after update.
step(self, optimizer: 'tensorplay.optim.Optimizer', *args: 'Any', **kwargs: 'Any') -> 'Optional[float]' [source]
Invoke unscale_(optimizer) followed by parameter update, if gradients are not infs/NaN.
step carries out the following two operations:
- Internally invokes
unscale_(optimizer)(unlessunscale_was explicitly called foroptimizerearlier in the iteration). As part of theunscale_, gradients are checked for infs/NaNs. - If no inf/NaN gradients are found, invokes
optimizer.step()using the unscaled gradients. Otherwise,optimizer.step()is skipped to avoid corrupting the params.
*args and **kwargs are forwarded to optimizer.step().
Returns the return value of optimizer.step(*args, **kwargs).
Args
- optimizer (
tensorplay.optim.Optimizer): Optimizer that applies the gradients. - args: Any arguments.
- kwargs: Any keyword arguments.
DANGER
Closure use is not currently supported.
unscale_(self, optimizer: 'tensorplay.optim.Optimizer') -> 'None' [source]
Divides ("unscales") the optimizer's gradient tensors by the scale factor.
unscale_ is optional, serving cases where you need to modify or inspect gradients<working-with-unscaled-gradients> between the backward pass(es) and step. If unscale_ is not called explicitly, gradients will be unscaled automatically during step.
Simple example, using unscale_ to enable clipping of unscaled gradients
scaler.scale(loss).backward()
scaler.unscale_(optimizer)
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm)
scaler.step(optimizer)
scaler.update()Args
- optimizer (
torch.optim.Optimizer): Optimizer that owns the gradients to be unscaled.
INFO
unscale_ does not incur a CPU-GPU sync.
DANGER
unscale_ should only be called once per optimizer per step call, and only after all gradients for that optimizer's assigned parameters have been accumulated. Calling unscale_ twice for a given optimizer between each step triggers a RuntimeError.
DANGER
unscale_ may unscale sparse gradients out of place, replacing the .grad attribute.
update(self, new_scale: 'Optional[Union[float, tensorplay.Tensor]]' = None) -> 'None' [source]
Update the scale factor.
If any optimizer steps were skipped the scale is multiplied by backoff_factor to reduce it. If growth_interval unskipped iterations occurred consecutively, the scale is multiplied by growth_factor to increase it.
Passing new_scale sets the new scale value manually. (new_scale is not used directly, it's used to fill GradScaler's internal scale tensor. So if new_scale was a tensor, later in-place changes to that tensor will not further affect the scale GradScaler uses internally.)
Args
- new_scale (
float ortensorplay.Tensor, optional, default=None): New scale factor.
DANGER
update should only be called at the end of the iteration, after scaler.step(optimizer) has been invoked for all optimizers used this iteration.
DANGER
For performance reasons, we do not check the scale factor value to avoid synchronizations, so the scale factor is not guaranteed to be above 1. If the scale falls below 1 and/or you are seeing NaNs in your gradients or loss, something is likely wrong. For example, bf16-pretrained models are often incompatible with AMP/fp16 due to differing dynamic ranges.
class OptState [source]
OptState(*values)Bases: Enum
