Adam

Note

Adam = 动量法 + RMSprop
也就是说,方向看梯度的 moving average,大小看梯度平方的 moving average

Adam 需记录梯度和梯度平方的 moving average:

\[\mathbf{v}_{t} = \beta_{1}\mathbf{v}_{t-1} + (1 - \beta_{1})\mathbf{g}_{t}\]
\[\mathbf{s}_{t} = \beta_{2}\mathbf{s}_{t-1} + (1 - \beta_{2})\mathbf{g}_{t}^{2}\]

权重归一化:

\[\hat{\mathbf{v}}_{t} = \frac{\mathbf{v}_{t}}{1 - \beta_{1}^{t}}\]
\[\hat{\mathbf{s}}_{t} = \frac{\mathbf{s}_{t}}{1 - \beta_{2}^{t}}\]

最后, update using both states:

\[\mathbf{x}_{t} = \mathbf{x}_{t-1} - \frac{\eta}{\sqrt{\hat{\mathbf{s}}_{t}} + \epsilon}\odot\hat{\mathbf{v}}_{t}\]
import torch
from torch import nn

net = nn.Sequential(nn.Linear(784, 10))
# pytorch中的Adam,betas=(beta1, beta2)
optimizer = torch.optim.Adam(net.parameters(), lr=0.01, betas=(0.9, 0.999))