线性神经网络

线性回归

线性回归的基本元素

线性模型

对线性假设而言，可以通过仿射变换来找到特征和目标之间的关系：

\hat{y} = w_{1} x_{1} + \dots + w_{d} x_{d} + b .

将特征放入向量 $x \in R^{d}$ ，并将权重放入向量 $y \in R^{d}$ ，可表达为点积的形式：

\hat{y} = w^{⊤} x + b .

而将数据集中的n个样本放入矩阵 $X \in R^{n \times d}$ 则可以通过矩阵-向量乘法表示为：

\hat{y} = X w + b .

为了达到更低的预测误差，需要寻找更好的参数模型（model parameters），这需要:（1）模型质量的度量方式；（2）更新模型以提高预测质量的方法

损失函数

损失函数（loss function）可以量化目标的实际值与预测值之间的差距。以平方误差为例：

l^{(i)} (w, b) = \frac{1}{2} {({\hat{y}}^{(i)} - y^{(i)})}^{2}

其中样本i的预测值为 ${\hat{y}}^{(i)}$ ，其真实值为 $y^{(i)}$ 。通过损失均值来度量模型质量：

L (w, b) = \frac{1}{n} \sum_{i = 1}^{n} l^{(i)} (w, b) = \frac{1}{n} \sum_{i = 1}^{n} \frac{1}{2} {(w^{⊤} x^{(i)} + b - y^{(i)})}^{2}

通过寻找一组参数 $(w^{*}, b^{*})$ 以最小化训练样本的总损失：

w^{*}, b^{*} = \underset{w, b}{argmin} L (w, b)

解析解

线性回归的解可以用公式表示出来，将偏置 $b$ 合并到参数 $w$ 中（通过在包含所有参数的矩阵中附加一列）。预测问题为最小化 $∥ y - X w ∥^{2}$ ，将损失关于 $w$ 的倒数2设为0，得到解析解：

w^{*} = (X^{⊤} X)^{- 1} X^{⊤} y

^[1]

随机梯度下降

其过程可以通过 3B1B 的视频深度学习之梯度下降法进行理解，算法步骤为:（1）初始化模型参数的值，如随机初始化；（2）从数据集中随机抽取小批量样本且在负梯度的方向上更新参数，并不断迭代这一步骤。对于平方损失和仿射变换，我们可以明确地写成如下形式：

\begin{aligned} w & \leftarrow w - \frac{η}{| B |} \sum_{i \in B} \partial_{w} l^{(i)} (w, b) = w - \frac{η}{| B |} \sum_{i \in B} x^{(i)} (w^{⊤} x^{(i)} + b - y^{(i)}), \\ b & \leftarrow b - \frac{η}{| B |} \sum_{i \in B} \partial_{b} l^{(i)} (w, b) = b - \frac{η}{| B |} \sum_{i \in B} (w^{⊤} x^{(i)} + b - y^{(i)}) . \end{aligned}

其中 $| B |$ 为每个小批量中的样本数，即批量大小(batch siez)， $η$ 表示学习率。

泛化（generalization）

找到一组参数，能够在我们未见过的数据上实现较低的损失

正态分布与平方损失

正态分布的概率密度为：

p (x) = \frac{1}{\sqrt{2 π σ^{2}}} \exp (- \frac{1}{2 σ^{2}} (x - μ)^{2})

其在 Python 中可以可视化表示

Python

import math  
import numpy as np  
import matplotlib.pyplot as plt  
  
def normal(x, mu, sigma):  
    p = 1 / math.sqrt(2 * math.pi * sigma**2)  
    return p * np.exp(-0.5 / sigma**2 * (x - mu)**2)  
  
# 使用 numpy 生成自变量 xx = np.arange(-7, 7, 0.01)  
  
# 均值和标准差对  
params = [(0, 1), (0, 2), (3, 1)]  
  
# 1. 设置画布大小需要使用 plt.figure()plt.figure(figsize=(6, 4))  # 稍微放大了尺寸，(4.5, 2.5) 可能会让图例重叠  
  
# 2. 推荐使用循环逐条画线，并用 label 参数标记，这样后续生成图例最方便  
for mu, sigma in params:  
    y = normal(x, mu, sigma)  
    plt.plot(x, y, label=f'mean {mu}, std {sigma}')  
  
# 3. 坐标轴标签和图例需要独立设置  
plt.xlabel('x')  
plt.ylabel('p(x)')  
plt.legend()  
  
# 4. 显示图像  
plt.show()

作者假设对真实世界的观测值 $y$ ，其中存在高斯噪声

y = w^{⊤} x + b + ϵ, ϵ \sim N (0, σ^{2})

因此 $ϵ = y - w^{⊤} x - b$ ，且服从正态分布 $N (0, σ^{2})$ ，对给定的 $x$ ，观测到的 $y$ 的概率密度为：

P (y ∣ x) = \frac{1}{\sqrt{2 π σ^{2}}} \exp (- \frac{1}{2 σ^{2}} (y - w^{⊤} x - b)^{2})

由于n个样本之间互相独立，数据集的联合似然为：

P (y ∣ X) = \prod_{i = 1}^{n} p (y^{(i)} ∣ x^{(i)})

利用 $\log$ 是单调函数这一性质，最大化 $P$ 等价于最大化 $\log P$ ，也等价于最小化 $- \log P$ ：

- \log P (y ∣ X) = \sum_{i = 1}^{n} \underset{第一项:常数}{\underset{⏟}{\frac{1}{2} \log (2 π σ^{2})}} + \underset{第二项:与MSE形式一致}{\underset{⏟}{\frac{1}{2 σ^{2}} {(y^{(i)} - w^{⊤} x^{(i)} - b)}^{2}}}

因此在"噪声服从高斯分布"的假设下,最小化均方误差 ⟺ 对线性模型做极大似然估计

从线性回归到深度网络

全连接层（fully‐connectedlayer）或稠密层（denselayer）

每个输入都与每个输出相连

线性回归的基本实现代码如下：

Python

import random  
import torch  
from d2l import torch as d2l  
  
def synthetic_data(w, b, num_examples): #@save  
    """生成y=Xw+b+噪声"""  
    X = torch.normal(0, 1, (num_examples, len(w))) #生成符合标准正态分布（均值为0，方差为1）的特征矩阵X  
    y = torch.matmul(X, w) + b  
    y += torch.normal(0, 0.01, y.shape) #给标签加上噪声  
    return X, y.reshape((-1, 1)) #转化为列向量  
  
true_w = torch.tensor([2,-3.4])  
true_b = 4.2  
features, labels = synthetic_data(true_w, true_b, 1000)  
print('features:',features[0],'\nlabel:',labels[0])  
  
d2l.set_figsize()  
d2l.plt.scatter(features[:,(1)].detach().numpy(),labels.detach().numpy(),1)  
  
def data_iter(batch_size, features,labels):  
    num_examples= len(features)  
    indices= list(range(num_examples))  
    #这些样本是随机读取的，没有特定的顺序  
    random.shuffle(indices)  
    for i in range(0,num_examples,batch_size):  
        batch_indices= torch.tensor(  
            indices[i:min(i+ batch_size,num_examples)])  
        yield features[batch_indices],labels[batch_indices] #生成器  
  
batch_size = 10  
for X, y in data_iter(batch_size, features, labels):  
    print(X, '\n', y)  
    break  
  
w = torch.normal(0, 0.01, size=(2,1), requires_grad=True) #随机初始化权重  
b = torch.zeros(1, requires_grad=True) #初始化偏置为0  
  
def linreg(X, w, b): #@save  
    """线性回归模型"""  
    return torch.matmul(X, w) + b #前向传播  
  
def squared_loss(y_hat, y): #@save  
    """均方损失"""  
    return (y_hat- y.reshape(y_hat.shape)) ** 2 / 2 #损失函数  
  
def sgd(params, lr, batch_size): #@save  
    """小批量随机梯度下降(优化器)"""  
    with torch.no_grad(): #关闭pytorch追踪  
        for param in params:  
            param-= lr * param.grad / batch_size #平均梯度  
            param.grad.zero_() #梯度清零  
  
lr = 0.03  
num_epochs = 3  
net = linreg  
loss = squared_loss  
  
for epoch in range(num_epochs):  
    for X, y in data_iter(batch_size, features, labels):  
        l = loss(net(X, w, b), y) # X和y的小批量损失  
        # 因为l形状是(batch_size,1)，而不是一个标量。l中的所有元素被加到一起，  
        # 并以此计算关于[w,b]的梯度  
        l.sum().backward()  
        sgd([w, b], lr, batch_size) # 使用参数的梯度更新参数  
    with torch.no_grad():  
        train_l = loss(net(features, w, b), labels)  
        print(f'epoch {epoch + 1}, loss {float(train_l.mean()):f}')  
  
print(f'w的估计误差: {true_w- w.reshape(true_w.shape)}')  
print(f'b的估计误差: {true_b- b}')

线性回归的简洁实现

Python

import numpy as np  
import torch  
from torch.utils import data  
from d2l import torch as d2l  
from torch import nn  
  
true_w = torch.tensor([2,-3.4])  
true_b = 4.2  
features, labels = d2l.synthetic_data(true_w, true_b, 1000)  
  
def load_array(data_arrays, batch_size, is_train=True): #@save is_train为是否迭代器在每个周期打乱顺序  
    """构造一个PyTorch数据迭代器"""  
    dataset = data.TensorDataset(*data_arrays)  
    return data.DataLoader(dataset, batch_size, shuffle=is_train)  
  
batch_size = 10  
data_iter = load_array((features, labels), batch_size)  
  
next(iter(data_iter))  
# print(next(iter(data_iter)))  
  
# 定义模型  
net = nn.Sequential(nn.Linear(2, 1))  
# 初始化参数  
net[0].weight.data.normal_(0, 0.01)  
net[0].bias.data.fill_(0)  
  
# 定义损失函数  
loss = nn.MSELoss()  
#定义优化算法  
trainer = torch.optim.SGD(net.parameters(), lr=0.03)  
  
#训练  
num_epochs = 3  
for epoch in range(num_epochs):  
    for X, y in data_iter:  
        l = loss(net(X) ,y)  
        trainer.zero_grad()  
        l.backward()  
        trainer.step()  
    l = loss(net(features), labels)  
    print(f'epoch {epoch + 1}, loss {l:f}')  
  
w = net[0].weight.data  
print('w的估计误差：', true_w- w.reshape(true_w.shape))  
b = net[0].bias.data  
print('b的估计误差：', true_b- b)

softmax回归

分类

“硬性”类别：属于哪个类别 “软性”类别：属于每个类别的概率即使我们只关心硬类别，我们仍然使用软类别的模型。

分类问题

分类数据的简单方法：

独热编码（one‐hotencoding）

独热编码是一个向量，它的分量和类别一样多。类别对应的分量设置为1，其他所有分量设置为0。

softmax运算

\hat{y} = softmax (o) 其中 {\hat{y}}_{j} = \frac{\exp (o_{j})}{\sum_{k} \exp (o_{k})}

其矢量计算表达式为

\begin{aligned} 0 = X W + b, \\ \hat{Y} = softmax (0) \end{aligned}

损失函数

对数似然

对于索引i的样本，将估计值与实际值进行比较：

P (Y ∣ X) = \prod_{i = 1}^{n} P (y^{(i)} ∣ x^{(i)})

利用最大似然估计，最大化 $P (Y ∣ X)$ ，即最小化负对数似然：

- \log P (Y ∣ X) = \sum_{i = 1}^{n} - \log P (y^{(i)} ∣ x^{(i)}) = \sum_{i = 1}^{n} l (y^{(i)}, {\hat{y}}^{(i)})

其损失函数为：

l (y, \hat{y}) = - \sum_{j = 1}^{q} y_{j} \log {\hat{y}}_{j}

该函数也被称为交叉熵损失（cross‐entropyloss）。

softmax及其导数

将 softmax 运算带入损失函数中，通过softmax定义可得：

\begin{aligned} l (y, \hat{y}) & = - \sum_{j = 1}^{q} y_{j} \log \frac{\exp (o_{j})}{\sum_{k = 1}^{q} \exp (o_{k})} \\ = \sum_{j = 1}^{q} y_{j} \log \sum_{k = 1}^{q} \exp (o_{k}) - \sum_{j = 1}^{q} y_{j} o_{j} \\ = \log \sum_{k = 1}^{q} \exp (o_{k}) - \sum_{j = 1}^{q} y_{j} o_{j} . \end{aligned}

PS：最后一步是由于对于独热编码元素和为1，即 $\sum_{j = 1}^{q} y_{j} = 1$ 。对预测值 $o_{j}$ 求导后（链式法则）得：

\partial_{o_{j}} l (y, \hat{y}) = \frac{\exp (o_{j})}{\sum_{k = 1}^{q} \exp (o_{k})} - y_{j} = softmax (o)_{j} - y_{j} .

图像分类数据集

本文与书中一致采用 Fashion‐MNIST 数据集在使用 softmax 函数时存在上溢（overflow）问题，未解决这个问题需要进行一定的规范化：

\begin{aligned} {\hat{y}}_{j} & = \frac{\exp (o_{j} - max (o_{k})) \exp (max (o_{k}))}{\sum_{k} \exp (o_{k} - max (o_{k})) \exp (max (o_{k}))} \\ = \frac{\exp (o_{j} - max (o_{k}))}{\sum_{k} \exp (o_{k} - max (o_{k}))} . \end{aligned}

对于该式， $o_{k} - max (o_{k})$ 可能具有较大的负值导致下溢（underflow），但在最终在计算交叉熵损失时会取它们的对数，使得 $l o g (e x p (\cdot))$ 被抵消：

\begin{aligned} \log ({\hat{y}}_{j}) & = \log (\frac{\exp (o_{j} - max (o_{k}))}{\sum_{k} \exp (o_{k} - max (o_{k}))}) \\ = \log (\exp (o_{j} - max (o_{k}))) - \log (\sum_{k} \exp (o_{k} - max (o_{k}))) \\ = o_{j} - max (o_{k}) - \log (\sum_{k} \exp (o_{k} - max (o_{k}))) . \end{aligned}

这种方法也被称作LogSumExp技巧

Python

loss = nn.CrossEntropyLoss(reduction='none')

softmax的实现

Python

import torch  
from torch import nn  
from d2l import torch as d2l  
import matplotlib  
matplotlib.use('TkAgg')  # 或 'Qt5Agg'，看你装了哪个  
import matplotlib.pyplot as plt  
# from IPython import display  
  
# 数据准备  
batch_size = 256  
train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size) # 加载 Fashion-MNIST 数据集  
  
# PyTorch不会隐式地调整输入的形状。因此，  
# 我们在线性层前定义了展平层（flatten），来调整网络输入的形状，Linear为全连接层  
net = nn.Sequential(nn.Flatten(), nn.Linear(784, 10))  
  
# 初始化权重  
def init_weights(m):  
    if type(m) == nn.Linear:  
        nn.init.normal_(m.weight, std=0.01) # 均值0标准差0，01  
net.apply(init_weights)  
  
# 累加器  
class Accumulator: #@save  
    """在n个变量上累加"""  
    def __init__(self, n):  
        self.data = [0.0] * n  
    def add(self, *args): #  对每个位置累加  
        self.data = [a + float(b) for a, b in zip(self.data, args)]  
    def reset(self): # 清零  
        self.data = [0.0] * len(self.data)  
    def __getitem__(self, idx):  
        return self.data[idx]  
  
# 在 Jupyter Notebook 里实时绘制训练曲线  
  
class Animator:  
    """在脚本中实时绘制训练曲线"""  
    def __init__(self, xlabel=None, ylabel=None, legend=None, xlim=None,  
                 ylim=None, xscale='linear', yscale='linear',  
                 fmts=('-', 'm--', 'g-.', 'r:'), nrows=1, ncols=1,  
                 figsize=(3.5, 2.5)):  
        if legend is None:  
            legend = []  
        self.legend = legend  
  
        plt.ion()  # 打开交互模式，关键！  
        self.fig, self.axes = plt.subplots(nrows, ncols, figsize=figsize)  
        if nrows * ncols == 1:  
            self.axes = [self.axes, ]  
  
        # 保存配置，等下每次重画时用  
        self.config = dict(xlabel=xlabel, ylabel=ylabel,  
                           xlim=xlim, ylim=ylim,  
                           xscale=xscale, yscale=yscale)  
        self.X, self.Y, self.fmts = None, None, fmts  
  
    def _config_axes(self):  
        ax = self.axes[0]  
        ax.set_xlabel(self.config['xlabel'])  
        ax.set_ylabel(self.config['ylabel'])  
        ax.set_xscale(self.config['xscale'])  
        ax.set_yscale(self.config['yscale'])  
        if self.config['xlim']:  
            ax.set_xlim(self.config['xlim'])  
        if self.config['ylim']:  
            ax.set_ylim(self.config['ylim'])  
        if self.legend:  
            ax.legend(self.legend)  
        ax.grid(True)  
  
    def add(self, x, y):  
        if not hasattr(y, "__len__"):  
            y = [y]  
        n = len(y)  
        if not hasattr(x, "__len__"):  
            x = [x] * n  
        if not self.X:  
            self.X = [[] for _ in range(n)]  
        if not self.Y:  
            self.Y = [[] for _ in range(n)]  
        for i, (a, b) in enumerate(zip(x, y)):  
            if a is not None and b is not None:  
                self.X[i].append(a)  
                self.Y[i].append(b)  
  
        self.axes[0].cla()  
        for x_, y_, fmt in zip(self.X, self.Y, self.fmts):  
            self.axes[0].plot(x_, y_, fmt)  
        self._config_axes()  
  
        self.fig.canvas.draw()  
        self.fig.canvas.flush_events()  
        plt.pause(0.01)  # 让事件循环跑一下，窗口才会刷新  
        # 注释部分为Jupyter Notebook版本  
        # display.display(self.fig)  
        # display.clear_output(wait=True)  
    def show(self):  
        """训练结束后调用，让窗口保持打开"""  
        plt.ioff()  
        plt.show()  
  
  
# 分类精度  
def accuracy(y_hat, y): #@save  
    """计算预测正确的数量"""  
    if len(y_hat.shape) > 1 and y_hat.shape[1] > 1:  
        y_hat = y_hat.argmax(axis=1)  
    cmp = y_hat.type(y.dtype) == y  
    return float(cmp.type(y.dtype).sum())  
  
# 训练一个 epochdef train_epoch_ch3(net, train_iter, loss, updater): #@save  
    """训练模型一个迭代周期（定义见第3章）"""  
    # 将模型设置为训练模式  
    if isinstance(net, torch.nn.Module):  
        net.train()  
    # 训练损失总和、训练准确度总和、样本数  
    metric = Accumulator(3)  
    for X, y in train_iter:  
        # 计算梯度并更新参数  
        y_hat = net(X)  
        l = loss(y_hat, y)  
        if isinstance(updater, torch.optim.Optimizer):  
            # 使用PyTorch内置的优化器和损失函数  
            updater.zero_grad()  
            l.mean().backward()  
            updater.step()  
        else:  
            # 使用定制的优化器和损失函数  
            l.sum().backward()  
            updater(X.shape[0])  
        # metric.add(float(l.sum()), accuracy(y_hat, y), y.numel())  
        metric.add(l.sum().item(), accuracy(y_hat, y), y.numel())  
    # 返回训练损失和训练精度  
    return metric[0] / metric[2], metric[1] / metric[2]  
  
#  在测试集上评估  
def evaluate_accuracy(net, data_iter): #@save  
    """计算在指定数据集上模型的精度"""  
    if isinstance(net, torch.nn.Module):  
        net.eval() # 将模型设置为评估模式  
    metric = Accumulator(2) # 正确预测数、预测总数  
    with torch.no_grad():  
        for X, y in data_iter:  
            metric.add(accuracy(net(X), y), y.numel())  
    return metric[0] / metric[1]  
  
# 完整训练循环  
def train_ch3(net, train_iter, test_iter, loss, num_epochs, updater): #@save  
    """训练模型（定义见第3章）"""  
    animator = Animator(xlabel='epoch', xlim=[1, num_epochs], ylim=[0.3, 0.9],  
                        legend=['train loss', 'train acc', 'test acc'])  
    for epoch in range(num_epochs):  
        train_metrics = train_epoch_ch3(net, train_iter, loss, updater)  
        test_acc = evaluate_accuracy(net, test_iter)  
        animator.add(epoch + 1, train_metrics + (test_acc,))  
    animator.show()  
    train_loss, train_acc = train_metrics  
    assert train_loss < 0.5, train_loss  
    assert train_acc <= 1 and train_acc > 0.7, train_acc  
    assert test_acc <= 1 and test_acc > 0.7, test_acc  
  
if __name__ == '__main__': # 避免在Windows中多线程问题  
    # 定义损失函数  
    loss = nn.CrossEntropyLoss(reduction='none')  
    # 优化算法  
    trainer = torch.optim.SGD(net.parameters(), lr=0.1)  
    # 训练  
    num_epochs = 10  
    train_ch3(net, train_iter, test_iter, loss, num_epochs, trainer)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170

Tips:正规方程

我们使用残差平方和（RSS）作为损失函数 $L (w)$ ：

\begin{array}{r} L (w) = ∥ y - X w ∥^{2} = (y - X w)^{⊤} (y - X w) \\ L (w) = y^{⊤} y - 2 w^{⊤} X^{⊤} y + w^{⊤} X^{⊤} X w \end{array}

根据矩阵求导公式 $\frac{\partial (a^{⊤} w)}{\partial w} = a$ 和 $\frac{\partial (w^{⊤} A w)}{\partial w} = 2 A w$ （其中 $A$ 为对称矩阵），对 $L (w)$ 求偏导：

\frac{\partial L}{\partial w} = - 2 X^{⊤} y + 2 X^{⊤} X w = - 2 X^{⊤} (y - X w) = 0

常数项-2约去得：

X^{⊤} y - X^{⊤} X w = 0

移项得：

X^{⊤} X w = X^{⊤} y

两边左乘 $(X^{⊤} X)^{- 1}$ 假设 $X^{⊤} X$ 是满秩的可逆矩阵，我们在等式两边同时左乘其逆矩阵： $w = (X^{⊤} X)^{- 1} X^{⊤} y$ 即为正规方程的解析解

这部分为正规方程（Normal Equation） ↩︎

线性神经网络 ​

线性回归 ​

线性回归的基本元素 ​

线性模型 ​

损失函数 ​

解析解 ​

随机梯度下降 ​

正态分布与平方损失 ​

从线性回归到深度网络 ​

线性回归的简洁实现 ​

softmax回归 ​

分类问题 ​

softmax运算 ​

损失函数 ​

对数似然 ​

softmax及其导数 ​

图像分类数据集 ​

softmax的实现 ​

Tips:正规方程 ​

线性神经网络

线性回归

线性回归的基本元素

线性模型

损失函数

解析解

随机梯度下降

正态分布与平方损失

从线性回归到深度网络

线性回归的简洁实现

softmax回归

分类问题

softmax运算

损失函数

对数似然

softmax及其导数

图像分类数据集

softmax的实现

Tips:正规方程