Convolutional 和 recurrent operations 都是一次處理一個 local 鄰域，此篇 paper 提出了 non-local operations 的想法，目的就是要解決即使相距很遠的的 blocks，仍然有可能是彼此有關聯的。

# 1. Introduction

• sequential data (e.g., in speech, language) 一般來說會採用 recurrent operations，也是為了達成 long-range dependency modeling
• image data 則常常採用深層的 convolutional operations

1. 計算上沒有效率
2. 難以優化
3. non-local 特徵訊息傳遞不靈活

Figure 1. A spacetime non-local operation in our network trained for video classification in Kinetics. A position $\textbf x_i$’s response is computed by the weighted average of the features of all positions $\textbf x_j$ (only the highest weighted ones are shown here). In this example computed by our model, note how it relates the ball in the first frame to the ball in the last two frames. More examples are in Figure 3.

# 3. Non-local Neural Networks

## 3.1. Formulation

$$\textbf y_i = \frac{1}{\mathcal C(\textbf x)} \sum_{\forall j} f(\textbf x_i, \textbf x_j) g(\textbf x_j). \tag{1}$$

• $i$：輸出位置的 index（in space, time, or spacetime）
• $j$：所有 enumerates 出來可能位置的 index
• $\textbf x$：輸入信號（image, sequence, video; often their features）
• $\textbf y$：輸出信號，大小和 $\textbf x$ 相同
• $f$：計算 $i$ 和所有 $j$ 之間的 affinity
• $g$：計算位置 $j$ 處的輸入信號表示方法
• $\mathcal C$：normalizer

non-local op $fc$

## 3.2. Instantiations

### Gaussian

$$f(\textbf x_i, \textbf x_j) = e^{\textbf x_i^T \textbf x_j}. \tag{2}$$

• $\textbf x_i^T \textbf x_j$：點積相似度（歐式距離也 ok，但點積較好實作）
• $\mathcal C(\textbf x) = \sum_{\forall j} f(\textbf x_i, \textbf x_j)$。

### Embedded Gaussian

$$f(\textbf x_i, \textbf x_j) = e^{\theta(\textbf x_i)^T \phi(\textbf x_j)}. \tag{3}$$

• $\theta(\textbf x_i) = W_\theta \textbf x_i$：embedding
• $\phi(\textbf x_j) = W_\phi \textbf x_j$：embedding
• $\mathcal C(\textbf x) = \sum_{\forall j} f(\textbf x_i, \textbf x_j)$

### Dot product

$$f(\textbf x_i, \textbf x_j) = \theta(\textbf x_i)^T \phi(\textbf x_j). \tag{4}$$

• $\theta(\textbf x_i) = W_\theta \textbf x_i$：embedding
• $\phi(\textbf x_j) = W_\phi \textbf x_j$：embedding
• $\mathcal C(\textbf x) = N$ ($N$：#positions in $\textbf x$)

### Concatenation

$$f(\textbf x_i, \textbf x_j) = \text{ReLU} (\textbf w_f^T[\theta(\textbf x_i), \phi(\textbf x_j)]). \tag{5}$$

• $[\cdot, \cdot]$：concatenation
• $\textbf w_f$：將 concatenated vector 投射成 scalar 的 weight vector
• $\mathcal C(\textbf x) = N$ ($N$：#positions in $\textbf x$)
• 在這裡 $f$ 多採用了一個 ReLU

## 3.3. Non-local Block

$$\textbf z_i = W_z \textbf y_i + \textbf x_i, \tag{6}$$

• $\textbf y_i$ 和 $+\textbf x_i$：residual connection

Figure 2. A spacetime non-local block. The feature maps are shown as the shape of their tensors, e.g., $T \times H \times W \times 1024$ for $1024$ channels (proper reshaping is performed when noted). “$\otimes$” denotes matrix multiplication, and “$\oplus$” denotes element-wise sum. The softmax operation is performed on each row. The blue boxes denote $1 \times 1 \times 1$ convolutions. Here we show the embedded Gaussian version, with a bottleneck of $512$ channels. The vanilla Gaussian version can be done by removing $\theta$ and $\phi$, and the dot-product version can be done by replacing softmax with scaling by $1 / N$.

# 5. Experiments on Video Classification

Figure 4. Curves of the training procedure on Kinetics for the ResNet-50 C2D baseline (blue) vs. non-local C2D with 5 blocks (red). We show the top-1 training error (dash) and validation error (solid). The validation error is computed in the same way as the training error (so it is 1-clip testing with the same random jittering at training time); the ﬁnal results are in Table 2c (R50, 5-block).

# 6. Extension: Experiments on COCO

Table 5. Adding 1 non-local block to Mask R-CNN for COCO object detection and instance segmentation. The backbone is ResNet-50/101 or ResNeXt-152, both with FPN.

1751 Words

2018-10-27 08:26 -0400