[Paper-Reading] Collaborative Spatiotemporal Feature Learning for Video Action Recognition

Chao Li, Qiaoyong Zhong, Di Xie, Shiliang Pu, Hikvision Research Institute

Chen Xiaoyuan, 2019/11/7

1. Background

To extract the spatiotemporal features, exiting deep neural networks either learn spatial and temporal features independently (C2D), or jointly (C3D), but with unconstrained parameters. The latter way, C3D, and its variants, work well in many models, but they need too much computing resources.

2. Motivation

img >original<

This is a visualization of three views of a video. The top left one is a normal one being understandable to us. However, if we treat the height and length dimensions as “spatial” ones (while width as temporal), we will get the top right image. It looks strange, but it still has some “spatial” pattern, such as edges and color blobs. Our trained models for static images may be useful here. —-So, that’s the core idea of CoST (Collaborative Spatio-Temporal).

3. Methods (including framework)

Let \(x\) be the input data of size \(T\times H\times W\times C\), then we have the three views of this video:

$$ x_{hw}=x\otimes w_{1\times3\times3} $$

$$ x_{tw}=x\otimes w_{3\times1\times3} $$

$$ x_{th}=x\otimes w_{3\times3\times1} $$

where \(\otimes\) denotes 3D convolution, and \(w_{3\times3}\) is shared by these three views (after a simple reshaping). Then they would be aggregated with weighted summation:

$$ y=\left[a_{hw},a_{tw},a_{th}\right] \left[ \begin{matrix} x_{hw} \\
x_{tw} \\
x_{th} \\
\end{matrix} \right] $$

Then, there are two different way of generating these coefficients, accordingly they’re called CoST(a) and CoST(b) respectively:

img >original<
img >original<

4. Experiments (data corpus, evaluation metrics, compared methods)

Datasets: Moments in Time; Kinetics.

Evaluation metrics: Top-1 Acc., Top-5 Acc.

Compared methods: C2D, C3D, I3D, R(2+1)D, S3D-G and etc.

Framework: ResNet-50, ResNet-101.

img >original<

These results are all on Kinetics, and all of which are single model results (RGB modality only).

5. Pros. & Cons.


  1. More accurate while having less parameters (compared with C3D).
  2. We can quantitatively interpret the compacts of spatial (H-W) and temporal (T-W, T-H) information through every entire videos.


  1. The T-W and T-H views are not intuitive at all.

(And, since the authors haven’t published the source codes yet, we should implement it by ourselves /sad.)

6 Comments (e. g., improvements)

  1. Processing T-W and T-H respectively may be redundant to some extent. (Though I have no any good ideas to deal with it. If we do down-sampling and some kind of fusing before the process, then the T-W and T-H may not be able to share the same weights with H-W.)
  2. The core idea of the paper, is somewhat like the Inception-v3, v4, v5, that is, dividing some “BIG” operations into a chain of “small” operations, but meanwhile preserving same capacity. It might be a good way for us to design our new model or channel or operations.
  1. D. Tran, et al. Learning spatiotemporal features with 3d convolutional networks. In ICCV, 2015.
  2. J. Carreira, et al. Quo vadis, action recognition? A new model and the kinetics dataset. In CVPR, 2017.
  3. D. Tran, et al. A closer look at spatiotemporal convolutions for action recognition. In CVPR, 2018.