Application of multimodal algorithms in video understanding

1. Overview

At present, video classification algorithms mainly focus on understanding the content of the video as a whole, labeling the video as a whole, and the granularity is relatively coarse. Fewer articles focus on fine-grained understanding of time-series segments, and also analyze videos from a multimodal perspective. This article will share a solution to improve the accuracy of video understanding using a multimodal network, and achieve significant improvements in the youtube-8m dataset.

2. Related Work

In video classification, NeXtVLAD ^[1] is shown to be an efficient and fast video classification method. Inspired by the ResNeXt method, the authors successfully decomposed the high-dimensional video feature vector into a set of low-dimensional vectors. The network significantly reduces the parameters of the previous NetVLAD network, but still achieves remarkable performance in feature aggregation and large-scale video classification.

RNNs ^[2] have been shown to perform well in modeling sequential data. Researchers often use RNNs to model temporal information in videos that is difficult for CNN networks to capture. GRU ^[3] is an important component of the RNN architecture that can avoid the vanishing gradient problem. Attention-GRU ^[4] refers to an attention mechanism that helps distinguish the impact of different features on the current prediction.

In order to combine the spatial and temporal features of video tasks, two-stream CNN ^[5] , 3D-CNN ^[6] , slowfast ^[7] and ViViT ^[8] were later proposed. Although these models have achieved good performance in video understanding tasks, there is still room for improvement. For example, many methods only target a single modality or only process the entire video without outputting fine-grained labels.

3. Technical solution

3.1 Overall network structure

This technical solution is designed to fully learn the semantic features of multimodal videos (text, audio, image), while overcoming the problems of extremely unbalanced and semi-supervised samples in the youtube-8m dataset.

As shown in Figure 1, the entire network mainly consists of a mixed multimodal network (mix-Multmodal Network) and a graph convolutional network (GCN ^[9] ). The mix-Multmodal Network consists of three differentiated multimodal classification networks, and the specific differentiated parameters are shown in Table 1.

Figure 1. Overall network structure

	Bert	NeXtVLAD
	Layers	Cluster Size	Reduction
Multimodal Net ⁽¹⁾	12	136	16
Multimodal Net ⁽³⁾	12	112	16
Multimodal Net ⁽³⁾	6	112	8

Table 1. Parameters of three differentiated Multimodal Nets

3.2 Multimodal Networks

As shown in Figure 2, the multimodal network mainly understands three modalities (text, video, and audio). Each modality includes three processes: basic semantic understanding, temporal feature understanding, and modality fusion. The semantic understanding models for video and audio use EfficientNet ^[10] and VGGish, respectively, and the temporal feature understanding model is NextVLAD. The temporal feature understanding model for text is Bert ^[11] .

For multi-modal feature fusion, we use SENet ^[12] . The pre-processing of SENet requires forcibly compressing and aligning the feature lengths of each modality, which will lead to information loss. To overcome this problem, we use a multi-group SENet network structure. Experiments show that the SENet network with multiple groups has stronger learning ability than a single SENet.

Figure 2. Multimodal network structure

3.3 Graph Convolution

Since the coarse-grained labels of Youtube-8M are all annotated, the fine-grained labels only annotate part of the data. Therefore, GCN is introduced to perform semi-supervised classification tasks. The basic idea is to update the node representation by propagating information between nodes. For multi-label video classification tasks, label dependency is an important information.

In our task, each label will be a node in the graph, and the line between two nodes represents their relationship ^[13][14] . So we can train a matrix to represent the relationship between all nodes.

Take Figure 3, a simplified label correlation graph extracted from our dataset, for example. Label BMW --> Label Car means that when the BMW label appears, the Car label is likely to occur, but not necessarily vice versa. The Car label has a high correlation with all other labels, and labels without arrows indicate that the two labels have no relationship with each other.

Figure 3. Schematic diagram of label correlation

The GCN network implementation is shown in Figure 4. The GCN module consists of two layers of stacked GCNs (GCN ⁽¹⁾ and GCN ⁽²⁾ ), which help learn a label correlation graph to map these label representations into a set of interdependent classifiers. is the input correlation matrix, which is initialized by the values of the matrix.

and are the matrices that will be trained in the network. is the classifier weight learned by GCN.

Figure 4. GCN network structure

3.4 Label Reweighting

The Youtube-8M video classification task is a multi-label classification task. However, the annotation data only selects one of the multiple labels to be marked as 1, and the rest of the labels are 0. In other words, a certain video clip may have other labels set to 0 in addition to the label. This problem is also a weak supervision problem.

To address this problem, we propose a solution by giving larger weights to annotated classes and smaller weights to unannotated classes when calculating the loss ^[15] . This weighted cross entropy approach will help the model learn better from incomplete datasets.

3.5 Feature Enhancement

To avoid overfitting when training the model, we added randomly generated Gaussian noise and randomly injected it into each element of the input feature vector.

As shown in Figure 6, noise is added to the input feature vector and the mask vector randomly selects 50% of the dimensions and sets the value to 1. The Gaussian noise here is independent but has the same distribution for different input vectors.

Figure 6. Adding Gaussian noise

At the same time, in order to prevent the multimodal model from only learning the features of a certain modality, that is, overfitting on the modality, we also mask the modality features to ensure that there is at least one modality in the input, as shown in Figure 7. In this way, each modality can be fully learned.

Figure 7. Modal Mask

4. Experiment

4.1 Evaluation Metrics

4.2 Experimental Results

4.2.1 Multimodality

In order to verify the benefits of each modality in multi-modality, we conducted an ablation experiment, and the results are shown in Table 2. When a single modality is used as a feature, the accuracy of Video is the highest, the accuracy of Audio is the lowest, and the accuracy of Text is close to Video. In dual-modality, Video + Text has a significant improvement, and the improvement is limited after adding Audio.

Modal			*MAP@K*
Video	Audio	Text	*MAP@K*
√			69.2
	√		38.1
		√	65.8
√	√		71.3
√		√	73.9
	√	√	70.5
√	√	√	74.6

Table 2. Multimodal ablation experiments

4.2.2 Graph Convolution

To verify the benefits of GCN, we also conducted comparative experiments, where we selected two thresholds λ, 0.2 and 0.4. As shown in Table 3, the results show that compared with the original model (org), the classifier generated by GCN helps improve the performance, especially when λ=0.4.

Modal	*MAP@K*
org	74.0
+ GCN ( λ = 0.2 )	74.7
+ GCN ( λ = 0.4 )	74.9

Table 3. Graph convolution experiments

4.2.3 Differentiated Multimodal Networks

In order to verify the effects of parallel multimodal networks and differentiation, we designed five groups of experiments. The first group of models is a single multimodal network, the second, third, and fourth groups are 2, 3, and 4 parallel multimodal networks, and the fifth group is 3 differentiated parallel multimodal networks.

From the results, it can be seen that parallel networks can improve accuracy, but the progress will decrease after 4 networks are connected in parallel, so blindly increasing the number of parallel networks will not bring benefits. At the same time, the experimental results also show that differentiated network structures can fit data more effectively.

Modal	*MAP@K*
One Multimodal Net	78.2
Two Multmodal Net	78.6
Three Multmodal Net	78.9
Four Multimodal Net	78.7
Three diff Multmodal Net	79.2

Table 4. Differentiated multimodal network experiments

4.2.4 Label Reweighting

Label reweighting consists of two hyperparameters (n and m). Experiments show that the accuracy is greatly improved when n=0.1 and m=2.5.

Modal	*MAP@K*
org	77.8
+ ReWeight(n=0.1, m=2.0)	78.2
+ ReWeight (n=0.1, m=2.5)	78.3
+ ReWeight (n=0.1, m=3.0)	78.1

Table 5. Label reweighting experiment

4.2.5 Feature Enhancement

Feature enhancement is a type of data enhancement. Experiments show that adding Gaussian noise and masking certain modes can improve the generalization ability of the model. This method of adding Gaussian noise is simple to implement, has strong transferability, and is easy to implement in other networks.

Modal	*MAP@K*
org	81.2
+ Gaussian noises	81.7
+ Gaussian noises + mask Modal	82.1

Table 6. Feature enhancement experiments

5. Summary

Experiments show that the above methods have all been improved to varying degrees, especially multimodal and graph convolution.

We hope to explore more label dependencies in the future. GCN networks have also been shown to be useful in this task, and we think it is worthwhile to do more experiments to combine GCN networks with other state-of-the-art video classification networks.

References

[1]. Rongcheng Lin, Jing Xiao, Jianping Fan: NeXtVLAD: An Efficient Neural Network to Aggregate Frame-level Features for Large-scale Video Classification.In: ECCV, workshop(2018)

[2]. Jeffrey L Elman. Finding structure in time. Cognitive science, 14(2):179–211, 1990

[3]. Kyunghyun Cho, Bart Van Merrienboer, ¨ Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv, 2014.

[4]. Jan K Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, Kyunghyun Cho, and Yoshua Bengio. Attention-based models for speech recognition. In NIPS, pages 577–585, 2015.

[5]. Karen Simonyan, Andrew Zisserman, Two-Stream Convolutional Networks for Action Recognition in Videos. In: NIPS (2014)

[6]. Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, Manohar Paluri Learning Spatiotemporal Features With 3D Convolutional Networks. In:ICCV(2015)

[7]. Christoph Feichtenhofer , Haoqi Fan , Jitendra Malik , Kaiming He ,SlowFast Networks for Video Recognition. In: CVPR (2019 )

[8]. Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić,Cordelia Schmid, ViViT: A Video Vision Transformer. In: CVPR (2021)

[9]. Zhao-Min Chen, Xiu-Shen Wei, Peng Wang, Yanwen Guo: Multi-Label Image Recognition with Graph Convolutional Networks. In: CVPR (2019)

[10]. Mingxing Tan, Quoc V. Le, EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks, PMLR 97:6105-6114, 2019

[11]. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova,BERT: Pre-training of deep bidirectional transformers for language understanding. In North American Association for Computational Linguistics (NAACL), 2019

[12]. Jie Hu, Li Shen, Gang Sun, Squeeze-and-Excitation Networks. In: CVPR (2018 )

[13]. Zhang Z,Sabuncu M. Generalized cross entropy loss for training deep neural networks with noisy labels[C]//Advances in neural information processing systems. 2018:
8778-8788.

[14]. Pereira RB, Plastino A, Zadrozny B, et al. Correlation analysis of performance measures for multi-label classification [J]. Information Processing & Management, 2018,
54(3): 359-369.

[15]. Panchapagesan S, Sun M, Khare A, et al.Multi-Task Learning and Weighted Cross-Entropy for DNN-Based Keyword Spotting[C]. 2016: 760-764.

<<: The role of LoRaWAN and IoT in optimizing asset management

>>: Smart homes need smarter Wi-Fi

The new THE PLAN v2 from BandwagonHost, quarterly payment starts from $32.6, available in Japan/Hong Kong/US CN2 GIA, etc.

Blog

GET or OUT! These six "hot" IT skills will become a powerful tool for salary increase and job change in 2018

Blog

In the era of intelligence, computing power is upgraded, and Huawei Cloud enables the intelligent transformation of the industry

Blog

Huawei: 5G+AI opens a new era of smart city twins

Blog

[Double Holiday] Megalayer Hong Kong CN2 server 399 yuan/month, Hong Kong/US high-end VPS/8 cores, 16G memory, 240G SSD, 3IP, starting from 199 yuan/month

Blog

Uncovering the Cost of Cyber Attacks in the 5G Era

Blog

The three major operators announced rectification: to solve the problem of too many packages and different rights for new and old packages

Blog

Asia Cloud: CN2 GIA cloud servers in the United States/Japan/Hong Kong starting from 24 yuan/month, with optional data centers in Shenzhen/Guangzhou/Fuzhou/Shiyan, etc.

Blog

TMThosting: VPS monthly payment 15% off annual payment 20% off, dedicated server monthly payment 95% off annual payment 20% off, Seattle/Dallas data center

Blog

What is the principle of WebSocket? Why can it achieve persistent connection?

Blog

Recommend

This year's Internet Light Expo has upgraded its black technology: these new gadgets are quite cool

On November 6, the 5th World Internet Conference ...

Yecao Cloud Labor Day Promotion: Hong Kong BGP line VPS special price from 88 yuan per year, Hong Kong dedicated server from 199 yuan/month

Yecaoyun has already started the Labor Day promot...

Research on automatic identification and optimization of network structure problems and joint analysis of tower resource merger

Author: Han Binjie and Liu Hongxing, unit: Hebei ...