Application of multimodal algorithms in video understanding

Application of multimodal algorithms in video understanding

1. Overview

At present, video classification algorithms mainly focus on understanding the content of the video as a whole, labeling the video as a whole, and the granularity is relatively coarse. Fewer articles focus on fine-grained understanding of time-series segments, and also analyze videos from a multimodal perspective. This article will share a solution to improve the accuracy of video understanding using a multimodal network, and achieve significant improvements in the youtube-8m dataset.

2. Related Work

In video classification, NeXtVLAD [1] is shown to be an efficient and fast video classification method. Inspired by the ResNeXt method, the authors successfully decomposed the high-dimensional video feature vector into a set of low-dimensional vectors. The network significantly reduces the parameters of the previous NetVLAD network, but still achieves remarkable performance in feature aggregation and large-scale video classification.

RNNs [2] have been shown to perform well in modeling sequential data. Researchers often use RNNs to model temporal information in videos that is difficult for CNN networks to capture. GRU [3] is an important component of the RNN architecture that can avoid the vanishing gradient problem. Attention-GRU [4] refers to an attention mechanism that helps distinguish the impact of different features on the current prediction.

In order to combine the spatial and temporal features of video tasks, two-stream CNN [5] , 3D-CNN [6] , slowfast [7] and ViViT [8] were later proposed. Although these models have achieved good performance in video understanding tasks, there is still room for improvement. For example, many methods only target a single modality or only process the entire video without outputting fine-grained labels.

3. Technical solution

3.1 Overall network structure

This technical solution is designed to fully learn the semantic features of multimodal videos (text, audio, image), while overcoming the problems of extremely unbalanced and semi-supervised samples in the youtube-8m dataset.

As shown in Figure 1, the entire network mainly consists of a mixed multimodal network (mix-Multmodal Network) and a graph convolutional network (GCN [9] ). The mix-Multmodal Network consists of three differentiated multimodal classification networks, and the specific differentiated parameters are shown in Table 1.

Figure 1. Overall network structure





Bert



NeXtVLAD



Layers



Cluster Size



Reduction



Multimodal Net (1)



12



136



16



Multimodal Net (3)



12



112



16



Multimodal Net (3)



6



112



8


Table 1. Parameters of three differentiated Multimodal Nets


3.2 Multimodal Networks

As shown in Figure 2, the multimodal network mainly understands three modalities (text, video, and audio). Each modality includes three processes: basic semantic understanding, temporal feature understanding, and modality fusion. The semantic understanding models for video and audio use EfficientNet [10] and VGGish, respectively, and the temporal feature understanding model is NextVLAD. The temporal feature understanding model for text is Bert [11] .

For multi-modal feature fusion, we use SENet [12] . The pre-processing of SENet requires forcibly compressing and aligning the feature lengths of each modality, which will lead to information loss. To overcome this problem, we use a multi-group SENet network structure. Experiments show that the SENet network with multiple groups has stronger learning ability than a single SENet.

Figure 2. Multimodal network structure


3.3 Graph Convolution

Since the coarse-grained labels of Youtube-8M are all annotated, the fine-grained labels only annotate part of the data. Therefore, GCN is introduced to perform semi-supervised classification tasks. The basic idea is to update the node representation by propagating information between nodes. For multi-label video classification tasks, label dependency is an important information.

In our task, each label will be a node in the graph, and the line between two nodes represents their relationship [13][14] . So we can train a matrix to represent the relationship between all nodes.

Take Figure 3, a simplified label correlation graph extracted from our dataset, for example. Label BMW --> Label Car means that when the BMW label appears, the Car label is likely to occur, but not necessarily vice versa. The Car label has a high correlation with all other labels, and labels without arrows indicate that the two labels have no relationship with each other.

Figure 3. Schematic diagram of label correlation

The GCN network implementation is shown in Figure 4. The GCN module consists of two layers of stacked GCNs (GCN (1) and GCN (2) ), which help learn a label correlation graph to map these label representations into a set of interdependent classifiers. is the input correlation matrix, which is initialized by the values ​​of the matrix.

and are the matrices that will be trained in the network. is the classifier weight learned by GCN.

Figure 4. GCN network structure


3.4 Label Reweighting

The Youtube-8M video classification task is a multi-label classification task. However, the annotation data only selects one of the multiple labels to be marked as 1, and the rest of the labels are 0. In other words, a certain video clip may have other labels set to 0 in addition to the label. This problem is also a weak supervision problem.

To address this problem, we propose a solution by giving larger weights to annotated classes and smaller weights to unannotated classes when calculating the loss [15] . This weighted cross entropy approach will help the model learn better from incomplete datasets.


3.5 Feature Enhancement

To avoid overfitting when training the model, we added randomly generated Gaussian noise and randomly injected it into each element of the input feature vector.

As shown in Figure 6, noise is added to the input feature vector and the mask vector randomly selects 50% of the dimensions and sets the value to 1. The Gaussian noise here is independent but has the same distribution for different input vectors.

Figure 6. Adding Gaussian noise

At the same time, in order to prevent the multimodal model from only learning the features of a certain modality, that is, overfitting on the modality, we also mask the modality features to ensure that there is at least one modality in the input, as shown in Figure 7. In this way, each modality can be fully learned.

Figure 7. Modal Mask

 

4. Experiment

4.1 Evaluation Metrics


4.2 Experimental Results

4.2.1 Multimodality

In order to verify the benefits of each modality in multi-modality, we conducted an ablation experiment, and the results are shown in Table 2. When a single modality is used as a feature, the accuracy of Video is the highest, the accuracy of Audio is the lowest, and the accuracy of Text is close to Video. In dual-modality, Video + Text has a significant improvement, and the improvement is limited after adding Audio.


Modal



MAP@K



Video



Audio



Text





 



 



69.2



 





 



38.1



 



 





65.8







 



71.3





 





73.9



 







70.5









74.6


Table 2. Multimodal ablation experiments

4.2.2 Graph Convolution

To verify the benefits of GCN, we also conducted comparative experiments, where we selected two thresholds λ, 0.2 and 0.4. As shown in Table 3, the results show that compared with the original model (org), the classifier generated by GCN helps improve the performance, especially when λ=0.4.


Modal



MAP@K



org



74.0



+ GCN ( λ = 0.2 )



74.7



+ GCN ( λ = 0.4 )



74.9


Table 3. Graph convolution experiments


4.2.3 Differentiated Multimodal Networks

In order to verify the effects of parallel multimodal networks and differentiation, we designed five groups of experiments. The first group of models is a single multimodal network, the second, third, and fourth groups are 2, 3, and 4 parallel multimodal networks, and the fifth group is 3 differentiated parallel multimodal networks.

From the results, it can be seen that parallel networks can improve accuracy, but the progress will decrease after 4 networks are connected in parallel, so blindly increasing the number of parallel networks will not bring benefits. At the same time, the experimental results also show that differentiated network structures can fit data more effectively.


Modal



MAP@K



One Multimodal Net



78.2



Two Multmodal Net



78.6



Three Multmodal Net



78.9



Four Multimodal Net



78.7



Three diff Multmodal Net



79.2


Table 4. Differentiated multimodal network experiments



4.2.4 Label Reweighting

Label reweighting consists of two hyperparameters (n and m). Experiments show that the accuracy is greatly improved when n=0.1 and m=2.5.


Modal



MAP@K



org



77.8



+ ReWeight(n=0.1, m=2.0)



78.2



+ ReWeight (n=0.1, m=2.5)



78.3



+ ReWeight (n=0.1, m=3.0)



78.1


Table 5. Label reweighting experiment


4.2.5 Feature Enhancement

Feature enhancement is a type of data enhancement. Experiments show that adding Gaussian noise and masking certain modes can improve the generalization ability of the model. This method of adding Gaussian noise is simple to implement, has strong transferability, and is easy to implement in other networks.


Modal



MAP@K



org



81.2



+ Gaussian noises



81.7



+ Gaussian noises + mask Modal



82.1


Table 6. Feature enhancement experiments

5. Summary

Experiments show that the above methods have all been improved to varying degrees, especially multimodal and graph convolution.

We hope to explore more label dependencies in the future. GCN networks have also been shown to be useful in this task, and we think it is worthwhile to do more experiments to combine GCN networks with other state-of-the-art video classification networks.

References

[1]. Rongcheng Lin, Jing Xiao, Jianping Fan: NeXtVLAD: An Efficient Neural Network to Aggregate Frame-level Features for Large-scale Video Classification.In: ECCV, workshop(2018)

[2]. Jeffrey L Elman. Finding structure in time. Cognitive science, 14(2):179–211, 1990

[3]. Kyunghyun Cho, Bart Van Merrienboer, ¨ Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv, 2014.

[4]. Jan K Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, Kyunghyun Cho, and Yoshua Bengio. Attention-based models for speech recognition. In NIPS, pages 577–585, 2015.

[5]. Karen Simonyan, Andrew Zisserman, Two-Stream Convolutional Networks for Action Recognition in Videos. In: NIPS (2014)

[6]. Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, Manohar Paluri Learning Spatiotemporal Features With 3D Convolutional Networks. In:ICCV(2015)

[7]. ​​Christoph Feichtenhofer ​​, ​​Haoqi Fan ​​, ​Jitendra Malik ​​, ​​Kaiming He ​​,SlowFast Networks for Video Recognition. In: CVPR (2019 )

[8]. Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić,Cordelia Schmid, ViViT: A Video Vision Transformer. In: CVPR (2021)

[9]. Zhao-Min Chen, Xiu-Shen Wei, Peng Wang, Yanwen Guo: Multi-Label Image Recognition with Graph Convolutional Networks. In: CVPR (2019)

[10]. Mingxing Tan, Quoc V. Le, EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks, PMLR 97:6105-6114, 2019

[11]. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova,BERT: Pre-training of deep bidirectional transformers for language understanding. In North American Association for Computational Linguistics (NAACL), 2019

[12]. Jie Hu, Li Shen, Gang Sun, Squeeze-and-Excitation Networks. In: CVPR (2018 )

[13]. Zhang Z,Sabuncu M. Generalized cross entropy loss for training deep neural networks with noisy labels[C]//Advances in neural information processing systems. 2018:
8778-8788.

[14]. Pereira RB, Plastino A, Zadrozny B, et al. Correlation analysis of performance measures for multi-label classification [J]. Information Processing & Management, 2018,
54(3): 359-369.

[15]. Panchapagesan S, Sun M, Khare A, et al.Multi-Task Learning and Weighted Cross-Entropy for DNN-Based Keyword Spotting[C]. 2016: 760-764.

<<:  The role of LoRaWAN and IoT in optimizing asset management

>>:  Smart homes need smarter Wi-Fi

Recommend

Blockchain cross-domain security solution

1. Evolution of blockchain development Looking ba...

HostYun Los Angeles CU2 (AS9929) VPS simple test

I looked through the previous articles and found ...

Operating system: Introduction to SSH protocol knowledge

Today I will share with you the knowledge related...

15 Best Practices for Fiber Optic Cable Installation in Data Centers

CABLExpress recently released its latest Fiber Op...

Comparing WiFi 6 and WiFi 5, there are three differences

[[430598]] The shopping festival is here. If you ...

CloudCone: $17/year-dual-core/1GB/55GB/2TB@1Gbps/Los Angeles data center

Here is some information about CloudCone's re...

How AI and software are driving 5G data center transformation

Today, we are witnessing a huge period of innovat...

12 Questions about Routing! Do you know all of them?

1. When to use multiple routing protocols? When t...