1. OverviewAt present, video classification algorithms mainly focus on understanding the content of the video as a whole, labeling the video as a whole, and the granularity is relatively coarse. Fewer articles focus on fine-grained understanding of time-series segments, and also analyze videos from a multimodal perspective. This article will share a solution to improve the accuracy of video understanding using a multimodal network, and achieve significant improvements in the youtube-8m dataset. 2. Related WorkIn video classification, NeXtVLAD [1] is shown to be an efficient and fast video classification method. Inspired by the ResNeXt method, the authors successfully decomposed the high-dimensional video feature vector into a set of low-dimensional vectors. The network significantly reduces the parameters of the previous NetVLAD network, but still achieves remarkable performance in feature aggregation and large-scale video classification. RNNs [2] have been shown to perform well in modeling sequential data. Researchers often use RNNs to model temporal information in videos that is difficult for CNN networks to capture. GRU [3] is an important component of the RNN architecture that can avoid the vanishing gradient problem. Attention-GRU [4] refers to an attention mechanism that helps distinguish the impact of different features on the current prediction. In order to combine the spatial and temporal features of video tasks, two-stream CNN [5] , 3D-CNN [6] , slowfast [7] and ViViT [8] were later proposed. Although these models have achieved good performance in video understanding tasks, there is still room for improvement. For example, many methods only target a single modality or only process the entire video without outputting fine-grained labels. 3. Technical solution3.1 Overall network structureThis technical solution is designed to fully learn the semantic features of multimodal videos (text, audio, image), while overcoming the problems of extremely unbalanced and semi-supervised samples in the youtube-8m dataset. As shown in Figure 1, the entire network mainly consists of a mixed multimodal network (mix-Multmodal Network) and a graph convolutional network (GCN [9] ). The mix-Multmodal Network consists of three differentiated multimodal classification networks, and the specific differentiated parameters are shown in Table 1. Figure 1. Overall network structure
Table 1. Parameters of three differentiated Multimodal Nets 3.2 Multimodal NetworksAs shown in Figure 2, the multimodal network mainly understands three modalities (text, video, and audio). Each modality includes three processes: basic semantic understanding, temporal feature understanding, and modality fusion. The semantic understanding models for video and audio use EfficientNet [10] and VGGish, respectively, and the temporal feature understanding model is NextVLAD. The temporal feature understanding model for text is Bert [11] . For multi-modal feature fusion, we use SENet [12] . The pre-processing of SENet requires forcibly compressing and aligning the feature lengths of each modality, which will lead to information loss. To overcome this problem, we use a multi-group SENet network structure. Experiments show that the SENet network with multiple groups has stronger learning ability than a single SENet. Figure 2. Multimodal network structure 3.3 Graph ConvolutionSince the coarse-grained labels of Youtube-8M are all annotated, the fine-grained labels only annotate part of the data. Therefore, GCN is introduced to perform semi-supervised classification tasks. The basic idea is to update the node representation by propagating information between nodes. For multi-label video classification tasks, label dependency is an important information. In our task, each label will be a node in the graph, and the line between two nodes represents their relationship [13][14] . So we can train a matrix to represent the relationship between all nodes. Take Figure 3, a simplified label correlation graph extracted from our dataset, for example. Label BMW --> Label Car means that when the BMW label appears, the Car label is likely to occur, but not necessarily vice versa. The Car label has a high correlation with all other labels, and labels without arrows indicate that the two labels have no relationship with each other. Figure 3. Schematic diagram of label correlation The GCN network implementation is shown in Figure 4. The GCN module consists of two layers of stacked GCNs (GCN (1) and GCN (2) ), which help learn a label correlation graph to map these label representations into a set of interdependent classifiers. is the input correlation matrix, which is initialized by the values of the matrix. and are the matrices that will be trained in the network. is the classifier weight learned by GCN. Figure 4. GCN network structure 3.4 Label ReweightingThe Youtube-8M video classification task is a multi-label classification task. However, the annotation data only selects one of the multiple labels to be marked as 1, and the rest of the labels are 0. In other words, a certain video clip may have other labels set to 0 in addition to the label. This problem is also a weak supervision problem. To address this problem, we propose a solution by giving larger weights to annotated classes and smaller weights to unannotated classes when calculating the loss [15] . This weighted cross entropy approach will help the model learn better from incomplete datasets. 3.5 Feature EnhancementTo avoid overfitting when training the model, we added randomly generated Gaussian noise and randomly injected it into each element of the input feature vector. As shown in Figure 6, noise is added to the input feature vector and the mask vector randomly selects 50% of the dimensions and sets the value to 1. The Gaussian noise here is independent but has the same distribution for different input vectors. Figure 6. Adding Gaussian noise At the same time, in order to prevent the multimodal model from only learning the features of a certain modality, that is, overfitting on the modality, we also mask the modality features to ensure that there is at least one modality in the input, as shown in Figure 7. In this way, each modality can be fully learned. Figure 7. Modal Mask
4. Experiment4.1 Evaluation Metrics4.2 Experimental Results4.2.1 MultimodalityIn order to verify the benefits of each modality in multi-modality, we conducted an ablation experiment, and the results are shown in Table 2. When a single modality is used as a feature, the accuracy of Video is the highest, the accuracy of Audio is the lowest, and the accuracy of Text is close to Video. In dual-modality, Video + Text has a significant improvement, and the improvement is limited after adding Audio.
Table 2. Multimodal ablation experiments 4.2.2 Graph ConvolutionTo verify the benefits of GCN, we also conducted comparative experiments, where we selected two thresholds λ, 0.2 and 0.4. As shown in Table 3, the results show that compared with the original model (org), the classifier generated by GCN helps improve the performance, especially when λ=0.4.
Table 3. Graph convolution experiments 4.2.3 Differentiated Multimodal NetworksIn order to verify the effects of parallel multimodal networks and differentiation, we designed five groups of experiments. The first group of models is a single multimodal network, the second, third, and fourth groups are 2, 3, and 4 parallel multimodal networks, and the fifth group is 3 differentiated parallel multimodal networks. From the results, it can be seen that parallel networks can improve accuracy, but the progress will decrease after 4 networks are connected in parallel, so blindly increasing the number of parallel networks will not bring benefits. At the same time, the experimental results also show that differentiated network structures can fit data more effectively.
Table 4. Differentiated multimodal network experiments 4.2.4 Label ReweightingLabel reweighting consists of two hyperparameters (n and m). Experiments show that the accuracy is greatly improved when n=0.1 and m=2.5.
Table 5. Label reweighting experiment 4.2.5 Feature EnhancementFeature enhancement is a type of data enhancement. Experiments show that adding Gaussian noise and masking certain modes can improve the generalization ability of the model. This method of adding Gaussian noise is simple to implement, has strong transferability, and is easy to implement in other networks.
Table 6. Feature enhancement experiments 5. SummaryExperiments show that the above methods have all been improved to varying degrees, especially multimodal and graph convolution. We hope to explore more label dependencies in the future. GCN networks have also been shown to be useful in this task, and we think it is worthwhile to do more experiments to combine GCN networks with other state-of-the-art video classification networks. References[1]. Rongcheng Lin, Jing Xiao, Jianping Fan: NeXtVLAD: An Efficient Neural Network to Aggregate Frame-level Features for Large-scale Video Classification.In: ECCV, workshop(2018) [2]. Jeffrey L Elman. Finding structure in time. Cognitive science, 14(2):179–211, 1990 [3]. Kyunghyun Cho, Bart Van Merrienboer, ¨ Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv, 2014. [4]. Jan K Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, Kyunghyun Cho, and Yoshua Bengio. Attention-based models for speech recognition. In NIPS, pages 577–585, 2015. [5]. Karen Simonyan, Andrew Zisserman, Two-Stream Convolutional Networks for Action Recognition in Videos. In: NIPS (2014) [6]. Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, Manohar Paluri Learning Spatiotemporal Features With 3D Convolutional Networks. In:ICCV(2015) [7]. Christoph Feichtenhofer , Haoqi Fan , Jitendra Malik , Kaiming He ,SlowFast Networks for Video Recognition. In: CVPR (2019 ) [8]. Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić,Cordelia Schmid, ViViT: A Video Vision Transformer. In: CVPR (2021) [9]. Zhao-Min Chen, Xiu-Shen Wei, Peng Wang, Yanwen Guo: Multi-Label Image Recognition with Graph Convolutional Networks. In: CVPR (2019) [10]. Mingxing Tan, Quoc V. Le, EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks, PMLR 97:6105-6114, 2019 [11]. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova,BERT: Pre-training of deep bidirectional transformers for language understanding. In North American Association for Computational Linguistics (NAACL), 2019 [12]. Jie Hu, Li Shen, Gang Sun, Squeeze-and-Excitation Networks. In: CVPR (2018 ) [13]. Zhang Z,Sabuncu M. Generalized cross entropy loss for training deep neural networks with noisy labels[C]//Advances in neural information processing systems. 2018: [14]. Pereira RB, Plastino A, Zadrozny B, et al. Correlation analysis of performance measures for multi-label classification [J]. Information Processing & Management, 2018, [15]. Panchapagesan S, Sun M, Khare A, et al.Multi-Task Learning and Weighted Cross-Entropy for DNN-Based Keyword Spotting[C]. 2016: 760-764. |
<<: The role of LoRaWAN and IoT in optimizing asset management
>>: Smart homes need smarter Wi-Fi
1. Evolution of blockchain development Looking ba...
I looked through the previous articles and found ...
Today I will share with you the knowledge related...
HostKvm has sent a message about the new Hong Kon...
CABLExpress recently released its latest Fiber Op...
The Casbay domain name has been registered since ...
[[430598]] The shopping festival is here. If you ...
[[350382]] At 14:00 on the afternoon of October 3...
Recently, the Ministry of Industry and Informatio...
Now we are in the Internet age, the Internet make...
Here is some information about CloudCone's re...
[Barcelona, Spain, February 26, 2023] During th...
Today, we are witnessing a huge period of innovat...
1. When to use multiple routing protocols? When t...
[[406599]] From left to right: Ye Kai, Song Zifu,...