Natural Language Understanding and Generation
2022, 36(11): 156-168.
Dense video captioning can automatically generate sentence sequence corresponding to video content, involving issues in both computer vision and natural language processing. To capture the audio information as well as the temporal structure and semantic relationship between events, this paper proposes a dense video captioning method based on multi-modal features. Firstly, Timeception layer is used as basic module in action proposal generation stage to better adapt various time span of action segments. Secondly, audio features are used to enhance the effect of proposal and description generation stages. Finally, the temporal semantic relation module models the temporal structure and semantic information between events to further enhance the accuracy of description generation. In addition, this paper also constructs a dataset named SDVC to explore the effectiveness of this method in application of real learning scene. The experimental results on ActivityNet Captions and SDVC datasets show that the AUC of action proposal generation increases by 0.8% and 6.7%, respectively; and in turn, using generated action proposals for description generation, BLEU_3 and BLEU_4 of SDVC dataset increased by 2.3% and 2.2%, respectively.