Environment friendly Video-Textual content Studying with Iterative Co-tokenization


Video is an ubiquitous supply of media content material that touches on many elements of individuals’s day-to-day lives. More and more, real-world video purposes, akin to video captioning, video content material evaluation, and video question-answering (VideoQA), depend on fashions that may join video content material with textual content or pure language. VideoQA is especially difficult, nevertheless, because it requires greedy each semantic info, akin to objects in a scene, in addition to temporal info, e.g., how issues transfer and work together, each of which should be taken within the context of a natural-language query that holds particular intent. As well as, as a result of movies have many frames, processing all of them to study spatio-temporal info might be computationally costly. Nonetheless, understanding all this info permits fashions to reply advanced questions — for instance, within the video beneath, a query in regards to the second ingredient poured within the bowl requires figuring out objects (the components), actions (pouring), and temporal ordering (second).

An instance enter query for the VideoQA job “What’s the second ingredient poured into the bowl?” which requires deeper understanding of each the visible and textual content inputs. The video is an instance from the 50 Salads dataset, used underneath the Inventive Commons license.

To deal with this, in “Video Query Answering with Iterative Video-Textual content Co-Tokenization”, we introduce a brand new strategy to video-text studying known as iterative co-tokenization, which is ready to effectively fuse spatial, temporal and language info for VideoQA. This strategy is multi-stream, processing completely different scale movies with impartial spine fashions for every to provide video representations that seize completely different options, e.g., these of excessive spatial decision or lengthy temporal durations. The mannequin then applies the co-tokenization module to study environment friendly representations from fusing the video streams with the textual content. This mannequin is extremely environment friendly, utilizing solely 67 giga-FLOPs (GFLOPs), which is no less than 50% fewer than earlier approaches, whereas giving higher efficiency than various state-of-the-art fashions.

Video-Textual content Iterative Co-tokenization
The principle objective of the mannequin is to provide options from each movies and textual content (i.e., the person query), collectively permitting their corresponding inputs to work together. A second objective is to take action in an environment friendly method, which is extremely necessary for movies since they comprise tens to tons of of frames as enter.

The mannequin learns to tokenize the joint video-language inputs right into a smaller set of tokens that collectively and effectively characterize each modalities. When tokenizing, we use each modalities to provide a joint compact illustration, which is fed to a transformer layer to provide the following stage illustration. A problem right here, which can also be typical in cross-modal studying, is that usually the video body doesn’t correspond on to the related textual content. We deal with this by including two learnable linear layers which unify the visible and textual content characteristic dimensions earlier than tokenization. This manner we allow each video and textual content to situation how video tokens are discovered.

Furthermore, a single tokenization step doesn’t permit for additional interplay between the 2 modalities. For that, we use this new characteristic illustration to work together with the video enter options and produce one other set of tokenized options, that are then fed into the following transformer layer. This iterative course of permits the creation of recent options, or tokens, which characterize a continuing refinement of the joint illustration from each modalities. On the final step the options are enter to a decoder that generates the textual content output.

As typically executed for VideoQA, we pre-train the mannequin earlier than fine-tuning it on the person VideoQA datasets. On this work we use the movies routinely annotated with textual content primarily based on speech recognition, utilizing the HowTo100M dataset as an alternative of pre-training on a big VideoQA dataset. This weaker pre-training information nonetheless permits our mannequin to study video-text options.

Visualization of the video-text iterative co-tokenization strategy. Multi-stream video inputs, that are variations of the identical video enter (e.g., a excessive decision, low frame-rate video and a low decision, excessive frame-rate video), are effectively fused along with the textual content enter to provide a text-based reply by the decoder. As a substitute of processing the inputs immediately, the video-text iterative co-tokenization mannequin learns a diminished variety of helpful tokens from the fused video-language inputs. This course of is finished iteratively, permitting the present characteristic tokenization to have an effect on the collection of tokens on the subsequent iteration, thus refining the choice.

Environment friendly Video Query-Answering
We apply the video-language iterative co-tokenization algorithm to 3 fundamental VideoQA benchmarks, MSRVTT-QA, MSVD-QA and IVQA, and reveal that this strategy achieves higher outcomes than different state-of-the-art fashions, whereas having a modest dimension. Moreover, iterative co-tokenization studying yields important compute financial savings for video-text studying duties. The tactic makes use of solely 67 giga-FLOPs (GFLOPS), which is one sixth the 360 GFLOPS wanted when utilizing the favored 3D-ResNet video mannequin collectively with textual content and is greater than twice as environment friendly because the X3D mannequin. That is all of the whereas producing extremely correct outcomes, outperforming state-of-the-art strategies.

Comparability of our iterative co-tokenization strategy to earlier strategies akin to MERLOT and VQA-T, in addition to, baselines utilizing single ResNet-3D or X3D-XL.

Multi-stream Video Inputs
For VideoQA, or any of quite a few different duties that contain video inputs, we discover that multi-stream enter is necessary to extra precisely reply questions on each spatial and temporal relationships. Our strategy makes use of three video streams at completely different resolutions and frame-rates: a low-resolution excessive frame-rate, enter video stream (with 32 frames-per-second and spatial decision 64×64, which we denote as 32x64x64); a high-resolution, low frame-rate video (8x224x224); and one in-between (16x112x112). Regardless of the apparently extra voluminous info to course of with three streams, we get hold of very environment friendly fashions because of the iterative co-tokenization strategy. On the similar time these extra streams permit extraction of essentially the most pertinent info. For instance, as proven within the determine beneath, questions associated to a selected exercise in time will produce larger activations within the smaller decision however excessive frame-rate video enter, whereas questions associated to the overall exercise might be answered from the excessive decision enter with only a few frames. One other advantage of this algorithm is that the tokenization modifications relying on the questions requested.

Visualization of the eye maps discovered per layer in the course of the video-text co-tokenization. The eye maps differ relying on the query requested for a similar video. For instance, if the query is said to the overall exercise (e.g., browsing within the determine above), then the eye maps of the upper decision low frame-rate inputs are extra energetic and appear to contemplate extra world info. Whereas if the query is extra particular, e.g., asking about what occurs after an occasion, the characteristic maps are extra localized and are usually energetic within the excessive frame-rate video enter. Moreover, we see that the low-resolution, high-frame fee video inputs present extra info associated to actions within the video.

We current a brand new strategy to video-language studying that focuses on joint studying throughout video-text modalities. We deal with the necessary and difficult job of video question-answering. Our strategy is each extremely environment friendly and correct, outperforming present state-of-the-art fashions, regardless of being extra environment friendly. Our strategy ends in modest mannequin sizes and may achieve additional enhancements with bigger fashions and information. We hope this work provokes extra analysis in vision-language studying to allow extra seamless interplay with vision-based media.

This work is performed by AJ Pierviovanni, Kairo Morton, Weicheng Kuo, Michael Ryoo and Anelia Angelova. We thank our collaborators on this analysis, and Soravit Changpinyo for priceless feedback and strategies, and Claire Cui for strategies and help. We additionally thank Tom Small for visualizations.


Please enter your comment!
Please enter your name here