Rewriting Picture Captions for Visible Query Answering Information Creation


Visible Query Answering (VQA) is a helpful machine studying (ML) process that requires a mannequin to reply a visible query about a picture. What makes it difficult is its multi-task and open-ended nature; it includes fixing a number of technical analysis questions in pc imaginative and prescient and pure language understanding concurrently. But, progress on this process would allow a variety of purposes, from helping the blind and the visually-impaired or speaking with robots to enhancing the person’s visible expertise with exterior data.

Efficient and strong VQA methods can’t exist with out high-quality, semantically and stylistically numerous large-scale coaching information of image-question-answer triplets. However, creating such information is time consuming and onerous. Maybe unsurprisingly, the VQA neighborhood has targeted extra on refined mannequin growth moderately than scalable information creation.

In “All You Could Want for VQA are Picture Captions,” printed at NAACL 2022, we discover VQA information era by proposing “Visible Query Technology with Query Answering Validation” (VQ2A), a pipeline that works by rewriting a declarative caption into a number of interrogative question-answer pairs. Extra particularly, we leverage two current property — (i) large-scale image-text information and (ii) large-capacity neural text-to-text fashions — to attain computerized VQA information era. As the sector has progressed, the analysis neighborhood has been making these property bigger and stronger in isolation (for normal functions similar to studying text-only or image-text representations); collectively, they’ll obtain extra and we adapt them for VQA information creation functions. We discover our method can generate question-answer pairs with excessive precision and that this information can efficiently be used for coaching VQA fashions to enhance efficiency.

The VQ2A method permits VQA information era at scale from picture captions by rewriting every caption into a number of question-answer pairs.

VQ2A Overview
Step one of the VQ2A method is to use heuristics primarily based on named entity recognition, part-of-speech tagging and manually outlined guidelines to generate reply candidates from the picture caption. These generated candidates are small items of data which may be related topics about which to ask questions. We additionally add to this checklist two default solutions, “sure” and “no”, which permit us to generate Boolean questions.

Then, we use a T5 mannequin that was fine-tuned to generate questions for the candidate, leading to [question, candidate answer] pairs. We then filter for the very best high quality pairs utilizing one other T5 mannequin (fine-tuned to reply questions) by asking it to reply the query primarily based on the caption. was . That’s, we evaluate the candidate reply to the output of this mannequin and if the 2 solutions are related sufficient, we outline this query as top quality and preserve it. In any other case, we filter it out.

The concept of utilizing each query answering and query era fashions to test one another for his or her round-trip consistency has been beforehand explored in different contexts. As an illustration, Q2 makes use of this concept to guage factual consistency in knowledge-grounded dialogues. In the long run, the VQ2A method, as illustrated under, can generate numerous [image, question, answer] triplets which might be high-quality sufficient for use as VQA coaching information.

VQ2A consists of three foremost steps: (i) candidate reply extraction, (ii) query era, (iii) query answering and reply validation.

Two examples of our generated VQA information are proven under, one primarily based on human-written COCO Captions (COCO) and the opposite on automatically-collected Conceptual Captions (CC3M), which we name VQ2A-COCO and VQ2A-CC3M, respectively. We spotlight the number of query sorts and kinds, that are vital for VQA. Total, the cleaner the captions (i.e., the extra intently associated they’re to their paired picture), the extra correct the generated triplets. Primarily based on 800 samples every, 87.3% of VQ2A-COCO and 66.0% VQ2A-CC3M are discovered by human raters to be legitimate, suggesting that our method can generate question-answer pairs with excessive precision.

Generated question-answer pairs primarily based on COCO Captions (high) and Conceptual Captions (backside). Gray highlighting denotes questions that do not seem in VQAv2, whereas inexperienced highlighting denotes those who do, indicating that our method is able to producing novel questions that an current VQA dataset doesn’t have.

Lastly, we consider our generated information by utilizing it to coach VQA fashions (highlights proven under). We observe that our automatically-generated VQA information is aggressive with manually-annotated goal VQA information. First, our VQA fashions obtain excessive efficiency on the right track benchmarks “out-of-the-box”, when educated solely on our generated information (mild blue and light-weight purple vs. yellow). As soon as fine-tuned on the right track information, our VQA fashions outperform target-only coaching barely on large-scale benchmarks like VQAv2 and GQA, however considerably on the small, knowledge-seeking OK-VQA (darkish blue/purple vs. mild blue/purple).

VQA accuracy on in style benchmark datasets.

All we may have for VQA are picture captions! This work demonstrates that it’s attainable to routinely generate high-quality VQA information at scale, serving as a vital constructing block for VQA and vision-and-language fashions basically (e.g., ALIGN, CoCa). We hope that our work evokes different work on data-centric VQA.

We thank Roee Aharoni, Idan Szpektor, and Radu Soricut for his or her suggestions on this blogpost. We additionally thank our co-authors: Xi Chen, Nan Ding, Idan Szpektor, and Radu Soricut. We acknowledge contributions from Or Honovich, Hagai Taitelbaum, Roee Aharoni, Sebastian Goodman, Piyush Sharma, Nassim Oufattole, Gal Elidan, Sasha Goldshtein, and Avinatan Hassidim. Lastly, we thank the authors of Q2, whose pipeline strongly influences this work.


Please enter your comment!
Please enter your name here