A New Language Interface for Object Detection


Object detection is a long-standing laptop imaginative and prescient process that makes an attempt to acknowledge and localize all objects of curiosity in a picture. The complexity arises when making an attempt to establish or localize all object situations whereas additionally avoiding duplication. Present approaches, like Sooner R-CNN and DETR, are rigorously designed and extremely personalized within the alternative of structure and loss operate. This specialization of current methods has created two main limitations: (1) it provides complexity in tuning and coaching the totally different components of the system (e.g., area proposal community, graph matching with GIOU loss, and many others.), and (2), it may well scale back the flexibility of a mannequin to generalize, necessitating a redesign of the mannequin for software to different duties.

In “Pix2Seq: A Language Modeling Framework for Object Detection”, printed at ICLR 2022, we current a easy and generic technique that tackles object detection from a very totally different perspective. In contrast to current approaches which might be task-specific, we forged object detection as a language modeling process conditioned on the noticed pixel inputs. We reveal that Pix2Seq achieves aggressive outcomes on the large-scale object detection COCO dataset in comparison with current highly-specialized and well-optimized detection algorithms, and its efficiency might be additional improved by pre-training the mannequin on a bigger object detection dataset. To encourage additional analysis on this course, we’re additionally excited to launch to the broader analysis neighborhood Pix2Seq’s code and pre-trained fashions together with an interactive demo.

Pix2Seq Overview

Our strategy is predicated on the instinct that if a neural community is aware of the place and what the objects in a picture are, one might merely educate it how you can learn them out. By studying to “describe” objects, the mannequin can study to floor the descriptions on pixel observations, resulting in helpful object representations. Given a picture, the Pix2Seq mannequin outputs a sequence of object descriptions, the place every object is described utilizing 5 discrete tokens: the coordinates of the bounding field’s corners [ymin, xmin, ymax, xmax] and a category label.

Pix2Seq framework for object detection. The neural community perceives a picture, and generates a sequence of tokens for every object, which correspond to bounding bins and sophistication labels.

With Pix2Seq, we suggest a quantization and serialization scheme that converts bounding bins and sophistication labels into sequences of discrete tokens (just like captions), and leverage an encoder-decoder structure to understand pixel inputs and generate the sequence of object descriptions. The coaching goal operate is solely the utmost chance of tokens conditioned on pixel inputs and the previous tokens.

Sequence Building from Object Descriptions

In generally used object detection datasets, photographs have variable numbers of objects, represented as units of bounding bins and sophistication labels. In Pix2Seq, a single object, outlined by a bounding field and sophistication label, is represented as [ymin, xmin, ymax, xmax, class]. Nonetheless, typical language fashions are designed to course of discrete tokens (or integers) and are unable to understand steady numbers. So, as a substitute of representing picture coordinates as steady numbers, we normalize the coordinates between 0 and 1 and quantize them into one of some hundred or thousand discrete bins. The coordinates are then transformed into discrete tokens as are the article descriptions, just like picture captions, which in flip can then be interpreted by the language mannequin. The quantization course of is achieved by multiplying the normalized coordinate (e.g., ymin) by the variety of bins minus one, and rounding it to the closest integer (the detailed course of might be present in our paper).

Quantization of the coordinates of the bounding bins with totally different numbers of bins on a 480 × 640 picture. With a small variety of bins/tokens, similar to 500 bins (∼1 pixel/bin), it achieves excessive precision even for small objects.

After quantization, the article annotations supplied with every coaching picture are ordered right into a sequence of discrete tokens (proven beneath). Because the order of the objects doesn’t matter for the detection process per se, we randomize the order of objects every time a picture is proven throughout coaching. We additionally append an Finish of Sequence (EOS) token on the finish as​​ totally different photographs usually have totally different numbers of objects, and therefore sequence lengths.

The bounding bins and sophistication labels for objects detected within the picture on the left are represented within the sequences proven on the correct. A random object ordering technique is utilized in our work however different approaches to ordering is also used.

The Mannequin Structure, Goal Operate, and Inference

We deal with the sequences that we constructed from object descriptions as a “dialect” and deal with the issue by way of a robust and common language mannequin with a picture encoder and autoregressive language encoder. Just like language modeling, Pix2Seq is educated to foretell tokens, given a picture and previous tokens, with a most chance loss. At inference time, we pattern tokens from mannequin chance. The sampled sequence ends when the EOS token is generated. As soon as the sequence is generated, we break up it into chunks of 5 tokens for extracting and de-quantizing the article descriptions (i.e., acquiring the anticipated bounding bins and sophistication labels). It’s price noting that each the structure and loss operate are task-agnostic in that they don’t assume prior data about object detection (e.g., bounding bins). We describe how we will incorporate task-specific prior data with a sequence augmentation method in our paper.


Regardless of its simplicity, Pix2Seq achieves spectacular empirical efficiency on benchmark datasets. Particularly, we evaluate our technique with effectively established baselines, Sooner R-CNN and DETR, on the broadly used COCO dataset and reveal that it achieves aggressive common precision (AP) outcomes.

Pix2Seq achieves aggressive AP outcomes in comparison with current methods that require specialization throughout mannequin design, whereas being considerably easier. One of the best performing Pix2Seq mannequin achieved an AP rating of 45.

Since our strategy incorporates minimal inductive bias or prior data of the article detection process into the mannequin design, we additional discover how pre-training the mannequin utilizing the large-scale object detection COCO dataset can influence its efficiency. Our outcomes point out that this coaching technique (together with utilizing larger fashions) can additional enhance efficiency.

The common precision of the Pix2Seq mannequin with pre-training adopted by fine-tuning. One of the best performing Pix2Seq mannequin with out pre-training achieved an AP rating of 45. When the mannequin is pre-trained, we see an 11% enchancment with an AP rating of fifty.

Pix2Seq can detect objects in densely populated and sophisticated scenes, similar to these proven beneath.

Instance advanced and densely populated scenes labeled by a educated Pix2Seq mannequin. Attempt it out right here.

Conclusion and Future Work

With Pix2Seq, we forged object detection as a language modeling process conditioned on pixel inputs for which the mannequin structure and loss operate are generic, and haven’t been engineered particularly for the detection process. One can, subsequently, readily prolong this framework to totally different domains or functions, the place the output of the system might be represented by a comparatively concise sequence of discrete tokens (e.g., keypoint detection, picture captioning, visible query answering), or incorporate it right into a perceptual system supporting common intelligence, for which it supplies a language interface to a variety of imaginative and prescient and language duties. We additionally hope that the discharge of our Pix2Seq’s code, pre-trained fashions and interactive demo will encourage additional analysis on this course.


This submit displays the mixed work with our co-authors: Saurabh Saxena, Lala Li, Geoffrey Hinton. We might additionally wish to thank Tom Small for the visualization of the Pix2Seq illustration determine.


Please enter your comment!
Please enter your name here