CoCa: Contrastive Captioners are Image-Text Foundation Models

Abstract

본 논문에서는 Constrastive Captioner(CoCa) - contrastive loss와 captioning loss를 통한 image-text encoder-decoder foundation model을 pretrain하는 최소한의 디자인 - 를 통해 CLIP이나 SimVLM과 같이 대조적인 접근 방식에서 모델의 기능을 포함하게 한다.
CoCa는 일반적 트랜스포머와 달리 first half of decoder layers에서 cross-attention을 생략하고, 남은 decoder layer을 cascade한다.
본 논문의 저자들은 unimodal image와 text embedding에 및 multimodal decoder output에 contrastive loss를 apply 하였다.

Untitled

Approach

Natural Language Supervision

Single-Encode Classification

이전의 single-encoder approach는 대량의 데이터에 기반한 image classification을 통한 visual encoder을 pretrain하는 것이었다(annotation texts의 vocabulary가 고정된 상태). 이러한 image annotation은 일반적으로 아래 수식인 cross-entropy loss를 통한 학습을 위해 discrete(이산) class vector으로 매핑된다.

Untitled

$p(y)$ : one-hot, multi-hot or smoothed label distribution from ground truth label $y$.

학습된 image encoder은 다운스트림 작업을 위해 generic visual representation extractor을 사용한다.

Dual-Encoder Contrastive Learning

Single-encdoder classification과 비교해 dual-encoder은 noisy web-scale text description을 exploit하고 text tower을 통해 free-form text를 encode한다. encoder들은 paired text를 샘플 배치의 나머지들과 비교하여 optimize된다.

Untitled

$x_i, y_j$ : $i$ 번째 pair의 normalized embedding of image, $j$ 번째 pair의 text.
$N$ :Batch size.
$\sigma$ : Temperature to scale logits.