login
Home / Papers / Learning to Prompt for Vision-Language Models

Learning to Prompt for Vision-Language Models

2326 Citations2022
Kaiyang Zhou, Jingkang Yang, Chen Change Loy

Context Optimization (CoOp) is proposed, a simple approach specifically for adapting CLIP-like vision-language models for downstream image recognition that achieves superb domain generalization performance compared with the zero-shot model using hand-crafted prompts.

Abstract

Large pre-trained vision-language models like CLIP have shown great potential\nin learning representations that are transferable across a wide range of\ndownstream tasks. Different from the traditional representation learning that\nis based mostly on discretized labels, vision-language pre-training aligns\nimages and texts in a common feature space, which allows zero-shot transfer to\na downstream task via prompting, i.e., classification weights are synthesized\nfrom natural language describing classes of interest. In this work, we show\nthat a major challenge for deploying such models in practice is prompt\nengineering, which requires domain expertise and is extremely time-consuming --\none needs to spend a significant amount of time on words tuning since a slight\nchange in wording could have a huge impact on performance. Inspired by recent\nadvances in prompt learning research in natural language processing (NLP), we\npropose Context Optimization (CoOp), a simple approach specifically for\nadapting CLIP-like vision-language models for downstream image recognition.\nConcretely, CoOp models a prompt's context words with learnable vectors while\nthe entire pre-trained parameters are kept fixed. To handle different image\nrecognition tasks, we provide two implementations of CoOp: unified context and\nclass-specific context. Through extensive experiments on 11 datasets, we\ndemonstrate that CoOp requires as few as one or two shots to beat hand-crafted\nprompts with a decent margin and is able to gain significant improvements over\nprompt engineering with more shots, e.g., with 16 shots the average gain is\naround 15% (with the highest reaching over 45%). Despite being a learning-based\napproach, CoOp achieves superb domain generalization performance compared with\nthe zero-shot model using hand-crafted prompts.\n