This paper introduces RAGCap, a retrieval-augmented framework that leverages similarity-based retrieval to select relevant image-caption pairs from the training dataset, and suggests RAG methods like RAGCap offer a scalable, practical alternative to fine-tuning for domain adaptation in RS image captioning.
ABSTRACT Recently, generalist vision-language models (VLMs) have proven to be highly versatile across various tasks, making them indispensable in computer vision and natural language processing. Their broad pre-training enables robust performance across many domains. When it comes to tailoring these models for specific downstream tasks such as remote sensing (RS) image captioning, the traditional approach has been fine-tuning. Fine-tuning, however, is often computationally expensive and may compromise the model inherent generalization capabilities by over-specializing on limited domain-specific datasets. This paper investigates Retrieval-Augmented Generation (RAG) as an alternative strategy for adapting generalist VLMs to RS captioning without requiring fine-tuning. We introduce RAGCap, a retrieval-augmented framework that leverages similarity-based retrieval to select relevant image-caption pairs from the training dataset. These examples are then combined with the target image within a carefully designed prompt structure, guiding the generalist VLM to generate stylistically coherent RS captions to the training dataset. While our implementation utilizes SigLIP for retrieval and Qwen2VL as the base VLM, the proposed framework is universal and applicable to other models. Extensive evaluations on four RS benchmark datasets reveal that RAGCap achieves competitive performance compared to traditional fine-tuning approaches. Our findings suggest RAG methods like RAGCap offer a scalable, practical alternative to fine-tuning for domain adaptation in RS image captioning. Code will be available at: https://github.com/BigData-KSU/RAGCap.