The study concludes that the SSD YOLOv8s model efficiently identifies objects and generates bounding boxes with higher accuracy compared to other models.
Identifying objects is among the most crucial and complex problems in computer vision. The effectiveness of object detection tasks has significantly increased due to advancements in deep learning architectures. Object detectors are classified into single-stage (SSD) and two-stage (TSD). SSD model architecture evaluates the spatial region proposals in a single pass. TSDs employ a more complex architecture to focus on selective region proposals. Object detection is an essential component of visual question answering (VQA), which requires algorithms to respond to text-based inquiries about images. While many object detection algorithms are available, real-time application is still somewhat difficult. This paper explores the object detection approaches for VQA systems through a comparative analysis of various YOLO variants. The study compares the effectiveness of YOLOv8s and YOLOv8n architectures against TSD approaches in detecting object positions with bounding boxes. Experiments were conducted on the MS-COCO and two versions of the PASCAL-VOC (2007 and 2012) datasets. The paper outlines the data preprocessing steps required to convert source datasets to the YOLO model format. The experimental evaluations on the PASCAL-VOC dataset indicate that the TSD Faster-RCNN model, using ResNet101 pretrained architecture, achieves the highest mAP of 78.80. In contrast, the SSD YOLOv8s model achieves the highest mAP of 89.63 using the CSPDarkNet53 model architecture. YOLO is essential for real-time object detection, allowing VQA systems to process images in a single forward pass. The study concludes that the SSD YOLOv8s model efficiently identifies objects and generates bounding boxes with higher accuracy compared to other models.