The CSP residual module in the original network is replaced by the encoding module of transformer to excavate the potential of self attention mechanism in feature expression to improve the original YOLOX combined with transformer.
YOLOX is a representative algorithm of one-stage target detection. Its main network structure is based on the residual module composed of convolution neural network. Different from CNN, transformer adopts multi head attention mechanism to represent global information efficiently, which performs well in the field of computer vision. Therefore, it is worth considering to improve the original YOLOX combined with transformer. In this paper, the CSP residual module in the original network is replaced by the encoding module of transformer to excavate the potential of self attention mechanism in feature expression. At the same time, a Pyramid Split Attention module is used to optimize the original FPN module in the neck part of the network. In addition, in order to achieve better performance, weighted boxes fusion is used to replace the non maximum suppression processing in the original algorithm. Experiments show that through these improvements, the algorithm has a good improvement in both public data set and Transmission line power equipment data set.