The attention maps from each patch to all the patches show that for both DeiT-S and TNT-S, more patches are related as layer goes deeper, because the information between patches has been fully communicated with each other in deeper layers.
Attention between Patches. In Figure 1, we plot the attention maps from each patch to all the patches. We can see that for both DeiT-S and TNT-S, more patches are related as layer goes deeper. This is because the information between patches has been fully communicated with each other in deeper layers. As for the difference between DeiT and TNT, the attention of TNT can focus on the meaningful patches in Block-12, while DeiT still pays attention to the tree which is not related to the pandas.