Different from ViT that typically yields low resolution outputs and 
incurs high computational and memory costs, PVT not only can be trained on dense partitions of an image to achieve high 
output resolution, which is important for dense prediction, but also uses a progressive shrinking pyramid to reduce the 
computations of large feature maps.