Architecture of the deep convolutional neural network in localizing joint positions. This neural network contains an encoder to extract information at multiple scales and a decoder to decode the abstracted information from feature maps. Multiple convolution, max-pooling and up-convolution layers are used. The encoder and the decoder are connected through concatenation layers.