Multiscale Feature Fusion Network for Monocular Complex Hand
Pose Estimation
Zhi Zhan , Guang Luo
Hand pose estimation based on a single RGB image has low accuracy due to
the complexity of the pose, local self-similarity of finger features,
and occlusion. A multiscale feature fusion network (MS-FF) for monocular
vision gesture pose estimation is proposed to address this problem. The
network can take full advantage of different channel information to
enhance important gesture information, and it can simultaneously extract
features from feature maps of different resolutions to obtain as much
detailed feature information and deep semantic information as possible.
The feature maps are merged to obtain the hand pose results. The
InterHand2.6M dataset and Rendered Handpose Dataset (RHD) are used to
train the MS-FF. Compared with the other methods (which can estimate
interacting hand poses from a single RGB image), the MS-FF obtains the
smallest average error of hand joints on RHD, verifying its
effectiveness.
Introduction: Hand pose estimation aims to identify and localize
key points of human hands in images, and it has a wide range of
applications in virtual reality (VR) and augmented reality (AR) [1].
Methods based on deep learning have obvious advantages over traditional
methods, both in processing speed and prediction accuracy. However,
owing to the complexity and diversity of the photographic environment,
such as hand shapes and occlusion, the robustness of hand pose
estimation methods is low.
Hand pose estimation methods can be categorized as either depth-
[2-5,15] and RGB-based [6-14,16]. Most methods rely on depth
images, such as Chen et al. [2] extracted effective joint features
through the initially estimated hand pose as guiding information, then
fused the joint features of the same fingers, and finally regressed the
hand pose by fusing the finger features. However, the method of
connecting five fingers and the palm at the same time can cause loss in
accuracy. According to Zhang et al. [4] made full use of the
information between the adjacent joints of the fingers to estimate the
depth coordinates. Then, 2D hand joint estimation and depth estimation
of a part of the hand joints were used as the bootstrap information to
obtain depth coordinates of all the hand joints.
Deep images are often limited by the application context, so RGB images
have been used for hand pose estimation. Simon et al. [6] estimated
2D hand poses from multi-view images and extended them to the 3D space.
However, this method could not estimate hand pose from a single RGB
image. Spurr et al. [7] used RGB images to train an encoder-decoder
model to estimate the complete 3D hand pose with different inputs.
However, the method did not make full use of the hand structure. Yang et
al. [9] learned the hand pose and hand images by a disentangled
variational autoencoder to achieve image synthesis and hand pose
estimation, but the disentangled process may lose useful information.
Since most datasets only have single hand sequences, estimating complex
gestures is relatively difficult. For this reason, Moon et al. [16]
constructed a dataset containing single and interacting hand sequences.
Additionally, the InterNet model was proposed to estimate hand poses by
a single RGB image. Due to the influence of occlusion, the method cannot
estimate complex hand pose well. However, the edge information in the
hand pose estimation is usually ignored, due to the presence of
occlusion, this information is especially important for extracting the
information of the occluded part. Simultaneously, because the fingertip
is a small object, it is relatively difficult to recognize the joint at
the fingertips. To address this, a robust Multi-Scale Feature Fusion
Network (MS-FF) is presented in this paper. The main contributions of
this method are as follows:
- MS-FF more accurately estimates hand poses in an RGB image and better
copes with complex application scenarios, so as to better deal with
difficult-to-recognize joints and inaccurate gesture recognition in
occlusion scenes;
- Channels contain different implicit information. We need to focus on
the information that is more important for recognizing gestures. A
channel conversion module adjusts the weights of channels to enhance
important information;
- Fingertips occupy a small percentage of an image, and are relatively
difficult to identify. A global regression module generates different
resolutions with rich semantic information, to better utilize image
edge details and deep information, which is important in estimating
finger poses;
- The global regression module may not accurately identify occluded
joints. A local optimization module is designed with deeper
information in the feature map. It fuses all level feature maps,
correcting joints that do not return to the correction position, for
better application to the occlusion scene;