Table 1 compares different methods on RHD, where EPE is the average
error of hand joints, and GT H and GT S indicate handedness and scale of
the hand, respectively. It can be seen that Spurr et al. [7] and
Yang et al. [9] required additional input at test time, achieving
lower joint errors, while our method could obtain low errors without
ground-truth information during testing.
Conclusion: We proposed an MS-FF for monocular visual hand pose
estimation. To effectively process the detailed information of occluded
edges and fingertips, the network can extract information of different
levels from feature maps of different resolutions to more accurately
estimate hand poses. A channel conversion module adjusts the weights of
channels. To make full use of both the edge detail characteristics of
the images and deep semantic information, a global regression module
fuses feature maps of different resolutions. An optimization procedure
corrects some joints that are not returned to the correct position.
Higher accuracy and robustness were achieved using the proposed method.
Experiments verified the effectiveness of the MS-FF.
Acknowledgments: This work was supported by the National Natural
Science Foundation of China under Grant 61601213, and Special Innovative
Projects of General Universities in Guangdong Province under Grant
022WTSCX210.
Zhi Zhan (Guangdong Engineering Polytechnic,Zhan Zhi, China)
Guang Luo (South China Normal University, Luo Guang, China)
E-mail: luoguang_arts@163.com