Monocular weakly supervised depth and pose estimation method based on multi-information fusion

DOI: 10.48129/kjs.12929


  • Zhimin Zhang Dept. of Computer Science and Engineering Northeastern University, China
  • Jianzhong Qiao Northeastern University
  • Shukuan Lin Dept. of Computer Science and Engineering Northeastern University, China



Current monocular visual odometry methods usually either require a large amount of expensive ground truth data or require effective training to obtain suboptimal results. This paper presents a weakly supervised monocular depth and camera pose estimation method based on the fusion of video sequences, inertial measurement unit (IMU), and "Ground truth" labels. First, we propose a labels generation model, which uses a transfer learning method to obtain high-precision depth and 6-degree-of-freedom(DOF) pose data as the "Ground truth" labels of our monocular model through a very small amount of ground truth disparity maps. Then, we construct a multi-information fusion network model based on the "Ground truth" labels, video sequence and IMU information to estimate depth and camera pose. Finally, we design the loss function of supervised cues based on "Ground Truth" labels and self-supervised cues. In the testing phase, the network model can separately output high-precision pose and depth data from a monocular video sequence. The model is tested on the Kitti dataset, and its results exceeded other mainstream monocular depth and pose estimation methods.