Surface-height- and uncertainty-based depth estimation for Mono3D

doi:10.3969/j.issn.1000-5641.2025.01.006

Abstract

Abstract:

Monocular three-dimensional (3D) object detection is a fundamental but challenging task in autonomous driving and robotic navigation. Directly predicting object depth from a single image is essentially an ill-posed problem. Geometry projection is a powerful depth estimation method that infers an object’s depth from its physical and projected heights in the image plane. However, height estimation errors are amplified by the depth error. In this study, the physical and projected heights of object surface points (rather than the height of the object itself) were estimated to obtain several depth candidates. In addition, the uncertainties in the heights were estimated and the final object depth was obtained by assembling the depth predictions according to the uncertainties. Experiments demonstrated the effectiveness of the depth estimation method, which achieved state-of-the-art (SOTA) results on a monocular 3D object detection task of the KITTI dataset.

Key words: monocular 3D object detection (Mono3D), depth estimation, geometry projection, automatic driving

CLC Number:

TP183

Yinshuai JI, Jinhua XU. Surface-height- and uncertainty-based depth estimation for Mono3D[J]. J* E* C* N* U* N* S*, 2025, 2025(1): 72-81.

Figures/Tables 5

Fig.1

Fig.2

Table 1

Fig.3

Table 2

References 36

1	ZAMANAKOS G, TSOCHATZIDIS L, AMANATIADIS A, et al.. A comprehensive survey of LIDAR-based 3D object detection methods with deep learning for autonomous driving. Computers & Graphics, 2021, 99, 153- 181.
2	FAN L, PANG Z Q, ZHANG T Y, et al. Embracing single stride 3D object detector with sparse transformer [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR). IEEE, 2022: 8448-8458.
3	SUN P, TAN M X, WANG W Y, et al. SWFormer: Sparse window transformer for 3D object detection in point clouds [C]// Computer Vision – ECCV 2022, ECCV 2022, Lecture Notes in Computer Science, vol 13670. Cham: Springer, 2022: 426-442.
4	SHI G S, LI R F, MA C. PillarNet: Real-time and high-performance pillar-based 3D object detection [C]// Computer Vision – ECCV 2022, ECCV 2022, Lecture Notes in Computer Science, vol 13670. Cham: Springer, 2022: 35-52.
5	CAI Y J, LI B Y, JIAO Z Y, et al. Monocular 3D object detection with decoupled structured polygon estimation and height-guided depth estimation [J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2020, 34(7): 10478-10485.
6	SHI X P, YE Q, CHEN X Z, et al. Geometry-based distance decomposition for monocular 3D object detection [C]// Proceedings of the IEEE/CVF International Conference on Computer Vision. IEEE, 2021: 15172-15181.
7	LU Y, MA X Z, YANG L, et al. Geometry uncertainty projection network for monocular 3D object detection [C]// Proceedings of the IEEE/CVF International Conference on Computer Vision. IEEE, 2021: 3111-3121.
8	吉银帅, 续晋华, 孙仕亮. 一种基于目标表面点高度和不确定性的单目深度估计方法: CN116843737A [P]. 2023-10-03.
9	ZHANG Y P, LU J W, ZHOU J. Objects are different: Flexible monocular 3D object detection [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 2021: 3289-3298.
10	LI Z L, QU Z, ZHOU Y, et al. Diversity matters: Fully exploiting depth clues for reliable monocular 3D object detection [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 2022: 2791-2800.
11	MA X Z, ZHANG Y M, XU D, et al. Delving into localization errors for monocular 3D object detection [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 2021: 4721-4730.
12	LI P X, ZHAO H C, LIU P F, et al. RTM3D: Real-time monocular 3D detection from object keypoints for autonomous driving [C]// Computer Vision – ECCV 2020, ECCV 2020, Lecture Notes in Computer Science, vol 12348. Cham: Springer, 2020: 644-660.
13	DING M Y, HUO Y Q, YI H W, et al. Learning depth-guided convolutions for monocular 3D object detection [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. IEEE, 2020: 11672-11681.
14	CHEN X Z, KUNDU K, ZHANG Z Y, et al. Monocular 3D object detection for autonomous driving [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2016: 2147-2156.
15	BRAZIL G, LIU X M. M3D-RPN: Monocular 3D region proposal network for object detection [C]// Proceedings of the IEEE/CVF International Conference on Computer Vision. IEEE, 2019: 9287-9296.
16	QIN Z Q, LI X. MonoGround: Detecting monocular 3D objects from the ground [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 2022: 3793-3802.
17	PENG L, WU X P, YANG Z, et al. DID-M3D: Decoupling instance depth for monocular 3D object detection [C]// Computer Vision – ECCV 2022, ECCV 2022, Lecture Notes in Computer Science, vol 13661. Cham: Springer, 2022: 71-88.
18	SHI S S, WANG X G, LI H S. PointRCNN: 3D object proposal generation and detection from point cloud [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 2019: 770-779.
19	ZHOU Y, TUZEL O. VoxelNet: End-to-end learning for point cloud based 3D object detection [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2018: 4490-4499.
20	LANG A H, VORA S, CAESAR H, et al. PointPillars: Fast encoders for object detection from point clouds [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 2019: 12697-12705.
21	RODDICK T, KENDALL A, CIPOLLA R. Orthographic feature transform for monocular 3D object detection [EB/OL]. (2018-11-20)[2023-10-08]. https://doi.org/10.48550/arXiv.1811.08188.
22	READING C, HARAKEH A, CHAE J, et al. Categorical depth distribution network for monocular 3D object detection [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 2021: 8555-8564.
23	WANG Y, CHAO W L, GARG D, et al. Pseudo-lidar from visual depth estimation: Bridging the gap in 3D object detection for autonomous driving [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 2019: 8445-8453.
24	MA X Z, WANG Z H, LI H J, et al. Accurate monocular 3D object detection via color-embedded 3D reconstruction for autonomous driving [C]// Proceedings of the IEEE/CVF International Conference on Computer Vision. IEEE, 2019: 6851-6860.
25	CHONG Z Y, MA X Z, ZHANG H, et al. MonoDistill: Learning spatial features for monocular 3D object detection [EB/OL]. (2022-01-26)[2023-10-08]. https://doi.org/10.48550/arXiv.2201.10830.
26	HU M, WANG S L, LI B, et al. PENet: Towards precise and efficient image guided depth completion [C]// 2021 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2021: 13656-13662.
27	PHUONG M, LAMPERT C H. Towards understanding knowledge distillation [EB/OL]. (2021-05-27)[2023-10-08]. https://doi.org/10.48550/arXiv.2105.13093.
28	ANGER H O.. Use of a gamma-ray pinhole camera for in vivo studies. Nature, 1952, 170 (4318): 200- 201.
29	YU F, WANG D Q, SHELHAMER E, et al. Deep layer aggregation [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2018: 2403-2412.
30	MOUSAVIAN A, ANGUELOV D, FLYNN J, et al. 3D bounding box estimation using deep learning and geometry [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2017: 7074-7082.
31	GEIGER A, LENZ P, STILLER C, et al.. Vision meets robotics: The KITTI dataset. The International Journal of Robotics Research, 2013, 32 (11): 1231- 1237.
32	KENDALL A, GAL Y. What uncertainties do we need in bayesian deep learning for computer vision? [C]// Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook, NY, United States: Curran Associates Inc., 2017: 5580–5590.
33	SIMONELLI A, BULO S R, PORZI L, et al. Disentangling monocular 3D object detection [C]// Proceedings of the IEEE/CVF International Conference on Computer Vision. IEEE, 2019: 1991-1999.
34	WANG L, ZHANG L, ZHU Y, et al. Progressive coordinate transforms for monocular 3D object detection [C]// Advances in Neural Information Processing Systems 34 (NeurIPS 2021), 2021: 13364-13377.
35	HUANG K C, WU T H, SU H T, et al. MonoDTR: Monocular 3D object detection with depth-aware transformer [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 2022: 4012-4021.
36	LIAN Q, LI P L, CHEN X Z. MonoJSG: Joint semantic and geometric cost volume for monocular 3D object detection [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 2022: 1070-1079.

方法	t/ms	$P_{ {\mathrm{A}\mathrm{P} }_{{\rm{3D}}\|R40} }$/%			$P_{ {\mathrm{A}\mathrm{P} }_{{\rm{BEV}}\|R40} }$/%
方法	t/ms	简单	中等	困难	简单	中等	困难
PCT^[34]	–	21.00	13.37	11.31	29.65	19.03	15.92
CaDDN^[22]	630	19.17	13.41	11.46	27.94	18.91	17.19
MonoDLE^[11]	40	17.23	12.26	10.29	24.79	18.89	16.00
MonoDTR^[35]	37	21.99	15.39	12.73	28.59	20.38	17.14
MonoJSG^[36]	42	24.69	16.14	13.64	32.59	21.26	18.18
MonoDistill^[25]	40	22.97	16.03	13.60	31.87	22.59	19.72
DID-M3D^[17]	40	24.40	16.29	13.75	32.95	22.76	19.83
MonoDDE^[10]	40	24.93	17.14	15.10	33.58	23.46	20.37
MonoRCNN^[6]	70	18.36	12.65	10.03	25.48	18.11	14.10
MonoFlex^[9]	30	19.94	13.89	12.07	28.23	19.75	16.89
GUP Net^[7]	34	22.26	15.02	13.12	30.29	21.19	18.20
SHUD	40	26.56	17.25	15.11	34.67	23.52	19.21

实验	${{z} }_{\rm{geo}}$	${{\sigma } }_{ {h}_{\rm{2D}} }$	${{\sigma } }_{ {h}_{\rm{3D} } }$	${\sigma }_{\rm{geo}}$	$P_{ {\mathrm{A}\mathrm{P} }_{\rm{BEV}} }$$(P_{ {\mathrm{A}\mathrm{P} }_{ {\rm{3D} }\|R40} }) $/%
实验	${{z} }_{\rm{geo}}$	${{\sigma } }_{ {h}_{\rm{2D}} }$	${{\sigma } }_{ {h}_{\rm{3D} } }$	${\sigma }_{\rm{geo}}$	简单	中等	困难
a	√				31.49(22.48)	23.50(17.11)	19.86(14.25)
b	√	√			33.54(24.32)	25.27(18.22)	20.81(14.91)
c	√		√		33.57(24.53)	25.35(18.39)	21.07(15.17)
d	√	√	√		35.08(25.16)	25.96(18.82)	21.45(15.71)
e	√	√	√	√	36.41(26.49)	26.32(19.10)	21.76(16.04)

[1]	Wenjing HU, Longquan JIANG, Junlong YU, Yiqian XU, Qipeng LIU, Lei LIANG, Jiahao LI. Knowledge-distillation-based lightweight crop-disease-recognition algorithm [J]. J* E* C* N* U* N* S*, 2025, 2025(1): 59-71.
[2]	Chang WANG, Dan MA, Huarong XU, Panfeng CHEN, Mei CHEN, Hui LI. SA-MGKT: Multi-graph knowledge tracing method based on self-attention [J]. Journal of East China Normal University(Natural Science), 2024, 2024(5): 20-31.
[3]	Zhihong ZHENG, Haichuan SONG. Group contrastive learning for weakly-supervised 3D point cloud semantic segmentation [J]. Journal of East China Normal University(Natural Science), 2024, 2024(2): 108-118.
[4]	Xinxin HE, Haichuan SONG. Hidden layer Fourier convolution for non-stationary texture synthesis [J]. Journal of East China Normal University(Natural Science), 2024, 2024(2): 119-130.
[5]	Lulu JIANG, Siqi SUN, Haidong ZOU, Lina LU, Rui FENG. Diabetic retinopathy grading based on dual-view image feature fusion [J]. Journal of East China Normal University(Natural Science), 2023, 2023(6): 39-48.
[6]	Caidie HUANG, Xinping WANG, Liangyu CHEN, Yong LIU. Research on a knowledge tracking model based on the stacked gated recurrent unit residual network [J]. Journal of East China Normal University(Natural Science), 2022, 2022(6): 68-78.
[7]	Haojie WU, Yanjie WANG, Wenbing CAI, Fei WANG, Yang LIU, Peng PU, Shaohui LIN. Correlation operation based on intermediate layers for knowledge method [J]. Journal of East China Normal University(Natural Science), 2022, 2022(5): 115-125.
[8]	Xueming ZHOU, Dingjiang HUANG. Survey of few-shot instance segmentation methods [J]. Journal of East China Normal University(Natural Science), 2022, 2022(5): 136-146.
[9]	CHEN Hai-long, PENG Wei. Research on improved BP neural network in forecasting traffic accidents [J]. Journal of East China Normal University(Natural Sc, 2017, 2017(2): 61-68.