基于表面高度和不确定性的单目3D物体检测

doi:10.3969/j.issn.1000-5641.2025.01.006

摘要/Abstract

摘要：

单目3D (three-dimensional)物体检测是自动驾驶和机器人导航中的一项基础但具有挑战性的任务. 直接从单张图片预测深度本质上是一个不适定的问题. 几何投影是一种强大的深度估计方法, 它从物体的物理高度和图像平面中的投影高度推断物体的深度. 然而, 高度估计错误将会放大深度估计的误差. 研究了预测物体表面点的物理高度和投影高度, 而不是物体本身的高度, 由此可获得一系列深度候选值; 还研究了估计高度的不确定性, 并根据不确定性来组合这些深度候选值, 以获得最终的目标深度. 实验证明了此深度估计方法的有效性, 且该方法在KITTI数据集的单目3D目标检测任务上达到了SOTA (state-of-the-art)结果.

关键词: 单目3D物体检测, 深度估计, 几何投影, 自动驾驶

Abstract:

Monocular three-dimensional (3D) object detection is a fundamental but challenging task in autonomous driving and robotic navigation. Directly predicting object depth from a single image is essentially an ill-posed problem. Geometry projection is a powerful depth estimation method that infers an object’s depth from its physical and projected heights in the image plane. However, height estimation errors are amplified by the depth error. In this study, the physical and projected heights of object surface points (rather than the height of the object itself) were estimated to obtain several depth candidates. In addition, the uncertainties in the heights were estimated and the final object depth was obtained by assembling the depth predictions according to the uncertainties. Experiments demonstrated the effectiveness of the depth estimation method, which achieved state-of-the-art (SOTA) results on a monocular 3D object detection task of the KITTI dataset.

Key words: monocular 3D object detection (Mono3D), depth estimation, geometry projection, automatic driving

中图分类号:

TP183

吉银帅, 续晋华. 基于表面高度和不确定性的单目3D物体检测[J]. 华东师范大学学报（自然科学版）, 2025, 2025(1): 72-81.

Yinshuai JI, Jinhua XU. Surface-height- and uncertainty-based depth estimation for Mono3D[J]. J* E* C* N* U* N* S*, 2025, 2025(1): 72-81.

图/表 5

图1

图2

表1

图3

表2

参考文献 36

1	ZAMANAKOS G, TSOCHATZIDIS L, AMANATIADIS A, et al.. A comprehensive survey of LIDAR-based 3D object detection methods with deep learning for autonomous driving. Computers & Graphics, 2021, 99, 153- 181.
2	FAN L, PANG Z Q, ZHANG T Y, et al. Embracing single stride 3D object detector with sparse transformer [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR). IEEE, 2022: 8448-8458.
3	SUN P, TAN M X, WANG W Y, et al. SWFormer: Sparse window transformer for 3D object detection in point clouds [C]// Computer Vision – ECCV 2022, ECCV 2022, Lecture Notes in Computer Science, vol 13670. Cham: Springer, 2022: 426-442.
4	SHI G S, LI R F, MA C. PillarNet: Real-time and high-performance pillar-based 3D object detection [C]// Computer Vision – ECCV 2022, ECCV 2022, Lecture Notes in Computer Science, vol 13670. Cham: Springer, 2022: 35-52.
5	CAI Y J, LI B Y, JIAO Z Y, et al. Monocular 3D object detection with decoupled structured polygon estimation and height-guided depth estimation [J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2020, 34(7): 10478-10485.
6	SHI X P, YE Q, CHEN X Z, et al. Geometry-based distance decomposition for monocular 3D object detection [C]// Proceedings of the IEEE/CVF International Conference on Computer Vision. IEEE, 2021: 15172-15181.
7	LU Y, MA X Z, YANG L, et al. Geometry uncertainty projection network for monocular 3D object detection [C]// Proceedings of the IEEE/CVF International Conference on Computer Vision. IEEE, 2021: 3111-3121.
8	吉银帅, 续晋华, 孙仕亮. 一种基于目标表面点高度和不确定性的单目深度估计方法: CN116843737A [P]. 2023-10-03.
9	ZHANG Y P, LU J W, ZHOU J. Objects are different: Flexible monocular 3D object detection [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 2021: 3289-3298.
10	LI Z L, QU Z, ZHOU Y, et al. Diversity matters: Fully exploiting depth clues for reliable monocular 3D object detection [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 2022: 2791-2800.
11	MA X Z, ZHANG Y M, XU D, et al. Delving into localization errors for monocular 3D object detection [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 2021: 4721-4730.
12	LI P X, ZHAO H C, LIU P F, et al. RTM3D: Real-time monocular 3D detection from object keypoints for autonomous driving [C]// Computer Vision – ECCV 2020, ECCV 2020, Lecture Notes in Computer Science, vol 12348. Cham: Springer, 2020: 644-660.
13	DING M Y, HUO Y Q, YI H W, et al. Learning depth-guided convolutions for monocular 3D object detection [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. IEEE, 2020: 11672-11681.
14	CHEN X Z, KUNDU K, ZHANG Z Y, et al. Monocular 3D object detection for autonomous driving [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2016: 2147-2156.
15	BRAZIL G, LIU X M. M3D-RPN: Monocular 3D region proposal network for object detection [C]// Proceedings of the IEEE/CVF International Conference on Computer Vision. IEEE, 2019: 9287-9296.
16	QIN Z Q, LI X. MonoGround: Detecting monocular 3D objects from the ground [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 2022: 3793-3802.
17	PENG L, WU X P, YANG Z, et al. DID-M3D: Decoupling instance depth for monocular 3D object detection [C]// Computer Vision – ECCV 2022, ECCV 2022, Lecture Notes in Computer Science, vol 13661. Cham: Springer, 2022: 71-88.
18	SHI S S, WANG X G, LI H S. PointRCNN: 3D object proposal generation and detection from point cloud [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 2019: 770-779.
19	ZHOU Y, TUZEL O. VoxelNet: End-to-end learning for point cloud based 3D object detection [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2018: 4490-4499.
20	LANG A H, VORA S, CAESAR H, et al. PointPillars: Fast encoders for object detection from point clouds [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 2019: 12697-12705.
21	RODDICK T, KENDALL A, CIPOLLA R. Orthographic feature transform for monocular 3D object detection [EB/OL]. (2018-11-20)[2023-10-08]. https://doi.org/10.48550/arXiv.1811.08188.
22	READING C, HARAKEH A, CHAE J, et al. Categorical depth distribution network for monocular 3D object detection [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 2021: 8555-8564.
23	WANG Y, CHAO W L, GARG D, et al. Pseudo-lidar from visual depth estimation: Bridging the gap in 3D object detection for autonomous driving [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 2019: 8445-8453.
24	MA X Z, WANG Z H, LI H J, et al. Accurate monocular 3D object detection via color-embedded 3D reconstruction for autonomous driving [C]// Proceedings of the IEEE/CVF International Conference on Computer Vision. IEEE, 2019: 6851-6860.
25	CHONG Z Y, MA X Z, ZHANG H, et al. MonoDistill: Learning spatial features for monocular 3D object detection [EB/OL]. (2022-01-26)[2023-10-08]. https://doi.org/10.48550/arXiv.2201.10830.
26	HU M, WANG S L, LI B, et al. PENet: Towards precise and efficient image guided depth completion [C]// 2021 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2021: 13656-13662.
27	PHUONG M, LAMPERT C H. Towards understanding knowledge distillation [EB/OL]. (2021-05-27)[2023-10-08]. https://doi.org/10.48550/arXiv.2105.13093.
28	ANGER H O.. Use of a gamma-ray pinhole camera for in vivo studies. Nature, 1952, 170 (4318): 200- 201.
29	YU F, WANG D Q, SHELHAMER E, et al. Deep layer aggregation [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2018: 2403-2412.
30	MOUSAVIAN A, ANGUELOV D, FLYNN J, et al. 3D bounding box estimation using deep learning and geometry [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2017: 7074-7082.
31	GEIGER A, LENZ P, STILLER C, et al.. Vision meets robotics: The KITTI dataset. The International Journal of Robotics Research, 2013, 32 (11): 1231- 1237.
32	KENDALL A, GAL Y. What uncertainties do we need in bayesian deep learning for computer vision? [C]// Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook, NY, United States: Curran Associates Inc., 2017: 5580–5590.
33	SIMONELLI A, BULO S R, PORZI L, et al. Disentangling monocular 3D object detection [C]// Proceedings of the IEEE/CVF International Conference on Computer Vision. IEEE, 2019: 1991-1999.
34	WANG L, ZHANG L, ZHU Y, et al. Progressive coordinate transforms for monocular 3D object detection [C]// Advances in Neural Information Processing Systems 34 (NeurIPS 2021), 2021: 13364-13377.
35	HUANG K C, WU T H, SU H T, et al. MonoDTR: Monocular 3D object detection with depth-aware transformer [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 2022: 4012-4021.
36	LIAN Q, LI P L, CHEN X Z. MonoJSG: Joint semantic and geometric cost volume for monocular 3D object detection [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 2022: 1070-1079.

方法	t/ms	$P_{ {\mathrm{A}\mathrm{P} }_{{\rm{3D}}\|R40} }$/%			$P_{ {\mathrm{A}\mathrm{P} }_{{\rm{BEV}}\|R40} }$/%
方法	t/ms	简单	中等	困难	简单	中等	困难
PCT^[34]	–	21.00	13.37	11.31	29.65	19.03	15.92
CaDDN^[22]	630	19.17	13.41	11.46	27.94	18.91	17.19
MonoDLE^[11]	40	17.23	12.26	10.29	24.79	18.89	16.00
MonoDTR^[35]	37	21.99	15.39	12.73	28.59	20.38	17.14
MonoJSG^[36]	42	24.69	16.14	13.64	32.59	21.26	18.18
MonoDistill^[25]	40	22.97	16.03	13.60	31.87	22.59	19.72
DID-M3D^[17]	40	24.40	16.29	13.75	32.95	22.76	19.83
MonoDDE^[10]	40	24.93	17.14	15.10	33.58	23.46	20.37
MonoRCNN^[6]	70	18.36	12.65	10.03	25.48	18.11	14.10
MonoFlex^[9]	30	19.94	13.89	12.07	28.23	19.75	16.89
GUP Net^[7]	34	22.26	15.02	13.12	30.29	21.19	18.20
SHUD	40	26.56	17.25	15.11	34.67	23.52	19.21

实验	${{z} }_{\rm{geo}}$	${{\sigma } }_{ {h}_{\rm{2D}} }$	${{\sigma } }_{ {h}_{\rm{3D} } }$	${\sigma }_{\rm{geo}}$	$P_{ {\mathrm{A}\mathrm{P} }_{\rm{BEV}} }$$(P_{ {\mathrm{A}\mathrm{P} }_{ {\rm{3D} }\|R40} }) $/%
实验	${{z} }_{\rm{geo}}$	${{\sigma } }_{ {h}_{\rm{2D}} }$	${{\sigma } }_{ {h}_{\rm{3D} } }$	${\sigma }_{\rm{geo}}$	简单	中等	困难
a	√				31.49(22.48)	23.50(17.11)	19.86(14.25)
b	√	√			33.54(24.32)	25.27(18.22)	20.81(14.91)
c	√		√		33.57(24.53)	25.35(18.39)	21.07(15.17)
d	√	√	√		35.08(25.16)	25.96(18.82)	21.45(15.71)
e	√	√	√	√	36.41(26.49)	26.32(19.10)	21.76(16.04)

[1]	胡雯婧, 蒋龙泉, 余俊龙, 徐伊茜, 刘奇鹏, 梁雷, 李嘉豪. 基于知识蒸馏的轻量化农作物病害识别算法[J]. 华东师范大学学报（自然科学版）, 2025, 2025(1): 59-71.
[2]	王畅, 马丹, 许华容, 陈攀峰, 陈梅, 李晖. SA-MGKT: 基于自注意力融合的多图知识追踪方法[J]. 华东师范大学学报（自然科学版）, 2024, 2024(5): 20-31.
[3]	郑智鸿, 宋海川. 基于组对比学习的弱监督三维点云语义分割方法[J]. 华东师范大学学报（自然科学版）, 2024, 2024(2): 108-118.
[4]	何鑫鑫, 宋海川. 基于隐层傅里叶卷积的非平稳纹理合成方法[J]. 华东师范大学学报（自然科学版）, 2024, 2024(2): 119-130.
[5]	姜璐璐, 孙司琦, 邹海东, 陆丽娜, 冯瑞. 基于双视图特征融合的糖尿病视网膜病变分级[J]. 华东师范大学学报（自然科学版）, 2023, 2023(6): 39-48.
[6]	黄彩蝶, 王昕萍, 陈良育, 刘勇. 基于堆叠门控循环单元残差网络的知识追踪模型研究[J]. 华东师范大学学报（自然科学版）, 2022, 2022(6): 68-78.
[7]	吴豪杰, 王妍洁, 蔡文炳, 王飞, 刘洋, 蒲鹏, 林绍辉. 基于隐层相关联算子的知识蒸馏方法[J]. 华东师范大学学报（自然科学版）, 2022, 2022(5): 115-125.
[8]	周雪茗, 黄定江. 小样本实例分割综述[J]. 华东师范大学学报（自然科学版）, 2022, 2022(5): 136-146.
[9]	陈海龙, 彭伟. 改进BP神经网络在交通事故预测中的研究[J]. 华东师范大学学报(自然科学版), 2017, 2017(2): 61-68.