Embracing Events and Frames with Hierarchical Feature Refinement Network for Object Detection

1Technical University of Munich, Munich, Germany    2University of Technology Sydney, Sydney, Australia    3Tongji University, Shanghai, China    4Munich Center for Machine Learning (MCML)

*Indicates corresponding author

crop_ProblemV2

This work leverages the complementary information from both events and frames for object detection. (Left) In each image pair, the left image is from frames, while the right is from events. Note that event cameras excel at high-speed and high dynamic range sensing but struggle to capture static and remote small targets compared to RGB cameras. (Right) We choose three methods, FPN-Fusion [48], RENet [58], and EFNet [46] for performance evaluation.

Abstract

In frame-based vision, object detection faces substantial performance degradation under challenging conditions due to the limited sensing capability of conventional cameras. Event cameras output sparse and asynchronous events, providing a potential solution to solve these problems. However, effectively fusing two heterogeneous modalities remains an open issue. In this work, we propose a novel hierarchical feature refinement network for event-frame fusion. The core concept is the design of the coarse-to-fine fusion module, denoted as the cross-modality adaptive feature refinement (CAFR) module. In the initial phase, the bidirectional cross-modality interaction (BCI) part facilitates information bridging from two distinct sources. Subsequently, the features are further refined by aligning the channel-level mean and variance in the two-fold adaptive feature refinement (TAFR) part. We conducted extensive experiments on two benchmarks: the low-resolution PKU-DDD17-Car dataset and the high-resolution DSEC dataset. Experimental results show that our method surpasses the state-of-the-art by an impressive margin of 8.0% on the DSEC dataset. Besides, our method exhibits significantly better robustness (69.5% versus 38.7%) when introducing 15 different corruption types to the frame images. The code can be found at the link (https://github.com/HuCaoFighting/FRN).

Framework

crop_Architecture

The overall architecture of our hierarchical feature refinement network. It comprises a dual-stream backbone network, CAFR, FPN, and a detection head. The backbone incorporates two branches: the event-based ResNet-50 (bottom) and the frame-based ResNet-50 (top). The CAFR operates to enhance features on a hierarchical scale. The refined multi-scale features are then forwarded to the FPN and detection head for accurate detection predictions. The structure of the FPN and the detection head is adapted from RetinaNet.

Results

crop_Architecture
crop_Architecture



We visualize the detection results selected from the DSEC dataset. The results demonstrate that the proposed method can produce satisfactory detection results in various challenging scenarios.

BibTeX

@InProceedings{CAFR,
            author = {Hu Cao and Zehua Zhang and Yan Xia and Xinyi Li and Jiahao Xia and Guang Chen and Alois Knoll},
            title = {Embracing Events and Frames with Hierarchical Feature Refinement Network for Object Detection},
            booktitle = {ECCV},
            year = {2024}
            }

Acknowledgements

Website adapted from the following template.