Technique Report of CVPR 2024 PBDL Challenges

Ying Fu    Yu Li    Shaodi You    Boxin Shi    Linwei Chen    Yunhao Zou    Zichun Wang    Yichen Li    Yuze Han    Yingkai Zhang    Jianan Wang    Qinglin Liu    Wei Yu    Xiaoqian Lv    Jianing Li    Shengping Zhang    Xiangyang Ji    Yuanpei Chen    Yuhan Zhang    Weihang Peng    Liwen Zhang    Zhe Xu    Dingyong Gou    Cong Li    Senyan Xu    Yunkang Zhang    Siyuan Jiang    Xiaoqiang Lu    Licheng Jiao    Fang Liu    Xu Liu    Lingling Li    Wenping Ma    Shuyuan Yang    Haiyang Xie    Jian Zhao    Shihua Huang    Peng Cheng    Xi Shen    Zheng Wang    Shuai An    Caizhi Zhu    Xuelong Li    Tao Zhang    Liang Li    Yu Liu    Chenggang Yan    Gengchen Zhang    Linyan Jiang    Bingyi Song    Zhuoyu An    Haibo Lei    Qing Luo    Jie Song    Yuan Liu    Qihang Li    Haoyuan Zhang    Lingfeng Wang    Wei Chen    Aling Luo    Cheng Li    Jun Cao    Shu Chen    Zifei Dou    Xinyu Liu    Jing Zhang    Kexin Zhang    Yuting Yang    Xuejian Gou    Qinliang Wang    Yang Liu    Shizhan Zhao    Yanzhao Zhang    Libo Yan    Yuwei Guo    Guoxin Li    Qiong Gao    Chenyue Che    Long Sun    Xiang Chen    Hao Li    Jinshan Pan    Chuanlong Xie    Hongming Chen    Mingrui Li    Tianchen Deng    Jingwei Huang    Yufeng Li    Fei Wan    Bingxin Xu    Jian Cheng    Hongzhe Liu    Cheng Xu    Yuxiang Zou    Weiguo Pan    Songyin Dai    Sen Jia    Junpei Zhang    Puhua Chen
Abstract

The intersection of physics-based vision and deep learning presents an exciting frontier for advancing computer vision technologies. By leveraging the principles of physics to inform and enhance deep learning models, we can develop more robust and accurate vision systems. Physics-based vision aims to invert the processes to recover scene properties such as shape, reflectance, light distribution, and medium properties from images. In recent years, deep learning has shown promising improvements for various vision tasks, and when combined with physics-based vision, these approaches can enhance the robustness and accuracy of vision systems. This technical report summarizes the outcomes of the Physics-Based Vision Meets Deep Learning (PBDL) 2024 challenge, held in CVPR 2024 workshop. The challenge consisted of eight tracks, focusing on Low-Light Enhancement and Detection as well as High Dynamic Range (HDR) Imaging. This report details the objectives, methodologies, and results of each track, highlighting the top-performing solutions and their innovative approaches.

footnotetext: Ying Fu, Yu Li, Shaodi You and Boxin Shi are the challenge organizers. Ying Fu is with Beijing Institute of Technology, Yu Li is with International Digital Economy Academy, Shaodi You is with University of Amsterdam, Boxin Shi is with Peking University.

1 Introduction

The integration of physics-based vision with deep learning offers a powerful paradigm for addressing complex computer vision problems. Physics-based vision seeks to model and invert physical processes to recover scene properties such as shape [110, 86], reflectance [32, 75], and light distribution [8] from images. Deep learning, on the other hand, excels at learning representations and patterns from large datasets. Combining these approaches allows for the development of models that are not only data-driven but also grounded in physical principles, leading to enhanced performance in various vision tasks such as object recognition [44], scene understanding [18], and image restoration [79, 107].

To explore the potential of this integrated approach, we organized a comprehensive challenge at CVPR 2024, held in conjunction with the Physics-Based Vision Meets Deep Learning (PBDL) workshop. The challenge comprised eight tracks, divided into two main categories: Low-Light Enhancement and Detection, and High Dynamic Range (HDR) Imaging. Each track was designed to address specific challenges in the field and to stimulate innovation in both theoretical and practical aspects. For instance, low-light enhancement aims to improve image visibility in poorly lit environments, which is crucial for applications like autonomous driving and surveillance [99]. HDR imaging, on the other hand, focuses on capturing a wider range of luminance levels to produce more realistic and detailed images, which is essential for photography and cinematography [23]. This report details the objectives, methodologies, and results of each track, highlighting the top-performing solutions and their innovative approaches. In the following, we present an overview of the individual tracks.

1.1 Low-Light Enhancement and Detection Challenge

  1. 1.

    Low-light Object Detection and Instance Segmentation: This track aimed to improve the robustness of object detection and instance segmentation algorithms in low-light conditions. Participants developed methods to handle noise, color distortion, and detail loss, common issues in low-light environments.

  2. 2.

    Low-light Raw Video Denoising with Realistic Motion: Focusing on enhancing video quality in low-light conditions, this track involved denoising raw video sequences with realistic motion. The goal was to reduce noise while preserving motion integrity.

  3. 3.

    Low-light SRGB Image Enhancement: This track targeted the enhancement of SRGB images captured in low-light conditions. Participants worked on methods to recover normal-light images from very dim environments, addressing noise, color bias, and over-exposure issues.

  4. 4.

    Extreme Low-Light Image Denoising: Participants in this track aimed to develop algorithms capable of denoising images captured under extremely low-light conditions, pushing the boundaries of what is achievable in terms of noise reduction and detail preservation.

  5. 5.

    Low-light Raw Image Enhancement: This track focused on enhancing raw images captured in low-light scenarios. By leveraging the higher bit-depth of raw data, participants aimed to improve the overall image quality significantly.

1.2 High Dynamic Range Imaging Challenge

  1. 1.

    HDR Reconstruction from a Single Raw Image: This track aimed at reconstructing high dynamic range images from single raw images. The challenge was to avoid potential misalignments common in multi-image fusion techniques while capturing a broad spectrum of intensity levels.

  2. 2.

    Highspeed HDR Video Reconstruction from Events: Participants developed methods to reconstruct HDR videos from event-based camera data. The goal was to combine the high temporal resolution of event cameras with HDR imaging techniques.

  3. 3.

    Raw Image Based Over-Exposure Correction: This track focused on correcting over-exposed regions in raw images. Participants aimed to develop techniques to recover details in both over- and under-exposed areas, resulting in visually pleasing and information-rich images.

1.3 Summary of Challenge Outcomes

The challenge attracted numerous teams from around the world, each bringing innovative approaches to tackle these complex problems. This report provides a comprehensive review of the methodologies and results for each track, highlighting the top-performing solutions. The participating teams demonstrated significant advancements in low-light enhancement and HDR imaging, showcasing the potential of combining physics-based vision with deep learning. The top three methods for each track are detailed, offering insights into the state-of-the-art techniques and their practical applications.

Through this challenge, we have not only advanced the field of computer vision but also demonstrated the mutual benefits of integrating physics-based models with deep learning. The results of this challenge pave the way for future research and development in this exciting interdisciplinary area. The following sections will delve into each track individually, presenting the objectives, methodologies, and outcomes in detail.

2 Low-light Object Detection and Instance Segmentation

Performing object detection and instance segmentation [28] under low-light conditions poses several challenges. e.g., images captured in low-light environments often suffer from poor quality, leading to loss of detail, color distortion, and prominent noise. These factors significantly hinder the performance of downstream vision tasks, particularly object detection and instance segmentation.

To address this challenge, the CVPR 2024 PBDL Challenge on Low-light Object Detection and Instance Segmentation aims to assess and enhance the robustness of object detection and instance segmentation algorithms on images captured in low-light environmental conditions.

In the low-light object detection track (Table 1), the top three teams demonstrated exceptional performance. Both GroundTruth and Xocean secured the 1st rank, achieving an average precision (AP) score of 0.76. They displayed remarkable accuracy in detecting objects under low-light conditions, with AP scores of 0.89 and 0.81 at IoU thresholds of 0.50 and 0.75, respectively. UnoWhoiam secured the 3rd rank with an AP score of 0.75, showcasing their strong performance in this challenging task.

For low-light instance segmentation (Table 2), the competition was equally intense. GroundTruth achieved the 1st rank with a mask AP score of 0.62, demonstrating their excellent ability to segment instances accurately in low-light images. UnoWhoiam secured the 2nd rank with an mask AP score of 0.59, while Xocean secured the 3rd rank with an mask AP score of 0.58. Both teams exhibited impressive performance in low-light instance segmentation, further emphasizing the significance of their contributions.

These results highlight the remarkable advancements made by the participating teams in addressing the challenges of low-light object detection and instance segmentation. The top-ranking teams have showcased their expertise and innovation in developing robust algorithms that excel in low-light conditions, paving the way for future advancements in computer vision research.

Table 1: Leaderboard of the low-light object detection.
Rank Team APbox AP50boxsubscriptsuperscriptabsent𝑏𝑜𝑥50{}^{box}_{50}start_FLOATSUPERSCRIPT italic_b italic_o italic_x end_FLOATSUPERSCRIPT start_POSTSUBSCRIPT 50 end_POSTSUBSCRIPT AP75boxsubscriptsuperscriptabsent𝑏𝑜𝑥75{}^{box}_{75}start_FLOATSUPERSCRIPT italic_b italic_o italic_x end_FLOATSUPERSCRIPT start_POSTSUBSCRIPT 75 end_POSTSUBSCRIPT
1 GroundTruth 0.76 0.89 0.81
1 Xocean 0.76 0.89 0.81
3 UnoWhoiam 0.75 0.94 0.86
Table 2: Leaderboard of the low-light instance segmentation.
Rank Team APmask AP50masksubscriptsuperscriptabsent𝑚𝑎𝑠𝑘50{}^{mask}_{50}start_FLOATSUPERSCRIPT italic_m italic_a italic_s italic_k end_FLOATSUPERSCRIPT start_POSTSUBSCRIPT 50 end_POSTSUBSCRIPT AP75masksubscriptsuperscriptabsent𝑚𝑎𝑠𝑘75{}^{mask}_{75}start_FLOATSUPERSCRIPT italic_m italic_a italic_s italic_k end_FLOATSUPERSCRIPT start_POSTSUBSCRIPT 75 end_POSTSUBSCRIPT
1 GroundTruth 0.62 0.82 0.65
2 UnoWhoiam 0.59 0.87 0.61
3 Xocean 0.58 0.79 0.61
Refer to caption
Figure 1: Example scenes in LIS dataset. Four image types (long-exposure normal-light and short-exposure low-light images in both RAW and sRGB formats) are captured for each scene.

2.1 Low-light Instance Segmentation Dataset

To systematically investigate the effectiveness of the proposed method in real-world conditions, a real low-light image dataset for instance segmentation is necessary and foundamental. The challenge utilizes the Low-light Instance Segmentation (LIS) dataset, introduced by [113, 18].

It is collected using a Canon EOS 5D Mark IV camera. Figure 1 shows examples of annotated images from LIS dataset. The LIS dataset exhibits the following characteristics:

  • Paired samples. The LIS dataset includes images in both sRGB-JPEG (typical camera output) and RAW formats. Each format consists of paired short-exposure low-light and corresponding long-exposure normal-light images. We term these four types of images sRGB-dark, sRGB-normal, RAW-dark, and RAW-normal. To ensure pixel-wise alignment, we mounted the camera on a sturdy tripod and used remote control via a mobile app to avoid vibrations.

  • Diverse scenes. The LIS dataset consists of 2230 image pairs collected in various indoor and outdoor scenes. To increase the diversity of low-light conditions, we used a series of ISO levels (e.g., 800, 1600, 3200, 6400) to capture long-exposure reference images and deliberately decreased the exposure time by various low-light factors (e.g., 10, 20, 30, 40, 50, 100) to capture short-exposure images, simulating very low-light conditions.

  • Instance-level pixel-wise labels. For each image pair, we provide precise instance-level pixel-wise labels annotated by professional annotators. This results in 10,504 labeled instances across eight common object classes: bicycle, car, motorcycle, bus, bottle, chair, dining table, and TV.

The LIS dataset includes images captured in different scenes (indoor and outdoor) and under varying illumination conditions. As shown in Figure 1, object occlusion and densely distributed objects add to the challenges presented by the low-light conditions.

2.2 GroundTruth Team’s Method

2.2.1 Network Architecture

Refer to caption
Figure 2: Framework of DINO [128].

Object detection. DINO [128] is adopted as the detector which uses a contrastive way for denoising training, a mixed query selection method for anchor initialization, and a look forward twice scheme for box prediction in an end-to-end manner, as shown in Figure 2. The most advanced and robust backbone FocalNet-Large [114] is utilized to extrack informative features, which introduce focal attention to additionally aggregate summarized visual tokens far away to capture coarse-grained and long-range visual dependencies, as shown in Figure 3. In order to increase the receptive field of each roi feature, we exploit the roi pooling on the feature map of the corresponding level to get the global context feature, which is used to enhance the roi feature of the corresponding level by adding them. We also add SyncBN to each box head to make the training process more stable.

Instance segmentation. HTC [13] is adopted as our detector which can learn more discriminative features progressively while integrating complementary features together in each stage, as shown in Figure 5. To simplify its use, we directly employ the original masks of objects as semantic maps. The most advanced and robust backbone ViT-adapter [20] is utilized to introduce the image-related inductive bias to a plain ViT [26], which allows plain ViT to achieve comparable performance to vision-specific transformers, as shown in Figure 6. In order to increase the receptive field of each roi feature, we exploit the roi pooling on the feature map of the corresponding level to get the global context feature, which is used to enhance the roi feature of the corresponding level by adding them. We also add SyncBN to each box head to make the training process more stable.

Refer to caption
Figure 3: Framework of FocalNet [114].
Refer to caption
Figure 4: Visual results of our method on the testing set.

2.2.2 Implementation Details

Dataset usage. The challenge uses the Low-light Instance Segmentation (LIS) dataset, introduced by [18], which contains 892 labeled images as the training set and 669 images as the testing set. The LIS dataset comprises paired images collected across various scenes, encompassing both indoor and outdoor environments. We utilize all labeled data for training and do not perform online evaluations during training. After training, we directly use the last checkpoint to predict the testing data.

Training details. During training, we take the model pre-trained on the Object365 dataset and finetuned on the COCO dataset as the pre-trained model. Specifically, our model is trained on 8 NVIDIA Tesla V100-32G with a total batch size of 8, numbers of queries of 900, and numbers of proposals of 100. Since the training set is small, we train the detector using the AdamW optimizer with an initial learning rate of 0.0001 and weight decay of 0.0001, to alleviate overfitting. We employ the standard 1×\times× schedule to train the model, and random horizontal flipping with a probability of 0.5 and random resize-crop-resize are introduced as weak augmentation.

Testing details. During testing, simple test-time augmentation like horizontal flipping and multi-scale testing are exploited, in which the scales include ×\times×1.0, ×\times×1.125, ×\times×1.25, ×\times×1.375, and ×\times×1.5. The NMS is not adopted and the detector directly outputs 100 box predictions end to end. Specifically, the initial test image size is 1333x800, and horizontal flipping is adopted to boost model performance. After obtaining ten predictions with different scale augmentation, we further use weighted boxed fusion (WBF) [92] to ensemble them as our final submission, which achieves an AP of 0.76 in the test phase.

In addition, we attempt to introduce some advanced low-light image enhancement methods, such as CIDNet [30], GlobalDiff [45], and Retinexformer [9], to enhance the challenge data, and perform detection algorithm on the enhanced images. Unfortunately, the performance has not been improved or even decreased. We argue that since the challenge dataset does not have pairs of low-light and normal scene images, this leads us to use these image enhancement methods for cross-domain inference, which corrupts the distributional information in the data itself, and ultimately leads to a degradation of detection performance.

Some visual results of our method on the testing set are shown in Figure 4.

Refer to caption
Figure 5: Framework of HTC [13].
Refer to caption
Figure 6: Framework of ViT-adapter [20].

2.3 Xocean Team’s Method

2.3.1 Network Architecture

Object detection. Several prior studies [17, 44, 33, 16] have endeavored to enhance image cognition performance in extreme conditions. Despite demonstrating superior efficacy compared to their respective baselines, we have observed that employing conventional methodologies on the dataset in this challenge yields comparable effectiveness while being straightforward to implement. Consequently, we adopt a simplified approach by treating the low-light images from the challenge dataset as conventional RGB images.

As shown in Figure 7, we trained several detectors, including RTMDet [71], YOLOX [38], Dino [128] and Co-DETR [132] on the challenge datasets, and then ensemble the predictions from those models to achieve better results. We employed Weighted Box Fusion [91] as our ensemble method.

Refer to caption
Figure 7: The overview of the proposed object detection framework.we trained several detectors, including RTMDet [71], YOLOX [38], Dino [128] and Co-DETR [132] on the challenge datasets, and then ensemble the predictions from those models to achieve better results. We employed Weighted Box Fusion [91] as our ensemble method.

RTMDet [71] is an efficient real-time object detector that surpasses the YOLO series. Apart from adjusting the number of output classes, we made no modifications to RTMDet. RTMDet-x and RTMDet-l models were chosen due to their high mAP on the COCO dataset.

YOLOX [38] is a highly advanced detector that represents a significant improvement upon the YOLO series. Apart from adjusting the number of output classes, we made no modifications to YOLOX. Taking into account both performance and training costs, we opted for YOLOX-l.

DINO [128] is an advanced end-to-end object detector. Apart from adjusting the number of output classes, we made no modifications to DINO. DINO-Swin-L model was chosen due to its high mAP on the COCO dataset.

Co-DETR [132] is a novel training scheme aimed at improving the efficiency and effectiveness of DETR-based detectors. Apart from adjusting the number of output classes, we made no modifications

Instance segmentation. We trained a single RTMDet [71] model for instance segmentation without employing any ensemble methods.

2.3.2 Implement Details

Dataset usage. We solely utilized the challenge dataset for training. Additionally, we attempted to augment our training data by incorporating the COCO dataset which was unprocessed according to [7], preserving annotations with common classes. However, this augmentation did not yield improved results. It is necessary to point out that we still utilized the pretrained weights on the unprocessed [7] COCO dataset to initialize some of the models, aiming to enhance the diversity of our model zoo, which proves advantageous for ensemble methods.

During the initial phase of the challenge, only annotations for the training set were available. Initially, we randomly divided the training set into a proxy training set and a validation set using an 8:2 ratio. Subsequently, we trained the models and optimized the training settings to enhance performance. These settings were then uniformly applied for training on the original complete training dataset, ensuring full utilization of the available data.

Training. To achieve higher performance, we initialized the model weights using pretrained weights from the COCO [63] Dataset. However, Co-DETR [132] was an exception, as we found that the pretrained weights obtained by training first on Object365 [88] and then on COCO [63] performed better than those from COCO [63].

During the validation and test phases, we retained the weights from the last epoch for evaluation on the official validation and test sets.

We utilized the MMDetection framework [14] to conduct all experiments on 4 machines, each equipped with 8 NVIDIA RTX 3090/4090 GPUs.

Due to the extensive nature of our training process, which involved training over 18 models for ensemble, providing detailed training configurations in this paper may not be feasible. We recommend referring to the config files in our code repository for more comprehensive information.

Ensemble. Table 3 briefly describes the type of model and any specific strategies employed. For example, ”Dino-Swin-L” signifies the use of the Dino model with the Swin-L Backbone, while ”Dino-Swin-L with TTA” indicates the same model enhanced by test-time augmentation (TTA). Additionally, the descriptions encompass different versions of the RTMDet and Co-DETR models, which may incorporate varying parameters like dropout rates or random seeds during the training phase. ”obj2coco” indicates that we use pretrained weights obtained by training first on Object365 [88] and then on COCO [63] to initialize the parameters of the model.

These predictions were then utilized in the weighted box fusion to ensemble predictions. The weight of each prediction was determined using a grid search algorithm on the proxy validation set described in 2.3.2.

For more details about the ensemble process, please refer to the configuration files in our code project.

Table 3: Ensemble Strategy. ”Dino-Swin-L” signifies the use of the Dino model with the Swin-L Backbone, while ”Dino-Swin-L with TTA” indicates the same model enhanced by test-time augmentation (TTA). Additionally, the descriptions encompass different versions of the RTMDet and Co-DETR models, which may incorporate varying parameters like dropout rates or random seeds during the training phase. ”obj2coco” indicates that we use pretrained weights obtained by training first on Object365 [88] and then on COCO [63] to initialize the parameters of the model.
ID Weight Description
1 1 Dino-Swin-L
2 1 Dino-Swin-L with TTA
3 1 RTMDet-l
4 1 RTMDet-l
5 1 RTMDet-l
6 1 RTMDet-l
7 1 RTMDet-l
8 1 RTMDet-l
9 1 RTMDet-x
10 1 RTMDet-l
11 1 RTMDet-l
12 1 RTMDet-l
13 1 RTMDet-l
14 1 YOLOX-l with TTA
15 8 Co-DETR
16 8 Co-DETR
17 10 Co-DETR-dropout0.6-obj2coco
18 10 Co-DETR-dropout0.3

As shown in Table 4, our ensemble method for the Object Detection track attained a mean Average Precision (mAP) of 0.76. Additionally, our RTMDet model for the Instance Segmentation track achieved an mAP of 0.58.

Table 4: Results of our methods
mAP mAP50
Object Detection 0.76 0.89
Instance Segmentation 0.58 0.79

2.4 UnoWhoiam Team’s Method

2.4.1 Network Architecture

Object detection. We utilized DINO as our foundational network. As shown in Figure 2, DINO is an advanced end-to-end Transformer detector that employs several innovative techniques, including contrastive denoising training, look forward twice, and mixed query selection. These techniques significantly enhance both training efficiency and detection performance. We chose DINO for our competition due to its demonstrated efficiency and robustness in handling complex detection tasks. Its high performance on benchmark datasets make it an ideal choice for achieving competitive results in the specific task of “Low-light Object Detection and Instance Segmentation” competition.

Refer to caption
Figure 8: Framework of Mask DINO [59].

Instance segmentation. We utilized Mask DINO [59] as our foundational network. As shown in Figure 8, Mask DINO is a unified Transformer-based framework designed for both object detection and image segmentation. This network is an extension of DINO, which was originally developed for detection, and adapts it to handle segmentation tasks with minimal modifications to key components. Mask DINO stands out due to its superior performance, outperforming previous specialized models and achieving the best results in instance, panoptic, and semantic segmentation tasks among models with fewer than one billion parameters.

One of the critical advantages of Mask DINO is its ability to enable task cooperation, demonstrating that detection and segmentation can mutually enhance each other within query-based models. Additionally, Mask DINO leverages better visual representations pre-trained on large-scale detection datasets to improve semantic and panoptic segmentation. This synergistic approach not only enhances the performance but also provides a robust and versatile framework capable of handling multiple vision tasks effectively. By employing Mask DINO, we aim to leverage these strengths to achieve superior results in the “Low-light Object Detection and Instance Segmentation” competition.

Feature alignment. We integrated the Feature-aligned Pyramid Network (FaPN) [47] to enhance our network for both object detection and instance segmentation. FaPN is a simple yet effective top-down pyramidal architecture designed to generate multi-scale features for dense image prediction. FaPN comprises two key modules: a feature alignment module and a feature selection module. The feature alignment module learns transformation offsets of pixels to contextually align upsampled higher-level features, while the feature selection module emphasizes lower-level features rich in spatial details. Empirical results show that FaPN consistently and substantially improves performance over the original FPN across four dense prediction tasks and three datasets.

We chose FaPN for our competition due to its demonstrated ability to improve multi-scale feature generation. Its integration into our network aims to leverage these strengths, thereby enhancing our model’s accuracy in the competition.

Refer to caption
Figure 9: Framework of Disturbance Suppression Learning [18].

2.4.2 Training and Testing Details

Training details. During training, we use a model pre-trained on the Object365 dataset and fine-tuned on the COCO dataset as our base. Our training setup includes 8 RTX 3090 GPUs, with a total batch size of 8. All other settings are kept the same as in the original paper. We follow the standard 1×\times× training schedule and apply weak data augmentation techniques, including random horizontal flipping with a probability of 0.5 and random resize-crop-resize.

Disturbance suppression learning. When fine-tuned on COCO, we utilize the low-light RAW synthetic pipeline from [18], which consists of two steps, namely, unprocessing and noise injection, to obtain synthetic low-light clean/noisy RAW images. We adopt disturbance suppression learning from previous work [18]. Ideally, a robust network should extract similar features whether the input image is corrupted by noise or not. To achieve this, we introduce disturbance suppression learning, which encourages the network to learn disturbance-invariant features during training. This approach is independent of architectural considerations.

The total loss for learning is defined as:

L(θ)=LIS(x;θ)+αLIS(x;θ)+βLDS(x,x;θ),𝐿𝜃subscript𝐿IS𝑥𝜃𝛼subscript𝐿ISsuperscript𝑥𝜃𝛽subscript𝐿DS𝑥superscript𝑥𝜃\displaystyle\color[rgb]{0,0,0}L(\theta)=L_{\text{IS}}(x;\theta)+\alpha L_{% \text{IS}}(x^{\prime};\theta)+\beta L_{\text{DS}}(x,x^{\prime};\theta),italic_L ( italic_θ ) = italic_L start_POSTSUBSCRIPT IS end_POSTSUBSCRIPT ( italic_x ; italic_θ ) + italic_α italic_L start_POSTSUBSCRIPT IS end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ; italic_θ ) + italic_β italic_L start_POSTSUBSCRIPT DS end_POSTSUBSCRIPT ( italic_x , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ; italic_θ ) , (1)

where x𝑥xitalic_x is the clean synthetic RAW image, xsuperscript𝑥x^{\prime}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is its noisy version, and α𝛼\alphaitalic_α and β𝛽\betaitalic_β are the weights of the respective losses. We empirically set α=1𝛼1\alpha=1italic_α = 1 and β=0.01𝛽0.01\beta=0.01italic_β = 0.01.

The loss LISsubscript𝐿ISL_{\text{IS}}italic_L start_POSTSUBSCRIPT IS end_POSTSUBSCRIPT is the task loss, e.g., instance segmentation loss, which consists of classification loss, bounding box regression loss, and segmentation (per-pixel classification) loss. The specific formula for LISsubscript𝐿ISL_{\text{IS}}italic_L start_POSTSUBSCRIPT IS end_POSTSUBSCRIPT is related to the model, we employ the same loss as the origianl model. This loss is applied to both the clean image x𝑥xitalic_x and the noisy image xsuperscript𝑥x^{\prime}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT to ensure the model performs consistently regardless of noise.

The loss LDSsubscript𝐿DSL_{\text{DS}}italic_L start_POSTSUBSCRIPT DS end_POSTSUBSCRIPT is the feature disturbance suppression loss, defined as:

LDS(x,x;θ)=i=1nf(i)(x;θ)f(i)(x;θ)22,subscript𝐿DS𝑥superscript𝑥𝜃superscriptsubscript𝑖1𝑛subscriptsuperscriptdelimited-∥∥superscript𝑓𝑖𝑥𝜃superscript𝑓𝑖superscript𝑥𝜃22\displaystyle\vspace{-2mm}\color[rgb]{0,0,0}L_{\text{DS}}(x,x^{\prime};\theta)% =\sum_{i=1}^{n}\lVert f^{(i)}(x;\theta)-f^{(i)}(x^{\prime};\theta)\rVert^{2}_{% 2},\vspace{-2mm}italic_L start_POSTSUBSCRIPT DS end_POSTSUBSCRIPT ( italic_x , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ; italic_θ ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∥ italic_f start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ( italic_x ; italic_θ ) - italic_f start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ; italic_θ ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , (2)

where f(i)(x;θ)superscript𝑓𝑖𝑥𝜃f^{(i)}(x;\theta)italic_f start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ( italic_x ; italic_θ ) represents the i𝑖iitalic_i-th stage of feature maps of the model. By minimizing the Euclidean distance between the clean features f(i)(x;θ)superscript𝑓𝑖𝑥𝜃f^{(i)}(x;\theta)italic_f start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ( italic_x ; italic_θ ) and the noisy features f(i)(x;θ)superscript𝑓𝑖superscript𝑥𝜃f^{(i)}(x^{\prime};\theta)italic_f start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ; italic_θ ), the disturbance suppression loss encourages the model to learn disturbance-invariant features. This reduces feature disturbance caused by image noise and improves the model’s robustness to corrupted low-light images.

Unlike perceptual loss [39], our approach does not require pretraining a teacher model, making our training process simpler and faster. With LIS(x;θ)subscript𝐿IS𝑥𝜃L_{\text{IS}}(x;\theta)italic_L start_POSTSUBSCRIPT IS end_POSTSUBSCRIPT ( italic_x ; italic_θ ) and LIS(x;θ)subscript𝐿ISsuperscript𝑥𝜃L_{\text{IS}}(x^{\prime};\theta)italic_L start_POSTSUBSCRIPT IS end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ; italic_θ ), our model can learn discriminative features from both clean and noisy images, maintaining stable accuracy regardless of noise. In contrast, the “student” model in perceptual loss [39] only sees noisy images, which can degrade performance on clean images and limit robustness. Additionally, the domain gap between the feature distributions of the teacher and student models can harm the learning process. By minimizing the distance between clean and noisy features predicted by the same model, we avoid this problem.

Testing details. During testing, we employ simple test-time augmentation techniques such as horizontal flipping and multi-scale testing. The multi-scale testing involves resizing the shorter side of the image to various sizes: 400, 500, 600, 700, 800, 900, 1000, 1100, and 1200 pixels. Horizontal flipping is also used to enhance model performance. For detection, after obtaining ten predictions with different scale augmentations, we use Weighted Box Fusion (WBF) [92] to ensemble them for our final submission.

2.5 Teams and Affiliations

GroundTruth

Title: Technique Report of Team GroundTruth for CVPR 2024 PBDL Challenge Low-light Object Detection and Instance Segmentation

Members: Xiaoqiang Lu ([email protected]), Licheng Jiao, Fang Liu, Xu Liu, Lingling Li, Wenping Ma, Shuyuan Yang

Affiliations: School of Artificial Intelligence, Xidian University

Xocean

Title: Tech Report of Low-light Object Detection and Instance Segmentation Challenge

Members: Haiyang Xie1,7 ([email protected]), Jian Zhao6,7, Shihua Huang2, Peng Cheng3, Xi Shen2, Zheng Wang1, Shuai An5, Caizhi Zhu2, Xuelong Li4

Affiliations: 1School of Computer Science, Wuhan University, 2Intellindust, 3Beijing Forestry University, 4Institute of AI (TeleAI), China Telecom, 5Harbin Institute of Technology, 6School of Artificial Intelligence, Optics and Electronics (iOPEN), Northwestern Polytechnical University, 7EVOL Lab, Institute of AI (TeleAI), China Telecom

UnoWhoiam

Title: Technique Report of Team UnoWhoiam for CVPR 2024 PBDL Challenge Low-light Object Detection and Instance Segmentation

Members: Linwei Chen1 ([email protected]), Ying Fu1, Tao Zhang2, Liang Li2, Yu Liu3, Chenggang Yan2

Affiliations: 1Beijing Institute of Technology, 2Lishui Institute of Hangzhou Dianzi University, 3Tsinghua University

3 Low-light raw video denoising with realistic motion

Supervised deep-learning methods have shown their effectiveness on raw video denoising in low-light. However, existing training datasets have specific drawbacks, e.g., inaccurate noise modeling in synthetic datasets, simple motion created by hand or fixed motion, and limited-quality ground truth caused by the beam splitter in real captured datasets. These defects significantly decline the performance of network when tackling real low-light video sequences, where noise distribution and motion patterns are extremely complex. To address this challenge, the CVPR 2024 PBDL Challenge on low-light raw video denoising with realistic motion aims to improve the recovery quality of realistic videos with complex motion.

Table 5: Leaderboard of low-light raw video denoising with realistic motion.
Rank Team PSNR SSIM
1 ZichunWang 45.47 0.99
2 wql 39.06 0.96
3 mmmmmm 33.64 0.88

As shown in the Table 5, in this TRACK, all the teams achieved great denoising performance. The first place team is ZichunWang, with PSNR and SSIM metrics of 45.47 and 0.99. The second place team is wql, with PSNR and SSIM metrics of 39.06 and 0.96. The third place team is mmmmmm, with PSNR and SSIM metrics of 33.64 and 0.88. These results show excellent denoising capabilities for real-world videos, and also demonstrate that the participants excellent ability in designing algorithms for the denoising task, making an important contribution to the future development of video denoising.

3.1 Low-light Raw Video Denoising Dataset

In this competition, we first collect  70 high-quality 4k videos from the internet, then play them on the DELL U2720QM monitor. We use a Sony Alpha 7R IV full-frame mirrorless camera. The size of the Bayer image is 9504×6336. The scenes of the video clips contain indoor and outdoor, ranging from natural landscapes to extreme sports. This relatively large range of scenes also has an advantage compared to previous datasets. Examples of our data are in Fig. 10.

  • Realistic scene motion. We collect paired low-light raw videos with realistic motion, showing great generalization to the complex scenarios in the real world.

  • 210 clips. It contains  210 video pairs, each scene contains three noise levels.

  • High-quality ground truth. Previous datasets are all collected in degraded conditions, which may significantly decline the performance of the network trained on them when tackling real scenes. We directly obtain realistic motion in our raw low-light video denoising dataset, featuring high-quality data.

  • No extra equipment. Our dataset collecting pipeline requires no extra equipment used in previous datasets.

Refer to caption
Figure 10: Several representative examples for low/normal-light images in the LLRVD dataset..

3.2 ZichunWang Team’s Method

In this section, we show the overall architecture of our proposed method, and describe the basic 3D spatial-temporal self-attention block with convolution.

3.2.1 Network Architecture

Overall Pipeline. Encoder-decoder is a classic architecture for low-level image tasks, exemplified by U-net [83]. The main issue for adopting the design of U-net to video denoising is how to efficiently use the redundant temporal information. To align temporal features, existing methods often use an auxiliary module for alignment, including convolution only [21, 94], deformable convolution [122], optical flow [112]. However, sub-optimal alignment operation may harm its performance.

Besides, most existing methods use convolution for multi-frame features fusion, where the lacking of long-range modeling ability may decline their recovery result. Some methods utilize spatial self-similarity, e.g. [122], while the abundant temporal-spatial self-similarity in the extra temporal dimension is not fully exploited.

Self-attention is suited for aggregating self-similarity since it can dynamically allocate weight for each pixel. To this end, we combine 3D temporal-spatial attention with the hierarchical design of U-net. Nonetheless, Transformer may suffer from the deficiency of local feature extraction, which is indispensable for recovering image details. Thus, we combine the locality of convolution with the long-range interaction of self-attention in each Transformer block.

The overview architecture of our network is shown in Fig. 11. We focus on the Raw2Raw video denoising task, where the input and output are all in the raw domain. The input is of size T×H×W×4𝑇𝐻𝑊4T\times H\times W\times 4italic_T × italic_H × italic_W × 4. T𝑇Titalic_T represents the number of input frames, with each frame containing H×W×4𝐻𝑊4H\times W\times 4italic_H × italic_W × 4 pixels in the Bayer pattern. The output frame is of size H×W×4𝐻𝑊4H\times W\times 4italic_H × italic_W × 4. To embed the pixels in images as tokens, we first apply a 3×3333\times 33 × 3 convolution. After embedding, all the tokens pass through K𝐾Kitalic_K encoders and patch merging layers. Each encoder contains M𝑀Mitalic_M Shifted Window Transformer blocks. For downsampling, we use the 4×4444\times 44 × 4 convolution and double the dimensions. Symmetrically, the decoder includes K𝐾Kitalic_K Transformer blocks and patch expanding layers. The output of decoder layers is then projected back to image patches. Finally, the extracted multi-frame features are temporally fused to handle the misalignment.

Refer to caption
Figure 11: Overview of network architecture. Swin Transformer denotes Shifted Window-based Transformer. 3D (S)W-MSA denotes 3D (Shifted)Window-based Multi-head Self-attention. LN denotes Layer Normalization. Convolutional Attention denotes our final fusion block.

3D Swin Transformer Block. Since vanilla self-attention [25] is computationally consuming, directly adopting it to video denoising is not affordable due to the extra temporal dimension. Besides, Transformer [25] has strong long-range modeling ability but neglects local features, which is vital for recovering details. To extract locality with less computational effort, we apply 3D shifted window-based multi-head self-attention (3DSW-MSA) and 3D window-based multi-head self-attention (3DW-MSA) [67], together with depth-wise convolution in the feed-forward layer. In this way, we can effectively extract the local features by convolution, at the same time fully taking advantage of intrinsic temporal-spatial self-similarity by the long-range modeling ability of the Transformer.

Two consecutive 3D shifted window-based Transformer blocks are computed as:

𝐳^l=3DWMSA(LN(𝐳l1))+𝐳l1,superscript^𝐳𝑙3DWMSALNsuperscript𝐳𝑙1superscript𝐳𝑙1\displaystyle\hat{\mathbf{z}}^{l}=3\mathrm{DW}-\mathrm{MSA}\left(\mathrm{LN}% \left(\mathbf{z}^{l-1}\right)\right)+\mathbf{z}^{l-1},over^ start_ARG bold_z end_ARG start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT = 3 roman_D roman_W - roman_MSA ( roman_LN ( bold_z start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT ) ) + bold_z start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT , (3)
𝐳l=FFN(LN(𝐳^l))+𝐳^l,superscript𝐳𝑙FFNLNsuperscript^𝐳𝑙superscript^𝐳𝑙\displaystyle\mathbf{z}^{l}=\mathrm{FFN}\left(\mathrm{LN}\left(\hat{\mathbf{z}% }^{l}\right)\right)+\hat{\mathbf{z}}^{l},bold_z start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT = roman_FFN ( roman_LN ( over^ start_ARG bold_z end_ARG start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ) ) + over^ start_ARG bold_z end_ARG start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ,
𝐳^l+1=3DSWMSA(LN(𝐳l))+𝐳l,superscript^𝐳𝑙13DSWMSALNsuperscript𝐳𝑙superscript𝐳𝑙\displaystyle\hat{\mathbf{z}}^{l+1}=3\mathrm{DSW}-\mathrm{MSA}\left(\mathrm{LN% }\left(\mathbf{z}^{l}\right)\right)+\mathbf{z}^{l},over^ start_ARG bold_z end_ARG start_POSTSUPERSCRIPT italic_l + 1 end_POSTSUPERSCRIPT = 3 roman_D roman_S roman_W - roman_MSA ( roman_LN ( bold_z start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ) ) + bold_z start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ,
𝐳l+1=FFN(LN(𝐳^l+1))+𝐳^l+1,superscript𝐳𝑙1FFNLNsuperscript^𝐳𝑙1superscript^𝐳𝑙1\displaystyle\mathbf{z}^{l+1}=\mathrm{FFN}\left(\mathrm{LN}\left(\hat{\mathbf{% z}}^{l+1}\right)\right)+\hat{\mathbf{z}}^{l+1},bold_z start_POSTSUPERSCRIPT italic_l + 1 end_POSTSUPERSCRIPT = roman_FFN ( roman_LN ( over^ start_ARG bold_z end_ARG start_POSTSUPERSCRIPT italic_l + 1 end_POSTSUPERSCRIPT ) ) + over^ start_ARG bold_z end_ARG start_POSTSUPERSCRIPT italic_l + 1 end_POSTSUPERSCRIPT ,

where 𝐳^lsuperscript^𝐳𝑙\hat{\mathbf{z}}^{l}over^ start_ARG bold_z end_ARG start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT and 𝐳lsuperscript𝐳𝑙\mathbf{z}^{l}bold_z start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT represent the output features of the 3DW-MSA and 3DSW-MSA for lthsuperscript𝑙𝑡l^{th}italic_l start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT block. A LayerNorm (LN) is added before MSA and after the FeedForward layer (FFN). Following the previous studies, we add the relative position encoding BT2×M2×M2𝐵superscriptsuperscript𝑇2superscript𝑀2superscript𝑀2B\in\mathbb{R}^{T^{2}\times M^{2}\times M^{2}}italic_B ∈ blackboard_R start_POSTSUPERSCRIPT italic_T start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT × italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT × italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT to the 3D attention block. The self-attention is computed as:

Attention (Q,K,V)=SoftMax(QKT/d+B)V,Attention 𝑄𝐾𝑉SoftMax𝑄superscript𝐾𝑇𝑑𝐵𝑉\displaystyle\text{ Attention }(Q,K,V)=\operatorname{SoftMax}\left(QK^{T}/% \sqrt{d}+B\right)V,Attention ( italic_Q , italic_K , italic_V ) = roman_SoftMax ( italic_Q italic_K start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT / square-root start_ARG italic_d end_ARG + italic_B ) italic_V , (4)

where Q,K,VTM2×d𝑄𝐾𝑉superscript𝑇superscript𝑀2𝑑Q,K,V\in\mathbb{R}^{TM^{2}\times d}italic_Q , italic_K , italic_V ∈ blackboard_R start_POSTSUPERSCRIPT italic_T italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT × italic_d end_POSTSUPERSCRIPT are the query, key and value matrices. d𝑑ditalic_d is the dimension of the query and key features. TM2𝑇superscript𝑀2TM^{2}italic_T italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT is the number of tokens per window. And, the values of B𝐵Bitalic_B are taken from the 3D bias matrix B^(2T1)×(2M1)×(2M1)^𝐵superscript2𝑇12𝑀12𝑀1\hat{B}\in\mathbb{R}^{(2T-1)\times(2M-1)\times(2M-1)}over^ start_ARG italic_B end_ARG ∈ blackboard_R start_POSTSUPERSCRIPT ( 2 italic_T - 1 ) × ( 2 italic_M - 1 ) × ( 2 italic_M - 1 ) end_POSTSUPERSCRIPT, corresponding to the temporal range of [T+1,T1]𝑇1𝑇1[-T+1,T-1][ - italic_T + 1 , italic_T - 1 ] and the spatial range of [M+1,M1]𝑀1𝑀1[-M+1,M-1][ - italic_M + 1 , italic_M - 1 ].

Temporal Fusion. After the exploitation of spatial-temporal self-similarity, features in neighbor frames are fused for the recovery of the reference frame. However, it is not appropriate to simply combine these frames, since the complex motion in real videos makes each neighbor frame contribute variously to the central reference frame. Intuitively, the closer between the features in the neighbor frame and reference frame, the more information a neighbor frame can provide for recovery. Therefore, we first extract the features by embedding, then compute the similarity between the features of each neighbor and the reference features in an embedded space:

S(Ft+i,Ft)=Sim(θ(Ft+i)T,ϕ(Ft)),𝑆subscript𝐹𝑡𝑖subscript𝐹𝑡Sim𝜃superscriptsubscript𝐹𝑡𝑖𝑇italic-ϕsubscript𝐹𝑡\displaystyle S\left(F_{t+i},F_{t}\right)=\operatorname{Sim}\left(\theta\left(% F_{t+i}\right)^{T},\phi\left(F_{t}\right)\right),italic_S ( italic_F start_POSTSUBSCRIPT italic_t + italic_i end_POSTSUBSCRIPT , italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = roman_Sim ( italic_θ ( italic_F start_POSTSUBSCRIPT italic_t + italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT , italic_ϕ ( italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) , (5)

where θ𝜃\thetaitalic_θ and ϕitalic-ϕ\phiitalic_ϕ are embedding functions. SimSim\operatorname{Sim}roman_Sim denotes the similarity calculation function. Here we also adopt the dot product following previous work [102] for similarity calculation. Ftsubscript𝐹𝑡F_{t}italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT refers to the reference frame and Fi+tsubscript𝐹𝑖𝑡F_{i+t}italic_F start_POSTSUBSCRIPT italic_i + italic_t end_POSTSUBSCRIPT refers to the neighbor frames where i[T+1,T1]𝑖𝑇1𝑇1i\in[-T+1,T-1]italic_i ∈ [ - italic_T + 1 , italic_T - 1 ]. After getting the similarity matrix, we adaptively re-weight the features in the temporal dimension,

F~t+i=Ft+iS(Ft+i,Ft),subscript~𝐹𝑡𝑖direct-productsubscript𝐹𝑡𝑖𝑆subscript𝐹𝑡𝑖subscript𝐹𝑡\displaystyle\tilde{F}_{t+i}={F}_{t+i}\odot S\left({F}_{t+i},{F}_{t}\right),over~ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_t + italic_i end_POSTSUBSCRIPT = italic_F start_POSTSUBSCRIPT italic_t + italic_i end_POSTSUBSCRIPT ⊙ italic_S ( italic_F start_POSTSUBSCRIPT italic_t + italic_i end_POSTSUBSCRIPT , italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , (6)
Ffusion=Conv([F~tT,,F~t,,F~t+T]),subscript𝐹fusionConvsubscript~𝐹𝑡𝑇subscript~𝐹𝑡subscript~𝐹𝑡𝑇\displaystyle F_{\text{fusion}}=\operatorname{Conv}([\tilde{F}_{t-T},\cdots,% \tilde{F}_{t},\cdots,\tilde{F}_{t+T}]),italic_F start_POSTSUBSCRIPT fusion end_POSTSUBSCRIPT = roman_Conv ( [ over~ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_t - italic_T end_POSTSUBSCRIPT , ⋯ , over~ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , ⋯ , over~ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_t + italic_T end_POSTSUBSCRIPT ] ) , (7)

where direct-product\odot and [,,][\cdot,\cdot,\cdot][ ⋅ , ⋅ , ⋅ ] denote the element-wise multiplication and concatenation respectively. We then concatenate all the features and gather them together for the reconstructed frame by convolution layer. Finally, a convolutional attention module [109] is used to spatially enhance the feature representation.

3.3 Wql Team’s Method

Our team, with the username wql on Codalab, achieved a final score of 38.82 on the leaderboard, ranking second. In this report, we will present all the technical details for solving this task. The task of this competition is to denoise and restore low-light raw videos. Considering the low-light characteristics of the data, we divide the task into two subtasks: low-light restoration and denoising. The key to video restoration lies in fully utilizing inter-frame information. After extensive experiments, we determined to use the Shift-Net model for low-light restoration. To avoid compromising the performance of the model, we converted the original data to the RGB format for training. Given that the restored videos still contain a large amount of noise, we applied the RVRT model for denoising again, resulting in high-quality output. Experimental results demonstrate that our strategy is effective, achieving an outstanding score of 38.82 on the LLRVD dataset.

Our entire task solution is illustrated in Fig. 12. For the original low-light video, it is first converted into an RGB format video. Then, it undergoes restoration to normal lighting conditions through the Shift-Net [58] model. At this point, the video’s lighting level is normal, but there still exists a significant amount of noise. Subsequently, it undergoes denoising through RVRT [61], resulting in the final video restoration and denoising outcome.

Refer to caption
Figure 12: The overall experimental architecture diagram.

The key to this track relies on the utilization of inter-frame information. Existing deep learning methods often depend on complex network architectures such as optical flow estimation, deformable convolutions, and cross-frame self-attention layers, leading to high computational costs. After extensive literature review, our team ultimately chose Shift-Net as the main model. This model proposes a simple yet effective video restoration and denoising framework, surpassing existing state-of-the-art methods not only in accuracy but also with its parameter count of only two versions, 4.1M and 12.3M, much smaller than existing advanced models. The model is based on grouped spatiotemporal displacements, a lightweight and direct technique that implicitly captures inter-frame correspondences through multi-frame aggregation. By introducing grouped spatial displacements, a broad effective receptive field is obtained, and combined with basic 2D convolutions, this simple framework can effectively aggregate inter-frame information. Despite the restoration of low-light images using Shift-Net, the images still contain a significant amount of noise. Therefore, our team chose the RVRT model to denoise the restored images, aiming for high-quality video restoration.

3.3.1 Network Architecture

Data preprocessing. Our team has chosen the Shift-Net model for low-light restoration, which defaults to RGB input format. To maintain the model’s performance, we decided to convert the raw images in the dataset to RGB format for training purposes. The storage format of raw images is RGBG, and since the human eye is more sensitive to green, for better visualization, we extract the RGB three-channel data and multiply R and B by 2, while keeping G unchanged. The processed images are saved in PNG format for model training. Due to the low pixel values in low-light sequences, there may be information loss when saving as PNG, so the conversion process is performed online to mitigate this.

Shift-Net. Most previous video restoration methods have employed complex architectures such as optical flow, deformable convolutions, and self-attention layers. This team proposes a simple yet effective grouped spatiotemporal displacement block to implicitly establish temporal correspondence.

As shown in Fig. 13, the framework of this model adopts a three-stage design: 1) feature extraction, 2) multi-frame feature fusion with grouped spatiotemporal offsets, 3) final restoration.

Feature extraction. Each frame Ii typically suffers from different types of degradation (such as noise or blur), which affects temporal correspondence modeling. A two-dimensional U-Net-like structure is adopted to mitigate the negative impact of degradation and extract frame-level features.

Multi-frame feature fusion. At this stage, a grouped spatiotemporal displacement block is proposed to move different features from adjacent frames to the reference frame, implicitly establishing temporal correspondence. Keyframe features are fully aggregated with features from neighboring frames to obtain corresponding aggregate features. By employing spatiotemporal displacements in different directions and distances, multiple candidate displacements are provided for frame matching. By stacking multiple grouped spatiotemporal displacement blocks, our framework achieves long-term aggregation.

Final restoration. Finally, similar to the U-Net structure, taking low-quality input frames and corresponding aggregate features as input, the model generates the final result for each frame.

Refer to caption
Figure 13: Overview of the Group Shift-Net. It adopts a three-stage design: feature extraction, multi-frame fusion, and final restoration. Grouped spatial-temporal shift blocks are proposed to achieve multi-frame aggregation.

In multi-frame fusion, frame features are aggregated with adjacent features to obtain temporally fused features. We adopt a two-dimensional U-Net structure for multi-frame fusion, maintaining skip connections within the U-Net. Instead of multiple 2D convolutional blocks, we replace them with stacked Grouped Spatiotemporal Shift (GSTS) blocks to effectively establish temporal correspondence and perform multi-frame fusion. GSTS blocks are not applied at the finest scale to save computational costs. The GSTS block consists of three parts: 1) temporal displacement, 2) spatial shift, 3) lightweight fusion layer, as illustrated in Fig. 14.

Refer to caption
Figure 14: The operations of Grouped Spatial-temporal Shift (GSTS). We stack the forward temporal shift (FTS) blocks (Left) and backward temporal shift (BTS) blocks (Right) alternatively to achieve bi-directional propagation. Grouped spatial shift provides multiple candidate displacements within large spatial fields and establish temporal correspondences implicitly.

RVRT. RVRT demonstrates excellent performance in the field of video denoising, as shown in Fig. 15. The framework consists of three parts: shallow feature extraction, recurrent feature refinement, and frame reconstruction. Shallow feature extraction utilizes convolutional layers and multiple RSTB blocks from SwinIR to extract features from low-quality videos (LQ). Subsequently, the recurrent feature refinement module performs temporal modeling, and guided deformable attention is employed for video alignment. Finally, multiple RSTB blocks are fed to generate the final features, followed by HQ reconstruction using pixelShuffle.

Refer to caption
Figure 15: Overall Framework of RVRT

3.3.2 Implementation Detail

Shift-Net. We opted for the standard version of the Shift-Net model for training. Since each video in the training set randomly contains three levels of noise, the data reading strategy during training also involves randomly selecting one noise level. The training parameters include a batch size of 4, a learning rate of 4e-4, and 120,000 iterations. The training was conducted using a single NVIDIA RTX 3090 GPU, lasting for 48 hours, without loading pretrained weights.

RVRT. For denoising training with RVRT, only the labels are retrieved during data loading, with a certain amount of noise added. The training parameters include a batch size of 4, a learning rate of 1e-5, and 40,000 iterations. Training was conducted using a single NVIDIA RTX 3090 GPU, lasting for 7 hours, without loading pretrained weights.

3.4 Mmmmmm Team’s Method

In this competition [36], we first attempted video denoising using raw images, employing models such as RViDeNet [122] and EMVD [72]. However, due to unsatisfactory results, we later switched to the RGB image-oriented model MIRNetv2 [125]. With a specific training strategy, we achieved our final score.

3.4.1 Network Architecture

We employed a multi-scale approach that preserves the original high-resolution features through the network hierarchy, thereby minimizing the loss of precise spatial details. Simultaneously, it encodes multi-scale context by using parallel convolution streams to process features at lower spatial resolutions. The multi-resolution parallel branches operate complementarily to the main high-resolution branch, providing more accurate and contextually enriched feature representations.

One major distinction between MIRNetv2 and other multi-scale image processing methods lies in how contextual information is aggregated. While other methods focus on processing each scale separately, MIRNetv2 progressively exchanges and fuses information from coarse to fine resolution levels. Additionally, unlike methods that use simple concatenation or averaging of features from multi-resolution branches, MIRNetv2 introduces a new selective kernel fusion approach. This approach dynamically selects the useful set of kernels from each branch representation using a self-attention mechanism. Moreover, the proposed fusion block combines features with varying receptive fields while preserving their distinctive complementary characteristics.

The MIRNetv2 network is divided into four modules: Dual-Pixel Defocus Deblurring, Image Denoising, Image Super-Resolution, and Image Enhancement. Among them, I employed the Dual-Pixel Defocus Deblurring module for my task.

Dual-Pixel Defocus Deblurring. Images captured with a wide aperture have a shallow depth of field, meaning that regions outside the depth of field become out of focus. Given an image with defocus blur, the goal of defocus deblurring is to generate a globally sharp image. Existing defocus deblurring methods either directly deblur images or first estimate the defocus disparity map and then use it to guide the deblurring process. Modern cameras are equipped with dual-pixel sensors, where each pixel location has two photodiodes, thereby generating two sub-aperture views. The phase difference between these views is useful for measuring the amount of defocus blur at each scene point. Recently, Abuolaim et al. introduced a dual-pixel deblurring dataset (DPDD) and a new method based on encoder-decoder design. In this paper, our focus is also on directly using dual-pixel data to deblur images. Previous defocus deblurring works have employed encoder-decoder architectures that repeatedly use downsampling operations, resulting in significant loss of important details. In contrast, the architectural design of our method enables the preservation of texture details required for the restored image.

Refer to caption
Figure 16: The proposed MIRNet-v2 framework is aimed at learning enriched feature representations for image restoration and enhancement. Based on a recursive residual design, MIRNet-v2 comprises multiple-scale residual blocks (MRBs), with its main branch dedicated to maintaining spatially precise high-resolution representations throughout the entire network, while complementary parallel branches provide better contextual features.

Visualization. According to the ISP process provided on the official competition website, we processed the TIFF images in RGGB order, performed grayscale balancing correction, and added an additional step of normalization before outputting PNG images to make the RGB images appear clearer and brighter. The specific process is illustrated in the following Fig. 17.

Refer to caption
Figure 17: Data Processing Pipeline
Refer to caption
Figure 18: GT Direct Visualization
Refer to caption
Figure 19: Normalized Images

3.4.2 Implementation Details

Training dataset contains a total of 72 scenes. The noisy dataset for each scene contains 10 consecutive images with three different noise levels. The validation set consists of scenes 3 and 4, with images containing noise levels of 125, 160, and 200. The test set includes scenes 16, 22, 36, and 66, each with ten images at noise levels of 100, 125, 160, 200, 250, and 320. Image size is (1300, 2700, 1), formatted as TIFF images arranged in RGGB order. Black level is 512, and white level is 15360. Specific parameter settings for the training process:

Dataset setup. The dataset includes training and validation sets for model training and evaluation. Double-pixel depth images are used for model training with geometric augmentation. Different training batch sizes (8, 5, 4, 2, 1, 1) and iteration numbers (92000, 64000, 48000, 36000, 36000, 24000) are set to gradually improve training effectiveness. Progressive training strategy is employed, starting from smaller image cropping sizes and gradually increasing the crop size (128, 160, 192, 256, 320, 384).

Network architecture setup. MIRNet_v2 network architecture is used for the task of deblurring double-pixel depth images. The network has 6 input channels and 3 output channels.

Training setup. Total iteration number is set to 300,000, and cosine annealing restarts learning rate scheduler is used. Adam optimizer is employed with a learning rate of 2e-4.

Validation setup.: Validation is performed every 2000 iterations, and the PSNR validation metric is calculated. Validation images are not saved.

With these training strategies, I trained on the training set for 300,000 iterations and achieved good performance.showing in Table 6:

Table 6: Comparison between the two models
Model Input Score
EMVD Raw 29.44
MIRNetv2 RGB 27.15
MIRNetv2* Normalied RGB 32.94

From the perspective of input image types, directly training with raw images yields lower scores, as shown in Fig. 16. This could be due to the resulting test images having blurry details, unclear textures, and a greenish hue. However, when input images are RGB images, the model MIRNetv2 is used for deblurring tasks. When our input images are not normalized, the test results tend to be darker with heavier colors. Then, after normalization, the color changes in the resulting test images are smaller, and the details and textures become clearer and more visible.

Refer to caption
Figure 20: EMVD
Refer to caption
Figure 21: Unnormalized MIRNetv2
Refer to caption
Figure 22: Normalized MIRNetv2
Refer to caption
Figure 23: Original images in the test set

During the early stages of training, we also utilized the RViDeNet model, undergoing training in three phases: predenoising, pretraining, and finetuning. However, as training required four different noise levels while each scene in our training set only had three noise levels, our strategy was to directly duplicate the highest noise level from the training set. This led to poor generalization of our model during training, resulting in unsatisfactory performance during the validation phase. Therefore, we did not continue using this model during the testing phase. Nevertheless, we still believe that with a sufficient dataset and training time, this model has the potential to achieve better results.

3.4.3 Teams and Affiliations

ZichunWang

Title: Zichun Wang’s Team Technique Report of CVPR 2024 PBDL Challenge Low-light raw video denoising with realistic motion

Members: Zichun Wang ([email protected]), Ying Fu

Affiliations: Beijing Institute of Technology

wql

Title: Low-light raw video denoising with realistic motion

Members: Qinliang Wang ([email protected]), Xuejian Gou, Yang Liu, Lingling Li, Fang Liu, Wenping Ma

Affiliations: School of Artificial Intelligence, Xidian University

mmmmmm

Title: Low-light raw video denoising with realistic motion

Members: Xinyue Yu ([email protected]), Sen Jia, Junpei Zhang, Licheng Jiao, Xu Liu, Puhua Chen

Affiliations: Intelligent Perception and Image Understanding Lab, Xidian University

4 Low-light SRGB Image Enhancement

Compared with normal-light images, quality degradation of low-light images captured under terrible lighting conditions is serious due to inevitable environmental or technical constraints, leading to unpleasant visual perception including details degradation, color distortion, and severe noise. These phenomena have a significant impact on the performance of advanced downstream visual tasks, such as image classification, object detection, semantic segmentation [24, 78, 82, 121, 127], etc. To mitigate the degradation of image quality, low-light image enhancement [70] has become an important topic in the low-level image processing community to effectively improve visual quality and restore image details.
To address this challenge, the CVPR 2024 PBDL Low-light sRGB Image Enhancement Challenge aims to evaluate and improve the visual quality of image enhancement algorithms in the field of low-light image enhancement.
In the low-light sRGB image enhancement track (Table 7), the top three teams demonstrated exceptional performance. The IMAGCX team secured the first place, achieving a PSNR score of 22.70 and an SSIM score of 0.82. The chm team came in second, with a PSNR score of 22.62 and an SSIM score of 0.82. The WanFly team achieved the third place with a PSNR score of 21.82 and an SSIM score of 0.81. Their enhanced images achieved excellent visual quality, showcasing their strong performance in this challenging task.
These results highlight the significant progress made by the participating teams in addressing the challenges of low-light sRGB image enhancement. The top-ranking teams demonstrated their expertise and innovative capabilities in developing image enhancement algorithms that excel in low-light conditions, paving the way for future advancements in computer vision research.

Table 7: Leaderboard of the low-light SRGB Image Enhancement
Rank Team PSNR SSIM
1 IMAGCX 22.70 0.82
2 chm 22.62 0.82
3 WanFly 21.82 0.81

4.1 Low-Light SRGB Image Enhacement Dataset

To systematically investigate the effectiveness of the proposed method in real-world conditions, a real low-light image dataset for image enhancement is necessary and fundamental. The challenge utilizes the Paired Normal/Low-light Images (PNLI) dataset, introduced by [33].
It is collected using a Canon EOS 5D Mark IV camera. Fig. 24 shows examples of paired normal/low-light images from the PNLI dataset. The PNLI dataset exhibits the following characteristics:

Refer to caption
Figure 24: Several representative examples for low/normal-light images in PNLI dataset, LOL dataset, SYN dataset and EnlightenGAN dataset. Objects and scenes captured in our PNLI dataset are more diverse, abundant and superior.
  • It contains 2,000 image pairs, which is four times the size of the LOL dataset.

  • Different from the existing real scenes dataset, i.e., LOL, there are no repeated scenes in our PNLI dataset, which is more abundant and superior than LOL. (There are many very similar scenes with little difference in the LOL dataset, as shown in Fig. 24)

  • All images in PNLI are collected from considerably more real scenes, which contain both indoor and outdoor scenes. In addition, the object categories in images are rich and common.

  • Excellent visual quality and clarity, which might help in learning pixel-level contextual information.

  • The darkness levels of low-light images in PNLI are rich, and it can truly restore various situations where the actual image brightness is missing due to insufficient ambient light or human operation mistakes. Therefore, it can effectively verify the stability and robustness of our proposed method.

4.2 IMAGCX Team’s Method

4.2.1 Network Architecture

To solve UHD low-light image enhancement, several recent state-of-the-art methods have been proposed, for example LLFormer [100], UHDFour[57], and MixNet [111]. We first conduct cross-domain generalization analysis on these methods, and we find that MixNet can better generalize to unseen real images. Thus, MixNet [111] is employed as the network backbone for low-light image enhancement.
Fig. 25 shows the overview of the network architecture. It aims to map an UHD low-light input image xH×W×C𝑥superscript𝐻𝑊𝐶x\in\mathbb{R}^{H\times W\times C}italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × italic_C end_POSTSUPERSCRIPT to its corresponding normal-clear version yH×W×C𝑦superscript𝐻𝑊𝐶y\in\mathbb{R}^{H\times W\times C}italic_y ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × italic_C end_POSTSUPERSCRIPT, where H𝐻Hitalic_H, W𝑊Witalic_W, and C𝐶Citalic_C represent height, width, and channel, respectively. To reduce computational complexity, it downsample the input to 1/4 of the original resolution by PixelUnshuffle. Subsequently, the shallow features go through multiple deep feature mixer blocks. Each feature mixer block mainly consists of a feature modulation network and a feed forward network. To better capture long-range pixel dependencies in UHD images, feature modulation network combines spatial and channel dimensions for joint feature modeling. Finally, we use PixelShuffle upsampling to reconstruct the final image.

Refer to caption
Figure 25: Several representative examples for low/normal-light images in PNLI dataset, LOL dataset, SYN dataset and EnlightenGAN dataset. Objects and scenes captured in our PNLI dataset are more diverse, abundant and superior.

4.2.2 Implementation Details

We conduct model training in PyTorch framework on 8 NVIDIA GeForce RTX 4090 GPUs. Furthermore, we incorporate other public UHD low-light image enhancement datasets (UHD-LL [57] and UHD-LOL [100]) into the network training. Similar to [65], patches at the size of 2000×2000200020002000\times 20002000 × 2000 are randomly cropped from the image pairs as training samples. The training data is augmented with random rotation and flipping. To optimize the network, we adopt L1 loss as the optimization objective, and we employ the Adam optimizer with a learning rate 2×1042superscript1042\times 10^{-4}2 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT. In total, we perform 600k iterations. During the testing phase, we perform full-resolution inference using one NVIDIA GeForce RTX 4090 GPU. Note that we employ a self-ensemble strategy to further improve performance. The code and model are released at code.

4.3 chm Team’s Method

4.3.1 Network Architecture

Fig. 26 illustrates the overall architecture of our method. Specifically, the input x𝑥xitalic_x is first reshaped to feature tensor via PixelUnshuffle (4×4\times\downarrow4 × ↓) to preserve original information, and then fed to 8 feature extraction modules. Finally, the output feature y𝑦yitalic_y is reshaped to the original height and width of input x𝑥xitalic_x via Pixelshuffle (4×4\times\uparrow4 × ↑). The feature extraction module mainly contains a feature rearrangement block (FRB), a feature enhancement block (FEB), and a feed-forward network (FFN). Here, FRB adopts MLP-based tensor dimensional transformations [131], while FEB employs CNN-based local operators [123]. The overall process can be represented as follows:

F1=Conv[FRB(LN(F0));FEB(LN(F0))]+F0,F2=FFN(LN(F1))+F1,formulae-sequencesubscript𝐹1ConvFRBLNsubscript𝐹0FEBLNsubscript𝐹0subscript𝐹0subscript𝐹2FFNLNsubscript𝐹1subscript𝐹1\begin{split}&F_{1}=\operatorname{Conv}\left[\operatorname{FRB}\left(% \operatorname{LN}\left(F_{0}\right)\right);\operatorname{FEB}\left(% \operatorname{LN}\left(F_{0}\right)\right)\right]+F_{0},\\ &F_{2}=\operatorname{FFN}\left(\operatorname{LN}\left(F_{1}\right)\right)+F_{1% },\end{split}start_ROW start_CELL end_CELL start_CELL italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = roman_Conv [ roman_FRB ( roman_LN ( italic_F start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) ; roman_FEB ( roman_LN ( italic_F start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) ] + italic_F start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = roman_FFN ( roman_LN ( italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ) + italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , end_CELL end_ROW (8)

where F0subscript𝐹0F_{0}italic_F start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT denote the input features, F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT denote the intermediate features and F2subscript𝐹2F_{2}italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT denote the output features. LN refers to the layer normalization.

4.3.2 Implementation Details

To supervise the training process, we employ the L1 loss as the objective function. We conduct model training on 4 NVIDIA TESLA V100s with 32GB memory. In total, we perform 500 epochs of training. During the training, we adopt the Adam optimizer with a learning rate of 2×1042superscript1042\times 10^{-4}2 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT. The patch size is set to be 768×768768768768\times 768768 × 768 pixels and the batch size is set to be 16. To augment the training data, we apply random horizontal and vertical flips. For testing images, we use one NVIDIA GeForce RTX 4090 GPU with 24GB memory. The source code and pre-trained model are available at model.

Refer to caption
Figure 26: The network architecture of team SuperGo.

4.4 WanFly Team’s Method

Refer to caption
Figure 27: (a) Conditional diffusion; (b) Mean-Reverting SDE diffusion

Diffusion models are increasingly applied in low-light image enhancement tasks due to their exceptional capability to model data distributions, but an inherent drawback of diffusion models in image restoration tasks is that starting the reverse process from pure Gaussian noise can lead to artifacts [120, 117]. Therefore, as illustrated in Fig. 27, we adopt the Mean-Reverting Stochastic Differential Equation (SDE) [69] as the base diffusion framework, directly implementing the mapping from low-quality to high quality images.
The fundamental idea of diffusion models is to gradually corrupt images by injecting noise, and then learn how to progressively remove this noise to reconstruct the original image. U-Net plays a crucial role in this denoising process. It is trained to predict the noise injected at each step, thereby methodically eliminating the noise and restoring the image. The U-Net used in diffusion models typically consists of residual blocks, upsampling and downsampling operations, and attention mechanisms. While the stacking of multiple residual blocks is beneficial for feature extraction, it increases the computational load, and the extensive convolutional operations are not friendly to low pixel values in low-light images.
Our motivation is to reduce multiplication operations in U-Net, protect low pixel values, and lighten the computational load. The simplified U-Net designed in this paper, as illustrated in Fig. 27(a), is only constructed from the feature extraction module SimpleGate [15] and Parameter-free attention [115] (SimPF) block, and includes upsampling and downsampling operations, making it suitable for both processing low-light images and reducing the resource consumption of the diffusion model for faster sampling.
The code is available at https://github.com/MrWan001/SFDiff.

4.4.1 Network Architecture

As shown in Fig. 28(b), we designed the SimPF block with the idea of retaining the necessary convolution and normalisation layers and using less computationally intensive components to reduce multiplication operations across feature maps. We use 1×1111\times 11 × 1 convolutions and 3×3333\times 33 × 3 depth-wise separable convolutions for feature extraction, both convolution types have been applied and proven effective in a variety of image restoration tasks. Specifically, the feature map first undergoes a 1×1111\times 11 × 1 convolution to expand the number of channels while preserving spatial information. Subsequently, a 3×3333\times 33 × 3 depth-wise separable convolution is employed to encode features from spatially adjacent pixel positions, facilitating the learning of local image structures.

Refer to caption
Figure 28: (a) U-Net with SimPF block. It is composed of SimPF blocks, upsampling and downsampling operations, along with skip connections; (b) SimPF block. It retains necessary convolution and normalization layers, incorporates SimpleGate and PFAM to minimize multiplication operations, and utilizes Time Embedding to align with diffusion models.

Since the activation function requires multiple multiplication operations, we use SimpleGate to replace complex nonlinear activation functions. SimpleGate can achieve the effect of nonlinear mapping through a single multiplication operation, which is particularly beneficial for preserving information in low pixel values, as complex functions like the cubic operations required in the GELU activation function can be detrimental to such information. The computation of SimpleGate is illustrated in Equation (9):

SimpleGate(𝑿,𝒀)=𝑿𝒀SimpleGate𝑿𝒀direct-product𝑿𝒀\operatorname{SimpleGate}(\bm{X},\bm{Y})=\bm{X}\odot\bm{Y}roman_SimpleGate ( bold_italic_X , bold_italic_Y ) = bold_italic_X ⊙ bold_italic_Y (9)

𝑿𝑿\bm{X}bold_italic_X and 𝒀𝒀\bm{Y}bold_italic_Y represent the division of a feature map with channels C𝐶Citalic_C, height H𝐻Hitalic_H, and width W𝑊Witalic_W along the channel dimension into two parts of (C2𝐶2\frac{C}{2}divide start_ARG italic_C end_ARG start_ARG 2 end_ARG, H𝐻Hitalic_H, W𝑊Witalic_W). The essence of this multiplication operation is a type of nonlinear mapping that can substitute for an activation function.
After the feature matrix has been given weights through Parameter-Free Attention Mechanism (PFAM), a 1×1111\times 11 × 1 convolution is used to aggregate pixel-level cross-channel context information. The subsequent two 1×1111\times 11 × 1 convolutions serve to facilitate interaction and combination among features across different channels, creating more complex and effective feature representations. In order to apply to the diffusion model, we have incorporated a time embedding block, which takes the current diffusion time step t𝑡titalic_t as input and encodes t𝑡titalic_t into the feature matrix, enabling the model to perceive noise at different time steps t𝑡titalic_t. Overall, the design of SimPF block, while minimizing multiplication operations, maintains robust feature extraction capabilities.

4.4.2 Implementation Details

Our method is implemented using the PyTorch framework. The diffusion time step T𝑇Titalic_T is established at 100100100100. A cosine scheduling scheme is utilized for noise scheduling. The optimization is carried out using the LION optimizer. The batch size is set to 6666. The initial learning rate is set at 4×1054superscript1054\times 10^{-5}4 × 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT, and the Cosine Annealing strategy is employed for learning rate scheduling. The model is trained on a single NVIDIA GeForce RTX 3090 GPU and converged after 300,000 iterations.
During the training phase, we first attempted to crop or randomly crop the center of the training set to 256 x 256, but did not achieve good results. Finally, we resized the training set to 256 x 256 and achieved good results. During testing, due to the large size of 6720 x 4480, which exceeded the maximum range that the model could handle, we first attempted to crop the image into 2240 x 2240 and merge it, but the effect was not good. Finally, we resized the image to 480 x 320, and after using model enhancement, we resized it to 6720 x 4480, achieving good results. In addition, due to the unknown GT, we used the lpips metric to preliminarily evaluate the enhancement results of the model. We found that our proposed SimPF block performed better on the three images 1162, 496, 735, while the model trained on the original NAF block performed better on the other test images. Therefore, we combined the results of the two models to obtain the final version for submission.

4.5 Teams and Affiliations

IMAGCX

Title: PBDL-Challenge-IMAGCX for the Low-Light-srgb-enhancement track Technical Report

Members: Xiang Chen ([email protected]), Hao Li, Jinshan Pan

Affiliations: Nanjing University of Science and Technology

chm

Title: PBDL Challenge on Low Light SRGB Image Enhancement

Members: Chuanlong Xie1 ([email protected]), Hongming Chen1, Mingrui Li2, Tianchen Deng3, Jingwei Huang4, Yufeng Li1

Affiliations: 1Shenyang Aerospace University, 2Dalian University of Technology, 3Shanghai Jiao Tong University, 4University of Electronic Science and Technology of China

WanFly

Title: WanFly for the Low-Light-srgb-enhancement track

Members: Fei Wan1,2 ([email protected]), Bingxin Xu1,2*, Jian Cheng1,2, Hongzhe Liu1,2, Cheng Xu1,2, Yuxiang Zou1,2, Weiguo Pan1,2, Songyin Dai1,2

Affiliations: 1*Beijing Key Laboratory of Information Service Engineering, Beijing Union University, Beijing, 100101, China. 2College of Robotics, Beijing Union University

5 Extremely Low-light Image Denoising

Light is crucial for photography. Nighttime and low-light conditions impose significant challenges due to the limited number of photons and unavoidable noise. The typical response is to increase light capture by, for example, enlarging the aperture, lengthening the exposure time, or using a flash. However, each approach has its drawbacks: a larger aperture results in a shallow depth of field and is not feasible for smartphone cameras; extended exposure times can lead to blurriness from scene changes or camera movement; and flash can cause color distortions and is effective only for objects close to the camera.

A practical solution for low-light imaging is burst photography [74, 41, 66, 62], which aligns and fuses multiple images to increase the signal-to-noise ratio (SNR). However, burst photography is prone to ghosting effects [41, 89] when capturing dynamic scenes involving vehicles, people, etc. An emerging alternative is using neural networks to automatically learn the mapping from a low-light noisy image to its long-exposure counterpart [11]. This deep learning approach typically requires a large amount of labeled training data resembling real-world low-light photographs. Collecting extensive high-quality training samples from various modern camera devices is extremely labor-intensive and expensive.

To bridge the domain gap between synthetic images and real photos, some works have collected paired real data for both evaluation and training [1, 11, 87, 12, 50]. Despite promising results, gathering sufficient real data with true labels to prevent overfitting is very costly and time-consuming. Recent works use paired [56] or single noisy images [54, 135] as training data instead of paired noisy and clean images. However, they do not significantly reduce the labor required to capture a large volume of real-world training data.

Another research direction focuses on enhancing the realism of synthetic training data to avoid the challenges of obtaining real data from cameras. By considering photon arrival statistics (”shot” noise) and sensor readout effects (”read” noise), works like [74, 7] use a signal-dependent heteroscedastic Gaussian model [31] to characterize noise in raw sensor data. Recently, Wang et al. [101] proposed a noise model that accounts for dynamic stripe noise, color channel heterogeneity, and clipping effects to simulate high-sensitivity noise in real low-light color images. Additionally, a flow-based generative model called NoiseFlow [2] was proposed to describe the distribution of real noise using latent variables with a density of one. However, these methods often oversimplify the imaging pipeline of modern sensors, especially the noise sources introduced by camera electronics, which have been extensively studied in the electronic imaging community. [53, 43, 40, 5, 27, 29, 48, 49, 6, 96, 22]

Therefore, we are honored to collaborate with the CVPR 2024 Workshop to launch the Extremely Low-Light Image Denoising Challenge. The primary objective of this challenge is to use deep learning methods for denoising real extremely low-light images, optimizing denoising performance and model robustness. Participants are tasked with enhancing model robustness and denoising effectiveness by modeling noise in real imaging processes and using synthetic datasets for training. This challenge aims to explore realistic low-light noise models and efficient denoising models for extremely low-light images. We aim to rigorously assess their effectiveness and identify key trends in network design. We welcome participants to push the boundaries of innovation and advance the technology of extremely low-light image denoising.

5.1 Extreme Low-Light Image Denoising challenge

5.1.1 The dataset

To systematically study the generality of the proposed noise formation model, we collect an extreme low-light denoising (ELD) dataset [108] that covers 10 indoor scenes and 4 camera devices from multiple brands (SonyA7S2, NikonD850, CanonEOS70D, CanonEOS700D). We also record bias and flat field frames for each camera to calibrate our noise model. The data capture setup is shown in Fig. 29 For each scene and each camera, a reference image at the base ISO was firstly taken, followed by noisy images whose exposure time was deliberately decreased by low light factors f to simulate extreme low light conditions. Another reference image then was taken akin to the first one, to ensure no accidental error (e.g. drastic illumination change or accidental camera/scene motion) occurred. We choose three ISO levels (800, 1600, 3200) and two low light factors (100, 200) for noisy images to capture our dataset, resulting in 240 (3×2×10×4) raw image pairs in total. The hardest example in our dataset resembles the image captured at a “pseudo” ISO up to 640000 (3200×200).

Refer to caption
Figure 29: Capture setup and example images from our dataset.

5.1.2 The evaluation protocols

The quantitative evaluation metrics include the standard Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index (SSIM), which are commonly used to assess image quality. To ensure fairness and accuracy in testing, we use the Codalab platform (https://codalab.lisn.upsaclay.fr/competitions/17787) for result evaluation. Due to the storage limitations of the Codalab platform, we have cropped the original raw images to a size of 1024×1024 and saved the cropped images along with relevant camera parameters in mat files provided to participants. During testing, participants are also required to save the denoised images in mat files for submission. The Codalab platform will then compute the evaluation metrics based on the ground truth. The final competition score is :

Score=logk(SSIMkPSNR)=PSNR+logk(SSIM),𝑆𝑐𝑜𝑟𝑒𝑙𝑜subscript𝑔𝑘𝑆𝑆𝐼𝑀superscript𝑘𝑃𝑆𝑁𝑅𝑃𝑆𝑁𝑅𝑙𝑜subscript𝑔𝑘𝑆𝑆𝐼𝑀Score=log_{k}(SSIM*k^{PSNR})=PSNR+log_{k}(SSIM),italic_S italic_c italic_o italic_r italic_e = italic_l italic_o italic_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_S italic_S italic_I italic_M ∗ italic_k start_POSTSUPERSCRIPT italic_P italic_S italic_N italic_R end_POSTSUPERSCRIPT ) = italic_P italic_S italic_N italic_R + italic_l italic_o italic_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_S italic_S italic_I italic_M ) , (10)

where k=1.2.

Table 8: Leaderboard of the Extremely Low-light Denoising
Rank User Score PSNR SSIM
1 jly724215288 43.80 43.89 0.99
2 yuxiaoxi 43.07 43.15 0.99

As shown in Tab. 8, in the extremely low-light detection track, jly724215288 achieved first place with a score of 43.80, a PSNR of 43.89, and an SSIM of 0.99, demonstrating their exceptional denoising capabilities for extremely low-light images. Yuxiaoxi secured second place with a score of 43.07, a PSNR of 43.15, and an SSIM of 0.99. Both teams performed excellently in low-light instance segmentation, further highlighting the significance of their contributions.

These results highlight the remarkable progress made by the participating teams in addressing the challenges of noise in extremely low-light images. The top-ranked teams showcased their expertise and innovation in developing robust algorithms adapted to low-light conditions, paving the way for future advancements in computer vision research.

5.2 jly724215288 Team’s Method

5.2.1 Network Architecture

Datasets processing. The authors note that the specialty of the Bayer pattern raw sensor lies in each individual pixel receiving only one spectral wavelength of light at a time. Given the limited amount of training data available, it becomes imperative to consider the spectral properties of every training pair. There are four channels in the sensor: R, Gr, B, and Gb. The Gr and Gb pixels have slightly different intensity responses even though they both capture the green color wavelength, due to imperfections in the Color Filter Array lens (usually compensated by the Image Signal Processing GbGr balance module). Additionally, the offset in the R and B channels impacts denoising performance.

The authors’ method first employs accurate patch-based registration while the images are captured on a tripod.

The registration is performed using phase-correlation Fig.30, which is robust against strong noise and brightness changes. First, the images are transformed into the frequency domain using Fast Fourier Transform, then the cross-power spectrum is calculated by taking the complex conjugate multiplication with element-wise normalization. 𝐆a={Ia},𝐆b={Ib}formulae-sequencesubscript𝐆𝑎subscript𝐼𝑎subscript𝐆𝑏subscript𝐼𝑏\mathbf{G}_{a}=\mathcal{F}\{I_{a}\},\;\mathbf{G}_{b}=\mathcal{F}\{I_{b}\}bold_G start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT = caligraphic_F { italic_I start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT } , bold_G start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT = caligraphic_F { italic_I start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT }

R=𝐆a𝐆b|𝐆a𝐆b|𝑅subscript𝐆𝑎superscriptsubscript𝐆𝑏subscript𝐆𝑎superscriptsubscript𝐆𝑏R=\frac{\mathbf{G}_{a}\circ\mathbf{G}_{b}^{*}}{|\mathbf{G}_{a}\circ\mathbf{G}_% {b}^{*}|}italic_R = divide start_ARG bold_G start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∘ bold_G start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG start_ARG | bold_G start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∘ bold_G start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT | end_ARG

Then they find the maximum response phase as the image patch offset (Δx,Δy)Δ𝑥Δ𝑦(\Delta x,\Delta y)( roman_Δ italic_x , roman_Δ italic_y ) in the inverse Fast-Fourier transform result.

r=1{R}𝑟superscript1𝑅r=\mathcal{F}^{-1}\{R\}italic_r = caligraphic_F start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT { italic_R }
(Δx,Δy)=argmax(x,y){r}Δ𝑥Δ𝑦subscript𝑥𝑦𝑟(\Delta x,\Delta y)=\arg\max_{(x,y)}\{r\}( roman_Δ italic_x , roman_Δ italic_y ) = roman_arg roman_max start_POSTSUBSCRIPT ( italic_x , italic_y ) end_POSTSUBSCRIPT { italic_r }
Refer to caption
Figure 30: phase correlation. The white point corresponded phase position is the image offset.

The authors acknowledge that due to the settings and exposure time of different brands of sensors, as well as the sensitivity of the expected exposure value (EV) to the final result, they introduce a variable λ𝜆\lambdaitalic_λ, ranging from 0.1 to 10, to reduce reliance on precise exposure accuracy. The digital gain (ISO) and exposure time in seconds are extracted from EXIF metadata and calculated into the exposure value (EV) for ratio estimation.

ratio=λEVgtEVin=λISOgt×TIMEgtISOin×TIMEin𝑟𝑎𝑡𝑖𝑜𝜆𝐸subscript𝑉𝑔𝑡𝐸subscript𝑉𝑖𝑛𝜆𝐼𝑆subscript𝑂𝑔𝑡𝑇𝐼𝑀subscript𝐸𝑔𝑡𝐼𝑆subscript𝑂𝑖𝑛𝑇𝐼𝑀subscript𝐸𝑖𝑛ratio=\lambda\frac{EV_{gt}}{EV_{in}}=\lambda\frac{ISO_{gt}\times TIME_{gt}}{% ISO_{in}\times TIME_{in}}italic_r italic_a italic_t italic_i italic_o = italic_λ divide start_ARG italic_E italic_V start_POSTSUBSCRIPT italic_g italic_t end_POSTSUBSCRIPT end_ARG start_ARG italic_E italic_V start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT end_ARG = italic_λ divide start_ARG italic_I italic_S italic_O start_POSTSUBSCRIPT italic_g italic_t end_POSTSUBSCRIPT × italic_T italic_I italic_M italic_E start_POSTSUBSCRIPT italic_g italic_t end_POSTSUBSCRIPT end_ARG start_ARG italic_I italic_S italic_O start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT × italic_T italic_I italic_M italic_E start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT end_ARG

The authors carefully perform data augmentation through random size cropping and rotation while maintaining Bayer pixel alignment. They avoid any scale-like resampling augmentation to preserve the sensor noise properties.

Network Architecture. The authors use a two-stage training strategy for Bayer raw denoising, employing slightly different networks for each stage. In the first stage, they use a U-Net with residual blocks as the denoising network, utilizing the L1 loss function for faster convergence. In the second stage, they add attention blocks after the residual blocks and freeze the weights of the first-stage network, enhancing the denoising capability by minimizing the mean square error (MSE).

Refer to caption
Figure 31: U-Net architecture [83].

5.2.2 Implementation Details

The authors first train their model on raw files from various camera brands. Each training image patch pair undergoes extensive augmentation, including cropping, phase alignment with the ground truth, rotation, and scaling by the λ𝜆\lambdaitalic_λ ratio.

Following the initial pretraining, the authors fine-tune the model using the specific camera brand intended for testing to improve noise model estimation.

In practice, different sensors exhibit unequal black levels and white points per channel. Another contribution by the authors is the development of a method to normalize raw Bayer data values Insubscript𝐼𝑛I_{n}italic_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT across images taken by different camera brands and exposure settings. When an image is captured, the raw file also records the black levels Lbsubscript𝐿𝑏L_{b}italic_L start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT and white points per channel Lwsubscript𝐿𝑤L_{w}italic_L start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT. After multiplying by the EV ratio (which is approximately 10 times larger than the camera EV change) to achieve normal brightness, the noisy input might be clipped by the sensor’s maximum bit value (usually 14 bits), leading to signal loss. The authors address this problem by normalizing using a large denominator Lmsubscript𝐿𝑚L_{m}italic_L start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT (above 16 bits) after subtracting the black levels. Replacing Lwsubscript𝐿𝑤L_{w}italic_L start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT with Lmsubscript𝐿𝑚L_{m}italic_L start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT helps preserve image detail. Additionally, the variable λ𝜆\lambdaitalic_λ in training results in better denoised images while maintaining GPU-capable precision.

In=ILbLmLbsubscript𝐼𝑛𝐼subscript𝐿𝑏subscript𝐿𝑚subscript𝐿𝑏I_{n}=\frac{I-L_{b}}{L_{m}-L_{b}}italic_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = divide start_ARG italic_I - italic_L start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_ARG start_ARG italic_L start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT - italic_L start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_ARG

The following equation shows a normalized Bayer raw multiply exposure ratio, then f(.)f(.)italic_f ( . ) donates our network function, finally de-normalize to the expect denoised bright image Inrsubscript𝐼𝑛𝑟I_{nr}italic_I start_POSTSUBSCRIPT italic_n italic_r end_POSTSUBSCRIPT.

Inr=f(Inratio)(LmLb)+Lbsubscript𝐼𝑛𝑟𝑓subscript𝐼𝑛𝑟𝑎𝑡𝑖𝑜subscript𝐿𝑚subscript𝐿𝑏subscript𝐿𝑏I_{nr}=f(I_{n}*ratio)(L_{m}-L_{b})+L_{b}italic_I start_POSTSUBSCRIPT italic_n italic_r end_POSTSUBSCRIPT = italic_f ( italic_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∗ italic_r italic_a italic_t italic_i italic_o ) ( italic_L start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT - italic_L start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) + italic_L start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT
Refer to caption
Figure 32: The overall network architecture of method

5.3 yuxiaoxi Team’s Method

5.3.1 Network Architecture

The authors adopt the classic single-stage U-shaped architecture with skip-connections, as shown in Fig. 32, to reduce inter-block complexity. The neural networks are constructed by stacking blocks. They start with a plain block containing the most common components: convolution, ReLU, and shortcut. Additionally, they find that vanilla channel attention meets the requirements for computational efficiency and brings global information to the feature map. Given the proven effectiveness of channel attention in image restoration tasks, the authors incorporate channel attention into the plain block.

The authors use convolution with a kernel size of 2 and a stride of 2 for the downsample layer. For the upsample layer, they double the channel width using a pointwise convolution followed by a pixel shuffle module. There are skip connections from the encoder block to the decoder block, and the authors simply add the encoder and decoder features element-wise for feature fusion. The default width and number of blocks are 64 and 36, respectively, and their network architecture consists of a total of 5 layers. The encoder has 4 layers with block quantities of 2, 2, 4, and 8, respectively. The intermediate connection layer has 12 blocks. The decoder also has 4 layers, with each layer containing 2 blocks.

5.3.2 Training strategy

The authors utilize the L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT loss function as the training objective, similar to most denoising methods. They employ the same data preprocessing and optimization strategy as ELD during pre-training. The raw images with long exposure times in the SID train subset are used for noise synthesis. For data preprocessing, they pack the Bayer images into 4 channels, then crop the long exposure data into patches of size 512×512 with a non-overlapping step of 256. The models are trained for 300 epochs using the Adam optimizer with β1subscript𝛽1\beta_{1}italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.9 and β2subscript𝛽2\beta_{2}italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.999, without applying weight decay. The initial learning rate is set to 104superscript10410^{-4}10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT, halved at the 150th epoch, and further reduced to 105superscript10510^{-5}10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT at the 220th epoch.

The inference code and the pre-trained models are released at here.

5.4 Teams and Affiliations

jly724215288

Title: Technique Report of Team jly724215288 for CVPR 2024 PBDL Challenge Extremly Low-light Image Denoising

Members: Linyan Jiang ([email protected]), Bingyi Song, Zhuoyu An, Haibo Lei, Qing Luo, Jie Song

Affiliations: Tencent

yuxiaoxi

Title: Technique Report of Team yuxiaoxi for CVPR 2024 PBDL Challenge Extremly Low-light Image Denoising

Members: Yuan Liu ([email protected]), Qihang Li, Haoyuan Zhang, Lingfeng Wang, Wei Chen, Aling Luo

Affiliations: Sanechips Technology Co., LTD

6 Low-light RAW Image Enhancement

Performing image enhancement under low-light conditions poses several challenges, such as degradation of details, color distortion, and severe noise, which significantly affect the quality of images [107, 108, 33, 126]. Meanwhile, compared to the 8-bit camera’s sRGB output, the RAW data has not been processed by the Image Signal Processor (ISP); thus, it can retain the linearity with the scene and more unquantified information [136, 35]. Based on the advantages of RAW data, the CVPR 2024 PBDL Challenge Low-light RAW Image Enhancement aims to assess and enhance algorithms’ robustness on images captured in low-light environmental conditions to address the challenge of image quality degradation.

In the low-light RAW image enhancement track (Table 9), the top two teams demonstrated exceptional performance. Miers achieved total scores of 30.11, 31.13 dB in PSNR, and 0.84 in SSIM. ISS achieved total scores of 25.95, 27.09 dB in PSNR, and 0.82 in SSIM. These results highlight the remarkable advancements made by the participating teams in addressing the challenges of low-light RAW image enhancement.

Table 9: Leaderboard of the low-light RAW image enhancement.
Rank Team Scores PSNR SSIM
1 Miers 30.11 31.13 0.84
2 ISS 25.95 27.09 0.82
Refer to caption
Figure 33: Example scenes in our captured RAW Image dataset. The last row is the reference images, and above it is the low-light image at 4 different ratios.

6.1 Low-light RAW Image Dataset

To systematically investigate the effectiveness of the proposed method in real-world conditions, a real low-light image dataset for enhancement is necessary and fundamental.

We use Canon EOS 5D Mark IV to capture the data. To capture low/normal-light image pairs, the camera was mounted on a sturdy tripod and controlled remotely via a mobile APP. The camera was not touched between the capture process of normal-light and low-light images to avoid vibration. For each pair, we first take the normal-light image and fix ISO and aperture. Then the low-light images are captured by changing the shutter (exposure time) to simulate low-light conditions. We capture our dataset indoor and outdoor to increase the richness of the scene, where include both natural scenarios and manual builds. The dataset exhibits the following characteristics:

  • Paired samples. The dataset includes images in RAW format, which consists of a normal-light reference image and four low-light images at different ratios (8,16,32,64).

  • Diverse scenes. The dataset contains 832 image pairs in 208 scenes. Our dataset stands out with its high resolution of 6720×4480672044806720\times 44806720 × 4480, surpassing the common resolutions (below 1920×1080192010801920\times 10801920 × 1080) found in other datasets. This higher resolution captures finer details, offering a more comprehensive analysis for low-light enhancement.

The dataset includes images captured in indoor and outdoor scenes under varying lighting conditions as shown in Fig. 33.

6.2 Miers Team’s Method

6.2.1 Network Architecture

The Miers proposed a multi-scale, light-weight transformer model for low-light raw image enhancement. Unlike previous Retinex-based methods [10, 60, 119] that generally decompose the input image into illumination components and reflection components, the proposed method adaptively aligns brightness-induced differences by introducing a learnable guidance vector in the self-attention mechanism. The network architecture called SANet is shown in Fig. 34. The SANet extracts features at different scales sequentially and performs feature fusion through long connections, which can effectively reduce the calculation. At each scale, the proposed method uses concatenated residual blocks and the SABlock as basic modules to obtain non-local views.

The self-attention mechanism in Transformer structure has been proven to have great advantages in low-level image enhancement, and the proposed SABlock (shown in Fig. 35) is also based on this. The SABlock captures global dependency information by building key-value pairs on feature blocks. In the low-light enhancement task, due to the large differences in input image distribution caused by illumination, the model is not easy to fit for natural state. This team introduced a learnable adaptive vector in SABlock to control the gap between the input RAW and the target. This allows the model to be effectively fitted to the direction that contributes to the correct output.

It is also worth noting that downsampling in SANet is implemented using 4×4444\times 44 × 4 convolution with stride=2𝑠𝑡𝑟𝑖𝑑𝑒2stride=2italic_s italic_t italic_r italic_i italic_d italic_e = 2 and the UP Block consists of a 3×3333\times 33 × 3 depthwise separable convolution and pixelshuffle.

Refer to caption
Figure 34: The proposed method.
Refer to caption
Figure 35: The structure of the SABlock.

6.2.2 Implementation Details

The code is based on BasicSR [104] and EFNet [93]. During the training stage, only the data provided by the competition is used. First, the input image is cropped to 128x128, rotation and flipping are added as data augmentation. It should be noted that due to the large difference in image brightness under different ISOs, the input RAW image subtracts the black level and divides by the difference between the white level and the black level, and then multiplies by the ratio for normalization. The ratio is calculated by

ratio=1max(raw_image - black_levelwhite_level - black_level)𝑟𝑎𝑡𝑖𝑜1raw_image - black_levelwhite_level - black_levelratio=\frac{1}{\max(\frac{\text{raw\_image - black\_level}}{\text{white\_level% - black\_level}})}italic_r italic_a italic_t italic_i italic_o = divide start_ARG 1 end_ARG start_ARG roman_max ( divide start_ARG raw_image - black_level end_ARG start_ARG white_level - black_level end_ARG ) end_ARG

In the training process, the batch size is 4, total iterations is set to 500,000. This team uses L1 loss as the training loss and MultiStepLR for learning rate decay. In addition, the model weight uses exponential moving average (EMA), and the model with the highest PSNR on the validation set is finally selected for testing.

6.3 ISS Team’s Method

6.3.1 Network Architecture

The ISS used the algorithm proposed in the Lighting Every Darkness in Two Pairs: A Calibration-Free Pipeline for RAW Denoising [51], which can adapt to the target camera without calibrating noise parameters and repeated training, requiring only a small amount of lens pairing data and fine-tuning, eliminating the complicated calibration steps, and achieved good performance.

Refer to caption
Figure 36: Overview of network architecture.

As shown in Fig. 36, the whole network adopts the macro architecture of Unet [83], in which the convolution blocks of the Unet network itself is replaced with the reparameterized noise removal (RepNR) block [51]. In the Pre-train stage, In RepNR Block has k branches of Camera-Specific Alignment (CSA) module [51], Each of these branches is fitted to a class of camera noise, In the Fine-tune phase, By averaging the k CSA module, Equivalent to the model integration of noise from multiple classes of cameras, At this time the RepNR block consists of two branches, Where the upper 3x3 convolution is designed to fit the out-of-model noise, Lower Camera-Specific Alignment (CSA) module, The main role is to adjust the distribution of the input features.

6.3.2 Implementation Details

By utilizing LED methods [51], we ultimately reduced the number of scenes to just four groups, randomly selecting three distinct images for each scene to pair with their respective ground truth images. These image pairs were rapidly deployed to a new camera using the provided pre-trained weights.

Subsequently, fine-tuning was conducted using a small amount of real data, with the RepNR block replacing the convolutional layers in the UNet architecture. During the fine-tuning process, we initially iterated the CSA from the pre-trained model for 5000 iterations until convergence. The optimizer used was Adam with a learning rate of 0.0001, and the training strategy employed a cosine annealing approach. An additional branch was then fine-tuned for an additional 3000 iterations, with the optimizer and training strategy consistent with the main branch. The loss function chosen was L1 loss.

6.4 Teams and Affiliations

Miers

Title: A Light-weight Aligned Attention for Low-light Raw Enhancement

Members: Cheng Li ([email protected]), Jun Cao, Shu Chen, Zifei Dou

Affiliations: Xiaomi Inc., China

ISS

Title: Low-light Raw Image Enhancement Technical solution

Members: Xinyu Liu, Jing Zhang (jingzhang_work @163.com, Kexin Zhang, Yuting Yang, Licheng Jiao, Shuyuan Yang

Affiliations: Intelligent Perception and Image Understanding Lab, Xidian University

7 HDR Reconstruction from a Single Raw Image

The dynamic range of real-world scenes frequently exceeds the capture capabilities of standard consumer camera sensors, often resulting in loss of detail in both overly bright and dark areas. In underexposed regions, noise becomes significant and affects the visual quality [11, 122, 108, 133], while in overexposed regions, information is often clipped [46, 35, 3]. To address this, the computational imaging community has extensively explored High Dynamic Range (HDR) imaging, which records a broader spectrum of intensity levels and captures more scene information. Unlike conventional Low Dynamic Range (LDR) images, HDR preserves greater detail in both over- and under-exposed areas. This enhancement not only benefits various vision tasks, such as segmentation [73] and object detection [77, 105], but also produces more visually pleasing images—a goal long pursued by computer vision researchers.

To advance HDR reconstruction research, we are launching a challenge focused on reconstructing HDR images from single Raw images. This approach specifically targets single Raw image HDR reconstruction, avoiding potential misalignments that can occur in multi-image fusion. We will utilize a Raw-to-HDR dataset that focus on HDR reconstruction from a single Raw image, as shown in Fig. 40, which contains pairs of Raw and HDR images. The Raw input is captured under challenging lighting conditions, representing the over- and under-exposed regions of a high dynamic range scene. The corresponding ground truth HDR images in the dataset are produced through bracketed exposures of each scene, subsequently merged using basic HDR fusion algorithms [23].

Refer to caption
Figure 37: Representative examples for Raw-to-HDR dataset SRHDR (Single Raw HDR). For each scene, a Raw image captured under challenging lighting is served as input image, and the HDR image that is merged by bracket exposures are used as ground truth.

7.1 Dataset

We capture and curate a real paired Raw-to-HDR dataset called SRHDR for HDR reconstruction from a single Raw image. The SRHDR dataset extended RawHDR [136] in both quantities and difficulty, and covers a large range of HDR scenarios including modern and ancient buildings, art districts, tourist attractions, street shops and restaurants, abandoned factories, city views, and more. These images are captured at different times of the day, including daytime and nighttime, which further guarantees the diversity of the paired Raw-to-HDR dataset. The data capture process involves several steps. Initially, we carefully select scenes with high dynamic range potential. Then, using a Canon 5D Mark IV camera mounted on a tripod, we employ bracket exposure mode to capture different exposures of the same scene. The Raw images taken in challenging lighting conditions, specifically from -3EV to +3EV, are used as input images. The corresponding ground truth images are created using an HDR merging method, as described by Debevec et al. [23]. The challenge dataset has the following characteristics

  • High resolution. Our dataset stands out with its high resolution of 6720x4480, surpassing the common resolutions (below 1920x1080) found in other HDR datasets. This higher resolution captures finer details, offering a more comprehensive analysis for HDR reconstruction.

  • High bit-depth ground truth. The SRHDR dataset features ground truth HDR images with a bit-depth of over 20 bits, utilizing a linear HDR format. This high bit-depth ensures a richer and more precise representation of color and light intensities.

  • Real paired samples. Each image pair in the dataset is meticulously captured through multi-exposure fusion. The input comprises actual images shot with a DSLR under challenging lighting conditions. The corresponding ground truth HDR images are generated using a widely accepted HDR merging algorithm, ensuring authenticity and relevance.

  • Raw images as input. The use of unprocessed Raw sensor data as the input format leverages the higher bit-depth and superior intensity tolerance of Raw data, effectively addressing the common issue of insufficient scene information in HDR image processing.

Table 10: Results and rankings of top-2 competitors.
Rank Team Score PSNR SSIM PSNR-μ𝜇\muitalic_μ MS-SSIM
1 Alanosu 66.90 34.02 0.94 33.43 0.97
2 USTCX 65.82 32.33 0.95 33.89 0.98

7.2 Alanosu Team’s Method

7.2.1 Network Architecture

Refer to caption
Figure 38: The Network Architecture of our proposed MHDRUNet.

We introduces MHDRUNet, a model that utilizes the emphasis on different channel information in Raw images for exposure guidance, subsequently used for HDR reconstruction of single-frame Raw images. The model framework is shown in Fig. 38. Inspired by RawHDR[136], Raw images have higher intensity values in green channels compared to red and blue. Therefore, we split IRGBGsubscript𝐼𝑅𝐺𝐵𝐺I_{RGBG}italic_I start_POSTSUBSCRIPT italic_R italic_G italic_B italic_G end_POSTSUBSCRIPT into IRBsubscript𝐼𝑅𝐵I_{RB}italic_I start_POSTSUBSCRIPT italic_R italic_B end_POSTSUBSCRIPT and IGsubscript𝐼𝐺I_{G}italic_I start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT, using the RB channels for exposure estimation to derive an underexposure mask Mundersubscript𝑀𝑢𝑛𝑑𝑒𝑟M_{under}italic_M start_POSTSUBSCRIPT italic_u italic_n italic_d italic_e italic_r end_POSTSUBSCRIPT, and then reconstruct the underexposed areas based on the G channel to get YGsubscript𝑌𝐺Y_{G}italic_Y start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT. Similarly, we use the G channel for exposure estimation to obtain an overexposure mask Moversubscript𝑀𝑜𝑣𝑒𝑟M_{over}italic_M start_POSTSUBSCRIPT italic_o italic_v italic_e italic_r end_POSTSUBSCRIPT, and then reconstruct the overexposed regions using the RB channels to get YRBsubscript𝑌𝑅𝐵Y_{RB}italic_Y start_POSTSUBSCRIPT italic_R italic_B end_POSTSUBSCRIPT. We combine YGsubscript𝑌𝐺Y_{G}italic_Y start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT and YRBsubscript𝑌𝑅𝐵Y_{RB}italic_Y start_POSTSUBSCRIPT italic_R italic_B end_POSTSUBSCRIPT through a weighted sum. To ensure smoothness in HDR reconstruction across the global range, we utilize the original Raw data for global exposure-guided reconstruction. The exposure reconstruction network is comprised of the complete HDRUNet[19], with inputs including the Raw image and a condition image, which by default matches the Raw image. The exposure estimation mask module consists of a CNN with residual connections, and the exposure reconstruction is carried out by the complete HDRUNet network. Secondly, we propose the method of Refine Exposure Adjustment. By analyzing the distribution of values in the input Raw image, we can estimate the areas of exposure and underexposure. For images where the area of the underexposure mask is greater than that of the overexposure mask, we consider it to be underexposed; conversely, if larger for the overexposure mask, it is considered overexposed. Based on this, we make appropriate exposure adjustments on the original Raw data, bringing underexposed images to a slightly underexposed state and overexposed images to a slightly overexposed state. These are then used as the condition images inputted into the HDRUNet network, thereby achieving better reconstruction results.

7.2.2 Implementation Details

We use the dataset proposed by HDR Reconstruction from a Single Raw Image challenge. Before training, we pre-process the data by cropping images into 768×768768768768\times 768768 × 768. During training, the mini-batch size is set to 1 and Adam [52] optimizer and Kaiming-initialization [42] are adopted for training. The initial learning rate is set to 1e41𝑒41e-41 italic_e - 4 and all models are built on the PyTorch framework and trained with NVIDIA 3090 GPU. It’s noteworthy that we find the HDR reconstruction task for overexposed images to be more challenging compared to that for underexposed images. Therefore, we propose the training strategy of Random Overexposure Adjustment. Specifically, during training, we randomly apply varying degrees of overexposure adjustments to the input underexposed images to generate pseudo-overexposed images for data augmentation, thereby enhancing the model’s robustness. During training, we use tanh L2 loss[19] and SSIM loss [106] to achieve better training effects. Additionally, we employ a constraint loss Lmasksubscript𝐿𝑚𝑎𝑠𝑘L_{mask}italic_L start_POSTSUBSCRIPT italic_m italic_a italic_s italic_k end_POSTSUBSCRIPT [136], which guides the learning of the mask.

7.3 USTCX Team’s Method

7.3.1 Network Architecture

Overall Pipeline. Inspired by Restormer [124], our network architecture is shown in Fig. 39. It overcomes the limitations of traditional Convolutional Neural Networks (CNNs) by utilizing Transformers’ ability to capture long-range pixel interactions, which is crucial for High Dynamic Range (HDR) image reconstruction. We introduce the two modules of the Transformer block: (a) multi-Dconv head transposed attention (MDTA) and (b) gated-Dconv feed-forward network (GDFN).

Refer to caption
Figure 39: The Restormer Framework [124] for HDR Reconstruction.

Multi-Dconv Head Transposed Attention This module replaces the standard multi-head self-attention mechanism. It operates across feature dimensions instead of spatial dimensions, reducing complexity. It uses 1×1111\times 11 × 1 convolutions for pixel-wise cross-channel context aggregation and 3×3333\times 33 × 3 depth-wise convolutions for channel-wise spatial context encoding.

Gated-Dconv Feed-Forward Network This module includes a gating mechanism and depth-wise convolutions to control information flow and encode local image structures. The gating mechanism is an element-wise product of two linear transformation paths, one activated with GELU non-linearity.

Loss Functions In our work, we use the Charbonnier loss [103] to optimize our network. This loss function is particularly effective for handling outliers and robust to noise. Its formulation is as follows:

content=I^I2+ϵ2subscriptcontentsubscriptnormsubscript^𝐼absentsubscript𝐼absent2superscriptitalic-ϵ2\mathcal{L}_{\mathrm{content}}=\sqrt{\left\|\hat{I}_{\mathrm{}}-I_{\mathrm{}}% \right\|_{2}+\epsilon^{2}}caligraphic_L start_POSTSUBSCRIPT roman_content end_POSTSUBSCRIPT = square-root start_ARG ∥ over^ start_ARG italic_I end_ARG start_POSTSUBSCRIPT end_POSTSUBSCRIPT - italic_I start_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG (11)

where I^^𝐼\hat{I}over^ start_ARG italic_I end_ARG is the predicted HDR Raw image, I𝐼Iitalic_I is the ground truth, and ϵitalic-ϵ\epsilonitalic_ϵ is set to 0.0001 as default.

In addition to the content loss, we leverage frequency domain information to introduce auxiliary loss to our network, which is defined as follows:

frequency=(I^)(I)1subscriptfrequencysubscriptnormsubscript^𝐼subscript𝐼1\mathcal{L}_{\text{frequency}}=\left\|\mathcal{F}\left(\hat{I}_{\text{}}\right% )-\mathcal{F}\left(I_{\text{}}\right)\right\|_{1}caligraphic_L start_POSTSUBSCRIPT frequency end_POSTSUBSCRIPT = ∥ caligraphic_F ( over^ start_ARG italic_I end_ARG start_POSTSUBSCRIPT end_POSTSUBSCRIPT ) - caligraphic_F ( italic_I start_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT (12)

where ()\mathcal{F}\left(\cdot\right)caligraphic_F ( ⋅ ) indicates the Fast Fourier Transform (FFT). Finally, the total loss could be defined as:

total=content+λfrequencysubscripttotalsubscriptcontent𝜆subscriptfrequency\mathcal{L}_{\mathrm{total}}=\mathcal{L}_{\mathrm{content}}+\lambda\mathcal{L}% _{\mathrm{frequency}}caligraphic_L start_POSTSUBSCRIPT roman_total end_POSTSUBSCRIPT = caligraphic_L start_POSTSUBSCRIPT roman_content end_POSTSUBSCRIPT + italic_λ caligraphic_L start_POSTSUBSCRIPT roman_frequency end_POSTSUBSCRIPT (13)

where λ𝜆\lambdaitalic_λ denotes the balanced weight, and we empirically set λ𝜆\lambdaitalic_λ to 0.5 as default.

7.4 Implementation details

We utilized PyTorch 1.8 within an NVIDIA 3090 GPU environment, equipped with 24GB of memory, to train our model on official datasets with a batch size of 4. The input images were standardized to an 80×80808080\times 8080 × 80 resolution. The training spanned approximately 23 hours, with a learning rate that started at 3×1043superscript1043\times 10^{-4}3 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT, reduced to 1×1071superscript1071\times 10^{-7}1 × 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT over 75,000 iterations using a Cosine Annealing schedule. This was followed by a second phase with a learning rate of 6×1056superscript1056\times 10^{-5}6 × 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT, also reduced to 1×1071superscript1071\times 10^{-7}1 × 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT over an additional 60,000 iterations. Notably, no special efficiency optimization strategies were applied during this process.

7.5 Teams

Alanosu

Title: MHDRUNet

Members: Liwen Zhang, Zhe Xu ([email protected]), Dingyong Gou, Cong Li

Affiliations: ZTE Corporation.

USTCX

Title: Restormer

Members: Senyan Xu ([email protected]), Yunkang Zhang, Siyuan Jiang

Affiliations: University of Science and Technology of China.

8 Highspeed HDR Video Reconstruction from Events

Event cameras, differing from conventional cameras that capture scene intensities at a fixed frame rate, use a unique approach by detecting pixel-wise intensity changes asynchronously. This is triggered whenever a pixel’s intensity change surpasses a certain contrast threshold. Unlike traditional frame-based cameras, event cameras have several advantages: low latency, low power consumption, high temporal resolution, and high dynamic range (HDR). These qualities make them particularly useful for a range of vision tasks, including real-time object tracking [85, 80, 129], high-speed motion estimation [55], optical flow estimation [37], ego motion analysis [118], and so on.

However, the distinct triggering mechanism of event cameras presents a challenge. The event data they capture, which lacks absolute intensity values and is represented as 4-tuples, is incompatible with standard frame-based vision algorithms. This discrepancy necessitates specialized processing pipelines, different from traditional image processing methods. Consequently, there is a growing interest in transforming event data into intensity images to leverage the high-speed and HDR capabilities of event cameras in practical applications [97, 116].

To this end, we are launching a challenge focused on reconstructing high-speed HDR videos from event streams. We will utilize the high-quality Event-to-HDR dataset, captured by a co-axis system and developed by [134]. This dataset includes aligned pairs of event streams and HDR videos in both spatial and temporal dimensions.

In the challenge evaluation, three evaluation metrics are used for assessment: Peak signal-to-noise ratio (PSNR), tone-mapped PSNR (PSNR-μ𝜇\muitalic_μ), Structural Similarity (SSIM) and multi-scale SSIM (MS-SSIM). The training dataset consists of 300 paired LDR/HDR images. The input images of validation and testing sets are provided, while the GT are not available to participants. The final leaderboard of top-3 participants are shown in Table 11.

Table 11: Results and rankings of methods.
Rank Team Score PSNR SSIM RMSE
1 IVISLAB 16.56 18.52 0.73 0.13
2 Jackzou 16.21 18.50 0.70 0.13
3 apolloUI 16.21 18.39 0.70 0.13

8.1 Dataset

The dataset for this challenge is shown in Fig. 40. This dataset provides real paired Event-to-HDR data for the purpose of high-speed HDR video reconstruction from event streams. The collection process involves an integrated system designed to simultaneously capture high-speed HDR videos and corresponding event streams. This is achieved by utilizing an event camera to record the event streams, alongside two high-speed cameras that capture synchronized Low Dynamic Range (LDR) frames. These LDR frames are later fused to create High Dynamic Range (HDR) frames. The careful alignment of these cameras within the system ensures the accurate synchronization of the high-speed HDR videos with the event streams, offering a robust dataset for the challenge participants. The challenge dataset has the following characteristics

Scene 1 Scene 2 Scene 3

Low bits

Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption

High bits

Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption

HDR

Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Frame 1 Frame 2 Frame 3 Frame 1 Frame 2 Frame 3 Frame 1 Frame 2 Frame 3
Figure 40: An overview of the challange dataset: Event-to-HDR video dataset.
  • Real high-bit HDR. Unlike existing methods that primarily leverage the HDR feature of event data, our dataset includes real high-bit HDR data. This data is created by fusing two images with different exposures using an HDR fusion strategy. This inclusion is crucial as most current methods do not use real high-bit depth HDR data for training, limiting their ability to generate such HDR formats.

  • Paired Event-to-HDR dataset. While existing datasets often contain only paired testing data created by simulating a virtual camera’s trajectory, this dataset provides real paired training data. This approach overcomes the domain gap that synthetic training data typically has with real-world testing scenarios. This dataset captures genuine paired training data, offering a more realistic and applicable training environment.

  • Highspeed. In alignment with the high-speed nature of event streams, our videos are captured with a high-speed camera at a frame rate of 500fps. This speed significantly exceeds that of APS or any other event-to-HDR dataset, making our dataset uniquely suited for applications requiring high temporal resolution.

8.2 IVISLAB Team’s Method

Refer to caption
Figure 41: Architecture of DERNet

To achieve high-speed HDR video reconstruction from events, our team introduces the Dual Event-stream Reconstruction Network (DERNet). As depicted in Figure 41, DERNet uses long-time and short-time event voxels to reconstruct the low-frequency brightness and high-frequency texture of HDR video. Furthermore, DERNet integrates Swin Transformer and Conv-GRU blocks to capture spatial and temporal contexts, thereby enhancing reconstruction accuracy.

8.2.1 Network Architecture

DERNet adopts an encoder-decoder network with a recursive design to process dual-stream event voxels to estimate high-speed HDR videos. Specifically, when reconstructing the t𝑡titalic_t-th frame of the HDR video, considering that the long-time event stream around frame t𝑡titalic_t can help reconstruct the low-frequency brightness, DERNet voxelizes the event data from frame tTl𝑡subscript𝑇𝑙t-T_{l}italic_t - italic_T start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT to frame t+Tl𝑡subscript𝑇𝑙t+T_{l}italic_t + italic_T start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT into a b𝑏bitalic_b-bins event voxel Vl,tsubscript𝑉𝑙𝑡V_{l,t}italic_V start_POSTSUBSCRIPT italic_l , italic_t end_POSTSUBSCRIPT. Simultaneously, considering that the short-time event stream around frame t𝑡titalic_t can help reconstruct high-frequency texture, DERNet voxelizes the event data from frame tTs𝑡subscript𝑇𝑠t-T_{s}italic_t - italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT to frame t+Ts𝑡subscript𝑇𝑠t+T_{s}italic_t + italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT into a b𝑏bitalic_b-bins event voxel Vs,tsubscript𝑉𝑠𝑡V_{s,t}italic_V start_POSTSUBSCRIPT italic_s , italic_t end_POSTSUBSCRIPT. Subsequently, the event voxels Vl,tsubscript𝑉𝑙𝑡V_{l,t}italic_V start_POSTSUBSCRIPT italic_l , italic_t end_POSTSUBSCRIPT and Vs,tsubscript𝑉𝑠𝑡V_{s,t}italic_V start_POSTSUBSCRIPT italic_s , italic_t end_POSTSUBSCRIPT are concatenated and input into the network. To fuse the features of the two event voxels, DERNet utilizes convolutional layers to generate fused features from the event voxels. The network then adopts a two-branch encoder. This structure includes a complex branch that extracts high-level semantic information from the fused features, leveraging Swin Transformer [67] blocks to capture spatial context and Conv-GRU blocks to capture temporal context by integrating historical states. It also includes a simple branch that utilizes convolutional layers to capture detailed information from the fused features. Next, the decoder of DERNet adopts multiple Swin Transformer blocks to fuse and upsample the features extracted by the two-branch encoder, finally using convolutional networks to predict the t𝑡titalic_t-th frame HDR image Itsubscript𝐼𝑡I_{t}italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.

8.2.2 Implementation Details

To train DERNet, a reconstruction loss rsubscript𝑟\mathcal{L}_{r}caligraphic_L start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT is designed for the estimated HDR image tsubscript𝑡\mathcal{L}_{t}caligraphic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT as

Lrsubscript𝐿𝑟\displaystyle L_{r}italic_L start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT =λ1L1(It,Itgt)+λ2L1(M(It),M(Itgt))absentsubscript𝜆1subscriptL1subscript𝐼𝑡superscriptsubscript𝐼𝑡𝑔𝑡subscript𝜆2subscriptL1𝑀subscript𝐼𝑡𝑀superscriptsubscript𝐼𝑡𝑔𝑡\displaystyle=\lambda_{1}{\rm L_{1}}(I_{t},I_{t}^{gt})+\lambda_{2}{\rm L_{1}}(% M(I_{t}),M(I_{t}^{gt}))= italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT roman_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g italic_t end_POSTSUPERSCRIPT ) + italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT roman_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_M ( italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , italic_M ( italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g italic_t end_POSTSUPERSCRIPT ) ) (14)
+λ3L2(It,Itgt)+λ4L2(M(It),M(Itgt))subscript𝜆3subscriptL2subscript𝐼𝑡superscriptsubscript𝐼𝑡𝑔𝑡subscript𝜆4subscriptL2Msubscript𝐼𝑡𝑀superscriptsubscript𝐼𝑡𝑔𝑡\displaystyle\quad+\lambda_{3}{\rm L_{2}}(I_{t},I_{t}^{gt})+\lambda_{4}{\rm L_% {2}}({\rm M}(I_{t}),M(I_{t}^{gt}))+ italic_λ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT roman_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g italic_t end_POSTSUPERSCRIPT ) + italic_λ start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT roman_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( roman_M ( italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , italic_M ( italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g italic_t end_POSTSUPERSCRIPT ) )

where λ1subscript𝜆1\lambda_{1}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, λ2subscript𝜆2\lambda_{2}italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, λ3subscript𝜆3\lambda_{3}italic_λ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT, and λ4subscript𝜆4\lambda_{4}italic_λ start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT coefficients balancing the loss terms, L1(,)subscriptL1{\rm L_{1}}(\cdot,\cdot)roman_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( ⋅ , ⋅ ) is the absolute loss function, L2(,)subscriptL2{\rm L_{2}}(\cdot,\cdot)roman_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( ⋅ , ⋅ ) is the mean squared error loss function, Itgtsuperscriptsubscript𝐼𝑡𝑔𝑡I_{t}^{gt}italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g italic_t end_POSTSUPERSCRIPT is the ground truth t𝑡titalic_t-th frame HDR image, and M()M{\rm M}(\cdot)roman_M ( ⋅ ) is the HDR to SDR function defined as M(x)=log(1+5000x)log(5001)M𝑥15000𝑥5001{\rm M}(x)=\frac{\log(1+5000x)}{\log(5001)}roman_M ( italic_x ) = divide start_ARG roman_log ( 1 + 5000 italic_x ) end_ARG start_ARG roman_log ( 5001 ) end_ARG.

DERNet is implemented using PyTorch. During training, a batch size of 2 is utilized, with a video sequence length of 10 and a data size of 224×224224224224\times 224224 × 224. An AdamW optimizer [68] is adopted with a learning rate of 4×1054superscript1054\times 10^{-5}4 × 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT and weight decay of 106superscript10610^{-6}10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT to optimize the network weights for 60 epochs. A cosine annealing scheduler is adopted to decay the learning rate. To prevent overfitting, random flipping, rotation, and cropping are applied to the event voxels for data augmentation. The coefficients are defined as b=6𝑏6b=6italic_b = 6, Tl=16subscript𝑇𝑙16T_{l}=16italic_T start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = 16, Ts=5subscript𝑇𝑠5T_{s}=5italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = 5, λ1=1subscript𝜆11\lambda_{1}=1italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 1, λ2=0.1subscript𝜆20.1\lambda_{2}=0.1italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.1,λ3=500subscript𝜆3500\lambda_{3}=500italic_λ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = 500, and λ4=10subscript𝜆410\lambda_{4}=10italic_λ start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT = 10.

8.3 Jackzou Team’s Method

Refer to caption
Figure 42: The overview of recurrent convolutional neural network for HDR video reconstruction from events.

8.3.1 Network Architecture

Our method employs a convolutional recurrent neural network designed to reconstruct HDR videos from event streams [134]. As shown in Figure 42, the network processes T=2N+1𝑇2𝑁1T=2N+1italic_T = 2 italic_N + 1 consecutive event voxel grids {𝐄tN,,𝐄t+N}subscript𝐄𝑡𝑁subscript𝐄𝑡𝑁\{\mathbf{E}_{t-N},\ldots,\mathbf{E}_{t+N}\}{ bold_E start_POSTSUBSCRIPT italic_t - italic_N end_POSTSUBSCRIPT , … , bold_E start_POSTSUBSCRIPT italic_t + italic_N end_POSTSUBSCRIPT } to generate the HDR frame 𝐇tsubscript𝐇𝑡\mathbf{H}_{t}bold_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT at timestamp t𝑡titalic_t. The architecture includes several key modules.

Firstly, the shared feature extractor downsamples event frames to a low spatial resolution feature space using strided convolution layers, producing 2N+12𝑁12N+12 italic_N + 1 output feature maps, {𝐅tN,,𝐅t+N}subscript𝐅𝑡𝑁subscript𝐅𝑡𝑁\{\mathbf{F}_{t-N},\ldots,\mathbf{F}_{t+N}\}{ bold_F start_POSTSUBSCRIPT italic_t - italic_N end_POSTSUBSCRIPT , … , bold_F start_POSTSUBSCRIPT italic_t + italic_N end_POSTSUBSCRIPT }. This shared encoding facilitates subsequent alignment by transforming the input data into a consistent feature space.

We then employ a deformable convolution-based alignment module [102], which uses pyramidal deformable convolutions to align features of different event frames with the central frame feature 𝐅tsubscript𝐅𝑡\mathbf{F}_{t}bold_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. This approach predicts offsets for the convolution kernels through a pyramidal processing structure, allowing the network to handle larger movements and align features accurately, thereby avoiding the pitfalls of inaccurate optical flow estimation.

The aligned features are combined in the attentive fusion and reconstruction module. Here, the features are stacked and processed by attention mechanisms that independently focus on height, width, and temporal/channel correlations. The fused features are passed through a recurrent residual network and a ConvLSTM [90] module, which help maintain temporal continuity by remembering information from successive sequences.

To enhance temporal consistency, we introduce a novel temporal consistency loss based on the integral relationship between consecutive frames and events, modeled using a pre-trained UNet-like [84] network. This loss ensures smooth transitions between frames, mitigating issues related to temporal discontinuity.

8.3.2 Implementation Details

The training strategy involves a combination of losses to optimize HDR video reconstruction and maintain temporal consistency. Given the reconstructed video sequence 𝐇isubscript𝐇𝑖\mathbf{H}_{i}bold_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and the corresponding ground truth frames 𝐇^isubscript^𝐇𝑖\hat{\mathbf{H}}_{i}over^ start_ARG bold_H end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, we employ several loss functions.

We use the l1subscript𝑙1l_{1}italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT loss to measure the pixel-wise difference between the reconstructed and ground truth frames:

l1=i=1T𝐇i𝐇^i.subscriptsubscript𝑙1superscriptsubscript𝑖1𝑇normsubscript𝐇𝑖subscript^𝐇𝑖\mathcal{L}_{l_{1}}=\sum_{i=1}^{T}\|\mathbf{H}_{i}-\hat{\mathbf{H}}_{i}\|.caligraphic_L start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∥ bold_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - over^ start_ARG bold_H end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ . (15)

To enhance perceptual quality, we introduce the Learned Perceptual Image Patch Similarity (LPIPS) loss [130], which focuses on high-level and structural similarity. Additionally, the temporal consistency loss Csubscript𝐶\mathcal{L}_{C}caligraphic_L start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT, derived from the pre-trained network, is defined as:

C=i=1T𝐄t𝒞(𝐇t1,𝐇t)22.subscript𝐶superscriptsubscript𝑖1𝑇superscriptsubscriptnormsubscript𝐄𝑡𝒞subscript𝐇𝑡1subscript𝐇𝑡22\mathcal{L}_{C}=\sum_{i=1}^{T}\|\mathbf{E}_{t}-\mathcal{C}(\mathbf{H}_{t-1},% \mathbf{H}_{t})\|_{2}^{2}.caligraphic_L start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∥ bold_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - caligraphic_C ( bold_H start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , bold_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (16)

This loss ensures that the reconstructed frames maintain smooth transitions.

The overall loss function used to train our model is:

=l1+τ1LPIPS+τ2C,subscriptsubscript𝑙1subscript𝜏1subscript𝐿𝑃𝐼𝑃𝑆subscript𝜏2subscript𝐶\mathcal{L}=\mathcal{L}_{l_{1}}+\tau_{1}\mathcal{L}_{LPIPS}+\tau_{2}\mathcal{L% }_{C},caligraphic_L = caligraphic_L start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT + italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_L italic_P italic_I italic_P italic_S end_POSTSUBSCRIPT + italic_τ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT , (17)

where τ1subscript𝜏1\tau_{1}italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and τ2subscript𝜏2\tau_{2}italic_τ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are empirically set to 2222 and 0.20.20.20.2, respectively.

The network is initialized using Kaiming initialization [42], and trained with the Adam optimizer (momentum set to 0.9). The initial learning rate is 104superscript10410^{-4}10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT, reduced by a factor of 10 every 50 epochs. We set the batch size to 4, and train the model for 100 epochs. The implementation uses the PyTorch framework, and training is performed on NVIDIA TITAN V GPUs.

In summary, our network architecture and training strategy effectively reconstruct high-quality HDR videos from event data, ensuring both spatial and temporal coherence.

8.4 ApolloUI Team’s Method

8.4.1 Network Architecture

As shown in the figure 43, we introduce three types of inputs at the input end. The first part is data generated from the voxel structure of raw event camera data, with the time step of 8. This part of the data represents the most primitive detail information of the image. The second part is 2D image data generated by E2VID [81], which serves as the reference frame for LDR 2D images. The third part is HDR image data generated by the decoder, representing the HDR data of historical frames, which can help smooth the entire video and ultimately generate HDR image data for the current frame.

Refer to caption
Figure 43: The network architecture.

8.4.2 Implementation Details

According to the aforementioned network architecture, this method is trained using the following approach. Firstly, the training data is converted into voxels based on the original timestamps and stored in the npy format. For the purpose of facilitating temporal training, each npy file is stored with a length of T=16. Subsequently, e2vid is employed to generate reference frames corresponding to each npy file. It is noteworthy that the reference frames generated by E2VID are often affected by the actual data distribution density, leading to occurrences of blank spaces or excessive noise.

Regarding the network itself, data augmentation techniques such as random horizontal flipping and the addition of Gaussian noise with a standard deviation of 0.001 are applied to the input data. Additionally, to better align with the evaluation metrics of this challenge, four types of noise are introduced, including KL divergence noise (to ensure alignment between HDR ground truth images and generated images), L1 noise, L2 noise, and SSIM noise. Due to time constraints, rigorous ablation experiments were not conducted. However, from a holistic analysis of the results, KL divergence noise yielded relatively favorable gains.

Other training parameters include the Adam optimizer with a learning rate of 0.001, cosine annealing for learning rate scheduling, a batch size of 48, and iterative training conducted using four NVIDIA 4090 GPUs. Training is performed for 1000 epochs, with the overall training and testing dataset split in an 8:2 ratio. To address challenging samples, this method manually removes data with poor distributions (some data lack original event information due to bandwidth congestion, rendering them unsuitable for training). Finally, inference can be performed, followed by truncating and normalizing the image with a maximum value of 65535, resulting in the HDR image of the current moment in the corresponding distribution domain.

8.5 Teams

IVISLAB

Title: Dual Event-stream Reconstruction Network

Members: Qinglin Liu1 ([email protected]), Wei Yu1, Xiaoqian Lv1, Jianing Li2, Shengping Zhang1, Xiangyang Ji3

Affiliations: 1Harbin Institute of Technology, 2Peking University, 3Tsinghua University.

Jackzou

Title: Learning to Reconstruct High Speed and High Dynamic Range Videosfrom Events

Members: Yunhao Zou ([email protected]), Ying Fu

Affiliations: Beijing Institute of Technology.

apolloUI

Title: Generating High Dynamic Range Image Sequences with Event Cameras Based on Multi-Head Encoding Networks

Members: Yuanpei Chen ([email protected]), Yuhan Zhang, Weihang Peng

Affiliations: Intelligent Science & Technology Academy of CASIC.

9 Overexposure Image Correction

Over-exposure is a prevalent issue in digital camera sensor systems, caused by automatic exposure errors during image processing. This problem particularly arises in dynamic scenes with fluctuating brightness levels, i.e., a car exiting a tunnel or the sudden illumination of a dark environment. Exposure correction aims to correct the brightness errors that occur during the image capture process [3, 76]. However, most CCD or CMOS cameras can only capture a limited illumination range and will produce clipped or over-exposed pixels when sensor elements are saturated due to improper settings or physical constraints in sensors. This largely degrades the essential details in bright areas of photographs as well as the image quality [4]. Therefore, correcting the brightness and texture details of the over-exposed images becomes a crucial task to improve the visual aesthetics of captured images and the performance of downstream image processing applications.

On the overexposure correction track (Table 12), the top three teams have shown outstanding performance. Gxj ranked first with a comprehensive score of 21.58. Specifically, gxj achieved PSNR of 21.58 and SSIM of 0.95. CVCV achieved PSNR of 20.56 and SSIM of 0.94. LiGoxin achieved PSNR of 19.45 and SSIM of 0.92.

These results highlight the remarkable advancements made by the participating teams in addressing the challenges of over-exposure image correction. The top-ranking teams have showcased their expertise and innovation in developing robust algorithms that excel in over-lighted conditions, paving the way for future advance ments in computer vision research.

Table 12: Leaderboard of the Over-exposure correction.
Rank Team PSNR SSIM Score
1 gxj 21.85 0.95 21.58
2 CVCV 20.90 0.94 20.56
3 LiGoxin 19.45 0.92 18.95

9.1 RAW based Over-Exposure Correction dataset

To propel research in this field forward, it is essential to assess proposed methods in real-world scenarios. Consequently, we will utilize the RAW image-based Real-world Paired Over-exposure (RPO) dataset, introduced by Prof. Fu’s team in [34], captured using a Canon EOS 5D Mark IV camera. The RPO dataset comprises paired images collected across various scenes. Each short-exposure (normal-exposure) image is paired with long-exposure (over-exposure) images with 4 ratios (x3, x5, x8, x10). Some representative examples of RPO dataset are shown in Fig. 44.

The RPO dataset exhibits the following characteristics:

  • Short Exposure Images (Normal, GT): Captured in each scene using a tripod-mounted camera. The camera was set to automatic mode to find optimal aperture and exposure time settings, then switched to manual mode to lock these settings. Images were taken using a remote mobile app to control the shutter, minimizing lens vibration.

  • Long Exposure Images (Over-exposure, OE): Following the capture of short exposure GT images, only the ”exposure time” setting was adjusted using the mobile app to simulate real over-exposure caused by incorrect settings. Four predetermined over-exposure ratios were used (×3, ×5, ×8, ×10). It was ensured that the camera was not touched during both long and short exposure captures to prevent any misalignment due to lens vibration.

Refer to caption
Figure 44: Examples in our Real-world Paired Over-exposure (RPO) [34] dataset include outdoor (the second row) and indoor (the third row) scenes. For each scene, we capture four different over-exposure ratios of 3, 5, 8, and 10 instances, both in RAW and sRGB formats. The most front image in the bottom two rows is the properly-exposed reference image. The behind images are correspondingly over-exposed images. “OE” indicates Over-Exposure.

9.2 Gxj Team’s Method

9.2.1 Network Architecture

Our solution of the whole task as shown in Fig. 45, for the original excessive exposure image, first data pretreatment to RGB format, and then after area perception exposure correction network RECNet [64] back to normal light state, the image light level is normal, but there are differences in resolution, after super resolution model OmniSR [98] will double the resolution to get the final result.

Refer to caption
Figure 45: Framework diagram of the model.

The key of the task is for the color correction of exposure area, but because the test file format directly with existing models, so we first convert the existing data set format, then choose compatible with low exposure and excessive exposure area perception exposure correction network (RECNet), through adaptive learning and bridging different area exposure representation to handle the mixed exposure, then the super resolution model (OmniSR) for the final result, finally achieve high excessive exposure image quality recovery.

Data preprocessing

Since the final test stage of this task provides the processed mat format files, in order to make the model better correct the test image, we further converted the image in mat format based on the existing RAW format data set and then converted the JPG format file which is more acceptable to the model. In general, the data set used for training is transformed into an image distribution file similar to the input of the test image. In this process, because the original image size is too large, we modified all the images to unify the size of the test set as the data for the training model.

Exposure correction

Correction for image exposure has been studied for a long time. Traditional methods will rely mainly on manual adjustment of models, such as histogram equilibria and gamma correction. Although existing methods achieve commendable results in exposure correction, many of them rely on complex manual designs or struggle with excessive limitations that ultimately lead to suboptimal results.

After investigating the existing model and analyzing the data set, and also being inspired by the RECNet [64] model, the exposure correction model used in this task was finally selected. When processing single images with mixed exposures, the network is difficult to stably converge, due to the large difference in over-and under-exposed regions, resulting in unbalanced performance for different exposures. To this end, the model takes into account the locality of different exposures to reduce the adverse effects of inconsistent optimization.To achieve this, the model adopt the idea of the divide and conquer strategy, and design aregion-aware exposure correction framework consisting of two well-designed modules concatenated in a chain of consecutive RMBs.

The model mainly contains a series of Blocks (RMB) with Region-aware De-exposure Module (RDM) and Mixed-scale Restoration Unit (MRU). The RDM maps exposure features Fin to a three-branched exposure-invariant feature Fn, while the MRU integrates the features Fs and Fc by the spatial-wise and channel-wise restoration, respectively. The exposure mask predictor (EMP) assists in generating the underexposure feature Fu and overexposure feature.It optimize the model with Exposure Contrastive Regularization (ECR).

Image super resolution

To match the results after exposure correction with the size of the resulting images required for the task, we used the Omni-SR [98] model to achieve a 2x super-resolution of the resulting images. Specifically, the model proposes a Omni Self-Attention (OSA) block based on the principle of dense interaction, which can model the pixel interaction from both the dimensions of space and channel, and mine the potential correlation between the global axis (i. e., space and channel). Combined with mainstream window partition strategies, OSA can achieve superior performance with a compelling computational budget. Second, a multi-scale interaction scheme is proposed to alleviate suboptimal ERFs in the shallow model, promoting local propagation and meso global scale interactions to form full-scale aggregate blocks.

9.2.2 Implementation Details

We conducted experiments using two different models: RECNet and Omni-SR. For RECNet [64], we processed and merged existing datasets, yielding a total of 1200 images, including 1120 images as the training set and 80 images as the validation set. No pre-training model was loaded, and training was done from scratch. The training parameters were a batch size of 8, a learning rate of 1e-4, and 300000 iterations, using a single NVIDIA RTX 4090 GPU. For Omni-SR[98], the pre-trained model of epoch885 with OmniSR on the DF2K dataset was used to treat the exposure-corrected images for 2x super-resolution. The experimental results are shown in Fig. 46, including the results and scores obtained by the original data after the recovery of the process structure.

Refer to caption
Figure 46: Different ratio result image examples and scores.

The title is dedicated to the correction task of overexposed images. In this report, we detail our team’s data processing methods and the use of models in this task. For the recovered images, the image quality is improved again through the super-resolution model, yielding better results. The experiments proved that our strategy to solve this task is reasonable and effective, ultimately achieving a score of 21.58 in the Ratio = 3 track of this task dataset.

9.3 CVCV Team’s Method

9.3.1 Network Architecture

Our training process for the entire task is shown in Fig. 47. For the original training dataset, it is first converted from CR2 to RGGB four-channel PNG. Then the data is input into the CGNet [34] model and supervised learning is carried out by the GroundTruth of the RGB three-channel. Finally, the overexposed image can be corrected to make the overexposed image return to normal.

Our test process for the entire task is shown in Fig. 48. For the original validation and test data set, it is first converted from mat format to RGGB four-channel PNG. Then the data is input into the CGNet model, the trained weights are loaded for model inference, and the inference results of RGB three-channel are obtained.

Refer to caption
Figure 47: Training process diagram
Refer to caption
Figure 48: Test process diagram

Most existing methods of overexposure in image correction have been developed based on sRGB images, which can lead to complex and non-linear degradation due to the image signal processing pipeline. Compared to sRGB-based technologies, RAW images are characterized by a near-linear correlation with scene brightness and exhibit superior performance due to the rich information content due to higher bit depth. Traditional digital camera sensors are designed to have a higher response ratio and relative spectral sensitivity to green channels. Therefore, in RAW images captured by most digital camera systems, the green channel is usually more likely to be overexposed in bright scenes than the red or blue channel. The red and blue channels of RAW images show more appropriate brightness and richer texture details than the green channels. This indicates that the green channel in the RGGB RAW image is more saturated than the red or blue channel and requires stronger correction.

Channel-Guidance Network(CGNet), which takes advantage of RAW images for overexposure correction. CGNet estimates correctly exposed sRGB images directly from overexposed RAW images in an end-to-end manner. Specifically, they introduce a RAW based channel guide branch into the U-Net-based backbone, which utilizes color channel intensity priors of RAW images to achieve superior overexposure correction performance.

Data preprocessing. Our team chose CGNet model for overexposure image correction, and the model default input format is RGGB four-channel. In order to maintain the performance of the model, we decided to convert the original images in the dataset into RGGB format for training. The original training image is stored in CR2 format, and the rawpy library is directly called to batch convert CR2 files to RGB three-channel format. Then copy the green channel and convert it to RGGB four-channel format. Then, the overexposed images of four ratios (3,5,8,10) were input into CGNet together and divided into the training set and validation set according to the ratio of 8:2.

The storage format of the original test image is mat. After reading the mat file, it is found that the pixel value is between 0 and 1, and the four-channel is RGBG. Therefore, the pixels are multiplied by 255, and the four channels are converted to RGGB, and the processed PNG image is obtained.

Refer to caption
Figure 49: Architecture of our Channel-Guidance Network (CGNet [34]) for image over-exposure correction. Given an over-exposed RAW image, they pad it into a four-channel RGGB image, and then feed it into their CGNet. Their CGNet is based on a U-Net backbone. The encoder consists of ”Half-Instance Normalization”, while the decoders are Residual blocks. They replace the original skip connection with Cascaded Dilated Residual (CDR) blocks. The red and blue channels are input to a Non-Green Channel Guidance (NGCG) branch for texture detail reconstruction. Their CGNet is pre-trained on their synthetic RAW image-based dataset, and fine-tuned on their Real-world Paired Over-exposure dataset.

CGNet. As shown in Fig. 49, the main branch is based on a basic U-net [83] with four encoder (downsample) and decoder (upsample) stages. Specifically, they first extract the initial features from the four-channel RAW images using a standard 3×3 convolution. In the encoder section, the HIN blocks are utilized to broaden the receptive field and enhance the robustness of features at various scales. During the downsampling operation, they double the number of channels in the feature maps. Moving on to the decoder part, residual blocks are employed to capture high-level features more effectively. For the skip connection, they introduce a novel Cascaded Dilated Residual (CDR) block to extract multi-scale features, which are then merged with the encoder’s features to mitigate the loss of detail information resulting from downsampling. The proposed NGCG branch integrates the prior knowledge of blue and red channels into each scale of the main branch encoder, aiding the main branch in recovering over-exposed areas more effectively.

Non-Green Channel Guidance. Firstly, the pixels pertaining to the red (or blue) channel are extracted from their respective positions within each 2×2 block of a Bayer image. Subsequently, these red and blue channels are input into the NGCG branch, which then generates an initial prediction of the corresponding components in the output sRGB image. The NGCG branch is structured with a Guidance-Enhanced Block (GEB) and four downsampling blocks, serving to guide the five corresponding encoder blocks. Within the GEB, there are two 3×3 convolutional layers, with an Instance Normalization [95] and LeakyReLU following each layer, as well as a 3×3 convolutional operation in between. It is worth noting that depth-wise separable convolution is utilized instead of the traditional 3×3 convolution to efficiently capture local information. The downsampling blocks comprise a maxpooling layer, a modified self-attention structure, and two depth-wise separable convolution layers. This arrangement allows the NGCG branch to aid the primary backbone network in restoring over-exposed RAW images in a multi-scale manner.

Cascaded Dilated Residual Block. In detail, each CDR block incorporates three residual connections that feature dilated convolution and the LeakyReLU activation function. Subsequently, a 1×1 convolutional layer follows, which enables the CDR block to effectively utilize the features extracted from each stage of the encoder and adequately explore local texture information. Additionally, it is worth noting that the dilated convolution employed in this configuration effectively enhances the receptive field of the CDR block, thereby facilitating multi-scale contextual feature extraction.

9.3.2 Implementation Details

We chose the CGNet [34] model for training. Since the training dataset contains overexposed images with four ratios (3, 5, 8, 10), we input the overexposed images with these four ratios into CGNet and divided them into the training set and the verification set according to a ratio of 8:2.

The training parameters include a batch size of 24, a learning rate of 0.0001, 4 undersampling layers, 9 residual layers, and the Adam optimizer. The model was trained for a total of 500 epochs, with a learning rate reduction strategy starting from 100 epochs. Training was performed using a single NVIDIA V100 GPU without loading pre-trained weights.

The experimental results are presented in Table 13. After loading the weights obtained from the training, we predicted the processed test set. The pixels of the predicted results were multiplied by a suitable multiplier to improve the score. Through experiments, we found that the highest scores for overexposed images with ratios of 3, 5, 8, and 10 were 20.56, 20.49, 20.54, and 20.03, respectively.

Table 13: Experiment Results
      Ratio       Argument       Score
      ratio=3       2.20       20.56
      ratio=5       1.65       20.49
      ratio=8       1.27       20.54
      ratio=10       1.10       20.03

The main task of this competition is to correct overexposed images. This report details our team’s approach to data processing and the details of model training and predictions. The experiment verifies that the data converted from CR2 or mat format to four-channel RGGB will not be affected by performance degradation when using the CGNet model for training. Additionally, by multiplying the pixels of the predicted results, we improved the quality of the images and achieved higher scores. The experiment verifies that our strategy to solve this problem is reasonable and effective. In the end, we scored 20.56 on the RPO dataset, placing us in second place.

9.4 LiGoxin Team’s Method

9.4.1 Network Architecture

We use CGNet [34] as a solution to the problem.CGNet contains two branches, namely a main branch based on U-net and a non-green channel guided (NGCG) branch, as shown in Fig. 50.

The main branch is based on a basic U-net [83] with four encoder (downsampling) and decoder (upsampling) stages. Specifically, the model first extracts initial features from a four-channel RAW image through a standard 3 × 3 convolution. In the encoder part, the model uses a HIN block to expand the receptive field and improve the robustness of the features at each scale. During the downsampling operation, the number of channels in the feature map is doubled. In the decoder part, the model uses a residual block to better extract high-level features. For skip connections, the model uses a novel cascaded dilated residual (CDR) block to extract multi-scale features and fuse them with features from the encoder part to compensate for the loss of detail information caused by downsampling. Specifically, each CDR block contains three residual connections with dilated convolutions and LeakyReLU activation functions, followed by a 1 × 1 convolution layer. This allows the CDR block to make good use of features from each stage of the encoder and fully explore local texture information. In addition, the dilated convolution used here can effectively expand the receptive field of the CDR block for multi-scale context feature extraction.

For the NGCG branch, the pixels belonging to the corresponding position of the red (or blue) channel are first extracted in each 2 × 2 block of the Bayer image. Then, the red and blue channels are input into the NGCG branch to produce an initial estimate of the corresponding elements in the output sRGB image. The NGCG branch consists of a Guidance-Enhanced Block (GEB) and four downsampling blocks, guiding 5 corresponding encoder blocks. GEB contains 2 3 × 3 convolutional layers (followed by instance normalization and LeakyReLU) and 3 × 3 convolution operations between them. It is worth noting that the model uses depthwise separable convolution instead of standard 3 × 3 convolution to fully extract local information. The downsampling block consists of a maximum pooling layer, an improved self-attention structure, and two depthwise separable convolutional layers.

Refer to caption
Figure 50: Architecture of CGNet [34] for image over-exposure correction

9.4.2 Implementation Details

The training dataset consists of 300 ground truth (gt) and corresponding overexposed images (ratios = 3, 5, 8, 10). The resolutions of RAW images and corresponding sRGB images are both 6744 x 4502.

The validation set processes RAW images into four-channel (RGGB) images, crops them, and saves them as .mat files. Unlike the training set, the validation set only includes input files and does not have ground truth.

For the test data, we converted the .mat file into a .png file and adjusted the channel order of the image. We process the exposure images using a pre-trained model that is pre-trained on the SOF dataset [34] and fine-tuned on the RPO dataset. After processing, we adjust the images at different ratios and upscale the corrected images. The experimental results are shown in Fig. 51.

Refer to caption
Figure 51: Different ratio examples and scores.

This report details our data processing methods and model usage in this task. Experiments have proven that our strategy for solving this task is reasonable and effective, and we ultimately achieved a score of 18.95 on the Ratio = 3 track of this task dataset, ranking third.

9.5 Teams and Affiliations

gxj
Title:
1st Solution Places for PBDL2024 Raw Image Based Over-Exposure Correction Challenge
Members: Xuejian Gou ([email protected]), Qinliang Wang, Yang Liu, Fang Liu, Lingling Li, Wenping Ma
Affiliations: School of Artificial Intelligence, Xidian University

CVCV
Title:
Raw Image Based Over-Exposure Correction
Members:Shizhan Zhao ([email protected]), Yanzhao Zhang, Libo Yan, Xiaoqiang Lu, Licheng Jiao, Yuwei Guo
Affiliations: Intelligent Perception and Image Understanding Lab, Xidian University

LiGoxin
Title:
Raw Image Based Over-Exposure Correction Challenge
Members:Guoxin Li ([email protected]), Qiong Gao, Chenyue Che, Long Sun, Xu Liu, Shuyuan Yang
Affiliations: Intelligent Perception and Image Understanding Lab, Xidian University

10 Conclusion

The three-month-long competition attracted over 300 participants, with more than 500 submissions from both industry and academic institutions. This high level of participation underscores the growing interest and investment in the field of computer vision, particularly in the integration of physics-based approaches with deep learning.

Looking forward, we anticipate continued advancements and breakthroughs in this interdisciplinary area. The success of this challenge has set a strong foundation for future research and development, encouraging more collaboration between academia and industry to solve complex vision problems. We are excited to see the future innovations and practical applications that will emerge from these efforts.

References

  • Abdelhamed et al. [2018] Abdelrahman Abdelhamed, Stephen Lin, and Michael S Brown. A high-quality denoising dataset for smartphone cameras. In IEEE Conf. Comput. Vis. Pattern Recog., pages 1692–1700, 2018.
  • Abdelhamed et al. [2019] Abdelrahman Abdelhamed, Marcus A Brubaker, and Michael S Brown. Noise flow: Noise modeling with conditional normalizing flows. In Int. Conf. Comput. Vis., pages 3165–3173, 2019.
  • Afifi et al. [2021] Mahmoud Afifi, Konstantinos G Derpanis, Bjorn Ommer, and Michael S Brown. Learning multi-scale photo exposure correction. In IEEE Conf. Comput. Vis. Pattern Recog., pages 9157–9167, 2021.
  • Assefa et al. [2014] Mekides Assefa, Tania Poulie, Jonathan Kervec, and Mohamed-Chaker Larabi. Correction of over-exposure using color channel correlations. In IEEE Global Conf. Signal Inform. Process., pages 1078–1082, 2014.
  • Baer [2006] Richard L Baer. A model for dark current characterization and simulation. In Sensors, Cameras, and Systems for Scientific/Industrial Applications VII, pages 37–48, 2006.
  • Boie and Cox [1992] Robert A. Boie and Ingemar J. Cox. An analysis of camera noise. IEEE Trans. Pattern Anal. Mach. Intell., 14(06):671–674, 1992.
  • Brooks et al. [2019] Tim Brooks, Ben Mildenhall, Tianfan Xue, Jiawen Chen, Dillon Sharlet, and Jonathan T Barron. Unprocessing images for learned raw denoising. In IEEE Conf. Comput. Vis. Pattern Recog., pages 11036–11045, 2019.
  • Broxton et al. [2020] Michael Broxton, John Flynn, Ryan Overbeck, Daniel Erickson, Peter Hedman, Matthew Duvall, Jason Dourgarian, Jay Busch, Matt Whalen, and Paul Debevec. Immersive light field video with a layered mesh representation. ACM Trans. Graph., 39(4):86–1, 2020.
  • Cai et al. [2023a] Yuanhao Cai, Hao Bian, Jing Lin, Haoqian Wang, Radu Timofte, and Yulun Zhang. Retinexformer: One-stage retinex-based transformer for low-light image enhancement. In Int. Conf. Comput. Vis., pages 12504–12513, 2023a.
  • Cai et al. [2023b] Yuanhao Cai, Hao Bian, Jing Lin, Haoqian Wang, Radu Timofte, and Yulun Zhang. Retinexformer: One-stage retinex-based transformer for low-light image enhancement. In Proceedings of the IEEE/CVF international conference on computer vision (ICCV), pages 12504–12513, 2023b.
  • Chen et al. [2018] Chen Chen, Qifeng Chen, Jia Xu, and Vladlen Koltun. Learning to see in the dark. In IEEE Conf. Comput. Vis. Pattern Recog., pages 3291–3300, 2018.
  • Chen et al. [2019a] Chen Chen, Qifeng Chen, Minh N Do, and Vladlen Koltun. Seeing motion in the dark. In Int. Conf. Comput. Vis., pages 3185–3194, 2019a.
  • Chen et al. [2019b] Kai Chen, Jiangmiao Pang, Jiaqi Wang, Yu Xiong, Xiaoxiao Li, Shuyang Sun, Wansen Feng, Ziwei Liu, Jianping Shi, Wanli Ouyang, et al. Hybrid task cascade for instance segmentation. In IEEE Conf. Comput. Vis. Pattern Recog., pages 4974–4983, 2019b.
  • Chen et al. [2019c] Kai Chen, Jiaqi Wang, Jiangmiao Pang, Yuhang Cao, Yu Xiong, Xiaoxiao Li, Shuyang Sun, Wansen Feng, Ziwei Liu, Jiarui Xu, Zheng Zhang, Dazhi Cheng, Chenchen Zhu, Tianheng Cheng, Qijie Zhao, Buyu Li, Xin Lu, Rui Zhu, Yue Wu, Jifeng Dai, Jingdong Wang, Jianping Shi, Wanli Ouyang, Chen Change Loy, and Dahua Lin. MMDetection: Open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155, 2019c.
  • Chen et al. [2022a] Liangyu Chen, Xiaojie Chu, Xiangyu Zhang, and Jian Sun. Simple baselines for image restoration. In Eur. Conf. Comput. Vis., pages 17–33. Springer, 2022a.
  • Chen et al. [2022b] Linwei Chen, Ying Fu, Shaodi You, and Hongzhe Liu. Hybrid supervised instance segmentation by learning label noise suppression. Neurocomputing, 496:131–146, 2022b.
  • Chen et al. [2023a] Linwei Chen, Ying Fu, Kaixuan Wei, Dezhi Zheng, and Felix Heide. Instance segmentation in the dark. Int. J. Comput. Vis., 131(8):2198–2218, 2023a.
  • Chen et al. [2023b] Linwei Chen, Ying Fu, Kaixuan Wei, Dezhi Zheng, and Felix Heide. Instance segmentation in the dark. Int. J. Comput. Vis., 131(8):2198–2218, 2023b.
  • Chen et al. [2021] Xiangyu Chen, Yihao Liu, Zhengwen Zhang, Yu Qiao, and Chao Dong. Hdrunet: Single image hdr reconstruction with denoising and dequantization. In IEEE Conf. Comput. Vis. Pattern Recog., pages 354–363, 2021.
  • Chen et al. [2023c] Zhe Chen, Yuchen Duan, Wenhai Wang, Junjun He, Tong Lu, Jifeng Dai, and Yu Qiao. Vision transformer adapter for dense predictions. In Int. Conf. Learn. Represent., 2023c.
  • Claus and Van Gemert [2019] Michele Claus and Jan Van Gemert. Videnn: Deep blind video denoising. In IEEE Conf. Comput. Vis. Pattern Recog. Worksh., pages 0–0, 2019.
  • Costantini and Susstrunk [2004] Roberto Costantini and Sabine Susstrunk. Virtual sensor design. In Sensors and Camera Systems for Scientific, Industrial, and Digital Photography Applications V, pages 408–419, 2004.
  • Debevec and Malik [2008] Paul E Debevec and Jitendra Malik. Recovering high dynamic range radiance maps from photographs. Proc. of ACM SIGGRAPH, pages 1–10, 2008.
  • Ding et al. [2021] Zongyuan Ding, Tao Wang, Quansen Sun, Qiongjie Cui, and Fuhua Chen. A dual-stream framework guided by adaptive gaussian maps for interactive image segmentation. Knowledge-Based Systems, 223:107033, 2021.
  • Dosovitskiy et al. [2021a] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In Proceedings of the International Conference on Learning Representations, 2021a.
  • Dosovitskiy et al. [2021b] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. In Int. Conf. Learn. Represent., 2021b.
  • El Gamal and Eltoukhy [2005] Abbas El Gamal and Helmy Eltoukhy. Cmos image sensors. IEEE Circuits and Devices Magazine, 21(3):6–20, 2005.
  • Fang et al. [2021] Yuxin Fang, Shusheng Yang, Xinggang Wang, Yu Li, Chen Fang, Ying Shan, Bin Feng, and Wenyu Liu. Instances as queries. In Int. Conf. Comput. Vis., pages 6910–6919, 2021.
  • Farrell et al. [2008] Joyce Farrell, Michael Okincha, and Manu Parmar. Sensor calibration and simulation. In Digital Photography IV, pages 249–257, 2008.
  • Feng et al. [2024] Yixu Feng, Cheng Zhang, Pei Wang, Peng Wu, Qingsen Yan, and Yanning Zhang. You only need one color space: An efficient network for low-light image enhancement. arXiv preprint arXiv:2402.05809, 2024.
  • Foi et al. [2008] Alessandro Foi, Mejdi Trimeche, Vladimir Katkovnik, and Karen Egiazarian. Practical poissonian-gaussian noise modeling and fitting for single-image raw-data. IEEE Trans. Image Process., 17(10):1737–1754, 2008.
  • Fu et al. [2020] Ying Fu, Yunhao Zou, Liheng Bian, Yuxiang Guo, and Hua Huang. Illumination modulation for reflective and fluorescent separation. Opt. Letters, 45(5):1120–1123, 2020.
  • Fu et al. [2022] Ying Fu, Yang Hong, Linwei Chen, and Shaodi You. Le-gan: Unsupervised low-light image enhancement network using attention module and identity invariant loss. Knowledge-Based Systems, 240:108010, 2022.
  • Fu et al. [2023a] Y. Fu, Y. Hong, Y. Zou, and et al. Raw image based over-exposure correction using channel-guidance strategy. IEEE Trans. Circuit Syst. Video Technol., 2023a.
  • Fu et al. [2023b] Ying Fu, Yang Hong, Yunhao Zou, Qiankun Liu, Yiming Zhang, Ning Liu, and Chenggang Yan. Raw image based over-exposure correction using channel-guidance strategy. IEEE Trans. Circuit Syst. Video Technol., 2023b.
  • Fu et al. [2023c] Ying Fu, Zichun Wang, Tao Zhang, and Jun Zhang. Low-light raw video denoising with a high-quality realistic motion dataset. IEEE Trans. Multimedia, 25:8119–8131, 2023c.
  • Gallego et al. [2018] Guillermo Gallego, Henri Rebecq, and Davide Scaramuzza. A unifying contrast maximization framework for event cameras, with applications to motion, depth, and optical flow estimation. In IEEE Conf. Comput. Vis. Pattern Recog., pages 3867–3876, 2018.
  • Ge et al. [2021] Zheng Ge, Songtao Liu, Feng Wang, Zeming Li, and Jian Sun. Yolox: Exceeding yolo series in 2021. arXiv preprint arXiv:2107.08430, 2021.
  • Gnanasambandam and Chan [2020] Abhiram Gnanasambandam and Stanley H Chan. Image classification in the dark using quanta image sensors. In Eur. Conf. Comput. Vis., pages 484–501, 2020.
  • Gow et al. [2007] Ryan D Gow, David Renshaw, Keith Findlater, Lindsay Grant, Stuart J McLeod, John Hart, and Robert L Nicol. A comprehensive tool for modeling cmos image-sensor-noise performance. IEEE Transactions on Electron Devices, 54(6):1321–1329, 2007.
  • Hasinoff et al. [2016] Samuel W Hasinoff, Dillon Sharlet, Ryan Geiss, Andrew Adams, Jonathan T Barron, Florian Kainz, Jiawen Chen, and Marc Levoy. Burst photography for high dynamic range and low-light imaging on mobile cameras. ACM Trans. Graph., 35(6):1–12, 2016.
  • He et al. [2015] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Int. Conf. Comput. Vis., pages 1026–1034, 2015.
  • Healey and Kondepudy [1994] Glenn E Healey and Raghava Kondepudy. Radiometric ccd camera calibration and noise estimation. IEEE Trans. Pattern Anal. Mach. Intell., 16(3):267–276, 1994.
  • Hong et al. [2021] Yang Hong, Kaixuan Wei, Linwei Chen, and Ying Fu. Crafting object detection in very low light. In Brit. Mach. Vis. Conf., page 3, 2021.
  • Hou et al. [2024] Jinhui Hou, Zhiyu Zhu, Junhui Hou, Hui Liu, Huanqiang Zeng, and Hui Yuan. Global structure-aware diffusion process for low-light image enhancement. Adv. Neural Inform. Process. Syst., 36, 2024.
  • Huang et al. [2022] Jie Huang, Yajing Liu, Xueyang Fu, Man Zhou, Yang Wang, Feng Zhao, and Zhiwei Xiong. Exposure normalization and compensation for multiple-exposure correction. In IEEE Conf. Comput. Vis. Pattern Recog., pages 6043–6052, 2022.
  • Huang et al. [2021] Shihua Huang, Zhichao Lu, Ran Cheng, and Cheng He. Fapn: Feature-aligned pyramid network for dense image prediction. In Int. Conf. Comput. Vis., pages 864–873, 2021.
  • Irie et al. [2008a] Kenji Irie, Alan E Mckinnon, Keith Unsworth, and Ian M Woodhead. A model for measurement of noise in ccd digital-video cameras. Measurement Science and Technology, 19(4):045207, 2008a.
  • Irie et al. [2008b] Kenji Irie, Alan E McKinnon, Keith Unsworth, and Ian M Woodhead. A technique for evaluation of ccd video-camera noise. IEEE Trans. Circuit Syst. Video Technol., 18(2):280–284, 2008b.
  • Jiang and Zheng [2019] Haiyang Jiang and Yinqiang Zheng. Learning to see moving objects in the dark. In Int. Conf. Comput. Vis., pages 7324–7333, 2019.
  • Jin et al. [2023] Xin Jin, Jia-Wen Xiao, Ling-Hao Han, Chunle Guo, Ruixun Zhang, Xialei Liu, and Chongyi Li. Lighting every darkness in two pairs: A calibration-free pipeline for raw denoising. In Int. Conf. Comput. Vis., pages 13275–13284, 2023.
  • Kingma and Ba [2014] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • Konnik and Welsh [2014] Mikhail Konnik and James Welsh. High-level numerical simulations of noise in ccd and cmos photosensors: review and tutorial. arXiv preprint arXiv:1412.4031, 2014.
  • Krull et al. [2019] Alexander Krull, Tim-Oliver Buchholz, and Florian Jug. Noise2void-learning denoising from single noisy images. In IEEE Conf. Comput. Vis. Pattern Recog., pages 2129–2137, 2019.
  • Lee et al. [2014] Jun Haeng Lee, Kyoobin Lee, Hyunsurk Ryu, Paul KJ Park, Chang-Woo Shin, Jooyeon Woo, and Jun-Seok Kim. Real-time motion estimation based on event-based vision sensor. In IEEE Int. Conf. Image Process., pages 204–208. IEEE, 2014.
  • Lehtinen et al. [2018] Jaakko Lehtinen, Jacob Munkberg, Jon Hasselgren, Samuli Laine, Tero Karras, Miika Aittala, and Timo Aila. Noise2noise: Learning image restoration without clean data. In Int. Conf. Mach. Learn., pages 2965–2974. PMLR, 2018.
  • Li et al. [2023a] Chongyi Li, Chun-Le Guo, Man Zhou, Zhexin Liang, Shangchen Zhou, Ruicheng Feng, and Chen Change Loy. Embedding fourier for ultra-high-definition low-light image enhancement. In Int. Conf. Learn. Represent., 2023a.
  • Li et al. [2023b] Dasong Li, Xiaoyu Shi, Yi Zhang, Ka Chun Cheung, Simon See, Xiaogang Wang, Hongwei Qin, and Hongsheng Li. A simple baseline for video restoration with grouped spatial-temporal shift. In IEEE Conf. Comput. Vis. Pattern Recog., pages 9822–9832, 2023b.
  • Li et al. [2023c] Feng Li, Hao Zhang, Huaizhe Xu, Shilong Liu, Lei Zhang, Lionel M Ni, and Heung-Yeung Shum. Mask dino: Towards a unified transformer-based framework for object detection and segmentation. In IEEE Conf. Comput. Vis. Pattern Recog., pages 3041–3050, 2023c.
  • Li et al. [2018] Mading Li, Jiaying Liu, Wenhan Yang, Xiaoyan Sun, and Zongming Guo. Structure-revealing low-light image enhancement via robust retinex model. IEEE Transactions on Image Processing, 27(6):2828–2841, 2018.
  • Liang et al. [2022] Jingyun Liang, Yuchen Fan, Xiaoyu Xiang, Rakesh Ranjan, Eddy Ilg, Simon Green, Jiezhang Cao, Kai Zhang, Radu Timofte, and Luc V Gool. Recurrent video restoration transformer with guided deformable attention. Adv. Neural Inform. Process. Syst., 35:378–393, 2022.
  • Liba et al. [2019] Orly Liba, Kiran Murthy, Yun-Ta Tsai, Tim Brooks, Tianfan Xue, Nikhil Karnad, Qiurui He, Jonathan T Barron, Dillon Sharlet, Ryan Geiss, et al. Handheld mobile photography in very low light. ACM Trans. Graph., 38(6):164–1, 2019.
  • Lin et al. [2014] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Eur. Conf. Comput. Vis., pages 740–755. Springer, 2014.
  • Liu et al. [2024a] J. Liu, H. Fu, C. Wang, and et al. Region-aware exposure consistency network for mixed exposure correction. arXiv preprint arXiv:2402.18217, 2024a.
  • Liu et al. [2024b] Xiaoning Liu, Zongwei Wu, Ao Li, Florin-Alexandru Vasluianu, Yulun Zhang, Shuhang Gu, Le Zhang, Ce Zhu, Radu Timofte, Zhi Jin, et al. Ntire 2024 challenge on low light image enhancement: Methods and results. arXiv preprint arXiv:2404.14248, 2024b.
  • Liu et al. [2014] Ziwei Liu, Lu Yuan, Xiaoou Tang, Matt Uyttendaele, and Jian Sun. Fast burst images denoising. ACM Trans. Graph., 33(6):1–9, 2014.
  • Liu et al. [2021] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Int. Conf. Comput. Vis., pages 10012–10022, 2021.
  • Loshchilov and Hutter [2018] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In Int. Conf. Learn. Represent., 2018.
  • Luo et al. [2023] Ziwei Luo, Fredrik K Gustafsson, Zheng Zhao, Jens Sjölund, and Thomas B Schön. Image restoration with mean-reverting stochastic differential equations. In Int. Conf. Mach. Learn., pages 23045–23066, 2023.
  • Lv et al. [2021] Feifan Lv, Yu Li, and Feng Lu. Attention guided low-light image enhancement with a large scale low-light simulation dataset. Int. J. Comput. Vis., 129(7):2175–2193, 2021.
  • Lyu et al. [2022] Chengqi Lyu, Wenwei Zhang, Haian Huang, Yue Zhou, Yudong Wang, Yanyi Liu, Shilong Zhang, and Kai Chen. Rtmdet: An empirical study of designing real-time object detectors. arXiv preprint arXiv:2212.07784, 2022.
  • Maggioni et al. [2021] Matteo Maggioni, Yibin Huang, Cheng Li, Shuai Xiao, Zhongqian Fu, and Fenglong Song. Efficient multi-stage video denoising with recurrent spatio-temporal fusion. In IEEE Conf. Comput. Vis. Pattern Recog., pages 3466–3475, 2021.
  • Martínez-Domingo et al. [2017] Miguel Ángel Martínez-Domingo, Eva M Valero, Javier Hernández-Andrés, Shoji Tominaga, Takahiko Horiuchi, and Keita Hirai. Image processing pipeline for segmentation and material classification based on multispectral high dynamic range polarimetric images. Opt. Express, 25(24):30073–30090, 2017.
  • Mildenhall et al. [2018] Ben Mildenhall, Jonathan T Barron, Jiawen Chen, Dillon Sharlet, Ren Ng, and Robert Carroll. Burst denoising with kernel prediction networks. In IEEE Conf. Comput. Vis. Pattern Recog., pages 2502–2510, 2018.
  • Nguyen et al. [2014] Rang MH Nguyen, Dilip K Prasad, and Michael S Brown. Training-based spectral reconstruction from a single rgb image. In Eur. Conf. Comput. Vis., pages 186–201. Springer, 2014.
  • Nsamp et al. [2018] NE Nsamp, Zhongyun Hu, and Qing Wang. Learning exposure correction via consistency modeling. In Brit. Mach. Vis. Conf., pages 1–12, 2018.
  • Onzon et al. [2021] Emmanuel Onzon, Fahim Mannan, and Felix Heide. Neural auto-exposure for high-dynamic range object detection. In IEEE Conf. Comput. Vis. Pattern Recog., pages 7710–7720, 2021.
  • Pérez-Hernández et al. [2020] Francisco Pérez-Hernández, Siham Tabik, Alberto Lamas, Roberto Olmos, Hamido Fujita, and Francisco Herrera. Object detection binary classifiers methodology based on deep learning to identify small objects handled similarly: Application in video surveillance. Knowledge-Based Systems, 194:105590, 2020.
  • Pharr et al. [2023] Matt Pharr, Wenzel Jakob, and Greg Humphreys. Physically based rendering: From theory to implementation. MIT Press, 2023.
  • Ramesh et al. [2018] Bharath Ramesh, Shihao Zhang, Zhi Wei Lee, Zhi Gao, Garrick Orchard, and Cheng Xiang. Long-term object tracking with a moving event camera. In Brit. Mach. Vis. Conf., pages 241–252, 2018.
  • Rebecq et al. [2019] Henri Rebecq, René Ranftl, Vladlen Koltun, and Davide Scaramuzza. Events-to-video: Bringing modern computer vision to event cameras. In IEEE Conf. Comput. Vis. Pattern Recog., pages 3857–3866, 2019.
  • Ren et al. [2015] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. Adv. Neural Inform. Process. Syst., 28, 2015.
  • Ronneberger et al. [2015a] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In Med. Image Comput. Comput. Assist. Interv., pages 234–241. Springer, 2015a.
  • Ronneberger et al. [2015b] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In Med. Image Comput. Comput. Assist. Interv., pages 234–241, 2015b.
  • Saner et al. [2014] Daniel Saner, Oliver Wang, Simon Heinzle, Yael Pritch, Aljoscha Smolic, Alexander Sorkine-Hornung, and Markus H Gross. High-speed object tracking using an asynchronous temporal contrast sensor. In Vision, Modeling, and Visualization, pages 87–94. Citeseer, 2014.
  • Sarlin et al. [2021] Paul-Edouard Sarlin, Ajaykumar Unagar, Mans Larsson, Hugo Germain, Carl Toft, Viktor Larsson, Marc Pollefeys, Vincent Lepetit, Lars Hammarstrand, Fredrik Kahl, et al. Back to the feature: Learning robust camera localization from pixels to pose. In IEEE Conf. Comput. Vis. Pattern Recog., pages 3247–3257, 2021.
  • Schwartz et al. [2018] Eli Schwartz, Raja Giryes, and Alex M Bronstein. Deepisp: Toward learning an end-to-end image processing pipeline. IEEE Trans. Image Process., 28(2):912–923, 2018.
  • Shao et al. [2019] Shuai Shao, Zeming Li, Tianyuan Zhang, Chao Peng, Gang Yu, Xiangyu Zhang, Jing Li, and Jian Sun. Objects365: A large-scale, high-quality dataset for object detection. In Int. Conf. Comput. Vis., pages 8430–8439, 2019.
  • Shen et al. [2019] Ziyi Shen, Wenguan Wang, Xiankai Lu, Jianbing Shen, Haibin Ling, Tingfa Xu, and Ling Shao. Human-aware motion deblurring. In Int. Conf. Comput. Vis., pages 5572–5581, 2019.
  • Shi et al. [2015] Xingjian Shi, Zhourong Chen, Hao Wang, Dit-Yan Yeung, Wai-Kin Wong, and Wang-chun Woo. Convolutional lstm network: A machine learning approach for precipitation nowcasting. Adv. Neural Inform. Process. Syst., 28, 2015.
  • Solovyev et al. [2021a] Roman Solovyev, Weimin Wang, and Tatiana Gabruseva. Weighted boxes fusion: Ensembling boxes from different object detection models. Image Vis. Comput., 107:104117, 2021a.
  • Solovyev et al. [2021b] Roman Solovyev, Weimin Wang, and Tatiana Gabruseva. Weighted boxes fusion: Ensembling boxes from different object detection models. Image Vis. Comput., 107:104117, 2021b.
  • Sun et al. [2022] Lei Sun, Christos Sakaridis, Jingyun Liang, Qi Jiang, Kailun Yang, Peng Sun, Yaozu Ye, Kaiwei Wang, and Luc Van Gool. Event-based fusion for motion deblurring with cross-modal attention. In Eur. Conf. Comput. Vis., pages 412–428. Springer, 2022.
  • Tassano et al. [2020] Matias Tassano, Julie Delon, and Thomas Veit. Fastdvdnet: Towards real-time deep video denoising without flow estimation. In IEEE Conf. Comput. Vis. Pattern Recog., pages 1354–1363, 2020.
  • Ulyanov et al. [2016] Dmitry Ulyanov, Andrea Vedaldi, and Victor S. Lempitsky. Instance normalization: The missing ingredient for fast stylization. CoRR, abs/1607.08022, 2016.
  • Wach and Dowski Jr [2004] Hans Wach and Edward R Dowski Jr. Noise modeling for design and simulation of computational imaging systems. In Visual Information Processing XIII, pages 159–170, 2004.
  • Wang et al. [2020] Bishan Wang, Jingwei He, Lei Yu, Gui-Song Xia, and Wen Yang. Event enhanced high-quality image recovery. In Eur. Conf. Comput. Vis., 2020.
  • Wang and et al. [2023] Hang Wang and et al. Omni aggregation networks for lightweight image super-resolution. In IEEE Conf. Comput. Vis. Pattern Recog., 2023.
  • Wang et al. [2022a] Hai Wang, Yanyan Chen, Yingfeng Cai, Long Chen, Yicheng Li, Miguel Angel Sotelo, and Zhixiong Li. Sfnet-n: An improved sfnet algorithm for semantic segmentation of low-light autonomous driving road scenes. IEEE Trans. Intell. Transp. Syst., 23(11):21405–21417, 2022a.
  • Wang et al. [2023] Tao Wang, Kaihao Zhang, Tianrun Shen, Wenhan Luo, Bjorn Stenger, and Tong Lu. Ultra-high-definition low-light image enhancement: A benchmark and transformer-based method. In AAAI, pages 2654–2662, 2023.
  • Wang et al. [2019a] Wei Wang, Xin Chen, Cheng Yang, Xiang Li, Xuemei Hu, and Tao Yue. Enhancing low light videos by exploring high sensitivity camera noise. In Int. Conf. Comput. Vis., pages 4111–4119, 2019a.
  • Wang et al. [2019b] Xintao Wang, Kelvin CK Chan, Ke Yu, Chao Dong, and Chen Change Loy. Edvr: Video restoration with enhanced deformable convolutional networks. In IEEE Conf. Comput. Vis. Pattern Recog. Worksh., pages 0–0, 2019b.
  • Wang et al. [2019c] Xintao Wang, Kelvin C.K. Chan, Ke Yu, Chao Dong, and Chen Change Loy. Edvr: Video restoration with enhanced deformable convolutional networks. In IEEE Conf. Comput. Vis. Pattern Recog. Worksh., pages 1954–1963, 2019c.
  • Wang et al. [2022b] Xintao Wang, Liangbin Xie, Ke Yu, Kelvin CK Chan, Chen Change Loy, and Chao Dong. Basicsr: Open source image and video restoration toolbox. Github, 2022b.
  • Wang et al. [2024] Xinzhe Wang, Kang Ma, Qiankun Liu, Yunhao Zou, and Ying Fu. Multi-object tracking in the dark. In IEEE Conf. Comput. Vis. Pattern Recog., pages 382–392, 2024.
  • Wang et al. [2004] Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process., 13(4):600–612, 2004.
  • Wei et al. [2020] Kaixuan Wei, Ying Fu, Jiaolong Yang, and Hua Huang. A physics-based noise formation model for extreme low-light raw denoising. In IEEE Conf. Comput. Vis. Pattern Recog., pages 2758–2767, 2020.
  • Wei et al. [2021] Kaixuan Wei, Ying Fu, Yinqiang Zheng, and Jiaolong Yang. Physics-based noise modeling for extreme low-light photography. IEEE Trans. Pattern Anal. Mach. Intell., 44(11):8520–8537, 2021.
  • Woo et al. [2018] Sanghyun Woo, Jongchan Park, Joon-Young Lee, and In So Kweon. Cbam: Convolutional block attention module. In Eur. Conf. Comput. Vis., pages 3–19, 2018.
  • Wu et al. [2010] Chenglei Wu, Yebin Liu, Qionghai Dai, and Bennett Wilburn. Fusing multiview and photometric stereo for 3d reconstruction under uncalibrated illumination. IEEE Trans. Vis. Comput. Graph., 17(8):1082–1095, 2010.
  • Wu et al. [2024] Chen Wu, Zhuoran Zheng, Xiuyi Jia, and Wenqi Ren. Mixnet: Towards effective and efficient uhd low-light image enhancement. arXiv preprint arXiv:2401.10666, 2024.
  • Xue et al. [2019] Tianfan Xue, Baian Chen, Jiajun Wu, Donglai Wei, and William T Freeman. Video enhancement with task-oriented flow. Int. J. Comput. Vis., 127:1106–1125, 2019.
  • Yang et al. [2021a] Hong Yang, Wei Kaixuan, Chen Linwei, and Fu Ying. Crafting object detection in very low light. In Brit. Mach. Vis. Conf., pages 1–15, 2021a.
  • Yang et al. [2022] Jianwei Yang, Chunyuan Li, Xiyang Dai, and Jianfeng Gao. Focal modulation networks. Adv. Neural Inform. Process. Syst., 35:4203–4217, 2022.
  • Yang et al. [2021b] Lingxiao Yang, Ru-Yuan Zhang, Lida Li, and Xiaohua Xie. Simam: A simple, parameter-free attention module for convolutional neural networks. In Int. Conf. Mach. Learn., pages 11863–11874. PMLR, 2021b.
  • Yang et al. [2023a] Yixin Yang, Jin Han, Jinxiu Liang, Imari Sato, and Boxin Shi. Learning event guided high dynamic range video reconstruction. In IEEE Conf. Comput. Vis. Pattern Recog., pages 13924–13934, 2023a.
  • Yang et al. [2023b] Zongyuan Yang, Baolin Liu, Yongping Xxiong, Lan Yi, Guibin Wu, Xiaojun Tang, Ziqi Liu, Junjie Zhou, and Xing Zhang. Docdiff: Document enhancement via residual diffusion models. In ACM Int. Conf. Multimedia, pages 2795–2806, 2023b.
  • Ye et al. [2020] Chengxi Ye, Anton Mitrokhin, Cornelia Fermüller, James A Yorke, and Yiannis Aloimonos. Unsupervised learning of dense optical flow, depth and egomotion with event-based sensors. pages 5831–5838. IEEE, 2020.
  • Yi et al. [2023] Xunpeng Yi, Han Xu, Hao Zhang, Linfeng Tang, and Jiayi Ma. Diff-retinex: Rethinking low-light image enhancement with a generative diffusion model. In Proceedings of the IEEE/CVF international conference on computer vision (ICCV), pages 12302–12311, 2023.
  • Yin et al. [2023] Yuyang Yin, Dejia Xu, Chuangchuang Tan, Ping Liu, Yao Zhao, and Yunchao Wei. Cle diffusion: Controllable light enhancement diffusion model. In ACM Int. Conf. Multimedia, pages 8145–8156, 2023.
  • Yuan et al. [2019] Jin Yuan, Xingxing Hou, Yaoqiang Xiao, Da Cao, Weili Guan, and Liqiang Nie. Multi-criteria active deep learning for image classification. Knowledge-Based Systems, 172:86–94, 2019.
  • Yue et al. [2020] Huanjing Yue, Cong Cao, Lei Liao, Ronghe Chu, and Jingyu Yang. Supervised raw video denoising with a benchmark dataset on dynamic scenes. In IEEE Conf. Comput. Vis. Pattern Recog., pages 2301–2310, 2020.
  • Zamir et al. [2021] Syed Waqas Zamir, Aditya Arora, Salman Khan, Munawar Hayat, Fahad Shahbaz Khan, Ming-Hsuan Yang, and Ling Shao. Multi-stage progressive image restoration. In IEEE Conf. Comput. Vis. Pattern Recog., pages 14821–14831, 2021.
  • Zamir et al. [2022a] Syed Waqas Zamir, Aditya Arora, Salman Khan, Munawar Hayat, Fahad Shahbaz Khan, and Ming-Hsuan Yang. Restormer: Efficient transformer for high-resolution image restoration. In IEEE Conf. Comput. Vis. Pattern Recog., pages 5728–5739, 2022a.
  • Zamir et al. [2022b] Syed Waqas Zamir, Aditya Arora, Salman Khan, Munawar Hayat, Fahad Shahbaz Khan, Ming-Hsuan Yang, and Ling Shao. Learning enriched features for fast image restoration and enhancement. IEEE Trans. Pattern Anal. Mach. Intell., 2022b.
  • Zhang et al. [2021a] Fan Zhang, Yu Li, Shaodi You, and Ying Fu. Learning temporal consistency for low light video enhancement from single images. In IEEE Conf. Comput. Vis. Pattern Recog., pages 4967–4976, 2021a.
  • Zhang et al. [2023a] Fan Zhang, Shaodi You, Yu Li, and Ying Fu. Learning rain location prior for nighttime deraining. In Int. Conf. Comput. Vis., pages 13148–13157, 2023a.
  • Zhang et al. [2023b] Hao Zhang, Feng Li, Shilong Liu, Lei Zhang, Hang Su, Jun Zhu, Lionel Ni, and Heung-Yeung Shum. Dino: Detr with improved denoising anchor boxes for end-to-end object detection. In Int. Conf. Learn. Represent., 2023b.
  • Zhang et al. [2021b] Jiqing Zhang, Xin Yang, Yingkai Fu, Xiaopeng Wei, Baocai Yin, and Bo Dong. Object tracking by jointly exploiting frame and event domain. In Int. Conf. Comput. Vis., pages 13043–13052, 2021b.
  • Zhang et al. [2018] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In IEEE Conf. Comput. Vis. Pattern Recog., 2018.
  • Zhuang et al. [2023] Yunliang Zhuang, Zhuoran Zheng, Yuang Zhang, Lei Lyu, Xiuyi Jia, and Chen Lyu. Dimensional transformation mixer for ultra-high-definition industrial camera dehazing. IEEE Transactions on Industrial Informatics, 2023.
  • Zong et al. [2023] Zhuofan Zong, Guanglu Song, and Yu Liu. Detrs with collaborative hybrid assignments training. In Int. Conf. Comput. Vis., pages 6748–6758, 2023.
  • Zou and Fu [2022] Yunhao Zou and Ying Fu. Estimating fine-grained noise model via contrastive learning. In IEEE Conf. Comput. Vis. Pattern Recog., pages 12682–12691, 2022.
  • Zou et al. [2021] Yunhao Zou, Yinqiang Zheng, Tsuyoshi Takatani, and Ying Fu. Learning to reconstruct high speed and high dynamic range videos from events. In IEEE Conf. Comput. Vis. Pattern Recog., pages 2024–2033, 2021.
  • Zou et al. [2023a] Yunhao Zou, Chenggang Yan, and Ying Fu. Iterative denoiser and noise estimator for self-supervised image denoising. In Int. Conf. Comput. Vis., pages 13265–13274, 2023a.
  • Zou et al. [2023b] Yunhao Zou, Chenggang Yan, and Ying Fu. Rawhdr: High dynamic range image reconstruction from a single raw image. In IEEE Conf. Comput. Vis. Pattern Recog., pages 12334–12344, 2023b.