License: CC BY 4.0
arXiv:2312.16071v1 [cs.NE] 26 Dec 2023

Event-based Shape from Polarization with Spiking Neural Networks

Peng Kang 1,*1{}^{1,*}start_FLOATSUPERSCRIPT 1 , * end_FLOATSUPERSCRIPT, Srutarshi Banerjee22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT, Henry Chopp33{}^{3}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPT, Aggelos Katsaggelos33{}^{3}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPT, and Oliver Cossairt11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT 11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPTDepartment of Computer Science, Northwestern, Evanston, IL, USA
22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPTArgonne National Laboratory, Lemont, IL, USA
33{}^{3}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPTDepartment of Electrical and Computer Engineering, Northwestern, Evanston, IL, USA
[email protected]

Recent advances in event-based shape determination from polarization offer a transformative approach that tackles the trade-off between speed and accuracy in capturing surface geometries. In this paper, we investigate event-based shape from polarization using Spiking Neural Networks (SNNs), introducing the Single-Timestep and Multi-Timestep Spiking UNets for effective and efficient surface normal estimation. Specificially, the Single-Timestep model processes event-based shape as a non-temporal task, updating the membrane potential of each spiking neuron only once, thereby reducing computational and energy demands. In contrast, the Multi-Timestep model exploits temporal dynamics for enhanced data extraction. Extensive evaluations on synthetic and real-world datasets demonstrate that our models match the performance of state-of-the-art Artifical Neural Networks (ANNs) in estimating surface normals, with the added advantage of superior energy efficiency. Our work not only contributes to the advancement of SNNs in event-based sensing but also sets the stage for future explorations in optimizing SNN architectures, integrating multi-modal data, and scaling for applications on neuromorphic hardware.

  • Dec. 2023

1 Introduction

Precise surface normal estimation can provide valuable information about a scene’s geometry and is useful for many computer vision tasks, including 3D Reconstruction [1], Augmented Reality (AR) and Virtual Reality (VR) [2, 3], Material Classification [4], and Robotics Navigation [5]. Depending upon the requirements of the application, surface normal estimation can be carried out using a variety of methods [6, 7, 8, 9, 10, 11]. In this work, we are interested in estimating surface normal from polarization images – shape from polarization [12, 13, 14, 15, 16, 17]. In particular, shape from polarization leverages the polarization state of light to infer the shape of objects. When light reflects off surfaces, it becomes partially polarized. This method uses this property to estimate the surface normals of objects, which are then used to reconstruct their 3D shape. Compared to other 3D sensing methods, shape from polarization has many advantages [13, 18], such as its suitability for capturing fine details on a variety of surface materials, including reflective and transparent ones, and its reliance on passive sensing, which eliminates the need for external light sources or emitters. Additionally, shape from polarization can provide high-precision data with relatively low-cost and low-energy equipment, making it an efficient and versatile option for 3D imaging in various applications.

Typically, a polarizing filter is used in conjunction with a camera to capture the polarization images and infer the polarization information. Generally, there are two ways to capture the polarization images and estimate the surface normals from them, one is Division of Time (DoT) [13, 19, 20] and the other one is Division of Focal Plane (DoFP) [16, 17, 21]. The DoT approaches add a rotatable linear polarizer in front of the lens of an ordinary camera. The filter is rotated to different orientations, and full-resolution polarization images are captured for each orientation at different times. By analyzing the changes in the polarization state of light across these images, the surface normals of objects can be estimated. The DoT methods use the full resolution of the sensor but trade-off against acquisition time. On the other hand, the DoFP methods place an array of micro-polarizers in front of the camera [21]. This allows the camera to capture polarization information at different orientations in a single shot. Despite the reduced latency, this system is limited by the low resolution of polarization images, as each pixel only captures polarization at a specific orientation. This can result in lower accuracy compared to the DoT methods.

To bridge the accuracy of DoT with the speed of DoFP, researchers propose event-based shape from polarization following the DoT design scheme [22]. Specifically, a polarizer is rotating in front of an event camera [23] and this creates sinosoidal changes in brightness intensity. Unlike traditional DoT methods utilize standard cameras to capture full-resolution polarization images at fixed rates, event-based shape from polarization employs event cameras to asynchronously measure changes in brightness intensity for each pixel within the full-resolution scene and trigger the events with microsecond resolution if the difference in brightness exceeds a threshold. The proposed event-based method uses the continuous event stream to reconstruct relative intensities at multiple polarizer angles. These reconstructed polarized images are then utilized to estimate surface normals using physics-based and learning-based methods [22]. Due to the DoT-driven characteristic and low latency event cameras provide, the event-based shape from polarization mitigates the accuracy-speed trade-off in the traditional shape from polarization field.

Although the event-based shape from polarization brings many advantages, we still need to carefully choose models that process the data from event cameras. With the prevalence of Artifical Neural Networks (ANNs), one recent method [22] employs ANNs to process event data and demonstrates the better surface normal estimation performance compared to physics-based methods. However, ANNs are not compatiable with the working mechanism of event cameras and incur the high energy consumption. To be more compatiable with event cameras and maintain the high energy efficiency, research on Spiking Neural Networks (SNNs) [24] starts to gain momentum. Similar to event cameras that mimic the human retina’s way of responding to changes in light intensity, SNNs are also bio-inspired and designed to emulate the neural dynamics of human brains. Unlike ANNs employing artificial neurons [25, 26, 27] and conducting real-valued computation, SNNs adopt spiking neurons [28, 29, 30] and utilize binary 0-1 spikes to process information. This difference reduces the mathematical dot-product operations in ANNs to less computationally summation operations in SNNs [24]. Due to such the advantage, SNNs are always energy-efficient and suitable for power-constrained devices. Although SNNs demonstrate the higher energy efficiency and much dedication has been devoted to SNN research, ANNs still present the better performance and dominate in a wide range of learning applications [31].

Recently, more research efforts have been invested to shrink the performance gap between ANNs and SNNs. And SNNs have achieved comparable performance in various tasks, including image classification [32], object detection [33], graph prediction [34], natural language processing [35], etc. Nevertheless, we have not yet witnessed the establishment of SNN in the accurate surface normal estimation with an advanced performance. To this end, this naturally raises an issue: could bio-inspired Spiking Neural Networks estimate surface normals from event-based polarization data with an advanced quality at low energy consumption?

In this paper, we investigate the event-based shape from polarization with a spiking approach to answer the above question. Specifically, inspired by the feed-forward UNet [36] for event-based shape from polarization [22], we propose the Single-Timestep Spiking UNet, which treats the event-based shape from polarization as a non-temporal task. This model processes event-based inputs in a feed-forward manner, where each spiking neuron in the model updates its membrane potential only once. Although this approach may not maximize the temporal processing capabilities of SNNs, it significantly reduces the computational and energy requirements. To further exploit the rich temporal information from event-based data and enhance model performance in the task of event-based shape from polarization, we propose the Multi-Timestep Spiking UNet. This model processes inputs in a sequential, timestep-by-timestep fashion, allowing each spiking neuron to utilize its temporal recurrent neuronal dynamics to more effectively extract information from event data. We extensively evaluate the proposed models on the synthetic dataset and the real-world dataset for event-based shape from polarization. The results of these experiments, both quantitatively and qualitatively, indicate that our models are capable of estimating dense surface normals from polarization events with performance comparable to current state-of-the-art ANN models. Additionally, we perform ablation studies to assess the impact of various design components within our models, further validating their effectiveness. Furthermore, our models exhibit superior energy efficiency compared to their ANN counterparts, which highlights their potential for application on neuromorphic hardware and energy-constrained edge devices.

The remainder of this paper is structured as follows: Section II provides a comprehensive review of existing literature on shape from polarization and SNNs. In Section III, we detail our proposed SNN models for event-based shape from polarization, including their structures, training protocols, and implementation details. Section IV showcases the effectiveness and energy efficiency of our proposed models on different benchmark datasets. The paper concludes with Section V, where we summarize our findings and outline potential avenues for future research.

2 Related Work

In the following, we will first give an overview of the related work on shape from polarization, including the traditional shape from polarization and event-based shape from polarization. Then, we will give a comprehensive review of SNNs and their applications in 3D scenes.

2.1 Shape from Polarization

Shurcliff proposed the method of shape recovery by polarization information in 1962 [37]. Essentially, when unpolarized light reflects off a surface point, it becomes partially polarized. And the observed scene radiance varies with changing the polarizer angle, which encodes some relationship with surface normals. Therefore, by analyzing such relationship at each surface point through Fresnel equations [38], shape from polarization methods can measure the azimuthal and zenithal angles at each pixel and recover the per-pixel surface normal with high resolution. Generally, two schemes are utilized to collect polarization images. One is Division of Time (DoT) [13, 19, 20] that provides full-resolution polarization images but increases the acquisition time significantly, while the other one is Division of Focal Plane (DoFP) [16, 17, 21] that trade-offs spatial resolution for low latency. After collecting the polarization images, various physical-based or learning-based methods [39] can be utilized to estimate the surface normals. However, since a linear polarizer cannot distinguish between polarized light that is rotated by π𝜋\piitalic_π radians, this results in two confounding estimates for azimuth angle at each pixel [16, 40]. To solve such the ambiguity, we have to carefully design the estimation methods by exploring additional constraints from various aspects, such as geometric cues [41, 42, 43], spectral cues [14, 44, 45], photometric cues [15, 46, 47], or priors learned from deep learning techniques [16, 17].

Recently, with the prevalence of bio-inspired neuromorphic engineering, researchers begin to shift their focus to high-speed energy-efficient event cameras and propose solutions that combine polarization information with event cameras. Specifically, inspired by the polarization vision in the mantis shrimp eye [48], [49] proposed the PDAVIS polarization event camera. The researchers employed the DoFP scheme to design such the camera. This involved fabricating an array of pixelated polarization filters and strategically positioning them atop the sensor of an event camera. While this camera is adept at capturing high dynamic range polarization scenes with high speeds, it still faces challenges with low spatial resolution, a common issue inherent in the DoFP methods. To bridge the high resolution of DoT with the low latency of DoFP, [22] adopted the DoT scheme and collected polarization events by placing a rotating polarizing filter in front of an event camera. Due to the high resolution of DoT and the low latency of event cameras, this method facilitates shape from polarization at both high speeds and with high spatial resolution. Typically, the captured polarization events are transformed into frame-like event representations [50], which are then processed using ANN models [22] to estimate surface normals. While these learning-based methods demonstrate superior performance over traditional physics-based methods, they significantly increase the energy consumption of the overall system, primarily due to the lower energy efficiency of ANNs. Through processing event polarization data collected by the promising DoT scheme, this paper aims to address this challenge by conducting event-based shape from polarization using SNNs, presenting a more energy-efficient alternative in this domain.

2.2 Spiking Neural Networks

With the development of ANNs, artificial intelligene models today have demonstrated extraordinary abilities in many tasks, such as computer vision, natural language processing, and robotics. Nevertheless, ANNs only mimic the brain’s architecture in a few aspects, including vast connectivity and structural and functional organizational hierarchy [24]. The brain has more information processing mechanisms like the neuronal and synaptic functionality [51, 52]. Moreover, ANNs are much more energy-consuming compared to human brains. To integrate more brain-like characteristics and make artificial intelligence models more energy-efficient, researchers propose SNNs, which can be executed on power-efficient neuromorphic processors like TrueNorth [53] and Loihi [54]. Like ANNs, SNNs are capable of implementing common network architectures, such as convolutional and fully-connected layers, yet they distinguish themselves by utilizing spiking neuron models [30], such as the Leaky Integrate-and-Fire (LIF) model [29] and the Spike Response Model (SRM) [28]. Due to the non-differentiability of these spiking neuron models, training SNNs can be challenging. However, progress has been made through innovative approaches such as converting pre-trained ANNs to SNNs [55, 56] and developing methods to approximate the derivative of the spike function [57, 58]. Thanks to the developement of these optimization techniques, several models have been proposed recently to tackle the complex tasks in 3D scenes. Notably, StereoSpike [59] and MSS-DepthNet [60] have pioneered the development of deep SNNs for depth estimation, achieving performance on par with the state-of-the-art ANN models. Additionally, SpikingNeRF [61] has successfully adapted SNNs for radiance field reconstruction, yielding synthesis quality comparable to ANN baselines while maintaining high energy efficiency. In this paper, our emphasis is on employing SNNs to tackle event-based shape from polarization, aiming to establish a method that is not only effective but also more efficient for event-based surface normal estimation.

3 Methods

In this paper, we focus on building SNNs to estimate surface normals through the use of a polarizer paired with an event camera. In this setup, the polarizer is mounted in front of the event camera and rotates at a constant high speed driven by a motor. This rotation changes the illumination of the incoming light. Event cameras generate an asynchronous event ei=(xi,yi,ti,pi)subscript𝑒𝑖subscript𝑥𝑖subscript𝑦𝑖subscript𝑡𝑖subscript𝑝𝑖e_{i}=(x_{i},y_{i},t_{i},p_{i})italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) when the illumination variation at a given pixel reaches a given contrast threshold C𝐶Citalic_C:

L(xi,yi,ti)L(xi,yi,tiΔti)=piC,𝐿subscript𝑥𝑖subscript𝑦𝑖subscript𝑡𝑖𝐿subscript𝑥𝑖subscript𝑦𝑖subscript𝑡𝑖Δsubscript𝑡𝑖subscript𝑝𝑖𝐶L(x_{i},y_{i},t_{i})-L(x_{i},y_{i},t_{i}-\Delta t_{i})=p_{i}C,italic_L ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - italic_L ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - roman_Δ italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_C , (1)

where Llog(I)approaches-limit𝐿𝑙𝑜𝑔𝐼L\doteq log(I)italic_L ≐ italic_l italic_o italic_g ( italic_I ) is the log photocurrent (”brightness”), pi{1,+1}subscript𝑝𝑖11p_{i}\in\{-1,+1\}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ { - 1 , + 1 } is the sign of the brightness change, and ΔtiΔsubscript𝑡𝑖\Delta t_{i}roman_Δ italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the time since the last event at the pixel (xi,yi)subscript𝑥𝑖subscript𝑦𝑖(x_{i},y_{i})( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ). The surface normal vector can be represented by its azimuth angle α𝛼\alphaitalic_α and zenith angle θ𝜃\thetaitalic_θ in a spherical coordinate system. And the proposed models predict the surface normal 𝐍𝐍\mathbf{N}bold_N as a 3-channel tensor 𝐍=(sinθcosα,sinθsinα,cosθ)𝐍𝜃𝛼𝜃𝛼𝜃\mathbf{N}=(\sin\theta\cos\alpha,\sin\theta\sin\alpha,\cos\theta)bold_N = ( roman_sin italic_θ roman_cos italic_α , roman_sin italic_θ roman_sin italic_α , roman_cos italic_θ ) through the event steam.

3.1 Input Event Representation

To ensure a fair comparison between our proposed methods and those utilizing ANNs for event-based shape from polarization, we transform the sparse event stream into frame-like event representations, which serve as the input for our methods. Specifically, similar to [22], we take the CVGR-I representation due to its superior performance. The CVGR-I representation combines the Cumulative Voxel Grid Representation (CVGR) with a single polarization image (I) taken at a polarizer angle of 0 degrees. The CVGR is a variation of the voxel grid [50]. Similar to previous works on learning with events [62, 63], the CVGR first encodes the events in a spatio-temporal voxel grid V𝑉Vitalic_V. Specifically, the time domain of the event stream is equally discretized into B𝐵Bitalic_B temporal bins indexed by integers in the range of [0,B1]0𝐵1[0,B-1][ 0 , italic_B - 1 ]. Each event ei=(xi,yi,ti,pi)subscript𝑒𝑖subscript𝑥𝑖subscript𝑦𝑖subscript𝑡𝑖subscript𝑝𝑖e_{i}=(x_{i},y_{i},t_{i},p_{i})italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) distributes its sign value pisubscript𝑝𝑖p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to the two closest spatio-temporal voxels as follows:

V(x,y,t)=xi=x,yi=ypimax(0,1|tti*|),ti*=B1ΔT(tit0),formulae-sequence𝑉𝑥𝑦𝑡subscriptformulae-sequencesubscript𝑥𝑖𝑥subscript𝑦𝑖𝑦subscript𝑝𝑖01𝑡superscriptsubscript𝑡𝑖superscriptsubscript𝑡𝑖𝐵1Δ𝑇subscript𝑡𝑖subscript𝑡0V(x,y,t)=\sum_{x_{i}=x,y_{i}=y}p_{i}\max(0,1-|t-t_{i}^{*}|),\hskip 8.53581ptt_% {i}^{*}=\frac{B-1}{\Delta T}(t_{i}-t_{0}),italic_V ( italic_x , italic_y , italic_t ) = ∑ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_x , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_y end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_max ( 0 , 1 - | italic_t - italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT | ) , italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT = divide start_ARG italic_B - 1 end_ARG start_ARG roman_Δ italic_T end_ARG ( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) , (2)

where (x,y,t)𝑥𝑦𝑡(x,y,t)( italic_x , italic_y , italic_t ) is a specific location of the spatio-temporal voxel grid V𝑉Vitalic_V, ΔTΔ𝑇\Delta Troman_Δ italic_T is the time domain of the event stream, and t0subscript𝑡0t_{0}italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is the timestep of the initial event in the event stream. Then, the CVGR calculates the cumulative sum across the bins and multiplies this total by the contrast threshold:

E(x,y,b)=Ci=0bV(x,y,i),b={0,1,2,3,,B1},formulae-sequence𝐸𝑥𝑦𝑏𝐶superscriptsubscript𝑖0𝑏𝑉𝑥𝑦𝑖𝑏0123𝐵1E(x,y,b)=C\sum_{i=0}^{b}V(x,y,i),\hskip 8.53581ptb=\{0,1,2,3,...,B-1\},italic_E ( italic_x , italic_y , italic_b ) = italic_C ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT italic_V ( italic_x , italic_y , italic_i ) , italic_b = { 0 , 1 , 2 , 3 , … , italic_B - 1 } , (3)

Finally, to enhance surface normal estimation in areas with insufficient event information, a single polarization image of 0 polarizer degree is incorporated, resulting in E=I[0]+E𝐸𝐼delimited-[]0𝐸E=I[0]+Eitalic_E = italic_I [ 0 ] + italic_E, thereby providing additional context. This resulting event representation E𝐸Eitalic_E will serve as the input of our models. Its dimensions are B×H×W𝐵𝐻𝑊B\times H\times Witalic_B × italic_H × italic_W, where H𝐻Hitalic_H and W𝑊Witalic_W represent the height and width of the event camera, respectively. We present a concrete input example of “cup” in Fig. 1.

Refer to caption
Figure 1: The CVGR-I input representation comprises CVGR frames spanning B𝐵Bitalic_B temporal bins, along with a single polarization image captured at a polarizer angle of 0 degrees. In this example, we set B=8𝐵8B=8italic_B = 8.

3.2 Spiking Neuron Models

Spiking neuron models are mathematical descriptions of specific cells in the nervous system. They are the basic building blocks of SNNs. In this paper, we primarily concentrate on using the Integrate-and-Fire (IF) model [29] to develop our proposed SNNs. The IF model is one of the earliest and simplest spiking neuron models. The dynamics of the IF neuron i𝑖iitalic_i is defined as:

ui(t)=ui(t1)+jwijxj(t),subscript𝑢𝑖𝑡subscript𝑢𝑖𝑡1subscript𝑗subscript𝑤𝑖𝑗subscript𝑥𝑗𝑡u_{i}(t)=u_{i}(t-1)+\sum_{j}w_{ij}x_{j}(t),italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) = italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t - 1 ) + ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t ) , (4)

where ui(t)subscript𝑢𝑖𝑡u_{i}(t)italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) represents the internal membrane potential of the neuron i𝑖iitalic_i at time t𝑡titalic_t, ui(t1)subscript𝑢𝑖𝑡1u_{i}(t-1)italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t - 1 ) is the membrane potential of the neuron i𝑖iitalic_i at the previous timestep t1𝑡1t-1italic_t - 1, and jwijxj(t)subscript𝑗subscript𝑤𝑖𝑗subscript𝑥𝑗𝑡\sum_{j}w_{ij}x_{j}(t)∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t ) is the weighted summation of the inputs from pre-neurons at the current time step t𝑡titalic_t. When ui(t)subscript𝑢𝑖𝑡u_{i}(t)italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) exceeds a certain threshold uthsubscript𝑢𝑡u_{th}italic_u start_POSTSUBSCRIPT italic_t italic_h end_POSTSUBSCRIPT, the neuron emits a spike, resets its membrane potential to uresetsubscript𝑢𝑟𝑒𝑠𝑒𝑡u_{reset}italic_u start_POSTSUBSCRIPT italic_r italic_e italic_s italic_e italic_t end_POSTSUBSCRIPT, and then accumulates ui(t)subscript𝑢𝑖𝑡u_{i}(t)italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) again in subsequent time steps.

In addition to the IF model, we also build our proposed models with the Leaky Integrate-and-Fire (LIF) model [29]. Compared to the IF model, LIF model contains a leaky term to mimic the diffusion of ions through the membrane. The dynamics of the LIF neuron i𝑖iitalic_i can be expressed as:

ui(t)=αui(t1)+jwijxj(t),subscript𝑢𝑖𝑡𝛼subscript𝑢𝑖𝑡1subscript𝑗subscript𝑤𝑖𝑗subscript𝑥𝑗𝑡u_{i}(t)=\alpha u_{i}(t-1)+\sum_{j}w_{ij}x_{j}(t),italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) = italic_α italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t - 1 ) + ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t ) , (5)

where α𝛼\alphaitalic_α is a leaky factor that decays the membrane potential over time. Drawing inspiration from previous work [64], we also construct models using the Parametric Leaky Integrate-and-Fire (PLIF) model, which enables automatic learning of the leaky factor. In our experiments, we demonstrate that the IF model can offer better performance as it retains more information by not incorporating the leaky factor, thus striking a balance between high performance and biological plausibility.

Refer to caption
Figure 2: The network structure of Single-Timestep Spiking UNet: The network is designed according to the UNet architecture in a fully convolutional manner. Specifically, it consists of an event encoding module (gray), an encoder (orange and blue), a decoder (yellow and green), and a final prediction layer (purple). The size of CVGR-I input representation is (8×512×512)8512512(8\times 512\times 512)( 8 × 512 × 512 ). Conv2D(a𝑎aitalic_a, b𝑏bitalic_b)-IF represents the spiking convolutional layer with a𝑎aitalic_a input channels and b𝑏bitalic_b output channels. Each max pooling layer downsamples the feature map by a factor of 2. And the spatial resolution is doubled after each upsamling layer.
Refer to caption
Figure 3: The network structure of Multi-Timestep Spiking UNet: The network is designed according to the UNet architecture in a fully convolutional manner. Specifically, it consists of an event encoding module (gray), an encoder (orange and blue), a decoder (yellow and green), and a final prediction layer (purple). Unlike the Single-Timestep Spiking UNet processing the CVGR-I representation as a whole and updating the membrane potential of its spiking neurons only once, the Multi-Timestep Spiking UNet processes the B×H×W𝐵𝐻𝑊B\times H\times Witalic_B × italic_H × italic_W CVGR-I representation along its temporal dimension B𝐵Bitalic_B. The settings for Conv2D(a𝑎aitalic_a, b𝑏bitalic_b)-IF layers, max pooling layers, and upsampling layers are the same as those for the Single-Timestep Spiking UNet.

3.3 SNNs for Event-based Shape from Polarization

In this section, we propose two SNNs that take the CVGR-I event representation as the input and estimate the surface normals 𝐍𝐍\mathbf{N}bold_N. Both of them can process the information through the spiking neuron models mentioned above. Due to the potential of IF neurons in event-based shape from polarization, we will present the proposed models based on the dynamics of IF neurons.

3.3.1 Single-Timestep Spiking UNet

In this work, we have chosen a UNet [36], a commonly utilized architecture in semantic segmentation, as the backbone for surface normal estimation. Specifically, we propose the Single-Timestep Spiking UNet as shown in Fig. 2. This model is composed of several key components: an event encoding module, an encoder, a decoder, and a final layer dedicated to making surface normal predictions. As a Single-Timestep feed-forward SNN, this model processes the entire B×H×W𝐵𝐻𝑊B\times H\times Witalic_B × italic_H × italic_W CVGR-I representation as its input and updates the membrane potential of its spiking neurons once per data sample. The event encoding module utilizes two spiking convolutional layers to transform the real-valued B×H×W𝐵𝐻𝑊B\times H\times Witalic_B × italic_H × italic_W CVGR-I representation to the binary spiking representation with the size of Nc×H×Wsubscript𝑁𝑐𝐻𝑊N_{c}\times H\times Witalic_N start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT × italic_H × italic_W. Based on Eq. 4, the membrane potential uisubscript𝑢𝑖u_{i}italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and output spiking state oisubscript𝑜𝑖o_{i}italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT of IF neuron i𝑖iitalic_i in the spiking convolutional layer are decided by:

ui=Conv(X),oi={1for uiuth0for otherwise,subscript𝑢𝑖𝐶𝑜𝑛𝑣𝑋missing-subexpressionsubscript𝑜𝑖cases1for uiuth0for otherwisemissing-subexpression\eqalign{u_{i}=Conv(X),\cr o_{i}=\cases{1&for $u_{i}\geq u_{th}$\\ 0&for otherwise\\ },}start_ROW start_CELL italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_C italic_o italic_n italic_v ( italic_X ) , end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { start_ROW start_CELL 1 end_CELL start_CELL for italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≥ italic_u start_POSTSUBSCRIPT italic_t italic_h end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL for otherwise end_CELL end_ROW , end_CELL start_CELL end_CELL end_ROW (6)

where Conv(X)𝐶𝑜𝑛𝑣𝑋Conv(X)italic_C italic_o italic_n italic_v ( italic_X ) is the weighted convolutional summation of the inputs from previous layers and t𝑡titalic_t in Eq. 4 is ignored since the model only updates once. After spiking feature extraction, there are Nesubscript𝑁𝑒N_{e}italic_N start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT encoder blocks to encode the spiking representation. Each encoder employs a max pooling layer and multiple spiking convolutional layers to capture surface normal features. The neuronal dynamics of IF neurons in these layers are still controlled by Eq. 6. The encoded features are subsequently decoded using Ndsubscript𝑁𝑑N_{d}italic_N start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT decoder blocks, where Nd=Nesubscript𝑁𝑑subscript𝑁𝑒N_{d}=N_{e}italic_N start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT = italic_N start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT. Since transposed convolutions are often associated with the creation of checkerboard artifacts [65], each decoder consists of an upsampling layer followed by multiple spiking convolutional layers, where the IF neurons are governed by Eq. 6. For the upsampling operations, we have two options: nearest neighbor upsampling and bilinear upsampling. Through our experiments, we will show that nearest neighbor upsampling can achieve performance comparable to bilinear upsampling in event-based surface normal estimation while preserving the fully spiking nature of our proposed model. As suggested in the UNet architecture, to address the challenge of information loss during down-sampling and up-sampling, skip connections are utilized between corresponding encoder and decoder blocks at the same hierarchical levels. To preserve the spiking nature and avoid introducing non-binary values, the proposed model utilizes concatenations as skip connections. Lastly, the final prediction layer employs the potential-assisted IF neurons [66, 67] to estimate the surface normals. Unlike traditional IF neurons generate spikes based on Eq. 6, the potential-assisted IF neurons are non-spiking neurons which output membrane potential driven by:

ui=Conv(X),oi=ui,subscript𝑢𝑖𝐶𝑜𝑛𝑣𝑋missing-subexpressionsubscript𝑜𝑖subscript𝑢𝑖missing-subexpression\eqalign{u_{i}=Conv(X),\cr o_{i}=u_{i},}start_ROW start_CELL italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_C italic_o italic_n italic_v ( italic_X ) , end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , end_CELL start_CELL end_CELL end_ROW (7)

where oisubscript𝑜𝑖o_{i}italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denotes the real-valued output of the neuron i𝑖iitalic_i. These potential-assisted dynamics can be extended to both LIF and PLIF neurons, facilitating the construction of a Single-Timestep Spiking UNet using these types of neurons. By producing real-valued membrane potential outputs, potential-assisted neurons retain rich information that enhances surface normal estimation and boosts the expressivity of SNNs, especially for large-scale regression tasks.

3.3.2 Multi-Timestep Spiking UNet

To take the advantage of temporal neuronal dynamics of spiking neurons and extract rich temporal information from event-based data, we propose the Multi-Timestep Spiking UNet for event-based shape from polarization. Figure 3 shows the network structure of the Multi-Timestep Spiking UNet. Similar to the Single-Timestep Spiking UNet, the Multi-Timestep Spiking UNet also consists of an event encoding module, an encoder, a decoder, and a final surface normal prediction layer. However, unlike the Single-Timestep Spiking UNet processing the CVGR-I representation as a whole and updating the membrane potential of its spiking neurons only once per data sample, the Multi-Timestep Spiking UNet processes the B×H×W𝐵𝐻𝑊B\times H\times Witalic_B × italic_H × italic_W CVGR-I representation for each data sample along its temporal dimension B𝐵Bitalic_B. At each time step, a 1×H×W1𝐻𝑊1\times H\times W1 × italic_H × italic_W CVGR-I representation is fed in to the event encoding module and transformed as the size of Nc×H×Wsubscript𝑁𝑐𝐻𝑊N_{c}\times H\times Witalic_N start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT × italic_H × italic_W, followed by Nesubscript𝑁𝑒N_{e}italic_N start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT encoder blocks, Ndsubscript𝑁𝑑N_{d}italic_N start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT decoder blocks, and a final prediciton layer. Based on Eq. 4, the membrane potential ui(t)subscript𝑢𝑖𝑡u_{i}(t)italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) and output spiking state oi(t)subscript𝑜𝑖𝑡o_{i}(t)italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) of IF neuron i𝑖iitalic_i in the spiking convolutional layers of the Multi-Timestep Spiking UNet are decided by:

ui(t)=ui(t1)(1oi(t1))+Conv(X(t)),oi(t)={1for ui(t)uth0for otherwise,subscript𝑢𝑖𝑡subscript𝑢𝑖𝑡11subscript𝑜𝑖𝑡1𝐶𝑜𝑛𝑣𝑋𝑡missing-subexpressionsubscript𝑜𝑖𝑡cases1for ui(t)uth0for otherwisemissing-subexpression\eqalign{u_{i}(t)=u_{i}(t-1)(1-o_{i}(t-1))+Conv(X(t)),\cr o_{i}(t)=\cases{1&% for $u_{i}(t)\geq u_{th}$\\ 0&for otherwise\\ },}start_ROW start_CELL italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) = italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t - 1 ) ( 1 - italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t - 1 ) ) + italic_C italic_o italic_n italic_v ( italic_X ( italic_t ) ) , end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) = { start_ROW start_CELL 1 end_CELL start_CELL for italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) ≥ italic_u start_POSTSUBSCRIPT italic_t italic_h end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL for otherwise end_CELL end_ROW , end_CELL start_CELL end_CELL end_ROW (8)

where Conv(X(t))𝐶𝑜𝑛𝑣𝑋𝑡Conv(X(t))italic_C italic_o italic_n italic_v ( italic_X ( italic_t ) ) is the weighted convolutional summation of the inputs from previous layers at the time step t𝑡titalic_t. The final prediction layer continues to use potential-assisted IF neurons, but with temporal dynamics as outlined below:

ui(t)=ui(t1)+Conv(X(t)),oi(t)=ui(t),subscript𝑢𝑖𝑡subscript𝑢𝑖𝑡1𝐶𝑜𝑛𝑣𝑋𝑡missing-subexpressionsubscript𝑜𝑖𝑡subscript𝑢𝑖𝑡missing-subexpression\eqalign{u_{i}(t)=u_{i}(t-1)+Conv(X(t)),\cr o_{i}(t)=u_{i}(t),}start_ROW start_CELL italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) = italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t - 1 ) + italic_C italic_o italic_n italic_v ( italic_X ( italic_t ) ) , end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) = italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) , end_CELL start_CELL end_CELL end_ROW (9)

where the potential-assisted IF neuron i𝑖iitalic_i accumulates its membrane potential to maintain the rich temporal information, oi(t)subscript𝑜𝑖𝑡o_{i}(t)italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) is the output of the neuron i𝑖iitalic_i at time step t𝑡titalic_t, and we use the outputs at the last time step as the final surface normal predictions.

3.4 Training and Implementation Details

We normalize outputs from spiking neurons into unit-length surface normal vectors 𝐍^^𝐍\mathbf{\hat{N}}over^ start_ARG bold_N end_ARG and then apply the cosine similarity loss function:

=1H×WiHjW(1𝐍^i,j,𝐍i,j),1𝐻𝑊superscriptsubscript𝑖𝐻superscriptsubscript𝑗𝑊1subscript^𝐍𝑖𝑗subscript𝐍𝑖𝑗\mathcal{L}=\frac{1}{H\times W}\sum_{i}^{H}\sum_{j}^{W}(1-\left<\mathbf{\hat{N% }}_{i,j},\mathbf{N}_{i,j}\right>),caligraphic_L = divide start_ARG 1 end_ARG start_ARG italic_H × italic_W end_ARG ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_W end_POSTSUPERSCRIPT ( 1 - ⟨ over^ start_ARG bold_N end_ARG start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT , bold_N start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ⟩ ) , (10)

where ,\left<\cdot,\cdot\right>⟨ ⋅ , ⋅ ⟩ indicates the dot product, 𝐍^i,jsubscript^𝐍𝑖𝑗\mathbf{\hat{N}}_{i,j}over^ start_ARG bold_N end_ARG start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT refers to the estimated surface normal at the pixel location (i,j)𝑖𝑗(i,j)( italic_i , italic_j ), while 𝐍i,jsubscript𝐍𝑖𝑗\mathbf{N}_{i,j}bold_N start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT denotes the ground truth surface normal at the same location. The objective is to minimize this loss, which is achieved when the orientations of 𝐍^i,jsubscript^𝐍𝑖𝑗\mathbf{\hat{N}}_{i,j}over^ start_ARG bold_N end_ARG start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT and 𝐍i,jsubscript𝐍𝑖𝑗\mathbf{N}_{i,j}bold_N start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT align perfectly.

To optimize the Single-Timestep Spiking UNet, we utilize the backpropagation method [68] to calculate the weight updates:

Δwl=wl=ololululwl,Δsuperscript𝑤𝑙superscript𝑤𝑙superscript𝑜𝑙superscript𝑜𝑙superscript𝑢𝑙superscript𝑢𝑙superscript𝑤𝑙\Delta w^{l}=\frac{\partial\mathcal{L}}{\partial w^{l}}=\frac{\partial\mathcal% {L}}{\partial o^{l}}\frac{\partial o^{l}}{\partial u^{l}}\frac{\partial u^{l}}% {\partial w^{l}},roman_Δ italic_w start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT = divide start_ARG ∂ caligraphic_L end_ARG start_ARG ∂ italic_w start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT end_ARG = divide start_ARG ∂ caligraphic_L end_ARG start_ARG ∂ italic_o start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT end_ARG divide start_ARG ∂ italic_o start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT end_ARG start_ARG ∂ italic_u start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT end_ARG divide start_ARG ∂ italic_u start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT end_ARG start_ARG ∂ italic_w start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT end_ARG , (11)

where wlsuperscript𝑤𝑙w^{l}italic_w start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT is the weight for layer l𝑙litalic_l, olsuperscript𝑜𝑙o^{l}italic_o start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT is the output of spiking neurons in layer l𝑙litalic_l, and ulsuperscript𝑢𝑙u^{l}italic_u start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT is the membrane potential of spiking neurons in layer l𝑙litalic_l. Similarly, to optimize the Multi-Timestep Spiking UNet, we utilize the BackPropagation Through Time (BPTT) [69] to calculate the weight updates. In BPTT, the model is unrolled for all discrete time steps, and the weight update is computed as the sum of gradients from each time step as follows:

Δwl=t=0B1otlotlutlutlwl,Δsuperscript𝑤𝑙superscriptsubscript𝑡0𝐵1superscriptsubscript𝑜𝑡𝑙superscriptsubscript𝑜𝑡𝑙superscriptsubscript𝑢𝑡𝑙superscriptsubscript𝑢𝑡𝑙superscript𝑤𝑙\Delta w^{l}=\sum_{t=0}^{B-1}\frac{\partial\mathcal{L}}{\partial o_{t}^{l}}% \frac{\partial o_{t}^{l}}{\partial u_{t}^{l}}\frac{\partial u_{t}^{l}}{% \partial w^{l}},roman_Δ italic_w start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B - 1 end_POSTSUPERSCRIPT divide start_ARG ∂ caligraphic_L end_ARG start_ARG ∂ italic_o start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT end_ARG divide start_ARG ∂ italic_o start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT end_ARG start_ARG ∂ italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT end_ARG divide start_ARG ∂ italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT end_ARG start_ARG ∂ italic_w start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT end_ARG , (12)

where wlsuperscript𝑤𝑙w^{l}italic_w start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT is the weight for layer l𝑙litalic_l, otlsuperscriptsubscript𝑜𝑡𝑙o_{t}^{l}italic_o start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT is the output of spiking neurons in layer l𝑙litalic_l at the time step t𝑡titalic_t, and utlsuperscriptsubscript𝑢𝑡𝑙u_{t}^{l}italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT is the membrane potential of spiking neurons in layer l𝑙litalic_l at the time step t𝑡titalic_t. Based on the Heaviside step functions in Eq. 6 and Eq. 8, we can see that both olulsuperscript𝑜𝑙superscript𝑢𝑙\frac{\partial o^{l}}{\partial u^{l}}divide start_ARG ∂ italic_o start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT end_ARG start_ARG ∂ italic_u start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT end_ARG and otlutlsuperscriptsubscript𝑜𝑡𝑙superscriptsubscript𝑢𝑡𝑙\frac{\partial o_{t}^{l}}{\partial u_{t}^{l}}divide start_ARG ∂ italic_o start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT end_ARG start_ARG ∂ italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT end_ARG cannot be differentiable in spiking convolutional layers. To overcome the non-differentiability, we use the differentiable ArcTan function g(x)=1πarctan(πx)+12𝑔𝑥1𝜋𝑎𝑟𝑐𝑡𝑎𝑛𝜋𝑥12g(x)=\frac{1}{\pi}arctan(\pi x)+\frac{1}{2}italic_g ( italic_x ) = divide start_ARG 1 end_ARG start_ARG italic_π end_ARG italic_a italic_r italic_c italic_t italic_a italic_n ( italic_π italic_x ) + divide start_ARG 1 end_ARG start_ARG 2 end_ARG as the surrogate funciton of the Heaviside step function [70]. For the final prediction layer with potential-assisted spiking neurons, since they output membrane potential instead of spikes, we have olul=1superscript𝑜𝑙superscript𝑢𝑙1\frac{\partial o^{l}}{\partial u^{l}}=1divide start_ARG ∂ italic_o start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT end_ARG start_ARG ∂ italic_u start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT end_ARG = 1 and otlutl=1superscriptsubscript𝑜𝑡𝑙superscriptsubscript𝑢𝑡𝑙1\frac{\partial o_{t}^{l}}{\partial u_{t}^{l}}=1divide start_ARG ∂ italic_o start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT end_ARG start_ARG ∂ italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT end_ARG = 1 for these layers’ weight updates.

4 Experiments and Results

In this section, we evaluate the effectiveness and efficiency of our proposed SNN models on event-based shape from polarization. We begin by introducing the experimental setup, datasets, baselines, and performance metrics for event-based shape from polarization. Then, extensive experiments on these datasets showcase the capabilities of our models, both in quantitative and qualitative terms, across synthetic and real-world scenarios. Lastly, we analyze the computational costs of our models to highlight their enhanced energy efficiency compared to the counterpart ANN models.

4.1 Experimental Setup

Our models are implemented with SpikingJelly [71], an open-source deep learning framework for SNNs based on PyTorch [72]. To fairly compare with the counterpart ANN models, we ensure our models have the similar settings like the ANN models in [22]. Specifically, we set B=8𝐵8B=8italic_B = 8 for the input event representation. In addition, our models have Ne=4subscript𝑁𝑒4N_{e}=4italic_N start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT = 4 encoder blocks and Nd=4subscript𝑁𝑑4N_{d}=4italic_N start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT = 4 decoder blocks. And the event encoding module outputs the binary spiking representation with the channel size of Nc=64subscript𝑁𝑐64N_{c}=64italic_N start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = 64. For the spiking-related settings, all the spiking neurons in the spiking convolutional layers are set with a reset value (uresetsubscript𝑢𝑟𝑒𝑠𝑒𝑡u_{reset}italic_u start_POSTSUBSCRIPT italic_r italic_e italic_s italic_e italic_t end_POSTSUBSCRIPT) of 0 and a threshold value (uthsubscript𝑢𝑡u_{th}italic_u start_POSTSUBSCRIPT italic_t italic_h end_POSTSUBSCRIPT) of 1111. Following [64, 73], normalization techniques are applied after each convolution (Conv) operation for faster convergence. We train our models for 1000100010001000 epochs with a batch size of 2 on Quadro RTX 8000. We use the Adam [74] with a learning rate of 1e41𝑒41e-41 italic_e - 4 to optimize our models.

4.2 Datasets

We evaluate our proposed models on two latest large-scale datasets for event-based shape from polarization, including the ESfP-Synthetic Dataset and ESfP-Real Dataset.

The ESfP-Synthetic Dataset was generated using the Mitsuba renderer [75], which created scenes with textured meshes illuminated by a point light source. For each scene, a polarizer lens, positioned in front of the camera, was rotated through angles ranging from 0 to 180 degrees with 15 degrees intervals, producing a total of 12 polarization images. With these images, events were simulated using ESIM [76] with a 5% contrast threshold. Therefore, each scene in the dataset is accompanied by rendered polarization images, simulated events, and groundtruth surface normals provided by the renderer.

The ESfP-Real Dataset is the first large-scale real-world dataset for event-based shape from polarization. It contains various scenes with different objects, textures, shapes, illuminations, and scene depths. The dataset was collected using a Prophesee Gen 4 event camera [77], a Breakthrough Photography X4 CPL linear polarizer [78], a Lucid Polarisens camera [21], and a laser point projector. Specificially, the polarizer rotated in front of the event camera that captured the events for each scene in the dataset. The Lucid Polarisens camera was used to collect polarization images of the same scene at 4 polarization angles {0, 45, 90, 135}. And the groundtruth surface normals were generated using Event-based Structured Light [79], a technique that involves integrating the laser point projector with the event camera.

4.3 Baselines and Performance Metrics

We evaluate our models against the state-of-the-art physics-based and learning-based methods in the field of shape from polarization. Smith et al. [47] combine the physics-based shape from polarization with the photometric image formation model. The method directly estimates lighting information and calculates the surface height using a single polarization image under unknown illumination. Mahmoud et al. [80] present a physics-based method to conduct shape recovery using both polarization and shading information. Recently, Muglikar et al. [22] are pioneers in addressing event-based shape from polarization, employing both physics-based and learning-based approaches. Their models are notable for directly using event data as inputs. In this paper, our focus is on comparing our proposed models with the learning-based model developed by Muglikar et al. We aim to demonstrate that our SNN-based models can match their performance while offering greater energy efficiency.

To evaluate the accuracy of the predicted surface normals, we employ four metrics: Mean Angular Error (MAE), % Angular Error under 11.25 degrees (AE<<<11.25), % Angular Error under 22.5 degrees (AE<<<22.5), and % Angular Error under 30 degrees (AE<<<30). MAE is a commonly used metric that quantifies the angular error of the predicted surface normal, where a lower value indicates better performance [16, 17]. The latter three metrics, collectively referred to as angular accuracy, assess the proportion of pixels with angular errors less than 11.25, 22.5, and 30 degrees, respectively, with higher percentages indicating better accuracy [22].

4.4 Performance on ESfP-Synthetic

Table 1: Shape from polarization performance on the ESfP-Synthetic Dataset in terms of Mean Angular Error (MAE) and the percentage of pixels under specific angular errors (AE<absent<\cdot< ⋅). The “Input” column specifies whether the method utilizes events (E) or polarization images (I). E+I[0] means the CVGR-I representation. “Single” is for the Single-Timestep Spiking UNet. “Multi” is for the Multi-Timestep Spiking UNet. “Bilinear” and “Nearest” represent the bilinear upsampling and nearest neighbor upsampling, respectively. We highlight the top performance in bold, and underline the second-best results.
Method Input Task MAE\downarrow AE<11.25\uparrow AE<22.5\uparrow AE<30\uparrow
Mahmoud et al. [80] I Physics 80.923 0.034 0.065 0.085
Smith et al. [47] I Physics 67.684 0.010 0.047 0.106
Muglikar et al. [22] E Physics 58.196 0.007 0.046 0.095
Muglikar et al. [22] E+I[0] Lernen 27.953 0.263 0.527 0.655
Single_Bilinear E+I[0] Lernen 36.432 0.181 0.403 0.525
Single_Nearest E+I[0] Lernen 36.824 0.141 0.370 0.491
Multi_Bilinear E+I[0] Lernen 31.296 0.200 0.438 0.578
Multi_Nearest E+I[0] Lernen 31.724 0.193 0.425 0.562
Table 2: Ablation study on various spiking neurons.
Method Input Task MAE\downarrow AE<11.25\uparrow AE<22.5\uparrow AE<30\uparrow
Multi_Nearest_IF E+I[0] Lernen 31.724 0.193 0.425 0.562
Multi_Nearest_LIF E+I[0] Lernen 35.250 0.154 0.384 0.523
Multi_Nearest_PLIF E+I[0] Lernen 35.086 0.154 0.393 0.530

We thoroughly evaluate our proposed models on the ESfP-Synthetic Dataset, using both quantitative metrics and qualitative analysis. Specifically, Table 1 presents the performance of both baselines and our methods in surface normal estimation on the ESfP-Synthetic Dataset. In addition, Figure 4 showcases the qualitative results of our models and the ANN counterpart on the ESfP-Synthetic Dataset.

From Table 1, we can see that our proposed models significantly outperform the physics-based methods. The reason why our model can achieve the better performance is that our models benefit from the large-scale dataset and utilize the spiking neurons to extract useful information for event-based shape from polarization. Despite this success, our models do not quite match the overall performance of their ANN counterpart on this dataset, likely due to the limited representation capacity of spiking neurons. However, as Fig. 4 illustrates, our Multi-Timestep Spiking UNets still manage to achieve comparable, and in some cases superior, results in shape recovery across various objects in the test set, compared to the ANN models.

Table 1 clearly demonstrates that the temporal dynamics inherent in spiking neurons enable the Multi-Timestep Spiking UNets to surpass the Single-Timestep versions in surface normal estimation. Additionally, nearest neighbor sampling, as compared to bilinear upsampling, shows comparable performance while preserving the binary nature and compatibility with SNNs.

Recognizing the effectiveness of Multi-Timestep Spiking UNets, we undertook an ablation study aimed at identifying the ideal spiking neurons to fully leverage their temporal dynamic capabilities. The results, detailed in Table 2, indicate that IF neurons offer superior performance. This is largely due to their ability to retain more extensive temporal information, as they operate without the influence of a leaky factor.

\begin{overpic}[width=59.75095pt,percent]{Paper_figures/synthetic/cv/scene1.% jpg}\end{overpic}
\begin{overpic}[width=59.75095pt,percent]{Paper_figures/synthetic/cv/scene2.% jpg}\end{overpic}
\begin{overpic}[width=59.75095pt,percent]{Paper_figures/synthetic/cv/scene3.% jpg}\end{overpic}
\begin{overpic}[width=59.75095pt,percent]{Paper_figures/synthetic/cv/scene4.% jpg}\end{overpic}
\begin{overpic}[width=59.75095pt,percent]{Paper_figures/synthetic/cv/scene5.% jpg}\end{overpic}
\begin{overpic}[width=59.75095pt,percent]{Paper_figures/synthetic/cv/scene6.% jpg}\end{overpic}
\begin{overpic}[width=59.75095pt,percent]{Paper_figures/synthetic/cv/scene7.% jpg}\end{overpic}
\begin{overpic}[width=59.75095pt,percent]{Paper_figures/synthetic/cv/scene8.% jpg}\end{overpic}
\begin{overpic}[width=59.75095pt,percent]{Paper_figures/synthetic/CVPR/849/im0% _pred.jpg}\put(0.0,85.0){\hbox{\pagecolor{green}\color[rgb]{0,0,1} \small 22.9% 3}}\end{overpic}
\begin{overpic}[width=59.75095pt,percent]{Paper_figures/synthetic/CVPR/849/im2% _pred.jpg}\put(0.0,85.0){\hbox{\pagecolor{green}\color[rgb]{0,0,1} \small 19.2% 1}}\end{overpic}
\begin{overpic}[width=59.75095pt,percent]{Paper_figures/synthetic/CVPR/849/im3% _pred.jpg}\put(0.0,85.0){\color[rgb]{0,0,1} \hbox{\pagecolor{green}\small 19.2% 8}}\end{overpic}
\begin{overpic}[width=59.75095pt,percent]{Paper_figures/synthetic/CVPR/849/im7% _pred.jpg}\put(0.0,85.0){\hbox{\pagecolor{green}\color[rgb]{0,0,1} \small 19.0% 6}}\end{overpic}
\begin{overpic}[width=59.75095pt,percent]{Paper_figures/synthetic/CVPR/849/im1% 3_pred.jpg}\put(0.0,85.0){\color[rgb]{0,0,1} \small 18.08}\end{overpic}
\begin{overpic}[width=59.75095pt,percent]{Paper_figures/synthetic/CVPR/849/im4% _pred.jpg}\put(0.0,85.0){\color[rgb]{0,0,1} \small 28.12}\end{overpic}
\begin{overpic}[width=59.75095pt,percent]{Paper_figures/synthetic/CVPR/849/im9% _pred.jpg}\put(0.0,85.0){\color[rgb]{0,0,1} \small 31.63}\end{overpic}
\begin{overpic}[width=59.75095pt,percent]{Paper_figures/synthetic/CVPR/849/im1% 0_pred.jpg}\put(0.0,85.0){\color[rgb]{0,0,1} \small 30.13}\end{overpic}
ANNs [22]
\begin{overpic}[width=59.75095pt,percent]{Paper_figures/synthetic/Bilinear_% Single/349/im0_pred.jpg}\put(0.0,85.0){\color[rgb]{0,0,1} \small 25.68}\end{overpic}
\begin{overpic}[width=59.75095pt,percent]{Paper_figures/synthetic/Bilinear_% Single/349/im2_pred.jpg}\put(0.0,85.0){\color[rgb]{0,0,1} \small 25.23}\end{overpic}
\begin{overpic}[width=59.75095pt,percent]{Paper_figures/synthetic/Bilinear_% Single/349/im3_pred.jpg}\put(0.0,85.0){\color[rgb]{0,0,1} \small 20.25}\end{overpic}
\begin{overpic}[width=59.75095pt,percent]{Paper_figures/synthetic/Bilinear_% Single/349/im7_pred.jpg}\put(0.0,85.0){\color[rgb]{0,0,1} \small 31.81}\end{overpic}
\begin{overpic}[width=59.75095pt,percent]{Paper_figures/synthetic/Bilinear_% Single/349/im13_pred.jpg}\put(0.0,85.0){\color[rgb]{0,0,1} \small 30.58}\end{overpic}
\begin{overpic}[width=59.75095pt,percent]{Paper_figures/synthetic/Bilinear_% Single/349/im4_pred.jpg}\put(0.0,85.0){\color[rgb]{0,0,1} \small 37.82}\end{overpic}
\begin{overpic}[width=59.75095pt,percent]{Paper_figures/synthetic/Bilinear_% Single/349/im9_pred.jpg}\put(0.0,85.0){\color[rgb]{0,0,1} \small 34.71}\end{overpic}
\begin{overpic}[width=59.75095pt,percent]{Paper_figures/synthetic/Bilinear_% Single/349/im10_pred.jpg}\put(0.0,85.0){\color[rgb]{0,0,1} \small 46.01}\end{overpic}
\begin{overpic}[width=59.75095pt,percent]{Paper_figures/synthetic/Nearest_% Single/849/im0_pred.jpg}\put(0.0,85.0){\color[rgb]{0,0,1} \small 23.08}\end{overpic}
\begin{overpic}[width=59.75095pt,percent]{Paper_figures/synthetic/Nearest_% Single/849/im2_pred.jpg}\put(0.0,85.0){\color[rgb]{0,0,1} \small 23.63}\end{overpic}
\begin{overpic}[width=59.75095pt,percent]{Paper_figures/synthetic/Nearest_% Single/849/im3_pred.jpg}\put(0.0,85.0){\color[rgb]{0,0,1} \small 28.12}\end{overpic}
\begin{overpic}[width=59.75095pt,percent]{Paper_figures/synthetic/Nearest_% Single/849/im7_pred.jpg}\put(0.0,85.0){\color[rgb]{0,0,1} \small 30.67}\end{overpic}
\begin{overpic}[width=59.75095pt,percent]{Paper_figures/synthetic/Nearest_% Single/849/im13_pred.jpg}\put(0.0,85.0){\color[rgb]{0,0,1} \small 48.09}\end{overpic}
\begin{overpic}[width=59.75095pt,percent]{Paper_figures/synthetic/Nearest_% Single/849/im4_pred.jpg}\put(0.0,85.0){\color[rgb]{0,0,1} \small 44.53}\end{overpic}
\begin{overpic}[width=59.75095pt,percent]{Paper_figures/synthetic/Nearest_% Single/849/im9_pred.jpg}\put(0.0,85.0){\color[rgb]{0,0,1} \small 34.17}\end{overpic}
\begin{overpic}[width=59.75095pt,percent]{Paper_figures/synthetic/Nearest_% Single/849/im10_pred.jpg}\put(0.0,85.0){\color[rgb]{0,0,1} \small 45.21}\end{overpic}
\begin{overpic}[width=59.75095pt,percent]{Paper_figures/synthetic/Bilinear_% Multi/749/im9_pred.jpg}\put(0.0,85.0){\color[rgb]{0,0,1} \small 25.11}\end{overpic}
\begin{overpic}[width=59.75095pt,percent]{Paper_figures/synthetic/Bilinear_% Multi/749/im8_pred.jpg}\put(0.0,85.0){\color[rgb]{0,0,1} \small 29.00}\end{overpic}
\begin{overpic}[width=59.75095pt,percent]{Paper_figures/synthetic/Bilinear_% Multi/749/im13_pred.jpg}\put(0.0,85.0){\color[rgb]{0,0,1} \small 24.90}\end{overpic}
\begin{overpic}[width=59.75095pt,percent]{Paper_figures/synthetic/Bilinear_% Multi/749/im6_pred.jpg}\put(0.0,85.0){\color[rgb]{0,0,1} \small 20.98}\end{overpic}
\begin{overpic}[width=59.75095pt,percent]{Paper_figures/synthetic/Bilinear_% Multi/749/im2_pred.jpg}\put(0.0,85.0){\hbox{\pagecolor{green}\color[rgb]{0,0,1% } \small 15.89}}\end{overpic}
\begin{overpic}[width=59.75095pt,percent]{Paper_figures/synthetic/Bilinear_% Multi/749/im1_pred.jpg}\put(0.0,85.0){\color[rgb]{0,0,1} \small 27.30}\end{overpic}
\begin{overpic}[width=59.75095pt,percent]{Paper_figures/synthetic/Bilinear_% Multi/749/im0_pred.jpg}\put(0.0,85.0){\color[rgb]{0,0,1} \small 32.36}\end{overpic}
\begin{overpic}[width=59.75095pt,percent]{Paper_figures/synthetic/Bilinear_% Multi/749/im3_pred.jpg}\put(0.0,85.0){\hbox{\pagecolor{green}\color[rgb]{0,0,1% } \small 23.77}}\end{overpic}
\begin{overpic}[width=59.75095pt,percent]{Paper_figures/synthetic/Nearest_% Multi/949/im9_pred.jpg}\put(0.0,85.0){\color[rgb]{0,0,1} \small 26.52}\end{overpic}
\begin{overpic}[width=59.75095pt,percent]{Paper_figures/synthetic/Nearest_% Multi/949/im8_pred.jpg}\put(0.0,85.0){\color[rgb]{0,0,1} \small 29.19}\end{overpic}
\begin{overpic}[width=59.75095pt,percent]{Paper_figures/synthetic/Nearest_% Multi/949/im13_pred.jpg}\put(0.0,85.0){\color[rgb]{0,0,1} \small 25.44}\end{overpic}
\begin{overpic}[width=59.75095pt,percent]{Paper_figures/synthetic/Nearest_% Multi/949/im6_pred.jpg}\put(0.0,85.0){\color[rgb]{0,0,1} \small 25.38}\end{overpic}
\begin{overpic}[width=59.75095pt,percent]{Paper_figures/synthetic/Nearest_% Multi/949/im2_pred.jpg}\put(0.0,85.0){\color[rgb]{0,0,1} \small 17.80}\end{overpic}
\begin{overpic}[width=59.75095pt,percent]{Paper_figures/synthetic/Nearest_% Multi/949/im1_pred.jpg}\put(0.0,85.0){\hbox{\pagecolor{green}\color[rgb]{0,0,1% } \small 25.27}}\end{overpic}
\begin{overpic}[width=59.75095pt,percent]{Paper_figures/synthetic/Nearest_% Multi/949/im0_pred.jpg}\put(0.0,85.0){\hbox{\pagecolor{green}\color[rgb]{0,0,1% } \small 30.18}}\end{overpic}
\begin{overpic}[width=59.75095pt,percent]{Paper_figures/synthetic/Nearest_% Multi/949/im3_pred.jpg}\put(0.0,85.0){\color[rgb]{0,0,1} \small 25.30}\end{overpic}
\begin{overpic}[width=59.75095pt,percent]{Paper_figures/synthetic/Nearest_% Multi/949/im9_gt.jpg}\end{overpic}
\begin{overpic}[width=59.75095pt,percent]{Paper_figures/synthetic/Nearest_% Multi/949/im8_gt.jpg}\end{overpic}
\begin{overpic}[width=59.75095pt,percent]{Paper_figures/synthetic/Nearest_% Multi/949/im13_gt.jpg}\end{overpic}
\begin{overpic}[width=59.75095pt,percent]{Paper_figures/synthetic/Nearest_% Multi/949/im6_gt.jpg}\end{overpic}
\begin{overpic}[width=59.75095pt,percent]{Paper_figures/synthetic/Nearest_% Multi/949/im2_gt.jpg}\end{overpic}
\begin{overpic}[width=59.75095pt,percent]{Paper_figures/synthetic/Nearest_% Multi/949/im1_gt.jpg}\end{overpic}
\begin{overpic}[width=59.75095pt,percent]{Paper_figures/synthetic/Nearest_% Multi/949/im0_gt.jpg}\end{overpic}
\begin{overpic}[width=59.75095pt,percent]{Paper_figures/synthetic/Nearest_% Multi/949/im3_gt.jpg}\end{overpic}
Figure 4: Qualitative results on the ESfP-Synthetic Dataset. Column (a) shows the scene photographs for context. Column (b) is for the counterpart ANN models. Columns (c-d) are for the Single-Timestep Spiking UNets with bilinear upsampling and nearest neighbor upsampling, respectively. Columns (e-f) are for the Multi-Timestep Spiking UNets with bilinear upsampling and nearest neighbor upsampling, respectively. Column (g) presents the ground truth normals. The MAE for the reconstructions is shown on the top left of each cell in Columns (b-f). For each scene, we highlight the best result using the green colorbox.

4.5 Performance on ESfP-Real

We also compare these methods on the ESfP-Real Dataset. Specifically, we show the quantitative performance in Table 3 and illustrate the qualitative results in Fig. 5. Similar to the results on the ESfP-Synthetic Dataset, our models demonstrate superior performance compared to physics-based methods on the real-world dataset. Moreover, as indicated by Table 3 and Fig. 5, our models not only match the overall performance of the ANN counterpart but also excel in qualitative results across diverse scenes in the test dataset. This enhanced performance on the ESfP-Real Dataset can be attributed to the sparser nature of this real-world dataset [22]. In addition, compared to the ANN counterpart, our model is more compatible with the sparse events and better maintains the sparsity to prevent overfitting on this dataset.

Mirroring the outcomes observed on the ESfP-Synthetic Dataset, results from Table 3 and Fig. 5 also show that the Multi-Timestep Spiking UNet slightly outperforms the Single-Timestep Spiking UNet. Additionally, nearest neighbor upsampling is on par with bilinear upsampling in terms of surface normal estimation performance.

Table 3: Shape from polarization performance on the ESfP-Real Dataset in terms of Mean Angular Error (MAE) and the percentage of pixels under specific angular errors (AE<absent<\cdot< ⋅). The ”Input” column specifies whether the method utilizes events (E) or polarization images (I). E+I[0] means the CVGR-I representation. “Single” is for the Single-Timestep Spiking UNet. “Multi” is for the Multi-Timestep Spiking UNet. “Bilinear” and “Nearest” represent the bilinear upsampling and nearest neighbor upsampling, respectively. We highlight the top performance in bold, and underline the second-best results.
Method Input Task MAE\downarrow AE<11.25\uparrow AE<22.5\uparrow AE<30\uparrow
Mahmoud et al. [80] I Physics 56.278 0.032 0.091 0.163
Smith et al. [47] I Physics 72.525 0.009 0.034 0.058
Muglikar et al. [22] E Physics 38.786 0.087 0.220 0.452
Muglikar et al. [22] E+I[0] Lernen 26.851 0.099 0.449 0.691
Single_Bilinear E+I[0] Lernen 27.134 0.109 0.458 0.685
Single_Nearest E+I[0] Lernen 27.391 0.106 0.450 0.684
Multi_Bilinear E+I[0] Lernen 26.886 0.093 0.439 0.689
Multi_Nearest E+I[0] Lernen 26.781 0.089 0.450 0.688
\begin{overpic}[width=59.75095pt,percent]{Paper_figures/real/chessboard2_24_10% /Ev_new.jpg}\end{overpic}
\begin{overpic}[width=59.75095pt,percent]{Paper_figures/real/cylinder2_27_10/% Ev_new.jpg}\end{overpic}
\begin{overpic}[width=59.75095pt,percent]{Paper_figures/real/comb1_26_10/Ev_% new.jpg}\end{overpic}
\begin{overpic}[width=59.75095pt,percent]{Paper_figures/real/buddha_26_10/Ev_% new.jpg}\end{overpic}
\begin{overpic}[width=59.75095pt,percent]{Paper_figures/real/headphones_case_3% 1_10/Ev_new.jpg}\end{overpic}
\begin{overpic}[width=59.75095pt,percent]{Paper_figures/real/phone_02_11/Ev_% new.jpg}\end{overpic}
\begin{overpic}[width=59.75095pt,percent]{Paper_figures/real/raclette_26_10/Ev% _new.jpg}\end{overpic}
\begin{overpic}[width=59.75095pt,percent]{Paper_figures/real/steel_1_31_10/Ev_% new.jpg}\end{overpic}
\begin{overpic}[width=59.75095pt,percent]{Paper_figures/real/chessboard2_24_10% /cvpr_28.361799240112305_mask.jpg}\put(0.0,85.0){\color[rgb]{1,1,1} \small 28.% 36}\end{overpic}
\begin{overpic}[width=59.75095pt,percent]{Paper_figures/real/cylinder2_27_10/% cvpr_24.018611907958984_mask.jpg}\put(0.0,85.0){\color[rgb]{1,1,1} \small 24.0% 2}\end{overpic}
\begin{overpic}[width=59.75095pt,percent]{Paper_figures/real/comb1_26_10/cvpr_% 28.0897216796875_mask.jpg}\put(0.0,85.0){\color[rgb]{1,1,1} \small 28.09}\end{overpic}
\begin{overpic}[width=59.75095pt,percent]{Paper_figures/real/buddha_26_10/cvpr% _26.99634552001953_mask.jpg}\put(0.0,85.0){\color[rgb]{1,1,1} \small 27.00}\end{overpic}
\begin{overpic}[width=59.75095pt,percent]{Paper_figures/real/headphones_case_3% 1_10/cvpr_31.50218391418457_mask.jpg}\put(0.0,85.0){\color[rgb]{1,1,1} \small 3% 1.50}\end{overpic}
\begin{overpic}[width=59.75095pt,percent]{Paper_figures/real/phone_02_11/cvpr_% 29.421188354492188_mask.jpg}\put(0.0,85.0){\color[rgb]{1,1,1} \small 29.42}\end{overpic}
\begin{overpic}[width=59.75095pt,percent]{Paper_figures/real/raclette_26_10/% cvpr_27.655147552490234_mask.jpg}\put(0.0,85.0){\color[rgb]{1,1,1} \small 27.6% 6}\end{overpic}
\begin{overpic}[width=59.75095pt,percent]{Paper_figures/real/steel_1_31_10/% cvpr_29.494539260864258_mask.jpg}\put(0.0,85.0){\color[rgb]{1,1,1} \small 29.4% 9}\end{overpic}
ANNs [22]
\begin{overpic}[width=59.75095pt,percent]{Paper_figures/real/chessboard2_24_10% /bs_27.768863677978516_mask.jpg}\put(0.0,85.0){\color[rgb]{1,1,1} \small 27.77}\end{overpic}
\begin{overpic}[width=59.75095pt,percent]{Paper_figures/real/cylinder2_27_10/% bs_24.11448097229004_mask.jpg}\put(0.0,85.0){\color[rgb]{1,1,1} \small 24.11}\end{overpic}
\begin{overpic}[width=59.75095pt,percent]{Paper_figures/real/comb1_26_10/bs_33% .16732406616211_mask.jpg}\put(0.0,85.0){\color[rgb]{1,1,1} \small 33.17}\end{overpic}
\begin{overpic}[width=59.75095pt,percent]{Paper_figures/real/buddha_26_10/bs_2% 6.83514976501465_mask.jpg}\put(0.0,85.0){\color[rgb]{1,1,1} \small 26.84}\end{overpic}
\begin{overpic}[width=59.75095pt,percent]{Paper_figures/real/headphones_case_3% 1_10/bs_28.940170288085938_mask.jpg}\put(0.0,85.0){\color[rgb]{1,1,1} \small 2% 8.94}\end{overpic}
\begin{overpic}[width=59.75095pt,percent]{Paper_figures/real/phone_02_11/bs_31% .408796310424805_mask.jpg}\put(0.0,85.0){\color[rgb]{1,1,1} \small 31.41}\end{overpic}
\begin{overpic}[width=59.75095pt,percent]{Paper_figures/real/raclette_26_10/bs% _27.732683181762695_mask.jpg}\put(0.0,85.0){\color[rgb]{1,1,1} \small 27.73}\end{overpic}
\begin{overpic}[width=59.75095pt,percent]{Paper_figures/real/steel_1_31_10/bs_% 29.568269729614258_mask.jpg}\put(0.0,85.0){\color[rgb]{1,1,1} \small 29.57}\end{overpic}
\begin{overpic}[width=59.75095pt,percent]{Paper_figures/real/chessboard2_24_10% /ns_27.650907516479492_mask.jpg}\put(0.0,85.0){\color[rgb]{1,1,1} \small 27.65}\end{overpic}
\begin{overpic}[width=59.75095pt,percent]{Paper_figures/real/cylinder2_27_10/% ns_24.117557525634766_mask.jpg}\put(0.0,85.0){\color[rgb]{1,1,1} \small 24.12}\end{overpic}
\begin{overpic}[width=59.75095pt,percent]{Paper_figures/real/comb1_26_10/ns_38% .19813919067383_mask.jpg}\put(0.0,85.0){\color[rgb]{1,1,1} \small 38.20}\end{overpic}
\begin{overpic}[width=59.75095pt,percent]{Paper_figures/real/buddha_26_10/ns_2% 7.461406707763672_mask.jpg}\put(0.0,85.0){\color[rgb]{1,1,1} \small 27.46}\end{overpic}
\begin{overpic}[width=59.75095pt,percent]{Paper_figures/real/headphones_case_3% 1_10/ns_27.884567260742188_mask.jpg}\put(0.0,85.0){\color[rgb]{1,1,1} \small 2% 7.88}\end{overpic}
\begin{overpic}[width=59.75095pt,percent]{Paper_figures/real/phone_02_11/ns_33% .943302154541016_mask.jpg}\put(0.0,85.0){\color[rgb]{1,1,1} \small 33.94}\end{overpic}
\begin{overpic}[width=59.75095pt,percent]{Paper_figures/real/raclette_26_10/ns% _29.069377899169922_mask.jpg}\put(0.0,85.0){\color[rgb]{1,1,1} \small 29.07}\end{overpic}
\begin{overpic}[width=59.75095pt,percent]{Paper_figures/real/steel_1_31_10/ns_% 30.029052734375_mask.jpg}\put(0.0,85.0){\color[rgb]{1,1,1} \small 30.03}\end{overpic}
\begin{overpic}[width=59.75095pt,percent]{Paper_figures/real/chessboard2_24_10% /bm_30.273061752319336_mask.jpg}\put(0.0,85.0){\color[rgb]{1,1,1} \small 30.27}\end{overpic}
\begin{overpic}[width=59.75095pt,percent]{Paper_figures/real/cylinder2_27_10/% bm_23.756885528564453_mask.jpg}\put(0.0,85.0){\color[rgb]{1,1,1} \small 23.76}\end{overpic}
\begin{overpic}[width=59.75095pt,percent]{Paper_figures/real/comb1_26_10/bm_26% .7894344329834_mask.jpg}\put(0.0,85.0){\color[rgb]{1,1,1} \small 26.79}\end{overpic}
\begin{overpic}[width=59.75095pt,percent]{Paper_figures/real/buddha_26_10/bm_2% 6.715396881103516_mask.jpg}\put(0.0,85.0){\color[rgb]{1,1,1} \small 26.72}\end{overpic}
\begin{overpic}[width=59.75095pt,percent]{Paper_figures/real/headphones_case_3% 1_10/bm_27.708858489990234_mask.jpg}\put(0.0,85.0){\color[rgb]{1,1,1} \small 2% 7.71}\end{overpic}
\begin{overpic}[width=59.75095pt,percent]{Paper_figures/real/phone_02_11/bm_28% .759902954101562_mask.jpg}\put(0.0,85.0){\color[rgb]{1,1,1} \small 28.76}\end{overpic}
\begin{overpic}[width=59.75095pt,percent]{Paper_figures/real/raclette_26_10/bm% _26.046762466430664_mask.jpg}\put(0.0,85.0){\hbox{\pagecolor{red}\color[rgb]{% 1,1,1} \small 26.05}}\end{overpic}
\begin{overpic}[width=59.75095pt,percent]{Paper_figures/real/steel_1_31_10/bm_% 28.600061416625977_mask.jpg}\put(0.0,85.0){\hbox{\pagecolor{red}\color[rgb]{% 1,1,1} \small 28.60}}\end{overpic}
\begin{overpic}[width=59.75095pt,percent]{Paper_figures/real/chessboard2_24_10% /nm_26.669212341308594_mask.jpg}\put(0.0,85.0){\hbox{\pagecolor{red}\color[rgb% ]{1,1,1} \small 26.67}}\end{overpic}
\begin{overpic}[width=59.75095pt,percent]{Paper_figures/real/cylinder2_27_10/% nm_23.35761833190918_mask.jpg}\put(0.0,85.0){\hbox{\pagecolor{red}\color[rgb]{% 1,1,1} \small 23.36}}\end{overpic}
\begin{overpic}[width=59.75095pt,percent]{Paper_figures/real/comb1_26_10/nm_26% .649303436279297_mask.jpg}\put(0.0,85.0){\hbox{\pagecolor{red}\color[rgb]{% 1,1,1} \small 26.65}}\end{overpic}
\begin{overpic}[width=59.75095pt,percent]{Paper_figures/real/buddha_26_10/nm_2% 6.19244384765625_mask.jpg}\put(0.0,85.0){\hbox{\pagecolor{red}\color[rgb]{% 1,1,1} \small 26.19}}\end{overpic}
\begin{overpic}[width=59.75095pt,percent]{Paper_figures/real/headphones_case_3% 1_10/nm_27.128381729125977_mask.jpg}\put(0.0,85.0){\hbox{\pagecolor{red}\color% [rgb]{1,1,1} \small 27.13}}\end{overpic}
\begin{overpic}[width=59.75095pt,percent]{Paper_figures/real/phone_02_11/nm_28% .728622436523438_mask.jpg}\put(0.0,85.0){\hbox{\pagecolor{red}\color[rgb]{% 1,1,1} \small 28.73}}\end{overpic}
\begin{overpic}[width=59.75095pt,percent]{Paper_figures/real/raclette_26_10/nm% _26.346115112304688_mask.jpg}\put(0.0,85.0){\color[rgb]{1,1,1} \small 26.35}\end{overpic}
\begin{overpic}[width=59.75095pt,percent]{Paper_figures/real/steel_1_31_10/nm_% 29.33940887451172_mask.jpg}\put(0.0,85.0){\color[rgb]{1,1,1} \small 29.34}\end{overpic}
\begin{overpic}[width=59.75095pt,percent]{Paper_figures/real/CVPR/99/im6_gt.% jpg}\end{overpic}
\begin{overpic}[width=59.75095pt,percent]{Paper_figures/real/CVPR/99/im17_gt.% jpg}\end{overpic}
\begin{overpic}[width=59.75095pt,percent]{Paper_figures/real/CVPR/99/im9_gt.% jpg}\end{overpic}
\begin{overpic}[width=59.75095pt,percent]{Paper_figures/real/CVPR/99/im5_gt.% jpg}\end{overpic}
\begin{overpic}[width=59.75095pt,percent]{Paper_figures/real/CVPR/99/im20_gt.% jpg}\end{overpic}
\begin{overpic}[width=59.75095pt,percent]{Paper_figures/real/CVPR/99/im23_gt.% jpg}\end{overpic}
\begin{overpic}[width=59.75095pt,percent]{Paper_figures/real/CVPR/99/im24_gt.% jpg}\end{overpic}
\begin{overpic}[width=59.75095pt,percent]{Paper_figures/real/CVPR/99/im26_gt.% jpg}\end{overpic}
Figure 5: Qualitative results on the ESfP-Real Dataset. Column (a) shows the scene photographs for context. Column (b) is for the counterpart ANN models. Columns (c-d) are for the Single-Timestep Spiking UNets with bilinear upsampling and nearest neighbor upsampling, respectively. Columns (e-f) are for the Multi-Timestep Spiking UNets with bilinear upsampling and nearest neighbor upsampling, respectively. Column (g) presents the ground truth normals. The MAE for the reconstructions is shown on the top left of each cell in Columns (b-f). For each scene, we highlight the best result using the red colorbox.

4.6 Energy Analysis

In earlier sections, we demonstrated that our models, employing nearest neighbor upsampling, can achieve performance comparable to those using bilinear upsampling in event-based shape from polarization. To delve deeper into the advantages of these fully spiking models, we will now estimate the computational cost savings they offer compared to their fully ANN counterpart [22] on the ESfP-Real Dataset. Commonly, the number of synaptic operations serves as a benchmark for assessing the computational energy of SNN models, as referenced in studies like [81] and [82]. Moreover, we can approximate the total energy consumption of a model using principles based on CMOS technology, as outlined in [83].

Unlike ANNs, which consistently perform real-valued matrix-vector multiplication operations regardless of input sparsity, SNNs execute computations based on events, triggered only upon receiving input spikes. Therefore, we initially assess the mean spiking rate of layer l𝑙litalic_l in our proposed model. In particular, the mean spiking rate for layer l𝑙litalic_l in an SNN is calculated as follows:

F(l)=1TtTSt(l)K(l)superscript𝐹𝑙1𝑇subscript𝑡𝑇subscriptsuperscript𝑆𝑙𝑡superscript𝐾𝑙F^{(l)}=\frac{1}{T}\sum_{t\in T}\frac{S^{(l)}_{t}}{K^{(l)}}italic_F start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t ∈ italic_T end_POSTSUBSCRIPT divide start_ARG italic_S start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG italic_K start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT end_ARG (13)

where T𝑇Titalic_T is the total time length, St(l)subscriptsuperscript𝑆𝑙𝑡S^{(l)}_{t}italic_S start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the number of spikes of layer l𝑙litalic_l at time t𝑡titalic_t, and K(l)superscript𝐾𝑙K^{(l)}italic_K start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT is the number of neurons of layer l𝑙litalic_l. Table 4 shows the mean spiking rates for all layers in our fully spiking models, including the Single-Timestep Spiking UNet and Multi-Timestep Spiking UNet. Notice that we do not consider the components without trainable weights, such as max pooling and nearest neighbor upsampling layers. From the table, we can see that the Multi-Timestep Spiking UNet exhibits a higher average spiking rate across all its layers compared to the Single-Timestep Spiking UNet. This increased spiking rate aids in preserving more information, thereby enhancing the accuracy of surface normal estimation.

With the mean spiking rates, we can estimate the number of synaptic operations in the SNNs. Given M𝑀Mitalic_M is the number of neurons, C𝐶Citalic_C is the number of synaptic connections per neuron, and F𝐹Fitalic_F indicates the mean spiking rate, the number of synaptic operations at each time in layer l𝑙litalic_l is calculated as M(l)×C(l)×F(l)superscript𝑀𝑙superscript𝐶𝑙superscript𝐹𝑙M^{(l)}\times C^{(l)}\times F^{(l)}italic_M start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT × italic_C start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT × italic_F start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT. Thus, the total number of synaptic operations in an SNN is calculated by:

#OP=lM(l)×C(l)×F(l)×T.#𝑂𝑃subscript𝑙superscript𝑀𝑙superscript𝐶𝑙superscript𝐹𝑙𝑇\#OP=\sum_{l}M^{(l)}\times C^{(l)}\times F^{(l)}\times T.# italic_O italic_P = ∑ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT italic_M start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT × italic_C start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT × italic_F start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT × italic_T . (14)

In contrast, the total number of synaptic operations in the ANNs is lM(l)×C(l)subscript𝑙superscript𝑀𝑙superscript𝐶𝑙\sum_{l}M^{(l)}\times C^{(l)}∑ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT italic_M start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT × italic_C start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT. Due to the binary nature of spikes, SNNs perform only accumulation (AC) per synaptic operation, while ANNs perform the multiply-accumulate (MAC) computations since the operations are real-valued. Based on these, we estimate the number of synaptic operations in the our proposed models and their ANN counterpart. Table 5 illustrates that, in comparison to ANNs, our models primarily perform AC operations with significantly fewer MAC operations that transform real-valued event inputs into binary spiking representations. Furthermore, the Multi-Timestep Spiking UNet executes more AC operations than the Single-Timestep Spiking UNet due to its higher average spiking rate and the utilization of temporal dynamics across multiple timesteps.

In general, AC operation is considered to be significantly more energy-efficient than MAC. For example, an AC is reported to be 5.1×\mathbf{5.1\times}bold_5.1 × more energy-efficient than a MAC in the case of 32-bit floating-point numbers (0.9pJ vs. 4.6pJ, 45nm CMOS process) [83]. Based on this principle, we obtain the computational energy benefits of SNNs over ANNs in Table 5. From the table, we can see that the SNN models are 3.14×\times× to 28.80×\times× more energy-efficient than ANNs on the ESfP-Real Dataset.

These results are consistent with the fact that the sparse spike communication and event-driven computation underlie the efficiency advantage of SNNs and demonstrate the potential of our models on neuromorphic hardware and energy-constrained devices.

Table 4: Mean spiking rates for all layers in the Single-Timestep Spiking UNet and Multi-Timestep Spiking UNet, both utilizing nearest neighbor upsampling and being fully spiking. Layers 1 to 19 correspond to the spiking convolutional layers depicted in Fig. 2 and Fig. 3. Given that the CVGR-I inputs are real-valued, the first layer in both models does not involve spike calculation.
Single-Timestep Spiking UNet_Nearest Multi-Timestep Spiking UNet_Nearest
Spiking rates Spikes Spiking rates Spikes
Layer 1 0.3070 No 0.3070 No
Layer 2 0.0901 Yes 0.2484 Yes
Layer 3 0.1342 Yes 0.2304 Yes
Layer 4 0.1057 Yes 0.1626 Yes
Layer 5 0.1467 Yes 0.2482 Yes
Layer 6 0.1174 Yes 0.1719 Yes
Layer 7 0.1485 Yes 0.2733 Yes
Layer 8 0.1153 Yes 0.1870 Yes
Layer 9 0.1717 Yes 0.3607 Yes
Layer 10 0.1691 Yes 0.2149 Yes
Layer 11 0.1278 Yes 0.1991 Yes
Layer 12 0.1513 Yes 0.2075 Yes
Layer 13 0.1175 Yes 0.1840 Yes
Layer 14 0.1540 Yes 0.1923 Yes
Layer 15 0.1391 Yes 0.1810 Yes
Layer 16 0.1937 Yes 0.1867 Yes
Layer 17 0.1624 Yes 0.1881 Yes
Layer 18 0.2323 Yes 0.2058 Yes
Layer 19 0.2080 Yes 0.2099 Yes
Average 0.1575 - 0.2189 -
Table 5: Energy comparison of our models and their ANN counterpart on the ESfP-Real Dataset. The energy benefit is equal to EnergyANNs/EnergySNNs𝐸𝑛𝑒𝑟𝑔subscript𝑦𝐴𝑁𝑁𝑠𝐸𝑛𝑒𝑟𝑔subscript𝑦𝑆𝑁𝑁𝑠Energy_{ANNs}/Energy_{SNNs}italic_E italic_n italic_e italic_r italic_g italic_y start_POSTSUBSCRIPT italic_A italic_N italic_N italic_s end_POSTSUBSCRIPT / italic_E italic_n italic_e italic_r italic_g italic_y start_POSTSUBSCRIPT italic_S italic_N italic_N italic_s end_POSTSUBSCRIPT.
ANNs [22] Single_Nearest Multi_Nearest
Average Spiking Rate - 0.1575 0.2189
#OP_MAC (×109absentsuperscript109\times 10^{9}× 10 start_POSTSUPERSCRIPT 9 end_POSTSUPERSCRIPT) 161.11 1.21 1.21
#OP_AC (×109absentsuperscript109\times 10^{9}× 10 start_POSTSUPERSCRIPT 9 end_POSTSUPERSCRIPT) 0 22.36 255.35
Energy (103superscript10310^{-3}10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPTJ, 45nm CMOS process) 741.11 25.69 235.38
Energy Benefit (×\times×) 1.0 28.80 3.14

5 Conclusion and Future Work

In this work, we explore the domain of event-based shape from polarization with SNNs. Drawing inspiration from the feed-forward UNet, we introduce the Single-Timestep Spiking UNet, which processes event-based shape from polarization as a non-temporal task, updating the membrane potential of each spiking neuron only once. This method, while not fully leveraging the temporal capabilities of SNNs, significantly cuts down on computational and energy demands. To better harness the rich temporal data in event-based information, we also propose the Multi-Timestep Spiking UNet. This model operates sequentially across multiple timesteps, enabling spiking neurons to employ their temporal recurrent neuronal dynamics for more effective data extraction. Through extensive evaluation on both synthetic and real-world datasets, our models demonstrate their ability to estimate dense surface normals from polarization events, achieving results comparable to those of state-of-the-art ANN models. Moreover, our models present enhanced energy efficiency over their ANN counterparts, underscoring their suitability for neuromorphic hardware and energy-sensitive edge devices. This research not only advances the field of spiking neural networks but also opens up new possibilities for efficient and effective event-based shape recovery in various applications.

Building on this foundation, future work could focus on several promising directions. One key area is the further optimization of SNN architectures to enhance their ability to process complex, dynamic scenes, potentially by integrating more sophisticated temporal dynamics or learning algorithms. Additionally, exploring the integration of our models with other sensory data types, like depth information, could lead to more robust and versatile systems. Moreover, adapting these models for real-time applications in various fields, from autonomous vehicles to augmented reality, presents an exciting challenge. Finally, there is significant potential in further reducing the energy consumption of these networks, making them even more suitable for deployment in low-power, edge computing scenarios. Through these explorations, we can continue to push the boundaries of what’s possible with SNNs in event-based sensing and beyond.


We are grateful to Chenghong Lin for her proofreading and advice on the paper writing.



