DOCTOR: Dynamic On-Chip Remediation Against Temporally-Drifting Thermal Variations Toward Self-Corrected Photonic Tensor Accelerators

Haotian Lu, Sanmitra Banerjee, Jiaqi Gu, 
Arizona State University
[email protected]

DOCTOR: Dynamic On-Chip Temporal Variation Remediation Toward Self-Corrected Photonic Tensor Accelerators

Haotian Lu, Sanmitra Banerjee, Jiaqi Gu, 
Arizona State University
[email protected]
Abstract

Photonic computing has emerged as a promising solution for accelerating computation-intensive artificial intelligence (AI) workloads, offering unparalleled speed and energy efficiency, especially in resource-limited, latency-sensitive edge computing environments. However, the deployment of analog photonic tensor accelerators encounters reliability challenges due to hardware noise and environmental variations. While off-chip noise-aware training and on-chip training have been proposed to enhance the variation tolerance of optical neural accelerators with moderate, static noise, we observe a notable performance degradation over time due to temporally drifting variations, which requires a real-time, in-situ calibration mechanism. To tackle this challenging reliability issues, for the first time, we propose a lightweight dynamic on-chip remediation framework, dubbed DOCTOR, providing adaptive, in-situ accuracy recovery against temporally drifting noise. The DOCTOR framework intelligently monitors the chip status using adaptive probing and performs fast in-situ training-free calibration to restore accuracy when necessary. Recognizing nonuniform spatial variation distributions across devices and tensor cores, we also propose a variation-aware architectural remapping strategy to avoid executing critical tasks on noisy devices. Extensive experiments show that our proposed framework can guarantee sustained performance under drifting variations with 34% higher accuracy and 2-3 orders-of-magnitude lower overhead compared to state-of-the-art on-chip training methods. Our code is open-sourced at link.

Index Terms:
Photonic computing, optical neural networks, thermal variation, robustness, on-chip calibration.

I Introduction

In recent years, the pursuit of efficient and high-performance solutions for artificial intelligence (AI) workloads has led to the emergence of photonic computing. Leveraging the unique properties of light, analog photonic accelerators stand out for their ability to deliver unparalleled speed and efficiency, presenting a promising avenue for AI applications [1, 2, 3, 4, 5, 6, 7, 8, 9].

However, the deployment of such accelerators encounters robustness challenges that impede their practical application [10, 11, 12, 13]. We consider one of the most sensitive accelerators based on micro-ring resonators (MRRs) as a case study [14, 15, 9]. Due to the intrinsic temperature sensitivity of the MRR device, shown in Fig. 1(a), a subtle drift in the temperature will lead to a slight change of the round-trip phase shift but a large deviation on the represented weight. Such a high sensitivity makes pre-deployment optimization ineffective and thus necessitates a real-time calibration mechanism on chip. Besides temperature drift, various dynamic random noise and crosstalk cast even more shadows on the reliability of photonic computing systems. Figure 1(b) shows the significant impacts of variations on accuracy, sometimes leading to malfunction over time when the noise intensities gradually increase.

While previous off-chip noise-aware model training [10, 16] have shown efficacy in enhancing the variation tolerance of optical accelerators by injecting noise during training and thus encourage a smoother solution space, they rely on accurate noise modeling and thus typically show unsatisfying robustness improvement with unknown physical variations and can only handle small and static noise [10, 11, 17, 18]. The performance drop remains unresolved when there exist temporally drifting variations. Recently, there has been a trend to resort to on-chip learning or physical training methods to directly train the optical neural network (ONN) models in situ that can naturally incorporate real physical noise into the weight training process [19, 20, 21, 18, 22]. However, they require repeatedly performing forward and backward propagation of the entire network on a labeled training dataset to calculate the task-specific gradients for weight fine-tuning, which induces nontrivial training costs that can severely harm the system throughput and efficiency. Moreover, prior methods fail to leverage the nonuniform spatial noise distribution and weight sensitivity, shown in Fig. 1(c), to balance accuracy and efficiency. Hence, it necessitates a real-time, low-cost, in-situ calibration mechanism without running backpropagation on any labeled training set to quickly recover the accuracy and ensure continued reliability in practical deployment.

To tackle these challenges, we present a DOCTOR framework for dynamic on-chip remediation against temporally drifting thermal variations. In this paper, we delve into the detailed modeling of time-variant thermal variations and their impacts on a thermal-sensitive MRR-based photonic accelerator and resolve the variation-induced performance drop by efficient sparse weight calibration, variation-aware tile remapping, and an adaptive remediation controller.

Refer to caption
((a))
Refer to caption
((b))
Refer to caption
((c))
Figure 1: (a) MRR-based photonic accelerator is sensitive to temperature drift. (b) Drifting noises cause severe accuracy drop over time, including phase variation (PV), temperature drift (TD), and thermal crosstalk (CT). "Acc" represents the accuracy. (c) Noise distributions across devices are nonuniform.

The major contributions of this paper are as follows:

  • \bullet

    Thermal Variation Modeling: We give rigorous modeling and sensitivity analysis of the dynamic thermal variations for multi-core photonic accelerators, providing a deeper understanding of the dynamic variations in real-world deployment.

  • \bullet

    Salience-Aware Sparse Calibration: We propose a training-free, data-free in-situ calibration mechanism to selectively mitigate thermal variations and effectively resume computing accuracy at negligible runtime overhead.

  • \bullet

    Variation-Aware Tile Remapping: We leverage the spatial nonuniformity in noise distributions to boost the reliability by optimally remapping workloads onto tensor cores, aware of the weight importance and device noise levels.

  • \bullet

    We evaluate that our DOCTOR framework guarantees sustained deployment performance with 1%-2.5% accuracy drop at negligible runtime overhead (0.1%-5% cycle overhead), outperforming state-of-the-art on-chip training methods [21, 22] by an average of 34% higher accuracy and 2-3 orders-of-magnitude less overhead. Our work makes significant strides toward the real-world deployment of photonic accelerators in dynamic environments.

II Background

II-A Photonic Tensor Accelerators

Various photonic neural network designs have been proposed and demonstrated to encode inputs and weights to the light magnitude/phase and circuit transmission, respectively, and perform ultra-fast matrix multiplication [1, 2, 3, 4, 5]. Typically, the photonic circuits are sensitive to thermal variation as temperature impacts the refractive index of the optical component. Especially for compact microring resonator (MRR)-based photonic accelerators, the weight w𝑤witalic_w is encoded by the differential transmission of the add-drop MRR as w=g(2a1);a=α22rαcosϕ+r212rαcosϕ+r2α2(0,1)formulae-sequence𝑤𝑔2𝑎1𝑎superscript𝛼22𝑟𝛼italic-ϕsuperscript𝑟212𝑟𝛼italic-ϕsuperscript𝑟2superscript𝛼201w=g(2a-1);~{}a=\frac{\alpha^{2}-2r\alpha\cos{\phi}+r^{2}}{1-2r\alpha\cos{\phi}% +r^{2}\alpha^{2}}\in(0,1)italic_w = italic_g ( 2 italic_a - 1 ) ; italic_a = divide start_ARG italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - 2 italic_r italic_α roman_cos italic_ϕ + italic_r start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 1 - 2 italic_r italic_α roman_cos italic_ϕ + italic_r start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∈ ( 0 , 1 ), where α𝛼\alphaitalic_α is the attenuation factor, r𝑟ritalic_r is the coupling factor, ϕitalic-ϕ\phiitalic_ϕ is the round-trip phase shift, a𝑎aitalic_a is the through-port transmission, and g𝑔gitalic_g is the scaling factor. To accommodate multiple channels in a single free-spectral range (FSR) while minimizing spectral crosstalk, i.e., large finesse, MRR is designed to have a mid-to-high quality factor. This design choice makes the MRR highly sensitive to small disturbances, highlighting the urgency in addressing challenges related to thermal variation robustness.

Refer to caption
Figure 2: Our proposed dynamic remediation DOCTOR can counter the accuracy degradation due to temporally drifting hardware variations.
Refer to caption
Figure 3: Architecture settings of a multi-core photonic tensor accelerator. (a) The accelerator can map Rk×Rk𝑅𝑘𝑅𝑘Rk\times Rkitalic_R italic_k × italic_R italic_k matrix-vector multiplication at each cycle. Note that we draw R=C=3𝑅𝐶3R=C=3italic_R = italic_C = 3 and k=4𝑘4k=4italic_k = 4 as an example for illustration but not the actual architecture setting. (b) The photonic accelerator include R𝑅Ritalic_R tiles, each tile including C𝐶Citalic_C photonic tensor cores (PTCs). Each PTC is of size k×k𝑘𝑘k\times kitalic_k × italic_k. Partial sum accumulation is performed by photocurrent accumulation across C𝐶Citalic_C cores within one tile. The same input vector chunks are broadcast to R𝑅Ritalic_R tiles (vertically) using photonic on-chip interconnects. (c) An ideal add-drop micro-ring resonator has a tunable through-port transmission a𝑎aitalic_a and a corresponding drop-port transmission 1a1𝑎1-a1 - italic_a. (d) As a case study, each k×k𝑘𝑘k\times kitalic_k × italic_k PTC is assumed to be a multiple-wavelength add-drop MRR weight bank with local buffers and electronic local control units.

II-B Noise-Aware ONN Optimization

Noise-aware optimization for photonic accelerators includes two categories, i.e., offline optimization and on-chip optimization. Prior offline methods inject noise into ONN model training to obtain smooth solution space for better noise tolerance [10]. However, such a method requires accurate noise modeling and can only handle static, known variations, which cannot be adapted to dynamically drifting, unknown variations. On-chip training, as a trend, has been demonstrated for online adaptability and efficiency in in-situ accuracy recovery. The pretrained optical NNs are fine-tuned on a target dataset with on-chip gradient calculation [19, 20, 21]. However, they rely on accurate gradient calculation and require costly forward and backward propagation on a given training set, which can significantly degrade the edge inference throughput and efficiency and bear data privacy issues.

III Dynamic On-Chip Remediation: DOCTOR

We introduce our efficient on-chip remediation flow DOCTOR, shown in Fig. 2, with rigorous noise modeling, lightweight device calibration, and architectural remapping techniques to guarantee real-time accuracy recovery against drifting variations.

III-A Photonic Accelerator Architecture Settings

Since MRR weight banks are typically considered to be one of the most thermal sensitive designs among different kinds of ONN designs [23, 24, 25], we focus on photonic accelerator architectures based on MRR weight banks as a challenging case study to showcase the effectiveness of our DOCTOR method, shown in Fig. 3. We assume a multi-core accelerator with R𝑅Ritalic_R tiles and C𝐶Citalic_C photonic tensor cores (PTCs) per tile. Each PTC is a k×k𝑘𝑘k\times kitalic_k × italic_k add-drop MRR weight bank. The partial sums are reduced in each tile. Hence, it can finish an Rk×Ck𝑅𝑘𝐶𝑘Rk\times Ckitalic_R italic_k × italic_C italic_k matrix-vector multiplication (MVM) at each cycle. A large M×N𝑀𝑁M\times Nitalic_M × italic_N matrix will be partitioned and simply mapped to this accelerator using M/(Rk)N/(Ck)𝑀𝑅𝑘𝑁𝐶𝑘\lceil M/(Rk)\rceil\cdot\lceil N/(Ck)\rceil⌈ italic_M / ( italic_R italic_k ) ⌉ ⋅ ⌈ italic_N / ( italic_C italic_k ) ⌉ cycles. Alongside each PTC, we assume a dedicated local buffer, an electrical local control unit, and thermal sensors for temperature monitoring, control, and processing.

III-B Thermal Variation Modeling

We first give a thorough noise modeling and sensitivity analysis for this photonic accelerator, including dynamic phase variations, drifting environmental temperature, and inter-device thermal crosstalk.

III-B1 Device Phase Variation

Refer to caption
Figure 4: Random phase variations on MRRs lead to large weight errors. Different devices and cores have distinct noise distributions.

Due to control signal noise and thermal fluctuations, the round-trip phase shift of MRRs exhibits stochastic variations. In Fig. 4, we posit a zero-mean Gaussian phase variation, denoted as Δϕ𝒩(0,σ2)similar-toΔitalic-ϕ𝒩0superscript𝜎2\Delta\phi\sim\mathcal{N}(0,\sigma^{2})roman_Δ italic_ϕ ∼ caligraphic_N ( 0 , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) [26]. This introduces a noisy transmission and weight values. Given that different MRR devices are subject to distinct noise distributions due to their unique control sources and local environments, our figure depicts an instance where the upper-left corner of the accelerator experiences more noise, in contrast to the lower-right corner of the chip, which exhibits less noise.

Refer to caption
Figure 5: Illustration of temporally drifting phase noise distributions. We control the mean and std of the distribution. At every timestep, it samples a new noise std map from the scheduled distribution and smoothly evolves to a new distribution via a damping factor βσsubscript𝛽𝜎\beta_{\sigma}italic_β start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT.

Beside nonuniform noise distribution, the noise profiles are dynamically drifting over time, as shown in Fig. 5. We simulate such dynamics with two-level sampling. For each PTC, we use the Standard Deviation (σijsubscript𝜎𝑖𝑗\sigma_{ij}italic_σ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT, std) of the phase noise distribution to represent the noise intensity for the MRR at coordinate ij𝑖𝑗ijitalic_i italic_j, while the intensity itself is time-varying. At time step t𝑡titalic_t, we sample the step-t𝑡titalic_t noise intensity from a distribution, i.e., σijt+1𝒩(μs(t),σs2(t))similar-tosuperscriptsubscript𝜎𝑖𝑗𝑡superscript1𝒩subscript𝜇𝑠𝑡superscriptsubscript𝜎𝑠2𝑡\sigma_{ij}^{t+1^{\prime}}\sim\mathcal{N}(\mu_{s}(t),\sigma_{s}^{2}(t))italic_σ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ∼ caligraphic_N ( italic_μ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_t ) , italic_σ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_t ) ). We apply exponential moving average σijt+1βσσijt+(1βσ)σijt+1superscriptsubscript𝜎𝑖𝑗𝑡1subscript𝛽𝜎superscriptsubscript𝜎𝑖𝑗𝑡1subscript𝛽𝜎superscriptsubscript𝜎𝑖𝑗𝑡superscript1\sigma_{ij}^{t+1}\leftarrow\beta_{\sigma}\sigma_{ij}^{t}+(1-\beta_{\sigma})% \sigma_{ij}^{t+1^{\prime}}italic_σ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT ← italic_β start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT + ( 1 - italic_β start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT ) italic_σ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT with a dampling factor βσsubscript𝛽𝜎\beta_{\sigma}italic_β start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT to smooth out the intensity drifting. μs(t)subscript𝜇𝑠𝑡\mu_{s}(t)italic_μ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_t ) and σs(t)subscript𝜎𝑠𝑡\sigma_{s}(t)italic_σ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_t ) are temporally evolving with a certain scheduling function.

The non-uniform, temporally drifting spatial distribution of noise provides valuable insights into addressing this challenge through periodic workload-to-PTC remapping, which implies intuitively mapping sensitive workloads to less noisy tensor cores.

III-B2 Environmental Temperature Drift

The environmental temperature fluctuation can impact the reliability of the chip, especially for thermal-sensitive microring resonator-based PTCs [27, 23, 24]. As shown in Fig. 1(b), for MRRs, even 0.05K temperature drift can lead to a relatively large resonant wavelength shift, leading to large errors on the represented weight value. We assume a linear dependence of on-resonance wavelength λcsubscript𝜆𝑐\lambda_{c}italic_λ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT on temperature T𝑇Titalic_T. Thus, we assume a constant λ/T𝜆𝑇\partial\lambda/\partial T∂ italic_λ / ∂ italic_T. The drift on the round-trip phase shift δϕ𝛿italic-ϕ\delta\phiitalic_δ italic_ϕ due to temperature change is derived as,

δλ𝛿𝜆\displaystyle\delta\lambdaitalic_δ italic_λ =δT(λ/T),δneff=δλng/λ,δϕ=δneff2πL/λ,formulae-sequenceabsent𝛿𝑇𝜆𝑇formulae-sequence𝛿subscript𝑛𝑒𝑓𝑓𝛿𝜆subscript𝑛𝑔𝜆𝛿italic-ϕ𝛿subscript𝑛𝑒𝑓𝑓2𝜋𝐿𝜆\displaystyle=\delta T(\partial\lambda/\partial T),~{}~{}\delta n_{eff}=\delta% \lambda\cdot n_{g}/\lambda,~{}~{}\delta\phi=\delta n_{eff}\cdot 2\pi\cdot L/\lambda,= italic_δ italic_T ( ∂ italic_λ / ∂ italic_T ) , italic_δ italic_n start_POSTSUBSCRIPT italic_e italic_f italic_f end_POSTSUBSCRIPT = italic_δ italic_λ ⋅ italic_n start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT / italic_λ , italic_δ italic_ϕ = italic_δ italic_n start_POSTSUBSCRIPT italic_e italic_f italic_f end_POSTSUBSCRIPT ⋅ 2 italic_π ⋅ italic_L / italic_λ , (1)

where ngsubscript𝑛𝑔n_{g}italic_n start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT is the group index, λ𝜆\lambdaitalic_λ is the input wavelength (i.e., ideal on-resonance wavelength at temperature T0subscript𝑇0T_{0}italic_T start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT), L𝐿Litalic_L is the perimeter of the MRR, neffsubscript𝑛𝑒𝑓𝑓n_{eff}italic_n start_POSTSUBSCRIPT italic_e italic_f italic_f end_POSTSUBSCRIPT is the effective refractive index. For the MRR on the c𝑐citalic_c-th column of the weight bank, its input wavelength is λcsubscript𝜆𝑐\lambda_{c}italic_λ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT, and its corresponding perimeter is Lcsubscript𝐿𝑐L_{c}italic_L start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT. Lcsubscript𝐿𝑐L_{c}italic_L start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT is designed to make the MRR resonate at λcsubscript𝜆𝑐\lambda_{c}italic_λ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT and default temperature T0subscript𝑇0T_{0}italic_T start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. Hence its round-trip phase shift change is δϕc=δneff2πLc/λc𝛿subscriptitalic-ϕ𝑐𝛿subscript𝑛𝑒𝑓𝑓2𝜋subscript𝐿𝑐subscript𝜆𝑐\delta\phi_{c}=\delta n_{eff}\cdot 2\pi\cdot L_{c}/\lambda_{c}italic_δ italic_ϕ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = italic_δ italic_n start_POSTSUBSCRIPT italic_e italic_f italic_f end_POSTSUBSCRIPT ⋅ 2 italic_π ⋅ italic_L start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT / italic_λ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT. Figure 6 illustrates a linear temperature drift scheduling from 300K to 301K and its impacts on phase error and weight error measured as normalized mean-absolute error (NMAE).

Refer to caption
Figure 6: Illutration of linear uniform temperature drift with a maximum 1K change, which introduces uniform phase errors and nonuniform weight errors across Rk×Ck𝑅𝑘𝐶𝑘Rk\times Ckitalic_R italic_k × italic_C italic_k MRRs.

III-B3 Thermal Crosstalk

To make a compact MRR-based PTC, the spacings between adjacent rings are usually not far enough to eliminate all thermal crosstalk, which is mainly due to the thermal interference between adjacent photonic devices [28, 29, 25]. Figure 7 illustrates the crosstalk within a weight bank. Given a fixed layout spacing of MRRs, γijsubscript𝛾𝑖𝑗\gamma_{ij}italic_γ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT represents the crosstalk coupling coefficient from the j𝑗jitalic_j-th MRR to i𝑖iitalic_i-th MRR, The round-trip phases ΦΦ\Phiroman_Φ for k×k𝑘𝑘k\times kitalic_k × italic_k MRRs will be transformed with a coupling matrix ΓΓ\Gammaroman_Γ [25], i.e., Φc=ΓΦsubscriptΦ𝑐ΓΦ\Phi_{c}=\Gamma\Phiroman_Φ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = roman_Γ roman_Φ, as follows,

(ϕ1cϕ2cϕk2c)matrixsubscriptsuperscriptitalic-ϕ𝑐1subscriptsuperscriptitalic-ϕ𝑐2subscriptsuperscriptitalic-ϕ𝑐superscript𝑘2\displaystyle\begin{pmatrix}\phi^{c}_{1}\\ \phi^{c}_{2}\\ \vdots\\ \phi^{c}_{k^{2}}\end{pmatrix}( start_ARG start_ROW start_CELL italic_ϕ start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_ϕ start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL italic_ϕ start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ) =(γ11γ12γ1k2γ21γ22γ2k2γk21γk22γk2k2)(ϕ1ϕ2ϕk2),absentmatrixsubscript𝛾11subscript𝛾12subscript𝛾1superscript𝑘2subscript𝛾21subscript𝛾22subscript𝛾2superscript𝑘2subscript𝛾superscript𝑘21subscript𝛾superscript𝑘22subscript𝛾superscript𝑘2superscript𝑘2matrixsubscriptitalic-ϕ1subscriptitalic-ϕ2subscriptitalic-ϕsuperscript𝑘2\displaystyle=\begin{pmatrix}\gamma_{11}&\gamma_{12}&\cdots&\gamma_{1k^{2}}\\ \gamma_{21}&\gamma_{22}&\cdots&\gamma_{2k^{2}}\\ \vdots&\vdots&\ddots&\vdots\\ \gamma_{k^{2}1}&\gamma_{k^{2}2}&\cdots&\gamma_{k^{2}k^{2}}\end{pmatrix}\begin{% pmatrix}\phi_{1}\\ \phi_{2}\\ \vdots\\ \phi_{k^{2}}\end{pmatrix},= ( start_ARG start_ROW start_CELL italic_γ start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT end_CELL start_CELL italic_γ start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPT end_CELL start_CELL ⋯ end_CELL start_CELL italic_γ start_POSTSUBSCRIPT 1 italic_k start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_γ start_POSTSUBSCRIPT 21 end_POSTSUBSCRIPT end_CELL start_CELL italic_γ start_POSTSUBSCRIPT 22 end_POSTSUBSCRIPT end_CELL start_CELL ⋯ end_CELL start_CELL italic_γ start_POSTSUBSCRIPT 2 italic_k start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL start_CELL ⋮ end_CELL start_CELL ⋱ end_CELL start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL italic_γ start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT 1 end_POSTSUBSCRIPT end_CELL start_CELL italic_γ start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT 2 end_POSTSUBSCRIPT end_CELL start_CELL ⋯ end_CELL start_CELL italic_γ start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_k start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ) ( start_ARG start_ROW start_CELL italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_ϕ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL italic_ϕ start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ) , (2)
γiisubscript𝛾𝑖𝑖\displaystyle~{}~{}\gamma_{ii}italic_γ start_POSTSUBSCRIPT italic_i italic_i end_POSTSUBSCRIPT =1,γij=ek1dij,formulae-sequenceabsent1subscript𝛾𝑖𝑗superscript𝑒subscript𝑘1subscript𝑑𝑖𝑗\displaystyle=1,~{}~{}\gamma_{ij}=e^{-k_{1}d_{ij}},= 1 , italic_γ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = italic_e start_POSTSUPERSCRIPT - italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ,
dijsubscript𝑑𝑖𝑗\displaystyle d_{ij}italic_d start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT =((rjri)lv)2+((cjci)lh)2,ri/j[k],ci/j[k],formulae-sequenceabsentsuperscriptsubscript𝑟𝑗subscript𝑟𝑖subscript𝑙𝑣2superscriptsubscript𝑐𝑗subscript𝑐𝑖subscript𝑙2formulae-sequencesubscript𝑟𝑖𝑗delimited-[]𝑘subscript𝑐𝑖𝑗delimited-[]𝑘\displaystyle=\sqrt{\big{(}(r_{j}-r_{i})l_{v}\big{)}^{2}+\big{(}(c_{j}-c_{i})l% _{h}\big{)}^{2}},~{}~{}r_{i/j}\in[k],c_{i/j}\in[k],= square-root start_ARG ( ( italic_r start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) italic_l start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( ( italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) italic_l start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG , italic_r start_POSTSUBSCRIPT italic_i / italic_j end_POSTSUBSCRIPT ∈ [ italic_k ] , italic_c start_POSTSUBSCRIPT italic_i / italic_j end_POSTSUBSCRIPT ∈ [ italic_k ] ,

where γii=1subscript𝛾𝑖𝑖1\gamma_{ii}=1italic_γ start_POSTSUBSCRIPT italic_i italic_i end_POSTSUBSCRIPT = 1 is the self-coupling coefficient, and γij=ek1dij(0,1)subscript𝛾𝑖𝑗superscript𝑒subscript𝑘1subscript𝑑𝑖𝑗01\gamma_{ij}=e^{-k_{1}d_{ij}}\in(0,1)italic_γ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = italic_e start_POSTSUPERSCRIPT - italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∈ ( 0 , 1 ) is the cross-coupling coefficient, determined by the structure-related constant k1subscript𝑘1k_{1}italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and the distance dijsubscript𝑑𝑖𝑗d_{ij}italic_d start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT between the i𝑖iitalic_i-th MRR at coordinate (ri,cisubscript𝑟𝑖subscript𝑐𝑖r_{i},c_{i}italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT) and j𝑗jitalic_j-th MRR at coordinate (rj,cjsubscript𝑟𝑗subscript𝑐𝑗r_{j},c_{j}italic_r start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT)[30].

Refer to caption
Figure 7: Thermal crosstalk among MRRs within the same weight bank. The MRR spacings are lhsubscript𝑙l_{h}italic_l start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT and lvsubscript𝑙𝑣l_{v}italic_l start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT. The coupling matrix ΓΓ\Gammaroman_Γ is applied to all phase shifts ΦΦ\Phiroman_Φ. The crosstalk factor γ𝛾\gammaitalic_γ exponentially decays with a larger device spacing d𝑑ditalic_d.

III-C Salience-Aware Sparse Calibration

Given a pretrained NN, we map the ideal weights Wsuperscript𝑊W^{*}italic_W start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT to the photonic accelerator. However, the actual weights W~~𝑊\widetilde{W}over~ start_ARG italic_W end_ARG will have a deviation from the ideal ones. To quickly recover the inference accuracy on the fly without training or utilizing any labeled calibration datasets, we introduce salience-aware sparse calibration in Fig. 8, which is formulated as a batched block-wise regression problem,

minWcalibsubscript𝑊subscript𝑐𝑎𝑙𝑖𝑏\displaystyle\min_{W}\mathcal{L}_{calib}roman_min start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_c italic_a italic_l italic_i italic_b end_POSTSUBSCRIPT =minWij|(𝔼[W~ij])(Wij)|,absentsubscript𝑊subscript𝑖𝑗𝔼delimited-[]subscript~𝑊𝑖𝑗subscriptsuperscript𝑊𝑖𝑗\displaystyle=\min_{W}\sum_{ij}\big{|}\mathcal{L}(\mathbb{E}[\widetilde{W}_{ij% }])-\mathcal{L}(W^{*}_{ij})\big{|},= roman_min start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT | caligraphic_L ( blackboard_E [ over~ start_ARG italic_W end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ] ) - caligraphic_L ( italic_W start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) | , (3)
W~ijsubscript~𝑊𝑖𝑗\displaystyle~{}~{}\widetilde{W}_{ij}over~ start_ARG italic_W end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT =Wij(Γ(Φ+ΔΦ+δΦ))absentsubscript𝑊𝑖𝑗ΓΦΔΦ𝛿Φ\displaystyle=W_{ij}\big{(}\Gamma(\Phi+\Delta\Phi+\delta\Phi)\big{)}= italic_W start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ( roman_Γ ( roman_Φ + roman_Δ roman_Φ + italic_δ roman_Φ ) )

By optimizing the latent weight W𝑊Witalic_W, we want to minimize the distance between the expected noisy weights 𝔼[W~]𝔼delimited-[]~𝑊\mathbb{E}[\widetilde{W}]blackboard_E [ over~ start_ARG italic_W end_ARG ] and the ideal weights Wsuperscript𝑊W^{*}italic_W start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT on each k×k𝑘𝑘k\times kitalic_k × italic_k block. The loss degradation of the NN can be approximated by Taylor expansion

||\displaystyle|| (𝔼[W~])(W)||WT(𝔼[W~]W)+12W2(𝔼[W~]W)2|.\displaystyle\mathcal{L}\!(\mathbb{E}[\widetilde{W}])\!-\!\!\mathcal{L}\!(W^{*% })|\!\approx\!\big{|}\nabla_{W}\!\mathcal{L}^{T}\!(\mathbb{E}[\widetilde{W}]\!% \!\!-\!\!W^{*})\!+\!\frac{1}{2}\!\nabla^{2}_{W}\!\mathcal{L}(\mathbb{E}[\!% \widetilde{W}\!]\!\!-\!\!W)^{2}\big{|}.caligraphic_L ( blackboard_E [ over~ start_ARG italic_W end_ARG ] ) - caligraphic_L ( italic_W start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) | ≈ | ∇ start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT caligraphic_L start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( blackboard_E [ over~ start_ARG italic_W end_ARG ] - italic_W start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) + divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT caligraphic_L ( blackboard_E [ over~ start_ARG italic_W end_ARG ] - italic_W ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | . (4)

As an approximation, we can rewrite the objective calibsubscript𝑐𝑎𝑙𝑖𝑏\mathcal{L}_{calib}caligraphic_L start_POSTSUBSCRIPT italic_c italic_a italic_l italic_i italic_b end_POSTSUBSCRIPT in Eq. (3)

minWcalibminWij𝔼[W~ij]Wij1.subscript𝑊subscript𝑐𝑎𝑙𝑖𝑏subscript𝑊subscript𝑖𝑗subscriptnorm𝔼delimited-[]subscript~𝑊𝑖𝑗subscriptsuperscript𝑊𝑖𝑗1\small\min_{W}\mathcal{L}_{calib}\approx\min_{W}\sum_{ij}\|\mathbb{E}[% \widetilde{W}_{ij}]-W^{*}_{ij}\|_{1}.roman_min start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_c italic_a italic_l italic_i italic_b end_POSTSUBSCRIPT ≈ roman_min start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ∥ blackboard_E [ over~ start_ARG italic_W end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ] - italic_W start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT . (5)

This fundamentally decouples the calibration procedure from the labeled dataset and task-specific loss function \mathcal{L}caligraphic_L. By solving this regression problem concurrently on all matrix blocks (i,jfor-all𝑖𝑗\forall i,j∀ italic_i , italic_j), we can efficiently resume the task performance. The expected noisy weights 𝔼[W~]𝔼delimited-[]~𝑊\mathbb{E}[\widetilde{W}]blackboard_E [ over~ start_ARG italic_W end_ARG ] can be probed by shining an identity matrix Ik×k𝐼superscript𝑘𝑘I\in\mathbb{R}^{k\times k}italic_I ∈ blackboard_R start_POSTSUPERSCRIPT italic_k × italic_k end_POSTSUPERSCRIPT through the MRR weight bank m𝑚mitalic_m times and calculating the average,

𝔼[W~]1mi=1mW~I=1mi=1mW(Γ(Φ+ΔΦi+δΦ))I.𝔼delimited-[]~𝑊1𝑚superscriptsubscript𝑖1𝑚~𝑊𝐼1𝑚superscriptsubscript𝑖1𝑚𝑊ΓΦΔsubscriptΦ𝑖𝛿Φ𝐼\small\mathbb{E}[\widetilde{W}]\approx\frac{1}{m}\sum_{i=1}^{m}\widetilde{W}I=% \frac{1}{m}\sum_{i=1}^{m}W\big{(}\Gamma(\Phi+\Delta\Phi_{i}+\delta\Phi)\big{)}I.blackboard_E [ over~ start_ARG italic_W end_ARG ] ≈ divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT over~ start_ARG italic_W end_ARG italic_I = divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_W ( roman_Γ ( roman_Φ + roman_Δ roman_Φ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_δ roman_Φ ) ) italic_I . (6)

Compared to on-chip training with backpropagation [22, 21], our calibration method has three major advantages:

(1) High efficiency – Our method does not require costly forward propagation or error feedback. The gradient of the calibration objective w.r.t. the latent weights Wcalibsubscript𝑊subscript𝑐𝑎𝑙𝑖𝑏\nabla_{W}\mathcal{L}_{calib}∇ start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_c italic_a italic_l italic_i italic_b end_POSTSUBSCRIPT can be efficiently approximated using a straight-through estimator. With MAE as the loss, the gradients are simply

Wcalib𝔼[W~]calib=Sign(1mi=1mW~iIW).subscript𝑊subscript𝑐𝑎𝑙𝑖𝑏subscript𝔼delimited-[]~𝑊subscript𝑐𝑎𝑙𝑖𝑏Sign1𝑚superscriptsubscript𝑖1𝑚subscript~𝑊𝑖𝐼superscript𝑊\small\nabla_{W}\mathcal{L}_{calib}\approx\nabla_{\mathbb{E}[\widetilde{W}]}% \mathcal{L}_{calib}=\texttt{Sign}(\frac{1}{m}\sum_{i=1}^{m}\widetilde{W}_{i}I-% W^{*}).∇ start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_c italic_a italic_l italic_i italic_b end_POSTSUBSCRIPT ≈ ∇ start_POSTSUBSCRIPT blackboard_E [ over~ start_ARG italic_W end_ARG ] end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_c italic_a italic_l italic_i italic_b end_POSTSUBSCRIPT = Sign ( divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT over~ start_ARG italic_W end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_I - italic_W start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) . (7)

(2) Accurate gradients – Unlike on-chip training, where gradient errors will exponentially accumulate through layers, our method only calculates gradients for each block without backpropagation. Hence, our method avoids slow convergence or divergence issues even under large noise.

(3) Task-agnostic & Data-free – We do not use any labeled dataset or task-specific loss function, which avoids dataset storage costs and data privacy issues on edge devices.

Refer to caption
Figure 8: The proposed sparse calibration flow performs backpropagation-free data-free local regression.

Salience-Aware Sparsity.  To reduce the overhead of calibration, we propose salience-aware sparse calibration. At each iteration, we only calibrate a subset (β×100𝛽100\beta\times 100italic_β × 100%) of weight blocks, e.g., β=0.2𝛽0.2\beta=0.2italic_β = 0.2 means 20% sparsity. Instead of randomly selecting blocks to be calibrated, we propose to prioritize important weights based on salience scores that can be precomputed offline once. The precomputed first-order or second-order gradients are good indicators of weight importance/salience, i.e., how sensitive the task loss function is to the weight perturbation. Therefore, we will generate salience scores, e.g., s=|W(𝒟train)|𝑠subscript𝑊subscript𝒟𝑡𝑟𝑎𝑖𝑛s=|\nabla_{W}\mathcal{L}(\mathcal{D}_{train})|italic_s = | ∇ start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT caligraphic_L ( caligraphic_D start_POSTSUBSCRIPT italic_t italic_r italic_a italic_i italic_n end_POSTSUBSCRIPT ) | oder s=|W2(𝒟train)|,𝑠subscriptsuperscript2𝑊subscript𝒟𝑡𝑟𝑎𝑖𝑛s=|\nabla^{2}_{W}\mathcal{L}(\mathcal{D}_{train})|,italic_s = | ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT caligraphic_L ( caligraphic_D start_POSTSUBSCRIPT italic_t italic_r italic_a italic_i italic_n end_POSTSUBSCRIPT ) | , of ideal weights across the training dataset, calculate the average scores for each Rk×Ck𝑅𝑘𝐶𝑘Rk\times Ckitalic_R italic_k × italic_C italic_k weight chunk, and use the normalized salience score as sampling probability to perform importance-sampling (IS) at each calibration iteration. This coarse-grained structured sparsity at the chunk level can directly translate to proportional calibration cycle reduction. We set a maximum calibration iteration and a weight error threshold to adaptively stop calibration, whichever stop criterion is first met. The hardware overhead in terms of cycles when performing Tcalibsubscript𝑇𝑐𝑎𝑙𝑖𝑏T_{calib}italic_T start_POSTSUBSCRIPT italic_c italic_a italic_l italic_i italic_b end_POSTSUBSCRIPT iterations for each Rk×Ck𝑅𝑘𝐶𝑘Rk\times Ckitalic_R italic_k × italic_C italic_k weight block is,

#CyclecalibβTcalibmk#subscriptCycle𝑐𝑎𝑙𝑖𝑏𝛽subscript𝑇𝑐𝑎𝑙𝑖𝑏𝑚𝑘\#\texttt{Cycle}_{calib}\approx\beta T_{calib}mk# Cycle start_POSTSUBSCRIPT italic_c italic_a italic_l italic_i italic_b end_POSTSUBSCRIPT ≈ italic_β italic_T start_POSTSUBSCRIPT italic_c italic_a italic_l italic_i italic_b end_POSTSUBSCRIPT italic_m italic_k (8)

III-D Variation-Aware Tile Remapping

Refer to caption
Figure 9: The proposed variation-aware tile remapping method first probes the errors for all matrix-to-tile pairs and then solves a linear assignment problem to find the min-error reordering.

Motivated by the important observation of the nonuniform noise distribution across devices and cores in Fig. 1(c), we propose architectural variation-aware tile remapping to remedy the accuracy degradation. We try to find a matrix-to-tile index mapping better than the direct mapping ordering, shown in Fig. 9. Given a weight-stationary dataflow, we map a Rk×Ck𝑅𝑘𝐶𝑘Rk\times Ckitalic_R italic_k × italic_C italic_k weight matrix block onto the accelerator for MVMs and move to the next matrix block. A direct mapping will map R×C𝑅𝐶R\times Citalic_R × italic_C subblocks onto the R×C𝑅𝐶R\times Citalic_R × italic_C PTC arrays following their original order. This is suboptimal because weight blocks have different sensitivities, and PTCs show different error levels. It is natural to remap weight blocks onto PTCs to minimize errors.

However, to avoid complicated dataflow, we cannot arbitrarily remap RC𝑅𝐶RCitalic_R italic_C weight blocks to RC𝑅𝐶RCitalic_R italic_C PTCs, which will lead to nontrivial architectural overhead. First, as imposed by the dataflow in the accelerator topology, the same input will be broadcast via photonic waveguides to all cores in a column, e.g., PTC(r𝑟ritalic_r,1) r[R]for-all𝑟delimited-[]𝑅\forall r\in[R]∀ italic_r ∈ [ italic_R ]. The partial sum from cores within one row (tile) will be accumulated via photocurrent summation. Therefore, we can only remap the workloads in the granularity of tiles. Formally, we denote the indices of tiles as 𝒱=[v1,v2,,vR]𝒱subscript𝑣1subscript𝑣2subscript𝑣𝑅\mathcal{V}=[v_{1},v_{2},\cdots,v_{R}]caligraphic_V = [ italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , italic_v start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT ] and the indices of weight chunks as 𝒰=[u1,u2,,uR]𝒰subscript𝑢1subscript𝑢2subscript𝑢𝑅\mathcal{U}=[u_{1},u_{2},\cdots,u_{R}]caligraphic_U = [ italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , italic_u start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT ], where Wurk×Cksubscript𝑊subscript𝑢𝑟superscript𝑘𝐶𝑘W_{u_{r}}\in\mathbb{R}^{k\times Ck}italic_W start_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_k × italic_C italic_k end_POSTSUPERSCRIPT. Tile remapping is formulated as a linear assignment problem (LAP),

minfu𝒰ϵ(u,f(u)),v=f(u)𝒱,f:𝒰𝒱.\small\min_{f}\sum_{u\in\mathcal{U}}\epsilon(u,f(u)),~{}v=f(u)\in\mathcal{V},~% {}f:\mathcal{U}\rightarrow\mathcal{V}.roman_min start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_u ∈ caligraphic_U end_POSTSUBSCRIPT italic_ϵ ( italic_u , italic_f ( italic_u ) ) , italic_v = italic_f ( italic_u ) ∈ caligraphic_V , italic_f : caligraphic_U → caligraphic_V . (9)

Each entry in the cost matrix ϵR×Ritalic-ϵsuperscript𝑅𝑅\epsilon\in\mathbb{R}^{R\times R}italic_ϵ ∈ blackboard_R start_POSTSUPERSCRIPT italic_R × italic_R end_POSTSUPERSCRIPT represents the edge weight in the complete bipartite graph, shown in Fig. 9. The edge weight ϵijsubscriptitalic-ϵ𝑖𝑗\epsilon_{ij}italic_ϵ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT is an indicator of errors when mapping Wisubscript𝑊𝑖W_{i}italic_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to tile j𝑗jitalic_j. Similar to the salience scores in the calibration, we also use the first-order Taylor expansion to calculate sensitivity-aware error terms ϵij=|WiT(𝔼[W~ij]Wi)|subscriptitalic-ϵ𝑖𝑗subscriptsubscript𝑊𝑖superscript𝑇𝔼delimited-[]superscriptsubscript~𝑊𝑖𝑗superscriptsubscript𝑊𝑖\epsilon_{ij}=|\nabla_{W_{i}}\mathcal{L}^{T}\cdot(\mathbb{E}[\widetilde{W}_{i}% ^{j}]-W_{i}^{*})|italic_ϵ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = | ∇ start_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_L start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ⋅ ( blackboard_E [ over~ start_ARG italic_W end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ] - italic_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) |. This LAP can be optimally solved in polynomial time. The cycle cost of remapping one Rk×Ck𝑅𝑘𝐶𝑘Rk\times Ckitalic_R italic_k × italic_C italic_k weight matrix block is,

#Cycleremap#subscriptCycle𝑟𝑒𝑚𝑎𝑝\displaystyle\#\texttt{Cycle}_{remap}# Cycle start_POSTSUBSCRIPT italic_r italic_e italic_m italic_a italic_p end_POSTSUBSCRIPT =#Cycleϵ+#CycleLAPRmk+R3,absent#subscriptCycleitalic-ϵ#subscriptCycle𝐿𝐴𝑃𝑅𝑚𝑘superscript𝑅3\displaystyle=\#\texttt{Cycle}_{\epsilon}+\#\texttt{Cycle}_{LAP}\approx Rmk+R^% {3},= # Cycle start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT + # Cycle start_POSTSUBSCRIPT italic_L italic_A italic_P end_POSTSUBSCRIPT ≈ italic_R italic_m italic_k + italic_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT , (10)

where the probing times m=1𝑚1m=1italic_m = 1. We periodically solve this optimal remapping f𝑓fitalic_f to save cost and apply it to all following inferences.

III-E Adaptive Remediation Controller

To dynamically determine when to trigger our remediation flow, we introduce an adaptive controller that periodically monitors cheap but informative statistics, i.e., temperature per PTC, and determine whether to trigger remediation based on a threshold, e.g., when average chip temperature drift from last remediation is above 0.01K, i.e., (1RCTr,cTprev>0.011𝑅𝐶subscript𝑇𝑟𝑐subscript𝑇𝑝𝑟𝑒𝑣0.01\frac{1}{RC}\sum T_{r,c}-T_{prev}>0.01divide start_ARG 1 end_ARG start_ARG italic_R italic_C end_ARG ∑ italic_T start_POSTSUBSCRIPT italic_r , italic_c end_POSTSUBSCRIPT - italic_T start_POSTSUBSCRIPT italic_p italic_r italic_e italic_v end_POSTSUBSCRIPT > 0.01). If not, we will perform a more expensive probing, the normalized MAE (NMAE) W~W1/W1subscriptnorm~𝑊superscript𝑊1subscriptnormsuperscript𝑊1\|\widetilde{W}-W^{*}\|_{1}/\|W^{*}\|_{1}∥ over~ start_ARG italic_W end_ARG - italic_W start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT / ∥ italic_W start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, with a slight cycle overhead. If the NMAE is above 5%, we will trigger remediation. To avoid overly frequent remediation that keeps interrupting the online inference stream, we set a cooling time τ𝜏\tauitalic_τ for our remediation procedure to control the maximum acceptable overhead. This can be reconfigured by users based on the preferences between accuracy and inference throughput. For example, with 10K total inferences, each remediation is as expensive as 10 inferences, and a cooling time of 200 inferences will lead to a max overhead of 10K×10200×10K=5%10𝐾1020010𝐾percent5\frac{10K\times 10}{200\times 10K}=5\%divide start_ARG 10 italic_K × 10 end_ARG start_ARG 200 × 10 italic_K end_ARG = 5 %.

IV Evaluation Results

IV-A Simulation Setup

Dataset and Models.  We evaluate our method on a three-layer CNN (C64K3-C64K3-C64K3-Pool5-FC10) on Fashion-MNIST[31], VGG-8[32] on CIFAR-10[33], and ResNet-18[34] on CIFAR-100[33] for image classification.

Training Settings.  We pre-train all models for 100 epochs with an Adam optimizer with a 2E-3 learning rate, a cosine decay scheduler, 1E-4 weight decay, and data augmentation (random crop and flip). BatchNorm layers are all frozen after pretraining.

Architecture Settings.  As a challenging case study, the hardware platform in this work is assumed to be a multi-tile, multi-core photonic tensor accelerator based on a thermally sensitive MRR weight bank shown in Fig. 3. Note that our method is not specific to MRR weight banks but can generalize to all universal optical matrix-vector multiplication units. We assume that the photonic accelerator has 4 tiles, and each tile has 4 cores. Each core is an 8×\times×8 add-drop MRR weight bank that can perform 8×\times×8 matrix-vector multiplication per core per cycle. Detailed architecture description is in Section III-A. The MRR device modeling is based on Section II-A. The device/circuit variation modeling is based on Section III-B.

Benchmark Settings.  To cover different thermal variation scenarios, we create several synthetic noise configurations as benchmarks in Table I.

TABLE I: Benchmark settings for different noise scenarios
Scenario Description
PV.1 Low Noise & Distribution: Edge-to-Corner
PV.2 High Noise & Distribution: Edge-to-Corner
TD.1 Temp Drift: Linear Increase & Uniform
TD.2 Temp Drift: Cosine Fluctuation & Uniform
TD.3 Temp Drift: Linear Increase & Corner Hotspot
TD.4 Temp Drift: Cosine Fluctuation & Corner Hotspot
CT Thermal Crosstalk among all MRRs within each core.

(1) Phase Variation (PV). To simulate real-world chip noise scenarios, two cases are created: "Low Noise" and "High Noise," corresponding to low and high standard deviation (std.) in phase shift, modeled as Δϕ𝒩(0,σ2)Δitalic-ϕ𝒩0superscript𝜎2\Delta\phi\in\mathcal{N}(0,\sigma^{2})roman_Δ italic_ϕ ∈ caligraphic_N ( 0 , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ). As explained in Section III-B, the noise std. σijt+1superscriptsubscript𝜎𝑖𝑗𝑡superscript1\sigma_{ij}^{t+1^{\prime}}italic_σ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT for each MRR device is time-variant and sampled at each time step from a noise level map 𝒩(μs(t),σs2(t))𝒩subscript𝜇𝑠𝑡superscriptsubscript𝜎𝑠2𝑡\mathcal{N}\in(\mu_{s}(t),\sigma_{s}^{2}(t))caligraphic_N ∈ ( italic_μ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_t ) , italic_σ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_t ) ). The noise level distribution gradually changes its high-noise region from the chip’s left edge to the top-left corner ("Edge-to-Corner") to capture dynamic noise profile drifting. Specific noise level functions are chosen for the Low and High Noise cases:

  • Low Noise: μs(t)=0.0025t,σs(t)=0.004t+0.002formulae-sequencesubscript𝜇𝑠𝑡0.0025𝑡subscript𝜎𝑠𝑡0.004𝑡0.002\mu_{s}(t)=0.0025t,\sigma_{s}(t)=0.004t+0.002italic_μ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_t ) = 0.0025 italic_t , italic_σ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_t ) = 0.004 italic_t + 0.002

  • High Noise: μs(t)=0.01t,σs(t)=0.005t+0.005formulae-sequencesubscript𝜇𝑠𝑡0.01𝑡subscript𝜎𝑠𝑡0.005𝑡0.005\mu_{s}(t)=0.01t,\sigma_{s}(t)=0.005t+0.005italic_μ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_t ) = 0.01 italic_t , italic_σ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_t ) = 0.005 italic_t + 0.005

The damping factor βσsubscript𝛽𝜎\beta_{\sigma}italic_β start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT is set to 0.9. The phase error intensities adopt the typical values in the literature [10, 26].

(2) Temperature Drift (TD): Two cases are designed to represent different types of temperature change with tmax=20,000subscript𝑡𝑚𝑎𝑥20000t_{max}=20,000italic_t start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT = 20 , 000:

  • Linear: T(t)=300K+t/tmax𝑇𝑡300𝐾𝑡subscript𝑡𝑚𝑎𝑥T(t)=300K+t/t_{max}italic_T ( italic_t ) = 300 italic_K + italic_t / italic_t start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT

  • Cosine: T(t)=300.25K0.25Kcos(10t/tmax)𝑇𝑡300.25𝐾0.25𝐾10𝑡subscript𝑡𝑚𝑎𝑥T(t)=300.25K-0.25K\cos(10t/t_{max})italic_T ( italic_t ) = 300.25 italic_K - 0.25 italic_K roman_cos ( 10 italic_t / italic_t start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT )

The temperature drifts adopt typical values in the literature [23, 24]. We also consider two representative spatial distributions of temperature.

  • Uniform: All PTCs have the same temperature drift following the Linear/Cosine scheduling.

  • Corner Hotspot: the upper-left corner of the chip experiences a high temperature drift, exponentially decreasing with distance, i.e., T(t)er2+c2(T(t)T(0))+T(0)𝑇𝑡superscript𝑒superscript𝑟2superscript𝑐2𝑇𝑡𝑇0𝑇0T(t)\leftarrow e^{-\sqrt{r^{2}+c^{2}}}(T(t)-T(0))+T(0)italic_T ( italic_t ) ← italic_e start_POSTSUPERSCRIPT - square-root start_ARG italic_r start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_c start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_POSTSUPERSCRIPT ( italic_T ( italic_t ) - italic_T ( 0 ) ) + italic_T ( 0 ), which represents a local hotspot scenario during the execution of the accelerator.

(3) Thermal Crosstalk (CT):, we use a default MRR spacing lv=200μmsubscript𝑙𝑣200𝜇𝑚l_{v}=200\mu mitalic_l start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT = 200 italic_μ italic_m, lh=60μmsubscript𝑙60𝜇𝑚l_{h}=60\mu mitalic_l start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT = 60 italic_μ italic_m, crosstalk coefficient k1=0.1subscript𝑘10.1k_{1}=0.1italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.1, and model crosstalk only among MRRs within the same PTC [28, 25, 30].

Evaluation Metrics.  We mainly evaluate different methods on the above benchmarks in terms of inference accuracy and cycle overhead consumed by remediation. All matrix multiplication operations in the forward propagation of NN inference is mapped to our photonic accelerators as the basic inference cycle count. If on-chip remediation method is applied, any additional computation will be converted to equivalent cycle overhead. For DOCTOR, cycle overhead of calibration and remapping for each Rk×Ck𝑅𝑘𝐶𝑘Rk\times Ckitalic_R italic_k × italic_C italic_k matrix block is counted as Eq. (8) and Eq. (10), respectively. The overhead is summed over all matrix blocks in the ONN. Cycle overhead is reported as an efficiency metric for each remediation method.

IV-B Ablation Study

IV-B1 Calibration: Sparsity and Salience Scores

Refer to caption
Figure 10: Evaluate salience, sampling methods, and sparsity levels when calibrating VGG-8 CIFAR10. IS means importance sampling.
Refer to caption
Figure 11: Visualization of in-situ calibration to quickly resume accuracy at low overhead (only interrupt 5 inferences) under time-variant thermal noises on VGG8 CIFAR10. We adopt MAE calibration loss, learning rate of 2e-3, sparsity of 1, and averaging times m𝑚mitalic_m=1.

To determine the best hyperparameters in sparse calibration, we first evaluate how many matrix probings m𝑚mitalic_m are needed for Eq. (6). In Table II, we found that m=1𝑚1m=1italic_m = 1 can efficiently estimate 𝔼[W~]𝔼delimited-[]~𝑊\mathbb{E}[\widetilde{W}]blackboard_E [ over~ start_ARG italic_W end_ARG ] and calibrate the circuits with low cost and high accuracy.

TABLE II: Comparison of VGG8 accuracy and calibration cycles on CIFAR-10 with different weight probing times m𝑚mitalic_m in our proposed sparse calibration. The stop MAE threshold is 0.0038.
m𝑚mitalic_m 1 2 3 4 5 10 15 20
cycles 3.07M 1.53M 2.12M 2.82M 3.53M 6.75M 10.12M 13.50M
MAE loss 0.00388 0.00379 0.00382 0.00367 0.00361 0.00366 0.00360 0.00360
acc 90.67% 90.05% 89.17% 89.36% 89.33% 88.56% 88.25% 88.32%

Figure 10 compares different salience scores, sampling methods, and sparsity. First-order Taylor expansion with important sampling achieves the best balance between cycle counts and accuracy. Top-K is not satisfying as it is fixed to blocks with large ideal gradients. With sparsity β=0.2𝛽0.2\beta=0.2italic_β = 0.2, the accuracy can be quickly resumed above 90% with negligible cycle overhead (equivalent to one single-image inference).

Figure 11 visualizes the effectiveness of our salience-aware sparsity calibration. With thermal variation, the accuracy drops from 90.94% to 57%. Calibration can quickly resume accuracy above 90% with only 10 iterations. The overhead is very cheap, equivalent to interrupting merely 5 single-batch inferences. With 0.2 sparsity, the overhead can be further reduced to only one inference.

IV-B2 Remapping: Error Estimation and Interval

Refer to caption
Figure 12: Using the first-order error as the sensitivity-aware ΣΣ\Sigmaroman_Σ and performing remapping every 4000 inferences lead to 5.4% higher accuracy with only 0.17% total cycle overhead.

Figure 12 evaluates different methods for error ϵijsubscriptitalic-ϵ𝑖𝑗\epsilon_{ij}italic_ϵ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT estimation, including mean absolute errors (MAE), first-order and second-order Taylor expansions, as indicated in Eq. (4). As noise distribution evolves, the ideal pre-trained model with direct mapping suffers from a large accuracy drop. In contrast, our variation-aware remapping can significantly reduce the numerical errors and thus boost inference accuracy by 5-10%. The first-order Taylor expansion of the loss function shows clear advantages over the naive matrix MAE scores and is cheaper than second-order expansion with the same accuracy benefit. The solved remapping can be reused for following inferences until the next remapping is triggered. We scan over different remapping intervals from once per 1K inferences up to once per 10K inferences. We found that 2K-4K inference intervals are enough to guarantee maximum accuracy benefits (+5.4%) and negligible cycle overhead (0.17%).

IV-C Main Results: Compare with Prior Work

TABLE III: Compare the inference accuracy and cycle overhead (Ovhd) among 4 methods on 8 different variation settings and 3 datasets/models. Our remediation method only induces a small overhead of 0.14%-5% of the original inference cycles, which is 2-3 orders-of-magnitude less costly than on-chip training (BS3).
FMNIST-CNN3 (pre-trained acc: 92.78%, inference cycles: 5.81E8)
Noise configs BS1 BS2 [10] BS3 [21, 22] DOCTOR
acc ovhd acc ovhd acc ovhd acc ovhd
CT+PV.1+TD.1 50.41 0 45.56 0 36.72 3.5E9 92.42 8.4E5
CT+PV.1+TD.2 42.85 0 37.08 0 77.74 3.5E9 92.59 7.9E5
CT+PV.1+TD.3 78.44 0 81.49 0 80.50 3.5E9 92.20 7.6E5
CT+PV.1+TD.4 74.11 0 75.09 0 90.37 3.5E9 92.19 7.5E5
CT+PV.2+TD.1 49.43 0 45.39 0 27.49 3.5E9 91.42 8.4E5
CT+PV.2+TD.2 41.81 0 37.00 0 74.64 3.5E9 91.07 8.2E5
CT+PV.2+TD.3 77.94 0 80.55 0 78.14 3.5E9 91.12 8.4E5
CT+PV.2+TD.4 73.34 0 74.20 0 88.54 3.5E9 91.11 8.4E5
Avg. Acc/
Ovhd Ratio
61.04 0.00% 59.55 0.00% 69.27 600% 91.77 0.14%
CIFAR10 VGG8 (pre-trained acc: 90.94%, inference cycles: 6.66E8)
Noise configs BS1 BS2 [10] BS3 [21, 22] DOCTOR
acc ovhd acc ovhd acc ovhd acc ovhd
CT+PV.1+TD.1 32.38 0 29.93 0 19.23 3.3E9 90.53 2.6E7
CT+PV.1+TD.2 28.54 0 26.86 0 53.71 3.3E9 90.24 2.4E7
CT+PV.1+TD.3 71.87 0 67.12 0 73.11 3.3E9 90.25 2.4E7
CT+PV.1+TD.4 65.82 0 60.89 0 86.47 3.3E9 90.27 2.6E7
CT+PV.2+TD.1 32.52 0 30.00 0 19.17 3.3E9 90.25 2.6E7
CT+PV.2+TD.2 28.98 0 27.02 0 51.96 3.3E9 89.62 2.4E7
CT+PV.2+TD.3 72.03 0 67.26 0 71.68 3.3E9 89.85 2.4E7
CT+PV.2+TD.4 65.80 0 60.87 0 85.83 3.3E9 89.75 2.3E7
Avg. Acc/
Ovhd Ratio
49.74 0.00% 46.24 0.00% 57.65 500% 90.10 3.65%
CIFAR100 ResNet18 (pre-trained acc: 73.57%, inference cycles: 5.43E9)
Noise configs BS1 BS2 [10] BS3 [21, 22] DOCTOR
acc ovhd acc ovhd acc ovhd acc ovhd
CT+PV.1+TD.1 5.23 0 5.32 0 6.20 2.7E10 72.15 2.9E8
CT+PV.1+TD.2 7.25 0 7.66 0 24.03 2.7E10 71.33 2.8E8
CT+PV.1+TD.3 21.69 0 23.06 0 39.00 2.7E10 70.20 2.7E8
CT+PV.1+TD.4 18.16 0 18.82 0 46.56 2.7E10 70.01 2.6E8
CT+PV.2+TD.1 5.14 0 5.36 0 4.96 2.7E10 72.28 2.9E8
CT+PV.2+TD.2 7.33 0 7.65 0 24.90 2.7E10 71.68 2.8E8
CT+PV.2+TD.3 21.97 0 23.61 0 1.03 2.7E10 70.47 2.7E8
CT+PV.2+TD.4 18.43 0 19.21 0 46.27 2.7E10 70.40 2.6E8
Avg. Acc/
Ovhd Ratio
13.15 0.00% 13.84 0.00% 24.12 500% 71.07 5.08%

In Table III, we compare DOCTOR with three baselines: (1) BS1: deploy pre-trained models; (2) BS2 [10]: noise-aware training with (2%) weight error injected during pretraining; and (3) BS3 [21, 22]: on-chip training for 1 epoch on a calibration dataset (10% of training set). The parameters for our DOCTOR framework are: (1) Calibration: probing times m=1𝑚1m=1italic_m = 1, salience score s=|W|𝑠subscript𝑊s=|\nabla_{W}\mathcal{L}|italic_s = | ∇ start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT caligraphic_L |, calibration sparsity β=0.2𝛽0.2\beta=0.2italic_β = 0.2, max calibration iteration Tcalib=20subscript𝑇𝑐𝑎𝑙𝑖𝑏20T_{calib}=20italic_T start_POSTSUBSCRIPT italic_c italic_a italic_l italic_i italic_b end_POSTSUBSCRIPT = 20 (50 for ResNet18); (2) Remapping: first-order Taylor expansion as ϵitalic-ϵ\epsilonitalic_ϵ; (3) Controller: remediation cooling time τ𝜏\tauitalic_τ: 200 inferences (50 for ResNet18); trigger remapping \rightarrow calibration.

Across all benchmarks, BS1 suffers from severe accuracy degradation (30%-60%). Noise-aware training [10], though it enhances the smoothness of the solution space of the model, shows limited effect on accuracy improvement, as the noise distribution used in pre-training is significantly different from the physical variations. Note that our DOCTOR is orthogonal to noise-aware training, allowing simultaneous application for a solution space that is both locally smooth and adaptive to the drifting noise distribution. On-chip training [21, 22] can boost the accuracy on small networks, e.g., 8% on CNN-FMNIST and VGG8-CIFAR10. However, it performs poorly on deep ResNet18 and is not stable on certain benchmarks, e.g., CT+PV.2+TD.1. The fundamental reason is that the gradient estimation error exponentially accumulates with backpropagation, which can lead to poor training performance and even divergence. Also, though it is only trained for 1 epoch on a 10% training set, it consumes 5-6×\times× more runtime (cycles) in training, which is not practical to be deployed on throughput-restricted edge platforms. In contrast, our proposed DOCTOR method can stably maintain high accuracy even with rapidly drifting noise distributions and temperature with less than 1-2.5% accuracy drop at the cost of merely 0.1%-5.1% cycle overhead. On average, DOCTOR is +34% more accurate and 2-3 orders-of-magnitude more efficient than on-chip training. We visualize our DOCTOR flow in Fig. 13 to show the temperature drift is detected and the accuracy is rapidly resumed with only 3.6% cycle overhead.

Refer to caption
Figure 13: Visualize DOCTOR for dynamic accuracy recovery.
Refer to caption
Figure 14: Impact of device spacings on crosstalk-induced accuracy drop and calibration effectiveness. We adopt 200 calibration iterations for sufficient convergence.

IV-D Discussion

Device Spacing and Crosstalk.  We evaluate how MRR spacing impacts the crosstalk and thus the maximum accuracy in Fig. 14. By default, we assume 200 μm𝜇𝑚\mu mitalic_μ italic_m vertical spacing lvsubscript𝑙𝑣l_{v}italic_l start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT and 60 μm𝜇𝑚\mu mitalic_μ italic_m horizontal spacing lhsubscript𝑙l_{h}italic_l start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT in the MRR array and scaling both directions. Below 0.4, the crosstalk severely impacts the representable weight space. Thus, even after calibration, the accuracy cannot be recovered. Within 0.4 to 1.1 scaling, the accuracy drop can be fully countered by our calibration, while above 1.1, there is almost no drop from crosstalk.

Trade off Efficiency and Accuracy.  The cooling time τ𝜏\tauitalic_τ in our adaptive remediation controller trades off cycle overhead and accuracy. In Table IV, we sweep the cooling interval from 200 to 1000 inferences. If accuracy is prioritized, we can adopt τ𝜏\tauitalic_τ=200 with only 3.7% cycle overhead. If inference throughput is prioritized, e.g., only <1% overhead is accepted, we can set τ=800𝜏800\tau=800italic_τ = 800 with a 2.7% accuracy drop.

TABLE IV: Cycle overhead and test accuracy on VGG8-CIFAR10 with different remediation cooling time τ𝜏\tauitalic_τ.
τ𝜏\tauitalic_τ 200 300 400 500 800 1000
#cycles
(overhead)
6.90E8
(+3.7%)
6.78E8
(+1.9%)
6.78E8
(+1.9%)
6.74E8
(+1.3%)
6.72E8
(+1.0%)
6.71E8
(+0.8%)
acc 90.24 89.97 89.97 89.15 88.28 86.55

V Conclusion

In this work, we present the first on-chip remediation approach that dynamically monitors photonic accelerator temperature drift and ensures continued reliability with minimal overhead through training-free, data-free calibration, and architectural remapping. Our method outperforms SoTA on-chip training by +34% higher accuracy and 2-3 orders-of-magnitude lower cost. Our lightweight, effective in-situ remediation method enables self-corrected photonic neural accelerators with unprecedented reliability in real-world, dynamic deployment scenarios.

References

  • [1] Y. Shen, N. C. Harris, S. Skirlo et al., “Deep learning with coherent nanophotonic circuits,” Nature Photonics, 2017.
  • [2] Q. Cheng, J. Kwon, M. Glick, M. Bahadori, L. P. Carloni, and K. Bergman, “Silicon Photonics Codesign for Deep Learning,” Proceedings of the IEEE, 2020.
  • [3] B. J. Shastri, A. N. Tait et al., “Photonics for Artificial Intelligence and Neuromorphic Computing,” Nature Photonics, 2021.
  • [4] C. Feng, J. Gu, H. Zhu, Z. Ying, Z. Zhao et al., “A compact butterfly-style silicon photonic–electronic neural chip for hardware-efficient deep learning,” ACS Photonics, vol. 9, no. 12, pp. 3906–3916, 2022.
  • [5] W. Liu, W. Liu, Y. Ye, Q. Lou, Y. Xie, and L. Jiang, “Holylight: A nanophotonic accelerator for deep learning in data centers,” in Proc. DATE, 2019.
  • [6] J. Gu, H. Zhu, C. Feng, Z. Jiang, R. T. Chen, and D. Z. Pan, “M3ICRO: Machine learning-enabled compact photonic tensor core based on programmable multi-operand multimode interference,” APL Machine Learning, vol. 2, no. 1, p. 016106, Mar. 2024.
  • [7] X. Xu, M. Tan, B. Corcoran, J. Wu, A. Boes, T. G. Nguyen, S. T. Chu, B. E. Little, D. G. Hicks, R. Morandotti, A. Mitchell, and D. J. Moss, “11 TOPS photonic convolutional accelerator for optical neural networks,” Nature, 2021.
  • [8] J. Feldmann, N. Youngblood, M. Karpov, H. Gehring, X. Li, M. Stappers, M. L. Gallo, X. Fu, A. Lukashchuk, A. Raja, J. Liu, D. Wright, A. Sebastian, T. Kippenberg, W. Pernice, and H. Bhaskaran, “Parallel convolutional processing using an integrated photonic tensor core,” Nature, 2021.
  • [9] C. Huang, S. Bilodeau, T. Ferreira de Lima et al., “Demonstration of scalable microring weight bank control for large-scale photonic integrated circuits,” APL Photonics, vol. 5, no. 4, p. 040803, 2020.
  • [10] J. Gu, Z. Zhao, C. Feng, H. Zhu, R. T. Chen, and D. Z. Pan, “ROQ: A noise-aware quantization scheme towards robust optical neural networks with low-bit controls,” in Proc. DATE, 2020.
  • [11] Z. Zhao, J. Gu, Z. Ying et al., “Design technology for scalable and robust photonic integrated circuits,” in Proc. ICCAD, 2019.
  • [12] Y. Zhu, G. L. Zhang, B. Li et al., “Countering Variations and Thermal Effects for Accurate Optical Neural Networks,” in Proc. ICCAD, 2020.
  • [13] A. Mirza, F. Sunny et al., “Silicon photonic microring resonators: A comprehensive design-space exploration and optimization under fabrication-process variations,” IEEE TCAD, vol. 41, no. 10, pp. 3359–3372, 2022.
  • [14] A. N. Tait, T. F. de Lima, E. Zhou et al., “Neuromorphic photonic networks using silicon photonic weight banks,” Sci. Rep., 2017.
  • [15] D. Liu, Z. Zhao, Z. Wang, Z. Ying, R. T. Chen, and D. Z. Pan, “Operon: Optical-electrical power-efficient route synthesis for on-chip signals,” in Proc. DAC, 2018.
  • [16] J. Gu, C. Feng, Z. Zhao, Z. Ying, M. Liu, R. T. Chen, and D. Z. Pan, “SqueezeLight: Towards Scalable Optical Neural Networks with Multi-Operand Ring Resonators,” in Proc. DATE, 2021.
  • [17] M. Kirtas, N. Passalis, G. Mourgias-Alexandris, G. Dabos, N. Pleros, and A. Tefas, “Robust architecture-agnostic and noise resilient training of photonic deep learning models,” IEEE TransactionsS on Emerging Topics in Computational Intelligence, 2023.
  • [18] L. G. Wright, T. Onodera et al., “Deep physical neural networks trained with backpropagation,” Nature, vol. 601, no. 7894, pp. 549–555, Jan. 2022.
  • [19] J. Gu, Z. Zhao, C. Feng, W. Li, R. T. Chen, and D. Z. Pan, “FLOPS: Efficient On-Chip Learning for Optical Neural Networks Through Stochastic Zeroth-Order Optimization,” in Proc. DAC, 2020.
  • [20] J. Gu, C. Feng, Z. Zhao, Z. Ying, R. T. Chen, and D. Z. Pan, “Efficient on-chip learning for optical neural networks through power-aware sparse zeroth-order optimization,” in Proc. AAAI, 2021.
  • [21] J. Gu, H. Zhu, C. Feng, Z. Jiang, R. T. Chen, and D. Z. Pan, “L2ight: Enabling On-Chip Learning for Optical Neural Networks via Efficient in-situ Subspace Optimization,” in Proc. NeurIPS, 2021.
  • [22] S. Pai, Z. Sun, T. W. Hughes et al., “Experimentally realized in situ backpropagation for deep learning in photonic neural networks,” Science, vol. 380, no. 6643, pp. 398–404, Apr. 2023.
  • [23] Y. Ye, J. Xu, X. Wu, W. Zhang, X. Wang, M. Nikdast, Z. Wang, and W. Liu, “System-Level Modeling and Analysis of Thermal Effects in Optical Networks-on-Chip,” IEEE Trans. VLSI Syst., vol. 21, no. 2, pp. 292–305, Feb. 2013.
  • [24] K. Padmaraju and K. Bergman, “Resolving the thermal challenges for silicon microring resonator devices,” Nanophotonics, vol. 3, no. 4-5, pp. 269–281, Aug. 2014.
  • [25] M. Milanizadeh, D. Aguiar, A. Melloni, and F. Morichetti, “Canceling Thermal Cross-Talk Effects in Photonic Integrated Circuits,” J. Lightwave Technol., vol. 37, no. 4, pp. 1325–1332, Feb. 2019.
  • [26] M. Y.-S. Fang, S. Manipatruni, C. Wierzynski, A. Khosrowshahi, and M. R. DeWeese, “Design of optical neural networks with component imprecisions,” Opt. Express, vol. 27, no. 10, p. 14009, May 2019.
  • [27] F. Sunny, A. Mirza, M. Nikdast, and S. Pasricha, “Crosslight: A cross-layer optimized silicon photonic neural network accelerator,” in Proc. DAC, 2021.
  • [28] H. Jayatilleka, K. Murray, M. Caverley, N. A. F. Jaeger, L. Chrostowski, and S. Shekhar, “Crosstalk in soi microring resonator-based filters,” Journal of Lightwave Technology, vol. 34, no. 12, pp. 2886–2896, 2016.
  • [29] W. Bogaerts, P. De Heyn, T. Van Vaerenbergh, K. De Vos, S. Kumar Selvaraja, T. Claes, P. Dumon, P. Bienstman, D. Van Thourhout, and R. Baets, “Silicon microring resonators,” Laser & Photon. Rev., vol. 6, no. 1, pp. 47–73, Jan. 2012.
  • [30] A. Cem, D. Sanchez-Jacome, D. Pérez-López, and F. Da Ros, “Thermal crosstalk modeling and compensation for programmable photonic processors,” in IEEE Photonic Conference, 2023.
  • [31] H. Xiao, K. Rasul, and R. Vollgraf, “Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms,” Arxiv, 2017.
  • [32] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” 2015.
  • [33] A. Krizhevsky, G. Hinton et al., “Learning multiple layers of features from tiny images,” 2009.
  • [34] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. CVPR, 2016, pp. 770–778.
Haotian Lu is currently a research intern in ScopeX group, School of Electrical, Computer and Energy Engineering at Arizona State University, advised by Prof. Jiaqi Gu. His research interests mainly include hardware-algorithm co-design, electronic design automation, efficient hardware accelerators and hardware security.
Sanmitra Banerjee is a Senior Design-for-X (DFX) Methodology Engineer at NVIDIA Corporation, Santa Clara, CA, and an Adjunct Faculty at Arizona State University. He received the B.Tech. degree from Indian Institute of Technology, Kharagpur, in 2018, and the M.S. and Ph.D. degrees from Duke University, Durham, NC, in 2021 and 2022, respectively. His research interests include machine learning based DFX techniques, and fault modeling and optimization of emerging AI accelerators under process variations and manufacturing defects.
Jiaqi Gu (S’19 - M’23) received the B.E. degree in Microelectronic Science and Engineering from Fudan University, Shanghai, China in 2018, and the Ph.D. degree in the Department of Electrical and Computer Engineering, The University of Texas at Austin, Austin, TX, USA in 2023. He is currently an Assistant Profoessor in School of Electrical, Computer and Energy Engineering at Arizona State University, Tempe, AZ, USA. His current research interests include emerging hardware design for efficient computing (photonics, post-CMOS electronics, quantum), hardware-algorithm co-design, AI/ML algorithms, and electronic-photonic design automation. He has received the Best Paper Award at IEEE TCAD 2021, the Best Paper Award at ASP-DAC 2020, the Best Paper Finalist at DAC 2020, the Best Poster Award at NSF Workshop on Machine Learning Hardware (2020), the ACM/SIGDA Student Research Competition First Place (2020), and the ACM Student Research Competition Grand Finals First Place (2021).