DOCTOR: Dynamic On-Chip Remediation Against Temporally-Drifting Thermal Variations Toward Self-Corrected Photonic Tensor Accelerators

Haotian Lu, Sanmitra Banerjee, Jiaqi Gu,
Arizona State University
[email protected]

DOCTOR: Dynamic On-Chip Temporal Variation Remediation Toward Self-Corrected Photonic Tensor Accelerators

Haotian Lu, Sanmitra Banerjee, Jiaqi Gu,
Arizona State University
[email protected]

Abstract

Photonic computing has emerged as a promising solution for accelerating computation-intensive artificial intelligence (AI) workloads, offering unparalleled speed and energy efficiency, especially in resource-limited, latency-sensitive edge computing environments. However, the deployment of analog photonic tensor accelerators encounters reliability challenges due to hardware noise and environmental variations. While off-chip noise-aware training and on-chip training have been proposed to enhance the variation tolerance of optical neural accelerators with moderate, static noise, we observe a notable performance degradation over time due to temporally drifting variations, which requires a real-time, in-situ calibration mechanism. To tackle this challenging reliability issues, for the first time, we propose a lightweight dynamic on-chip remediation framework, dubbed DOCTOR, providing adaptive, in-situ accuracy recovery against temporally drifting noise. The DOCTOR framework intelligently monitors the chip status using adaptive probing and performs fast in-situ training-free calibration to restore accuracy when necessary. Recognizing nonuniform spatial variation distributions across devices and tensor cores, we also propose a variation-aware architectural remapping strategy to avoid executing critical tasks on noisy devices. Extensive experiments show that our proposed framework can guarantee sustained performance under drifting variations with 34% higher accuracy and 2-3 orders-of-magnitude lower overhead compared to state-of-the-art on-chip training methods. Our code is open-sourced at link.

Index Terms:

Photonic computing, optical neural networks, thermal variation, robustness, on-chip calibration.

I Introduction

In recent years, the pursuit of efficient and high-performance solutions for artificial intelligence (AI) workloads has led to the emergence of photonic computing. Leveraging the unique properties of light, analog photonic accelerators stand out for their ability to deliver unparalleled speed and efficiency, presenting a promising avenue for AI applications [1, 2, 3, 4, 5, 6, 7, 8, 9].

However, the deployment of such accelerators encounters robustness challenges that impede their practical application [10, 11, 12, 13]. We consider one of the most sensitive accelerators based on micro-ring resonators (MRRs) as a case study [14, 15, 9]. Due to the intrinsic temperature sensitivity of the MRR device, shown in Fig. 1(a), a subtle drift in the temperature will lead to a slight change of the round-trip phase shift but a large deviation on the represented weight. Such a high sensitivity makes pre-deployment optimization ineffective and thus necessitates a real-time calibration mechanism on chip. Besides temperature drift, various dynamic random noise and crosstalk cast even more shadows on the reliability of photonic computing systems. Figure 1(b) shows the significant impacts of variations on accuracy, sometimes leading to malfunction over time when the noise intensities gradually increase.

While previous off-chip noise-aware model training [10, 16] have shown efficacy in enhancing the variation tolerance of optical accelerators by injecting noise during training and thus encourage a smoother solution space, they rely on accurate noise modeling and thus typically show unsatisfying robustness improvement with unknown physical variations and can only handle small and static noise [10, 11, 17, 18]. The performance drop remains unresolved when there exist temporally drifting variations. Recently, there has been a trend to resort to on-chip learning or physical training methods to directly train the optical neural network (ONN) models in situ that can naturally incorporate real physical noise into the weight training process [19, 20, 21, 18, 22]. However, they require repeatedly performing forward and backward propagation of the entire network on a labeled training dataset to calculate the task-specific gradients for weight fine-tuning, which induces nontrivial training costs that can severely harm the system throughput and efficiency. Moreover, prior methods fail to leverage the nonuniform spatial noise distribution and weight sensitivity, shown in Fig. 1(c), to balance accuracy and efficiency. Hence, it necessitates a real-time, low-cost, in-situ calibration mechanism without running backpropagation on any labeled training set to quickly recover the accuracy and ensure continued reliability in practical deployment.

To tackle these challenges, we present a DOCTOR framework for dynamic on-chip remediation against temporally drifting thermal variations. In this paper, we delve into the detailed modeling of time-variant thermal variations and their impacts on a thermal-sensitive MRR-based photonic accelerator and resolve the variation-induced performance drop by efficient sparse weight calibration, variation-aware tile remapping, and an adaptive remediation controller.

The major contributions of this paper are as follows:

$\bullet$

Thermal Variation Modeling: We give rigorous modeling and sensitivity analysis of the dynamic thermal variations for multi-core photonic accelerators, providing a deeper understanding of the dynamic variations in real-world deployment.
$\bullet$

Salience-Aware Sparse Calibration: We propose a training-free, data-free in-situ calibration mechanism to selectively mitigate thermal variations and effectively resume computing accuracy at negligible runtime overhead.
$\bullet$

Variation-Aware Tile Remapping: We leverage the spatial nonuniformity in noise distributions to boost the reliability by optimally remapping workloads onto tensor cores, aware of the weight importance and device noise levels.
$\bullet$

We evaluate that our DOCTOR framework guarantees sustained deployment performance with 1%-2.5% accuracy drop at negligible runtime overhead (0.1%-5% cycle overhead), outperforming state-of-the-art on-chip training methods [21, 22] by an average of 34% higher accuracy and 2-3 orders-of-magnitude less overhead. Our work makes significant strides toward the real-world deployment of photonic accelerators in dynamic environments.

II Background

II-A Photonic Tensor Accelerators

Various photonic neural network designs have been proposed and demonstrated to encode inputs and weights to the light magnitude/phase and circuit transmission, respectively, and perform ultra-fast matrix multiplication [1, 2, 3, 4, 5]. Typically, the photonic circuits are sensitive to thermal variation as temperature impacts the refractive index of the optical component. Especially for compact microring resonator (MRR)-based photonic accelerators, the weight $w$ is encoded by the differential transmission of the add-drop MRR as $w=g(2a-1);~{}a=\frac{\alpha^{2}-2r\alpha\cos{\phi}+r^{2}}{1-2r\alpha\cos{\phi}% +r^{2}\alpha^{2}}\in(0,1)$ , where $\alpha$ is the attenuation factor, $r$ is the coupling factor, $\phi$ is the round-trip phase shift, $a$ is the through-port transmission, and $g$ is the scaling factor. To accommodate multiple channels in a single free-spectral range (FSR) while minimizing spectral crosstalk, i.e., large finesse, MRR is designed to have a mid-to-high quality factor. This design choice makes the MRR highly sensitive to small disturbances, highlighting the urgency in addressing challenges related to thermal variation robustness.

II-B Noise-Aware ONN Optimization

Noise-aware optimization for photonic accelerators includes two categories, i.e., offline optimization and on-chip optimization. Prior offline methods inject noise into ONN model training to obtain smooth solution space for better noise tolerance [10]. However, such a method requires accurate noise modeling and can only handle static, known variations, which cannot be adapted to dynamically drifting, unknown variations. On-chip training, as a trend, has been demonstrated for online adaptability and efficiency in in-situ accuracy recovery. The pretrained optical NNs are fine-tuned on a target dataset with on-chip gradient calculation [19, 20, 21]. However, they rely on accurate gradient calculation and require costly forward and backward propagation on a given training set, which can significantly degrade the edge inference throughput and efficiency and bear data privacy issues.

III Dynamic On-Chip Remediation: DOCTOR

We introduce our efficient on-chip remediation flow DOCTOR, shown in Fig. 2, with rigorous noise modeling, lightweight device calibration, and architectural remapping techniques to guarantee real-time accuracy recovery against drifting variations.

III-A Photonic Accelerator Architecture Settings

Since MRR weight banks are typically considered to be one of the most thermal sensitive designs among different kinds of ONN designs [23, 24, 25], we focus on photonic accelerator architectures based on MRR weight banks as a challenging case study to showcase the effectiveness of our DOCTOR method, shown in Fig. 3. We assume a multi-core accelerator with $R$ tiles and $C$ photonic tensor cores (PTCs) per tile. Each PTC is a $k\times k$ add-drop MRR weight bank. The partial sums are reduced in each tile. Hence, it can finish an $Rk\times Ck$ matrix-vector multiplication (MVM) at each cycle. A large $M\times N$ matrix will be partitioned and simply mapped to this accelerator using $\lceil M/(Rk)\rceil\cdot\lceil N/(Ck)\rceil$ cycles. Alongside each PTC, we assume a dedicated local buffer, an electrical local control unit, and thermal sensors for temperature monitoring, control, and processing.

III-B Thermal Variation Modeling

We first give a thorough noise modeling and sensitivity analysis for this photonic accelerator, including dynamic phase variations, drifting environmental temperature, and inter-device thermal crosstalk.

III-B1 Device Phase Variation

Due to control signal noise and thermal fluctuations, the round-trip phase shift of MRRs exhibits stochastic variations. In Fig. 4, we posit a zero-mean Gaussian phase variation, denoted as $\Delta\phi\sim\mathcal{N}(0,\sigma^{2})$ [26]. This introduces a noisy transmission and weight values. Given that different MRR devices are subject to distinct noise distributions due to their unique control sources and local environments, our figure depicts an instance where the upper-left corner of the accelerator experiences more noise, in contrast to the lower-right corner of the chip, which exhibits less noise.

Beside nonuniform noise distribution, the noise profiles are dynamically drifting over time, as shown in Fig. 5. We simulate such dynamics with two-level sampling. For each PTC, we use the Standard Deviation ( $\sigma_{ij}$ , std) of the phase noise distribution to represent the noise intensity for the MRR at coordinate $ij$ , while the intensity itself is time-varying. At time step $t$ , we sample the step- $t$ noise intensity from a distribution, i.e., $\sigma_{ij}^{t+1^{\prime}}\sim\mathcal{N}(\mu_{s}(t),\sigma_{s}^{2}(t))$ . We apply exponential moving average $\sigma_{ij}^{t+1}\leftarrow\beta_{\sigma}\sigma_{ij}^{t}+(1-\beta_{\sigma})% \sigma_{ij}^{t+1^{\prime}}$ with a dampling factor $\beta_{\sigma}$ to smooth out the intensity drifting. $\mu_{s}(t)$ and $\sigma_{s}(t)$ are temporally evolving with a certain scheduling function.

The non-uniform, temporally drifting spatial distribution of noise provides valuable insights into addressing this challenge through periodic workload-to-PTC remapping, which implies intuitively mapping sensitive workloads to less noisy tensor cores.

III-B2 Environmental Temperature Drift

The environmental temperature fluctuation can impact the reliability of the chip, especially for thermal-sensitive microring resonator-based PTCs [27, 23, 24]. As shown in Fig. 1(b), for MRRs, even 0.05K temperature drift can lead to a relatively large resonant wavelength shift, leading to large errors on the represented weight value. We assume a linear dependence of on-resonance wavelength $\lambda_{c}$ on temperature $T$ . Thus, we assume a constant $\partial\lambda/\partial T$ . The drift on the round-trip phase shift $\delta\phi$ due to temperature change is derived as,

\displaystyle\delta\lambda

\displaystyle=\delta T(\partial\lambda/\partial T),~{}~{}\delta n_{eff}=\delta% \lambda\cdot n_{g}/\lambda,~{}~{}\delta\phi=\delta n_{eff}\cdot 2\pi\cdot L/\lambda,

(1)

where $n_{g}$ is the group index, $\lambda$ is the input wavelength (i.e., ideal on-resonance wavelength at temperature $T_{0}$ ), $L$ is the perimeter of the MRR, $n_{eff}$ is the effective refractive index. For the MRR on the $c$ -th column of the weight bank, its input wavelength is $\lambda_{c}$ , and its corresponding perimeter is $L_{c}$ . $L_{c}$ is designed to make the MRR resonate at $\lambda_{c}$ and default temperature $T_{0}$ . Hence its round-trip phase shift change is $\delta\phi_{c}=\delta n_{eff}\cdot 2\pi\cdot L_{c}/\lambda_{c}$ . Figure 6 illustrates a linear temperature drift scheduling from 300K to 301K and its impacts on phase error and weight error measured as normalized mean-absolute error (NMAE).

III-B3 Thermal Crosstalk

To make a compact MRR-based PTC, the spacings between adjacent rings are usually not far enough to eliminate all thermal crosstalk, which is mainly due to the thermal interference between adjacent photonic devices [28, 29, 25]. Figure 7 illustrates the crosstalk within a weight bank. Given a fixed layout spacing of MRRs, $\gamma_{ij}$ represents the crosstalk coupling coefficient from the $j$ -th MRR to $i$ -th MRR, The round-trip phases $\Phi$ for $k\times k$ MRRs will be transformed with a coupling matrix $\Gamma$ [25], i.e., $\Phi_{c}=\Gamma\Phi$ , as follows,

$\displaystyle\begin{pmatrix}\phi^{c}_{1}\\ \phi^{c}_{2}\\ \vdots\\ \phi^{c}_{k^{2}}\end{pmatrix}$	$\displaystyle=\begin{pmatrix}\gamma_{11}&\gamma_{12}&\cdots&\gamma_{1k^{2}}\\ \gamma_{21}&\gamma_{22}&\cdots&\gamma_{2k^{2}}\\ \vdots&\vdots&\ddots&\vdots\\ \gamma_{k^{2}1}&\gamma_{k^{2}2}&\cdots&\gamma_{k^{2}k^{2}}\end{pmatrix}\begin{% pmatrix}\phi_{1}\\ \phi_{2}\\ \vdots\\ \phi_{k^{2}}\end{pmatrix},$	(2)
$\displaystyle~{}~{}\gamma_{ii}$	$\displaystyle=1,~{}~{}\gamma_{ij}=e^{-k_{1}d_{ij}},$
$\displaystyle d_{ij}$	$\displaystyle=\sqrt{\big{(}(r_{j}-r_{i})l_{v}\big{)}^{2}+\big{(}(c_{j}-c_{i})l% _{h}\big{)}^{2}},~{}~{}r_{i/j}\in[k],c_{i/j}\in[k],$

where $\gamma_{ii}=1$ is the self-coupling coefficient, and $\gamma_{ij}=e^{-k_{1}d_{ij}}\in(0,1)$ is the cross-coupling coefficient, determined by the structure-related constant $k_{1}$ and the distance $d_{ij}$ between the $i$ -th MRR at coordinate ( $r_{i},c_{i}$ ) and $j$ -th MRR at coordinate ( $r_{j},c_{j}$ )[30].

III-C Salience-Aware Sparse Calibration

Given a pretrained NN, we map the ideal weights $W^{*}$ to the photonic accelerator. However, the actual weights $\widetilde{W}$ will have a deviation from the ideal ones. To quickly recover the inference accuracy on the fly without training or utilizing any labeled calibration datasets, we introduce salience-aware sparse calibration in Fig. 8, which is formulated as a batched block-wise regression problem,

	$\displaystyle\min_{W}\mathcal{L}_{calib}$	$\displaystyle=\min_{W}\sum_{ij}\big{\|}\mathcal{L}(\mathbb{E}[\widetilde{W}_{ij% }])-\mathcal{L}(W^{*}_{ij})\big{\|},$		(3)
	$\displaystyle~{}~{}\widetilde{W}_{ij}$	$\displaystyle=W_{ij}\big{(}\Gamma(\Phi+\Delta\Phi+\delta\Phi)\big{)}$		(3)

By optimizing the latent weight $W$ , we want to minimize the distance between the expected noisy weights $\mathbb{E}[\widetilde{W}]$ and the ideal weights $W^{*}$ on each $k\times k$ block. The loss degradation of the NN can be approximated by Taylor expansion

\displaystyle|

\displaystyle\mathcal{L}\!(\mathbb{E}[\widetilde{W}])\!-\!\!\mathcal{L}\!(W^{*% })|\!\approx\!\big{|}\nabla_{W}\!\mathcal{L}^{T}\!(\mathbb{E}[\widetilde{W}]\!% \!\!-\!\!W^{*})\!+\!\frac{1}{2}\!\nabla^{2}_{W}\!\mathcal{L}(\mathbb{E}[\!% \widetilde{W}\!]\!\!-\!\!W)^{2}\big{|}.

(4)

As an approximation, we can rewrite the objective $\mathcal{L}_{calib}$ in Eq. (3)

\small\min_{W}\mathcal{L}_{calib}\approx\min_{W}\sum_{ij}\|\mathbb{E}[% \widetilde{W}_{ij}]-W^{*}_{ij}\|_{1}.

(5)

This fundamentally decouples the calibration procedure from the labeled dataset and task-specific loss function $\mathcal{L}$ . By solving this regression problem concurrently on all matrix blocks ( $\forall i,j$ ), we can efficiently resume the task performance. The expected noisy weights $\mathbb{E}[\widetilde{W}]$ can be probed by shining an identity matrix $I\in\mathbb{R}^{k\times k}$ through the MRR weight bank $m$ times and calculating the average,

\small\mathbb{E}[\widetilde{W}]\approx\frac{1}{m}\sum_{i=1}^{m}\widetilde{W}I=% \frac{1}{m}\sum_{i=1}^{m}W\big{(}\Gamma(\Phi+\Delta\Phi_{i}+\delta\Phi)\big{)}I.

(6)

Compared to on-chip training with backpropagation [22, 21], our calibration method has three major advantages:

(1) High efficiency – Our method does not require costly forward propagation or error feedback. The gradient of the calibration objective w.r.t. the latent weights $\nabla_{W}\mathcal{L}_{calib}$ can be efficiently approximated using a straight-through estimator. With MAE as the loss, the gradients are simply

\small\nabla_{W}\mathcal{L}_{calib}\approx\nabla_{\mathbb{E}[\widetilde{W}]}% \mathcal{L}_{calib}=\texttt{Sign}(\frac{1}{m}\sum_{i=1}^{m}\widetilde{W}_{i}I-% W^{*}).

(7)

(2) Accurate gradients – Unlike on-chip training, where gradient errors will exponentially accumulate through layers, our method only calculates gradients for each block without backpropagation. Hence, our method avoids slow convergence or divergence issues even under large noise.

(3) Task-agnostic & Data-free – We do not use any labeled dataset or task-specific loss function, which avoids dataset storage costs and data privacy issues on edge devices.

Salience-Aware Sparsity. To reduce the overhead of calibration, we propose salience-aware sparse calibration. At each iteration, we only calibrate a subset ( $\beta\times 100$ %) of weight blocks, e.g., $\beta=0.2$ means 20% sparsity. Instead of randomly selecting blocks to be calibrated, we propose to prioritize important weights based on salience scores that can be precomputed offline once. The precomputed first-order or second-order gradients are good indicators of weight importance/salience, i.e., how sensitive the task loss function is to the weight perturbation. Therefore, we will generate salience scores, e.g., $s=|\nabla_{W}\mathcal{L}(\mathcal{D}_{train})|$ oder $s=|\nabla^{2}_{W}\mathcal{L}(\mathcal{D}_{train})|,$ of ideal weights across the training dataset, calculate the average scores for each $Rk\times Ck$ weight chunk, and use the normalized salience score as sampling probability to perform importance-sampling (IS) at each calibration iteration. This coarse-grained structured sparsity at the chunk level can directly translate to proportional calibration cycle reduction. We set a maximum calibration iteration and a weight error threshold to adaptively stop calibration, whichever stop criterion is first met. The hardware overhead in terms of cycles when performing $T_{calib}$ iterations for each $Rk\times Ck$ weight block is,

\#\texttt{Cycle}_{calib}\approx\beta T_{calib}mk

(8)

III-D Variation-Aware Tile Remapping

Motivated by the important observation of the nonuniform noise distribution across devices and cores in Fig. 1(c), we propose architectural variation-aware tile remapping to remedy the accuracy degradation. We try to find a matrix-to-tile index mapping better than the direct mapping ordering, shown in Fig. 9. Given a weight-stationary dataflow, we map a $Rk\times Ck$ weight matrix block onto the accelerator for MVMs and move to the next matrix block. A direct mapping will map $R\times C$ subblocks onto the $R\times C$ PTC arrays following their original order. This is suboptimal because weight blocks have different sensitivities, and PTCs show different error levels. It is natural to remap weight blocks onto PTCs to minimize errors.

However, to avoid complicated dataflow, we cannot arbitrarily remap $RC$ weight blocks to $RC$ PTCs, which will lead to nontrivial architectural overhead. First, as imposed by the dataflow in the accelerator topology, the same input will be broadcast via photonic waveguides to all cores in a column, e.g., PTC( $r$ ,1) $\forall r\in[R]$ . The partial sum from cores within one row (tile) will be accumulated via photocurrent summation. Therefore, we can only remap the workloads in the granularity of tiles. Formally, we denote the indices of tiles as $\mathcal{V}=[v_{1},v_{2},\cdots,v_{R}]$ and the indices of weight chunks as $\mathcal{U}=[u_{1},u_{2},\cdots,u_{R}]$ , where $W_{u_{r}}\in\mathbb{R}^{k\times Ck}$ . Tile remapping is formulated as a linear assignment problem (LAP),

\small\min_{f}\sum_{u\in\mathcal{U}}\epsilon(u,f(u)),~{}v=f(u)\in\mathcal{V},~% {}f:\mathcal{U}\rightarrow\mathcal{V}.

(9)

Each entry in the cost matrix $\epsilon\in\mathbb{R}^{R\times R}$ represents the edge weight in the complete bipartite graph, shown in Fig. 9. The edge weight $\epsilon_{ij}$ is an indicator of errors when mapping $W_{i}$ to tile $j$ . Similar to the salience scores in the calibration, we also use the first-order Taylor expansion to calculate sensitivity-aware error terms $\epsilon_{ij}=|\nabla_{W_{i}}\mathcal{L}^{T}\cdot(\mathbb{E}[\widetilde{W}_{i}% ^{j}]-W_{i}^{*})|$ . This LAP can be optimally solved in polynomial time. The cycle cost of remapping one $Rk\times Ck$ weight matrix block is,

\displaystyle\#\texttt{Cycle}_{remap}

\displaystyle=\#\texttt{Cycle}_{\epsilon}+\#\texttt{Cycle}_{LAP}\approx Rmk+R^% {3},

(10)

where the probing times $m=1$ . We periodically solve this optimal remapping $f$ to save cost and apply it to all following inferences.

III-E Adaptive Remediation Controller

To dynamically determine when to trigger our remediation flow, we introduce an adaptive controller that periodically monitors cheap but informative statistics, i.e., temperature per PTC, and determine whether to trigger remediation based on a threshold, e.g., when average chip temperature drift from last remediation is above 0.01K, i.e., ( $\frac{1}{RC}\sum T_{r,c}-T_{prev}>0.01$ ). If not, we will perform a more expensive probing, the normalized MAE (NMAE) $\|\widetilde{W}-W^{*}\|_{1}/\|W^{*}\|_{1}$ , with a slight cycle overhead. If the NMAE is above 5%, we will trigger remediation. To avoid overly frequent remediation that keeps interrupting the online inference stream, we set a cooling time $\tau$ for our remediation procedure to control the maximum acceptable overhead. This can be reconfigured by users based on the preferences between accuracy and inference throughput. For example, with 10K total inferences, each remediation is as expensive as 10 inferences, and a cooling time of 200 inferences will lead to a max overhead of $\frac{10K\times 10}{200\times 10K}=5\%$ .

IV Evaluation Results

IV-A Simulation Setup

Dataset and Models. We evaluate our method on a three-layer CNN (C64K3-C64K3-C64K3-Pool5-FC10) on Fashion-MNIST[31], VGG-8[32] on CIFAR-10[33], and ResNet-18[34] on CIFAR-100[33] for image classification.

Training Settings. We pre-train all models for 100 epochs with an Adam optimizer with a 2E-3 learning rate, a cosine decay scheduler, 1E-4 weight decay, and data augmentation (random crop and flip). BatchNorm layers are all frozen after pretraining.

Architecture Settings. As a challenging case study, the hardware platform in this work is assumed to be a multi-tile, multi-core photonic tensor accelerator based on a thermally sensitive MRR weight bank shown in Fig. 3. Note that our method is not specific to MRR weight banks but can generalize to all universal optical matrix-vector multiplication units. We assume that the photonic accelerator has 4 tiles, and each tile has 4 cores. Each core is an 8 $\times$ 8 add-drop MRR weight bank that can perform 8 $\times$ 8 matrix-vector multiplication per core per cycle. Detailed architecture description is in Section III-A. The MRR device modeling is based on Section II-A. The device/circuit variation modeling is based on Section III-B.

Benchmark Settings. To cover different thermal variation scenarios, we create several synthetic noise configurations as benchmarks in Table I.

TABLE I: Benchmark settings for different noise scenarios

Scenario	Description
PV.1	Low Noise & Distribution: Edge-to-Corner
PV.2	High Noise & Distribution: Edge-to-Corner
TD.1	Temp Drift: Linear Increase & Uniform
TD.2	Temp Drift: Cosine Fluctuation & Uniform
TD.3	Temp Drift: Linear Increase & Corner Hotspot
TD.4	Temp Drift: Cosine Fluctuation & Corner Hotspot
CT	Thermal Crosstalk among all MRRs within each core.

(1) Phase Variation (PV). To simulate real-world chip noise scenarios, two cases are created: "Low Noise" and "High Noise," corresponding to low and high standard deviation (std.) in phase shift, modeled as $\Delta\phi\in\mathcal{N}(0,\sigma^{2})$ . As explained in Section III-B, the noise std. $\sigma_{ij}^{t+1^{\prime}}$ for each MRR device is time-variant and sampled at each time step from a noise level map $\mathcal{N}\in(\mu_{s}(t),\sigma_{s}^{2}(t))$ . The noise level distribution gradually changes its high-noise region from the chip’s left edge to the top-left corner ("Edge-to-Corner") to capture dynamic noise profile drifting. Specific noise level functions are chosen for the Low and High Noise cases:

•

Low Noise: $\mu_{s}(t)=0.0025t,\sigma_{s}(t)=0.004t+0.002$
•

High Noise: $\mu_{s}(t)=0.01t,\sigma_{s}(t)=0.005t+0.005$

The damping factor $\beta_{\sigma}$ is set to 0.9. The phase error intensities adopt the typical values in the literature [10, 26].

(2) Temperature Drift (TD): Two cases are designed to represent different types of temperature change with $t_{max}=20,000$ :

•

Linear: $T(t)=300K+t/t_{max}$
•

Cosine: $T(t)=300.25K-0.25K\cos(10t/t_{max})$

The temperature drifts adopt typical values in the literature [23, 24]. We also consider two representative spatial distributions of temperature.

•

Uniform: All PTCs have the same temperature drift following the Linear/Cosine scheduling.
•

Corner Hotspot: the upper-left corner of the chip experiences a high temperature drift, exponentially decreasing with distance, i.e., $T(t)\leftarrow e^{-\sqrt{r^{2}+c^{2}}}(T(t)-T(0))+T(0)$ , which represents a local hotspot scenario during the execution of the accelerator.

(3) Thermal Crosstalk (CT):, we use a default MRR spacing $l_{v}=200\mu m$ , $l_{h}=60\mu m$ , crosstalk coefficient $k_{1}=0.1$ , and model crosstalk only among MRRs within the same PTC [28, 25, 30].

Evaluation Metrics. We mainly evaluate different methods on the above benchmarks in terms of inference accuracy and cycle overhead consumed by remediation. All matrix multiplication operations in the forward propagation of NN inference is mapped to our photonic accelerators as the basic inference cycle count. If on-chip remediation method is applied, any additional computation will be converted to equivalent cycle overhead. For DOCTOR, cycle overhead of calibration and remapping for each $Rk\times Ck$ matrix block is counted as Eq. (8) and Eq. (10), respectively. The overhead is summed over all matrix blocks in the ONN. Cycle overhead is reported as an efficiency metric for each remediation method.

IV-B Ablation Study

IV-B1 Calibration: Sparsity and Salience Scores

To determine the best hyperparameters in sparse calibration, we first evaluate how many matrix probings $m$ are needed for Eq. (6). In Table II, we found that $m=1$ can efficiently estimate $\mathbb{E}[\widetilde{W}]$ and calibrate the circuits with low cost and high accuracy.

TABLE II: Comparison of VGG8 accuracy and calibration cycles on CIFAR-10 with different weight probing times

m

in our proposed sparse calibration. The stop MAE threshold is 0.0038.

$m$	1	2	3	4	5	10	15	20
cycles	3.07M	1.53M	2.12M	2.82M	3.53M	6.75M	10.12M	13.50M
MAE loss	0.00388	0.00379	0.00382	0.00367	0.00361	0.00366	0.00360	0.00360
acc	90.67%	90.05%	89.17%	89.36%	89.33%	88.56%	88.25%	88.32%

Figure 10 compares different salience scores, sampling methods, and sparsity. First-order Taylor expansion with important sampling achieves the best balance between cycle counts and accuracy. Top-K is not satisfying as it is fixed to blocks with large ideal gradients. With sparsity $\beta=0.2$ , the accuracy can be quickly resumed above 90% with negligible cycle overhead (equivalent to one single-image inference).

Figure 11 visualizes the effectiveness of our salience-aware sparsity calibration. With thermal variation, the accuracy drops from 90.94% to 57%. Calibration can quickly resume accuracy above 90% with only 10 iterations. The overhead is very cheap, equivalent to interrupting merely 5 single-batch inferences. With 0.2 sparsity, the overhead can be further reduced to only one inference.

IV-B2 Remapping: Error Estimation and Interval

Figure 12 evaluates different methods for error $\epsilon_{ij}$ estimation, including mean absolute errors (MAE), first-order and second-order Taylor expansions, as indicated in Eq. (4). As noise distribution evolves, the ideal pre-trained model with direct mapping suffers from a large accuracy drop. In contrast, our variation-aware remapping can significantly reduce the numerical errors and thus boost inference accuracy by 5-10%. The first-order Taylor expansion of the loss function shows clear advantages over the naive matrix MAE scores and is cheaper than second-order expansion with the same accuracy benefit. The solved remapping can be reused for following inferences until the next remapping is triggered. We scan over different remapping intervals from once per 1K inferences up to once per 10K inferences. We found that 2K-4K inference intervals are enough to guarantee maximum accuracy benefits (+5.4%) and negligible cycle overhead (0.17%).

IV-C Main Results: Compare with Prior Work

TABLE III: Compare the inference accuracy and cycle overhead (Ovhd) among 4 methods on 8 different variation settings and 3 datasets/models. Our remediation method only induces a small overhead of 0.14%-5% of the original inference cycles, which is 2-3 orders-of-magnitude less costly than on-chip training (BS3).

FMNIST-CNN3 (pre-trained acc: 92.78%, inference cycles: 5.81E8)

Noise configs

BS1

BS2 [10]

BS3 [21, 22]

DOCTOR

acc

ovhd

acc

ovhd

acc

ovhd

acc

ovhd

CT+PV.1+TD.1

50.41

45.56

36.72

3.5E9

92.42

8.4E5

CT+PV.1+TD.2

42.85

37.08

77.74

3.5E9

92.59

7.9E5

CT+PV.1+TD.3

78.44

81.49

80.50

3.5E9

92.20

7.6E5

CT+PV.1+TD.4

74.11

75.09

90.37

3.5E9

92.19

7.5E5

CT+PV.2+TD.1

49.43

45.39

27.49

3.5E9

91.42

8.4E5

CT+PV.2+TD.2

41.81

37.00

74.64

3.5E9

91.07

8.2E5

CT+PV.2+TD.3

77.94

80.55

78.14

3.5E9

91.12

8.4E5

CT+PV.2+TD.4

73.34

74.20

88.54

3.5E9

91.11

8.4E5

Avg. Acc/

Ovhd Ratio

61.04

0.00%

59.55

0.00%

69.27

600%

91.77

0.14%

CIFAR10 VGG8 (pre-trained acc: 90.94%, inference cycles: 6.66E8)

Noise configs

BS1

BS2 [10]

BS3 [21, 22]

DOCTOR

acc

ovhd

acc

ovhd

acc

ovhd

acc

ovhd

CT+PV.1+TD.1

32.38

29.93

19.23

3.3E9

90.53

2.6E7

CT+PV.1+TD.2

28.54

26.86

53.71

3.3E9

90.24

2.4E7

CT+PV.1+TD.3

71.87

67.12

73.11

3.3E9

90.25

2.4E7

CT+PV.1+TD.4

65.82

60.89

86.47

3.3E9

90.27

2.6E7

CT+PV.2+TD.1

32.52

30.00

19.17

3.3E9

90.25

2.6E7

CT+PV.2+TD.2

28.98

27.02

51.96

3.3E9

89.62

2.4E7

CT+PV.2+TD.3

72.03

67.26

71.68

3.3E9

89.85

2.4E7

CT+PV.2+TD.4

65.80

60.87

85.83

3.3E9

89.75

2.3E7

Avg. Acc/

Ovhd Ratio

49.74

0.00%

46.24

0.00%

57.65

500%

90.10

3.65%

CIFAR100 ResNet18 (pre-trained acc: 73.57%, inference cycles: 5.43E9)

Noise configs

BS1

BS2 [10]

BS3 [21, 22]

DOCTOR

acc

ovhd

acc

ovhd

acc

ovhd

acc

ovhd

CT+PV.1+TD.1

5.23

5.32

6.20

2.7E10

72.15

2.9E8

CT+PV.1+TD.2

7.25

7.66

24.03

2.7E10

71.33

2.8E8

CT+PV.1+TD.3

21.69

23.06

39.00

2.7E10

70.20

2.7E8

CT+PV.1+TD.4

18.16

18.82

46.56

2.7E10

70.01

2.6E8

CT+PV.2+TD.1

5.14

5.36

4.96

2.7E10

72.28

2.9E8

CT+PV.2+TD.2

7.33

7.65

24.90

2.7E10

71.68

2.8E8

CT+PV.2+TD.3

21.97

23.61

1.03

2.7E10

70.47

2.7E8

CT+PV.2+TD.4

18.43

19.21

46.27

2.7E10

70.40

2.6E8

Avg. Acc/

Ovhd Ratio

13.15

0.00%

13.84

0.00%

24.12

500%

71.07

5.08%

In Table III, we compare DOCTOR with three baselines: (1) BS1: deploy pre-trained models; (2) BS2 [10]: noise-aware training with (2%) weight error injected during pretraining; and (3) BS3 [21, 22]: on-chip training for 1 epoch on a calibration dataset (10% of training set). The parameters for our DOCTOR framework are: (1) Calibration: probing times $m=1$ , salience score $s=|\nabla_{W}\mathcal{L}|$ , calibration sparsity $\beta=0.2$ , max calibration iteration $T_{calib}=20$ (50 for ResNet18); (2) Remapping: first-order Taylor expansion as $\epsilon$ ; (3) Controller: remediation cooling time $\tau$ : 200 inferences (50 for ResNet18); trigger remapping $\rightarrow$ calibration.

Across all benchmarks, BS1 suffers from severe accuracy degradation (30%-60%). Noise-aware training [10], though it enhances the smoothness of the solution space of the model, shows limited effect on accuracy improvement, as the noise distribution used in pre-training is significantly different from the physical variations. Note that our DOCTOR is orthogonal to noise-aware training, allowing simultaneous application for a solution space that is both locally smooth and adaptive to the drifting noise distribution. On-chip training [21, 22] can boost the accuracy on small networks, e.g., 8% on CNN-FMNIST and VGG8-CIFAR10. However, it performs poorly on deep ResNet18 and is not stable on certain benchmarks, e.g., CT+PV.2+TD.1. The fundamental reason is that the gradient estimation error exponentially accumulates with backpropagation, which can lead to poor training performance and even divergence. Also, though it is only trained for 1 epoch on a 10% training set, it consumes 5-6 $\times$ more runtime (cycles) in training, which is not practical to be deployed on throughput-restricted edge platforms. In contrast, our proposed DOCTOR method can stably maintain high accuracy even with rapidly drifting noise distributions and temperature with less than 1-2.5% accuracy drop at the cost of merely 0.1%-5.1% cycle overhead. On average, DOCTOR is +34% more accurate and 2-3 orders-of-magnitude more efficient than on-chip training. We visualize our DOCTOR flow in Fig. 13 to show the temperature drift is detected and the accuracy is rapidly resumed with only 3.6% cycle overhead.

IV-D Discussion

Device Spacing and Crosstalk. We evaluate how MRR spacing impacts the crosstalk and thus the maximum accuracy in Fig. 14. By default, we assume 200 $\mu m$ vertical spacing $l_{v}$ and 60 $\mu m$ horizontal spacing $l_{h}$ in the MRR array and scaling both directions. Below 0.4, the crosstalk severely impacts the representable weight space. Thus, even after calibration, the accuracy cannot be recovered. Within 0.4 to 1.1 scaling, the accuracy drop can be fully countered by our calibration, while above 1.1, there is almost no drop from crosstalk.

Trade off Efficiency and Accuracy. The cooling time $\tau$ in our adaptive remediation controller trades off cycle overhead and accuracy. In Table IV, we sweep the cooling interval from 200 to 1000 inferences. If accuracy is prioritized, we can adopt $\tau$ =200 with only 3.7% cycle overhead. If inference throughput is prioritized, e.g., only <1% overhead is accepted, we can set $\tau=800$ with a 2.7% accuracy drop.

TABLE IV: Cycle overhead and test accuracy on VGG8-CIFAR10 with different remediation cooling time

\tau

\tau

200

300

400

500

800

1000

#cycles

(overhead)

6.90E8

(+3.7%)

6.78E8

(+1.9%)

6.78E8

(+1.9%)

6.74E8

(+1.3%)

6.72E8

(+1.0%)

6.71E8

(+0.8%)

acc

90.24

89.97

89.15

88.28

86.55

V Conclusion

In this work, we present the first on-chip remediation approach that dynamically monitors photonic accelerator temperature drift and ensures continued reliability with minimal overhead through training-free, data-free calibration, and architectural remapping. Our method outperforms SoTA on-chip training by +34% higher accuracy and 2-3 orders-of-magnitude lower cost. Our lightweight, effective in-situ remediation method enables self-corrected photonic neural accelerators with unprecedented reliability in real-world, dynamic deployment scenarios.

References

[1] Y. Shen, N. C. Harris, S. Skirlo et al., “Deep learning with coherent nanophotonic circuits,” Nature Photonics, 2017.
[2] Q. Cheng, J. Kwon, M. Glick, M. Bahadori, L. P. Carloni, and K. Bergman, “Silicon Photonics Codesign for Deep Learning,” Proceedings of the IEEE, 2020.
[3] B. J. Shastri, A. N. Tait et al., “Photonics for Artificial Intelligence and Neuromorphic Computing,” Nature Photonics, 2021.
[4] C. Feng, J. Gu, H. Zhu, Z. Ying, Z. Zhao et al., “A compact butterfly-style silicon photonic–electronic neural chip for hardware-efficient deep learning,” ACS Photonics, vol. 9, no. 12, pp. 3906–3916, 2022.
[5] W. Liu, W. Liu, Y. Ye, Q. Lou, Y. Xie, and L. Jiang, “Holylight: A nanophotonic accelerator for deep learning in data centers,” in Proc. DATE, 2019.
[6] J. Gu, H. Zhu, C. Feng, Z. Jiang, R. T. Chen, and D. Z. Pan, “M3ICRO: Machine learning-enabled compact photonic tensor core based on programmable multi-operand multimode interference,” APL Machine Learning, vol. 2, no. 1, p. 016106, Mar. 2024.
[7] X. Xu, M. Tan, B. Corcoran, J. Wu, A. Boes, T. G. Nguyen, S. T. Chu, B. E. Little, D. G. Hicks, R. Morandotti, A. Mitchell, and D. J. Moss, “11 TOPS photonic convolutional accelerator for optical neural networks,” Nature, 2021.
[8] J. Feldmann, N. Youngblood, M. Karpov, H. Gehring, X. Li, M. Stappers, M. L. Gallo, X. Fu, A. Lukashchuk, A. Raja, J. Liu, D. Wright, A. Sebastian, T. Kippenberg, W. Pernice, and H. Bhaskaran, “Parallel convolutional processing using an integrated photonic tensor core,” Nature, 2021.
[9] C. Huang, S. Bilodeau, T. Ferreira de Lima et al., “Demonstration of scalable microring weight bank control for large-scale photonic integrated circuits,” APL Photonics, vol. 5, no. 4, p. 040803, 2020.
[10] J. Gu, Z. Zhao, C. Feng, H. Zhu, R. T. Chen, and D. Z. Pan, “ROQ: A noise-aware quantization scheme towards robust optical neural networks with low-bit controls,” in Proc. DATE, 2020.
[11] Z. Zhao, J. Gu, Z. Ying et al., “Design technology for scalable and robust photonic integrated circuits,” in Proc. ICCAD, 2019.
[12] Y. Zhu, G. L. Zhang, B. Li et al., “Countering Variations and Thermal Effects for Accurate Optical Neural Networks,” in Proc. ICCAD, 2020.
[13] A. Mirza, F. Sunny et al., “Silicon photonic microring resonators: A comprehensive design-space exploration and optimization under fabrication-process variations,” IEEE TCAD, vol. 41, no. 10, pp. 3359–3372, 2022.
[14] A. N. Tait, T. F. de Lima, E. Zhou et al., “Neuromorphic photonic networks using silicon photonic weight banks,” Sci. Rep., 2017.
[15] D. Liu, Z. Zhao, Z. Wang, Z. Ying, R. T. Chen, and D. Z. Pan, “Operon: Optical-electrical power-efficient route synthesis for on-chip signals,” in Proc. DAC, 2018.
[16] J. Gu, C. Feng, Z. Zhao, Z. Ying, M. Liu, R. T. Chen, and D. Z. Pan, “SqueezeLight: Towards Scalable Optical Neural Networks with Multi-Operand Ring Resonators,” in Proc. DATE, 2021.
[17] M. Kirtas, N. Passalis, G. Mourgias-Alexandris, G. Dabos, N. Pleros, and A. Tefas, “Robust architecture-agnostic and noise resilient training of photonic deep learning models,” IEEE TransactionsS on Emerging Topics in Computational Intelligence, 2023.
[18] L. G. Wright, T. Onodera et al., “Deep physical neural networks trained with backpropagation,” Nature, vol. 601, no. 7894, pp. 549–555, Jan. 2022.
[19] J. Gu, Z. Zhao, C. Feng, W. Li, R. T. Chen, and D. Z. Pan, “FLOPS: Efficient On-Chip Learning for Optical Neural Networks Through Stochastic Zeroth-Order Optimization,” in Proc. DAC, 2020.
[20] J. Gu, C. Feng, Z. Zhao, Z. Ying, R. T. Chen, and D. Z. Pan, “Efficient on-chip learning for optical neural networks through power-aware sparse zeroth-order optimization,” in Proc. AAAI, 2021.
[21] J. Gu, H. Zhu, C. Feng, Z. Jiang, R. T. Chen, and D. Z. Pan, “L2ight: Enabling On-Chip Learning for Optical Neural Networks via Efficient in-situ Subspace Optimization,” in Proc. NeurIPS, 2021.
[22] S. Pai, Z. Sun, T. W. Hughes et al., “Experimentally realized in situ backpropagation for deep learning in photonic neural networks,” Science, vol. 380, no. 6643, pp. 398–404, Apr. 2023.
[23] Y. Ye, J. Xu, X. Wu, W. Zhang, X. Wang, M. Nikdast, Z. Wang, and W. Liu, “System-Level Modeling and Analysis of Thermal Effects in Optical Networks-on-Chip,” IEEE Trans. VLSI Syst., vol. 21, no. 2, pp. 292–305, Feb. 2013.
[24] K. Padmaraju and K. Bergman, “Resolving the thermal challenges for silicon microring resonator devices,” Nanophotonics, vol. 3, no. 4-5, pp. 269–281, Aug. 2014.
[25] M. Milanizadeh, D. Aguiar, A. Melloni, and F. Morichetti, “Canceling Thermal Cross-Talk Effects in Photonic Integrated Circuits,” J. Lightwave Technol., vol. 37, no. 4, pp. 1325–1332, Feb. 2019.
[26] M. Y.-S. Fang, S. Manipatruni, C. Wierzynski, A. Khosrowshahi, and M. R. DeWeese, “Design of optical neural networks with component imprecisions,” Opt. Express, vol. 27, no. 10, p. 14009, May 2019.
[27] F. Sunny, A. Mirza, M. Nikdast, and S. Pasricha, “Crosslight: A cross-layer optimized silicon photonic neural network accelerator,” in Proc. DAC, 2021.
[28] H. Jayatilleka, K. Murray, M. Caverley, N. A. F. Jaeger, L. Chrostowski, and S. Shekhar, “Crosstalk in soi microring resonator-based filters,” Journal of Lightwave Technology, vol. 34, no. 12, pp. 2886–2896, 2016.
[29] W. Bogaerts, P. De Heyn, T. Van Vaerenbergh, K. De Vos, S. Kumar Selvaraja, T. Claes, P. Dumon, P. Bienstman, D. Van Thourhout, and R. Baets, “Silicon microring resonators,” Laser & Photon. Rev., vol. 6, no. 1, pp. 47–73, Jan. 2012.
[30] A. Cem, D. Sanchez-Jacome, D. Pérez-López, and F. Da Ros, “Thermal crosstalk modeling and compensation for programmable photonic processors,” in IEEE Photonic Conference, 2023.
[31] H. Xiao, K. Rasul, and R. Vollgraf, “Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms,” Arxiv, 2017.
[32] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” 2015.
[33] A. Krizhevsky, G. Hinton et al., “Learning multiple layers of features from tiny images,” 2009.
[34] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. CVPR, 2016, pp. 770–778.

Haotian Lu is currently a research intern in ScopeX group, School of Electrical, Computer and Energy Engineering at Arizona State University, advised by Prof. Jiaqi Gu. His research interests mainly include hardware-algorithm co-design, electronic design automation, efficient hardware accelerators and hardware security.

Sanmitra Banerjee is a Senior Design-for-X (DFX) Methodology Engineer at NVIDIA Corporation, Santa Clara, CA, and an Adjunct Faculty at Arizona State University. He received the B.Tech. degree from Indian Institute of Technology, Kharagpur, in 2018, and the M.S. and Ph.D. degrees from Duke University, Durham, NC, in 2021 and 2022, respectively. His research interests include machine learning based DFX techniques, and fault modeling and optimization of emerging AI accelerators under process variations and manufacturing defects.

Jiaqi Gu (S’19 - M’23) received the B.E. degree in Microelectronic Science and Engineering from Fudan University, Shanghai, China in 2018, and the Ph.D. degree in the Department of Electrical and Computer Engineering, The University of Texas at Austin, Austin, TX, USA in 2023. He is currently an Assistant Profoessor in School of Electrical, Computer and Energy Engineering at Arizona State University, Tempe, AZ, USA. His current research interests include emerging hardware design for efficient computing (photonics, post-CMOS electronics, quantum), hardware-algorithm co-design, AI/ML algorithms, and electronic-photonic design automation. He has received the Best Paper Award at IEEE TCAD 2021, the Best Paper Award at ASP-DAC 2020, the Best Paper Finalist at DAC 2020, the Best Poster Award at NSF Workshop on Machine Learning Hardware (2020), the ACM/SIGDA Student Research Competition First Place (2020), and the ACM Student Research Competition Grand Finals First Place (2021).