ExtremeCast: Boosting Extreme Value Prediction for Global Weather Forecast

Wanghan Xu^{1 2} Kang Chen^{1 3} Tao Han^{1 4} Hao Chen¹ Wanli Ouyang¹ Lei Bai¹

Abstract

Data-driven weather forecast based on machine learning (ML) has experienced rapid development and demonstrated superior performance in the global medium-range forecast compared to traditional physics-based dynamical models. However, most of these ML models struggle with accurately predicting extreme weather, which is related to training loss and the uncertainty of weather systems. Through mathematical analysis, we prove that the use of symmetric losses, such as the Mean Squared Error (MSE), leads to biased predictions and underestimation of extreme values. To address this issue, we introduce Exloss, a novel loss function that performs asymmetric optimization and highlights extreme values to obtain accurate extreme weather forecast. Beyond the evolution in training loss, we introduce a training-free extreme value enhancement module named ExBooster, which captures the uncertainty in prediction outcomes by employing multiple random samples, thereby increasing the hit rate of low-probability extreme events. Combined with an advanced global weather forecast model, extensive experiments show that our solution can achieve state-of-the-art performance in extreme weather prediction, while maintaining the overall forecast accuracy comparable to the top medium-range forecast models. Code and the model checkpoints are available at https://github.com/black-yt/ExtremeCast .

1 Introduction

Weather forecast plays a crucial role in society and impacts various industries including energy (Meenal et al. 2022), agriculture (Fathi et al. 2022), and transportation (Wang et al. 2020a). In recent years, the field of Numerical Weather Prediction has witnessed the rapid development of many data-driven forecast models based on machine learning (ML) (Kurth et al. 2023; Bi et al. 2023; Chen et al. 2023a), which are usually trained on extensive meteorological data spanning decades, achieving remarkable accuracy and fast inference.

Despite the advancements in ML forecast models, there remains a significant challenge when it comes to the extreme weather forecast. More specifically, these models tend to underestimate extreme values and produce smooth outputs (Gong et al. 2024), especially when the prediction lead time extends, which greatly hinders the prediction of natural disasters like typhoons and heatwaves, as shown in Figure 1. Some previous works (Lopez-Gomez et al. 2023; Ni 2023) have highlighted the notable influence of training loss on extreme value prediction, and empirically improve extreme values by adjusting the training loss. However, these work generally lack theoretical analysis and targeted methodologies. In addition, modifying the loss alone cannot capture the inherent uncertainty in the atmosphere, i.e., identical initial conditions may lead to multiple potential outcomes, and certain extreme events with low probabilities only appear in specific outcomes. The methodology for capturing extreme values in uncertain systems remains unexplored.

Refer to caption — Figure 1: MSE loss vs. Exloss. Through theoretical analysis, we found that MSE loss will give underestimated extreme value predictions. An asymmetric loss function, Exloss, is designed to address this bias by balancing the data distribution.

In this paper, we present a theoretical explanation from extreme value theory (EVT) (Smith 1990) to illustrate why models utilizing Mean Squared Error (MSE) loss perform poorly in extreme weather prediction. Through mathematical analysis, we found that MSE loss is effective only for normally distributed data, in the case of asymmetric extreme value distributions, it emphasizes minimizing the overall average error rather than forecasting extreme values accurately, leading to an underestimation bias. Building on this analysis, we propose an asymmetric loss function called Exloss, which incorporates a quantitative scaling function that adjusts the distribution of asymmetric extreme values towards a more balanced distribution, correcting the model’s underestimate tendency, as illustrated in Figure 1.

Moreover, due to the inherent uncertainty of the atmospheric system, a single prediction outcome may not encompass all extreme events. Inspired by ensemble algorithms (Gneiting and Raftery 2005), we introduce ExBooster, which generates multiple potential prediction outcomes via random sampling and subsequently employs the rank histograms algorithm (Hamill 2001) to amalgamate all prediction results. This module serves as a plug-and-play component that does not necessitate training but significantly enhances the incorporation of extreme values in predictions.

Building upon an advanced medium-range global weather forecast model, i.e., FengWu (Chen et al. 2023a), and a cascaded diffusion model (Ho, Jain, and Abbeel 2020), we come up with ExtremeCast, a robust and powerful global weather forecast model that works on the $0.25^{\circ}$ high-resolution. Trained with large-scale global atmosphere reanalysis dataset, i.e., ERA5 (Hersbach et al. 2020), ExtremeCast achieves state-of-the-art (SOTA) performance on extreme value metrics compared with both the top-performing data-driven global weather forecast models and the physics dynamic model, while maintaining the competitive overall accuracy. We summarize the contributions of this paper as follows:

•

From the perspective of extreme value theory, we propose a theoretical explanation for why MSE-based model is difficult to predict extreme values and prove that MSE leads to underestimation of extreme values.
•

We design a novel loss function named Exloss based on an asymmetric design, which balances data distribution to eliminate the bias of the above underestimated forecasts and provide accurate extreme weather forecasts.
•

We introduce ExBooster, a training-free module that models atmospheric uncertainty through random samplings, effectively boosting the forecast hit rate of extreme values.
•

Large-scale experiments for global weather forecast on the $0.25^{\circ}$ resolution show that our method significantly improve the forecasting ability of extreme weather, which is supported by obtaining better performance on extreme value metrics (e.g., RQE and SEDI) while maintaining the overall accuracy metrics (e.g., RMSE).

2 Related Work

2.1 Medium-range Weather Forecast

Medium-range weather forecast provides forecasts of weather conditions including temperature, wind speed, humidity, and more, for the next few days to two weeks. Traditional medium-range forecast models are mainly physics-based dynamic models such as the Integrated Forecasting System (IFS) from European Centre for Medium-Range Weather Forecasts (ECMWF) (MODEL 2003), whose development is primarily limited by the computational cost of numerical models (Ben-Bouallegue et al. 2023).

In recent years, data-driven models based on machine learning show significant potential for weather forecast (de Burgh-Day and Leeuwenburg 2023). FourCastNet (Kurth et al. 2023) utilizes Adaptive Fourier Neural Operator networks (Guibas et al. 2021) for prediction and performs a two-step finetuning to improve the accuracy of autoregressive multi-step forecast, where the model’s output is fed back as input for predicting the next step. This was followed by the Pangu-Weather (Bi et al. 2023), which uses 3D Swin-Transformer (Liu et al. 2021) and proposes hierarchical temporal aggregation aimed at reducing iterations in autoregressive forecast through the integration of multiple models. Subsequently, GraphCast (Lam et al. 2023) uses graph neural networks and performs a 12-step autoregressive finetuning. FengWu (Chen et al. 2023a) treats the prediction problem as a multi-task optimization and introduces a novel finetune strategy named replay buffer. FuXi (Chen et al. 2023b) reduces the cumulative error of the multi-step forecast by cascading three U-Transformer models (Petit et al. 2021) optimized at different lead times.

Although many ML models outperform ECMWF-IFS on the root mean square error (RMSE) metric, most of them suffer from the prediction smoothing issue and lag behind ECMWF-IFS on the extreme value metrics.

2.2 Extreme Weather Forecast

The application of ML models in predicting extreme weather remains relatively unexplored. Most existing models for extreme weather prediction are limited to regional scale or low resolution ( $>0.25^{\circ}$ ) rather than global scale and high resolution ( $0.25^{\circ}$ ). Zhao et al. (Zhao, Lu, and Kou 2003) use wavelet transform (Farge 1992) to predict the occurrence of extreme weather, but cannot predict their actual values. Porto et al. (Porto et al. 2022) combine multiple models to address the challenge of learning diverse extreme weather patterns, which introduces several times the additional training cost. Annau et al. (Annau, Cannon, and Monahan 2023) propose to separate high and low-frequency in the data through convolution and enhance the learning of high-frequency to improve extreme value prediction. However, the learning of high-frequency is notably more challenging than low-frequency, which has not been effectively solved. Morozov et al. (Morozov et al. 2023) apply bias correction on extreme values through quantile regression, whose main limitation is that it only performs regression on $7$ quantiles, which may not sufficiently capture the continuous distribution of the data.

Some work improves extreme values by adjusting loss. Lopez-Gomez et al. (Lopez-Gomez et al. 2023) increase the learning weight of extreme values by designing an exponential-based loss function. However, this method may cause exponential explosion in practical applications (Wang et al. 2019). Ni et al. (Ni 2023) identify that using GAN loss (Creswell et al. 2018) can enhance the performance of the model in predicting extreme values, which is relatively difficult to optimize (Berard et al. 2019).

For the global scale and high resolution extreme weather forecast, FuXi-Extreme (Zhong et al. 2023) uses a diffusion model to improve prediction accuracy of extreme weather. Building on this previous work, we use a cascaded diffusion model as a baseline component of our model and design Exloss and ExBooster to achieve SOTA performance.

3 Method

3.1 Preliminary

Formally, we denote the weather state at time period $i$ as a tensor $\mathcal{X}^{i}\in\mathbb{R}^{C\times H\times W}$ , where $C$ represents the number of atmospheric variables, $H$ and $W$ are the height and width. In this paper, we consider $C=69$ atmospheric variables, which will be introduced in detail in Table 1. To align the experimental setup with ML models like Pangu-Weather (Bi et al. 2023) and GraphCast (Lam et al. 2023), we adopt a latitude and longitude resolution of $0.25^{\circ}$ , which means the global atmospheric data is projected into a two-dimensional plane that $H=721$ and $W=1440$ horizontally.

The model learns to map from $\mathcal{X}^{i}$ to $\mathcal{X}^{i+1}$ with a time interval of 6 hours. Formally, it can be written as:

F_{\theta}(\mathcal{X}^{i})=P(\mathcal{X}^{i+1}|\mathcal{X}^{i})

(1)

where $\theta$ denotes the parameters of the model. By iterating the above process $n$ times, i.e. autoregressive prediction, the model can forecast any future time step $\mathcal{X}^{i+n}(n\in\mathbb{Z})$ .

3.2 Model Framework Overview

Our model consists of two trainable modules and one training-free module. They are, respectively, a deterministic model $M_{d}$ , a probabilistic generation model $M_{g}$ , and the ExBooster module. Specifically, $M_{d}$ learns the mapping from $\mathcal{X}^{i}$ to $\mathcal{X}^{i+1}$ , followed by $M_{g}$ , a conditional diffusion model that that refines the predictions of $M_{d}$ with additional details. Ultimately, ExBooster simulates atmospheric uncertainty through random sampling from the output of $M_{g}$ and amalgamating these samples to derive the final prediction.

In addition to one-step ( $\mathcal{X}^{i}\to\mathcal{X}^{i+1}$ ) training, we also incorporate a finetuning of autoregressive regime to improve model’s performance in multi-step prediction. The training process of our model can be divided into four stages:

•

Stage 1: One-Step pretraining. Use MSE loss to train $M_{d}$ learning mapping from $\mathcal{X}^{i}$ to $\mathcal{X}^{i+1}$ .
•

Stage 2: Muti-Step finetuning. Use the proposed Exloss to finetune $M_{d}$ learning autoregressive forecasting. This stage improves not only the accuracy of the model’s multi-step predictions, but also the ability to predict extreme weather due to the use of Exloss.
•

Stage 3: Diffusion training. Use Exloss to train $M_{g}$ to refine extreme values while freezing $M_{d}$ ’s parameters.
•

Stage 4: Inference. ExBooster that does not require training is added for complete weather forecast, enhancing the forecast hit rate of extreme events.

In the following, we will start by analyzing the inherent limitations of MSE, followed by describing the details of the proposed Exloss and the ExBooster.

3.3 Why MSE Fails to Predict Extreme

We abbreviate Equation 1 as $F_{\theta}(X)=P(Y|X)$ , where $X$ is the input of the model, and $Y$ is the target. According to the previous work (Aravkin et al. 2012; Kendall, Gal, and Cipolla 2018), when the data is normally distributed $Y\sim\mathcal{N}\left(\mu,\sigma^{2}\right)$ , using MSE loss to predict $Y$ is equivalent to using maximum likelihood estimation (Jiang and Zhang 2009) to predict the mean $\mu$ of $Y$ given $X$ , as shown in Equation 2:

$\begin{aligned} \underset{\theta}{argmin}\ &MSE(Y,\widetilde{Y}),\mathrm{where% }\ \widetilde{Y}\equiv F_{\theta}(X)\\ \Leftrightarrow\underset{\theta}{argmin}\ &-log\left(P\left(Y|\mu,\sigma^{2}% \right)\right),\mathrm{where}\ \mu\equiv F_{\theta}(X)\\ =\underset{\theta}{argmin}\ &-log\left(\frac{1}{\sigma\sqrt{2\pi}}exp\left(-% \frac{(\mu-Y)^{2}}{2\sigma^{2}}\right)\right)\\ =\underset{\theta}{argmin}\ &\frac{(\mu-Y)^{2}}{2\sigma^{2}}+log\left(\sigma% \sqrt{2\pi}\right)\\ \end{aligned}$

(2)

However, this type of optimization has a significant negative impact on extreme values. Consider the introduction of a new variable, denoted as $Y_{M}=max\{Y_{1},Y_{2},...,Y_{n}\}$ , where $Y_{i}(i=1,2..,n)$ represents different samples. In the context of weather forecasting, assuming that $Y_{1},Y_{2},...,Y_{n}$ correspond to the temperatures in different regions, then $Y_{M}$ represents the highest temperature in the world. According to the extreme value theory, $Y_{M}$ obeys the extreme value distribution (more specifically, maximum value distribution), that is, $Y_{M}\sim G(\widetilde{\mu},\widetilde{\sigma})$ (Kotz and Nadarajah 2000), whose probability density function is:

$f(Y_{M})=\frac{1}{\widetilde{\sigma}}exp\left(-\frac{Y_{M}-\widetilde{\mu}}{% \widetilde{\sigma}}-exp\left(-\frac{Y_{M}-\widetilde{\mu}}{\widetilde{\sigma}}% \right)\right)$

(3)

where $\widetilde{\mu}$ and $\widetilde{\sigma}$ are its position and scale parameter respectively. It is worth mentioning that, unlike the normal distribution, the extreme value distribution is asymmetric. When employing maximum likelihood estimation for predicting the extreme value $Y_{M}$ , we can derive the optimization objective through a similar procedure:

$\begin{aligned} \underset{\theta}{argmin}\ &-log\left(P\left(Y_{M}|\widetilde{% \mu},\widetilde{\sigma}\right)\right)\\ =\underset{\theta}{argmin}\ &\frac{Y_{M}-\widetilde{\mu}}{\widetilde{\sigma}}+% exp\left(-\frac{Y_{M}-\widetilde{\mu}}{\widetilde{\sigma}}\right)+log\left(% \widetilde{\sigma}\right)\\ where\ \widetilde{\mu}&\equiv max\left\{F_{\theta}(X_{1}),F_{\theta}(X_{2}),..% .,F_{\theta}(X_{n})\right\}\end{aligned}$

(4)

Define the optimization function obtained by maximum likelihood estimation of this extreme distribution as $obj_{M}(\cdot)$ :

$obj_{M}(\widetilde{\mu})=\frac{Y_{M}-\widetilde{\mu}}{\widetilde{\sigma}}+exp% \left(-\frac{Y_{M}-\widetilde{\mu}}{\widetilde{\sigma}}\right)+log\left(% \widetilde{\sigma}\right)$

(5)

It can be proved (in Appendix A) that the optimal solution of Equation 4, that is, the minimum of $obj_{M}(\cdot)$ is obtained if and only if $\widetilde{\mu}=Y_{M}$ , which meets the optimization goal. But due to the asymmetry of the data distribution, $obj_{M}(\cdot)$ is also asymmetric, as depicted in the upper left corner of Figure 1, which ultimately results in biased estimates.

$obj_{M}(\cdot)$ can be viewed as a loss function that considers the data distribution. Under this assumption, the area enclosed by $obj_{M}(\cdot)$ and the x-coordinate axis can be regarded as the expectation of the total loss.

As illustrated in Figure 1, the shading on the left and the shading on the right represent the expectation of the total loss when model provides different estimates (underestimation and overestimation, respectively). Assuming the highest temperature in the world at a certain time is $Y_{M}=40^{\circ}C$ , predicting $\widetilde{\mu}=39.5^{\circ}C$ will get a lower loss than predicting $\widetilde{\mu}=40.5^{\circ}C$ based on probability expectation, which encourages the model to predict a smaller $Y_{M}$ , that is $\widetilde{\mu}=39.5^{\circ}C$ , rather than $\widetilde{\mu}=40.5^{\circ}C$ . This finally results in underestimating the degree of extremes. The same situation occurs when predicting minimum values, such as the global minimum temperature. For a comprehensive analysis of $obj_{M}(\cdot)$ , please refer to Appendix A.

In practice, the highest temperature predicted by the MSE-based model is always smaller than the true value (Ben-Bouallegue et al. 2023), which supports the above proof.

3.4 Exloss

Section 3.3 proves from EVT that using MSE loss will underestimate the degree of extremes. To address this issue, a simple and straightforward idea is to make the originally asymmetric $obj_{M}(\cdot)$ symmetrical through scaling.

It is cumbersome and unnecessary to transform $obj_{M}(\cdot)$ into a completely symmetric function. In actual operation, we use linear scaling to make the optimization objective function $obj_{M}(\cdot)$ symmetrical within a certain range $\epsilon$ , where $\epsilon$ is a hyperparameter that depends on both the data distribution and the accuracy of the model’s predictions. In simple terms, when the model’s predictions are highly accurate, the prediction error is smaller, allowing for the selection of a smaller value for $\epsilon$ . For detailed guidelines on selecting an appropriate value for $\epsilon$ , please refer to Appendix B. Mathematically, the process of achieving symmetry through scaling can be expressed as follows:

$\begin{aligned} &\widetilde{\mu}_{s}=\left\{\begin{matrix}\frac{\widetilde{% \sigma}}{s_{1}}(\widetilde{\mu}-Y_{M})+Y_{M},\ \widetilde{\mu}\leq Y_{M}\\ \frac{\widetilde{\sigma}}{s_{2}}(\widetilde{\mu}-Y_{M})+Y_{M},\ \widetilde{\mu% }>Y_{M}\\ \end{matrix}\right.\\ &s.t.\int_{\widetilde{\mu}_{s}=Y_{M}-\epsilon}^{Y_{M}}obj_{M}\left(\widetilde{% \mu}_{s}\right)\ \mathrm{d}\widetilde{\mu}_{s}=\int_{\widetilde{\mu}_{s}=Y_{M}% }^{Y_{M}+\epsilon}obj_{M}(\widetilde{\mu}_{s})\ \mathrm{d}\widetilde{\mu}_{s}% \\ \end{aligned}$

(6)

The purpose of the scaling factors $s_{1}$ and $s_{2}$ is to equalize the areas of the two shaded regions in Figure 1, which ensures that the expectations of the total loss are the same for underestimation and overestimation. Subsequently, $s_{1}$ and $s_{2}$ are combined into a function $S(\hat{\mathcal{Y}},\mathcal{Y})$ . By applying scaling function to the loss, we propose Exloss:

$\begin{aligned} Exloss\left(\hat{\mathcal{Y}},\mathcal{Y}\right)&=\left\|S(% \hat{\mathcal{Y}},\mathcal{Y})\odot\hat{\mathcal{Y}}-S(\hat{\mathcal{Y}},% \mathcal{Y})\odot\mathcal{Y}\right\|^{2}\\ &=\left\|S(\hat{\mathcal{Y}},\mathcal{Y})\odot\left(\hat{\mathcal{Y}}-\mathcal% {Y}\right)\right\|^{2}\end{aligned}$

(7)

where $\hat{\mathcal{Y}}$ and $\mathcal{Y}$ represent the model prediction and the target respectively, $S(\hat{\mathcal{Y}},\mathcal{Y})\in\mathbb{R}^{C\times H\times W}$ is asymmetrical scaling function. For the detailed form of $S(\hat{\mathcal{Y}},\mathcal{Y})$ , please refer to Appendix B.

In fact, by calculating $S(\hat{\mathcal{Y}},\mathcal{Y})$ , it can be observed that Exloss increases the loss weight when the model underestimates extreme value. This behavior is highly rational, as it adaptively increases the penalty for underestimations.

3.5 ExBooster

The atmospheric system has uncertainty (Thompson 1957), which means that for a given initial condition, there are multiple possible predictions. Some low-probability extreme events only exist in some prediction results. In such scenarios, multiple sampling of predictions aids in capturing a broader spectrum of potential extreme events.

In this work, we propose the ExBooster, drawing insights from ensemble algorithms (Gneiting and Raftery 2005). The process begins by introducing noise to the prediction, generating $m$ ensemble results. Due to the addition of noise, these different prediction results form a larger range of values than the original single prediction, which means that they contain more extreme events. Then, we use the ranking histogram method (Hamill 2001) to integrate these m prediction results. Specifically, all pixels in these samples are first sorted and grouped by the size of $m$ , and then the original value is replaced with the median of the corresponding part according to the ranking of each element in the initial prediction.

As depicted in Figure 3, ExBooster adaptively expands the range of pixel values without altering the overall distribution, thereby producing more extreme predictions. In practice, we choose $m=50$ , that is, randomly sample $50$ predictions, which is enough to mimic the uncertainty of atmospheric dynamics (Price et al. 2023). For detailed implementation, please refer to Appendix C.

4 Experiment

4.1 Experimental Setup

Dataset.

The training process utilizes the 5th generation of ECMWF reanalysis (ERA5) data (Hersbach et al. 2020). We use the data from 1979 to 2017 as the training set, from 2018 to 2021 as the validation set, and from 2022 to 2022 as the test set. Our model employs 69 atmospheric variables, including four surface variables and five upper-level variables with 13 pressure levels, as shown in Table 1.

Name	Description	Levels
u10	X-direction wind at 10m height	Single
v10	Y-direction wind at 10m height	Single
t2m	Temperature at 2m height	Single
msl	Mean sea level pressure	Single
z	Geopotential	13
q	Absolute humidity	13
u	X-direction wind	13
v	Y-direction wind	13
t	Temperature	13

Table 1: Atmospheric Variables. The levels are 50,100,150, 200,250,300,400,500,600,700,850,925,1000hPa.

Network Structure.

$M_{d}$ has the same network structure as (Chen et al. 2023a), which uses Swin-Transformer (Vaswani et al. 2017) as backbone, and designs the encoder and decoder through down-sampling and up-sampling. $M_{g}$ uses a mainstream diffusion implementation which use a U-Transformer (Petit et al. 2021) as backbone, and realizes conditional generation by concatenating condition to the model input. See Appendix D for hyperparameters.

Baseline Models.

We choose five ML models, namely Pangu (Bi et al. 2023), GraphCast (Lam et al. 2023), FengWu (Chen et al. 2023a), FuXi (Chen et al. 2023b), FuXi-Extreme (Zhong et al. 2023) and a physics-based dynamic model ECMWF-IFS (MODEL 2003), as the baseline models for comparative experiments. For detailed introductions of these models, please refer to the Related Work.

Metrics.

We use three metrics to reflect the extreme weather forecast performance and the overall accuracy.

•

RQE. Relative Quantile Error (Kurth et al. 2023). It is defined as the relative error of quantiles (90%-99.99%) between prediction and target. A positive RQE value indicates that extreme values are overestimated, while a negative value indicates underestimation.
•

SEDI. Symmetric Extremal Dependency Index (Ferro and Stephenson 2011). By classifying each pixel into extreme weather or normal weather, this metric calculates the hit rate of the classification. By selecting thresholds, SEDI for different degrees of extreme weather can be calculated, such as SEDI 90th, SEDI 95th, etc (Zhong et al. 2023; Han et al. 2024). A higher SEDI value closer to 1 indicates better prediction accuracy for extreme values.
•

RMSE. Weighted Root Mean Square Error (Rasp et al. 2020). It reflects the overall forecast error. A smaller RMSE value indicates a higher forecast accuracy.

Both RQE and SEDI serve as metrics to evaluate the accuracy of extreme value predictions, but their focus is different. RQE reflects more on the global scale, while SEDI reflects more on the regional (pixel-level) scale.

				RQE (- underestimate, + overestimate)				SEDI (the closer to 1 the better)
	Exloss	Diffusion	ExBooster	t2m	u10	q500	u500	t2m@90th	[email protected]	ws10@90th	[email protected]
➀	×	×	×	-0.0023	-0.1201	-0.1130	-0.0899	0.7059	0.5637	0.5623	0.4295
\hdashline➁	✓	×	×	-0.0012	-0.0792	-0.0330	-0.0538	0.7589	0.6222	0.6423	0.5156
\hdashline[1pt/4pt] ➂	×	✓	✓	+0.0005	-0.0296	-0.0407	-0.0299	0.7743	0.6947	0.6793	0.5778
➃	✓	×	✓	-0.0007	-0.0374	+0.0063	-0.0173	0.7643	0.6405	0.6791	0.5773
➄	✓	✓	×	+0.0002	-0.0496	-0.0291	-0.0459	0.7764	0.6927	0.6627	0.5517
\hdashline➅	✓	✓	✓	+0.0008	-0.0120	+0.0092	-0.0125	0.7796	0.7106	0.6930	0.6056

Table 2: Ablation Experiment. All results are predictions on the fifth day and use ERA5 as the target.

4.2 Global Extreme Weather Forecast Capability

We use the RQE to evaluate the models’ ability to predict extreme values on the global scale. As depicted in Figure 4, the RQE values of most ML models exhibit a tendency to become progressively more negative as the lead time increases, indicating an increasing underestimation of extreme values. On the contrary, the dynamic model has a relatively better RQE, that is, an RQE value closer to 0, and does not decrease as the lead time increases.

For our model, overall, the RQE does not decrease as the lead time increases, which is due to Exloss’s asymmetric optimization, avoiding the underestimation of extreme values. Especially for q500 (the channel of absolute humidity at the 500hPa pressure layer), u500, and u10, our model predicts with a RQE close to 0 consistently, surpassing both ML models and the dynamic model, showing the accurate and stable extreme value prediction capabilities.

Our model slightly overestimates (around $10^{-3}$ ) the extreme values in certain atmospheric variables like z500, which is very rare for ML models but is occasionally observed in dynamic models as well. Appropriate overestimation can provide a more adequate upper bound for the prediction of extreme events, which is beneficial for some applications like disaster early warning (Li et al. 2021).

4.3 Regional Extreme Weather Forecast Capability

Different from RQE, SEDI focus more on extreme values in local areas, because each pixel uses its quantile in the time dimension as a threshold that varies from month to month. For instance, a temperature of $-10^{\circ}C$ may not be considered as an extreme event for high-latitude areas during winter, but it could be extreme for mid-latitude areas in summer.

As illustrated in Figure 5, for surface temperature, our model’s prediction accuracies at 90th, 95th, 98th extreme degrees are significantly higher than that of other models. Regarding the most extreme scenarios at the 99.5th quantile, our model demonstrates relatively better performance when the lead time is less than 24 hours, and for lead times exceeding 24 hours, our model has similar performance to FuXi-Extreme and ECMWF-IFS.

For surface wind speed, our model significantly outperforms other models across all extreme levels. It is worth noting that the SEDI of our model’s first step prediction (lead time=6h) is remarkably high, predominantly exceeding 0.95. This is particularly challenging for wind speed which exhibit rapid and diverse changes, highlighting the excellent performance of our model in forecasting extreme wind speeds.

4.4 Case Analysis

To better showcase the performance of our model, we select two natural disasters, typhoon and heatwave, for visual analysis. Specifically, we choose Nanmadol typhoon (Wu et al. 2023) at 6:00 on September 17, 2022, and the heatwave in the southeastern coastal area of China at 12:00 on August 14, 2022 (Jiang et al. 2023), for three-day forecasting. Figure 7 illustrates the results, with the two rows showing surface wind speed and surface temperature, respectively.

Regarding typhoons, the wind speeds of our prediction are the most extreme and closest to the real wind speeds. This aligns with the findings observed in the far-leading SEDI of wind speed, showing the exceptional performance of our model in predicting extreme wind speeds. This capability is of significant importance for estimating the intensity of typhoon landings and assessing the potential damage to coastal areas.

In the case of the heatwave, both of the predictions of our model and FuXi-Extreme are more extreme than other models. Although fuxi-extreme’s predictions are more extreme than ours, our model gets a prediction ( $37.01^{\circ}C$ ) that is closer to the true value ( $37.25^{\circ}C$ ) for the regional maximum temperature prediction. For more visualizations, please refer to Appendix E.

4.5 Ablation Experiment

We assess the individual contributions of each module by conducting ablation experiments where we remove Exloss, and ExBooster. Additionally, as an effective implementation component, diffusion model is also considered in the ablation.

Through Table 2, we find that all three modules are beneficial in enhancing the prediction of extreme values. Specifically, we discover that the asymmetric optimization of Exloss plays a crucial role in enhancing the forecasting capabilities of extreme weather events. This is evident from the comparison between row ➀ and row ➁, where the use of Exloss leads to significant improvements in both RQE and SEDI metrics. Furthermore, comparing row ➂ and row ➅ (best version), the lack of Exloss results in a significant decrease in all metrics.

Moreover, Diffusion and ExBooster have great improvements in the prediction of extreme temperatures and extreme wind speeds respectively, which highlights the favorable decoupling properties of different modules in our model.

4.6 Normal Weather Forecast Capability

Although the focus of this paper is extreme weather prediction, the overall forecast accuracy of our model is still competitive. As shown in Figure 6, while our model demonstrates outstanding performance in predicting extreme wind speeds, it also achieves a high accuracy in overall wind speed prediction, as indicated by RMSE u10. Additionally, the RMSE metrics of our model on other variables are also comparable to top ML models and surpass ECMWF-IFS.

5 Conclusions

Weather forecast models based on machine learning often face challenges in accurately predicting extreme weather. In this paper, we provide an explanation based on extreme value theory, for the inherent limitations of MSE loss in forecasting extreme values. To address this issue, a novel loss function, Exloss, is introduced, which rectifies deviations in extreme value prediction through asymmetric optimization.

Moreover, we propose a training-free module named ExBooster that models the uncertainty of atmosphere through random sampling and integrates multiple forecast results to improve the forecast hit rate of extreme events. Experimental results show that our model, ExtremeCast, achieves SOTA performance in extreme value metrics such as RQE and SEDI, while maintaining the competitive overall accuracy.

The limitation of our model lies in the increased time cost resulting from the inclusion of the cascaded pipeline, and the sensitivity of ExBooster affected by the scale of noise.

References

Annau, Cannon, and Monahan (2023) Annau, N. J.; Cannon, A. J.; and Monahan, A. H. 2023. Algorithmic hallucinations of near-surface winds: Statistical downscaling with generative adversarial networks to convection-permitting scales. Artificial Intelligence for the Earth Systems, 2(4): e230015.
Aravkin et al. (2012) Aravkin, A.; Burke, J. V.; Chiuso, A.; and Pillonetto, G. 2012. On the estimation of hyperparameters for empirical bayes estimators: Maximum marginal likelihood vs minimum MSE. IFAC Proceedings Volumes, 45(16): 125–130.
Bellprat and Doblas-Reyes (2016) Bellprat, O.; and Doblas-Reyes, F. 2016. Attribution of extreme weather and climate events overestimated by unreliable climate simulations. Geophysical Research Letters, 43(5): 2158–2164.
Ben-Bouallegue et al. (2023) Ben-Bouallegue, Z.; Clare, M. C.; Magnusson, L.; Gascon, E.; Maier-Gerber, M.; Janousek, M.; Rodwell, M.; Pinault, F.; Dramsch, J. S.; Lang, S. T.; et al. 2023. The rise of data-driven weather forecasting. arXiv preprint arXiv:2307.10128.
Berard et al. (2019) Berard, H.; Gidel, G.; Almahairi, A.; Vincent, P.; and Lacoste-Julien, S. 2019. A closer look at the optimization landscapes of generative adversarial networks. arXiv preprint arXiv:1906.04848.
Bi et al. (2023) Bi, K.; Xie, L.; Zhang, H.; Chen, X.; Gu, X.; and Tian, Q. 2023. Accurate medium-range global weather forecasting with 3D neural networks. Nature, 619(7970): 533–538.
Cai and Hames (2010) Cai, Y.; and Hames, D. 2010. Minimum sample size determination for generalized extreme value distribution. Communications in Statistics—Simulation and Computation®, 40(1): 87–98.
Chen et al. (2023a) Chen, K.; Han, T.; Gong, J.; Bai, L.; Ling, F.; Luo, J.-J.; Chen, X.; Ma, L.; Zhang, T.; Su, R.; et al. 2023a. FengWu: Pushing the Skillful Global Medium-range Weather Forecast beyond 10 Days Lead. arXiv preprint arXiv:2304.02948.
Chen et al. (2023b) Chen, L.; Zhong, X.; Zhang, F.; Cheng, Y.; Xu, Y.; Qi, Y.; and Li, H. 2023b. FuXi: A cascade machine learning forecasting system for 15-day global weather forecast. arXiv preprint arXiv:2306.12873.
Creswell et al. (2018) Creswell, A.; White, T.; Dumoulin, V.; Arulkumaran, K.; Sengupta, B.; and Bharath, A. A. 2018. Generative adversarial networks: An overview. IEEE signal processing magazine, 35(1): 53–65.
de Burgh-Day and Leeuwenburg (2023) de Burgh-Day, C. O.; and Leeuwenburg, T. 2023. Machine learning for numerical weather and climate modelling: a review. Geoscientific Model Development, 16(22): 6433–6477.
Fang et al. (2021) Fang, W.; Xue, Q.; Shen, L.; and Sheng, V. S. 2021. Survey on the application of deep learning in extreme weather prediction. Atmosphere, 12(6): 661.
Farge (1992) Farge, M. 1992. Wavelet transforms and their applications to turbulence. Annual review of fluid mechanics, 24(1): 395–458.
Fathi et al. (2022) Fathi, M.; Haghi Kashani, M.; Jameii, S. M.; and Mahdipour, E. 2022. Big data analytics in weather forecasting: A systematic review. Archives of Computational Methods in Engineering, 29(2): 1247–1275.
Ferro and Stephenson (2011) Ferro, C. A.; and Stephenson, D. B. 2011. Extremal dependence indices: Improved verification measures for deterministic forecasts of rare binary events. Weather and Forecasting, 26(5): 699–713.
Francis and Vavrus (2012) Francis, J. A.; and Vavrus, S. J. 2012. Evidence linking Arctic amplification to extreme weather in mid-latitudes. Geophysical research letters, 39(6).
Gneiting and Raftery (2005) Gneiting, T.; and Raftery, A. E. 2005. Weather forecasting with ensemble methods. Science, 310(5746): 248–249.
Gong et al. (2024) Gong, J.; Bai, L.; Ye, P.; Xu, W.; Liu, N.; Dai, J.; Yang, X.; and Ouyang, W. 2024. CasCast: Skillful High-resolution Precipitation Nowcasting via Cascaded Modelling. arXiv preprint arXiv:2402.04290.
Gray (1977) Gray, W. M. 1977. Tropical cyclone genesis in the western North Pacific. Journal of the Meteorological Society of Japan. Ser. II, 55(5): 465–482.
Guibas et al. (2021) Guibas, J.; Mardani, M.; Li, Z.; Tao, A.; Anandkumar, A.; and Catanzaro, B. 2021. Adaptive fourier neural operators: Efficient token mixers for transformers. arXiv preprint arXiv:2111.13587.
Hamill (2001) Hamill, T. M. 2001. Interpretation of rank histograms for verifying ensemble forecasts. Monthly Weather Review, 129(3): 550–560.
Han et al. (2024) Han, T.; Guo, S.; Xu, W.; Bai, L.; et al. 2024. CRA5: Extreme Compression of ERA5 for Portable Global Climate and Weather Research via an Efficient Variational Transformer. arXiv preprint arXiv:2405.03376.
Hernández-Orallo, Flach, and Ferri Ramírez (2012) Hernández-Orallo, J.; Flach, P.; and Ferri Ramírez, C. 2012. A unified view of performance metrics: Translating threshold choice into expected classification loss. Journal of Machine Learning Research, 13: 2813–2869.
Hersbach et al. (2020) Hersbach, H.; Bell, B.; Berrisford, P.; Hirahara, S.; Horányi, A.; Muñoz-Sabater, J.; Nicolas, J.; Peubey, C.; Radu, R.; Schepers, D.; et al. 2020. The ERA5 global reanalysis. Quarterly Journal of the Royal Meteorological Society, 146(730): 1999–2049.
Ho, Jain, and Abbeel (2020) Ho, J.; Jain, A.; and Abbeel, P. 2020. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33: 6840–6851.
Jiang et al. (2023) Jiang, J.; Liu, Y.; Mao, J.; and Wu, G. 2023. Extreme heatwave over Eastern China in summer 2022: the role of three oceans and local soil moisture feedback. Environmental Research Letters, 18(4): 044025.
Jiang and Zhang (2009) Jiang, W.; and Zhang, C.-H. 2009. General maximum likelihood empirical Bayes estimation of normal means. projecteuclid.
Kendall, Gal, and Cipolla (2018) Kendall, A.; Gal, Y.; and Cipolla, R. 2018. Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In Proceedings of the IEEE conference on computer vision and pattern recognition, 7482–7491.
Kotz and Nadarajah (2000) Kotz, S.; and Nadarajah, S. 2000. Extreme value distributions: theory and applications. world scientific.
Kurth et al. (2023) Kurth, T.; Subramanian, S.; Harrington, P.; Pathak, J.; Mardani, M.; Hall, D.; Miele, A.; Kashinath, K.; and Anandkumar, A. 2023. Fourcastnet: Accelerating global high-resolution weather forecasting using adaptive fourier neural operators. In Proceedings of the Platform for Advanced Scientific Computing Conference, 1–11.
Lam et al. (2023) Lam, R.; Sanchez-Gonzalez, A.; Willson, M.; Wirnsberger, P.; Fortunato, M.; Alet, F.; Ravuri, S.; Ewalds, T.; Eaton-Rosen, Z.; Hu, W.; et al. 2023. Learning skillful medium-range global weather forecasting. Science, eadi2336.
Li et al. (2021) Li, X.; Rankin, C.; Gangrade, S.; Zhao, G.; Lander, K.; Voisin, N.; Shao, M.; Morales-Hernández, M.; Kao, S.-C.; and Gao, H. 2021. Evaluating precipitation, streamflow, and inundation forecasting skills during extreme weather events: A case study for an urban watershed. Journal of Hydrology, 603: 127126.
Liu et al. (2021) Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; and Guo, B. 2021. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, 10012–10022.
Lopez-Gomez et al. (2023) Lopez-Gomez, I.; McGovern, A.; Agrawal, S.; and Hickey, J. 2023. Global extreme heat forecasting using neural weather models. Artificial Intelligence for the Earth Systems, 2(1): e220035.
Meenal et al. (2022) Meenal, R.; Binu, D.; Ramya, K.; Michael, P. A.; Vinoth Kumar, K.; Rajasekaran, E.; and Sangeetha, B. 2022. Weather forecasting for renewable energy system: A review. Archives of Computational Methods in Engineering, 29(5): 2875–2891.
MODEL (2003) MODEL, E. W. 2003. IFS DOCUMENTATION. ecmwf.
Morozov et al. (2023) Morozov, V.; Galliamov, A.; Lukashevich, A.; Kurdukova, A.; and Maximov, Y. 2023. CMIP X-MOS: Improving Climate Models with Extreme Model Output Statistics. arXiv preprint arXiv:2311.03370.
Ni (2023) Ni, Z. 2023. Kunyu: A High-Performing Global Weather Model Beyond Regression Losses. arXiv preprint arXiv:2312.08264.
Petit et al. (2021) Petit, O.; Thome, N.; Rambour, C.; Themyr, L.; Collins, T.; and Soler, L. 2021. U-net transformer: Self and cross attention for medical image segmentation. In Machine Learning in Medical Imaging: 12th International Workshop, MLMI 2021, Held in Conjunction with MICCAI 2021, Strasbourg, France, September 27, 2021, Proceedings 12, 267–276. Springer.
Porto et al. (2022) Porto, F.; Ferro, M.; Ogasawara, E.; Moeda, T.; de Barros, C. D. T.; Silva, A. C.; Zorrilla, R.; Pereira, R. S.; Castro, R. N.; Silva, J. V.; et al. 2022. Machine learning approaches to extreme weather events forecast in urban areas: Challenges and initial results. Supercomputing Frontiers and Innovations, 9(1): 49–73.
Price et al. (2023) Price, I.; Sanchez-Gonzalez, A.; Alet, F.; Ewalds, T.; El-Kadi, A.; Stott, J.; Mohamed, S.; Battaglia, P.; Lam, R.; and Willson, M. 2023. GenCast: Diffusion-based ensemble forecasting for medium-range weather. arXiv preprint arXiv:2312.15796.
Rasp et al. (2020) Rasp, S.; Dueben, P. D.; Scher, S.; Weyn, J. A.; Mouatadid, S.; and Thuerey, N. 2020. WeatherBench: a benchmark data set for data-driven weather forecasting. Journal of Advances in Modeling Earth Systems, 12(11): e2020MS002203.
Richard et al. (2013) Richard, Y.; Rouault, M.; Pohl, B.; Crétat, J.; Duclot, I.; Taboulot, S.; Reason, C.; Macron, C.; and Buiron, D. 2013. Temperature changes in the mid-and high-latitudes of the Southern Hemisphere. International Journal of Climatology, 33(8): 1948–1963.
Smith (1990) Smith, R. L. 1990. Extreme value theory. Handbook of applicable mathematics, 7(437-471): 18.
Thompson (1957) Thompson, P. D. 1957. Uncertainty of initial state as a factor in the predictability of large scale atmospheric flow patterns. Tellus, 9(3): 275–295.
Vaswani et al. (2017) Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. Attention is all you need. Advances in neural information processing systems, 30.
Wang et al. (2020a) Wang, H.-W.; Peng, Z.-R.; Wang, D.; Meng, Y.; Wu, T.; Sun, W.; and Lu, Q.-C. 2020a. Evaluation and prediction of transportation resilience under extreme weather events: A diffusion graph convolutional approach. Transportation research part C: emerging technologies, 115: 102619.
Wang et al. (2020b) Wang, Q.; Ma, Y.; Zhao, K.; and Tian, Y. 2020b. A comprehensive survey of loss functions in machine learning. Annals of Data Science, 1–26.
Wang et al. (2019) Wang, S.; Li, Y.; Liang, X.; Quan, D.; Yang, B.; Wei, S.; and Jiao, L. 2019. Better and faster: Exponential loss for image patch matching. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 4812–4821.
Wei et al. (2018) Wei, C.-C.; Peng, P.-C.; Tsai, C.-H.; and Huang, C.-L. 2018. Regional forecasting of wind speeds during typhoon landfall in Taiwan: A case study of westward-moving typhoons. Atmosphere, 9(4): 141.
Wellander, Lundén, and Backstrom (2001) Wellander, N.; Lundén, O.; and Backstrom, M. 2001. The maximum value distribution in a reverberation chamber. In 2001 IEEE EMC International Symposium. Symposium Record. International Symposium on Electromagnetic Compatibility (Cat. No. 01CH37161), volume 2, 751–756. IEEE.
Wu et al. (2023) Wu, Z.; Zhang, Y.; Zhang, L.; and Zheng, H. 2023. Interaction of Cloud Dynamics and Microphysics During the Rapid Intensification of Super-Typhoon Nanmadol (2022) Based on Multi-Satellite Observations. Geophysical Research Letters, 50(15): e2023GL104541.
Yang et al. (2023) Yang, L.; Zhang, Z.; Song, Y.; Hong, S.; Xu, R.; Zhao, Y.; Zhang, W.; Cui, B.; and Yang, M.-H. 2023. Diffusion models: A comprehensive survey of methods and applications. ACM Computing Surveys, 56(4): 1–39.
Yang and Tsai (2019) Yang, T.-H.; and Tsai, C.-C. 2019. Using numerical weather model outputs to forecast wind gusts during typhoons. Journal of Wind Engineering and Industrial Aerodynamics, 188: 247–259.
Zhao, Lu, and Kou (2003) Zhao, J.; Lu, C.-T.; and Kou, Y. 2003. Detecting region outliers in meteorological data. In Proceedings of the 11th ACM international symposium on Advances in geographic information systems, 49–55.
Zhong et al. (2023) Zhong, X.; Chen, L.; Liu, J.; Lin, C.; Qi, Y.; and Li, H. 2023. FuXi-Extreme: Improving extreme rainfall and wind forecasts with diffusion model. arXiv preprint arXiv:2310.19822.

Appendix A Analysis of the Optimization Objective

First, we examine the optimization function for the scenario where the model predicts the maximum value $Y_{M}=\max\{Y_{1},Y_{2},...,Y_{n}\}$ . In this case, $Y_{M}$ follows a maximum distribution, characterized by its probability density function (Wellander, Lundén, and Backstrom 2001):

f(Y_{M})=\frac{1}{\widetilde{\sigma}}exp\left(-\frac{Y_{M}-\widetilde{\mu}}{% \widetilde{\sigma}}-exp\left(-\frac{Y_{M}-\widetilde{\mu}}{\widetilde{\sigma}}% \right)\right)

(8)

where $\widetilde{\mu}$ and $\widetilde{\sigma}$ are its position and scale parameter respectively. Unlike the normal distribution, maximum distribution is an asymmetric skewed distribution. We can draw a histogram of the maximum distribution through the following algorithm, so as to observe the characteristics of the maximum distribution more directly.

Algorithm 1 Maximum Distribution Sample

M=[\ ]\ \#\

List of maximum distribution sampling

for

i=1

10000

N=[\ ]\ \#\

List of normal distribution sampling

for

j=1

10000

x\longleftarrow\mathcal{N}(0,1)

N

.append(

x

)

end for

x_{M}\longleftarrow

Max(

N

)

M

.append(

x_{M}

)

end for

Draw a histogram of

M

and

N

The asymmetry of the distribution of maximum values can be clearly seen from Figure 8. This asymmetry is also the focus of our discussion below.

When making predictions for the extreme value $Y_{M}$ , the optimization objective can be formulated as follows:

$\displaystyle\underset{\theta}{argmin}$	$\displaystyle-log\left(P\left(Y_{M}\|\widetilde{\mu},\widetilde{\sigma}\right)\right)$	(9)
$\displaystyle=\underset{\theta}{argmin}$	$\displaystyle\frac{Y_{M}-\widetilde{\mu}}{\widetilde{\sigma}}+exp\left(-\frac{% Y_{M}-\widetilde{\mu}}{\widetilde{\sigma}}\right)+log\left(\widetilde{\sigma}\right)$
$\displaystyle where\ \widetilde{\mu}$	$\displaystyle\equiv max\left\{F_{\theta}(X_{1}),F_{\theta}(X_{2}),...,F_{% \theta}(X_{n})\right\}$

Denote the optimization objective as $obj_{M}(\cdot)$ :

obj_{M}(\widetilde{\mu})=\frac{Y_{M}-\widetilde{\mu}}{\widetilde{\sigma}}+exp% \left(-\frac{Y_{M}-\widetilde{\mu}}{\widetilde{\sigma}}\right)+log\left(% \widetilde{\sigma}\right)

(10)

The derivative of $obj_{M}(\cdot)$ with respect to $\widetilde{\mu}$ is:

obj_{M}^{\prime}(\widetilde{\mu})=-\frac{1}{\widetilde{\sigma}}+\frac{1}{% \widetilde{\sigma}}exp\left(-\frac{Y_{M}-\widetilde{\mu}}{\widetilde{\sigma}}\right)

(11)

Its second derivative is:

obj_{M}^{\prime\prime}(\widetilde{\mu})=\frac{1}{\widetilde{\sigma}^{2}}exp% \left(-\frac{Y_{M}-\widetilde{\mu}}{\widetilde{\sigma}}\right)

(12)

Its second derivative $obj_{M}^{\prime\prime}(\widetilde{\mu})$ is positive, indicating that the derivative, $obj_{M}^{\prime}(\widetilde{\mu})$ , monotonically increases. As $obj_{M}^{\prime}(Y_{M})=0$ , we can conclude that:

\left\{\begin{matrix}obj_{M}^{\prime}(\widetilde{\mu})<0,\widetilde{\mu}<Y_{M}% \\ obj_{M}^{\prime}(\widetilde{\mu})=0,\widetilde{\mu}=Y_{M}\\ obj_{M}^{\prime}(\widetilde{\mu})>0,\widetilde{\mu}>Y_{M}\end{matrix}\right.

(13)

So $obj_{M}(\widetilde{\mu})$ first decreases and then increases, with its minimum value being:

min(obj_{M}(\widetilde{\mu}))=obj_{M}(\widetilde{\mu})|_{\widetilde{\mu}=Y_{M}% }=1+log\left(\widetilde{\sigma}\right)

(14)

Next, we examine the situation where the model predicts the minimum value $Y_{m}=\min\{Y_{1},Y_{2},...,Y_{n}\}$ . In this case, $Y_{m}$ follows a minimum distribution, whose probability density function is (Cai and Hames 2010):

f(Y_{m})=\frac{1}{\widetilde{\sigma}}exp\left(\frac{Y_{m}-\widetilde{\mu}}{% \widetilde{\sigma}}-exp\left(\frac{Y_{m}-\widetilde{\mu}}{\widetilde{\sigma}}% \right)\right)

(15)

where $\widetilde{\mu}$ and $\widetilde{\sigma}$ are its position and scale parameter respectively. When making predictions for the extreme value $Y_{m}$ , the optimization objective is:

$\displaystyle\underset{\theta}{argmin}$	$\displaystyle-log\left(P\left(Y_{m}\|\widetilde{\mu},\widetilde{\sigma}\right)\right)$	(16)
$\displaystyle=\underset{\theta}{argmin}$	$\displaystyle-\frac{Y_{m}-\widetilde{\mu}}{\widetilde{\sigma}}+exp\left(\frac{% Y_{m}-\widetilde{\mu}}{\widetilde{\sigma}}\right)+log\left(\widetilde{\sigma}\right)$
$\displaystyle where\ \widetilde{\mu}$	$\displaystyle\equiv min\left\{F_{\theta}(X_{1}),F_{\theta}(X_{2}),...,F_{% \theta}(X_{n})\right\}$

Denote the optimization objective as $obj_{m}(\cdot)$

obj_{m}(\widetilde{\mu})=-\frac{Y_{m}-\widetilde{\mu}}{\widetilde{\sigma}}+exp% \left(\frac{Y_{m}-\widetilde{\mu}}{\widetilde{\sigma}}\right)+log\left(% \widetilde{\sigma}\right)

(17)

Similar to the analysis process of $obj_{M}(\cdot)$ , we can prove that $obj_{m}(\cdot)$ first decreases and then increases. The minimum value of $obj_{m}(\cdot)$ is:

min(obj_{m}(\widetilde{\mu}))=obj_{m}(\widetilde{\mu})|_{\widetilde{\mu}=Y_{m}% }=1+log\left(\widetilde{\sigma}\right)

(18)

By visualizing the shapes of $obj_{M}(\cdot)$ and $obj_{m}(\cdot)$ , we gain a clearer insight into their respective characteristics. It can be seen from Figure 9 that both $obj_{m}(\cdot)$ and $obj_{M}(\cdot)$ are asymmetric.

Appendix B Scaling Function

$obj_{M}(\cdot)$ and $obj_{m}(\cdot)$ can actually be regarded as loss functions that take into account the distribution characteristics of the original data (Wang et al. 2020b). Under this assumption, the area enclosed by the function and the x-coordinate axis can be regarded as the expectation of the total loss of this part of the sample (Hernández-Orallo, Flach, and Ferri Ramírez 2012).

We hope that the areas of $obj_{M}(\cdot)$ near $Y_{M}$ are equal, thereby ensuring that the expectations of the total loss when underestimating extreme values and overestimating extreme values are equal. The same goes for $obj_{m}(\cdot)$ . This goal is accomplished through scaling. Formally, for $obj_{M}(\cdot)$ , it can be written:

		$\displaystyle\widetilde{\mu}_{s}=\left\{\begin{matrix}\frac{\widetilde{\sigma}% }{s_{1}}(\widetilde{\mu}-Y_{M})+Y_{M},\ \widetilde{\mu}\leq Y_{M}\\ \frac{\widetilde{\sigma}}{s_{2}}(\widetilde{\mu}-Y_{M})+Y_{M},\ \widetilde{\mu% }>Y_{M}\\ \end{matrix}\right.$		(19)
		$\displaystyle s.t.\int_{\widetilde{\mu}_{s}=Y_{M}-\epsilon}^{Y_{M}}obj_{M}% \left(\widetilde{\mu}_{s}\right)\ \mathrm{d}\widetilde{\mu}_{s}=\int_{% \widetilde{\mu}_{s}=Y_{M}}^{Y_{M}+\epsilon}obj_{M}(\widetilde{\mu}_{s})\ % \mathrm{d}\widetilde{\mu}_{s}$		(19)

To simplify the calculation, we subtract the minimum value of $obj_{M}(\cdot)$ , which is $1+\log(\widetilde{\sigma})$ , from $obj_{M}(\cdot)$ . This adjustment ensures that the lowest value of $obj_{M}\cdot$ becomes 0, as shown in Figure 10. We can then compute the area around the lowest point by performing integration:

The area integral of $R_{1}$ is as follows:

	$\displaystyle\int_{\widetilde{\mu}_{s}=Y_{M}-\frac{\widetilde{\sigma}}{s_{1}}% \epsilon}^{Y_{M}}obj_{M}\left(\widetilde{\mu}_{s}\right)\ \mathrm{d}\widetilde% {\mu}_{s},\ \widetilde{\mu}_{s}=\frac{\widetilde{\sigma}}{s_{1}}(\widetilde{% \mu}-Y_{M})+Y_{M}$	(20)
$\displaystyle=$	$\displaystyle\int_{\widetilde{\mu}=Y_{M}-\epsilon}^{Y_{M}}\left(\frac{Y_{M}-% \left(\frac{\widetilde{\sigma}}{s_{1}}(\widetilde{\mu}-Y_{M})+Y_{M}\right)}{% \widetilde{\sigma}}+exp\left(-\frac{Y_{M}-\left(\frac{\widetilde{\sigma}}{s_{1% }}(\widetilde{\mu}-Y_{M})+Y_{M}\right)}{\widetilde{\sigma}}\right)-1\right)\ % \mathrm{d}\left(\frac{\widetilde{\sigma}}{s_{1}}(\widetilde{\mu}-Y_{M})+Y_{M}\right)$
$\displaystyle=$	$\displaystyle\frac{\widetilde{\sigma}}{s_{1}}\int_{\widetilde{\mu}=Y_{M}-% \epsilon}^{Y_{M}}\left(\frac{Y_{M}-\widetilde{\mu}}{s_{1}}+exp\left(-\frac{Y_{% M}-\widetilde{\mu}}{s_{1}}\right)-1\right)\ \mathrm{d}\widetilde{\mu}$
$\displaystyle=$	$\displaystyle\left.\frac{\widetilde{\sigma}}{s_{1}}\left(-\frac{s_{1}}{2}\left% (\frac{Y_{M}-\widetilde{\mu}}{s_{1}}\right)^{2}+s_{1}\cdot exp\left(-\frac{Y_{% M}-\widetilde{\mu}}{s_{1}}\right)-\widetilde{\mu}\right)\right\|^{Y_{M}}_{Y_{M}% -\epsilon}$
$\displaystyle=$	$\displaystyle\widetilde{\sigma}\left(\frac{\epsilon^{2}}{2s_{1}^{2}}-\frac{% \epsilon}{s_{1}}-exp\left(-\frac{\epsilon}{s_{1}}\right)+1\right)$

The area integral of $R_{2}$ is as follows:

	$\displaystyle\int_{\widetilde{\mu}_{s}=Y_{M}}^{Y_{M}+\frac{\widetilde{\sigma}}% {s_{2}}\epsilon}obj_{M}\left(\widetilde{\mu}_{s}\right)\ \mathrm{d}\widetilde{% \mu}_{s},\ \widetilde{\mu}_{s}=\frac{\widetilde{\sigma}}{s_{2}}(\widetilde{\mu% }-Y_{M})+Y_{M}$	(21)
$\displaystyle=$	$\displaystyle\int_{\widetilde{\mu}=Y_{M}}^{Y_{M}+\epsilon}\left(\frac{Y_{M}-% \left(\frac{\widetilde{\sigma}}{s_{2}}(\widetilde{\mu}-Y_{M})+Y_{M}\right)}{% \widetilde{\sigma}}+exp\left(-\frac{Y_{M}-\left(\frac{\widetilde{\sigma}}{s_{2% }}(\widetilde{\mu}-Y_{M})+Y_{M}\right)}{\widetilde{\sigma}}\right)-1\right)\ % \mathrm{d}\left(\frac{\widetilde{\sigma}}{s_{2}}(\widetilde{\mu}-Y_{M})+Y_{M}\right)$
$\displaystyle=$	$\displaystyle\frac{\widetilde{\sigma}}{s_{2}}\int_{\widetilde{\mu}=Y_{M}}^{Y_{% M}+\epsilon}\left(\frac{Y_{M}-\widetilde{\mu}}{s_{2}}+exp\left(-\frac{Y_{M}-% \widetilde{\mu}}{s_{2}}\right)-1\right)\ \mathrm{d}\widetilde{\mu}$
$\displaystyle=$	$\displaystyle\left.\frac{\widetilde{\sigma}}{s_{2}}\left(-\frac{s_{2}}{2}\left% (\frac{Y_{M}-\widetilde{\mu}}{s_{2}}\right)^{2}+s_{2}\cdot exp\left(-\frac{Y_{% M}-\widetilde{\mu}}{s_{2}}\right)-\widetilde{\mu}\right)\right\|^{Y_{M}+% \epsilon}_{Y_{M}}$
$\displaystyle=$	$\displaystyle\widetilde{\sigma}\left(-\frac{\epsilon^{2}}{2s_{2}^{2}}-\frac{% \epsilon}{s_{2}}+exp\left(\frac{\epsilon}{s_{2}}\right)-1\right)$

In order to achieve equality between the areas of $R_{1}$ and $R_{2}$ , we obtain the following equation:

		$\displaystyle\widetilde{\sigma}\left(\frac{\epsilon^{2}}{2s_{1}^{2}}-\frac{% \epsilon}{s_{1}}-exp\left(-\frac{\epsilon}{s_{1}}\right)+1\right)=\widetilde{% \sigma}\left(-\frac{\epsilon^{2}}{2s_{2}^{2}}-\frac{\epsilon}{s_{2}}+exp\left(% \frac{\epsilon}{s_{2}}\right)-1\right)$		(22)
		$\displaystyle\frac{\epsilon^{2}}{2s_{1}^{2}}+\frac{\epsilon^{2}}{2s_{2}^{2}}-% \frac{\epsilon}{s_{1}}+\frac{\epsilon}{s_{2}}-exp\left(-\frac{\epsilon}{s_{1}}% \right)-exp\left(\frac{\epsilon}{s_{2}}\right)+2=0$		(22)

The parameter $\epsilon$ is associated with the prediction accuracy. The higher the prediction accuracy is, the smaller the error is. The closer the predicted $\widetilde{\mu}$ will be to the real target $Y_{M}$ , the smaller the area we need to consider, that is, the smaller $\epsilon$ will be. On the contrary, when the prediction accuracy is very low, the $\widetilde{\mu}$ predicted by the model is relatively far away from $Y_{M}$ , so the $\epsilon$ we need to consider must be appropriately increased.

In the actual training, we select $\epsilon=0.1\widetilde{\sigma}$ as our choice, because of the observation that the relative error of the model’s predictions for extreme values is often below 10% (Bellprat and Doblas-Reyes 2016) (Fang et al. 2021). Therefore it is sufficient to consider an error margin of 10 $\%$ . On the other hand, if the error range considered is too small, the scaling effect will not be obvious. When we set $\epsilon=0.1\widetilde{\sigma}$ , the solution to Equation 22 remains non-unique. Since our primary interest lies in the relative relationship between $s_{1}$ and $s_{2}$ , it is reasonable to fix one of them first. We fix $s_{2}$ to $\widetilde{\sigma}$ , and calculate that $s_{1}=0.9\widetilde{\sigma}$ approximately.

Thus we can obtain the specific form of the scaling function, depicted below:

\widetilde{\mu}_{s}=\left\{\begin{matrix}\begin{aligned} &\frac{10}{9}(% \widetilde{\mu}-Y_{M})+Y_{M},\ \widetilde{\mu}\leq Y_{M}\\ &\widetilde{\mu},\ \widetilde{\mu}>Y_{M}\\ \end{aligned}\end{matrix}\right.

(23)

Subtract $Y_{M}$ from both sides to get:

	$\displaystyle\widetilde{\mu}_{s}-Y_{M}$	$\displaystyle=\left\{\begin{matrix}\frac{10}{9}(\widetilde{\mu}-Y_{M}),\ % \widetilde{\mu}\leq Y_{M}\\ \widetilde{\mu}-Y_{M},\ \widetilde{\mu}>Y_{M}\\ \end{matrix}\right.$		(24)
	$\displaystyle(\widetilde{\mu}_{s}-Y_{M})^{2}$	$\displaystyle=\left\{\begin{matrix}\frac{100}{81}(\widetilde{\mu}-Y_{M})^{2},% \ \widetilde{\mu}\leq Y_{M}\\ (\widetilde{\mu}-Y_{M})^{2},\ \widetilde{\mu}>Y_{M}\\ \end{matrix}\right.$		(24)

Similarly, for the prediction of minimum values $Y_{m}$ , we can also apply a similar scaling function:

(\widetilde{\mu}_{s}-Y_{m})^{2}=\left\{\begin{matrix}\begin{aligned} &\frac{10% 0}{81}(\widetilde{\mu}-Y_{m})^{2},\ \widetilde{\mu}\geq Y_{m}\\ &(\widetilde{\mu}-Y_{m})^{2},\ \widetilde{\mu}<Y_{m}\\ \end{aligned}\end{matrix}\right.

(25)

Finally, merge the two scaling functions into $S(\hat{Y},Y)$ :

S(\hat{Y},Y)=\left\{\begin{matrix}\begin{aligned} &\frac{100}{81},\ (\hat{Y}% \geq Y\ and\ Y<Y_{10th})\ or\ (\hat{Y}\leq Y\ and\ Y>Y_{90th})\\ &1,\ else\\ \end{aligned}\end{matrix}\right.

(26)

where $\hat{Y}$ is the prediction and $Y$ is the target. $Y_{10th}$ and $Y_{90th}$ correspond to the 10th and 90th percentiles in the sample, respectively, and they can serve as thresholds for extreme values (such as extremely low or high temperatures). The expression $(\hat{Y}\geq Y\ and\ Y<Y_{10th})\ or\ (\hat{Y}\leq Y\ and\ Y>Y_{90th})$ refers to the situation where the true value meets the threshold of extreme values, but the model gives an underestimated prediction. By applying the scaling function $S(\cdot)$ to the loss function, we obtain Exloss. It is noticeable that the model increases the loss when underestimating extremes, thereby imposing a higher penalty.

Exloss\left(\hat{\mathcal{Y}},\mathcal{Y}\right)=\left\|S(\hat{\mathcal{Y}},% \mathcal{Y})\odot\left(\hat{\mathcal{Y}}-\mathcal{Y}\right)\right\|^{2}

(27)

Appendix C ExBooster

The python code of ExBooster (PyTorch implementation) is implemented as follows:

{python}

def SortIndex(input): ”’ Get the sorting ranking corresponding to each element in tensor. Args: input (torch.Tensor): The input tensor. Returns: torch.Tensor: Tensor of the same size as input. The ranking of each element in the tensor, from smallest to largest. Example: input=torch.tensor([1.2, 1.5, 0.8, 0.9]) return torch.tensor([2, 3, 0, 1]) ”’ sorted_indices = torch.argsort(input, dim=-1) ranks = torch.argsort(sorted_indices, dim=-1) return ranks

{python}

def ExBooster(pred, sampling_nums=50, noise_scale=0.1): ”’ Apply ExBooster to the predictions. Args: pred (torch.Tensor): Tensor of size [B, C, H, W]. The input predictions. sampling_nums (int): Number of samplings (default: 50). noise_scale (float or torch.Tensor): Scaling factor for the noise. It can be a real number or a tensor of size [B, C, H, W] (default: 0.1). Returns: torch.Tensor: Tensor of size [B, C, H, W]. The extreme boosted predictions. ”’ B, C, H, W = pred.shape

scale = noise_scale * torch.ones_like(pred) # [B, C, H, W]

# Get index idx = SortIndex(pred.flatten(2,3)) # [B, C, H*W]

# Sample pred = pred.unsqueeze(2) # [B, C, 1, H, W] scale = scale.unsqueeze(2) # [B, C, 1, H, W] disturbance = torch.randn(B, C, sampling_nums, H, W, device=pred.device) * scale ens = pred + disturbance # [B, C, sampling_nums, H, W]

# Sort ensembles sorted_ens, _= torch.sort(ens.flatten(2,4)) # [B, C, sampling_nums*H*W] sorted_ens = sorted_ens.reshape(B, C, H*W, sampling_nums) # [B, C, H*W, sampling_nums]

# Partition and median k = int(0.5 * sampling_nums) # sampling_nums / 2 sorted_ens_mid, _= torch.kthvalue(sorted_ens, k, -1) # [B, C, H*W]

# Restore by index ens_from_idx = torch.gather(sorted_ens_mid, dim=-1, index=idx) # [B, C, H*W] out = ens_from_idx.reshape(B, C, H, W) # [B, C, H, W]

return out

Appendix D Training Details

Our model is trained for 2 weeks using 32 A100 GPUs, and the batch size is 32. The number of learnable parameters of $M_{d}$ are 570M, the learning rate of stage 1 is 2.5e-4, and the learning rate of stage 2 is 5e-6. The number of learnable parameters of $M_{g}$ are 140M, and the learning rate is 5e-4. The complete hyperparameter settings are shown in the table below.

Hyperparameter	Value
Batch size	32
Learning rate of stage 1	2.5e-4
Learning rate of stage 2	5e-6
Learning rate of stage 3	5e-4
Learning rate schedule	Cosine
Optimizer	AdamW
Patch size	8x8
Embedding dimension	1152
Attention type	Swin window
Attention heads	6
Window size	6x12
Dropout	0
MLP ratio	4
Activation function	GLUE
Data size	[B,69,721,1440]

Table 3: Hyperparameter.

The process (including training and testing) of our model can be divided into four distinct stages:

•

Stage 1: One-Step pretraining. Use MSE loss to train $M_{d}$ learning mapping from $\mathcal{X}^{i}$ to $\mathcal{X}^{i+1}$ as existing data-driven global weather forecasting models.
•

Stage 2: Muti-Step finetuning. Use the proposed Exloss to finetune $M_{d}$ learning autoregressive forecasting. This stage improves not only the accuracy of the model’s multi-step predictions, but also the ability to predict extreme weather due to the use of Exloss.
•

Stage 3: Diffusion training. Use Exloss to train $M_{g}$ learning adding details to the output of $M_{d}$ while freezing the parameters of $M_{d}$ .
•

Stage 4: Inference. In this stage, the ExBooster module that does not require training will be added for complete weather forecast.

We use Exloss in Stage 2 and Stage 3 of model training.

In Stage 2, the model will perform multi-step autoregressive fineutne on $M_{d}$ through the replay buffer (Chen et al. 2023a). We employ Exloss as the loss function in this stage to help the model achieve accurate multi-step predictions for extreme weather events.

In Stage 3, We freeze the parameters of $M_{d}$ and train the diffusion model $M_{g}$ to generate the land-surface variables (u10, v10, t2m) with more details. In order to further improve the expression of diffusion for extreme values, we apply Exloss in diffusion to directly optimize extreme values and avoid losing details as prediction time increases.

Loss(\hat{\varepsilon},\varepsilon)=\left\|\hat{\varepsilon}-\varepsilon\right% \|^{2}+Exloss(\hat{\varepsilon},\varepsilon)

(28)

where $\hat{\varepsilon}$ and $\varepsilon$ represent the predicted Gaussian noise (Yang et al. 2023) and the actual added noise in diffusion.

Finally, concatenate the not-land-surface variables (msl, z, q, u, v, t) of $M_{d}$ ’s output with the land-surface variables of $M_{g}$ ’s output to obtain more detailed forecasts.

Appendix E Module Processing Visualization

As described in Section D, our model comprises two trainable modules, namely the deterministic model (denoted as $M_{d}$ ) and the probabilistic generation model (denoted as $M_{g}$ ), along with the training-free ExBooster module. The outputs of these modules are denoted as $P_{d}$ , $P_{g}$ , and $P_{e}$ respectively. By visualizing the difference between $P_{g}$ and $P_{d}$ , we can observe the impact of the diffusion model, while visualizing the difference between $P_{e}$ and $P_{g}$ allows us to examine the effect of the ExBooster module.

Two key observations can be derived from Figure 12. Firstly, both the diffusion and ExBooster modules exert similar influences on the final output. This indicates that our distinct modules show consistency and do not adversely affect one another. Secondly, both the diffusion and ExBooster modules result in increased temperatures in mid-latitudes and decreased temperatures in high-latitudes (Richard et al. 2013) (Francis and Vavrus 2012). Collectively, this contributes to a certain level of extreme weather events at the global scale.

As depicted in Figure 12, both the diffusion and ExBooster modules exhibit an amplifying effect on wind speed, consequently leading to an increase in extreme wind speed. Furthermore, in the vicinity of cyclones (Gray 1977) exhibiting higher wind speeds, we observe a pronounced intensification of the enhancement effect facilitated by the diffusion and ExBooster modules. This notable enhancement is particularly advantageous in the accurate prediction of typhoons and their associated wind speeds (Wei et al. 2018) (Yang and Tsai 2019).