Impact Statement
Estimating conditional probabilities and rate of returns of extreme heat waves is important for climate change risk assessments. The most impactful heat waves are also long lasting. Numerical weather and climate models are expensive to run and may have biases, while observational datasets are too short to observe many of the potential extreme events and sample them accordingly. We present a relatively inexpensive yet powerful datadriven tool for sampling and estimating probabilities of extreme prolonged heat waves and compare it with existing datadriven statistical approaches such as convolutional neural networks and extreme value theory estimations.
1. Introduction
It is expected that heat waves will be among the most impactful effects of climate change (Russo et al., Reference Russo, Sillmann and Fischer2015; Seneviratne et al., Reference Seneviratne, Zhang, Adnan, Badi, Dereczynski, Di Luca, Ghosh, Iskandar, Kossin, Lewis, Otto, Pinto, Satoh, VicenteSerrano, Wehner and Zhou2021) in the 21st century. While, generally, climate scientists have warned that global warming will increase the rate of heat waves, in some regions such as Europe, the trends have accelerated beyond what was expected (Christidis et al., Reference Christidis, McCarthy and Stott2020). Extreme heat waves, such as the Western European heat wave in 2003 (GarcíaHerrera et al., Reference GarcíaHerrera, Díaz, Trigo, Luterbacher and Fischer2010) and the Russian heat wave in 2010 (Barriopedro et al., Reference Barriopedro, Fischer, Luterbacher, Trigo and GarcíaHerrera2011; Chen et al., Reference Chen, Kuang, Wang, Chen, Han and Fan2022) had consequences on agriculture and posed health hazards. Such events have also affected other regions including Asia, for example, the Korean Peninsula (Choi et al., Reference Choi, Ho, Jung, Chang and Ha2021) in the summers of 2013, 2016, and 2018 (Min et al., Reference Min, Kim, Lee, Sparrow, Li, Lott and Stott2020). Thus, understanding the potential drivers, forecasting, and performing longterm projections of heat waves is necessary. These tasks, also relevant for climate change attribution studies (National Academies of Sciences Engineering and Medicine, 2016), are difficult since they require computation of small probabilities and thus massive ensemble simulations with expensive models or drawing conclusions from the relatively short observational record.
The possibility of using atmospheric circulation analogs for predicting weather dates back to the works of Bjerknes (Reference Bjerknes1921) and Lorenz (Reference Lorenz1969). The idea is quite intuitive: one expects weather patterns to repeat given similar initial conditions. However, estimating the amount of data (a catalog of analogs) actually necessary for shortrange predictions revealed the superiority of physicsbased numerical weather prediction (NWP) models (van den Dool, Reference van den Dool2007; Balaji, Reference Balaji2021). On the other hand, for predicting certain types of largescale patterns, analog forecasting may require less resources and can be competitive with or superior to NWP on time scales longer than Lyapunov time, such as subseasonaltoseasonal scales (Van den Dool et al., Reference Van den Dool, Huang and Fan2003; Cohen et al., Reference Cohen, Coumou, Hwang, Mackey, Orenstein, Totz and Tziperman2019) and especially longrange predictions. The examples include prediction of ENSO (Ding et al., Reference Ding, Newman, Alexander and Wittenberg2019; Wang et al., Reference Wang, Slawinska and Giannakis2020) and Madden–Julian oscillation (Krouma et al., Reference Krouma, Silini and Yiou2023).
Analog forecasting involves constructing potential trajectories that the system could have explored from a given state. These methods, which assume a Markov property (Yiou, Reference Yiou2014), belong to the family of methods referred to as stochastic weather generators (SWGs). SWGs can be used as climate model emulators due to their ability to generate long sequences that can infer statistical properties of spatiotemporal dynamics and correlation structures of the system without running expensive general circulation models (GCMs) (Ailliot et al., Reference Ailliot, Allard, Monbet and Naveau2015). Other applications include downscaling (Wilks, Reference Wilks1992; Rajagopalan and Lall, Reference Rajagopalan and Lall1999) and data assimilation (Lguensat et al., Reference Lguensat, Tandeo, Ailliot, Pulido and Fablet2017).
In the last decade, there has been a rapid advancement of other types of datadriven forecasting/emulators in earth system models based on deep learning (Reichstein et al., Reference Reichstein, CampsValls, Stevens, Jung, Denzler, Carvalhais and Prabhat2019). For example, in a seminal work (Ham et al., Reference Ham, Kim and Luo2019), convolutional neural network (CNN) has achieved a positive skill for El Niño–Southern Oscillation (ENSO) prediction compared to NWP. These success stories are sometimes accompanied by similar skills displayed by the analog method (Ding et al., Reference Ding, Newman, Alexander and Wittenberg2019; Wang et al., Reference Wang, Slawinska and Giannakis2020). Recently, deep learning has been used to predict extreme heat waves (Chattopadhyay et al., Reference Chattopadhyay, Nabizadeh and Hassanzadeh2020; JacquesDumas et al., Reference JacquesDumas, Ragone, Borgnat, Abry and Bouchet2022; LopezGomez et al., Reference LopezGomez, McGovern, Agrawal and Hickey2022) targeting categorical scores related to hit ratesFootnote ^{1}. In order to provide probabilistic prediction, the following studies were performed using random forests (van Straaten et al., Reference van Straaten, Whan, Coumou, van den Hurk and Schmeits2022) and neural networks (Miloshevich et al., Reference Miloshevich, Cozian, Abry, Borgnat and Bouchet2023a). Probabilistic forecasting in the context of uncertainty quantification plays an important role in the current development of machine learningdriven techniques applied to, for instance, postprocessing of weather forecasts (Grönquist et al., Reference Grönquist, Yao, BenNun, Dryden, Dueben, Li and Hoefler2021; Schulz and Lerch, Reference Schulz and Lerch2022).
Geopotential height anomalies and soil moisture were chosen as the inputs in Miloshevich et al. (Reference Miloshevich, Cozian, Abry, Borgnat and Bouchet2023a) based on physical understanding of the precursors to heat waves. Persistent positive geopotential height anomalies are known drivers, since they favor clear skies and produce subsidence. It has been argued that soil moisture has memory of previous landatmospheric conditions (Koster and Suarez, Reference Koster and Suarez2001; Seneviratne et al., Reference Seneviratne, Koster, Guo, Dirmeyer, Kowalczyk, Lawrence, Liu, Mocko, Lu, Oleson and Verseghy2006). In fact, soil moisture was identified (Seneviratne et al., Reference Seneviratne, Nicholls, Easterling, Goodess, Kanae, Kossin, Luo, Marengo, McInnes, Rahimi, Reichstein, Sorteberg, Vera, Zhang, Rusticucci, Semenov, Alexander, Allen, Benito, Cavazos, Clague, Conway, DellaMarta, Gerber, Gong, Goswami, Hemer, Huggel, van den Hurk, Kharin, Kitoh, Klein Tank, Li, Mason, McGuire, van Oldenborgh, Orlowsky, Smith, Thiaw, Velegrakis, Yiou, Zhang, Zhou, Zwiers, Field, Barros, Stocker and Dahe2012) among the important drivers for European heat waves. In contrast, in Northern European regions, soil moisture may play a lesser role in preconditioning (Felsche et al., Reference Felsche, Böhnisch and Ludwig2023). This association between soil moisture deficits and heat wave occurrence has also been indicated elsewhere (Shukla and Mintz, Reference Shukla and Mintz1982; Rowntree and Bolton, Reference Rowntree and Bolton1983; D’Andrea et al., Reference D’Andrea, Provenzale, Vautard and De NobletDecoudré2006; Fischer et al., Reference Fischer, Seneviratne, Vidale, Lüthi and Schär2007; Vautard et al., Reference Vautard, Yiou, D’Andrea, de Noblet, Viovy, Cassou, Polcher, Ciais, Kageyama and Fan2007; Lorenz et al., Reference Lorenz, Jaeger and Seneviratne2010; Seneviratne et al., Reference Seneviratne, Corti, Davin, Hirschi, Jaeger, Lehner, Orlowsky and Teuling2010; Hirschi et al., Reference Hirschi, Seneviratne, Alexandrov, Boberg, Boroneant, Christensen, Formayer, Orlowsky and Stepanek2011; Stéfanon et al., Reference Stéfanon, D’Andrea and Drobinski2012; Miralles et al., Reference Miralles, Teuling, van Heerwaarden and VilàGuerau de Arellano2014, Reference Miralles, Gentine, Seneviratne and Teuling2019; Schubert et al., Reference Schubert, Wang, Koster, Suarez and Groisman2014; Zhou et al., Reference Zhou, Williams, Berg, Cook, Zhang, Hagemann, Lorenz, Seneviratne and Gentine2019; Vargas Zeppetello and Battisti, Reference Vargas Zeppetello and Battisti2020; Benson and Dirmeyer, Reference Benson and Dirmeyer2021; Zeppetello et al., Reference Zeppetello, Battisti and Baker2022), contributing to the understanding that in dry regimes heat waves could be amplified due to the impacts of evapotranspiration. In contrast, the SWG, designed for heat waves, did not take this crucial input and thus was not able to reproduce the corresponding feedback (Jézéquel et al., Reference Jézéquel, Yiou and Radanovics2018). However, it is important to better understand the contributions of such drivers to improve projected climate change impacts on heat wave probabilities (Field et al., Reference Field, Barros, Stocker and Dahe2012; Berg et al., Reference Berg, Lintner, Findell, Seneviratne, van den Hurk, Ducharne, Chéruy, Hagemann, Lawrence, Malyshev, Meier and Gentine2015; Horton et al., Reference Horton, Mankin, Lesk, Coffel and Raymond2016). For instance, using circulation analogs in a preindustrial simulation, Horowitz et al. (Reference Horowitz, McKinnon and Simpson2022) showed that circulation patterns explain around 80% of the temperature anomalies in the United States, while negative soil moisture anomalies added a significant positive contribution, especially midcontinent.
Recently, the analog method was successfully adapted for predicting chaotic transitions in a lowdimensional system (Lucente et al., Reference Lucente, Herbert and Bouchet2022a). In fact, the analog method is generally expected to perform better when the number of relevant degrees of freedom is not too high. This is natural given that it belongs to the family of $ k $ nearest neighbors algorithms, which suffer from the curse of dimensionality (Beyer et al., Reference Beyer, Goldstein, Ramakrishnan, Shaft, Beeri and Buneman1999). Consequently, for our problem of heat wave prediction, we will consider linear and nonlinear dimensionality reduction techniques and evaluate the performance of SWG in real versus latent space. This approach is partially motivated by the emergence of generative modeling for climate and weather applications, for example, studies combining deep learning architectures with extreme value theory (EVT) for generating extremes (Bhatia et al., Reference Bhatia, Jain and Hooi2021; Boulaguiem et al., Reference Boulaguiem, Zscheischler, Vignotto, van der Wiel and Engelke2022) and realistic climate situations (Besombes et al., Reference Besombes, Pannekoucke, Lapeyre, Sanderson and Thual2021).
Beyond finitehorizon probabilistic prediction, datadriven methods can be used to create emulators capable of extracting risks for “blackswan” events, events that are so far removed from the climatology that they have not been observed yet but their impact may be devastating. Such risk assessments are often carried out using EVT based on reanalysis, with long runs of highfidelity models (methods such as ensemble boosting (Gessner et al., Reference Gessner, Fischer, Beyerle and Knutti2021) or rare event algorithms (Ragone et al., Reference Ragone, Wouters and Bouchet2018; Ragone and Bouchet, Reference Ragone and Bouchet2021)). More recently, other statistical approaches such as Markov state models (Finkel et al., Reference Finkel, Gerber, Abbot and Weare2023) have been used to estimate the rates of extreme sudden stratospheric warming events based on ensemble hindcasts of European Center of Medium Range Weather Forecasting.
The first goal of this study is to compare the performance of two datadriven probabilistic forecasting approaches for extreme heat waves in two European regions: France and Scandinavia. We aim to explore the use of SWG, which relies on the method of analog Markov chain, optimize it for our task, and compare or combine it with more modern deep learning approaches such as CNN. We will investigate ways to accelerate the computation of analogs using dimensionality reduction techniques, such as training a variational autoencoder (VAE) to project the state of the system to a smalldimensional latent space. Our second goal will consist of generating synthetic time series with the help of the SWG trained on a short model run and comparing statistics of underresolved extreme events (in this short run) to a long control run. We will work with PlaSim data, which is an intermediate complexity GCM. The choice of the right metrics is important since many of the current datadriven forecasts are trained based on mean square error, which is not suited for the evaluation of extremes. However, extremes actually represent the most immediate societal risks (Watson, Reference Watson2022) and are a subject of this manuscript. The choice of GCM is motivated by the ability of PlaSim to generate inexpensive long runs and the recent study of Miloshevich et al. (Reference Miloshevich, RoubyPoizat, Ragone and Bouchet2023b), who showed that the composite statistics of the largescale 500 hPa geopotential height fields, conditioned on heat waves, revealed similar patterns for PlaSim as for higher fidelity models such as CESM and the ERA5 reanalysis.
The paper is organized as follows. Section 2.1 describes the details of our dataset generated by PlaSim simulation. Section 2.2 defines the notation and heat wave events. Section 2.3 delineates the goal of the probabilistic inference, the scoring function that determines the goodness of the predictions, the training and validation protocol. Section 2.4 reviews the CNN introduced in Miloshevich et al. (Reference Miloshevich, Cozian, Abry, Borgnat and Bouchet2023a). Section 2.5 introduces the analog Markov chain, that is, SWG, and the corresponding steps involving coarsegraining, definition of metric, how to make probabilistic prediction with SWG, and how to construct return time plots. Section 3 spells out the results consisting of Section 3.1 that covers probabilistic forecasting and Section 3.2 discusses extending return time plots and teleconnections. Finally, Section 4 concludes the findings of this study.
2. Methods and data
2.1. PlaSim model data
The data have been generated from 80 batchesFootnote ^{2} that are independent 100yearlong stationary simulations of PlaSim (Fraedrich et al., Reference Fraedrich, Jansen, Kirk, Luksch and Lunkeit2005, Reference Fraedrich, Kirk and Lunkeit1998). PlaSim is an intermediate complexity GCM, suitable for methodological developments, such as the one performed here. It can be run to generate long simulations at a lower computational cost than the new generation of CMIP6 ensemble.
PlaSim consists of an atmospheric component coupled to ice, ocean, and land modules. The atmospheric component solves the primitive equations of fluid dynamics adapted to the geometry of the Earth in the spectral space. Unresolved processes such as boundary heat fluxes, convection, clouds, and so forth are parameterized. The interactions between the soil and the atmosphere are crucial for this study. They are governed by the bucket model of Manabe (Reference Manabe1969), which controls the soil water content by replenishing it from precipitation and snowmelt and depleting by the surface evaporation. The soil moisture capacity is prescribed based on geographical distribution.
The model was run with conditions corresponding to the Earth climate in 1990s with fixed greenhouse gas concentrations and boundary conditions, which include incoming solar radiation, sea surface temperatures, and sea ice cover that are cyclically repeated each year. The parameters are chosen to reproduce the climate of the 1990s. We also include the daily cycle in this work (like in Miloshevich et al. (Reference Miloshevich, Cozian, Abry, Borgnat and Bouchet2023a)). The model is run at T42 resolution, which corresponds to cells that are 2.8 by 2.8 degrees and results in 64 by 128 resolution over the whole globe. The vertical resolution is 10 layers. The fields are sampled every 3 hours and daily averages are taken.
In this paper, we work with different subsets of the full 8000yearlong dataset defined in Table 1.
2.2. Physical fields and geographical domains
The inputs to our various data pipelines will consist of fields $ \mathrm{\mathcal{F}}\left(\mathbf{r},t\right) $ such as

• $ \mathcal{T} $ : 2meter temperature

• $ \mathcal{Z} $ : 500 hPa geopotential height

• $ \mathcal{S} $ : soil moisture
at time $ t $ , and discrete points $ \mathbf{r}\in \unicode{x1D53B} $ , where $ \unicode{x1D53B} $ represents the domain of interest. We will consider three domains:

• $ {\unicode{x1D53B}}_{NAE} $ : North Atlantic, Europe (24 by 48 cells);

• $ {\unicode{x1D53B}}_{France} $ : the area of France masked over the land;

• $ {\unicode{x1D53B}}_{Scandinavia} $ : the area of Scandinavia masked over the land.
The areas of France and Scandinavia that will correspond to the geographical areas of heat waves are defined in Figure 1. Datadriven methods that will be introduced below will accept typically some stacked version of the fields that will be represented by the letter $ \mathcal{X} $
where each component corresponds to a specific field.
2.3. Definition of heat wave events
As in Miloshevich et al. (Reference Miloshevich, Cozian, Abry, Borgnat and Bouchet2023a), we are concerned with the prediction of $ T=15 $ daylong heat waves. Thus, our events of interest are related to the time and space integrated anomaly of 2meter temperature ( $ \mathcal{T} $ ), where we use symbol $ \unicode{x1D53C} $ to imply average over the years conditioned to the calendar day:
which depending on the threshold $ A(t)\ge \alpha $ defines a class of heat waves. Symbol $ \unicode{x1D53B} $ corresponds to the domain of interest described in Section 2.2. $ \left\unicode{x1D53B}\right $ is the surface area of $ \unicode{x1D53B} $ and equation (2) contains an area weighted average. $ T $ is the chosen duration (in days) of the heat waves. For probabilistic prediction, we set $ T=15 $ days, while for return times we relax this requirement to study a family of return time plots.
The paper will involve two slightly different definitions of heat waves based on equation (2) depending on the task that will be performed. In the next Section 2.4.1, we will introduce probabilistic prediction, for which heat waves will be defined as exceeding a certain fixed threshold $ a $ set fixed for all heat waves, which allows many heat waves per summer. Concurrently, in Section 2.6.6, we will work with a blockmaximum definition, which permits only single heat wave per summer, so each year gets unique value $ a $ . Moreover, a range of $ T $ values will be considered.
2.4. Probabilistic prediction and validation
2.4.1. Committor function
Our goal is a probabilistic forecast, a committor function (Lucente et al., Reference Lucente, Herbert and Bouchet2022a) in the nomenclature of rare event simulations. Based on equation (10), we define the label $ y=y(t) $ for each state as a function of the corresponding time $ t $ as $ y(t)\in \left\{0,1\right\} $ , so that $ y=1 $ iff $ A>a $ , where $ a $ is 95th percentile of $ A $ . Our objective is to find the probability $ p $ of $ y=1 $ at time $ t $ given the state $ \mathcal{X} $ which we observe at an earlier time $ t\tau $ (see Section 2.6.3):
Parameter $ \tau $ is referred to as lead time or sometimes lag time. We stress that for nonzero $ \tau $ it is the initial observed state $ \mathcal{X} $ that is shifted in time, rather than the labels $ y $ , thus the events of interest (both $ y=1 $ or $ y=0 $ ) are kept the same, allowing controlled comparisons among different values of $ \tau $ . Moreover, the season of interest for which we intend to compute the committor functions and compare them to the ground truth should always involve the same days: from June 1 to August 16, because the last potential heat wave may last $ T=15 $ days and, this way, end in August 30 (the last day of summer in PlaSim).
2.4.2. Normalized logarithmic skill score
The quality of the prediction will be measured with a normalized logarithmic score (NLS) (Miloshevich et al., Reference Miloshevich, Cozian, Abry, Borgnat and Bouchet2023a) due its attractive properties for rare events (Benedetti, Reference Benedetti2010) and the appropriateness for probabilistic predictions (Wilks, Reference Wilks2019). In the following, we will briefly introduce the NLS. Since the dataset is composed of $ N $ states $ {\left\{{\mathcal{X}}_n\right\}}_{1\ge n\ge N} $ , each of them labeled by $ {y}_n\in \left\{0,1\right\} $ , it can be equivalently represented by $ N $ independent pairs $ \left({\mathcal{X}}_n,{y}_n\right) $ . The aim of the prediction task is to provide an estimate $ {\hat{p}}_y $ of the unknown probability $ {p}_{\overline{y}}(X)=\mathrm{\mathbb{P}}\left(y=\overline{y}\hskip0.4em \hskip0.1em \mathcal{X}=X\right) $ . It has been argued (Benedetti, Reference Benedetti2010; Miloshevich et al., Reference Miloshevich, Cozian, Abry, Borgnat and Bouchet2023a) that among the possible scores to assess the quality of a probabilistic prediction, the most suitable is the logarithmic score
It should be noted that the logarithmic score coincides with the empirical cross entropy widely used in machine learning. The score $ {S}_N $ is positive and negatively oriented (the smaller the better). The value of $ {S}_N $ is not indicative in itself, but is useful for comparing the quality of two different predictions. Thus, the NLS is defined as an affine transformation of the logarithmic score, which allows us to compare the datadriven forecast with the climatological one. To be more precise, let $ \overline{p} $ be the climatological frequency of the extremes ( $ \overline{p}=\mathrm{\mathbb{P}}\left(y=1\right) $ ) and $ {S}_N^{(c)} $ the logarithmic score for this prediction, that is
Then, the NLS score is defined as
Thus, NLS is positively oriented (the larger the better) and bounded above by 1. Since we identify $ y=1 $ with 95th percentile of $ A(t) $ this implies that $ \overline{p}=0.05 $ . In other words, we always compare the results to the baselineFootnote ^{3}, which would result in NLS equal to zero. Any predictions with smaller NLS are completely useless from the probabilistic perspective. Finally, we stress that other scores devised for rare events, such as MCC (Matthews, Reference Matthews1975), most notably used in the study (JacquesDumas et al., Reference JacquesDumas, Ragone, Borgnat, Abry and Bouchet2022) and others, are useful for categorical prediction but do not necessarily translate into high NLS scores when training neural networks, especially if early stoppingFootnote ^{4} is used, since the optimal epoch may vary depending on the chosen score.
2.4.3. Training/validation protocol for committor function
We split the dataset into training (TS) and validation (VS). The ML algorithm of choice is trained on TS and various hyperparameters are optimized on VS depending on the target metric (see Section 2.4.2). Different methods can be compared on VS, which will be performed in this paper. The splits will be based on Table 1 and correspond to the selection of the first number of years, for example, D100 implies choosing the first 100 years of the simulation.
To have a better idea about the spread of the skill (equation (6)), we will rely on fivefold crossvalidation. This means that the TS/VS split is performed five times, while sliding the VS window consequently through the full data interval, chosen for the particular study (D100, D500, etc.). Each TS/VS split we refer to as “fold,” that is, there are five folds (numbered 0–4). For instance, for D100, fold number 3, VS starts in year 60 and ends in year 79, while TS is as usual the complement of that set. Notice that years are also numbered from 0. In order to ensure that each VS has the same number of extreme events, we perform custom stratification (Miloshevich et al., Reference Miloshevich, Cozian, Abry, Borgnat and Bouchet2023a). This means that we shuffle the years of the simulation between different folds so that each interval receives the same (or almost the same) number of extreme events. The procedure does not modify the mean benchmarks but tends to decrease the error bars, which strongly depend on the number of extreme events within the VS.
2.5. Convolutional neural network
The CNN we take for this study has been proposed by Miloshevich et al. (Reference Miloshevich, Cozian, Abry, Borgnat and Bouchet2023a). Since the precise inputs are slightly different in this paper (as well as a whole new region of Scandinavia that is considered in this paper for the first time), the training had to be repeated, but the hyperparameters were left unchanged.
In this manuscript, the CNN accepts an input that is 24 by 48 gridpoints over three fields (see Eq. (1)). We perform masking (Miloshevich et al., Reference Miloshevich, Cozian, Abry, Borgnat and Bouchet2023a) of the temperature and soil moisture, that is, values outside of the respective heat wave area (France or Scandinavia depending on which heat waves we study, see Figure 1) are set to zero to avoid some spurious correlations. In this paper, we also study the heat waves occurring in Scandinavia, and for consistency perform masking over the corresponding area in that case. The inputs are then normalized so that each gridcell of each field has mean 0 and standard deviation 1.
The input is fed through three convolutional and two maxpooling layers; the intermediate output is flattened and passed through a dense layer, which is connected to the final output. The final output consists of two neurons and a softmax activation, which enforces the outputs within $ \left(0,1\right) $ interval consistent with the domain of probabilities. The architecture is trained with Adam optimizer to minimize crossentropy as is appropriate for classification task and early stopping is applied, which stops the training when the average foldwise NLS reaches the maximum.
2.6. Analog Markov chain
2.6.1. SWG algorithm
The principle of the SWG as described in Yiou (Reference Yiou2014) is based on the analog Markov chain method. The idea is to use the existing observations or model output that catalog the dynamical evolution of the system, the dynamical history, and generate synthetic series.
Algorithm 1: The algorithm generates a single synthetic trajectory by using the analogs of training set (see Table 1). Sets of $ K $ nearest analogues of each sample $ {X}_n $ during dynamical evolution are computed based on the metric given in equation (13). The states whose analogs we compute are constrained by the condition that their calendar days must start on May 15 and end on August 30 (there are 30 days per month in PlaSim). This results in $ {N}_t=105\cdot {N}_{\mathrm{years}} $ . There is additional condition on the upper bound of the calendar day of the analogs of these days which is August 27 preventing SWG from accessing days outside of the summer season, since the time step is $ {\tau}_m=3 $ days. The analogs are computed using parallelized kDTree search implemented in SciPy packages of python once per parameter $ {\alpha}_0 $ (see equation (13) and discussion in Section 2.6.3) and each fold.
The method consists of constructing a catalog of analogs, the states that have similar circulation patterns and other thermodynamic characteristics (in our case temperature and soil moisture). Each state in the catalog comes from the realization of the dynamics (in principle, it could come not only from simulations but also reanalysis). To construct synthetic time series, one starts from any of these states and draws randomly (with uniform probability) one analog from a predefined set of nearest neighbors $ n $ . The similarity between the states is assessed via Euclidean metric in the feature space (Karlsson and Yakowitz, Reference Karlsson and Yakowitz1987; Davis, Reference Davis2002; Yiou, Reference Yiou2014), as defined below (Section 2.6.3). The following step is to lookup in the dynamical history, that is, how that analog evolved dynamically after time $ {\tau}_m $ and use the corresponding state as the next member of the synthetic time series (see Figure 2). Next, the process is repeated. The details of how the algorithm is implemented will become clear in the following sections but they are schematically represented in Figure 2 and encapsulated procedurally in Algorithm 1. The procedure is quite general although it is still possible to modify it in many aspects. For instance, the analogs are not always drawn from the uniform probability (Lall and Sharma, Reference Lall and Sharma1996; Rajagopalan and Lall, Reference Rajagopalan and Lall1999; Buishand and Brandsma, Reference Buishand and Brandsma2001; Yiou, Reference Yiou2014), the catalog of analogs is often constrained by a moving window or the analogs from the same year may be excluded from possible candidates (Yiou, Reference Yiou2014). The common rationale behind our choices is to prioritize methods that demand less resources and use as much information as possible.
2.6.2. Time step and coarsegraining
The first question that arises is what is the appropriate choice for the time step $ {\tau}_m $ of SWG (see schematics in Figure 2). It is natural to use synoptic time scale of cyclones on which the dynamics partially decorrelates, which corresponds to $ {\tau}_m=3 $ days, as demonstrated by inspecting autocovariance Miloshevich et al. (Reference Miloshevich, RoubyPoizat, Ragone and Bouchet2023b). This means that we are constructing a Markov chain where we are selecting a time step consistent with ignoring synoptic timescale correlations. One may expect that the results do not depend too much on $ {\tau}_m $ as long as it does not exceed the total correlation time (see Appendix B for further details). The coarse graining will be applied as follows, $ \tilde{\mathrm{\mathcal{F}}}\left(\mathbf{r},{t}_0\right) $ will be
The tilde will be suppressed without the loss of generality since the coarsegraining will be applied throughout, except in the input to the CNN. For simplicity, we perform coarsegraining in accordance with this time step $ {\tau}_c={\tau}_m $ .
Note that we will also work with scalars obtained as area integrals $ {\mathrm{\mathcal{F}}}_{\unicode{x1D53B}} $ over the area of France $ \unicode{x1D53B}=F $ and Scandinavia $ \unicode{x1D53B}=S $ defined as
This allows us to rewrite the definition of heat waves (equation (2)) in a new, more concise notation
If we take coarsegraining into account and set $ T=n{\tau}_c=15 $ days, we can express equation (2) as discretization
2.6.3. State space and metric
Next, we address how to combine fields of different nature (unit) in SWG. The fields can be stacked in various ways, for instance, tuples can be constructed:
This approach adapted for SWG consisting of integrals over $ \unicode{x1D53B} $ is to be contrasted with the input received by CNN, whereby $ \mathcal{S} $ and $ \mathcal{T} $ are masked as described in Section 2.5 outside areas corresponding to the heat waves, while SWG receives only integrated temperature and soil moisture (over the same heat wave areas).
In the case of SWG, each field will be normalized to global fieldwise standard deviationFootnote ^{5} (with the global fieldwise mean subtracted) as opposed to cellwise like in the case of CNN
The overbar will be omitted in what follows and it will be implied instead.
The Euclidean distance between two points $ {\mathcal{X}}_1 $ and $ {\mathcal{X}}_2 $ is defined as
where $ \mathit{\dim}\left({\mathcal{X}}^i\right) $ counts the number of grid points concerned (1 for scalars such as $ \mathcal{S} $ and $ \mathcal{T} $ and 1152 for $ \mathcal{Z} $ ). When $ {\alpha}_1={\alpha}_2=0 $ , we get definition consistent with (Yiou, Reference Yiou2014), except that we assign uniform probabilities for a set of nearest analogs.
Of a particular interest is the fact that geopotential contains global information on a shorter time scale, while soil moisture contains local information on a longer time scale (Miloshevich et al., Reference Miloshevich, Cozian, Abry, Borgnat and Bouchet2023a). In the zerothorder approximation, one expects that for the efficient algorithm, the fields would need to be further rescaled by their dimensionality as is done in equation (13). Thus, we choose parameters $ {\alpha}_i=1 $ as a first guess. To optimize the performance of SWG for the conditional prediction problem (equation (3)), we perform grid search: for each $ \tau $ we compute committor $ p $ and measure its skill (equation (6)) as a function of number of nearest neighbors $ n $ retained and the coupling coefficient for geopotential $ {\alpha}_0 $ , while the other two coefficients are set to one: $ {\alpha}_{1,2}=1 $ .
Other choices for distance function have been explored in the literature, such as Mahalanobis metric (Stephenson, Reference Stephenson1997; Yates et al., Reference Yates, Gangopadhyay, Rajagopalan and Strzepek2003), which is equivalent to performing ZCA “whitening,” that is, a procedure that is a rotation of a PCA components plus normalizing the covariance matrix. We do not explicitly compute Mahalanobis distance in this work, although we do actually compute the Euclidean distance after performing the dimensionality reduction via PCA as will be described below.
2.6.4. Dimensionality reduction
When dealing with highdimensional fields, one often encounters the problem of “curse of dimensionality.” In these cases, it is known that Euclidean distance is poorly indicative regarding the similarity of two data points. It could therefore be useful to project the dynamics onto a reduced space. Furthermore, in this way, the construction of a catalog is much more efficient from a computational point of view. Here, we decided to adopt two different dimensionality reduction techniques, namely a VAE and principal component analysis (PCA), also known as empirical orthogonal function (EOF) analysis. In a nutshell, PCA is a linear transformation of the original data, where the samples are projected onto the eigenvectors of the empirical correlation matrix (see Appendix A). Usually, the projections are made only on the eigenvectors corresponding to the largest eigenvalues, as they are the ones that contain most of the variability of the data. However, this variability may not be entirely relevant to the prediction of heat waves, so this criterion is not necessarily useful. The VAE instead is a nonlinear probabilistic projections onto a latent space (usually with Gaussian measure). It consists of probabilistic decoder $ {p}_{\theta}\left(xz\right) $ and probabilistic (stochastic) encoder $ {q}_{\phi}\left(zx\right) $ and the training is performed with the aim of maximize the probability of the data $ p(x) $ (Doersch, Reference Doersch2016) (more details can be found in Appendix A). Here, we adopt a deep fully CNN (FCNN) as an encoder followed by an upsampling FCNN decoder. Once the data have been projected into the latent space through one of the two methods, the SWG can be built exactly as explained above, simply replacing the fields in highdimensional space with lowdimensional ones.
2.6.5. SWG committor
Due to the necessity of transitioning between TS and S (see Figure 2), we require two transition matrices $ {\mathrm{\mathcal{M}}}_{\mathrm{tr}} $ and $ {\mathrm{\mathcal{M}}}_{\mathrm{va}} $ . The former is a straightforward application of what was discussed above, while the latter allows the trajectory starting in the VS to find the appropriate analog in TS. During this procedure (when $ \tau =0 $ ), we retain the mean value of temperature over $ {\tau}_c $ days that was already known in the VS (see Section 2.6.2), and reuse it when computing the synthetic label $ A(t) $ , equation (10). All the remaining entries in $ A(t) $ are based on the temperatures corresponding to the subsequent states visited in the TS.
When $ {\tau}_m=3 $ and $ T=15 $ and $ \tau =0 $ days, we need to evolve the trajectory only four times, thus each state $ s $ can have at most $ {K}^4 $ different values of $ A $ . Therefore, if we denote by $ {N}_a(s) $ the number of trajectories such that $ A $ is greater than $ a $ , the committor function of the state $ s $ will be $ p(s)=\frac{N_a(s)}{K^4} $ . Because we are also interested in larger lead times such as $ \tau =15 $ days we actually need to start the trajectories earlier, which leads to exponentially growing computational trees: with total nine steps of Markov chain necessary to be applied starting from each day in VS. To reduce the computational burden, we use Monte Carlo sampling with 10,000 trajectories at each day from VS rather than exploring the full tree. We do not observe improvements in the skill after refining with more trajectories.
Due to large number of trajectories, we had to parallelize our python code with a Numba package. SWG requires computation of a large matrix of nearest neighbors, for which we use kDTree from scikitlearn package and we run it in parallel on 16 CPU cores of Intel Xeon Gold 6142. Assuming there was no dimensionality reduction performed prior to this operation, the feature space resulting from $ \mathcal{Z} $ field is 1152 dimensional (number of cells) and we run a grid search to find optimal $ {\alpha}_0 $ coefficient in Eq. (2) for each of the five folds. The procedure takes approximately 18 hours when dimensionality reduction (see Appendix A) is not applied.
Finally, because of coarsegraining (see Section 2.6.2) and the definition of labels (equation (8)), special care has to be made when comparing quality of the prediction of (3) with respect to (Miloshevich et al., Reference Miloshevich, Cozian, Abry, Borgnat and Bouchet2023a), where $ \mathcal{X} $ was daily. Indeed, the coarsegrained field $ \mathcal{X} $ retains information until $ {\tau}_c $ days later due to forward coarsegraining (equation (7)). Thus, for a fair comparison between different coarsegrained committors, one must consistently shift the lead time $ \tau $ of $ {\tau}_{c_1}{\tau}_{c_2} $ to ensure that fields $ {\mathcal{X}}_1 $ and $ {\mathcal{X}}_2 $ have no information about the future state of the system. In this paper, we consider coarsegraining time of 3 days.
2.6.6. Return time plots
The return time plots are produced using the method described in details in Lestang et al. (Reference Lestang, Ragone, Bréhier, Herbert and Bouchet2018). Here, we will mostly summarize the algorithm. The idea is to estimate the return time of blockmaximum (summer) $ A(t) $ surpassing a particularly high threshold value $ a $ . Thus, the heat wave definition differs slightly with respect to what is used in the committor definition (Section 2.4.1), that is, we do not restrict $ a $ to the 95th percentile, but instead identify the largest value of $ A(t) $ each summer which will correspond to $ a $ of that year. The interval (duration of the block) is defined so that heat wave may start in June 1 and end in August 30, so the interval depends on the duration of the event $ T $ , which for this application may take values other than just 15 days. If $ T=30 $ days, for instance, the last heat wave may start no later than August 1. Next, the yearly thresholds $ a $ ’s are sorted in the descending order from largest to smallest: $ {\left\{{a}_m\right\}}_{1\leqslant m\leqslant M} $ , where $ M $ is the total number of years. With this definition, the most extreme return time, is identical to the length of the dataset (in years) under consideration (Table 1) and the thresholds are ordered $ {a}_1\ge {a}_2\ge \dots \ge {a}_M $ . The simplest approach is to just to associate the rank of the thresholds $ a $ to the inverse return time.
This definition is intuitively reasonable for large $ a $ but runs into problems at the other end of the $ a $ axis.
A more precise estimator uses the assumption of Poisson process $ P(t)=\lambda \exp \left(\lambda t\right) $ approximation (which largely affects the small return times), modified block maximum estimator in Lestang et al. (Reference Lestang, Ragone, Bréhier, Herbert and Bouchet2018) which reads in our notation as
where $ M $ is the length of the dataset in years and $ \operatorname{rank}(a) $ is the order in which the particular year appears in the ordered list of years (by threshold).
2.7. Generalized extreme value fit
We chose the scikitextremes package of Correoso (Reference Correoso2019) to perform the generalized extreme value (GEV) fits. The maximal yearly summer $ A(t) $ is calculated and then fit using maximum likelihood method estimator. Because the initial guess for the parameters is important and can have an impact on the shape parameter of the GEV distribution, linear moments of the distribution are used to obtain those start values. The GEV function is as follows
where $ c $ corresponds to the shape parameter. To estimate the confidence intervals, the delta method is employed. Some years in the simulation tend to be cooler and even with negative temperature anomaly $ {A}_{\mathrm{max}}(T)<0 $ . In fact, there is a handful of such $ {A}_{\mathrm{max}}(T)<0 $ years in the TS of D100. To better estimate the GEV parameters, we remove such years.
3. Results
3.1. Probabilistic forecasting
The first goal is to compare SWG and CNN as tools for probabilistic heat wave forecasting. This is done in a framework depicted schematically in Figure 3.
3.1.1. Comparisons with CNN
Here, we plot the NLS (equation (6)) comparing the predictions of CNN (orange curve) versus SWG (blue curve) described in Sections 2.5 and 2.6 for France (top panels) and Scandinavia (bottom panels), respectively, in Figure 4. The comparison is made on D500 (see Table 1) but similar results are obtained considering shorter dataset (see Appendix C). The left panels are plotted as a function of lead time $ \tau $ and just like in Miloshevich et al. (Reference Miloshevich, Cozian, Abry, Borgnat and Bouchet2023a), we see a downward trend in NLS due to chaotic nature of the atmosphere (the predictability diminishes with time). As was described in Section 2.6.3, parameters $ {\alpha}_0 $ and the number of nearest neighbors of SWG were tuned based on crossvalidation which allows us to provide error bars (shading around the curves). We find that using 10 nearest neighbors is almost always optimal (right panels), whereas the optimal values of $ {\alpha}_0 $ (middle panels) tend to be on the order of 50–100. Later, we shall see that $ {\alpha}_0 $ also may depend on the preprocessing the data (such as dimensionality reduction).
We are following the identical procedures of data processing for both SWG and CNN (Section 2.4.3). However, because of the coarsegraining in SWG, the comparisons are only appropriate, if we shift orange curve by $ {\tau}_c1=2 $ days ahead (increasing $ \tau $ ). The latter adjustment is explained in Section 2.6.5. Figure 4 shows that in France at $ \tau =2 $ days NLS reaches approximately $ 0.40\pm 0.04 $ for CNN and an estimation of $ 0.30\pm 0.02 $ for SWG. In Scandinavia (bottom panel), we have similar results: approximately $ 0.37\pm 0.03 $ for CNN versus only $ 0.25\pm 0.02 $ for SWG. Large lead times $ \tau $ result in lower predictive skill due to chaotic nature of the atmosphere. The behavior of the curve leads to the conclusion that CNN predicts heat waves better up to 10 days before the event, at which point the CNN and SWG skills are not statistically distinguishable. In case of France, the regime of large $ \tau $ is mostly dominated by predictability due to soil moisture (Miloshevich et al., Reference Miloshevich, Cozian, Abry, Borgnat and Bouchet2023a), whereas in Scandinavia, this longterm information is absent; thus, NLS converges to nearzero values at large $ \tau $ . It is generally known (Felsche et al., Reference Felsche, Böhnisch and Ludwig2023), that it is heat waves in Southern Europe that are preceded by the negative anomaly of soil moisture, rather than Northern Europe, which is consistent with what we see here. The reasoning we provide is that in France, the soil may actually be in a regime of strong lack of humidity in summer, which significantly amplifies the soil–atmosphere feedbacks.
Looking at the optimality of the parameter $ {\alpha}_0 $ , which controls the importance of geopotential in the metric (equation (13)), we see that for France, it is of order $ {\alpha}_0\sim 50 $ , an order of magnitude higher than what we expected $ {\alpha}_0=1 $ based on weighting by the number of relevant grid points. In Scandinavia, the optimal value is not too different, of order $ {\alpha}_0\sim 100 $ , but the dependence is weaker. The reason for this could also be the fact that soil moisture does not contribute much to heat wave predictability in that region, thus limiting its usefulness in the metric (equation (13)).
The benchmarks indicate that CNN outperforms SWG independent of the area of interest (France or Scandinavia) or even dataset length (Figure A3).
3.1.2. Dimensionality reduction
We now move on to the question of projecting the underlying space to latent dimension as described in Section 2.6.4. Three approaches are considered, one which does not involve any dimensionality reduction, one where a deep FCNN is used as an encoder followed by an upsampling FCNN decoder and one, where instead of autoencoder we use SciPy package PCA projectionsFootnote ^{6}. We plot the results in Figures 5 and 6 for France and Scandinavia, respectively. There, simple SWG trained without dimensionality reduction is represented via a blue curve, autoencoder (latent dimension $ p=16 $ ) via orange and the other curves represent PCA for different latent dimensions $ p $ . For France, we see no statistically significant differences even in the drastic case where the latent dimension consists of the two largest principal components onlyFootnote ^{7}. This is probably due to the strong predictability power of soil moisture field (as clearly shown in Miloshevich et al. (Reference Miloshevich, Cozian, Abry, Borgnat and Bouchet2023a)). Indeed, the results for Scandinavia (which is less affected by soil moisture) shown in Figure 6 display a degradation when $ p<4 $ while for $ p\ge 4 $ all methods provide essentially the same score. Interestingly, the highest score (although within the error bars) is obtained for $ p=8 $ , suggesting that an optimal dimension may exist.
The computation of analogs (equation (13)) is rather difficult in high dimensions and with datasets larger than years becomes unfeasible, because we also have to perform tuning of $ {\alpha}_0 $ and the number of nearest neighbors. This is much more efficient if the dimension of the space has been successfully reduced, with a method such as autoencoder. However, as can be seen from Figures 5 and 6, the benefits from using complicated multilayer architecture are limited in this case, given the fact that the green curve, corresponding to PCA is almost the same (within the error bars) as the one obtained with the autoencoder. One notable difference between the cases with dimensional reduction and without is that the value for $ {\alpha}_0 $ that is optimal $ {\alpha}_0=5 $ is an order of magnitude smaller than the one we obtain in Figure 4, corresponding to SWG learned in real space.
These conclusions should be taken in the context of the predictability of temperature using geopotential, which have rather linear Gaussian statistics. It could be that autoencoders are more useful for more complex fields such as precipitation.
3.2. Sampling extremes
From now on, our goal will be to use SWG and evaluate the quality of the statistics for the extremes it can produce. In contrast to Section 3.1, we will consider heat wave extremes as large as one in 7000 years events.
3.2.1. Computing return times
We aim to estimate extreme return times from shorter sequences and compare them to the control run D8000 (Table 1). To this end, we use SWG trained on D100 (Table 1) and initialized with $ {\alpha}_0=1 $ and 10 nearest neighbors, parameters that we do not optimize in contrast to Section 3.1. The generated synthetic sequences are plotted for France in Figure 7a and for Scandinavia in Figure 7b. The curves were obtained using the method described in Section 2.6.6, so that the analogs are initialized in June 1 of each year (using the simulation data) and then advanced according to Algorithm 1. The trajectory lasts up to synthetic August 30.
The main conclusion from Figure 7a is that the analog method is rather wellsuited for this task: not only do synthetic trajectories match return times of the reference real trajectory (80 years of analogs) but they also provide correct estimates of the returns of a much longer trajectory (8000 years) at a fairly low computational footprint, compared to running such a trajectory. For instance, for $ T=15 $ day heat waves, the most extreme event with return time of 7200 years has anomaly 7.07K which is well within the error bars of SWG estimate $ 7.02\pm 0.46K $ . We can see that the generalization happens consistently, except for the extremes of 6day heat waves, where we have sampling issue since $ {\tau}_m=3 $ days. The procedure is repeated for Scandinavia in Figure 7b. The results are generally consistent, although we observe more deviations from the prediction of SWG. For example, heat waves of length $ T=15 $ days and return times of order 2000–3000 years as well as the longest return times for $ T=30 $ days tend to be systematically underestimated. Note, however, that these events are rare even in the control run and therefore their thresholds have large uncertainties. Excluding these extreme events, the predictions are almost always in agreement with the return times calculated on the long run and when they are not, the relative error is smaller than 15%.
Since for this type of risk assessments EVT is often employed, we compare our approach to the GEV function fits obtained using package developed by Correoso (Reference Correoso2019) from the 79 extreme $ T=14 $ day heat waves in the 80 years of D100 (our training data) in Figure 8. For details on why only 79 heat waves are chosen, see Section 2.7.
Generally speaking, GEV fit performs worse than SWG; in France, it underestimates the severity of extremes, while in Scandinavia, it produces confidence intervals that are very large. While the GEV fit is also consistent with the extremes, we observe in the control run it provides much looser confidence intervals. On the other hand, SWG tends to shadow more closely the extremes of the long control simulation. This could be ascribed to the hypothesis that there is an upper bound for temperature extremes (e.g., Zhang and Boos (Reference Zhang and Boos2023)). Parameters of the GEV fit yield uncertainties (especially the shape parameter) which can be very large when the available data are limited (here, 80 years). This leads to wide uncertainties for return periods that are larger than the training length, although in case of Scandinavia the return levels of the long control simulations do fall within the confidence intervals of the GEV fits, albeit not on the GEV fit. On the other hand, the SWG simulations are close to intrinsic properties of temperature, driven by the predictors we use. This explains why the SWG simulations follow the long control run, with a relatively narrow confidence interval, which does not increase with return periods.
We do not control for $ {\alpha}_1 $ and $ {\alpha}_2 $ to give us insight into which variable matters more in this aspect (i.e., temperature or soil moisture); however, since soil moisture does not have impact on heat waves on Scandinavia, one could reason that it is temperature in that case. In other words, to produce reliable return time plots the Euclidean metric should not only take circulation patterns into account but also the temperature and soil moisture in some cases. A side criticism of our approach is that we have chosen value $ {\alpha}_0=1 $ , which seems arbitrary in view of the discussion in Section 3.1.1 on the optimality of $ {\alpha}_0\sim 50 $ for performing the probabilistic forecast. However, it should be noted that having changed the task to be performed, there is no longer a guarantee that $ {\alpha}_0\sim 50 $ is still an optimal value (and it is not, as shown in Appendix D).
3.2.2. Teleconnection patterns for extremes
We use synthetic trajectories generated by SWG trained on only 80 years (from D100, see Table 1) to estimate extreme teleconnection patterns that were never observed in that run. Note that we do not use autoencoder to generate new circulation patterns. Instead, SWG generates new sequences from the already existing ones which results in new values of $ A(t) $ (equation (2)) since it is a running mean over temperature series. This is precisely the recipe we have followed in Section 3.2.1. To validate the teleconnection patterns thus obtained, we compare them to the control run (D8000). The results are plotted in Figure 9.
First, we identify the 10 most extreme heat waves in D100 and plot the composite on their first3 days ( $ \tau =\left\{\mathrm{0,1,2}\right\} $ days) since the onset in Figure 9a Footnote ^{8}. The threshold for this event happens to be approximately 4K. The pattern consists of anticyclonic (in red) and cyclonic (blue) anomalies (the word “anomalies” will be suppressed in what follows for brevity).
Next, we compare this lack of data regime (essentially 10 events) to the composite map that can be computed using a 7200 years long control run in Figure 9c using the same 4K threshold and plotting the first 3 days ( $ \tau =\left\{\mathrm{0,1,2}\right\} $ days) of the composites. In contrast to the previous situation, this leaves us with many more events satisfying the constraint. The picture changes somewhat, most notably, the cyclone and anticyclone become more pronounced, while the other patterns maintain relationship consistent with a Rossby wave pattern. This is in line with the understanding of teleconnection patterns expected in European heat waves. This pattern is also consistent with wave number 3 teleconnection for heat waves in France obtained in Ragone and Bouchet (Reference Ragone and Bouchet2021) and Miloshevich et al. (Reference Miloshevich, RoubyPoizat, Ragone and Bouchet2023b) computed for 1000yearlong sequence.
Finally, we compare both figures with a very long synthetic SWG run in Figure 9e. The resulting teleconnection pattern is again similar to the one in Figure 9e. Overall, the advantage offered by SWG in this case is rather small and mostly has to do with the shape of the anticyclone over Northern Europe which in the short D100 composites appears more narrow.
We repeat the exact same steps but for the heat waves in Scandinavia, starting from the 10 most extreme events in D100 in Figure 9b to the composites resulting from synthetic SWG run in Figure 9f. In this case, SWG performs better in that it is able to capture the magnitude of the anticyclone over Scandinavia more accurately (compared to Figure 9d) than D100 composite, although the other features are hard to distinguish.
4. Discussion
We have systematically compared performance of SWG and CNN (Miloshevich et al., Reference Miloshevich, Cozian, Abry, Borgnat and Bouchet2023a) tasked to predict the occurrence of prolonged heat waves over two European areas using simulation data from intermediate complexity climate model PlaSim. We have also studied the ability to sample extreme return times and composite maps via the method of SWG.
In addition to the study of heat waves in France considered in Miloshevich et al. (Reference Miloshevich, Cozian, Abry, Borgnat and Bouchet2023a), we also apply our methodology to heat waves in Scandinavia. This area has different hydrology which impacts longterm predictability. In order to achieve better predictive ability with SWG, we took the version developed earlier (Yiou, Reference Yiou2014), which previously used only analogs of circulation, predictors we refer to as global, and upgraded it with additional predictors such as temperature and soil moisture integrated over the area of interest (where heat waves occur), which we refer to as local. This was done because of the ample evidence of importance of slow drivers such as soil moisture (Zeppetello et al., Reference Zeppetello, Battisti and Baker2022). Consequently, we had to use appropriate weights in the definition of Euclidean distance to control the importance of global versus local predictors. The benchmarks of CNN versus SWG were obtained based on NLS which is particularly well suited for studying rare events (Benedetti, Reference Benedetti2010). The conclusion is that CNN outperforms SWG in probabilistic forecasting, although this is more evident for particularly large (hundreds of years long) Training Sets (TSs), which would not be available in the observational record. This is in line with the statement of Miloshevich et al. (Reference Miloshevich, Cozian, Abry, Borgnat and Bouchet2023a) on the convergence of such methods in deep learning that requires massive datasets. These conclusions could be interesting for extreme event prediction but are also relevant for future developments of geneological rare event algorithms that aim to resample heat waves based on the forecasts provided by either CNN or SWG (in the approach similar to Lucente et al. (Reference Lucente, Rolland, Herbert and Bouchet2022b)).
We have studied the performance of SWG in relation to the estimation of large return times and extreme teleconnections. This is achieved by generating synthetic trajectories of extensive length based on the initial training data of a short 80year PlaSim run. We have shown that outofthebox (without hyperparameter optimization) implementation of SWG, which takes the three relevant fields as input for the Euclidean metric, was capable of reproducing the return times of the much longer control run (7200 years) for a range of heat wave durations: 15, 30, and 90 days ones. These SWG calculations shadow more closely the control run (with smaller variance) than the EVT estimates, which are typically used to address similar questions. However, the version of SWG whose weights have been optimized for a different task of conditional (intermediate range to subseasonal) probabilistic prediction tends to underestimate the return times slightly. Nevertheless, its synthetic teleconnection patterns have qualitatively accurate features. This is an important result in view of works such as (Yiou et al., Reference Yiou, Cadiou, Faranda, Jézéquel, Malhomme, Miloshevich, Noyelle, Pons, Robin and Vrac2023) that rely on SWG to estimate the risks of the extremes.
In the hopes of improving the performance of SWG, we have considered two dimensionality reduction techniques to project the geopotential to the smaller latent space. These techniques involve traditional EOFs and VAEs. We found that while improving the efficiency at which the analogs are computed, the dimensionality reductions did not modify the predictive skill, which leads us to suggest EOF as a superior dimensionality technique for this task. However, it is possible that VAE could still be useful for other types of extremes, such as precipitation, where the application of traditional analog method is more controversial. We leave such studies for the future. When looking for optimal dimensionality of the latent space, we found that a few 500 hPa geopotential EOF components coupled with local temperature and soil moisture are sufficient to obtain optimal forecasting skill for SWG, which, however, as stated earlier, falls short of CNN prediction, which takes the full fields as in input.
Another possibility of extending this work would be comparing SWG to other modern tools for learning propagators, for instance, simple UNet architectures or a corresponding Bayesian neural network (Thuerey et al., Reference Thuerey, Weißenow, Prantl and Hu2020, Reference Thuerey, Holl, Mueller, Schnell, Trost and Um2021), which provides uncertainty of the prediction thus a kind of propagator. Many other types of architectures are possible, however, given overall simplicity of SWG when training climate model emulators with small datasets SWG should be treated as an indispensable baseline method to justify the use of complex architectures. Fortunately, we have supported this paper with all the necessary code and documentation for straightforward implementation of SWG that can be applied to other projects.
As stated above, another interesting potential application for SWGs and other statistical methods such as deep learning is rare event algorithms (Lucente et al., Reference Lucente, Herbert and Bouchet2022a). Rare event algorithms have been used in the past to sample heat waves (Ragone et al., Reference Ragone, Wouters and Bouchet2018; Ragone and Bouchet, Reference Ragone and Bouchet2021) and could potentially allow to sample other extreme events (Webber et al., Reference Webber, Plotkin, O’Neill, Abbot and Weare2019) with expensive highfidelity models at a cheaper cost. Datadriven approaches could be used to improve such rare event simulations. This is because knowing the probability of an event gives a prior information to the rare event algorithm (Chraibi et al., Reference Chraibi, Dutfoy, Galtier and Garnier2021). A recent work (JacquesDumas et al., Reference JacquesDumas, van Westen, Bouchet and Dijkstra2023) compares SWG and CNN in computing importance functions for two simple Atlantic meridional overturning circulation models. We believe that providing similar benchmarks is important; thus, in this manuscript, we concentrate on the comparisons between deep learning and analog forecasting.
Finally, the results here were obtained based on dataset generated by intermediate complexity climate model, PlaSim with the ocean that was driven using stationary climatology and with relatively coarse resolution. As other modes of longterm predictability come from the ocean the next study should investigate the same questions based on higher fidelity model with better parameterizations. In particular, the question of transfer learning should be addressed, that is, how to properly generalize outofsample by pretraining on a simpler models and fine tuning on more complex datasets, such as CESM or reanalysis.
Acknowledgments
The authors acknowledge CBP IT test platform (ENS de Lyon, France) for ML facilities and GPU devices, operating the SIDUS solution (Quemener, Reference Quemener2014). This work was granted access to the HPC resources of CINES under the DARI allocations A0050110575, A0070110575, A0090110575, and A0110110575 made by GENCI. The authors would like to acknowledge the help of Alessandro Lovo in maintaining the GitHub page. The authors would also like to acknowledge the help of the referees in streamlining and improving the quality of this article.
Author contribution
Conceptualization: D.L., F.B., P.Y., G.M.; Data curation: D.L., G.M.; Formal analysis: D.L., G.M.; Investigation: D.L., P.Y., G.M.; Methodology: D.L., F.B., G.M.; Software: D.L., G.M.; Writing – original draft: D.L., F.B., P.Y., G.M.; Writing – review & editing: D.L., F.B., P.Y., G.M.; Funding acquisition: F.B., P.Y.; Project administration: F.B., P.Y., G.M.; Supervision: F.B., P.Y., G.M.; Validation: G.M.; Visualization: G.M.
Competing interest
The authors report no competing interests.
Data availability statement
Data for this study are available from https://zenodo.org/records/10102506.
Funding statement
This work was supported by the ANR grant SAMPRACE, project ANR20CE01000801 and the European Union’s Horizon 2020 research and innovation program under grant agreement No. 101003469 (XAIDA). This work has received funding through the ACADEMICS grant of the IDEXLYON, project of the Université de Lyon, PIA operated by ANR16IDEX0005.
Code availability statement
The coding resources for this work, such as the Python and Jupyter notebook files, are available on a GitHub page https://github.com/georgemilosh/ClimateLearning and are part of a larger project at LSCE/IPSL ENS de Lyon with multiple collaborators working on rare event algorithms.
Abbreviations
 CESM

Community Earth System Model
 CNN

Convolutional Neural Network
 EOF

Empirical Orthogonal Function
 EVT

Extreme Value Theory
 GCM

General Circulation Model
 GEV

Generalized Extreme Value
 MCC

Matthew’s Correlation Coefficient
 NLS

Normalized Logarithmic Score
 NWP

Numerical Weather Prediction
 PCA

Principal Component Analysis
 PlaSim

Planet Simulator
 S2S

SubseasonaltoSeasonal
 SWG

Stochastic Weather Generator
 TS

Training Set
 VAE

Variational Autoencoder
 VS

Validation Set
 CESM

Community Earth System Model
 CNN

Convolutional Neural Network
 EOF

Empirical Orthogonal Function
 EVT

Extreme Value Theory
 GCM

General Circulation Model
 GEV

Generalized Extreme Value
 MCC

Matthew’s Correlation Coefficient
 NLS

Normalized Logarithmic Score
 NWP

Numerical Weather Prediction
 PCA

Principal Component Analysis
 PlaSim

Planet Simulator
 S2S

SubseasonaltoSeasonal
 SWG

Stochastic Weather Generator
 TS

Training Set
 VAE

Variational Autoencoder
 VS

Validation Set
Appendix
A. Principal component analysis and variational autoencoder
In PlaSim, we encounter highdimensional fields, compared to what is done in toy models such as Lorenz 63. Generally speaking, one has to deal with problems of “curse of dimensionality” in such cases and the need to somehow project the dynamics $ x\to z $ . To be more specific, let $ \mathcal{D}\in {\mathrm{\mathbb{R}}}^{n,d} $ be a dataset consisting of $ n $ samples $ x\in {\mathrm{\mathbb{R}}}^d $ and let $ z=f(x)\in {\mathrm{\mathbb{R}}}^p $ a lower dimensional projection of the data ( $ p\ll d $ ). In what follows we will give two brief descriptions of two widely used projection techniques involving the use of both linear and nonlinear functions $ f $ , namely principal component analysis (PCA) and variational autoencoder (VAE). PCA (also known as empirical orthogonal functions) consists of a linear transformation of the data points $ x $ , which aims at preserving as much variance as possible of the original dataset. To do so, $ x $ is projected onto the lowdimensional space spanned by the first $ p $ eigenvectors of the correlation matrix $ \Sigma ={\mathcal{D}}^T\mathcal{D} $ Footnote ^{9}. The dimensionality $ p $ of the latent space is determined by the percentage of data variability that is desired to be preserved by the transformation. Here, however, dimensionality reduction is used as a preprocessing step in order to optimize a prediction task. From this perspective, not all information regarding the variability of observations is necessarily relevant to the forecasting task.
The second projection technique we want to discuss consists of a nonlinear dimensionality reduction via VAE (Kingma and Welling, Reference Kingma and Welling2013). The idea is to project fields such as geopotential $ x $ on the latent lowdimensional space $ z $ , from where realistic looking states can be sampled using, for instance, a Gaussian measure. VAE consists of probabilistic decoder $ {p}_{\theta}\left(xz\right) $ and probabilistic (stochastic) encoder $ {q}_{\phi}\left(zx\right) $ . The goal is to maximize the probability of the data $ p(x) $ in the training set (TS) (Doersch, Reference Doersch2016)
where traditionally the prior $ p(z)=\mathcal{N}\left(0,I\right) $ . The choice for the likelihood given by the decoder is often chosen to be Gaussian $ p\left(xz\right)=\mathcal{N}\left(x{f}_{\theta }(z),{\sigma}^2\hskip0.1em I\right) $ or Bernoulli (which is particularly suitable when the underlying data are binary (LoaizaGanem and Cunningham, Reference LoaizaGanem and Cunningham2019) and $ {f}_{\theta }(z) $ can be some neural network with weights $ \theta $ . Formally, posterior $ {p}_{\theta}\left(zx\right) $ could be written using Bayes’s rule but in most practical cases this leads to intractable expression and methods such as importance sampling may lead to the variance in the estimate that can be very high if the proposal distribution is poor so instead evidence lowest bound is used (Mehta et al., Reference Mehta, Bukov, Wang, Day, Richardson, Fisher and Schwab2019). This leads to approximate posterior which is also parameterized using neural networks $ {q}_{\phi}\left(zx\right)=\mathcal{N}\left(z{\mu}_{\phi }(x),{\Sigma}_{\phi }(x)\right) $ . For details on how this is achieved, such as “reparametrization trick” (see, e.g., Rezende et al., Reference Rezende, Mohamed, Wierstra, Xing and Jebara2014). Other variants of VAE have been developed such as conditional VAE (Sohn et al., Reference Sohn, Lee, Yan, Cortes, Lawrence, Lee, Sugiyama and Garnett2015) but since we are looking for a space on which distance can be computed to allow simulating unconditional trajectories we believe the latent space should be also unconditional. VAEs are not the only probabilistic generative models, and indeed generative adversarial networks (GANs) are often preferred choice since VAEs often produce smoothened images. On the other hand, GANs may suffer from technical difficulties such as mode collapse and are generally more difficult to train.
Once the training is performed (on the TS), we may project the fields in both training and validation set as $ \mathcal{Z}\left(\mathbf{r},t\right)\to {\mathcal{V}}_{\mathrm{tr}}\hskip0.1em \overline{\mathcal{Z}}\left(\mathbf{r},t\right) $ to obtain latent variables $ \mathbf{z}(t)\hskip0.4em := \hskip0.4em \left\{{z}_m(t),m\in 1,\dots, M\right\}\sim \mathcal{N}\left(0,1\right) $ that are stacked along with the area integrals
For this aim, we use eightlayer encoder–decoder type network with residual connections within encoder and decoder, respectively, which allows projecting $ \mathcal{Z} $ to the 16dimensional latent space. In this case, the input corresponding to $ \mathcal{Z} $ field contribution in Eq. (13) is provided from the latent space. The VAE is trained only on the TS. The training takes approximately 50 epochs for each fold on the corresponding TS using TU102 (RTX 2080 Ti Rev. A) GPU which results in total training time of approximately 1 hour. Such large degree of freedom reduction naturally leads to significant gains (two orders of magnitude) in subsequent kDTree computation speed. The forecast skill of this VAE SWG, however, remains the same compared to regular SWG. The original and VAE reconstructed geopotential anomalies are plotted for few samples weather situations in Figure A1.
B. Daily steps
Throughout this paper, we have used $ {\tau}_m={\tau}_c=3 $ day step for our Markov chain and coarse graining. While the motivation for this was synoptic decorrelation time, one naturally wonders how optimal is that choice for predicting extremes. Moreover, the CNN was designed to learn from daily fields, which meant we had to shift the NLS curves by 2 days for proper comparisons. Here, we will show what happens when $ {\tau}_m={\tau}_c=1 $ day. Given that the temperature autocorrelation time, one would expect subsequent days to be strongly correlated, which would violate Markov’s condition. Nevertheless, comparing the blue and orange curves in Figure A2, we hardly see any difference.
C. Committors for 100 years
Results of Figure 4 show that CNN outperforms SWG when using D500 (see Table 1), in other words, 400 years of training data. Climate models allow working with large datasets; however, in practice, we are often limited by the paucity of the observational record. Thus, we provide benchmarks for shorter dataset D100, thus 80 years of training, the same that is used in Section 3.2. The results are plotted in Figure A3, which shows that, while CNN still gives better results on average, the spread of the NLS tends to be quite large. The trends for $ {\alpha}_0 $ and number of neighbors remain the same.
D. Return times for different values of the hyperparameter
Here, we attach Figure A4 that is equivalent to Figure 7a in all but the value of $ {\alpha}_0=50 $ . This parameter was chosen optimal for this area in Section 3.1. As we can see, it is not optimal for generating synthetic return times (underestimation). This underestimation is related to the reduction of variance problem. We do not include the case of Scandinavia, but the results look very similar.