Quantifying stratospheric biases and identifying their potential sources in subseasonal forecast systems

. The stratosphere can be a source of predictability for surface weather on timescales of several weeks to months. However, the potential predictive skill gained from stratospheric variability can be limited by biases in the representation of


Introduction
The Earth's stratosphere is home to several dynamical phenomena that are coupled with the tropospheric circulation.This coupling between the two layers can go in both directions; tropospheric variability drives variability in the stratosphere, but downward coupling from stratospheric variability can subsequently impact weather in the troposphere across the globe (Domeisen and Butler, 2020).As a result, the stratosphere is recognized as a source of predictability for the troposphere on subseasonalto-seasonal (S2S) timescales (Butler et al., 2019a;Domeisen et al., 2020b).However, model simulations and forecasts often struggle to adequately capture such stratosphere-troposphere coupling processes.Model biases in both the troposphere and stratosphere can impact these coupling processes, with potential deleterious effects for S2S predictability.The goal of this study is to provide a systematic identification of stratospheric biases in a wide range of S2S forecast systems.
In the wintertime extratropical stratosphere, variability in the upward flux of planetary-scale Rossby waves drives variability in the westerly circulation of the stratospheric polar vortices, affecting their mid-winter strength and the timing of their seasonal breakdowns in spring (Andrews et al., 1987;Garfinkel et al., 2010).During winter and spring, anomalous behavior of the polar vortices can exert an influence on the underlying troposphere.This kind of "downward coupling" is especially apparent for extreme polar vortex events, including sudden stratospheric warming (SSW) events (Gerber et al., 2012;Baldwin et al., 2021), which are characterized by massive disruptions to the polar vortex that decelerate and reverse its westerly winds; strong vortex events, which are characterized by anomalous strengthening of the vortex (Limpasuvan et al., 2005;Tripathi et al., 2015); and final warmings, which denote the breakdown of the polar vortex in spring or early summer until the subsequent autumn (Black et al., 2006).The observed surface response following such polar vortex events generally resembles the so-called Northern and Southern Annular Modes (NAM/SAM; Thompson and Wallace, 2000;Baldwin and Dunkerton, 2001), or the North Atlantic Oscillation (NAO; Ambaum and Hoskins, 2002;Charlton-Perez et al., 2018;Domeisen, 2019), which captures the tropospheric response specifically over the North Atlantic, where northern hemisphere (NH) stratospheric influence tends to be the strongest (Butler et al., 2017;Dai and Hitchcock, 2021).The surface temperature and precipitation responses associated with these large-scale tropospheric circulation patterns have been shown to contribute to extreme events such as cold air outbreaks and precipitation extremes (Domeisen and Butler, 2020).
In the tropical stratosphere, the absorption of vertically propagating tropical waves from the troposphere below drive the alternating phases of easterly and westerly winds of the Quasi-Biennial Oscillation (QBO; Baldwin et al., 2001).In turn, the QBO has the ability to modulate the tropospheric jet streams, tropical convection, and other phenomena such as the Madden-Julian Oscillation (e.g., Gray et al., 2018;Kim et al., 2020;Anstey et al., 2021, and references therein).These tropospheric teleconnections of the QBO are known to affect the NAO, as well as surface temperatures and precipitation near East Asia and in the tropics on both seasonal-mean and subseasonal timescales (Anstey et al., 2021;Gray et al., 2018;Haynes et al., 2021;Elsbury et al., 2021;Park et al., 2022).The QBO also has an apparent influence on the strength of the polar vortex, particularly in the Northern Hemisphere (NH), whereby easterly or westerly winds in the tropical lower stratosphere tend to lead to a weaker or stronger polar vortex, respectively (e.g., Holton and Tan, 1980;Calvo et al., 2007;Garfinkel et al., 2012;Anstey and Shepherd, 2014;Gray et al., 2018;Rao et al., 2020a).This is known as the "Holton-Tan" effect; there are several mechanisms that help explain such QBO-polar vortex coupling, but generally they are tied to the QBO winds affecting the propagation and dissipation of waves in the polar stratosphere.
Since the stratosphere and its variations generally exhibit a higher persistence and predictability than the troposphere (Domeisen et al., 2020a;Son et al., 2020), their downward influence can lead to opportunities for long-range prediction of surface weather on S2S timescales (Butler et al., 2019a).For example, all of the extreme polar vortex events mentioned above have shown potential for improved S2S surface prediction: SSWs (Sigmond et al., 2013), strong vortex events (Tripathi et al., 2015;Domeisen et al., 2020b), and final warmings (Butler et al., 2019b;Byrne et al., 2019).Tropical stratospheric variability associated with the QBO has also shown the potential to improve surface prediction on these time-scales (Marshall and Scaife, 2009;Garfinkel et al., 2018;Martin et al., 2021b).However, surface prediction skill is not always improved based on the conditions in the stratosphere.In fact, several regions exhibit poorer predictability after extreme polar vortex events as compared to periods without (Domeisen et al., 2020b).There are likely several reasons for these shortcomings in dynamical predictions related to the inherent chaotic nature of dynamical coupling within the atmosphere.For example, not all SSWs appear to have a significant downward impact (Karpechko et al., 2017), likely due to tropospheric internal variability (Domeisen et al., 2020c; Afargan-Gerstman and Domeisen, 2020) and the duration and strength of anomalies in the lower stratosphere following the SSW (Maycock and Hitchcock, 2015;Karpechko et al., 2017;Charlton-Perez et al., 2018;Domeisen, 2019;White et al., 2020).
Model biases and related shortcomings in simulating the stratosphere can affect dynamical predictions of stratospheric variability and stratosphere-troposphere coupling (Domeisen et al., 2020b;Charlton-Perez et al., 2013).Indeed, tropospheric biases identified independently of the stratosphere can sometimes be traced back to stratospheric anomalies in subseasonal forecasts; in the ECMWF extended-range prediction system, the persistence of the NAO was found to be constrained too strongly by the state of the polar vortex at initialization (Kolstad et al., 2020).Since stratospheric variability and extreme events are primarily governed by wave-mean flow interactions, biases in wave driving due to either resolved or parameterized processes can lead to biases in the mean state of the stratosphere, which can further alter its response to subsequent forcing (McLandress et al., 2012;Richter et al., 2014) and limit stratospheric predictability (Portal et al., 2021).More fundamentally, a model's vertical resolution in the stratosphere can influence its ability to realistically represent the stratosphere.For example, climate models with limited resolution in the stratosphere tend to dramatically underestimate stratospheric variability (Charlton-Perez et al., 2013;Shaw et al., 2014;Richter et al., 2020;Rao et al., 2020a).Similarly, subseasonal forecast systems with severely limited model resolution in the stratosphere exhibit poor skill in predicting extreme polar vortex events beyond 1 week, whereas systems with higher model lids and higher vertical resolution in the stratosphere can predict these events up to 2-3 weeks in advance (Domeisen et al., 2020a).Subseasonal forecast systems that poorly represent tropospheric stationary waves also tend to have low vertical resolution in the stratosphere, connected to larger stratospheric biases (Schwartz et al., 2021).On seasonal timescales, surface prediction skill can also in part be traced back to model properties such as vertical resolution in the stratosphere (Butler et al., 2016;Portal et al., 2021).
It is expected that identifying and properly addressing the sources of stratospheric biases would not only improve prediction in the stratosphere, but also help improve the surface prediction skill associated with the aforementioned stratospheric phenomena.For example, some recent studies have investigated the impact of explicitly correcting stratospheric biases in model simulations, and found improvements to SSW statistics (Tyrrell et al., 2021) and the representation of the Holton-Tan effect (Karpechko et al., 2021).In other similar studies, teleconnections to the polar vortex related to Siberian snow cover (Tyrrell et al., 2020) and the El Niño-Southern Oscillation (ENSO; Tyrrell and Karpechko, 2021) were shown to be sensitive to model stratospheric biases.In the case of the Siberian snow cover, explicitly correcting the stratospheric biases had modest impacts on the magnitude and duration of downward coupling between the polar vortex and the AO/NAO.However, in the case of ENSO teleconnections, correcting stratospheric biases had no detectable impact on the NAO, possibly because the NAO response to the strong ENSO events in the experiments was dominated by the tropospheric ENSO pathway to the North Atlantic (Jiménez-Esteve and Domeisen, 2018).
Given the role the stratosphere can play for skillfully predicting surface weather at subseasonal to seasonal timescales, it is crucial to investigate and diagnose the stratosphere and stratosphere-troposphere coupling biases present in such models.
Identifying and comparing these biases will serve as a data point for improving subseasonal forecast systems, and ultimately prediction skill, both in the stratosphere and for extended-range surface weather.Herein we perform a comprehensive investigation and intercomparison of stratosphere-related biases affecting subseasonal prediction systems.The present work is the result of a volunteer collaborative effort of the World Climate Research Programme (WCRP) Stratosphere-troposphere Processes and their Role in Climate (SPARC) Stratospheric Network for the Assessment of Predictability (SNAP) activity, which is also the stratosphere sub-project of the WCRP/World Weather Research Programme S2S Prediction Project.In Section 2 we describe the datasets and methods we use to identify biases.We describe our results in Section 3, and discuss and summarize our findings in Section 4.

Subseasonal-to-Seasonal (S2S) Hindcast and Reanalysis Datasets
We primarily use ensemble hindcast data from the S2S Prediction Project Database (Vitart et al., 2017).Where possible, we also include results from other ensemble forecast systems that do not provide data to the S2S database; these include the National Oceanic and Atmospheric Administration's Global Ensemble Forecast System version 12 (NOAA GEFSv12; Hamill et al., 2021;Guan et al., 2021), the Geophysical Fluid Dynamics Laboratory Seamless System for Prediction and EArth System Research (GFDL-SPEAR; Delworth et al., 2020) Table 1 lists the different S2S forecast systems, and includes relevant information about their hindcast and model configurations.Because the hindcast periods available for different systems vary substantially, unless noted otherwise, we limit our analyses to the 1999-2010 period, which is common to nearly all systems.The one exception is GEFSv12, which has hindcasts that span the period from 2000-2019, for which we simply use the 2000-2010 period.In order to balance including as many prediction systems as possible while also being inclusive of contributions from co-authors with differing capabilities to access and store the large datasets involved, we designated the S2S database systems that provide at least 35-day forecasts as our "core systems', while others were considered optional.Thus, the number of S2S systems shown for a given analysis may vary, but will always include the seven core systems from the S2S database.
Since we consider CMA to be a core system (as it performs forecasts beyond 35 days), we use the older low-top version known as the "BCC-CPS-S2Sv1" because the more recent "BCC-CPS-S2Sv2" hindcast period only covers 2004-2018.In contrast, ECCC data is considered to be optional, so we sometimes mix the use of :::: some :::::::: analyses ::: use ::::: either : the high-and :: or : low-top versions; in such cases, we explicitly describe these as "ECCC-hi" or "ECCC-lo", respectively (see Section 2.2 for the specific definitions of high versus low top).
For the analyses herein, we use basic meteorological fields, including zonal and meridional wind components, temperatures, and geopotential heights.These fields from the S2S database are provided once daily at instantaneous 00 UT verification times on a 1.5 • × 1.5 • latitude/longitude grid, with 10 pressure levels between 1000 and 10 hPa (with about 3 levels in the stratosphere, including 100, 50, and 10 hPa).GEFSv12 fields are provided 6-hourly on a 0.5 • × 0.5 • grid, with 25 pressure levels between 1000 and 1 hPa (6 between 100 and 10 hPa).GFDL-SPEAR fields are also 6-hourly on a 0.5 • × 0.5 • grid, with 33 pressure levels between 1000 and 1 hPa.For the CESM2-CAM and CESM2-WACCM data, we make use of the available "zonal mean" collection of variables that are provided as daily means on pressure levels closest to the model levels; we interpolate this data to a set of 32 standard pressure levels between 1000 and 10 hPa (with 6 between 100 and 10 hPa); these fields have 192 latitudes (∼0.9424 • resolution) from the finite volume grid.
The subseasonal hindcast datasets used are all initialized with different atmospheric analyses.Therefore, to ensure that comparisons and biases are all determined with respect to a consistent dataset, we compare hindcast fields to those from the ERA-Interim (ERA-I) reanalysis (Dee et al., 2011).Different modern reanalysis products generally agree very well amongst one another for the time periods (post-1999) and levels (10 hPa and below) we consider (Long et al., 2017;Gerber and Martineau, 2018;Fujiwara, M. et al., 2021), and thus our results are unlikely to be sensitive to the choice of reanalysis.

Methods
Throughout the paper we distinguish between forecast systems with low-and high-top atmospheric models.We define the systems having model tops at or above 0.1 hPa with several levels above 1 hPa as high-top; any others that do not meet these criteria are specified as low-top.In total, we consider 8 : 9 ::::::: systems ::::: with high-top ::::: models : and 6 :::::: systems ::::: with low-top models (see Table 1), though : ; :::::::: however, the precise number of models included in a given analysis varies.In our figures, we highlight low-top models with asterisks and/or dashed lines.
For a given diagnostic, raw biases among the forecast systems are computed by taking the difference between the ensemblemean hindcasts and ERA-I.We generally composite these biases according to lead time and/or season in order to determine the systematic differences of the hindcast predictions from reanalysis.For some applications we derive lead-time dependent climatologies for the different forecast systems, which we use to determine forecast anomalies and to apply bias-correction.
For systems in the S2S database, these climatologies are found by averaging all ensemble-mean hindcasts for a given day of year at a specific lead time.For systems that provide a fixed set of hindcast initializations that do not uniformly cover the same days of year in the hindcasts (such as the GEFSv12 and the CESM2 systems), we create the lead-dependent climatologies according to the method outlined in Pegion et al. (2019); briefly, this method involves averaging all hindcasts for a given day of year and lead time (which is generally less than the total number of years in the hindcast archive), and then applying a rolling 31-day average with centered triangular weights to the "raw" and noisy lead-dependent climatology.Hindcast anomalies are then determined by subtracting these climatologies for a given day of year and lead time from the raw forecast quantities.In some cases we apply a linear bias-correction to raw quantities in order to remove any climatological drift that may exist.This is done by removing the difference between each system's lead time-dependent climatology and the ERA-I climatology from the predicted quantity in question: , where Q BC is the bias corrected In sections 3.3 we investigate whether there are biases in relation to extreme stratospheric events, including SSWs and strong vortex events in the NH.SSWs are defined using ERA-I data, based on reversals in the 10 hPa 60 • N zonal mean zonal winds, 175 consistent with Charlton and Polvani (2007) and Butler et al. (2017).Similarly to SSWs, strong vortex events are defined using ERA-I data when the daily mean 10 hPa 60 • N zonal mean zonal winds exceed 41.2 m/s for at least two consecutive days, corresponding to approximately the 80th percentile of the November-March 1980-2012 zonal winds, consistent with the definition used by Tripathi et al. (2015).These definitions have also been employed in prior investigations of the stratosphere in S2S systems (Domeisen et al., 2020a, b).The central dates of these events are considered to be the first day on which these 180 zonal wind thresholds are met.

Global and Zonal Mean Biases
From the perspective of model evaluation, examining mean state biases in the stratosphere is useful since the processes that govern the distribution of zonal mean temperatures and winds in the stratosphere are quite well understood.Global-and annualmean temperatures in the stratosphere should be very close to radiative equilibrium (e.g., Olaguer et al., 1992).In contrast, seasonal and meridional variations in zonal mean stratospheric temperatures and winds arise from dynamic influences such as wave-mean flow interaction.Meridional circulations driven by the dissipation of waves drive stratospheric temperatures away from radiative equilibrium, which affects the zonal-mean zonal winds through thermal-wind balance (e.g., Andrews et al., 1987;Shepherd, 2000).Model biases in global annual-mean stratospheric temperatures, and/or seasonally-varying biases in zonal mean temperatures and winds therefore provide information about the likely origin of model errors, either pointing to model components that affect radiative processes, or those that affect middle atmosphere dynamical processes such as parameterized gravity wave drag.
We first consider the annual, global mean temperature biases among the S2S forecast systems.Figure 1 shows these biases for 10, 50, and 100 hPa for each of the models at a lead time of 4 weeks (days 22-28).These biases generally develop and increase in magnitude monotonically with lead time, with many of these biases being present at earlier lead times of 1-2 weeks (not shown).The magnitude of biases tend to be largest in the middle stratosphere at 10 hPa, with 6 of the models shown exceeding absolute biases over 2 K. Positive temperature biases tend to be most common across the models and levels, with the exception of ECMWF and CESM2-WACCM, which primarily have negative temperature biases.
Figure 1 also highlights some "familial" relationships .In particular, the "NOAA-family" of forecast systems (including GEFSv12, GFDL-SPEAR, and NCEP) all show global mean warm biases throughout the lower and middle stratosphere.The biases in the KMA and UKMO systems are very similar, likely owing to the fact they both use the same GloSea-5 model (see Table 1).There are large differences between the biases apparent in CESM2-CAM and CESM2-WACCM, which may be due to a combination of factors: aside from the differences in model tops, CESM2-WACCM includes fully interactive tropospheric and stratospheric chemistry (Gettelman et al., 2019;Richter et al., 2021), and the atmospheric models are initialized with different reanalyses ("CFSv2" for CESM2-CAM, and "MERRA-2" for CESM2-WACCM).
The apparent easterly bias in the tropical stratospheric zonal winds across the high-and low-top systems represents a general bias among the models in representing the QBO.We examine these biases in the next section in more detail.

Biases Related to the Tropical Stratosphere and QBO
The circulation of the tropical stratosphere is dominated by the QBO, and hence the zonal mean biases in the tropics shown above must be partially related to each forecast system's ability to maintain the QBO from initial conditions.To the extent that S2S models can represent the QBO and its teleconnections accurately, the QBO could lead to more reliable predictions on subseasonal and seasonal timescales in the troposphere (Garfinkel et al., 2018;Merryfield et al., 2020).Therefore, we now consider whether this potential is realized by the S2S forecast systems.Unlike elsewhere in the paper, here our QBO analyses make use of the full hindcast periods available to each model to maximize the number of QBO cycles available when determining biases (see Table 1).
The QBO has been shown to influence tropical convection on subseasonal timescales, and one of the leading mechanisms for this effect is related to the QBO's mean meridional circulation, which leads to temperature (and buoyancy frequency) anomalies in the tropical tropopause layer that subsequently affect high clouds and convection (Gray et al., 2018).These relationships also underpin the QBO's observed relationship with the Madden-Julian Oscillation (MJO; Yoo and Son, 2016;Son et al., 2017;Lee and Klingaman, 2018;Martin et al., 2021a), which can modulate MJO forecast skill (Kim et al., 2019;Lim et al., 2019).
The QBO is known to have an important teleconnection to the boreal winter polar vortex known as the "Holton-Tan effect" (e.g., Holton and Tan, 1980;Baldwin et al., 2001;Calvo et al., 2007;Garfinkel et al., 2012;Anstey and Shepherd, 2014), in which weaker or stronger polar vortex winds preferentially occur with EQBO or WQBO winds in the lower stratosphere, respectively.Figure 4 shows the composite anomalies in 10 hPa polar vortex winds, and 100 hPa polar cap geopotential heights (Z100) composited based on the 50 hPa QBO phase at initialization.Since the Holton-Tan effect develops in early winter and is most pronounced in mid-winter, we limit our focus :::::  anomaly grows stronger out to around week 3 (see, e. g., NCEP, UKMO, and ECMWF), but all the anomalies decay thereafter.

Northern Hemisphere Polar Vortex Variability
Variability in the NH polar winter stratosphere primarily arises from extreme dynamical polar vortex events, including midwinter SSWs and strong vortex events.The occurrence of these events are generally associated with extremes in upward wave driving on :: the ::::::: upward ::::: wave ::::: fluxes ::: that ::::::: disturb the polar vortex, with SSWs/strong vortex events being preceded by extended periods of above/below-normal wave driving, respectively (e.g., Polvani and Waugh, 2004).A typical metric for upward planetary wave driving ::: flux is the meridional eddy heat flux, v ′ T ′ (with v being the meridional wind, T the temperature, and the primes denoting deviations from the zonal mean).In models, such wave driving should be resolved ::: well :::::::::: represented, dependent upon tropospheric variability and proper simulation of vertical wave propagation.There can also be significant variability among polar vortex events in terms of different characteristics, such as their timing, magnitude, persistence, and even polar vortex geometry (e.g., Karpechko et al., 2017).Importantly, the occurrence of such polar vortex events can lead to long-term coupling with the troposphere ::: that :::: lasts ::: for ::::: weeks :: to ::::::: months; in forecast models, the occurrence of these events can improve tropospheric predictability by providing "forecast windows of opportunity" (e.g., Butler et al., 2019a, and references therein).
This means the NCEP system is more accurate for week 4 SSW risk forecasts initialized in December and January, but then overestimates SSW probabilities in February and March.The NCEP system's higher prediction of easterly winds is possibly related to its significant weak vortex bias (see Supplement Figure 14 ::: S13), which may be linked to its overestimation of eddy heat fluxes from the Scandinavian/Siberian region :::: heat ::::: fluxes : (Figure 5d).
Strong vortex events (right column of Figure 6) exhibit an entirely different seasonal cycle, with most events occurring between December and January (primarily due to the threshold-based definition and the climatological maximum strength of the vortex occurring in these months).Most models tend to underestimate the frequency of strong polar vortex winds for November and December initializations, particularly at weeks 3 and 4.There are some notable exceptions, including the CNR-ISAC, CNRM, GEFSv12, and CESM2-CAM systems, which all have substantial strong polar vortex biases in their wintertime zonal winds (Supplemental Figures 2, 5, 6, and 10 : 9).For January initializations, most models instead overestimate the frequency of strong vortex winds beyond week 1 (Fig. 6f), which is also consistent with the general strong vortex biases evident in the model composites of Figure 2. In addition to the probability of the events from raw forecasts, the black horizontal lines in Figure 6 indicate the probabilities 395 estimated from bias-corrected forecasts (see Section 2.2 for details of the mean bias-correction process).The probability of both SSWs and strong vortex events in the bias-corrected hindcasts initialised in November, December and January is generally either close to, or smaller than, the observed probabilities from ERA-I across all forecast systems.In most cases, this corresponds to an improvement over the raw forecasts.Especially for the prediction of SSWs for forecasts initialized in January (Fig. 6e), the mean bias correction clearly improves the estimates over those from the raw data, particularly for lead times of 3-4 weeks.However, the bias correction for late winter/early spring predictions (initializations in February and March, weeks 3 and 4) does not necessarily bring the easterly wind probabilities closer to observations.In some cases the bias correction increases the probability of events, even for systems whose un-corrected probabilities already closely match reanalysis.This may mean that model zonal-mean zonal wind biases in late winter and early spring tend to not dynamically alter the probability of zonal wind reversals at times when final warmings may be expected to occur.The exceptions here are the systems with the most severe biases, such as BoM and CNR-ISAC.Nevertheless, and especially for early winter, the magnitude of zonal wind biases clearly changes the probability of fixed threshold events in most of the S2S systems.This supports the utility of bias correction for stratospheric S2S forecasts, albeit with some limitations.Furthermore, such bias correction has to be applied and interpreted with care, since the non-linear dynamics in the models evolve according to their own potentially biased mean states, and therefore, little can be said about potential tropospheric responses to such bias-corrected stratospheric forecasts.
There also exist biases in the magnitude of predicted events, even at relatively short lead times.In Figure 6 we identified the probability of polar vortex events occurring in the subseasonal forecasts at different lead times; in Figure 7 we instead focus on observed polar vortex events in the ERA-I record, and assess the forecasted wind changes at verification times surrounding the observed events.Figure 7 shows the distributions of simulated wind changes associated with SSWs and strong vortex events (defined as in Section 2.2) in the 1999-2010 reanalysis record.The deceleration or acceleration associated with these events is measured by computing the change in the hindcast/reanalysis zonal-mean zonal wind at 10 hPa and 60 • N, at ±5 days around the ERA-I event onset dates : ; ::: for ::: the :::::::: hindcasts, ::::: these ::: are :::: first :::::::: computed :::::::::: individually ::: for :::: each ::::::::: ensemble ::::::: member ::::: before ::::: being :::::::::: composited.Almost all systems underestimate the wind changes for both SSW and strong vortex events at all lead times, yielding dominantly positive/negative mean errors for SSWs/strong vortex events, respectively.However, the prediction of the observed magnitude of events clearly improves with decreasing lead time, as expected.At weeks 3 and 4 lead times, the predicted wind change distributions among the models :::::: systems : are generally close to zero and exhibit small spread.This indicates that the prediction systems are generally not forecasting extreme vortex events at verification times around the target event dates; instead they :::: these ::::::: models predict climatological zonal-mean zonal wind values or only weak wind tendencies of the same sign as the events.At shorter lead times within the typical deterministic predictability limit of roughly two weeksfor the stratosphere ::: This :: is ::: not : a :::::::::: shortcoming ::: of :: the ::::::::: prediction ::::::: systems, ::: but ::::: rather :: to :: be :::::::: expected ::::: given ::: that ::: the :::::: typical ::::::::::: predictability :::: limit ::: for :::: these :::::: vortex :::::: events : is ::::: about :::: two :::::: weeks.::::: Even ::::: within :: a :::: lead :::: time :: of :::: two ::::: weeks, some systems still underestimate the magnitude and spread of the observed wind changes; many of these are the low-top models such as BoM, CESM2-CAM, and CMA.For SSWs, the ECMWF and NCEP systems have the smallest errors of around 5 m/s; for strong vortex events, the CNRM, CESM2-CAM, and GEFSv12 systems consistently have the lowest mean errors within 10 m/s from 1-3 weeks, but these systems also have substantial positive zonal wind biases.Figure 7c,d also highlights the disparity in the magnitude of wind changes between SSWs and strong vortex events; median errors for SSWs are on the order of 15 -20 m/s with outliers up samples, especially for longer lead times, due to the different hindcast lengths available for each system (see Table 1); in other words, some hindcasts do not fully cover the 10-day periods surrounding the observed SSWs and strong vortex events, and are excluded.
to 60 m/s, whereas the range is much smaller for strong vortex events.This reflects the large and sudden deceleration of winds that occur during SSWs that (in absolute terms) is much larger than the acceleration of winds for strong events.
Finally, we examine whether there are biases among the subseasonal models in :::::::: forecasting : the geometry of the polar vortex for forecasts : at ::::: times : surrounding observed SSW events.Such characteristics are important since biases in the shape or ::: The 435 ::::: shape ::: and : location of the vortex would affect ::::: polar ::::: vortex ::: are ::::::::: ultimately ::::::: affected ::: by the vertical wave propagation that may ultimately influence :::::: activity :::: that :::::::: influences : the occurrence and/or magnitude of SSWs.We examine these characteristics ::::: vortex :::::::: geometry using elliptical diagnostics, which provide quantities such as the vortex centroid latitude and aspect ratio (e.g., Waugh, 1997;Seviour et al., 2013) that can be used to quantify the displacement or stretch of the vortex during SSWs.We perform these calculations using the hindcast 10 hPa geopotential height ::::: heights, assuming that the 30 km contour is representative 440 of the vortex edge.Figure 8 shows the ensemble-mean predictions of the centroid latitude and aspect ratio diagnostics as a function of lead time and initialization with respect to composites of the central dates of displacement SSWs and split SSWs , overestimates :::: them.

Southern Hemisphere Polar Vortex Variability
In the Southern Hemisphere (SH), stratospheric polar vortex variability is mainly associated with interannual variability in the timing of the springtime polar vortex break-down (i.e., the final warming) sometime between November and January.Prior to the final warming, the SH polar vortex undergoes a downward shift in its location relative to its midwinter position (Mechoso et al., 1985).The downward shift of the polar vortex has been linked to a poleward shift of the SH eddy-driven jet in austral spring, while the timing of the polar vortex breakdown has been linked to the equatorward shift of the eddy-driven jet between November and January (Hio and Yoden, 2005;Byrne et al., 2017).
Following on from the SH eddy heat flux biases, Figure 10 shows the interannual standard deviation of SH polar cap (60 • S-500 90 • S) geopotential height at 50 hPa for each of the S2S systems based on initialization dates that are closest to the first of The overestimation of SH variability shown by NCEP from week 4 and beyond in October and November initializations is consistent with this system having the most negative heat flux biases in Figure 9. Overall, because of the limited number of years in the comparison, most of these differences in variability with respect to reanalysis are not significant (not shown).There are also biases in the timing of the seasonal breakdown of the SH polar vortex.Figure 11 shows histograms of the polar vortex breakdown dates across different initialization dates between August and November.Here we simply define the breakdown date as the first day of easterlies without subsequent return to westerlies at 10 hPa and 60 • S. For many of the models, forecasts initialized in early spring produce easterly winds towards the end of the forecast, but at times that are much too early for the breakdown.This is a somewhat surprising result given that such early polar vortex wind reversals are rare in observations and expected to be rare in model simulations (e.g., Jucker et al., 2021).It is unclear, however, whether this behavior represents an early breakdown bias among some of the forecast systems, or whether these events represent SSWs for which the polar vortex would eventually recover if the forecasts continued in time.Regardless, these models produce early events that are generally not consistent with the observational record.Breakdown dates that fall within the ERA-I range are generally only produced for initializations on or after the first of October, or once the end of the forecasts include the second half of December.An exception is the BoM model, which produces breakdown dates that consistently fall close to the end of its forecasts, resulting in a late breakdown bias for initializations after October.Notably, free-running coupled climate models have a bias towards too-late breakdowns of the SH polar vortex (Butchart et al., 2011;Rao and Garfinkel, 2021b); this suggests that information contained in the October initializations likely help to constrain the S2S models and improve final warming estimates.Although stratospheric ozone is prescribed to climatological values in many S2S forecast systems, the strength of the initialized polar vortex winds in October likely contains information about relevant chemistry-climate feedbacks with stratospheric ozone that are well-correlated to the timing of the breakdown date (Butler and Domeisen, 2021).

Discussion and Conclusions
We have performed a comprehensive intercomparison of stratospheric biases in subseasonal forecast systems, with a core focus on systems that contribute to the S2S database (Vitart et al., 2017).Our results show: -Forecast systems with low-top atmospheric models generally have the largest biases across the diagnostics examined for zonal mean winds and temperatures, the QBO, meridional eddy heat fluxes, and the stratospheric polar vortices.
-Global-and annual-mean warm biases in the stratosphere tend to be most common across the different S2S forecast systems, though this can vary for different regions of the stratosphere in some (e.g., lower versus middle stratosphere).
-Too strong/cold wintertime polar vortices, and too cold extratropical UTLS regions are common features across most of the systems in the zonal mean temperature and zonal wind biases.
-In the SH, most systems generally underestimate the late-spring variability of the Antarctic polar vortex, particularly for initializations in October and November.However, many systems also simulate reversals in the 10 hPa 60 • S zonal mean zonal winds for initializations in August and September, at times of the year when SH final warmings have not occurred in reanalysis.
These biases likely arise due to a combination of factors.While the physical processes that govern the evolution of the stratosphere are relatively well understood, they are generally not fully resolved within atmospheric models, and are instead dependent upon model configurations (e.g., the height of model lids and vertical resolution), and simplified/parameterized processes (e.g., gravity wave drag, and the representation of ozone).The resulting biases can affect both the mean state and variability of the stratosphere, and have potential consequences for subsequent coupling with the troposphere.
Several of the systems with the largest global-and annual-mean warm biases in the stratosphere are those in the "NOAA family", including the high-top GEFSv12 and NCEP CFSv2, and the low-top GFDL-SPEAR; the others include the low-top BoM and CESM2-CAM.The only systems with global-and annual-mean cold biases are the ECMWF and CESM2-WACCM systems.It is unclear whether the warm-biases in the NOAA systems are related to a common cause.While the GEFSv12 and NCEP CFSv2 models use similar physics packages, including ozone physics parameterizations and radiation packages, they do use different dynamical cores (Guan et al., 2021;Saha et al., 2006Saha et al., , 2014)); similarly, GEFSv12 and the GFDL-SPEAR share the same FV3-based dynamical core, but SPEAR uses different physics packages and uses prescribed monthly ozone time series (Zhao et al., 2018;Delworth et al., 2020).The fact that the global and annual-mean stratosphere should be in radiative equilibrium poses a strong constraint that biases are likely to be radiative in nature, but model dynamics related to horizontal and vertical resolution can also play a role, particularly at high resolutions (see, e.g., Polichtchouk et al., 2019).We note that the global mean cold biases in the ECMWF system are likely dependent on the specific model cycles, and that more recent versions likely have reduced cold biases following implementation of quintic vertical interpolation in the ECMWF model's semi-Lagrangian numerics (Polichtchouk et al., 2019).
The cold biases in wintertime polar cap temperatures (corresponding to stronger polar stratospheric winds), and cold biases in the extratropical upper troposphere/lower stratosphere are longstanding biases that are similar to what has been documented in other weather and climate models (Charlton-Perez et al., 2013;Bland et al., 2021).The stratospheric polar cap temperature biases generally point to a dynamical cause related to :::::::: dynamical ::::::::: influences :::::: related :: to :::::::: planetary :::: wave :::: drag :::: and parameterized gravity wave drag, given that :::: since : wave-mean flow interactions and the ensuing residual circulations are responsible for driving local zonal mean temperatures away from radiative equilibrium.The cold extratropical UTLS biases, on the other hand, are likely to be radiatively driven, related to excessive leakage of water vapor into the lower stratosphere (e.g., Bland et al., 2021).
Both of these issues are dependent upon vertical resolution, which likely explains why the biases in the low-top systems (which have fewer levels in the stratosphere and coarser resolution in the UTLS) are generally more severe than those in the high-top systems.Correcting such biases :::::::: Reducing :::: such ::::: biases ::::::: through :::::: model ::::::::::: improvements : is likely to have some impact on forecast skill in both the stratosphere and troposphere, since the latitudinal dependence of the temperature biases affect the distribution of winds (i.e., the tropospheric jets and stratospheric polar vortices).For instance, artificially bias-correcting the extratropical UTLS humidity biases in the ECMWF system ::::::: forecast ::::: model : was shown to remove the UTLS cold biases, and moderately improve the skill of forecasts over Europe (Hogan et al., 2017).
The gradual decay of QBO anomalies with lead time towards each forecast system's own climatology is consistent with possible issues related to parameterized gravity wave drag.In models, the QBO has been shown to be most sensitive to parameterized non-orographic gravity wave drag (NOGWD) (e.g.Bushell et al., 2020), but other factors such as model vertical diffusion and resolution can also play a role in properly representing and maintaining the QBO, especially in the lower stratosphere (Garfinkel et al., 2021;Polichtchouk et al., 2021).However, documenting the specific model configurations and gravity wave parameterizations among the different forecast systems examined herein is beyond the scope of this study.Our results also showed that many of the S2S systems show a "Holton-Tan" response to the QBO wind phase in polar vortex winds and polar cap geopotential heights consistent with observations, but only for the first 2-4 weeks of the forecasts, after which the polar vortex anomalies decay.
These heat fluxes were generally more realistic in the high-top systems; the low-top systems showed considerable negative biases in heat fluxes at week 4 over the Northern Pacific and Scandinavia/Siberia.This suggests that, in combination with their limited representations of the stratosphere, these low-top systems have difficulties simulating realistic Rossby wave activity in the troposphere and/or their propagation and interaction with the mean flow (see, e.g., Schwartz et al., 2021).Thus, other biases documented for the low-top systems (such as those related to polar vortex winds and variability) are likely tied, to some extent, to these deficiencies.We documented similar behavior in the low-top composite of SH eddy heat fluxes, but the biases for the low-top systems were less robust, and were particularly affected by the BoM system having large positive heat flux biases (indicating less upward wave driving in the SH).
Our results show that many of the mean state biases become more sizable with increasing lead time (as expected), especially around weeks 3 to 4. Predictions that in some way rely on stratosphere-troposphere coupling processes are unlikely to be meaningfully affected by the small biases present at shorter lead times of 1-2 weeks.For example, the failure of a model to predict a SSW beyond 2 weeks is unlikely to be related to the model's tendency for its initial EQBO anomaly to decay by 2-3 m/s in the first few weeks (Figure 3).However, many of the ::::: Many :: of ::: the S2S systems considered herein ::: also : make forecasts well beyond 4 weeks; on these extended timescales, such biases are likely to be more important.Furthermore, while :::: have ::::: more ::::::::: substantial :::::: impacts ::: on :::::::::::::::::::: stratosphere-troposphere :::::::: coupling.::::: While a fully unbiased model will not be possible to achieve, it is still desirable for models to (1) minimize mean state stratospheric biases so that the stratosphere represents a more accurate upper boundary condition for the troposphere, and (2) have similar variability (in a statistical sense) to observations in the stratosphere so that ensemble spread properly accounts for potential outcomes.For example, forecasts from a model with a strong NH polar vortex bias could simply be post-processed with bias correction to improve the prediction for a SSW (Figure 6); however, if the polar stratospheric winds stay westerly in the actual model simulation, then that would represent a different dynamical regime for stratosphere-troposphere coupling compared to the model actually simulating a transition to stratospheric easterlies (since easterly winds effectively shut off vertical propagation of Rossby waves).
This study primarily focuses on biases among S2S models within the stratosphere.We have shown that large biases are present throughout the stratosphere, linked to a range of stratospheric phenomena in the tropics and the extratropics of both hemispheres.To our knowledge, this is the first systematic assessment of such biases in the stratosphere and of processes affecting the stratosphere in a multi-model study of sub-seasonal to seasonal prediction systems.In a follow-up companion study as part of the same SNAP effort, we more closely examine how biases in the stratosphere, such as those identified herein, are linked to coupling with the troposphere and its predictability.
, the National Center for Atmospheric Research Community Earth System Model version 2 (CESM2) with version 6 of the Community Atmosphere Model as its atmospheric component (NCAR CESM2-CAM6, hereafter CESM2-CAM or CESM2-C), and CESM2 with the version 6 Whole Atmosphere Community Climate Model as its atmospheric component (CESM2-WACCM6, hereafter CESM2-WACCM or CESM2-W; Richter et al., 2021).
so as not to unfairly weight the high-top composite; as the supplemental Figures 13 and 15 show, these systems have nearly identical biases.The patterns of biases in temperatures and zonal winds are consistent across the high-and low-top composites with signatures of (1) global mean warm biases in the stratosphere (consistent with Figure1), (2) cold extratropical UTLS biases in both hemispheres, (3) easterly wind biases in the tropical stratosphere, and (4) too strong/cold stratospheric polar vortices in the winter hemispheres.It is clear, however, that the biases ::: and ::::: MAE : in the low-top models ::::::: systems are generally much larger in magnitude despite having similar ::::: spatial : patterns.

Figure 5 .
Figure 5. (a) The December-February climatology of ERA-I eddy heat fluxes, v ′ T ′ , at 100 hPa.(b) The high-top composite of :::::: week-4 eddy heat flux model biases with respect to ERA-I from November-January initializationsat a lead time of 4 weeks.(c) is as in (b), but composited for the low-top models.In panel a, the line and colour-filled contours match the colorbar spacing of 10 ms −1 K; in panels b and c, the line contours match the colorbar contour intervals of 5 ms −1 K, but colours are only shown where the biases are statistically significant at the 95% level from a two-tailed Student's t-test.(d) Time series of the difference between the S2S hindcasts and ERA-I for v ′ T ′ :: the :::::: wave-1 :::: eddy

Figure 6 .
Figure 6.Probability of boreal (left column) sudden stratospheric warmings and (right column) strong vortex events, shown individually for composites of initializations within each month (rows) and weekly lead time (horizontal axis, alternating grey/white background shading within panels).For each model, the solid bars show the raw estimates, while the bold black horizontal lines indicate the probability determined after bias-correction.The coloured circles indicate the probabilities computed using ERA-I subsampled to match the same dates from each individual set of model hindcasts.

Figure 7 .
Figure 7. Distributions of the composited wind changes forecasted by each ensemble for forecast verification times surrounding (left column) boreal sudden stratospheric warmings and (right column) strong vortex events, composited over all such events in the 1999-2010 hindcast records.This corresponds to 11 SSW events and 13 strong vortex events across the boreal winters from 1999/2000 to 2009/2010.Units are m/s over the 10-day period centered on the observed event central dates within the forecasts or reanalysis.Within each section of each panel, data are shown as violin plots covering the 15-85% range of wind changes or mean errors, with outliers outside this range indicated with individual coloured circles, and black horizontal lines indicating the median value.In panels (a) and (b) the quantities shown are raw values, while those in panels (c) and (d) are shown as deviations from ERA-I.Note that individual models may contain a slightly different number of

Figure 8 .
Figure 8. Predictions of boreal stratospheric vortex (a-i) centroid latitude (j-r) aspect ratio, computed at 10 hPa from hindcast geopotential height.Values shown are an average of all ensemble members over all (a-e, f-i) displacement and (j-n, o-r) split SSWs in the 1999-2020 record.Data are plotted against (horizontal axis) SSW-relative day number and (vertical axis) model lead time.For each panel, the number of SSWs ('SSWs') and number of ensemble members ('n') is indicated at top left.The colorbar transitions to pink occur at thresholds of 66 • N for centroid latitude, and 2.4 for aspect ratio.

Figure 9 .
Figure 9. (a) The September-November climatology of ERA-I eddy heat fluxes, v ′ T ′ , at 100 hPa in the Southern Hemisphere.(b) The hightop composite of eddy heat flux model biases with respect to ERA-I from August-October initializations at a lead time of 4 weeks.(c) is as in (b), but composited for the low-top models.In panel a, the line and colour-filled contours match the colorbar spacing of 10 ms −1 K; in panels b and c, the line contours match the colorbar contour intervals of 5 ms −1 K, but colours are only shown where the biases are statistically significant at the 95% level from a two-tailed Student's t-test.(d) Time series of the difference between the S2S hindcasts and ERA-I for the combined wave-1 and 2 planetary wave heat flux over 45 • -75 • S, based on August-October initializations.Note that the low-top composite in panel (c) includes results from the CNR-ISAC system for a total of 3 low-top systems in the composite, which is not shown in panel (d) (only 2 low-top systems).

Figure 10 .
Figure 10.Interannual standard deviation of the southern hemisphere hindcast 50 hPa polar cap (60 • -90 • S) geopotential height for each model (coloured lines) in comparison to ERA-I (black lines).The hindcasts closest to the first of each month are chosen for each model.The verification time range in each row spans a maximum of 8 weeks.

Table 1 .
Details of the subseasonal-to-seasonal forecast systems used herein.Raw is the raw quantity, Qhc is the hindcast climatology, Qobs is the observed/reanalysis climatology, t is the forecast initialization date, l is the lead time, and t doy is the day of year of the initialization date t.