Comment on wcd-2021-15 Anonymous Referee # 2 Referee comment on " Resampling of ENSO teleconnections : accounting for cold-season evolution reduces uncertainty in the North Atlantic

The manuscript by M. P. King, Camille Li, and Stefan Sobolowski focuses on the quantifying uncertainty of the winter ENSO teleconnection to the North Atlantic sector. The main message that arises from this study is that accounting for the winter evolution of this teleconnection and thus separating the sea level pressure (SLP) response into early and late winter reduces the uncertainty of this teleconnection. In the manuscript, the uncertainty of the teleconnection is measured using different bootstrapping methodologies and displayed using Taylor diagrams, which displays information of the pattern-correlation and the amplitude of the teleconnection. The authors do not attempt to better understand the underlying mechanism behind the intraseasonal variability of the teleconnection but just focus on pure statistical aspects of the SLP response to ENSO. The paper briefly touches on the topic of asymmetry between La Niña (LN) and El Niño (EN) and the nonlinearity associated with the location of the maximum SSTs in the tropical Pacific, i.e. East and Central Pacific ENSO, events and find that considering EP and CP events separately leads to larger confidence intervals. The paper finishes with a detailed discussion on the confidence intervals and t-tests and concludes that for the present application using a simple t-test is as good as using a more sophisticated bootstrapping technique, which agrees with the common practice in climate science to use t-test to test the statistical significance in seasonal teleconnections.

. The role of tropical-extratropical teleconnections in the predictability of Northern Hemisphere climate, and how prediction systems can be improved by including these processes correctly, was recognized early on.  tion to this aspect has intensified in recent years (e.g., Hardiman et al. 2020;Scaife et al. 2014Scaife et al. , 2016. However, establishing the teleconnection patterns and amplitudes as well as the physical mechanisms involved remains a challenge owing to large atmospheric internal variability in the extratropics. R1.3 The issue of uncertainty in the ENSO teleconnection to the Northern Hemisphere during wintertime (DJF) was raised by Deser et al. (2017) (DES2017 hereafter). They used bootstrapping (random sampling with replacement) to show that the sea 30 level pressure (SLP) composites associated with ENSO vary considerably in both pattern and amplitude, due primarily to sampling fluctuations arising from internal (non-ENSO forced) atmospheric variability rather than ENSO variability, as the ENSO SST composites (average of selected events) are relatively robust and hence have very narrow confidence intervals (see Fig. 5 in DES2017). R2.6 The North Atlantic circulation response exhibits a large range of amplitudes, different levels of statistical significance, and even anomalies of opposite signs in some locations. In contrast, the North Pacific teleconnection ishas lower 35 uncertainty in amplitude than the North Atlantic, but notwith a high confidence in sign. R2.7 Through the use of large multimodel ensembles DES2017 also examined model biases in both the forced response to ENSO and the internal atmospheric variability. A key message from their paper is that even with nearly 100 years of observations, the detectableobserved R2.8 response to ENSO over the North Atlantic sector in boreal winter is very uncertain.
The late-fall to early-winter (Nov-Dec) ENSO teleconnection in the North Atlantic is notably different spatially from its late-40 winter (Jan-Feb) counterpart (e.g., Moron and Gouirand 2003). The Nov-Dec teleconnection has a dipolar pattern similar to the NAO but with its northern and southern centers shifted southward, resemblingresembles R2.9 the East Atlantic pattern, while the Jan-Feb teleconnection is more of a classical NAO-like dipoleprojects onto the NAO pattern (King et al. 2018a;Mezzina et al. 2020) R2.10 . Consistently, Molteni et al. (2020) examined the PDFs of NAO index obtained through resampling a reanalysis record and found a change from positive to negative (negative to positive) NAO index associated with the warm 45 (cold) ENSO phase during the cold season (although the use of a single index loses some accuracy in the spatial patterns).
The fall-to-winter evolution has implications for the underlying teleconnection mechanism and associated impacts. Recently, King et al. (2018a) highlighted these intra-seasonal differences, and suggested that the Nov-Dec ENSO teleconnection to the North Atlantic involves a tropospheric pathway, unlike the Jan-Feb teleconnection which involves both tropospheric and stratospheric pathways (also see Ayarzagüena et al. 2018;Domeisen et al. 2019;Hardiman et al. 2019; Jiménez-Esteve and 50 Domeisen 2018) R2.11,R2.12 . A statistically significant relationship between ENSO and western European temperatures in Nov-Dec was also presented by King et al. (2018a).
The mechanisms behind and model performance in representing the varying ENSO teleconnection during the extended cold season have been explored. Ayarzagüena et al. (2018) suggest that Rossby wave propagation in the troposphere excited by atmospheric diabatic heating (a result of latent heating related to precipitation) in the western tropical Atlantic is responsible for 55 the Nov-Dec teleconnection. The late fall response may also be modulated by SST forcing from the tropical west Pacific (Bladé et al., 2008) or atmospheric diabatic heating over the Indian Ocean, which is itself amplified by ENSO Joshi et al., 2021). A number of studies (Ayarzagüena et al., 2018;Joshi et al., 2021;King et al., 2018a;Molteni et al., 2020) report that models are generally able to simulate the varying teleconnection to the North Atlantic from November through February in initialized hindcasts, while "free-running" coupled-model experiments have less success, producing an NAO-like pattern 60 similar to the observed late-winter teleconnection through all these months with at best a weak signature of the transition.
In this study, we revisit the question of uncertainty in the ENSO teleconnection to the North Atlantic. The major methodological differences compared to DES2017 are that we consider a varying ENSO teleconnection through the cold season, as well as a larger North Atlantic domain that includes the mid-latitudes, where an important part of the teleconnection signal resides. As in DES2017, our main technique for estimating uncertainty is bootstrap resampling (see Hesterberg (2015) and 65 chapters 10, 11 in Efron and Hastie (2016) for good general introductions). This technique is widely employed in climate research for estimating uncertainties in teleconnections (e.g., Michel et al. 2020), estimating prediction skill in large ensembles (e.g., Stockdale et al. 2015), and investigating the effects of sampling variability (e.g., Cash and coauthors 2017). Bootstrapping artificially increases the number of samples or sample size. It is simple to implement and it does not make any assumption about the underlying distribution. However, it should not be regarded as a substitute for producing larger model ensembles or 70 making longer observations, because certain properties of the bootstrap distributions (such as the location parameter) are still dependent on the original samples (Hesterberg, 2015).
The three subsections in Results correspond to the steps of our investigation: (a) Analyze the sampling variability problem of SLP composites associated with ENSO for Jan-Feb and Nov-Dec means separately in addition to DJF means, using a larger North Atlantic domain that includes the midlatitude region in contrast to the polar area used by DES2017 R2.13 ; 75 (b) investigate asymmetry in El Niño and La Niña-related SLP composites; and (c) investigate several types of confidence intervals that are available via bootstrapping in addition to the ordinary t intervals. In designing the bootstrap tests with these factors in mind, we provide a more optimisticaim to present a more nuanced perspective on the uncertainty in theof the ENSO teleconnection to the North Atlantic with reduced uncertainty compared to earlier studies which considered the standard DJF winter season view. R2.14 80 2 Data and methods

Data
To be consistent with DES2017, we use the same datasets. The SLP data is from the NOAA-CIRES Twentieth Century Reanalysis V2c (Compo and Coauthors, 2011). The sea surface temperature (SST) used to construct the Nino3.4 index is from HadISST1.1 (Rayner and Coauthors, 2003). DES2017 verified that their results are robust across different datasets; our ex-85 perience has been similar when dealing with SST, SLP, and geopotential heights in monthly to seasonal means for tropicalextratropical teleconnections. R2.15 (e.g., King et al. 2018a, b).

SLP composites and bootstrapping
El Niño and La Niña events are identified according to when the magnitude of the Nino3.4 index is greater than or equal to one standard deviation. Table 1 lists the selected ENSO events for calculating the DJF, Nov-Dec and Jan-Feb SLP composites, with 90 further explanation given at the beginning of Sect. 3.1. R2.16 For each category, we have a set of SLP anomaly fields denoted E o or L o , which are approximations of the populations associated with El Niño and La Niña, respectively. The observed (i.e., 'original' to differentiate it from 'bootstrap') SLP composite C o for ENSO is then calculated as C o = E o − L o , where the overbar denotes the mean of the set. In constructing a bootstrap composite C * , we draw randomly with replacement from the sets of E o or L o to form E * or L * , respectively; and C * = E * − L * is then calculated. The sample size of each bootstrap 95 E * and L * composite is the same as the observed E o and L o respectively, meaning that the bootstrap composites contain the same number of El Niño or La Niãa events as the observations (except when investigating the effect of sample size in Sect. 3.1). R2.17 Multiple bootstrap composites are generated in this way to form a set C * . As in DES2017, we typically create 2000 bootstrap composites (i.e., there are 2000 members in C * ). Larger numbers of bootstrap samples have been tested for some cases and the results are not so different that the findings are affected.

Confidence intervals and significance testing
The ordinary t confidence intervals are written conventionally as µ = x ± t · SE (Bulmer, 1979), where µ is the population mean, x a sample mean calculated from n elements, and SE = s/ √ n the standard error estimated from s the sample standard deviation. In the two-tailed t-test with the null hypothesis µ = 0, we check if µ = |x| − t · SE > 0 is true, in which case we can reject the null hypothesis. The t value is usually obtained from a table (e.g., p. 265 in Spiegel and Liu 1999). For example, for 105 a 5% significance level, we look up t 0.025 for a two-tailed test -that is, a 5% probability split evenly between two tails, which is one reason that a normal (unskewed) distribution is required for the t test. Rewriting the t test using the notation introduced in the previous paragraph for the observed composite gives |C o | − t · SE o > 0, which is checked gridpoint-wise to see where the null hypothesis can be rejected, and hence the signal can be considered statistically significant. The method described here is widely used in climate research (and also here in Sects. 3.1 and 3.2 to check statistical significance). It is useful to describe 110 this general approach because in Sect. 3.3, we discuss three additional types of confidence interval that bootstrapping allows us to calculate.

Taylor diagram
The number of bootstrap composites in C * is rather large, and the composites themselves exhibit varying spatial patterns and amplitudes. An effective way to summarize this information is through the Taylor diagram (Taylor, 2001). A Taylor diagram is 115 plotted in polar coordinates (r, α), where we define the radial distance r = C * 2 / C o 2 of each plotted point from the origin as the ratio of the spatial amplitude (Euclidean norm) of a bootstrap composite to that of the observed composite: R2.18 and the cosine of the angle α from the positive x-axis direction cos(α) = C * · C o / C * 2 C o 2 as the spatial correlation between the bootstrap and observed composites: R2.18 Both the spatial amplitude and spatial correlation are area-weighted (by the cosine of latitude, cos (φ j ) above). The symbol i,j ≡ i j represents the summation over grid points in a chosen spatial domain. R2.18 For the North Pacific analysis, we select the domain 150 • E-120 • W, 30 • -60 • N; while for the North Atlantic the domain 50 • W-10 • E, 20 • -60 • N is used for Nov-Dec, and 60 • W-0 • E, 30 • -70 • N for DJF and Jan-Feb. The exact choices of the boundaries might be somewhat subjective 125 but are chosen to enclose the important anomalies in the observed composites within the geographical area of interest (i.e., North Pacific, North Atlantic). R1.4 In practice, we compute r and cos(α) using the GrADS functions atot and scorr. The general interpretation of the Taylor diagram is that the radial distance of a point indicates how strong the anomalies in a bootstrap composite are compared to the observed composite, and the proximity from the x-axis indicates how closely the pattern of a bootstrap composite resembles 130 the observed composite.

Uncertainties of ENSO teleconnections in Nov/Dec and Jan/Feb
First, the observed SLP composites (C o described in Sect. 2) for ENSO are calculated for the period 1920-2013, the same period used in DES2017. In addition to the standard DJF composite, we compute composites for Nov-Dec (ND) and Jan-Feb 135 (JF). The events included in the various seasonal composite calculations in this study are listed in Table 1. Also similar to the DES2017 approach, the Nino3.4 indices for NDJ, ON and DJ are used to identify SLP anomalies for DJF, ND, and JF, respectively. One reason for using this one-month lag is to account for a delay in the influence of tropical SST anomalies on the extratropics through both the tropospheric and stratospheric pathways. It turns out that exactly the same years are selected for NDJ and DJ, and they are also mostly the same events as for ON. R2.19 For Nino3.4 in NDJ and DJ, our method selects  Table 1.
The value of separating the cold season into late fall (ND) fromand winter (JF) has been addressed by previous studies in- 145 vestigating statistical and dynamical properties of ENSO teleconnections to the North Atlantic Ayarzagüena et al., 2018;King et al., 2018a, b;Moron and Gouirand, 2003;Toniazzo and Scaife, 2006), and is also demonstrated in Fig.   1 for the data set used in this study. Shown in Fig. A1 are composites of the four months individually (also see Moron and Gouirand 2003 in particular) which indicate the differences qualitatively and confirm that it is reasonable to perform analyses on ND and JF means. R1.1,R2.2 The DJF SLP composite as well as the spatial pattern of significant anomalies (at the 5% significance level) are very similar to those presented by DES2017 (Fig. 1a). There is a strengthening of the Aleutian Low with amplitude up to about 8 hPa; in the North Atlantic the anomalies project onto the negative NAO, with about 4 hPa weakening of the southern center (Azores High) and up to 3 hPa weakening of the northern center (Icelandic Low, but with a very small area of statistical significance). Panel c shows the composite for JF. The main difference compared to DJF is that the anomaly in the northern center in the Atlantic sector strengthens to 4 hPa with a larger area of statistical significance. The composite for 155 ND (panel b) instead shows a meridional gradient in the North Atlantic opposite of that for JF, and the centers located south of those for JFresembles the East Atlantic pattern. R2.20 Next, bootstrapping is carried out to produce resampled ENSO composites as described in Sect. 2. The corresponding Taylor diagrams for the North Atlantic sector are shown in the top row of Fig. 2. The large black dot on the x-axis at coordinates (r = 1, α = arccos 1) indicates the observed composite C o , and the small dots represent the 2000 bootstrap composites in C * . The 160 blue and red lines indicate the 5th and 95th percentiles of the amplitude ratios or spatial correlations: we refer to these as the bootstrap percentiles bracketing the bootstrap confidence interval, which contains 90% of the bootstrap composites (see also caption). The second and third rows in Fig. 2 show the spatial pattern of bootstrap composites close to the confidence interval bounds in both the amplitude ratio and spatial correlation (i.e., the point closest to the intersection of the blue arc and straight blue line, or of the red arc and red straight red line).

165
According to the Taylor diagrams, separating the cold season into ND and JF reduces the uncertainty in the ENSO teleconnection to the North Atlantic. Specifically, the cloud of bootstrap composites becomes denser, contracting towards the r = 1 semi-circle and the x-axis. In terms of the amplitude ratios, the bootstrap confidence intervals for DJF, ND and JF are (0.57,1.59), (0.69,1.56) and (0.58,1.57) respectively, meaning that the upper bounds are about 2.8, 2.3, and 2.7 times the lower ones. In terms of spatial correlations, the improvements are clear for both ND and JF over the traditional DJF winter season 170 definition. Specifically, the bootstrap confidence intervals are narrower and the 5th percentiles are larger, sitting close to 0.8 (straight blue lines in panels b, c) compared to 0.69 for the DJF bootstrap distribution. As a visual guide for the comparison, the thick gray arc indicating the confidence interval for DJF in panel a is repeated in panels b and c. Even for DJF, our analysis, which includes a larger mid-latitude North Atlantic domain than in DES2017, indicates some certainty in the spatial pattern because the confidence interval spans only positive spatial correlations (0.69 to 0.99 given in Fig. 2a). If we instead construct a 175 DJF Taylor diagram for a "polar" North Atlantic area (poleward of 60 • N) similar to DES2017, the uncertainty is larger in both the spatial correlations and amplitude ratios (Fig. 3a)., and the weak bootstrap composites contain The 5th to 95th percentile interval contains bootstrap composites at the lower limit that have limited areas of statistical significance (Fig. 3b); the midlatitude region that exhibits larger patches of statistically significant SLP anomalies (first column in Fig. 2) are not included, and hence, cannot contribute to narrowing the uncertainty. R2.21

180
The spatial patterns of the bootstrap percentiles in Fig. 2 are consistent with the t test results shown in gray shading ( Fig.   1), indicating robust signals where the confidence interval is bounded by significant anomalies of the same sign. For DJF, the southern center has statistically significant negative anomalies at both the 5th and 95th bootstrap percentiles, while the northern center has a very small area of statistical significance at the lower 5th percentile (comparing panels d, g). For ND and JF, both the southern and northern centers are bounded by bootstrap composites showing the same (expected) sign. To summarize, these results show that separating the DJF winter into JF and ND, as well as including the mid-latitude North Atlantic in the analysis, R2.22 reduces the uncertainty in the pattern of the ENSO teleconnection (as indicated by the narrower intervals for spatial correlations in panels b, c), with a slight improvement in the certainty of the amplitude for ND only. This is examined further in sections 3.2 and 3.3.
DES2017 used data in 1920-2013 to coincide with the availability of model data they also analyzed. We have checked the 190 longer period of 1870-2014 available in the reanalysis data (ENSO events selected are listed in Table 1). The result is shown in Fig. A2, with the top row being the observed composites, and the bottom row the corresponding Taylor diagrams. The main difference compared to Figs. 1, 2 is that for the JF analysis, the bootstrap confidence intervals are narrower for both amplitude ratio and spatial correlation, with the 5th percentile for the latter increasing from 0.77 to 0.84. Otherwise, using the longer reanalysis period with more available events for the resampling does not produce markedly different results that affect the 195 conclusions.
The effect of sample size (number of ENSO events) on the robustness and uncertainty range of ENSO signals is often estimated using sub-sampling techniques, especially in large ensembles of modelling experiments (e.g., Michel et al. 2020;Weinberger et al. 2019). Although the total number of available observed events is greatly limited compared to a large model ensemble, we can still examine the effect of sample size on the uncertainty by sub-sampling with replacement. We perform 200 this test using sample sizes of 5, 10, 15, ..., etc. for the cases in Fig. A2 (see Fig. A3). As expected, the teleconnection becomes more robust with sample size: the uncertainty range for amplitude decreases and the spatial correlations increases as While the focus of this study is the North Atlantic, we briefly touch on ENSO's effects in the North Pacific. The analyses in Fig. 2 are repeated for the North Pacific, and shown in Fig. 4. In general, the North Pacific is less affected by sampling variability in both the amplitude ratios and spatial correlations. The teleconnection pattern is consistent across all three seasonal means, unlike the North Atlantic, where ND is different. The clouds of bootstrap composites are tighter and closer to the x-axis, with 215 less uncertainty for DJF than JF or ND. For JF, the bootstrap confidence interval for the amplitude ratios is (0.75,1.31), where the upper bound is 1.75 times the lower one (close to the factor of 2 reported by DES2017). The uncertainty in the amplitude can be important for climate impacts assessment (Deser et al. 2018;Michel et al. 2020).
Another resampling technique is known as the permutation test. We have performed this test with an aim to demonstrate its workings in quantifying uncertainty of the systematic difference between North Atlantic SLP anomalies associated with El 220 Niño and La Niña. This is essentially equivalent to the above assessments which examine the hypothesis of El Niño minus La Niño composites are significantly different from zero. The result is reported in Appendix A. R2.23 In the above, the uncertainties of the composites are investigated, in essence, by questioning the sampling variability of SLP anomalies associated with El Niño and La Niña. In addition, we can also ask: Is there any systematic difference between El Niño and La Niña SLP anomalies such that calculating a linear composite makes sense? Here, we make a quick detour to 225 demonstrate another resampling technique called the permutation test. The null hypothesis is that the SLP anomalies under El Niño and La Niña events originate from the same population. First, the El Niño and La Niña years are put into the same pool.
For example, for NDJ, the 26 El Niño and 22 La Niña years (see Table 1) are pooled together. Second, we randomly draw 26 years from the pool and reassign (also called relabel) them as El Niño, and the remaining 22 years as La Niña. The composite C * is then calculated, and the steps are repeated to obtain 2000 composites as before. If the reassigned bootstrap composites an SLP composite at least as large as the observed one if El Niño and La Niña related SLP anomalies were from the same population. These p values are small enough (following common practice) to reject the null hypothesis. Note also that none of the bootstrap confidence intervals for the amplitude ratios, indicated with the blue and red semi-circles, crosses over the gray semi-circles (therefore radii for red semi-circles < 1) representing the observed amplitudes. The permutation tests carried out here suggest that the SLP anomalies associated with El Niño and La Niña in the North Atlantic can be distinguished from each 240 other with a high confidence. R2.23 //Moved to Appendix A//

Uncertainties in El Niño and La Niña teleconnections
The aim of this subsection is to examine the uncertainties for SLP anomalies associated with El Niño and La Niña separately (based on Niño3.4 as above). TheseAdditional analyses also allow us to inspect whether there is any asymmetry in the anomaly patterns associated with these teleconnections. (our approach does not test for statistical significance in the nonlinearity of the Furthermore, as also reported by DES2017 for DJF, we find that the SLP anomalies during El Niño and La Niña events in 250 the ND and JF seasons do not indicate anysign asymmetry in terms of the sign of the anomalies over the domains used in the Taylor diagrams R2.25 ; this is true for both the North Atlantic and North Pacific sectors. However, the spatial extents of significant teleconnections in the North Atlantic are different. During El Niño for ND and La Niña for JF, the SLP anomalies in the North Atlantic have larger areas of statistical significance (panels a, d compared with b, c) R2.25 , as well as narrower bootstrap confidence intervals in the spatial correlations (panels e, h compared with f, g). Note also the smaller scatter of the 255 bootstrap composites and their proximity to the observed composite on the x axis in panels e, h. Without further study and more observations, it is not possible to provide a physical explanation for the differences in the uncertainties between these composites (but see Hardiman et al. 2019).
We performed additional analyses to investigate if there is any statistical significance for the asymmetry seen in the composite anomalies in the first row of Fig. 5. This is done by assessing the composites for the asymmetrical portion of the ENSO 260 teleconnection (El Niño+La Niña) against the null hypothesis that they are statistically indistinguishable from zero. The results are shown in Fig. 6. Firstly, in the top row, it is noted that there are no statistically significant values in the North Pacific and North Atlantic areas of interest in this study. In other words, we cannot reject the null hypothesis at the 5% significance level based on t tests, meaning that we find no significant asymmetry in these regions. There are some locations at lower latitudes over the extratropical Pacific and Atlantic oceans which exhibit statistically significant nonlinearity. Secondly, the second row 265 of Fig. 6 shows the Taylor diagrams for the asymmetrical portion of the teleconnecton over the North Atlantic. The confidence intervals for the spatial correlations of the bootstrap composites and the observed composites have lower limits at -0.09 and -0.22 (blue lines), which are weak and even negative, indicating again that nonlinearity is not detectable, at least according to these domain-wise metrics. Note that these analyses are performed using observed SLP anomalies identified with Niño3.4 events as in other parts of this study. We do not investigate nonlinearity related to other factors such as ENSO SST types 270 (except briefly below) or amplitudes, other atmospheric variables and regions, as this is a challenging question in its own right, and is addressed by other studies (e.g., Garfinkel et al. 2013;Jiménez-Esteve and Domeisen 2020;Trascasa-Castro et al. 2019;Weinberger et al. 2019). Consistent with our results, these studies also generally agree that elucidation of nonlinearities using the limited sample of events in the reanalysis record is difficult, and therefore this research requires larger samples of model data. R2.1,R2.25,R2.26 275 Many previous studies (e.g., Feng et al. 2017;Frauen et al. 2014;Garfinkel et al. 2013;Toniazzo and Scaife 2006;Zhang et al. 2018) investigated asymmetry in ENSO teleconnections arising from central Pacific (CP) or eastern Pacific (EP) events, or due to varying ENSO strength. Zhang et al. (2018) show that CP-El Niño and CP-La Niña events are symmetrical (see Fig.   A5e, g), while CP-La Niña and EP-La Niña (Fig. A5g, h) teleconnections in the Euro-Atlantic area are not the same either spatially or in sign (the CP-and EP-ENSO events selected are listed in Table 2) R1.5 . They postulate that the asymmetry can 280 be due to the fact that as the EP-La Niña develops, the eastern tropical Pacific becomes progressively colder, thus reducing the chances of overcoming the convection threshold to influence the overlying atmosphere and trigger teleconnections. They note, however, that they are unable to provide an explanation for the strong anomaly in the North Atlantic-Europe during EP-La Niña ( Fig. A5h), which is not symmetrical with El Niño's. A brief check of the ND composites associated with these previously identified CP and EP events R1.5 show very little in terms of robust teleconnection signals (Fig. A5a-d), especially over the 285 North Atlantic. This is likely due to a combination of weaker ENSO SST anomalies in ND and smaller areas over which SST anomalies occur during CP or EP events compared to all events pooled together. Studies such as those cited in the previous paragraph do suggest, however, that for JF, mixing CP-and EP-ENSO (as the Nino3.4 index does) may result in cancellation of signals in the North Atlantic, thus affecting the statistical significance and increasing the uncertainty of the signals. Therefore, asymmetry is perhaps a more important factor for JF than for ND.

Confidence intervals and t-test
Confidence intervals are important for quantifying uncertainty. Different types of confidence intervals have different accuracies depending on the properties of the data, therefore it is informative to consider more than one type (Hesterberg, 2015). The most common (without bootstrapping) is the "ordinary t interval" (ordt) which is directly related to the t test for statistical significance (see Sect. 2.3 and Table 3 Further types of confidence intervals which are not commonly considered in the climate research literature can be calculated from bootstrapping. We describe two additional ones -tBoot and bootT (Table 3 and Hesterberg 2015). The tBoot interval is also quite straightforward: the standard error (SE in C o ± t α/2 · SE) is calculated as the standard deviation of the bootstrap composites. This comes from the definition of standard error being the standard deviation of the sampling distribution, and 305 using the bootstrap distribution as a substitute for the sampling distribution. For our SLP composites, these tBoot intervals (not shown) are virtually identical to the perc intervals. The bootT interval is less immediately obvious but important because it has higher accuracy (Hesterberg 2015) and, like perc, it allows for asymmetrical intervals about the mean. Instead of using a table, the t values themselves are calculated from the bootstrap samples (of which we have 2000) and then the 2.5th and 97.5th percentiles of the t values are obtained. To do this, from each sample we calculate a value of t * = (C * − C * )/SE * , where C * 310 is a bootstrap composite, C * indicates the mean of all the bootstrap composites, and SE * = s * / √ n is the standard error for this bootstrap sample. This is performed for all samples to obtain 2000 t values. The percentiles of the t * values are then used to determine the bootT confidence intervals (see equations in Table 3), and these are shown in the third column of Figs. 7, A6, Comparing the ordt, perc and bootT confidence intervals, we note a few interesting aspects in Figs. A6, A7. R2.4 Firstly, 315 ordt intervals are consistent (and must be by definition) with the absence or presence of statistically significant areas shown in Fig. 5. For example, Fig. 7a, d R2.4 show that ordt in the North Pacific center and the southern center of the North Atlantic are bounded by negative values at the 95% confidence level, and the signals in these regions are indeed largely significant (Fig. 5). In contrast, the confidence interval for the northern center of the North Atlantic crosses "0" (includes negative and positive values), thus implying that the sign and amplitude of the signal are uncertain, consistent with the absence of statistical 320 significance in Fig. 5c over this area. Secondly, all estimates of the confidence interval agree very well with each other, implying that the ordt intervals are good enough for our purposes (although we would not know this a priori). Inspecting the different types of confidence intervals for these cases shows, however, that the perc and bootT intervals yield slightly larger areas of statistical significance in the nothern center North Atlantic teleconnections than ordt, suggesting that the bootstrap composites exhibit a small amount of skewness here.

325
Examining different types of confidence intervals, three of which are obtained from the bootstrap test, provides further support for the results shown in previous subsections. In particular, despite uncertainty in the amplitudes, we have reasonably high certainty in the signs of the teleconnection anomalies in the main centers of action shown by the ordinary t test in Fig. 1.
The bootstrap confidence intervals may also provide new information for regions where uncertainty in the ENSO response is not normally distributed, such as the northern center of the North Atlantic during late winter.

Concluding remarks
This study clarifies the uncertainty in the ENSO teleconnection in North Atlantic by considering the early (Nov-Dec) and late (Jan-Feb) parts of the cold season separately, as well as using a geographical area that better covers both anomaly centers in the North Atlantic. The motivation for separating the seasons in this way follows logically from previous studies that find different teleconnection signals and mechanisms at work during ND compared to JF. Various confidence intervals were used to assess 335 uncertainty, including the widely used Student's t test as well as several types of intervals calculated from the bootstrapping analysis (Sect. 3.3). These produce nearly the same results in most cases, thus indicating that the conventional Student's t test ( Fig. 1) and the equivalent ordt confidence intervals (first columns of Figs. A6, A7) are generally reliable for assessing the statistical significance of ENSO teleconnection patterns.
The key results of this study are encapsulated in Figures 2 and 5. Based on our analyses of the observational record, we find 340 that there is confidence in the the dipolespatial R2.27 pattern of SLP anomalies associated with ENSO in the North Atlantic as it changes from November through February. There is an improved uncertainty than suggested by traditional DJF winter analyses, which average over opposite-signed anomalies in early and late winter. In agreement with DES2017, the analyses performed here indicate that uncertainty in the amplitudes of the teleconnection signals is indeed high, even when the coldseason transition is accounted for, with a 95th-to-5th percentiles ratio reaching a maximum factor of 2.8 in the North Atlantic 345 (compared to a factor of 2.0 in the North Pacific). Overall, we argue that our analyses and conclusions permit a more complete and optimistic viewof thean understanding of the ENSO teleconnection signals in the North Atlantic with reduced uncertainty compared to earlier studies which considered only the standard DJF winter season view. R2.14, R2.28 After many decades of research, interest in ENSO teleconnections in the North Atlantic remains high and new results continue to appear (e.g., recent review by Domeisen et al. 2019). A number of recent studies have contributed to the continuous reproduce the ENSO teleconnection to the North Atlantic through the cold season, while intialized hindcasts perform better.
Other studies have focused on the effects of the ENSO teleconnection on surface climate in Europe (precipitation, temperature, drought indices, e.g., Brönnimann et al. 2007;King et al. 2020;van Oldenborgh and Burgers 2005), for which model performance is less well documented than for atmospheric circulation anomalies (but see e.g., King et al. 2020;Volpi et al. 2020).
Further research on surface climate impacts in models and prediction skills should consider the varying nature of the ENSO teleconnection through the cold season (discussion in King et al. 2018a). 0.008, and 11/2000 ≈ 0.005 (indicated bythe numerators of the fractions are the numbers of red dots in the panels, and the denominators are just the total number of bootstrap composites) R2.24 for DJF, ND, and JF, respectively. Another way to describe thisthese p values is that iteach one is the probability of obtaining an SLP composite at least as large as the observed one if El Niño and La Niña related SLP anomalies were from the same population. These p values are small enough (following common practice of requiring for e.g. p = 0.05) to reject the null hypothesis. Note also that none of the bootstrap confidence 385 intervals for the amplitude ratios, indicated with the blue and red semi-circles, crosses over the gray semi-circles representing the observed amplitudes (therefore radii for red semi-circles < 1). The permutation tests carried out here suggest that the with red (blue) contours indicating positive (negative) anomalies. Gray shading indicates the 5% significance level for a two-tailed t test. All analyses in this paper use data from HadSST and NOAA-CIRES 20CR V2c.   percentile maps. These panels correspond to the first column in Fig. 2. Features in the Taylor diagram are as described in Fig. 2. See Sect.
3.1 for more details.      (Table 1). These panels correspond to the second row of Fig. A2 except for different sample sizes. Panels j, h, i indicate the estimated minimum sample sizes required for robust signals for spatial patterns of the composites. See Sect. 3.1 for more details. Figure A4. Taylor diagrams for permutation tests corresponding to the same events in Fig. A2 (1870-2014). The thick gray arc marks the 5th-to-95th percentile range in the corresponding panel in Fig. A2. Other features in the Taylor diagrams are as described in Fig. 2. See

R1.2
Appendix A for more details.

Long name Short name General equation/descriptions
Ordinary t interval ordt µ = Co ± t α/2 · SEo, where SEo is estimated from original sample Bootstrap percentile (or confidence) interval perc Intervals are obtained from bootstrap composites directly t interval with bootstrap SE tBoot SE is estimated as standard deviation of bootstrap composites Bootstrap t interval bootT µ = Co − bootT 1−α/2 · SEo, µ = Co − bootT α/2 · SEo, where SEo is estimated from original sample, and bootT values are calculated from all bootstrap samples