Industry standard and econometric standard: the search for powerful approach to evaluate var models

Under the Basel III and Basel IV accords, risk model validation remains based on the VaR measure. According to the industry practice, VaR backtesting procedures rely on two likelihood ratio tests, which, in light of the academic research, have been criticized for their unsatisfactory power. This paper aims to show the differences between VaR model evaluation based on the standard likelihood ratio approach and backtesting by means of other econometric methods applicable to the binary VaR failure process. The author decomposed the model evaluation into testing the unconditional coverage, replaced the likelihood ratio with a normal statistic, and in the next stage in order to verify the conditional coverage, employed the Ljung-Box statistic. The study experimentally confirmed the superiority of the proposed procedures over the industry standards. The main contribution, however, is the empirical study designed to demonstrate the practical differences in risk analysis attributable to the choice of the backtesting method. Using data on leading stock market indexes, from various periods, the author showed that the practical conclusions from backtesting diverge markedly due to the test choice. The proposed, more powerful tests, contrary to the standard procedures, allowed for distinguishing distinct models of index behaviour connected with undergoing the financial crises.


INTRODUCTION
Value-at-Risk (VaR) owes its popularity as a risk measure to both business practice and international supervisory rules. In the context of business routines, its constantly widening range of applications stems from the practical advantages, like the straightforward interpretation and applicability to complex portfolios. Following the industry practice, the international system of risk measurement standards was based on VaR in the 1990s (Basel 1996), shortly after the original inception of this measure by J. P. Morgan (1994). Although the reform of the supervisory rules, undertaken in 2012 by the Basel Committee of Banking Supervision (Basel 2016(Basel , 2017 has involved a movement from VaR to the ES (Expected Shortfall) measure, the procedures of risk model evaluation remain based on VaR. This necessitates a discussion on VaR testing rules and gives an incentive to investigate the statistical properties of relevant methods.
The VaR testing framework is based on a binary variable indicating VaR violations. Under the correct risk model this variable is required to follow the iid Bernoulli process. The iid Bernoulli property is commonly split into the postulates of the unconditional and conditional coverage property. The first postulate refers to the overall VaR failure rate and means that the number of violations should match the assumed VaR tolerance level, while the conditional coverage property requires the independence of violations. The extensive toolkit for verifying these two postulates, separately or jointly, involves testing the parameters of the Bernoulli process (Kupiec 1995), using the transition probabilities of the binary Markov chain (Christoffersen 1998), regressing VaR failures on their lagged values (Engle and Manganelli 2004), checking the unpredictability of the durations between VaR failures (Christoffersen, Pelletier 2004, Candelon et al. 2011 or using the spectral theory (Berkowitz et al. 2011, Gordy and. The one-level VaR backtesting procedures were extended into checking the fit of the density function (Berkowitz 2001), the truncated density function (Crnkovic and Drachman 1997) or multi-level VaR testing (Hurlin and Tokpavi 2007, Colletaz et al. 2013, Kratz et al. 2018). Among these propositions, two tests, formulated within the likelihood ratio (LR) framework, have won wide recognition in the industry. These are the Kupiec test (Kupiec 1995), which checks the unconditional failure rate, and Christoffersen's Markov test (Christoffersen 1998), aimed at capturing the serial dependence in failures. Developed specifically for the purposes of risk management, these tests offer the advantage of a convenient, straightforward implementation to real-life processes. These popular approaches, however, have been repeatedly criticized with respect to their statistical properties (Lopez 1999, Christoffersen and Pelletier 2004, Berkowitz et al. 2011, Pajhede 2017. In view of practical aspects, like the straightforward implementation and computational efficiency, the study explored the possibilities of backtesting VaR through standard econometric methods, applicable to the Bernoulli sequence and independence testing. Building on the results of Malecka (2018), the author refrained from using methods developed specifically for the purposes of risk management. Exploiting the properties of the Bernoulli distribution, the binomial distribution and its convergence to the normal one, the study applied a normal statistic to checking the unconditional coverage property. To enhance the power properties in the conditional coverage testing, the author employed the Ljung-Box statistic (Ljung and Box 1978). This used the fact that the Ljung-Box test has the power against linear alternatives of any order, which corresponds to dependencies in the GARCH processes, with a slow decay of correlation.
The aim of the paper was to evaluate the capacity of the above-mentioned, well-established econometric methods as risk management tools, in relation to the popular VaR-dedicated tests. For this purpose the Monte Carlo technique was employed, and an empirical investigation of the methods was performed. The Monte Carlo study was designed to reflect the typical VaR failure setting. To achieve this, the study used two types of experiments. In the first type, correlated VaR violations were generated by employing the GARCH-class models with the specification that enables explicit indication of the volatility clustering. Therefore it was possible to study the power of the tests as a function of a controlled parameter of a return distribution. However, the explicit control over the parameter that represents volatility clustering, limits the range of applicable data generating processes. Therefore this type of experiment was followed by the second type, where the priority was to closely reflect the real-life financial processes. The second set of experiments stepped away from the exact control over the volatility clustering and, instead, used the ARMA and N-GARCH-based data generating processes with the Student t-distribution and parameters based on empirical data. In accordance with the Basel framework, the study provided the results for VaR coverage levels 1% and 2.5%, in this way extending the earlier research on backtesting VaR through classical econometric methods, which treated only 5% VaR (Malecka 2018). The author experimentally showed that the proposed approach outperforms the standard LR tests both in terms of the accuracy, understood as a test size, and in terms of ability to detect incorrect risk models, understood as a test power. The main contribution is, however, abroad empirical study, which exploits and illustrates the results of the Monte Carlo simulations. The research was designed to show the differences in risk analysis that result from the choice of a backtesting procedure. To this end, three leading stock market indexes were utilised, for which the standard LR and the proposed methods were subsequently applied. To provide a relevant scenario for assessing the capacity of risk management tools, all the backtesting procedures were implemented to evaluate twelve mainstream market risk models in various periods, under diverse volatility conditions. The results demonstrate that the proposed tests, compared to the popular LR approach, provide a more insightful view of market behaviour. They allow to choose models suitable for predicting risk under various volatility regimes and thus characterize the market specificity. Contrary to the standard procedures, they distinguish two distinct patterns of the market behaviour, connected with undergoing the financial crises.
The paper proceeds as follows. Section 2 sets the notation and provides the details of the compared tests. Section 3 gives the comparative evaluation of their properties through the Monte Carlo experiments, based on the GARCH processes. Section 4 empirically illustrates the differences in the outcomes of real data analysis, resulting from the choice of testing procedure.

TESTING UNCONDITIONAL AND CONDITIONAL VaR COVERAGE
The VaR model evaluation framework is based on a binary process indicating VaR failures. Assuming that R t is the random return from a portfolio, with the continuous distribution function F R t , and VaR is its p-quantile, , the failure process is defined as: The quantile order p is referred to as the VaR tolerance level. Under the correct VaR model, the I t process is required to be the iid Bernoulli process with the parameter p, i.e.
The iid Bernoulli condition may be decomposed into the postulate of unconditional coverage, referring to the unconditional probability of failure p, and the postulate of conditional coverage, requiring independence of failures.
The industry standard to test the unconditional coverage property is the Kupiec test (Kupiec 1995), which assumes the identical, independent Bernoulli where T 1 is the number of violations and T is the number of observations. The H 0 restriction is checked through the likelihood ratio statistic: provided that H 0 restriction is satisfied, changes into 1 ( , ), T B T p  , If the number of observations is large, by the Central Limit Theorem the binomial distribution converges to the normal one. Exploiting this fact, the unconditional coverage VaR test may also by conducted with the use of the continuous normal distribution. The test statistics Z takes the form: and, under the null, has the asymptotic standard normal distribution (0,1). N The unconditional coverage tests, relying on the iid assumption, consider only the overall rate of failures. The complete VaR backtesting procedure, as formulated in the conditional coverage postulate, requires also checking independence of failures. The standard approach to verifying the conditional coverage property is the Markov test (Christoffersen 1998), which embeds the failure process within the binary first-order Markov chain. The test is formulated in terms of single-step transition probabilities. The independence condition implies that the transition probabilities π 01 and π 11 are equal, where π ij denotes the probability of the transition of I t from state i to state j. The null 0 01 11 : = H π π is tested through the likelihood ratio statistic of the  Relying on the single-step transition probabilities, the Markov test has only the potential to detect first-order dependencies. This deficiency may be made up for by employing the well-known econometric Ljung-Box test, which has the power against linear alternatives of any order.

STATISTICAL PROPERTIES OF VaR BACKTESTING PROCEDURES
As a preview to the empirical analysis, a Monte Carlo study was used to assess the theoretical statistical properties of the examined tests in the context of the VaR model evaluation. The comparative study included the size and power properties, estimated as the proportion of rejections under the null and under the alternative, respectively. The size evaluation included significance levels 0.01, 0.05 and 0.1. For the power comparison, the study reported rejection frequencies at 0.05 level. The estimates of the statistical properties were computed over 10,000 Monte Carlo trials for sample sizes 100, 250, 500, 750, and 1000 1 .
With reference to the unconditional coverage property, the author compared the normal Z statistics, applied to a VaR failure series, to testing VaR models through the Kupiec LR UC test. In accordance to the conditional coverage property the author evaluated the properties of the Ljung-Box LB H in relation to the properties of the Markov-chain-based LR CC procedure, setting the autocorrelation order H = 5, which corresponds to one week of daily observations 2 .
The size study investigates test accuracy, understood as the compliance between the observed rejection frequency and the nominal significance level (Tables 1 and 2). Since the size assessment examines the test performance under the null, it requires data from the iid binary process with the correct failure probability. This was done through generating iid Bernoulli samples with the parameter π 1 , equal to the chosen VaR tolerance level.
While the size results in the group of the unconditional coverage tests show minor differences between the compared methods, the discrepancies observed between the conditional coverage tests are much larger. Both unconditional tests -Z and LR UC -seem relatively accurate, however any differences that    appear suggest the superiority of the Z test over the standard Kupiec LR UC approach. The Z rejection frequencies seem accurate including all significance levels and both VaR coverage levels. They are also rather stable over the sample sizes, though the choice of the low-level VaR, like the ones considered, should clearly go with samples larger than 100 observations. The LR UC rejection frequencies, in turn, show that the LR UC distribution tends to diverge markedly from the theoretical likelihood ratio distribution. This is especially visible for 1% VaR and small sample sizes -for 1% VaR the convergence of LR UC rejection frequencies to the nominal significance levels seems to start only from 750 observations. The differences between the compared conditional coverage tests -LB 5 and LR CC -are more pronounced. The results indicate that LB 5 outperforms the conventional LR CC procedure. Especially for 2.5% VaR, the rejection frequencies obtained for this test are clearly closer to the nominal significance, and they show signs of convergence to the desired levels with increasing the sample size. As opposed to this, the LR CC test tends to be oversized for large samples, with rejection frequencies exceeding the nominal test size more than twice. For 1% coverage the empirical rejection frequencies of both tests do not correspond to the assumed significance levels -the tests tend to under-reject (LR CC ) or over-reject (LB 5 ) correct risk models. Therefore, the results suggest that in order to ensure an accurate test level, it is advisable to perform the conditional coverage testing for 2.5% VaR. Moreover, to reduce the type I error it is advisable to replace the Markov-chain-based LR statistic with the Ljung-Box statistic.
The power study, aimed at evaluating the test's ability to detect incorrect risk models, involved violation of the iid Bernoulli assumption. Relevant simulations were conducted in two stages, where the experiments subsequently violated unconditional and conditional coverage property. The false unconditional coverage was implemented through generating random Bernoulli numbers with the parameter π 1 set to values different than the VaR tolerance of 1% or 2.5%. As this type of experiment is dedicated to checking the unconditional coverage property, it was called the uc experiment. In the second stage the underlying processes violated the conditional coverage property. This was implemented through two types of experiments, called the cc experiments. Both cc experiments were aimed at generating serially dependent VaR failures, however were done in two ways. In the first type of the cc experiment, the focus was on controlling the scale of violation of the conditional coverage property, hence this experiment type is referred to as theoretically-oriented. The second type of the cc experiment strived to be as close as possible to the real market conditions, so this experiment type is viewed as practically-oriented.
In the first cc experiment type, the focus is on the scale of violating the conditional coverage property, which, in this case, is the same as the distance from the null. The author wanted to control this distance and treat it as the experiment parameter, and then observe how the test power changes when manipulating this parameter. In such experiments one needs a way to measure how much the conditional coverage property is violated. To achieve this, the author resorted to the basic GARCH-normal model, where the scale of violating the conditional coverage property can be judged from the volatility clustering, which in turn, can be measured by the autocorrelation of the squared returns. The dependence of failures is achieved by using a constant VaR level, based on the unconditional distribution of the returns. In this variant of the experiment, the author chose the following specification of the GARCH model: In accordance with the idea behind this type of experiment, specification (6) allows for the analytical calculation of autocorrelations of the squared returns, enabling to study the power of the test as a function of a controlled parameter of a return distribution. Under (6), the first order autocorrelation of the squared returns ρ is given by and the autocorrelations decline exponentially, with the decay factor α β + . However, if the fourth moment of Z t is not finite, the autocorrelations are timevarying. To prevent this, the model needs to satisfy the condition 2 2 2 1 ( ) α β α + + < . This is ensured by fixing parameters ω and β at levels 0.01 and 0.6, respectively, and setting ρ to 0.1, 0.3 and 0.5 in subsequent variants of the experiment. The α parameter is set to such a value that ensures the desired level of ρ. This one obtains the simulation experiment that enables to explicitly control the volatility clustering.
In the second type of the cc experiments, the study aimed at closely mimicking the real-life conditions. For this reason, the more complex GARCH specifications were chosen, which represent various possible features of the financial data. The focus was on checking the test performance in specific conditions like non-linearity, non-normality, getting close to non-stationarity (α β + close to one), lack of the volatility clustering or the presence of the serial correlation in the mean equation instead of the variance equation. In the choice of the specific models matching the real data, the author followed previous studies by Berkowitz et al. (2011) and Du (2016), and used the following N-GARCH specification for the models, numbered from 1 to 4: and the AR(2) return specification for the fifth model in this set of experiments: Generating data from the above processes required dispensing with the explicit control of the scale of the volatility clustering. Thus, the power estimates from these experiments cannot be compared to each other and cannot be assessed in relation to the distance from the null. The interpretation of their results can only rely on the fact that any of the above representations, combined with the constant VaR corresponding to the unconditional return distribution, produces the clusters of VaR failures.
The power results show that the test ability to detect incorrect VaR models differs considerably with respect to the VaR coverage (Tables 3 to 5). Testing based on 2.5% VaR seems possible even for samples of 250 observations, while inference based on 1% VaR appears feasible only for large samples. Recommendable sample sizes, for such a low coverage level, start with 750 observations.   Source: author's own.  Source: author's own.
As in the size study, larger differences in the test quality are connected with testing the conditional coverage property rather than the unconditional coverage property. The relative assessment of the unconditional coverage tests -Z and LR UC (Table 3) -indicates that these tests are comparable in terms of their power. Any observed differences, however, indicate the Z normal statistic as the more powerful than the standard Kupiec LR UC approach.
The results from testing the conditional coverage by LB 5 and LR CC show remarkable differences in the test quality. The LB 5 test clearly outperforms the LR CC procedure in all cc experiments of type one (GARCH-normal-based, with the volatility clustering controlled by ρ), with rejection frequencies often doubling those of LR CC (Table 4). In these experiments, the LB 5 supremacy is most visible at short distance from the null. For example, in the 0.1 correlation experiment, the LB 5 rejection frequencies tend to double or even triple (depending on the VaR level) those of LR CC . Further from the null, the LB 5 outperformance is most marked for small samples.
The above conclusions from the first set of the cc experiments were confirmed by the experiments of the second type (AR and N-GARCH-Studentt-based). These experiments were designed with the aim of closely reflecting the real time series on the proviso of not having any parameter to control the volatility clustering. Therefore, the power results cannot be compared among the models, and there is no a priori knowledge of what power to expect for the specific data generating processes. What can be compared, however, is the rejection frequencies obtained for the standard LR CC and the proposed LB 5 (Table 5). In the vast majority of cases, the rejection frequencies of LB 5 exceed those of LR CC . This regularity can be observed without any exception for the 2.5% VaR level, which includes both the ARMA model and all N-GARCH Student-t data generating processes. For 1% VaR the only exceptions occur for the shortest series of 100 or 250 observations. The irregularities for 1% VaR and the shortest series can be explained by the small number of the observed VaR failures. For example, in the case of the 1% VaR and 100 observations, the expected number of VaR failures is 1. Such a low number of observations hinders any statistical inference. Thus for 1% VaR, finding patterns connected with statistical methods requires longer series 3 . Starting from 500 observations, as before, the LB 5 test systematically outperforms the LR CC procedure. In summing up the results from both experiment variants, the LB 5 test appears more effective than the standard approach.

BACKTESTING EMPIRICAL VaR FORECASTS
The empirical study, based on FTSE100, NIKKEI225 and S&P500 data, illustrates how the statistical properties of the examined tests translate into practical conclusions from the risk analysis. To this end, twelve leading time series models, used to forecast daily VaR, were evaluated subsequently by all the tests. The range of the models covered both parametric and non-parametric methods. Within the parametric framework, the basic constant variance models were followed by conditional variance models with various error term specifications. The study employed the normal distribution, the Student-t distribution as well as the Picks over Thresholds (POT) method (McNeil and Frey 2000), which, through the Extreme Value Theory, uses the Generalized Pareto Distribution (Balkema andde Haan 1974, Pickands 1975). The conditional variance was modelled through the GARCH-class processes (Bollerslev 1986). This ensures representation of the volatility clustering phenomenon. To allow also for an asymmetric volatility response, relating to upward and downward market trends, the asymmetric GJR-GARCH models were used (Glosten et al. 1993). Within nonparametric methods the author employed the historical simulation model and the filtered historical simulation technique (Barone-Adesi et al. 1998), with filtering based on GARCH or GJR-GARCH model residuals.
The choice of the above models matched the aims of this study in the sense of evaluating a range of models, characterized by various levels of complexity and flexibility. As the study did not focus on finding the best fit to the time series, but on assessing the VaR tests, the author mainly needed the models that differ in their predictive ability. These models were applied as tools to generate a series of VaR forecasts, subsequently used to conduct the tests. The results regarding the quality of the models were treated in the study as complementary, while the key conclusions were based on the consistency among the tests, or the differences they showed in evaluating the VaR forecasts. For these reasons, the range of the time series model started with the most naive ones (like homoscedastic or the historical simulation models) and ended with the specifications regarded as flexible and showing high predictive ability (like the GJR-GARCH 4 models with the t-distribution or the GJR-GARCH models combined with the distribution based on the Extreme Value Theory).
The FTSE100, NIKKEI225 and S&P500 data were chosen to represent financial returns because these indexes are commonly used in similar research, showing typical features of the financial market. Such choice allowed to use the results of previous research and to compare the conclusions. One of the extensive studies, including these indexes, was carried by Angelidis et al. (2004). Based on the period 1987-2002, with 484 models for each index and two VaR levels, they showed that the GARCH models are unquestionable leaders in predicting VaR, but the right model choice strongly depends on the market specificity. The only attainable general conclusion, not depending on the particular market, was that the asymmetric models perform better than the others. This study was based on the standard Kupiec and Christoffersen procedure, so it was extended by including more powerful tests. Another similar study, based on the FTSE100 data from the 1997-2011 period, recommended the use of the GARCH-POT models that originated from the Extreme Value Theory (Totić et al. 2011). This study focused, however on testing only the unconditional coverage property. On the other hand, a recent study, which included the FTSE100, NIKKEI225 and S&P500 data from 1990 till 2016, concentrated on testing the conditional coverage property (Patton et al. 2019). Although this study placed greater focus on ES as a measure of risk than VaR, it showed the better predictive ability of the nonparametric GARCH models (represented in this study by the FHS method) than that of the parametric ones. This study, however, did not consider the combination of the GARCH models with the POT method, as is done here, but placed emphasis on the GAS (Generalized Autoregressive Score) approach. What is important, it referred to the early Christoffersen's VaR test, supporting the need to replace it with other testing methods.
As in majority of similar studies, the author's empirical analysis was based on the daily close-to-close log returns. In order to mimic the real-life decisionmaking process, where a standard sample of daily data covers a yearly period or its multiple, it was decided to perform the study on 4-year samples. As a result, there were around 1000 observations in each sample, which corresponds to the largest sample size examined in the simulation study. On the one hand this matches the risk management practice, and on the other, it provides a relatively wide range of data in one sample. Such a sample length also goes in line with other similar studies. For example, Angelidis et al. (2004), who put great emphasis on the sample choice, showing that best VaR predictions (for 1% VaR and GARCH models with normal or Student-t innovations) are attainable from samples of 1000 observations. Another element of the business practice is to repeat the testing with a fixed frequency, such as weekly, monthly or yearly. The practical choice of the frequency depends on the potential impact of the risk exposure on the company operations. As an effect of the sample choice and the frequency choice, the real-life samples usually overlap and any changes in the market conditions can be observed from a series of subsequent samples. To follow this practice, this process was repeated for several samples of the same 4-year length, however moving the testing window in such a way that the neighbouring samples do not overlap. This allowed to cover a wider time range and obtain more diversified samples, deemed relevant for the purposes of checking the capacity of risk management tools. An important element of the sample choice was to rely on predefined time intervals instead of using any statistical techniques of dividing the data into specific subperiods. This was important for two reasons. First, it served to assess the test's ability of finding incorrect risk forecasts based on the preestablished periods. Therefore some fixed data and a collection of risk models were needed, with a varied potential of matching these data. If a change in the volatility regime occurs, it is expected that the tests find the unsuitability of the applied risk models. Second, the aim was to follow the real-life procedures where the risk model testing is carried out periodically, usually based on calendar periods, without any a priori knowledge of shifts between the market regimes. Thus the author decided to use the samples: 2008-2011, 2012-2015, 2016-2019. Such a subsample choice has the additional advantage of the oldest subsample going back as far as the subprime mortgage crisis of 2008, which gave the possibility to empirically evaluate the test performance in the extreme conditions experienced in the recent past. The standard choice of the intuitive periods corresponding to the calendar years resulted in limiting the time series to the end of 2019. However, in order to fully use the various market situations experienced recently, the study was extended to the middle of 2020, hence including one more sample of the same length as all the others. It partly overlaps with the 2016-2019 sample but differs from it substantially, as it covers the outbreak of the COVID-19 pandemic. The latter sample goes from the middle of 2016 to the middle of 2020. In this way the author obtained four samples, which seem highly diverse, as shown by the descriptive statistics (Table 6). Table 6 Descriptive statistics of S&P500, FTSE100 and NIKKEI225 daily returns A common observation for all the indexes is that the first sample, 2008-2011, including the subprime mortgage crisis and its spillover effects, clearly stands out. It represents the crisis-driven behaviour, which manifests itself in high volatility and excess kurtosis. The market crash resulted also in low means, extremely low minimums and relatively high maximums. For FTSE100 and S&P500, such behaviour is also highly evident in the last sample, mid-2016 to mid-2020, including the outbreak of the COVID-19 pandemic. For these two indexes the volatility, skewness and kurtosis went down in the 2012-2015 and 2016-2019 samples, which therefore were initially regarded as representative for the usual market conditions. NIKKEI225 differs from the above observations in the sense that the COVID-19 sample does not stand out so clearly, and the preceding periods also show signs of the high volatility regime. This can be observed also from the NIKKEI225 time series plots, which show large volatility clusters not just in the neighbourhood of the subprime mortgage or the COVID-19 crises (Figure 1). The backtesting exercise performed for the three indexes was aimed at illustrating the differences in the conclusions from the risk analysis, attributable to the test choice. The p-values obtained for all examined stock market indexes (Tables 7 to 9) show that these differences are minor when testing unconditional coverage property through LR UC and Z. On the other hand, testing conditional coverage by the means of LR CC and LB 5 statistics reveals that the test choice markedly influences the outcomes. This conclusion is in line with the results of the simulation study.
Backtesting a risk model with respect to the unconditional coverage (LR UC and Z tests) informs whether the overall number of VaR violations produced by the examined model, corresponds to the assumed VaR tolerance level. The general picture from backtesting risk models by Z and LR UC is that both unconditional coverage tests provide comparable conclusions. There are   however a few cases when Z rejects the models that LR UC allows, which suggests that the normal Z statistics proved more powerful at detecting incorrect models. Such cases happened for all three indexes, most often however for NIKKE225. A vivid example is the 2012-2015 NIKKEI225 sample, where Z rejects nearly all parametric specifications, whereas LR UC admits most of them. In general, however, both tests point out similar models as acceptable for predicting risk for all three indexes. In particular, both tests classify the majority of parametric models, apart from those belonging to the POT class, as incorrect in the high volatility regime. This is especially evident  in 2008-2011 subprime mortgage crisis samples and, to a lesser extent, in the mid-2016 and mid-2020 COVID-19 samples. The admitted models in these highly volatile periods are based either on the nonparametric historical simulation method (HS or FHS class) or the POT method originating from the Extreme Value Theory. This shows that, when turbulences occur, the popular distributional assumptions of normality or Student-t innovations tend to produce an excessive number of VaR violations. In calmer samples, the correct overall failure rate is attainable by most of the methods. Yet, judging by the p-values, the FHS or POT models seem to perform best in terms of the unconditional coverage.
The conditional coverage tests (LR CC and LB 5 ) complement the overall VaR failure rate check by enquiring into the dependence of failures in time. The procedures dedicated to this property are often deemed critical to financial stability as they potentially prevent catastrophic losses from occurring in series. In light of the results of the conditional coverage tests, the choice among the constant-variance, conditional-variance or asymmetric conditionalvariance specifications turns out to be more important for forecasting risk than the distributional choice. First of all, the use of the GARCH-class models is indicated as crucial for preventing VaR failure dependence in time. Second, in some cases, the asymmetric GJR-GARCH models are strongly preferred. However, most importantly in view of the study's goals, testing the conditional coverage property reveals substantial differences in the conclusions, attributable to the chosen testing method.
For the FTSE100 index (Table 7) the LB 5 test generally classifies GARCHclass models as admissible. In the case of this index, the LB 5 test does not distinguish between the standard GARCH and the GJR-GARCH models with volatility asymmetry. This shows that the potential differences in the market behaviour relating to upward and downward trends, do not impact on the FTSE100 risk forecasts. This result is stable across the samples, which indicates that although the model parameters may change, the volatility regime does not affect the general patterns of investors' behaviour. Another conclusion is the preference towards the parametric models over the FHS ones, which is visible in two out of four samples. Compared to these outcomes, based on the LB 5 procedure, the standard Markov LR CC test results seem more vague. In some cases the Markov test even fails to reject the most naive, constantvariance models (the POT model in 2012-2015 and the Student-t model in 2016-2019).
The LB 5 p-values from testing the NIKKEI225 (Table 8) conditional coverage show a different pattern of market behaviour in comparison to the FTSE100 index. Contrary to FTSE100, the risk model choice for this index seems to be driven by volatility regimes. The LB 5 test shows that in the 2008-2011 market crash sample, only the narrow class of the GJR-GARCH models is capable of producing accurate risk forecasts. Thus only these models have the potential to prevent occurring large losses clustered in time during extremely volatile periods. In other samples which do not include such extraordinary price movements, the more general class of the GARCH models turns out to be sufficient for predicting risk. Since the volatility asymmetry component of the GJR models appears crucial only for times of crisis, it appears that the investors' behaviour is influenced by the volatility regime. The high crisis-driven volatility appears to stimulate more violent reactions to falling prices. This asymmetric volatility regime-specific effect is strong enough to affect the suitability of risk forecasting methods.
Similarly to the case of the FTSE100 index, backtesting the NIKKEI225 risk forecasts through the standard Markov LR CC procedure provides a different picture than backtesting through LB 5 . In most cases the LR CC test is unable to point out any specific class of models. For the 2008-2011market crash sample it failed to reject the basic GARCH-class models, classified as incorrect by the LB 5 statistic. For the two subsequent samples it admits most of the models, failing to specify any approach recommendable for predicting risk. In particular, for 2016-2019 all the models are admitted.
The results from testing the conditional coverage for S&P500 by LB 5 are in line with those for NIKKEI225 (Table 9). In standard situations, as indicated by the 2012-2015 and 2016-2019 samples, the GARCH risk forecasts are sufficient to fulfil the requirement of the proper conditional coverage. However, to ensure that VaR violations are not serially correlated during the crises, the asymmetry volatility component needs to be taken into account. Thus the GJR-GARCH-class models are advisable for the 2008-2011 subprime mortgage crisis sample and the mid 2016 to mid-2020 COVID-19 sample. An additional observation for S&P500, which goes in line with the FTSE100 results, is the preference towards the parametric specifications over the historical simulation-based methods. This is clear from all samples apart from the calmest 2012-2015 one.
As previously, the S&P500 results are test-specific. The risk analysis based on the Markov LR CC statistic gives different conclusions. Under all volatility regimes it admits models from various classes, failing to characterize the market specificity.
The outcomes of the conditional coverage tests for all the indexes, combined with the initial results from testing unconditional coverage property, indicate that the GJR-GARCH-POT model performs best overall in terms of forecasting daily risk for stock prices. It seems most flexible as it is classified as accurate in light of both properties, for all indexes and under all volatility regimes. This most general result stays in line with the results of other similar studies that assessed VaR predictability for periods including major stock crashes (e.g. Angelidis et al. 2004, Totić et al. 2011). Referring to the study's goals, an important fact is that this conclusion can be deduced only from the LB 5 results, in particular not attainable by the Markov LR CC test.
With regard to the market specificity, the combined results from LB 5 test for the three examined indexes allow for distinguishing two distinct patterns of investors' behaviour connected with undergoing the financial crises. While in the London market, as judged by the FTSE100 index, ways of predicting risk seem insensitive to the volatility regimes, the other markets appear to be strongly affected by the crises. According to the NIKKEI225 and S&P500 results, in the crises conditions the general class of GARCH models needs to be narrowed to the models with the volatility asymmetry. Otherwise, the violations of VaR tend to cluster in time, which may result in large losses occurring one by one. Since the volatility asymmetry is essential for periods with extraordinary price movements, the financial crisis in the New York and Tokyo markets seems to affect investors' behaviour in such a way that it stimulates violent reactions to downward price movements. A crucially important fact is that these conclusions about market specificity strongly depend on the backtesting method. The results demonstrate that, contrary to the LB 5 autocorrelation test, the Markov LR CC test used commonly in the industry, fails to explain the individual nature of the markets.

SUMMARY AND CONCLUSIONS
The study dealt with the methods of evaluating risk forecasts. The author referred to the Basel framework, which recommends testing risk models based on the VaR measure, and inquired into the statistical properties of the VaR tests. In order to enhance their accuracy and efficiency, the study replaced standard risk-management-dedicated tests with other econometric methods applicable to the binary VaR failure process, and decomposed VaR model evaluation into testing the unconditional and conditional property. With respect to the verification of the unconditional coverage property, the study utilized the convergence of the binomial distribution to the normal one, while for the conditional coverage, it employed the Ljung-Box χ 2 statistic. Therefore, it was proposed to use the well-established econometric methods instead of the methods developed specifically for the purposes of risk management.
In accordance to the Basel rules, the author examined the test properties on two low significance levels, and the simulations confirmed the superiority of the proposed procedures over the industry standards, in terms of their power. The results of the simulations were used in the empirical study, which demonstrated the advantages of the proposed approach. The study was designed with a view to showing the differences in the risk analysis attributable to the choice of the backtesting method. To provide a relevant setting for evaluating risk management tools, the study used data on three leading stock market indexes, various volatility regimes and twelve mainstream risk models to generate VaR forecasts. The application of the proposed methods to the VaR failure series provided evidence that more powerful tests, in comparison to the standard risk management procedures, give a more insightful view of the market behaviour.
Contrary to the standard approach, the proposed procedures allowed for distinguishing models that best suit risk management in various market conditions. This, in turn, enabled to define market-specific patterns of investors' behaviour connected with changing volatility conditions. Under usual conditions, the general class of GARCH models was pointed out as sufficient in terms of predicting risk. However, the inclusion of the financial crises into the sample implied the need for more specific methods. In the market crash periods only the narrow class of the nonparametric filtered historical simulation models or the POT models prevented risk underestimation. The requirement that the VaR failures should not group in time further narrowed the range of acceptable models only to those that combine the POT method with the GARCH volatility specification. Moreover, for the New York and Tokyo markets, judging by their leading indexes, the asymmetry volatility component was vital in the sense that it prevented clustering of extraordinary large losses. This indicated the GJR-GARCH-POT model as the most flexible, in the sense of being suitable for predicting risk in the widest variety of market conditions. A comparison of these results to the outcomes of the standard VaR testing procedure demonstrated that the proposed methods were more effective in detecting incorrect risk models. Two general implications follow from this comparison: first, the proposed tests better characterize the specificity of the market behaviour and second, more importantly, they have better potential to secure the stability of the institutions operating in the financial markets. Thus, the results motivated the author to recommend these methods for institutional risk management systems as a replacement for the usual LR-based procedures. Moreover, these conclusions may also be used as guidance by the supervisory bodies in creating recommendations for risk managers.
The conclusions, formulated in the most general form, suggest the superiority of the well-established econometric methods over the standard risk management tools. In more detail, however, the main improvements were achieved by replacing the Markov-chain Christoffersen's framework with the testing based on the LB statistic. The author treated this not only as an indication of the recommendable VaR testing approach, but also as guidance for further developments. Following the idea behind the LB statistic, one of the directions for future research may be to search for more advanced ways of using the autocorrelation function. Their potential may lie, among others, in employing the spectral theory which allows for utilizing the same information as included in autocorrelations, but modified by means of the Fourier transform. Such a transform, by using the same information in a different way, may improve the power properties. The use of the spectral theory gives a wide range of possibilities connected with new testing statistics. Such an approach was proposed in the VaR testing context by Berkowitz et al. (2011), but has been studied so far only in a very limited scope, based on two chosen statistics. One more proposition of utilizing the autocorrelation function in testing VaR is the recent modification of the LB statistic by Miettinen et al. (2020). Unlike the basic LB test, this modification takes into account the presence of the volatility clustering. In this modification, the asymptotic variance of the test statistic is derived when assuming only the symmetry and finite fourth moments of the time series. When the time series has the volatility clustering, it introduces a multiplicative factor that helps to achieve the correct size of the test. Both this proposition and the one suggesting to utilize the spectral theory require extensive simulations and empirical verification, and thus are left for further research.
Another natural extension to this study is to verify the author's propositions with the use of multivariate GARCH processes. Such processes, exactly as with the univariate ones, allow to predict VaR. Indeed, one of the key practical advantages of VaR as a risk measure is the straightforward way it can be computed for portfolios of assets or portfolios of indexes. As a consequence, testing multivariate VaR models proceeds in an analogous way to testing univariate ones. Up to now, several studies have been conducted to test accuracy of the multivariate models like VECH, BEKK, CCC-GARCH, DCC-GARCH and asymmetric DCC-GARCH in the context of forecasting VaR (Morimoto and Kawasaki 2008, Caporin and McAleer 2014, Santos, Nogales and Ruiz 2013. These studies, however, were based on the standard risk management tests. Validating the multivariate models by means of the methods found relevant in the univariate case (and possibly other, improved tests based on autocorrelations) is an area viewed as an interesting subject for future studies.