How much do we see? On the explainability of partial dependence plots for credit risk scoring

Risk prediction models in credit scoring have to fulfil regulatory requirements, one of which consists in the interpretability of the model. Unfortunately, many popular modern machine learning algorithms result in models that do not satisfy this business need, whereas the research activities in the field of explainable machine learning have strongly increased in recent years. Partial dependence plots denote one of the most popular methods for model-agnostic interpretation of a feature’s effect on the model outcome, but in practice they are usually applied without answering the question of how much can actually be seen in such plots. For this purpose, in this paper a methodology is presented in order to analyse to what extent arbitrary machine learning models are explainable by partial dependence plots. The proposed framework provides both a visualisation, as well as a measure to quantify the explainability of a model on an understandable scale. A corrected version of the German credit data, one of the most popular data sets of this application domain, is used to demonstrate the proposed methodology.


Introduction
During the last few years several frameworks for automated machine learning (autoML, Hutter et al., 2018) have been proposed.One such framework is provided by the R package mlr3 (Lang et al., 2019).It allows to define a chain of modelling steps, including data preprocessing operations such as dimensionality reduction and imputation up to the final model evaluation using different strategies such as crossvalidation, bootstrap and also holdout sets.All model specification choices can be defined as so-called hyperparameters, and algorithms are provided in order to optimise these hyperparameters with regard to a predefined performance measure.As a consequence of the free availability of tools such as mlr3, the use of complex machine learning algorithms has been facilitated also for companies with comparatively low experience in this field.The resulting models are able to detect complex nonlinear multivariate dependencies without the need for the analyst to explicitly specify the kind of the functional relationship of the dependence.For this reason, such models are often called black box models.
In the application context of credit risk scoring, traditionally white box logistic regression models (Crook et al., 2007;Szepannek, 2022) are frequently used in business practice.Nonetheless, numerous benchmark studies have shown that properly parametrised modern machine learning algorithms, such as random forests and gradient boosting, are often of superior predictive accuracy compared to the aforementioned traditional scorecard models (for an overview cf.Louzada et al., 2014).A comprehensive benchmark study which evaluates several algorithms on a set of domain-specific data sets on a meta-level can be found in Baesens et al. (2002) and has been updated by Lessmann et al. (2015).The specific situation of unbalanced classes was addressed by Vincotti and Hand (2002) and Brown and Mues (2012), and investigated together with a systematic hyperparameter tuning for several classes of machine learning algorithms in a comprehensive benchmark study (Bischl et al., 2014).In Crook et al. (2007) and Szepannek (2017), the current challenges are discussed in a broader context, e.g.reject inference (Banasik and Crook, 2007), the Basel 2 accord (Basel Committee on Banking Supervision, BCBS, 2005), and profit scoring (Verbraken et al., 2014).
In order to prevent the concomitant lack of model understanding, the BCBS established a number of requirements on transparency from the perspective of regulation.The "selection of certain risk drivers and rating criteria should be based not only on statistical analysis, but the relevant business experts should be consulted on the business rationale and risk contribution of the risk drivers under consideration" (European Banking Authority, 2017).This underlines the need for an appropriate methodology to understand what the models have learned, and still for their explanation.
According to Szepannek and Aschenbruck (2020), there can be different requirements to the explanation of a model depending on the context.Several authors recently applied methods of interpretable machine learning to credit scoring (Biecek et al., 2021;Bussmann et al., 2020;Dastile and Celik, 2021;Demajo et al., 2020;Torrent et al., 2020).In Bücker et al. (2021), the different requirements are linked to the corresponding methodology within a unified framework for Transparency, Auditability and eXplainability for Credit Scoring (TAX4CS).According to this, the methods can be distinguished into either global explainability on the model level such as variable importance (Breiman, 2001), partial dependence (PD, Friedman 2001), or accumulated local effects (ALE, Apley, 2016), or local explainability on the level of individual predictions such as Shapley additive explanations (SHAP, Strumbelj and Kononenko, 2014), breakdown plots (Staniak and Biecek, 2018), or local interpretable model explanations (LIME, Ribeiro et al., 2016).Many of them can be accessed via the DALEX framework (Biecek, 2018).
This paper concentrates on partial dependence, denoting a popular and wellknown method for a model-agnostic assessment of a feature's effect on the model outcome.Despite its popularity and its frequent use in practice, partial dependence analysis is usually applied without addressing the corresponding question how much can actually be seen in the resulting plots.For this purpose, the methodology is presented in order to analyse to what extent arbitrary machine learning models are explainable by partial dependence plots.The proposed framework provides both a visualisation as well as a measure to quantify the explainability of a model on an understandable scale.Molnar et al. (2020a) pointed out that the superior performance of complex machine learning models results from their ability to detect high order dependencies and nonlinearities.Such dependencies are difficult to understand for analysts, while it has to be noted that the trade-off between predictive accuracy and interpretability is not necessarily given for any data (Rudin, 2019).As a potential solution, criteria are proposed that help to quantify the interpretability of a model.Model selection can thus consist in multi-objective optimisation of both predictive performance and interpretability.The approach followed in this paper differs from this in the sense that it assumes an existing model (which may be the one with the largest predictive accuracy).Afterwards, the question addressed is "How much can we see in the interpretation given by the partial dependence plots for a given model?" In Section 2, partial dependence is reviewed.Based on this, a measure is presented that allows to quantify how far it explains a given model.In the case study, the methodology is applied to the real-world context of credit scoring using the South German credit data (Groemping, 2017;Szepannek and Luebke, 2021).In Section 3, an algorithm is presented that can be further used to identify a subset of variables which best serve to explain model.Finally, the research results are summarised in Section 4.

Partial dependence
Referring back to Friedman (2001), partial dependence plots (PDP) are a popular tool to understand the effect of one or several features w.r.t. the output of a predictive model.One of their advantages is that they can be used for different kinds of predictive models ( ) f x : the set of predictor variables x = (x s , x c ) is split into disjoint subsets and the partial dependence function for a subset x s is given by: This means that a partial dependence function computes the expected prediction given X s takes the values x s .For a data set with n observations, it is estimated by: where x is are the values that observation i takes in X s .Note that for corresponds to the model itself and for s = ∅ or in other words X c = X, the partial dependence function ends up in: which is a constant that can be estimated by ( )

Application to the South German credit data
The South German credit data is publicly available at the UCI ML benchmark repository (Dua and Graff, 2017) and has been made available by Groemping (2019; see also Szepannek and Luebke, 2021).It has 1000 observations and 21 variables where 7 predictors are numeric and 13 are categorical plus a binary target variable.The predictable event describes the default status of a loan.The overall prior default rate on the data is 0.3.For the purpose of this paper a random forest model was trained on the South German credit data using default parameters according to Liaw and Wiener (2002), which turned out to be a good choice for this purpose (Szepannek, Source: authors' own.2017).Usually, the data are split into training, validation and test sets in order to ensure a proper model selection and validation.As these aspects are beyond the scope of this paper, but rather the interpretability of the resulting model is of interest, no additional splits of the data were undertaken and the forest was trained on the entire data.
Figure 1 illustrates the partial dependence curves for the numeric variables duration and status account.This allows for a visual analysis of the effect of the variable on the predicted default probability and it can be easily seen that the risk (i.e. the default probability) increases for longer maturity time, whereas from roughly four years (45 months) onwards the risk stays constantly high.Analogously, it can be seen from the right plot that the risk decreases with a larger amount of money in the account.Nonetheless, when adding the predicted training data points to the graph it has to be noted that the PDP only partly explains the predictions by the models which cover a much broader range than the PDP indicates.This is obvious, as partial dependence is obtained by averaging.In turn, relying on the PD can be misleading.

Explainability
In the following step, a measure is derived to quantify the degree of explanation given by a partial dependence function for a model.A perfect explanation will have the same values for the partial dependence function and the predictions of the data.In this case, all points in a scatterplot of predictions vs. explanation (PX-plot) will lie on the diagonal.Such a plot is shown in Figure 2, where compared to Figure 1, the x-axis changed.The above allows for a graphical analysis of the explainability.The more representative a PDP for a model, the closer the points to the diagonal.From this plot, it can be seen that the PDP covers a much smaller range of predicted values compared to the true model's predictions.Note that the x-axis of the right plot for the categorical variable status account takes only for distinct values -one for each category of the variable.In addition, the range of the partial dependence values is broader compared to those for the status account variable, and in particular for this variable there are only few observations with low values of the PDP ≤ 0.25 and large predictions > 0.75.
In order to quantify the confidence in an explanation given by a partial dependence plot, one can measure the differences between the partial dependence function PD(X s ) and the model's predictions.A natural way of doing this is obtained by computing the expected squared difference (ESD): Note that in contrast to common error functions, the ESD does not measure the difference between predictions and observations, but instead between the partial dependence function PD s (X) and the model's predictions ( ) f X .For an easier interpretation ESD(PD s ) can be benchmarked against ( ) The comparison of both ESD(PD s ) and ( ) ESD PD ∅ can be used to quantify the explainability ϒ of model ( ) f X by a partial dependence function PD s via the ratio: Note that ϒ in ( 6) is somehow similar to the common R² as used in linear regression: ϒ close to 1 states that a model is well represented by a PDP and the smaller it is, the fewer of the model's predictions are explained in the PDP.Real data plug-in estimates for ESD(PD s ) and

(
) ESD PD ∅ are obtained using  ( ) s s PD x and  PD∅ as described above.

Application to the South German credit data Table 1 (column 
ϒ ) shows the explainability of the random forest model on the South German credit data for all variables.Among all the numeric variables, duration has the highest explainability of only  ϒ = 0.077, which is nonetheless pretty far from 1 and thus reflects the visual impression as gained by considering Figures 1 and 2.
Columns  k ϒ of the table describe the explainability for increasing number of variables k in the subset X s (cf.Section 3).It can be seen that for two subsets X s ⊂ X s* , it is ϒ(PD s ) ≤ ϒ(PD s* ) with ϒ(PD s ) = 1 for X s = X.The PX-plot in Figure 3 illustrates the fit of PD s from Table 1 with dim(X s ) = 9 and  ϒ = 0.8 (which obviously cannot be visualised anymore).Compared to Figure 2, the PDP covers a broader range and the points are closer to the diagonal.Source: authors' own.

Connection to the existing methodology
Note that the proposed measure of explainability ϒ reflects the difference between the PDP and the model's prediction, which is an important but not the only aspect of interest with regard to explainability.A well-known limitation of partial dependence curves is that they might be misleading in the case of correlated predictor variables (Hooker and Mentch, 2019).In Friedman and Popescu (2008), an H 2 statistic is proposed that can be used to identify the existence of interactions between predictor variables.For correlated predictor variables, accumulated local effect plots (ALE, Apley, 2016) have been shown to be more appropriate than partial dependence plots.ALE plots are beyond the scope of this paper, but the extension of ϒ for ALE plots may be a subject for future research.
A popular visual tool to analyse hidden variability behind a partial dependence curve are individual conditional expectation (ICE) curves (Goldstein et al., 2015), where instead of averaging over all the observations, a separate PD curve is drawn for each observation x i : The resulting plot of the ICE curves enables to understand the heterogeneity of the PD as a function of x s (cf. Figure 4 (left) for the variable duration).In particular, ICE plots can be used for a visual analysis of whether the individual curves show the same trend.Yet, to the best of the author's knowledge, this can only be analysed visually, but no objective measure has been proposed in order to quantify this.In contrast, explainability ϒ quantifies the observed variation hidden behind a partial dependence function into one single and interpretable value that is close to 1 (for small variation) and close to 0 (for strong variation) by integrating over the distribution P(X s ).Another issue of PDPs is their extrapolation to areas where little or no training data is available (Hooker andMentch, 2019, Molnar et al., 2020b).Note that explainability as a global measure reflects the distribution of the training data w.r.t. the predictor variables, i.e. a large value of ϒ does not prevent from misinterpreting extrapolations of the model outside the range of the training data.
Note that for the individual curves in an ICE plot, the values of x s are varied regardless of how likely they are to occur, conditional on x ic , which might be misleading.ϒ, in addition, takes into account the joint distribution of variables from X s and X c , as from each curve only the observed points x is (dots in the graph) are used.

Computational considerations
For the common implementations of partial dependence plots, e.g.those in Greenwell (2017), Biecek (2018) and Molnar et al.,(2018), the scope consists in visuali-sation of the PD curve and it is sufficient to restrict on computing  s PD for a subset grid of the data.In contrast,  ϒ accounts for distribution P(X s ), and thus requires computation of the partial dependence  ( ) PD x for all observations.
Computation of  ( ) of the two variable subsets x s and x c of the data, therefore the calculation of for given data is O(n 2 ) in the number of observations n with regard to both computation time and memory usage.In order to circumvent this issue arising with large sample sizes, an alternative consists in its computation on a random subsample of the x is , i = 1, ..., n.Note that a similar approach was proposed to reduce the computation cost for Shapley additive explanations (Strumbelj and Kononenko, 2014), where random subsets of variables are used in order to avoid enumerating all possible permutations of variable subsets.Naturally, this trades off with the variance of the estimate.Figure 4 (right) illustrates both the reduction in (average) computation time (dashed line) as well as the increasing variability of the estimates (box plots) for 50 random samples of the x s using an INTEL Xeon CPU E3-1505M v5 2.8Ghz 8 core with 32GB RAM.

Based variable selection
According to Table 1 (column  ϒ ), ϒ can be used to compare different variables with regard to their ability to explain a model (using a PDP).Consequently, a forward variable selection can be carried out to maximise the explainability of a model with as few variables as possible (cf.Algorithm 1).Note that, as opposed to traditional variable selection or variable importance, the variables here are not selected with regard to the model's performance but rather with regard to the degree of explanation that they provide for an existing model.Source: authors' own.

Application to the South German credit data
Table 1 (column  k ϒ ) provides an example of variable selection based on ϒ to maximise explainability (the step number is indicated in column k): a PDP of only two variables already provides an explainability of 0.304 and for dim (X s ) = 5 (/9/12) an explainability  dim( ) s X ϒ = 0.5 (/0.8 /0.9) is obtained.Figure 5 shows a trellis visualisation (Cleveland, 1993) of a two-dimensional PDP (as implemented in e.g.Greenwell, 2017) for the two variables: status account and duration, with the highest explainability.It reveals the same trend of increasing risk with longer maturity times for all status levels of the account, but an observable interaction exists for existing accounts with a low or negative balance where the increase in risk is stronger.Although in general, partial dependence functions are not restricted with regard to the dimension of X s , their visualisation is limited to one or two dimensions.For more than two variables one can create scatterplot matrices (Cleveland, 1993), but this still does not allow to visualise higher order interactions between variables.This should be kept in mind when partial dependence plots are used to explain black box machine learning models.For the random forest model on the South German credit data, the most explainable two-dimensional PDP from Figure 5 only explains 30% of the variation of the model's predictions.

Conclusion
In recent years, several failures of AI applications have occurred.As a result, regulatory requirements for business applications of machine learning and the ongoing hype around the methodology for explainable AI (XAI) have emerged, but a unified framework on up to what extent the explanations by XAI can be misleading is still missing.Hence, a methodology was presented that allows to analyse to what degree predictive black box machine learning models can be explained by partial dependence plots.The framework provides both a graphical analysis of the mismatch between the PD curve and the predictions by the model in terms of PX-plots, as well as a measure (ϒ) to quantify explainability of a model by a PDP on an interpretable scale.An algorithm was presented to maximise explainability with a low-dimensional PDP.
The proposed methodology was applied in this study to the publicly available South German credit data using a random forest model.It appears that a reasonable and well-interpretable partial dependence curve as it is observed for the variable duration, can still deviate noticeably from the predictions of the model -which has to be taken into account when explaining it.The proposed measure of explainability ϒ can help to support business decisions by validating the model's interpretability.A PDP of the two most explainable variables, i.e. status account and duration, is more appropriate in order to understand how the model behaves.
In general, the explainability of the model becomes better when an increasing number of variables are taken into account, but for >2D PDPs can no longer be visualised and thus an analyst will not be able to understand any high-order dependencies that impact on the model's predictions.An R package with implementations of the described methodology is available on Github under https://github.com/g-rho/xPDPy.
Note that the proposed measure of explainability ϒ only reflects the difference between the partial dependence curve and the predictions by the model under investigation, which is an important but not the only aspect of interest with regard to explainability.For example, individual conditional expectation (ICE) plots allow for a visual analysis whether the individual curves for all the observations show the same trend.Yet, currently there is no objective measure to quantify this -which could be a scope of future research.
There is also a need for ongoing research to develop methodology to understand high order interactions, e.g. based on the ideas presented in Britton (2019), and Gosiewska and Biecek (2019).Psychology provides a reasonable number of dimensions that can be simultaneously assessed by humans, there seems to be somewhere around seven (Miller, 1956), while naturally there might be differences depending on the experience of the analyst.However, it is questionable to what degree humans will ever be able to understand nonlinear high-order interactions.For this reason, the proposed measure can be considered as a tool to quantify the degree of explainability of a black box machine learning model.Molnar et al. (2020a) suggested an approach to simultaneously optimise a tradeoff between predictive accuracy and interpretability.In contrast, other authors claim to rather use interpretable models (Rudin, 2019) which may trade off with predictive power (but not always, cf.e.g.Buecker et al., 2021).To conclude, the benefits of more complex but uninterpretable models over interpretable ones should be carefully analysed during model selection.
An important challenge consists in the development of fair scoring models (Kusner and Loftus, 2020;Szepannek and Luebke, 2021) and future research on this topic will be based on causal inference (cf.e.g.Luebke et al., 2020 for some examples).According to the results of Zhao and Hastie (2019), partial dependence curves can be used for this purpose.This makes the suggested measure of explainability also an important concept on the road towards developing fair scores.

Fig. 1 .
Fig. 1.Partial dependence plot for the variables duration (left, black line) and status account (right, bars), as well as the predictions on the training data (grey dots).

Fig. 4 .
Fig. 4. ICE curves for the variable duration (left) and simulation results for the computation time based on subsamples of different size, as well as the resulting distribution of Ŷ for the variable duration (right).Source: authors' own.