A PROPOSAL OF DATA – DRIVEN METHOD FOR DETERMINING THE WEIGHTS OF COMPOSITE INDICATORS

: The paper proposed a simulation method for determining the weights of components of taxonomic measures. The method takes into account the degree of similarity of the final ranking to other rankings and other properties, e.g. the clustering ability of the measure. The analyses were performed on publicly available data published by the General Statistic Office, concerning selected characteristics of the labour market in Poland at the level of subregions. The results obtained by the proposed method depend on the initial set of weights vectors. Due to the fact that the proposed method does not provide an invariant solution for a given data set, the stability of the rankings obtained using this method was assessed. There was high consistency in the orderings of objects obtained in the consecutive repetitions of the procedure.


Introduction
The synthetic variable (known also as taxonomic measure or composite indicator) is the result of an appropriate transformation of the group of diagnostic variables (Wydymus, 1984, p. 188). Taxonomic measures allow for the comparison and linear ordering of objects described by many different characteristics. There is not a single universal method for constructing taxonomic measures. Some of these methods were presented by, among others Kukuła and Luty (2018). The standarised sum method was used, due to its simplicity. 50 Kinga Kądziołka The analysed taxonomic measures took the form: where w j is the weight of j-th diagnostic variable, ∑ = 1 =1 , w j > 0, j = 1, …, m, z ij -value of j-th variable (in form of stimulant 1 and after normalization) for i-th object, i = 1, ..., n.
The weights of the diagnostic variables reflect their relative importance. The methods for determining the weights of the diagnostic variables can be categorized into three groups: experts-based weighting, equal weighting and statistics-based weighting. Gan et al., analysed literature to answer the question of what are the most commonly used methods for weighting and aggregation diagnostic variables. According to their study, the most often adopted was the equal weighting method (Gan et al., 2017, p. 492). The existing literature offers many quantitative methods to determine the weights of the composite indicators, such as: principal component analysis, factor analysis, multiple linear regression, mathematical programming (Becker, Saisana, Paruolo, and Vandercasteele, 2017;Greco, Ishizka, Tasiou, and Torrisi 2019;Zhou, Ang, and Poh, 2007).
In this paper, a simulation method for determining the weights is proposed. The idea of the method is to create ranking of objects being similar to the rankings obtained with other analysed taxonomic measures. Spearman's correlation coefficient was used to assess the similarity of rankings. The proposed method consists of four steps: 1. Generate randomly k vectors of weights and determine k taxonomic measures based on the generated vectors of weights.
2. Determine for values of each of constructed taxonomic measures the mean and semi-standard deviation of Spearman's correlation coefficients with the values of other analysed taxonomic measures. The author used semi-standard deviation that incorporated only the negative deviations from the mean value. 2 Deviations above the target are a positive phenomenon. The higher the value of Spearman's correlation coefficient, the more similar the rankings.
3. Determine a subset (denoted as D) of the constructed measures such that for each taxonomic measure belonging to this subset there is no other taxonomic measure (among the initial set of k measures) with the higher mean of Spearman's correlation coefficients and lower or the same semi-standard deviation, or with the same mean of Spearman's correlation coefficients and lower semi-standard deviation.
4. Select the final taxonomic measure from the set D based on the adopted criterion. The five criteria of the selection of the final taxonomic measure are compared.

51
The proposed method is presented based on the example of multidimensional comparative analysis of the labour market data at subregional level. The analysed data are publicly available on the website of General Statistic Office (GUS). All the calculations were conducted using R software.

Characteristics of the analysed data
In this study the taxonomic measure was constructed in order to assess the situation on the labour market in Poland at subregional level in 2018 ( Table 3). The following four diagnostic variables were chosen to construct the taxonomic measure: • registered unemployment rate (x 1 ), • people registered as unemployed for a period lasting longer than 1 year (% of overall unemployed; the so-called long-term unemployment rate) (x 2 ), • participation of unemployed persons in the age group of 18-24 years in the total number of people of this age (x 3 ), • participation of unemployed persons with at most lower secondary education in the total number of unemployed (x 4 ). The diagnostic variables were chosen arbitrarily. Their choice was motivated, among others, by data availability. Table 1 presents examples of the sets of variables used by other authors for the multidimensional assessment of the situation on the labour market in Poland. Unemployment rate, long-term unemployment rate, participation of unemployed persons over 55 years in the total number of unemployed, participation of unemployed persons of 18-24 years in the total number of unemployed, participation of unemployed persons with higher education in the total number of unemployed, participation of persons with disabilities in the total number of unemployed, people registered as unemployed per one job offer.
M. Gawrycka, A. Szymczak (2013, p. 77) Labour productivity, employment rate, unemployment rate, tax burdens, investment expenditure for research and development, labour force participation, life--long learning of adults, gross enrolment rate.
E. Sojka (2013Sojka ( , p. 35, 2014 Participation of unemployed persons of 18-24 years in the total number of unemployed, persons without internship or with internship not exceeding 1 year in the total number of unemployed, long-term unemployment rate, people registered as unemployed per one job offer, participation of unemployed persons with higher education in the total number of unemployed, participation of people working in the private sector in the total number of working people, participation of people working in services in the total number of working people, gross earnings in relation to the regional average (Silesia region = 100). The monthly average gross salary, new registered national economy entities per every 10 thousand of working age population, business investment expenditure per 1000 of working age population, unemployment rate.
A. Tatarczak, O. Boichuk (2018, p. 375) Participation of unemployed persons of 15-24 years in the total number of unemployed, participation of unemployed persons without internship in the total number of unemployed, participation of unemployed persons with higher education in the total number of unemployed, job vacancy rate, the monthly average gross salary in relation to the national average.
People registered as unemployed for a period lasting longer than 1 year (% of overall unemployed), average monthly number of people registered as unemployed per one job offer, unemployment rate, average monthly gross earnings in relation to the national average, new registered entities per every 10 thousand of working age population, business investment expenditure per one working age person, national economy entities per one thousand working age citizens, employment rate.
In this paper, in addition to the registered unemployment rate, the variables concerning long-term unemployment and unemployment among young people and people with low education were used, as these are particularly dangerous phenomena on the labour market, which may lead to an increase in the crime rate (Kądziołka, 2015, p. 72).
In the analysed case, all diagnostic variables were destimulants. They were converted into stimulants according to the formula: =

Application of the proposed method
In the analysed case, there were randomly generated 1000 vectors of weights ( 1 , … , 4 ), r = 1, …, 1000, from whose one can choose a representative, with which the final taxonomic measure is constructed and the linear ordering of subregions performed. Firstly, k = 1000 taxonomic measures are constructed: = ∑ =1 , , i = 1, …, n; r = 1, …, k, m = 4, associated with the generated weights. From the set of these taxonomic measures, one is chosen to be the final solution of the linear ordering of objects (here: subregions). Figure 1 presents the scatterplot of the semi-standard deviation and the mean of Spearman's correlation coefficients for the constructed taxonomic measures.
Based on the values of taxonomic measures, the author created rankings of objects. The subregions were ordered from the best to the worst according to the  values of the taxonomic measures. Figure 2 shows positions of individual subregions in rankings obtained for the analysed 1000 taxonomic measures. Position 1 is the object with the highest value of the taxonomic measure (the best subregion) and position 73 the object with the lowest value of the measure (the worst subregion). It can be seen that for the majority of subregions there were large differences in positions according to the particular rankings. Next the author created a subset of taxonomic measures containing those for which there exists neither a measure with higher mean of Spearman's correlation coefficients, and lower or the same semi-standard deviation of Spearman's correlation coefficients nor a measure with the same mean of Spearman's correlation coefficients and lower semi-standard deviation. In this case there were 13 such taxonomic measures (see Figure 3). There is a similarity to the determination of an efficient frontier of investment portfolios. The labels in Figure 3 contain the identification number (Id) of the appropriate taxonomic measure. Figure 4 shows the structure of the weights of the taxonomic measures belonging to the reduced set, and Figure 5

The choice of the final representative
The results obtained with the use of the five methods for selecting the final measure were compared. In the first case, the measure characterized by the minimal value of the semi--standard deviation of Spearman's correlation coefficients was selected.
In the second case, the measure characterized by the maximal mean of Spearman's correlation coefficients was selected.
In the third case, the measure characterized by the maximal value of Sokołowski's discrimination coefficient 3 was selected. Sokołowski's coefficient is determined according to the formula: order. The higher the value of Sokołowski's coefficient, the higher the ability of the taxonomic measure to create clusters of similar objects (Roszkowska and Lasakevic, 2014, p. 46). In the fourth case, the measure was selected for which the sum of the distances to other measures (in two-dimensional space, see Figure 3) was minimal.
In the last case, the measure for which the mean value of indicators of the similarity of weights structures was maximal, was selected. The indicator of the similarity of two structures was determined according to the following formula: where: i, j -numbers (Id) of objects, k -number (Id) of the component of the structure, p ik -share of k-th component in the structure of i-th object, p jk -share of k-th component in the structure of j-th object. The higher the value of this indicator, the more similar structures of objects. When the value of the indicator equals 1, the structures are identical (Sobczyk, 2010, p. 181). Table 2 presents the results of the selection of the final measure according to various criteria. The results differ from each other as different selection methods may lead to different outcomes. Table 3 presents the rankings of the subregions according to the measures selected using methods 1 to 5 ( Table 2). The rankings of the subregions in the first ten positions are identical. In each of the five final rankings, Szczecin was the best subregion, while the Włocławski subregion was the worst. Table 4 shows the values of Spearman's correlation coefficients for the five analysed TMs. There was a high consistency in the linear orderings of the subregions according to the values of the analysed measures.

Stability of results
The weights obtained by the use of the proposed method depend on the initial set of k weights vectors. To assess the stability of the obtained rankings, the procedure was repeated 100 times and the obtained results were compared. The final criteria in the fourth step of the method was the maximal mean of Spearman's correlation coefficients. Figure 6 presents the range of the subregions' positions obtained in 100 final rankings. The positions obtained according to the values of TM 591 were used as a benchmark (Tables 2 and 3). There was a high consistency in the linear ordering of the subregions. Next, the matrix was determined containing Spearman's correlation coefficients for the values of the obtained taxonomic measures; it contained 100 rows and 100 columns. The minimal value of these coefficients was 0.9967. This result confirms the high consistency of the linear orderings of the subregions.  Next, the author analysed the stability of the weights' structures of the 100 final representatives. Figure 7 shows the structures of the weights of the measures. The matrix containing the indicators of the structures' similarity was determined; it contained 100 rows and 100 columns. The minimal value of these indicators was 0.8165. This result confirms the high similarity of the weights' structures.

Conclusion
The existing literature offers many different methods for determining the weights of the components of taxonomic measures. Different methods can lead to different rankings. The paper proposed a simulation method for determining the weights of the diagnostic variables. The proposed method makes it possible to find a local solution (a vector of weights and the taxonomic measure depending on the initial set of weights). The accuracy of the obtained solution depends on the number of the initial weights vectors that are randomly generated. The higher the number of these initial weights vectors, the more accurate the obtained solution.