Please click here to download the Matlab source codes for regenerating the results of the following paper. The article and its source codes can be cited as follows: Amin Zollanvari, Alex Pappachen James, Reza Sameni A Theoretical Analysis of the Peaking Phenomenon in Classification". (The article is currently under consideration at Elsevier Pattern Recognition Letters, November 2017) 

A Theoretical Analysis of the Peaking Phenomenon in Classification
Amin Zollanvari, Alex Pappachen James, Reza Sameni
A. Zollanvari and A. P. James are with the Department of Electrical and Computer Engineering, Nazarbayev University, Kazakhstan (emails: amin.zollanvari@nu.edu.kz; apj@ieee.org); R. Sameni is with the School of Electrical and Computer Engineering, Shiraz University, Iran. (email: rsameni@shirazu.ac.ir)
Abstract
Index Terms
In a classification setting, dimensionality reduction is generally attested to by the peaking phenomenon, which, as Mclachlan described, states that [1, p. 391]: “For training samples of finite size, the performance of a given discriminant rule in a frequentist framework does not keep on improving as the number p of feature variables is increased. Rather, its overall unconditional error rate will stop decreasing and start to increase as p is increased beyond a certain threshold”
Commonly, the certain point after which adding more features (variables) deteriorates the performance of the classifier is referred to as the optimal number of features [2, 3]. In pattern recognition, the peaking phenomenon was initially observed in Hughes’ work on characterizing the expected performance of a discrete classification rule as a function of sample size and dimensionality [4]. As pointed out in [2, 5], the performance of a Bayesian classifier cannot worsen as long as the old set of features is recoverable from the new set. For example, the peaking phenomenon in Hughes’ work is due to the estimate of cell probabilities from finite number of observations and plugging them in the Bayes rule, which leads to a suboptimal procedure [6, 7, 5]. These suboptimal procedures are what McLachlan is referring to as “rules in a frequentist framework” (see the comment above).
These studies and many other comments about the peaking phenomenon (see [1, p. 15]; [8, p. 561]; and [3]) or even the use of a phrase such as “curse of dimensionality” when this phenomenon is actually meant (e.g., see [9, p. 25]; [10, p. 101]; and [11, p. 9]) all point in one direction: in a nonBayesian framework and above a certain dimensionality, the performance of the classifier starts deteriorating instead of improving steadily [12].
There is rarely an analytical investigation of this phenomenon. The lack of an analytical study is mainly attributed to the paucity of a proper mathematical framework. Even when an analytical study exists, the underlying framework is unsuited to studying the phenomenon. For example, consider the work of Jain and Waller, who analytically studied the phenomenon in connection with linear discriminant analysis (LDA) in a multivariate Gaussian model of binary classification. Let δ_{p} denote the Mahalanobis distance between two pdimensional Gaussian distributions (μ_{0,i},Σ_{p}), i = 0,1; to wit,
 (1) 
Jain and Waller [2] employed the Bowker and Sitgreaves’ expressions of expected error of LDA [13, 14] to show that the minimal increase in δ_{p} that justifies adding another feature to avoid peaking is approximately δ_{p+1} δ_{p} ≈, where n is the sample size in each class. Nevertheless, this type of analysis has two limitations. First, the analysis only looks into adding one feature at a time. What if the cumulative discriminatory effect of adding many features overrides the detrimental effect of additional features? Second, the underlying mathematical groundwork to obtain Bowker and Sitgreaves’ expressions is based on the classical largesample fixeddimension asymptotics (n →∞, p fixed). In general, the result of such asymptotic analysis is valid in the finitesample regime when the sample size is much larger than the number of variables [15]. As a result, one may not use this type of analysis to study the effect of having many features comparable to or far exceeding the sample size.
In this letter, we employ the double asymptotic machinery grounded in the work of pioneers such as Raudys and Serdobolskii in pattern recognition [16, 17, 18] and Wigner in physics [19]. We extend the framework to a multiple asymptotic setting that allows us to analytically study the effect of adding many features on the expected error of a simple classifier.
II. A Prototypical Example of Simple Classifiers
Consider a set of n_{T} = n_{0} + n_{1} independent and identicallydistributed training samples in ℝ^{p}, where x_{1},x_{2},…,x_{n0} come from population Π_{0} and x_{n0+1},x_{n0+2}, ,x_{nT} come from population Π_{1}. Population Π_{i} is assumed to follow a multivariate Gaussian distribution (μ_{i},Σ), for i = 0,1. With this assumption, the Bayes plugin rule with a known covariance matrix is obtained by the linear discriminant,
 (2) 
where = ∑ _{i=1}^{n0}x_{i} and _{1} = ∑ _{i=n0+1}^{nT}x_{i} are the sample means for each class. Following for example [20, 21], we are assuming the covariance matrix Σ is known. This assumption is made not only to simplify the analysis, but more importantly, to avoid singularity of sample covariance matrix when p > n_{T} — a setting that allows us to study the effect of incorporating many variables in the classifier. The designed LDA classifier, ψ(x), is then given by
_{0}
 (3) 
When Σ in (2) is replaced by the identity matrix, this classifier reduces to the Euclideandistance classifier [22]. For i = 0,1, the actual error of ψ(x) is given by
 (4) 
where Φ(.) denotes the cumulative distribution function of a standard normal random variable. Then the (overall) error ε is the probability of misclassification across both classes; to wit, ε = α_{0}ε_{0} + α_{1}ε_{1} where α_{i} is the prior probability of class i.
III. Double and Multiple Asymptotic Conditions
Hereafter, and for simplicity, we assume n_{0} = n_{1} = n. Consider the following sequence of Gaussian discrimination problems relative to the classifier (3):
 (5) 
where subindex p denotes dependency of parameters on p and Ξ_{p} is restricted by the following conditions:
Condition B assures the existence of limits of relevant statistics [23, 24]. Under the usual ‘double asymptotic’ conditions A and B (abbreviation by d.a.), we have (see Theorem 1 and its proof in [23]),
 (6) 
where
 (7) 
where plim_{p→∞}(⋅) denotes convergence in probability under a similar set of conditions as in (7). To compare the expected error of ψ(x) defined for observations of dimension p_{1} with the classifier defined in a higher dimensional space p_{1} + p_{2}, we generalize the double asymptotic conditions A, B, and (5) to a multiple asymptotic space. To simplify notations, let
 (8) 
where p_{j,j} p_{j}, and j is an integer, 0 < j ≤ k. To formalize the setting, assume the following sequence of Gaussian discrimination problems:
 (9) 
where all parameters depend on p_{j,k} (omitted to simplify the notation), and Ξ_{pj,k} is restricted by the following conditions:
where δ_{pj,k}^{2} is obtained from (1) by replacing p with p_{j,k}. ■
Intuitively, the studied problems consist of n observations from k feature sets of size p_{j} (0 < j ≤ k). The multiple asymptotic assumptions imply that the featuresample space scale up at constant ratios of J_{j}, as n →∞. This can be envisaged as a stretching of the featuresample space in both directions (features and samples) at constant ratios.
IV. The Possibility of a Multihump Phenomenon
From condition B in (10), it is straightforward to see that,
 (11) 
Proof: Let ε_{i,p1,j}^{B} denote the error of the Bayes (optimal) classifier for p_{1,j} variables over class i = 0,1. Replacing the sample means in (4) by the actual means, we have ε_{i,p1,j}^{B} = Φ(). Consider another set of p_{1,k} variables from which the previous set of p_{1,j} variables is recoverable; that is to say, the smaller set of variables is a subset of the larger one [5]. Theorem 2 in [5] yields ε_{i,p1,k}^{B} ≤ ε_{i,p1,j}^{B} and, therefore, δ_{p1,j} ≤ δ_{p1,k}. Using this result in (11) leads to (i). The proof of (ii) is straightforward from condition B and (11). ■
Theorem 1: Consider the sequence of Gaussian problems defined by (9). For a positive integer k we have
 (12) 
where Δ_{p1,pk} is obtained from (ii).
Proof: From (11) and by setting j = 1 we have δ_{p1,k}^{2} = δ_{p1}^{2} + Δ_{p1,pk}. Therefore, (12) is obtained from (6) by replacing δ_{p}^{2}, p →∞, and J, by δ_{p1,k}^{2}, p_{1,k} →∞, and J_{1} + + J_{k}, respectively. ■
Hereafter, and for ease of notations, we only consider the ordinary convergence of expected error and leave convergence in probability of true error implicit. Using Theorem 2, we seek the condition where E[ε_{p1}] ⋚ E[ε_{p1+p2}]. In this regard, we have the following theorem.
Theorem 2: Consider the sequence of Gaussian problems defined by (9). Then
 (13) 
Proof: From (6), by using Theorem 1 for p_{1} + p_{2} and the monotonicity of Φ(⋅), inequality (13) is obtained. ■
Solving (13) with respect to J_{2} and Δ_{p1,p2} leads to the following corollaries:
In general, we have the following theorem, which shows the possibility of a multihump phenomenon in the expected error of the classifier (in a multiple asymptotic sense) by adding more features with the limiting ratio of J_{i+1} and the contribution Δ_{pi,pi+1} to the Mahalanobis distance δ_{p1,i}^{2}.
Theorem 3: Consider the sequence of Gaussian problems defined by (9) and let k be a positive integer. Then, depending on the underlying parameters of the sequence of Gaussian problems, (Δ_{pi,pi+1},δ_{p1,i}^{2},J_{1} + + J_{i},J_{i+1}), any sequence of the inequalities constructed by choosing either < or ≥ is possible in
 (18) 
Proof: Following similarly to the proof of (13) for 1 ≤ j ≤ k  1, and for an arbitrary δ_{p1,i}^{2} > 0 and Δ_{pi,pi+1} > 0 yields
 (19) 
where h(⋅,⋅,⋅) is defined in (17). Similar results can be seen by considering g(δ_{p1,i}^{2},J_{1} + + J_{i},Δ_{pi,pi+1}) defined in (15). In that case we have,
 (20) 
Theorems 2 (and its corollaries) have important consequences. Consider equation (14). The function g(δ_{p1}^{2},J_{1},Δ_{p1,p2}) depends on: (1) δ_{p1}^{2}, which is the limiting Mahalanobis distance of infinitely many features with an increasing rate of J_{1}; and (2) Δ_{p1,p2}, which is the increase in the limiting Mahalanobis distance by adding a second infinite set of features with the limiting ratio J_{2}. Therefore, one may interpret g(δ_{p1}^{2},Δ_{p1,p2},J_{1}) as the asymptotic relative cumulative effectiveness of adding the second infinite set of features with respect to the first. Therefore, according to (14), if the asymptotic ratio of the number of additional features to the sample size (i.e., J_{2}) is larger (smaller) than the relative effectiveness of this additional infinite set of features, then the expected true error induced by the second set of features increases (decreases). Therefore, a consequence of Corrollary 1 is that the peaking phenomenon, as it is classically stated, does not necessarily exists in an asymptotic sense—from (18) we observe that adding (unboundedly) various sets of features can lead to an arbitrary oscillating sequence of expected error.
V. Implications in FiniteSample Regime
In order to apply and interpret the asymptotic expressions found in the previous section in a finitesample regime, we need to replace J_{m} and Δ_{pj,pk} with their finitesample equivalents [25, 26, 23]; that is to say, with and Δ_{pj,pk}δ_{p1,k}^{2}  δ_{p1,j}^{2} (cf. (11)), respectively. Consequently, Corollary 1 suggests the following finitesample approximation to quantify the number of additional features, p_{2}, that can be added to have a larger or smaller expected true error with respect to the previously p_{1} features:
 (21) 
where g(⋅,⋅,⋅) is defined in (15). Here, δ_{p1}^{2} and Δ_{p1,p2} are the Mahalanobis distance between class conditional densities defined over p_{1} features and the increase in the Mahalanobis distance by adding p_{2} features to the first p_{1} features, respectively. Note that for a fixed p_{1}, increasing quantities on the right side of inequality (21) (i.e., n and Δ_{p1,p2}) leads to a smaller expected error. On the other hand, p_{2} is the only factor that results in a larger expected error. Therefore, (21) implies that as long as the number of additional features is greater (smaller) than their cumulative efficacy in decreasing the error rate, the expected error increases (decreases).
Special Case 1: If ngδ_{p1}^{2},,Δ_{p1,p2} on the right side of (21) is less than 1, then the inequality with “<” is not feasible and the expected error rate always increases by adding p_{2} features. ■
Special Case 2: If Δ_{p1,p2} = 0 (i.e., no contribution to the Mahalanobis distance by adding p_{2} features), the expected true error keeps increasing by adding any number of features, i.e. E[ϵ_{n,p1}] < E[ϵ_{n,p1+p2}].■
Nevertheless, ngδ_{p1}^{2},,Δ_{p1,p2} can be in general quite large and, therefore, the expected error rate may decrease or increase depending on p_{2}. We note that the feasibility of the similar inequality was not a concern in the asymptotic sense (i.e., feasibility of (14)), because g(δ_{p1}^{2},J_{1},Δ_{p1,p2}) is a fixed positive value and there is always a positive J_{2} that satisfies the inequality there (a similar argument holds for Theorem 3).
Special Case 3: Let Δ_{p1,p2} > 0 and Δ_{p1+p2,p3} > 0. Then similar to Theorem 3, and by taking tngδ_{ p1}^{2},,Δ_{p1,p2} and sngδ_{p1}^{2} + Δ_{p1,p2}, + ,Δ_{p1+p2,p3}, we have (assuming p_{3} < s is feasible):

which shows that in a finitesample setting, the expected error can increase first and decrease afterwards. Therefore, the peaking phenomenon, i.e., the existence of a certain point after which the error rate keeps increasing, does not necessary hold. In practice, the increase/decrease of the expected error depends on the nature of the problem (on δ_{p1}^{2}, Δ_{p1,p2}, Δ_{p1+p2,p3}, the number of features p_{1}, p_{2}, and p_{3}, and the classifier). ■
As described in [3], “usually the best features are added first, and lessuseful features are added later”—this is the strategy we follow in both experiments presented in this section.
Experiment 1: In the first experiment, we examine the finitesample accuracy of expressions (6) and (21), the former and the latter being respectively the foundation and the consequence of (12). Let a_{(p)} denote a column vector of size p with identical elements a. Suppose the two populations follow 1000 dimensional multivariate Gaussian distributions with μ_{0} = 0_{(1000)}, μ_{1} = [0.1_{(100)}^{T},0.05_{(900)}^{T}]^{T} and a common identity covariance matrix. Note that as (6) suggests, the influence of distributional parameters such as means and covariance matrix on the expected error of classifier (3) is summarized in the Mahalanobis distance defined in (1).
We examine a sequence of classifiers to determine the expected error as a function of dimensionality. For a fixed p, we generate a set of n = 50,100,200 training observations of p dimension from each class, construct the classifier using (2) and (3), and find the exact error rate from (4). Note that using (4) to determine the exact error rate is possible because we have the actual distributional parameters. For each p, we repeat this procedure 100 times to achieve a distribution of error rate.
Fig. 1(a),(c),(e) plot the expected error rate of the classifier as a function of p using the aforementioned Monte Carlo procedure as well as the finitesample approximation obtained from (6) by replacing J and δ_{p}^{2} with p∕n and δ_{p}^{2}, respectively. The results show substantial agreement between Monte Carlo estimates and the analytical approximations. Fig. 1(b),(d),(f) present the magnitude of function ngδ_{p1}^{2},,Δ_{p1,p2} and p_{2}. Here, we take p_{1} = 100 and we seek the number of features p_{2} to add to p_{1} to obtain an identical expected error rate.
Fig. 1. Left column: comparison of the expected error and its standard deviation bar as a function of p, using the double asymptotic approximation with Monte Carlo estimates for n = 50 (a); n = 100 (b); and n = 200 (c). The expected error rate has a notch at p_{1} = 100; however, by adding p_{2} more variables, an identical expected error to that of p_{1} is obtained. Going beyond p = p_{1} + p_{2} results in smaller expected errors; The analytical and empirical results show substantial agreement. Right column: the magnitude of function ngδ_{100}^{2},100∕n,Δ_{100,p2} and p_{2} for n = 50 (b); n = 100 (d); and n = 200 (e). Note that at the point where the p_{2} line crosses ngδ_{100}^{2},100∕n,Δ_{100,p2} curves (plots in the right column), then E[ϵ_{n,p1}] ≈E[ϵ_{n,p1+p2}] (from empirical results in the left column).
Experiment 2: Having laid down the accuracy of finitesample approximation obtained from asymptotic results, in this experiment we use these analytic expressions to study various forms of expected error that would appear as a function of p. Suppose that the class conditional densities follow multivariate Gaussian distributions with a Mahalanobis distance determined from the three models depicted in Fig. 2(a). In order to obtain these models, one may consider two q = 50,000 dimensional Gaussian distributions with the following parameters and take the first p dimensions: Σ = I_{q}, μ_{0} = 0_{(q)}, and,

where (a,b)_{r} denotes a column vector containing r equally spaced numbers between a and b. Fig. 2 (b) plots the expected error of the classifier (3) as a function of dimensionality. As we observe, the expected error curve can have a single valley (the classical notion of the peaking phenomenon), a multihump behavior, or a monotonic decreasing trend depending on the underlying distributional parameters.
Fig. 2. (a) Mahalanobis distance as a function of dimensionality with distributional parameters defined in Experiment 2; (b) Expected error as a function of dimensionality. Solid, dashed, and dashdotted lines correspond to case I, II, and III presented in Experiment 2, respectively.
In this letter, we conducted a multiple asymptotic analysis in connection with a simple classifier to analytically study the socalled peaking phenomenon. Our analytical (and empirical) results provide a different picture of the phenomenon. That is to say an initial increase (decrease) in the expected error of a classification rule might stop and start to decrease (increase) depending on the cumulative discriminatory effect of additional predictors.
[1] G. McLachlan, Discriminant Analysis and Statistical Pattern Recognition, Wiley, New York, 2004.
[2] A. Jain and W. Waller, “On the optimal number of features in the classification of multivariate gaussian data,” Pattern Recogn., vol. 10, pp. 365–374, 1978.
[3] S. J. Raudys and A. K. Jain, “Small sample size effects in statistical pattern recognition: Recommendations for practitioners,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 13, pp. 252–264, 1991.
[4] G. F. Hughes, “On the mean accuracy of statistical pattern recognizers,” IEEE Trans. Information Theory, vol. 14, no. 1, pp. 5563, 1968.
[5] W. Waller and A. Jain, “On the monotonicity of the performance of bayesian classifiers,” IEEE Trans. Information Theory, vol. 24, pp. 392394, 1978.
[6] K. Abend and Jr. T. J. Harley, “Comments “on the mean accuracy of statistical pattern recognizers,” IEEE Trans. Information Theory, vol. 14, no. 1, pp. 420–423, 1969.
[7] J. M. Van Campenhout, “On the peaking of the hughes mean recognition accuracy: the resolution of an apparent paradox,” IEEE Trans. Systems, Man and Cybernetics, vol. 8, pp. 390395, 1978.
[8] L. Devroye, L. Gyorfi, and G. Lugosi, A Probabilistic Theory of Pattern Recognition, Springer, New York, 1996.
[9] U.M. BragaNeto and E.R. Dougherty, Error Estimation for Pattern Recognition, WileyIEEE Press, New Jersey, 2015.
[10] G. Niu, DataDriven Technology for Engineering Systems Health Management, Science PressSpringer, Beijing, 2017.
[11] N. Zheng and J. Xue, Statistical Learning and Pattern Analysis for Image and Video Processing, Springer, New York, 2009.
[12] B. Chandrasekaran and A. K. Jain, “Quantization complexity and independent measurements,” IEEE Trans. on Computers, vol. 23, pp. 102–106, 1974.
[13] R. Sitgreaves, “Some results on the distribution of the Wclassification,” in Studies in Item Analysis and Prediction, H. Solomon, Ed., pp. 241–251. Stanford University Press, 1961.
[14] A.H. Bowker and R. Sitgreaves, “An asymptotic expansion for the distribution function of the wclassification statistic,” in Studies in Item Analysis and Prediction, H. Solomon, Ed., pp. 292–310. Stanford University Press, 1961.
[15] B. Efron, “Bayesian, frequentists, and scientists,” Journal of the American Statistical Association, vol. 100, pp. 1–5, 2005.
[16] S. Raudys, “On determining training sample size of a linear classifier,” Computer Systems, vol. 28, pp. 79–87, 1967, in Russian.
[17] S. Raudys and D. M. Young, “Results in statistical discriminant analysis: A review of the former soviet union literature,” Journal of Multivariate Analysis, vol. 89, pp. 1–35, 2004.
[18] V. I. Serdobolskii, “On minimum error probability in discriminant analysis,” Soviet Math. Dokl., vol. 27, no. 3, pp. 720–725, 1983.
[19] E. P. Wigner, “On the distribution of the roots of certain symmetric matrices,” Ann. Math., vol. 67, no. 2, pp. 325–327, 1958.
[20] M.A. Moran, “On the expectation of errors of allocation associated with a linear discriminant function,” Biometrika, vol. 62, no. 1, pp. 141–148, 1975.
[21] M. J. Sorum, “Estimating the expected probability of misclassification for a rule based on the linear discriminant function: Univariate normal case,” Technometrics, vol. 15, pp. 329–339, 1973.
[22] S. Raudys, Statistical and Neural Classifiers, An Integrated Approach to Design, SpringerVerlag, London, 2001.
[23] A. Zollanvari, U. M. BragaNeto, and E. R. Dougherty, “Analytic study of performance of error estimators for linear discriminant analysis,” IEEE Trans. Sig. Proc., vol. 59, no. 9, pp. 4238–4255, 2011.
[24] V. I. Serdobolskii, Multivariate Statistical Analysis: A HighDimensional Approach, Kluwer Academic Publishers, 2000.
[25] M. Zhang, F. Rubio, D. P. Palomar, and X. Mestre, “Finitesample linear filter optimization in wireless communications and financial systems,” IEEE Trans. Sig. Proc., vol. 61, no. 20, pp. 5014–5025, 2013.
[26] F. Rubio, X. Mestre, and D. P. Palomar, “Performance analysis and optimal selection of large minimum variance portfolios under estimation risk,” IEEE J. Sel. Topics Signal Process, vol. 6, no. 4, pp. 337–350, 2012.