Please click here to download the Matlab source codes for regenerating the results of the following paper. The article and its source codes can be cited as follows: Amin Zollanvari, Alex Pappachen James, Reza Sameni A Theoretical Analysis of the Peaking Phenomenon in Classification".
(The article is currently under consideration at Elsevier Pattern Recognition Letters, November 2017)
A Theoretical Analysis of the Peaking Phenomenon in Classification
Amin Zollanvari, Alex Pappachen James, Reza Sameni
A. Zollanvari and A. P. James are with the Department of Electrical and Computer Engineering, Nazarbayev University, Kazakhstan (emails: email@example.com; firstname.lastname@example.org); R. Sameni is with the School of Electrical and Computer Engineering, Shiraz University, Iran. (email: email@example.com)
In a classification setting, dimensionality reduction is generally attested to by the peaking phenomenon, which, as Mclachlan described, states that [1, p. 391]: “For training samples of finite size, the performance of a given discriminant rule in a frequentist framework does not keep on improving as the number p of feature variables is increased. Rather, its overall unconditional error rate will stop decreasing and start to increase as p is increased beyond a certain threshold”
Commonly, the certain point after which adding more features (variables) deteriorates the performance of the classifier is referred to as the optimal number of features [2, 3]. In pattern recognition, the peaking phenomenon was initially observed in Hughes’ work on characterizing the expected performance of a discrete classification rule as a function of sample size and dimensionality . As pointed out in [2, 5], the performance of a Bayesian classifier cannot worsen as long as the old set of features is recoverable from the new set. For example, the peaking phenomenon in Hughes’ work is due to the estimate of cell probabilities from finite number of observations and plugging them in the Bayes rule, which leads to a suboptimal procedure [6, 7, 5]. These suboptimal procedures are what McLachlan is referring to as “rules in a frequentist framework” (see the comment above).
These studies and many other comments about the peaking phenomenon (see [1, p. 15]; [8, p. 561]; and ) or even the use of a phrase such as “curse of dimensionality” when this phenomenon is actually meant (e.g., see [9, p. 25]; [10, p. 101]; and [11, p. 9]) all point in one direction: in a non-Bayesian framework and above a certain dimensionality, the performance of the classifier starts deteriorating instead of improving steadily .
There is rarely an analytical investigation of this phenomenon. The lack of an analytical study is mainly attributed to the paucity of a proper mathematical framework. Even when an analytical study exists, the underlying framework is unsuited to studying the phenomenon. For example, consider the work of Jain and Waller, who analytically studied the phenomenon in connection with linear discriminant analysis (LDA) in a multivariate Gaussian model of binary classification. Let δp denote the Mahalanobis distance between two p-dimensional Gaussian distributions (μ0,i,Σp), i = 0,1; to wit,
Jain and Waller  employed the Bowker and Sitgreaves’ expressions of expected error of LDA [13, 14] to show that the minimal increase in δp that justifies adding another feature to avoid peaking is approximately δp+1 -δp ≈, where n is the sample size in each class. Nevertheless, this type of analysis has two limitations. First, the analysis only looks into adding one feature at a time. What if the cumulative discriminatory effect of adding many features overrides the detrimental effect of additional features? Second, the underlying mathematical groundwork to obtain Bowker and Sitgreaves’ expressions is based on the classical large-sample fixed-dimension asymptotics (n →∞, p fixed). In general, the result of such asymptotic analysis is valid in the finite-sample regime when the sample size is much larger than the number of variables . As a result, one may not use this type of analysis to study the effect of having many features comparable to or far exceeding the sample size.
In this letter, we employ the double asymptotic machinery grounded in the work of pioneers such as Raudys and Serdobolskii in pattern recognition [16, 17, 18] and Wigner in physics . We extend the framework to a multiple asymptotic setting that allows us to analytically study the effect of adding many features on the expected error of a simple classifier.
II. A Prototypical Example of Simple Classifiers
Consider a set of nT = n0 + n1 independent and identically-distributed training samples in ℝp, where x1,x2,…,xn0 come from population Π0 and xn0+1,xn0+2, ,xnT come from population Π1. Population Πi is assumed to follow a multivariate Gaussian distribution (μi,Σ), for i = 0,1. With this assumption, the Bayes plug-in rule with a known covariance matrix is obtained by the linear discriminant,
where x0 = ∑ i=1n0xi and x1 = ∑ i=n0+1nTxi are the sample means for each class. Following for example [20, 21], we are assuming the covariance matrix Σ is known. This assumption is made not only to simplify the analysis, but more importantly, to avoid singularity of sample covariance matrix when p > nT — a setting that allows us to study the effect of incorporating many variables in the classifier. The designed LDA classifier, ψ(x), is then given by
When Σ in (2) is replaced by the identity matrix, this classifier reduces to the Euclidean-distance classifier . For i = 0,1, the actual error of ψ(x) is given by
where Φ(.) denotes the cumulative distribution function of a standard normal random variable. Then the (overall) error ε is the probability of misclassification across both classes; to wit, ε = α0ε0 + α1ε1 where αi is the prior probability of class i.
III. Double and Multiple Asymptotic Conditions
Hereafter, and for simplicity, we assume n0 = n1 = n. Consider the following sequence of Gaussian discrimination problems relative to the classifier (3):
where subindex p denotes dependency of parameters on p and Ξp is restricted by the following conditions:
Condition B assures the existence of limits of relevant statistics [23, 24]. Under the usual ‘double asymptotic’ conditions A and B (abbreviation by d.a.), we have (see Theorem 1 and its proof in ),
where plimp→∞(⋅) denotes convergence in probability under a similar set of conditions as in (7). To compare the expected error of ψ(x) defined for observations of dimension p1 with the classifier defined in a higher dimensional space p1 + p2, we generalize the double asymptotic conditions A, B, and (5) to a multiple asymptotic space. To simplify notations, let
where pj,j pj, and j is an integer, 0 < j ≤ k. To formalize the setting, assume the following sequence of Gaussian discrimination problems:
where all parameters depend on pj,k (omitted to simplify the notation), and Ξpj,k is restricted by the following conditions:
where δpj,k2 is obtained from (1) by replacing p with pj,k. ■
Intuitively, the studied problems consist of n observations from k feature sets of size pj (0 < j ≤ k). The multiple asymptotic assumptions imply that the feature-sample space scale up at constant ratios of Jj, as n →∞. This can be envisaged as a stretching of the feature-sample space in both directions (features and samples) at constant ratios.
IV. The Possibility of a Multi-hump Phenomenon
From condition B in (10), it is straightforward to see that,
Proof: Let εi,p1,jB denote the error of the Bayes (optimal) classifier for p1,j variables over class i = 0,1. Replacing the sample means in (4) by the actual means, we have εi,p1,jB = Φ(). Consider another set of p1,k variables from which the previous set of p1,j variables is recoverable; that is to say, the smaller set of variables is a subset of the larger one . Theorem 2 in  yields εi,p1,kB ≤ εi,p1,jB and, therefore, δp1,j ≤ δp1,k. Using this result in (11) leads to (i). The proof of (ii) is straightforward from condition B and (11). ■
Theorem 1: Consider the sequence of Gaussian problems defined by (9). For a positive integer k we have
where Δp1,pk is obtained from (ii).
Proof: From (11) and by setting j = 1 we have δp1,k2 = δp12 + Δp1,pk. Therefore, (12) is obtained from (6) by replacing δp2, p →∞, and J, by δp1,k2, p1,k →∞, and J1 + + Jk, respectively. ■
Hereafter, and for ease of notations, we only consider the ordinary convergence of expected error and leave convergence in probability of true error implicit. Using Theorem 2, we seek the condition where E[εp1] ⋚ E[εp1+p2]. In this regard, we have the following theorem.
Theorem 2: Consider the sequence of Gaussian problems defined by (9). Then
Solving (13) with respect to J2 and Δp1,p2 leads to the following corollaries:
In general, we have the following theorem, which shows the possibility of a multi-hump phenomenon in the expected error of the classifier (in a multiple asymptotic sense) by adding more features with the limiting ratio of Ji+1 and the contribution Δpi,pi+1 to the Mahalanobis distance δp1,i2.
Theorem 3: Consider the sequence of Gaussian problems defined by (9) and let k be a positive integer. Then, depending on the underlying parameters of the sequence of Gaussian problems, (Δpi,pi+1,δp1,i2,J1 + + Ji,Ji+1), any sequence of the inequalities constructed by choosing either < or ≥ is possible in
Proof: Following similarly to the proof of (13) for 1 ≤ j ≤ k - 1, and for an arbitrary δp1,i2 > 0 and Δpi,pi+1 > 0 yields
where h(⋅,⋅,⋅) is defined in (17). Similar results can be seen by considering g(δp1,i2,J1 + + Ji,Δpi,pi+1) defined in (15). In that case we have,
Theorems 2 (and its corollaries) have important consequences. Consider equation (14). The function g(δp12,J1,Δp1,p2) depends on: (1) δp12, which is the limiting Mahalanobis distance of infinitely many features with an increasing rate of J1; and (2) Δp1,p2, which is the increase in the limiting Mahalanobis distance by adding a second infinite set of features with the limiting ratio J2. Therefore, one may interpret g(δp12,Δp1,p2,J1) as the asymptotic relative cumulative effectiveness of adding the second infinite set of features with respect to the first. Therefore, according to (14), if the asymptotic ratio of the number of additional features to the sample size (i.e., J2) is larger (smaller) than the relative effectiveness of this additional infinite set of features, then the expected true error induced by the second set of features increases (decreases). Therefore, a consequence of Corrollary 1 is that the peaking phenomenon, as it is classically stated, does not necessarily exists in an asymptotic sense—from (18) we observe that adding (unboundedly) various sets of features can lead to an arbitrary oscillating sequence of expected error.
V. Implications in Finite-Sample Regime
In order to apply and interpret the asymptotic expressions found in the previous section in a finite-sample regime, we need to replace Jm and Δpj,pk with their finite-sample equivalents [25, 26, 23]; that is to say, with and Δpj,pkδp1,k2 - δp1,j2 (cf. (11)), respectively. Consequently, Corollary 1 suggests the following finite-sample approximation to quantify the number of additional features, p2, that can be added to have a larger or smaller expected true error with respect to the previously p1 features:
where g(⋅,⋅,⋅) is defined in (15). Here, δp12 and Δp1,p2 are the Mahalanobis distance between class conditional densities defined over p1 features and the increase in the Mahalanobis distance by adding p2 features to the first p1 features, respectively. Note that for a fixed p1, increasing quantities on the right side of inequality (21) (i.e., n and Δp1,p2) leads to a smaller expected error. On the other hand, p2 is the only factor that results in a larger expected error. Therefore, (21) implies that as long as the number of additional features is greater (smaller) than their cumulative efficacy in decreasing the error rate, the expected error increases (decreases).
Special Case 1: If ngδp12,,Δp1,p2 on the right side of (21) is less than 1, then the inequality with “<” is not feasible and the expected error rate always increases by adding p2 features. ■
Special Case 2: If Δp1,p2 = 0 (i.e., no contribution to the Mahalanobis distance by adding p2 features), the expected true error keeps increasing by adding any number of features, i.e. E[ϵn,p1] < E[ϵn,p1+p2].■
Nevertheless, ngδp12,,Δp1,p2 can be in general quite large and, therefore, the expected error rate may decrease or increase depending on p2. We note that the feasibility of the similar inequality was not a concern in the asymptotic sense (i.e., feasibility of (14)), because g(δp12,J1,Δp1,p2) is a fixed positive value and there is always a positive J2 that satisfies the inequality there (a similar argument holds for Theorem 3).
Special Case 3: Let Δp1,p2 > 0 and Δp1+p2,p3 > 0. Then similar to Theorem 3, and by taking tngδ p12,,Δp1,p2 and sngδp12 + Δp1,p2, + ,Δp1+p2,p3, we have (assuming p3 < s is feasible):
which shows that in a finite-sample setting, the expected error can increase first and decrease afterwards. Therefore, the peaking phenomenon, i.e., the existence of a certain point after which the error rate keeps increasing, does not necessary hold. In practice, the increase/decrease of the expected error depends on the nature of the problem (on δp12, Δp1,p2, Δp1+p2,p3, the number of features p1, p2, and p3, and the classifier). ■
VI. Numerical Experiments
As described in , “usually the best features are added first, and less-useful features are added later”—this is the strategy we follow in both experiments presented in this section.
Experiment 1: In the first experiment, we examine the finite-sample accuracy of expressions (6) and (21), the former and the latter being respectively the foundation and the consequence of (12). Let a(p) denote a column vector of size p with identical elements a. Suppose the two populations follow 1000 dimensional multivariate Gaussian distributions with μ0 = 0(1000), μ1 = [0.1(100)T,0.05(900)T]T and a common identity covariance matrix. Note that as (6) suggests, the influence of distributional parameters such as means and covariance matrix on the expected error of classifier (3) is summarized in the Mahalanobis distance defined in (1).
We examine a sequence of classifiers to determine the expected error as a function of dimensionality. For a fixed p, we generate a set of n = 50,100,200 training observations of p dimension from each class, construct the classifier using (2) and (3), and find the exact error rate from (4). Note that using (4) to determine the exact error rate is possible because we have the actual distributional parameters. For each p, we repeat this procedure 100 times to achieve a distribution of error rate.
Fig. 1(a),(c),(e) plot the expected error rate of the classifier as a function of p using the aforementioned Monte Carlo procedure as well as the finite-sample approximation obtained from (6) by replacing J and δp2 with p∕n and δp2, respectively. The results show substantial agreement between Monte Carlo estimates and the analytical approximations. Fig. 1(b),(d),(f) present the magnitude of function ngδp12,,Δp1,p2 and p2. Here, we take p1 = 100 and we seek the number of features p2 to add to p1 to obtain an identical expected error rate.
Fig. 1. Left column: comparison of the expected error and its standard deviation bar as a function of p, using the double asymptotic approximation with Monte Carlo estimates for n = 50 (a); n = 100 (b); and n = 200 (c). The expected error rate has a notch at p1 = 100; however, by adding p2 more variables, an identical expected error to that of p1 is obtained. Going beyond p = p1 + p2 results in smaller expected errors; The analytical and empirical results show substantial agreement. Right column: the magnitude of function ngδ1002,100∕n,Δ100,p2 and p2 for n = 50 (b); n = 100 (d); and n = 200 (e). Note that at the point where the p2 line crosses ngδ1002,100∕n,Δ100,p2 curves (plots in the right column), then E[ϵn,p1] ≈E[ϵn,p1+p2] (from empirical results in the left column).
Experiment 2: Having laid down the accuracy of finite-sample approximation obtained from asymptotic results, in this experiment we use these analytic expressions to study various forms of expected error that would appear as a function of p. Suppose that the class conditional densities follow multivariate Gaussian distributions with a Mahalanobis distance determined from the three models depicted in Fig. 2(a). In order to obtain these models, one may consider two q = 50,000 dimensional Gaussian distributions with the following parameters and take the first p dimensions: Σ = Iq, μ0 = 0(q), and,
where (a,b)r denotes a column vector containing r equally spaced numbers between a and b. Fig. 2 (b) plots the expected error of the classifier (3) as a function of dimensionality. As we observe, the expected error curve can have a single valley (the classical notion of the peaking phenomenon), a multi-hump behavior, or a monotonic decreasing trend depending on the underlying distributional parameters.
Fig. 2. (a) Mahalanobis distance as a function of dimensionality with distributional parameters defined in Experiment 2; (b) Expected error as a function of dimensionality. Solid, dashed, and dash-dotted lines correspond to case I, II, and III presented in Experiment 2, respectively.
In this letter, we conducted a multiple asymptotic analysis in connection with a simple classifier to analytically study the so-called peaking phenomenon. Our analytical (and empirical) results provide a different picture of the phenomenon. That is to say an initial increase (decrease) in the expected error of a classification rule might stop and start to decrease (increase) depending on the cumulative discriminatory effect of additional predictors.
 S. J. Raudys and A. K. Jain, “Small sample size effects in statistical pattern recognition: Recommendations for practitioners,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 13, pp. 252–264, 1991.
 A.H. Bowker and R. Sitgreaves, “An asymptotic expansion for the distribution function of the w-classification statistic,” in Studies in Item Analysis and Prediction, H. Solomon, Ed., pp. 292–310. Stanford University Press, 1961.
 A. Zollanvari, U. M. Braga-Neto, and E. R. Dougherty, “Analytic study of performance of error estimators for linear discriminant analysis,” IEEE Trans. Sig. Proc., vol. 59, no. 9, pp. 4238–4255, 2011.
 M. Zhang, F. Rubio, D. P. Palomar, and X. Mestre, “Finite-sample linear filter optimization in wireless communications and financial systems,” IEEE Trans. Sig. Proc., vol. 61, no. 20, pp. 5014–5025, 2013.
 F. Rubio, X. Mestre, and D. P. Palomar, “Performance analysis and optimal selection of large minimum variance portfolios under estimation risk,” IEEE J. Sel. Topics Signal Process, vol. 6, no. 4, pp. 337–350, 2012.