A direct method for L1-norm codeviation

A.J. Allinger
Design Service Corporation


Abstract:   The statistic σ1 = ∑ max(min(ξ, η), min(-ξ, -η)) is proposed as a robust analog to the usual covariance, leading to L1 analogs of correlation and statistical distance. The L1 statistics are shown to have the same relation to their L2 counterparts as fuzzy logic does to probability theory. Distance functions based on correlation are shown to be a sum of metric and non-metric parts, that may be interpreted as the sum of disagreement and agreement, respectively, of two objects. Applications in data clustering are discussed.



Background

A fundamental task in statistics is to seek relations between variables as a means of explaining the world. For this purpose, the Pearson correlation has been long universally applied in search of a relation between two variables. The result is a number r between -1 and 1, which is positive when an increase in one tends to be accompanied by an increase in the other, and negative when an increase in one tends to be accompanied by a decrease in the other.

A closely related task is to quantify the dissimilarity between entities rather than variables. In a 1936 article, P.C. Mahalanobis introduced a generalized statistical distance. Given a sample of measurements of p variables for each of n objects, each pair of objects is characterized by a nondimensional distance. The Mahalanobis distance has often been used as a measure of dissimilarity between objects in a sample. Compared to the ordinary Euclidean distance, the Mahalanobis distance gives superior performance on problems in data clustering and classification.5 The building block of both of these statistics is the covariance,

sxy = (1/N) ∑ ξη = (1/N) ∑ (x - μ)(y - ν)

where μ and ν are the means of x and y, respectively.

A weakness of the standard covariance (and L2 statistics in general) is their sensitivity to outliers. A few extreme objects have an disproportionate effect on the sample mean and variance. In contrast, the L1 statistics median and mean absolute deviation, are comparatively robust against these effects. For example, increasing the largest observation by a arbitrary factor of f has no effect at all on the median. The effect of an extreme measurement on the mean absolute deviation is proportional to its difference from the average, but its effect on the variance is proportional to the square of the difference. If the outliers are spurious, the estimate of the standard deviation may be severely corrupted.

Previous work in developing a robust generalized distance has focused on choosing a subset of reliable observations.4 Since the number of possible non-empty subsets of n observations is 2n-1, there is exponential growth in the the number of possibilities to consider, and a consequently high computational cost. Perhaps more importantly, choosing to banish an observation from the data set necessarily discards possibly valid information. Silently weighing the analyst's prior assumptions about what sort of distribution the data should have sacrifices scientific objectivity.

An L1 variant of covariance is thus desirable for at least two reasons: it achieves a more robust estimate without willfully discarding information, and it extends a correspondence with L2 statistics.

Argument from the standard covariance

The Euclidean distance, given by the Pythagorean theorem
d2 = √∑(x-y)²

has its counterpart in the Manhattan distance
d1 = ∑|x-y|

Likewise, the standard deviation
σ2 = √[(1/N) ∑(x-μ)²]

has its L1 counterpart in the mean absolute deviation:
σ1 = (1/N) ∑|x-μ|

L1 analogs of covariance and correlation will now be introduced. Consider the familiar covariance
(1/N) ∑ξη = (1/N) ∑[(x-μ)(y-ν)]

The expression ξη is positive when ξ and η have the same sign, and negative when they have different signs. Employing the identities
(a+b)² = a² + 2ab + b²
(a-b)² = a² - 2ab + b²
∴ 4ab = (a+b)² - (a-b)²

factor:
ξη = ¼[(ξ+η)² - (ξ-η)²]

Notice that the covariance is the difference of an agreement and a disagreement. Extending the analogy of mean absolute deviation to standard deviation, define:
σ1 = (1/2N) ∑(|ξ+η| - |ξ-η|)

as a new measure of the relationship between x and y. The factor 1/2 ensures that for x = y, the expression has the nice property that it reduces to mean absolute deviation. Thus, the mean absolute deviation is a special case of the L1 codeviation, just as the L2 variance is a special case of the L2 covariance.

Correlation and Generalized Distance

The correlation coefficient is
r2 = ∑ ξη / √[(∑ξ²)(∑η²)]

By analogy, define a Manhattan correlation
r1 = ∑ (|ξ+η| - |ξ-η|) / ∑(|ξ| + |η|)

For p variables, there is a p×p covariance matrix S. The squared Mahalanobis distance, neglecting a factor of det(S), is
d2 = ξiPijξj

where P = S-1. Writing the quantity in summation notation makes explicit the interaction between the variables. The ξiξj interaction is essential, but a true L1 measure would not contain multiplication. An expression for true generalized L1 statistical distance has not been obtained. However, it is possible to use the L1 codeviation matrix in place of the standard covariance matrix. Such an L1½ metric has led to improved performance on benchmark Iris and Wine problems from the UCI archive.1

Argument from fuzzy logic

In the context of logical variables, which take values [0,1], a symmetric distance function between x and y is given by their mutual exclusion:
x ⊻ y = (x ∧ ~y) ∨ (~x ∧ y)

and the logical distance is thus:
d = x(1-y) + (1-x)y   = x + y - 2xy

In their work on fuzzy logic2, G. Klir and B. Yuan observed that when the variables are the Boolean {0,1}, then logical AND is equivalent to the minimum, and logical OR is equivalent to the maximum. They introduced a fuzzy distance
d = max(min(x, 1-y), min(1-x, y))

To obtain the Manhattan codeviation, begin with an expression for logical equivalence:
x ⇔ y = (x ∧ y) ∨ (~x ∧ ~y)
= max(min(x, y), min(1-x, 1-y))

Since this article is mostly concerned with scalar data, switch from logical variables to scalar variables. In doing so, the effects of the min and max operations are unchanged. For negation, values must be reflected across 0 instead of ½, thus ~x becomes -x rather than 1-x, resulting in:
x ⇔ y = max(min(x, y), min(-x, -y))



Equivalence of these approaches

Compare the expression derived from covariance with the expression derived from fuzzy logic; they are identical.
½(|ξ+η| - |ξ-η|) ≡ max(min(ξ, η), min(-ξ, -η))

This demonstrates a remarkable relationship between fuzzy logic and L1 statistics. Two distinct approaches have led to the same formula. The first approach was an L1 modification of L2 statistics. The second approach was a fuzzy logic modification of calculus of probabilities. It may be concluded that fuzzy logic is related to probability theory in the same way that L1 statistics are related to L2 statistics.

For the purposes of computation, the min/max formula is to be preferred, since it is not susceptible to subtractive cancellation.

Improper Metric Decompositions

The expression for L1 codeviation yields insights into the metric and non-metric properties of distance functions. Seek a decomposition d = m + n of distance into a metric part and a non-metric part. Then for triples of objects u, v, w drawn from some data set, the triangle inequality holds for the metric part of distance
mw ≤ mu + mv

but may or may not be violated in the non-metric part. Such a decomposition always exists, but is not unique.6

Two important distance functions are the cosine distance and the correlation distance. The difference between them is that the correlation distance operates on the difference from the average, and the cosine distance operates on the original data. The cosine distance often is used to compare documents represented by a list of words and the frequency with which each word occurs (the "bag of words" format). The correlation distance has been used to cluster gene expression data.3

Suppose that the data vectors have been normalized, by dividing each by
√(∑x²)

for the L2 case, or by
∑|x|

for the L1 case. The correlation is
r2 = (x · y) / √(∑x²∑y²)

The correlation distance is preferably arccos(r), which is a metric distance. Elsewhere, the correlation distance has been defined as:7, 8
d = (1-r) / 2

L2 normalizing the data makes √∑x²∑y² equal one. Substituting ¼[(x+y)² - (x-y)²] for xy yields
d2 = [∑(x-y)² / 8] + [½ - ∑(x+y)² / 8]

which is a metric decomposition, as verified by numerical simulations testing for triangle inequality violations. For the L1 case,
r1 = ∑(|x+y| - |x-y|) / ∑(|x| + |y|)

L1 normalizing the data makes the denominator equal two, thus:
d1 = [∑|x-y| / 4] + [½ - ∑|x+y| / 4]

which is a metric decomposition. The disagreement term (x-y) is in the metric part of distance, and the agreement term (x+y) is in the non-metric part of distance, as would be expected by intuition.

Contra-categoricals in clustering

An example of inherently non-metric data occurs when trying to synthesize a set of observations. Some number n of events have occurred, which have been observed by some number q of sensors. Each sensor tries to give a single report of each event, but there may be mix-ups. A sensor may miss an event, or it may double-count an event.

Other things being equal, if two observations were made by the same sensor, they probably refer to different events. Thus, it is informative to include in the observation an integer [1,q] to identify the sensor that recorded it. This ID data has the unusual property that the distance is greatest between two ID's when they are identical. Let such data be called contra-categorical. It is inherently non-metric.

The Minkowski distance function:
d = (∑(x-y)L)(1/L)

is usually metric. A very simple decomposition is available into d = m + n. Contra-categorical and binary asymmetric data are non-metric. Scalar, categorical, ordinal, or cyclic data are metric. Specifically, letting xm and ym denote the metric components of vectors x and y,
m = (∑(xm-ym)L)(1/L)
n = d - m

A set of objects described by a matrix of non-metric distances cannot be embedded into Euclidean space. It can be embedded into a pseudo-Euclidean space, in which some components are imaginary numbers, so that the greater the difference in these components, the closer together two objects are. This is precisely the behavior of contra-categorical data. The pseudo-Euclidean representation is not esoteric as it may seem at first. Rather, it is the natural representation of perfectly ordinary objects.

Discussion

The L1 codeviation is insensitive to extreme values. This makes it robust against outliers, but it is not always an advantage. These measures are equal:
½(|x+y|-|x-y|)
max(min(x,y), min(-x,-y))
sign(xy) · min(|x|, |y|)

The last form makes the behavior explicit. If the signs of x and y are the same, the result is positive, and if the signs of x and y are different, the result is negative. The magnitude is the lesser of x and y.

For example, these data sets have the same σ1
{(1,1), (2,2), (3,3)} and
{(1,1), (2,2), (3,99)}.

Which suggests that standardizing the data may be beneficial.

The L1½ generalized distance, as mentioned above, can be used as a drop-in replacement for the conventional Mahalanobis distance. This has led to improvements in clustering accuracy. In numerical experiments on the Iris data, the error rate was reduced from 4% to 2% by using the L1½ distance. It remains to be seen if this improvement holds generally.

A true L1 generalized distance has not been obtained. Its discovery, or a disproof of existence, remains as a goal for future research. Another direction which has not been explored is to define sample statistics for a general Minkowski metric. The Minkowski average μ would be the quantity that minimizes the Minkowski variance σ
σL = (1/N) ∑|x - μ|L

No progress has been made towards this goal.

Improper metric decompositions are valuable aids to building binary tree data structures. If the distance function is metric, the triangle inequality may be used to set bounds for which partitions must be searched to find the nearest neighbors of a query point. If a decomposition of the distance function is available, heuristics for searching the tree can be derived that are faster than exhaustive search, and more accurate than ignoring the metric violations. Decompositions for important non-metric distance functions such as the Edit Distance would be important developments. No general method for obtaining such decompositions is known at the present.

Conclusion

The L1-norm codeviation yields several robust measures of immediate applicability in statistics and pattern recognition. It has a surprising close relationship to fuzzy logic. It also provides insight into what causes non-metric distance functions to violates the triangle inequality.

References

  1. http://archive.ics.uci.edu/ml/datasets.html
  2. G. Klir, B. Yuan, Fuzzy Sets and Fuzzy Logic, Prentice Hall, 1995. cited in: [7]
  3. M. Eisen, P. Spellman, P. Brown, D. Botstein, "Cluster Analysis and display of genome-wide expression data," Proceedings of National Academy of Science, USA v.95 pp.14863-14868, 1998. cited in: [7]
  4. Peter J. Rousseeuw and Katrien Van Driessen, "A Fast Algorithm for the Minimum Covariance Determinant Estimator," Revised version, 15 December 1998.
  5. A. Bar-Hillel, T. Hertz, H. Shental, and D. Weinshall, "Learning distance functions using equivalence relation," International Conference on Machine Learning, 2003.
  6. Julian Laub, Non-Metric Pairwise Proximity Data. Technical University of Berlin, 10 December 2004.
  7. Sergios Theodoridis and Konstantinos Koutroumbas, Pattern Recognition, Third Edition, Academic Press/Elsevier, 2006.
  8. Here the correlation distance is defined as 1-r.
    Michiel de Hoon, Seiya Imoto, and Satoru Miyano. Users' manual for: The C Clustering Library for cDNA Microarray Data University of Tokyo, Institute of Medical Science, Human Genome Center, 27 December 2010.