# A direct method for L1-norm codeviation

A.J. Allinger
Design Service Corporation
May 2014, revised December 2018

Abstract:   The statistic σ1 = ∑ max(min(x, y), min(-x, -y)) is proposed as an analog to the usual covariance. The L1 statistic is shown to have the same relation to its L2 counterparts as fuzzy logic does to probability theory. Distance functions based on correlation are shown to be a sum of metric and non-metric parts, that may be interpreted as the sum of disagreement and agreement, respectively, of two objects. Applications in data clustering are discussed.

## Background

A fundamental task in statistics is to seek relations between variables as a means of explaining the world. For this purpose, the Pearson correlation has been long universally applied in search of a relation between two variables. The result is a number r between -1 and 1, which is positive when an increase in one tends to be accompanied by an increase in the other, and negative when an increase in one tends to be accompanied by a decrease in the other.

A closely related task is to quantify the dissimilarity between entities rather than variables. In a 1936 article, P.C. Mahalanobis introduced a generalized statistical distance. Given a sample of measurements of p variables for each of n objects, each pair of objects is characterized by a nondimensional distance. The Mahalanobis distance has often been used as a measure of dissimilarity between objects in a sample. The building block of both of these statistics is the covariance,

sxy = (1/N) ∑ xy = (1/N) ∑ (x - μ)(y - ν)

where μ and ν are the means of x and y, respectively.

A weakness of the standard covariance (and L2 statistics in general) is their sensitivity to outliers. A few extreme objects have an disproportionate effect on the sample mean and variance. In contrast, the L1 statistics median and mean absolute deviation, are comparatively robust against these effects. For example, increasing the largest observation by a arbitrary factor of f has no effect at all on the median. The effect of an extreme measurement on the mean absolute deviation is proportional to its difference from the average, but its effect on the variance is proportional to the square of the difference. If the outliers are spurious, the estimate of the standard deviation may be severely corrupted.

Previous work in developing a robust generalized distance has focused on choosing a subset of reliable observations.5 Since the number of possible non-empty subsets of n observations is 2n-1, there is exponential growth in the the number of possibilities to consider, and a consequently high computational cost. Perhaps more importantly, choosing to banish an observation from the data set necessarily discards possibly valid information. Silently weighing the analyst's prior assumptions about what sort of distribution the data should have sacrifices scientific objectivity.

An L1 variant of covariance is thus desirable for at least two reasons: it achieves a more robust estimate without willfully discarding information, and it extends a correspondence with L2 statistics.

## Argument from the standard covariance

The Euclidean distance, given by the Pythagorean theorem
d2 = √∑(x-y)²

has its counterpart in the Manhattan distance
d1 = ∑|x-y|

Likewise, the standard deviation
σ2 = √[(1/N) ∑(x-μ)²]

has its L1 counterpart in the mean absolute deviation:
σ1 = (1/N) ∑|x-μ|

L1 analogs of covariance and correlation will now be introduced. Consider the familiar covariance
(1/N) ∑xy = (1/N) ∑[(x-μ)(y-ν)]

The expression xy is positive when x and y have the same sign, and negative when they have different signs. Employing the identities
(a+b)² = a² + 2ab + b²
(a-b)² = a² - 2ab + b²
∴ 4ab = (a+b)² - (a-b)²

factor:
xy = ¼[(x+y)² - (x-y)²]

Notice that the covariance is the difference of an agreement and a disagreement. Extending the analogy of mean absolute deviation to standard deviation, define:
σ1 = (1/2N) ∑(|x+y| - |x-y|)

as a new measure of the relationship between x and y. The factor 1/2 ensures that for x = y, the expression has the nice property that it reduces to mean absolute deviation. Thus, the mean absolute deviation is a special case of the L1 codeviation, just as the L2 variance is a special case of the L2 covariance.

### Correlation and Generalized Distance

The correlation coefficient is
r2 = ∑ xy / √[(∑x²)(∑y²)]

By analogy, define a Manhattan correlation
r1 = ∑ (|x+y| - |x-y|) / ∑(|x| + |y|)

For p variables, there is a p×p covariance matrix S. The squared Mahalanobis distance, neglecting a factor of det(S), is
d2 = xi(S-1)ijxj

An expression for generalized L1 statistical distance is
d1 = ∑|(S-1)ijxj|

## Argument from fuzzy logic

In their work on fuzzy logic2, G. Klir and B. Yuan observed that when the variables are the Boolean {0,1}, then logical AND is equivalent to the minimum, and logical OR is equivalent to the maximum. Theodoridis and Koutroumbas proposed a fuzzy distance7
d = max(min(x, 1-y), min(1-x, y))

To obtain the Manhattan codeviation, begin with an expression for logical equivalence:
x ⇔ y = (x ∧ y) ∨ (¬x ∧ ¬y)
= max(min(x, y), min(1-x, 1-y))

Since this article is mostly concerned with scalar data, switch from logical variables to scalar variables. In doing so, the effects of the min and max operations are unchanged. For negation, values must be reflected across 0 instead of ½, thus ¬x becomes -x rather than 1-x, resulting in:
x ⇔ y = max(min(x, y), min(-x, -y))

### Equivalence of these approaches

Compare the expression derived from covariance with the expression derived from fuzzy logic; they are identical.
½(|x+y| - |x-y|) ≡ max(min(x, y), min(-x, -y))

This demonstrates a remarkable relationship between fuzzy logic and L1 statistics. Two distinct approaches have led to the same formula. The first approach was an L1 modification of L2 statistics. The second approach was a fuzzy logic modification of calculus of probabilities. It may be concluded that fuzzy logic is related to probability theory in the same way that L1 statistics are related to L2 statistics.

For the purposes of computation, the min/max formula is to be preferred, since it is not susceptible to subtractive cancellation.

## Improper Metric Decompositions

The expression for L1 codeviation yields insights into the metric and non-metric properties of distance functions. Seek a decomposition d = m + n of distance into a metric part and a non-metric part. Then for triples of objects u, v, w drawn from some data set, the triangle inequality holds for the metric part of distance
mw ≤ mu + mv

but may or may not be violated in the non-metric part. Such a decomposition always exists, but is not unique.6

Two important distance functions are the cosine distance and the correlation distance. The difference between them is that the correlation distance operates on the difference from the average, and the cosine distance operates on the original data. The cosine distance often is used to compare documents represented by a list of words and the frequency with which each word occurs (the "bag of words" format). The correlation distance has been used to cluster gene expression data.4

Suppose that the data vectors have been normalized, by dividing each by
√(∑x²)

for the L2 case, or by
∑|x|

for the L1 case. The correlation is
r2 = (x · y) / √(∑x²∑y²)

The correlation distance is preferably arccos(r), which is a metric distance. Elsewhere, the correlation distance has been defined as:7, 8
d = (1-r) / 2

L2 normalizing the data makes √∑x²∑y² equal one. Substituting ¼[(x+y)² - (x-y)²] for xy yields
d2 = [∑(x-y)² / 8] + [½ - ∑(x+y)² / 8]

which is a metric decomposition, as verified by numerical simulations testing for triangle inequality violations. For the L1 case,
r1 = ∑(|x+y| - |x-y|) / ∑(|x| + |y|)

L1 normalizing the data makes the denominator equal two, thus:
d1 = [∑|x-y| / 4] + [½ - ∑|x+y| / 4]

which is a metric decomposition. The disagreement term (x-y) is in the metric part of distance, and the agreement term (x+y) is in the non-metric part of distance, as would be expected by intuition.

## Contra-categoricals in clustering

An example of inherently non-metric data occurs when trying to synthesize a set of observations. Some number n of events have occurred, which have been observed by some number q of sensors. Each sensor tries to give a single report of each event, but there may be mix-ups. A sensor may miss an event, or it may double-count an event.

Other things being equal, if two observations were made by the same sensor, they probably refer to different events. Thus, it is informative to include in the observation an integer [1,q] to identify the sensor that recorded it. This ID data has the unusual property that the distance is greatest between two ID's when they are identical. Let such data be called contra-categorical. It is inherently non-metric.

The Minkowski distance function:
d = (∑(x-y)L)(1/L)

is usually metric. A very simple decomposition is available into d = m + n. Contra-categorical and binary asymmetric data are non-metric. Scalar, categorical, ordinal, or cyclic data are metric. Specifically, letting xm and ym denote the metric components of vectors x and y,
m = (∑(xm-ym)L)(1/L)
n = d - m

A set of objects described by a matrix of non-metric distances cannot be embedded into Euclidean space. It can be embedded into a pseudo-Euclidean space, in which some components are imaginary numbers, so that the greater the difference in these components, the closer together two objects are. This is precisely the behavior of contra-categorical data. The pseudo-Euclidean representation is not esoteric as it may seem at first. Rather, it is the natural representation of perfectly ordinary objects.

## Classification Experiment

The L1 generalized distance can be used as a replacement for the conventional Mahalanobis distance. Traditional linear discriminant analysis (LDA) was compared to the L1 variant in classifying 19 data sets from the UCI repository1 and one other. The results are disappointing, showing that the L1 variant is usually quite similar to traditional LDA and generally inferior. The table below result lists the results for classification accuracy, and the difference. The 95% confidence interval was computed by a paired Student's t test, according to the procedure described by Mitchell.3 10-fold cross-validation was used, except as noted.

 Data set # variables # objects # classes L2 L1 dif interval note iris 4 150 3 0.980 0.980 0.000 0.000 * in 5-fold cross-validation balance-scale 4 563 3 0.851 0.812 0.039 0.143 banknote_authentication 4 1372 2 0.976 0.966 0.010 0.000 wilt 5 4839 2 0.943 0.939 0.003 0.000 Wholesale customers 6 440 2 0.848 0.900 -0.052 0.036 * ignoring the nominal variable "region" seed 7 210 3 0.962 0.957 0.005 0.049 * in 5-fold cross-validation E. coli 7 336 8 0.881 0.858 0.024 0.026 pima-indians-diabetes 8 768 2 0.778 0.743 0.035 0.031 yeast 8 1484 10 0.587 0.577 0.010 0.018 fertility 9 100 2 0.830 0.800 0.030 0.075 * in 3-fold cross-validation glass 9 214 6 0.631 0.594 0.037 0.073 * in 5-fold cross-validation, omitting 1 empty class breast-cancer-wisconsin 9 783 2 0.960 0.940 0.020 0.013 * 16 missing data deleted hockey 10 796 2 0.732 0.737 −0.005 0.086 * new data set wine 13 178 3 0.989 0.989 0.000 0.000 * in 5-fold cross-validation leaf 14 340 30 0.812 0.756 0.056 0.045 * 30 non-empty classes. Not to be confused with 'One-hundred species plant leaves' or 'folio' data sets parkinsons 22 195 2 0.872 0.662 0.210 0.199 * in 5-fold cross-validation ionosphere 34 351 2 0.877 0.795 0.083 0.050 Landsat satellite 36 6435 6 0.838 0.804 0.034 0.007 musk 166 6598 2 0.945 0.811 0.134 0.000 isolet 617 7797 26 0.944 0.446 0.499 0.156 average 0.862 0.803 0.059

## Discussion

The L1 codeviation seems to be chiefly of theoretical interest. It is insensitive to extreme values. This makes it robust against outliers, but it is not always an advantage. These measures are equal:
½(|x+y|-|x-y|)
max(min(x,y), min(-x,-y))
sign(xy) · min(|x|, |y|)

The last form makes the behavior explicit. If the signs of x and y are the same, the result is positive, and if the signs of x and y are different, the result is negative. The magnitude is the lesser of x and y.

For example, these data sets have the same σ1
{(1,1), (2,2), (3,3)} and
{(1,1), (2,2), (3,99)}.

Which suggests that standardizing the data may be beneficial.

Improper metric decompositions are valuable aids to building binary tree data structures. If the distance function is metric, the triangle inequality may be used to set bounds for which partitions must be searched to find the nearest neighbors of a query point. If a decomposition of the distance function is available, heuristics for searching the tree can be derived that are faster than exhaustive search, and more accurate than ignoring the metric violations. Decompositions for important non-metric distance functions such as the Edit Distance would be important developments. No general method for obtaining such decompositions is known at the present.

## Conclusion

The L1-norm codeviation is a robust statistic, but is not epecially useful. It has a surprising close relationship to fuzzy logic. It also provides insight into what causes non-metric distance functions to violate the triangle inequality.

## References

1. http://archive.ics.uci.edu/ml/datasets.html
2. G. Klir, B. Yuan, Fuzzy Sets and Fuzzy Logic, Prentice Hall, 1995.
3. T. M. Mitchell, Machine Learning, New Delhi: McGraw Hill Education, 2013. 
4. M. Eisen, P. Spellman, P. Brown, D. Botstein, "Cluster Analysis and display of genome-wide expression data," Proceedings of National Academy of Science, USA v.95 pp.14863-14868, 1998. cited in: 
5. Peter J. Rousseeuw and Katrien Van Driessen, "A Fast Algorithm for the Minimum Covariance Determinant Estimator," Revised version, 15 December 1998.
6. Julian Laub, Non-Metric Pairwise Proximity Data. Technical University of Berlin, 10 December 2004.
7. Sergios Theodoridis and Konstantinos Koutroumbas, Pattern Recognition, Third Edition, Academic Press/Elsevier, 2006.
8. Here the correlation distance is defined as 1-r.
Michiel de Hoon, Seiya Imoto, and Satoru Miyano. Users' manual for: The C Clustering Library for cDNA Microarray Data University of Tokyo, Institute of Medical Science, Human Genome Center, 27 December 2010.