**Abstract: **
The statistic
σ_{1} = ∑ max(min(ξ, η), min(-ξ, -η))
is proposed as a robust analog to the usual covariance, leading
to L1 analogs of correlation and statistical distance. The L1 statistics are
shown to have the same relation to their L2 counterparts as fuzzy logic does
to probability theory. Distance functions based on correlation are shown to be
a sum of metric and non-metric parts, that may be interpreted as the sum
of disagreement and agreement, respectively, of two objects. Applications in
data clustering are discussed.

A fundamental task in statistics is to seek relations between
variables as a means of explaining the world. For this purpose, the
Pearson correlation has been long universally applied in search of a
relation between two variables. The result is a number *r* between
-1 and 1, which is positive when an increase in one tends to be
accompanied by an increase in the other, and negative when an increase
in one tends to be accompanied by a decrease in the other.

A closely related task is to quantify the dissimilarity between
entities rather than variables. In a 1936 article, P.C. Mahalanobis
introduced a generalized statistical distance. Given a sample of
measurements of *p* variables for each of *n* objects,
each pair of objects is characterized by a nondimensional distance. The
Mahalanobis distance has often been used as a measure of dissimilarity
between objects in a sample. Compared to the ordinary Euclidean
distance, the Mahalanobis distance gives superior performance on
problems in data clustering and classification.^{5} The
building block of both of these statistics is the covariance,

s_{xy} = (1/N) ∑ ξη = (1/N) ∑ (x - μ)(y - ν)

where μ and ν are the means of x and y, respectively.

A weakness of the standard covariance (and L2 statistics in general)
is their sensitivity to outliers. A few extreme objects have an
disproportionate effect on the sample mean and variance. In contrast,
the L1 statistics median and mean absolute deviation, are comparatively
robust against these effects. For example, increasing the largest
observation by a arbitrary factor of *f* has no effect at all on
the median. The effect of an extreme measurement on the mean absolute
deviation is proportional to its difference from the average, but its
effect on the variance is proportional to the *square* of the
difference. If the outliers are spurious, the estimate of the standard
deviation may be severely corrupted.

Previous work in developing a robust generalized distance has focused
on choosing a subset of reliable observations.^{4} Since the
number of possible non-empty subsets of *n* observations is
2^{n}-1, there is exponential growth in the the number of
possibilities to consider, and a consequently high computational cost.
Perhaps more importantly, choosing to banish an observation from the
data set necessarily discards possibly valid information. Silently weighing
the analyst's prior assumptions about what sort of distribution the
data should have sacrifices scientific objectivity.

An L1 variant of covariance is thus desirable for at least two
reasons: it achieves a more robust estimate without willfully discarding
information, and it extends a correspondence with L2 statistics.

The Euclidean distance, given by the Pythagorean theorem

d_{2} = √∑(x-y)²

has its counterpart in the Manhattan distance

d_{1} = ∑|x-y|

Likewise, the standard deviation

σ_{2} = √[(1/N) ∑(x-μ)²]

has its L1 counterpart in the mean absolute deviation:

σ_{1} = (1/N) ∑|x-μ|

L1 analogs of covariance and correlation will now be introduced.
Consider the familiar covariance

(1/N) ∑ξη = (1/N) ∑[(x-μ)(y-ν)]

The expression ξη is positive when ξ and η have the same
sign, and negative when they have different signs. Employing the
identities

(a+b)² = a² + 2ab + b²

(a-b)² = a² - 2ab + b²

∴ 4ab = (a+b)² - (a-b)²

factor:

ξη = ¼[(ξ+η)² - (ξ-η)²]

Notice that the covariance is the difference of an agreement and a
disagreement. Extending the analogy of mean absolute deviation to standard
deviation, define:

σ_{1} = (1/2N) ∑(|ξ+η| - |ξ-η|)

as a new measure of the relationship between x and y. The factor 1/2
ensures that for x = y, the expression has the nice property that it
reduces to mean absolute deviation. Thus, the mean absolute deviation is
a special case of the L1 codeviation, just as the L2 variance is a special
case of the L2 covariance.

r

By analogy, define a Manhattan correlation

r

For p variables, there is a p×p covariance matrix S. The squared
Mahalanobis distance, neglecting a factor of det(S), is

d_{2} = ξ_{i}P_{ij}ξ_{j}

where P = S^{-1}. Writing the quantity in summation notation
makes explicit the interaction between the variables. The
ξ_{i}ξ_{j} interaction is essential, but a true L1
measure would not contain multiplication. An expression for true
generalized L1 statistical distance has not been obtained. However, it
is possible to use the L1 codeviation matrix in place of the standard
covariance matrix. Such an L1½ metric has led to improved
performance on benchmark Iris and Wine problems from the UCI archive.^{1}

x ⊻ y = (x ∧ ~y) ∨ (~x ∧ y)

and the logical distance is thus:

d = x(1-y) + (1-x)y = x + y - 2xy

In their work on fuzzy logic

d = max(min(x, 1-y), min(1-x, y))

To obtain the Manhattan codeviation, begin with an expression for logical equivalence:

x ⇔ y = (x ∧ y) ∨ (~x ∧ ~y)

= max(min(x, y), min(1-x, 1-y))

Since this article is mostly concerned with scalar data, switch from
logical variables to scalar variables. In doing so, the effects of the
min and max operations are unchanged. For negation, values must be
reflected across 0 instead of ½, thus ~x becomes -x rather
than 1-x, resulting in:

x ⇔ y = max(min(x, y), min(-x, -y))

½(|ξ+η| - |ξ-η|) ≡ max(min(ξ, η), min(-ξ, -η))

This demonstrates a remarkable relationship between fuzzy logic and L1 statistics. Two distinct approaches have led to the same formula. The first approach was an L1 modification of L2 statistics. The second approach was a fuzzy logic modification of calculus of probabilities. It may be concluded that fuzzy logic is related to probability theory in the same way that L1 statistics are related to L2 statistics.

For the purposes of computation, the min/max formula is to be preferred,
since it is not susceptible to subtractive cancellation.

The expression for L1 codeviation yields insights into the metric and
non-metric properties of distance functions. Seek a decomposition d = m
+ n of distance into a metric part and a non-metric part. Then for
triples of objects u, v, w drawn from some data set, the triangle
inequality holds for the metric part of distance

m_{w} ≤ m_{u} + m_{v}

but may or may not be violated in the non-metric part. Such a decomposition
always exists, but is not unique.^{6}

Two important distance functions are the cosine distance and the
correlation distance. The difference between them is that the
correlation distance operates on the difference from the average, and
the cosine distance operates on the original data. The cosine distance
often is used to compare documents represented by a list of words and
the frequency with which each word occurs (the "bag of words" format).
The correlation distance has been used to cluster gene expression
data.^{3}

Suppose that the data vectors have been normalized, by dividing each by

√(∑x²)

for the L2 case, or by

∑|x|

for the L1 case. The correlation is

r_{2} = (x · y) / √(∑x²∑y²)

The correlation distance is preferably arccos(r), which is a metric
distance. Elsewhere, the correlation distance has been defined
as:^{7, 8}

d = (1-r) / 2

L2 normalizing the data makes √∑x²∑y² equal
one. Substituting ¼[(x+y)² - (x-y)²] for xy yields

d_{2} = [∑(x-y)² / 8] + [½ - ∑(x+y)² / 8]

which is a metric decomposition, as verified by numerical simulations testing
for triangle inequality violations. For the L1 case,

r_{1} = ∑(|x+y| - |x-y|) / ∑(|x| + |y|)

L1 normalizing the data makes the denominator equal two, thus:

d_{1} = [∑|x-y| / 4] + [½ - ∑|x+y| / 4]

which is a metric decomposition. The disagreement term
(x-y) is in the metric part of distance, and the agreement term (x+y) is
in the non-metric part of distance, as would be expected by intuition.

An example of inherently non-metric data occurs when trying to
synthesize a set of observations. Some number *n* of events have
occurred, which have been observed by some number *q* of sensors.
Each sensor tries to give a single report of each event, but there may
be mix-ups. A sensor may miss an event, or it may double-count an
event.

Other things being equal, if two observations were made by the same
sensor, they probably refer to different events. Thus, it is
informative to include in the observation an integer [1,q] to identify
the sensor that recorded it. This ID data has the unusual property
that the distance is greatest between two ID's when they are identical.
Let such data be called **contra-categorical**. It is
inherently non-metric.

The Minkowski distance function:

d = (∑(x-y)^{L})^{(1/L)}

is usually metric. A very simple decomposition is available into d = m
+ n. Contra-categorical and binary asymmetric data are non-metric.
Scalar, categorical, ordinal, or cyclic data are metric. Specifically,
letting x_{m} and y_{m} denote the metric components of
vectors x and y,

m = (∑(x_{m}-y_{m})^{L})^{(1/L)}

n = d - m

A set of objects described by a matrix of non-metric distances cannot
be embedded into Euclidean space. It can be embedded into a
pseudo-Euclidean space, in which some components are imaginary numbers,
so that the greater the difference in these components, the closer
together two objects are. This is precisely the behavior of
contra-categorical data. The pseudo-Euclidean representation is not
esoteric as it may seem at first. Rather, it is the natural
representation of perfectly ordinary objects.

The L1 codeviation is insensitive to extreme values. This makes it
robust against outliers, but it is not always an advantage. These
measures are equal:

½(|x+y|-|x-y|)

max(min(x,y), min(-x,-y))

sign(xy) · min(|x|, |y|)

The last form makes the behavior explicit. If the signs of x
and y are the same, the result is positive, and if the signs of x and y
are different, the result is negative. The magnitude is the lesser of x
and y.

For example, these data sets have the same σ_{1}

{(1,1), (2,2), (3,3)} and

{(1,1), (2,2), (3,99)}.

Which suggests that standardizing the data may be beneficial.

The L1½ generalized distance, as mentioned above, can be used as a drop-in replacement for the conventional Mahalanobis distance. This has led to improvements in clustering accuracy. In numerical experiments on the Iris data, the error rate was reduced from 4% to 2% by using the L1½ distance. It remains to be seen if this improvement holds generally.

A true L1 generalized distance has not been obtained. Its discovery,
or a disproof of existence, remains as a goal for future research.
Another direction which has not been explored is to define sample
statistics for a general Minkowski metric. The Minkowski average μ
would be the quantity that minimizes the Minkowski variance σ

σ^{L} = (1/N) ∑|x - μ|^{L}

No progress has been made towards this goal.

Improper metric decompositions are valuable aids to building binary
tree data structures. If the distance function is metric, the triangle
inequality may be used to set bounds for which partitions must be
searched to find the nearest neighbors of a query point. If a
decomposition of the distance function is available, heuristics for
searching the tree can be derived that are faster than exhaustive
search, and more accurate than ignoring the metric violations.
Decompositions for important non-metric distance functions such as the
Edit Distance would be important developments. No general method for
obtaining such decompositions is known at the present.

- http://archive.ics.uci.edu/ml/datasets.html
- G. Klir, B. Yuan, Fuzzy Sets and Fuzzy Logic, Prentice Hall, 1995. cited in: [7]
- M. Eisen, P. Spellman, P. Brown, D. Botstein, "Cluster Analysis and display of genome-wide expression data," Proceedings of National Academy of Science, USA v.95 pp.14863-14868, 1998. cited in: [7]
- Peter J. Rousseeuw and Katrien Van Driessen, "A Fast Algorithm for the Minimum Covariance Determinant Estimator," Revised version, 15 December 1998.
- A. Bar-Hillel, T. Hertz, H. Shental, and D. Weinshall, "Learning distance functions using equivalence relation," International Conference on Machine Learning, 2003.
- Julian Laub, Non-Metric Pairwise Proximity Data. Technical University of Berlin, 10 December 2004.
- Sergios Theodoridis and Konstantinos Koutroumbas, Pattern Recognition, Third Edition, Academic Press/Elsevier, 2006.
- Here the correlation distance is defined as 1-r.

Michiel de Hoon, Seiya Imoto, and Satoru Miyano. Users' manual for: The C Clustering Library for cDNA Microarray Data University of Tokyo, Institute of Medical Science, Human Genome Center, 27 December 2010.