The ‘Copula’

This article sprang the term ‘Gaussian copula’ on me. Some quick Googling led me to some formula laden Wikipedia pages that gave a good description of notions some of which I already had and some of which were new to me. I had not know what they were called. The Wikipedia article is short and notation ridden and I aspire here to give a less formal introduction.

The first necessary notion is the probability integral transform which is itself a useful notion which I explore first. Suppose you have the heights of the students of some High School. You can compute several statistics from this information but we propose here instead to sort the heights and put them in a list and assign indexes to them from smallest height to largest. These indexes are numbers from 1 to the population which is the number of students in the school. We divide each index by the population and get numbers between 0 and 1. We now have a tabulated function, f, from the numbers between 0 and 1, to the heights of the students. f(½) is the median height of the students. f(4/5) is the height of the shortest student in the tallest quintile of heights. It is important to realize that f is not a linear function of its argument, but it is monotonic: x≤y → f(x) ≤ f(y). It may help to visualize the student body lined up on a stage by height, each student occupying the same horizontal space, and a large horizontal ruler with 0 at the left end and 1 at the right end. This image constitutes a graph of f. Sometimes the inverse of f is useful and since f is monotonic this is well defined. 0 ≤ f⁻¹(h) ≤ 1. There are ((population)*f⁻¹(5 ft)) students whose height is less than five feet.

The Copula

Suppose we have also collected the weights of the students. We could sort the students by weight but they would appear in a different order than the above ordering. We could form a function g from the numbers from 0 to 1 to the weights where, for instance, g(½) is the median weight.

The copula might be called: a cumulative distribution in a box. Each component of the random vector has been replaced by another that runs uniformly from 0 to 1. The population whose individual stats that had each been distributed over some natural range, has been replaced by another normalized stat ranging from 0 to 1. If we visualize the original population, distributed in n-space by their individual stats, to be redistributed in this n-box, the distribution in the box is the copula.

If heights and weights were independently distributed which means that knowing one gives no clue about the other, then this distribution in the box is uniform, or flat, and conversely.

The Gaussian Copula

What if our distribution is Gaussian, indeed multivariate Gaussian? The probability density function in this case is e^−Q(x) where Q is some positive definite quadratic form and x is the vector from the mean. The copula is always relative to some coordinate system and if Q in that coordinate system is diagonal then the copula is a constant thru out the hypercube. This is the case if the components of the vector random variable are uncorrelated.

The Dark Side