Tuesday 15 May 2012

Calculating Correlation Using LINQ

Recently I looked over my old A level maths notes, something that I was looking at was statistics and correlation.  As a LINQ test, I wrote a simple algorithm that calculates the Spearman's Rank Correlation Coefficient.

The Spearmans Rank Correlation Coefficient is one of the several correlation calculations used to calculate the 'strength' of a relationship between 2 sets of data, this calculated value is in the range -1..1. For more information on correlations look here.

It works by giving each value in each set of data a rank between 1 and at most n (in descending order), if 2 or more values are eqaul then the rank is averaged between these values, so for instance a set of 7 values and the respective ranks:

x 72 112 46 97 46 46 52
rx 3 1 6 2 6 6 4

As you can see the value of 46 is given the rank 6, because (5+6+7)/3 = 6, now that I have described how sets of values are ranked, lets look at the equation:

Spearman's Rank Correlation Coefficient = 1 - ((6 * sum of rank differences^2) / (n(n^2 - 1))
Where d = the difference in rank between the 2 sets of data, n = the number of values in each set (7)

So using the example above as our first set if we now include a second set of data (must also be 7 values):

y 20 2 7 11 4 12 7
ry 1 7 4.5 3 6 2 4.5

As before the value 7 shares 4th and 5th (4+5/2). Now we can sum the squares of the differences between ranks:
= 22 + -62 + 1.52 + -12 + 02 + 42 + -0.52
= 4 + 36 + 2.25 + 1 + 0 + 16 + 0.25
= 59.5

This gives a final result of: ρ = -0.0625 (a very slight negative correlation).   Now the write a program which can calculate this for LINQ, firstly we need to order the values in descending order, numbered from 1 to n, the best way to do this is by using fluent syntax, as this supports the selecting of a value along with its index (something that isn't possible with query syntax).  We then can group the values if the numbers are the same:
double[] rankings = (from n in values
   join grp in groupedValues on n equals grp.Key
   select grp.Average(g => g.IndexedRank)).ToArray();
With the rankings we can combine them and the calculate the sum of the differences by using aggregation:
int n = rankings.Length;

// Note that I have expanded the n^2(n - 1) to use less code
double rho = 1 - ((6 * sigmaDiff) / (Math.Pow(n, 3) - n));
Putting it all together, using our 2 sets as a test harness:
using System;
using System.Collections.Generic;
using System.Linq;

public static class Extensions
{
    public static double SpearmansCoeff(this IEnumerable<int> current, IEnumerable<int> other)
    {
        if (current.Count() != other.Count())
            throw new ArgumentException("Both collections of data must contain an equal number of elements");

        double[] ranksX = GetRanking(current);
        double[] ranksY = GetRanking(other);

        var diffPair = ranksX.Zip(ranksY, (x, y) => new { x, y });
        double sigmaDiff = diffPair.Sum(s => Math.Pow(s.x - s.y, 2));
        int n = current.Count();

        // Spearman's Coefficient of Correlation
        // ρ = 1 - ((6 * sum of rank differences^2) / (n(n^2 - 1))
        double rho = 1 - ((6 * sigmaDiff) / (Math.Pow(n, 3) - n));

        return rho;
    }

    static double[] GetRanking(IEnumerable<int> values)
    {
        var groupedValues = values.OrderByDescending(n => n)
                                  .Select((val, i) => new { Value = val, IndexedRank = i + 1 })
                                  .GroupBy(i => i.Value);

        double[] rankings = (from n in values
                             join grp in groupedValues on n equals grp.Key
                             select grp.Average(g => g.IndexedRank)).ToArray();

        return rankings;
    }
}

class Program
{
    static void Main(string[] args)
    {
        int[] setX = { 72, 112, 46, 97, 46, 46, 52 };
        int[] setY = { 20, 2, 7, 11, 4, 12, 7 };

        double rho = setX.SpearmansCoeff(setY);

        Console.WriteLine("Spearman's Rank Correlation Coefficient: {0}", rho);
    }
}
Spearman's Rank Correlation Coefficient: -0.0625

1 comment:

mr292 said...

Ryan,

Very impressive. I am looking for someone to do a LINQ project for me. It involves writing several basic and some advanced risk functions to analyze return data. Would you be interested?

You can contact me at mansoor.razzaq@gmail.com