I was flicking through an old notebook that I used at for uni notes the other day, and I stumbled across my overly-simplistic guide to Principal Component Analysis: PCA for Morons. It’s a really cool data-mining technique, so I figured it would be worth fleshing out to be slightly more detailed than a few short bullet points!
What is Principal Component Analysis and Why is it Useful?
Data mining is the process of discovering patterns within large volumes of data. This data typically contains redundant information, associations that would be difficult to spot due to high numbers of variables, and lots of noise; making it virtually impossible to detect these patterns without the use of statistical/artificially intelligent processing systems.
Essentially, PCA aims to transform a data set by creating a new smaller set made up from combinations of the original data, known as Principal Components. This has the result of shrinking the size of the search space, making predictions / analysis easier, and the principal components are ordered to indicate the most interesting patterns identified. PCA does lead to a loss of information, though it is typically minimal, and the dataset can be reconstructed afterwards, so it can be used as a lossy compression method too.
The Basic Gist of PCA Dimension Reduction
The awesome chart to the left (yes, that is done with the airbrush tool in MS Paint) attempts to show how this reduction can be possible: The left hand side plots the variables v1 and v2, which as you can see have no correlation – if you know v1, it would be very difficult to predict v2. The right hand side, however, shows two variables (v1 and v3) which show a strong correlation – as v1 grows, so does v3, making predictions much easier. Because of this strong relationship we can transform the data and combining the data from multiple axes down to one single axis.
This process is performed for all dimensional pairs within the data, and then all related variables are reduced. This process isn’t limited to reducing just two dimensions, but can reduce any number, as long as they are all sufficiently correlated.
In this post I’ll go perform PCA on a data set made up of 15 three dimensional items. This isn’t real data, I made it up for the sake of the example, and is only three dimensional for speed and clarity of concepts – but there is no limit to the dimensionality of data when applying PCA to a real dataset.
Before performing PCA the data needs to be mean adjusted, so that the mean of the whole data set is 0, and this is achieved by simply calculating the mean across each dimension and subtracting it from each item.
|Raw Data||Mean Adjusted Data|
As I said before, PCA compares all pairs of dimensions in the data in order to find correlations within the data, and this is acheived by measuring the covariance – or the variarance of one dimension with respect to another. Variance is a measure of the spread of the one-dimensional data, and is calculated by dividing the total of all the data points less the mean of the data set by the number of data points minus 1 (or standard deviation squared):
Covariance is very similar, but is calculated using two variables (v1 and v2) instead of just the one (v):
To compare all pairs of dimensions within the data, we can construct a covariance matrix, of the form:
Eigenvalues and Eigenvectors
Full disclosure: this bit is magic. Or at least if there is some mathematical reasoning behind it, I don’t know it.
With our covariance matrix, we can get some useful numbers out of it, known as eigenvalues, and some useful vectors, known as (surprise, surprise) eigenvectors. These can only be found in square matrices, and of an n x n matrix, there will be n pairs of eigenvalues and eigenvectors (with each eigenvector representing n dimensions (x1, x2 … xn). I shan’t go into too much detail about what they are / how they are calculated, because they are a bit of a mystery to me, but most maths packages should be able to calculate them for you. The only point you may need to bear in mind is that PCA requires unit vectors – that is the vectors should be normalized so that they are all of length 1. Fortunately, most maths packages will return unit vectors, but it’s worth checking if you are unsure.
For the dataset above, the eigenvalues are: 8.62, 0.35 and 0.05; and the eigenvectors are: [0.45,-0.50,-.73], [0.42,-0.61,0.67] and [0.79,0.61,0.65].
The eigenvectors characterise the data, and the corresponding eigenvalue indicates just how representative of the data the vector is. If you order the eigenvectors by eigenvalue and plot on a scatter graph of the data then you can see that the first vector (the principal component) should pass through the centre of the points like a line of best fit, with each corresponding vector having less significance than the one before it.
Apologies, I realise that the above plot is hardly clear, but it shows each of the 15 points plotted with the original axes (v1, v2 and v3 – in blue) and the eigenvectors (v1′, v2′ and v3′ – in purple). If there were more data points – and I had access to a better charting library – then it’d be much more apparent, but for now you’ll just have to trust that it works!
The Good Bit: Transforming the Original Data
The final step of PCA is to reduce the dimensionality of the data based on the eigenvalues calculated, by selecting only the top t eigenvalue/eigenvectors. As the sample data has a clear pattern, the eigenvalues clearly show one vector is far more representative of the data than the others, but it can be far tighter. In these cases, you can either manually select the threshold, or use some thresholding algorithm to determine the cut-off point. Calculating thresholds is a really big topic in AI, with many different approaches, but one simple technique I like (and I think works well in situations like this) is to calculate the standard deviation of all of the eigenvectors, and subtract it from the first eigenvalue.
When you have selected the eigenvectors that you will use, you must construct a feature vector, which is essentially just a matrix composed of vector columns (v1, v2 … etc) and then transposed. The final data is then simply the feature vector multiplied by the mean adjusted data.
Here is the plot of the new data transformed using two eigenvectors, resulting in two dimensional data. Plotting the data in this way clearly shows that there is a strong pattern in the data – although this was visible in the 3d plot, it would be impossible to plot, for instance, a 20-dimensional plot. PCA also helps to remove redundant information which contains very little information. As you have seen, performing PCA is a reasonably simple affair, but is a powerful tool when trying to simplify a large and complicated dataset.