To simplify, let us first consider the simple problem of representing the high-dimensional data set on a simple line, using the method described here.
Specifically we would like to assign a single number, or ‘‘score’’, to each column of the matrix. We choose a direction , and a scalar . This corresponds to the affine ‘‘scoring’’ function , which, to a generic column of the data matrix, assigns the value
We thus obtain a vector of values , with , . It is often useful to center these values around zero. This can be done by choosing such that
that is: , where
is the vector of sample averages across the columns of the matrix (that is, data points). The vector can be interpreted as the ‘‘average response’’ across experiments.
The values of our scoring function can now be expressed as
In order to be able to compare the relative merits of different directions, we can assume, without loss of generality, that the vector is normalized (so that ).
It is convenient to work with the ‘‘centered’’ data matrix, which is
where is the vector of ones in .
In matlab, we can compute the centered data matrix as follows.Matlab syntax
>> xhat = mean(X,2); >> [m,n] = size(X); >> Xcent = X-xhat*ones(1,n);
We can compute the (row) vector scores using the simple matrix-vector product:
We can check that the average of the above row vector is zero:
Scores obtained with random direction: This image shows the values of the projections of the Senator's votes (that is, with average across Senators removed) on a (normalized) ‘‘random bill’’ direction. This projection shows no particular obvious structure. Note that the range of the data is much less than obtained with the average bill shown above.
We can also try to project the data on a plane, which involves assigning two scores to each data point.
This corresponds to the affine ‘‘scoring’’ map, which, to a generic column of the data matrix, assigns the two-dimensional value
where are two vectors, and two scalars, while , .
The affine map allows to generate two-dimensional data points (instead of -dimensional) , . As before, we can require that the 's be centered:
by choosing the vector to be such that , where is the ‘‘average response’’ defined above. Our (centered) scoring map takes the form
We can encapsulate the scores in the matrix . The latter can be expressed as the matrix-matrix product
with the centered data matrix defined above.
Clearly, depending on which plan we choose to project on, we get a very different pictures. Some planes seem to be more ‘‘informative’’ than others. We return to this issue here.
Two-dimensional projection of the Senate voting matrix: This particular projection does not seem to be very informative. Notice in particular, the scale of the vertical axis. The data is all but crushed to a line, and even on the horizontal axis, the data does not show much variation.
Two-dimensional projection of the Senate voting matrix: This particular projection seems to allow to cluster the Senators along party line, and is therefore more informative. We explain how choose such a direction here.