Projecting data on a line
Projecting data on a plane
Consider a data set of points , in . Each point can represent the a chemical experiment under specific conditions; the response of a specific gene to a number of different drugs; the votes of a particular citizen on an array of issues; atmospheric readings (temperature, pressure, humidity, etc) at a specific location; past prices of a single asset; etc.
We can represent this data set as a matrix , where each is a -vector. Simply plotting the raw matrix is often not very informative.
We can try to visualize the data set, by projecting each data point (each row or column of the matrix) on (say) a 1D-, 2D- or 3D-space. Each ‘‘view’’ corresponds to a particular projection, that is, a particular one-, two- or three-dimensional subspace on which we choose to project the data. The visualization problem consists of choosing an appropriate projection.
There are many ways to formulate the visualization problem, and none dominates the others. Here,we focus on the basics of that problem.
To simplify, let us first consider the simple problem of representing the high-dimensional data set on a simple line, using the method described here.
Specifically we would like to assign a single number, or ‘‘score’’, to each column of the matrix. We choose a direction , and a scalar . This corresponds to the affine ‘‘scoring’’ function , which, to a generic column of the data matrix, assigns the value
We thus obtain a vector of values , with , . It is often useful to center these values around zero. This can be done by choosing such that
that is: , where
is the vector of sample averages across the columns of the matrix (that is, data points). The vector can be interpreted as the ‘‘average response’’ across experiments.
The values of our scoring function can now be expressed as
In order to be able to compare the relative merits of different directions, we can assume, without loss of generality, that the vector is normalized (so that ).
It is convenient to work with the ‘‘centered’’ data matrix, which is
where is the vector of ones in .
In matlab, we can compute the centered data matrix as follows.Matlab syntax
>> xhat = mean(X,2); >> [m,n] = size(X); >> Xcent = X-xhat*ones(1,n);
We can compute the (row) vector scores using the simple matrix-vector product:
We can check that the average of the above row vector is zero:
Example:Senator scores on average bill.
We can also try to project the data on a plane, which involves assigning two scores to each data point.
This corresponds to the affine ‘‘scoring’’ map, which, to a generic column of the data matrix, assigns the two-dimensional value
where are two vectors, and two scalars, while , .
The affine map allows to generate two-dimensional data points (instead of -dimensional) , . As before, we can require that the 's be centered:
by choosing the vector to be such that , where is the ‘‘average response’’ defined above. Our (centered) scoring map takes the form
We can encapsulate the scores in the matrix . The latter can be expressed as the matrix-matrix product
with the centered data matrix defined above.
Example:Visualizing Senate voting on a plane.