Preprocess the TFIDF Matrices

Normalize doc vector lengths

Normalize term vector variance

We do not normalize variance, which we would normally do, such as with data containing divergent units of measure. \ This is because to do so would exaggerate the importance of rare words (see Ng, 2008: 6m40s — 8m00s).

Center the word vectors

Note that we are taking the column-wise means -- the means for the term vectors. \ We don't really need to do this. But it is typical for PCA. \ NOTE: SOme argue that centering alters the cosine angles.

Compute Covariance Matrix

We could compute this directly, but we use the built in Pandas method here.

Decompose the Matrix

There a at least three options to choose from. We go with SciPy's Hermitian Eigendecomposition \ method eigh(), since our covarience matrix is symmetric.

Convert eigen data to dataframes

Select Principal Components

Next, we associate each eigenvalue with its corresponding column in the eigenvalue matrix. \ This is why we transpose the EIG_VEC dataframe.

Combine eigenvalues and eignvectors

Next, we sort in descending order and pick the top K (=10).

Compute and Show Explained Variance

We might have usd this value to sort our components.

Pick Top K (10) Components

We pick these based on explained variance.

Show Loadings

Loadings sow the contribution of each term to the component. \ We'll just look at the topi 10 words for the first two components in the Book version.