Search code examples
matlabmachine-learningdata-analysis

How to get the final features?


The original data is Y, the size of Y is L*n ( n is the number of features; L is the number of observations. B is the covariance matrix of the original data Y. Suppose A is the eigenvectors of the covariance matrix B. I represent A as A = (e1, e2,...,en), where ei is an eigenvector. Matrix Aq is the first q eigenvectors and ai be the row vectors of Aq: Aq = (e1,e2,...,eq) = (a1,a2,...,an)'. I want to apply the k-means algorithm to Aq to cluster the row vector ai to k clusters or more (note: I do not want to apply k-means algorithm to the eigenvector ei to k clusters). For each cluster, only the vector closest to the center of cluster is retained, and the feature corresponding to this vector is finally selected as the informative features.

My question is:

1) What is the difference between applying the k-means algorithm to Aq to cluster the row vector ai to k clusters and applying k-means algorithm to Aq to cluster the eigenvector ei to k clusters?

2) the closest_vectors I get is from this command: closest_vectors = Aq(min_idxs, :), the size of the closest_vectors is k*qdouble. How to get the final informative features? Since the final informative features have to be obtained from the original data Y.

Thanks!

I found two function about pca and pfa:

function [e m lambda, sqsigma] = cvPca(X, M)

[D, N] = size(X);

if ~exist('M', 'var') || isempty(M) || M == 0
    M = D; 
end
M = min(M,min(D,N-1));

%% mean subtraction
m = mean(X, 2);  %%% calculate the mean of every row

X = X - repmat(m, 1, N);

%% singular value decomposition. X = U*S*V.' or X.' = V*S*U.'
[U S V] = svd(X,'econ');

e = U(:,1:M);

if nargout > 2
    s = diag(S);

    s = s(1:min(D,N-1));

    lambda = s.^2 / N; % biased (1/N) estimator of variance

end

% sqsigma. Used to model distribution of errors by univariate Gaussian
if nargout > 3
    d = cvPcaDist(X, e, m); % Use of validation set would be better

    N = size(d,2);

    sqsigma = sum(d) / N; % or (N-1) unbiased est

end
end

%/////////////////////////////////////////////////////////////////////////////

function [IDX, Me] = cvPfa(X, p, q)
[D, N] = size(X);
if ~exist('p', 'var') || isempty(p) || p == 0
    p = D;
end
p = min(p, min(D, N-1));
if ~exist('q', 'var') || isempty(q)
    q = p - 1;
end

%% PCA step
[U Me, Lambda] = cvPca(X, q);

%% cluter row vectors (q x D). not col
[Cl, Mu] = kmeans(U, p, 'emptyaction', 'singleton', 'distance', 'sqEuclidean');

%% find axis which are nearest to mean vector
IDX = logical(zeros(D,1));
for i = 1:p
    Cli = find(Cl == i);
    d = cvEucdist(Mu(i,:).', U(Cli,:).');
    [mini, argmin] = min(d);
    IDX(Cli(argmin)) = 1;
end

Solution

  • Summarizing Olologin's comments, it doesn't make sense to cluster the eigenvectors of the covariance matrix, or the columns of the U matrix of the SVD. Eigenvectors in this case are all orthogonal so if you tried to cluster them, you would only get one member per cluster and this cluster's centroid is defined by the eigenvector itself.

    Now, what you're really after is selecting out the features in your data matrix that describe your data in terms of discriminatory analysis.

    The functions that you have provided both compute the SVD and pluck out the k principal components of your data and also determine which features out of these k to select as the most prominent. By default, the amount of features to select out is equal to k, but you can override this if you want. Let's just stick with the default.

    The cvPfa function performs this feature selection for you, but a warning to you that the data matrix in the function is organized where each row is a feature and each column is a sample. The output is a logical vector that tells you which features are the strongest to select in your data.

    Simply put, you just do this:

    k = 10; %// Example
    IDX = cvPfa(Y.', k);
    Ynew = Y(:,IDX);
    

    This code will choose the 10 most prominent features in your data matrix and pluck out those 10 features that are the most representative of your data, or the most discriminative. You can then use the output for whatever application you're targetting.