I am trying to use scikit's factor analysis on some financial data to find betas to use in a model. FA has a parameters called n_components and tolerance. I am having some trouble wrapping my head around how these variables influence the outcome. I have read the docs and done research but have had trouble finding any relevant information. I am new to machine learning and am not a stats wizard. Could someone explain how these influence the out come of the algorithm?
From sklearn.decomposition.FactorAnalysis
n_components : int | None
Dimensionality of latent space, the number of components of X that are obtained after transform. If None, n_components is set to the number of features.
tol : float Stopping tolerance for EM algorithm.
I am assuming that your financial data is a matrix with (n_samples, n_features)
shape. Factor analysis uses an expectation maximization (EM) optimizer to find the best Gaussian distribution that can accurately model your data within a tolerance of n_tolerance
. In simple terms n_components
is the dimensionality of the Gaussian distribution.
Data that can be modelled with a Gaussian distribution sometimes has negligible variance in one dimension. Think of an ellipsoid that is squashed along its depth such that it resembles an ellipse. If the raw data was the ellipsoid, you want your n_components = 2
, so that you can model your data with the least complicated model.