python machine-learning scikit-learn statistics finance

Factor Analysis Scikit

I am trying to use scikit's factor analysis on some financial data to find betas to use in a model. FA has a parameters called n_components and tolerance. I am having some trouble wrapping my head around how these variables influence the outcome. I have read the docs and done research but have had trouble finding any relevant information. I am new to machine learning and am not a stats wizard. Could someone explain how these influence the out come of the algorithm?

Solution

From sklearn.decomposition.FactorAnalysis

n_components : int | None

Dimensionality of latent space, the number of components of X that are obtained after transform. If None, n_components is set to the number of features.

tol : float Stopping tolerance for EM algorithm.

I am assuming that your financial data is a matrix with (n_samples, n_features) shape. Factor analysis uses an expectation maximization (EM) optimizer to find the best Gaussian distribution that can accurately model your data within a tolerance of n_tolerance. In simple terms n_components is the dimensionality of the Gaussian distribution.

Data that can be modelled with a Gaussian distribution sometimes has negligible variance in one dimension. Think of an ellipsoid that is squashed along its depth such that it resembles an ellipse. If the raw data was the ellipsoid, you want your n_components = 2, so that you can model your data with the least complicated model.