Search code examples
pythonmatlabpcaanalysis

PCA on large dataset


I have a large dateset consisting of 6 input variables (temperatures, pressures, flow rates etc) to give an output such as yield, purity and conversion. There are a total of approx 47600 instances and this is all in an excel spreadsheet. I have applied both artificial neural network and random forest algorithms on this data and obtained predicted plots and accuracy metrics. (in Python) The random forest model has a feature that gives input variable importance. I would now like to perform a PCA on this data to firstly compare to the random forest results, as well as to obtain more information on how my input data interacts with each other to give my output. I've watched a few youtube videos and tutorials to get my head around PCA however the data they use is quite different to mine.

Below is a snippet of my data. The first 6 columns are inputs and the last 3 are outputs. enter image description here

How can I analyse this using PCA? I have managed to plot it in python however the plot is very busy and almost doesnt give much information.

Any help or tips are welcome! Perhaps a different analysis tool? I don't mind using Python or Matlab

Thank you :)


Solution

  • I suggest to use the KarhunenLoeveSVDAlgorithm in OpenTURNS. It provides 4 implementations of a random SVD algorithm. The constraint is that the number of singular values to be computed has to be set beforehand.

    In order to enable the algorithm, we must set the KarhunenLoeveSVDAlgorithm-UseRandomSVD key in the ResourceMap. Then the KarhunenLoeveSVDAlgorithm-RandomSVDMaximumRank key sets the number of singular values to compute (be default, it is equal to 1000).

    Two implementations are provided:

    • Nathan Halko, Per-Gunnar Martinsson, Joel A. Tropp. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions,
    • Nathan Halko, Per-Gunnar Martisson, Yoel Shkolnisky and Mark Tygert. An algorithm for the principal component analysis of large data sets.

    These algorithms can be chosen with the KarhunenLoeveSVDAlgorithm-RandomSVDVariant key.

    In the following example, I simulate a large process sample from a gaussian process with AbsoluteExponential covariance model.

    import openturns as ot
    mesh = ot.IntervalMesher([10]*2).build(ot.Interval([-1.0]*2, [1.0]*2))
    s = 0.01
    model = ot.AbsoluteExponential([1.0]*2)
    sampleSize = 100000
    sample = ot.GaussianProcess(model, mesh).getSample(sampleSize)
    

    Then the random SVD algorithm is used:

    ot.ResourceMap_SetAsBool('KarhunenLoeveSVDAlgorithm-UseRandomSVD', True)
    algorithm = ot.KarhunenLoeveSVDAlgorithm(sample, s)
    algorithm.run()
    result = algorithm.getResult()
    

    The result object contains the Karhunen-Loève decomposition of the process. This corresponds to the PCA with a regular grid (and equal weights).