Search code examples
scikit-learnpca

Incremental PCA


I've never used incremental PCA which exists in sklearn and I'm a bit confused about it's parameters and not able to find a good explanation of them.

I see that there is batch_size in the constructor, but also, when using partial_fit method you can again pass only a part of your data, I've found the following way:

n = df.shape[0]
chunk_size = 100000
iterations = n//chunk_size

ipca = IncrementalPCA(n_components=40, batch_size=1000)

for i in range(0, iterations):
    ipca.partial_fit(df[i*chunk_size : (i+1)*chunk_size].values)

ipca.partial_fit(df[iterations*chunk_size : n].values)

Now, what I don't understand is the following - when using partial fit, does the batch_size play any role at all, or not? And how are they related?

Moreover, if both are considered, how should I change their values properly, when wanting to increase the precision while increasing memory footprint (and the other way around, decrease the memory consumption for the price of decreased accuracy)?


Solution

  • The docs say:

    batch_size : int or None, (default=None)

    The number of samples to use for each batch. Only used when calling fit...
    

    This param is not used within partial_fit, where the batch-size is controlled by the user.

    Bigger batches will increase memory-consumption, smaller ones will decrease it. This is also written in the docs:

    This algorithm has constant memory complexity, on the order of batch_size, enabling use of np.memmap files without loading the entire file into memory.

    Despite some checks and parameter-heuristics, the whole fit-function looks like this:

    for batch in gen_batches(n_samples, self.batch_size_):
        self.partial_fit(X[batch], check_input=False)