Search code examples
pythondataframescikit-learnpca

reduce dataframe rows in python


i have dataframe of 8000*1600 , and i want to reduce the number of lines without changing the values, i tried pca but the values has changed exemple :

    a 10 20 30 40
    b 20 70 40 50
    c 10 00 80 40
    d 20 30 99 50
    e 10 20 30 40
    f 59 30 40 50
    g 10 20 30 40
    h 90 30 40 50
    i 91 20 34 18

into :

    a 10 20 30 40
    c 10 00 80 40
    h 90 30 40 50
    i 91 20 34 18

i think explained_variance_ratio_ would handle this with a for loop , any help please


Solution

  • Unless I'm misunderstanding your problem, I think you're confusing the purpose of PCA (dimensionality reduction) with a simple dataframe manipulation to reduce the number of rows. These are very different things:

    Dimensionality reduction, which you can get via PCA, would modify the values of your dataframe (that is the point), and is a useful, but not extremely straightforward method of creating/extracting new features from your data for analysis, visualizing high-dimensional data, etc. Take a look at the wikipedia pages on pca and dimensionality reduction, and see if that is indeed what you want. If that is what you want, I suggest you reformulate your question.

    Reducing the number of rows is something completely different, and is very straightforward in pandas. Based on your example, it looks like you want to extract a number of random rows, without modification, from your dataframe. This can be done by the following df.sample()

    For example, on your data that you posted the following selects 4 random rows:

    >>> df.sample(4)
       0   1   2   3   4
    0  a  10  20  30  40
    2  c  10   0  80  40
    7  h  90  30  40  50
    5  f  59  30  40  50