Search code examples
pythonpandaslistmachine-learningk-means

Removal of List from Pandas DataFrame


    ProArticle                                          Vector

0   Iran jails blogger 14 years An Iranian weblogg...   [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
1   UK gets official virus alert site A rapid aler...   [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
2   OSullivan could run Worlds Sonia OSullivan ind...   [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
3   Mutant book wins Guardian prize A book evoluti...   [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
4   Microsoft seeking spyware trojan Microsoft inv...   [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...

The above is the data.head() snippet from a vectorized news article.

type(data.Vector[0]) is list

I need to use KMeans clustering on this Vectorized data, but the lists won't let me.

data.Vector.shape is 179, and data.Vector[0].shape is 8868.

How can I remove the list, or if I can't, then how can I use it to cluster the given data? Perhaps I could get a dataframe in the following way to start, followed by running PCA on it.

Expected Output looks like this: enter image description here


Solution

  • What it seems that you want to do, is to create a 2D numpy array out of a Pandas column that contains lists of numbers. In most cases you can treat a Pandas column as a list or 1-dimensional Numpy array. here, you can use vstack to stack the separate lists as rows:

    >>> df = pd.DataFrame({
    ...     "ProArticle": ["a", "b", "c", "d"],
    ...     "Vector": [[0, 0], [1, 1], [2, 2], [3, 3]]
    ... })
    >>> vs = np.vstack(df.Vector)
    >>> vs
    array([[0, 0],
           [1, 1],
           [2, 2],
           [3, 3]])
    

    So this results in an array that you can use directly with sklearn's KMeans:

    >>> kmeans = KMeans(n_clusters=2)
    >>> kmeans.fit_predict(vs)
    array([1, 1, 0, 0], dtype=int32)
    

    If you still want to have the intermediate result as a Pandas dataframe, you can use apply to create Pandas series of each list; according to apply's documentation this results in a DataFrame:

    >>> df.Vector.apply(pd.Series)
       0  1
    0  0  0
    1  1  1
    2  2  2
    3  3  3
    

    You can then get the same Numpy array by accessing the .values member of the resulting DataFrame. However, this is by far slower than the vstack solution, 1 milliseconds versus 25.4 microseconds on my machine.