python pandas list machine-learning k-means

Removal of List from Pandas DataFrame

    ProArticle                                          Vector

0   Iran jails blogger 14 years An Iranian weblogg...   [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
1   UK gets official virus alert site A rapid aler...   [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
2   OSullivan could run Worlds Sonia OSullivan ind...   [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
3   Mutant book wins Guardian prize A book evoluti...   [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
4   Microsoft seeking spyware trojan Microsoft inv...   [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...

The above is the data.head() snippet from a vectorized news article.

type(data.Vector[0]) is list

I need to use KMeans clustering on this Vectorized data, but the lists won't let me.

data.Vector.shape is 179, and data.Vector[0].shape is 8868.

How can I remove the list, or if I can't, then how can I use it to cluster the given data? Perhaps I could get a dataframe in the following way to start, followed by running PCA on it.

Expected Output looks like this:

Solution

What it seems that you want to do, is to create a 2D numpy array out of a Pandas column that contains lists of numbers. In most cases you can treat a Pandas column as a list or 1-dimensional Numpy array. here, you can use vstack to stack the separate lists as rows:

>>> df = pd.DataFrame({
...     "ProArticle": ["a", "b", "c", "d"],
...     "Vector": [[0, 0], [1, 1], [2, 2], [3, 3]]
... })
>>> vs = np.vstack(df.Vector)
>>> vs
array([[0, 0],
       [1, 1],
       [2, 2],
       [3, 3]])

So this results in an array that you can use directly with sklearn's KMeans:

>>> kmeans = KMeans(n_clusters=2)
>>> kmeans.fit_predict(vs)
array([1, 1, 0, 0], dtype=int32)

If you still want to have the intermediate result as a Pandas dataframe, you can use apply to create Pandas series of each list; according to apply's documentation this results in a DataFrame:

>>> df.Vector.apply(pd.Series)
   0  1
0  0  0
1  1  1
2  2  2
3  3  3

You can then get the same Numpy array by accessing the .values member of the resulting DataFrame. However, this is by far slower than the vstack solution, 1 milliseconds versus 25.4 microseconds on my machine.