ProArticle Vector
0 Iran jails blogger 14 years An Iranian weblogg... [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
1 UK gets official virus alert site A rapid aler... [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
2 OSullivan could run Worlds Sonia OSullivan ind... [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
3 Mutant book wins Guardian prize A book evoluti... [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
4 Microsoft seeking spyware trojan Microsoft inv... [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
The above is the data.head() snippet from a vectorized news article.
type(data.Vector[0])
is list
I need to use KMeans clustering on this Vectorized data, but the lists won't let me.
data.Vector.shape
is 179
, and data.Vector[0].shape
is 8868
.
How can I remove the list, or if I can't, then how can I use it to cluster the given data? Perhaps I could get a dataframe in the following way to start, followed by running PCA on it.
What it seems that you want to do, is to create a 2D numpy array out of a Pandas column that contains lists of numbers. In most cases you can treat a Pandas column as a list or 1-dimensional Numpy array. here, you can use vstack
to stack the separate lists as rows:
>>> df = pd.DataFrame({
... "ProArticle": ["a", "b", "c", "d"],
... "Vector": [[0, 0], [1, 1], [2, 2], [3, 3]]
... })
>>> vs = np.vstack(df.Vector)
>>> vs
array([[0, 0],
[1, 1],
[2, 2],
[3, 3]])
So this results in an array that you can use directly with sklearn's KMeans:
>>> kmeans = KMeans(n_clusters=2)
>>> kmeans.fit_predict(vs)
array([1, 1, 0, 0], dtype=int32)
If you still want to have the intermediate result as a Pandas dataframe, you can use apply
to create Pandas series of each list; according to apply
's documentation this results in a DataFrame:
>>> df.Vector.apply(pd.Series)
0 1
0 0 0
1 1 1
2 2 2
3 3 3
You can then get the same Numpy array by accessing the .values
member of the resulting DataFrame. However, this is by far slower than the vstack
solution, 1 milliseconds versus 25.4 microseconds on my machine.