python pandas numpy numpy-ndarray countvectorizer

Numpy - array of arrays recognize as vector

I encounter a problem with numpy arrays. I used CountVectorizer from sklearn with a wordset and values (from pandas column) to create an array of arrays that count words (BoW). And when I print the array and the shape, I have this result:

[[array([0, 5, 0, ..., 0, 0, 0])]
 [array([0, 0, 0, ..., 0, 0, 0])]
 [array([0, 0, 0, ..., 0, 0, 0])]
 ...
 [array([0, 0, 0, ..., 0, 0, 0])]
 [array([0, 0, 0, ..., 0, 0, 0])]
 [array([0, 0, 0, ..., 0, 0, 0])]] (2800, 1)

An array of arrays having a vector shape ???

I checked that all rows have the same size.

Here is a way to reproduce my problem:

import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer


data = pd.DataFrame(["blop blip blup", "bop bip bup", "boop boip boup"], columns=["corpus"])

# add labels column
data["label"] = ["blop", "bip", "boup"]

wordset = pd.Series([y for x in data["corpus"].str.split() for y in x]).unique()
cvec = CountVectorizer(vocabulary=wordset, ngram_range=(1, 2))
    
labels_count_np = data["label"].apply(lambda x: cvec.fit_transform([x]).toarray()[0]).values

print(labels_count_np, labels_count_np.shape)

it should return:

[array([1, 0, 0, 0, 0, 0, 0, 0, 0]) array([0, 0, 0, 0, 1, 0, 0, 0, 0])
 array([0, 0, 0, 0, 0, 0, 0, 0, 1])] (3,)

Can someone explain me why numpy has this comportment ?

Also, I tried to find a way to concatenate multiple arrays like this:

A = [array([1, 0, 0, 0, 0, 0, 0, 0, 0]) array([0, 0, 0, 0, 1, 0, 0, 0, 0])
 array([0, 0, 0, 0, 0, 0, 0, 0, 1])]
B = [array([0, 7, 2, 0]) array([1, 4, 0, 8])
 array([6, 1, 0, 9])]

concatenate(A,B) =>
[
  [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 7, 2, 0],
  [0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 4, 0, 8],
  [0, 0, 0, 0, 0, 0, 0, 0, 1, 6, 1, 0, 9]
]

But I did not found a good way to do it.

Solution

values from a dataframe, even if it has just one column, will be 2d. values from a Series, one column of the frame, will be 1d.

If labels_count_np is (2800, 1) shape, you can easily make it 1d with labels_count_np[:,0] or np.squeeze(labels...). That's just basic numpy.

It will still be an object dtype array containing arrays, the elements of the dataframe cells. If those arrays are all the same size then

 np.stack(labels_count_np[:,0])

should create a 2d numeric array.

Make a frame with array elements:

In [35]: df = pd.DataFrame([None,None,None], columns=['x'])
In [36]: df
Out[36]: 
      x
0  None
1  None
2  None
In [37]: for i in range(3):df['x'][i] = np.zeros(4,int)
In [38]: df
Out[38]: 
              x
0  [0, 0, 0, 0]
1  [0, 0, 0, 0]
2  [0, 0, 0, 0]

The 2d array from the frame:

In [39]: df.values
Out[39]: 
array([[array([0, 0, 0, 0])],
       [array([0, 0, 0, 0])],
       [array([0, 0, 0, 0])]], dtype=object)
In [40]: _.shape
Out[40]: (3, 1)

from the Series:

In [41]: df['x'].values
Out[41]: 
array([array([0, 0, 0, 0]), array([0, 0, 0, 0]), array([0, 0, 0, 0])],
      dtype=object)
In [42]: _.shape
Out[42]: (3,)

Joining the Series values into one 2d array:

In [43]: np.stack(df['x'].values)
Out[43]: 
array([[0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0]])