Search code examples
pythonpandasnumpynumpy-ndarraycountvectorizer

Numpy - array of arrays recognize as vector


I encounter a problem with numpy arrays. I used CountVectorizer from sklearn with a wordset and values (from pandas column) to create an array of arrays that count words (BoW). And when I print the array and the shape, I have this result:

[[array([0, 5, 0, ..., 0, 0, 0])]
 [array([0, 0, 0, ..., 0, 0, 0])]
 [array([0, 0, 0, ..., 0, 0, 0])]
 ...
 [array([0, 0, 0, ..., 0, 0, 0])]
 [array([0, 0, 0, ..., 0, 0, 0])]
 [array([0, 0, 0, ..., 0, 0, 0])]] (2800, 1)

An array of arrays having a vector shape ???

I checked that all rows have the same size.

Here is a way to reproduce my problem:

import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer


data = pd.DataFrame(["blop blip blup", "bop bip bup", "boop boip boup"], columns=["corpus"])

# add labels column
data["label"] = ["blop", "bip", "boup"]

wordset = pd.Series([y for x in data["corpus"].str.split() for y in x]).unique()
cvec = CountVectorizer(vocabulary=wordset, ngram_range=(1, 2))
    
labels_count_np = data["label"].apply(lambda x: cvec.fit_transform([x]).toarray()[0]).values

print(labels_count_np, labels_count_np.shape)

it should return:

[array([1, 0, 0, 0, 0, 0, 0, 0, 0]) array([0, 0, 0, 0, 1, 0, 0, 0, 0])
 array([0, 0, 0, 0, 0, 0, 0, 0, 1])] (3,)

Can someone explain me why numpy has this comportment ?

Also, I tried to find a way to concatenate multiple arrays like this:

A = [array([1, 0, 0, 0, 0, 0, 0, 0, 0]) array([0, 0, 0, 0, 1, 0, 0, 0, 0])
 array([0, 0, 0, 0, 0, 0, 0, 0, 1])]
B = [array([0, 7, 2, 0]) array([1, 4, 0, 8])
 array([6, 1, 0, 9])]

concatenate(A,B) =>
[
  [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 7, 2, 0],
  [0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 4, 0, 8],
  [0, 0, 0, 0, 0, 0, 0, 0, 1, 6, 1, 0, 9]
]

But I did not found a good way to do it.


Solution

  • values from a dataframe, even if it has just one column, will be 2d. values from a Series, one column of the frame, will be 1d.

    If labels_count_np is (2800, 1) shape, you can easily make it 1d with labels_count_np[:,0] or np.squeeze(labels...). That's just basic numpy.

    It will still be an object dtype array containing arrays, the elements of the dataframe cells. If those arrays are all the same size then

     np.stack(labels_count_np[:,0])
    

    should create a 2d numeric array.

    Make a frame with array elements:

    In [35]: df = pd.DataFrame([None,None,None], columns=['x'])
    In [36]: df
    Out[36]: 
          x
    0  None
    1  None
    2  None
    In [37]: for i in range(3):df['x'][i] = np.zeros(4,int)
    In [38]: df
    Out[38]: 
                  x
    0  [0, 0, 0, 0]
    1  [0, 0, 0, 0]
    2  [0, 0, 0, 0]
    

    The 2d array from the frame:

    In [39]: df.values
    Out[39]: 
    array([[array([0, 0, 0, 0])],
           [array([0, 0, 0, 0])],
           [array([0, 0, 0, 0])]], dtype=object)
    In [40]: _.shape
    Out[40]: (3, 1)
    

    from the Series:

    In [41]: df['x'].values
    Out[41]: 
    array([array([0, 0, 0, 0]), array([0, 0, 0, 0]), array([0, 0, 0, 0])],
          dtype=object)
    In [42]: _.shape
    Out[42]: (3,)
    

    Joining the Series values into one 2d array:

    In [43]: np.stack(df['x'].values)
    Out[43]: 
    array([[0, 0, 0, 0],
           [0, 0, 0, 0],
           [0, 0, 0, 0]])