Search code examples
pythonpandassparse-matrix

Convert string representations of sparse vectors into Pandas dataframe


I have a vector vec where each element is a string representation of a sparse vector.
The output I want is a Pandas DataFrame with the following characteristics:

index: vec index
columns: sparse vector indices
values: sparse vector values

The sparse vectors are encoded with the format <feature_index>:<feature_value>, and records are separated by a single space.

Here are a few rows of example data:

vec = ["70:1.0000 71:1.0000 83:1.0000",
       "3:2.0000 8:2.0000 9:3.0000",
       "3:3.0000 185:1.0000 186:1.0000",
       "3:1.0000 8:1.0000 289:1.0000"]

And here's my expected output:

          185     186     289       3      70      71       8      83       9
index                                                                        
0         NaN     NaN     NaN     NaN  1.0000  1.0000     NaN  1.0000     NaN
1         NaN     NaN     NaN  2.0000     NaN     NaN  2.0000     NaN  3.0000
2      1.0000  1.0000     NaN  3.0000     NaN     NaN     NaN     NaN     NaN
3         NaN     NaN  1.0000  1.0000     NaN     NaN  1.0000     NaN     NaN

I have a working solution using from_records and pivot, but it seems clumsy and inefficient:

import pandas as pd

dense = pd.DataFrame()

for i, row in enumerate(vec):
    tups = []
    for entry in row.split(): 
        tups.append(tuple([x for x in entry.split(':')]))

    dense = pd.concat([dense,
                       (pd.DataFrame
                          .from_records(tups, 
                                        index=[i]*len(tups), 
                                        columns=['key','val'])
                          .reset_index()
                          .pivot(index='index', 
                                 columns='key', 
                                 values='val')
                       )
                     ])

Can anyone suggest a cleaner approach, ideally one that makes better use of Pandas functionality?
The actual dataset I'm working with is rather large, so I'd like to take advantage of the performance optimizations in native Pandas, if possible.

Notes:
- The output index doesn't need to be labeled index.
- This doesn't have to be a pure Pandas solution. For example, I looked a bit at some of the sklearn methods for handling sparsity, but none of them quite seemed appropriate for solving this task.
- I'm not sure this matters, but after this operation I merge the resulting DataFrame (call it dense) with another DataFrame (call this one df), using dense and df indices as merge keys. So in this example, vec indices are [0,1,2,3], and the output dense needs to retain those indices.


Solution

  • I think you can use list comprehensions - first for splitting and then convert it to dicts with DataFrame constructor:

    print ([dict([y.split(':') for y in (x.split())]) for x in vec])
    [{'83': '1.0000', '70': '1.0000', '71': '1.0000'}, 
     {'8': '2.0000', '3': '2.0000', '9': '3.0000'}, 
     {'185': '1.0000', '186': '1.0000', '3': '3.0000'}, 
     {'289': '1.0000', '8': '1.0000', '3': '1.0000'}]
    
    df = pd.DataFrame([dict([y.split(':') for y in (x.split())]) for x in vec])
    print (df)
          185     186     289       3      70      71       8      83       9
    0     NaN     NaN     NaN     NaN  1.0000  1.0000     NaN  1.0000     NaN
    1     NaN     NaN     NaN  2.0000     NaN     NaN  2.0000     NaN  3.0000
    2  1.0000  1.0000     NaN  3.0000     NaN     NaN     NaN     NaN     NaN
    3     NaN     NaN  1.0000  1.0000     NaN     NaN  1.0000     NaN     NaN
    

    Get DataFrame with NaNs and strings, so for numeric casting is necessary:

    print (type(df.loc[0,'70']))
    <class 'str'>
    
    df = df.astype(float)
    print (df)
       185  186  289    3   70   71    8   83    9
    0  NaN  NaN  NaN  NaN  1.0  1.0  NaN  1.0  NaN
    1  NaN  NaN  NaN  2.0  NaN  NaN  2.0  NaN  3.0
    2  1.0  1.0  NaN  3.0  NaN  NaN  NaN  NaN  NaN
    3  NaN  NaN  1.0  1.0  NaN  NaN  1.0  NaN  NaN
    
    print (type(df.loc[0,'70']))
    <class 'numpy.float64'>