I have a vector vec
where each element is a string representation of a sparse vector.
The output I want is a Pandas DataFrame
with the following characteristics:
index:
vec
index
columns: sparse vector indices
values: sparse vector values
The sparse vectors are encoded with the format <feature_index>:<feature_value>
, and records are separated by a single space.
Here are a few rows of example data:
vec = ["70:1.0000 71:1.0000 83:1.0000",
"3:2.0000 8:2.0000 9:3.0000",
"3:3.0000 185:1.0000 186:1.0000",
"3:1.0000 8:1.0000 289:1.0000"]
And here's my expected output:
185 186 289 3 70 71 8 83 9
index
0 NaN NaN NaN NaN 1.0000 1.0000 NaN 1.0000 NaN
1 NaN NaN NaN 2.0000 NaN NaN 2.0000 NaN 3.0000
2 1.0000 1.0000 NaN 3.0000 NaN NaN NaN NaN NaN
3 NaN NaN 1.0000 1.0000 NaN NaN 1.0000 NaN NaN
I have a working solution using from_records
and pivot
, but it seems clumsy and inefficient:
import pandas as pd
dense = pd.DataFrame()
for i, row in enumerate(vec):
tups = []
for entry in row.split():
tups.append(tuple([x for x in entry.split(':')]))
dense = pd.concat([dense,
(pd.DataFrame
.from_records(tups,
index=[i]*len(tups),
columns=['key','val'])
.reset_index()
.pivot(index='index',
columns='key',
values='val')
)
])
Can anyone suggest a cleaner approach, ideally one that makes better use of Pandas functionality?
The actual dataset I'm working with is rather large, so I'd like to take advantage of the performance optimizations in native Pandas, if possible.
Notes:
- The output index doesn't need to be labeled index
.
- This doesn't have to be a pure Pandas solution. For example, I looked a bit at some of the sklearn
methods for handling sparsity, but none of them quite seemed appropriate for solving this task.
- I'm not sure this matters, but after this operation I merge the resulting DataFrame
(call it dense
) with another DataFrame
(call this one df
), using dense
and df
indices as merge keys. So in this example, vec
indices are [0,1,2,3]
, and the output dense
needs to retain those indices.
I think you can use list comprehensions
- first for splitting and then convert it to dicts
with DataFrame
constructor:
print ([dict([y.split(':') for y in (x.split())]) for x in vec])
[{'83': '1.0000', '70': '1.0000', '71': '1.0000'},
{'8': '2.0000', '3': '2.0000', '9': '3.0000'},
{'185': '1.0000', '186': '1.0000', '3': '3.0000'},
{'289': '1.0000', '8': '1.0000', '3': '1.0000'}]
df = pd.DataFrame([dict([y.split(':') for y in (x.split())]) for x in vec])
print (df)
185 186 289 3 70 71 8 83 9
0 NaN NaN NaN NaN 1.0000 1.0000 NaN 1.0000 NaN
1 NaN NaN NaN 2.0000 NaN NaN 2.0000 NaN 3.0000
2 1.0000 1.0000 NaN 3.0000 NaN NaN NaN NaN NaN
3 NaN NaN 1.0000 1.0000 NaN NaN 1.0000 NaN NaN
Get DataFrame
with NaN
s and strings, so for numeric casting is necessary:
print (type(df.loc[0,'70']))
<class 'str'>
df = df.astype(float)
print (df)
185 186 289 3 70 71 8 83 9
0 NaN NaN NaN NaN 1.0 1.0 NaN 1.0 NaN
1 NaN NaN NaN 2.0 NaN NaN 2.0 NaN 3.0
2 1.0 1.0 NaN 3.0 NaN NaN NaN NaN NaN
3 NaN NaN 1.0 1.0 NaN NaN 1.0 NaN NaN
print (type(df.loc[0,'70']))
<class 'numpy.float64'>