I need to reshape my df.
This is my input df:
import pandas as pd
import datatable as dt
DF_in = dt.Frame(name=['name1', 'name1', 'name1', 'name1', 'name2', 'name2', 'name2', 'name2'],
date=['2021-01-01', '2021-01-02', '2021-01-03', '2021-01-04', '2021-01-05', '2021-01-06', '2021-01-07', '2021-01-08'],
type=['a', 'b', 'a', 'b', 'b', 'a', 'b', 'a'],
value=[1, 2, 3, 4, 5, 6, 7, 8])
| name date type value
-- + ----- ---------- ---- -----
0 | name1 2021-01-01 a 1
1 | name1 2021-01-02 b 2
2 | name1 2021-01-03 a 3
3 | name1 2021-01-04 b 4
4 | name2 2021-01-05 b 5
5 | name2 2021-01-06 a 6
6 | name2 2021-01-07 b 7
7 | name2 2021-01-08 a 8
This is the desired output df:
DF_out = dt.Frame(name=['name1', 'name1', 'name2', 'name2'],
date_a=['2021-01-01', '2021-01-03', '2021-01-06', '2021-01-08'],
date_b=['2021-01-02', '2021-01-04', '2021-01-07', None],
value_a=[1, 3, 6, 8],
value_b=[2, 4, 7, None])
| name date_a date_b value_a value_b
-- + ----- ---------- ---------- ------- -------
0 | name1 2021-01-01 2021-01-02 1 2
1 | name1 2021-01-03 2021-01-04 3 4
2 | name2 2021-01-06 2021-01-07 6 7
3 | name2 2021-01-08 NA 8 NA
If necessary the datatable Frames can be converted into a pandas DataFrame:
DF_in = DF_in.to_pandas()
Transformation:
I hope this explanation is understandable.
Thank you in advance
Thank you all very much for your answers. In the meantime I developed a solution that uses only datatable package a uses some workarounds for the current limitations:
code:
import math
import datatable as dt
from datatable import dt, f, by, update, join
DF_in = dt.Frame(name=['name1', 'name1', 'name1', 'name1', 'name2', 'name2', 'name2', 'name2'],
date=['2021-01-01', '2021-01-02', '2021-01-03', '2021-01-04', '2021-01-05', '2021-01-06', '2021-01-07', '2021-01-08'],
type=['a', 'b', 'a', 'b', 'b', 'a', 'b', 'a'],
value=[1, 2, 3, 4, 5, 6, 7, 8])
def group_id(n):
l = [x for x in range(0, math.floor(n / 2))]
l = sorted(l * 2)
if n % 2 != 0:
try:
l.append(l[-1] + 1)
except IndexError:
l.append(0)
return l
DF_in['id'] = range(DF_in.nrows)
first_row = f.id==dt.min(f.id)
row_eq_b = dt.first(f.type)=="b"
remove_rows = first_row & row_eq_b
DF_in[:, update(remove_rows = ~remove_rows), 'name']
DF_in = DF_in[f[-1]==1, :-1]
group_count = DF_in[:, {"Count": dt.count()}, by('name')][:, 'Count'].to_list()[0]
group_id_column = []
for x in group_count:
group_id_column = group_id_column + group_id(x)
DF_in['group_id'] = dt.Frame(group_id_column)
df1 = DF_in[f.type == 'a', ['name', 'date', 'value', 'group_id']]
df2 = DF_in[f.type == 'b', ['name', 'date', 'value', 'group_id']]
df2.key = ['name', 'group_id']
DF_out = df1[:, :, join(df2)]
DF_out.names = {'date': 'date_a', 'value': 'value_a', 'date.0': 'date_b', 'value.0': 'value_b'}
DF_out[:, ['name', 'date_a', 'date_b', 'value_a', 'value_b']]
| name date_a date_b value_a value_b
-- + ----- ---------- ---------- ------- -------
0 | name1 2021-01-01 2021-01-02 1 2
1 | name1 2021-01-03 2021-01-04 3 4
2 | name2 2021-01-06 2021-01-07 6 7
3 | name2 2021-01-08 NA 8 NA