We have TSV files which holds IOT data, want to convert to table like structure using pandas. I have worked on TSV data, similar to given below, were the logics goes like
This is bit challenging as explained, col1 to col3 is dimension data and remaining is fact data
tsv file data looks as below
col1 qweqweq
col2 345435
col3 01/01/2024 35:08:09
col4 1
col5 0
col4 0
col5 0
col4 1
col5 1
col4 0
col5 1
Want to project as table like structure
col1 | col2 | col3 | col4 | col5 |
---|---|---|---|---|
qweqweq | 345435 | 01/01/2024 35:08:09 | 1 | 0 |
qweqweq | 345435 | 01/01/2024 35:08:09 | 0 | 0 |
qweqweq | 345435 | 01/01/2024 35:08:09 | 1 | 1 |
qweqweq | 345435 | 01/01/2024 35:08:09 | 0 | 1 |
col4 and col5 can differ in each IOT file. How to achieve with python, pandas?
Assuming you can rely on "col1" to define the groups, you can use a pivot
after de-duplicating the rows with cumsum
and groupby.cumcount
, and groupby.ffill
:
df = (pd.read_csv('input_file.tsv', sep='\t', header=None)
.assign(index=lambda x: x[0].eq('col1').cumsum(),
n=lambda x: x.groupby(['index', 0]).cumcount())
.pivot(index=['index', 'n'], columns=0, values=1)
.groupby(level='index').ffill()
.reset_index(drop=True).rename_axis(columns=None)
)
Output:
col1 col2 col3 col4 col5
0 qweqweq 345435 01/01/2024 35:08:09 1 0
1 qweqweq 345435 01/01/2024 35:08:09 0 0
2 qweqweq 345435 01/01/2024 35:08:09 1 1
3 qweqweq 345435 01/01/2024 35:08:09 0 1
Reproducible input:
import io
input_file = io.StringIO('''col1\tqweqweq
col2\t345435
col3\t01/01/2024 35:08:09
col4\t1
col5\t0
col4\t0
col5\t0
col4\t1
col5\t1
col4\t0
col5\t1''')
Intermediates:
# before pivot
0 1 index n
0 col1 qweqweq 1 0
1 col2 345435 1 0
2 col3 01/01/2024 35:08:09 1 0
3 col4 1 1 0
4 col5 0 1 0
5 col4 0 1 1
6 col5 0 1 1
7 col4 1 1 2
8 col5 1 1 2
9 col4 0 1 3
10 col5 1 1 3
# before the cleanup-step:
0 col1 col2 col3 col4 col5
index n
1 0 qweqweq 345435 01/01/2024 35:08:09 1 0
1 qweqweq 345435 01/01/2024 35:08:09 0 0
2 qweqweq 345435 01/01/2024 35:08:09 1 1
3 qweqweq 345435 01/01/2024 35:08:09 0 1