Search code examples
pythonpandasdataframepivottranspose

How do I convert a series of repeated data rows into columns of multiple records using Python?


We have data from a device that measures multiple parts, and it outputs multiple measurements for each part into a CSV file. We read the CSV file into a dataframe with a structure such as this:

PartNo  12
Meas1   45
Meas2   23
!END
PartNo  13
Meas1   63
Meas2   73
!END
PartNo  12
Meas1   82
Meas2   84
!END

The "!END" flag indicates where the data from one part ends, and the next part starts. We would like to reshape the data so it looks like:

PartNo  Meas1   Meas2
12      45      23
13      63      73
12      82      84

(Note that a part could appear more than once - so there is no field that is guaranteed to be unique across all records.)

A pivot produces:

0   !END  Meas1  Meas2  PartNo
0    NaN    NaN    NaN    12.0
1    NaN   45.0    NaN     NaN
2    NaN    NaN   23.0     NaN
3    NaN    NaN    NaN     NaN
4    NaN    NaN    NaN    13.0
5    NaN   63.0    NaN     NaN
6    NaN    NaN   73.0     NaN
7    NaN    NaN    NaN     NaN
8    NaN    NaN    NaN    12.0
9    NaN   82.0    NaN     NaN
10   NaN    NaN   84.0     NaN
11   NaN    NaN    NaN     NaN

How do I squeeze these rows down to group by PartNo?
A transpose produces:

       0      1      2     3       4      5      6     7       8      9      10    11
0  PartNo  Meas1  Meas2  !END  PartNo  Meas1  Meas2  !END  PartNo  Meas1  Meas2  !END
1      12     45     23   NaN      13     63     73   NaN      12     82     84   NaN

How could I reset the row every 4th item?

I could create a new index column in the original dataframe, and then iterate through the rows incrementing the index for every row with !END (and then use the index to group the data), but it seems that there ought to be a more elegant shape shifting function to handle this case, or maybe there is an argument to Pivot or Transpose that would handle this. I am a Python beginner. Here is the full code:

import pandas as pd
from io import StringIO

tdata = (
'PartNo,    12\n'
'Meas1, 45\n'
'Meas2, 23\n'
'!END\n'
'PartNo,    13\n'
'Meas1, 63\n'
'Meas2, 73\n'
'!END\n'
'PartNo,    12\n'
'Meas1, 82\n'
'Meas2, 84\n'
'!END\n')
tdf = pd.read_csv(StringIO(tdata), header=None)
print(tdf)
print(tdf.pivot(index=None, columns=0, values=1))
print(tdf.T)

Solution

  • #having dataframe x:
    >>> x = pd.DataFrame([['PartNo',12],['Meas1',45],['Meas2',23],['!END',''],['PartNo',13],['Meas1',63],['Meas2',73],['!END',''],['PartNo',12],['Meas1',82],['Meas2',84],['!END','']])
    >>> x
             0   1
    0   PartNo  12
    1    Meas1  45
    2    Meas2  23
    3     !END    
    4   PartNo  13
    5    Meas1  63
    6    Meas2  73
    7     !END    
    8   PartNo  12
    9    Meas1  82
    10   Meas2  84
    11    !END    
    
    #grouping by first column, and aggregating values to list. First column then contains Series that you want. By converting each list in this series to series, dataframe is created, then you just need to transpose
    >>> df = x.groupby(0).agg(lambda x: list(x))[1].apply(lambda x: pd.Series(x)).transpose()
    >>> df[['PartNo','Meas1','Meas2']]
    0 PartNo Meas1 Meas2
    0     12    45    23
    1     13    63    73
    2     12    82    84