Search code examples
pythonpandasscikit-learnsklearn-pandas

How to load this kind of data in pandas


Background: I have logs which are generated during the testing of the devices after manufacture. Each device has a serial number and a corresponding csv log file with all the data. Something like this.

DATE,TESTSTEP,READING,LIMIT,RESULT
01/01/2019 07:37:17.432 AM,1,23,10,FAIL
01/01/2019 07:37:23.661 AM,2,3,3,PASS

So there are many such log files. Each with the test data. I have the the serial number of devices which failed in field. I want to create a model using these log files. And then use it to predict if the given device has a chance of failing in field given its log file.

Till now as a part of learning, I have worked with data like housing price. Every row was complete. Depending on area, number of rooms etc, it was easy to define a model for expected selling price.

Here I wish to find a way to somehow flatten all the logs into a single row. I am thinking of having something like:

DATE_1,TESTSTEP_1,READING_1,LIMIT_1,RESULT_1,DATE_2,TESTSTEP_2,READING_2,LIMIT_2,RESULT_2
1/1/2019 07:37:17.432 AM,1,23,10,FAIL,01/01/2019 07:37:23.661 AM,2,3,3,PASS

Is this the right way to deal with this kind of data?

If so, then does Pandas has any inbuilt support for this?

I will be using scikit-learn to create models.


Solution

  • First convert columns to ordered CategoricalIndex for same order of columns in output, convert DATE column by to_datetime and convert datetimes to dates by Series.dt.date with cumcount for counter, create MultiIndex by set_index, reshape by unstack and sort second level of MultiIndex in columns by sort_index. Last flatten it by list comprehension with reset_index:

    df['DATE'] = pd.to_datetime(df['DATE'])
    dates = df['DATE'].dt.date
    
    df.columns = pd.CategoricalIndex(df.columns,categories=df.columns, ordered=True)
    g = df.groupby(dates).cumcount().add(1)
    df = df.set_index([dates, g]).unstack().sort_index(axis=1, level=1)
    df.columns = [f'{a}_{b}' for a, b in df.columns]
    df = df.reset_index(drop=True)
    print (df)
                       DATE_1  TESTSTEP_1  READING_1  LIMIT_1 RESULT_1  \
    0 2019-01-01 07:37:17.432           1         23       10     FAIL   
    
                       DATE_2  TESTSTEP_2  READING_2  LIMIT_2 RESULT_2  
    0 2019-01-01 07:37:23.661           2          3        3     PASS
    

    If need also dates in separate first column:

    df['DATE'] = pd.to_datetime(df['DATE'])
    dates = df['DATE'].dt.date
    
    df.columns = pd.CategoricalIndex(df.columns,categories=df.columns, ordered=True)
    g = df.groupby(dates).cumcount().add(1)
    df = df.set_index([dates.rename('DAT'), g]).unstack().sort_index(axis=1, level=1)
    df.columns = [f'{a}_{b}' for a, b in df.columns]
    df = df.reset_index()
    print (df)
             DAT                  DATE_1  TESTSTEP_1  READING_1  LIMIT_1 RESULT_1  \
    0 2019-01-01 2019-01-01 07:37:17.432           1         23       10     FAIL   
    
                       DATE_2  TESTSTEP_2  READING_2  LIMIT_2 RESULT_2  
    0 2019-01-01 07:37:23.661           2          3        3     PASS