Search code examples
pythonpandascsvtypessklearn-pandas

Dataframe's column conversion from type object to int / float using Pandas Python


The Scenario

I have 2 CSV files (1) u.Data and (2) prediction_matrix which I need to read and write into a Single Dataframe, once done it is processed for Clustering based on int / float values it will contain

The Problem

I'm done combining the 2 CSVs into 1 Dataframe named AllData.csv, but the type of columns holding value have a different type now (object), as shown below (with a warning)

sys:1: DtypeWarning: Columns (0,1,2) have mixed types. Specify dtype option on import or set low_memory=False.
UDATA -------------
uid    int64
iid    int64
rat    int64
dtype: object
PRED_MATRIX -------
uid      int64
iid      int64
rat    float64
dtype: object
AllDATA -----------
uid    object
iid    object
rat    object
dtype: object

P.S. I know how to use low_memory=False and that just supresses the warning.

The Possible Cause

with open('AllData.csv', 'w') as handle:
    udata_df.to_csv(handle, index=False)
    pred_matrix.to_csv(handle, index=False)

Since, I need to write 2 CSVs into Single DF handle object is used and probably that turns all the values into its type. Can anything preserve the data type applying the same logic?

Unhelpful References taken so far:

  1. This one
  2. This two
  3. This too!

Solution

  • There is problem your header in second DataFrame is written too, so need parametr header=False:

    with open('AllData.csv', 'w') as handle:
        udata_df.to_csv(handle, index=False)
        pred_matrix.to_csv(handle, index=False, header=False)
    

    Another solution is mode=a for append second DataFrame:

    f = 'AllData.csv'
    udata_df.to_csv(f, index=False)
    pred_matrix.to_csv(f,header=False, index=False, mode='a')
    

    Or use concat:

    f = 'AllData.csv'
    pd.concat([udata_df, pred_matrix]).to_csv(f, index=False)
    

    Sample:

    udata_df = pd.DataFrame({'uid':[1,2],
                             'iid':[8,9],
                             'rat':[0,3]})
    
    pred_matrix = udata_df * 10
    

    Third row is header:

    with open('AllData.csv', 'w') as handle:
        udata_df.to_csv(handle, index=False)
        pred_matrix.to_csv(handle, index=False)
    
    f = 'AllData.csv'
    df = pd.read_csv(f)
    print (df)
       iid  rat  uid
    0    8    0    1
    1    9    3    2
    2  iid  rat  uid
    3   80    0   10
    4   90   30   20
    

    After parameter header=False it working correctly:

    with open('AllData.csv', 'w') as handle:
        udata_df.to_csv(handle, index=False)
        pred_matrix.to_csv(handle, index=False, header=False)
    
    f = 'AllData.csv'
    df = pd.read_csv(f)
    print (df)
       iid  rat  uid
    0    8    0    1
    1    9    3    2
    2   80    0   10
    3   90   30   20
    

    mode append solution:

    f = 'AllData.csv'
    udata_df.to_csv(f, index=False)
    pred_matrix.to_csv(f,header=False, index=False, mode='a')
    df = pd.read_csv(f)
    print (df)
       iid  rat  uid
    0    8    0    1
    1    9    3    2
    2   80    0   10
    3   90   30   20
    

    concat solution:

    f = 'AllData.csv'
    pd.concat([udata_df, pred_matrix]).to_csv(f, index=False)
    df = pd.read_csv(f)
    print (df)
       iid  rat  uid
    0    8    0    1
    1    9    3    2
    2   80    0   10
    3   90   30   20