Search code examples
pythonsample

python: data was added an extra column after sampled to a small dataset


This is odd to me:

I have a 5 columns csv dataset with the first couple rows as below:

1,2268318,2520377,pv,1511544070
1,2333346,2520771,pv,1511561733
1,2576651,149192,pv,1511572885
1,3830808,4181361,pv,1511593493
1,4365585,2520377,pv,1511596146
1,4606018,2735466,pv,1511616481

Code to sample the data:

df = pd.read_csv(
        filename,
        header=0
        )
ddf = df.sample(n=100150)
ddf.to_csv(samplename, sep=',')

Generated sampling file is like:

1,2268318,2520377,pv,1511544070
50533941,564438,1311198,2806049,pv,1511700754
53243335,689136,111697,154040,pv,1512264559
68168213,452960,119903,1151115,pv,1511711183
18695132,942795,2288684,1526856,pv,1511594854

You can see for some odd reason the sampling data contains 6 columns except the first row (first row is correct).

What's wrong with the code? How do I get a randomly generated sample (with adjustable ratio)?

I am using python 3.6 in spyder/pycharm.

Thank you.


Solution

  • I think problem is your first data row is set to columns names.

    So need change header=0 to header=None for default RangeIndex columns names.

    import pandas as pd
    
    temp=u"""1,2268318,2520377,pv,1511544070
    1,2333346,2520771,pv,1511561733
    1,2576651,149192,pv,1511572885
    1,3830808,4181361,pv,1511593493
    1,4365585,2520377,pv,1511596146
    1,4606018,2735466,pv,1511616481"""
    #after testing replace 'pd.compat.StringIO(temp)' to 'filename.csv'
    df = pd.read_csv(pd.compat.StringIO(temp), header=None)
    
    print (df)
       0        1        2   3           4
    0  1  2268318  2520377  pv  1511544070
    1  1  2333346  2520771  pv  1511561733
    2  1  2576651   149192  pv  1511572885
    3  1  3830808  4181361  pv  1511593493
    4  1  4365585  2520377  pv  1511596146
    5  1  4606018  2735466  pv  1511616481
    

    Extra first column is called index, if need remove it if write DataFrame to csv use parameter index=False and for remove columns names header=None:

    df.to_csv(file, index=False, header=None)
    
    1,2268318,2520377,pv,1511544070
    1,2333346,2520771,pv,1511561733
    1,2576651,149192,pv,1511572885
    1,3830808,4181361,pv,1511593493
    1,4365585,2520377,pv,1511596146
    1,4606018,2735466,pv,1511616481
    

    EDIT:

    I suggest create custom columns names for easier working with data by parameter names (header=None is not necessary) and for remove some column use drop:

    import pandas as pd
    
    temp=u"""1,2268318,2520377,pv,1511544070
    1,2333346,2520771,pv,1511561733
    1,2576651,149192,pv,1511572885
    1,3830808,4181361,pv,1511593493
    1,4365585,2520377,pv,1511596146
    1,4606018,2735466,pv,1511616481"""
    #after testing replace 'pd.compat.StringIO(temp)' to 'filename.csv'
    df = pd.read_csv(pd.compat.StringIO(temp), names=['col1','col2','col3','col4','col5'])
    print (df)
       col1     col2     col3 col4        col5
    0     1  2268318  2520377   pv  1511544070
    1     1  2333346  2520771   pv  1511561733
    2     1  2576651   149192   pv  1511572885
    3     1  3830808  4181361   pv  1511593493
    4     1  4365585  2520377   pv  1511596146
    5     1  4606018  2735466   pv  1511616481
    
    df1 = df.drop('col4', axis=1)
    print (df1)
       col1     col2     col3        col5
    0     1  2268318  2520377  1511544070
    1     1  2333346  2520771  1511561733
    2     1  2576651   149192  1511572885
    3     1  3830808  4181361  1511593493
    4     1  4365585  2520377  1511596146
    5     1  4606018  2735466  1511616481