This is odd to me:
I have a 5 columns csv dataset with the first couple rows as below:
1,2268318,2520377,pv,1511544070
1,2333346,2520771,pv,1511561733
1,2576651,149192,pv,1511572885
1,3830808,4181361,pv,1511593493
1,4365585,2520377,pv,1511596146
1,4606018,2735466,pv,1511616481
Code to sample the data:
df = pd.read_csv(
filename,
header=0
)
ddf = df.sample(n=100150)
ddf.to_csv(samplename, sep=',')
Generated sampling file is like:
1,2268318,2520377,pv,1511544070
50533941,564438,1311198,2806049,pv,1511700754
53243335,689136,111697,154040,pv,1512264559
68168213,452960,119903,1151115,pv,1511711183
18695132,942795,2288684,1526856,pv,1511594854
You can see for some odd reason the sampling data contains 6 columns except the first row (first row is correct).
What's wrong with the code? How do I get a randomly generated sample (with adjustable ratio)?
I am using python 3.6 in spyder/pycharm.
Thank you.
I think problem is your first data row is set to columns names.
So need change header=0
to header=None
for default RangeIndex
columns names.
import pandas as pd
temp=u"""1,2268318,2520377,pv,1511544070
1,2333346,2520771,pv,1511561733
1,2576651,149192,pv,1511572885
1,3830808,4181361,pv,1511593493
1,4365585,2520377,pv,1511596146
1,4606018,2735466,pv,1511616481"""
#after testing replace 'pd.compat.StringIO(temp)' to 'filename.csv'
df = pd.read_csv(pd.compat.StringIO(temp), header=None)
print (df)
0 1 2 3 4
0 1 2268318 2520377 pv 1511544070
1 1 2333346 2520771 pv 1511561733
2 1 2576651 149192 pv 1511572885
3 1 3830808 4181361 pv 1511593493
4 1 4365585 2520377 pv 1511596146
5 1 4606018 2735466 pv 1511616481
Extra first column is called index, if need remove it if write DataFrame
to csv
use parameter index=False
and for remove columns names header=None
:
df.to_csv(file, index=False, header=None)
1,2268318,2520377,pv,1511544070
1,2333346,2520771,pv,1511561733
1,2576651,149192,pv,1511572885
1,3830808,4181361,pv,1511593493
1,4365585,2520377,pv,1511596146
1,4606018,2735466,pv,1511616481
EDIT:
I suggest create custom columns names for easier working with data by parameter names
(header=None
is not necessary) and for remove some column use drop
:
import pandas as pd
temp=u"""1,2268318,2520377,pv,1511544070
1,2333346,2520771,pv,1511561733
1,2576651,149192,pv,1511572885
1,3830808,4181361,pv,1511593493
1,4365585,2520377,pv,1511596146
1,4606018,2735466,pv,1511616481"""
#after testing replace 'pd.compat.StringIO(temp)' to 'filename.csv'
df = pd.read_csv(pd.compat.StringIO(temp), names=['col1','col2','col3','col4','col5'])
print (df)
col1 col2 col3 col4 col5
0 1 2268318 2520377 pv 1511544070
1 1 2333346 2520771 pv 1511561733
2 1 2576651 149192 pv 1511572885
3 1 3830808 4181361 pv 1511593493
4 1 4365585 2520377 pv 1511596146
5 1 4606018 2735466 pv 1511616481
df1 = df.drop('col4', axis=1)
print (df1)
col1 col2 col3 col5
0 1 2268318 2520377 1511544070
1 1 2333346 2520771 1511561733
2 1 2576651 149192 1511572885
3 1 3830808 4181361 1511593493
4 1 4365585 2520377 1511596146
5 1 4606018 2735466 1511616481