I am preprocessing text for classification, and I import my dataset like this:
dataset = pd.read_csv('lyrics.csv', delimiter = '\t', quoting = 2)
dataset
prints on terminal:
lyrics,classification
0 I should have known better with a girl like yo...
1 You can shake an apple off an apple tree\nShak...
2 It's been a hard day's night\nAnd I've been wo...
3 Michelle, ma belle\nThese are words that go to...
however, when I inspect the variable dataset
closer using spyder
, I see that I have only one column, instead of the desired two columns.
considering that lyrics themselves have commas and "," delimiter would not work,
how do I correct my dataframe above in order to have:
1) one column for lyrics
2) one column for classification
with correspondent data for each row?
If your lyrics themselves do not contain commas (they most likely do), then you can use read_csv
with delimiter=','
.
However, if that is not an option, you could use str.rsplit
:
dataset.iloc[:, 0].str.rsplit(',', expand=True)
df
lyrics,classification
0 I should have known better with a girl like yo...
1 You can shake an...,0
2 It's been a hard day's night...,0
df = df.iloc[:, 0].str.rsplit(',', 1, expand=True)
df.columns = ['lyrics', 'classification']
df
lyrics classification
0 I should have known better with a girl like yo... 0
1 You can shake an... 0
2 It's been a hard day's night... 0