Search code examples
pandascsvdataframetokenize

How to tokenize a single column in a CSV file with 2 columns using Pandas DataFrame


I am trying to perform a sentiment analysis using a Bayesian Classifier and I have a CSV file consisting of rows with the following structure:

Column 1: Either 1 or 0 
Column 2: String 

Example: 1 | This is a great movie 

I am using Pandas when reading the CSV file (read_csv).

After reading each row from the CSV file has the following structure:

1;This is a great movie
0;This is a bad movie

I would like to tokenize each string in column 2. However, I have not managed to do this. How do I tackle this problem?


Solution

  • Assuming the df looks like (just replace column name from 0 to column_name which you have as header:

                            0
    0  1;This is a great movie
    1    0;This is a bad movie
    
    pd.DataFrame(df[0].apply(lambda x: x.split(";")).values.tolist(),columns=['A','B'])
       A                      B
    0  1  This is a great movie
    1  0    This is a bad movie