Search code examples
pythonpandasdataframemachine-learningtext-classification

Create columns based on whether specific substrings exist in column


I have a DataFrame with some 270,000 books and I'd like to get_dummies based on the Book-Title, but only use a select number of words. In the end, I'd like to use 200-300 really distinguishing words, like 'Mystery', 'Murder', 'Classic', and 'Science' and see if they're in the Book-Title for each book. I'd like a column for each one of these words that is filled with 0's if the word isn't found and 1's if it is found.

If I try to get_dummies for the entire column, I end up with hundreds of thousands of different columns and I don't have the RAM to do that kind of processing.

Below is an example of what I'd like. I already have a list of the 200 most-frequent words among the Book Titles, I just don't know how to make columns from this list.

Input:

    ISBN        Book-Title                       Book-Author            Year-Of-Publication     Publisher
0   0195153448  Classical Mythology              Mark P. O. Morford     2002     Oxford University Press
1   0002005018  Clara Callan                     Richard Bruce Wright   2001     HarperFlamingo Canada
2   0060973129  Decision in Normandy             Carlo D Este           1991     HarperPerennial
3   0374157065  Flu: The Story of the Great...   Gina Bari Kolata       1999     Farrar Straus Giroux
4   0393045218  The Mummies of Urumchi           E. J. W. Barber        1999     W. W. Norton & Company

Desired output:

    ISBN        Title                          'World'  'Mythology'  'Mystery'  'Mummies'
0   0195153448  Classical Mythology             0        1            0          0
1   0002005018  Clara Callan                    0        0            0          0
2   0060973129  Decision in Normandy            0        0            0          0
3   0374157065  Flu: The Story of the Great...  0        0            0          0
4   0393045218  The Mummies of Urumchi          0        0            0          1  

Thank you in advance!

Adam


Solution

  • You can apply a function to 'Book-Title' column that iterates over the list of words to check if each word exists in each entry; and convert the output to a DataFrame:

    lst = ['World', 'Mythology', 'Mystery', 'Mummies']
    df[lst] = df['Book-Title'].apply(lambda x: pd.Series([int(w in x) for w in lst]))
    

    Output:

                        Book-Title  World  Mythology  Mystery  Mummies
    0          Classical Mythology      0          1        0        0
    1                 Clara Callan      0          0        0        0
    2         Decision in Normandy      0          0        0        0
    3  Flu: The Story of the Great      0          0        0        0
    4       The Mummies of Urumchi      0          0        0        1