I have a DataFrame with some 270,000 books and I'd like to get_dummies based on the Book-Title
, but only use a select number of words. In the end, I'd like to use 200-300 really distinguishing words, like 'Mystery', 'Murder', 'Classic', and 'Science' and see if they're in the Book-Title
for each book. I'd like a column for each one of these words that is filled with 0's if the word isn't found and 1's if it is found.
If I try to get_dummies for the entire column, I end up with hundreds of thousands of different columns and I don't have the RAM to do that kind of processing.
Below is an example of what I'd like. I already have a list of the 200 most-frequent words among the Book Titles, I just don't know how to make columns from this list.
Input:
ISBN Book-Title Book-Author Year-Of-Publication Publisher
0 0195153448 Classical Mythology Mark P. O. Morford 2002 Oxford University Press
1 0002005018 Clara Callan Richard Bruce Wright 2001 HarperFlamingo Canada
2 0060973129 Decision in Normandy Carlo D Este 1991 HarperPerennial
3 0374157065 Flu: The Story of the Great... Gina Bari Kolata 1999 Farrar Straus Giroux
4 0393045218 The Mummies of Urumchi E. J. W. Barber 1999 W. W. Norton & Company
Desired output:
ISBN Title 'World' 'Mythology' 'Mystery' 'Mummies'
0 0195153448 Classical Mythology 0 1 0 0
1 0002005018 Clara Callan 0 0 0 0
2 0060973129 Decision in Normandy 0 0 0 0
3 0374157065 Flu: The Story of the Great... 0 0 0 0
4 0393045218 The Mummies of Urumchi 0 0 0 1
Thank you in advance!
Adam
You can apply a function to 'Book-Title'
column that iterates over the list of words to check if each word exists in each entry; and convert the output to a DataFrame:
lst = ['World', 'Mythology', 'Mystery', 'Mummies']
df[lst] = df['Book-Title'].apply(lambda x: pd.Series([int(w in x) for w in lst]))
Output:
Book-Title World Mythology Mystery Mummies
0 Classical Mythology 0 1 0 0
1 Clara Callan 0 0 0 0
2 Decision in Normandy 0 0 0 0
3 Flu: The Story of the Great 0 0 0 0
4 The Mummies of Urumchi 0 0 0 1