I'm analyzing the text of Macbeth through the Project Gutenberg website, and I'm trying to create a list of the characters by mention of their names. I know there is a way to do this with nltk but I am trying to avoid that at this point. I'm getting the names by finding all instances of 'Enter' in the text, and then trying to remove all lowercase words. This is the code I have so far:
import requests
macbeth = requests.get('http://www.gutenberg.org/cache/epub/2264/pg2264.txt').text
macbeth = macbeth.split('.')
character_list = [sentence.split() for sentence in macbeth if 'Enter' in sentence]
for sublist in character_list:
for string in sublist:
if string.islower() == True:
sublist.remove(string)
Here is an extract of the output I get when printing the result:
[['Enter', 'Witches'],
['Enter',
'King,',
'Malcome,',
'Donalbaine,',
'Lenox,',
'attendants,',
'a',
'Captaine'],
['Enter', 'Rosse', 'Angus'],
['Enter', 'three', 'Witches'],
['Enter', 'Macbeth', 'Banquo'],
["Toth'", 'tune', 'words:', 'here?', 'Enter', 'Rosse', 'Angus']
etc.
I'm having a hard time understanding why 'attendants', 'a', 'three', 'tune', etc. are not removed from each sublist. Am I missing something in the code I currently have?
You remove one item from list in one for loop, the list also have changed. So in this for string in sublist
, the string will not loop as the order of original sublist.