I'm trying to find out what names of list are in a news text.
I have a big text file (around 100MB) with many place names. Each name is a line in the file.
Part of the file.
Brasiel
Brasier Gap
Brasier Tank
Brasiilia
Brasil
Brasil Colonial
and the news texts are like this:
"It's thought the couple may have contracted the Covid-19 virus in the US or while travelling to Australia, according to Queensland Health officials.
Hanks is not the only celebrity to have tested positive for the virus. British actor Idris Elba also revealed last week he had tested positive."
For instance, in this text the strings Australia and Queensland should be founded. I'm using the NLTK library and creating ngrams from the news.
To do this, I'm doing this:
from nltk.util import ngrams
# readings the place name file
file = open("top-ord.txt", "r")
values = file.readlines()
news = "It's thought the couple may have contracted the Covid-19 virus in the US or while travelling to Australia, according to Queensland Health officials."
# ngrams_list is all ngrams from the news
for item in ngrams_list:
if item in values:
print(item)
This is too slow. How can I improve it?
Convert values to a set like so:
value_set = {country for country in values}
That should significantly speed things up as lookup with sets runs in constant time (as opposed to linear time as with lists)
Also, make sure you strip away trailing newlines when parsing the file (if needed).