Trying to search 2 lists for common strings. 1-st list being a file with text, while the 2-nd is a list of words with logarithmic probability before the actual word – to match, a word not only needs to be in both lists, but also have a certain minimal log probability (for instance, between -2,123456 and 0,000000; that is negative 2 increasing up to 0). The tab separated list can look like:
-0.962890 dog
-1.152454 lol
-2.050454 cat
I got stuck doing something like this:
common = []
for i in list1:
if i in list2 and re.search("\-[0-1]\.[\d]+", list2):
common.append(i)
The idea to simply preprocess the list to remove lines under a certain threshold is valid of course, but since both the word and its probability are on the same line, isn’t a condition also possible? (Regexps aren’t necessary, but for comparison solutions both with and without them would be interesting.)
EDIT: own answer to this question below.
Answering my own question after hours of trial and error, and read tips from here and there. Turns out, i was thinking in the right direction from start, but needed to separate word detection and pattern matching, and instead combine the latter with log probability checking. Thus creating a temporary list of items with needed log prob, and then just comparing that against the text file.
common = []
prob = []
loga , rithmus = -9.87 , -0.01
for i in re.findall("\-\d\.\d+", list2):
if (loga < float(i.split()[0]) < rithmus):
prob.append(i)
prob = "\n".join(prob)
for i in list1:
if i in prob:
common.append(i)