Search code examples
pythonstringnlpmwe

Substring search for multiword strings - Python


I want to check a set of sentences and see whether some seed words occurs in the sentences. but i want to avoid using for seed in line because that would have say that a seed word ring would have appeared in a doc with the word bring.

I also want to check whether multiword expressions (MWE) like word with spaces appears in the document.

I've tried this but this is uber slow, is there a faster way of doing this?

seed = ['words with spaces', 'words', 'foo', 'bar', 
        'bar bar', 'foo foo foo bar', 'ring']

 docs = ['these are words with spaces but the drinks are the bar is also good', 
    'another sentence at the foo bar is here', 
    'then a bar bar black sheep, 
    'but i dont want this sentence because there is just nothing that matches my list',
    'i forgot to bring my telephone but this sentence shouldn't be in the seeded docs too']

docs_seed = []
for d in docs:
  toAdd = False
  for s in seeds:
    if " " in s:
      if s in d:
        toAdd = True
    if s in d.split(" "):
      toAdd = True
    if toAdd == True:
      docs_seed.append((s,d))
      break
print docs_seed

The desired output should be this:

[('words with spaces','these are words with spaces but the drinks are the bar is also good')
('foo','another sentence at the foo bar is here'), 
('bar', 'then a bar bar black sheep')]

Solution

  • Consider using a regular expression:

    import re
    
    pattern = re.compile(r'\b(?:' + '|'.join(re.escape(s) for s in seed) + r')\b')
    pattern.findall(line)
    

    \b matches the start or end of a "word" (sequence of word characters).

    Example:

    >>> for line in docs:
    ...     print pattern.findall(line)
    ... 
    ['words with spaces', 'bar']
    ['foo', 'bar']
    ['bar', 'bar']
    []
    []