Search code examples
pythonarraysvectorizationstring-matchingdata-extraction

Python: Matching Strings from an Array with Substrings from Texts in another Array


Currently I am crawling a Webpage for newspaper articles using Pythons BeautifulSoup Library. These articles are stored in the object "details".

Then I have a couple of names of various streets that are stored in the object "lines". Now I want to search the articles for the street names that are contained in "lines".

If one of the street names is part of one of the articles, I want to safe the name of the street in an array.

If there is no match for an article (the selected article does not contain any of the street names), then there should be an empty element in the array.

So for example, let's assume the object "lines" would consist of ("Abbey Road", "St-John's Bridge", "West Lane", "Sunpoint", "East End").

The object "details" consists of 4 articles, of which 2 contain "Abbey Road" and "West Lane" (e.g. as in "Car accident on Abbey Road, three people hurt"). The other 2 articles don't contain any of names from "lines".

Then after matching the result should be an array like this: []["Abbey Road"][]["West Lane"]

I was also told to use Vectorization for this, as my original data sample is quite big. However I'm not familiar with using vectorization for String operations. Has anyone worked with this already?

My Code currently looks like this, however this only returns "-1" as elements of my resulting array:

from bs4 import BeautifulSoup
import requests
import io
import re
import string
import numpy as np


my_list = []
for y in range (0, 2):
    y *= 27
    i = str(y)
    my_list.append('http://www.presseportal.de/blaulicht/suche.htx?q=' + 'einbruch' + '&start=' + i)



for link in my_list:
  #  print (link)
    r = requests.get(link)
    r.encoding = 'utf-8'
    soup = BeautifulSoup(r.content, 'html.parser')



with open('a4.txt', encoding='utf8') as f:
        lines = f.readlines()
        lines = [w.replace('\n', '') for w in lines]    


        details = soup.find_all(class_='news-bodycopy')
        for class_element in details:
            details = class_element.get_text()

        sdetails = ''.join(details)
        slines = ''.join(lines)
        i = str.find(sdetails, slines[1 : 38506])
        print(i)                

If someone wants to reproduce my experiment, the Website-Url is in the code above and the crawling and storing of articles in the object "details" works properly, so the code can just be copied.

The .txt-file for my original Data for the object "lines" can be accessed in this Dropbox-Folder: https://www.dropbox.com/s/o0cjk1o2ej8nogq/a4.txt?dl=0

Thanks a lot for any hints how I can make this work, preferably via Vectorization.


Solution

  • You could try something like this:

    my_list = []
    for y in range (0, 2):
        i = str(y)
        my_list.append('http://www.presseportal.de/blaulicht/suche.htx?q=einbruch&start=' + i)
    
    for link in my_list:
        r = requests.get(link)
        soup = BeautifulSoup(r.content.decode('utf-8','ignore'), 'html.parser')
    
    details = soup.find_all(class_='news-bodycopy')
    f = open('a4.txt')
    lines = [line.rstrip('\r\n') for line in f] 
    
    result = []
    for i in range(len(details)):
        found_in_line = 0
        for j in range(len(lines)):
            try:
                if details[i].get_text().index(lines[j].decode('utf-8','ignore')) is not None:
                    result.append(lines[j])
                    found_in_line = found_in_line + 1
            except:
                if (j == len(lines)-1) and (found_in_line == 0):
                    result.append(" ")
    print result