Search code examples
pythoncsvweb-scrapingcomparisonstring-conversion

Python comparing get_text item against a list item


Moving along with my python project, yet I've stumbled upon one more frustrating phase.

I've not got snippet of code that finds last post date from a forum, keeps it in both a temporary variable (Wanted to use it for checking against each date) and a public/global one for further use throughout the scope.

However, method I'm trying to use is to fetch all last post dates from forum and compare them against already existing dates in a .csv file to see if any new posts were made, if not, just don't scrape / mine the data.

Yet that's the exact part I am struggling with, cannot compare my mined (get_text) element against item from a .csv list.

Any ideas would be appreaciated, tried multiple methods, left it with the last one below that still does not work.

Code:

#Preparing csv file to be read through to check if dates match
storedDates = open(os.path.expanduser("PostDates.csv"))
csv_storedDates = csv.reader(storedDates)
dateRow = list(csv_storedDates) #Storing all the dates as a "List" object
listLength = len(dateRow) #Grabbing the csv List length
startingDate = 0 #Variable for looping through each date for each post.

lPostDate = lPostDate2 = ""

#Looping through 6 times (As that's how many pages each forum has, and collecting Next Page Link,Each Thread Title, It's Link
#.. last post date (To know how recent it is) and assigning next page link to current url, and continuing loop.
while number < 6:
    for postDate in soup.find_all(title=re.compile("^Replies:")):
        tempData = ""
        tempData += (postDate.get_text("\n", strip=True)[0:10] + "\n")
        lPostDate += (postDate.get_text("\n", strip=True)[0:10] + "\n")
        if any(tempData in s for s in dateRow[startingDate]):
            print("Matched a date" + tempData + "to one from database" + dateRow[startingDate])
            startingDate +=1
        else :
            startingDate += 1
            print("Date " + tempData + "was not matched to anything" + str(dateRow[startingDate]))

This is JUST the part of the code, however this is the only bit I am trying to get work at the moment. Assume that PostDates.csv already have information in them. Also, this is how the output looks like:

Date 02-11-2017
was not matched to anything['02-11-2017']
Date 01-10-2017
was not matched to anything['01-10-2017']
Date 02-12-2017
was not matched to anything['02-12-2017']
Date 10-01-2016
was not matched to anything['10-01-2016']
Date 09-30-2016
was not matched to anything['09-30-2016']
Date 08-10-2016
was not matched to anything['08-10-2016']
Date 10-01-2015
was not matched to anything['10-01-2015']
Date 10-01-2015
was not matched to anything['10-01-2015']
Date 08-29-2015
was not matched to anything['08-29-2015']
Date 03-16-2015
was not matched to anything['03-16-2015']
Date 07-16-2014
was not matched to anything['07-16-2014']
Date 07-13-2014
was not matched to anything['07-13-2014']
Date 02-11-2014
was not matched to anything['02-11-2014']
Date 07-02-2013
was not matched to anything['07-02-2013']
Date 06-28-2013
was not matched to anything['06-28-2013']
Date 04-22-2013
was not matched to anything['04-22-2013']
Date 05-28-2012
was not matched to anything['05-28-2012']
Date 05-25-2012
was not matched to anything['05-25-2012']
Date 05-09-2012
was not matched to anything['05-09-2012']
Date 06-10-2010
was not matched to anything['06-10-2010']
Date 01-18-2010
was not matched to anything['01-18-2010']
Date 01-18-2010
was not matched to anything['01-18-2010']
Date 12-29-2009
was not matched to anything['12-29-2009']
Date 06-08-2009
was not matched to anything['06-08-2009']
Date 02-02-2009
was not matched to anything['02-02-2009']
Date 11-24-2008
was not matched to anything['11-24-2008']
Date 09-02-2008
was not matched to anything['09-02-2008']
Date 08-07-2008
was not matched to anything['08-07-2008']
Date 06-05-2008
was not matched to anything['06-05-2008']
Date 05-22-2008
was not matched to anything['05-22-2008']
Date 04-21-2008
was not matched to anything['04-21-2008']
Date 03-29-2008
was not matched to anything['03-29-2008']
1
Date 02-11-2017
was not matched to anything['02-11-2017']
Date 01-10-2017
was not matched to anything['01-10-2017']
Date 11-07-2007
was not matched to anything['11-07-2007']
Date 11-07-2007
was not matched to anything['11-07-2007']
Date 09-19-2007
was not matched to anything['09-19-2007']
Date 09-01-2007
was not matched to anything['09-01-2007']
Date 08-31-2007
was not matched to anything['08-31-2007']
Date 08-31-2007
was not matched to anything['08-31-2007']
Date 08-30-2007
was not matched to anything['08-30-2007']
Date 08-24-2007
was not matched to anything['08-24-2007']
Date 08-19-2007
was not matched to anything['08-19-2007']
Date 08-08-2007
was not matched to anything['08-08-2007']
Date 08-03-2007
was not matched to anything['08-03-2007']
Date 07-29-2007
was not matched to anything['07-29-2007']
Date 07-18-2007
was not matched to anything['07-18-2007']
Date 06-26-2007
was not matched to anything['06-26-2007']
Date 06-26-2007
was not matched to anything['06-26-2007']
Date 01-12-2007
was not matched to anything['01-12-2007']
Date 12-05-2006
was not matched to anything['12-05-2006']
Date 11-16-2006
was not matched to anything['11-16-2006']
Date 11-05-2006
was not matched to anything['11-05-2006']
Date 11-05-2006
was not matched to anything['11-05-2006']
Date 11-03-2006
was not matched to anything['11-03-2006']
Date 09-19-2006
was not matched to anything['09-19-2006']
Date 09-19-2006
was not matched to anything['09-19-2006']
Date 09-19-2006
was not matched to anything['09-19-2006']
Date 09-12-2006
was not matched to anything['09-12-2006']
Date 08-17-2006
was not matched to anything['08-17-2006']
Date 08-07-2006
was not matched to anything['08-07-2006']
Date 08-02-2006
was not matched to anything['08-02-2006']
Date 07-16-2006
was not matched to anything['07-16-2006']
Date 07-07-2006
was not matched to anything['07-07-2006']

I did no longer paste the otput after page 2 as it's 6 pages so long, so quite a lot of data.

And this is how it looks like when it's been scraped before and stored in a .csv file(dateRow variable):

Date,
02-11-2017
01-10-2017
02-12-2017
10-01-2016
09-30-2016
08-10-2016
10-01-2015
10-01-2015
08-29-2015
03-16-2015
07-16-2014
07-13-2014
02-11-2014
07-02-2013
06-28-2013
04-22-2013
05-28-2012
05-25-2012
05-09-2012
06-10-2010
01-18-2010
01-18-2010
12-29-2009
06-08-2009
02-02-2009
11-24-2008
09-02-2008
08-07-2008
06-05-2008
05-22-2008
04-21-2008
03-29-2008
02-11-2017
01-10-2017
11-07-2007
11-07-2007
09-19-2007
09-01-2007
08-31-2007
08-31-2007

Any advice how to process it so it would find the matching dates would be greatly appreciated, thank you!


Solution

  • Just to sum up our conversation in comments: You typed any(tempData in s for s in dateRow[startingDate]) and I thought that it has to be type mismatch. Well it turned out to be. That's because any() is defined as follows:

    any(iterable) Return True if any element of the iterable is true. If the iterable is empty, return False. Equivalent to:

    def any(iterable):
        for element in iterable:
            if element:
                return True
        return False
    

    And your code when put apart gives something like this:

    >>> # Curly brackets make it syntactically correct
    >>> iterable = (tempData in s for s in dateRow[startingDate]) 
    >>> any(iterable)
    False
    

    but is it really iterable? Lets see:

    >>> type(iterable)
    <class 'generator'>
    

    It's not! Ha! But this:

    >>> type([tempData in s for s in dateRow[startingDate]])
    <class 'list'>
    

    Is iterable!

    >>> hasattr([tempData in s for s in dateRow[startingDate]], '__iter__')
    True
    

    Problem solved, just remember to add some parenthesis around generator to make it an iterable!