Search code examples
pythonnumpyscipydata-analysispython-datetime

How to find the correlation between two lists when one list consists of date values?


I'm trying to calculate the correlation between two lists every 30 days using the pearsonr function from scipy.

One list consists of dates (called dateValues), and the other one consists of sales (called saleNumbers). I already extracted the dates using datetime.strptime earlier and if I print out dateValues, I get a range of dates with an arbitrary length.

datetime.datetime(2016, 8, 12, 0, 0), datetime.datetime(2016, 8, 11, 0, 0), datetime.datetime(2016, 8, 10, 0, 0)...etc

While here is the sales list:

saleNumbers = [3567,2348,1234,....etc]

However when I do

pearsonr(dateValues,saleNumbers)

I get the error

TypeError: unsupported operand type(s) for +: 'datetime.datetime' and 'datetime.datetime'

After searching endlessly, I found that one can use datetime.date to do arithmetic between dates.

So i did this:

print(datetime.date(dateValues[0]) - datetime.date(dateValues[29]))

And sure enough that gives me 30 days for the time difference.

So I then tried this:

pearsonr(datetime.date(dateValues[0]) - datetime.date(dateValues[29]),saleNumbers)

But I then get this error

TypeError: len() of unsized object

Any ideas on how I can move forward with this? Also I don't think datetime.date(dateValues[0]) - datetime.date(dateValues[2]) is the correct Pythonic way to handle the dates list when finding the correlation.

PS: In this image, is an Excel spreadsheet showing what I've already done, but trying to replicate here in Python: https://i.sstatic.net/THUoX.jpg


Solution

  • Convert them to numeric values first:

    arbitrary_date = datetime(1970,1,1)
    pearsonr([(d - arbitrary_date).total_seconds() for d in dateValues], saleNumbers)
    

    Perason correlation is unaffected by scaling or translation in either axis (affine transformations)