python linear-regression pearson-correlation

simple prediction using Pearson correlation and linear regression with python

i have a data set like this

    Value   Month       Year 

    103.4   April       2006
    270.6   August      2006
    51.9    December    2006
    156.9   February    2006
    126.9   January     2006
    96.8    July        2006
    183.1   June        2006
    266.6   March       2006
    193.1   May         2006
    524.7   November    2006
    619.9   October     2006
    129     September   2006
    374.1   April       2007
    260.5   August      2007
    119.6   December    2007
    9.9     February    2007
    91.1    January     2007
    106.6   July        2007
    79.9    June        2007
    60.5    March       2007
    432.4   May         2007
    128.8   November    2007
    292.1   October     2007
    129.3   September   2007

value is the annual rainfall for one district. lets call it districtA. i have the data set from 2006 to 2014 and i need to predict rainfall for next 2 years for districtA. i choose pearson correlation and linear regression from sklearn libary to predict the data. I'm very confused and I don't know how to set X and Y. 'm new to Python so every help is valuable.Thank you

ps.. i found a code like this

import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model

# Load the diabetes dataset
diabetes = datasets.load_diabetes()


# Use only one feature
diabetes_X = diabetes.data[:, np.newaxis, 2]

# Split the data into training/testing sets
diabetes_X_train = diabetes_X[:-20]
diabetes_X_test = diabetes_X[-20:]

# Split the targets into training/testing sets
diabetes_y_train = diabetes.target[:-20]
diabetes_y_test = diabetes.target[-20:]

# Create linear regression object
regr = linear_model.LinearRegression()

# Train the model using the training sets
regr.fit(diabetes_X_train, diabetes_y_train)

# The coefficients
print('Coefficients: \n', regr.coef_)
# The mean square error
print("Residual sum of squares: %.2f"
      % np.mean((regr.predict(diabetes_X_test) - diabetes_y_test) ** 2))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % regr.score(diabetes_X_test, diabetes_y_test))

# Plot outputs
plt.scatter(diabetes_X_test, diabetes_y_test,  color='black')
plt.plot(diabetes_X_test, regr.predict(diabetes_X_test), color='blue',
         linewidth=3)

plt.xticks(())
plt.yticks(())

plt.show()

when i print the diabetes_X_train it gives me this

[[ 0.07786339]
 [-0.03961813]
 [ 0.01103904]
 [-0.04069594]
 [-0.03422907]...]

i assuming this is the r value getting from correlation and coefficient. when i print the diabetes_Y_train it gives me something like this

[ 233.   91.  111.  152.  120.  .....]

my problem is how can i get r value from the rainfall and assign it to x axis

Solution

There is not the best solution, but it works.

Little explanation: I have substituted month on their indexes in the list, it is necessary for algorithm. Also I have substituted spaces delimeters on ';' delimeters, because in different rows was different number of spaces and it was not convinient. Now your data is:

Value;Month;Year 
103.4;April;2006
270.6;August;2006
51.9;December;2006

And file with initial data is 'data.csv'.

import pandas as pd
import sklearn.linear_model as ll

data = pd.read_csv('data.csv', sep=';')
X = data.ix[:,1:3]
y = data.ix[:,0]

month = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December']

for i, m in enumerate(data.ix[:,1]):
    data.ix[i,1] = month.index(m)

X = data.ix[:,1:3]
lr = ll.LinearRegression()
lr.fit(X, y)

######### TEST DATA ##########
X_test = [[1, 2008], [2, 2008]]
X_test = pd.DataFrame(X_test, columns=['Month', 'Year'])

y_test = lr.predict(X_test)
print(y_test)

As a result of test I got this values

[69.23079837  80.63691725]