i have a data set like this
Value Month Year
103.4 April 2006
270.6 August 2006
51.9 December 2006
156.9 February 2006
126.9 January 2006
96.8 July 2006
183.1 June 2006
266.6 March 2006
193.1 May 2006
524.7 November 2006
619.9 October 2006
129 September 2006
374.1 April 2007
260.5 August 2007
119.6 December 2007
9.9 February 2007
91.1 January 2007
106.6 July 2007
79.9 June 2007
60.5 March 2007
432.4 May 2007
128.8 November 2007
292.1 October 2007
129.3 September 2007
value is the annual rainfall for one district. lets call it districtA. i have the data set from 2006 to 2014 and i need to predict rainfall for next 2 years for districtA. i choose pearson correlation and linear regression from sklearn libary to predict the data. I'm very confused and I don't know how to set X and Y. 'm new to Python so every help is valuable.Thank you
ps.. i found a code like this
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model
# Load the diabetes dataset
diabetes = datasets.load_diabetes()
# Use only one feature
diabetes_X = diabetes.data[:, np.newaxis, 2]
# Split the data into training/testing sets
diabetes_X_train = diabetes_X[:-20]
diabetes_X_test = diabetes_X[-20:]
# Split the targets into training/testing sets
diabetes_y_train = diabetes.target[:-20]
diabetes_y_test = diabetes.target[-20:]
# Create linear regression object
regr = linear_model.LinearRegression()
# Train the model using the training sets
regr.fit(diabetes_X_train, diabetes_y_train)
# The coefficients
print('Coefficients: \n', regr.coef_)
# The mean square error
print("Residual sum of squares: %.2f"
% np.mean((regr.predict(diabetes_X_test) - diabetes_y_test) ** 2))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % regr.score(diabetes_X_test, diabetes_y_test))
# Plot outputs
plt.scatter(diabetes_X_test, diabetes_y_test, color='black')
plt.plot(diabetes_X_test, regr.predict(diabetes_X_test), color='blue',
linewidth=3)
plt.xticks(())
plt.yticks(())
plt.show()
when i print the diabetes_X_train it gives me this
[[ 0.07786339]
[-0.03961813]
[ 0.01103904]
[-0.04069594]
[-0.03422907]...]
i assuming this is the r value getting from correlation and coefficient. when i print the diabetes_Y_train it gives me something like this
[ 233. 91. 111. 152. 120. .....]
my problem is how can i get r value from the rainfall and assign it to x axis
There is not the best solution, but it works.
Little explanation: I have substituted month on their indexes in the list, it is necessary for algorithm. Also I have substituted spaces delimeters on ';' delimeters, because in different rows was different number of spaces and it was not convinient. Now your data is:
Value;Month;Year
103.4;April;2006
270.6;August;2006
51.9;December;2006
And file with initial data is 'data.csv'.
import pandas as pd
import sklearn.linear_model as ll
data = pd.read_csv('data.csv', sep=';')
X = data.ix[:,1:3]
y = data.ix[:,0]
month = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December']
for i, m in enumerate(data.ix[:,1]):
data.ix[i,1] = month.index(m)
X = data.ix[:,1:3]
lr = ll.LinearRegression()
lr.fit(X, y)
######### TEST DATA ##########
X_test = [[1, 2008], [2, 2008]]
X_test = pd.DataFrame(X_test, columns=['Month', 'Year'])
y_test = lr.predict(X_test)
print(y_test)
As a result of test I got this values
[69.23079837 80.63691725]