Search code examples
pythonloopscsvmachine-learninglinear-regression

Iterate every csv column and predict value using linear regression


I am using a loop to grab values from every csv row and run it through linear_regression_model for prediction. The needed output is, for every row in the csv, print the predicted value that ran through the model, like:

4.500
4.256
3.909
4.565
...
4.433

Here is what I did:

def prediction_loop():
    for index, row in ml_sample.iterrows():
        print(row['column'])
        new_data = OrderedDict(['column', row])
        new_data = pd.Series(new_data).values.reshape(1,-1)
        print(linear_regression_model.predict(new_data))

The actual output I get is:

Traceback (most recent call last):
new_data = OrderedDict(['column', row])
ValueError: too many values to unpack (expected 2)

In the csv there are 87 rows and 1 column. How can I optimise the code? Thank you


Solution

  • If I understand the question correctly, then this can be done very efficiently without the aid of any external modules. We just need a trivial class to manage the statistics. The assumption is that the input file contains one numerical value per line and that such values are Y and the implied line number is X. Try this:-

    class Stats():
        def __init__(self):
            self.n = 0
            self.sx = 0
            self.sy = 0
            self.sxx = 0
            self.syy = 0
            self.sxy = 0
    
        def add(self, x, y):
            self.sx += x
            self.sy += y
            self.sxx += x * x
            self.syy += y * y
            self.sxy += x * y
            self.n += 1
    
        def r(self):  # correlation coefficient
            return (self.n * self.sxy - self.sx * self.sy) / ((self.n * self.sxx - self.sx * self.sx) * (self.n * self.syy - self.sy * self.sy)) ** 0.5
    
        def b(self):  # slope
            return (self.n * self.sxy - self.sx * self.sy) / (self.n * self.sxx - self.sx * self.sx)
    
        def a(self):  # intercept
            return self.my() - self.b() * self.mx()
    
        def mx(self):  # mean x
            assert self.n > 0
            return self.sx / self.n
    
        def my(self):  # mean y
            assert self.n > 0
            return self.sy / self.n
    
        def y(self, x):  # estimate of y for given x
            return x * self.b() + self.a()
    
    
    stats = Stats()
    
    with open('lr.txt') as data:
        for i, line in enumerate(data):
            stats.add(i, float(line.split()[0]))
    
    print(f'r={stats.r():.4f} slope={stats.b():.4f} intercept={stats.a():.4f}')
    
    for x in range(stats.n):
        print(f'Estimate for {x} = {stats.y(x):.2f}')