Search code examples
c#plotchartslinear-regressionzedgraph

How can I make the regression line more accurate in log scale?


The supplied code draws some data points and the corresponding regression line.

The regression line goes through or is near most of the data points on a linear scale.

enter image description here

However, on a log scale, the line doesn't look very accurate.

enter image description here

Also, the lengths of the lines become shorter.

How can I make the regression line look same in both the linear scale and log scale?

using System;
using System.Collections.Generic;
using ZedGraph;
using System.Drawing;
using System.IO;
using System.Linq;

class GenerateRegressionLine
{
    static void Main()
    {
        string dataFilePath = @"output.txt";
        Tuple<List<double>, List<double>> givenLine = Fit.ReadDataFromFile(dataFilePath);

        Tuple<List<double>, List<double>> regressionLine = CreateRegressionLine(givenLine.Item1, givenLine.Item2);

        
        ZedGraphControl zgc = new ZedGraphControl();
        zgc.Size = new Size(1200, 800);

        GraphPane myPane = zgc.GraphPane;
        myPane.Title.Text = "Line Plot";
        myPane.XAxis.Title.Text = "X Axis";
        myPane.YAxis.Title.Text = "Y Axis";

        myPane.XAxis.Type = AxisType.Log;
        myPane.YAxis.Type = AxisType.Log;

        PointPairList givenLinePPL = new PointPairList(givenLine.Item1.ToArray(), givenLine.Item2.ToArray());
        LineItem givenLineCurve = myPane.AddCurve("Given Line", givenLinePPL, Color.Green, SymbolType.None);

        PointPairList regressionLinePPL = new PointPairList(regressionLine.Item1.ToArray(), regressionLine.Item2.ToArray());
        LineItem regressionLineCurve = myPane.AddCurve("Regression Line", regressionLinePPL, Color.Red, SymbolType.None);


        zgc.AxisChange();
        zgc.Invalidate();

        string directory = Path.GetDirectoryName(dataFilePath);

        string imagePath = Path.Combine(directory, "DrawRegressionLine.png");

        zgc.GetImage().Save(imagePath, System.Drawing.Imaging.ImageFormat.Png);
    }

    private static Tuple<double, double> CalculateLinearRegressionCoefficients(List<double> xList, List<double> yList)
    {
        if (xList == null || yList == null || xList.Count != yList.Count)
            throw new ArgumentException("Lists must be non-null and have the same number of elements.");

        double xSum = 0, ySum = 0, xySum = 0, x2Sum = 0;
        int count = xList.Count;

        for (int i = 0; i < count; i++)
        {
            double x = xList[i];
            double y = yList[i];
            xSum += x;
            ySum += y;
            xySum += x * y;
            x2Sum += x * x;
        }

        double slope = (count * xySum - xSum * ySum) / (count * x2Sum - xSum * xSum);
        double intercept = (ySum - slope * xSum) / count;

        return Tuple.Create(intercept, slope);
    }

    public static Tuple<List<double>, List<double>> CreateRegressionLine(List<double> xList, List<double> yList)
    {
        // Calculate the regression coefficients
        var coefficients = CalculateLinearRegressionCoefficients(xList, yList);

        List<double> xVals = new List<double>();
        List<double> yVals = new List<double>();

        double intercept = coefficients.Item1;
        double slope = coefficients.Item2;

        double startX = xList[0];
        double endX = xList[xList.Count - 1];

        xVals.Add(startX);
        yVals.Add(intercept + slope * startX);

        xVals.Add(endX);
        yVals.Add(intercept + slope * endX);

        return Tuple.Create(xVals, yVals);
    }
}



Solution

  • The short answer is that you need to understand how a log-scale plot works better first. On a related note, I'm upvoting and answering this because I remember having a similar question in my very rookie days when the world was new. This biases me into believing that it's not that bad of a question :)

    There are two reasons that the points on the left appear to be ignored.

    1. Points with lower y-values have the same weight or y-uncertainty associated with them as the points on the right. On a linear plot that means that the "wiggle room" the fit has for each point looks uniform. Not so for the log plot. Look at the enormous error bars on the left vs the right. All of them are of length +/-30 for better visibility. In a least-squared fit, the proportional relationship of the weights matters, not the absolute magnitude.
    2. A line in linear space is not necessarily a line in log-space. The only time it will be is when the slope is exactly 1. Your CreateRegressionLine is misleading because it only outputs the first and last point. I've output a denser array below, where you can see that the log-log plot of the same data is not linear in log space.

    enter image description here

    So what to do? There are a couple of simple approaches. The specific choice of approach depends on what exactly you want, which is unclear, likely even to you at this point.

    1. Do your line regression in log space. That will make it an actual line in log-space, and the weights will be uniform there. The weights will be symmetrical in log-space, so the plots are only an approximation:

      enter image description here

    2. When you do a regression, set the weights to be proportional in the log scale. In other words, your 1-sigma uncertainty for a least squares should be much larger for the large y-values, and smaller for the small ones, to appear nearly uniform on a log scale. A simple, imperfect implementation, is to take the reciprocal of y for the 1-sigma weight. This will make a line in linear space, but not a nice one, and the curve in log space still won't be a line, but a potentially more pleasing fit. The error bars show are y (scaled down by a factor of 20 for visibility), since the uncertainties are inversely proportional to the weights, which are just 1 / y.

      enter image description here

      For this option, you will have to implement weighted least-squares, which is explained here in Wikipedia: https://en.wikipedia.org/wiki/Weighted_least_squares.


    I know very little about C#, so I've done all the plotting here in python. Code is below for completeness. I expect that you'll be able to understand it with relatively little trouble. If something specific bugs you about it, let me know.

    from matplotlib import pyplot as plt
    import numpy as np
    
    np.random.seed(0)
    
    x = np.arange(0, 2000, 100)
    y = np.random.uniform(1.0, 100.0, x.shape).cumsum()
    
    fit1 = np.polyfit(x, y, deg=1)
    r1 = np.polyval(fit1, x)
    
    fig1, ax1 = plt.subplots(1, 2)
    
    ax1[0].set_title('Linear')
    ax1[0].errorbar(x, y, 30, label='data')
    ax1[0].plot(x, r1, label='fit')
    ax1[0].legend()
    
    ax1[1].set_yscale('log')
    ax1[1].set_title('Log')
    ax1[1].errorbar(x, y, 30, label='data')
    ax1[1].plot(x, r1, label='fit')
    ax1[1].legend()
    
    
    fit2 = np.polyfit(x, np.log(y), deg=1)
    r2 = np.exp(np.polyval(fit2, x))
    bars = np.stack((np.exp(np.log(y) - 1), np.exp(np.log(y) + 1)))
    
    fig2, ax2 = plt.subplots(1, 2)
    
    ax2[0].set_title('Linear')
    ax2[0].errorbar(x, y,  bars, label='data')
    ax2[0].plot(x, r2, label='fit')
    ax2[0].legend()
    
    ax2[1].set_yscale('log')
    ax2[1].set_title('Log')
    ax2[1].errorbar(x, y, bars, label='data')
    ax2[1].plot(x, r2, label='fit')
    ax2[1].legend()
    
    fit3 = np.polyfit(x, y, deg=1, w=1.0 / y)
    r3 = np.polyval(fit3, x)
    
    fig3, ax3 = plt.subplots(1, 2)
    
    ax3[0].set_title('Linear')
    ax3[0].errorbar(x, y, y / 20, label='data')
    ax3[0].plot(x, r3, label='fit')
    ax3[0].legend()
    
    ax3[1].set_yscale('log')
    ax3[1].set_title('Log')
    ax3[1].errorbar(x, y, y / 20, label='data')
    ax3[1].plot(x, r3, label='fit')
    ax3[1].legend()
    
    plt.show()