Search code examples
rlinear-regressionknnnearest-neighborleast-squares

How can I find the nearest neighboring line(s) to a desired line in a collection of data?


I need to figure out how to determine the nearest neighbors of an "optimal" line, as illustrated in a simplified figure, linked below.

*Edit 8-20-2018: Since I was unable to find a cookie-cutter solution to my problem in R, I ended up making a formula that calculates the area between the desired line and each of the other lines from experimental data using R. It's similar to finding a least squares regression line, but takes it to another level. The lines closest to the desired curve will have the smallest area

The blue, orange, green, and purple lines represent the best fit to a time series of ~50-100 data points. The desired profile (red dashed line) represents the optimal linear trajectory:

Example Graph

Is there a reliable way by which I can calculate which is nearest to the optimal line via k-nearest neighbors? Or will I need to write my own algorithm that determines the curve that has the least sum of squares?

DESIRED GOAL: In any case, if I were to set k=1, I'd like the algorithm to select the green time series. And if k=2, I'd like it to select both the orange and green lines (and automatically calculate the average of their labeled values).

I'm not sure if i'd need to use the raw data in aggregate or use fitted lines for each of the time series.

Ideally, I'd like to use R for this project, but have just begun learning python.

Hopefully I've provided enough info to make things understandable.

Thanks for your help!


Solution

  • Well I've tried a bit something to share, hoping to be helpful. I have created some fake data similar to yours:

    ts <- data.frame(ts1 =c(1,2,3,4,5,6,7,8,9),
                     ts2 =c(1,2,4,6,7,10,11,10,9.5),
                     ts3 =c(1,1.1,2,2.1,3,3.1,4,4.1,5),
                     ts4 =c(1.3,1.7,2.3,2.7,3.3,4,7,5.7,6.3),
                     time =c(1,2,3,4,5,6,7,8,9))
    

    Plotting them:

    library(ggplot2)
    library(reshape2)
    meltdf <- melt(ts,id="time") # ggplot loves long data, so we have to crunch them a bit
    ggplot(meltdf,aes(x=time,y=value,colour=variable,group=variable)) + geom_line()
    

    enter image description here

    Close enough!
    Now we have to try to classify. Using knn should be a problem, not for the algorithm, but for the result giveth by our data. For example:

    knn(train, test, cl, k = 1, l = 0, prob = FALSE, use.all = TRUE)
    

    You have to give some data to train, and some to classify: how? It seems you can train your knn only with one ts, ts1, then classify the others, and this is not going to give some nice result, as stated here.

    train <- t(data.frame(knn_data[1,]))
    test <- knn_data[2:4,]
    label<- c('t')
    
    knn(train = train, test = test,cl = label, k=1)
    [1] t t t
    Levels: t
    

    Maybe here they can help you for this, and you can start from here.

    However as you know, knn use Euclidean Distances: we can try to make it simpler, using it.

    # here a data.frame with all the distance between ts1, source, and the other ts.
    distances <- data.frame(source =c("ts1","ts1","ts1"),
                             target = c("ts2","ts3","ts4"),
                            dista = c(dist(rbind(ts$ts1, ts$ts2)),
                                          dist(rbind(ts$ts1, ts$ts3)),
                                          dist(rbind(ts$ts1, ts$ts4)))
                            )
    
      # now we can choose the top 1 like in this case, or the top 1.
      head(distances[order(distances$dista),],1) 
    
      source target    dista
    3    ts1    ts3 4.672259