Search code examples
pythonpandasnumpynearest-neighbor

In Python get k-nearest neighbor optimal matching without replacement


I want to assign two control plots to each of the project plots. In the first iteration, I want to assign the nearest control plot (with the minimum value of dist_project_plot_id*) to the project plot being evaluated. If this control plot has already been assigned to a project plot, we look for the next nearest control plot.

Once all project plots have been assigned a first control plot, we assign a second control plot to each project plot, following the same criterion: find the control plot with the minimum distance as long as it has not been previously assigned to another project plot.

I have a dataframe which looks like: 
`data = {
    'control_plot_id': [1526258, 1507770, 1539206, 1528123, 2019722, 1504105],
    'dist_project_plot_id1': [3025.22, 2670.43, 2140.41, 1697.68, 3999.77, 2783.97],
    'dist_project_plot_id2': [488.07, 427.82, 1180.68, 1386.38, 4739.51, 590.44],
    'dist_project_plot_id3': [2033.15, 2193.51, 2958.56, 3168.14, 5573.02, 2008.31]
}

df = pd.DataFrame(data)`

Where: control_plot_id represents the identifiers of the control plots dist_project_plot_id1 represents the distance between the control plot and project plot 1 dist_project_plot_id2 represents the distance between the control plot and project plot 2 etc.

I have tried the first search in the following code:

import pandas as pd
df = pd.DataFrame(data)

# Add new columns "PP" and "dist"
df['PP'] = ''
df['dist'] = np.nan

# Get the column names starting with 'project_plot_id'
project_columns = [col for col in df.columns if col.startswith('project_plot_id')]

# Iterate over the project_plot_id columns
for col in project_columns:
    # Sort the dataframe by the current column in ascending order
    df_sorted = df.sort_values(col)

    # Find the k-nearest control plots for the current column
    k = 1  # Set the value of k
    nearest_control_plots = []
    for i in range(k):
        min_value = df_sorted.loc[~df_sorted['control_plot_id'].isin(nearest_control_plots)].head(1)[['control_plot_id', col]]
        nearest_control_plots.append(min_value['control_plot_id'].values[0])
        df.loc[df['control_plot_id'] == min_value['control_plot_id'].values[0], 'PP'] = col
        df.loc[df['control_plot_id'] == min_value['control_plot_id'].values[0], 'dist'] = min_value[col].values[0]

What I am not able to program is that if a control plot has already been selected for a project plot, the code should continue searching for the next nearest control plot, which could be the third, the fourth, or even the last one. Maybe a particular library is doing what I am looking for.

The expected output should have the following columns:

  • control_plot_id: original control plot id
  • PP: project plot for which the control plot has been selected
  • dist: distance between the control plot and the control plot selected

Solution

  • Okay, based on the description this is what you are looking for:

    import pandas as pd
    
    data = {
        'control_plot_id': [1526258, 1507770, 1539206, 1528123, 2019722, 1504105],
        'dist_project_plot_id1': [3025.22, 2670.43, 2140.41, 1697.68, 3999.77, 2783.97],
        'dist_project_plot_id2': [488.07, 427.82, 1180.68, 1386.38, 4739.51, 590.44],
        'dist_project_plot_id3': [2033.15, 2193.51, 2958.56, 3168.14, 5573.02, 2008.31]
    }
    
    df = pd.DataFrame(data)
    cols = [c for c in df.columns if c.startswith("dist")]
    
    project_data = []
    assignments = []
    
    for c in cols:
        for i in range(1, 3):
            row = {}
            search_df = df.loc[~df.control_plot_id.isin(assignments)]
            control_plot = search_df.loc[search_df[c].idxmin()]
            row["project_plot"] = c
            row["control_plot"] = control_plot.control_plot_id
            row["dist"] = control_plot[c]
            project_data.append(row)
            assignments.append(control_plot.control_plot_id)
    
    out_data = pd.DataFrame(project_data)
    
    print(out_data)
    

    This returns:

                project_plot  control_plot     dist
    0  dist_project_plot_id1     1528123.0  1697.68
    1  dist_project_plot_id1     1539206.0  2140.41
    2  dist_project_plot_id2     1507770.0   427.82
    3  dist_project_plot_id2     1526258.0   488.07
    4  dist_project_plot_id3     1504105.0  2008.31
    5  dist_project_plot_id3     2019722.0  5573.02
    

    This does what you ask for:

    I want to assign the nearest control plot (with the minimum value of dist_project_plot_id*) to the project plot being evaluated. If this control plot has already been assigned to a project plot, we look for the next nearest control plot. Once all project plots have been assigned a first control plot, we assign a second control plot to each project plot, following the same criterion: find the control plot with the minimum distance as long as it has not been previously assigned to another project plot.

    However your example output is different from what you explained you wanted, not sure why...

    NOTE after edit: If instead you wanted one row per project then, as with my answer prior to the edit when you clarified your expected output:

    project_data = {}
    assignments = []
    
    for c in cols:
        project_data[c] = {}
        for i in range(1, 3):
            search_df = df.loc[~df.control_plot_id.isin(assignments)]
            control_plot = search_df.loc[search_df[c].idxmin()]
            project_data[c][f"control_plot{i}"] = control_plot.control_plot_id
            project_data[c][f"dist{i}"] = control_plot[c]
            assignments.append(control_plot.control_plot_id)
    
    out_data = pd.DataFrame.from_dict(project_data, orient='index').reset_index()
    out_data.rename(columns={'index': 'project'}, inplace=True)
    

    which would return:

                     project  control_plot1    dist1  control_plot2    dist2
    0  dist_project_plot_id1      1528123.0  1697.68      1539206.0  2140.41
    1  dist_project_plot_id2      1507770.0   427.82      1526258.0   488.07
    2  dist_project_plot_id3      1504105.0  2008.31      2019722.0  5573.02