Search code examples
pythonpandasdataframevectorization

How to increase speed searching through a list inside a df?


I have a data frame with a column of lists of geographic coordinates (grouped by what power transmission line they belong to) and another list of coordinates outside the data frame. I need to find the position of each list coordinate in the data frame if the sample number is DS_3, but this takes way too long since there are almost 6000 coordinates. I bet there is a vectorized way to do this since I've learned that looping through a data frame is not practical, but I don't know much about vector methods.

Here's the data frame:

    id  coordinates                                 voltage   length    voltage x length
0   36  [[-81.443569, 28.470022], [-81.446726, 28.4740...   230 2788.450481 6.413436e+05
1   69  [[-82.402208, 27.907588], [-82.406592, 27.9084...   69  1486.634968 1.025778e+05
2   87  [[-80.38392, 25.748665], [-80.383758, 25.74358...   69  3395.795388 2.343099e+05
3   128 [[-81.423956, 28.410278], [-81.424811, 28.4053...   69  5231.189711 3.609521e+05
4   138 [[-81.843314, 30.572359], [-81.844404, 30.5685...   230 2716.984353 6.249064e+05
... ... ... ... ... ...
3061    68184   [[-81.251491, 28.53718], [-81.250396, 28.53283...   69  19243.512450    1.327802e+06
3062    68189   [[-82.669886, 28.961533], [-82.664782, 28.9615...   230 27463.901761    6.316697e+06
3063    68196   [[-81.157196, 29.000982], [-81.157041, 28.9958...   500 90524.038042    4.526202e+07
3064    68199   [[-80.549594, 28.481094], [-80.551733, 28.4857...   115 7185.881445 8.263764e+05
3065    68211   [[-80.44025, 25.81403]] 115 673.881802  7.749641e+04

Here's the start of the coordinates list:

[[-81.708274, 31.095992], [-81.708763, 31.090911], [-81.709349, 31.085841], [-81.710002, 31.080779], [-81.710627, 31.075713], [-81.711167, 31.070638], [-81.711649, 31.065557], [-81.712316, 31.060497], [-81.713036, 31.055444], [-81.713757, 31.050391], [-81.714478, 31.045338], [-81.715199, 31.040285], [-82.184384, 31.058297], [-82.188367, 31.061488], [-82.192045, 31.065027], [-82.195735, 31.068554], [-82.199426, 31.072079], [-82.20315, 31.075567], [-82.207127, 31.078767], [-82.211101, 31.081969], [-82.215077, 31.08517], [-82.219057, 31.088366], [-82.223033, 31.091567], [-82.227002, 31.094776], [-82.230978, 31.097977], [-82.234959, 31.101171], [-82.238934, 31.104373], [-82.242912, 31.107571], [-82.24689, 31.110769], [-82.250862, 31.113975], [-82.255483, 31.116065], [-82.260227, 31.117947], [-82.26497, 31.119832], [-82.269711, 31.121722], [-82.274457, 31.123602], [-82.279199, 31.125489], [-82.283947, 31.127364], [-82.28869, 31.129249], [-82.293435, 31.131131], [-82.298182, 31.133006], [-82.30292, 31.134905] ...  

This is the code I am currently using; any tips would be greatly appreciated!

for index, row in df.iterrows():
    for i in range(len(coordinates)):
        if coordinates[i] in row['coordinates'] and sample[i]['sample_0'] == 'DS_3':
                    if row['id'] in damaged_lines:
                        break
                    else:
                        damaged_lines.append(row['id']) 
 

Updated code:

for i in range(len(coordinates)):
    for index, row in df.iterrows():
        if coordinates[i] in row['coordinates'] and sample[i]['sample_0'] == 'DS_3':
                    if row['id'] in damaged_lines:
                        break
                    else:
                        damaged_lines.append(row['id']) 
                        break
     
       

Solution

  • Here's an example of how to do this via binary search. This doesn't do your check for 'DS_3', but that would be easy to add. I generate 100 rows with a random number of coordinate pairs. I then extract from that a single list of the coordinates along with the index of the row where they came from. I then extract a random sample from that to lookup, and then do a simple loop through the extracted sample to binary search the list.

    import os
    import numpy
    import bisect
    import random
    import pandas as pd
    
    # Generate 100 random rows.
    
    rows = []
    for _ in range(100):
        n = random.randrange(2,5)
        rows.append( [[random.random(),random.random()] for _ in range(n)] )
    
    # Turn that into a dataframe.
    
    ids = list(range(100))
    df = pd.DataFrame( { 'id': ids, 'coordinates': rows } )
    
    # Now construct a single index.
    
    allvals = []
    for idx,row in df.iterrows():
        for v in row['coordinates']:
            allvals.append( (v, row['id']) )
    
    # Pick an assortment for searching.
    
    coordinates = [i[0] for i in random.choices(allvals,k=15)]
    
    # Sort the index.
    
    allvals.sort()
    
    for v in coordinates:
        i = bisect.bisect_left(allvals, (v,0))
        print(v, "found with index", allvals[i] )
    

    Output:

    [0.11420241950096588, 0.3012171391829337] found with index ([0.11420241950096588, 0.3012171391829337], 26)
    [0.5291298973513743, 0.7171475026345056] found with index ([0.5291298973513743, 0.7171475026345056], 46)
    [0.37665801783339126, 0.9637006925267728] found with index ([0.37665801783339126, 0.9637006925267728], 38)
    [0.470318651764595, 0.645338416806701] found with index ([0.470318651764595, 0.645338416806701], 42)
    [0.1058926573158816, 0.5197311227746729] found with index ([0.1058926573158816, 0.5197311227746729], 44)
    [0.47475506322839045, 0.5079794380642515] found with index ([0.47475506322839045, 0.5079794380642515], 12)
    [0.36731827174930154, 0.27873066502693755] found with index ([0.36731827174930154, 0.27873066502693755], 19)
    [0.16012365898638536, 0.9036887843952794] found with index ([0.16012365898638536, 0.9036887843952794], 39)
    [0.5443093669600313, 0.014991094298460905] found with index ([0.5443093669600313, 0.014991094298460905], 96)
    [0.03875984282281675, 0.13183885631959258] found with index ([0.03875984282281675, 0.13183885631959258], 76)
    [0.48366667699166943, 0.40255967639358115] found with index ([0.48366667699166943, 0.40255967639358115], 63)
    [0.1743226390424386, 0.7617040167975732] found with index ([0.1743226390424386, 0.7617040167975732], 20)
    [0.9862858775797206, 0.5949703957196504] found with index ([0.9862858775797206, 0.5949703957196504], 16)
    [0.9475511012671365, 0.27111055784024374] found with index ([0.9475511012671365, 0.27111055784024374], 34)
    [0.9686244585578948, 0.8454092861137659] found with index ([0.9686244585578948, 0.8454092861137659], 72)