Search code examples
pythonnumpypycharmnested-loopsdata-analysis

Nested for loops and data analysis across multiple data files


I have wrote the following code and am just having some last little issues that I could use some help with and once this is polished up, I figured this could be really useful to people doing data analysis on point proximity in the future.

The purpose of this code is to read in two separate lists of data as individual points and into a numpy array. From there the nested for loop is intended to take point one in file1 and compare its angular separation to each point in file2 then point 2 in file1 and compare it to each element in file2, so on and on.

The code had worked beautifully for all of my test files that have only around 100 elements in each. I am pretty sure the angular separation in spherical coordinates are written properly, and have converted the measurement to radians instead of degrees.

import numpy as np
import math as ma

filename1 = "C:\Users\Justin\Desktop\file1.data"
data1 = np.genfromtxt(filename1,
                     skip_header=1,
                     usecols=(0, 1))
                     #dtype=[
                            #("x1", "f9"),
                         #("y1", "f9")])
#print "data1", data1

filename2 = "C:\Users\Justin\Desktop\file2.data"
data2 = np.genfromtxt(filename2,
                      skip_header=1,
                      usecols=(0, 1))
                      #dtype=[
                             #("x2", "f9"),
                             #("y2", "f9")])

#print "data2",data2

def d(a,b):
    d = ma.acos(ma.sin(ma.radians(a[1]))*ma.sin(ma.radians(b[1]))
                +ma.cos(ma.radians(a[1]))*ma.cos(ma.radians(b[1]))*       (ma.cos(ma.radians((a[0]-b[0])))))
    return d

for coor1 in data1:
    for coor2 in data2:
        n=0
        a = [coor1[0], coor1[1]]
        b = [coor2[0], coor2[1]]
        #print "a", a
        #print "b", b
        if d(a, b) < 0.0174533: # if true what happens
            n += 1
            print 'True', d(a,b)

        if n == 0: # if false what happens
            print 'False', d(a,b)

Unfortunately I am now having issues with the much larger files (between 10,000-500,000 data points in each) and have narrowed it down to a few things but first here are my problems: 1.) When running, the output window states that ! Too much output to process although plenty of results come out of it. ((Could this be a PyCharm issue?)) 2.) The first line of my code is returning complete nonsense and the output changes every time without a t/f result. With more testing this seems specific to the Too much output to process issue.

What I think may be some potential issues that I cant seem to remedy or just don't understand:

1.) I have not properly defined a = [coor1[0], coor1[1]] b = [coor2[0], coor2[1]] or am not calling the coordinates properly. But again, this worked perfectly with other test files.

2.) Since I am running windows, the .data files get corrupted from the original format that was from a Mac. I've tried renaming them to .txt and even in word but the file just gets totally screwed up. I have been assured that this shouldnt matter but I am still suspect... Especially since I cant open them without corrupting the data format.

3.) the files are simply too big for my computer or pycharm/numpy to handle efficiently although I doubt this.

4.) Just to cover all bases: The possibility that my code sucks and I need to learn more? First big project here so if that's the case don't hesitate to point out anything that may be helpful.


Solution

  • After a little more research and getting some advice from colleagues about how to best test my code with different variables, I realized that the code I wrote is actually doing exactly what I want it to. And it seems (for now at least) everything is doing great. All I did was restrict my proximity search to a much smaller field since before it was returning 10,000 x 10,000 results while printing the distance, the coordinate points, and a T/F statement.

    So when the search field is small enough, the code does exactly what I need it to, but as for why I was having the too much output to process error I am not sure, but I will likely post another question on here trying to clarify that problem. Again, the above code is although perhaps brute force and basic in approach, it is a pretty efficient way of analyzing proximity of points across multiple tables in a for loop.