Search code examples
pythondataframeenumerate

Create internal python loop based on index and groupings for all combinations


I have script which looks at the rows and columns headers belonging to a group (REG_ID) and sums the values. The code runs on a matrix (small subset) as follows:

Outputs

My code runs well for calculating the sum for all the IDs based on rows and columns belonging to each internal group (REG_ID). For example for all row and column IDs which belong to REG_ID 1 are summed so the total flows between region 1 and region 1 (internal flows) is calculated and so on for each region. I wish to extend this code by calculating (summing) the flows between regions for example region 1 to region 2, 3, 4 ,5.... I figure that I need to include another loop within the existing while loop but would really appreciate some help to figure out where it should be and how to construct it. My code which currently runs on the internal flow sum (1-1, 2-2, 3-3 etc) is as follows:

global index
index = 1
x = index
while index < len(idgroups):
    ward_list = idgroups[index] #select list of ward ids for each region from list of lists
    df6 = mergedcsv.loc[ward_list] #select rows with values in the list
    dfcols = mergedcsv.loc[ward_list, :] #select columns with values in list
    ward_liststr = map(str, ward_list) #convert ward_list to strings so that they can be used to select columns, won't work as integers.
    ward_listint = map(int, ward_list)
    #dfrowscols = mergedcsv.loc[ward_list, ward_listint]
    df7 = df6.loc[:, ward_liststr]
    print df7
    regflowsum = df7.values.sum() #sum all values in dataframe
    intflow = [regflowsum]
    print intflow
    dfintflow = pd.DataFrame(intflow)
    dfintflow.reset_index(level=0, inplace=True)
    dfintflow.columns = ["RegID", "regflowsum"]
    dfflows.set_value(index, 'RegID', index)
    dfflows.set_value(index, 'RegID2', index)
    dfflows.set_value(index, 'regflow', regflowsum)
    mergedcsv.set_value(ward_list, 'TotRegFlows', regflowsum)
    index += 1 #increment index number
print dfflows
new_df = pd.merge(pairlist, dfflows,  how='left', left_on=['origID','destID'], right_on = ['RegID', 'RegID2'])
print new_df #useful for checking dataframe merges
regionflows = r"C:\Temp\AllNI\regionflows.csv"
header = ["WardID","LABEL","REG_ID","Total","TotRegFlows"]
mergedcsv.to_csv(regionflows, columns = header, index=False)
regregflows = r"C:\Temp\AllNI\reg_regflows.csv"
headerreg = ["REG_ID_ORIG", "REG_ID_DEST", "FLOW"]

pairlistCSV = r"C:\Temp\AllNI\pairlist_regions.csv"
new_df.to_csv(pairlistCSV)

The output is as follows:

idgroups dataframe: (see image 1 - second part of image 1)

df7 and intflows for each region Reg_ID:(third part of image 1 - on the right)

ddflows dataframe:(fourth part of image 2)

and the final output is new_df:(fifth part of image 2)

I wish to populate the sums for all possible combinations of flows between the regions not just internal.

I figure I need to add another loop into the while loop. So possibly add an enumerate function like:

while index < len(idgroups):
    #add line(s) to calculate flows between regions
    for index, item in enumerate(idgroups):
        ward_list = idgroups[index]
        print ward_list
        df6 = mergedcsv.loc[ward_list] #select rows with values in the list
        dfcols = mergedcsv.loc[ward_list, :] #select columns with values in list
        ward_liststr = map(str, ward_list) #convert ward_list to strings so that they can be used to select columns, won't work as integers.
        ward_listint = map(int, ward_list)
        #dfrowscols = mergedcsv.loc[ward_list, ward_listint]
        df7 = df6.loc[:, ward_liststr]
        print df7
        regflowsum = df7.values.sum() #sum all values in dataframe
        intflow = [regflowsum]
        print intflow
        dfintflow = pd.DataFrame(intflow)
        dfintflow.reset_index(level=0, inplace=True)
        dfintflow.columns = ["RegID", "regflowsum"]
        dfflows.set_value(index, 'RegID', index)
        dfflows.set_value(index, 'RegID2', index)
        dfflows.set_value(index, 'regflow', regflowsum)
        mergedcsv.set_value(ward_list, 'TotRegFlows', regflowsum)
        index += 1 #increment index number

I'm unsure how to integrate the item so struggling to extend the code for all combinations. Any advice appreciated.

Update based on flows function:

    w=pysal.rook_from_shapefile("C:/Temp/AllNI/NIW01_sort.shp",idVariable='LABEL')
Simil = pysal.open("C:/Temp/AllNI/simNI.csv")
Similarity = np.array(Simil)
db = pysal.open('C:\Temp\SQLite\MatrixCSV2.csv', 'r')
dbf = pysal.open(r'C:\Temp\AllNI\NIW01_sortC.dbf', 'r')
ids = np.array((dbf.by_col['LABEL']))
commuters = np.array((dbf.by_col['Total'],dbf.by_col['IDNO']))
commutersint = commuters.astype(int)
comm = commutersint[0]
floor = int(MIN_COM_CT + 100)
solution = pysal.region.Maxp(w=w,z=Similarity,floor=floor,floor_variable=comm)
regions = solution.regions
#print regions
writecsv = r"C:\Temp\AllNI\reg_output.csv"
csv = open(writecsv,'w')
csv.write('"LABEL","REG_ID"\n')
for i in range(len(regions)):
        for lines in regions[i]:
            csv.write('"' + lines + '","' + str(i+1) + '"\n')
csv.close()
flows = r"C:\Temp\SQLite\MatrixCSV2.csv"
regs = r"C:\Temp\AllNI\reg_output.csv"
wardflows = pd.read_csv(flows)
regoutput = pd.read_csv(regs)
merged = pd.merge(wardflows, regoutput)
#duplicate REG_ID column as the index to be used later
merged['REG_ID2'] = merged['REG_ID']
merged.to_csv("C:\Temp\AllNI\merged.csv", index=False)
mergedcsv = pd.read_csv("C:\Temp\AllNI\merged.csv",index_col='WardID_1') #index this dataframe using the WardID_1 column
flabelList = pd.read_csv("C:\Temp\AllNI\merged.csv", usecols = ["WardID", "REG_ID"]) #create list of all FLabel values

reg_id = "REG_ID"
ward_flows = "RegIntFlows"
flds = [reg_id, ward_flows] #create list of fields to be use in search

dict_ref = {} # create a dictionary with for each REG_ID a list of corresponding FLABEL fields


#group the dataframe by the REG_ID column
idgroups = flabelList.groupby('REG_ID')['WardID'].apply(lambda x: x.tolist())
print idgroups

idgrp_df = pd.DataFrame(idgroups)

csvcols = mergedcsv.columns

#create a list of column names to pass as an index to select columns
columnlist = list(mergedcsv.columns.values)

mergedcsvgroup = mergedcsv.groupby('REG_ID').sum()
mergedcsvgroup.describe()
idList = idgroups[2]
df4 = pd.DataFrame()
df5 = pd.DataFrame()
col_ids = idList #ward id no

regiddf = idgroups.index.get_values()
print regiddf
#total number of region ids
#print regiddf
#create pairlist combinations from region ids
#combinations with replacement allows for repeated items
#pairs = list(itertools.combinations_with_replacement(regiddf, 2))
pairs = list(itertools.product(regiddf, repeat=2))
#print len(pairs)

#create a new dataframe with pairlists and summed data
pairlist = pd.DataFrame(pairs,columns=['origID','destID'])
print pairlist.tail()
header_pairlist = ["origID","destID","flow"]
header_intflow = ["RegID", "RegID2", "regflow"]
dfflows = pd.DataFrame(columns=header_intflow)

print mergedcsv.index
print mergedcsv.dtypes
#mergedcsv = mergedcsv.select_dtypes(include=['int64'])
#print mergedcsv.columns
#mergedcsv.rename(columns = lambda x: int(x), inplace=True)

def flows():
    pass

#def flows(mergedcsv, region_a, region_b):
def flows(mergedcsv, ward_lista, ward_listb):
    """Return the sum of all the cells in the row/column intersections
    of ward_lista and ward_listb."""

    mergedcsv = mergedcsv.loc[:, mergedcsv.dtypes == 'int64']
    regionflows = mergedcsv.loc[ward_lista, ward_listb]
    regionflowsum = regionflows.values.sum()


    #grid = [ax, bx, regflowsuma, regflowsumb]
    gridoutput = [ax, bx, regionflowsum]
    print gridoutput

    return regflowsuma
    return regflowsumb

#print mergedcsv.index

#mergedcsv.columns = mergedcsv.columns.str.strip()

for ax, group_a in enumerate(idgroups):
    ward_lista = map(int, group_a)
    print ward_lista


    for bx, group_b in enumerate(idgroups[ax:], start=ax):
        ward_listb = map(int, group_b)
        #print ward_listb

        flow_ab = flows(mergedcsv, ward_lista, ward_listb)
            #flow_ab = flows(mergedcsv, group_a, group_b)

This results in KeyError: 'None of [[189, 197, 198, 201]] are in the [columns]'

I have tried using ward_lista = map(str, group_a) and map(int, group_a) also but list objects not found in dataframe.loc. The columns are mixed datatypes but all the columns containing the labels which should be sliced are of type int64. I have tried many solutions around the datatypes but to no avail. Any suggestions?


Solution

  • I can't speak to the computations you're doing, but it seems like you just want to arrange combinations of groups. The question is whether they are directed or undirected- that is, do you need to compute flows(A,B) and flows(B,A), or just one?

    If just one, you could do this:

    for i,ward_list in enumerate(idgroups):
        for j,ward_list2 in enumerate(idgroups[i:],start=i):
    

    This would iterate over i,j pairs like:

    0,0 0,1 0,2 ... 0,n
    1,1 1,2 ... 1,n
    2,2 ... 2,n
    

    which would serve in the undirected case.

    If you need to compute both flows(A,B) and flows(B,A), then you can simply push your code into a function called flows, and call it with reversed args, as shown. ;-)

    Update

    Let's define a function called flows:

    def flows():
        pass
    

    Now, what are the parameters?

    Well, looking at your code, it gets data from a DataFrame. And you want two different wards, so let's start with those. The result seems to be a sum of the resulting grid.

    def flows(df, ward_a, ward_b):
        """Return the sum of all the cells in the row/column intersections
        of ward_a and ward_b."""
    
        return 0
    

    Now I'm going to copy lines of your code:

        ward_list = idgroups[index]
        print ward_list
        df6 = mergedcsv.loc[ward_list] #select rows with values in the list
        dfcols = mergedcsv.loc[ward_list, :] #select columns with values in list
        ward_liststr = map(str, ward_list) #convert ward_list to strings so that they can be used to select columns, won't work as integers.
        ward_listint = map(int, ward_list)
        #dfrowscols = mergedcsv.loc[ward_list, ward_listint]
        df7 = df6.loc[:, ward_liststr]
        print df7
        regflowsum = df7.values.sum() #sum all values in dataframe
        intflow = [regflowsum]
        print intflow
    

    I think this is most of the flow function right here. Let's look.

    1. The ward_list will obviously be either the ward_a or ward_b parameters.

    2. I'm not sure what df6 is, because you sort of recompute it in df7. So that need to be clarified.

    3. regflowsum is our desired output, I think.

    Rewriting this into the function:

    def flows(df, ward_a, ward_b):
        """Return the sum of all the cells in the row/column intersections
        of ward_a and ward_b."""
    
        print "Computing flows from:"
        print "    ", ward_a
        print ""
        print "flows into:"
        print "    ", ward_b
    
        # Filter rows by ward_a, cols by ward_b:
        grid = df.loc[ward_a, ward_b]
    
        print "Grid:"
        print grid
    
        flowsum = grid.values.sum()
    
        print "Flows:", flowsum
    
        return flowsum
    

    Now, I have assumed that the ward_a and ward_b values are already in the correct format. So we'll have to str-ify them or whatever outside the function. Let's do that:

    for ax, group_a in enumerate(idgroups):
        ward_a = map(str, group_a)
    
        for bx, group_b in enumerate(idgroups[ax:], start=ax):
            ward_b = map(str, group_b)
    
            flow_ab = flows(mergedcsv, ward_a, ward_b)
    
            if ax != bx:
                flow_ba = flows(mergedcsv, ward_b, ward_a)
            else:
                flow_ba = flow_ab
    
            # Now what?
    

    At this point, you have two numbers. They will be equal when the wards are the same (internal flow?). At this point your original code stops being helpful because it only deals with internal flows, and not A->B flows, so I don't know what to do. But the values are in the variables, so ...