I have script which looks at the rows and columns headers belonging to a group (REG_ID) and sums the values. The code runs on a matrix (small subset) as follows:
My code runs well for calculating the sum for all the IDs based on rows and columns belonging to each internal group (REG_ID). For example for all row and column IDs which belong to REG_ID 1 are summed so the total flows between region 1 and region 1 (internal flows) is calculated and so on for each region. I wish to extend this code by calculating (summing) the flows between regions for example region 1 to region 2, 3, 4 ,5.... I figure that I need to include another loop within the existing while loop but would really appreciate some help to figure out where it should be and how to construct it. My code which currently runs on the internal flow sum (1-1, 2-2, 3-3 etc) is as follows:
global index
index = 1
x = index
while index < len(idgroups):
ward_list = idgroups[index] #select list of ward ids for each region from list of lists
df6 = mergedcsv.loc[ward_list] #select rows with values in the list
dfcols = mergedcsv.loc[ward_list, :] #select columns with values in list
ward_liststr = map(str, ward_list) #convert ward_list to strings so that they can be used to select columns, won't work as integers.
ward_listint = map(int, ward_list)
#dfrowscols = mergedcsv.loc[ward_list, ward_listint]
df7 = df6.loc[:, ward_liststr]
print df7
regflowsum = df7.values.sum() #sum all values in dataframe
intflow = [regflowsum]
print intflow
dfintflow = pd.DataFrame(intflow)
dfintflow.reset_index(level=0, inplace=True)
dfintflow.columns = ["RegID", "regflowsum"]
dfflows.set_value(index, 'RegID', index)
dfflows.set_value(index, 'RegID2', index)
dfflows.set_value(index, 'regflow', regflowsum)
mergedcsv.set_value(ward_list, 'TotRegFlows', regflowsum)
index += 1 #increment index number
print dfflows
new_df = pd.merge(pairlist, dfflows, how='left', left_on=['origID','destID'], right_on = ['RegID', 'RegID2'])
print new_df #useful for checking dataframe merges
regionflows = r"C:\Temp\AllNI\regionflows.csv"
header = ["WardID","LABEL","REG_ID","Total","TotRegFlows"]
mergedcsv.to_csv(regionflows, columns = header, index=False)
regregflows = r"C:\Temp\AllNI\reg_regflows.csv"
headerreg = ["REG_ID_ORIG", "REG_ID_DEST", "FLOW"]
pairlistCSV = r"C:\Temp\AllNI\pairlist_regions.csv"
new_df.to_csv(pairlistCSV)
The output is as follows:
idgroups dataframe: (see image 1 - second part of image 1)
df7 and intflows for each region Reg_ID:(third part of image 1 - on the right)
ddflows dataframe:(fourth part of image 2)
and the final output is new_df:(fifth part of image 2)
I wish to populate the sums for all possible combinations of flows between the regions not just internal.
I figure I need to add another loop into the while loop. So possibly add an enumerate function like:
while index < len(idgroups):
#add line(s) to calculate flows between regions
for index, item in enumerate(idgroups):
ward_list = idgroups[index]
print ward_list
df6 = mergedcsv.loc[ward_list] #select rows with values in the list
dfcols = mergedcsv.loc[ward_list, :] #select columns with values in list
ward_liststr = map(str, ward_list) #convert ward_list to strings so that they can be used to select columns, won't work as integers.
ward_listint = map(int, ward_list)
#dfrowscols = mergedcsv.loc[ward_list, ward_listint]
df7 = df6.loc[:, ward_liststr]
print df7
regflowsum = df7.values.sum() #sum all values in dataframe
intflow = [regflowsum]
print intflow
dfintflow = pd.DataFrame(intflow)
dfintflow.reset_index(level=0, inplace=True)
dfintflow.columns = ["RegID", "regflowsum"]
dfflows.set_value(index, 'RegID', index)
dfflows.set_value(index, 'RegID2', index)
dfflows.set_value(index, 'regflow', regflowsum)
mergedcsv.set_value(ward_list, 'TotRegFlows', regflowsum)
index += 1 #increment index number
I'm unsure how to integrate the item so struggling to extend the code for all combinations. Any advice appreciated.
Update based on flows function:
w=pysal.rook_from_shapefile("C:/Temp/AllNI/NIW01_sort.shp",idVariable='LABEL')
Simil = pysal.open("C:/Temp/AllNI/simNI.csv")
Similarity = np.array(Simil)
db = pysal.open('C:\Temp\SQLite\MatrixCSV2.csv', 'r')
dbf = pysal.open(r'C:\Temp\AllNI\NIW01_sortC.dbf', 'r')
ids = np.array((dbf.by_col['LABEL']))
commuters = np.array((dbf.by_col['Total'],dbf.by_col['IDNO']))
commutersint = commuters.astype(int)
comm = commutersint[0]
floor = int(MIN_COM_CT + 100)
solution = pysal.region.Maxp(w=w,z=Similarity,floor=floor,floor_variable=comm)
regions = solution.regions
#print regions
writecsv = r"C:\Temp\AllNI\reg_output.csv"
csv = open(writecsv,'w')
csv.write('"LABEL","REG_ID"\n')
for i in range(len(regions)):
for lines in regions[i]:
csv.write('"' + lines + '","' + str(i+1) + '"\n')
csv.close()
flows = r"C:\Temp\SQLite\MatrixCSV2.csv"
regs = r"C:\Temp\AllNI\reg_output.csv"
wardflows = pd.read_csv(flows)
regoutput = pd.read_csv(regs)
merged = pd.merge(wardflows, regoutput)
#duplicate REG_ID column as the index to be used later
merged['REG_ID2'] = merged['REG_ID']
merged.to_csv("C:\Temp\AllNI\merged.csv", index=False)
mergedcsv = pd.read_csv("C:\Temp\AllNI\merged.csv",index_col='WardID_1') #index this dataframe using the WardID_1 column
flabelList = pd.read_csv("C:\Temp\AllNI\merged.csv", usecols = ["WardID", "REG_ID"]) #create list of all FLabel values
reg_id = "REG_ID"
ward_flows = "RegIntFlows"
flds = [reg_id, ward_flows] #create list of fields to be use in search
dict_ref = {} # create a dictionary with for each REG_ID a list of corresponding FLABEL fields
#group the dataframe by the REG_ID column
idgroups = flabelList.groupby('REG_ID')['WardID'].apply(lambda x: x.tolist())
print idgroups
idgrp_df = pd.DataFrame(idgroups)
csvcols = mergedcsv.columns
#create a list of column names to pass as an index to select columns
columnlist = list(mergedcsv.columns.values)
mergedcsvgroup = mergedcsv.groupby('REG_ID').sum()
mergedcsvgroup.describe()
idList = idgroups[2]
df4 = pd.DataFrame()
df5 = pd.DataFrame()
col_ids = idList #ward id no
regiddf = idgroups.index.get_values()
print regiddf
#total number of region ids
#print regiddf
#create pairlist combinations from region ids
#combinations with replacement allows for repeated items
#pairs = list(itertools.combinations_with_replacement(regiddf, 2))
pairs = list(itertools.product(regiddf, repeat=2))
#print len(pairs)
#create a new dataframe with pairlists and summed data
pairlist = pd.DataFrame(pairs,columns=['origID','destID'])
print pairlist.tail()
header_pairlist = ["origID","destID","flow"]
header_intflow = ["RegID", "RegID2", "regflow"]
dfflows = pd.DataFrame(columns=header_intflow)
print mergedcsv.index
print mergedcsv.dtypes
#mergedcsv = mergedcsv.select_dtypes(include=['int64'])
#print mergedcsv.columns
#mergedcsv.rename(columns = lambda x: int(x), inplace=True)
def flows():
pass
#def flows(mergedcsv, region_a, region_b):
def flows(mergedcsv, ward_lista, ward_listb):
"""Return the sum of all the cells in the row/column intersections
of ward_lista and ward_listb."""
mergedcsv = mergedcsv.loc[:, mergedcsv.dtypes == 'int64']
regionflows = mergedcsv.loc[ward_lista, ward_listb]
regionflowsum = regionflows.values.sum()
#grid = [ax, bx, regflowsuma, regflowsumb]
gridoutput = [ax, bx, regionflowsum]
print gridoutput
return regflowsuma
return regflowsumb
#print mergedcsv.index
#mergedcsv.columns = mergedcsv.columns.str.strip()
for ax, group_a in enumerate(idgroups):
ward_lista = map(int, group_a)
print ward_lista
for bx, group_b in enumerate(idgroups[ax:], start=ax):
ward_listb = map(int, group_b)
#print ward_listb
flow_ab = flows(mergedcsv, ward_lista, ward_listb)
#flow_ab = flows(mergedcsv, group_a, group_b)
This results in KeyError: 'None of [[189, 197, 198, 201]] are in the [columns]'
I have tried using ward_lista = map(str, group_a) and map(int, group_a) also but list objects not found in dataframe.loc. The columns are mixed datatypes but all the columns containing the labels which should be sliced are of type int64. I have tried many solutions around the datatypes but to no avail. Any suggestions?
I can't speak to the computations you're doing, but it seems like you just want to arrange combinations of groups. The question is whether they are directed or undirected- that is, do you need to compute flows(A,B) and flows(B,A), or just one?
If just one, you could do this:
for i,ward_list in enumerate(idgroups):
for j,ward_list2 in enumerate(idgroups[i:],start=i):
This would iterate over i,j pairs like:
0,0 0,1 0,2 ... 0,n
1,1 1,2 ... 1,n
2,2 ... 2,n
which would serve in the undirected case.
If you need to compute both flows(A,B) and flows(B,A), then you can simply push your code into a function called flows
, and call it with reversed args, as shown. ;-)
Update
Let's define a function called flows
:
def flows():
pass
Now, what are the parameters?
Well, looking at your code, it gets data from a DataFrame. And you want two different wards, so let's start with those. The result seems to be a sum of the resulting grid.
def flows(df, ward_a, ward_b):
"""Return the sum of all the cells in the row/column intersections
of ward_a and ward_b."""
return 0
Now I'm going to copy lines of your code:
ward_list = idgroups[index]
print ward_list
df6 = mergedcsv.loc[ward_list] #select rows with values in the list
dfcols = mergedcsv.loc[ward_list, :] #select columns with values in list
ward_liststr = map(str, ward_list) #convert ward_list to strings so that they can be used to select columns, won't work as integers.
ward_listint = map(int, ward_list)
#dfrowscols = mergedcsv.loc[ward_list, ward_listint]
df7 = df6.loc[:, ward_liststr]
print df7
regflowsum = df7.values.sum() #sum all values in dataframe
intflow = [regflowsum]
print intflow
I think this is most of the flow
function right here. Let's look.
The ward_list
will obviously be either the ward_a
or ward_b
parameters.
I'm not sure what df6
is, because you sort of recompute it in df7
. So that need to be clarified.
regflowsum
is our desired output, I think.
Rewriting this into the function:
def flows(df, ward_a, ward_b):
"""Return the sum of all the cells in the row/column intersections
of ward_a and ward_b."""
print "Computing flows from:"
print " ", ward_a
print ""
print "flows into:"
print " ", ward_b
# Filter rows by ward_a, cols by ward_b:
grid = df.loc[ward_a, ward_b]
print "Grid:"
print grid
flowsum = grid.values.sum()
print "Flows:", flowsum
return flowsum
Now, I have assumed that the ward_a
and ward_b
values are already in the correct format. So we'll have to str
-ify them or whatever outside the function. Let's do that:
for ax, group_a in enumerate(idgroups):
ward_a = map(str, group_a)
for bx, group_b in enumerate(idgroups[ax:], start=ax):
ward_b = map(str, group_b)
flow_ab = flows(mergedcsv, ward_a, ward_b)
if ax != bx:
flow_ba = flows(mergedcsv, ward_b, ward_a)
else:
flow_ba = flow_ab
# Now what?
At this point, you have two numbers. They will be equal when the wards are the same (internal flow?). At this point your original code stops being helpful because it only deals with internal flows, and not A->B flows, so I don't know what to do. But the values are in the variables, so ...