Search code examples
pythondata-analysis

Analyze data using python


I have a csv file in the following format:

30  1964    1   1
30  1962    3   1
30  1965    0   1
31  1959    2   1
31  1965    4   1
33  1958    10  1
33  1960    0   1
34  1959    0   2
34  1966    9   2
34  1958    30  1
34  1960    1   1
34  1961    10  1
34  1967    7   1
34  1960    0   1
35  1964    13  1
35  1963    0   1

The first column denotes the age and the last column denotes the survival rate(1 if patient survives 5 years or longer;2 if patient died within 5 years) I have to calculate which age has the highest survival rate. I am new to python and I cannot figure out how to proceed. I was able to calculate the most repeated age using the mode function but I cannot figure out how to check one column and print the corresponding other column. Please help.

I was able to find an answer where I had to analyze just the first row.

import csv
import matplotlib.pyplot as plt
import numpy as np

df = open('Dataset.csv')
csv_df=csv.reader(df)
a=[]
b=[]

for row in csv_df:
    a.append(row[0])   
    b.append(row[3])

print('The age that has maximum reported incidents of cancer is '+ mode(a))

Solution

  • I am not entirely sure whether I understood your logic clearly for determining the age with the maximum survival rate. Assuming that the age that has the heighest number of 1s have the heighest survival rate the following code is written

    I have done the reading part a little differently as the data set acted wired when I used csv. If the csv module works fine in your environment, use it. The idea is, to retrieve each element of value in each row; we are interested in the 0th and 3rd columns.

    In the following code, we maintain a dictionary, survival_map, and count the frequency of a particular age being associated with a 1.

    import operator
    
    survival_map = {}
    
    with open('Dataset.csv', 'rb') as in_f:
        for row in in_f:
            row = row.rstrip() #to remove the end line character
            items = row.split(',') #I converted the tab space to a comma, had a problem otherwise
    
            age = int(items[0])
            survival_rate = int(items[3])
    
            if survival_rate == 1:        
                if age in survival_map:
                    survival_map[age] += 1
                else:
                    survival_map[age] = 1
    

    Once we build the dictionary, {33: 2, 34: 5, 35: 2, 30: 3, 31: 2}, it is sorted in reverse by the key:

    sorted_survival_map = sorted(survival_map.items(), key=operator.itemgetter(1), reverse = True)
    max_survival = sorted_survival_map[0]
    

    UPDATE:

    For a single max value, OP's suggestion (in a comment) is preferred. Posting it here:

    maximum = max(dict, key=dict.get) 
    print(maximum, dict[maximum])
    

    For multiple max values

    max_keys = []
    max_value = 0
    for k,v in survival_map.items():
        if v > max_value:
            max_keys = [k]
            max_value = v
        elif v == max_value:
            max_keys.append(k)
    
    print [(x, max_value) for x in max_keys] 
    

    Of course, this could be achieved by a dictionary comprehension; however for readability, I am proposing this. Also, this is done through one pass through the objects in the dictionary without going through it multiple times. Therefore, the solution has O(n) time complexity and would be the fastest.