So I am generally new to Python, and I've been reading many articles but I am still not sure how to ignore the lines with '#' .
I need to:
Make the four columns(col2-col5) in this tsv file into separate lists. (How would I chose to ignore the line with Hawaii, since it has incomplete data, therefore using 49 data points. )
Then define a function Pearson(X,Y) that takes two lists as parameters and returns the Pearson correlation coefficient. Let X= [x1,x2,...,xn] and Y = [y1, y2,....,yn]. The Pearson Correlation Coefficient between X and Y is given by:
r=(nΣxiyi -ΣxiΣyi)/((√(nΣxi^2-(Σxi^2)^2(nΣyi^2-(Σyi)^2)
listT = [26, 24, 23, 14, 15, 19, 21, 22, 18, 17, 16, 23, 24, 21, 20]
listH = [75, 69, 77, 51, 48, 68, 83, 68, 71, 51, 54, 71, 77, 67, 68]
def Pearson(X,Y):
# Do something
return
# Should print: var T and var H: 0.8139
print("var T and var H: %.4f"%(Pearson(listT, listH)))
While defining the function, how would I write out the Σ sign?
#------------------------------------------------------------------------
# Data from the CDC -- http://www.cdc.gov/ -- reports on prevalence of
# smoking, incidence of lung cancer, and deaths attributed to smoking.
#
# Col1: state
# Col2: cases of lung cancer (per 100,000 inhabitants)
# Col3: smoking among adults (%)
# Col4: attempts to quit (%)
# Col5: smoking-related deaths (per 100,000 inhabitants)
#------------------------------------------------------------------------
Alabama 107.1 24.9 47.1 321.1
Alaska 89 24.9 54.2 296.2
Arizona 63.4 18.6 49.4 248.9
Arkansas 105 25.7 45.6 334.1
California 64.4 14.8 51.4 261
Colorado 56.9 20.1 42.4 252.7
Connecticut 81.1 18.1 49 253.8/
Delaware 98.8 24.5 48.7 296
District of Columbia 80.2 21 54.2 257.3
Florida 85.5 20.4 44.2 275.5
Georgia 98.3 20.1 54.8 312.3
Hawaii 68.2 NA NA 185.1
Idaho 62.7 17.5 47.2 254.1
Illinois 92 22.2 49.3 278.4
Indiana 102.8 25 47.5 322.2
Iowa 91.8 20.8 42.9 256.7
Kansas 84.8 19.8 43.7 270.8
Kentucky 132.6 27.6 47.6 378.1
Louisiana 108 23.6 51.8 309.1
Maine 99.3 21 55.3 303.8
Maryland 80.1 19.7 51.1 279.5
Massachusetts 83.3 18.5 52.5 258.6
Michigan 90 23.4 55.6 296.3
Minnesota 65 20.7 43.6 225.3
Mississippi 115.4 24.6 48.9 343.2
Missouri 103.9 24.1 43 325
Montana 73.1 20.4 45.4 292.6
Nebraska 82.8 20.3 46.7 251.9
Nevada 82.7 23.2 41.4 370.4
New Hampshire 80.6 21.8 53.2 294.8
New Jersey 78.8 18.9 49.6 253.1
New Mexico 57.6 20.3 45.6 250.8
New York 76.7 20 51.5 259.6
North Carolina 104.1 23.2 49.2 307
North Dakota 71.5 19.9 43.9 233
Ohio 97.4 25.9 41.3 310.6
Oklahoma 102.6 26.1 45.1 321.7
Oregon 77.6 20 46 277.5
Pennsylvania 89.4 22.7 47.1 269.1
Rhode Island 84.6 21.3 53.1 283
South Carolina 99.4 24.5 49.1 303.3
South Dakota 78.8 20.3 46.4 253.8
Tennessee 111.1 26.1 46.6 333.6
Texas 83.2 20.6 46.4 287.4
Utah 33.1 10.5 53.7 144.9
Vermont 90.2 20 55.4 272.2
Virginia 86.7 20.9 44.8 288.7
Washington 76.2 19.2 51.6 279.1
West Virginia 120 26.9 46.2 361.6
Wisconsin 75 22 47.7 258.2
Wyoming 57.8 21.7 48.5 294.2
This is what I have so far:
import csv
import operator
import math
import sys
cases_lung_cancer = [] #blank list for 1st column
smoking_adults = [] #blank list for 2nd column
attempts_quit = [] #blank list for 3rd column
smoking_deaths = [] #blank list for 4th column
def Pearson(X, Y)
with open('cdc_data.tsv', newline= ' ') as csv_f:
for row in csv.DictReader(csv_f, delimiter='\t'):
Make the four columns(col2-col5) in this tsv file into separate lists, and I chose to ignore the line with Hawaii, since it has incomplete data, therefore using 49 data points.
col0 = []
col1 = []
col2 = []
col3 = []
col4 = []
f = open('cdc_data.tsv', 'r')
contents = f.read()
lines = contents.split('\n') # split file into seperate lines
for line in lines:
if (line[0:1] == '#'): # filter out comments
continue
split_line = line.split('\t') # split line into seperate words seperated by TAB
if (len(split_line) < 5): # drop any line that isn't 5 columns
continue
# assign each column into a separate list
col0.append(split_line[0])
col1.append(split_line[1])
col2.append(split_line[2])
col3.append(split_line[3])
col4.append(split_line[4])
I'll leave the issue with Hawaii and your #2 problem as an exercise for you to complete.