I am trying to write a function that takes the string response in column "Frequency" and calculates the total number of days per year someone drank alcohol.
The three main values I am trying to take from the string are the numbers that exist in the statement and the word (week, month, year) to calculate the average total number of days someone drank within a year. For example, if someone drank 2-3 times per month the equation would be (2+3/2)*12 = 30 times per year. The data table below shows a sample of the data.
Frequency |
---|
1 day per month |
3 days per week |
1 to 2 days per year |
2 days per week |
1 day per month |
6-11 days per year |
5-6 days a week |
The table I am trying to produce would have the average days per year as seen below:
Frequency per year |
---|
12 |
156 |
1.5 |
104 |
12 |
8.5 |
286 |
So far I have written the code below:
import pandas as pd
AlcData = pd.read_excel('Alcohol_Data.xlsx')
#add new column with unittime value for use in function
AlcData['unittime'] = AlcData.Frequency.str.extract(r'\b(\w+)$',
expand = True)
def calculatetotaldays(row):
for x in range(1,11):
#read in row item as string value
string = AlcData.Frequency
# create list of number values from the string
numbers = [int(i) for i in string.split() if i.isdigit()]
#compute total days if list has length of 1
if len(numbers) == 1:
x = [numbers[j] for j in (0)]
if row[AlcData.unittime] == 'week':
total = x*52
elif row[AlcData.unittime] == 'month':
total = x*12
elif row[AlcData.unittime] == 'year':
total = x
#compute total days if list has length of 2
if len(numbers) == 2:
x, y = [numbers[j] for j in (0, 1)]
if row[AlcData.unittime] == 'week':
total = (((x+y)/2)*52)
elif row[AlcData.unittime] == 'month':
total = (((x+y)/2)*12)
elif row[AlcData.unittime] == 'year':
total = ((x+y)/2)
return total
AlcData['totalperyear'] = AlcData.apply(calculatetotaldays, axis=1)
I am currently getting the error: "'Series' object has no attribute 'split'" while trying to extract numbers across the rows into lists. Does anyone know how to correct this error within the function? More importantly, is this approach (using the length of the list to assign these variables and calculate numbers) is the best way to solve this problem?
I've been struggling with this for a long time so any and all tips on how to compute this information would be extremely helpful.
This is largely a rewrite, but here's a way to go about it with just pandas
:
In [92]: (
...: df['Frequency']
...: .str.split()
...: .str[-1].map({'week': 52, 'month': 12, 'year': 1})
...: .mul(
...: df['Frequency']
...: .str.extract(r'(\d+)\D*(\d+)?')
...: .ffill(axis=1)
...: .astype(int).mean(axis=1)
...: )
...: )
Out[92]:
0 12.0
1 156.0
2 1.5
3 104.0
4 12.0
5 8.5
6 286.0
dtype: float64
We can break it into a calculation of the numerical parts and then a multiplier. You can get the numbers out with a regex similar to what you were doing before:
In [89]: df['Frequency'].str.extract(r'(\d+)\D*(\d+)?')
Out[89]:
0 1
0 1 NaN
1 3 NaN
2 1 2
3 2 NaN
4 1 NaN
5 6 11
6 5 6
and from there, you can use .ffill()
and .mean()
to turn it into one number:
In [90]: df['Frequency'].str.extract(r'(\d+)\D*(\d+)?').ffill(axis=1).astype(int).mean(axis=1)
Out[90]:
0 1.0
1 3.0
2 1.5
3 2.0
4 1.0
5 8.5
6 5.5
dtype: float64
If you know it always ends in year
, month
, or week
, you can just pick that off and use map
for the multiplier:
In [91]: df['Frequency'].str.split().str[-1].map({'week': 52, 'month': 12, 'year': 1})
Out[91]:
0 12
1 52
2 1
3 52
4 12
5 1
6 52
Name: Frequency, dtype: int64
and then you can multiply them as I did above.