I will explain the complete context just in case, I found some solutions but only with an explicit for i in range
or by setting a simple condition, not like the one I need.
I have a Dataframe with the columns: post
, author
, DateTime
, day_of_week
, hours
Now I want to calculate the probability of:
that any author post a post on a specific day of the week
which is number_post_that_week_day/total_post
this is simple, can be done as follow ( probably not the best way but acceptable one):
count_by_field = data_set.groupby('day_of_week').count()['post']
total_by_field = data_set.groupby('day_of_week').count()['post'].sum()
temp_prob_by_field = count_by_field / total_by_field
# In case I need that the size of temp_prob_by_field should be 7
# but my sample, in some cases, only has Monday, Saturday
# With the next lines I will always have 7 records
for index in range(size):
if not index in temp_prob_by_field.index:
temp_prob_by_field.loc[index] = 0
THE PROBLEM
I want to assign to the original data_set
on a new column(prob
), my probabilities values, but I want that it matches with the day of the week column, I mean:
If in a record, I have 3 ( which means Wednesday ) on column day_of_week. I want, that in that record on column probs
the probability associated.
What I've been trying (without success):
data_set[data_set.loc[ data_set['hours'] in temp_prob_by_field.index, temp_prob_by_field ]]
= temp_prob_by_field.loc[data_set.loc[ data_set['hours'] in temp_prob_by_field.index] # 🤷♂️
I can do this by doing a for in as follow:
for i in range(7):
data_set.loc[data_set['hours'] == i, 'probs' ] = temp_prob_by_field.loc[i]
I'm really new at pandas and seems to me this is not a good way to solve this problem, maybe I'm wrong.
As a @not_speshai as a data_sample to play with:
import pandas as pd
import numpy as np
np.random.seed(1213)
c = ['post', 'author', 'datetime', 'day_of_week', 'hours']
data = pd.DataFrame(np.random.choice([1,0,3,5], size=(10,5)), columns=c)
data['post']='A post about something"
""" post author datetime day_of_week hours
0 A post about something 5 5 0 3
1 A post about something 1 1 1 5
2 A post about something 3 1 3 5
3 A post about something 5 3 5 1
4 A post about something 0 5 3 0
5 A post about something 3 3 0 1
6 A post about something 0 5 5 0
7 A post about something 3 3 5 3
8 A post about something 5 1 1 0
9 A post about something 1 0 0 3
"""
I think what you're looking for is pd.merge
. Try:
data.merge(temp_prob_by_field, left_on="day_of_week", right_index=True)