Search code examples
pythonpandasgeolocationsocial-media

Filter Pandas df by count of a column and write data


I have a data set of geo-located social media posts for which I am trying to filter by the frequency of user_id greater than 1 (users who posted 2 or more times). I would like to filter this so I can further clean trajectory data I'm creating.

Sample code:

# Import Data
data = pd.read_csv('path', delimiter=',', engine='python')
#print len(data),"rows"
#print data

# Create Data Fame
df = pd.DataFrame(data, columns=['user_id','timestamp','latitude','longitude'])
#print data.head()

# Get a list of unique user_id values
uniqueIds = np.unique(data['user_id'].values)

# Get the ordered (by timestamp) coordinates for each user_id
output = [[id,data.loc[data['user_id']==id].sort_values(by='timestamp')['latitude','longitude'].values.tolist()] for id in uniqueIds]

# Save outputs
outputs = pd.DataFrame(output)
#print outputs
outputs.to_csv('path', index=False, header=False)

I tried using df[].value_counts() to get a count of user_id, and then pass >1 in the line output = [[......data['user_id']==id>1]..... however, that did not work. Is it possible to add the frequency of user_id as an additional argument to code and extract information for only those users?

Sample data:

user_id, timestamp, latitude, longitude
478134225, 3/12/2017 9:04, 38.8940974, -77.0276216
478103585, 3/12/2017 9:04, 38.882584, -77.1124701
478073193, 3/12/2017 9:07, 39.00027849, -77.09480086
476194185, 3/12/2017 9:14, 38.8048355, -77.0469214
476162349, 3/12/2017 9:16, 38.8940974, -77.0276216
478073193, 3/12/2017 9:05, 38.8549, -76.8752
477899275, 3/12/2017 9:08, 38.90181532, -77.03733586
477452890, 3/12/2017 9:08, 38.96117237, -76.95561893
478073193, 3/12/2017 9:05, 38.7188716, -77.1542684

Solution

  • Let's try to groupby then the filter method that returns only those records that evaluate as true in the filter function. In this case, return True for groups or user_id's that have more than one record.

    df.groupby('user_id').filter(lambda x: x['user_id'].count()>1)
    

    A more efficient statement using transform and boolean indexing.

    df[df.groupby('user_id')['user_id'].transform('count') > 1]