Search code examples
pandasdataframeunique

counting unique values in column using sub-id


I have a df containing sub-trajectories (segments) of users, with mode of travel indicated by 0,1,2... which looks like this:

df = pd.read_csv('sample.csv')
df
       id     lat          lon   mode
0   5138001 41.144540   -8.562926   0
1   5138001 41.144538   -8.562917   0
2   5138001 41.143689   -8.563012   0
3   5138003 43.131562   -8.601273   1
4   5138003 43.132107   -8.598124   1
5   5145001 37.092095   -8.205070   0
6   5145001 37.092180   -8.204872   0
7   5145015 39.289341   -8.023454   2
8   5145015 39.197432   -8.532761   2
9   5145015 39.198361   -8.375641   2

In the above sample, id is for the segments but a full trajectory maybe covered by different modes (i.e. contains multiple segments). So the first 4-digits of id is the unique trajectories, and the last 3-digits, unique segment with that trajectory.

I know that I can count the number of unique segments in the dfusing:

df.groupby('id').['mode'].nunique()

How do I then count the number of unique trajectories 5138, 5145, ...?


Solution

  • Use indexing for get first 4 values with str, if necessary first convert values to strings by Series.astype:

    df = df.groupby(df['id'].astype(str).str[:4])['mode'].nunique().reset_index(name='count')
    print (df)
         id  count
    0  5138      2
    1  5145      2
    

    If need processing values after first 4 ids:

    s = df['id'].astype(str)
    df = s.str[4:].groupby(s.str[:4]).nunique().reset_index(name='count')
    print (df)
         id  count
    0  5138      2
    1  5145      2
    

    Another idea is use lambda function:

    df.groupby(df['id'].apply(lambda x: str(x)[:4]))['mode'].nunique()