Search code examples
regexpandasgroup-byextractcontains

Pandas group_by string column which values contained in a separate list


I have a hierarchy-based event stream, where each hierarchy parent node(represented as level0/1) has multiple children (level0(0/1/2) and sub child (level00(0/1/2)). "level" is just a placeholder, each hierarchy level has its own unique name. The only rule is that a parent node hierarchy string is always included in the child's hierarchy string name. Assume that this event stream has 300k and more entries.

| index | hierarchystr          |
| ----- | --------------------- |
| 0     |  level0level00level000|
| 1     |  level0level01        |
| 2     |  level0level02level021|
| 3     |  level0level02level021|
| 4     |  level0level02level020|
| 5     |  level0level02level021|
| 6     |  level1level02level021|
| 7     |  level1level02level021|
| 8     |  level1level02level021|
| 9     |  level2level02level021|

Now I want to do an inclusive group_by by a separate list and the line should be included if the string in the array is included in the string of the hierarchystr column, expected output (beware hstrs is every time in a different order!):

#hstrs = ["level0", "level1", "level0level01", "level0level02", "level0level02level021"]
|index| 0                   | Count |
|-----|---------------------|-------|
|0    |level0               | 6     |
|1    |level1               | 3     |
|2    |level0level01        | 1     |
|3    |level0level02        | 4     |
|4    |level0level02level021| 3     |

I tried the following solutions, but all are slow as hell:

#V1
for hstr in hstrs:
    s = df[df.hierarchystr.str.contains(hstr)]
    s2 = s.count()
    s3 = s2.values[0]
    if s3 > 200:
        beforeset.append(hstr)
#V2
for hstr in hstrs:
        s = df.hierarchystr.str.extract('(' + hstr + ')', expand=True)
        s2 = s.count()
        s3 = s2.values[0]
        if s3 > 200:
            list.append(hstr)
 #V3 - fastest, but also slow and not satisfying
containing =[item for hierarchystr in df.hierarchystr for item in hstrs if item in hierarchystr]
containing = Counter(containing)
df1 = pd.DataFrame([containing]).T
nodeNamesWithOver200 = df1[df1 > 200].dropna().index.values

I also tried versions for all variables at once with pat and extract, but in return the size per group changes in every run, because the list hstrs is every run in a different order.

df.hierarchystr.extract[all](pat="|".join(hstrs))

Is there a regex and method possible that do this task in one step so this is also applicable for huge data frames at an appropriate time - that not depending on the order of the hstrs array?


Solution

  • You can try:

    count = [df['hierarchystr'].str.startswith(hstr).sum() for hstr in hstrs]
    out = pd.DataFrame({'hstr': hstrs, 'count': count})
    print(out)
    
    # Output
                        hstr  count
    0                 level0      6
    1                 level1      3
    2          level0level01      1
    3          level0level02      4
    4  level0level02level021      3