Search code examples
pythonpandasstring-matchingpartial-matches

Complex partial string matching in pandas


Given a dataframe with the following structure and values json_path -

json_path Reporting Group Entity/Grouping
data.attributes.total.children.[0] Christian Family Abraham Family
data.attributes.total.children.[0].children.[0] Christian Family In Estate
data.attributes.total.children.[0].children.[0].children.[0].children.[0] Christian Family Cash
data.attributes.total.children.[0].children.[0].children.[1].children.[0] Christian Family Investment Grade Fixed Income

How would I filter on the json_path rows which containchildren four times? i.e., I want to filter on index position 2-3 -

json_path Reporting Group Entity/Grouping
data.attributes.total.children.[0].children.[0].children.[0].children.[0] Christian Family Cash
data.attributes.total.children.[0].children.[0].children.[1].children.[0] Christian Family Investment Grade Fixed Income

I know how to obtain a partial match, however the integers in the square brackets will be inconsistent, so my instinct is telling me to somehow have logic that counts the instances of children (i.e., children appearing 4x) and using that as a basis to filter.

Any suggestions or resources on how I can achieve this?


Solution

  • As you said, a naive approach would be to count the occurrence of .children and compare the count with 4 to create boolean mask which can be used to filter the rows

    df[df['json_path'].str.count(r'\.children').eq(4)]
    

    A more robust approach would be to check for the consecutive occurrence of 4 children

    df[df['json_path'].str.contains(r'(\.children\.\[\d+\]){4}')]
    

                                                                       json_path   Reporting Group                Entity/Grouping
    2  data.attributes.total.children.[0].children.[0].children.[0].children.[0]  Christian Family                           Cash
    3  data.attributes.total.children.[0].children.[0].children.[1].children.[0]  Christian Family  Investment Grade Fixed Income