Search code examples
pythonpandasstring-comparison

Comparing two json format strings in pandas and assigning label based on matches


I have two columns in my dataframe namely diff and diff2.

An instance of diff:

{'paths': {'modified': {'/v1/authorization/details/byDate': {'operations': {'modified': {'POST': {'requestBody': {'added': True}}}}}}}, 'endpoints': {'modified': {'{ method: POST, path: /v1/authorization/details/byDate }': {'requestBody': {'added': True}}}}}
{'info': {'version': {'from': '1.0.2', 'to': '1.0.3'}}, 'paths': {'modified': {'/equipment-status': {'operations': {'modified': {'GET': {'parameters': {'modified': {'query': {'pei': {'schema': {'pattern': {'from': '^(imei-[0-9]{15}|imeisv-[0-9]{16}|.+)$', 'to': '^(imei-[0-9]{15}|imeisv-[0-9]{16}|mac([0-9a-fA-F]{2})((-[0-9a-fA-F]{2}){5})|.+)$'}}}}}}}}}}}}, 'endpoints': {'modified': {'{ method: GET, path: /equipment-status }': {'parameters': {'modified': {'query': {'pei': {'schema': {'pattern': {'from': '^(imei-[0-9]{15}|imeisv-[0-9]{16}|.+)$', 'to': '^(imei-[0-9]{15}|imeisv-[0-9]{16}|mac([0-9a-fA-F]{2})((-[0-9a-fA-F]{2}){5})|.+)$'}}}}}}}}}, 'externalDocs': {'description': {'from': '3GPP TS 29.511 V15.4.0; 5G System; Equipment Identity Register Services; Stage 3', 'to': '3GPP TS 29.511 V16.0.0; 5G System; Equipment Identity Register Services; Stage 3'}}}

An instance of diff2:

Backward compatibility errors (1):
error at specs/389643.json, in API POST /v1/authorization/details/byDate added required request body [added-required-request-body].
Backward compatibility errors (1):
warning at specs/419378.json, in API GET /equipment-status changed the pattern for the 'query' request parameter 'pei' from '^(imei-[0-9]{15}|imeisv-[0-9]{16}|.+)$' to '^(imei-[0-9]{15}|imeisv-[0-9]{16}|mac([0-9a-fA-F]{2})((-[0-9a-fA-F]{2}){5})|.+)$' [request-parameter-pattern-changed]. This is a warning because it is difficult to automatically analyze if the new pattern is a superset of the previous pattern(e.g. changed from '[0-9]+' to '[0-9]*')

I want away to check if the keywords in diff2(always starting from API) match with any of the keywords present in diff and to assign them a label based on that. If all keywords match and there is no unmatching set of words, I want to assign the change as Breaking and if there are matching words(from diff2), and also unmatching(all the remaining fromdiff), I want the label to be Both

And if diff2 is Nan then the change to be Non-Breaking

So for the first instance the change would be Breaking and second would be Both.

The expected output would be something like this:

diff                                                            diff_2                                             Change
{'paths': {'modified': {'/v1/authorization/details/byDate'      ./ API POST /v1/authorization/details/byDate      Breaking    

This is the to_dict() output of my df:

{'diff': {11: "{'openAPI': {'from': '3.0.0', 'to': '3.0.3'}, 'info': {'title': {'from': 'Example.com', 'to': 'LN-Markets API'}, 'description': {'from': 'This is an **example** API to demonstrate features of OpenAPI specification\\n# Introduction\\nThis specification is intended to to be a good starting point for describing your API in \\n[OpenAPI/Swagger format](https://github.com/OAI/OpenAPI-Specification/blob/master/versions/2.0.md).\\nIt also demonstrates features of [generator-openapi-repo](https://github.com/Rebilly/generator-openapi-repo) tool and \\n[ReDoc](https://github.com/Rebilly/ReDoc) documentation engine. So beyond the standard OpenAPI syntax we use a few \\n[vendor extensions](https://github.com/Rebilly/ReDoc/blob/master/docs/redoc-vendor-extensions.md).\\n\\n# OpenAPI Specification\\nThe goal of The OpenAPI Specification is to define a standard, language-agnostic interface to REST APIs which\\nallows both humans and computers to discover and understand the capabilities of the service without access to source\\ncode, documentation, or through network traffic inspection. When properly defined via OpenAPI, a consumer can \\nunderstand and interact with the remote service with a minimal amount of implementation logic. Similar to what\\ninterfaces have done for lower-level programming, OpenAPI removes the guesswork in calling the service.\\n', 'to': 'Trade derivatives on the **[Lightning Network](https://lightning.network/).**'}, 'termsOfService': {'from': 'https://example.com/terms/', 'to': ''}, 'contact': {'deleted': True}, 'license': {'deleted': True}}, 'paths': {'added': ['/user/jwt', '/user', '/lnurl/a', '/positions', '/user/logout', '/lnurl/w', '/user/history', '/state/node', '/lnurl/w/r', '/login/credentials', '/user/deposit', '/user/withdraw', '/lnurl/a/c', '/login/joule', '/state/api', '/positions/cancel', '/user/withdraw/lnurl'], 'deleted': ['/users/{username}', '/echo']}, 'endpoints': {'added': [{'method': 'POST', 'path': '/user/logout'}, {'method': 'DELETE', 'path': '/positions'}, {'method': 'GET', 'path': '/positions'}, {'method': 'POST', 'path': '/positions'}, {'method': 'PUT', 'path': '/positions'}, {'method': 'GET', 'path': '/lnurl/a'}, {'method': 'GET', 'path': '/lnurl/w/r'}, {'method': 'GET', 'path': '/user/history'}, {'method': 'POST', 'path': '/user/withdraw'}, {'method': 'POST', 'path': '/user/withdraw/lnurl'}, {'method': 'PUT', 'path': '/login/credentials'}, {'method': 'GET', 'path': '/login/credentials'}, {'method': 'POST', 'path': '/login/credentials'}, {'method': 'GET', 'path': '/user'}, {'method': 'PUT', 'path': '/user'}, {'method': 'GET', 'path': '/state/node'}, {'method': 'GET', 'path': '/state/api'}, {'method': 'POST', 'path': '/user/deposit'}, {'method': 'DELETE', 'path': '/user/jwt'}, {'method': 'GET', 'path': '/user/jwt'}, {'method': 'POST', 'path': '/user/jwt'}, {'method': 'POST', 'path': '/login/joule'}, {'method': 'POST', 'path': '/positions/cancel'}, {'method': 'GET', 'path': '/lnurl/a/c'}, {'method': 'GET', 'path': '/lnurl/w'}], 'deleted': [{'method': 'GET', 'path': '/users/{username}'}, {'method': 'PUT', 'path': '/users/{username}'}, {'method': 'POST', 'path': '/echo'}]}, 'security': {'added': ['cookieAuth', 'jwtAuth']}, 'servers': {'added': ['https://api.lnmarkets.com'], 'deleted': ['http://example.com/api/v1', 'https://example.com/api/v1']}, 'tags': {'added': ['LNURL', 'Login', 'Positions', 'State'], 'deleted': ['Echo'], 'modified': {'User': {'description': {'from': 'Operations about user', 'to': 'Interactions with the website'}}}}, 'externalDocs': {'deleted': True}, 'components': {'schemas': {'deleted': ['Email', 'User']}, 'headers': {'deleted': ['ExpiresAfter']}, 'securitySchemes': {'added': ['cookieAuth', 'jwtAuth'], 'deleted': ['main_auth', 'api_key', 'basic_auth']}}}",
  14: "{'paths': {'modified': {'/user/withdraw/lnurl': {'operations': {'added': ['GET'], 'deleted': ['POST']}}}}, 'endpoints': {'added': [{'method': 'GET', 'path': '/user/withdraw/lnurl'}], 'deleted': [{'method': 'POST', 'path': '/user/withdraw/lnurl'}]}}",
  15: "{'paths': {'modified': {'/user/withdraw/lnurl': {'operations': {'added': ['POST'], 'deleted': ['GET']}}}}, 'endpoints': {'added': [{'method': 'POST', 'path': '/user/withdraw/lnurl'}], 'deleted': [{'method': 'GET', 'path': '/user/withdraw/lnurl'}]}}",
  17: "{'paths': {'added': ['/state'], 'deleted': ['/state/api', '/state/node']}, 'endpoints': {'added': [{'method': 'GET', 'path': '/state'}], 'deleted': [{'method': 'GET', 'path': '/state/api'}, {'method': 'GET', 'path': '/state/node'}]}}",
  22: '{\'paths\': {\'added\': ["/user/history\'", \'/user/update-password\'], \'deleted\': [\'/user/history\'], \'modified\': {\'/user\': {\'operations\': {\'deleted\': [\'PUT\']}}}}, \'endpoints\': {\'added\': [{\'method\': \'GET\', \'path\': "/user/history\'"}, {\'method\': \'PUT\', \'path\': \'/user/update-password\'}], \'deleted\': [{\'method\': \'GET\', \'path\': \'/user/history\'}, {\'method\': \'PUT\', \'path\': \'/user\'}]}}'},
 'diff_2': {11: 'Backward compatibility errors (3):\nerror at original_source=specs/994.json, in API POST /echo api path removed without deprecation [api-path-removed-without-deprecation]. \n\nerror at original_source=specs/994.json, in API GET /users/{username} api path removed without deprecation [api-path-removed-without-deprecation]. \n\nerror at original_source=specs/994.json, in API PUT /users/{username} api path removed without deprecation [api-path-removed-without-deprecation]. \n\n',
  14: 'Backward compatibility errors (1):\nerror at specs/555.json, in API POST /user/withdraw/lnurl api removed without deprecation [api-removed-without-deprecation]. \n\n',
  15: 'Backward compatibility errors (1):\nerror at specs/554.json, in API GET /user/withdraw/lnurl api removed without deprecation [api-removed-without-deprecation]. \n\n',
  17: 'Backward compatibility errors (2):\nerror at original_source=specs/552.json, in API GET /state/api api path removed without deprecation [api-path-removed-without-deprecation]. \n\nerror at original_source=specs/552.json, in API GET /state/node api path removed without deprecation [api-path-removed-without-deprecation]. \n\n',
  22: 'Backward compatibility errors (2):\nerror at original_source=specs/547.json, in API GET /user/history api path removed without deprecation [api-path-removed-without-deprecation]. \n\nerror at specs/547.json, in API PUT /user api removed without deprecation [api-removed-without-deprecation]. \n\n'}}

Any suggestions or ideas on how to do this would be highly appreciated.


Solution

  • I'm not entirely sure to understand what you are trying to do, and since your example is not fully reproducible, here is mine, where:

    • first row is a "Breaking" change case (all keywords match)
    • second row illustrates "Both" (some keywords match)
    • and third is a "Non-breaking" case (zero match):
    import pandas as pd
    
    df = pd.DataFrame(
        {
            "diff": [
                {
                    "paths": {
                        "modified": {
                            "/v1/authorization/details/byDate": {
                                "operations": {
                                    "modified": {"POST": {"requestBody": {"added": True}}}
                                }
                            }
                        }
                    },
                },
            ]
            * 3
            + [pd.NA, 2, "aaa"],
            "diff2": [
                "Backward compatibility errors (1): error at specs/389643.json, in API POST /v1/authorization/details/byDate added",
                "Backward compatibility errors (1): error at specs/390643.json, in API GET /v1/authorization/details/byDate added",
                "Backward compatibility errors (1): error at specs/391643.json, in API PUSH /v2/authorization/details/byDate removed",
                "",
                "",
                "",
            ],
        }
    )
    

    First, define a recursive helper function to get all keys from a nested dictionary:

    def get_keys_from_dict(d, keys=None):
        keys = keys if keys else []
        if not isinstance(d, dict):
            return None
        for k, v in d.items():
            keys.append(k)
            if isinstance(v, dict):
                get_keys_from_dict(v, keys)
            if isinstance(v, list):
                for i in v:
                    get_keys_from_dict(i, keys)
        return keys
    

    Define another helper function to get all keywords in a string that come after the word "API", using str.split:

    def get_keywords_from_string(string):
        return (
            [item for item in string.split("API")[1].split(" ") if item] if string else []
        )
    

    And another one to compare two lists of keywords with Python built-in functions all and any:

    def compare(keywords, other_keywords):
        if not keywords or not other_keywords:
            return ""
        results = [item in keywords for item in other_keywords]
        if all(results):
            return "Breaking"
        if any(results):
            return "Both"
        return "Non-Breaking"
    

    Finally, compose and apply those functions with the dataframe:

    df["Change"] = df.apply(
        lambda x: compare(
            get_keys_from_dict(x["diff"], []),
            get_keywords_from_string(x["diff2"]),
        ),
        axis=1,
    )
    

    Then:

    print(df)
    # Output
    
                                                    diff  ...        Change
    0  {'paths': {'modified': {'/v1/authorization/det...  ...      Breaking
    1  {'paths': {'modified': {'/v1/authorization/det...  ...          Both
    2  {'paths': {'modified': {'/v1/authorization/det...  ...  Non-Breaking
    3                                               <NA>  ...
    4                                                  2  ...
    5                                                aaa  ...
    

    EDIT

    With the real dataframe you provided, in which diff values are strings, no dictionaries, you don't need get_keys_from_dict helper function, and you can simply compare columns like this:

    df["Change"] = df.apply(
        lambda x: compare(
            x["diff"],
            get_keywords_from_string(x["diff_2"]),
        ),
        axis=1,
    )
    
    
    print(df)
    # Output
    
                       diff               diff_2 Change
    11  {'openAPI': {'fr...  Backward compati...   Both
    14  {'paths': {'modi...  Backward compati...   Both
    15  {'paths': {'modi...  Backward compati...   Both
    17  {'paths': {'adde...  Backward compati...   Both
    22  {'paths': {'adde...  Backward compati...   Both