Search code examples
bashfor-loopawkdataset

How to get a list of numbers out of an awk output in bash


ok so I am currently in the process of making a very basic bash script to search a jsonl file of mario maker levels with their id's, and then converting the found id's into the actual in game id's using a python script given to me (the database and explanation is in this reddit thread, im searching the courses.jsonl file https://www.reddit.com/r/MarioMaker/comments/wlkwp9/easily_searchable_database_of_super_mario_maker_1/)

with my current code, im able to search, seperate just the id's, then convert 1 id to the correct in game id. however, i am not sure how i would be able to convert all the id's found in the search separately. would it just be a for loop, or would i have to do something more complex? here is the code, any help would be appreciated, as most of my bash knowledge is very basic

#!/bin/sh
id=$(grep $1 /file/path/courses.jsonl | awk -F ',' '{print $1}' | awk -F ':' '{print $2}')


python3 /file/path/mmid.py $id

edit: here is the current mmid.py file, its basically the initial one i was given with the small change of just adding the id to an argument on runtime instead of asking for input after running

import struct
import hashlib
import hmac
import sys
idno = int(sys.argv[1])
key = hashlib.md5(b"9f2b4678").digest()
data = struct.pack("<Q", idno)
checksum = hmac.HMAC(key, data, 'md5').digest()
checksum = checksum[3:1:-1].hex().upper()
idstring = str.upper(hex(idno))[2:]
for y in range(8 - len(idstring)):
    idstring = '0' + str(idstring)
code = str(checksum) + '0000' + str(idstring)
print(code)

edit 2: i was asked for an example of the courses.jsonl file, so here is one entry. they all look like this, the important thing im grabbing here is the id, which is the first part

{"id":69558108,"retrieval_date":"2021-06-05T08:04:59.154776Z","url":"https://d2sno3mhmk1ekx.cloudfront.net/10.WUP_AMAJ_datastore/ds/1/data/00069558108-00001?Expires=1625472299&Signature=BqxLU18d~4rV4FeHNN-Sm2dbKqKyLhS5qX2liHPqr3MN9Y0Fgt5GCYCXrseFt-tinj4LVjrtMIWZW4G1ijj4FSHQyXN9OoW3PfyZbSuatl9~aNHLPtCCBfMDm7HmF~Njry1pvFjvmeMTJf4ZlUTrKIJF00gfA~2Mu2K1qjIR4yqSQCyxsxG61J~KG3CQMVJUKnbdGupeHN5-nigDqZ3EbT3-odEmjJsq04wb8K98iNMVHKINg8MZwuWHtEtECGWWqvhAvkdRDj6ZZQxAPW1fwxkLJBY4PSRW28A20T-lcq0CDDeU4yFUzeeE6GqFV0xvwj2IOFrlWVOo3McmE21HqA__&Key-Pair-Id=APKAJUYKVK3BE6ZPNZBQ","stars":1,"course_name":"hello kitty saves xmas by Lario\u2122","creator":{"pid":1745383521,"nnid":"realdealLario","mii_data":"QlBGQwAAAAEAAAAAAAAAAAAAAAAAAQAAAwAAQMHCaSxE5iEQ2lRGSOgTQXuGIgAAiANMAGEAcgBpAG8AIiEAAAAAAAAAACNiKABCAQPHRBgLM0cQDg8RsCoAACIAUkhQAABlAAAAcgBlAGEAawAAAAAAAAAAAO1hAAAAAAAAAAAAAAAAAAAAAAAAAAQ="},"upload_time":"2021-03-30T22:59:39Z","user_plays":2,"clears":1,"total_attempts":5,"failures":4,"world_record":{"best_time_player":{"pid":1747253384,"nnid":"ilovesnowandcold","mii_data":"QlBGQwAAAAEAAAAAAAAAAAAAAAAAAQAAAwAAEGpkmkWgRBCggeC0lA/Q2aSlewAAghVLAGUAaQB0AGgAAAAAAAAAAAAAAAsVAAAsApJuRBgnNEcYhhIOaA8AACnQUUhQSwBlAGkAdABoAAAAAAAAAAAAAAAAAJLUAAAAAAAAAAAAAAAAAAAAAAAAAAE="},"first_complete_player":{"pid":1747253384,"nnid":"ilovesnowandcold","mii_data":"QlBGQwAAAAEAAAAAAAAAAAAAAAAAAQAAAwAAEGpkmkWgRBCggeC0lA/Q2aSlewAAghVLAGUAaQB0AGgAAAAAAAAAAAAAAAsVAAAsApJuRBgnNEcYhhIOaA8AACnQUUhQSwBlAGkAdABoAAAAAAAAAAAAAAAAAJLUAAAAAAAAAAAAAAAAAAAAAAAAAAE="},"best_time_ms":13245,"created_time":"2021-03-30T23:52:40Z","updated_time":"2021-03-30T23:52:40Z"}}

I should clarify, this jsonl file is over 16gb of raw text


Solution

  • Do it entirely within python:

    EDIT: Now with matching course_name

    import hashlib
    import hmac
    import jsonlines
    import re
    import struct
    import sys
    
    course_name = str(sys.argv[1])
    idno = -1
    reader = jsonlines.Reader(sys.stdin)
    for obj in reader:
        if re.search(course_name, obj["course_name"]):
            idno = int(obj["id"])
            break
    if idno > -1:
        key = hashlib.md5(b"9f2b4678").digest()
        data = struct.pack("<Q", idno)
        checksum = hmac.HMAC(key, data, "md5").digest()
        checksum = checksum[3:1:-1].hex().upper()
        idstring = str.upper(hex(idno))[2:]
        for y in range(8 - len(idstring)):
            idstring = "0" + str(idstring)
        code = str(checksum) + "0000" + str(idstring)
        print(code)
    else:
        print("Cannot match course_name: %s\n" % course_name)
    

    Usage example:

    $ ./bin/python3 ./mmid.py kitty < ./courses.jsonl
    7AC9000004255F5C