Search code examples
pythonregexkeyword-search

Extract keywords from links


I'm trying to extract the first 2 numbers in links like these:

https://primer.text.com/sdfg/8406758680-345386743-DSS1-S%20Jasd%12Odsfr%12Iwetds-Osdgf/ 
https://primer.text.com/sdfg/8945879094-849328844-DPE-S%20Jsdfe%12OIert-Isdfu/
https://primer.text.com/sdfg/8493093053-292494834-QW23%23Wsdfg%23Iprf%64Uiojn%32Asdfg-Werts/

The output should be like this:

id1 = ['8406758680', '8945879094','8493093053']
id2 = ['345386743', '849328844', '292494834']

I'm trying to do this using the re module.

Please, tell me how to do it.

This the code snippet I have so far:

def GetUrlClassId(UrlInPut):
    ClassID = ''
    for i in UrlInPut:
        if i.isdigit():
            ClassID+=i
        elif ClassID !='':
            return int(ClassID)
    return ""

def GetUrlInstanceID(UrlInPut):
    InstanceId = ''
    ClassID = 0
    for i in UrlInPut:
        if i.isdigit() and ClassID==1:
            InstanceId+=i
        elif InstanceId !='':
            return int(InstanceId)
        if i == '-':
            ClassID+=1
    return ""

I don't want to use something like this. I would like to use regular expressions.


Solution

  • The regex pattern: /(\d{10})-(\d{9}) the brackets are needed to identify the groups of digits, the {} specifies an exact occurrence of a repetition, doc.

    # urls separated by a white space
    urls = 'https://primer.text.com/sdfg/8406758680-345386743-DSS1-S%20Jasd%12Odsfr%12Iwetds-Osdgf/ https://primer.text.com/sdfg/8945879094-849328844-DPE-S%20Jsdfe%12OIert-Isdfu/ https://primer.text.com/sdfg/8493093053-292494834-QW23%23Wsdfg%23Iprf%64Uiojn%32Asdfg-Werts/'
    
    urls = urls.split() # as list
    
    import re
    
    ids = [re.search(r'/(\d{10})-(\d{9})', url).groups() for url in urls]
    print(list(zip(*ids)))
    

    Output

    [('8406758680', '8945879094', '8493093053'), ('345386743', '849328844', '292494834')]