Search code examples
pythonstringsplit

How to properly split string to create dictionary in Python?


I have two strings:

"TOP : Cotton + Embroidered ( 2 Mtr) \nBOTTOM : Cotton + Solid (2 Mtr) \nDUPATTA : Chiffon + Lace Work ( 2 Mtr) \nTYPE : Un Stitched\nCOLOUR : Multi Colour \nCONTAINS : 1 TOP WITH LINING 1 BOTTOM & 1 DUPATTA\nCountry of Origin: India"

and second one is:

"Top Fabric: Cotton Cambric + Top Length: 0-2.00\nBottom Fabric: Cotton Cambric + Bottom Length: 0-2.00\nDupatta Fabric: Nazneen + Dupatta Length: 0-2.00\nLining Fabric: Cotton Cambric\nType: Un Stitched\nPattern: Printed\nMultipack: 3 Top\nCountry of Origin: India"

I need to create a Python dictionary out of these two strings but with keys which are before colon

For example in string one keys would be

TOP,BOTTOM,DUPATTA,TYPE,COLOUR,CONTAINS,COUNTRY OF ORIGIN

and in second one

keys would be

Top Fabric,Bottom Fabric,Top Length,Bottom Length,Dupatta Fabric,Dupatta Length,Lining Fabric,Type,Pattern,Multipack,Country of Origin

So far I have used

keys = ["Top Fabric","Bottom Fabric","Dupatta Fabric","Lining Fabric","Type","Pattern","Multipack","TOP ","BOTTOM ","  DUPATTA ","COLOUR ","CONTAINS ","TYPE ","Country"] 

pattern = re.compile('({})\s+'.format(':|'.join(keys))) 
newdict = dict(zip(*[(i.strip() for i in (pattern.split(desc.replace("*",""))) if i)]*2))

but it is not working on first string and on second string it is not creating every key and value.


Solution

  • You might use a regex pattern that matches the part before the colon in group 1 and after the colon in group 2.

    Then assert that after group 2, there is either another part starting with a + followed by : or the end of the string.

    Then create a dictionary, stripping the group 1 and group 2 values.

    (?:\s*\+\s*)?([^:]+)\s*:\s*([^:]+)(?=\+[^:+]*:|$)
    

    The pattern matches:

    • (?:\s*\+\s*)? Optionally match a + sign between optional whitespace chars
    • ([^:]+) Capture group 1, match any char except :
    • \s*:\s* Match a : between optional whitespace chars
    • ([^:]+) Capture group 2, match any char except :
    • (?=\+[^:+]*:|$) Positive lookahead, assert either + followed by : to the right, or assert the end of the string

    Regex demo | Python demo

    Example

    import re
    import pprint
    
    pattern = r"(?:\s*\+\s*)?([^:\r\n]+)\s*:\s*([^:\r\n]+)\s*(?=\+[^:+\n]*:|$)"
    
    s = ("TOP : Cotton + Embroidered ( 2 Mtr) \n"
                "BOTTOM : Cotton + Solid (2 Mtr) \n"
                "DUPATTA : Chiffon + Lace Work ( 2 Mtr) \n"
                "TYPE : Un Stitched\n"
                "COLOUR : Multi Colour \n"
                "CONTAINS : 1 TOP WITH LINING 1 BOTTOM & 1 DUPATTA\n"
                "Country of Origin: India\n\n"
                "Top Fabric: Cotton Cambric + Top Length: 0-2.00\n"
                "Bottom Fabric: Cotton Cambric + Bottom Length: 0-2.00\n"
                "Dupatta Fabric: Nazneen + Dupatta Length: 0-2.00\n"
                "Lining Fabric: Cotton Cambric\n"
                "Type: Un Stitched\n"
                "Pattern: Printed\n"
                "Multipack: 3 Top\n"
                "Country of Origin: India")
    
    dictionary = {}
    for m in re.finditer(pattern, s, re.MULTILINE):
        dictionary[m.group(1).strip()] = m.group(2).strip()
    pprint.pprint(dictionary)
    

    Output

    {'BOTTOM': 'Cotton + Solid (2 Mtr)',
     'Bottom Fabric': 'Cotton Cambric',
     'Bottom Length': '0-2.00',
     'COLOUR': 'Multi Colour',
     'CONTAINS': '1 TOP WITH LINING 1 BOTTOM & 1 DUPATTA',
     'Country of Origin': 'India',
     'DUPATTA': 'Chiffon + Lace Work ( 2 Mtr)',
     'Dupatta Fabric': 'Nazneen',
     'Dupatta Length': '0-2.00',
     'Lining Fabric': 'Cotton Cambric',
     'Multipack': '3 Top',
     'Pattern': 'Printed',
     'TOP': 'Cotton + Embroidered ( 2 Mtr)',
     'TYPE': 'Un Stitched',
     'Top Fabric': 'Cotton Cambric',
     'Top Length': '0-2.00',
     'Type': 'Un Stitched'}