Search code examples
pythonpython-retxt

How to use re module to parse color names in text file?


I need to read the file src/rgb.txt which contains names of colors and their numerical representations in RGB format (the file is presented below just partially). Each line contains four fields: red, green, blue, and color name, each of them is separated by some amount of whitespace (tab or space).

I try to write a function using Python's regular expressions (usage of re is mandatory!) that reads the file and should return a list of strings, so that in the returned list they have four fields separated by a single tab character (\t). The first string in the returned list should be: '255\t250\t250\tsnow'.

Text file:

255 250 250     snow
248 248 255     ghost white
248 248 255     GhostWhite
245 245 245     white smoke
245 245 245     WhiteSmoke
220 220 220     gainsboro
255 250 240     floral white
255 250 240     FloralWhite

My code looks as follows so far:

import re

def red_green_blue(filename='src/rgb.txt'):
    with open('src/rgb.txt', "r") as f:
        for line in f:
            line = f.read().splitlines()
            for i in range(len(line)):
                new_line = re.sub("^\t+|\t+$", "", str(line[i]), flags=re.UNICODE)
                d1 = " ".join(re.split("\t+", str(new_line), flags=re.UNICODE))
                print(d1, type(d1))
        return d1

I would like to know if there any other way to solve this task using other regular expressions, e.g. findall, search, etc.

I want also to know how to display \t, because in my case I see tabs, but not as \t, i.e. 169 169 169 DarkGray instead of 169\t169\t169\tDarkGray.


Solution

  • How about this:

    [ \t]*(\d+)[ \t]*(\d+)[ \t]*(\d+)[ \t]*(.*)
    

    Since you're iterating file line by line, no need to consider newlines, and only have to focus on single line.

    Plus, assuming first line ! $Xorg: does exist in file and will skip it - as I'm newbie in linux so I don't know what that is or is a legit part of file.

    import re
    
    
    def parse_re_gen(filename):
        regex = re.compile(r"[ \t]*(\d+)[ \t]*(\d+)[ \t]*(\d+)[ \t]*(.*)")
    
        with open(filename) as f:  # "r" = "rt" and already default, no need to specify.
            for line in f:
                try:
                    yield regex.match(line).groups()
                except AttributeError:  # first line " ! $Xorg:~~~ " falls here.
                    pass
    
    
    def wrapped_re():
        for record in parse_re_gen():
            # print(record)
            print(repr("\t".join(record)))
    
    wrapped_re()
    

    Generator parse_re_gen will return matched tuples, line by line. Your teacher/professor probably want this. Calling return after loop instead will only return last one line.

    ('0', '139', '139', 'DarkCyan')
    ('139', '0', '139', 'dark magenta')
    ('139', '0', '139', 'DarkMagenta')
    ('139', '0', '0', 'dark red')
    ('139', '0', '0', 'DarkRed')
    

    And wrapped_re will iterate thru generator, joining yielded tuples with tab as a separator and print out raw tabs via use of repr(str).

    '0\t139\t139\tDarkCyan'
    '139\t0\t139\tdark magenta'
    '139\t0\t139\tDarkMagenta'
    '139\t0\t0\tdark red'
    '139\t0\t0\tDarkRed'
    

    Old Alternative way

    When considering this as xy problem: Why use re at first place?

    Everything is much simpler and faster without re module.

    def parse_rgb_gen(filename):
    
        with open(filename) as fp:
            for line in fp:
                print(repr(output := "\t".join(line.split())))
                # do something with output
    

    timeit.timeit result:

    without re: 2.365
    with    re: 3.116
    

    part of output:

    '139\t0\t139\tdark\tmagenta'
    '139\t0\t139\tDarkMagenta'
    '139\t0\t0\tdark\tred'
    '139\t0\t0\tDarkRed'
    '144\t238\t144\tlight\tgreen'
    '144\t238\t144\tLightGreen'
    

    Would be preferable to convert this to generator and use in for loop for encapsulation.

    def parse_rgb_gen(filename="source.txt"):
    
        with open(filename) as fp:
            for line in fp:
                yield "\t".join(line.split())
    
    for item in parse_rgb_gen():
        repr(item)