I need to read the file src/rgb.txt
which contains names of colors and their numerical representations in RGB format (the file is presented below just partially). Each line contains four fields: red, green, blue, and color name, each of them is separated by some amount of whitespace (tab or space).
I try to write a function using Python's regular expressions (usage of re is mandatory!) that reads the file and should return a list of strings, so that in the returned list they have four fields separated by a single tab character (\t
).
The first string in the returned list should be:
'255\t250\t250\tsnow'
.
Text file:
255 250 250 snow
248 248 255 ghost white
248 248 255 GhostWhite
245 245 245 white smoke
245 245 245 WhiteSmoke
220 220 220 gainsboro
255 250 240 floral white
255 250 240 FloralWhite
My code looks as follows so far:
import re
def red_green_blue(filename='src/rgb.txt'):
with open('src/rgb.txt', "r") as f:
for line in f:
line = f.read().splitlines()
for i in range(len(line)):
new_line = re.sub("^\t+|\t+$", "", str(line[i]), flags=re.UNICODE)
d1 = " ".join(re.split("\t+", str(new_line), flags=re.UNICODE))
print(d1, type(d1))
return d1
I would like to know if there any other way to solve this task using other regular expressions, e.g. findall
, search
, etc.
I want also to know how to display \t
, because in my case I see tabs, but not as \t
, i.e. 169 169 169 DarkGray
instead of 169\t169\t169\tDarkGray
.
How about this:
[ \t]*(\d+)[ \t]*(\d+)[ \t]*(\d+)[ \t]*(.*)
Since you're iterating file line by line, no need to consider newlines, and only have to focus on single line.
Plus, assuming first line ! $Xorg:
does exist in file and will skip it - as I'm newbie in linux so I don't know what that is or is a legit part of file.
import re
def parse_re_gen(filename):
regex = re.compile(r"[ \t]*(\d+)[ \t]*(\d+)[ \t]*(\d+)[ \t]*(.*)")
with open(filename) as f: # "r" = "rt" and already default, no need to specify.
for line in f:
try:
yield regex.match(line).groups()
except AttributeError: # first line " ! $Xorg:~~~ " falls here.
pass
def wrapped_re():
for record in parse_re_gen():
# print(record)
print(repr("\t".join(record)))
wrapped_re()
Generator parse_re_gen
will return matched tuples, line by line. Your teacher/professor probably want this. Calling return
after loop instead will only return last one line.
('0', '139', '139', 'DarkCyan')
('139', '0', '139', 'dark magenta')
('139', '0', '139', 'DarkMagenta')
('139', '0', '0', 'dark red')
('139', '0', '0', 'DarkRed')
And wrapped_re
will iterate thru generator, joining yielded tuples with tab as a separator and print out raw tabs via use of repr(str)
.
'0\t139\t139\tDarkCyan'
'139\t0\t139\tdark magenta'
'139\t0\t139\tDarkMagenta'
'139\t0\t0\tdark red'
'139\t0\t0\tDarkRed'
When considering this as xy problem: Why use re
at first place?
Everything is much simpler and faster without re
module.
def parse_rgb_gen(filename):
with open(filename) as fp:
for line in fp:
print(repr(output := "\t".join(line.split())))
# do something with output
timeit.timeit
result:
without re: 2.365
with re: 3.116
part of output:
'139\t0\t139\tdark\tmagenta'
'139\t0\t139\tDarkMagenta'
'139\t0\t0\tdark\tred'
'139\t0\t0\tDarkRed'
'144\t238\t144\tlight\tgreen'
'144\t238\t144\tLightGreen'
Would be preferable to convert this to generator
and use in for
loop for encapsulation.
def parse_rgb_gen(filename="source.txt"):
with open(filename) as fp:
for line in fp:
yield "\t".join(line.split())
for item in parse_rgb_gen():
repr(item)