Search code examples
pythonregexchatirc

Extracting username in IRC logs with regex?


I'm learning programming as best as I can, and I've been starting with Python. I am currently writing an IRC statistics generator (as if there weren't enough of those already), and I am trying to come up with a regex that matches the username (and only the username) in a particular log format. However, the one I have doesn't match anything with re.search.

Here is an example of the log format:

may 01 14:04:54 <FishCream> Wahoo!
may 01 14:05:01 <LpSamuelm> Oh, if only talking was this fun in real life.
jan 01 00:00:00 <Username>  Message goes here.
jan 01 00:00:00 *   Username Action goes here.

Here are the compile statements:

findusername = re.compile("^[a-zA-Z]+\s[0-9]+\s[0-9:]\s<([A-Za-z]+)>")
finduseraction = re.compile("^[a-zA-Z]+\s[0-9]+\s[0-9:]\s\*\s+([A-Za-z]+)\s")

As you can see, I have made two separate statements for finding the username when the user talks and when they use /me commands - making one super-regex for these two is probably possible, but I've got enough headache as it is.

Can anyone help me identify the problem?


Solution

  • Your [0-9:] class only matches one character, not the 8 that are there; add a quantifier:

    findusername = re.compile("^[a-zA-Z]+\s[0-9]+\s[0-9:]{8}\s<([A-Za-z]+)>")
    finduseraction = re.compile("^[a-zA-Z]+\s[0-9]+\s[0-9:]{8}\s\*\s+([A-Za-z]+)\s")
    

    This presumes that you have each time entry on a separate line; add the re.MULTILINE flag if your log text comprises of multiple lines at a time.

    A demo using the re.MULTILINE flag with .findall() on your input example:

    >>> findusername = re.compile("^[a-zA-Z]+\s[0-9]+\s[0-9:]{8}\s<([A-Za-z]+)>", re.MULTILINE)
    >>> finduseraction = re.compile("^[a-zA-Z]+\s[0-9]+\s[0-9:]{8}\s\*\s+([A-Za-z]+)\s", re.MULTILINE)
    >>> findusername.findall(logs)
    ['FishCream', 'LpSamuelm', 'Username']
    >>> finduseraction.findall(logs)
    ['Username']