Search code examples
pythonregexsplitdelimiterlogfiles

Python split delimiter issue


I built a small python script some months ago that collects some very basic stats from the logs on my honeypots. I discovered that theres a fault with this script and i've failed to find an answer on my own.

The script will read the log files from the attacks. The log files contain five pieces of data on each line.

  • date/time stamp
  • the IP address from which the attack occurred
  • attempted user name
  • attempted password
  • success/failure code


Example:

2014-12-24 13:37:00,1.2.3.4,root,password,0    

The five pieces of data are separated by a ','.
So i've used the ',' as a delimiter to split the lines into a list, like this.

['2014-12-24 13:37:00', '1.2.3.4', 'root', 'password', '0']    

from which i can grab the data i need.

The issue, which i'm sure a few of you already have figured out,
occurs when the delimiter is present in the attempted password,
in this case H4ck3r,,h4cker,,2015

The log file ends up looking like this

2015-01-02 01:44:38,2.3.4.5,root,H4ck3r,,h4cker,,2015,0    

and turns the resulting list into this after its been slpit.

['2015-01-02 01:44:38', '2.3.4.5', 'root', 'H4ck3r', '', 'h4cker', '', '2015', '0']    


My first thought for a workaround here was to remove [0:3] and [-1],
then accept what ever is left to be the password but,
not very clean nor accurate to say the least.
If the attacker uses the delimiter within the user name i'll be back to square one.

Questions.

  • Is there any clean and easy way to resolve this using split?
  • Would regex be the best way to go here?
  • ...Other ways to solve this?

Solution

  • As mgilson already pointed out, you should change the format of your log files (if possible).

    To parse the existing logs, you can use the regex ^([^,]*),([^,]*),([^,]*),(.*),(\d+)\s*$. This captures the timestamp in group 1, the ip in group 2, and so on.

    regex101 demo.

    >>> pattern= r'^([^,]*),([^,]*),([^,]*),(.*),(\d+)\s*$'
    >>> string= 'time,ip,user,H4ck3r,,h4cker,,2015 ,1'
    >>> match= re.match(pattern, string)
    >>> print match.groups()
    ('time', 'ip', 'user', 'H4ck3r,,h4cker,,2015 ', '1')