I built a small python script some months ago that collects some very basic stats from the logs on my honeypots. I discovered that theres a fault with this script and i've failed to find an answer on my own.
The script will read the log files from the attacks. The log files contain five pieces of data on each line.
Example:
2014-12-24 13:37:00,1.2.3.4,root,password,0
The five pieces of data are separated by a ','.
So i've used the ',' as a delimiter to split the lines into a list, like this.
['2014-12-24 13:37:00', '1.2.3.4', 'root', 'password', '0']
from which i can grab the data i need.
The issue, which i'm sure a few of you already have figured out,
occurs when the delimiter is present in the attempted password,
in this case H4ck3r,,h4cker,,2015
The log file ends up looking like this
2015-01-02 01:44:38,2.3.4.5,root,H4ck3r,,h4cker,,2015,0
and turns the resulting list into this after its been slpit.
['2015-01-02 01:44:38', '2.3.4.5', 'root', 'H4ck3r', '', 'h4cker', '', '2015', '0']
My first thought for a workaround here was to remove [0:3] and [-1],
then accept what ever is left to be the password but,
not very clean nor accurate to say the least.
If the attacker uses the delimiter within the user name i'll be back to square one.
Questions.
As mgilson already pointed out, you should change the format of your log files (if possible).
To parse the existing logs, you can use the regex ^([^,]*),([^,]*),([^,]*),(.*),(\d+)\s*$
. This captures the timestamp in group 1, the ip in group 2, and so on.
>>> pattern= r'^([^,]*),([^,]*),([^,]*),(.*),(\d+)\s*$'
>>> string= 'time,ip,user,H4ck3r,,h4cker,,2015 ,1'
>>> match= re.match(pattern, string)
>>> print match.groups()
('time', 'ip', 'user', 'H4ck3r,,h4cker,,2015 ', '1')