Search code examples
pythonregexpandasvalueerror

Using regular expression to split into columns


I'm in desperate need of help have a data that I would like to split into columns using regex(python), it must use regular expression

Jan  9 01:04:49 syntax sshd(pam_unix)[21354]: authentication failure; logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=120-123-141-4.hinet-ip.hinet.com  

Jul 10 04:17:11 syntax su(pam_unix)[95367]: session opened for user abc by (uid=0)

May  1 14:06:19 syntax su(pam_unix)[95367]: session closed for user abc

Oct 23 18:08:26 syntax logrotate: ALERT exited abnormally with [1]

Jun 14 21:42:52 syntax su(pam_unix)[95367]: session opened for user cbx by (uid=0)

Supposed output

Supposed Output

It's actually from a url, I made it into a pandas dataFrame and tried using re.split but it gives me error

*ValueError: 1 columns passed, passed data had 24 columns*

Hope can I get the output I need?


Solution

  • Use regex as follows

    Data

     df=pd.DataFrame({'Text':['Jan  9 01:04:49 syntax sshd(pam_unix)[21354]: authentication failure; logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=120-123-141-4.hinet-ip.hinet.com','Jul 10 04:17:11 syntax su(pam_unix)[95367]: session opened for user abc by (uid=0)','May  1 14:06:19 syntax su(pam_unix)[95367]: session closed for user ab']})
    

    regex= ([A-Za-z]+\s+\d+\s+\d+:\d+:\d+)\s+|(?<=\])[:\s+]+|(?<=[x])\s+

    df2=df.Text.str.split('([A-Za-z]+\s+\d+\s+\d+:\d+:\d+)\s+|(?<=\])[:\s+]+|(?<=[x])\s+', n=3, expand=True)
    
    df2.rename(columns=({0:'DROP1',1:'Timestamp', 2:'A1', 3:'DROP', 4:'A2', 5:'DROP2',6:'A3'}),inplace=True)#Rename columns
    
    df2.drop(columns=['DROP2','DROP1','DROP'],inplace=True)#Drop unwanted columns
    

    Basically;

    (?<=\])[:\s+]+ Split by space that comes after ]:

    or -|

    (?<=[x])\s+ Split by space that comes after x

    or- |

    ([A-Za-z]+\s+\d+\s+\d+:\d+:\d+)\s+ Split the timestamp

    Outcome

    enter image description here