Search code examples
rrows

R importing unequal number of columns per row correctly


I have a dataset with over 40,000 rows and an unequal number of columns for each. The problem I am having is that each value at the end of each row belongs in the same column but its ending up divided among many different columns.

So my question is: Is there a simple way of asking that the final value of each row be placed in the same column?

headers are: c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c11 c12 c13 c14

BAC0004|acr3|tr|B5LX01|B5LX01_CAMJU Acr3 GN=acr3 PE=4 SV=1 multidrug resistance
BAC0017|adeI|tr|Q2FD95|Q2FD95_ACIBA AdeI GN=adeI PE=4 SV=1 predicted protein BAC0001|abeM|tr|Q5FAM9|Q5FAM9_ACIBA Multidrug efflux pump AbeM GN=abeM PE=4 SV=2 multidrug resistance protein

current import looks like this:

c1                                   c2        c3       c4     c5      c6        c7    c8  c9
BAC0004|acr3|tr|B5LX01|B5LX01_CAMJU  Acr3      GN=acr3  PE=4   SV=1    multidrug resistance
BAC0017|adeI|tr|Q2FD95|Q2FD95_ACIBA  AdeI      GN=adeI  PE=4   SV=1    predicted protein
BAC0001|abeM|tr|Q5FAM9|Q5FAM9_ACIBA  Multidrug efflux   pump   AbeM    GN=abeM   PE=4 SV=2 multidrug efflux pump

If anything I'd like it to look like this:

c1                                   c2        c3       c4     c5    c6      c7   c8   c9
BAC0004|acr3|tr|B5LX01|B5LX01_CAMJU  Acr3      GN=acr3  PE=4   SV=1  NA      NA   NA   multidrug resistance
BAC0017|adeI|tr|Q2FD95|Q2FD95_ACIBA  AdeI      GN=adeI  PE=4   SV=1  NA      NA   NA   predicted protein
BAC0001|abeM|tr|Q5FAM9|Q5FAM9_ACIBA  Multidrug efflux   pump   AbeM  GN=abeM PE=4 SV=2 multidrug efflux pump

but preferably like this:

c1                                                         c2   c3      c4   c5   c6
BAC0004|acr3|tr|B5LX01|B5LX01_CAMJU                        Acr3 GN=acr3 PE=4 SV=1 multidrug resistance
BAC0017|adeI|tr|Q2FD95|Q2FD95_ACIBA                        AdeI GN=adeI PE=4 SV=1 predicted protein
BAC0001|abeM|tr|Q5FAM9|Q5FAM9_ACIBA Multidrug efflux pump  AbeM GN=abeM PE=4 SV=2 multidrug efflux pump

It is line 3 in this causing all the problems - I would prefer it if kept the name as one long string and didn't add those 3 extra columns, but as what I really want is the name(c1) and the description (last value) I am willing to deal with it this way if there in no other way.

But I was just thinking that perhaps there is a way of telling R how to divide the string - maybe so that all values after the "GN=*" value are placed into 3 individual columns. I don't know what would be best. Thanks for your help.


Solution

  • That's a pretty messy file format. It would be better if you would separate fields with tabs or something that can be easily found in the line. As a human I had a hard time guessing where the breaks would be so i'm not surprised read.table has trouble to.

    You could parse these strings with a regular expression. Something like

    #lines <- readLines("filename.txt")
    lines <- c("BAC0004|acr3|tr|B5LX01|B5LX01_CAMJU Acr3 GN=acr3 PE=4 SV=1 multidrug resistance",
    "BAC0017|adeI|tr|Q2FD95|Q2FD95_ACIBA AdeI GN=adeI PE=4 SV=1 predicted protein","BAC0001|abeM|tr|Q5FAM9|Q5FAM9_ACIBA Multidrug efflux pump AbeM GN=abeM PE=4 SV=2 multidrug resistance protein")
    
    m <-regexpr("(.*) (\\w+) (GN=\\S+) (PE=\\S+) (SV=\\S+) (.*)", lines, perl=T)
    

    This will report the start position and length of each of the columns. You can extract the values using substring or you can use a helper function like regcapturedmatches. If we take the results and propertly insert tabs between the fields, then we can use read.table() to make a data.frame

    read.table(text=sapply( regcapturedmatches(lines,m), function(x) paste(x, collapse="\t")), sep="\t", as.is=T)
    

    which gives

                                                             V1   V2      V3   V4   V5                           V6
    1                       BAC0004|acr3|tr|B5LX01|B5LX01_CAMJU Acr3 GN=acr3 PE=4 SV=1         multidrug resistance
    2                       BAC0017|adeI|tr|Q2FD95|Q2FD95_ACIBA AdeI GN=adeI PE=4 SV=1            predicted protein
    3 BAC0001|abeM|tr|Q5FAM9|Q5FAM9_ACIBA Multidrug efflux pump AbeM GN=abeM PE=4 SV=2 multidrug resistance protein
    

    which seems to match your desired result.