I have a dataset with over 40,000 rows and an unequal number of columns for each. The problem I am having is that each value at the end of each row belongs in the same column but its ending up divided among many different columns.
So my question is: Is there a simple way of asking that the final value of each row be placed in the same column?
headers are: c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c11 c12 c13 c14
BAC0004|acr3|tr|B5LX01|B5LX01_CAMJU Acr3 GN=acr3 PE=4 SV=1 multidrug resistance
BAC0017|adeI|tr|Q2FD95|Q2FD95_ACIBA AdeI GN=adeI PE=4 SV=1 predicted protein BAC0001|abeM|tr|Q5FAM9|Q5FAM9_ACIBA Multidrug efflux pump AbeM GN=abeM PE=4 SV=2 multidrug resistance protein
current import looks like this:
c1 c2 c3 c4 c5 c6 c7 c8 c9
BAC0004|acr3|tr|B5LX01|B5LX01_CAMJU Acr3 GN=acr3 PE=4 SV=1 multidrug resistance
BAC0017|adeI|tr|Q2FD95|Q2FD95_ACIBA AdeI GN=adeI PE=4 SV=1 predicted protein
BAC0001|abeM|tr|Q5FAM9|Q5FAM9_ACIBA Multidrug efflux pump AbeM GN=abeM PE=4 SV=2 multidrug efflux pump
If anything I'd like it to look like this:
c1 c2 c3 c4 c5 c6 c7 c8 c9
BAC0004|acr3|tr|B5LX01|B5LX01_CAMJU Acr3 GN=acr3 PE=4 SV=1 NA NA NA multidrug resistance
BAC0017|adeI|tr|Q2FD95|Q2FD95_ACIBA AdeI GN=adeI PE=4 SV=1 NA NA NA predicted protein
BAC0001|abeM|tr|Q5FAM9|Q5FAM9_ACIBA Multidrug efflux pump AbeM GN=abeM PE=4 SV=2 multidrug efflux pump
but preferably like this:
c1 c2 c3 c4 c5 c6
BAC0004|acr3|tr|B5LX01|B5LX01_CAMJU Acr3 GN=acr3 PE=4 SV=1 multidrug resistance
BAC0017|adeI|tr|Q2FD95|Q2FD95_ACIBA AdeI GN=adeI PE=4 SV=1 predicted protein
BAC0001|abeM|tr|Q5FAM9|Q5FAM9_ACIBA Multidrug efflux pump AbeM GN=abeM PE=4 SV=2 multidrug efflux pump
It is line 3 in this causing all the problems - I would prefer it if kept the name as one long string and didn't add those 3 extra columns, but as what I really want is the name(c1) and the description (last value) I am willing to deal with it this way if there in no other way.
But I was just thinking that perhaps there is a way of telling R how to divide the string - maybe so that all values after the "GN=*" value are placed into 3 individual columns. I don't know what would be best. Thanks for your help.
That's a pretty messy file format. It would be better if you would separate fields with tabs or something that can be easily found in the line. As a human I had a hard time guessing where the breaks would be so i'm not surprised read.table
has trouble to.
You could parse these strings with a regular expression. Something like
#lines <- readLines("filename.txt")
lines <- c("BAC0004|acr3|tr|B5LX01|B5LX01_CAMJU Acr3 GN=acr3 PE=4 SV=1 multidrug resistance",
"BAC0017|adeI|tr|Q2FD95|Q2FD95_ACIBA AdeI GN=adeI PE=4 SV=1 predicted protein","BAC0001|abeM|tr|Q5FAM9|Q5FAM9_ACIBA Multidrug efflux pump AbeM GN=abeM PE=4 SV=2 multidrug resistance protein")
m <-regexpr("(.*) (\\w+) (GN=\\S+) (PE=\\S+) (SV=\\S+) (.*)", lines, perl=T)
This will report the start position and length of each of the columns. You can extract the values using substring or you can use a helper function like regcapturedmatches. If we take the results and propertly insert tabs between the fields, then we can use read.table()
to make a data.frame
read.table(text=sapply( regcapturedmatches(lines,m), function(x) paste(x, collapse="\t")), sep="\t", as.is=T)
which gives
V1 V2 V3 V4 V5 V6
1 BAC0004|acr3|tr|B5LX01|B5LX01_CAMJU Acr3 GN=acr3 PE=4 SV=1 multidrug resistance
2 BAC0017|adeI|tr|Q2FD95|Q2FD95_ACIBA AdeI GN=adeI PE=4 SV=1 predicted protein
3 BAC0001|abeM|tr|Q5FAM9|Q5FAM9_ACIBA Multidrug efflux pump AbeM GN=abeM PE=4 SV=2 multidrug resistance protein
which seems to match your desired result.