I have a huge file (square data file of a sequence alignment) and want to put every position into a separate column, but readr::read_delim
for instance can´t take empty delimiters, and for readr::read_fwf
it seems that you need to specify every position? I have more than 35000 positions.
Example input file:
EIGMEYRTVSGVAGPLVILDKVKGPKYQEI.....
EIGMEYRTVSGVAGPLVILDKVKGPKYQEI.....
EIGMEYRTVSGVAGPLVILDKVKGPKYQEI.....
Output:
col1 col2 col3 col4 col5 col6....
E I G M E Y.....
E I G M E Y.....
E I G M E Y.....
readr::read_fwf
has a few different ways you can specify the field widths using the col_positions
argument. Here's a test file, test.txt
:
Hdvsmf
Dfhjds
Dfhjkd
Dfklds
Dkjffd
Dsfjkd
fkldsf
Assuming you know how many fields there are, you can either specify a vector of field widths (1 character wide, 5 times because there are five fields in this test file):
read_fwf('test.txt', col_positions = fwf_widths(rep(1, 5)))
This is probably easier than specifying star and end positions for each field. You can also provide a character vector of column names to fwf_widths
, like:
fwf_widths(rep(1, 5), paste0('col', 1:5))
If you don't know how many fields you have, you can also bring it in as a single column and then use tidyr::separate
to extract your columns (the sep
argument can take a vector of numeric positions, not just delimiters):
# a data frame with everything in one column named blah
df1 = read_csv('test.txt', col_names = 'blah')
field_count = length(df1$blah[1]) # assuming the fields are all same length!
# nb: parentheses for field_count - 1 are super important! you will spend forever debugging this if you miss it
df1 = df1 %>% separate(blah, into = paste0('col', 1:field_count), sep = 1:(field_count - 1))