I have a fastq file that has reads with the following format:
@SRR1463109.1 HWI-ST740_1:1:1101:1222:2116/1
AAACTAAAATTTTAAAGCATCTGACTGTACTCATGGTGGGTACACGTGACTAGAAATCTATCACACTAACATGAGGGTCAGCTCCACGCTCTGTGACTTCT
+
HHHHHFHHHHHHHHHHHHHHHHHHHHHHHGHHHHHHHHHHCEHHDDDDBFGGGBGHHHHFHHHHHF;EF?FDCD?GGCGGFFGFGHHEGHGGFFGEEDHHG
I need to remove the space after the @xxxx word so that it looks like
@SRR1463109.1_HWI-ST740_1:1:1101:1222:2116/1
AAACTAAAATTTTAAAGCATCTGACTGTACTCATGGTGGGTACACGTGACTAGAAATCTATCACACTAACATGAGGGTCAGCTCCACGCTCTGTGACTTCT
+
HHHHHFHHHHHHHHHHHHHHHHHHHHHHHGHHHHHHHHHHCEHHDDDDBFGGGBGHHHHFHHHHHF;EF?FDCD?GGCGGFFGFGHHEGHGGFFGEEDHHG
I'm new to awk but so far I've got
awk '{ gsub("^@([a-z]|[A-Z])*", $1"_"$2, $1); $2=""; print }' test.fastq
and the result is
@SRR1463109.1_HWI-ST740_1:1:1101:1222:2116/11463109.1
AAACTAAAATTTTAAAGCATCTGACTGTACTCATGGTGGGTACACGTGACTAGAAATCTATCACACTAACATGAGGGTCAGCTCCACGCTCTGTGACTTCT
+
HHHHHFHHHHHHHHHHHHHHHHHHHHHHHGHHHHHHHHHHCEHHDDDDBFGGGBGHHHHFHHHHHF;EF?FDCD?GGCGGFFGFGHHEGHGGFFGEEDHHG
The last part of the line is getting mangled, possibly because of the "/1" that's in the text. How can I fix this?
Use sed for simple replacement.
$ sed 's/^\(@[^[:blank:]]*\)[[:blank:]]\+/\1_/' file
@SRR1463109.1_HWI-ST740_1:1:1101:1222:2116/1
AAACTAAAATTTTAAAGCATCTGACTGTACTCATGGTGGGTACACGTGACTAGAAATCTATCACACTAACATGAGGGTCAGCTCCACGCTCTGTGACTTCT
+
HHHHHFHHHHHHHHHHHHHHHHHHHHHHHGHHHHHHHHHHCEHHDDDDBFGGGBGHHHHFHHHHHF;EF?FDCD?GGCGGFFGFGHHEGHGGFFGEEDHH