So I have a file like this, with each row representing a position in the scaffolds with some positions omitted. (There are actually a lot more rows for each scaffold):
SCF_1 0 1
SCF_1 3 4
SCF_1 9 10
SCF_2 0 1
SCF_2 4 5
SCF_2 12 13
SCF_2 23 24
SCF_2 79 80
SCF_3 2 3
SCF_4 1 2
...
and ultimately i want to make 100kb sized windows for each scaffold separately (the last window on each scaffold would be less than 100kb).This is what it should look like:
SCF_1 0 280000
SCF_1 280000 576300
SCF_1 576300 578000
SCF_2 9002 630000
...
The ranges should not appear uniform because some positions are omitted. I was thinking to somehow make another column with ascending numbers for each scaffold but i'm a newbie to coding and don't know how.
SCF_1 0 1 0
SCF_1 3 4 1
SCF_1 9 10 2
SCF_2 0 1 0
SCF_2 4 5 1
SCF_2 12 13 2
SCF_2 23 24 3
SCF_2 79 80 4
SCF_3 2 3 0
SCF_3 5 6 1
All right, I finished a bash script that will do exactly what you need. Go ahead and save the following as num_count.sh (or whatever you want as long as it's a shell script format) and it should do the trick for you:
#!/bin/bash
#Color declarations
RED='\033[0;31m'
GREEN='\033[0;32m'
LIGHTBLUE='\033[1;34m'
LIGHTGREEN='\033[1;32m'
NC='\033[0m' # No Color
#Ignore the weird spacing. I promise it looks good when it's echoed out to the screen.
echo -e ${LIGHTBLUE}"############################################################"
echo "# Running string counting script. #"
echo "# #"
echo -e "# ${LIGHTGREEN}Syntax: num_count.sh inputFile outputFile${LIGHTBLUE} #"
echo "# #"
echo "# The script will count the number of instances of #"
echo "# the first string and increment the number as it #"
echo "# finds a new one, appending it to the end of each line. #"
echo -e "############################################################"${NC}
numCount=0
oldStr=null
if [ -z "$1" ] || [ -z "$2" ]; then
echo "Insufficient arguments. Please correct your parameters and run the script again."
exit
fi
> $2
while IFS= read -r line; do
firstStr=$(echo $line | awk '{print $1;}')
if [ $oldStr == $firstStr ] ; then
((numCount++))
echo -e "$line\t$numCount" >> $2
else
oldStr=$firstStr
numCount=0
echo -e "$line\t$numCount" >> $2
fi
done < $1
Essentially, you'll need to run the script with the first argument as the file that contains the lines you want counted, and the second argument as the output file. Be careful because the output file will be overwritten with the output data. I hope this helps!
Here's the before and after:
SCF_1 0 1 SCF_1 3 4 SCF_1 9 10 SCF_2 0 1 SCF_2 4 5 SCF_2 12 13 SCF_2 23 24 SCF_2 79 80 SCF_3 2 3 SCF_4 1 2
SCF_1 0 1 0 SCF_1 3 4 1 SCF_1 9 10 2 SCF_2 0 1 0 SCF_2 4 5 1 SCF_2 12 13 2 SCF_2 23 24 3 SCF_2 79 80 4 SCF_3 2 3 0 SCF_4 1 2 0