Character substitution in file based on position extracted from another file

I have a list of files each containing six lines. A representative example is shown below.

cat test.fa
>chain A
MIRLGAPQTLVLLTLLVAAVLRCQGQDVQEAGSCVQDGQRYNDKDVWKPEPCRICVCDTGTVLCDDIICEDVKDCLSPEIPFGECCPICPTDLATASGQPGPKGQKGEPGDIKDIVGPKGPPGPQGPAGEQGPRGDRGDKGEKGAPGPRGRDGEPGTPGNPGPPGPPGPPGPPGLGGNFAAQMAGGFDEKAGGAQLGVMQGPMGPMGPRGPPGPAGAPGPQGFQGNPGEPGEPGVSGPMGPRGPPGPPGKPGDDGEAGKPGKAGERGPPGPQGARGFPGTPGLPGVKGHRGYPGLDGAKGEAGAPGVKGESGSPGENGSPGPMGPRGLPGERGRTGPAGAAGARGNDGQPGPAGPPGPVGPAGGPGFPGAPGAKGE
>chain B
MMSFVQKGSWLLLALLHPTIILAQQEAVEGGCSHLGQSYADRDVWKPEPCQICVCDSGSVLCDDIICDDQELDCPNPEIPFGECCAVCPQPPTAPTRPPNGQGPQGPKGDPGPPGIPGRNGDPGIPGQPGSPGSPGPPGICESCPTGPQNYSPQYDSYDVKSGVAVGGLAGYPGPAGPPGPPGPPGTSGHPGSPGSPGYQGPPGEPGQAGPSGPPGPPGAIGPSGPAGKDGESGRPGRPGERGLPGPPGIKGPAGIPGFPGMKGHRGFDGRNGEKGETGAPGLKGENGLPGENGAPGPMGPRGAPGERGRPGLPGAAGARGNDGARGSDGQPGPPGPPGTAGFPGSPGAKGEVGPAGSPGSNGAPGQRGEPGPQGH
>chain C
MLPQIPFLLLVSLNLVHGVFYAERYQMPTGIKGPLPNTKTQFFIPYTIKSKGIAVRGEQGTPGPPGPAGPRGHPGPSGPPGKPGYGSPGLQGEPGLPGPPGPSAVGKPGVPGLPGKPGERGPYGPKGDVGPAGLPGPRGPPGPPGIPGPAGISVPGKPGQQGPTGAPGPRGFPGEKGAPGVPGMNGQKGEMGYGAPGRPGERGLPGPQGPTGPSGPPGVGKRGENGVPGQPGIKGDRGFPGEMGPIGPPGPQGPPGERGPEGIGKPGAAGAPGQPGIPGTKGLPGAPGIAGPPGPPGFGKPGLPGLKGERGPAGLPGGPGAKGEQGPAGLPGKPGLTGPPGNMGPQGPKGIPGSHGLPGPKGETGPAGPAGYPGAK

Reading row-by-row another file called test.list, I would like to substitute character position 140 in chain A of test.fa with "0" if the third column is "K" and character position 142 of chain B with "1" if the fourth column is E. Same for other rows.

cat test.list

A-B 140-142 K E
B-C 140-142 K E
A-B 299-301 K E
B-C 299-301 K E

I cannot figure out how to get a headstart. Really appreciate any help!

Solution

Interpretation of the task

Files are to be processed in pairs where a file with file-extension .list contains information needed to modify a corresponding file with file-extension .fa.

The .fa files contain fasta-formatted sequences for three chains identified A, B, or C. I have assumed every .fa file will contain three sequences identified A, B, and C. (if other identifiers are used, the script will need to be modified to extract identifiers for the chains).

The .list files contain four sets of conditions specifying changes to be made to the identified chains in the corresponding .fa file. Any changes accumulate as each condition is processed. Each condition is formatted as follows:

A-B 140-142 K E

Using the values in the above (one of four per pair of files) condition example, the instructions become:

If position 140 in chain A (of the corresponding .fa file) is currently K

AND If position 142 in chain B (of the corresponding .fa file) is currently E

PROVIDED BOTH CONDITIONS ARE MET change position 140 in chain A to a 0 AND change position 142 in chain B to 1.

If only one (or no) condition is met, make no change. The changes will always be 0 for the first position and 1 for the second position.

The changed chains are then subjected to tests for the next three condition sets pertaining to that sequence file, accumulating changes with each condition set.

Approach

Since many paired files may exist, a bash script can be used to interrogate a directory containing target files to identify pairs of matched files to be processed. The target directory will be passed as an argument to the script.

The bash script will then pass pairs of matched files to an awk script, embedded in the bash script, where the values of the .list file can be read and the corresponding changes made to the .fa files.

Because of the potential for unforeseen errors, the original .fa files are not overwritten. Instead files containing the substituted sequences will be written to a set of new files (in a sub-directory of the target directory).

In order to track the changes and report any unmatched files, the script will also create a log file to record the changes made to each file and note any unmatched files.

awk steps

The bash script passes the .list file to awk before the corresponding .fa sequence file. Thus, when the record number NR (equivalent to accumulated line numbers across both files but beginning with the .list file) is equal to the record number of the current file FNR, the script is processing the first file and an action block can be assigned to perform steps only on the data in the .list file. Using this, the conditions data (in the .list file) can be re-formatted into an array for easier reading later on:

# this block only executes for lines in the first argument file (the .list file) because it has the condition pattern NR==FNR;
NR==FNR{
  split($1,abc,"-"); # splits e.g 'A-B' into an array named 'abc' with 'A' index 1, and 'B' in index 2;
  split($2,nums,"-"); # splits e.g '140-142' into an array named 'nums' with '140' index 1, and '142' in index 2;

conditions[NR,1] = abc[1];
conditions[NR,2] = abc[2];
conditions[NR,3] = nums[1];
conditions[NR,4] = nums[2];
conditions[NR,5] = $3; # $3 is e.g. 'K';
conditions[NR,6] = $4; # $4 is e.g. 'E';
conditions[NR,7] = $0; # the whole line;
}

(the comment note examples relate to the example condition A-B 140-142 K E as used above). A 2-d array named conditions is built containing the data needed for the tests later on.

A set of three pairs of short blocks follow. These extract the sequence runs for the three chains A, B, and C, storing them in an associative array indexed with the relevant letter. The first of each pair identifies the record (line) number FNR of the line following the line in which the chain identifier is found, storing it in a variable named aline, bline or cline for the three chains in each .fa file:

/>chain A/ {aLine = FNR+1}
FNR==aLine {chains["A"] = $0}
/>chain B/ {bLine = FNR+1}
FNR==bLine {chains["B"] = $0}
/>chain C/ {cLine = FNR+1}
FNR==cLine {chains["C"] = $0;}

When the line after the line containing the label, for example 'chain A', is encountered (FNR==aLine) the sequence found in that line is recorded in the chains array.

The condition tests and substitutions are made entirely in the awk END block, which executes once after the earlier blocks have been applied to each record (line) of the argument files.

The END block contains a loop to cycle through each condition, making any specified changes before moving to the next condition.

The logic test is performed inside the loop using an if conditional, which specifies that both conditions specified in the current condition line if the block of code is to be executed.

if (substr(chains[conditions[i,1]],conditions[i,3],1)==conditions[i,5] && substr(chains[conditions[i,2]],conditions[i,4],1)==conditions[i,6]) { # execute of both condition met;}

The right hand side of each comparison refers to the condition letter 'K' or 'E' from the earlier example, checking if it is currently present at the position specified in the left hand side of the condition. The current character at that position is extracted from the sequence, stored in chains[conditions[i,1]] (which resolves to chain array elements e.g chains[A] or chains[B]) using awk's substr function to extract the single character present at position condition[i,3].

If both conditions are met, the substitutions are made by joining substrs of the regions flanking the target position either side of the required '0' or '1'. The sequences, altered or not, are then written to file.

Implementing the script

The target files, one .list file for each .fa, file should be placed together in a directory. The directory name will be the argument for the script. The name part of the filenames should match for the matched pairs (e.g sequence1.fa should have a matched conditions file named sequence1.list).

(assuming the script is saved in a file named process_list_fa.sh). The script must be made executable. This can be done on the command line using:

chmod +x process_list_fa.sh

The script is then called from the command line as follows:

./process_list_fa.sh path/to/directory

The path/to/directory argument must represent the directory under which the matched files are stored. (don't use a trailing '/').

The script will execute, reporting progress to the command line. It may halt if, for example, the number of .fa files is different to the number of .list files, and await confirmation to continue.

Once complete, the target directory will now contain a sub directory named output. Inside the output directory will be a collection of files containg the modified sequences. Each will be named as it was originally but pre-pended with modified_. A log.txt file is also created summarising the changes made, and any files skipped. This may be useful for checking the files, especially if large numbers of pairs are processed.

The process_list_fa.sh script

copy the entire following snippet and save in a file named process_list_fa.sh. Make the file executable (chmod +x process_list_fa.sh) and execute it passing the directory containing the sequences as an argument (./process_list_fa.sh path/directory)

#!/bin/bash

# check for target directory passed as argument;
baseDir=$1;
if [ ! -d "$baseDir" ]; then
  echo "target directory $baseDir not found. exiting";
  exit;
fi

outputDir=$baseDir/output;
if [ ! -d "$outputDir" ]; then
  mkdir $outputDir;
fi

logFile=$outputDir/log.txt;

echo "directory: "$baseDir"/" > $logFile;
echo "results will be stored in: "$outputDir >> $logFile;
echo "checking for .fa and .list files inside $baseDir..."
skipCount=0;
skippedFiles=(); # array to store skipped file names in;

for FILE in $baseDir/*.fa; do 
#faCount=(( $faCount +1 ));
((faCount++));
done;

for FILE in $baseDir/*.list; do 
((listCount++));
done;

echo $faCount ".fa files found" >> $logFile;
echo $listCount ".list files found" >> $logFile;

echo $faCount ".fa files found, and " $listCount ".list files found";

if [ $faCount == $listCount ]; then
  echo "beginning replacement procedure..."; 
else 
  echo "WARNING - file numbers not equal";
  echo "if you choose to continue, unmatched files will be skipped";

  while true; do
    read -p "Do you wish to continue? [y/n]" yn
    case $yn in
        [Yy]* ) echo "results follow:" >> $logFile;
                echo "" >> $logFile;
                break;;
        [Nn]* ) echo "exiting... bye";
                echo "process quit by user" >>  $outputDir/log.txt;
                exit;;
        * ) echo "Please type y (continue) or n (exit).";;
    esac
  done;
fi

processCount=1;
ext=".list"

for FILE in $baseDir/*.fa; do 
listFile=${FILE/.fa/$ext};
# does the .list file exist?
if [ ! -f "$listFile" ]; then
  echo "Warning: no matched .list file for $FILE found. Skipping $FILE";
  #echo "" >> $logFile;
  #echo "No matched .list file for $FILE found. skipping $FILE" >> $logFile;
  ((skipCount++));
  skippedFiles+=($FILE); # append to array;
  continue;
fi

echo "file number "$processCount $FILE;
echo "matched list file: "$listFile "found";
echo "processing files "$FILE " and "$listFile;
echo "" >> $logFile;
echo "matched pair $processCount: " $FILE " and "$listFile >> $logFile;
((processCount++));

#### awk script ####

awk '
BEGIN {numConditionSets=0;}

NR==FNR{
numConditionsSets++;

# organise conditions data into an array;
  split($1,abc,"-");
  split($2,nums,"-");

conditions[NR,1] = abc[1];
conditions[NR,2] = abc[2];
conditions[NR,3] = nums[1];
conditions[NR,4] = nums[2];
conditions[NR,5] = $3;
conditions[NR,6] = $4;
conditions[NR,7] = $0; # the whole line;


# make path to pre-existing log file;
  if (NR==1) {
    n = split(FILENAME, a, "/");
      for (i=1; i<n; i++) {basename=basename a[i]"/";}
    logFile = basename"output/log.txt";
  } # end 1st line block;
} # end NR==FNR block;


# store chains in chains array;
/>chain A/ {aLine = FNR+1}
FNR==aLine {chains["A"] = $0}
/>chain B/ {bLine = FNR+1}
FNR==bLine {chains["B"] = $0}
/>chain C/ {cLine = FNR+1}
FNR==cLine {chains["C"] = $0;}


END {

# get .fa identified from file path;
    n = split(FILENAME, a, "/");
    fileID = a[n];
    outPath = basename"output/"; # path to save output .fa files;

# loop conditions, mutating sequences along the way;

for (i=1; i<=numConditionsSets; i++) {

  if (substr(chains[conditions[i,1]],conditions[i,3],1)==conditions[i,5] && substr(chains[conditions[i,2]],conditions[i,4],1)==conditions[i,6]) { 

       chains[conditions[i,1]] = substr(chains[conditions[i,1]],1,conditions[i,3]-1)"0"substr(chains[conditions[i,1]],conditions[i,3]+1);
       chains[conditions[i,2]] = substr(chains[conditions[i,2]],1,conditions[i,4]-1)"1"substr(chains[conditions[i,2]],conditions[i,4]+1);
print "condition "i" ("conditions[i,7]"): chain "conditions[i,1]" residue "conditions[i,3]" "conditions[i,5]" to 0, chain "conditions[i,2]" residue "conditions[i,4]" "conditions[i,6]" to 1," >> logFile;

     } else {print "condition "i" ("conditions[i,7]"): not met, no changes made," >> logFile;}
} # next i condition set;


  for (indx in chains) {
    print ">chain "indx"\n"chains[indx] >> outPath"modified_"fileID;
  } 
 

} # end END block;





' $listFile $FILE 


done; # end for-in-do;

#### log unmatched/skipped .fa files ######
echo "" >> $logFile; 
echo "${#skippedFiles[@]} .fa files were unmatched with .list files and skipped:" >> $logFile;

for i in "${skippedFiles[@]}"
do
   echo "$i" >> $logFile;
done

echo "" >> $logFile;
echo "END" >> $logFile;