Search code examples
powershellbioinformaticsfasta

Adding file name to lines of characters starting with ">" for multiple .faa files with PowerShell


I have a 100 of FASTA containing protein sequences stored in a singe directory. I need to add their respective file names to each of the FASTA headers (character string strings starting with ">") containd within them and subsequently merge them into a single .faa file.

I got the merging part going with the following PowerShell commands:

#Change extensions from .faa to .txt
gci -File | Rename-Item -NewName { $_.name -replace ".faa", ".txt" }

#Actual merging
Get-ChildItem $directory -include *.txt -rec | ForEach-Object {gc $_; ""} | out-file $directory

#Change encoding so I can process the file further in R
Get-Content .\test.txt | Set-Content -Encoding utf8 test-utf8.txt

After that I just change the extension back to .faa.

Each file stores multiple sequences of proteins. Each header should look like this:

some_sequence -> >some_sequence file_name

This is my first contact with PowerShell, how can I do this? Best regards!


Solution

  • I assume you're looking for something like the following, which uses a switch statement to process the individual files and modifies their headers:

    Get-ChildItem $directory -Filter *.faa -Recurse | 
      ForEach-Object {
        $file = $_
        switch -Regex -File $file.FullName { # Process the file at hand.
          '^>' { $_ + ' ' + $file.Name  } # header line -> append file name
          default { $_ } # pass through
        }
        ''  # Empty line between the content from the indiv. files.
      } | 
      Set-Content -Encoding utf8 test-utf8.txt
    

    Note:

    • No need to rename the .faa files first.
    • No need for intermediate files with modified headers - all content for the ultimate output file can directly be streamed to a single Set-Content call.