Search code examples
linuxbioinformaticsbiopythonbioperlblast

how to replace the same number of file1 with same number from file2


i have a list of query and hits gi in one file (file1) . i have another file in which complete name of hits is there(file2), now i want to replace Hits gi from file1 to file2 that have the complete Hit name. i want that gi must be replace with the same gi in front of it's each corresponding Query.

file1

 1  Query=gi_148659820 ref_YP_001281343.1_ Hit=gi_148659820 ref_YP_001281343.1_
 2  Query=gi_148659820 ref_YP_001281343.1_ Hit=gi_148821250 ref_YP_001286004.1_ 
 3  Query=gi_148659820 ref_YP_001281343.1_ Hit=gi_15607202 ref_NP_214574.1_  
 4  Query=gi_148659820 ref_YP_001281343.1_ Hit=gi_253796975 ref_YP_003029976.1_ 
 5  Query=gi_148659820 ref_YP_001281343.1_ Hit=gi_375294260 ref_YP_005098527.1_ 

file2

1  >gi_375294260_ref_YP_005098527.1_ hypothetical protein TBSG_00059 [Mycobacterium tuberculosis KZN 4207]
2  >gi_253796975_ref_YP_003029976.1_ hypothetical protein TBMG_00059 [Mycobacterium tuberculosis KZN 1435]
3  >gi_15607202_ref_NP_214574.1_ Conserved hypothetical protein [Mycobacterium tuberculosis H37Rv]
4  >gi_148659820_ref_YP_001281343.1_ hypothetical protein MRA_0062 [Mycobacterium tuberculosis H37Ra]
5  >gi_148821250_ref_YP_001286004.1_ hypothetical protein TBFG_10059 [Mycobacterium tuberculosis F11]

desired output:

1  Query=gi_148659820 ref_YP_001281343.1_ Hit=gi_148659820_ref_YP_001281343.1_ hypothetical protein MRA_0062 [Mycobacterium tuberculosis H37Ra]
2  Query=gi_148659820 ref_YP_001281343.1_ Hit=gi_148821250_ref_YP_001286004.1_ hypothetical protein TBFG_10059 [Mycobacterium tuberculosis F11]
3  Query=gi_148659820 ref_YP_001281343.1_ Hit=gi_15607202_ref_NP_214574.1_ Conserved hypothetical protein [Mycobacterium tuberculosis H37Rv
4  Query=gi_148659820 ref_YP_001281343.1_ Hit=gi_253796975_ref_YP_003029976.1_ hypothetical protein TBMG_00059 [Mycobacterium tuberculosis KZN 1435]
5  Query=gi_148659820 ref_YP_001281343.1_ Hit=gi_375294260_ref_YP_005098527.1_ hypothetical protein TBSG_00059 [Mycobacterium tuberculosis KZN 4207]

Solution

  • The solution is described stepwise;

    1. Extract only Hit GIs from file1;

      cat file1 | awk '{print $3}' | sed 's/Hit=//g' > file1-gi
      
    2. Remove # > from file 2.;

      sed 's/^....//g' file2 > file2_1
      
    3. Remove redundancy in file2, if any;

      cat file2_1 | sort $1 | uniq > file2_2
      
    4. Use system command to grep the names of corresponding GIs;

      cat file1-gi | awk '{system ("grep "$1" file2_2")}' >> file1-gi-name
      
    5. Printing starting 3 columns of file1;

      cut -d" " -f-3 file1 > file1_1
      
    6. Paste two files;

      paste file1_1 file1-gi-name > output