Search code examples
pythongoogle-colaboratorybioinformaticsfasta

Is there a way to fetch protein sequences from UniProt using Python?


I am looking for a way to retrieve FASTA files from UniProt by specifying the protein UniProt ID in input. My goal is to create a Google Colab that is able to create FASTA files where I can specify the FASTA name, the directory (in Google Drive) where I want to save it and take Uniprot IDs in the format 1xUniProt1, 3xUniProt2, where 3x is the number of times I want that sequence in the FASTA file separated by a ':'.

I was thinking something like this:

In input:

Name = protein_sequences
Proteins = 2xUniprot1, 3xUniprot2, 1xUniprot3
Directory = FASTA_directory

In output:

Name of file = protein_sequences.fasta

FASTA file:

> protein_sequences   sequenceUniprot1:sequenceUniprot1:sequenceUniprot2:sequenceUniprot2:sequenceUniprot2:sequenceUniprot3

The main problem I have is that I am not sure how to fetch the sequences themselves from UniProt using Python. I don't know what the latest and most efficient way of doing this is.


Solution

  • Looks like UniProt has a REST api, so I would try to fetch the protein info from there: https://www.uniprot.org/help/programmatic_access

    You need to make http calls to this API. For that I recommend the httpx library. Their documentation should guide you through the process, if you've never done anything like that.