Exercise: The protein database UniProt
Exercise written by: Henrik Nielsen - updated by Morten Nielsen and Rasmus Wernersson
In this exercise, we shall extract information from the protein database, Uniprot. This database is administrated in collaboration between Swiss Institute of Bioinformatics (SIB), European Bioinformatics Institute (EBI), England, and Georgetown University, Washington DC, USA.
UniProt, http://www.uniprot.org/, consists of three parts:
- UniProt Knowledge-base (UniProtKB)
- protein sequences with annotation and references
- UniProt Reference Clusters (UniRef)
- homology-reduced database, where similar sequences (having a certain percentage identity) are merged into clusters, each with a representative sequence
- UniProt Archive (UniParc)
- an archive containing all versions of Uniprot without annotations
Of these databases, Uniprot Knowledge-base is the most useful, and this is the database we shall be using today. Uniprot Knowledge-base consists of two parts:
- a manually annotated (reviewed) protein-database.
- a computer-annotated supplement to Swiss-Prot, that contains all translations of EMBL nucleotide sequences not yet included in Swiss-Prot.
Simple text mining
First, we will find some UniProt entries using simple text mining. You are supposed to find the entry for human insulin.
- Open the UniProt home-page http://www.uniprot.org/
- Type "human insulin" in the search field in the top of the page. Leave the search menu on "Protein Knowledge-base (UniProtKB)", which is default.
- How many hits do you find?
- How many hits are from Swiss-Prot? (tip: Click on "Show only reviewed")
- Can you identify the correct hit (i.e. see which one is actually human insulin and not something else)?
In this case, it was easy to find the correct hit, but sometimes it is more difficult. If you do not identify the correct hit immediately, it will often help to narrow down the search, and that is exactly what we ask you to do in the next four questions.
The first step is searching for proteins that actually come from the organism "human" and are called something containing the word "insulin", as opposed to just containing the words "human" and "insulin" somewhere in the description. This can be done very easily: Below the heading "Results" you find two lines that allows you to restrict the search to specific fields.
- At > Restrict term "human" to click on: organism.
QUESTION 2: How many hits are now left (still only in Swiss-Prot)?
- At > Restrict term "insulin" to click on: protein name.
QUESTION 3: How many hits are now left (still only in Swiss-Prot)?
Note that all selections made with the mouse are shown in text format in the Query box in the top of the page. It is possible to edit the search criteria manually in this box to make them broader or more narrow.
- Try for instance to exclude proteins that are not insulin, but only insulin-like. You do this by adding the following text in the Query box: "NOT name:insulin-like" and click on the Search button.
QUESTION 4: How many hits are now left?
- Try now to exclude proteins that are insulin receptors or described as substrates for insulin receptors.
- How did you do this?
- How many hits are now left?
The contents of UniProt
We shall now see what information is contained in a UniProt entry, and what further information is available as links in each entry.
- Click on the accession-number for insulin (the blue code in the field Entry). This will take you to the insulin entry in the UniProtKB/Swiss-Prot database. Spend some time to get an overview on the page and what information it contains.
- Scroll down to References. Note that it is indicated, what each reference has contributed ("Cited for"). You can get to the PubMed literature database at NCBI by clicking at the link "PubMed:" for a reference - try this. The abstract of a publication can be read here (or directly at UniProt using the "Abstract"-link), if the work is an actual published article and not a "direct submission".
- How many references are there (not counting "computationally mapped references")?
- Why do you think insulin is such a highly investigated protein?
Read the General annotation (Comments) and have a look at the Ontologies - especially the first section Keywords. Here, you find the general functional and structural annotation of the protein; in General annotation (Comments) it is in (more or less) free text, while in Ontologies it is expressed in a controlled vocabulary (there is a finite number of possible keywords).
One of the most important types of comments is naturally Function - in Ontologies split into Biological process and Molecular function. Another type of comment is Subcellular location - corresponding to Cellular component under Ontologies.
- Where in the cell / outside the cell do you find insulin?
- Why do you think is it found there? (Hint: consider the function)
Scroll down to Sequence annotation (Features). Here, you find those annotations that are coupled to specific parts of the protein. You can click on the Position(s) field for any feature and see the corresponding amino acids highlighted in the sequence (try it!). Note the following:
- Insulin has both a signal peptide and a pro-peptide. These are both cleaved off before secretion. The mature insulin (the A and B chains) is hence much smaller than what was shown under "Sequence information".
QUESTION 8: How long is the signal peptide and the propeptide, respectively?
- Some variants (mutations) of insulin have been described. In some cases it is known what phenotype (variants of diabetes) is associated with each variant.
- Secondary structure is specified as "Helix" (alpha-helix), "Strand" (part of a beta-pleated sheet) or "Turn", coded by three different colours. Try to see what happens when you hover the mouse (without clicking) over the coloured bars.
QUESTION 9: Which positions are in β-sheet conformation in insulin?
Other databases linked from Swiss-Prot
Now, scroll down to Cross-references in the Swiss-Prot entry. Here, you can find links to other databases. Under Sequence databases you primarily find links to corresponding entries in the nucleotide databases. If you set the radio button on the left to GenBank, you can click on one of the blue GenBank identifiers and see a GenBank entry for the insulin gene (try it!).
To look at the three dimensional structure of a protein, you must go to yet another database, the PDB under 3D structure databases. We will be working with 3D structures later in the course, but let's just have a quick look here today also. As you can see, the 3D structure of insulin has been determined several times. Select one such structure marked X-ray under Method and click on the blue identifier under Entry (just leave the radio button on PDBe). Besides a lot of information on how the molecule and the experimental procedure used to solve the structure, the page also contains a nice picture of the insulin molecule.
Under Family and domain databases you find a list of databases containing proteins that are similar (protein families). These have been collected using various techniques that you will hear about later in the course (multiple alignment). In some cases, the proteins are similar only in smaller parts (domains) but not in other parts, and in some cases the databases can tell which parts of the actual protein are known in other species. Some large proteins can contain several different parts (domains) each with their own evolutionary history. The most important of these databases is InterPro, because it collects the information from most of the other databases. Try to click on one of the InterPro links. This will take you to the Interpro page with lots of information about the protein family that insulin belongs to.
The UniProt interface allows you to use most of the fields in the database for searching, not only the fields like name and organism, as we did previously, but also the functional and structural annotations. We shall now try a few of these.
- Go back to UniProt's website http://www.uniprot.org/. Click on Advanced Search » to the right of the Query box. This brings up a second query line.
- General annotation field: Now we will find out how many proteins are secreted from the cell (just like insulin). Select Subcellular location in the drop-down menu Field (not in the Search in menu). Next type "secreted" in the box Term and click Add & Search.
QUESTION 10: How many proteins do you find?
- Evidence/Confidence: The proteins we find in this way include proteins that are predicted to be secreted, without having experimental evidence for their secretion. We will now limit the search to experimentally confirmed secreted proteins. Clear the previous search by clicking the Clear-button, then do as before, but change the Confidence menu to Experimental.
QUESTION 11: How many proteins do you find now?
- Combining fields: How many secreted proteins are found in humans? Click on Advanced Search again, leave the menu to the left on AND, select Organism [OS] under Field, type "human" in the box Term, accept the suggestion "Human " and click Add & Search.
QUESTION 12: How many proteins do you find now? (Note again here how you can perform the search by editing the text in the Query box - however to do this you need to know the names for the fields).
Important note about the organism field: when you type some letters, a drop-down list with suggestions will come up. Each has a number in brackets — this is the TaxID, which you can also find in the NCBI Taxonomy Browser. If you search for e.g. Human proteins, it is a good idea to include the TaxID; if you omit it and just write "human", you will also find proteins from organisms like Human immunodeficiency virus (try it!). On the other hand, if you search for proteins from a bacterial species, it is better to omit the TaxID, because each strain has its own TaxID, and you probably want all possible strains.
QUESTION 13: (Clear the previous search by clicking the Clear-button). How many proteins are there in UniProt from Bacillus subtilis with the default TaxID ? How many are there from Bacillus subtilis in total (all strains and subspecies)?
- Numerical field: Now we will try to answer a completely different question: Which extremely short proteins are present in UniProt? Clear the previous search by clicking the Clear-button. Under Field, select Sequence length. Now two new fields appear where you can define the lower and upper limits for the search. Type "1" and "10" and search.
QUESTION 14: How many proteins of maximum length 10 do you find?
- Extremely short proteins are often mistakes translated directly from a nucleotide sequence with no evidence for the sequences being protein coding. Limit your search to proteins that actually have evidence for their existence at the protein level (found under Field as Protein existence [PE]).
QUESTION 15: How many proteins are now left?
- A large fraction of the proteins identified in this way are fragments. Try to exclude fragments from the search. Set Field to Fragment (yes/no), select no and search.
QUESTION 16: How many proteins are now left?
- And as the final question, how many of these proteins are found in humans?. Do as before...
QUESTION 17: How many human non-fragment proteins of maximum length 10 do you find in UniProt?
- Finally you can save the results of your search. Click on the orange Download... button. You can now save the search results in the format you prefer (try FASTA!).
QUESTION 18: Copy the FASTA sequences to your report.