Originally written by: Morten Nielsen — some editing by Rasmus Wernersson and Bent Petersen — new version by Henrik Nielsen.
Earlier in the course you have used the BLAST program to perform fast alignments of DNA and protein sequences. As shown in today's lecture BLAST will often fail to recognize relationships between proteins with low sequence similarity. In today's exercise, you shall use the iterative BLAST program (PSI-BLAST) to calculate sequence profiles and see how such profiles can used to:
- Identify relationships between proteins with low sequence similarity
- Identify conserved residues in protein sequences (residues important for the structural stability or function of the protein)
- NCBI BLAST: http://www.ncbi.nlm.nih.gov/BLAST/
- Blast2logo is a tool for visualization of protein sequence profiles and identification of conserved residues.
When BLAST fails
Say you have a sequence Query and you want to make predictions about its function and structure. As seen earlier in the course, you will most often use BLAST to do this. However what happens when BLAST fails?
- QUESTION 1: How many significant hits does BLAST find (E-value < 0.005)?
Trying another approach
- QUESTION 2: How many significant hits does BLAST find (E-value < 0.005)? (Tip: you can see the number by selecting all significant hits (clicking All under Sequences producing significant alignments with E-value BETTER than threshold) and then looking at the number of selected hits)
- QUESTION 3: How large a fraction (Query coverage) of the query sequence do the significant hits match (excluding the identical match)?
- QUESTION 4: Do you find any PDB hits among the significant hits? (Tip: look for a PDB identifier in the Accession column — a PDB identifier is a 4 character code, where the first character is a number, followed by a single letter chain name, such as "1XYZ_A")
Constructing the PSSM
Now run a second BLAST iteration in order to construct a PSSM (Position-Specific Scoring Matrix). Press the Go button at Run PSI-Blast iteration 2.
- QUESTION 5: How many significant hits does BLAST find (E-value < 0.005)?
- QUESTION 6: How large a fraction of the query sequence do the 20 most significant hits match (do not include the first hit since this is identical to the query)?
- QUESTION 7: Why does BLAST come up with more significant hits in the second iteration? Make sure you answer this question and understand what is going on!
Saving and reusing the PSSM
This time, we will not ask you to look for PDB identifiers manually among the significant hits. Instead, you should save the PSSM that PSI-BLAST has created and use it for searching PDB directly.
Go to the top of the PSI-BLAST output page and click Download, then click ASN.1 under "PssmWithParameters". Save the file to a place on your computer where you can find it again! You can take a look at this file using jEdit, but it is really not meant to be human-readable.
Then, open a new BLAST window (this is important—you need your first BLAST window again later) where you again select PSI-BLAST as the algorithm. Select pdb as the database. Click on Algorithm parameters to show the extended settings. Click the button next to Upload PSSM and select the file you just saved. Note: You don't have to paste the query sequence again, it is stored in the PSSM!
- QUESTION 8: Do you find any significant PDB hits now? If yes, how many?
- QUESTION 9: What are the PDB identifiers and the E-values for the two best PDB hits?
- QUESTION 10: What are the values for Query coverage, sequence identity, and sequence similarity (Positives) for the two best PDB hits? (Tip: click on the description to get to the actual alignment between the query sequence and the PDB hit)?
- QUESTION 11: What is the function of these proteins?
One more round
Let's try one more iteration of PSI-BLAST:
- Go back to your first BLAST window (the one with the results from the nr database) and press the Go button at Run PSI-Blast iteration 3.
- Save the resulting PSSM file (make sure you give it a different name!).
- Launch a new PSI-BLAST search against pdb using this PSSM (you may have to click on Clear to erase your first PSSM file from the server).
- QUESTION 12: Answer questions 8-10 again for the new search.
Identifying conserved residues
You have now (hopefully) identified a structural relationship between the Query sequence and a protein sequence in the PDB database of protein structures. Say you would like to validate this relationship. This could be done by mutating (substituting) essential residues in the query sequence and test if the protein function (or structure) is affected by these mutations.
The protein sequence of the query is large (more than 400 amino acids) and a complete mutation study including all residues would be extremely costly. Instead one can use PSI-BLAST and sequence profiles to identify conserved residues that are likely to be essential for the protein structure and/or protein function.
Below you find a set of 8 residues from the Query protein sequence. You shall use the PSI-BLAST and Blast2logo programs to select four of the eight residues for a mutagenesis study (you shall select the four residues based on sequence conservation only).
- (a): H271
- (b): R287
- (c): E290
- (d): Y334
- (e): F371
- (f): R379
- (g): R400
- (h): Y436
You shall use the Blast2logo server to identify which residues are conserved in the Query protein sequence. Go to the Blast2logo server and upload the Query sequence. Select the Blast database to NR70, and press submit (note it might take some (5-10) minutes before your job is completed). If the job does not complete you can find the output following this link Blast2logo output.
When the job is completed you should see the logo-plot on the website. To improve the readability of the logo you can click on the Customize visualization using Seq2Logo. In doing this your are transferred to the Seq2Logo web server. Here, just leave all options as default, and press submit.
- QUESTION 13: Spend a little time looking at the logo plot. Can you understand why the logo is so flat for the first 150 residues (how large a fraction of the query section did the Blast search cover)?
- QUESTION 14: Which four of the eight residues listed above are most conserved and hence most likely to be essential for the protein stability and/or function?
Homology modelling (optional)
You shall use the CPHmodels server to validate if the structural properties of the four most conserved residues from question Q14 indeed could form an active site. CHPmodels is a program for protein homology modelling. This program takes as input a protein amino acids sequence and produces as output a 3D protein structure based on single template homology modelling. Go to the CPHmodels web-site and upload the Query sequence. Note it might take some (10-20) minutes before your job is completed. To save you time, I have run the calculation for you. You can find the output here: CPHmodels output
The output from CPHmodels is not straightforward to interpret. However the method provides a Z-score in situations where the query and template shares little sequence similarity. As a rule of thumb, a Z-score greater than 10 will signify a reliable model. You find the template used by CPHmodels and the Z-score for the model in the last part of the output file.
- QUESTION 15: Does CPHmodels agree that the hit identified by PSI-Blast hit is significant?
Download the homology model made by CPHmodels (click on the query.pdb link), and open the model file in Pymol. If you do not have Pymol installed on your computer, you can find it on Campusnet in the Pymol folder. Show the location of the four essential residues from question Q14 on the structure.
- QUESTION 16: Could the residues form an active site?
Now you have seen the power of sequence profiles in general and the PSI-BLAST program in particular. Using sequence profiles you have been able to identify a relationship between protein sequences far below 30% sequence similarity. Further, you have made qualified predictions on the protein function and selected a set of essential amino acids suitable for experimental validation of the structural and functional predictions.