Written by: Morten Nielsen - editing by Rasmus Wernersson
Earlier in the course you have used the BLAST program to perform fast alignments of DNA and protein sequences. As shown in todays lecture BLAST will often fail to recognize relationships between proteins with low sequence similarity. In todays exercise, you shall use the iterative BLAST program (PSI-BLAST) to calculate sequence profiles and see how such profiles can used to:
- Identify relationships between proteins with low sequence similarity
- Identify conserved residues in protein sequences (residues important for the structural stability or function of the protein)
- NCBI BLAST: http://www.ncbi.nlm.nih.gov/BLAST/
- Blast2logo is a tool for visualization of protein sequence profiles and identification of conserved residues.
First part. When BLAST fails
Say you have a sequence Query and you want to make predictions about its function and structure. As seen earlier in the course, you will most often use BLAST to do this. However what happens when BLAST fails?
IMPORTANT! - Recently NCBI changed the results format to a newer more streamlined version. This means that some of the features we are looking for in this exercise will be hidden. Follow the 4 steps shown at the picture below, when you have obtained your first BLAST results. Make sure that you have the option "linkout" checked. Remember to click reformat (step 4).
- QUESTION 1: How many significant hits does BLAST find (E-value < 0.005)?
Now go back to the BLAST web-site. Paste in the query sequence Query. Set the database to nr, select PSI-BLAST (Position-Specific Iterated BLAST), change the Algorithm parameters Max target sequences to 5000 and press Blast.
- QUESTION 2: How many significant hits does BLAST find (E-value < 0.005)?
- QUESTION 3: How large a fraction (Query coverage) of the query sequence do the significant hits match (excluding the identical matches)?
- QUESTION 4: Do you find any PDB hits among the significant hits (look for the colored S to the right of the E-value))?
Now run a second BLAST iteration. Press Go at Run PSI-Blast iteration 2.
- QUESTION 5: How many significant hits does BLAST find (E-value < 0.005)? Just give a rough estimate.
- QUESTION 6: How large a fraction of the query sequence do the significant hits match (do not include the first hit since this is identical to the query)?
- QUESTION 7: Why does BLAST come up with more significant hits in the second iteration? Make sure you answer this question and understand what is going on!
- QUESTION 8: Do you find any PDB hits among the significant hits (look for the red colored S to the right of the E-value)?
If you did not find a PDB hit among the significant hits, run a third Blast iteration
- QUESTION 9: What is the PDB identifier (a 4 letter code followed by a single letter chain name) for the best PDB hit?
- QUESTION 10: What is the sequence similarity (Identity) between the query and this PDB hit (click on the alignment score (Max score column - link to get to the actual alignment of the query sequence to the PDB hit)? If the alignment is not shown, go to Formatting options the very top of the page, set Alignments to 5000 and press Reformat.
- QUESTION 11: What is the function of this protein?
Identifying conserved residues
You have now (hopefully) identified a structural relationship between the Query sequence and a protein sequence in the PDB database of protein structures. Say you would like to validate this relationship. This could be done by mutating (substituting) essential residues in the query sequence and test if the protein function (or structure) is affected by these mutations.
The protein sequence of the query is large (more than 400 amino acids) and a complete mutation study including all residues would be extremely costly. Instead one can use PSI-BLAST and sequence profiles to identify conserved residues that are likely to be essential for the protein structure and/or protein function.
Below you find a set of 8 residues from the Query protein sequence. You shall use the PSI-BLAST and Blast2logo programs to select four of the eight residues for a mutagenesis study (you shall select the four residues based on sequence conservation only).
- (a): H271
- (b): R287
- (c): E290
- (d): Y334
- (e): F371
- (f): R379
- (g): R400
- (h): Y436
You shall use the Blast2logo server to identify which residues are conserved in the Query protein sequence. Go to the Blast2logo server and upload the Query sequence. Select the Blast database to NR70, and press submit (note it might take some (5-10) minutes before your job is completed). If the job does not complete you can find the output following this link Blast2logo output.
When the job is completed you should see the logo-plot on the website. To improve the readability of the logo you can click on the Customize visualization using Seq2Logo. In doing this your are transferred to the Seq2Logo web server. Here, just leave all options as default, and press submit.
- QUESTION 12.1: Spend a little time looking at the logo plot. Can you understand why the logo is so flat for the first 150 residues (how large a fraction of the query section did the Blast search cover)?
- QUESTION 12.2: Which four of the eight residues listed above are most conserved and hence most likely to be essential for the protein stability and/or function?
You shall use the CPHmodels server to validate if the structural properties of the four most conserved residues from question Q12 indeed could form an active site. CHPhodels is a program for protein homology modeling. This program takes as input a protein amino acids sequence and produces as output a 3D protein structure based on single template homology modeling. Go to the CPHmodels web-site and upload the Query sequence. Note it might take some (10-20) minutes before your job is completed. To save you time, I have run the calculation for you. You can find the output here CPHmodels output
The output from CPHmodels is not strait forward to interpret. However the method provides a Z-score in situations where the query and template shares little sequence similarity. As a rule of thumb, a Z-score greater than 10 will signify a reliable model. You find the template used by CPHmodels and the Z-score for the model in the last part of the output file.
- QUESTION 13: Does CPHmodels agree that the hit identified by PSI-Blast hit is significant?
Download the homology model made by CPHmodels (click on the query.pdb link), and open the model file in Pymol. If you do not have Pymol installed on your computer, you can find it on Campusnet in the Pymol folder. Show the location of the four essential residues from question Q12 on the structure.
- QUESTION 14: Could the residues form an active site?
Now you have seen the power of sequence profiles in general and the PSI-BLAST program in particular. Using sequence profiles you have been able to identify a relationship between protein sequences far below 30% sequence similarity. Further, you have made qualified predictions on the protein function and selected a set of essential amino acids suitable for experimental validation of the structural and functional predictions.