Answers to the BLAST exercise, by Henrik Nielsen. Values for database sizes etc. retrieved March 17, 2014.
Part 2: Assessing the statistical significance of BLAST hits
Report the three sequences in FASTA format:
>seq_01 CGCCCGACCGTGTAGGAGGTCCGGT >seq_02 TTCCAAGTAGGTTATGTTAAAGCCG >seq_03 CGGTTAATTCGCTGTCTATATAAGG
(of course your particular sequences will not be identical to these)
- Do you find any sequences that look like your input sequences (paste in a few example alignments in your report).
- There will typically be several 100% identity hits, e.g.:
GENE ID: 5438907 BC1G_02961 | hypothetical protein [Botryotinia fuckeliana B05.10] Score = 33.7 bits (36), Expect = 6.3 Identities = 18/18 (100%), Gaps = 0/18 (0%) Strand=Plus/Minus Query 1 CGCCCGACCGTGTAGGAG 18 |||||||||||||||||| Sbjct 574 CGCCCGACCGTGTAGGAG 557
- What is the typical length of the hits (the alignment length)?:
- Typically around 17-22 base pairs.
- What is the typical % identity?:
- 90% - 100%
- In what range are the bit-scores ("max score)?:
- typically 30-35 bits.
- What is the range of the E-values?:
- varying from 1 to 50 (occasionally, you might find hits as "good" as 0.1).
- Note: we chose to use an E-value threshold of 50.0. The default is 10.0.
- What is the score for a match/mismatch and Gaps (hint: see search summary)?
- What is the bit-score for the two alignments shown below?
What is the biological significance of the hits you found / is there any biological meaning?:
This makes absolutely NO biological sense(!) The hits are real enough as such, they represent sequences that actually are in the database. But we know that our query sequences are completely random and therefore have no evolutionary relationship with the hits. The only reason we found our hits is that the database is so vast that we for for purely stochastic reasons happen upon sequences that are similar.
The E-values tell us precisely this: As described in the BLAST lecture, the alignment score will follow an extreme value distribution for those sequences that are not related to our query sequences, and the E-value is the expected number of spurious (unrelated) hits with the given alignment score or better, given the database size.
Note: Don't be confused by the difference between alignment score and bit score; bit score is simply the alignment score normalized by a constant factor which gives a result expressible in bits.
Report the sequences in FASTA format:
>seq_01 aymgsplsflshdhncirdakyvvs >seq_02 danshrsqwnrlgmanwfqhwsewv >seq_03 vsltdgpdkhesqhirqdkyfddia
(again, your particular sequences will of course differ from these).
- How big is the database this time? :
- Number of letters (amino acids): 13,453,102,810; Number of sequences: 37,854,583
- What is the typical length of the alignment and do they contain gaps?:
- Typically 15-22. Rarely gaps, but several mismatches.
- What is the range of E-values?:
- Typically 100-1000
- Try to inspect a few of the alignments in details ("+" means similar) - do you find any that look plausible, if we for a moment ignore the length/E-value?
- Yes, maybe. See e.g. the alignment below, it has 77% identities (but it is way too short to be significant, as the E-value tells us).
transcription initiation factor IIB [Picrophilus torridus DSM 9790] Sequence ID: ref|YP_024118.1|Length: 304Number of Matches: 1 Range 1: 58 to 70 Alignment statistics for match #1 Score Expect Method Identities Positives Gaps 26.2 bits(56) 585 Composition-based stats. 10/13(77%) 11/13(84%) 0/13(0%) Query 1 AYMGSPLSFLSHD 13 A GSP+SFLSHD Sbjct 58 ARTGSPMSFLSHD 70
- If we had used the default E-value cutoff of 10 would any hits have been found?:
- No. Note the difference from the nucleotide database searches: if we had run BLASTN with an E-value threshold of 1000, we would have had many pages of hits for each query sequence.
- If we compare the result from BLAST'ing random DNA sequences to random Peptide sequences - which kind of search has the higher risk of returning false positives (results that appear plausible, maybe even significant, but are truly unrelated)?:
- The risk of getting a false hit (an unrelated sequence with a "decent" E-value) is much larger when working with DNA sequences. Remember than we used 50 as E-value cut-off for BLASTN, while we used 1000 with BLASTP in order to see any hits at all.
Part 3: using BLAST to transfer functional information by finding homologs
some numbers have changed since the databases change
- Do we get any significant hits?
- Yes, there are 11 hits with an E-value of "0.0" (i.e. so small that is is rounded to zero) — and the next hits are also extremely significant. The first hit (S48754.1) furthermore has a query coverage of 100% and an identity of 100% (this is actually the source of our query).
- What kind of genes (function) do we find?
- All the high-quality hits are alkaline serine proteases from the genus Bacillus — except some hits that are whole genome sequences.
Note 1: remember to use the ORF Finder in Virtual Ribosome! Since we are told the sequence is a full-length transcript, we can assume that the START and STOP codons are included and set the ORF finder to "Start codon: Any" (in this case, it would have given the same result to use "Start codon: Strict").
Note 2: you can choose the standard genetic code (Table 1) or alternatively Table 11 (Bacterial and Plant Plastid). The only difference is that Table 11 allows some extra, rarely occurring, start codons.
- Report your translated protein sequence in FASTA format.:
>Unknown_transcript01_rframe2_ORF MKKPLGKIVASTALLISVAFSSSIASAAEEAKEKYLIGFNEQEAVSEFVEQVEANDEVAI LSEEEEVEIELLHEFETIPVLSVELSPEDVDALELDPAISYIEEDAEVTTMAQSVPWGIS RVQAPAAHNRGLTGSGVKVAVLDTGISTHPDLNIRGGASFVPGEPSTQDGNGHGTHVAGT IAALNNSIGVLGVAPSAELYAVKVLGASGSGSVSSIAQGLEWAGNNGMHVANLSLGSPSP SATLEQAVNSATSRGVLVVAASGNSGAGSISYPARYANAMAVGATDQNNNRASFSQYGAG LDIVAPGVNVQSTYPGSTYASLNGTSMATPHVAGAAALVKQKNPSWSNVQIRNHLKNTAT SLGSTNLYGSGLVNAEAATR
- Do we find any conserved protein domains? (Indicated at the very top of the result page, and during the search).:
- Yes, there is a "Peptidase S8" domain and an "Inhibitor_I9" domain. You can see them by clicking Show Conserved Domains near the top of the BLAST results page.
- Do we find any significant hits? (E-value?):
- Yes, a lot. The first 22 hits have an E-value of 0.0, and hit #100 is still very significant (6e-112) — note that by default, only the top 100 hits are shown!
- Are all the best hits the same category of enzymes?:
- Yes, they are alkaline proteases (except a few that are hypothetical proteins).
- Note that you can click the Accession code for a hit and go directly to the corresponding entry in the database.
- From what you have seen, what is best for identifying intermediate quality hits - DNA or Protein BLAST?:
- Protein BLAST (BLASTP). If you have very high quality hits, they can be identified by both methods, but if the evolutionary distance is larger, BLASTP is clearly better.
STEP 1 - cleaning up the sequence:
- Subquestion: convert the sequence to FASTA format (manually, in JEdit) and quote it in your report.
>CLONE12 AACGGGCACGGGACGCATGTAGCTGGAACAGTGGCAGCCGTAAATAATAATGGTATCGGA GTTGCCGGGGTTGCAGGAGGAAACGGCTCTACCAATAGTGGAGCAAGGTTAATGTCCACA CAAATTTTTAATAGTGATGGGGATTATACAAATAGCGAAACTCTTGTGTACAGAGCCATT GTTTATGGTGCAGATAACGGAGCTGTGATCTCGCAAAATAGCTGGGGTAGTCAGTCTCTG ACTATTAAGGAGTTGCAGAAAGCTGCGATCGACTATTTCATTGATTATGCAGGAATGGAC GAAACAGGAGAAATACAGACAGGCCCTATGAGGGGAGGTATATTTATAGCTGCCGCCGGA AACGATAACGTTTCCACTCCAAATATGCCTTCAGCTTATGAACGGGTTTTAGCTGTGGCC TCAATGGGACCAGATTTTACTAAGGCAAGCTATAGCACTTTTGGAACATGGACTGATATT ACTGCTCCTGGCGGAGATATTGACAAATTTGATTTGTCAGAATACGGAGTTCTCAGCACT TATGCCGATAATTATTATGCTTATGGAGAGGGAACATCCATGGCTTGTCCACATGTCGCC GGCGCCGCC
STEP 2 - thinking about the task:
- Subquestion: Give a summary of your considerations.
- Based on the information given: is the sequence protein-coding?
- Yes — we know this because the PCR primers used to clone the sequence target known enzymes. Therefore, it will make sense to try to translate the sequence using Virtual Ribosome.
- If it is, can you trust it will contain both a START and STOP codon?
- No — the PCR primers used to clone the sequence target the middle of the sequence, in other words we must assume that our sequence is a fragment. Therefore, the ORF finder in Virtual Ribosome should be set to Start codon: None.
- Do we know if the sequence is sense or anti-sense?
- No — the PCR process amplifies a stretch of double-stranded DNA. Therefore, we should let Virtual Ribosome search in all 6 reading frames.
STEP 3 - Performing the database search: We want to use BLAST to search the large databases. Let's therefore try the following:
- Translate to protein (using Virtual Ribosome).
Both when doing BLASTN and BLASTP we will use the NR database in order to search as broadly as possible. It would not make sense to use an organism-specific database when we don't know which organism our sequence stems from.
1) BLASTN. When trying BLASTN against NR we don't get any significant results at all. Observe in particular how small the query coverage percentages are.
There is simply nothing in the entire NR database that has enough similarity to our query sequence. A search on the DNA level is only suited for finding very close hits.
2) Translate using Virtual Ribosome with the settings we chose under Step 2 above.
The result from the ORF finder:
VIRTUAL RIBOSOME ---------------- Translation table: Standard SGC0 >CLONE12_rframe1_ORF Reading frame: 1 N G H G T H V A G T V A A V N N N G I G V A G V A G G N G S 5' AACGGGCACGGGACGCATGTAGCTGGAACAGTGGCAGCCGTAAATAATAATGGTATCGGAGTTGCCGGGGTTGCAGGAGGAAACGGCTCT 90 .......................................................................................... T N S G A R L M S T Q I F N S D G D Y T N S E T L V Y R A I 5' ACCAATAGTGGAGCAAGGTTAATGTCCACACAAATTTTTAATAGTGATGGGGATTATACAAATAGCGAAACTCTTGTGTACAGAGCCATT 180 .....................>>>.................................................................. V Y G A D N G A V I S Q N S W G S Q S L T I K E L Q K A A I 5' GTTTATGGTGCAGATAACGGAGCTGTGATCTCGCAAAATAGCTGGGGTAGTCAGTCTCTGACTATTAAGGAGTTGCAGAAAGCTGCGATC 270 .........................................................)))............)))............... D Y F I D Y A G M D E T G E I Q T G P M R G G I F I A A A G 5' GACTATTTCATTGATTATGCAGGAATGGACGAAACAGGAGAAATACAGACAGGCCCTATGAGGGGAGGTATATTTATAGCTGCCGCCGGA 360 ........................>>>..............................>>>.............................. N D N V S T P N M P S A Y E R V L A V A S M G P D F T K A S 5' AACGATAACGTTTCCACTCCAAATATGCCTTCAGCTTATGAACGGGTTTTAGCTGTGGCCTCAATGGGACCAGATTTTACTAAGGCAAGC 450 ........................>>>....................................>>>........................ Y S T F G T W T D I T A P G G D I D K F D L S E Y G V L S T 5' TATAGCACTTTTGGAACATGGACTGATATTACTGCTCCTGGCGGAGATATTGACAAATTTGATTTGTCAGAATACGGAGTTCTCAGCACT 540 ...............................................................)))........................ Y A D N Y Y A Y G E G T S M A C P H V A G A A 5' TATGCCGATAATTATTATGCTTATGGAGAGGGAACATCCATGGCTTGTCCACATGTCGCCGGCGCCGCC 609 .......................................>>>...........................
(Tip: Remember that you can get the sequence in FASTA format via the FASTA link on the result page):
>CLONE12_rframe1_ORF NGHGTHVAGTVAAVNNNGIGVAGVAGGNGSTNSGARLMSTQIFNSDGDYTNSETLVYRAI VYGADNGAVISQNSWGSQSLTIKELQKAAIDYFIDYAGMDETGEIQTGPMRGGIFIAAAG NDNVSTPNMPSAYERVLAVASMGPDFTKASYSTFGTWTDITAPGGDIDKFDLSEYGVLST YADNYYAYGEGTSMACPHVAGAA
The first thing we notice (already while the search is running) is that there is a "Peptidases_S8_S53" domain. This is a very strong indicator of the function.
We get several significant hits. When looking at the top hits and disregarding "hypothetical" and "uncharacterized" proteins, we can see that the rest are almost all serine proteases. Some of them are described as belonging to the of the S8/S53 family.
Let's take a closer look at the best characterised hit:
>ref|WP_006953704.1| peptidase [Prevotella micans] Length=922 Score = 207 bits (526), Expect = 2e-58, Method: Compositional matrix adjust. Identities = 117/211 (55%), Positives = 145/211 (69%), Gaps = 14/211 (7%) Query 2 GHGTHVAGTVAAVNNNGIGVAGVAGGNGSTNSGARLMSTQIFNSDGDYTNSETLVYRAIV 61 GHGTHVAGTVAA NNNG+GVAG+AGG+GSTNSG RL+S QIF + ++E AI Sbjct 279 GHGTHVAGTVAARNNNGLGVAGIAGGDGSTNSGVRLLSCQIFRKSKEEGSAEA----AIK 334 Query 62 YGADNGAVISQNSWGSQSL-TIKELQKA---AIDYFIDYAGMDETGEIQT-GPMRGGIFI 116 Y ADNGAVI+Q SWG S +KEL K+ AIDYFI +AG D G ++ PM+GG+ I Sbjct 335 YAADNGAVIAQCSWGYASKENVKELPKSLKEAIDYFITFAGCDAHGAQRSDSPMKGGVMI 394 Query 117 AAAGNDNVSTPNMPSAYERVLAVASMGPDFTKASYSTFGTWTDITAPGGDIDKFDLSEYG 176 AAGN+N++ P+AYE+V++VAS +F KASYS + W I+APGGD D F L + G Sbjct 395 FAAGNENMNFKEFPAAYEKVISVASTAWNFQKASYSNYADWVSISAPGGDQDAFGL-KAG 453 Query 177 VLSTYADNY----YAYGEGTSMACPHVAGAA 203 VLST Y Y +GTSMACPHV+G A Sbjct 454 VLSTMPKKIASSGYGYMQGTSMACPHVSGIA 484
Note that although it is not a perfect hit (our query sequence not existing in the database) it looks reasonable: the alignment covers the entire query with Indentity of 55% and Similarity (Positives) of 69%.
Taken together with the fact that almost all the best non-hypothetical hits are serine proteases, we have a very strong indication that our mystery sequence, CLONE12, is a peptidase or protease of the S8/S53 superfamily.
Part 4: BLAST'ing Genomes
What information is given about the relationship between this gene and the gene "HTA1"?
They are nearly identical ("one of two nearly identical (see also HTA1) subtypes").
>YBL003C MSGGKGGKAGSAAKASQSRSAKAGLTFPVGRVHRLLRRGNYAQRIGSGAPVYLTAVLEYL AAEILELAGNAARDNKKTRIIPRHLQLAIRNDDELNKLLGNVTIAQGGVLPNIHQNLLPK KSAKTAKASQEL*
- How many high-confidence hits do we get?:
- 3 — HTA1, HTA2 and HTZ1.
- Do the hits make sense, from what you have read about HTA2 at the SGD webpage?:
- Yes; HTA1 and HTA2 are indeed nearly identical (only 2 amino acids differ).
- How many high-confidence hits (with E-value better than 10-10) are found?
- Answer: 30.