From teachingmaterials

Jump to: navigation, search


Exercise: BLAST

ABBREVIATED VERSION Full version is here: ExBlastFull

Exercise written by: Rasmus Wernersson


In this exercise we will be using BLAST (Basic Local Alignment Search Tool) for searching sequence databases such as GenBank (DNA data) and UniProt (protein). When using BLAST for sequence searches it is of utmost importance to be able to critically evaluate the statistical significance of the results returned.

The BLAST software package is free to use (Open Source) and can be installed on any local system - it's originally written for UNIX type Operating Systems. The package contains both programs for performing the actual sequence searches against preexisting databases (e.g. "blastn" for DNA databases and "blastp" for protein databases), as well as a tool for creating new databases from scratch (the "fortmatdb" program).

In this exercise we will be using the Web-interface to BLAST hosted by the NCBI. For our purpose there are several advantages to this approach:

  • We don't have to mess around with a UNIX command prompt.
  • NCBI offers direct access to preformatted BLAST databases of all the data that they host:
    • GenBank (+ derivates)
    • Full Genome database
    • Protein database (Both from translated GenBank and UniProt)

It should be noted that running BLAST locally (for example at the super-computer cluster at CBS/DTU) offers much more fine-grained control of DATA and workflow (everything can be scripted/automated) than running BLAST through a web-interface.



Part 1- Assesing the statistical significance of BLAST hits

As discussed in the lecture, there will be a risk of getting false positive results (hits to sequences that are not related to our input sequence) by purely stochastic means. In this first part of the exercise we will be investigating this further, by examining what happens when we submit randomly generated sequence to BLAST searches.

Rather than giving out a set of pre-generated DNA/Peptide sequences where you only have my word for their randomness, you'll be generating your own random sequence by throwing dice - four sided dice ("d4") for DNA sequences and twenty-sided dice ("d20") for protein sequences (see the tables below). If you make this exercise at home, and don't have access to d4/d20 dice, you can use this online dice roller instead: http://www.wizards.com/dnd/dice/dice.htm (it's built for playing Dungeons and Dragons - but can be used for rolling dice for this exercise as well).

Another option to generate random sequences is to use this tool seqGen.

STEP 1 - DNA sequences and BLASTN

Generate 3 DNA sequences of length 25bp using the table below

Examples of 4-sided dice - notice that the yellow and back dice at the right hand side of the picture uses a different orientation of the numbers (Yellow = 3; Black = 2)
  • Throw the d4 die 25 times per sequence, and use the table to translate rolls into letter.

1 2 3 4

QUESTION 1: Report the three sequences in FASTA format (give them short UNIQUE names, e.g. "seq1", "seq2", "seq3").

We now need to do a BLASTN search at NCBI.

  • Follow the "nucleotide blast" link from the main BLAST page.
  • In the section "Program Selection" select the option "Somewhat similar sequences (blastn)"
  • Choose "Nucleotide Collection (nr/nt)" as the search database. NR is the "Non Redundant" database, which contains all non-redundant (non-identical) sequences from GenBank and the full genome databases.

VERY IMPORTANT: For this special situation where we BLAST small artificial sequences we need to turn off some the automatics NCBI incorporate when short sequences are detected. Otherwise we'll not be able to see the intended results:

  • Extend the "Algorithm parameters" section (see the screen shot below) in order to gain access to fine-tuning the options.
    1. Deselect the "Automatically adjust parameters for short input sequences" option.
    2. Set the E-value cut-off ("Expect threshold") to 50

Remember to adjust the BLAST settings

  • Paste in your three sequences in FASTA format and start the BLAST search.

Random seqs vs. the NR database

Browsing BLAST results: select which of your query sequences to inspect in the drop-down box near the top of the page
  • Inspect the results.
    • Notice that NCBI supplies a summary table at the very top of the page, and the individual alignment can be seen further down.

When you look at the result it's IMPORTANT to look at all three sections: The graphical overview, the summary table and the actual alignments.


  • How big (in basepairs) is the database we used for the BLAST search?
    • (Expand the "Search summary" section near the top by clicking the small arrow to see this).

QUESTION 2b: Answer the following small questions, and document your findings by pasting in examples of alignments / text snippets from the overview table:

  • Do you find any sequences that look like your input sequences (paste in a few example alignments in your report).
  • What is the typical length of the hits (the alignment length)?
  • What is the typical % identity?
  • In what range is the bit-scores ("max score")?
    • Notice: This is conceptually the same as the "alignment score" we have already met in the pairwise alignment exercise.
  • What is the range of the E-values?


  • What is the score for a match/mismatch and Gaps (hint: see search summary)?
  • What is the bit-score for the two alignments shown below?
 Alignment 1
 Query  6         GTTTCTGTAAACGTCTGA  23
 Sbjct  1818      GTTTCTGTAAACGTCTGA  1801
 Alignment 2
                  ||||||||||| | |||||||
 Sbjct  23339642  AGGTTTCTGTAGAGGTCTGAT  23339662


  • What is the biological significance of the hits you found / is there any biological meaning?

Random seqs vs. the human genome database

Now let's try to perform a search against a different database. Open a new window/tab for this - you'll need to compare the results in a moment.

This time choose the "Human genomic plus transcripts" database (and remember to set the Algorithm parameters - same as above), and run the BLAST search.


  • Report the same basic statistics as in question 2a/2b.

Concerning database size and E-values

Consider this: All human sequences are also found in the NR database.

  • This means that any hits to human sequences we have picked up in the first query (NR database) will also be found in the search against the human-only database.
  • We are now going to investigate how this affects the statistics.
  1. In case you had any hits against human sequences in the first run, try to identify the identical hits in the Human specific database
    • For some reason NCBI has chosen to use DIFFERENT database IDs in the two Blast databases, which makes identifying the identical hit a bit difficult.
    • You can still do it by looking at the actual ALIGNMENT of the Human hit in the NR database and find the identical ALIGNMENT in the Human only database.
  2. Alternatively, use the pre-generated examples quoted below.


  • Has the alignment score ("max-score") changed? Would you expect it to?
  • Has the E-value changed? Why/Why not?
  • What is the relationship between database size and E-value for hits with identical alignment score?
    • Hint 1: You can actually calculate this relationship from the data in the example below.
  • In conclusion: if the database size is doubled, what will happen to the E-value?

Pre-calculated example: NR:

Database: All GenBank+EMBL+DDBJ+PDB sequences (but no EST, STS,
GSS,environmental samples or phase 0, 1 or 2 HTGS sequences)
           11,076,294 sequences; 30,408,770,429 total letters

                                                                   Score     E
Sequences producing significant alignments:                       (Bits)  Value

ref|XM_001558247.1|  Botryotinia fuckeliana B05.10 hypothetica...  33.7    6.3  
gb|AF440781.1|  Streptomyces cinnamonensis polyether antibioti...  33.7    6.3  
emb|AL365194.15|  Human DNA sequence from clone RP4-549F15 on ...  33.7    6.3  

>emb|AL365194.15| Human DNA sequence from clone RP4-549F15 on chromosome 1 Contains 
a novel gene and part of the CAMTA1 gene for calmodulin 
binding transcription activator 1, complete sequence

 Score = 33.7 bits (36),  Expect = 6.3
 Identities = 23/25 (92%), Gaps = 2/25 (8%)

             ||||||||||||||||||  |||||

Pre-calculated example: Human Genome/Transcriptome:

Database: Human build 37 RNA, GRCh37, and HuRef assemblies
           47,542 sequences; 5,860,289,005 total letters

                                                                   Score     E
Sequences producing significant alignments:                       (Bits)  Value

ref|NT_021937.19|  Homo sapiens chromosome 1 genomic contig, G...  33.7    1.2  
ref|NW_001838523.1|  Homo sapiens chromosome 1 genomic contig,...  33.7    1.2  

>ref|NT_021937.19| Homo sapiens chromosome 1 genomic contig, GRCh37 reference primary assembly

 Features in this part of subject sequence:
   calmodulin-binding transcription activator 1 

 Score = 33.7 bits (36),  Expect = 1.2
 Identities = 23/25 (92%), Gaps = 2/25 (8%)

                ||||||||||||||||||  |||||
Sbjct  3502635  CGCCCGACCGTGTAGGAG--CCGGT  3502613

STEP 2 - Protein sequences and BLASTP

Examples of 20-sided dice

Now it's time to work with a set of protein sequences - generate three sequences of length 25 aa using the table below.

  • Notice 1: The distribution of amino acids will be equal (5% prob) and this is different from true biological sequences - however this is not important for this first part of the exercise.
  • Notice 2: Please recall from the lecture that the way BLASTP selects candidate sequences for full Smith-Waterman alignment is different from BLASTN. (BLASTN - a single short (11 bp +) perfect match hit is needed. BLASTP - a pair of "near match" hits of 3 aa within a 40 aa window is needed).

01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20

QUESTION 6: Report the sequences in FASTA format (once again use short UNIQUE names).

Locate the "Protein BLAST" page at NCBI and choose blastp as the algorithm to use.

Paste in your sequences in FASTA format, and choose the "NR" database (this is the protein version, consisting of translated CDS'es, UniProt etc).

VERY IMPORTANT: We also need to tweak the parameters this time - in the "Algorithm Parameters" section select BLOSUM62 as the alignment matrix to use and set the "Expect threshold" to 1000 (default: 10) - and DISABLE the "Automatically adjust parameters for short input sequences" parameters as we did in the DNA search a moment ago - otherwise our carefully tweaked parameters will be ignored.

Perform the BLAST search.

Inspect the results:

QUESTION 7a: (Remember to document your answers in the same manner as Q2 and Q4)

  • How big is the database this time?
  • What is the typical length of the alignment and do they contain gaps?
  • What is the range of E-values?
  • Try to inspect a few of the alignments in details ("+" means similar sequences) - do you find any that look plausible, if we for a moment ignore the length?
  • If we had used the default E-value cut-off of 10 would any hits have been found?


  • If we compare the result from BLAST'ing random DNA sequences to random Peptide sequences - what kind of search has the higher risk of returning false positives (results that appear plausible, maybe even significant, but are truly unrelated)?
    • Remember to take E-values into your consideration.


Part 2 - using BLAST to transfer functional information by finding homologs

Homo-, Ortho- and Paralogs

One of the most common ways to use BLAST as a tool, is in the situation where you have a sequence of unknown function, and want to find out which function it has. Since a large amount of sequence data has been gathered during the years, chances are that an evolutionarily related sequence with known function has already been identified. In general such a related sequence is known as a "homolog".

Homo-, Ortho- and Paralogs:

  • A Homolog is a general term that describes a sequence that is related by any evolutionary means.
  • An Ortholog ("Ortho" = True) is a sequence that is "the same gene" in a different organism: The sequences shared a single common ancestor sequence, and has now diverged through speciation (e.g. the Alpha-globin gene in Human and Mouse).
  • A Paralog arises due to a gene duplication within a species. For example Alpha- and Beta-globin are each others paralogs.
Image source: gwLee's blog

Notice that in both cases it's possible to transfer information, for example information about gene family / protein domains. We have already touched upon comparison of (potentially) evolutionarily related sequences in the pairwise alignment exercise. However, this time we do not start out with two sequences we assume are related, but we rather start out with a single sequence ("query sequence") which we will use to search the databases for homologs (we often informally speak of "BLAST hits", when discussing the sequences found).


BLAST example 1

Lets start out with a sequence that will produce some good hits in the database. The sequence below is a full-legth transcript (mRNA) from a prokaryote. Let's find out what it is.


BLASTN search

Perform a BLASTN search in the NR/NT database (BLASTN) using default settings.

NOTICE: Make sure once again to set the search program to BLASTN - "'somewhat similar sequences'" and set the database to "NR/NT Nucleotide".

QUESTION 8: (Once again remember to document your findings)

  • Do we get any significant hits?
  • What kind of genes (function) do we find?
  • Do you find any ortholog/paralog genes?
  • Explain why you think it is an ortholog/paralog

BLASTP search

Now let's try to do the same at the protein level:

ABBREVIATED VERSION: The following protein sequence has been prepared by scanning the DNA sequence for the most promising ORF (Open Reading Frame) using the VirtualRibosome web-server:

  • BLAST the sequence (BLASTP) against the NR database.

QUESTION 9: (Document!)

  • Report your translated protein sequence in FASTA format.
  • Do we find any conserved protein-domains? (Indicated at the very top of the result page, and during the search). Identifying known protein domains can provide important clues to the function of an unknown protein.
  • Do we find any significant hits? (E-value?)
  • Are all the best hits the same category of enzymes?
  • How does the distribution of E-values look compared to the DNA search?


BLAST example 2

In the previous section we have been cheating a bit by using a sequence that was already in the database - let's move on to the following sequence instead.

The sequence is a DNA fragment from an unknown non-cultivatable microorganism. It was cloned and sequenced directly from DNA extracted from a soil-sample, and it goes by the poetic name "CLONE12". It was amplified using degenerated PCR primers that target the middle ("core cloning") of the sequence of a group of known enzymes. (I can guarantee this particular sequence is not in the BLAST databases, since I have cloned and sequenced it myself, and it has never been submitted to GenBank).

LOCUS       CLONE12.DNA    609 BP DS-DNA             UPDATED   06/14/98
DEFINITION  UWGCG file capture
SOURCE      -
COMMENT     Non-sequence data from original file:
BASE COUNT      174 A    116 C    162 G    157 T      0 OTHER
ORIGIN      ?
    clone12.dna Length: 609   Jun 13, 1998 - 03:39 PM   Check: 6014 ..
      601 GGCGCCGCC

QUESTION 10 (Long question - read all):

Your task is now to find out what kind of enzyme this sequence is likely to encode, using the methods you have learned.

INSTRUCTIONS: You are free to write the combined answer to this question in a free-style essay-like fashion - just be sure to include the subquestions in your answers. In an exam situation you will need to put all the clues together yourself, reason about the tools/databases to use, and document your findings.

STEP 1 - cleaning up the sequence:

The sequence is (more or less) in GenBank format and the NCBI BLAST server expects the input to be in FASTA format, or to be "raw" unformatted sequence.

  • There are two solutions to this:
    • Copy the sequence into a text-editor and manually create a FASTA file ("search and replace" and/or "rectangular selection" is useful for the reformatting).
      This is the most robust solution: it will always work. (Look at the JEdit exercise for a reminder of how to do this).
    • Hope the creators of the web-server you're using were kind enough to automatically remove non-DNA letters (paste in ONLY the DNA lines) - this turns out to be the case for both NCBI BLAST and VirtualRibosome, but it cannot be universally relied upon.

Subquestion: convert the sequence to FASTA format (manually, in JEdit) and quote it in your report.

ABBREVIATED VERSION - here is the DNA sequence in FASTA format:


STEP 2 - thinking about the task:

Consider the following before you start on solving this task:

  • Based on the information given: is the sequence protein-coding?
  • If it is, can you trust it will contain both a START and STOP codon?
  • Do we know if the sequence is sense or anti-sense?

Subquestion: Give a summary of your considerations.

ABBREVIATED VERSION - here is the protein sequence in FASTA format:


STEP 3 - Performing the database search:

Now, use database searching to figure out what the function of the unknown sequence is. Significance: We will put the criteria for significance at 1e-10 (remember: the higher the E-value, the worse the significance).


Cover the following in your answer:

  • What type of BLAST do you need to perform to find out the function of the sequence?


Part 3 - BLAST'ing Genomes

So far we have been using BLAST to search in the big broad databases that covers at huge set of sequence from a large range of organisms. In this final part of the exercise we will be doing some more focused searches in smaller databases by trageting specific genomes.

Typically this will be useful if you have a gene of known function from one organism (say a cell-cycle controlling gene from Yeast, Saccharomyces cerevisiae) and want to find the human homolog/ortholog to this gene (genes that control cell division are often involved in cancer).

When you have been performing the BLAST searches, you have probably already noticed, that's it possible to search specifically in the Human and Mouse genomes (these database only contains sequences from Human/Mouse). It's also possible to restrict the output from searches in the large databases (e.g. NR) to specific organisms.

A growing number of organisms have been fully sequenced, and the research teams resposible for a large scale genome project typically put up their own Web resouces for accessing the data. For example the Yeast genome is principally hosted in the Saccharomyces Genome Database (SGD - www.yeastgenome.org) - it should be noted that SGD also offers BLAST as a means to search the database.

Genome links

For the purpose of this exercise we will be using the genome resources hosted at the NCBI (with a short digression to SGD):

Genome specific analysis of histones


Let's do a small study of the relationship between the histones found in Yeast and in Human (evolutionary distance: ~1-1.5 billion years).

Look up the HTA2 gene in SGD (use the Quick Search box). Notice that a brief description about the function of the gene and it's protein product is displayed (a huge amout of additional information can be found further down the page - much of it Yeast specific).

QUESTION 11: What information is given about the relationship between this gene and the gene "HTA1"?

Browse the page and locate the link to the protein sequence - keep the window open, or save the sequence to a file, we'll need it in a moment.

ABBREVIATED VERSION - here is the protein sequence in FASTA format:

>YBL003C  Chr 2   reverse complement


NCBI Genomes page - organism specific links
  • Now return to the NCBI Genome page.

Notice that an overview of the organisms for which genomes are available is shown in a box to the right (section: "Organism-specific") - for each organism the information available is shown using a single letter code ("B" = BLAST). You can use this to open a BLAST page dedicated to that specific genome (you can search both DNA and proteins translated from the genes).

Before we start looking in the human genome, let's find out if we can locate the HTA2 gene in the NCBI version of the Yeast genome:

  • Go to the BLAST page for Yeast (Click "B").
  • Choose "RefSeq protein" as the database
  • Use the HTA2 protein sequence as query.

QUESTION 12: (Remember to document you answers)

  • How many high-confidence hits do we get?
  • Does the hits make sense, from what you have read about HTA2 at the SGD webpage?

The next step is to search the translated version of the human genome .

  • Go to the dedicated BLAST page for Human (click "B").
  • Choose "RefSeq protein" as the database.

Notice: a larger number of databases are offered compared to Yeast. This is simply due to the fact that the identification of the genes in the human genome is much more troublesome than in Yeast - and therefore a number of alternative interpretations of the genome/proteome is offered. (In Yeast virtually all protein coding genes has been experimentally verfied).

  • Choose "BLASTP" as the method - and start you search for HTA2 homologs.


  • How many high-confidence hits are found?
  • These protein originates from a number of genes - but how many UNIQUE genes?
    • Hint: Some of the proteins are iso-forms that originates from alternative splicing (one gene -> multiple iso-forms).

Low complexity filter

Notice: In the BLAST alignments - some parts of the sequences are marked in grey - these are low-complexity regions, that BLAST by default ignores in the comparision (but NCBI has chosen to show them in lowercase grey for pointing out the regions).

  • As the very last thing today, try to explicitly ENABLE the low-complexity filter, and re-run the search (in a new windows/tab):
  • When you get the result page, click "Formating options" and set MASKING to "X for protein, N for nucleotide".
  • Inspect the alignments.

QUESTION 14: Do we get shorter alignments this time?

Concluding remarks

Today we have been using BLAST to find a number of homologues genes (and protein-products). If we want to go even deeper into the analysis of the homologs, the next logical step would be to build a dataset of the full-length versions of the sequences we have found (not just the part found by the local alignment in BLAST).

A further analysis could consist of a series of pairwise alignments (for finding out what is similar/different between pairs of sequences) or a multiple alignment which could form the basis of establishing the evolutionary relationship between the entire set of seqeunces.

BLAST can also be used as way to build a dataset of sequences base on a known "seed" sequence. As we saw in the GenBank exercise, free-text searching in the GenBank can be difficult, and if we for instance wanted to build a dataset of variants of the insulin gene, an easiy way to go around this would be to BLAST the normal version of the insulin against the sequence database of choice, and pick the best matching hits from here.

Personal tools