Searching the GenBank database
ABBREVIATED VERSION - link to full version: ExGenbank
Exercise written by: Rasmus Wernersson
This exercise has two main goals:
1) Introduction to the types of DNA data contained in the GenBank database (data format, visualization, cross-database links, how biological "features" such as genes are annotated and described as coordinates in the DNA sequence).
2) Practice searching the online version of GenBank hosted at the NCBI. Since the number of sequences in GenBank is HUGE it's critically important to be able to search and filter the information. Especially filtering the unwanted sequences can be a challenge, as we shall see.
Where to find GenBank
The GenBank database is hosted at NCBI (National Center for Biotechnology Information, USA) (Link: http://www.ncbi.nlm.nih.gov/). Besides the main GenBank database, NCBI also hosts a number of other biological databases (for example whole-genome databases for human, mouse, chimp etc.). In this particular exercise we will concentrate on the classical "GenBank" database - the main search interface is located here: http://www.ncbi.nlm.nih.gov/Genbank/index.html
Using the "Entrez" database browser
ALL the NCBI databases can be queried through a common search interface named Entrez. On next to all NCBI webpages a search box can be found in the upper part of the page, allowing an easy access for searching the individual databases (or searching across all databases). Click on the following link to open up a new browser window with Entrez, where the focus is pre-set to search in the GenBank database:
(It's NOT necessary to remember this particular convoluted URL - in the future you can just go to the main NCBI webpage and chose "Nucleotide" as the database).
Part 1: Concerning the DATA in GenBank
This part of the exercise is about the types of data hosted in GenBank.
Searching for a specific ID
The typical case for searching for a specific ID in GenBank, will be looking up information from the literature (e.g. a gene found in a study), following up on information from other databases, investigation of lists of interesting genes etc. In this part of the exercise we will be working with a set of alpha-globin genes.
- Query information about "AB001981" - after the search, follow the sequence link to inspect the result.
By default the result is shown in the GenBank format.
QUESTION 1.1:Please notice that the publication from which the DNA sequence originates is cited (and linked via a PubMED ID) within the header. Sometimes multiple publications related to the same gene is listed. This is of great importance since it makes it possible trace the source(s) of the DNA sequence and investigate if the experiments carried out is to be trusted.
a) How many genes are contained in this entry?
b) From which organism does the DNA originate?
c) What kind of information is contained within the HEADER and within the FEATURE block?
This can be of real importance if something seems "wrong" with the sequence (for example if this particular gene exhibits a really strange intron/exon structure compared to other closely related genes, or if it simply doesn't match ANY other known genes of the same family). By investigation of the original publication it's possible to double-check the experimental procedure. It may be that the article correctly states the gene to be of type XXX but when that data submitted it was accidentally annotated as YYY (it is the original researchers' responsibility to double-check this). There can also be more serious problems with the experiments ranging from bad/wrong PCR primers, to contamination with DNA from a different species during a cloning step.
NEVER FORGET: biological data CAN be wrong.
- Investigate the PubMed link(s):
- Follow the PubMED link from the sequence entry.
- Observe that's is always possible to read the ABSTRACT of the publication in PubMED, even if access to the publication requires subscription. For most (new) publications there will also be a direct link to the publication itself.
- Return to the sequence entry once again (or perform the search again if you closed the window).
- View the sequence entry in FASTA format (Simply click on "FASTA" in the top of the page, below the entry title)
Now the entire GenBank entry is shown in FASTA format.
QUESTION 1.2:Observe that the name of the sequence is based on the name of the GenBank entry.
a) What happened to the alpha-globin genes? Can they still be found?
b) Which part of the GenBank entry has been converted?
Exploring the genes defined in a GenBank entry
- Go back to GenBank entry in your browser. Click the first "CDS" element (Alpha-D)
CDS = CoDing Sequences: The PROTEIN CODING part of a gene. Basically: the sequence you get when the CODING exons are concatenated (UTR regions are ignored). A CDS always starts with a START codon and ends with a STOP codon.
Observe the following:
- What happened to the DNA sequence?
- Which interval is listed?
- Which three nucleotides does the sequence now start and end with? Does this make sense?
- Are there any introns present?
When looking at the FEATURE table, the first line of text in the definition of each CDS is as follows:
QUESTION 1.4: Based on your observations: What do these numbers mean? How many coding Exons does each gene contain?
- View the first CDS (Alpha-D) in FASTA format
QUESTION 1.5: What do the numbers in the sequence title represent?
(Click to open search in a new window)
Part 2: Searching the GenBank
The key issue to keep in mind when searching GenBank is to avoid drowning in huge amounts of irrelevant data. It is therefore of great importance to filter out unwanted information, WITHOUT losing the relevant entries.
Today we will work with searching the TEXTUAL annotation of GenBank entries (keywords, free text etc). We will later get back to sequence based searches (BLAST).
In the first part of the exercise the aim is to locate the human gene for Insulin
- Search for GenBank entries containing the term "insulin"
Simply enter "insulin" in the search box and hit "Go".
Observe the following:
- A large number of entries are found.
- Go through a few pages of results and notice that we are offered data from a diverse set of sources: Experimental work, Patent applications, predicted genes, partial genes etc.
QUESTION 2.1: How many search results were returned?
- Confining the search to specific parts of the annotation:
By default the search term is matched against ALL POSSIBLE fields in the GenBank entries - including almost all text in the HEADER and FEATURE table. It's even possible to pick up entries where the match is to one of the authors names and not a gene name! (Perhaps not an issue for insulin). Luckily it is possible to restrict the search to specific pre-indexed fields in the HEADER and FEATURE table ("Search fields"), which makes it possible to make the search much more focused.
Spend a few moments to investigate the HEADER section of the GenBank entry you have all received as a hand-out (X01831) to get an idea of how the data is related to specific sections (e.g. KEYWORDS and ORGANISM which we will use in a moment).
A schematic overview of the search fields can be found on the NCBI homepage: Search Fields and Qualifiers (you can also find this page by following the "HELP" link in the menu bar, and look for "Search fields").
(Click to open the entire list in a new window)
- Narrow the search to human insulin:
Query: "human[organism] insulin"
Observe that we now only get entries from Human - Homo sapiens (TaxID: 9606). For all major model organisms the English name (rat, mouse, pig) can be used instead of the full binominal Latin name (Rattus norvegicus, Mus musculus, Sus scrofa).
QUESTION 2.2: How many hits do we have now? Do they all appear to be insulin genes?Try inspecting a few of the obvious non-insulin genes, and see if you can find out WHERE the term "insulin" was used. The main issue here is that we find entries where "insulin" is mentioned anywhere in the entry, and sometimes it's unrelated genes like "Insulin-receptor", "Insulin inhibitor" etc.
A good example of a really surprising match is the entry NM_053056. The description of this entry states it to be "Homo sapiens cyclin D1 (CCND1), mRNA". But why does such an entry come up in our search? The culprit in this case turns out to be a reference to a publication with the title "Insulin-like growth factor I triggers nuclear accumulation of cyclin D1 in MCF-7S breast cancer cells" (!).
- The next step is to search for entries where insulin is specifically annotated as a KEYWORD:
Query: "human[organism] insulin[keyword]
- Observe that we have now reduced the number of sequences to a level, where it's actually practically possible to inspect them all (even if there's still a fair bit of junk).
- Find the most likely candidate for the full-length insulin gene by browsing the list of search hits and inspecting promising entries.
QUESTION 2.3: How many search results were found? Which entry is the correct Human Insulin gene?
Combining search terms using boolean operators: NOT, AND and OR
Our next task will be to find full length insulin genes from as many different organisms as possible.
- Let's start out with a new clean search for Insulin
The number of hits is not that high (< 100, december 2008) and in principle they could all be inspected by hand. However, another possibility is to add search terms to AVOID in order to bring down the false positive rate.
- By a brief inspection of some of the search hits, it turns out some of them are "Insulin-like" rather than being a actual insulin gene. We can exclude these by using the NOT keyword:
Query: "Insulin[keyword] NOT insulin-like"
Observe that the number of hits goes down - but we still have some unwanted entries.
- Let's get rid of the partial genes:
Query: "Insulin[keyword] NOT (insulin-like OR part OR partial)
Notice the use of parentheses.
Conceptually what we are doing here is to conduct a number of searches that are either COMBINED or SUBTRACTED from each other. The "(insulin-like OR part OR partial)" search term finds all entries where any of the three terms are found. This list is then excluded from the "Insulin[keyword]" by using the NOT operator.
The use of boolean operators can be visualized graphically using Venn diagrams: Venn Diagrams for Boolean Logic.
A good strategy for narrowing down a GenBank search is to build a list of "kill words"/"filter words" (terms to avoid). More terms can be added to the list as search results are inspected, and it's found out why strange entries appear on the result list.
A word of caution: Be careful of not throwing the baby out with the bath water - don't add kill-words that are so broad that they will actually exclude the gene(s) we are looking for.
About the use of AND: The AND keyword is implicitly used when ever you enter more than one search term: "human globin" will be interpreted as "human AND globin" and only results where BOTH terms are found will be reported.
- The final part of the exercise to continue to find terms to exclude on your own hand. The point is to bring down the number of search results to a level where it's easy to pick the correct ones.
QUESTION 2.4:Notice: There are several possible answers to this question, as it will be a balance between filtering out False Positives (things that are NOT insulin) without filtering out (too many) True Positives (things that are actually insulin).
a) Which search term did you end up using?
b) How many search results do you get now?
"Free exercise"Now it's time to perform a number of GenBank searches on your own. It's important to think about the search strategy - discuss this within the group.
QUESTION 3: Do at least three of the below and report your findings.
- Find the Rat and Mouse Insulin gene
- Find the alcohol-dehydrogenase gene from as many organisms as possible.
- Find the alpha-globin gene from Capra hircus - (Remember: Alpha-globin is part of hemoglobin).
- Find the alpha-globin gene from all ruminants - (hint: inspect the ORGANISM fields in a GenBank entry from an animal you know to be a ruminant, in order to pick up a good search term). If you want to go deeper into the taxonomy, the Tree of Life project have an entry on placental mammals here:http://tolweb.org/tree?group=Eutheria&contgroup=Mammalia.
- Find the actin gene from as many organisms as possible.
Avoid mRNA and entries that are part of whole chromosomes, cosmids etc
- Find the NORMAL p53 gene from human (Somewhat tricky)
p53 is involved in cancer and therefore a large number of mutated versions of the gene have been investigated. The problem is here that these mutant versions "pollute" the GenBank database, when we want to search for the "vanilla" version of the gene.
For starters try to have a look at one of the mutated versions: S66666. Notice where the term "p53" is present and use this to devise your search strategy. (Sometimes this gene also goes by the name "TP53").
The tricky part of this assignment is to find the best search fields (and terms) to use, and to avoid eliminating the real (unmutated) version of the gene when you put together your "kill-word" list.
Can you find the mRNA version? The full length gene complete with intron/exon structure?