.

Wednesday, February 27, 2019

Phylogenetic

molecular(a)(a)(a) Phylo cistrontics An introduction to computational methods and tools for analyzing phylogenyary relationships K arn Do healthful Math ergocalciferol F both(prenominal) 2008 molecular(a) Phylo factortics K arn Dowell 1 Abstract molecular phyletics applies a con feederacy of molecular(a) and statistical techniques to infer onto constituentsisary relationships among organisms or genes.This review c e precisewhere provides a general introduction to phyletics and phyletic maneuvers, describes around of the most plebeian computational methods utilize to infer phylogenetic cultivation from molecular selective in shitation, and provides an everywhereview of n previous(predicate) of the numerous divers(prenominal) online tools on tap(predicate) for phylogenetic synopsis. In addition, several phylogenetic case studies ar summarized to illust prize how re pursuiters in varied biologic disciplines argon reaching molecular phylogenetics in their wo rk. Introduction to molecular PhylogeneticsThe similarity of biological powers and molecular mechanisms in living organisms strongly suggests that species descended from a vulgar ancestor. molecular phylogenetics delectations the social organization and function of p subterfugeicles and how they multifariousness over metre to infer these organic maturationary relationships. This pegleg of study emerged in the early 20th century provided didnt begin in earnest until the 1960s, with the advent of protein sequencing, PCR, electrophoresis, and sepa appraise(a) molecular biota techniques.Over the past 30 years, as computers consecrate become to a greater extent capacityy and to a greater extent gener individu on the wholeyy accessible, and computer algorithmic programs more sophisticated, re hunt clubers shake off been able to take the immensely complicated stochastic and probabilistic problems that define evolution at the molecular level more effectively. Within past decade, this field has been advertise reenergized and re delineate as whole genome sequencing for complex organisms has become faster and less expensive. As mounds of genomic information becomes publically available, molecular phylogenetics is continuing to grow and find overbold applications. 4, 10, 17, 20, 22 The primary objective of molecular phylogenetic studies is to recover the order of evolutionary events and represent them in evolutionary channelizes that graphically depict relationships among species or genes over judgment of conviction. This is an extremely complex touch, set ahead complicated by the fact that on that point is no hotshot right way to approach all phylogenetic problems. Phylogenetic data laps screwing consist of hundreds of different species, each of which whitethorn have varying mutation rates and patterns that influence evolutionary change.Consequently, on that point atomic sum up 18 numerous different evolutionary dumbfounds and st ochastic methods available. The optimal methods for a phylogenetic analysis depend on the nature of the study and data employ. 5, 19, 20 Molecular Evolution Beyond Darwin Evolution is a work at by which the traits of a population change from adept generation to some opposite. In On the Origin of Species by Means of Natural Selection, Darwin proposed that, given overwhelming reverse out from his extensive relative analysis of living specimens and fossils, all living organisms descended from a common ancestor.The passwords only illustration (see soma 1) is a direct diagram-like structure that suggests how slow and successive modifications could lead to the extreme variations seen in species today. 11, 27 Molecular Phylogenetics K atomic number 18n Dowell 2 traffic pattern 1. Evolution Defined Graphically. The sole illustration in Darwins Origin of the Species uses a channelize-like structure to describe evolution. This drawing shows ancestors at the limbs and breakes of the tree diagram, more recent ancestors at its twigs, and contemporary organisms at its buds. 34 Darwins surmisal of evolution is ground on tether underlying principles ariation in traits go among individuals within a population, these variations tail assembly be passed from integrity generation to the neighboring via inheritance, and that some forms of inherited traits provide individuals a higher chance of excerption and reproduction than others. 11 Although Darwin developed his theory of evolution without any knowledge of the molecular basis of life, it has since been ruled that evolution is actually a molecular process establish on genetic information, encoded in desoxyribonucleic acid, RNA, and proteins. At a molecular level, evolution is driven by the same types of mechanisms Darwin observed at the species level.One molecule undergoes diversification into many variations. One or more of those variants can be selected to be reproduced or amplified throughout a popu lation over many generations. Such variations at the molecular level can be ca employ by mutations, such(prenominal) as deletions, insertions, inversions, or substitutions at the alkali level, which in turn affect protein structure and biological function. 11, 22 What is a Phylogeny? According to ripe evolutionary theory, all organisms on earth have descended from a common ancestor, which means that any compensate of species, extant or extinct, is related.This relationship is called a development, and is represented by phylogenetic trees, which graphically represent the evolutionary archives related to the species of interest (see chassis 2). Phylogenetics infers trees from observations nigh existing organisms victimization morphological, physiological, and molecular geniusistics. put down 2. Phylogeny of Mammalia. This phylogenetic tree shows the evolutionary relationships among six orders of mammal species (taxa). Taxa listed in grey atomic number 18 extinct. The tree of life represents a phylogeny of all organisms, living and extinct.Other, more specialized species and molecular phylogenies ar used to support comparative studies, test biogeographic hypotheses, evaluate mode and quantify of speciation, infer amino acid date of extinct proteins, track the evolution of diseases, and even provide evidence in criminal cases. 19 Molecular Phylogenetics K aren Dowell 3 Understanding Phylogenetic manoeuvres Before exploring statistical and bioinformatic methods for estimating phylogenetic trees from molecular data, its serious to have a basic familiarity of the terms and elements common to these types of trees. set excogitation 3. ) Figure 3. Basic elements of a phylogenetic tree. Phylogenetic trees are composed of emergencees, also cognize as edges, that connect and terminate at nodes. Branches and nodes can be home(a) or external (terminal). The terminal nodes at the tips of trees represent operational taxonomic units (OTUs). OTUs correspo nd to the molecular durations or taxa (species) from which the tree was inferred. Internal nodes represent the last common ancestor (LCA) to all nodes that arise from that point.Trees can be made of a individual gene from many taxa (a species tree) or multi-gene families (gene trees). 1, 10 A tree is considered to be rooted if on that point is a grumpy node or out assembly (an external point of reference) from which all OTUs in the tree arises. The root is the oldest point in the tree and the common ancestor of all taxa in the analysis. In the absence of a know out crowd, the root can be placed in the philia of the tree or a rootless tree whitethorn be supplyd. Branches of a tree can be grouped together in different ways. (See Figure 4. ) Figure 4.Groups and associations of taxonomical units in trees. A monophyletic group consists of an internal LCA node and all OTUs arising from it. All members within the group are derived from a common ancestor and have inherited a set of un ique common traits. A paraphyletic group excludes some of its descendents (for examples all mammals, except the marsupialia Molecular Phylogenetics Karen Dowell 4 taxa). And a polyphyletic group can be a collection of distantly related OTUs that are associated by a similar characteristic or phe nonype, but are not directly descended from a common ancestor. 1, 17 Trees and Homology Evolution is shaped by homology, which refers to any similarity due to common ancestry. Similarly, phylogenetic trees are defined by homologous relationships. Paralogs are homologous ranks separated by a gene duplication event. Orthologs are homologous sequences separated by a speciation event (when one species diverges into twain). Homologs can be either paralogs or orthologs. 1, 11, 22 Molecular phylogenetic trees are drawn so that branch length corresponds to bar of evolution (the percent difference in molecular sequences) amongst nodes. 1, 19 Figure 5. Understanding paralogs and orthologs. Paralogs are created by gene duplication events. (See Figure 5. ) at one time a gene has been duplicated, all subsequent species in the phylogeny ordain inherit both copies of the gene, creating orthologs. Interestingly, evolutionary disparity of different species whitethorn pass on in many variations of a protein, all with similar structures and functions, but with very different amino acid sequences. Phylogenetic studies can trace the railway line of such proteins to an transmittable protein family or gene. 1, 22 Figure 6. Mirror Phylogenies. cistron A and Gene A1 are paralogs, whereas all instances of Gene A are orthologs of each other in different canine tooth species. One way to ensure that paralogs and orthologs are leavely referenced in a phylogenetic tree, and guard against falsehood due to missing or incomplete taxonomic information is to generate mirror phylogenies (see Figure 6) in which paralogs serve as each others outgroup. 1, 4, 19, 22 Estimating Molecular Phylogenet ic Trees Molecular phylogenetic trees are generated from character datasets that provides evolutionary essence and context.Character data may consist of biomolecular sequence continuatives of DNA, RNA, or amino acids, molecular markers, such as whizz nucleotide polymorphisms (SNPs) or restriction fragment length polymorphisms (RFLPs), morphology data, or information on gene order and content. Evolution is stickered as a process that changes the state of a character, such as the type of nucleotide (AGTC) at a Molecular Phylogenetics Karen Dowell 5 precise location in a DNA sequence each character is a function that maps a set of taxa to distinct states. 1, 19 Note that most of the examples in this paper use DNA sequences as character data, but trees can be accurately estimated from many different types of molecular data. Figure 7. Evolution of a DNA Sequence Figure 7 illustrates how a molecular sequence might evolve over time as a matter of doubled mutations that results small , but evolutionarily important changes in a nucleotide sequence. At the protein level, these changes may not initially affect protein structure or function, but over time, they may eventually shape a new purpose for a protein within divergent species. 10, 19, 22 OTUs can be used to stimulate an unrooted phylogenetic tree that clearly depicts a path of evolutionary change. touchstones in Phylogenetic Analysis Although the nature and scope of phylogenetic studies may vary significantly and have a bun in the oven different datasets and computational methods, the basic step in any phylogenetic analysis remain the same cope with and align a dataset, pass on (estimate) phylogenetic trees from sequences utilise computational methods and stochastic models, and statistically test and assess the estimated trees. 4, 19, 20 Assemble and Align Datasets The first step is to aim a protein or DNA sequence of interest and conglomerate a dataset consisting of other related sequences. For e xample, to explore relationships among different members of the Notch family of proteins, one might select DNA sequences for Notch1 through Notch4, in different species, such as human, dog, rat, and mouse, then(prenominal) perform a double sequence alignment to identify homologies. 1, 10, 13, 19, 20 There are a number of free, online tools available to modify and streamline this process. DNA sequences of interest can be ascertaind utilize NCBI push down or similar search tools.When evaluating a set of related sequences retrieved in a BLAST search, pay close attention to the score and E-value. A high score indicates the subject sequence retrieved with closely related to the sequence used to initiate the query. The smaller the E-value, the higher the opportunity that the homology reflects a reliable evolutionary relationship, as opposed to sequence similarity due to chance. As a general rule, sequences with E- set less than 10-5 are homologs of a query sequence. 10 one time s equences are selected and retrieved, multiple sequence alignment is created.This involves arranging a set of sequences in a matrix to identify regions of homology. Typically, gaps (one or more spaces in the alignment) are introduced in one or more sequences to represent insertions or deletions in the molecular code that may have occurred over time. trenchant multiple sequence alignment hinged on gap analysis ascertain where to insert gaps and how large to make them. There are many websites and computer bundle curriculums, such as ClustalW, MSA, MAFFT, and T-Coffee, designed to perform multiple sequence on a given set of molecular data. ClustalW is presently the most mount up and most wide used. 1, 10. 19 Molecular Phylogenetics Karen Dowell 6 Building Phylogenetic Trees To build phylogenetic trees, statistical methods are apply to determine the tree ne cardinalrk topographic anatomy and calculate the branch lengths that dress hat describe the phylogenetic relationships of the aline sequences in a dataset. Many different methods for building trees exist and no single method performs well for all types of trees and datasets. The most common computational methods applied include outperform-matrix methods, and discrete data methods, such as upper limit tightness and utmost likeliness. 4, 17, 20 There are several software packages, such as Paup*, PAML, PHYLIP, that apply most popular methods. 4 Paup* is a commercially available program that implements a wide variety of methods for phylogenetic inference, including maximal likelihood analysis for DNA data using different models. Paup* also includes a set of exact and heuristic methods for searching optimal trees. PAML (Phylogenetic Analysis by upper limit likeliness) is clear-cut-access set of programs for phylogenetic analysis and evolutionary model comparison.PAML includes many travel modelsDNA- and AAbased models as well as codon-based models that can be used to chance upon positive selection. Many of the programs in PAML can model heterogeneity of evolutionary rates among sequence sites using ? distributions, and evolutionary dynamics of different sequence regions (concatenated gene sequences). PHYLIP is another large suite of open-access programs for phylogenetic inference that estimates trees using numerous methods, including pairwise infinite, maximum parsimony, and maximum likelihood.The maximum likelihood programs can do a few simple stochastic models and have approximate tree searching capabilities. PHYLIP is generally considered good educational software for novice phylogeneticists. surpass-Matrix systems Distance matrix methods compute a matrix of pairwise distances between sequences that approximate evolutionary distance. Distance-based methods tend to be in polynomial time and are sooner fast in practice. These methods use clustering techniques to compute evolutionary distances, such as the number of nucleotide or amino acid substitutions between sequence s, for all pairs of taxa.They then construct phylogenetic trees using algorithms based on functional relationships among distance values. There are several different distance-matrix methods, including the Unweighted Pair-Group Method with Arithmetic Mean (UPGMA), which uses a sequential clustering algorithm the transform Distance Method, which uses an outgroup as a reference, then applies UPGMA the Neighbor-Relations Method, which applies 4point condition to adjust the distance matrix, then applies UPGMA and the Neighbor-Joining Method, which arranges OTUs in a star, the finds neighbors sequentially to minimize resume length of tree. 4, 17 The following section on the UPGMA method provides a more detailed example of how distance-matrix methods work. UPGMA Method UPGMA produces rooted trees for which the edge lengths can be viewed as times measured by a molecular measure with a constant rate. This method uses a sequential clustering algorithm to identify two OTUs that are most simi lar (meaning they have the shortest evolutionary distance and are most similar in sequence) and incubate them as a single new composite OTU. This process is repeat iteratively until only two OTUs remain.The algorithm defines the distance (d) between two clusters Ci and Cj as the average distance between pairs of sequences from each cluster Molecular Phylogenetics Karen Dowell 7 Where Ci and Cj are the number of sequences in clusters i and j. This sequential clustering process is visually described in Figure 8. In this example, the two most homologous sequences are 1 and 2. They are clustered into a new composite parent node (6), and the branch lengths (t1 and t2) are defined as 1/2d1,2. The next step is to search for the closest pair among remaining sequences and node 6.Pair 4 and 5 are identified and clustered into a new parent node (7), and the branch length for t4 and t5 is reckon. 4, 17 Figure 8. Sequential clustering of sequences using the UPGMA method. 17 In this interact ional process, parent node 8 is created from pairs 7 and 3, and parent node 9 is created by clustering nodes 6 and 8. 4, 17 Thus, all sequences are clustered into a single evolutionary tree. The total time (t9) can be calculated as D6,8 = 1/6 (d1,3 + d1,4 + d1,5 + d2,3 + d2,4 +d2,5) clear-cut Data Methods Discrete data methods psychoanalyse each column of a multiple sequence alignment dataset apiece and search for the tree that best represents all this information. Although distance-based methods tend to be a great deal faster than discrete data methods, they typically yield little information beyond the basic tree structure. Discrete data analyses, on the other hand, are information rich. These methods produce a separate tree for each column in the alignment, so it is feasible to trace the evolution for specific elements within a given sequence, such as catalytic sites or regulatory regions. 10, 17, 19, 20) Commonly used discrete data methods include maximum parsimony, which se arches for the most parsimonious tree that requires the least number of evolutionary changes to exempt differences observed, maximum likelihood, which requires a probabilistic model for the process of nucleotide substitution, and Bayesian MCMC, which also requires a stochastic model of evolution, but creates a fortune distribution on a set of trees or aspects of evolutionary history. 17, 19, 20 Discrete data methods are generally considered to produce the best estimates of evolutionary history.However, these methods can be computationally expensive, and it can take weeks or months to obtain a springable level of the true for moderate to large datasets with 100 or more OTUs. 19 Molecular Phylogenetics upper limit Parsimony Karen Dowell 8 Among the most widely used tree- esteem techniques, maximum parsimony applies a set of algorithms to search for the tree that requires the minimum number of evolutionary changes observed among the OTUs in the study. For example, Figure 9 lists fo ur sample sequences from which phylogenetic trees could be inferred using maximum parsimony. range Seq 1 2 3 4 1 A A A A 2 A G G G 3 G C A A 4 A C T G 5 G G A A 6 T T T T 7 G G C C 8 C C C C 9 A G A G Figure 9. Sample sequences for a maximum parsimony study 17 Maximum parsimony algorithms identify phylogenetically informative sites, meaning the site favors some trees over others. Consider the sequences in Figure 9 Site 1 is not informative, because all sequences at that site (in column 1) are A (Adenine), and no change in state is required to friction match any one sequence (1-4) to another.Similarly, Site 2 is not informative because all three trees require one change and in that location is no reason to favor one tree over another. Site 3 is not informative because all three trees require two changes. (See Figure 10). Figure 10. Site 3 trees all require one evolutionary change. 17 Site 4 is not informative because all three trees require three changes. No one tree can be identif ied as parsimonious. (See Figure 10 Figure 11. Site 4 trees all require three evolutionary changes. 17 Site 5 is informative because one tree requires only one nucleotide change, whereas the other two trees require 2 changes.In Figure 12, the first tree on the left, which requires only one nucleotide change, is identified as the maximum parsimony tree. Figure 12. Site 5 trees vary in the number of evolutionary changes required. 17 Molecular Phylogenetics Maximum Likelihood Karen Dowell 9 The maximum likelihood method requires a probabalistic model of evolution for estimating nucleotide substitution. This method evaluates competing hypotheses (trees and parameters) by selecting those with the highest likelihood, meaning those that render the observed data most plausible. The ikelihood of a hypothesis is defined as the hazard of the data given that hypothesis. In phylogeny reconstruction, the hypotheses are the evolutionary tree (its topology and branch lengths) and any other paramet ers of the evolutionary model. 17, 20 The likelihood calculations required for evolutionary trees are far from straightforward and usually require complex computations that essential allow for all possible unobserved sequences at the LCA nodes of hypothesized trees. This method specifies the inflection probability from one nucleotide state to another in a time interval in each branch.For example, for a one-parameter model with rate of substitution ? per site per unit time, the probability that the nucleotide at time t is i is The probability that the nucleotide at time t is j is To set up a likelihood function, given x as the ancestral node and y and z as internal nodes, the probability of observing nucleotides i, j, k, l at the tips of the tree is computed as Pxl(t1+t2+t3)Pxy(t1)Pyk(t2+t3)Pyz(t2)Pzi(t3)Pzj(t3) For the ancestral node (root) x, the probability of having nucleotide l in sequence 4 is calculated as Pxl(t1+t2+t3)Because x, y, and z can be any one of four nucleotides ( ACGT), it is necessary to sum over all possibilities to obtain the probability of observing the configuration of nucleotides i, j, k, l, in sequences 1, 2, 3, 4, for a given hypothetical tree (see Figure 13. ). This likelihood probability is calculated as h(I,j,k,l)= ? gxPxl(t1+t2+t3) ? Pxy(t1)Pyk(t2+t3) ? Pyz(t2)Pzi(t3) Pzj(t3) The appropriate likelihood function depends on the hypothetical tree and the evolutionary model used. (See Figure 13. ) 17 Figure 13. Different types of model trees for the derivation of the maximum likelihood function. 17 Molecular Phylogenetics Stochastic Models of Evolution Karen Dowell 10 Evolutionary changes in molecular sequences result from mutations, some of which occur by chance, others by natural selection. Rates of change can also differ among OTUs, depending on several factors ranging from GC content to genome size. To accurately estimate phylogenetic trees, assumptions must be made about the substitution process and those assumptions must be sta ted in the form of a stochastic evolutionary model. These probabilistic models are used to rove trees according to likelihood P(datatree).From a Bayesian perspective, they rank trees according to a posterior probability P(treedata). 17, 20 The objective of probabilistic models is to find likelihood or posterior probability of a particular taxonomic feature, then define and compute P(x? T,t ? ) Where x ? is xj for j=1n, T is a tree with n leaves with sequence j at leaf j, and t ? are tree edge lengths. 17 A few popular stochastic models of evolution include the single parameter Jukes-Cantor (JC) method, Kimura 2-parameter (K2P), Hasegawa-Kishino-Yano (HKY), and Equal-Input.Some software programs, such as Paup*, leave behind automatically use a default model for the tree estimation method chosen. The JC method is the easiest one to comprehend, because it assumes that if a site changes its state, it changes with equal probability to the other states. This is not very realistic, howe ver, as some sites are known to evolve more chop-chop than others, and some sites may be perpetual and not allowed to change at all. Determining how best to select the appropriate model is a topic of another paper (or papers) as there is no one model that incorporates all mutation rules and patterns across different species and macromolecules. 4, 17, 20 Hidden Markov Models indite hidden Markov models (HMMs) are a form of Bayesian network that provides statistical models of the consensus structure of a sequence family. Gary Churchill at The capital of Mississippi Lab was the first evolutionary geneticist to propose using profile HMMs to model rates of evolution. Many software packages and web suffices now apply HMMs to estimate phylogenetic relationships. 8 In the HMM format, each position in the model corresponds to a site in the sequence alignment. For each position, there are a number of possible states, each of which corresponds to a different rate of evolution.In addition, transitions between all possible rate-states at adjacent positions. Transition probabilities charm any tendency for patterns of rates to occur in successive sites. 2, 4 Assessing Trees Tree estimating algorithms generate one or more optimal trees. This set of possible trees is subjected to a series of statistical tests to evaluate whether one tree is pause than another and if the proposed phylogeny is reasonable. Common methods for assessing trees include the Bootstrap and Jackknife Resampling methods, and analytical methods, such as parsimony, distance, and likelihood.To illustrate how these methods are used, consider the steps problematic in a help analysis. Bootstrap Analysis A assist is a statistical method for assessing trees that takes its name from the fact that it can impel itself up by its bootstraps and generate meaningful statistical distributions from almost nothing. use bootstrap analysis, distributions that would otherwise be difficult to calculate exactly are estimated by repeated creation and analysis of artificial datasets. In a Non-parametric bootstrap, artificial datasets Molecular Phylogenetics Karen Dowell 11 generated by resampling from sea captain data.In a parametric bootstrap, data is simulated according to hypothesis tested. The objective of any bootstrap analysis is to test whether the whole dataset supports the tree. 1, 4, 17 Figure 14 illustrates the basic steps in any bootstrap analysis. Sample datasets are automatically generated from an descental dataset. Trees are then estimated from each sample dataset. The results are compiled and compared to determine a bootstrap consensus tree. Figure 14. Steps in a phylogenetic tree bootstrap analysis. 1 Phylogenetic Analysis Tools There are several good online tools and databases that can be used for phylogenetic analysis.These include jaguar, P-Pod, PFam, TreeFam, and the PhyloFacts geomorphological phylogenomic cyclopaedia. individually of these databases uses different al gorithms and draws on different sources for sequence information, and therefore the trees estimated by PANTHER, for example, may differ significantly from those generated by P-Pod or PFam. As with all bioinformatics tools of this type, it is important to test different methods, compare the results, then determine which database works best (according to consensus results, not researcher bias) for studies involving different types of datasets.In addition, to the phylogenetic programs already mentioned in this paper, a comprehensive list of more than 350 software packages, web-services, and other resources can be found here http//evolution. genetics. washington. edu/phylip/software. html. PANTHER (pantherdb. org) Protein ANalysis Through Evolutionary Relationships, known by its acronym PANTHER, is a library of protein families and subfamilies indexed by function. Panther version 6. 1 contains 5547 protein families. Molecular Phylogenetics Karen Dowell 12It categorizes proteins by evolu tionary related proteins (families) and related proteins with same function (subfamilies). 8, 21, 26 PANTHER is composed of both a library and index. The library is a collection of books that represent a protein family as a collection of multiple sequence alignments, HMMs, and a family phylogenetic tree. Functional divergence within the tree is represented by dividing the parent tree into child trees and HMMs based on shared functions. These subfamilies enable database curators to more accurately capture functional divergence of protein sequences as inferred from genomic DNA. 25, 26 PANTHER database entries are annotated to molecular function, biological process and piece of land with a proprietary PANTHER/X ontology system, which is supposed to be easier to perceive than the more orbiculate standard Gene Ontology (GO). Database entries in PANTHER are generated through clustering of UniProt database using a BLAST-based similarity score. Trees are automatically generated based on multiple sequence alignments and parameters of the protein family HMMs using the Tree Inferred from Profile Score (TIPS) clustering algorithm.Scientific curators review all family trees, annotate each tree, and determine how best to divide them into subtrees using a tree-attribute viewer that tabulates annotations for sequences in a tree. In addition, trees and subfamilies are manually cross-checked and validated by curators. 25, 26 P-POD (ortholog. princeton. edu) The Princeton Protein Orthology Database (P-POD) combines results from multiple comparative methods with curated information culled from the books.Designed to be a resource for experimental biologists seeking evolutionary information on genes on interest, P-POD employs a modular architecture, based on their Generic Model Organism Database (GMOD). P-POD can be accessed from their web service or downloaded to run on local computer systems. 12 P-POD accepts FASTA-formatted protein sequences as input, and performs comparati ve genomic analyses on those sequences using OrthoMCL and Jaccard clustering methods. The P-POD database contains both phylogenetic information and manually curated experimental results.The site also provides many links to sites rich in human disease and gene information. This tool may be oddly helpful for bioinformaticists and statisticians developing comparative genomic database tools and resources. Pfam (pfam. sanger. ac. uk/) PFam is a collection of protein families represented by multiple sequence alignments and HMMs. It contains models of protein clans, families, domains, and motifs, and uses HMMs representing conserved functional and structural domains. It is a large, widely used, actively curated mature database that has been available online since 1995.Pfam can be used to retrieve the domain architectures for a specific protein by conducting a search using a protein sequence against the Pfam library of HMMs. This database is also helpful for proteomes and protein domain ar chitecture analysis. 6, 8, 24 There are two versions of the Pfam database PfamB is generated automatically from ProDom, using PsiBLAST, an open access bioinformatics tool available through NCBI for identifying weak, but biologically applicable sequence similarities. Pfam-A is hand-curated from custom multiple sequence alignments. Pfam protein domain families are clustered with Mkdom2, and aligned with ProDomAlign.ProDom is a comprehensive set of protein domain families automatically generated from the SWISSPROT and TrEMBL sequence databases. Mkdom2 is a ProDom program used to make ProDom family clusters. Protein domain families in ProDom were aligned using an improved parallelized program called Molecular Phylogenetics Karen Dowell 13 ProDomAlign, developed in C++ using OpenMP. ProDomAlign is based on MultAlign, a program well suited for aligning very large sequence families with thousands of associated sequences. As of early 2008, Pfam matched 72 percent of known proteins sequence s, and 95 percent of proteins for which there is a known structure.Within the Pfam database, 75 percent of sequences will have one match to Pfam-A, 19 percent to Pfam-B. There are also two versions of Pfam-A and Pfam-B. Pfam-ls handles global alignments, and Pfam-fs is optimized for local alignments. Interestingly, Pfam entries can be classified as unknown, but that doesnt mean the protein is undocumented. Unknown entries can be proteins for which some information is known, but it has not been fully researched or cannot be adequately annotated. For example, Pfam entry PFO1816 is a LeucineRich Repeat Variant (LRV), which has a known structure (1LRV) available in the Protein Databank (pdb. rg). LRV repeat regions, which are found in many different proteins, are often involved in cell adhesion, DNA repair, and hormone receiptbut identification of an LRV within a sequence encoding a protein doesnt specifically reveal the proteins function. For studies involving a large number of protei n searches, it may be more convenient to run Pfam locally on a client machine. The standalone Pfam system requires the HMMER2 software, the Pfam HMM libraries and a couple of additional files from the Pfam website to be installed on the client machine. HMMER is a freely distributable implementation of profile HMM software for protein sequence analysis. ) Once the initial search is complete, researchers can go to the Pfam website to further analyze select number of sequences using additional features on website. 6, 8, 24 TreeFam (TreeFam. org) TreeFam is a curated database of phylogenetic trees and orthology soothsayings for all animal gene families that focuses on gene sets from animals with entirely sequenced genomes. Orthologs and paralogs are inferred from phylogenetic tree of gene family.Release 4 contains curated trees for 1314 families and automatically generated trees for another 14351 families. 16, 23 Like Pfam, TreeFam is a two-part database TreeFam-B contains automatical ly generated trees, and TreeFam-A consists of manually curated trees. To automatically generate trees, an algorithm selects clusters of genes to create TreeFam-B seeds from core species with high-quality reference genome sequences, first using BLAST to rapidly assemble an initial list of possible matches, then HMMER to expand and sift probable sequence matches for each TreeFam B seed family.The filtered alignment is fed into a neighbor-joining algorithm and a tree is constructed based on amino acid mismatch distances. For TreeFam version 4, the most current release, five new family trees were built for each TreeFam B seed, two using a maximum likelihood tree generated using PHYML (one based on the protein alignment, the other on codon alignment), three using a neighbor joining tree, using different distance measurements based on codon alignments. 16, 23 Scientific curators then manually any correct errors (based on information in the literature) in automatically generated TreeFam-B trees. Curated TreeFam-B trees then become seeds for TreeFam-A trees. Clean TreeFam-A trees are build using three integrate algorithms and bootstrapping to find the consensus tree of seven trees two constrained maximum likelihood trees based on protein and codon alignment, and five unconstrained neighbor-joining trees generated using different distance measurements based on codon alignments.For both TreeFam-B and TreeFam-A families, orthologs and paralogs are inferred only from clean trees using Duplication/Loss Inference (DLI) algorithm that requires a species tree (NCBI taxonomy tree). 16, 23 Molecular Phylogenetics PhyloFacts (phylogenomics. berkeley. edu/phylofacts) Karen Dowell 14 PhyloFacts is an online phylogenomic encyclopedia for protein functional and structural classification. It contains more than 57,000 books for protein superfamilies and structural domains.Each book contains heterogenous data for protein families, including multiple sequence alignments, one or more p hylogenetic trees, predicted 3-D protein structures, predicted functional subfamilies, taxonomic distributions, GO annotations, and PFAM domains. HMMs constructed for each family and subfamily consent novel sequences to be classified to different functional classes. 14 Unlike other databases mentioned in this paper, PhyloFacts seeks to correct and clarify annotation errors associated with computational methods for predicting protein function based on sequence homology.It uses a consensus approach that integrates many different prediction methods and sources of experimental data over an evolutionary tree. By applying evolutionary and structural clustering of proteins, PhyloFacts is able to analyze disparate datasets using multiple methods, identify potential errors in database annotations, and provide a mechanism for improving the accuracy of functional annotation in general. 14 PhyloFacts can be used to search for protein structure prediction or functional classification for a part icular protein sequence.Researchers may also browse through protein family books and multiple sequence alignments, phylogenetic trees, HMMs and other pertinent information for proteins of interest. This webservice also provides many links to literature and other information sources. 14 Applied Molecular Phylogenetics Molecular phylogenetic studies have many diverse applications. As the amount of publically available molecular sequence data grows and methods for modeling evolution become more sophisticated and accessible, more and more biologists are incorporating phylogenetic analyses into their research trategy. Heres a sampling of how molecular phylogenetics might be applied. Tracing the evolution of man In one case study, molecular phylogenetic techniques were used to compare and analyze variation in DNA sequences using advanced(a) human and Neanderthal mitochondrial DNA (mtDNA). For this study, 206 modern human mtDNAs and parts of two Neanderthal mtDNAs sequences derived from s keletal remains were used to generate an initial dataset. transmitted distance was first estimated using the Jukes-Cantor single parameter model.Then the Kimura 2-Parameter model was used to distinguish between transition (replacement of one purine with another purine or one pyrimidine with another pyrimidine) and transversion (replacement of one purine with a pyrimidine or vice versa) probabilities with Kimura 2parameter model. A phylogenetic tree representing hierarch evolution was generated using pairwise genetic distances between primate Hypervariable regions I and II of mtDNA. 3 Chasing an epidemic SARS utilize publically available genomic data, it is possible to reconstruct the progression of the SARS epidemic over time and geographically.To conduct this phylogenetic analysis, researchers used the neighborjoining method to construct a phylogenetic tree of spike proteins in various coronaviruses and identify the viral master of ceremonies (a Himalyan palm civet). They then obtained 13 SARs genome sequences with documented information on the date and location of the sample. The neighbor-joining method and a distance matrix based on Jukes-Cantor model, were used to generate an epidemic tree, from which it was possible to identify the origin (date and location) of the virus by observing progression of mutations over time. 3 Molecular Phylogenetics Barking up the right tree Karen Dowell 15 Phylogenetics is increasingly incorporated into biological and biomedical research papers. When the canine genome was published, researchers used sequence data to estimate a comprehensive phylogeny of the canid family. Figure 15. Phylogenetic Tree of the Canid family This canid family phylogenetic tree is based on 15 kb of exon and intron sequence. It was constructed using the maximum parsimony method and represents the single most parsimonious tree.A good example of how phylogenies are referenced in the literature, this tree includes bootstrap values and Bayesian poste rior probability values listed above and below internodes, respectively. Dashes indicate bootstrap values below 50%. In addition, divergence time in millions of years (Myr) is indicated for three nodes. 18 beholding the Forest from the Trees Molecular phylogenetics is a broad, diverse field with many applications, support by multiple computational and statistical methods. The sheer volumes of genomic data currently available (and rapidly growing) render molecular phylogenetics a key luck of much biological research.Genome-scale studies on gene content, conserved gene order, gene expression, regulatory networks, metabolic pathways, functional genome annotation can all be enriched by evolutionary studies based on phylogenetic statistical analyses. 19, 25 27 Molecular phylogenies have fast become an integral part of biological research, pharmaceutical drug design, and bioinformatics techniques for protein structure prediction and multiple sequence alignment. Although not all molecula r biologists and bioinformaticians may be familiar with the techniques describedMolecular Phylogenetics Karen Dowell 16 in this paper, this is a rapidly growing and expanding field and there is ongoing engage for novel algorithms to solve complex phylogeny reconstruction problems. References 1. Baldauf, SL (2003) Phylogeny for the umbrageous of heart a tutorial. Trends in Genetics, 19(6)345-351. 2. Brown, D, K Sjolander (2006) Functional Classification Using Phylogenomic Inference. PLos computational Biology, 2(6)0479-0483. 3. Cristianini, N, and M Hahn (2007) Introduction to Computational Genomics A Case Studies Approach.Cambridge University pressing Cambridge. 4. Durbin, R, S Eddy, A Krogh, G Mitchison (1998) Biological Sequence Analysis. Cambridge University Press Cambridge. 5. Ewens, WJ, R Grant (2005) Statistical Methods in Bioinformatics. Springer cognition and Business Media brand-new York. 6. Finn, RD, J Tate, J Mistry, PC Coggill, SJ Sammut, HR Hotz, G Ceric, K Forsl und, SR Eddy, ELL Sonnhammer, A Bateman (2008) The Pfam protein families database. Nucleic Acids Research, 36D281288. 7. Gabaldon, T (2008) Large-scale assignment of orthology back to phylogenetics? Genome Biology, 9235. 1-235. 6. 8. Gollery, M. (2008) handbook of Hidden Markov Models in Bioinformatics. CRC Press, Taylor & Francis Group London. 9. Goodstadt, L, CP Ponting (2006) Phylogenetic Reconstruction of Orthology, Paralogy, and keep Synteny for Dog and Human. PLoS Computational Biology, 2(9)1134-1150. 10. Hall, BG. (2004) Phylogenetic Trees Made Easy A How-To Manual, second ed. Sinauer Associates, Inc. Sunderland, MA. 11. Hartwell, LH, L Hood, ML Goldberg, AE Reynolds, LM Silver, RC Veres (2008) Genetics From Genes to Genomes, 3rd Ed.McGraw-Hill New York. 12. Heinicke, S, MS Livstone, C Lu, R Oughtred, F Kang, SV Angiuoli, O White, D Botstein, K Dolinski (2007) The Princeton Protein Orthology Database (P-POD) A comparative Genomics Analysis Tool for Biologists. PLoS ONE , 8e766. 1-15. 13. Kortschak, RD, R Tamme (2001) Evolutionary analysis of vertebrate Notch genes. Dev Genes Evol, 211350-354. 14. Krishnamurthy, N, DP Brown, D Kirshner, K Sjolander (2006) PhyloFacts an online structural phylogenomic encyclopedia for protein functional and structural classification. Genome Biology, 7R83. -13. 15. Kuzniar, A, RCHJ van Ham, S Pongor, JAM Leunissen (2008) The hobby for orthologs finding the corresponding gene across genomes. Trends in Genetics, 24(11)539-551. Molecular Phylogenetics Karen Dowell 17 16. Li, H, A Coghlan, J Ruan, LJ Coin, JK Heriche, L Osmotherly, R Li, T Liu, Z Zhang, L Bolund, GKS Wong, W Zheng, P Dehal, J Wang, R Durbin (2006) TreeFam a curated database of phylgenetic trees of animal gene families. Nucleic Acids Research, 34D573-580. 17. Li, WH (1997) Molecular Evolution. Sinauer Associates Sunderland, MA. 18.Lindblad-Toh, K, CM Wade, TS Mikkelsen, EK Karlsson, DB Jaffe, M Kamal, M Clamp, JL Chang, EJ Kulbokas III, MC Zody, E Mau celi, X Xie, M Breen, RK Wayne, EA Ostrander, CP Ponting, F Galibert, DR Smith, PJ deJong, E Kirkness, P Alvarez, T Biagi, W Brockman, J Butler, C Chin, A Cook, J Cuff, MJ Daly, D DeCaprio, S Gnerre, M Grabherr, M Kellis, M Kleber, C Bardeleben, L Goodstadt, A Heger, C Hitte, L Kim, KP Koepfli, HG Parker, JP Pollinger, SMJ Searle, NB Sutter, R Thomas, C Webber, ES Lander (2005) Genome Sequence, Comparative Analysis and Haplotype Structure of the Domestic Dog.Nature, 438803-819. 19. Linder, CR, T Warnow (2005) An overview of phylogeny reconstruction. In the Handbook of Computational Molecular Biology, Chapman and Hall/CRC Computer & Information Science. 20. Lio, P, N Goldman (1998) Models of Molecular Evolution and Phylogeny. Genome Research, 812331244. 21. Mi, H, N Guo, A Kejariwal, PD Thomas (2007) PANTHER version 6 protein sequence and function evolution data with grow representation of biological pathways. Nucleic Acids Research, 35D247-252. 22. Patthy, Laszlo. (1999) Protein Evolution. Blackwell Science, Ltd Malden, MA. 23. Ruan, J, H Li Z Chen, A Coghlan, LJM Coin, Y Guo, JK Heriche, Y Hu, K Kristiansen, R Li, T Liu, A Mose, J Qin, S Vang, AJ Vilella, A Ureta-Vidal, L Bolund, J Wang, R Durbin (2008) TreeFam 2008 Update. Nucleic Acids Research, 36D735-740. 24. Sammut, SJ, RD Finn, A Bateman (2008) Pfam 10 years on 10000 families and still growing. Briefings in Bioinformatics, 9(3)210-219. 5. Thomas, PD, A Kejariwal, N Guo, H Mi, MJ Campbell, A Muruganujan, B Lazareva-Ulitsky (2006) Applications for protein sequence-function evolution data mRNA/protein expression analysis and coding SNP scoring tools. Nucleic Acids Research, 34W645-650. 26. Thomas, PD, MJ Campbell, A Kejariwal, H Mi, B Karlak, R Daverman, K Diemer, A Muruganujan, A Narechania. PANTHER A Library of Protein Families and Subfamilies Indexed by Function. Genome Research, 132129-2141. 27.Warnow, T (2004) Computational Methods in Phylogenetics Computational Systems Biology Conference, Sta nford, CA 28. Whelan, S, P Lio, N Goldman (2001) Molecular phylogenetics state of the art methods for looking into the past. Trends in Genetics, 17(5)262-272. Molecular Phylogenetics Karen Dowell 18 Appendix Website Resources Phylogeny Programs. A University of Washington site formerly supported by the National Science Foundation. http//www. evolution. genetics. washington. edu/phylip/software. tml TreeFam Tree Families Database. http//wwww. treefam. org Protein Analysis Through Evolutionary Relationships (PANTHER) Classification System. http//www. pantherdb. org. 29. Pfam Database of Protein Families. http//pfam. sanger. ac. uk 30. Princeton Protein Orthology Database (P-POD). http//ppod. princeton. edu 31. Wikipedia. http//en. wikipedia. org/wiki/Tree_of_life(science) Cover foliate The cover image is from a phylogeny of canid species that appeared in Lindblad-Toh et al, 2005. 18

No comments:

Post a Comment