Visualising a subset of the tree of life

Visualising a subset of the tree of life

We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

I understand that many curated trees of life already exist (eg but is there any website that allows one to input a list of organisms, and then produce the current best guess at their evolutionary relationships?

The best site I managed to find so far is, which allows you to select specific species from the main tree to plot into a sub-tree, but the tree itself is rather small, so more detailed comparisons cannot be made. performs the requisite task (generating a phylogenetic tree based on the specific organisms provided, using the NCBI taxonomy tables).

For example, the input of tree elements

Trichomonas vaginalis,Trypanosoma brucei,Homo sapiens,Fibroporia radiculosa,Paramecium tetraurelia,Tetrahymena thermophila,Cryptosporidium muris,Cryptosporidium hominis,Blastocystis hominis

generates the tree

Study of giant viruses shakes up tree of life

A new study of giant viruses supports the idea that viruses are ancient living organisms and not inanimate molecular remnants run amok, as some scientists have argued. The study may reshape the universal family tree, adding a fourth major branch to the three that most scientists agree represent the fundamental domains of life.

The new findings appear in the journal BMC Evolutionary Biology.

The researchers used a relatively new method to peer into the distant past. Rather than comparing genetic sequences, which are unstable and change rapidly over time, they looked for evidence of past events in the three-dimensional, structural domains of proteins. These structural motifs, called folds, are relatively stable molecular fossils that -- like the fossils of human or animal bones -- offer clues to ancient evolutionary events, said University of Illinois crop sciences and Institute for Genomic Biology professor Gustavo Caetano-Anollés, who led the analysis.

"Just like paleontologists, we look at the parts of the system and how they change over time," Caetano-Anollés said. Some protein folds appear only in one group or in a subset of organisms, he said, while others are common to all organisms studied so far.

"We make a very basic assumption that structures that appear more often and in more groups are the most ancient structures," he said.

Most efforts to document the relatedness of all living things have left viruses out of the equation, Caetano-Anollés said.

"We've always been looking at the Last Universal Common Ancestor by comparing cells," he said. "We never added viruses. So we put viruses in the mix to see where these viruses came from."

The researchers conducted a census of all the protein folds occurring in more than 1,000 organisms representing bacteria, viruses, the microbes known as archaea, and all other living things. The researchers included giant viruses because these viruses are large and complex, with genomes that rival -- and in some cases exceed -- the genetic endowments of the simplest bacteria, Caetano-Anollés said.

"The giant viruses have incredible machinery that seems to be very similar to the machinery that you have in a cell," he said. "They have complexity and we have to explain why."

Part of that complexity includes enzymes involved in translating the genetic code into proteins, he said. Scientists were startled to find these enzymes in viruses, since viruses lack all other known protein-building machinery and must commandeer host proteins to do the work for them.

In the new study, the researchers mapped evolutionary relationships between the protein endowments of hundreds of organisms and used the information to build a new universal tree of life that included viruses. The resulting tree had four clearly differentiated branches, each representing a distinct "supergroup." The giant viruses formed the fourth branch of the tree, alongside bacteria, archaea and eukarya (plants, animals and all other organisms with nucleated cells).

The researchers discovered that many of the most ancient protein folds -- those found in most cellular organisms -- were also present in the giant viruses. This suggests that these viruses appeared quite early in evolution, near the root of the tree of life, Caetano-Anollés said.

The new analysis adds to the evidence that giant viruses were originally much more complex than they are today and experienced a dramatic reduction in their genomes over time, Caetano-Anollés said. This reduction likely explains their eventual adoption of a parasitic lifestyle, he said. He and his colleagues suggest that giant viruses are more like their original ancestors than smaller viruses with pared down genomes.

The researchers also found that viruses appear to be key "spreaders of information," Caetano-Anollés said.

"The protein structures that other organisms share with viruses have a particular quality, they are (more widely) distributed than other structures," he said. "Each and every one of these structures is an incredible discovery in evolution. And viruses are distributing this novelty," he said.

Most studies of giant viruses are "pointing in the same direction," Caetano-Anollés said. "And this study offers more evidence that viruses are embedded in the fabric of life."

The research team included graduate student Arshan Nasir and Kyung Mo Kim, of the Korea Research Institute of Bioscience and Biotechnology.

A Tree of Life Grows in Texas

Tandy Warnow, David Bruton, Jr. Centennial Professor in Computer Science. Scientists continue to refine, and sometimes radically alter, our understanding of the “Tree of Life” — the ways in which species are related to one another. They’re using the computing power of the Texas Advanced Computing Center (TACC) at The University of Texas at Austin to better understand the origin of species and, ultimately, help fight disease and develop better crops.

Whereas once, evolutionary history was based on the relationships of bones, skeletons and other morphological clues, today DNA is now the main informer in the story of how the Earth became such a diverse place.

Phylogenetics is the branch of life science that studies the evolutionary relationships among organisms based on genetic evidence. By aligning the molecular sequences of different species, scientists can see how organisms differ at the genetic level, determine where they diverged and map out branching trees of relationships based on the alignments.

With the cost of gene sequencing declining, researchers are performing more phylogenetic studies. Even so, the process of lining up tens of thousands of sequences from hundreds or thousands of species is incredibly complicated, even for a computer.

“The most accurate trees are estimated using methods that try to solve hard optimization problems,” said Tandy Warnow, professor of computer science at The University of Texas at Austin and a Guggenheim Fellow.

“While those solutions can be done on small data sets or moderate sized data sets, on large data sets, they can take a very long time — weeks to months to years of computational time. The Texas Advanced Computing Center ends up being essential for those problems.”

TACC, on the J.J. Pickle Research Campus in north Austin, runs some of the biggest and most powerful systems in the world, but even their supercomputers can hardly keep up with the pace of genetic research. According to Moore’s law, the performance of computers doubles every two years. However, the ability of gene sequencers to create data has grown at an even faster rate.

“It’s a different kind of challenge,” Warnow said. “It’s not just how we run analyses on big data sets, but how do we access the data in a way that is sensible?”

Divide and Conquer

Warnow is working with postdoctoral fellow Kevin Liu of Rice University and Siavash Mirarab, a Ph.D. student in computer science at The University of Texas at Austin, to create smarter, faster and more accurate algorithms to apply to some of the biggest data sets ever created.

This phylogenetic tree, created by David Hillis, Derreck Zwickil and Robin Gutell, depicts the evolutionary relationships of about 3,000 species throughout the Tree of Life. Less than 1 percent of known species are depicted. With a $1.5 million grant from the National Science Foundation (through the Assembling the Tree of Life project), the researchers have developed software that allows computers to draw better evolutionary trees more quickly.

It’s called SATé — Simultaneous Alignment and Tree Estimation — and uses a novel divide-and-conquer approach.

“By dividing a really big data set that’s hard to align into small data sets that are closely related, you can get good estimates on each subset and then get an alignment on the full data set,” Warnow explained.

Massive supercomputers, such as Ranger at TACC, align the sequences of each subset and combine the alignments into an alignment on the full set of sequences.

There’s no way to know whether the tree that emerges from these simulations is absolutely accurate. Some trees are obviously wrong — for example, those that show humans and crocodiles on the same branch, separated from chimps — but most are probable.

For that reason, SATé uses a statistical method to provide a maximum likelihood score: a measure by which to assess its accuracy against other answers. SATé repeats the process of alignment and tree-building many times until a tree with the highest likelihood score is reached.

In software development, the best products are not just the newest, but the ones that are proved to be better than the alternatives. To this end, Warnow and her team have been working as quality assurance and reliability testers, solving hard evolutionary tree problems multiple times, with different methods and parameters, to ensure that SATé produces the highest-quality result.

First reported in the journal Science and later explored in the journals PLoS Currents and Systematic Biology, the researchers have shown repeatedly that SATé works as well as the alignment and tree estimation methods that are commonly used, which analyze trees as single units. But SATé is far faster or achieves greater accuracy but in the same amount of time.

For the Birds

Warnow and her team also collaborate with evolutionary biologists on projects in which their guidance can lead to new insights.

Since Charles Darwin’s day, scientists have debated the evolutionary history of flightless birds, known at ratites. How did so many similar species get to such far-flung corners of the Earth?

“The theory of continental drift provided a convenient answer,” said Michael Braun, a curator in the department of systematic biology at the Smithsonian Institution. “These birds evolved from a common flightless ancestor and then drifted to their current distributions. For 40 years, this remained the textbook explanation of species dispersal.”

That is until Braun discovered through DNA analysis that an ancient (but still living) family of birds found in South America, the tinamou, was one of the most closely related groups to emus and ostriches. But the tinamou could fly — a finding first reported in 2009.

This fact, combined with the lack of skeletal evidence of flightless birds before the continents broke apart, led to a re-conceptualization of the ratite branch of the avian tree. Ratites were in fact descended from flying birds that traveled to places where flight was no longer an evolutionary advantage and consequently lost their ability to fly.

“It‘s hard to recognize the relationships among species using just morphology, but when we can use the molecules and appropriate analytical methods to find the relationships, it helps us understand better how that adaptive evolution has occurred,” Braun said.

Recently, Warnow worked with Braun, using SATé, to reanalyze his controversial findings. Their study confirmed the evolutionary relationship that Braun found.

Emergency Phylogenetics

Better, faster, more accurate phylogenetic methods can have a life or death impact for humans.

The Centers for Disease Control and Prevention uses sequence alignment and evolutionary tree-building tools when a new virus emerges to determine where it might have come from and how it differs from previous viruses.

Plant scientists also use tree-building tools to determine which genes are associated with positive traits such as hardiness and drought tolerance. This knowledge is enabling scientists to breed more productive crops, helping to feed the world.

But none of these problems is easily solved.

“Many research groups are estimating trees containing anywhere from a few thousand to hundreds of thousands of species, towards the eventual goal of estimating a Tree of Life, containing perhaps as many as several million leaves,” Warnow wrote in a recent article in Systematic Biology. “These phylogenetic estimations present enormous computational challenges, and current computational methods are likely to fail to run even on datasets in the low end of this range.”

In other words, small problems may be within reach, but the big ones remain.

“It’s not getting any easier, but it is getting more fun,” Warnow said.

By Aaron Dubrow, originally published on the Texas Advanced Computing Center website.

The Tree of Life Exercise

The image below is an example of what the tree of life exercise will look like once complete. I was able to complete this rough draft in about an hour. The instructions below will describe how you can create your own.

The first step of course is to draw a tree. I’ve included a video below that should help if you feel lost. However, I should note that–at least for your first draft–it might be helpful to keep it rough. You can always go back later and redraw or touch-up your existing drawing for aesthetics. This round is all about getting the information down.

Next, follow the labeling instructions below. If you can only think of one or two things per section at a time, don’t worry about it. The nature of this exercise is that as you complete each step, it unlocks more memories and ideas for other parts. You can skip around and fill things in at any time. The most helpful thing in the beginning is to just write stuff down and see where it takes you. You might be surprised!

The Compost Heap (Optional–But Highly Recommended!)

Write down anything in your compost heap that would normally go in the other sections described below but which are now things you no longer want to be defined by.

These are often sources of trauma, abuse, cultural standards of normality/beauty/etc. or anything else that shapes negative thoughts about yourself in your mind. You can write down places, people, problems, experiences. Whatever you need to.

I blurred mine out above, but you can see it has several items. Generally they all have to do with past trauma and damaging relationships I’m trying to let go of. I’ve found that the idea of a compost heap is an extremely helpful way to think about these things. Especially since many of them are not neatly categorized as “all bad”.

There are in fact quite a few life defining lessons I learned through the things that ended up in my compost heap. And like a compost heap is supposed to do, I will eventually break those things down and re-sow the rich parts back into my life.

You can do the same with yours.

The Roots

Write down where you come from on the roots. This can be your home town, state, country, etc. You could also write down the culture you grew up in, a club or organization that shaped your youth, or a parent/guardian.

The Ground

Write down the things you choose to do on a weekly basis on the ground. These should not be things you are forced to do, but rather things you have chosen to do for yourself.

The Trunk

Write your skills and values on the trunk. I chose to write my values starting at the base of the trunk going up. I then transitioned into listing my skills. For me this felt like a natural progression from roots to values to skills.

The Branches

Write down your hopes, dreams, and wishes on the branches. These can be personal, communal, or general to all of mankind. Think both long and short term. Spread them around the various branches.

The Leaves

Write down the names of those who are significant to you in a positive way. Your friends, family, pets, heroes, etc.

The Fruits

Write down the legacies that have been passed on to you. You can begin by looking at the names you just wrote on leaves and thinking about the impact they’ve had on you and what they’ve given to you over the years. This can be material, such as an inheritance, but most often this will be attributes such as courage, generosity, kindness, etc.

(Tip: if your tree is pretty crowded by this point, perhaps try drawing some baskets of fruit at the base of your tree and label them accordingly there.)

The Flowers & Seeds

Write down the legacies you wish to leave to others on the flowers and seeds.

(Tip: again, you may wish to de-clutter your drawing by visualizing saplings, baskets of flowers, etc. on which to write these items down.)

Conceptual frameworks: Reinterpreting the TOL

Many substantial advances have been made in phylogenetic methods, including the development of sophisticated evolutionary models, tree-building techniques (including faster tools suited to the analysis of genome-wide data sets) and reliability estimates of tree inferences, as well as databases and other computational tools. In this section, we are primarily concerned with the concepts that underpin these methods and their respective results. Specifically, our focus is how the TOL has been reconceptualized in light of the fact that the more molecular data is analysed, the more difficult it is to interpret straightforwardly the evolutionary histories of those molecules. Rather than renouncing the universal tree, many evolutionary biologists have instead elected to restructure their understanding of the TOL in relation to bodies of data and what can be done with them. We will outline a variety of positions that encompass an ever more extensive range of modifications to the basic TOL concept (Figure 1). These stances range from "business as usual" on the basis of finding clear signals of the one true TOL, to a perspective in which local trees are seen as just occasional structures in the "real" web of life. All these positions draw on Darwin's tree metaphor, and they also overlap and feed into one another in various ways, but each commands a distinct conceptual space for itself.

Conceptual frameworks of the TOL in relation to Darwin's tree simile.

1. Trees of genes as trees of species

Trees of gene and protein sequences are typically considered most valuable when they can be justified as representing trees of species. To achieve this representational status, a gene or a set of genes have to meet some criteria of genealogical markers. The first two related criteria are the most obvious ones: i) a gene has to be (nearly) universal, i.e., represented by readily recognizable orthologs (preferably single-copy) in all cellular life forms ii) the sequence of the gene in question has to be sufficiently conserved to allow the construction of an unambiguous alignment and an informative tree. The third criterion is more controversial and harder to apply: a gene used for the construction of a reference tree has to be minimally prone to HGT. Genes favoured under these criteria include those for ribosomal RNA, ribosomal proteins, elongation factors, RNA polymerases and several other (nearly) universal, highly conserved genes [28, 29]. A few of these markers are considered to be so evolutionarily "special" that they have become the basis of reference trees for the whole TOL [30, 15]. The problems of the most well known reference trees, the trees of 16S and 18S rRNA genes, have been frequently discussed (e.g., [31, 32]). Nevertheless, for many evolutionary biologists, the concept of a reference tree can still be justified as long as its limitations are understood (e.g., [33]).

However, researchers -- even if they continue to use reference trees -- increasingly recognize that single-gene trees, and even composite multi-gene trees, might obscure more than they reveal. These trees cannot take into account non-bifurcating patterns from major evolutionary events, such as endosymbiosis, co-evolving symbioses, hybridization and any other occurrences of lineage fusion [34–37]. More generally, HGT is now recognized as a major factor of evolution in the prokaryote world. Treating all these non-tree-like processes as problems that obscure the "true" TOL greatly skews and limits the understanding of evolutionary history that is one of the central goals of evolutionary biology -- along with understanding processes and patterns of evolution [38].

The second way of relating gene trees to species trees is to think of the gene trees as contained "within" the species tree. This route is particularly attractive for the systematics of organisms for which there is already a widely accepted phylogenetic placement in the TOL (primarily multicellular eukaryotes), but it has also had appeal for prokaryote phylogenetics. An obvious problem is that the species tree has to be "predetermined" in order to pick and construct the right gene trees (e.g., [39, 40]), and this makes the phylogeny circularly presuppose its conclusion. However, as with the previous conceptual relationship between gene and species trees, discordance between trees for individual genes -- not just in prokaryotes, and not just because of HGT -- has also led to fundamental questions about whether gene trees could be simply understood as tracing a history "within" a known species tree [41–43]. "In considering these issues", wrote phylogeneticist Wayne Maddison,

"one is provoked to consider precisely what is phylogeny. Perhaps it is misleading to view some gene trees as agreeing and other gene trees as disagreeing with the species tree rather, all of the gene trees are part of the species tree, which can be visualized like a fuzzy statistical distribution, a cloud of gene histories" [41].

Rather than gene trees being contained within species trees or standing in for them, new concepts of the TOL and of evolutionary history in general began to be articulated with the increasing availability of comparative genomic data. Because the species tree is generally perceived as the true aim of phylogeny (or at least used to be until recently), new modelling techniques have been devised and broader treatments of data developed in order to represent the species tree less problematically. Given the abundance of molecular data, a major investment has been made in attempts to reconstruct trees of genomes. In the process, the very concept of "species tree" (and thus TOL) has been revised.

2. Trees of genomes as trees of cells

Under the broad banner of phylogenomics, efforts to reconcile inconsistent data and resolve the branching order of all lineages of life have been developed and championed [44]. Phylogeneticists are obliged to believe that traces of vertical signal can be detected amongst the evolutionary noise (although these very categories imply certain expectations) and so are torn between interpreting such signal as the central truth of the evolutionary history or as an indication of limited genetic relatedness that is not necessarily central to our understanding of evolution. A major outcome of the attempts to understand the relationship between putative signal and noise in genomic data has been the generation of new concepts of the TOL. While there are several methodological routes involved [45, 46]), two streams of genome tree construction illustrate this tension due to their substantially different underlying ways of thinking about the TOL.

Core genome approaches are concerned with an evolutionarily stable core of genes that can be taken to represent the organismal lineage, which is seen as the process of binary genome replication and cell division (thus justifying the tenet of bifurcation). In accordance with the previously mentioned criteria for the choice of reference genes, this approach seeks to identify genes that are widely represented in genomes, and most importantly, that produce congruent phylogenetic signals (e.g., [47–52]). A degree of success has been achieved under this conceptual framework (different methods may be employed), with the identification of universal genes that appear to track the same evolutionary story. There are many questions, however, about whether the trees generated for this purpose, particularly concatenated sequence trees, are methodological artefacts [53], and whether such analyses say much about the TOL or simply produce a partially distorted history of several genes.

Perhaps the biggest problem with such an approach is how well the identified cores represent the evolutionary history of the organisms and genomes that contain them. The (nearly) universal gene core of cellular life is extremely small and functionally skewed. One much scrutinized core analysis examined genomes of 191 species from all three domains of life but was able to identify only 31 universal genes, primarily, those for ribosomal proteins [54]. Prokaryote genomes typically contain between 1,000 and 4000 genes, so any tree built on the basis of 31 genes is a highly reduced representation of the intended TOL -- "a tree of 1%" in a famously trenchant criticism [36]. More generally, the fact that all genes in prokaryote genomes are likely to have experienced at least one HGT event in the 3.5 billion year history of cellular genomes means that no pristinely untransferred core exists [55]. The core approach might, therefore, be better interpreted as concerned with a "least transferred" subset of genes. In that case, the core would be a "fuzzy" gene set displaying a particular statistical trend, rather than a precisely defined set, and this is the conceptual space another version of the genome-based TOL inhabits.

Central trend approaches are built on the quantification of more and less transfer. They combine individual trees of genes in order to foreground vertical tree patterns against the much more complicated backdrop of the "forest" of life [56–60]. Such conceptualizations factor in the pervasiveness of HGT but search for an indicative message of vertical descent from the composite data. This trend, composed of the most universal signal, can usually be picked up only faintly at deep phylogenetic levels, except for the signal of bifurcation between archaea and bacteria [57]. It may not be possible, ultimately, to recover any other details of deep branching and even tree tips may remain in doubt for some lineages [55, 61, 62]. Nevertheless, for some of these supertree constructions, a "modal information" TOL seems to emerge strongly enough to be a "backbone" tree that is merely draped with some fine "cobwebs" of HGT [60].

While none of these analyses see the central trend as the majority signal in the forest, they do recognize it as an extremely important one. In one case, when using a specially designed "tree-net trend" score, the central tree-like trend amounts to approximately 40% of the total information on prokaryote evolution [58]. But is such a "statistical tree" what is traditionally meant by the TOL? This was certainly not how the TOL was conceived in the first era of molecular phylogeny before the recognition that different genes might have distinct evolutionary histories. The statistical TOL approach also involves the acknowledgement that averaging out the signal from different gene trees may produce artefactual trees while obscuring relevant aspects of evolution [63]. The willingness to make this transition may have more to do with the perceived epistemological function of the TOL (which we explore below), than with commitments to the ontology of the tree (e.g., its "realness").


To construct the AnnoTree database, we re-annotated all 28 941 prokaryotic genomes in the GTDB (Release 03-RS86) using a consistent annotation pipeline. Following gene prediction, we assigned functional annotations [Pfam protein families ( 10), TIGRFAM protein families ( 18) and KEGG Orthology (KO) identifiers ( 28)] to protein sequences using standard confidence score thresholds, resulting in 106 856 093 Pfam, 27 624 080 TIGRFAM, and 67 878 984 KEGG annotations. All taxonomic information, protein sequences, and functional annotations are stored in a back-end MySQL database for rapid retrieval by the front-end AnnoTree application (Figure 1). To enable phylogenetic visualization of all 28 941 prokaryotic genomes, AnnoTree divides the bacterial and archaeal trees of life into distinct views by each major taxonomic level. A user can explore the phylogenetic distribution of a trait anywhere from the phylum to genome level in either taxonomic domain. Additionally, AnnoTree can be used to explore custom trees and datasets (see Data Availability).

Data flow in the AnnoTree application. Raw values and computed features derived from data obtained from the GTDB is stored in a MySQL database that will be updated to match revisions made to the GTDB. Users can access data relevant to their queries in the form of figures and tables that are rendered in their browser. The figures themselves and the data used to generate them can be downloaded in various file formats from the AnnoTree interface.

Data flow in the AnnoTree application. Raw values and computed features derived from data obtained from the GTDB is stored in a MySQL database that will be updated to match revisions made to the GTDB. Users can access data relevant to their queries in the form of figures and tables that are rendered in their browser. The figures themselves and the data used to generate them can be downloaded in various file formats from the AnnoTree interface.

AnnoTree can be queried in several ways: by Pfam protein family, TIGRFAM protein family, KO term, or taxonomic name/id. Annotation queries can be filtered by their corresponding confidence scores such as E-value and percent alignment. Additionally, species that appear in a BLAST result can be visualized by uploading the BLAST XML2 output file directly. AnnoTree will then generate a ‘painted’ phylogeny using root-to-tip coloring for all lineages containing matches to the query (Figure 2). Visualizations are also accompanied by basic taxonomic information and distribution summary statistics based on GTDB nomenclature (Figure 2). Publication-quality SVG images, Newick formatted phylogenies for any selected subset of the tree, and taxonomic distribution tables of all queries can be downloaded for offline analysis or editing. Confidence scores (E-values) and options for downloading protein sequences for each annotation in a genome or lineage are displayed within a pop-up window when a colored node is selected on the tree.

AnnoTree interface overview. AnnoTree can be queried with any number of KO identifiers, Pfam families, Tigrfam families, or NCBI taxon identification numbers to display a mapping of those traits on the GTDB tree at any resolution. Lineages containing at least one genome with the query annotation(s) are highlighted in red. A circle chart displays a taxonomic summary of the genomes containing the flagellin gene (KO identifier: K02406) at a chosen taxonomic level. Smaller trees below show the interactive view when different taxonomic levels are selected by the user. When a highlighted node is clicked, a window appears (not shown in figure) displaying basic taxonomic information, zooming options, and annotation confidence scores.

AnnoTree interface overview. AnnoTree can be queried with any number of KO identifiers, Pfam families, Tigrfam families, or NCBI taxon identification numbers to display a mapping of those traits on the GTDB tree at any resolution. Lineages containing at least one genome with the query annotation(s) are highlighted in red. A circle chart displays a taxonomic summary of the genomes containing the flagellin gene (KO identifier: K02406) at a chosen taxonomic level. Smaller trees below show the interactive view when different taxonomic levels are selected by the user. When a highlighted node is clicked, a window appears (not shown in figure) displaying basic taxonomic information, zooming options, and annotation confidence scores.

Since all data is precomputed, users can explore the phylogenomic distribution of any combination of gene families within seconds. As an example, the recent metagenomics-driven discovery of commamox bacteria ( 29, 30) can be reproduced through a simple AnnoTree query by searching for genomes possessing all three key genes that act as a signature for commamox activity: KO terms K00371 (nxrB), K10944 (amoA) and K10535 (hao). Highlighted in the tree are the known commamox species (i.e. organisms within the genus Nitrospira), along with several additional taxa implicated as having potential commamox-like activity (e.g. Crenothrix) ( Supplementary Figure S1 ).

As a second example, the recent discoveries of homologs of important bacterial toxins outside of their respective bacterial lineages can be reproduced and visualized phylogenetically using simple AnnoTree queries. A query with Pfam PF01742 (botulinum neurotoxin protease) reveals a taxonomic distribution outside of Clostridium including the lineages Weissella and Chryseobacterium, consistent with earlier analyses ( 31, 32) ( Supplementary Figure S2 ). Similarly, a search with the diphtheria toxin domains (PF02763 or PF02764) reveals homologs in related genera Streptomyces and Austwickia, again reproducing recent analyses ( 33) almost instantaneously ( Supplementary Figure S3 ). These examples illustrate the use of AnnoTree as a hypothesis-generating tool by revealing distributions of gene families that may be new or unexpected to users.

Lineage-specific gene families

As an initial exploration of the data within AnnoTree, we examined the distributions of all 77 004 395 bacterial Pfam and KO annotations when mapped onto the bacterial GTDB tree of life (Release 02-RS83). Based on the phylogenetic conservation score (τD) ( 22), 68.1% of KO identifiers and 60.0% of Pfam protein families had significantly non-random phylogenomic distributions (P < 0.05), revealing a greater phylogenetic congruency for KO predictions than Pfam predictions. Next, we analyzed the distributions of Pfam and KO annotations, and used standard binary classification metrics to identify those with strong lineage-specificity (see Methods) ( Supplementary Data File S1 ). Extremely lineage-specific families were identified as those with both very high (≥95%) precision (percentage of genomes in the clade containing a trait) and very high (≥95%) sensitivity (percentage of a trait-containing genomes occurring in the clade). Based on these criteria, we identified 358 (3.2%) Pfam protein families and 152 (0.9%) KO identifiers with lineage-specific distributions in Bacteria. We observed a trend in which lineage-specific KO identifiers and Pfam protein families increase in frequency from higher (e.g. phylum) to lower (e.g. species) taxonomic levels ( Supplementary Figure S4 ), consistent with the idea that gene family taxonomic distributions tend to diversify over time and that HGT impacts evolution over short evolutionary timescales ( 34). Although lineage-specific families are relatively rare at high taxonomic levels, these cases often represent ancient, clade-defining bacterial innovations. Examples include K18955 (WhiB family transcriptional regulator) in the Actinobacteria, PF07542 (ATP12 chaperone) in the Alphaproteobacteria, and numerous photosynthesis-related genes within the Cyanobacteria (class Oxyphotobacteria).

Lineage-specific gene families can provide insights into the unique biology of their respective organisms. For example, eight lineage-specific Pfam and KO annotations were detected within the Endozoicomonas subtree, a clade of endosymbiotic bacteria that inhabit numerous marine eukaryotic hosts ( 35). Consistent with possible utilization of host processes, the lineage-specific genes detected within this clade appear to be of eukaryotic origin and include genes involved in cytoskeletal organization (PF01302), eukaryotic cell–cell signaling (PF00812), apoptosis inhibition (K010343, K010344, K04725, PF07525) and eukaryotic proteolysis (K01378). Given the occurrence of numerous lineage-specific gene families in Endozoicomonas, we asked whether lineage-specific gene families may be overrepresented in certain taxa or branches of the bacterial tree. Indeed, lineage-specific genes were significantly enriched in specific taxonomic groups. Notable examples include 37 Pfam protein families within the Bacillus_A genus, and 19 Pfam protein families within the Actinobacteria that are largely composed of proteins of unknown function. We also observed an overrepresentation of lineage-specific gene families in numerous well-studied pathogens (e.g. Bordetella, Helicobacter, Legionella and Vibrio) ( Supplementary Figures S5–S7 Supplementary Data File S1 ). This is in part due to the presence of lineage-specific virulence factors and toxins, but is also likely influenced by annotation bias towards organisms of biomedical interest ( 36).

Gene families with patchy distributions

Although 60–68% of functional annotations show a significant phylogenetic signal when mapped onto the tree, more surprising are the remaining 30–40% that show more random phylogenetic distributions, potentially reflecting the widespread horizontal transfer and/or frequent gene gain/loss that is known to occur in bacterial genomes ( 37, 38). To investigate this further, we ranked all Pfam and KEGG annotations according to their phylogenetic patchiness, determined by homoplasy score (total number of gains and losses by parsimony) normalized by gene family size after filtering out traits with family size <50 ( Supplementary Data File S2 , see Materials and Methods ). Next, we grouped KO terms into their higher-level functional categories for visual comparison of broader trends (Figure 3, Supplementary Data File S3 ). Not surprisingly, ‘viral’ (bacteriophage) genes ranked the highest in homoplasy in both Pfam and KEGG annotations, and therefore are the single most phylogenetically scattered class of genes in bacteria. In contrast, gene functions with extremely low homoplasy include sporulation, photosynthesis, and core processes such as transcription, replication and protein synthesis (Figure 3). Highly scattered genes showed significant overrepresentation among specific taxonomic groups such as the genera Pseudomonas_E, Streptomyces, and Mycobacterium ( Supplementary Data Files S4 and S5 ), suggesting that these taxa may be taxonomic ‘hotspots’ of HGT.

Phylogenetic patchiness of annotations inferred using AnnoTree. Phylogenetic patchiness was computed for each KEGG KO identifier and Pfam protein family using the consistency index (CI), a common homoplasy metric representing the inverse of the minimum possible number of state changes (trait gain or loss) given the tree topology. The final phylogenetic patchiness score is equal to -log(CI)/log(family size) where family size is the total number of genomes containing the trait. (A) Density plot showing the distribution of phylogenetic patchiness scores of Pfam protein families and KO identifiers with different visual examples of varying patchiness (red = present gray = absent). The phylogenetic distribution plots are, from left to right: K10922 (transmembrane regulatory protein ToxS), K18955 (WhiB transcriptional regulator), PF01848 (ATP12 chaperone), PF01848 (Hok/Sok antitoxin system), and K07495 (putative transposase). (B) Mean-sorted box plots containing phylogenetic patchiness scores of KO identifiers in their respective KEGG pathways and KEGG BRITE categories. The mean patchiness score of a set of KO identifiers in a KEGG pathway or KEGG BRITE category is indicated by a black line.

Phylogenetic patchiness of annotations inferred using AnnoTree. Phylogenetic patchiness was computed for each KEGG KO identifier and Pfam protein family using the consistency index (CI), a common homoplasy metric representing the inverse of the minimum possible number of state changes (trait gain or loss) given the tree topology. The final phylogenetic patchiness score is equal to -log(CI)/log(family size) where family size is the total number of genomes containing the trait. (A) Density plot showing the distribution of phylogenetic patchiness scores of Pfam protein families and KO identifiers with different visual examples of varying patchiness (red = present gray = absent). The phylogenetic distribution plots are, from left to right: K10922 (transmembrane regulatory protein ToxS), K18955 (WhiB transcriptional regulator), PF01848 (ATP12 chaperone), PF01848 (Hok/Sok antitoxin system), and K07495 (putative transposase). (B) Mean-sorted box plots containing phylogenetic patchiness scores of KO identifiers in their respective KEGG pathways and KEGG BRITE categories. The mean patchiness score of a set of KO identifiers in a KEGG pathway or KEGG BRITE category is indicated by a black line.

We then examined in more detail the top 100 gene families that showed the most scattered distributions across the bacterial tree. Not surprisingly, this list of gene families is dominated by transposases, CRISPR- and bacteriophage-associated gene families ( Supplementary Data File S2 ). Numerous gene families of unknown function were included among the most patchy gene families, but further examination revealed that most of these genes are likely bacteriophage-derived. The extreme phylogenetic patchiness of bacteriophage and CRISPR genes is not only consistent with their known evolutionary dynamics but could also reflect the ongoing ‘arms race’ between these two opposing biological forces (phage infection versus phage defense). Other biologically relevant members of the 1% most highly scattered KO genes include: K19057-K19059 (merC, merD, and merR of the mer operon) for mercury resistance K19155 and K19156, components of a toxin-antitoxin system characterized in E. coli K15943, K15945, and K16411 for polyketide antibiotic biosynthesis and K19173-K19175 for DNA backbone S-modification (phosphorothioation) ( Supplementary Data File S2 ).

Reductive dehalogenases

As a case study for the hypothesis generation and data mining strengths of AnnoTree, we selected a gene family of significant biological interest that ranked among the top percentile of homoplasy scores: pcpC tetrachloro-p-hydroquinone reductive dehalogenase (K15241) Supplementary Data File S2 ). As key enzymes in bioremediation of chlorinated solvents, there has been extensive characterization of the diversity and phylogenomic distribution of reductive dehalogenases (Rdhs) and organohalide respiring organisms ( 39). Using AnnoTree, we compiled a dataset of Rdh genes and associated taxa using Pfam query PF13486. Our analysis produced a comprehensive dataset of 1,299 putative Rdh genes from 385 genera and 38 phyla ( Supplementary Table S1, Figures S8, S9 ), which not only recapitulates the known diversity of Rdh-associated phyla, but significantly expands it. In comparison, a manually-curated Rdh-specific database contains 264 Rdh genes from only 19 genera and 6 phyla ( 39), less than 15% of the total diversity identified by AnnoTree ( Supplementary Table S1 ). The AnnoTree-derived dataset includes several newly predicted rdh-encoding taxa discovered from metagenome-assembled genomes ( Supplementary Table S2 ), including the candidate phyla KSB1 (4 of 6 genomes, rdh copy number = 1) and UBP10 (7 of 14 genomes, rdh copy number = 1), as well as Rhodospirillales UBA2165 (rdh copy number = 13) and Acidobacterium UBA2161 (rdh copy number = 8) ( Supplementary Figure S9, Table S2 ). The novel organisms with high rdh copy numbers are potential obligate organohalide respirers and may be valuable for remediation efforts. By revealing both known and potentially novel groups of organohalide respiring bacteria, the Rdh case study highlights the ability of AnnoTree to capture a broad and complete taxonomic diversity of a gene family, with accompanying hypothesis generation around the evolution and ecology of a function of interest.

On Multiple Trees

TreeJuxtaposer: Scalable Tree Comparison using Focus+Context with Guaranteed Visibility. Tamara Munzner, François Guimbretière, Serdar Tasiran, Li Zhang, Yunhong Zhou. "MunzerComparingTrees.pdf"

The Challenge of Visualising Multiple Overlapping Classification Hierarchies. Martin Graham, Jessie B Kennedy and Chris Hand. "UIDIS"

A Comparison of Set-Based and Graph-Based Visualisations of Overlapping Classification Hierarchies. Martin Graham, Jessie B Kennedy, Chris Hand. ACM 2000. "p41-graham.pdf"

Pullan, M.R., Watson, M.F., Kennedy, J.B., Raguenaud, C. & Hyam, R.: The Prometheus Taxonomic Model: a practical approach to representing multiple classifications. - Taxon 49: 55-75. 2000. "Pullan00Taxon.pdf"

Visualising Multiple Overlapping Classification Hierarchies. PhD. Thesis. Martin James Graham. Napier University, December 2001 "GrahamThesisFinal.pdf"

Conclusion: synthesizing tree-reading frameworks

Our review shows that there are some well-elaborated works on tree-reading skills that thus far have not explicitly referred to each other. The two major systems show different approaches: Halverson and Friedrichsen (2013) consider the total spectrum of learners’ progress in handling evolutionary trees, from absolute novices to longtime experts, in a hierarchical structure. Novick and Catley (2016) use a smaller-scale approach, describing task-oriented skills needed for fully understanding tree-reading. Novick and Catley’s task-oriented system seems suitable for easily generating learning assignments, while Halverson and Friedrichsen’s system seems to constitute a good basis for structuring a complete process of learning by starting to dismantle common misconceptions and then improving skills with increasing difficulty in ordered sequence. The skills proposed by other authors substantiate several skills or skill levels in the skill systems.

In general, our literature overview shows that multiple groups have worked on modeling tree-reading skills, and some major advancements have been made. At the same time, however, it has become clear that there has been no attempt to unify and combine the insights already gained. Publications show only few cross-references to works on tree-reading skills by other authors, leading to mainly singular, not explicitly interlinked approaches. Furthermore, research on tree-thinking skills so far has focused on deducing skills or systems from theory, observation, or experience, and there has been no major attempt to empirically verify the proclaimed models.

Based on the works published on tree-thinking skills (Halverson and Friedrichsen 2013 Novick and Catley 2016) and on skills published by other authors (Blacquiere and Hoese 2016 Meir et al. 2007), we wish to present a proposal for a synthetic hierarchical system of tree-reading skills consisting of six skill levels. This system could at this point be seen as an example of how such a synthesis might look, as it is the result of a theoretical approach drawing together the previous works of different authors.

The hierarchical nature of this system largely follows the hierarchy of Halverson and Friedrichsen’s system (2013), although one minor adjustment of the order has been made, as explained below. The structure of the proposed system, along with the allocation of the proposed skill levels to published skill systems, is also explained below, as well as presented in Table 1 in the form of major ideas.

The hierarchy starts at skill level zero (“naïve handling”). Students at this level are not able to analyze a tree correctly, nor do they know the symbolic meaning of the different components of the tree. Interpretations of a given tree are largely based on one or more learners’ misconceptions and tend to over-interpret uninformative facets of a tree diagram over others. This level corresponds to the first three skills of Halverson and Friedrichsen, which are all characterized by fragmented knowledge of evolutionary trees (Halverson and Friedrichsen 2013).

Skill level one (“identifying structures”) represents the ability to identify and interpret the meaning of diagrammatic elements of the representation. This includes knowledge of the meaning of nodes, branches, labels, and the direction of time, but also slightly more elaborate knowledge, like the positions of MRCAs in the tree. This level corresponds with Halverson and Friedrichsen’s level four (“symbolic use of the representation”), where the students have knowledge of the meaning and importance of diagrammatic features but cannot interpret the diagram any further (Halverson and Friedrichsen 2013).

The second skill level (“handling apomorphies”) encompasses the ability to interpret traits labeled in a tree. This includes tasks in both directions, naming all traits that a taxon shows and listing all taxa that show certain traits. This skill can only be utilized if the given tree shows traits or apomorphies by any representational means (e.g., pictorial or textual, along the lines, with reference markings, etc.). The basis for this skill level is the combination of several skills proposed by Novick and Catley (2016), all of which focus on identifying and interpreting labelled apomorphies [(A) “identify characters,” (B) “identify taxa,” (H) “evolutionary sequence,” and (I) “convergent evolution”]. In Halverson and Friedrichsen’s model, handling apomorphies is part of the extensive skill level (6). It was separated into a distinct skill level, as many evolutionary trees do not show apomorphies, so handling apomorphies is not a skill generally needed to understand every tree, but it can greatly improve the handling of a tree if apomorphies are present (Catley et al. 2010 Novick et al. 2010).

The third skill level (“identifying relationships”) describes the core tasks of tree-reading. This skill covers all tasks that answer questions about the relative relationships of different species and the formation of clades in a given tree. Typical questions at this level are “Which group is the closest relative to group X?”, “Is group X more closely related to group Y than to group Z?”, and “Which groups form a clade with groups X, Y, and Z?” This level corresponds to four of the skills of Novick and Catley (2016) [(C) “identify/evaluate clades,” (D) “identify nested clades,” (E) “evolutionary relationship: resolved structure,” and (F) “evolutionary relationship: polytomy”] and to skill level six of Halverson and Friedrichsen. It consists of a set of skills pertaining to evaluating monophyletic groups and relative evolutionary relationships.

The fourth skill level (“comparing trees”) incorporates the ability to mentally rotate branches in a tree, to analyze subtrees, and to decide whether given trees show the same or different relationships. The same applies to comparing different representational styles (e.g., rectangular, circular, and diagonal trees). This level corresponds to two skills identified by Novick and Catley [(K) “rotation” and (J) “subset of the ToL”] and to Halverson and Friedrichsen’s skill level five (“conceptual use of representation”). At this point, we diverged from Halverson and Friedrichsen’s skill hierarchy, as this skill does not refer merely to the knowledge that trees can be rotated around nodes, but to the more complex task of reasoning about relationships with different subsets and the appearance of a tree. Furthermore, analyzing and comparing multiple evolutionary trees requires the formation of multiple complex mental models (Hochpöchler et al. 2013). Comparing two trees requires the learner to process many more graphical elements at the same time than when evaluating the relative relationships of a number of species (Kim et al. 2000). Thus, this skill necessitates the ability to evaluate evolutionary relationships in a very complex and demanding way and has to follow skill level four. The understanding that trees can come in different formats but are informationally equivalent can be found in skill level six of Halverson and Friedrichsen’s system. This is also an aspect of our fourth skill level. Therefore, we deviated from the hierarchy of Halverson and Friedrichsen in this respect.

The fifth and final level (“arguing and inferring”) aims at going beyond the given information in the representation. It covers the ability to form conclusions and predictions based on the phylogeny, which may extend to taxa or traits not presented. It is based on Halverson and Friedrichsen’s level seven (“expert use of representation”) and represents the ability to interpret evolutionary trees in a deeper way than students are normally able to. Depicted information is used to form inferences and arguments that go beyond the presented information. This includes forming new mental models of composite trees, solving complex phylogenetic problems, and deciding which tree formats are best suited to different means of representation. The resulting skill levels, together with an explanation of the levels and the corresponding skills by other authors, can be seen in Table 3.

Tree Thinking

Abstract diagrams are critically important in most, if not all, science disciplines (Novick, 2006). In biology, hierarchical diagrams are especially common. Since 2004, I have been investigating college and high school students’ understanding of cladograms, the most important tool that contemporary scientists use to reason about evolutionary relationships. Most of this research has been conducted in collaboration with Kefyn Catley, an evolutionary biologist and science educator at Western Carolina University.

A cladogram is a type of hierarchical diagram that depicts hypotheses about nested sets of taxa that are supported by shared, evolutionarily novel characters called synapomorphies. For example, the cladogram shown at the top of the page indicates that one synapomorphy for birds and alligators is that they both possess a gizzard. That is, birds and alligators share a most recent common ancestor (MRCA) that evolved the novel character of possessing a gizzard. A group of taxa consisting of the MRCA and all descendants of that ancestor is called a clade or monophyletic group. Thus, birds and alligators comprise a clade (in the cladogram shown above). Because of the nesting inherent in hierarchical diagrams, birds, alligators, and lizards also comprise a clade. And those three taxa plus mammals (represented by manatees and elephants in the cladogram above) constitute another clade, etc. The synapomorphy supporting the bird/alligator clade distinguishes the MRCA of birds and alligators from the earlier ancestor common to birds, alligators, and lizards. And the synapomorphy supporting the bird/alligator/lizard clade (see UV light) distinguishes the MRCA of those three taxa from the earlier ancestor common to birds, alligators, lizards, and mammals. The latter ancestor evolved the novel character of having an amniotic egg, a critical development in the history of life on Earth that enabled vertebrates possessing this character to complete their life cycles on land.

Biologists use the tool of phylogenetics along with its product, the cladogram, to study macroevolution, the subdiscipline of biology that synthesizes events of Earth history and deep time (the well-established theory that Earth is billions of years old) with mechanisms that generate and maintain the biodiversity of our planet. Macroevolutionary processes operate at the level of species and above, resulting in the formation, radiation, and extinction of higher groups of taxa. Macroevolution explains, for example, both the origin and radiation of mammalian taxa. In contrast, microevolution concerns processes that occur at the level of the organism (i.e., genome, individual, and population). Microevolution explains, for example, the appearance of antibiotic-resistant strains of bacteria.

Cladograms are the most important tool used by evolutionary biologists because they document and organize existing knowledge about the properties of species and higher-order taxa. Tree thinking is the ability to understand and reason with evolutionary relationships depicted in cladograms (phylogenetic trees). The power of tree thinking is that the resulting classification scheme­—for example that alligators are more closely related to birds than to lizards because of their shared MRCA—reflects current understanding of the history of life on Earth (i.e., the evolutionary relationships among taxa). Thus, inferences based on this classification scheme are likely to be more informative and to have greater practical value than inferences based on other criteria. For example, inferring which antivenin to use to counteract the bite of a venomous king brown snake based on its close evolutionary relationship to the red-bellied black snake is more likely to lead to a successful outcome (namely, survival!) than is basing the choice of antivenin on the king brown snake’s similar appearance to the western brown snake.

Summary of My Research

Overview.My research on tree thinking falls into three broad categories: (a) Influences of diagram design on interpretations of evolutionary relationships, (b) assessing and improving students’ tree-thinking skills, and (c) effects of prior knowledge about taxonomic relationships on tree thinking. The studies of diagram design are based primarily in cognitive and perceptual psychology, with strong implications for education. The instructional studies are rooted in science education while being informed by cognitive psychology. The studies of prior knowledge reflect a more even mix of psychological and educational foundations. All studies are informed by expert knowledge of evolutionary biology. This research has used a variety of different kinds of tasks, including those that require diagram comprehension, translation from one diagram format to another, and inference. Measures of performance include accuracy, types of errors made, written explanations (evidence cited) in support of one’s responses, and patterns of eye movements.

Influences of diagram design on interpretations of evolutionary relationships. Consistent with a large cognitive psychological literature on diagram comprehension, we would expect students’ interpretations of Tree-of-Life diagrams to be influenced by how those diagrams are designed. Thus, one major focus of my research program has been to discover how diagram design affects students’ interpretations of a variety of different types of Tree-of-Life representations.

One exciting project compared students’ ability to extract the hierarchical structure from cladograms depicted in different ways. Cladograms are typically drawn in one of two formats: rectangular trees (left diagram in the figure below) and diagonal ladders (right diagram in the figure below). In an analysis of the cladograms printed in a professional journal, Novick and Catley (2007) found that rectangular trees are by far the preferred format among evolutionary biologists: 83% vs. 17%. In high school and biology textbooks, however, the diagonal format was found to occur slightly more often than the rectangular format: 59% vs. 41% for high school biology texts and 54% vs. 46% for college texts (Catley & Novick, 2008).

Rectangular tree (left) and diagonal ladder (right) cladogram formats.

In several studies (Novick & Catley, 2007, 2013), we found that students had difficulty understanding and reasoning from the diagonal cladogram format and that this difficulty stems from the Gestalt principle of good continuation, which works to conceal the critical information about hierarchical levels in this format. One implication of these results is that if some method can be found to break good continuation at the appropriate points along the continuous lines, students’ ability to correctly extract the hierarchical structure of diagonal cladograms should improve. Consistent with this prediction, we found that adding a synapomorphy to mark each branching point in diagonal cladograms greatly improved students’ ability to translate those cladograms to the rectangular format (Novick, Catley, & Funk, 2010). In a final study in this line of research, we found that biology students preferentially scan diagonal cladograms from left to right, following their highly practiced directional pattern for reading written text, and that they prefer to scan along the main diagonal line at the base of the cladogram (Novick, Stull, & Catley, 2012). This impairs their ability to uncover the correct pattern of nesting in diagonal cladograms as those cladograms are typically drawn in textbooks and the biology literature (see above figure).

I am excited to report that based on our research, many textbooks for introductory biology, evolution, and zoology classes have changed from depicting cladograms in the diagonal to the rectangular format to improve student comprehension and learning. Introductory biology textbooks alone reach approximately 800,000 students every year.

My current research is examining the importance of another Gestalt grouping principle in influencing students’ interpretations of the evolutionary relationships depicted in cladograms. I have recently come to believe that the fundamental difficulty students need to overcome to acquire expertise in tree thinking is to understand that any specific evolutionary tree is a subset of the complete, unimaginably large Tree of Life. My prior research with Kefyn Catley suggests that students instead reify the particular groupings they see and fail to appreciate that these groupings are largely an artifact of the specific taxa that happen to be included in the particular tree under consideration. This reification of particular groupings occurs, I believe, because of the Gestalt principles of grouping, which are part of the foundation of human perception. I am pursuing this new line of research in collaboration with Linda Fuselier, an evolutionary biologist at the University of Louisville. We are examining the role of the Gestalt principle of connectedness in determining students’ interpretations of the relationships depicted in rectangular format cladograms. By testing students enrolled in biology classes at different levels (e.g., introductory biology for majors and nonmajors vs. more advanced classes), we will be able to discern the extent to which reliance on Gestalt grouping versus most recent common ancestry changes as a function of biological expertise.

Assessing and improving students’ tree-thinking skills. As documented in three recent publications (Novick & Catley, 2016, 2017 Novick, Catley, & Schreiber, 2014), using the knowledge we gained from our extensive research on tree thinking, Kefyn Catley and I set out to create, implement, and test a research-based tree-thinking curriculum and assessment instrument. Our efforts were very successful with students from a wide variety of biology backgrounds, ranging from little or no biology coursework in college to extensive biology coursework consistent with being a senior biology major. Over three connected and iterative studies, we were able to show that direct instruction produced skills that transferred to regular classroom practices and lab settings and appeared to enhance student understanding of macroevolutionary patterns and processes. Some of the instructional materials we developed are available for download here and from the lessons and resources for teachers section of the Understanding Evolution web site maintained by the University of California Museum of Paleontology.

Effects of prior knowledge about taxonomic relationships on tree thinking. A third focus of my research program concerns students’ folkbiological knowledge about taxonomic relationships among living things and the impact of such knowledge on their ability to engage in tree thinking. Students’ folkbiological knowledge often conflicts with well-established scientific taxonomy. For example, although students (even after an introductory biology course for majors) group lizards together with frogs in the folkbiological category of reptiles and amphibians, lizards are in fact more closely related to mammals because those taxa share a MRCA that evolved the novel character of possessing an amniotic egg (see the cladogram at the top of this page).

In one project (Novick & Catley, 2014), I examined how college and high school students responded when their prior knowledge conflicted with the evolutionary information provided in rectangular format cladograms. In two studies, college and high school students received matched pairs of cladograms that depicted an identical pattern of relationships among either familiar or unfamiliar taxa. When the taxa were familiar, the cladograms showed (correct) relationships that conflicted with students’ prior knowledge. For example, one such cladogram showed that mushrooms are more closely related to animals than to plants, contradicting folkbiological taxonomy that mushrooms are plants. Students answered evolutionary relationship questions about both cladograms in each matched pair. For both student groups, accuracy was higher when the cladograms depicted relationships among unfamiliar rather than familiar taxa (i.e., when folkbiological knowledge was not available to contradict the scientific information presented).

An additional study reported in Novick and Catley (2014) examined college students’ willingness to include birds in the reptile category, where they belong, as a function of the strength of the supporting evidence. Even with salient visual evidence in the cladogram supporting this grouping, approximately half the students resisted this classification. On the positive side, students did at least choose a coherent definition of reptiles. For example, when they excluded birds from the category, they also excluded crocodiles, to which birds are most closely related. Evidently, the strength of many students’ prior belief that birds are not reptiles is greater than their prior belief that crocodiles are reptiles.

The difficulty of persuading students of the inaccuracy of their prior knowledge may relate in part to the length of time over which their misconceptions have been reinforced. Brenda Phillips, a former postdoctoral fellow in my laboratory, collected some preliminary data on pre-K through 6th grade children’s and college students’ knowledge about the relationships among sets of three familiar taxa (e.g., camels, elephants, and zebras beavers, snakes, and frogs). In several respects, the responses of K-1st grade, 4th-6th grade, and college students were remarkably similar. For example, given the set of beavers, snakes, and frogs, most students in all age groups responded, incorrectly, that snakes and frogs are most closely related. See if you can figure out the age group of the student providing each of the following three explanations for this response: (a) “Both live near/in water and are reptile family members” (b) “They are both not mammals” (c) “They’re both amphibians and can go underwater and stay underwater, and can both go on land. They both like bugs.” [**Answers are at the bottom of this page.]

Research Support

Much of the research described here was supported by the Institute of Education Sciences, U.S. Department of Education, through Grant R305A080621 to Vanderbilt University (Laura R. Novick, PI Kefyn M. Catley, Co-I). The opinions expressed are those of the authors and do not represent views of the Institute or the U.S. Department of Education. My current research is being supported by a small grant from Peabody College of Vanderbilt University.

Instructional Materials Available for Download

As part of the above-mentioned IES grant, Kefyn Catley and I developed a variety of instructional materials for teaching tree thinking to undergraduates. Some of these materials are available for download here, as well as from the lessons and resources for teachers section of the Understanding Evolution web site maintained by the University of California Museum of Paleontology.

** (a) Vanderbilt student, (b) kindergarten or first grade student, (c) 4th-6th grade student.


Munzner and colleagues have demonstrated the advantage of using hierarchical data viewers enhanced with a 3D hyperbolic view over conventional 2D based viewers for efficiency of deciphering tree-based information [18]. While the 3D hyperbolic visualization of phylogenetic trees will not fully supplant 2D viewers, it can serve as an additional module to augment other visualization components. In the future, a phylogenetic tree visualization tool that integrates several visualization components in a similar way to the XML3D tool used by Risden et al. [18] would be desirable. The Walrus viewer and the conversion tool are a step towards this goal.


  1. Zebediah

    I'm sorry, but I think you are wrong. I can prove it.

  2. Guilio

    What words ... Great, a brilliant phrase

  3. Recene

    In the root incorrect information

  4. Grolrajas

    Agree, a very useful message

  5. Fenrikasa

    Well done, the idea is excellent and timely

Write a message