Information

Computing Percent Identity between DNA / Amino Acid Sequence

Computing Percent Identity between DNA / Amino Acid Sequence



We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

What are some tools where I can input a pair of DNA sequences (or alternatively a pair of Amino Acid Sequences) and compute a percent similarity identity metric between them?

Is BLAST the right algorithm for this or something else?

The context is that a certain patent protects all sequences at least 90% or more identity to a given sequence. I wanted to test some candidate sequences and verify the percent identity metric.

Below I post a patent-snippet where they actually define their identity metric.


tBLASTn, BLASTx and tBLASTx can be used.

It is clearly illustrated here.

If you have just two sequences you need not do a search against the database. You can simply align them using Smith-Waterman or Needleman-Wunsch algorithms, as David mentioned. However, you still need to translate your DNA to protein before doing the alignment. Again, as pointed out by David, with protein alignments you do usually not use identity. Scoring matrices like PAM or BLOSUM are used to calculate similarity.


If you want to compare two specific sequences that you already have, then BLAST itself is not the program you need. BLAST is a program that uses heuristics for rapid searching of a large database for sequences similar to a query sequence. Each of the highest scoring sequences obtained is 'finished' for presentation by running an implementation of quite different older and slower dynamic-programming algorithm, which is non-heuristic.

There is an option at the BLAST implementation at NCBI for just comparing two sequences, but this seems strange as one may as well just use the dynamic-programming alignment algorithm. Furthermore the output is not what you want as it scores a local alignment, whereas what your patent specifies is a global alignment (see below).

So for pairwise comparison of two sequences one normally uses a program that implements the Smith and Waterman dynamic programming algorithm for local alignment (only the regions of best similarity are compared) or the Needleman and Wunsch algorithm for global alignment (the whole of both sequences are compared). From your patent specification you clearly want a global alignment and a value for percentage identity.

You can find free web implementations of both of them at the EBI website (among other places) where the program 'Needle' performs the global alignment you require. Alternatively, there is an implementation of the Needleman and Wunsch algorithm at NBCI, tucked misleadingly into the BLAST suite.

I do see one theoretical problem with these programs in relation to the patent, although this may not matter in practice. The patent says that “The optimal alignment is the alignment in which the percentage identity is the highest possible”. But these programs try to find the 'best' alignment not on the basis of the highest identity score, but on the basis of the highest similarity score. This is done by using one of a number of emperical scoring matrices that encompass the likelihood of particular amino acid substitutions. For example a match between similar amino acids (e.g. Glu/Asp) may give a score of 4, compared with a perfect match score of 5. Moreover, a perfect match for an amino acid like Trp scores much higher than one for Gly. (More info on this and these terms can be found in the Wikipedia Sequence Alignment entry or in the BLAST Glossary.)

Now you would expect that the 'best' sequence alignment obtained by optimizing for similarity in this way would also give the best identity score. This is quite likely for sequences of 90% identity or higher, but there is no reason it should necessarily be the case. In theory to get the highest identity score you should use a scoring matrix in which all perfect matches scored the same, and all mismatches zero. This is possible using an implementation of Needleman and Wunsch that allows you to use your own comparison matrix, e.g. this online EMBOSS-Explorer implementation. The problem is what score to give the perfect match (the average of match scores?), as this has to fit with the scoring penalties for mismatches. Or perhaps you should drop the mismatch penalty altogether, as you are only interested in the highest identity.

My advice would be to go for the Needleman and Wunsch algorithm, accept the defaults, and check you get the same score with different implementations. If your lawyers want anything more, I can offer a consultancy ;-).


Eigenvalue analysis of amino acid substitution matrices reveals a sharp transition of the mode of sequence conservation in proteins

The pattern of amino acid substitutions and sequence conservation over many structure-based alignments of protein sequences was analyzed as a function of percentage sequence identity. The statistics of the amino acid substitutions were converted into the form of log-odds amino acid substitution matrices to which eigenvalue decomposition was applied. It was found that the most important component of the substitution matrices exhibited a sharp transition at the sequence identity of 30-35%, which coincides with the twilight zone. Above the transition point, the most dominant component is related to the mutability of amino acids and it acts to disfavor any substitutions, whereas below the transition point, the most dominant component is related to the hydrophobicity of amino acids and substitutions between residues of similar hydrophobic character are positively favored. Implications for protein evolution and sequence analysis are discussed.


Lab Procedure

There are several ways to compare organisms. One way is through their morphological characteristics. Other ways include comparisons at the protein or DNA level. This activity will allow you to compare bats to mammals and bats to birds using a protein, beta-hemoglobin.

Find your critters

Begin by going to UniProt database at http://www.uniprot.org/ In the search box in the top toolbar type the phrase hemoglobin beta and click &ldquosearch&rdquo. The computer will retrieve many entries and display them on multiple pages.

&rarr Find two different entries for birds, two bats, and one mammal. (You can also add these terms to the search bar)

You may want to add the name of the animal to the search, for example: &ldquochicken hemoglobin beta&rdquo When you find an animal, check to make sure that it is the hemoglobin beta chain (preferably without a number after it) and not another hemoglobin types such as alpha or gamma. Write down the common name of your choices in your notes. Check the box in front of the animal of your choice then click on &ldquoAdd to Basket&rdquo in the top toolbar. You should see numbers appear in the &ldquoBasket&rdquo located on the upper right side of the page as you add each entry. Continue until you have added all 5 of your animals to &ldquoBasket&rdquo.

Side by Side Comparisons

1. Click on &ldquoBasket&rdquo. Select two animals to compare and click on the box in front of their name. Look for &ldquoalign&rdquo which should appear in the toolbar in the bottom of &ldquobasket&rdquo after making your selection.

2. Click on &ldquoalign&rdquo. This will make a data request to compare the proteins for the two animals you have selected. The program may take a minute or two to run.

3. The protein sequence is located under the &ldquoalignment&rdquo heading in the middle of the page. Using the right-hand scroll bar, find the amino-acid sequence. The amino acids are indicated by single-letter symbols. This shows you the actual alignment of the two sequences.

  • An * (asterisk) indicates positions which have a single, fully conserved residue.
  • A : (colon) indicates conservation between groups of strongly similar properties
  • A . (period) indicates conservation between groups of weakly similar properties
  • A - (dash) indicates there is no amino acid in that location

4. Look farther down on the page to the &ldquoResult Information&rdquo heading and find the &ldquoidentity&rdquo. Record the value in your notes. This value is the percent of amino acids that are similar between the two proteins. If all the amino acids were the same, the percent identity would be 100%. The different amino acids between the organisms are the results of mutations that have accumulated in the DNA over time. More differences indicate that the organisms are less closely related. Few differences indicate that the organisms are more closely related.

5. Repeat the comparison process until you have the &ldquopercent identity&rdquo value for all side by side comparisons between all of your organisms. On the table below, write the % for each alignment, and identify the organisms by common name


What is Identity?

Identity in sequence alignment is the number of characters that match exactly between two different sequences. Hence, gaps do not count when assessing identity. The measurement is considered to be relational to the shorter sequence among the two sequences. It significantly implies that it has the effect where the sequence identity is not transitive. If X=Y and Y=Z, then X is not necessarily equal to Z. This is deduced in terms of the identity distance measure.

Figure 02: Identity in Sequence Alignment

For example, X has a sequence of AAGGCTT, Y has a sequence of AAGGC and Z has a sequence of AAGGCAT. Identity between X and Y is 100% <5 identical nucleotides / min[length(X),length(Y)]>. Identity between Y and Z is also 100%. But identity between X and Z is only 85% <(6 identical nucleotides / 7)>.


Results

Characterizing the dynamic mapping landscape relevant to CpDAA data integration

Our first step to achieve high-fidelity multi-omic data integration was to establish a comprehensive set of test data. For this, we aggregated publicly available cysteine and lysine chemoproteomics datasets (Weerapana et al, 2010 Backus et al, 2016 Hacker et al, 2017 ), resulting in a total of 6,510 CpD cysteines and 9,327 CpD lysines detected in 4,119 unique proteins. These 15,837 CpDAAs are further sub-categorized by the residues labeled by cysteine- or lysine-reactive probes (iodoacetamide alkyne [IAA] or pentynoic acid sulfotetrafluorophenyl ester [STP], respectively) and those residues with additional measures of intrinsic reactivity (categorized as high-, medium-, and low-reactive residues Dataset EV1).

As our overarching objective was to characterize CpDAAs using functional annotations based on different versions of protein, transcript, and DNA sequences (Fig 1A), our next step was to develop a high-fidelity data analysis pipeline for intra- and inter-database mapping. To guide our analyses, we first referenced established methods for such data mapping, including ID mapping (Huang et al, 2008 Meyer, Geske, & Yu, 2016 Xin et al, 2016 ), residue–residue mapping (Martin, 2005 David & Yip, 2008 Dana et al, 2019 ), and residue–codon mapping (Zhou et al, 2015 Li et al, 2016 ) (See Appendix Table S1 for detailed descriptions of each type of mapping).

Figure 1. Landscape of sequence annotation information updates

  1. Schematic representation of mapping chemoproteomic-detected amino acids (CpDAAs) to pathogenicity scores.
  2. Timeline of gene annotation database release dates and project-specific datasets, including Ensembl releases tested for compatibility (Fig 2) to CpDAA coordinates based on canonical UniProtKB protein sequences and the database reference corresponding to the genomic pathogenicity scores (Fig 3).
  3. Average database release cycle length for releases between August 2013 and July 2019. All values are mean ± SD. Total of 25 Ensembl, 13 GENCODE, six CCDS (homo sapien only), and five NCBI releases were counted. UniProtKB value was calculated by taking the average of release cycle lengths reported on the UniProt website.

We suspected that the frequent and unsynchronized update cycles of independent databases (Fig 1B Dataset EV2) might complicate accurate residue-level mapping. Supporting this hypothesis, quantification of the average update cycle for each database across this time period revealed that UniProtKB has the shortest mean update cycle (

6 weeks Fig 1C). In contrast, NCBI is only updated yearly. These different update cycles can create a lag between versions of databases used to create identifier cross-reference (a.k.a. External Reference [xref]) files (Appendix Table S1). For example, ID mapping files provided by Ensembl for UniProtKB proteins may not share identical sequences if not used within the short 4-week window between UniProtKB updates.

To enable further characterization of how database update cycles and mapping strategy impact the fidelity of data integration, we collected a test set of Ensembl releases (Appendix Fig S1 and Dataset EV3). Specific releases were prioritized that (i) represented reference releases based on the GRCh37 or GRCh38 reference genome, (ii) were compatible with the latest Consensus Coding Sequence (CCDS) update for the human genome (release 22), (iii) were used in database for nonsynonymous functional predictions (dbNSFP) v4.0a and CADDv1.4, two resources that integrate functional annotations for all possible nonsynonymous single nucleotide variants (SNV) (Kircher et al, 2014 Liu et al, 2016 Rentzsch et al, 2019 ), and (iv) were associated with a commonly used version of the Ensembl Variant Effect Predictor (VEP) (McLaren et al, 2016 ).

With these prioritized datasets in hand, we next tracked the loss of CpDAA-containing protein IDs during intra-database mapping of UniProtKB releases and inter-database ID mapping to different Ensembl releases. Gratifyingly, only a handful of the original 4,119 protein IDs were lost due to database updates, both for Ensembl (e.g., 37 IDs for v97 release of Ensembl) and for UniProtKB (e.g., 26 IDs for 2012 UniProtKB Appendix Fig S1, Datasets EV1 and EV4). The greatest identifier loss was observed from mapping UniProtKB-based legacy data to the 2018 UniProtKB-SwissProt CCDS cross-referenced curation of the human proteome, with 119 IDs not found in the 2018 dataset. We ascribe this identifier loss to both UniProtKB updates and to the higher level of curation for proteins in the 2018 dataset, which includes only Swiss-Prot canonical protein sequences with a cross-referenced (“xref”) entry term in the CCDS database. Of note, CCDS gene IDs are manually reviewed and linked to UniProtKB-SwissProt. The TREMBL database is comprised of automatically generated protein IDs, which, as a result, comprises a substantially larger set of UniProtKB IDs, when compared to the manually curated SwissProt CCDS subset (Appendix Fig S2). From these analyses, we concluded that using the CCDS UniProtKB release was optimal for integrating functional annotations with chemoproteomics datasets.

Updates to canonical sequences assigned to UniProtKB stable identifiers can lead to intra-database mismapping of CpDAAs

Proteomics datasets, including published CpDAA datasets, are routinely searched against FASTA files containing only canonical UniProtKB proteins (Appendix Table S1), for two main reasons. First, canonical proteins reduce the redundancy and complexity of proteome search databases. Second, these sequences are identified by stable identifiers (also known as the UniProtKB primary accessions) and offer the seeming advantage of remaining constant through database update cycles. However, one particularly confusing aspect of the stable identifier is that the word “stable” in this context does not mean permanent or immutable. Specifically, the associated sequence linked to a stable identifier can change over database releases.

Therefore, we next assessed whether and to what extent updates to the canonical sequences assigned to UniProtKB stable identifiers resulted in mismapping. To confirm the integrity of our CpDAA dataset, we started this process by validating that over 99% of the CpDAA protein IDs and residue positions matched with those found in a 2012 UniProt FASTA file, corresponding to the reference proteome originally used to process the datasets (see Materials and Methods and Dataset EV1). The small fraction of data lost was due to missing stable identifiers and mis-matched CpDAA positions, which likely stems from slight inconsistencies between the original processing pipeline and our current workflow. We then mapped the 6,404 CpD cysteines and 9,213 CpD lysines from 4,084 canonical proteins identified in the 2012 dataset to the 2018 UniProtKB CCDS canonical sequence subset of the human proteome. Mapping to CCDS sequences enabled us to take advantage of the extensive array of tools that facilitate forward and reverse annotation between gene, transcript, and protein sequences and would allow for residue-specific mapping to genomic functional annotations (Dataset EV5) (Zhou et al, 2015 Meyer, Geske, & Yu, 2016 McGarvey et al, 2019 ). Updating to the 2018 release was a requisite step for using these tools, as they overwhelmingly require recent cross-reference files using the newest reference genome GRCh38. For all CpDAA positions, we performed residue–residue mapping—defined as a one-to-one correspondence between amino acids in proteins from different databases or release dates—to match the 2012 canonical UniProtKB sequences with their 2018 counterparts (Dataset EV4). This dataset mapping resulted in the loss of 121 protein IDs, with 108 simply not found in the 2018 reference file and the remaining 13 found to have different canonical sequences, resulting in mismapping or loss of the originally identified CpDAA residues.

The high concordance between these two UniProtKB releases, separated by 6 years, indicates that for the vast majority of UniProtKB updates, differences in release date should not complicate re-mapping legacy proteomics data to more recently released gene, transcript, and protein sequences. However, we were surprised to find that several widely studied proteins, including protein arginine N-methyltransferase 1 (PRMT1 or ANM1, Q99873), serine/threonine protein kinase, (SIK3 Q9Y2K2) (Walkinshaw et al, 2013 ), and tropomyosin alpha-3 chain (TPM3, P06753), had canonical protein sequence differences resulting in all or nearly all CpDAA positions to be missed using the 2018 position index (Dataset EV4). We observed two main reasons for these losses: (i) changes to the canonical sequence associated with the UniProtKB stable ID and (ii) changes to which isoform is assigned as the canonical sequence. While both 2012 and 2018 sequences of PRMT1 are associated with UniProtKB stable ID Q99873, the 2018 sequence contains an additional short N-terminal sequence, not present in the 2012 sequence (Fig 2A). As a result, all 13 PRMT1 CpDAAs failed to map to the 2018 UniProtKB release. In the 2012 release of UniProtKB, the canonical sequence of the peptidyl-prolyl cis-trans isomerase FKBP7 is associated with the versioned (isoform) ID Q9Y680-1, whereas in the 2018 release, the canonical sequence is associated with the versioned (isoform) ID Q9Y680-2, which lacks a short sequence (AAΔ125:162) in the middle of the protein. For FKBP7, this update fortuitously does not result in loss of CpD Lys83, as it is located N-terminal to the deletion. These updates to the protein sequence are, in essence, masked by the stable IDs, which do not flag sequence updates or changes to which isoform sequence is assigned as the canonical. Exemplifying this problem, we identified 45 stable identifiers with non-identical canonical protein sequences in the 2012 and 2018 UniProtKB releases (Dataset EV4).

Figure 2. Challenges with residue-level mapping and UniProtKB canonical protein sequences

  1. Schematic depiction of mapping scenarios from updating chemoproteomic-detected protein sequences using stable or versioned identifiers.
  2. Distribution of number of isoforms per stable UniprotKB ID for 3,953 detected proteins.
  3. Frequency of specific isoform name for 2,487 multi-isoform UniProtKB canonical proteins.
  4. Schematic depiction of glucose-6-phosphate dehydrogenase (G6PD, UniProtKB ID P11413) cross-referencing both identical and non-identical sequences of Ensembl stable IDs from five releases.
  5. Heatmap of protein sequence distance scores for detected UniProtKB and cross-referenced Ensembl proteins from five releases. Each gene name corresponds to one unique stable Ensembl protein ID.

To further understand how the presence or absence of protein isoforms impacts the fidelity of data mapping during intra-database (UniProtKB) mapping, we identified all isoforms associated with CpDAA stable protein IDs. Analysis of this dataset revealed that 58% of protein stable IDs have between 2–5 associated isoform sequences (Fig 2B). Catenin delta-1 protein (CTNND1, O60716) had 32 isoforms, which was the greatest number of isoforms in our dataset (Dataset EV6). Protein isoforms are identified by the “-X” after the UniProtKB ID, where X represents the isoform name. A common assumption of most mapping tools and proteomics databases is that the “-1” sequence is the canonical sequence. However, a key finding from our isoform analysis is that the canonical sequence does not always correspond to the “-1” isoform ID provided by UniProtKB. In fact, for 288 proteins in the UniProtKB 2018 release, the non-“-1” entry corresponds to the canonical isoforms, and for 55 CpDAA-containing proteins in our dataset (

2%), the canonical sequence is not the “-1” isoform (Fig 2C and Dataset EV7). Strikingly, the canonical sequence can even be the “-10” isoform, as is the case for the Ras-associated and pleckstrin homology domains-containing protein (RAPH1, Q70E73). In the context of database mapping, all of these non-“-1” canonical proteins will likely result in mismapping using established tools.

Accurate residue-level inter-database mapping between UniProtKB and Ensembl is dependent on database update cycles

To investigate how sequence versions impact inter-database mapping, we next turned to ID cross-reference files (Dataset EV3) that are released by Ensembl and UniProtKB. Cross-reference files can be used to convert between UniProtKB and Ensembl ID types. Three major challenges arise with ID cross-referencing: (i) when cross-reference stable IDs match, but corresponding sequences are not identical, (ii) multi-mapping, where a UniProtKB ID maps to many Ensembl protein (ENSP), transcript, and gene IDs, and (iii) when the origin, both the time of the releases and the specific database provided cross-reference files used, determines the mapping accuracy of datasets.

Glucose-6-phosphate dehydrogenase (G6PD, P11413) exemplifies how sequence updates associated with a stable ID can lead to mismapping of gene-, transcript-, and protein-level annotations for CpDAAs (Fig 2D). For G6PD, the same UniProtKB ID maps to four unique ENSP IDs with identical sequences (see first row in “Identical”) as well as four different ENSP IDs with non-identical sequences (see second row in “Non-identical”). For G6PD, this significant redundancy is also observed at the gene and transcript level, both for stable and versioned IDs (Fig EV1A Dataset EV8). Overall, genes undergo the highest frequency of sequence re-annotation due to continual refinement of the reference genome. In contrast, protein IDs remain largely fixed across releases (Fig EV1B Dataset EV9).

Figure EV1. Mapping of Ensembl IDs to UniprotKB shows heterogeneity at gene, transcript, and protein levels

  • A. Number of stable and versioned Ensembl gene, transcript, and protein IDs for G6PD across all five Ensembl releases.
  • B. Cumulative sequence re-annotations for Ensembl gene, transcript, and protein IDs since the v85 release.
  • C, D. Average number of Ensembl gene, transcript, and protein IDs for (C) single isoform (n = 1,466) and (D) multi-isoform (n = 2,487) CpDAA UniProt entries. Bar plots represent mean values ± SD for the number of Ensembl IDs per stable UniProtKB ID. Statistical significance was calculated using an unpaired Student's t-test, ****P-value < 0.0001.

To assess how pervasive multi-mapping is across the entire CpDAA dataset, we quantified the mean number of Ensembl IDs per UniProtKB ID. We counted both versioned and stable Ensembl IDs types (gene, transcript, and protein IDs), for all CpD UniProtKB proteins grouped by single (Fig EV1C) or multi-isoform (Fig EV1D Dataset EV10) associated stable IDs. We suspected that database updates for all data types (gene, transcript, and protein) and the presence of UniProtKB isoforms would contribute to the observed multi-mapping of CpD protein IDs in our dataset. Of note, Ensembl versioned IDs indicate changes to the associated sequence rather than the presence of isoforms. For example, for protein tropomyosin alpha-4 chain (TPM4, P67936), during the update from v96 to v97, the stable protein identifier showed version change from “.3” to “.4” (ENSP00000300933.3 to ENSP00000300933.4), which corresponds to a difference of 165 amino acids in the primary sequence caused by the update (Dataset EV11). Not surprisingly, we found that UniProtKB stable identifiers with multiple associated protein isoforms have a higher average of cross-referenced Ensembl ID types per UniProtKB stable identifier, when compared to UniProtKB stable IDs associated with only one protein isoform. In addition, single isoform UniProtKB stable IDs are more likely to cross-reference identical ENSPs, when compared to multi-isoform UniProtKB stable IDs (Appendix Figs S3 and S4).

One last challenge we identified is that the origin of the cross-reference file (whether it was created by UniProtKB or by Ensembl) affected the outcome of our mapping procedures. Across the five Ensembl releases, only 56.9% of all Ensembl-UniProtKB cross-referenced IDs had identical protein sequences (Appendix Fig S3 Dataset EV8). We then used a cross-reference file from UniProtKB that, unlike the Ensembl mapping files, contains mappings with canonical isoform protein identifiers for UniProtKB proteins to Ensembl stable protein IDs, to test whether inclusion of isoform name details improves the accuracy of inter-database ID mapping. This approach allowed for > 99% identical protein sequence cross-references for UniProtKB-ENSP IDs and substantially reduced the burden of identifier multi-mapping (Appendix Fig S4 Dataset EV12). Our study demonstrates that high-fidelity ID cross-referencing requires attention to details regarding database updates, multi-mapping, and identifier types used in cross-reference file sources. We also observed that sequences associated with mapped UniProtKB and Ensembl stable IDs varied significantly in alignment distance depending on the Ensembl version (Fig 2E Appendix Fig S5 Dataset EV11), with temporally close releases showing generally greater sequence similarity.

Assessment of pathogenicity predictions for CpD cysteine and lysine codons, using residue–codon mapping

Our next objective was to apply residue–codon mapping to the prioritization of functional CpDAAs. Cysteines and lysines are both highly conserved, with 97% (Miseta & Csutora, 2000 ) and 80% (Hacker et al, 2017 ) median conservation, respectively. Consequently, sequence motif conservation cannot distinguish between functional and non-functional residues within chemoproteomics datasets. To identify cysteine- and lysine-centric genetic features suitable for pathogenicity prioritization, we tailored our pipeline to reverse-translate CpD cysteine and lysine positions in canonical UniProtKB proteins to genomic coordinates from both major genome assemblies (GRCh37 and GRCh38) and genomic-based functional annotations. For all proteins within our CpDAA dataset, referred to as detected proteins, we also processed undetected equivalent residue types in CpD Cys- and/or CpD Lys-containing proteins (Fig EV2). Cysteines and lysines were required to have valid coordinates in GRCh37 and GRCh38 reference genome assemblies, as some functional genetic variant annotations are only available in one genome assembly (Dataset EV13). Probe-labeled cysteines and lysines represent

15% of all cysteines (6,057 CpD Cys out of 40,107 total Cys) and

6% of all lysines (8,868 CpD Lys out of 149,520 total Lys) found in chemoproteomic-identified proteins (n = 3,840 UniProtKB IDs successfully mapped Fig 3A and B).

Figure EV2. Flowchart of the mapping strategy and data analysis

CpD cysteines and lysines from three publicly available datasets were processed and filtered according to our optimized mapping pipeline. Number of CpD cysteines (red) and CpD lysines (blue) retained following each step shown as bar plots.

Figure 3. Analysis of pathogenic missense at detected vs undetected cysteines and lysines

  • A, B. Aggregate number of detected and undetected cysteines (A) and lysines (B) in 3,840 CpDAA-containing proteins.
  • C. Heatmap of missense score correlations for all possible nonsynonymous SNVs at CpD cysteine (29,541 missense) for eight pathogenicity scores. Overall, Spearman's rank r was between 0.36 and 0.91.
  • D. CpD Lysine (41,850 missense) heatmap for missense score correlations for all possible nonsynonymous SNVs. Spearman's rank r between 0.16 and 0.81.
  • E. Odds of predicted deleterious Cys > Trp (red) missense at detected (n = 6,057) vs undetected (n = 34,049) residues in 3,840 detected proteins. Deleterious missense defined by CADD38, FATHMM, and DANN score thresholds (y axis). CADD38 OR = 0.76, P = 3.40e-22 FATHMM OR = 0.92, P = 0.02 DANN OR = 0.690, P = 6.69e-26. Odds of predicted deleterious Lys > Ile (blue) missense at detected (n = 3,581) vs undetected (n = 63,385) residues in 3,840 detected proteins. CADD38 OR = 1.80, P = 1.03e-53 FATHMM OR = 1.55, P = 3.47e-33 DANN OR = 1.75, P = 9.21e-14. *P < 0.0042 Bonferroni-adjusted (two-tailed Fisher's exact test).
  • F. Odds of ClinVar pathogenic variant overlapping detected (6,057 Cys 8,868 Lys) vs undetected (34,050 Cys 140,652 Lys) residues in 3,840 detected proteins. Cys detected in ClinVar pathogenic site (red, OR = 1.17, P = 0.457) and Lys detected at ClinVar Pathogenic site (light blue, OR = 2.76, P = 1.03e-04). Combined Cys and Lys (dark blue, OR = 2.26, P = 9.99e-07) *P < 0.0167 Bonferroni-adjusted (two-tailed Fisher's exact test).

Data information: In (E and F), 95% confidence intervals (line segments) and odds ratios (squares). In (C and D), color intensity represents two-tailed Spearman's rank-order correlation coefficients between 0 and 1.

Next, genomic coordinates of cysteine and lysine codons from 3,840 detected proteins were annotated by a panel of functional scores (Quang, Chen, & Xie, 2015 Shihab et al, 2015 Ioannidis et al, 2016 Jagadeesh et al, 2016 preprint: Samocha et al, 2017 Sundaram et al, 2018 Rentzsch et al, 2019 ). With the goal of assessing the correlation between individual scores and chemoproteomics identification labels, we selected complementary pan-genome and missense deleteriousness prediction scores (Dataset EV13) based on either GRCh37 or GRCh38 reference genome assemblies for our analysis. For the CADD score, which is available for both assemblies, we observed a trend of slightly higher scores with CADD38 compared to CADD37 (Appendix Fig S6). We calculated the Spearman's correlation of scores for all possible nonsynonymous SNVs overlapping cysteine and lysine codons and saw a higher correlation between the deleteriousness predictions for CpD cysteine substitutions (Fig 3C Dataset EV14) than for CpD lysine substitutions (Fig 3D Dataset EV14). For the subset of scores that provide deleteriousness scores for all possible nonsynonymous variants, we did not observe substantial differences between the correlation of scores for chemoproteomic-detected and -undetected lysines or cysteines (Appendix Fig S7 Dataset EV15).

Pathogenicity thresholds, which are provided by a subset of the scores investigated (e.g., CADD, functional analysis through hidden markov models [fathmm-MKL], and Deleterious Annotation of genetic variants using Neural Networks [DANN]), provide a useful cut-off for assessing whether substitutions at specific amino acids are likely to be deleterious to protein function. Therefore, we next assessed whether substitutions at detected vs undetected cysteines or lysines were more likely to be predicted damaging. We first assessed the amino acid substitutions for cysteine and lysine resulting in the greatest chemical property change, or highest Grantham score (Grantham, 1974 ), Cys > Trp and Lys > Ile. For CADD38 (Kircher et al, 2014 ), fathmm-MKL coding (Shihab et al, 2014 ), and DANN (Quang, Chen, & Xie, 2015 ), substitutions of detected cysteines were less likely to be predicted damaging compared to substitutions of undetected cysteines (Fig 3E, red Dataset EV16). In contrast, substitutions of detected lysines were more likely to be predicted damaging compared to substitutions of undetected lysines (Fig 3E, blue Dataset EV16). This trend for cysteine and lysine predicted deleterious score enrichment extended to all missense types (Fig EV3A Dataset EV16).

Figure EV3. Assessment of missense pathogenicity between detected–undetected and reactivity groups for CPD cysteine and lysine residues

  • A. Enrichment of predicted deleterious missense variants for detected vs undetected cysteine (red) and lysine (blue) missense variants in 3,840 proteins. Missense types in order of increasing Grantham score. Odds ratio (OR) for all possible nonsynonymous SNVs at cysteine codons between 0.56 and 0.76, at lysine codons fall between 1.38 and 1.80. 95% confidence intervals (line segments) and odds ratios (squares), two-tailed Fisher's exact test, *P < p cut-off, and 0.0019 Bonferroni-corrected (0.05/26).
  • B, C. Distribution of mean CADD38 (model for GRCh38) PHRED scores for (B) cysteine (n = 1,401) and (C) lysine (n = 4,363) CpDAAs of low, medium, and high intrinsic reactivities, defined by isoTOP-ABPP ratios, low (R10:1 > 5), medium (2 < R10:1 < 5), high (R10:1 < 2) (Weerapana et al, 2010 Hacker et al, 2017 ). Kruskal–Wallis nonparametric test to examine reactivity group difference, Wilcox test used for pairwise comparisons (BH-adjusted P-values, *P. adj = 0.013, ***P. adj = 2.80e-05, and ****P. adj = 5.30e-08). The boxplot boxes represent the lower and upper quartiles, with the central band as median, notches show the confidence interval based on median ± 1.58*IQR/sqrt(n), and whiskers mark observations that satisfy quartiles ± 1.5*IQR.

We next tested if these trends would extend to clinically validated “pathogenic” and “likely pathogenic” missense mutations, as identified by the ClinVar database (Landrum et al, 2018 ). ClinVar is the gold standard repository of genomic variants associated with monogenic disorders. In total, the filtered ClinVar dataset contained 2,225 disease-associated missense variants that change from a cysteine (1,653 variants) or lysine (572 variants). We found no significant enrichment of disease-associated variants in detected over undetected cysteines (Fig 3F, red Dataset EV17). In contrast, detected lysines showed a significant enrichment for disease-associated variants relative to undetected lysines (Fig 3F, light blue). Combining cysteine and lysine data revealed detected residues in general as more likely to harbor disease-associated mutations relative to equivalent undetected residues in 3,840 detected proteins (Fig 3F, dark blue). Given the challenges associated with accurately diagnosing missense variants, we expect that chemoproteomic detection, particularly for lysine residues, could be used as an additional metric to improve pathogenicity predictions for genetic variants.

Chemoproteomics data combined with pathogenicity scores can help prioritize functional residues

We next assessed correlations between genetic-based pathogenicity score and amino acid reactivity, as assessed by chemoproteomics. We chose CADD as the optimal score to evaluate, as it integrates other nucleotide variant predictors into its model and is available for both reference genome assemblies, GRCh37 and GRCh38. Chemoproteomic reactivity measurements were binned into low, medium, and high reactivity categories, defined as low (R10:1 > 5), medium (2 < R10:1 < 5), high (R10:1 < 2) isoTOP-ABPP ratios, respectively (Weerapana et al, 2010 Hacker et al, 2017 ). These ratios quantify the relative labeling of a residue at different probe concentrations (e.g., 1× vs 10×). A ratio closer to one indicates that labeling is saturated at low probe concentration, which corresponds to a cysteine or lysine with higher intrinsic reactivity.

To adapt CADD scores from the nucleotide level to the amino acid level for CpDAAs, the mean and max CADD score for all possible nonsynonymous SNVs per codon (see Methods) were calculated. For both max (Fig 4A) and mean (Fig EV3B) CADD codon scores, we found that highly reactive cysteines show significantly higher predicted deleteriousness. In contrast, lysine reactivity did not correlate with predicted pathogenicity (Fig 4B and EV3C).

Figure 4. Association between amino acid reactivity and CADD score

  • A, B. Distribution of the max CADD38 PHRED (model for GRCh38) scores for (A) cysteine (n = 1,401) and (B) lysine (n = 4,363) CpDAAs of low, medium, and high intrinsic reactivities, defined by isoTOP-ABPP ratios, low (R10:1 > 5), medium (2 < R10:1 < 5), high (R < 2) (Weerapana et al, 2010 Hacker et al, 2017 ). Kruskal–Wallis nonparametric test to examine reactivity group difference, and Wilcox test used for pairwise comparisons (BH-adjusted P-values, *P. adj = 0.04, **P. adj = 0.0037, and ***P. adj = 0.00013). The boxplot boxes represent the lower and upper quartiles, with the central band as median. Notches show the confidence interval based on median ± 1.58*IQR/sqrt(n), and whiskers mark observations that satisfy quartiles ± 1.5*IQR. Median CADD38 max codon scores with bootstrapped 95% confidence intervals for reactive groups are low CpD Cys 27.3 (26.9, 28.0), medium CpD Cys 28.55 (27.80, 29.05), high CpD Cys 31 (28.8, 32.0), low CpD Lys 29.5 (29.3, 29.6), medium CpD Lys 29.25 (28.85, 29.50), high CpD Lys 29.05 (28.50, 29.55).
  • C. Shows CADD38 max codon scores for nonsynonymous SNVs at residues 220–479 of CASP8 (UniProt ID Q14790). Dashed horizontal line marks deleterious threshold of 25.
  • D. Crystal structure of CASP8 (PDB ID: 3KJN) highlighting C360 and C409. Bound covalent inhibitor B93 in yellow, with distance between Cys409 and the bound inhibitor measured in Angstroms. Protein surface color represents CADD38 max codon scores. Image generated in PyMOL (DeLano, 2002 ).
  • E. Activity of recombinant caspase-8 protein assayed using fluorogenic IETD-AFC substrate. Percentage activity shown relative to wild-type (WT) protein for three replicate experiments, bars and error bars as mean ± SD.

As the legacy cysteine reactivity dataset was relatively small (94 high reactivity cysteines in total), we next sought to verify these striking correlations, using a larger dataset. For this, we subjected lysates from the immortalized human T lymphocyte Jurkat cell line to isoTOP-ABPP reactivity profiling, comparing cysteine labeling with 10 or 100 μM iodoacetamide alkyne probe, as has been described previously (Weerapana et al, 2010 ). In aggregate, we identified 4,291 cysteines across five replicate experiments (

4-fold more cysteines than were assayed by (Weerapana et al, 2010 )), including 322 high, 1,448 medium, and 2,247 low reactivity residues. A strong correlation (Pearson correlation coefficient = 0.5) was observed between values reported in our new dataset and those reported previously (Appendix Fig S9). This rich dataset (Dataset EV18) allowed us to further verify our finding that the codons of highly reactive CpDAAs are enriched for high pathogenicity scores. Gratifyingly, our initial finding was reproduced with this new and larger dataset (Appendix Fig EV3B and C), supporting both the validity of our approach and the robustness of our findings.

As a first case study to explore the utility of integrating genetic-based pathogenicity predictions with CpDAA reactivity measures, we turned to the well-characterized essential enzyme glucose-6-phosphate dehydrogenase (G6PD). Associated with over 160 different genetic mutations, G6PD deficiency is one of the most common genetic enzymopathies (Hwang et al, 2018 ). As G6PD deficiency is associated with both acute and chronic hemolytic anemia (Porter et al, 1964 Miwa & Fujii, 1996 ) (OMIM #300908), and with malaria resistance (Luzzatto, Usanga, & Reddy, 1969 ) (OMIM #611162), identifying functionally important residues in G6PD should inform the diagnosis and treatment of G6PD-associated genetic disorders. To visualize CADD pathogenicity scores along protein sequence length, we plotted the first 300 amino acids in G6PD with lines tracking max CADD GRCh38 scores, including the positions of all 15 residues identified in prior chemoproteomics studies (Fig EV4A). Of particular interest to us were K171 and K205, which are both located proximal to the enzyme active site (Fig EV4B). While K171 and K205 had very different intrinsic reactivities (R10:1 = 1.3 and R10:1 = 9.2, respectively), both showed high max CADD scores (28.8 and 32, respectively Fig EV4A). Consistent with the observed high CADD scores, chemical modification at K205 (e.g., by aspirin) has been found to block G6PD activity (Jeffery, Hobbs, & Jörnvall, 1985 Ai et al, 2016 ) and mutations at K171 have been implicated in anemia (Hirono et al, 1989 Au et al, 2000 ). These prior data, when combined with our analysis of CADD and reactivity measurements support our finding that the propensity of lysines to react with electrophilic probes, but not measured differences in their intrinsic probe reactivity, correlate with predicted pathogenicity (Figs 3E and F, and EV3A and C).

Figure EV4. Functional validation of reactive lysine in G6PD

  1. Shows CADD38 max codon missense scores for residues 1–350 of G6PD (UniProt ID P11413). CpD K205 has the highest score out of all positions in protein. CpDAA positions above CADD38 deleterious threshold (gray dash line) include K47, K89, C158, K171, K205, and C294.
  2. Crystal structure of G6PD (PDB ID: 2BH9) shows K205 and K171 located within the enzyme active site. NADP + cofactor shown in yellow. Surface colored by CADD38 max codon missense scores. Image generated in PyMOL (Smith et al, 2019 ).

We next sought to determine whether the utility of integrating genetic-based pathogenicity predictions with CpDAA reactivity measures could extend to the de novo discovery of functional residues. We turned to the well-characterized enzyme caspase-8, a member of the cysteine-aspartic acid protease (caspase) family and a key initiator of extrinsic apoptosis. Pathogenic mutations in caspase-8 result in autoimmune lymphoproliferative syndrome (ALPS, OMIM# 607271) (Chun et al, 2002 Kanderova et al, 2019 ) and are associated with certain types of cancer. Our chemoproteomic reactivity dataset (Dataset EV18 and Fig EV5A) revealed that caspase-8 harbors two iodoacetamide alkyne-reactive cysteines: the catalytic cysteine (Cys360, R10:1 = 3.8) and a second non-catalytic cysteine (Cys409, R10:1 = 2.9). Consistent with its function as the catalytic nucleophile, the codon of Cys360 has a high mean CADD score (29.3), whereas the codon of Cys409 has a lower CADD score (21.4), indicative that mutations that alter Cys409 should be less damaging to caspase-8 (Fig 4C). Cys409 is located on a flexible loop

11.8 Å from the active site, as revealed by our projection of the max CADD codon scores onto the CASP8 X-ray structure (Fig 4D). As, to our knowledge, the functional impact of Cys409 mutations has not been assessed, we tested whether mutations at Cys409 would impact protein function, as indicated by the elevated measured reactivity, but not the moderate CADD score. Activity assays revealed that mutations at Cys409 do indeed impact protein function, completely blocking proteolytic activity (Fig 4E Dataset EV19). Taken together, these analyses highlight the utility of integration of chemoproteomic measures with pathogenicity predictions to improve stratification of functional and pathogenic residues.

Figure EV5. 2019 Cysteine chemoproteomics data support residue reactivity and deleteriousness score trend

  • A, B. Association between cysteine reactivity labels and CADD38 (model for GRCh38) PHRED scores for cysteines of low (n = 2,247), medium (n = 1,448), and high (n = 322) intrinsic reactivities, defined by isoTOP-ABPP ratios, low (R10:1 > 5), medium (2 < R10:1 < 5), high (R10:1 < 2) (Weerapana et al, 2010 Hacker et al, 2017 ). Either the max CADD score for a missense change was assigned to the codon (BH-adjusted P-values, low vs med ***P. adj = 0.00099, low vs high ***P. adj = 0.00086) (A) or the average of all missense scores at that codon (BH-adjusted P-values, low vs med ***P. adj = 4.0e-04, med vs high *P. adj = 0.023, low vs high ****P. adj = 3.90e-05) (B). Reactivity group differences assessed by Kruskal–Wallis nonparametric test and Wilcox test used for pairwise comparisons.
  • C. Plot of cysteine reactivity ratios for 3,590 out of 4,017 total profiled residues in 2019 isoTOP-ABPP study. Represented are 322 high, 1,448 medium, and 1,820 low threshold cysteines.

The conservation score at a site corresponds to the site's evolutionary rate. The rate of evolution is not constant among amino (nucleic) acid sites: some positions evolve slowly and are commonly referred to as "conserved", while others evolve rapidly and are referred to as "variable". The rate variations correspond to different levels of purifying selection acting on these sites. For example, in proteins, the purifying selection can be the result of geometrical constraints on the folding of the protein into its 3D structure, constraints at amino acid sites involved in enzymatic activity or in ligand binding or, alternatively, at amino acid sites that take part in protein-protein interactions.

In ConSurf, the rate of evolution at each site is calculated using either the empirical Bayesian (Mayrose et al., 2004) or the Maximum Likelihood (Pupko et al., 2002) paradigm. In both of these methods, the stochastic process underlying the sequence evolution and the phylogenetic tree are explicitly taken into account. The Bayesian method was shown to significantly improve the accuracy of conservation scores estimations over the Maximum Likelihood method, in particular when a small number of sequences are used for the calculations (Mayrose et al., 2004). An additional advantage of the Bayesian method is that a confidence interval is assigned to each of the inferred evolutionary conservation score.


Uncanny similarity of unique inserts in the 2019-nCoV spike protein to HIV-1 gp120 and Gag

We are currently witnessing a major epidemic caused by the 2019 novel coronavirus (2019-nCoV). The evolution of 2019-nCoV remains elusive. We found 4 insertions in the spike glycoprotein (S) which are unique to the 2019-nCoV and are not present in other coronaviruses. Importantly, amino acid residues in all the 4 inserts have identity or similarity to those in the HIV-1 gp120 or HIV-1 Gag. Interestingly, despite the inserts being discontinuous on the primary amino acid sequence, 3D-modelling of the 2019-nCoV suggests that they converge to constitute the receptor binding site. The finding of 4 unique inserts in the 2019-nCoV, all of which have identity /similarity to amino acid residues in key structural proteins of HIV-1 is unlikely to be fortuitous in nature. This work provides yet unknown insights on 2019-nCoV and sheds light on the evolution and pathogenicity of this virus with important implications for diagnosis of this virus.


Materials and methods

Data access

The raw and processed data have been submitted to the NCBI Gene Expression Omnibus (GEOhttp://www.ncbi.nlm.nih.gov/geo/) under accession number GSE99990. A virtual machine containing a running version of the data processing pipeline is available as a Docker image https://hub.docker.com/r/gui11aume/epi/. The scripts to reproduce the figures are on Github at https://github.com/Lcarey/HIS3InterspeciesEpistasis.

Study design

The His3 gene was selected for three principal reasons, it is short, conditionally essential and has not been known to be involved in protein-protein interactions. Studying 20 220 variants of His3 is impossible, thus, we have chosen an approach to survey the fitness landscape in a manner that would elucidate the area most relevant to His3 evolution while managing the technical limitations of our experimental design. We considered amino acid states found in extant species, focusing on yeast species, which translated into a full combinatorial set of

10 83 unique genotypes. Technically, it is feasible to measure fitness on the order of 100,000 unique genotypes in a single growth experiment. Therefore, we split the His3 gene into 12 independent segments such that the full combinatorial set of extant amino acid states from 21 yeast species in each segment was 10,000 – 100,000 genotypes. We then considered the combinatorial library for each segment in an independent growth experiment, which allowed us to study a tractable section of the sequence space while considering trajectories across a vast part of the space connecting extant species (Fig 1A). We constructed these combinations in 12 plasmid libraries and transformed them into a haploid His3 knockout S. cerevisiae strain. Growth rate (fitness) of yeast carrying different mutations in His3 was measured using serial batch culture in the absence of histidine.

We split the His3 gene sequence into segments in a manner agnostic to the structure of the His3 protein (S1A Fig). For technical reasons, a segment consisted of two variable regions with a constant region between them (S1B and S2A Fig). All growth experiments were performed independently for each segment, with the exception of one experiment on a limited group of genotypes from each segment which was done for the normalization of fitness values across different segments (S4 Fig).

As a control, we measured the rate of growth of S. cerevisiae whose entire His3 gene sequence came from another distant species. We found that the replacement of an entire gene sequence of His3 leads to wild-type rates of growth of S. cerevisiae even when the His3 sequence comes from very distant yeasts, as far as S. pombe (S4 Fig). Therefore, His3 appears to be an independent unit of the fitness landscape and is a good model for the study of fitness landscapes of an isolated gene.

Measuring fitness

Plasmid construction.

The His3 open reading frame of S. cerevisiae was PCR amplified to include the promoter and transcription terminator regions from 622 base pairs (bp) upstream of the open reading frame (ORF) to 237 bp downstream of the ORF, using primers 126 and 127 (see S2 Supporting Information) from the wild-type prototroph strain FY4. The PCR product was cloned into vector pRS416 using Gibson assembly (NEB, E2611S). The His3 orthologues from other species were amplified from genomic DNA using designed primers (S1 Supporting Information) and were cloned into the vector pRS416_his3, replacing the ORF of S. cerevisiae by Gibson assembly (NEB, E2611S). Since the His3 orthologue from A. nidulans contains an intron, the whole open reading frame was initially cloned into the vector, and the intron was later removed by PCR-amplifying the whole plasmid without this sequence, followed by recircularization.

Genomic DNA extraction.

Genomic DNA from fungi (Saccharomyces cerevisiae, Saccharomyces bayanus, Candida glabrata, Saccharomyces castellii, Kluyveromyces lactis, Eremotheciumgossypii, Debaryomyces hansenii, Lodderomycese longosporus, Aspergillus nidulans, Schizosaccharomyces pombe, Candida guilliermondii, Saccharomyces kluyveri, Kluyveromyces waltii) was extracted using MasterPure Yeast DNA Purification Kit according to the manufacturer’s instructions (Epicentre, MPY80200).

Mutant library construction.

Twelve independent mutant libraries, each for different regions of His3 (S2 Supporting Information), were generated based on the results of multiple alignment of 392 His3 orthologues. The alignment was built using the ClustalW alignment feature of the MEGA 6.0 software package [55] and user-corrected.

Mutant libraries were constructed by fusion-PCR, leaving two variable regions separated by a constant region. For each library, two contiguous fragments of His3 were amplified independently, using 1 μg of S. cerevisiae (strain FY4) genomic DNA in separate Phusion polymerase reaction mixes (Thermo Fisher Scientific, F530S) in GC buffer. For each PCR, one of the primers was a degenerate oligonucleotide with a constant part at the 5’ end required for the fusion-PCR the other primer was either 126 or 127. The degenerate primer approach led to the integration of non-extant amino acid sequences due to the redundancy of the genetic code. Consider the amino acid Phe in S. cerevisiae coded by the codon TTT. When incorporating an extant orthologous state Trp (TGG) two independent T -> G nucleotide mutations will be incorporated creating the codons TTG (Leu) and TGT (Cys). If these two amino acids were not found in other species then they would be non-extant and due to the non-random nature of their incorporation in our dataset the non-extant states in this study do not represent a random set of amino acid changes. The cycling conditions for the PCR were 98°C for 30 s 98°C for 20 s, 60°C for 30s and 72°C for 1 min (25 cycles) and 72°C for 5 min. The products were column-purified (QIAGEN, QIAquick PCR purification kit, 28104), eluted in 50 μl and mixed in equimolar proportion. The fusion-PCR was carried out by diluting 10 μl of the mix to 25 μL of standard Phusion polymerase reaction mix in GC buffer. The cycling conditions of the fusion-PCR were 98°C for 30 s 98°C for 30 s, 60°C for 2 min and 72°C for 1 min (25 cycles) and 72°C for 5 min. The product of fusion was purified from agarose gel (Qiagen, MinElute Gel Extraction Kit, 28604) and eluted in 10 μl of water. 10 μl of the product was used as a template for additional 5 cycles of PCR reaction in Phusion polymerase reaction mix (Thermo Fisher Scientific, F530S) in GC buffer, using primers 126 and 127. The cycling conditions were as follows: 98°C for 30 s 98°C for 20 s, 60°C for 30 s and 72°C for 1 min (5 cycles) and 72°C for 5 min. The product was column-purified (QIAGEN, QIAquick PCR purification kit, 28104), and used as an insert for Gibson assembly.

To create a library of His3 mutants, pRS416 plasmid was amplified using primers 128 and 129. The insert was cloned into the vector using Gibson assembly (NEB, E2611S). Ligated products (200–300 ng/μL) were desalted by drop dialysis using 13 mm diameter, Type-VS Millipore membrane (Merck Millipore, VSWP01300). 20 μL ElectroMAX DH10B competent cells (Invitrogen, 18290015) were electroporated with 3 μL ligated products. 0.01% of the electroporated bacteria were plated on ampicillin-containing medium in order to estimate the complexity of the library the remaining culture was grown overnight in 100 ml of liquid medium, and the plasmid was extracted the next day. For each library, the maximum number of protein sequences that can be generated was computed. Libraries were generated until to total complexity reached at least 3 times this value.

Yeast transformation and yeast library generation.

For each segment, yeast strain LBCY47 (his3:KanMXleu2Δ0 met15Δ0 ura3Δ0, derived from BY4741) was transformed with 50 μg of pRS416_His3 mutant library using lithium acetate transformation and plated onto glucose synthetic complete dropout plates lacking uracil. After 40 hours’ growth at 30°C, approximately 0.5 million yeast colonies were scraped off the plates, mixed together and washed 2 times with 100 ml of PBS.

Bulk competition.

4x10 9 cells were inoculated into 500 ml of glucose synthetic complete dropout medium lacking uracil with 200 mg/L of G418, and grown at 30°C at 220 RPM for 6-8 h in order to eliminate clones with low fitness irrespective of histidine biosynthesis. Cells were later pelleted and washed with 50 ml of PBS. Approximately 10 10 cells were inoculated into 1 L of synthetic complete dropout medium lacking histidine, and grown at 30°C at 220 RPM for 168 h with 12 h between bottlenecks:

10 10 cells were transferred into fresh medium

10 8 cells from the culture were kept as sample for the given time point. Bulk competition for each library of mutants was performed in two replicates to account for biological variability.

NGS library preparation.

The relative abundance of yeast mutants was measured in 3 samples: 1) the initial population before selection was applied (t0), 2) the population after 12 h of growth in the selective medium (t1), and 3) the final population after 168 h of growth in the selective medium (t14). In order to extract plasmid DNA, 5x10 9 cells from each sample were incubated in 300 μL of zymolyase buffer (1 M sorbitol, 0.1 M sodium acetate, 60mM EDTA (pH 7.0), 2 mg/ml zymolyase, 1% 2-Mercaptoethanol) at 37°C for 3 h. The plasmid DNA was purified from the obtained spheroplasts using QIAprep Spin Miniprep Kit (QIAGEN, 27104) according to the manufacturer's protocol. The obtained DNA was used as a template in a 25 μL of Q5 DNA polymerase reaction mix (NEB, M0491S), using staggered primers for demultiplexing in the following cycling conditions: 98°C for 30s 98°C for 10s, 60°C for 30s and 72°C for 30s (18 cycles) and 72°C for 2 min. PCR products were purified using Agencourt AM Pure XP beads (Beckman Coulter, A63880), and eluted in 40 μL of TE buffer (pH 8.0). DNA extraction and PCR-amplification were repeated twice for every sample to account for the technical variability.

NGS libraries were prepared from 100 ng of the purified DNA amplicons using Ovation Rapid DR System (Nugen, 0319-32) according to manufacturer's instructions. Each library was visualized on a Bioanalyzer (Agilent Technologies) and quantified by qPCR with a Kapa Library Quantification Kit (Kapa Biosystems, KK4835). Twelve samples were pooled together (accounting for two biological replicates, two technical replicates and three time points) at the final concentration of 4 nM, and sequenced in the same lane. Samples were sequenced as 125-bp paired-end reads on a HiSeq2500 sequencer (Illumina) with v4 sequencing chemistry.

Yeast growth assay.

Mutant strains were grown overnight in complete dropout medium lacking uracil. The cultures were diluted to 0.05 OD 600 nm, and grown for 5 h in the same medium. 6 μL of each culture were transferred into 96-well plates in 125 μL of complete dropout medium lacking histidine. Growth of the strains was monitored by measuring OD 600 nm every 10 min using Tecan Infinite M1000 PRO microplate reader equipped with an integrated Stacker module.

The growth rate of individual curves was measured as the inverse of the time to grow from OD = 0.135 = exp(-2) to OD = 0.368 = exp(-1). If the curve did not reach 0.368, the growth was set to 0. Curves that crossed 0.135 or 0.368 were excluded. The growth rate of a clone was measured as the median of 6 independent growth experiments. We excluded from the analysis clones with discordance between growth in solid and liquid medium, clones that could not be sequenced or that showed evidence of contamination by sequencing, and clones such that the Kullback-Leibler divergence of their read counts compared to all synonymous clones was greater than 0.22. The later criterion ensured that the selected clones were not outliers compared to other variants encoding the same protein.

Growth rates of isolated strains

We isolated 197 strains from all segment libraries of extant amino acid combinations (9-26 strains per segment) and used Sanger sequencing to determine the sequence. For each strain we performed 6 repeats of growth assay and calculated the average growth rate. Fitness values from competition and growth rates are highly correlated (r = 0.82, p = 10 -48 ). Correlation was significant and greater than 0.6 for all segments except segment 9, where all selected genotypes appeared to be neutral (S4 Fig).

Initial data filtering

The individual sequences of the variants were recovered from pair-end reads with the following steps: the constant region between the two variable regions was identified by inexact matching allowing up to 20% errors using the Seeq library version 1.1.2 (https://github.com/ezorita/seeq). The reads are not oriented because the Illumina sequencing adapters were added by ligation, so the constant regions were searched on both reads. Forward and reverse reads were swapped when a match was found on the reverse read. This ensured that all of the sequences are in the same orientation. For multiplexing purposes, the sample identity was encoded in the left and right primers used to PCR-amplify the variants. To demultiplex the reads, we used inexact matching with the candidate primers, allowing up to 20% errors. To merge the reads, the sequence of the reverse reads was reverse complemented and the constant region was searched by inexact matching allowing up to 20% errors. The position of the constant part in each read indicated how they must be stitched together. This approach was faster and less error-prone than using FLASH [56]. In the region of overlap, the consensus sequence was determined by picking the nucleotide with highest quality as indicated in the quality line of the fastq files. If 'N' persisted in the final sequence, the reads were discarded. The PCR primers were trimmed so that all the sequences of the same competition would start and end at the same location.

Reads that did not have the constant region, that could not be oriented or that could not be demultiplexed were discarded. The remaining errors in the reads were corrected by sequence clustering. We used Starcode version 1.0 [57] with default parameters and allowing up to two errors. The corrected reads were translated using the genetic code. Variants encoding the same proteins were not merged they were kept separate for downstream analyses. A running Docker virtual machine with commented scripts to replay the whole the process is available for download at https://hub.docker.com/r/gui11aume/epi/.

DNA sequence variant frequency calculation and data filtering

The total number of reads for 12 segments, 3 time points and 4 replicas are shown in S2 Supporting Information. Genotypes frequencies are defined as the number of reads for a given genotype divided by the total number of reads in that replicate. Mean frequency was calculated over 4 replicas to be used in further analysis. However, to eliminate influence of outliers the median was taken instead of mean if absolute difference between mean and median was greater than the median value. Only genotypes present in both technical replicas of both biological replicas with at least ten reads (summed across all time points) in each of them were kept.

Sequencing error estimate

The length of the segment was designed so that each variable region was read twice by pair-end reads (S2A Fig). This strategy led to a substantial reduction in the sequencing error rate because mismatches between the two reads were corrected to the nucleotide with the higher quality call.

The raw Illumina sequencing error rate was estimated by measuring the frequency of mismatches between forward and reverse reads. The rationale of this estimate is that each mismatched nucleotide must be a sequencing error for at least one of the reads. The variant calling error rate was estimated by collecting groups of reads that differ by only one nucleotide in the constant region (S2A Fig). Since the variable regions are identical, such reads come from the same variant and the different nucleotide in the constant region must originate from a calling error (mutations in S. cerevisiae are negligible because they occur at a much lower rate). The frequency of such reads was used to approximate the per-nucleotide variant calling error rate. The raw Illumina error rate was computed by custom Python scripts and reads differing by one nucleotide were collected using the “sphere” clustering option of Starcode with a maximum distance of 1 (see the Data access section for the code repositories). The results are summarized in S2 Supporting Information.

The design strategy and the low error rate allow us to distinguish variants incorporated in the library from random sequencing errors. Each library contained on the order of 10 5 individual sequence variants, so that each library variant would be found at a frequency several orders of magnitude higher than variants created by sequencing errors. For example, for segment 7, the number of nucleotide variants was 176,879 with 31,815,448 possible single mutants of these variants. The error rate in segment 7 was 0.04% per nucleotide, which translates into 2.4% of reads being miscalled. Thus, the expected frequency of a particular miscalled variant is 0.024 / 31,815,448 ≈ 8x10 -10 , considerably smaller than the expected frequency of real variants 0.976 / 176,879 ≈ 6x10 -6 . The estimated frequencies of library variants for all segments of His3 are reported in S2 Supporting Information.

The final variant calling error rate was smaller than the numbers shown in S2 Supporting Information because errors were further corrected by sequence clustering using Starcode (see Initial data filtering section). In line with the rationale above, low frequency erroneous reads are converted to the closest high frequency variant at a maximum Levenshtein distance of 2.

PCR recombination rate estimate

It is known that PCR can create new genotypes by template switching [58]. To test the magnitude of this effect, we took advantage of the two-block design of the variants and estimated the frequency of recombination between the left and the right variable regions. The structure of reads can be represented like this: AAAAAA-------BBBBBB. In this example, “A” represents the left part of the variable region of the segment, “-“ represents the invariable region and “B” represents the right part of the variable region. Insertions of two or more nucleotides are rare events caused by errors during the library synthesis so, if the same insertion in a left half of the variant is associated to several variants on the right half, it is likely an occurrence of template switching (the same holds for insertions in the right half of the variant). For example, if the region with A*AAA*AA-------BBBBBB, where “*” represents a deletion, is found with several different variants of the B segment then such a situation likely represents template switching. Focusing on the initial time point to avert the effect of selection, we counted a total of 11,454 variants with two or more insertions on either the left side or the right side. Among those, 76 had the same insertion as another variant. This means that > 98.6% of the variants were free of template switching. Extrapolating to the rest of the dataset, this means that the “leakage” of reads between variants is substantially lower than the magnitude of the observed epistasis and we can rule out this artefact as a potential explanation for our results. In either case, all errors in our experimental pipeline, including template switching, are taken into account when we calculate the false discovery rate.

Noise estimation

The major factors causing noise in genotype frequency measurements are sampling errors, PCR amplification errors and genetic drift during the competition. For all of these factors, the amount of error depends on the genotype frequency. Therefore, we estimated measurement errors as the function of genotype frequency.

For a given segment, time point and a pair of biological or technical replicas for each genotype we calculated the mean frequency and the squared difference of frequencies from these two replicas. We sorted genotypes by mean frequency and grouped them such that each bin contains 5000 genotypes. We calculated the average frequency and the average squared difference in each bin. Additionally, squared error for frequency 0 was set equal to , where Ni and Nj are total read numbers in replicas i and j. Finally, by linear interpolation we obtained dependencies of squared differences as a function of frequency, , where i and j are different replicas.

Using squared differences from pairwise comparison of replicas we can estimate variance of mean frequency over four replicas. Let numerate replicas 1, 2, 3, 4 where 1, 2 are technical replicas of the first biological repeat and 3, 4 are the technical replicas of the second biological repeat. Errors coming from the competition (e.g.: genetic drift) are shared for replicas 1, 2 and for replicas 3, 4. Let’s call them and and their variances and , respectively. Technical errors of sampling from the population and from PCR are unique for each replica. Let’s call them and their variances , respectively. All variances are function of frequency and when writing we assume .

In the introduced notations the mean frequency over 4 replicas is: where f* is the true frequency. Applying basic properties of variance, the variance of mean frequency:

To estimate we used squared differences from pairwise comparison of replicas calculated above :

Therefore, the variance of mean frequency f can be found as: Recalling that variance and squared differences are a function of frequency: For each segment and time point we calculated the numerical function σ 2 (f). Then for each genotype having mean frequency fx we estimated its variance as σ 2 (fx)

Merging amino acid genotypes

We merged nucleotide genotypes that corresponded to the same amino acid sequence and summed their frequencies and variances. We filtered out all genotypes x that had any of following patterns:

or or . Fraction of such genotypes were <0.5% for all segments except S9, for which it was 4.5%

For further analysis, this amino acid dataset was used except when specified.

Fitness estimation

Number of cells in a pool with particular genotype x after time interval t increases exponentially where sx is absolute fitness. Frequency of genotype x as well depends exponentially on absolute fitness with an additional multiplicative factor: where N t and N 0 are total cell numbers in a pool at time points 0 and t. Factor reflects the total growth of population, it changes with time but is the same for all genotypes. Therefore, we can rewrite genotype frequency at time t as: where

In the measured dataset for each genotype x we have 3 measurements of frequency and their errors . To estimate genotype fitness, we minimized relative squared errors of exponential fit as function of fitness sx and initial frequency : (1)

This formula contains four parameters common for all genotypes from one segment: . Further we will perform additional shifting and scaling of fitness values (see next section), therefore, without loss of generality we could set and t1 = 1. Ideally, t2/t1 should equal 14 however, we noticed that this ratio does not hold for many segments and fitted k = t2/t1 from data instead of using value 14.

To find specific and k for each segment we selected genotypes with high frequencies at t0 (t0>25∙10 −6 ) that corresponds to

500-1000 reads per technical replicate. Each segment contains 10 3 -10 4 genotypes that meet this criterion. We minimized eq. (1) for selected genotypes trying all possible combinations of ( ) from a grid where with step 0.01 and [1,14] with step 0.1 and choose ( ) which gives minimal (*).

Finally, given ( ) for each segment we found sx for each genotype. Errors for fitness values, , were estimated as standard error of best-fit parameter.

For genotypes with frequencies pattern fit of Eq (1) cannot be obtained. Therefore, we defined upper boundary for their fitness value as , where are total read numbers at time point t1 in i-th replica.

Fitness rescaling

We scaled fitness such that lethal genotypes have fitness 0 and neutral genotypes have fitness 1. We assumed that genotypes with a stop codon or frame shift are lethal. Thus, for each segment we linearly rescaled the fitness distribution so that 95% of genotypes with nonsense mutations have a fitness of 0 and so that the local maximum of the fitness distribution of genotypes with extant amino acids is 1. The scaling around the local maximum led to the shift of fitness values of less than +/- 0.025 in each of the 12 segments compared to the measured wildtype strains and did not affect our results (for scale, we called an amino acid change non-neutral if its effect on fitness was > 0.4). All fitness values that became smaller than 0 were set to 0.

Quality control and comparison of synonymous sequences

We used nucleotide synonymous sequences as an internal control. The error rate for a measurement of fitness of an amino acid sequence depends on the number of synonymous sequences, n, that were used to estimate it. Therefore, we estimated the false discovery rates separately for categories with n = 1. 10 variants. For each amino acid genotype with more than n synonymous variants we merged random combination of n of its nucleotide genotypes and estimated fitness. We then calculated the difference between this fitness and the fitness of the corresponding amino acid sequence. We classified case as “false unfit” if difference was <-0.4 and as “false fit” if difference was >0.4. The fraction of such cases gives us false discovery rates for genotypes having n synonymous variants. To get total false discovery rates for each segment we averaged “false unfit” and “false fit” rates for different n with weights equal to the fractions of genotypes in amino acid dataset which have n synonymous variant (S2 Supporting Information). The high correlation between biological replicas (S2 Supporting Information) confirms high accuracy of our high-throughput experiments, with the exception of segment 9.

Analysis of amino acid replacement effects in different backgrounds

For each amino acid replacement, we calculated its fitness effect in different backgrounds. We estimated the fraction of backgrounds in which a replacement exhibits deleterious, beneficial and neutral effects. To get the fraction of backgrounds with neutral effects we utilized the approach of mixture distribution analysis from Sarkisyan et al. [10]. We assume that neutral replacements have the same distribution as the fitness effects of synonymous replacements and are caused by measurement noise. We then calculated the fraction of backgrounds in which mutations have a neutral effect as the overlap between the distribution of synonymous mutations and distribution of fitness effects across different backgrounds. In remaining backgrounds, the amino acid replacement was called to have a non-neutral effect. Among them we counted those with strong fitness effects, including deleterious mutations when the fitness effect was < -0.4 and beneficial when the fitness effect was > 0.4. We concluded that a particular amino acid replacement exhibits a strong deleterious or beneficial effect in some background if the fraction of such backgrounds exceeded the false discovery rate.

Predicting fitness using deep learning

To predict the unidimensional fitness function based on additive contribution of extant amino acid states we used deep learning, a machine learning technique capable of constructing virtually any function, even with a simple neural network architecture [59] (Fig 4B). To convert amino acid sequences into a binary feature matrix we used one-hot encoding strategy, in which each feature (column in the matrix) indicates the presence or absence of a particular amino acid state. For neural network implementation the TensorFlow library was used (www.tensorflow.org/about/bib).

To optimise the accuracy/overfitting ratio, we tested different combinations of neural network architectures and parameters. As a starting point, we selected a number of complex architectures, which describe our data but are prone to overfitting due to their large number of parameters. We then gradually reduced the number of layers and neurons to reduce the overfitting, while empirically controlling for accuracy.

Our final architecture consists of three layers and 22 neurons in total (Fig 4). Each neuron performs a linear transformation of the input and subsequently applies a non-linear sigmoid activation function to the result. The output of the first layer is a single sigmoid of a linear transformation of the feature vector, i.e. where x is the feature vector, c1 is the vector of coefficients, b1 is the bias and σ(t) = (1+et ) −1 . Looking forward, observe that is a fitness potential of the genotype x (see main text). The second layer decompresses the hidden nonlinear representation into 20 sigmoids, the combination of which is further linearly transformed with the only neuron of the third layer and wrapped into another sigmoid function: In the formula above, c2,i is the coefficient of the i-th neuron in the n-th layer, and b2,i is the bias of the i-th neuron in the second layer (the biases of the only neurons of the first and third layers are b1 and b3, correspondingly).

The key idea of our approach is that the number of neurons in the first layer of the neural network determines the number of linear combinations of mutations (or fitness potentials) used in order to predict the fitness of the variant. In other words, each neuron in the first layer assigns a single unique weight to every amino acid state in the dataset (Fig 4). Thus, the number of neurons in the first layer of the architecture is the dimensionality of epistasis in the model (i.e. one in this case). The fitness potentials are then transformed by a nonlinear phase shift function constructed by the 22 neurons of the neural network.

The simplicity of the architecture minimizes overfitting, which was further prevented by keeping 10% of the data as a test set (in order to see how well the model performs on a fraction of the data it has never seen) and early stopping (training was stopped if the test accuracy did not improve for 10 epochs). The loss function that is being optimised is not convex, which leads to a high probability of getting stuck in different local minima. To ensure reproducibility, each of our models was constructed ten independent times using random train-test splits. The accuracies of the 10 constructed models varied by at most 2%.

Each model was trained for under 100 epochs using mean squared error as the loss function. An adaptive learning rate method proposed by Geoffrey Hinton, RMSProp, was used as the optimiser [60]. This algorithm is a version of a mini-batch stochastic gradient descent, utilising the gradient magnitude of the recent gradients in order to normalise the current ones. All the weights were initialised using Xavier normal initialiser [61].

Paths between pairs of fit genotypes

For analysis in Fig 7D, we first choose two fit “parental” genotypes, one randomly chosen genotype (eg: ABE) and the other parental genotype that is either S. cerevisiae wildtype genotype (inter-segmental) or another random fit genotype in the data (intra-segmental) (eg: abe). The two genotypes in this example are Hamming Distance 3 apart (HD = 3). We next compute all (2 HD -2) intermediate genotypes (eg: AbC, aBc, et cetera) and retain the subset that were experimentally measured. We represent the two parental genotypes and all measured intermediate genotypes as an undirected graph in which each genotype is a vertex. All genotypes one amino acid replacement apart are connected by an unweighted edge. The shortest possible path for a given pair of genotypes is of length HD. We find all shortest paths between the two parental genotypes using a breadth-first search. We next remove all vertices (genotypes) that are unfit, and recompute the number of shortest between the two parental genotypes. For example, in Fig 7A, there are six paths of length three if you take into account all genotypes, but only three paths of length three if you take into account only fit genotypes.

Clustering of unfit genotypes in sequence space

For the analysis in Fig 7E, we first represent the two parental genotypes and all measured intermediate genotypes as an undirected graph in which each genotype is a vertex. All genotypes one amino acid replacement apart are connected by an unweighted edge. We can then compute the degree (number of genotypes of distance one) for each vertex (genotype). We do so randomly drawing from all measured genotypes and using only unfit genotypes or using the same number but randomly chosen genotypes. For the randomly chosen genotypes, the value is the average over 1000 runs.

Quantifying sign epistasis

For each amino acid replacement (eg: C -> S at position 141), we considered only those that exhibit a large fitness effect (abs. difference > 0.4) comprising a set of amino acid replacements with large effects. For each amino acid replacement, we divided the genetic backgrounds into two categories: those in which the replacement caused a > 0.4 increase in fitness, and those backgrounds in which the replacement caused > 0.4 decrease in fitness. A single amino acid replacement can cause a large increase in fitness in some backgrounds and a large decrease in others due to two possible reasons: sign epistasis or experimental error. To differentiate the two cases, we identified secondary amino acid replacements that significantly alter the ratio of large increases to large decreases in fitness (Fisher’s exact test, Bonferroni corrected p-value < 0.05). We only consider a site to be under sign epistasis if there is a second site that alters the frequency of sign epistasis in a statistically significant manner, i.e. more frequently than expected by chance alone.

Ancestral state reconstruction

We reconstructed ancestral amino acid states using maximum likelihood approach implemented in CODEML program of PAML 4 [62].

Structural analysis

Structure prediction.

An initial model was obtained with the I-TASSER server [63]. The list of top 10 PDB structural templates picked up by the I-TASSER included high-quality crystal structures of imidazoleglycerol-phosphate dehydratases from Arabidopsis thaliana and Cryptococcus neoformans. Coordinates of the top-scoring model (C-score = 0.21, estimated TM-score = 0.74±0.11, estimated RMSD = 5.1±3.3Å) and the predicted normalized B-factor [64] were used for further analysis. The value of the model quality metric (TM-score >0.5) indicates a model of correct topology. The proteins structurally close to the final model (RMSD 0.6–1.7Å are PDB IDs 4MU0, 4GQU, 1RHY, 5DNL and 2AE8 from Arabidopsis thaliana, Mycobacterium tuberculosis, Cryptococcus neoformans, Pyrococcus furiosus and Staphylococcus aureus.

We measured the distribution of distances (in angstroms) between pairs of residues that exhibit strong sign epistasis (S2 Supporting Information, ReallyPositivePair == TRUE), and compared it with the distribution of pairwise distances among residues for which we have sufficient data to be certain that a given pair does not exhibit sign epistasis (S2 Supporting Information, ReallyNegativePair == TRUE).

ΔΔG prediction.

Cartesian_ddg application [65] from Rosetta version 2017.08.59291 was used for ΔΔG predictions. Top-scoring I-TASSER model was pre-minimized using the Relax [66] application in dual-space [67] with the flags: -relax:dualspace true -ex1 -ex2 -use_input_sc -flip_HNQ -no_optH false -relax:min_typelbfgs_armijo_nonmonotone -nonideal. The best scoring model from 1000 structures was selected. The effect of up to 4 mutations (54,500 genotypes in total) was assessed in Cartesian space with the Talaris_2014 score function, and the -fa_max_dis 9.0 flag. ΔΔG was estimated as a difference of mean score for 3 independent runs for every mutant and the wild-type score.


DNA analysis for chimpanzees and humans reveals striking differences in genes for smell, metabolism and hearing

Nearly 99 percent alike in genetic makeup, chimpanzees and humans might be even more similar were it not for what researchers call "lifestyle" changes in the 6 million years that separate us from a common ancestor. Specifically, two key differences are how humans and chimps perceive smells and what we eat.

A massive gene-comparison project involving two Cornell University scientists, and reported in the latest issue of the journal Science (Dec. 12, 2003), found these and many other differences in a search for evidence of accelerated evolution and positive selection in the genetic history of humans and chimps.

In the most comprehensive comparison to date of the genetic differences between two primates, the genomic analysts found evidence of positive selection in genes involved in olfaction, or the ability to sense and process information about odors. "Human and chimpanzee sequences are so similar, we were not sure that this kind of analysis would be informative," says evolutionary geneticist Andrew G. Clark, Cornell professor of molecular biology and genetics. "But we found hundreds of genes showing a pattern of sequence change consistent with adaptive evolution occurring in human ancestors." Those genes are involved in the sense of smell, in digestion, in long-bone growth, in hairiness and in hearing. "It is a treasure-trove of ideas to test by more careful comparison of human and chimpanzee development and physiology," Clark says.

The DNA sequencing of the chimpanzee was performed by Celera Genomics, in Rockville, Md., as part of a larger study of human variation headed by company researchers Michele Cargill and Mark Adams.

Celera generated some 18 million DNA sequence "reads," or about two-thirds as many as were required for the first sequencing of the human genome. Statistical modeling and computation was done by Clark and by Rasmus Nielsen, a Cornell assistant professor of biological statistics and computational biology. Some of the analysis, which also compared the mouse genome, used the supercomputer cluster at the Cornell Theory Center. Clark explains, "By lining up the human and chimpanzee gene sequences with those of the mouse, we thought we might be able to find genes that are evolving especially quickly in humans. In a sense, this method asks: What are the genes that make us human? Or rather, what genes were selected by natural selection to result in differences between humans and chimps?" The study started with almost 23,000 genes, but this number fell to 7,645 because of the need to be sure that the right human, chimp and mouse genes were aligned.

According to Clark, all mammals have an extensive repertoire of olfactory receptors, genes that allow specific recognition of the smell of different substances. "The signature of positive selection is very strong in both humans and chimps for tuning the sense of smell, probably because of its importance in finding food and perhaps mates," says Clark. In addition to the great departure in smell perception, differences in amino acid metabolism also seem to affect chimps' and humans' abilities to digest dietary protein and could date back to the time when early humans began eating more meat, Clark speculates. Anthropologists believe that this occurred around 2 million years ago, in concert with a major climate change.

"This study also gives tantalizing clues to an even more complex difference -- the ability to speak and understand language," Clark says. "Perhaps some of the genes that enable humans to understand speech work not only in the brain, but also are involved in hearing." Evidence for this came from a particularly strong sign of selection acting on the gene that codes for an obscure protein in the tectorial membrane of the inner ear. One form of congenital deafness in humans is caused by mutations to this gene, called alpha tectorin.

Mutations in alpha tectorin result in poor frequency response of the ear, making it hard to understand speech. "It's something like replacing the soundboard of a Stradivarius violin with a piece of plywood," Clark notes. The large divergence between humans and chimps in alpha tectorin, he says, could imply that humans needed to tune the protein for specific attributes of their sense of hearing. This leads Clark to wonder whether one of the difficulties in training chimpanzees to understand human speech is that their hearing is not quite up to the task. Although studies of chimpanzee hearing have been done, detailed tests of their transient response have not been carried out.

Clark emphasizes that a study like this cannot prove that the biology of humans and chimps differ because of this or that particular gene. "But it generates many hypotheses that can be tested to yield insight into exactly why only 1 percent in DNA sequence difference makes us such different beasts," he says.

Also collaborating in the study were researchers at Applied Biosystems (Foster City, Calif.), Celera Diagnostics (Alameda, Calif.) and Case Western Reserve University in Cleveland. The Science report is titled, "Inferring non-neutral evolution from human-chimp-mouse orthologous gene trios."


Answers Research Journal

2011 Volume 4

Cutting-edge creation research. Free. Answers Research Journal (ARJ) is a professional, peer-reviewed technical journal for the publication of interdisciplinary scientific and other relevant research from the perspective of the recent Creation and the global Flood within a biblical framework.

  • Science
  • What Is Science?
  • Astronomy
  • Biology
  • Chemistry
  • Environmental Science
  • Fossils
  • Genetics
  • Geology
  • Human Body
  • Mathematics
  • Physics

Submit a Paper

High-quality papers for Answers Research Journal, sponsored by Answers in Genesis, are invited for submission.

  1. Read the Instructions to Authors Manual (PDF).
  2. Email papers, diagrams, tables, etc. to the email address listed in the Manual.

Answers in Genesis is an apologetics ministry, dedicated to helping Christians defend their faith and proclaim the good news of Jesus Christ.


Watch the video: excel - υπολογισμός του ποσοστού που εκφράζει ένα μέρος των εξόδων σε σχέση με το σύνολο (August 2022).