Use Case Catalogue - Research - Reconstructing ancient synaptic architecture

Musser Lab

Reconstructing ancient synaptic architecture with single-cell RNA-seq and deep learning

Abstract

Proteins act as cores of animal cells where they orchestrate distinct functions. However, how protein-protein interactions (PPIs) diversified across cells in animal evolution is poorly understood. Emerging protein language models that learn the “grammar” of protein sequences and uncover hidden mechanisms offer strong generalization power, making it possible to tackle challenges such as detecting remote homologs and predicting their interaction patterns. In this project, we focus on reconstructing interaction profiles of synaptic proteins across different animals to uncover the conserved and divergent constituents of early animal synapses throughout evolution. 

To do this, we integrate full proteomes and single-cell transcriptomic data across diverse species, into a unified analytical pipeline. First, we use human synaptic gene modules as templates to locate homologous modules across a broad range of species using PROST, a protein language model-based homolog detection tool. Then, to refine our list of potential synaptic protein interaction partners for each cell type we perform co-expression analysis using published single-cell data and identify co-expressed synaptic proteins as well as potential novel interacting partners. A novel deep learning model is developed to predict the protein interaction profile of each candidate. Lastly, we perform experimental validation to examine the predicted modules. The results of this study will reveal a clear map of the evolutionary path of synaptic gene modules and uncover novel mechanisms and potential new types of synapses.

What inspired this project?

This project is inspired by the longstanding mystery of how the nervous system originated and evolved from simple nerve nets in early metazoans to the complex synaptic structures seen in modern bilaterians. Genome sequencing has revealed that many synaptic genes existed even before functional nervous systems emerged. This  raises the possibility that ancient organisms possessed molecular components or functional modules of synapses long before neurons formed. At the same time, advances in protein language models (pLMs) capable of learning the structural and functional properties embedded in protein sequences, which offers a powerful means to probe these ancestral synaptic elements. By combining the predictive power of fine tuning pLMs, together with single-cell transcriptomic data analysis, the project seeks to reconstruct ancient synaptic architectures and uncover how the molecular foundations of neural communication first arose.

How might this project benefit humanity? 

This project could benefit humanity by help us understanding of how the nervous system first emerged and evolved—knowledge that can ultimately inform modern neuroscience and medicine. By reconstructing ancient synaptic architectures, we may uncover fundamental design principles that shaped neural communication and remain conserved in humans today. These insights can improve our understanding of neurological disorders rooted in synaptic dysfunction which may lead to new avenues for diagnostics or therapeutic strategies. Additionally, the project develops a computational framework and deep learning-based pipeline that are broadly applicable beyond tracing functional modules in synapse. These tools can be used to study the evolutionary history of any gene module, perform comparative analyses across diverse organisms, and uncover how molecular systems diverge and specialize over time.

Who would use this?

A wide range of researchers would benefit from both the results and the methodological framework of this project. Evolutionary biologists and comparative genomics researchers would use the findings to understand how gene modules (synaptic or otherwise) emerged, diversified, and were repurposed across animal lineages. Neuroscientists would draw on the reconstructed synaptic architectures to gain insight into the origins of neural communication and the conserved molecular principles underlying nervous systems. 

Computational biologists and bioinformaticians would apply the deep learning-based pipelines to analyze protein interactions, trace evolutionary trajectories, and conduct cross-species comparisons of gene networks. Moreover, biomedical researchers and drug discovery teams could use the improved modeling of protein interactions to identify disease relevant pathways and therapeutic targets. Ultimately, anyone studying gene module evolution, protein function, or molecular system diversification across animals could leverage the project’s code repository and datasets.A wide range of researchers would benefit from both the results and the methodological framework of this project. Evolutionary biologists and comparative genomics researchers would use the findings to understand how gene modules (synaptic or otherwise) emerged, diversified, and were repurposed across animal lineages. Neuroscientists would draw on the reconstructed synaptic architectures to gain insight into the origins of neural communication and the conserved molecular principles underlying nervous systems. Computational biologists and bioinformaticians would apply the deep learning-based pipelines to analyze protein interactions, trace evolutionary trajectories, and conduct cross-species comparisons of gene networks. 

Moreover, biomedical researchers and drug discovery teams could use the improved modeling of protein interactions to identify disease relevant pathways and therapeutic targets. Ultimately, anyone studying gene module evolution, protein function, or molecular system diversification across animals could leverage the project’s code repository and datasets.

Musser Lab - Project 1

How does it work?

The entire pipeline consists of two major components: 1. Identification of synaptic proteins across diverse animal species using protein language models. 2. Inference of the evolutionary assembly of synaptic protein complexes by integrating protein coexpression profiles with protein-protein interaction predictions generated by a novel deep learning model.

A major challenge for detecting proteins in non-model animals similar to well-characterized synaptic proteins is that traditional sequence alignment methods often fail at large evolutionary distances. However, protein language models effectively capture inherent patterns in protein sequences and can identify similar proteins across large evolutionary distances with greater precision. We will systematically collect complete animal proteomes and convert all proteins into a high-dimensional matrix embedding using the ESM language model following the PRotein Ortholog Search Tool (PROST) workflow. Then, we will calculate pairwise embedding distances between well-characterized synaptic proteins and all proteins in each animal. Validation will be performed via sequence searches with BLAST and HMM where possible, and by assessing structural similarity using Aphafold2.

One of preconditions for synaptic proteins to interact is that they must be co-expressed in the same cell. We leverage the recent proliferation of published whole-body single-cell expression atlases together with several new cell atlases generated in our own lab to assess synaptic protein co-expression in animals with and without nervous systems. Briefly, for each dataset we identify co-expressed synaptic genes using weighted gene co-expression network analysis.

Finally, we identify species-specific interactions among synaptic proteins using a deep learning model developed in our lab. This model is built on the ESM2-650M architecture, a widely used protein language model pretrained on approximately 65 million sequences from the UniRef50 database. Leveraging the model’s learned representations of individual proteins, we train it using: 1. Public protein-protein interaction datasets with binary labels (positive/negative), including the PPI dataset STRING v12 database, the cross-species dataset from Sledzieski et al., and the Human Gold Standard Dataset from Bernett et al.. 2. Protein complexes with annotated interface residues from the RCSB database. We first train the model on the binary-labeled datasets using a combination of classification and whole-sequence masked language modeling (MLM) objectives. We then fine-tune it on the protein-complex dataset with a selective masking strategy designed to focus the model’s attention more on interaction-interface features, which enables more accurate prediction of specific protein-protein interactions.

Building on the previous analyses, we construct protein interaction networks for each synaptic gene module and compare overall interaction profiles across species. We also superimpose protein structures from different organisms onto the existing protein complex. By examining the amino acid composition near interaction interfaces, we infer how sequence changes may alter interaction profiles over evolutionary time. These computational findings then cross-validated using direct experimental evidence of selected synaptic protein interactions in sponges and jellyfish, with ongoing work in our lab and in the laboratory of our collaborator Thomas Biederer in the Department of Neurology at Yale.

With these results, we aim to reconstruct ancient synaptic architecture and pinpoint when synaptic functional modules first began to operate together, ultimately giving rise to the earliest synapses in evolutionary history.

Troubleshooting Tips?

Placeholder Block

Contributors & Acknowledgments

Jacob Musser joined Yale as an Assistant Professor in the Department of Molecular, Cellular, and Developmental Biology. Jacob was a molecular evolutionary biologist at the European Molecular Biology Laboratory (EMBL) in Heidelberg, Germany. Originally from northern Michigan (USA), Jacob earned his PhD at Yale University in 2015 studying the molecular evolution of bird feathers. Jacob’s current research focuses on understanding the evolution of animal cell types, particularly the origin of early animal gene regulatory networks, specialized functional machinery such as the synapse, and early modes of intercellular communication. At EMBL he esta

Jacob Musser

Jacob Musser

Assistant Professor in Molecular, Cellular and Developmental Biology

Kejue is an associate research scientist in the MCDB department at Yale studying cell type and gene module evolution. He earned his PhD in computational biology with minors in applied math and statistics from Iowa State University with a focus on protein coevolution and homolog detection.

Kejue Jia

Kejue Jia, PhD

Associate Research Scientist

Related Publications

Musser Lab investigates one of the great mysteries of animal evolution, how animals evolved to make specialized cells and orchestrate complex multicellular behavior. This required a suite of new inventions:  genomic mechanisms to generate distinct genetic programs, new machinery to enact specialized cell functions, and novel intercellular signaling to coordinate cells.

To understand this, the lab studies sponges, comb jellies, and other early-branching animal lineages that provide a unique window into early animal life and give insight into the general principles of animal cell architecture.