Use Case Catalogue - Research - Reconstructing ancient synaptic architecture

Musser Lab

Reconstructing ancient synaptic architecture with single-cell RNA-seq and deep learning

Abstract

The functional machinery of cells is assembled through interactions of different proteins. However, how protein-protein interactions (PPIs) diversified across cells in animal evolution is poorly understood. Emerging protein language models that learn the “grammar” of protein sequences to uncover hidden mechanisms offer strong generalization power, making it possible to tackle challenges such as detecting remote homologs and predicting their interaction patterns. In this project, we focus on reconstructing the interaction profiles of synaptic proteins across different animals to uncover the conserved and divergent constituents of early animal synapses throughout evolution.

To do this, we integrate full proteomes and single-cell transcriptomic data across diverse species, into a unified analytical pipeline. First, we use human synaptic gene modules as templates to locate homologous modules across a broad range of species using PROST, a protein language model-based homolog detection tool. Then, to refine our list of potential synaptic protein interaction partners for each cell type we perform co-expression analysis using published single-cell data and identify co-expressed synaptic proteins as well as potential interacting partners. A novel deep learning model is developed to predict the protein interaction profile of each candidate. Lastly, we perform experimental validation to examine the predicted modules. The results of this study will reveal a clear map of the evolutionary path of synaptic gene modules and uncover novel mechanisms and potential new types of synapses.

What inspired this project?

Evolution has diversified the machinery of cells by inventing new proteins and new interactions between proteins. One of the more remarkable inventions is the neuronal synapse, composed of hundreds of different proteins with highly choreographed interactions. Understanding the origins of these proteins and their interactions in the laboratory is challenging, and we lack tools to computationally predict how the evolutionary assembly of synapses first occurred. Genome sequencing has revealed that many synaptic genes existed even before functional nervous systems emerged. This raises the possibility that ancient organisms possessed molecular components or functional modules of synapses long before true neurons emerged. Recent advances in protein language models (pLMs) capable of learning the structural and functional properties embedded in protein sequences offer powerful means to probe these ancestral synaptic elements. By combining the predictive power of fine tuning pLMs, together with single-cell transcriptomic data analysis, the project seeks to reconstruct ancient synaptic architectures and uncover how the molecular foundations of neural communication first arose.

How might this project benefit humanity? 

This project benefits humanity by helping improve our understanding of how the nervous system first emerged and evolved-knowledge that ultimately inform modern neuroscience and medicine. By reconstructing ancient synaptic architectures, we uncover fundamental design principles that shaped neural communication and remain conserved in humans today. These insights will improve our understanding of neurological disorders rooted in synaptic dysfunction which can lead to new avenues for diagnostics or therapeutic strategies. 

Additionally, the project develops a computational framework and deep learning-based pipeline that are broadly applicable beyond tracing functional modules in synapse. These tools will be used to study the evolutionary history of any gene module, perform comparative analyses across diverse organisms, and discover how molecular systems diverge and specialize over time.

Who would use this?

A wide range of researchers would benefit from both the results and the methodological framework of this project. Evolutionary biologists and comparative genomics researchers would use the findings to understand how gene modules (synaptic or otherwise) emerged, diversified, and were repurposed across animal lineages. Neuroscientists would draw on the reconstructed synaptic architectures to gain insight into the origins of neural communication and the conserved molecular principles underlying nervous systems. 

Computational biologists and bioinformaticians would apply the deep learning-based pipelines to analyze protein interactions, trace evolutionary trajectories, and conduct cross-species comparisons of gene networks. Moreover, biomedical researchers and drug discovery teams could use the improved modeling of protein interactions to identify disease relevant pathways and therapeutic targets. Ultimately, anyone studying gene module evolution, protein function, or molecular system diversification across animals could leverage the project’s code repository and datasets.

MusserLab_v3

How does it work?

The entire pipeline consists of two major components: 1) Identification of synaptic proteins across diverse animal species using protein language models, and 2) Inference of the evolutionary assembly of synaptic protein complexes by integrating protein co-expression profiles with protein-protein interaction predictions generated by our novel deep learning model.

A major challenge for detecting homologous synaptic proteins in non-model animals is that sequence alignment-based methods often fail at large evolutionary distances. However, protein language models effectively capture inherent patterns in protein sequences and can identify similar proteins across large evolutionary distances with greater precision. We will systematically collect complete animal proteomes and then convert all proteins into optimized pLM embeddings using the PRotein Ortholog Search Tool (PROST) workflow. Then, we will calculate pairwise embedding distances between well-characterized synaptic proteins and all proteins in each animal. Validation will also be performed via sequence searches with BLAST and HMM where possible, and by assessing structural similarity using Aphafold3.

One precondition for synaptic protein interaction is that they must be co-expressed in the same cell. We leverage the recent proliferation of published whole-body single-cell expression atlases together with several new cell atlases generated in our own lab to assess synaptic protein co-expression in animals with and without nervous systems. Briefly, for each dataset we identify co-expressed synaptic genes using weighted gene co-expression network analysis.

Finally, we identify species-specific interactions among synaptic proteins using a deep learning model developed in our lab. This model is built on the ESM2-650M architecture, a widely used protein language model pretrained on approximately 65 million sequences from the UniRef50 database. Leveraging the model’s learned representations of individual proteins, we train it using: 1. Public protein-protein interaction datasets with binary labels (positive/negative), including the PPI dataset STRING v12 database, the cross-species dataset from Sledzieski et al., and the Human Gold Standard Dataset from Bernett et al.. 2. Protein complexes with annotated interface residues from the RCSB database. We first train the model on the binary-labeled datasets using a combination of classification and whole-sequence masked language modeling (MLM) objectives. We then fine-tune it on the protein-complex dataset with a selective masking strategy designed to emphasize interaction-interface features, which enables more accurate prediction of specific protein-protein interactions.

Building on the previous analyses, we construct protein interaction networks for each synaptic gene module and compare overall interaction profiles across species. We also superimpose protein structures from different organisms onto the existing protein complex. By examining the amino acid composition near interaction interfaces, we infer how sequence changes may alter interaction profiles over evolutionary time. These computational findings then cross-validated using direct experimental evidence of selected synaptic protein interactions in sponges and jellyfish, with ongoing work in our lab and in the laboratory of our collaborator Thomas Biederer in the Department of Neurology at Yale.

With these results, we aim to reconstruct ancient synaptic architecture and pinpoint when synaptic functional modules first began to operate together, ultimately giving rise to the earliest synapses in evolutionary history.

Contributors & Acknowledgments

Jacob Musser joined Yale as an Assistant Professor in the Department of Molecular, Cellular, and Developmental Biology. Jacob was a molecular evolutionary biologist at the European Molecular Biology Laboratory (EMBL) in Heidelberg, Germany. Originally from northern Michigan (USA), Jacob earned his PhD at Yale University in 2015 studying the molecular evolution of bird feathers. Jacob’s current research focuses on understanding the evolution of animal cell types, particularly the origin of early animal gene regulatory networks, specialized functional machinery such as the synapse, and early modes of intercellular communication. At EMBL he esta

Jacob Musser

Jacob Musser

Assistant Professor in Molecular, Cellular and Developmental Biology

Kejue is an associate research scientist in the MCDB department at Yale studying cell type and gene module evolution. He earned his PhD in computational biology with minors in applied math and statistics from Iowa State University with a focus on protein coevolution and homolog detection.

Kejue Jia

Kejue Jia, PhD

Associate Research Scientist

Related Publications

Musser Lab investigates one of the great mysteries of animal evolution, how animals evolved to make specialized cells and orchestrate complex multicellular behavior. This required a suite of new inventions:  genomic mechanisms to generate distinct genetic programs, new machinery to enact specialized cell functions, and novel intercellular signaling to coordinate cells.

To understand this, the lab studies sponges, comb jellies, and other early-branching animal lineages that provide a unique window into early animal life and give insight into the general principles of animal cell architecture.