How does it work?
The entire pipeline consists of two major components: 1. Identification of synaptic proteins across diverse animal species using protein language models. 2. Inference of the evolutionary assembly of synaptic protein complexes by integrating protein coexpression profiles with protein-protein interaction predictions generated by a novel deep learning model.
A major challenge for detecting proteins in non-model animals similar to well-characterized synaptic proteins is that traditional sequence alignment methods often fail at large evolutionary distances. However, protein language models effectively capture inherent patterns in protein sequences and can identify similar proteins across large evolutionary distances with greater precision. We will systematically collect complete animal proteomes and convert all proteins into a high-dimensional matrix embedding using the ESM language model following the PRotein Ortholog Search Tool (PROST) workflow. Then, we will calculate pairwise embedding distances between well-characterized synaptic proteins and all proteins in each animal. Validation will be performed via sequence searches with BLAST and HMM where possible, and by assessing structural similarity using Aphafold2.
One of preconditions for synaptic proteins to interact is that they must be co-expressed in the same cell. We leverage the recent proliferation of published whole-body single-cell expression atlases together with several new cell atlases generated in our own lab to assess synaptic protein co-expression in animals with and without nervous systems. Briefly, for each dataset we identify co-expressed synaptic genes using weighted gene co-expression network analysis.
Finally, we identify species-specific interactions among synaptic proteins using a deep learning model developed in our lab. This model is built on the ESM2-650M architecture, a widely used protein language model pretrained on approximately 65 million sequences from the UniRef50 database. Leveraging the model’s learned representations of individual proteins, we train it using: 1. Public protein-protein interaction datasets with binary labels (positive/negative), including the PPI dataset STRING v12 database, the cross-species dataset from Sledzieski et al., and the Human Gold Standard Dataset from Bernett et al.. 2. Protein complexes with annotated interface residues from the RCSB database. We first train the model on the binary-labeled datasets using a combination of classification and whole-sequence masked language modeling (MLM) objectives. We then fine-tune it on the protein-complex dataset with a selective masking strategy designed to focus the model’s attention more on interaction-interface features, which enables more accurate prediction of specific protein-protein interactions.
Building on the previous analyses, we construct protein interaction networks for each synaptic gene module and compare overall interaction profiles across species. We also superimpose protein structures from different organisms onto the existing protein complex. By examining the amino acid composition near interaction interfaces, we infer how sequence changes may alter interaction profiles over evolutionary time. These computational findings then cross-validated using direct experimental evidence of selected synaptic protein interactions in sponges and jellyfish, with ongoing work in our lab and in the laboratory of our collaborator Thomas Biederer in the Department of Neurology at Yale.
With these results, we aim to reconstruct ancient synaptic architecture and pinpoint when synaptic functional modules first began to operate together, ultimately giving rise to the earliest synapses in evolutionary history.