Revolutionary advances in next-generation sequencing have made genomic sequences widely and cheaply available for a large cross-section of organisms. However, a deeper understanding of gene function in these organisms is limited by the lack of functional genomic data. Towards this, we introduce D-SCRIPT, a method for predicting protein-protein interactions (PPI) given only sequence data. Not only are PPIs a valuable functional genomic modality by themselves, the network of PPIs can be analyzed for further insights (e.g., to discover gene modules). D-SCRIPT is a deep learning approach that generalizes well and can be trained on known PPIs (e.g., in Human cells) to make predictions on evolutionarily distant species. We also emphasize an interpretable structure-based approach, building our model around a representation of the (putative) inter-protein contact-map between the two proteins when they bind. To capture this structural intuition with only sequence data, we leverage recent exciting advances in language models of proteins which map protein sequence to an informative embedding.



Advances in single-cell technologies now allow the simultaneous measurement of multiple kinds of information ("modalities") on a single cell. Each measurement provides a different view on the same cell. Schema is a general framework for integrating these heterogeneous modalities in an efficient way and is well-suited to exploratory analysis. We interpret each modality as describing a distance measure between the underlying cells. Schema allows you to calibrate and combine these disparate views into a single distance measure that is in some agreement with each of them.

Mathematically, we set this up as a quadratic program that simultaneously considers all the modalities and optimizes the correlation between their respective sets of pairwise distances. The method has an attractive theoretical underpinning and we have used it to gain new insights from multi-modal single-cell datasets.



Given two or more protein-protein interaction (PPI) networks, we aim to find the best overall alignment of the networks, taking into account both the network topologies as well as sequence similarities between the individual proteins of the networks. This network alignment problem is analogous to the global sequence alignment problem--- we are interested in the best overall match between the two inputs.

We introduce a spectral approach to this problem. Over a series of papers, my collaborators and I have described the IsoRank and IsoRank-N algorithms for finding such alignments. Using these, we are able to predict functional orthologs--- cross-species gene correspondences that take into account both sequence and protein data. These functional orthologs may provide certain advantages over existing sequence-only orthologs.



We predict protein interactions (PPIs) computationally, given just the sequence data of two proteins. The goal is to augment existing experimental data whose coverage remains spotty. We use structure-based approaches to predict whether two proteins interact, given just their sequence data. The structure based predictions are combined with functional genomic data using machine learning techniques.



Discovering the structure and dynamics of signaling networks is a key goal of systems biology. Towards this, we propose an approach to combine PPI and RNA-interference data to produce high-confidence hypotheses about the structure of a signaling network. The work, which is ongoing, was first presented at ISMB 2007. In it, we introduce the idea of using a multi-commodity flow framework to set up constraints on the structure of a signaling network, given PPI data and knock-down information from RNAi experiments. The constraints describe an Integer Linear Program (ILP) whose LP relaxation is then solved.



The Yeast 2-Hybrid protocol is one of the two main experimental approaches to discovering PPIs in a high-throughput way, the other being Co-Immunoprecipitation. The Y2H protocol is susceptible to some systematic biases, the most problematicbeing that certain proteins can behave "promiscuosly" in the assay and be responsible for many false-positive PPI pairs. We describe a Bayesian approach to modeling this systematic error. This approach allows us to combine information across multiple datasets and make more nuanced inferences than existing approaches.



One of the problems with performing gene-perturbation experiments is choosing the right cut-offs for the signal-vs-noise threshold in the assay. A too-high threshold will exclude promising hits; a too-low threshold will slow down downstream analysis with irrelevant genes. In the context of RNA-interference assays, we started with the intuition that the intended set of hits should share similar functions and hence be well-connected in the PPI network. We designed quantitative measures that express, given the list of all RNAi scores, how changes in cut-off will impact the connectivity (w.r.t. random) of the chosen set of hits. This leads to intuitive ways of selecting cut-offs for the experiment.



In protein structure prediction, one of the challenges is in efficiently exploring the local neighborhood of a conformation. We propose an approach that uses concepts from inverse kinematics (in robotics) to change a small part of a protein's backbone without changing anything else. This operation can be applied arbitrarily many times to explore the local neighborhood. Using this approach, we construct ensemble models of protein structure that better explain X-ray crystallization data than single-conformer models and are more effective than existing ensemble models.



Microarray experiments can quickly get costly, especially if one has to perform a number of them as part of a time-series study. We use the concept of active learning to compute the optimal points along the time-line at which microarray experiments should be run. The intuition is that the sampling should be focused in time-regions where the gene expression curves are least well-characterized.



  • Beckett Sterner, Bonnie Berger and I wrote a paper on using information theoretic ideas to identify and annotate active sites in proteins.
  • Mitul Saha and I wrote a paper on searching for a 3-D protein fragment in a database of protein structures.