Santiago Segarra - Projects

Postdoctoral Research Associate (segarra@mit.edu)
alt text 



I am curious. Ideally, I would like to contribute to the understanding of multiple fields of knowledge, including engineering, economics, sociology, and medicine. My way of achieving this in today's highly specialized research environment is by studying networks, since these are ubiquitous data structures transversal to multiple fields and areas of expertise. Indeed, my whole work has been built around networks and this allowed me to collaborate with people from varying backgrounds from mathematicians and computer scientists to medical doctors and literature professors.

I can categorize the work I have done into the following three classes, with networks being the common denominator.

Signal Processing on Graphs

Graph signal processing (GSP) is an exciting emerging field in which I have been working since my fourth year at UPenn, where I get to combine everything I learned about networks with my knowledge of classical signal processing.

Sometimes networks have intrinsic value and are themselves the object of study. In other occasions, the network defines an underlying notion of proximity, but the object of interest is a signal defined on top of the graph, i.e., data associated with the nodes of the network. This is the matter addressed in the field of graph signal processing, where the notions of, e.g., frequency and linear filtering are extended to signals supported on graphs. A plethora of graph-supported signals exist in different engineering and science fields, with examples ranging from gene expression patterns defined on top of gene networks to the spread of epidemics over a social network. Transversal to the particular application, the general goal of GSP is to contribute to the advancement of the understanding of network data by redesigning traditional tools originally conceived to study signals defined on regular domains (such as time-varying signals) and extend them to analyze signals on the more complex graph domain.

In the past two years, I worked on multiple GSP projects which, for ease of understanding, I further divide into the following four subcategories.

Sampling and Interpolation

Sampling is a cornerstone problem in classical signal processing. The goal of this project is to investigate the sampling and posterior recovery of signals that are defined in the nodes of a graph. The underlying assumption is that such signals are bandlimited, i.e., they admit a sparse representation in a (frequency) domain which is related to the structure of the graph where they reside. Most of the current efforts have been focused on using the value of the signal observed at a subset of nodes to recover the signal in the entire graph. Our proposal is different. We present a new sampling method that accounts for the graph structure, can be run at a single node and only requires access to information of neighboring nodes. Moreover, we also show that the proposed method shares similarities with the classical sampling and interpolation of time-varying signals.

Interestingly, and unlike the case for time-signals, the ideal interpolators for graph signals are not low-pass graph filters, generating a decoupling between the processes of sampling and low-pass interpolation for general graphs. Hence, we also study the problem of reconstructing bandlimited graph signals via low-pass filters. More specifically, we study the problem of inducing a known graph signal using as input a graph signal that is nonzero only for a small subset of seeding nodes, which is then percolated (interpolated) across the graph using a graph filter.

Graph Filter Design and Identification

We investigate how to design graph filters, which are a generalization of classical time-invariant systems, to implement (or at least approximate) a pre-specified linear transformation. This is particularly useful in network setting because graph filters can be implemented locally. Graph filters have been used to implement distributedly specific linear transformations such as fast consensus or projections onto the low-rank space of graph-bandlimited signals. By contrast our goal in this project is to develop a general theory to determine which transformations are implementable as graph filters and, for those which cannot be perfectly implemented, what is the optimal graph-filter approximation achievable.

A second project investigates the problem of blind identification of graph filters. Mathematically, graph filters are linear transformations that can be expressed as matrix polynomials of the so-called graph-shift operator, which encodes the graph structure. The polynomial coefficients determine completely the transformation and are referred to as filter coefficients. Specifically, in this problem we are given a graph signal which is assumed to be the output of a graph filter, and seek to jointly identify the filter coefficients and the input signal that gave rise to the observed output. This is the extension to graphs of the classical problem of blind system identification or blind deconvolution of signals in the time or spatial domains. We tackle the blind graph-filter identification problem via a joint rank and sparsity minimization subject to linear constraints. Among other potential applications, since the dynamics of opinion formation in social networks can be modeled via graph filters, we may adopt blind filter identification to detect the influential actors that instilled an observed status-quo.

Network Topology Inference

Most GSP works assume that the graph-shift operator (hence the graph) is known, and then analyze how the algebraic and spectral characteristics of the graph shift impact the properties of the signals and filters defined on such a graph. Here instead we take the reverse path and investigate how to use information available from graph signals and filters to infer the underlying graph topology. In a nutshell, we follow a two-step approach whereby we first leverage results from GSP theory to identify the graph shift's eigenbasis using the available information, and then rely on these (possibly imperfect and incomplete) spectral templates to recover the graph shift operator itself. This brings a fresh and novel perspective to the prominent problem of inferring a graph's topology from a set of (graph-signal) observations.

Statistical Signal Processing on Graphs

Stationarity is a cornerstone property that facilitates the analysis and processing of random signals in the time domain. Thus, the goal of this project is to generalize the notion of stationary processes to the graph domain. We propose a definition of weak stationarity for random graph signals that takes into account the structure of the graph where the random process takes place, while inheriting many of the meaningful properties of the classical definition in the time domain. Under this definition, notions like the power spectral density (PSD) or results such as the spectral convolution theorem can be generalized to signals supported on graphs. Furthermore, we propose and analyze different methods to estimate the PSD of both nonparametric (periodogram, correlogram, average periodogram, and filter banks) and parametric (AR, MA, and ARMA processes) nature.

Foundations of Network Theory

This class includes (but is not limited to) the work towards my PhD thesis, which I started during my second year at UPenn. The common theme of the following projects is that their objective is to better understand the structure of networks either by reducing their complexity via clustering or by ranking their nodes via stable centrality measures.

Hierarchical Clustering of Asymmetric Networks

There are literally hundreds of methods, techniques, and heuristics that can be applied to the determination of hierarchical and nonhierarchical clusters in finite metric (thus symmetric) spaces. Methods to identify clusters in a network of asymmetric dissimilarities, however, are rarer. This relative rarity is expected because the intuition of clusters as groups of nodes that are closer to each other than to the rest is difficult to generalize when nodes are close in one direction but far apart in the other. Although the theoretical foundations of clustering are not as well developed as its practice, the foundations of clustering in metric spaces have been developed over the past decade. Of particular relevance to our work is the case of hierarchical clustering where, instead of a single partition, we look for a family of partitions indexed by a resolution parameter. In this context, some approaches have successfully followed axiomatic approaches, where desirable properties of clustering methods are encoded as axioms.

The goal of this project is to extend previous works and lay an axiomatic foundation for the more general case of clustering asymmetric (weighted and directed) networks. Our construction is based on the definition of admissible methods as those that abide by the axioms of value – nodes in a network with two nodes are clustered together at the maximum of the two dissimilarities between them – and transformation – when dissimilarities are reduced the network may become more clustered but not less. Apart from characterizing the set of admissible methods, we investigate alternative axiomatic constructions (giving rise to the concept of quasi-clustering), we propose tractable algorithms based on a min-max dioid algebra, and we analyze further desirable properties such as stability and scale invariance of clustering methods.

Metric Representations of Network Data

Despite the pervasive presence of networks, tools to analyze them and algorithms that exploit network data are not as well developed as tools and algorithms for processing of conventional signals. Although some of this lag can be attributed to different developmental stages, there is also the matter of the complexity of network data. Indeed, consider a problem of proximity search in which we are given a network and an element whose dissimilarity to different nodes of the network can be determined. We are asked to find the element that is least dissimilar. In an arbitrary network, finding the least dissimilar node requires comparison against all nodes and incurs a complexity that is linear in the size of the network. In a metric space, however, the triangle inequality encodes a transitive notion of proximity. If two points are close to each other in a metric space and one of them is close to a third point, then the other one is also close to this third point. This characteristic can be exploited to design efficient search methods using metric trees that effectively reduce the complexity. Likewise, many hard combinatorial problems on graphs are known to be approximable in metric spaces but not approximable in generic networks. In either case, the advantage of the metric space is that the triangle inequality endows it with a structure that an arbitrary network lacks. It is this structure that makes analysis and algorithm design tractable.

If metric spaces are easier to handle than arbitrary networks, a possible route for network analysis is to design projection operators to map arbitrary networks onto metric spaces. This is the problem addressed in this research avenue.

Stable Node Centrality Measures

In any graph or network, the topology determines an influence structure among the nodes or agents. Peripheral nodes have limited impact on the dynamics of the network whereas central nodes have a major effect on the behavior of the whole graph. Node centrality measures are tools designed to identify such important agents. However, node importance is a rather vague concept and can be interpreted in various ways, giving rise to multiple coexisting centrality measures, the most common being degree, closeness, eigenvector, and betweenness centrality. The ability of a centrality measure to be robust to noise in the network data is of practical importance. However, no formal theory had explained the disparate behaviors of the different centrality measures when presented with a noisy network. The goal of this project is to understand why some centrality measures are stable to noise and, based on this, define alternative measures that retain the same centrality notion as the commonly used measures while being robust to noise.

Interdisciplinary Applications of Network Analysis

This third category nucleates some of the more applied and interdisciplinary projects where I collaborated with different experts to apply network theory and graph signal processing to seemingly unrelated fields of knowledge.

Authorship Attribution

The discipline of authorship attribution is concerned with matching a text of unknown or disputed authorship to one of a group of potential candidates. More generally, it can be seen as a way of quantifying literary style or uncovering a stylometric fingerprint. The most traditional application of authorship attribution is literary research, but it has also been applied in forensics, defense intelligence, and plagiarism. Both, the availability of electronic texts and advances in computational power and information processing, have boosted accuracy and interest in computer based authorship attribution methods.

In this project we build stylometric fingerprints by considering the relational structure of function words (such as prepositions, conjunctions, and pronouns) in a written text. We encode these structures as word adjacency networks (WANs) which are asymmetric networks that store information of co-appearance of two function words in the same sentence. With proper normalization, edges of these networks describe the likelihood that a particular function word is encountered in the text given that we encountered another one. In turn, this implies that WANs can be reinterpreted as Markov chains describing transition probabilities between function words. Given this interpretation it is natural to measure the dissimilarity between different texts in terms of the relative entropy between the associated Markov chains. Hence, we use relative entropy to attribute texts achieving (and improving) state-of-the-art accuracy. More interestingly, WANs can be shown to carry information about the genre and time when the text was written. Finally, we use WANs to shed light on some famous literary controversies such as proposed collaborative plays in the Shakespearean canon.

Brain Graphs

Brains are one of the most appropriate domains to implement both traditional network analysis and graph signal processing since brains are themselves network structures (neural regions which are either structurally or functionally connected with each other) and brain activity can be thought of as a signal defined on these neural regions. We have collaborated with the Bioengineering department at UPenn in two brain-related projects. On the first one we use multiple network centrality measures as features to build a classifier in order to differentiate between pathological sources of corticobasal syndrome (Alzheimer's and frontotemporal lobar degeneration) from structural MRI data. In the second one, we implement graph filters to process fMRI signals captured while a series of individuals were learning a motor skill. The filtered signals unveil a relation between the variance of neurological activity and the learning aptitude when given a motor task, which could potentially be used to enhance the human learning process.

Sports Analytics

Throughout the past decade, analytics in sports have evolved to take a more mathematical and advanced outlook. Baseball, being the prime sport of choice for analytics, has significantly evolved with the implementation of Sabermetrics, the search for objective knowledge in the sport. Ever since they were introduced to the concept of ‘Moneyball’, baseball teams on all levels have proceeded to devalue old player ranking statistics such as batting averages, and instead have focused on metrics statistically correlated with wins such as the on-base plus slugging percentage (OPS). This paradigm shift has led to positive results for many teams, showcasing value in analyzing specific metrics within a sport. Thus, there is potential in closing the gap in knowledge between baseball and other sports through the analysis of simple metrics that resemble OPS. One such metric in soccer is the probability of shooting before losing (PSL) possession of the ball which, just like OPS, correlates with an increase in scoring. Our goal is to create a model that can estimate and predict PSL, as well as test the significance and impact it has on the performance of a team. Teamwork, an essential aspect of soccer, cannot be completely characterized using individual player statistics. Thus, in our approach we combine a network model of the team dynamics (represented by a continuous time Markov chain model of a ball possession) with detailed game statistics in order to estimate, predict, and modify the performance of a soccer team.