analysis of event-driven networks

Inferring Influence

Inferring influence in networks from observed communication activity is an important task in contexts such as surveillance, marketing, and cybersecurity. Most existing approaches rely on content or meta-data indicating related activity, rendering those approaches ineffective when such information is unavailable, for example due to encrypted communication. In contrast, we present an efficient algorithm to infer influence between entities that relies only on the times of their individual activity, paying particular attention to the computational challenges posed by large, high-volume networks. We provide theoretical bounds on the recall and runtime of our algorithm relative to characteristics of network structure and dynamics, along with experiments to support our theoretical analysis and validate the effectiveness of our approach.


Detecting Correlated Activity

In many real-world networks, interactions between entities are observed at specific moments in continuous time, such as email, SMS messaging, and IP traffic. The majority of methods for analyzing such data first aggregate communication over designated time blocks, resulting in one or more discrete time series, to which existing tools can be applied. However, regardless of how the block lengths are chosen, discretizing time inherently introduces information loss and biases analysis towards patterns occurring at the designated time scale, effects which can be especially pronounced in networks with a high degree of temporal heterogeneity. Due to this, there has been increasing interest in using stochastic point processes to model network activity. We present a novel approach based on such models to detect times and sets of entities with temporally correlated recent activity. We develop efficient algorithms and compare our approach to existing and baseline methods through experiments on synthetic and real-world data.


Discovering Functional Communities

Community discovery is a natural task that arises in the study of social networks, but finding a mathematical formulation which captures the intuitive notion of community is an active research area. Most of the existing literature frames community discovery as the task of clustering a social network graph so that well-connected vertices are in the same cluster. When the network also serves as a medium for the dissemination of information, however, graph structure alone does not tell the whole story, since the existence of a social link does not imply information transfer.

We suggest an alternative approach to community discovery that identifies groups of social media users who interact with similar content, represented as dense biclusters in a user-content matrix. We present a heuristic algorithm to efficiently search the space of possible co-clusterings for one which maximizes the value of a given metric, along with a new class of co-clustering metrics that are more suitable for this task than existing metrics. We evaluate our approach using synthetic and real-world datasets.


Cascade Discovery

A cascade is a sequence of events by which a single piece of information travels through a network. Existing work has proposed generative cascade models, as well as analyzed the properties of known cascades, but to our knowledge there has been no attempt to identify or extract cascades from unlabeled data without relying on textual content. We introduce the problem of partitioning the set of network events into the minimal number of cascades. We consider two variants of the problem, one with an additional constraint that each node can participate at most once in a cascade. We prove that the constrained problem is NP-complete, and present an efficient algorithm for the other.


cybersecurity and moving target defense

Sustaining Availability of Critical Services

Much of the existing literature on cyber defense focuses on detection and response: developing better detection mechanisms, combating malware that has already been detected, or efficiently patching known vulnerabilities. However, research has shown that many cyber attacks are able to evade detection for a long time, propagating malware throughout a network for months or years undetected. On the other hand, some viral attacks spread very quickly, infecting an entire network before exploited vulnerabilities can be identified or patches developed. In light of these challenges, traditional approaches to network defense do not provide sufficient protection from cyber attack.

In this work, we explore active defense mechanisms to limit the spread of malware – detected or not – using only routine computer operations, without requiring any specialized software. In particular, we consider the task of dynamically allocating provision of shared services which potentially contain vulnerabilities, with the goal of maximizing the number of nodes still providing service at the time of a coordinated exploit enabled by propagating malware. We propose a continuous-time stochastic model and derive the optimal solution over a class of active defense strategies. We compare the performance of the optimal solution under our theoretical model with other proposed and baseline strategies in a more realistic environment using agent-based simulation.


Platform Migration Defense Strategies

Despite the significant effort directed toward securing important cyber systems, many remain vulnerable to advanced, targeted cyber intrusion. Today, most systems which provide network services employ a fixed software stack that includes an operating system, web servers, databases, and a virtualization layer. This software mix as a whole constitutes the attack surface of the host, and a vulnerability in one or more of these services is a threat to the security of the entire system.

Moving target defense (MTD) aims to increase the security of a system against successful intrusion by increasing an attacker’s uncertainty of the attack surface. Platform migration is a form of MTD that entails changing the virtualized software stack configuration of a system. We consider a scenario in which an attacker gathers information and then selects and launches an attack against a target system which is implementing a platform migration defense (PMD). We use agent-based simulation to evaluate the MTD’s effectiveness depending on the capabilities of the attacker and defender. In particular, we focus on two core characteristics of a PMD: (i) migration rate, the frequency at which the platform is changed, and (ii) platform diversity, the number of platform configurations available, as well as two dimensions of an attacker’s capabilities: (i) reconnaissance skill, the ability to collect accurate information regarding the target system, and (ii) arsenal size, the number of usable exploits at the attacker’s disposal. We perform simulation experiments to evaluate a defender’s ability to protect itself against a spectrum of attackers ranging from “script-kiddies” to state-sponsored actors. Our results indicate that increased platform diversity results in a lower rate of successful attacks, even in cases where the attacker has near-perfect information regarding the target system, but that this may come at a cost in functionality of the system. Furthermore, although the strength of an attacker is often measured by their ability to develop or acquire a large arsenal of available exploits, reconnaissance skill may be just as important a determinant for the success of an attack as the arsenal size. Our analysis provides insight into the relationship between attacker and defender capabilities, which can help inform decision-making processes of cyber defenders and lay the grounds for effective automation of cyber maneuvers.


study of complex systems

Electrical Flow in the Heart

Electrical communication between cardiomyocytes (heart cells) can be perturbed during arrhythmia, but these perturbations are not captured by conventional electrocardiographic metrics. We develop a theoretical framework to quantify electrical communication using information theory metrics in two-dimensional cell lattice models of cardiac excitation propagation. We use Shannon entropy and mutual information to perform spatial profiling of electrical signals during normal heartbeat as well as three models of cardiac arrhythmia: anatomical reentry, spiral reentry, and multiple reentry. We find that information sharing between cells is spatially heterogeneous, and that cardiac arrhythmia significantly impacts information sharing within the heart. In addition, entropy was able to localize the path of the drifting core of spiral reentry, which could be an optimal target of therapeutic ablation. We conclude that information theory metrics can quantitatively assess electrical communication among cardiomyocytes and may find clinical application in the identification of rhythm-specific treatments which are currently unmet by traditional electrocardiographic techniques.


Food Web Dynamics

In order to develop policies and guidelines that will lead to sustainable agricultural practices, it is important to understand the effects of pesticide use on an ecosystem. Extensive field experiments may be costly or impractical, and it may take several years or more before long-term effects can be observed. We propose a new model that integrates knowledge of pesticide-organism interactions with existing bioenergetic food web models in order to explore the mechanism by which pesticides can offset the balance of an ecosystem through their direct affect on only certain organisms. By running simulations under a variety of scenarios, our approach could be used to efficiently determine safe upper limits of pesticide use in a given environment. We demonstrate our approach through experiments on a model soil food web, and perform sensitivity analysis to gauge the reliability of applying our results to predict the behavior of real-world ecosystems.


collaboration in academia

Collaborative Impact Metrics

Across academic disciplines, it is natural to want to measure the impact of an individual and his or her work. Consequently, many metrics have been proposed, starting with simple counts of papers or citations and becoming progressively more complex. Given the attention such metrics receive, and their potential to influence decisions around hiring and promotions, there has been much effort in designing them to be meaningful. For example, total paper counts give little indication of the quality of the work, whereas aggregate citation counts are distorted by a single highly-cited paper and so do not indicate the researcher's breadth.

In 2005, Hirsch proposed the h-index: the largest integer h such that the author has published at least h papers with at least h citations each. This measure has an intuitive appeal, and is not unduly influenced by a single high-impact paper, nor by a multitude of low-impact publications. Since then, a plethora of variations and alternative indices have been proposed to address perceived shortcomings of the h-index. Most of these measures evaluate an author solely based on his or her individual publication record. However, modern scientific research tends to be highly collaborative in nature. We introduce a measure which aims to capture the impact of a researcher not only on the research corpus, but also on his or her fellow researchers, which we call the Social h-index. We perform a case study over the Computer Science research corpus, and show that it is distinct from other measures, and rewards more collaborative research styles.


Game-Theoretic Collaboration Models

Researchers exhibit a wide range of work habits and behaviors. Some work on many papers simultaneously, while others focus on only a few at a time. Some engage in mentoring relationships, while others choose to collaborate mostly with their peers, and still others prefer to work independently. These behaviors may be motivated by a variety of factors such as institutional needs, academic field, stage in career, funding situation, and affinity for teaching. We pose the question: "If researchers were motivated by X, what would the world of academic research look like?" We analyze collaborative behavior in a large scholarly dataset and arrive at a model of academic collaboration, which we call the h-Reinvestment model. We then formalize a game based on this model, and study the outcome of the game when each researcher tries to optimize a given objective function.


security and privacy in data sharing

Anonymization of Social Networks

Knowledge discovery on social network graphs is valuable in many application domains. When publishing relational databases, data is often anonymized by removing personally identifiable information to protect the privacy of individuals. In social networks, however, knowledge of an individual's local graph structure may be sufficient to re-identify them in the network. This poses an inherent challenge in how to preserve both the utility of the published network data and the privacy of the social network users.

We formally define privacy models and attack models for the problem of anonymizing network data, in particular where an adversary's prior knowledge includes the degrees of the target and its i-hop neighbors. We present two new clustering methods with a minimum cluster size constraint, the bounded t-means clustering and union-split clustering algorithms. We then propose an inter-cluster matching algorithm, strategically adding and removing edges based on nodes' social roles, which satisfies the privacy requirements due to the minimum cluster size property. Experiments demonstrate that the anonymized social networks produced by our algorithm preserve several important graph properties.


Privacy-Preserving Database-As-A-Service

Outsourced databases provide a solution for data owners who want to delegate the task of answering database queries to third-party service providers. However, distrustful users may desire a means of verifying the integrity of responses to their database queries. Simultaneously, for privacy reasons, the data owner may want to keep the database hidden from service providers. This is particularly relevant for aggregate databases, where an individual's data is sensitive, and results should only be revealed for queries that are aggregate in nature. In such a scenario, using simple signature schemes for verification does not suffice. We present a solution in which service providers can collaboratively compute aggregate queries without gaining knowledge of intermediate results, and users can verify the results of their queries, relying only on trust of the data owner. Our protocols are secure under reasonable cryptographic assumptions, and are robust to collusion between k dishonest service providers.


Anonymization of Recommendation Datasets

Recommender systems are used to predict user preferences for products or services. To incentivize the development of better prediction techniques, owners of recommender systems such as Netflix may make their data available to the public, which raises serious privacy concerns. With only a small amount of knowledge about individuals and their ratings to some items in a recommender system, an adversary may re-identify the users, thus breaching their privacy. Unfortunately, most of the existing privacy models (e.g. k-anonymity) cannot be directly applied to recommendation data.

We represent recommendation data as a bipartite graph, and identify several attacks that can re-identify users and determine their item ratings. To address the problem, we first give formal privacy definitions for recommendation data, and then develop a robust and efficient anonymization algorithm, Predictive Anonymization, to achieve our privacy goals. Our experimental results show that Predictive Anonymization can prevent the attacks with little impact to prediction accuracy.