Graphs have long been used to represent datasets where it is important to explicitly capture and model relationships. My research is focused on designing analytical frameworks that operate on graph datasets -- both static graphs (previous work) and streaming graphs (current focus). My research methodology has three phases: (i) exploratory graph analysis to discover characteristic data patterns, (ii) explaining the discovered patterns by graph modeling, and (iii) designing analytical frameworks using the discovered patterns and graph models.
In long-term, I am interested in designing algorithms for preparing, managing, and processing challenging graph data sets. The challenge could be the acquisition of data, for instance, when there is a concern about the availability, privacy, reliability, or scale of the data. The challenge could be the processing technology such as
sensitivity, robustness, scalability, or explainability. It could also be a computational requirement such as novelty, sustainability, or composability. I am also interested in interdisciplinary work where graph data is suitable for modeling the data of particular application domains.
My Past Research
My Ph.D. research focused on statistical network analysis over real-world bipartite streaming graphs to discover emergence patterns of motif subgraphs and designed appropriate frameworks for streaming graph modeling, subgraph counting, and concept drift detection. Before my Ph.D., I performed numerical simulations and spectral analysis on synthetic graphs to learn the relationships between the structure and the dynamics (synchronization) of directed complex graphs and accordingly introduced graph models enhanced for synchronization. I applied these graph models to study the functioning of the neurons in the brain.
My M.Sc. Thesis Research
Synchronization (correlated behaviour of phase oscillators) in large populations of interacting elements is observed in different disciplines such as system biology, neuroscience, electrochemistry, earth science, computer science, social science, and engineering [28]. In many applications, it is desired to have networks with high synchronizability. Studies before my M.Sc. had shown that there is a relationship between the topological and spectral properties of a graph and its dynamics. However, most of these studies had not considered the direction of the links, even though most of the real networks are directed. I studied the intrinsic relationship between the graph structures, the spectral characteristics of their associated matrices, and the synchronization implications for different directed random graphs.
My thesis formulates spectral metrics for synchronizability of graphs and the steady-state frequency of vertices. Also, it reveals the impact of loops, as well as in-degree and out-degree of vertices on the spectral properties of graphs and their synchronization. These findings [25,26] have resulted in models for assigning direction to the graph edges and help to analyze the synchronization of real-world graphs such as neuronal networks [32,39].
My Ph.D. Research
A particular type of graphs that is of interest in my Ph.D. thesis is where the graph emerges over time as the entities and the relationships among them are established and the corresponding data records (a payload indicating an edge with a timestamp assigned by the generative source) stream to a processing unit. The main characteristic of these so called streaming graphs is that they are unbounded and the full graph is never available to algorithms processing them. An example is the case of user-product interaction graphs in e-commerce applications where the interactions of users with products (reviews, purchases, visits) are relationships captured as the edges. The arrival rate of edges can be high, making real-time processing difficult. For example, Alibaba has reported a processing rate of 470 million events per second during a peak interval [20, Chapter 11]. The streaming graph model captures the characteristics of these real-world datasets.
My Ph.D. thesis presents a data-driven algorithm/system design approach for explainable and interpretable streaming graph analytics. The data-driven approach refers to exploratory analysis of streaming graphs for in-depth understanding and identification of the structural and temporal organizing principles of real-world streaming graphs to design effective and efficient processing algorithms. The explainable and interpretable approach is concerned with performing iterative and stateful tasks over streaming graphs, such that the operations are explainable and the outputs are interpretable. I focused on bipartite graphs and (2,2)-bicliques, also known as butterflies.
Bipartite Graphs. The streaming graph records (SGRs) in many applications capture the interactions that naturally occur in a bipartite mode. Specifically, the continuous interaction of users with items in a variety of contexts results in generation of streams of bipartite graph edges. For instance, the association of users with products, services, facilities, categories, etc. Moreover, it has been shown that all complex networks have an underlying bipartite structure driving the topological structure of the unipartite version [19,30,12,29]. Also, unipartite and hyper-graph edges as well as none-graph streaming records can be projected to bipartite graph edges. Therefore, bipartite graph edges represent either derived or natural streaming records.
Butterflies. A butterfly is a complete subgraph between two pairs of distinct vertices. Similar to the triangles in unipartite graphs, butterflies are the simplest and most local form of a cycle in bipartite graphs. Butterflies are identified as one of the main topological drivers of structural features such as transitivity and degree assortativity, whose understanding is critical for improving the studies and models of spreading phenomena on social networks with bipartite backbone graphs [30]. Moreover, butterflies are of great importance in measuring properties such as cohesion, network stability, and error tolerance [38]. For instance, cohesion can be measured by counting the number of butterflies adjacent to vertices or by the clustering coefficient computed based on the fraction of paths of length three, which are adjacent to each vertex and form butterflies. Recently, various butterfly-based data models and analytic algorithms for cohesive subgraph detection in heterogeneous information networks have been proposed [8,13,2,36,6,18].
In the following, I describe three components of my Ph.D. thesis.
1. Streaming Graph Mining. I studied the emergence patterns of butterflies in bipartite streaming graphs in two phases. The first phase [23, Section 3.2] showed that butterflies are temporal motifs with bursty emergence patterns. Due to these emergence patterns, the number of butterflies is significantly and continuously higher than that of random (null) graphs. I formulated the quantitative emergence pattern as the butterfly densification power law (BPL), which states that the number of butterflies at time t follows a power law function of the number of edges at time t. Another finding was that the bursty butterfly formation is contributed by vertices with degrees above the average of unique vertex degrees (hubs) and timestamps in the first 25% of the ordered set of already seen timestamps (old hubs). The second phase [24, Section 5.1] studied the tendency of butterfly vertices with the same strength (weighted degree) to connect (strength assortativity) and discovered a phenomenon called scale-invariant strength assortativity of streaming butterflies, a co-occurrence of three patterns: butterfly densification, strength diversification, and steady strength assortativity. The confounding data-driven semantics were explained in the domain of user-item interactions as these patterns relate to three graph theory concepts: burstiness, rich-get-richer, and core-periphery.
2. Streaming Graph Modeling. I explained the growth patterns in streaming graphs. As previous studies [4,15,7] describe, the graph models providing micro-mechanics or high-order generative process of graph structures are generally deemed as the explanation for the patterns observed in real-world graphs. Previous works studied and modeled the generative patterns of static or aggregated temporal graphs commonly optimized for down stream analytics or ignored (1) multi-partite/non-stationary data distributions, (2) emergence patterns (not just existence) of building blocks, and (3) streaming paradigms such as unbounded/time-sensitive updates, evolving streaming rates, and out-of-order/bursty records (e.g., [1, 11, 35, 2, 16, 3, 31 , 37 ]). I introduced a streaming growth model, called sGrow [24, Section 6], which includes a set of microscopic mechanisms to explain the discovered patterns. Microscopic mechanisms, also known as `local rules', determine how new edges connect to the rest of the graph. sGrow displays robust realization of streaming growth patterns independent of initial conditions, scale and temporal characteristics, and model configurations. It suits the following cases:
Streaming graph benchmarks by generating configurable realistic data streams supported by a reference guide for parameter configuration and stress testing analyses.
Machine learning benchmarks by providing annotated data streams which are synthesized by realistic instance injection and suit both testing and training purposes.
Development of streaming algorithms and models by providing microscopic mechanisms and characteristic patterns.
3. Streaming Graph Analytics. I developed two streaming graph frameworks for butterfly counting and concept drift detection. The results from the previous research component confirm that butterflies are temporal motifs in bipartite streaming graphs. On the other hand, as noted in various studies (e.g., [27 , 21, 5 ]), motif counting is a fundamental problem in large-scale network analysis. Therefore, in this part of the thesis, I studied the problem of butterfly counting in bipartite streaming graphs. It benefits many applications where studying the cohesion in graph data is of particular interest. Examples include investigating the structure of computational graphs or input graphs to the algorithms, as well as dynamic phenomena and analytic tasks over complex real graphs.
It also helps to summarize a streaming graph as a network of butterfly motifs for applications such as concept drift (CD) detection. Butterfly counting is computationally expensive, and known techniques do not scale to large graphs; the problem is even harder in streaming graphs. On the other hand, CD detection is important in non-stationary settings such as streaming data in which a change in a hidden context induces changes in a target concept, and adapting to the changes is critical for generating reliable outputs. Previous works on detecting CD in streaming data (a) focus on drift detection and adaptation integrated within supervised learning systems [10, 17, 33, 9 ], (b) assume independence of streaming data records [14], (c) analyze the drift occurrence using parameters, and/or (d) focus on the arrival time points without considering generation timestamps particularly for sequence of generic attributed graphs [34,22]. These assumptions and design choices are not always realistic. Therefore, I also studied the problem of detecting concept drift in bipartite streaming graphs with no assumptions on the data consumer task(s) and the availability of supervision labels. I introduced sGrapp [23, Section 4], a streaming graph framework for butterfly count approximation and sGradd a streaming graph framework for concept drift detection.
sGrapp uses a novel window-based stream processing, which adapts to the temporal distribution of the stream.
It approximates the cumulative number of butterflies by counting the butterflies within each window and approximating the inter-window butterflies using the BPL. sGrapp achieves mean absolute percentage error <=0.05/0.14 for the cumulative butterfly count in streaming graphs with uniform/non-uniform temporal distribution and a processing throughput of 1.5 million data record per second. The throughput and estimation error of sGrapp are 160 times higher and 0.02 times lower than baselines.
sGradd summarizes the input streaming graph into a network of butterflies represented by a network of phase oscillators with phases denoting the neighborhood of butterflies; Next, it tracks the changes in phase synchronization to detect a change in the emergence of butterflies as a signal of change in the generative source of streaming graph. sGradd supports any analytics (supervised or unsupervised), incurs low overhead, while accuracy improves over time, detects various drift types without supervision and achieves zero false alerts, explains the detected drifts' time and location, adapts to the streaming rate, and does not require input thresholds, prototype parameters, and graph attributes.
My Postdoctoral Research
After my Ph.D., I have been working on the data enrichment phase in the broader area of data engineering for analyzing streaming graph data records. My particular interest ever since has been developing lightweight and data-less predictive technologies. I have been continuing my research on transient concepts in streaming graphs. I have taken the sGradd framework to a next level, introducing another framework called sGradp. What distinguishes sGradp from sGradd is its ability to predict CDs rather than delayed detection, without using the payload of Streaming data records. sGradp enjoys on-the-fly general-purpose data management and light-weight CD checks.
My Future Research
A streaming graph management system (SGMS) generates unreliable outputs when the input data is new, temporal, incomplete, or manipulated by an adversary, or when the processing tasks are not well-engineered with respect to the functioning and execution and the system does not recognize and manage these issues properly. Therefore, it is important to perform reliability check on input data and the processing tasks before any processing. This can be managed in two ways: (1) tracking the origin of issues and augmenting the data stream with explicit descriptive alerts to inform the system for adjustments, or (2) tracking the impact of issues on the performance of system tasks and adjusting to implicitly detected issues. While the second approach benefits human-in-the-loop and task-specialized adaptive systems, the first approach suits general-purpose systems which run several queries on the same data (for instance a knowledge graph) since the overhead of reliability check is not repeated for each query. Moreover, separating the reliability check from the adaptation process enables explainable AI applications such as ``real-time monitoring or control of some automated activity'', ``organizing and personalizing information'', and ``characterizing health, well-being, or a state of humans, economics, or entities'' [40]. Therefore, when data becomes available in the input monitor of SGMS, a reliability check can enhance the performance.
Since the last years of Ph.D. studies to now, I have been working on real-time detection and prediction of CD in streaming graphs by detecting a change in the generative source of the data stream. CD examination can be performed by detecting a change in other types of hidden contexts. Moreover, CD is one of the causes of unreliable data since it leads to new and temporal data. Other sources of unreliable data include transmission errors, lack of representative/sufficient data records, and security issues which lead to incomplete and invalid inputs. Additionally, the processing tasks can be problematic e.g., when the tasks are not correctly coded, run with inconsistency and inappropriate orchestration, and fail to meet hardware and execution constraints. An interesting long-term objective would be assuring reliable analytics in an SGMS. This is split into two research components. 1: detecting data hazards and 2: detecting managing and processing hazards. Further examples of the issues detected in each component are listed in the table below.
Given an SGMS with certain architecture and functionalities, cohesive algorithms and data structures for reliability issues should be developed.
Data Hazards
This component is concerned with the issues related to the data that affect the performance of an SGMS. Data can be categorized as the raw data records arriving at the input monitor of the system, the derived data records generated by an internal component of SGMS as the intermediate or final results consumed by the same/other processing components/tasks, or the metadata describing the data records or a system state. The issues can be categorized according to the root causes such as generative process, communication channel, or ingestion process; and also according to the affected system component. A sub-problem would be efficient system scanning and reporting its architecture if it is not given.
There are many works for detecting invalid, insufficient, incomplete, erroneous, and abnormal data records in a given dataset. Detecting such issues in streaming graphs should be simultaneously effective and efficient, especially because it happens before any processing and significantly affects the performance of SGMS. Due to these challenges and unexplored areas, detecting data hazards in streaming graphs is a valuable research direction. Detecting each type of data hazards in streaming graphs can be a research milestone for an interested master's student.
Managing and Processing Hazards
This component is concerned with the issues related to the factors other than data, which include the management and processing tasks. The issues can be categorized according to a performance criterion such as accuracy, sensitivity, robustness, scalability, explainability, privacy, or security; and also according to an execution criterion such as execution context, or computing hardware.
There are many works for protecting certain tasks against information leakage, failures, and execution context switches and also for assuring the reliability of systems with respect to the consumption and status of resources. These works primarily focus on the effectiveness of cures, not the detection of issues. Moreover, SGMSs deal with strict efficiency requirements due to continuous execution, rapidly evolving inputs, and demanding service-level agreements. Therefore, research for the development of comprehensive checkups for software and system reliability in SGMSs is promising. This research component also requires expertise and familiarity with the internal components of SGMSs and consistency with the data hazard research component. Two interested PhD students can work on this component and with the master students on the other research component.