ENQUIRE PROJECT DETAILS BY GENERAL PUBLIC

Project Details

Funding Scheme :

General Research Fund

Project Number :

11213620

Project Title(English) :

Algorithmic Foundations of Distributed Large-Scale Graph Processing

Project Title(Chinese) :

分布式大型图处理的算法基础

Principal Investigator(English) :

Prof LI, Minming

Principal Investigator(Chinese) :

Department :

Department of Computer Science

Institution :

City University of Hong Kong

E-mail Address :

mli000@cityu.edu.hk

Tel :

27889538

Co - Investigator(s) :

Dr Konrad, Christian

Prof ROBINSON, Peter

Panel :

Engineering

Subject Area :

Computing Science & Information Technology

Exercise Year :

2020 / 21

Fund Approved :

807,204

Project Status :

Completed

Completion Date :

30-6-2024

Project Objectives :

Designing faster distributed algorithms on large-scale graphs for solving problems such as all-pairs-shortest-paths, graph spanners, and diameter verification, as well as local problems such as computing maximal independent sets, maximal matchings, and ruling sets. We also plan to address these problems on dynamically-changing graphs. This includes an experimental evaluation of the resulting algorithms in real-world graph processing frameworks.

Obtaining complexity lower bounds for the message-time trade-off inherent to algorithms for global graph problems. Due to the relationship between the message complexity of CONGEST algorithms and fast k-machine algorithms, we will derive lower bounds in both models for problems such as graph spanners, broadcast (in the CONGEST model), and constructing breadth first search trees.

Developing a complexity theory for large-scale graph processing and investigating the formal relationship between models for massive graphs such as the k-machine model and the MPC model. We will propose time complexity classes in the k-machine model and study the existence of minimal problems with respect to these classes.

Abstract as per original application
(English/Chinese):

Achieving efficient computation on massive graph data sets is a fundamental challenge with numerous applications, ranging from social network analysis and machine learning to protein interaction graphs studied in computational biology. The scale of such massive graphs, which often exceeds the memory and bandwidth capabilities of a single machine has prompted practitioners to develop distributed graph processing frameworks such as Google Pregel, Apache Spark/GraphX and Giraph. The overarching goal of this project is to develop the theoretical foundations of algorithms for efficient distributed computation on massive graphs in computing models that capture the main features of real-world graph processing frameworks. This includes the design of new distributed algorithms that scale to massive graphs for problems such as computing shortest distances between vertices, computing maximal independent sets, and finding sparse subgraphs (i.e. with few edges) that exhibit certain properties. Given that real-world data sets are continuously evolving, we also plan to derive new algorithms that can handle dynamically-changing input graphs. The obtained algorithms will be experimentally evaluated by implementing them in open source graph processing frameworks, which will have a tangible impact on industrial practice. A crucial challenge for performing fast distributed computation on large graphs is to come up with algorithms that do not send too many messages, i.e., are communication-efficient. We will explore new algorithmic techniques to overcome this obstacle and we anticipate that this also provides insights on designing communication- and time-efficient solutions for classic distributed computing models, where each graph vertex represents a computational entity. A complementary goal of the project is to understand how much communication a distributed graph algorithm needs to perform, while still being able to quickly arrive at a correct solution. More specifically, this work will reveal the complexity lower bounds on the trade-offs between time and the required communication in distributed computing models, which is still an open question for many fundamental graph problems. To place our results in a broader context, we will investigate the complexity-theoretic foundations of problems in the large-scale graph setting by developing complexity classes and providing simulations between different computational models for massive data sets.

在海量圖數據集上實現高效計算是眾多應用的一項基本挑戰，範圍從社交網絡分析和機器學習到在計算生物學中研究的蛋白質相互作用圖。這種海量圖形的規模通常超過一台機器的內存和帶寬能力，促使從業人員開發分佈式圖形處理框架，例如Google Pregel，Apache Spark / GraphX和Giraph。此項目的總體目標是為計算模型中的大量圖做出高效的分佈式計算算法開發理論基礎。所謂的計算模型圖形應呈現實際圖形處理框架的主要特徵。我們的總體目標包括設計新的分佈式算法，這些算法可縮放到大量圖以解決問題，例如計算頂點之間的最短距離，計算最大獨立集以及查找表現出某些特性的稀疏子圖（即邊緣很少）。鑑於現實世界中的數據集在不斷發展，我們還計劃推出可以處理動態變化的輸入圖的新算法。我們將利用開放源代碼圖形處理框架執行這些新算法，並進行實驗評估。這將對工業實踐產生切實的影響。在大型圖形上執行快速分佈式計算的關鍵挑戰是提出一種不會發送太多消息的算法, 即通信效率高。我們將探索新的算法技術來克服這一障礙，並且我們預計這還將會為經典的分佈式計算模型 (每個圖形頂點代表一個計算實體) 提供即省時又通信效率高的解決方案。該項目的一個補充目標是了解即快速並準確的分佈式圖形算法所需要執行的通信數。更具體地說，這項工作將揭示時間與分佈式計算模型中所需的通信之間的權衡取捨的複雜性下限，這對於許多基本圖形問題仍然是一個未解決的問題。為了將我們的結果放在更廣闊的背景下，我們將通過開發複雜度類並為海量數據集提供不同計算模型之間的仿真，來研究大規模圖形設置中問題的複雜度理論基礎。

Realisation of objectives:

For the first objective, we have the following outcome. 1.Maximal Matching in k-machine model. We study maximal matching in the k-machine model and give some efficient algorithms. 2.Dynamic maximal matching in the k-machine model. We give algorithms for dynamic maximal matching in the k-machine model with batches of updates, where there are two different adversaries, i.e., oblivious adversary and adaptive adversary and the dynamic updates include edge deletions and edge insertions. 3.Correlation clustering in the MPC model. We give two efficient algorithms for correlation clustering in the MPC model for random graphs created from the SBM model. 4.Perfect Matching in complete bipartite graphs. We give several different algorithms for finding perfect matching in complete bipartite graphs in different distributed models. We also have some experiment, which demonstrates our results. Publications: ESA 2023, ITCS 2024, SIROCCO 2025(short) (will submit to Journal TCS). For objective 2, we have the following outcome. 1. Leader Election. We consider leader election in clique networks. In the synchronous setting with simultaneous wake-up, we prove there is a trade-off between messages and rounds. We also improve the best known upper bounds. In addition, we also give many efficient algorithms under different settings. 2. Dynamic Maximal Matching in the k-machine model. We give lower bounds for dynamic algorithms dealing with batches of updates. For oblivious adversaries and adaptive adversaries, we give two different lower bounds. Publication: PODC 2023 (Invited to Journal Distributed Computing). ITCS 2024. This objective is finished by 90%. We did not give lower bounds for the CONGEST model. The major reason is that we do not find a good/meaningful candidate problem to show a lower bound for CONGEST model. For objective 3, we found that the relationship between k-machine model and the MPC model is not that interesting. Basically, if we make the bandwidth of them to be the same, it is the relationship between the vertex-partition model and the edge-partition model. We can transform one to the other.

Summary of objectives addressed:

	Objectives	Addressed	Percentage achieved
1.	Designing faster distributed algorithms for solving problems on large-scale graphs such as all-pairs-shortest-paths, graph spanners, and diameter verification, as well as local problems such as computing maximal independent sets, maximal matchings, and ruling sets. This includes an experimental evaluation by implementing the resulting algorithms in graph processing frameworks. This includes the following task (as stated in the proposal): Task 2: Fast Algorithms for Global Graph Problems under Dynamic Changes Task 3: Symmetry Breaking Problems on Large Graphs	Yes	100%
2.	Obtaining complexity lower bounds for the message-time trade-off inherent to algorithms for global graph problems. Due to the relationship between the message complexity of CONGEST algorithms and fast k-machine algorithms, we will derive lower bounds in both models for problems such as graph spanners, broadcast (only in the CONGEST model), and constructing breadth first search trees. This includes the following task (as stated in the proposal): Task 1: Message-Time Trade-Offs for Global Graph Problems	Yes	90%
3.	Developing a complexity theory for large-scale graph processing and investigating the formal relationship between models for massive graphs such as the k-machine model and the MPC model. We will propose time complexity classes in the k-machine model and study the existence of minimal problems with respect to these classes. This includes the following task (as stated in the proposal): Task 4: Developing a Complexity Theory for Large-Scale Graph Processing	Yes	80%

Research Outcome

Major findings and research outcome:

Leader election is the fundamental tool in most of distributed systems. Our major finding is the new trade-offs for leader election, which improves the previous results significantly (PODC 2023). We give lower bounds and upper bounds in several settings. Data is inherently dynamic, and how to deal with dynamic data becomes more and more important and challenging. We establish new upper bounds and lower bounds for dynamic problems in the large-scale distributed model (ITCS 2024). Since nowadays, data are stored in a distributed way, our dynamic algorithms will save time and money for updating the systems, database and so on when the data changes. Our lower bound will show the bottleneck of dealing with updates. It is known that clustering plays important roles in data mining. We give near optimal algorithms for the correlation clustering (ESA 2023) which will make many applications using this kind of clustering more efficient than before.

Potential for further development of the research
and the proposed course of action:

Dynamic problems have many open questions in distributed models. One of the most important tasks is to design distributed data structures for maintaining solutions. It is not clear how to design such algorithms, and also it is hard to show the lower bounds. A good start of exploring this direction is to start from some famous and important problems like graph coloring and matching. To make it possible to utilize distributed data structures, it is allowed to increase the power of the model by introducing some auxiliary equipment. To show the lower bounds, the possible tools are from communication complexity and information theory.

Layman's Summary of
Completion Report:

We study different topics including leader election, matching, dynamic problems, clustering. For these problems, we give algorithms under different settings. Most of our results are theoretical, but these theories have real applications. It means that one can improve the performance of information technology in the industry by implementing our results. We also prove that it is impossible to achieve better than some targets, which can save a lot of unnecessary efforts.

Research Output

Peer-reviewed journal publication(s)
arising directly from this research project :
(* denotes the corresponding author)

Recognized international conference(s)
in which paper(s) related to this research
project was/were delivered :

Month/Year/City	Title	Conference Name
Virtual	Being Fast Means Being Chatty: The Local Information Cost of Graph Spanners	ACM-SIAM SODA 2021
Virtual	Can We Break Symmetry with o(m) Communication?	ACM PODC 2021
Orlando	Improved Tradeoffs for Leader Election	PODC 2023
Amsterdam	Massively Parallel Algorithms for the Stochastic Block Model	ESA 2023
Vienna	Dynamic Maximal Matching in Clique Networks	ITCS 2024
Delphi	Perfect Matching with Few Link Activations	SIROCCO 2025

Other impact
(e.g. award of patents or prizes,
collaboration with other research institutions,
technology transfer, etc.):

SCREEN ID: SCRRM00542