|Abstract as per original application
Achieving efficient computation on massive graph data sets is a fundamental challenge with numerous applications, ranging from social network analysis and machine learning to protein interaction graphs studied in computational biology. The scale of such massive graphs, which often exceeds the memory and bandwidth capabilities of a single machine has prompted practitioners to develop distributed graph processing frameworks such as Google Pregel, Apache Spark/GraphX and Giraph. The overarching goal of this project is to develop the theoretical foundations of algorithms for efficient distributed computation on massive graphs in computing models that capture the main features of real-world graph processing frameworks. This includes the design of new distributed algorithms that scale to massive graphs for problems such as computing shortest distances between vertices, computing maximal independent sets, and finding sparse subgraphs (i.e. with few edges) that exhibit certain properties. Given that real-world data sets are continuously evolving, we also plan to derive new algorithms that can handle dynamically-changing input graphs. The obtained algorithms will be experimentally evaluated by implementing them in open source graph processing frameworks, which will have a tangible impact on industrial practice.
A crucial challenge for performing fast distributed computation on large graphs is to come up with algorithms that do not send too many messages, i.e., are communication-efficient. We will explore new algorithmic techniques to overcome this obstacle and we anticipate that this also provides insights on designing communication- and time-efficient solutions for classic distributed computing models, where each graph vertex represents a computational entity.
A complementary goal of the project is to understand how much communication a distributed graph algorithm needs to perform, while still being able to quickly arrive at a correct solution. More specifically, this work will reveal the complexity lower bounds on the trade-offs between time and the required communication in distributed computing models, which is still an open question for many fundamental graph problems. To place our results in a broader context, we will investigate the complexity-theoretic foundations of problems in the large-scale graph setting by developing complexity classes and providing simulations between different computational models for massive data sets.
在海量圖數據集上實現高效計算是眾多應用的一項基本挑戰，範圍從社交網絡分析和機器學習到在計算生物學中研究的蛋白質相互作用圖。這種海量圖形的規模通常超過一台機器的內存和帶寬能力，促使從業人員開發分佈式圖形處理框架，例如Google Pregel，Apache Spark / GraphX和Giraph。此項目的總體目標是為計算模型中的大量圖做出高效的分佈式計算算法開發理論基礎。所謂的計算模型圖形應呈現實際圖形處理框架的主要特徵。我們的總體目標包括設計新的分佈式算法，這些算法可縮放到大量圖以解決問題，例如計算頂點之間的最短距離，計算最大獨立集以及查找表現出某些特性的稀疏子圖（即邊緣很少）。鑑於現實世界中的數據集在不斷發展，我們還計劃推出可以處理動態變化的輸入圖的新算法。我們將利用開放源代碼圖形處理框架執行這些新算法，並進行實驗評估。這將對工業實踐產生切實的影響。在大型圖形上執行快速分佈式計算的關鍵挑戰是提出一種不會發送太多消息的算法, 即通信效率高。我們將探索新的算法技術來克服這一障礙，並且我們預計這還將會為經典的分佈式計算模型 (每個圖形頂點代表一個計算實體) 提供即省時又通信效率高的解決方案。該項目的一個補充目標是了解即快速並準確的分佈式圖形算法所需要執行的通信數 。更具體地說，這項工作將揭示時間與分佈式計算模型中所需的通信之間的權衡取捨的複雜性下限，這對於許多基本圖形問題仍然是一個未解決的問題。為了將我們的結果放在更廣闊的背景下，我們將通過開發複雜度類並為海量數據集提供不同計算模型之間的仿真，來研究大規模圖形設置中問題的複雜度理論基礎。