Information-Oriented Random Walks and Pipeline Optimization for Distributed Graph Embedding

Peng Fang, Zhenli Li, Arijit Khan, Siqiang Luo, Fang Wang, Zhan Shi, Dan Feng

Research output: Contribution to journalJournal articleResearchpeer-review

Abstract

Graph embedding maps graph nodes to low-dimensional vectors and is widely used in machine learning tasks. The increasing availability of billion-edge graphs underscores the importance of learning efficient and effective embeddings on large graphs, such as link prediction on Twitter with over one billion edges. Most existing graph embedding methods fall short of reaching high data scalability. In this paper, we present a general-purpose, distributed, information-centric random walk-based, and pipeline-optimized graph embedding framework, DistGER−PipeDistGER−Pipe , which scales to embed billion-edge graphs. DistGER−PipeDistGER−Pipe incrementally computes information-centric random walks to reduce redundant computations for more effective and efficient graph embedding. It further leverages a multi-proximity-aware, streaming, parallel graph partitioning strategy, simultaneously achieving high local partition quality and excellent workload balancing across machines. DistGER−PipeDistGER−Pipe also improves the distributed Skip−GramSkip−Gram learning model to generate node embeddings by optimizing access locality, CPU throughput, and synchronization efficiency. Finally, DistGER−PipeDistGER−Pipe designs pipelined execution that decouples the operators in sampling and training procedures with an inter-round serial and intra-round parallel processing, attaining optimal utilization of computing resources. Experiments on real-world graphs demonstrate that compared to state-of-the-art distributed graph embedding frameworks, including KnightKingKnightKing , DistDGLDistDGL , Pytorch−BigGraphPytorch−BigGraph , and DistGERDistGER , DistGER−PipeDistGER−Pipe exhibits 3.15×–1053× acceleration, 45% reduction in cross-machines communication, >10% effectiveness improvement in downstream tasks, and 38% enhancement in CPU utilization.
Original languageEnglish
JournalIEEE Transactions on Knowledge and Data Engineering
Volume37
Issue number1
Pages (from-to)408-422
Number of pages14
ISSN1041-4347
DOIs
Publication statusPublished - Jan 2025

Keywords

  • Computational modeling
  • Distributed graph embedding
  • Information-centric
  • Pipeline execution
  • Pipelines
  • Scalability
  • Servers
  • Synchronization
  • Task analysis
  • Vectors

Fingerprint

Dive into the research topics of 'Information-Oriented Random Walks and Pipeline Optimization for Distributed Graph Embedding'. Together they form a unique fingerprint.

Cite this