Skip to content

Commit fbd23bb

Browse files
rusty1sjjpietrakJakubPietrakIntelkgajdamoZhengHongming888
authored
Add figures to distributed tutorial (#8917)
Co-authored-by: Jakub Pietrak <[email protected]> Co-authored-by: JakubPietrakIntel <[email protected]> Co-authored-by: Kinga Gajdamowicz <[email protected]> Co-authored-by: ZhengHongming888 <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
1 parent 4c32243 commit fbd23bb

File tree

6 files changed

+21
-4
lines changed

6 files changed

+21
-4
lines changed

docs/source/_figures/dist_part.png

113 KB
Loading
59.6 KB
Loading
29.3 KB
Loading
58.9 KB
Loading

docs/source/conf.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -61,6 +61,8 @@
6161
'_static/thumbnails/explain.png',
6262
'tutorial/shallow_node_embeddings':
6363
'_static/thumbnails/shallow_node_embeddings.png',
64+
'tutorial/distributed_pyg':
65+
'_static/thumbnails/distributed_pyg.png',
6466
'tutorial/multi_gpu_vanilla':
6567
'_static/thumbnails/multi_gpu_vanilla.png',
6668
'tutorial/multi_node_multi_gpu_vanilla':

docs/source/tutorial/distributed_pyg.rst

Lines changed: 19 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,9 @@
11
Distributed Training in PyG
22
===========================
33

4+
.. figure:: ../_figures/intel_kumo.png
5+
:width: 400px
6+
47
.. note::
58
We are thrilled to announce the first **in-house distributed training solution** for :pyg:`PyG` via :class:`torch_geometric.distributed`, available from version 2.5 onwards.
69
Developers and researchers can now take full advantage of distributed training on large-scale datasets which cannot be fully loaded in memory of one machine at the same time.
@@ -15,11 +18,11 @@ Key Advantages
1518
--------------
1619

1720
#. **Balanced graph partitioning** via METIS ensures minimal communication overhead when sampling subgraphs across compute nodes.
18-
#. Utilizing **DDP for model training in conjunction with RPC for remote sampling and feature fetching routines** (with TCP/IP protocol and `gloo <https://github.com/facebookincubator/gloo>`_ communication backend) allows for data parallelism with distinct data partitions at each node.
21+
#. Utilizing **DDP for model training** in conjunction with **RPC for remote sampling and feature fetching routines** (with TCP/IP protocol and `gloo <https://github.com/facebookincubator/gloo>`_ communication backend) allows for data parallelism with distinct data partitions at each node.
1922
#. The implementation via custom :class:`~torch_geometric.data.GraphStore` and :class:`~torch_geometric.data.FeatureStore` APIs provides a flexible and tailored interface for distributing large graph structure information and feature storage.
20-
#. Distributed neighbor sampling is capable of sampling in both local and remote partitions through RPC communication channels.
21-
All advanced functionality of single-node sampling are also applicable for distributed training, *e.g.*, heterogeneous sampling, link-level sampling, temporal sampling, *etc*..
22-
#. Distributed data loaders offer a high-level abstraction for managing sampler processes, ensuring simplicity and seamless integration with standard :pyg:`PyG` data loaders..
23+
#. **Distributed neighbor sampling** is capable of sampling in both local and remote partitions through RPC communication channels.
24+
All advanced functionality of single-node sampling are also applicable for distributed training, *e.g.*, heterogeneous sampling, link-level sampling, temporal sampling, *etc*.
25+
#. **Distributed data loaders** offer a high-level abstraction for managing sampler processes, ensuring simplicity and seamless integration with standard :pyg:`PyG` data loaders.
2326
#. Incorporating the Python `asyncio <https://docs.python.org/3/library/asyncio.html>`_ library for asynchronous processing on top of :pytorch:`PyTorch`-based RPCs further enhances the system's responsiveness and overall performance.
2427

2528
Architecture Components
@@ -57,6 +60,12 @@ This ensures that the resulting partitions provide maximal local access of neigh
5760
Through this partitioning approach, every edge receives a distinct assignment, while "halo nodes" (1-hop neighbors that fall into a different partition) are replicated.
5861
Halo nodes ensure that neighbor sampling for a single node in a single layer stays purely local.
5962

63+
.. figure:: ../_figures/dist_part.png
64+
:align: center
65+
:width: 100%
66+
67+
Graph partitioning with halo nodes.
68+
6069
In our distributed training example, we prepared the `partition_graph.py <https://github.com/pyg-team/pytorch_geometric/blob/master/examples/distributed/pyg/partition_graph.py>`_ script to demonstrate how to apply partitioning on a selected subset of both homogeneous and heterogeneous graphs.
6170
The :class:`~torch_geometric.distributed.Partitioner` can also preserve node features, edge features, and any temporal attributes at the level of nodes and edges.
6271
Later on, each node in the cluster then owns a single partition of this graph.
@@ -174,6 +183,12 @@ A batch of seed nodes follows three main steps before it is made available for t
174183
#. **Data conversion:** Based on the sampler output and the acquired node (or edge) features, a :pyg:`PyG` :class:`~torch_geometric.data.Data` or :class:`~torch_geometric.data.HeteroData` object is created.
175184
This object forms a batch used in subsequent computational operations of the model.
176185

186+
.. figure:: ../_figures/dist_sampling.png
187+
:align: center
188+
:width: 450px
189+
190+
Local and remote neighbor sampling.
191+
177192
Distributed Data Loading
178193
~~~~~~~~~~~~~~~~~~~~~~~~
179194

0 commit comments

Comments
 (0)