You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/source/tutorial/distributed_pyg.rst
+19-4Lines changed: 19 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,9 @@
1
1
Distributed Training in PyG
2
2
===========================
3
3
4
+
.. figure:: ../_figures/intel_kumo.png
5
+
:width:400px
6
+
4
7
.. note::
5
8
We are thrilled to announce the first **in-house distributed training solution** for :pyg:`PyG` via :class:`torch_geometric.distributed`, available from version 2.5 onwards.
6
9
Developers and researchers can now take full advantage of distributed training on large-scale datasets which cannot be fully loaded in memory of one machine at the same time.
@@ -15,11 +18,11 @@ Key Advantages
15
18
--------------
16
19
17
20
#. **Balanced graph partitioning** via METIS ensures minimal communication overhead when sampling subgraphs across compute nodes.
18
-
#. Utilizing **DDP for model training in conjunction with RPC for remote sampling and feature fetching routines** (with TCP/IP protocol and `gloo <https://github.com/facebookincubator/gloo>`_ communication backend) allows for data parallelism with distinct data partitions at each node.
21
+
#. Utilizing **DDP for model training** in conjunction with **RPC for remote sampling and feature fetching routines** (with TCP/IP protocol and `gloo <https://github.com/facebookincubator/gloo>`_ communication backend) allows for data parallelism with distinct data partitions at each node.
19
22
#. The implementation via custom :class:`~torch_geometric.data.GraphStore` and :class:`~torch_geometric.data.FeatureStore` APIs provides a flexible and tailored interface for distributing large graph structure information and feature storage.
20
-
#. Distributed neighbor sampling is capable of sampling in both local and remote partitions through RPC communication channels.
21
-
All advanced functionality of single-node sampling are also applicable for distributed training, *e.g.*, heterogeneous sampling, link-level sampling, temporal sampling, *etc*..
22
-
#. Distributed data loaders offer a high-level abstraction for managing sampler processes, ensuring simplicity and seamless integration with standard :pyg:`PyG` data loaders..
23
+
#. **Distributed neighbor sampling** is capable of sampling in both local and remote partitions through RPC communication channels.
24
+
All advanced functionality of single-node sampling are also applicable for distributed training, *e.g.*, heterogeneous sampling, link-level sampling, temporal sampling, *etc*.
25
+
#. **Distributed data loaders** offer a high-level abstraction for managing sampler processes, ensuring simplicity and seamless integration with standard :pyg:`PyG` data loaders.
23
26
#. Incorporating the Python `asyncio <https://docs.python.org/3/library/asyncio.html>`_ library for asynchronous processing on top of :pytorch:`PyTorch`-based RPCs further enhances the system's responsiveness and overall performance.
24
27
25
28
Architecture Components
@@ -57,6 +60,12 @@ This ensures that the resulting partitions provide maximal local access of neigh
57
60
Through this partitioning approach, every edge receives a distinct assignment, while "halo nodes" (1-hop neighbors that fall into a different partition) are replicated.
58
61
Halo nodes ensure that neighbor sampling for a single node in a single layer stays purely local.
59
62
63
+
.. figure:: ../_figures/dist_part.png
64
+
:align:center
65
+
:width:100%
66
+
67
+
Graph partitioning with halo nodes.
68
+
60
69
In our distributed training example, we prepared the `partition_graph.py <https://github.com/pyg-team/pytorch_geometric/blob/master/examples/distributed/pyg/partition_graph.py>`_ script to demonstrate how to apply partitioning on a selected subset of both homogeneous and heterogeneous graphs.
61
70
The :class:`~torch_geometric.distributed.Partitioner` can also preserve node features, edge features, and any temporal attributes at the level of nodes and edges.
62
71
Later on, each node in the cluster then owns a single partition of this graph.
@@ -174,6 +183,12 @@ A batch of seed nodes follows three main steps before it is made available for t
174
183
#. **Data conversion:** Based on the sampler output and the acquired node (or edge) features, a :pyg:`PyG` :class:`~torch_geometric.data.Data` or :class:`~torch_geometric.data.HeteroData` object is created.
175
184
This object forms a batch used in subsequent computational operations of the model.
0 commit comments