[FSDP][6/N] Check valid param freezing for ModuleWrapPolicy (pytorch#104427)

Andrew Gu · pytorchmergebot · commit a8c52863ddef · 2023-08-02T21:44:44.000Z
This PR adds improved error/warning messaging when auto wrapping with `ModuleWrapPolicy` in the presence of frozen parameters. - For `use_orig_params=False`, FSDP requires uniform `requires_grad` for each FSDP instance. This PR adds a `ValueError` at wrapping time with a message that mentions the violating module and the frozen/non-frozen parameter names. - For `use_orig_params=True`, FSDP allows non-uniform `requires_grad` for each FSDP instance. However, it will result in higher-than-expected gradient memory usage. This PR adds a `UserWarning` at wrapping time with a message that mentions the violating module, how much extra gradient memory will be used (in units of numel), and the frozen/non-frozen parameter names. - There is a possibility that this warning will be spammy/verbose, but my current thinking is that it is okay for now unless users complain. <details> <summary> Why DFS via named_children() vs. Using named_modules()</summary> ``` LoraModel( (embed_tokens): Embedding(100, 32) (layers): ModuleList( (0-3): 4 x LoraDecoder( (attn): LoraAttention( (q_proj): Linear(in_features=32, out_features=32, bias=False) (lora_A): Linear(in_features=32, out_features=8, bias=False) (lora_B): Linear(in_features=8, out_features=32, bias=False) (k_proj): Linear(in_features=32, out_features=32, bias=False) (v_proj): Linear(in_features=32, out_features=32, bias=False) (o_proj): Linear(in_features=32, out_features=32, bias=False) ) (mlp): LoraMLP( (proj1): Linear(in_features=32, out_features=128, bias=False) (proj2): Linear(in_features=128, out_features=32, bias=False) ) (inp_layernorm): LayerNorm((32,), eps=1e-05, elementwise_affine=True) (post_attn_layernorm): LayerNorm((32,), eps=1e-05, elementwise_affine=True) ) ) (norm): LayerNorm((32,), eps=1e-05, elementwise_affine=True) ) ``` Reverse topological order with stack-based DFS via `named_children()`: ``` [ 'embed_tokens', 'layers.0.attn.q_proj', 'layers.0.attn.lora_A', 'layers.0.attn.lora_B', 'layers.0.attn.k_proj', 'layers.0.attn.v_proj', 'layers.0.attn.o_proj', 'layers.0.attn', 'layers.0.mlp.proj1', 'layers.0.mlp.proj2', 'layers.0.mlp', 'layers.0.inp_layernorm', 'layers.0.post_attn_layernorm', 'layers.0', 'layers.1.attn.q_proj', 'layers.1.attn.lora_A', 'layers.1.attn.lora_B', 'layers.1.attn.k_proj', 'layers.1.attn.v_proj', 'layers.1.attn.o_proj', 'layers.1.attn', 'layers.1.mlp.proj1', 'layers.1.mlp.proj2', 'layers.1.mlp', 'layers.1.inp_layernorm', 'layers.1.post_attn_layernorm', 'layers.1', 'layers.2.attn.q_proj', 'layers.2.attn.lora_A', 'layers.2.attn.lora_B', 'layers.2.attn.k_proj', 'layers.2.attn.v_proj', 'layers.2.attn.o_proj', 'layers.2.attn', 'layers.2.mlp.proj1', 'layers.2.mlp.proj2', 'layers.2.mlp', 'layers.2.inp_layernorm', 'layers.2.post_attn_layernorm', 'layers.2', 'layers.3.attn.q_proj', 'layers.3.attn.lora_A', 'layers.3.attn.lora_B', 'layers.3.attn.k_proj', 'layers.3.attn.v_proj', 'layers.3.attn.o_proj', 'layers.3.attn', 'layers.3.mlp.proj1', 'layers.3.mlp.proj2', 'layers.3.mlp', 'layers.3.inp_layernorm', 'layers.3.post_attn_layernorm', 'layers.3', 'layers', 'norm', '' ] ``` Reverse topological order with `named_modules()`: ``` [ 'norm', 'layers.3.post_attn_layernorm', 'layers.3.inp_layernorm', 'layers.3.mlp.proj2', 'layers.3.mlp.proj1', 'layers.3.mlp', 'layers.3.attn.o_proj', 'layers.3.attn.v_proj', 'layers.3.attn.k_proj', 'layers.3.attn.lora_B', 'layers.3.attn.lora_A', 'layers.3.attn.q_proj', 'layers.3.attn', 'layers.3', 'layers.2.post_attn_layernorm', 'layers.2.inp_layernorm', 'layers.2.mlp.proj2', 'layers.2.mlp.proj1', 'layers.2.mlp', 'layers.2.attn.o_proj', 'layers.2.attn.v_proj', 'layers.2.attn.k_proj', 'layers.2.attn.lora_B', 'layers.2.attn.lora_A', 'layers.2.attn.q_proj', 'layers.2.attn', 'layers.2', 'layers.1.post_attn_layernorm', 'layers.1.inp_layernorm', 'layers.1.mlp.proj2', 'layers.1.mlp.proj1', 'layers.1.mlp', 'layers.1.attn.o_proj', 'layers.1.attn.v_proj', 'layers.1.attn.k_proj', 'layers.1.attn.lora_B', 'layers.1.attn.lora_A', 'layers.1.attn.q_proj', 'layers.1.attn', 'layers.1', 'layers.0.post_attn_layernorm', 'layers.0.inp_layernorm', 'layers.0.mlp.proj2', 'layers.0.mlp.proj1', 'layers.0.mlp', 'layers.0.attn.o_proj', 'layers.0.attn.v_proj', 'layers.0.attn.k_proj', 'layers.0.attn.lora_B', 'layers.0.attn.lora_A', 'layers.0.attn.q_proj', 'layers.0.attn', 'layers.0', 'layers', 'embed_tokens', '' ] ``` With the stack-based DFS via `named_children()`, reversing the topological order gives us each level in the module tree in the registered order, wheres with `named_modules()`, reversing the topological order gives us each level in reverse. Both are valid orders, but we prefer the former since it allows us to error/warn on the _first-registered_ module that violates the frozen/non-frozen condition. </details> Pull Request resolved: pytorch#104427 Approved by: https://github.com/ezyang
diff --git a/test/distributed/fsdp/test_wrap.py b/test/distributed/fsdp/test_wrap.py
@@ -10,6 +10,7 @@
 import torch
 import torch.nn as nn
 import torch.nn.functional as F
+from torch.distributed.fsdp._wrap_utils import _validate_frozen_params
 from torch.distributed.fsdp.fully_sharded_data_parallel import (
     BackwardPrefetch,
     CPUOffload,
@@ -57,6 +58,56 @@ def __init__(self):
         self.sync_bn = nn.SyncBatchNorm(10)
 
 
+class LoraModel(nn.Module):
+    """This is a toy LoRA decoder model."""
+
+    def __init__(self):
+        super().__init__()
+        self.embed_tokens = nn.Embedding(100, 32)
+        self.layers = nn.ModuleList([LoraDecoder() for _ in range(4)])
+        self.norm = nn.LayerNorm(32)
+        self.embed_tokens.weight.requires_grad_(False)
+        self.norm.weight.requires_grad_(False)
+        self.norm.bias.requires_grad_(False)
+
+
+class LoraDecoder(nn.Module):
+    def __init__(self):
+        super().__init__()
+        self.attn = LoraAttention()
+        self.mlp = LoraMLP()
+        self.inp_layernorm = nn.LayerNorm(32)
+        self.post_attn_layernorm = nn.LayerNorm(32)
+        self.inp_layernorm.weight.requires_grad_(False)
+        self.inp_layernorm.bias.requires_grad_(False)
+        self.post_attn_layernorm.weight.requires_grad_(False)
+        self.post_attn_layernorm.bias.requires_grad_(False)
+
+
+class LoraAttention(nn.Module):
+    def __init__(self):
+        super().__init__()
+        self.q_proj = nn.Linear(32, 32, bias=False)
+        self.lora_A = nn.Linear(32, 8, bias=False)
+        self.lora_B = nn.Linear(8, 32, bias=False)
+        self.k_proj = nn.Linear(32, 32, bias=False)
+        self.v_proj = nn.Linear(32, 32, bias=False)
+        self.o_proj = nn.Linear(32, 32, bias=False)
+        self.q_proj.weight.requires_grad_(False)
+        self.k_proj.weight.requires_grad_(False)
+        self.v_proj.weight.requires_grad_(False)
+        self.o_proj.weight.requires_grad_(False)
+
+
+class LoraMLP(nn.Module):
+    def __init__(self):
+        super().__init__()
+        self.proj1 = nn.Linear(32, 128, bias=False)
+        self.proj2 = nn.Linear(128, 32, bias=False)
+        self.proj1.weight.requires_grad_(False)
+        self.proj2.weight.requires_grad_(False)
+
+
 class WrapMethod(Enum):
     FSDP_CTOR = auto()
     # FSDP_CTOR is the supported way forward, but keep WRAP_API in case we miss
@@ -650,6 +701,116 @@ def test_auto_wrap_with_ignored_modules(self, wrap_method: WrapMethod):
         self.assertTrue(isinstance(model.module[2][0], nn.Linear))
         self.assertTrue(isinstance(model.module[2][1], nn.Linear))
 
+    @unittest.skipIf(torch.cuda.device_count() < 2, "Requires at least 2 GPUs")
+    def test_frozen_params(self):
+        """
+        Tests that mixing frozen/non-frozen parameters in an FSDP instance
+        raises for ``use_orig_params=False`` and warns for ``True``.
+        """
+        for use_orig_params in [True, False]:
+            self._test_frozen_params(use_orig_params)
+
+    def _test_frozen_params(self, use_orig_params: bool):
+        model = LoraModel().cuda()
+        policy = ModuleWrapPolicy({LoraAttention, LoraMLP, LoraDecoder})
+        msg = "layers.0.attn has both parameters with requires_grad=True and False. "
+        if use_orig_params:
+            msg += "We do not recommend wrapping such modules"
+            ctx = self.assertWarnsRegex(UserWarning, msg)
+        else:
+            msg += "FSDP does not support wrapping such modules when use_orig_params=False."
+            ctx = self.assertRaisesRegex(ValueError, msg)
+        with ctx:
+            FSDP(
+                model,
+                process_group=self.process_group,
+                auto_wrap_policy=policy,
+                use_orig_params=use_orig_params,
+            )
+
+
+class TestWrapUtils(TestCase):
+    def test_validate_frozen_params(self):
+        """Tests the method ``_validate_frozen_params()``."""
+        for use_orig_params in [True, False]:
+            self._test_validate_frozen_params(use_orig_params)
+
+    def _test_validate_frozen_params(self, use_orig_params: bool):
+        model = LoraModel()
+        # Wrap only LoRA modules
+        modules_to_wrap = {
+            module
+            for module_name, module in model.named_modules()
+            if "lora_A" in module_name or "lora_B" in module_name
+        }
+        _validate_frozen_params(model, modules_to_wrap, set(), use_orig_params)
+        # Additionally wrap attention
+        for module in model.modules():
+            if isinstance(module, LoraAttention):
+                modules_to_wrap.add(module)
+        _validate_frozen_params(model, modules_to_wrap, set(), use_orig_params)
+        # Additionally wrap decoders
+        for module in model.modules():
+            if isinstance(module, LoraDecoder):
+                modules_to_wrap.add(module)
+        _validate_frozen_params(model, modules_to_wrap, set(), use_orig_params)
+        # Do not wrap the LoRA-A modules (meaning mixed frozen/non-frozen)
+        for module_name, module in model.named_modules():
+            if "lora_A" in module_name:
+                modules_to_wrap.remove(module)
+        regex = "layers.0.attn has both parameters with requires_grad=True and False."
+        if use_orig_params:
+            # Wrapping the attention manages all parameters except those from
+            # the LoRA-B module, which is separately wrapped and all nonfrozen
+            lorab_numel = sum(
+                p.numel() for p in model.layers[0].attn.lora_B.parameters()
+            )
+            attn_frozen_param_numel = sum(
+                p.numel()
+                for p in model.layers[0].attn.parameters()
+                if not p.requires_grad
+            )
+            attn_nonfrozen_param_numel = (
+                sum(
+                    p.numel()
+                    for p in model.layers[0].attn.parameters()
+                    if p.requires_grad
+                )
+                - lorab_numel
+            )
+            attn_total_param_numel = (
+                attn_frozen_param_numel + attn_nonfrozen_param_numel
+            )
+            regex += (
+                " We do not recommend wrapping such modules since the "
+                r"gradient memory usage will be higher than expected \("
+                f"{attn_total_param_numel} numel instead of {attn_nonfrozen_param_numel} numel "
+                r"before sharding via reduce-scatter\). "
+            )
+        else:
+            regex += " FSDP does not support wrapping such modules when use_orig_params=False. "
+        regex += "If possible, wrap the frozen parameters with FSDP separately.\n"
+        regex += (
+            "The following parameters have requires_grad=True:\n"
+            r"\['layers.0.attn.lora_A.weight'\]\n"
+            "The following parameters have requires_grad=False:\n"
+            r"\['layers.0.attn.q_proj.weight', 'layers.0.attn.k_proj.weight', "
+            r"'layers.0.attn.v_proj.weight', 'layers.0.attn.o_proj.weight'\]"
+        )
+        if use_orig_params:
+            ctx = self.assertWarnsRegex(UserWarning, regex)
+        else:
+            ctx = self.assertRaisesRegex(ValueError, regex)
+        with ctx:
+            _validate_frozen_params(model, modules_to_wrap, set(), use_orig_params)
+        # Now ignore those LoRA-A modules' parameters
+        ignored_params = set()
+        for module_name, module in model.named_modules():
+            if "lora_A" in module_name:
+                for param in module.parameters():
+                    ignored_params.add(param)
+        _validate_frozen_params(model, modules_to_wrap, ignored_params, use_orig_params)
+
 
 instantiate_parametrized_tests(TestFSDPWrap)
 instantiate_parametrized_tests(TestAutoWrap)
diff --git a/torch/distributed/fsdp/_wrap_utils.py b/torch/distributed/fsdp/_wrap_utils.py
@@ -1,8 +1,9 @@
+import collections
 import functools
 import inspect
 import warnings
 from functools import partial
-from typing import Any, Callable, Dict, Set, Type, Union
+from typing import Any, Callable, Dict, List, Set, Tuple, Type, Union
 
 import torch.nn as nn
 from torch.distributed.fsdp._common_utils import (
@@ -64,6 +65,13 @@ def _auto_wrap(
                 root_module, mixed_precision._module_classes_to_ignore
             )
             _warn_on_overridden_mixed_precision(overridden_module_classes)
+        use_orig_params = fsdp_kwargs.get("use_orig_params", False)
+        _validate_frozen_params(
+            root_module,
+            set(target_module_to_kwargs.keys()),
+            ignored_params,
+            use_orig_params,
+        )
         wrap_fn = _construct_wrap_fn(root_module, target_module_to_kwargs, fsdp_fn)
         _post_order_apply(root_module, wrap_fn)
         return
@@ -121,3 +129,142 @@ def _warn_on_overridden_mixed_precision(
         "These modules will be wrapped as separate FSDP instacnes with mixed "
         "precision disabled."
     )
+
+
+def _validate_frozen_params(
+    root_module: nn.Module,
+    modules_to_wrap: Set[nn.Module],
+    ignored_params: Set[nn.Parameter],
+    use_orig_params: bool,
+):
+    """
+    This checks that, given ``modules_to_wrap``, each module would manage
+    parameters that are uniformly frozen or non-frozen. This uniformity
+    requirement is strict for ``use_orig_params=False`` (hard error) and highly
+    recommended for ``use_orig_params=True`` (user warning).
+    """
+    post_order_named_modules = _get_post_order_named_modules(root_module)
+    visited_modules: Set[nn.Module] = set()
+    for module_name, module in post_order_named_modules:
+        if module in modules_to_wrap:
+            param_to_fqn = _get_managed_param_to_fqn(
+                module, ignored_params, visited_modules, module_name
+            )
+            frozen_param_fqns: List[str] = []
+            frozen_param_numel = 0
+            nonfrozen_param_fqns: List[str] = []
+            nonfrozen_param_numel = 0
+            for param, fqn in param_to_fqn.items():
+                if param.requires_grad:
+                    nonfrozen_param_fqns.append(fqn)
+                    nonfrozen_param_numel += param.numel()
+                else:
+                    frozen_param_fqns.append(fqn)
+                    frozen_param_numel += param.numel()
+            if len(frozen_param_fqns) > 0 and len(nonfrozen_param_fqns) > 0:
+                msg = f"{module_name} has both parameters with requires_grad=True and False."
+                if use_orig_params:
+                    total_param_numel = frozen_param_numel + nonfrozen_param_numel
+                    msg += (
+                        " We do not recommend wrapping such modules since "
+                        "the gradient memory usage will be higher than expected "
+                        f"({total_param_numel} numel instead of {nonfrozen_param_numel} numel "
+                        "before sharding via reduce-scatter). "
+                    )
+                else:
+                    msg += " FSDP does not support wrapping such modules when use_orig_params=False. "
+                msg += "If possible, wrap the frozen parameters with FSDP separately.\n"
+                msg += (
+                    f"The following parameters have requires_grad=True:\n{nonfrozen_param_fqns}\n"
+                    f"The following parameters have requires_grad=False:\n{frozen_param_fqns}"
+                )
+                if use_orig_params:
+                    warnings.warn(msg)
+                else:
+                    raise ValueError(msg)
+
+
+def _get_post_order_named_modules(
+    root_module: nn.Module,
+) -> List[Tuple[str, nn.Module]]:
+    """
+    This returns the named modules following a post-order traversal, which is a
+    valid reverse topological sort. We achieve this using the reverse of a
+    stack-based DFS order instead of reversing ``root_module.named_modules()``
+    since the former gives the modules in registration order at each level in
+    the module tree (as opposed to the reverse), which allows us to error/warn
+    on the first registered module that violates the condition.
+
+    For example, consider the following module structure:
+        M(
+          S1(),
+          S2(
+            SS1(),
+            SS2(),
+          ),
+          S3(),
+        )
+    The reverse DFS order is [S1, SS1, SS2, S2, S3, M], while the reverse
+    ``named_modules()`` order is [S3, SS2, SS1, S2, S1, M].
+    """
+    visited_modules = {root_module}
+    stack = [("", root_module)]
+    # Append and reverse at the end for linear-time algorithm
+    reverse_post_order_named_modules: List[Tuple[str, nn.Module]] = []
+    while stack:
+        module_name, module = stack.pop()
+        reverse_post_order_named_modules.append((module_name, module))
+        for child_module_name, child_module in module.named_children():
+            if child_module is None:  # only for overrides of `named_children()`
+                continue
+            if child_module not in visited_modules:
+                visited_modules.add(child_module)
+                if module_name != "":
+                    child_module_name = module_name + "." + child_module_name
+                stack.append((child_module_name, child_module))
+    post_order_named_modules = list(reversed(reverse_post_order_named_modules))
+    return post_order_named_modules
+
+
+def _get_managed_param_to_fqn(
+    module_to_wrap: nn.Module,
+    ignored_params: Set[nn.Parameter],
+    visited_modules: Set[nn.Module],
+    root_prefix: str,
+) -> Dict[nn.Parameter, str]:
+    """
+    This returns a dict that maps managed parameter to its FQN for the given
+    ``module_to_wrap``. The dict's keys are exactly the parameters that would
+    be managed by the module, where this is achieved by calling this function
+    on the modules to wrap in reverse topological order, destructively updating
+    ``visited_modules``, and not traversing into those modules. The FQNs are
+    prefixed from the root (via ``root_prefix``) to be more informative.
+
+    NOTE: This function is meant to be called pre-wrapping and iteratively in
+    reverse topological order to cover the full module tree. This differs from
+    the ``_get_param_to_fqn()`` function meant to be called post-wrapping and
+    on the full module tree in one shot. Given those differences, we do not try
+    to unify the two.
+    """
+    param_to_fqn: Dict[nn.Parameter, str] = {}
+    # Run BFS (or any tree traversal works)
+    queue = collections.deque([(module_to_wrap, root_prefix)])
+    visited_modules.add(module_to_wrap)
+    while queue:
+        module, prefix = queue.popleft()
+        for param_name, param in module.named_parameters(recurse=False):
+            if param not in ignored_params:
+                fqn = param_name if prefix == "" else prefix + "." + param_name
+                param_to_fqn[param] = fqn
+        for child_module_name, child_module in module.named_children():
+            if child_module is None:  # only for overrides of `named_children()`
+                continue
+            if child_module not in visited_modules:
+                visited_modules.add(child_module)
+                child_prefix = (
+                    child_module_name
+                    if prefix == ""
+                    else prefix + "." + child_module_name
+                )
+                queue.append((child_module, child_prefix))
+    return param_to_fqn