lightning-ai/lightning Release Notes

Lightning v2.5 (2024-12-20)

[Lightning AI](https://lightning.ai) :zap: is excited to announce the release of Lightning 2.5. 

Lightning 2.5 comes with improvements on several fronts, with **zero** API changes. Our users love it stable, we keep it stable :smile:.

Talking about love :heart:, the `lightning`, `pytorch-lightning` and `lightning-fabric` packages are collectively getting more than **10M downloads per month** :open_mouth:, for a total of over **180M downloads** :exploding_head: since the early days . It's incredible to see PyTorch Lightning getting such a strong adoption across the industry and the sciences.

Release 2.5 embraces PyTorch 2.5, and it marks some of its more recent directions as officially supported, namely tensor subclass-based APIs like [Distributed Tensors](https://pytorch.org/docs/stable/distributed.tensor.html) and [TorchAO](https://pytorch.org/blog/pytorch-native-architecture-optimization/), in combination with `torch.compile`.

Here's a couple of examples:

Distributed FP8 transformer with PyTorch Lightning

Full example [here](https://github.com/Lightning-AI/pytorch-lightning/tree/master/examples/pytorch/fp8_distributed_transformer)

```python
import lightning as L
import torch
import torch.nn as nn
import torch.nn.functional as F
from lightning.pytorch.demos import Transformer, WikiText2
from lightning.pytorch.strategies import ModelParallelStrategy
from torch.distributed._composable.fsdp.fully_shard import fully_shard
from torch.utils.data import DataLoader
from torchao.float8 import Float8LinearConfig, convert_to_float8_training

class LanguageModel(L.LightningModule):
    def __init__(self, vocab_size):
        super().__init__()
        self.vocab_size = vocab_size
        self.model = None

    def configure_model(self):
        if self.model is not None:
            return

        with torch.device("meta"):
            model = Transformer(
                vocab_size=self.vocab_size,
                nlayers=16,
                nhid=4096,
                ninp=1024,
                nhead=32,
            )

        float8_config = Float8LinearConfig(
            # pip install -U --index-url https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/Triton-Nightly/pypi/simple/ triton-nightly  # noqa
            pad_inner_dim=True,
        )

        def module_filter_fn(mod: torch.nn.Module, fqn: str):
            # we skip the decoder because it typically vocabulary size
            # is not divisible by 16 as required by float8
            return fqn != "decoder"

        convert_to_float8_training(model, config=float8_config, module_filter_fn=module_filter_fn)

        for module in model.modules():
            if isinstance(module, (nn.TransformerEncoderLayer, nn.TransformerDecoderLayer)):
                fully_shard(module, mesh=self.device_mesh)

        fully_shard(model, mesh=self.device_mesh)

        self.model = torch.compile(model)

    def training_step(self, batch):
        input, target = batch
        output = self.model(input, target)
        loss = F.nll_loss(output, target.view(-1))
        self.log("train_loss", loss, prog_bar=True)
        return loss

    def configure_optimizers(self):
        return torch.optim.Adam(self.parameters(), lr=1e-4)

def train():
    L.seed_everything(42)

    dataset = WikiText2()
    train_dataloader = DataLoader(dataset, num_workers=8, batch_size=1)

    model = LanguageModel(vocab_size=dataset.vocab_size)

    mp_strategy = ModelParallelStrategy(
        data_parallel_size=4,
        tensor_parallel_size=1,
    )

    trainer = L.Trainer(strategy=mp_strategy, max_steps=100, precision="bf16-true", accumulate_grad_batches=8)

    trainer.fit(model, train_dataloader)

    trainer.print(torch.cuda.memory_summary())

if __name__ == "__main__":
    torch.set_float32_matmul_precision("high")

    train()
```



Distributed FP8 transformer with Fabric

Full example [here](https://github.com/Lightning-AI/pytorch-lightning/tree/master/examples/fabric/fp8_distributed_transformer)

```python
import lightning as L
import torch
import torch.nn as nn
import torch.nn.functional as F
from lightning.fabric.strategies import ModelParallelStrategy
from lightning.pytorch.demos import Transformer, WikiText2
from torch.distributed._composable.fsdp.fully_shard import fully_shard
from torch.distributed.device_mesh import DeviceMesh
from torch.utils.data import DataLoader
from torchao.float8 import Float8LinearConfig, convert_to_float8_training
from tqdm import tqdm

def configure_model(model: nn.Module, device_mesh: DeviceMesh) -> nn.Module:
    float8_config = Float8LinearConfig(
        # pip install -U --index-url  triton-nightly  # noqa
        pad_inner_dim=True,
    )

    def module_filter_fn(mod: torch.nn.Module, fqn: str):
        # we skip the decoder because it typically vocabulary size
        # is not divisible by 16 as required by float8
        return fqn != "decoder"

    convert_to_float8_training(model, config=float8_config, module_filter_fn=module_filter_fn)

    for module in model.modules():
        if isinstance(module, (torch.nn.TransformerEncoderLayer, torch.nn.TransformerDecoderLayer)):
            fully_shard(module, mesh=device_mesh)

    fully_shard(model, mesh=device_mesh)

    return torch.compile(model)

def train():
    L.seed_everything(42)

    batch_size = 8
    micro_batch_size = 1

    max_steps = 100

    dataset = WikiText2()
    dataloader = DataLoader(dataset, num_workers=8, batch_size=micro_batch_size)

    with torch.device("meta"):
        model = Transformer(
            vocab_size=dataset.vocab_size,
            nlayers=16,
            nhid=4096,
            ninp=1024,
            nhead=32,
        )

    strategy = ModelParallelStrategy(data_parallel_size=4, tensor_parallel_size=1, parallelize_fn=configure_model)

    fabric = L.Fabric(precision="bf16-true", strategy=strategy)
    fabric.launch()

    model = fabric.setup(model)

    optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)
    optimizer = fabric.setup_optimizers(optimizer)

    dataloader = fabric.setup_dataloaders(dataloader)

    iterable = tqdm(enumerate(dataloader), total=len(dataloader)) if fabric.is_global_zero else enumerate(dataloader)

    steps = 0

    for i, batch in iterable:
        input, target = batch

        is_accumulating = i % (batch_size // micro_batch_size) != 0

        with fabric.no_backward_sync(model, enabled=is_accumulating):
            output = model(input, target)
            loss = F.nll_loss(output, target.view(-1))
            fabric.backward(loss)

        if not is_accumulating:
            fabric.clip_gradients(model, optimizer, max_norm=1.0)
            optimizer.step()
            optimizer.zero_grad()
            steps += 1

        if fabric.is_global_zero:
            iterable.set_postfix_str(f"train_loss={loss.item():.2f}")

        if steps == max_steps:
            break

    fabric.print(torch.cuda.memory_summary())

if __name__ == "__main__":
    torch.set_float32_matmul_precision("high")

    train()
```



As these examples show, it's now easier than ever to take your PyTorch Lightning module and run it with **FSDP2 and/or tensor parallelism in FP8 precision**, using the `ModelParallelStrategy` we introduced in 2.4.

Also note the use of distributed tensor APIs, TorchAO APIs, and `torch.compile` directly in the `configure_model` hook (or in the parallelize function in Fabric's `ModelParallelStrategy`), as opposed to the `LightningModule` as a whole. The advantage with this approach is that you can just **copy-paste the parallelize functions** that come with native PyTorch models directly in `configure_model` and get the same effect, no head-scratching involved :nerd_face:.

Talking about head scratching, we also made a pass at the PyTorch Lightning internals and **hardened** the parts where we keep track of **progress counters** during training, validation, testing, as well as learning rate scheduling, in relation to **resuming from checkpoints**. We now made sure there are no (to the best of our knowledge) edge cases where stopping and resuming from checkpoints can change the sequence of loops or other internal states. **Fault tolerance for the win** :partying_face:!

Alright! Feel free to take a look at the **full changelog** below.

And of course: the best way to use PyTorch Lightning and Fabric is through [Lightning Studio](https://lightning.ai/) :zap:. Access GPUs, train models, deploy and more with **zero setup**. Focus on data and models - not infrastructure.



# Changes



## PyTorch Lightning

Added

- Added `step` parameter to `TensorBoardLogger.log_hyperparams` to visualize changes during training ([#20176](https://github.com/Lightning-AI/pytorch-lightning/pull/20176))
- Added `str` method to datamodule ([#20301](https://github.com/Lightning-AI/pytorch-lightning/pull/20301))
- Added timeout to DeepSpeedStrategy ([#20474](https://github.com/Lightning-AI/pytorch-lightning/pull/20474))
- Added doc for Truncated Back-Propagation Through Time ([#20422](https://github.com/Lightning-AI/pytorch-lightning/pull/20422))
- Added FP8 + FSDP2 + torch.compile examples for PyTorch Lightning ([#20440](https://github.com/Lightning-AI/pytorch-lightning/pull/20440))
- Added profiling to `Trainer.save_checkpoint` ([#20405](https://github.com/Lightning-AI/pytorch-lightning/pull/20405))
- Added after_instantiate_classes hook to CLI ([#20401](https://github.com/Lightning-AI/pytorch-lightning/pull/20401))



Changed

- Updated checkpointing documentation to mark `resume_from_checkpoint` as deprecated ([#20477](https://github.com/Lightning-AI/pytorch-lightning/pull/20477))
- Made plugin type checks more flexible ([#20186](https://github.com/Lightning-AI/pytorch-lightning/pull/20186))
- Changed seeding NumPy using `np.random.SeedSequence()` in `pl_worker_init_function()` to robustly seed NumPy-dependent dataloader workers ([#20369](https://github.com/Lightning-AI/pytorch-lightning/pull/20369))
- Allowed callbacks to be restored not just during training ([#20403](https://github.com/Lightning-AI/pytorch-lightning/pull/20403))
- Changed LightningCLI tests to account for future fix in jsonargparse ([#20372](https://github.com/Lightning-AI/pytorch-lightning/pull/20372))
- Bumped PyTorch to version `2.5` ([#20351](https://github.com/Lightning-AI/pytorch-lightning/pull/20351))
- Decoupled checkpoint artifact path from model artifact path ([#20325](https://github.com/Lightning-AI/pytorch-lightning/pull/20325))
- Updated BitsAndBytes version ([#20313](https://github.com/Lightning-AI/pytorch-lightning/pull/20313))
- Changed merging of hparams when logging to ignore parameter names that start with an underscore `_` ([#20221](https://github.com/Lightning-AI/pytorch-lightning/pull/20221))
- Re-enabled passing `BytesIO` as path in `.to_onnx()` ([#20172](https://github.com/Lightning-AI/pytorch-lightning/pull/20172))



Removed

- Removed `List[int]` as input type for Trainer when `accelerator="cpu"` ([#20399](https://github.com/Lightning-AI/pytorch-lightning/pull/20399))



Fixed

- Fixed UnboundLocalError when using the predict method with return_predictions=False. ([#20484](https://github.com/Lightning-AI/pytorch-lightning/pull/20484))
- Fixed use of `convert_module` in FSDP to avoid using more memory than necessary during initialization ([#20323](https://github.com/Lightning-AI/pytorch-lightning/pull/20323))
- Fixed TypeError in `configure_optimizers` when running with `ReduceLROnPlateau` ([#20471](https://github.com/Lightning-AI/pytorch-lightning/pull/20471))
- Fixed return type in `configure_optimizers` example ([#20420](https://github.com/Lightning-AI/pytorch-lightning/pull/20420))
- Fixed in ncorrect URI prefix stripping in MLFlowLogger ([#20365](https://github.com/Lightning-AI/pytorch-lightning/pull/20365))
- Fixed shuffling behavior when using a custom sampler in data module ([#20327](https://github.com/Lightning-AI/pytorch-lightning/pull/20327))
- Ensured restarting from checkpoints leads to consistent internal counters compared to uninterrupted training ([#20379](https://github.com/Lightning-AI/pytorch-lightning/pull/20379))
- Fixed LightningCLI failing when both module and data module save hyperparameters due to conflicting internal `_class_path` parameter ([#20221](https://github.com/Lightning-AI/pytorch-lightning/pull/20221))





## Lightning Fabric

Added

- Added `step` parameter to `TensorBoardLogger.log_hyperparams` to visualize changes during training ([#20176](https://github.com/Lightning-AI/pytorch-lightning/pull/20176))
- Added timeout to DeepSpeedStrategy ([#20474](https://github.com/Lightning-AI/pytorch-lightning/pull/20474))
- Added FP8 + FSDP2 + torch.compile examples for Fabric ([#20440](https://github.com/Lightning-AI/pytorch-lightning/pull/20440))
- Added RTX 4080 super to chips dictionary ([#20285](https://github.com/Lightning-AI/pytorch-lightning/pull/20285))
- Added device property to lazy load functionality ([#20183](https://github.com/Lightning-AI/pytorch-lightning/pull/20183))
- Added `ddp_find_unused_parameters_true` alias in Fabric's DDPStrategy ([#20125](https://github.com/Lightning-AI/pytorch-lightning/pull/20125))



Changed

- Changed seeding NumPy using `np.random.SeedSequence()` in `pl_worker_init_function()` to robustly seed NumPy-dependent dataloader workers ([#20369](https://github.com/Lightning-AI/pytorch-lightning/pull/20369))
- Bumped PyTorch to version `2.5` ([#20351](https://github.com/Lightning-AI/pytorch-lightning/pull/20351))
- Update BitsAndBytes version ([#20313](https://github.com/Lightning-AI/pytorch-lightning/pull/20313))



Removed

- Nothing to see here :smile:



Fixed

- Fixed use of `convert_module` in FSDP to avoid using more memory than necessary during initialization ([#20323](https://github.com/Lightning-AI/pytorch-lightning/pull/20323))






**Full commit list**: [2.4.0 -> 2.5.0](https://github.com/Lightning-AI/pytorch-lightning/compare/2.4.0...2.5.0)



# Contributors

We thank **all folks** who submitted issues, features, fixes and doc changes. It's the only way we can **collectively** make Lightning :zap: better for everyone, nice job!

In particular, we would like to thank the authors of the pull-requests above, in no particular order:

@ringohoffman @MrWhatZitToYaa @jedyang97 @chualanagit @lantiga @AlessandroW @kazuar @t-vi @01AbhiSingh @WangYue0000 @amorehead @EricCousineau-TRI @mauvilsa @Borda @pete-mcelroy @ali-alshaar7 @GdoongMathew @farhadrgh @tshu-w @LukasSalchow @awindmann @dadwadw233 @qingquansong

Thank you :heart: and we hope you'll keep them coming!

Lightning v2.3: Tensor Parallelism and 2D Parallelism (2024-06-13)

[Lightning AI](https://lightning.ai) is excited to announce the release of Lightning 2.3 :zap:

**Did you know?** The Lightning philosophy extends beyond a boilerplate-free deep learning framework: We've been hard at work bringing you [Lightning Studio](https://lightning.ai/). Code together, prototype, train, deploy, host AI web apps. All from your browser, with zero setup.

This release introduces experimental support for Tensor Parallelism and 2D Parallelism, [PyTorch 2.3](https://pytorch.org/blog/pytorch2-3/) support, and several bugfixes and stability improvements.


- [Highlights](#highlights)
    - [Tensor Parallelism (beta)](https://github.com/Lightning-AI/lightning/releases/tag/2.3.0#highlights-tensor-parallel)
    - [2D Parallelism (beta)](https://github.com/Lightning-AI/lightning/releases/tag/2.3.0#highlights-2d-parallel)
    - [Training Mode in Model Summary](https://github.com/Lightning-AI/lightning/releases/tag/2.3.0#highlights-model-summary)
    - [Special Forward Methods in Fabric](https://github.com/Lightning-AI/lightning/releases/tag/2.3.0#highlights-forward-methods)
- [Notable Changes](https://github.com/Lightning-AI/lightning/releases/tag/2.3.0#bc-changes)
- [Full Changelog](https://github.com/Lightning-AI/lightning/releases/tag/2.3.0#changelog)
    - [PyTorch Lightning](https://github.com/Lightning-AI/lightning/releases/tag/2.3.0#changelog-pytorch)
    - [Lightning Fabric](https://github.com/Lightning-AI/lightning/releases/tag/2.3.0#changelog-fabric)
- [Contributors](https://github.com/Lightning-AI/lightning/releases/tag/2.3.0#contributors)



# Highlights


## Tensor Parallelism (beta)

Tensor parallelism (TP) is a technique that splits up the computation of selected layers across GPUs to save memory and speed up distributed models. To enable TP as well as other forms of parallelism, we introduce a `ModelParallelStrategy` for both Lightning Trainer and Fabric. Under the hood, TP is enabled through new experimental PyTorch APIs like [DTensor](https://github.com/pytorch/pytorch/blob/main/torch/distributed/_tensor/README.md) and [`torch.distributed.tensor.parallel`](https://pytorch.org/docs/stable/distributed.tensor.parallel.html).

### PyTorch Lightning

Enabling TP in a model with PyTorch Lightning requires you to implement the `LightningModule.configure_model()` method where you convert selected layers of a model to paralellized layers. This is an advanced feature, because it requires a deep understanding of the model architecture. Open the [tutorial Studio](https://lightning.ai/lightning-ai/studios/tensor-parallelism-supercharging-large-model-training-with-pytorch-lightning) to learn the basics of Tensor Parallelism.


  


 

```python
import lightning as L
from lightning.pytorch.strategies import ModelParallelStrategy
from torch.distributed.tensor.parallel import ColwiseParallel, RowwiseParallel
from torch.distributed.tensor.parallel import parallelize_module


# 1. Implement the `configure_model()` method in LightningModule
class LitModel(L.LightningModule):
    def __init__(self):
        super().__init__()
        self.model = FeedForward(8192, 8192)

    def configure_model(self):
        # Lightning will set up a `self.device_mesh` for you
        tp_mesh = self.device_mesh["tensor_parallel"]
        # Use PyTorch's distributed tensor APIs to parallelize the model
        plan = {
            "w1": ColwiseParallel(),
            "w2": RowwiseParallel(),
            "w3": ColwiseParallel(),
        }
        parallelize_module(self.model, tp_mesh, plan)

    def training_step(self, batch):
        ...


# 2. Create the strategy
strategy = ModelParallelStrategy()

# 3. Configure devices and set the strategy in Trainer
trainer = L.Trainer(accelerator="cuda", devices=2, strategy=strategy)
trainer.fit(...)

```

Full training example (requires at least 2 GPUs).
    
```python
import torch
import torch.nn as nn
import torch.nn.functional as F

from torch.distributed.tensor.parallel import ColwiseParallel, RowwiseParallel
from torch.distributed.tensor.parallel import parallelize_module

import lightning as L
from lightning.pytorch.demos.boring_classes import RandomDataset
from lightning.pytorch.strategies import ModelParallelStrategy


class FeedForward(nn.Module):
    def __init__(self, dim, hidden_dim):
        super().__init__()
        self.w1 = nn.Linear(dim, hidden_dim, bias=False)
        self.w2 = nn.Linear(hidden_dim, dim, bias=False)
        self.w3 = nn.Linear(dim, hidden_dim, bias=False)

    def forward(self, x):
        return self.w2(F.silu(self.w1(x)) * self.w3(x))


class LitModel(L.LightningModule):
    def __init__(self):
        super().__init__()
        self.model = FeedForward(8192, 8192)

    def configure_model(self):
        if self.device_mesh is None:
            return

        # Lightning will set up a `self.device_mesh` for you
        tp_mesh = self.device_mesh["tensor_parallel"]
        # Use PyTorch's distributed tensor APIs to parallelize the model
        plan = {
            "w1": ColwiseParallel(),
            "w2": RowwiseParallel(),
            "w3": ColwiseParallel(),
        }
        parallelize_module(self.model, tp_mesh, plan)

    def training_step(self, batch):
        output = self.model(batch)
        loss = output.sum()
        return loss

    def configure_optimizers(self):
        return torch.optim.AdamW(self.model.parameters(), lr=3e-3)

    def train_dataloader(self):
        # Trainer configures the sampler automatically for you such that
        # all batches in a tensor-parallel group are identical
        dataset = RandomDataset(8192, 64)
        return torch.utils.data.DataLoader(dataset, batch_size=8, num_workers=2)


strategy = ModelParallelStrategy()
trainer = L.Trainer(
    accelerator="cuda",
    devices=2,
    strategy=strategy,
    max_epochs=1,
)

model = LitModel()
trainer.fit(model)

trainer.print(f"Peak memory usage: {torch.cuda.max_memory_allocated() / 1e9:.02f} GB")
```






### Lightning Fabric

Applying TP in a model with Fabric requires you to implement a special function where you convert selected layers of a model to paralellized layers. This is an advanced feature, because it requires a deep understanding of the model architecture. Open the [tutorial Studio](https://lightning.ai/lightning-ai/studios/tensor-parallelism-supercharging-large-model-training-with-lightning-fabric) to learn the basics of Tensor Parallelism.


  


 

```python
import lightning as L
from lightning.fabric.strategies import ModelParallelStrategy
from torch.distributed.tensor.parallel import ColwiseParallel, RowwiseParallel
from torch.distributed.tensor.parallel import parallelize_module


# 1. Implement the parallelization function for your model
def parallelize_feedforward(model, device_mesh):
    # Lightning will set up a device mesh for you
    tp_mesh = device_mesh["tensor_parallel"]
    # Use PyTorch's distributed tensor APIs to parallelize the model
    plan = {
        "w1": ColwiseParallel(),
        "w2": RowwiseParallel(),
        "w3": ColwiseParallel(),
    }
    parallelize_module(model, tp_mesh, plan)
    return model


# 2. Pass the parallelization function to the strategy
strategy = ModelParallelStrategy(parallelize_fn=parallelize_feedforward)

# 3. Configure devices and set the strategy in Fabric
fabric = L.Fabric(accelerator="cuda", devices=2, strategy=strategy)
fabric.launch()
```


Full training example (requires at least 2 GPUs).
    
```python
import torch
import torch.nn as nn
import torch.nn.functional as F

from torch.distributed.tensor.parallel import ColwiseParallel, RowwiseParallel
from torch.distributed.tensor.parallel import parallelize_module

import lightning as L
from lightning.pytorch.demos.boring_classes import RandomDataset
from lightning.fabric.strategies import ModelParallelStrategy


class FeedForward(nn.Module):
    def __init__(self, dim, hidden_dim):
        super().__init__()
        self.w1 = nn.Linear(dim, hidden_dim, bias=False)
        self.w2 = nn.Linear(hidden_dim, dim, bias=False)
        self.w3 = nn.Linear(dim, hidden_dim, bias=False)

    def forward(self, x):
        return self.w2(F.silu(self.w1(x)) * self.w3(x))


def parallelize_feedforward(model, device_mesh):
    # Lightning will set up a device mesh for you
    tp_mesh = device_mesh["tensor_parallel"]
    # Use PyTorch's distributed tensor APIs to parallelize the model
    plan = {
        "w1": ColwiseParallel(),
        "w2": RowwiseParallel(),
        "w3": ColwiseParallel(),
    }
    parallelize_module(model, tp_mesh, plan)
    return model


strategy = ModelParallelStrategy(parallelize_fn=parallelize_feedforward)
fabric = L.Fabric(accelerator="cuda", devices=2, strategy=strategy)
fabric.launch()

# Initialize the model
model = FeedForward(8192, 8192)
model = fabric.setup(model)

# Define the optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=3e-3)
optimizer = fabric.setup_optimizers(optimizer)

# Define dataset/dataloader
dataset = RandomDataset(8192, 64)
dataloader = torch.utils.data.DataLoader(dataset, batch_size=8)
dataloader = fabric.setup_dataloaders(dataloader)

# Simplified training loop
for i, batch in enumerate(dataloader):
    output = model(batch)
    loss = output.sum()
    fabric.backward(loss)
    optimizer.step()
    optimizer.zero_grad()
    fabric.print(f"Iteration {i} complete")

fabric.print(f"Peak memory usage: {torch.cuda.max_memory_allocated() / 1e9:.02f} GB")
```








## 2D Parallelism (beta)

Tensor Parallelism by itself can be very effective for efficient inference of very large models. For training, TP is typically combined with other forms of parallelism, such as FSDP, to increase throughput and scalability on large clusters with 100s of GPUs. The new `ModelParallelStrategy` in this release supports the combination of TP + FSDP, which is referred to as 2D parallelism.

For an introduction to this feature, please also refer to the tutorial Studios ([PyTorch Lightning](https://lightning.ai/lightning-ai/studios/tensor-parallelism-supercharging-large-model-training-with-pytorch-lightning), [Lightning Fabric](https://lightning.ai/lightning-ai/studios/tensor-parallelism-supercharging-large-model-training-with-lightning-fabric)). At the moment, the PyTorch team is reimplementing FSDP under the name [FSDP2](https://github.com/pytorch/pytorch/issues/114299) with the aim to make it compose well with other parallelisms such as TP. Therefore, for the experimental 2D parallelism support, you'll need to switch to using FSDP2 with the new `ModelParallelStrategy`. Please refer to our docs ([PyTorch Lightning](https://lightning.ai/docs/pytorch/latest/advanced/model_parallel/tp_fsdp.html), [Lightning Fabric](https://lightning.ai/docs/fabric/latest/advanced/model_parallel/tp_fsdp.html)) and stay tuned for future releases as these APIs mature.



## Training Mode in Model Summary

The model summary table that gets displayed when you run `Trainer.fit()` now contains a new column "Mode" that shows the training mode each layer is in ([#19468](https://github.com/Lightning-AI/lightning/pull/19468)).

```
  | Name                 | Type            | Params | Mode 
-----------------------------------------------------------------
0 | model                | Sam             | 93.7 M | train
1 | model.image_encoder  | ImageEncoderViT | 89.7 M | eval 
2 | model.prompt_encoder | PromptEncoder   | 6.2 K  | train
3 | model.mask_decoder   | MaskDecoder     | 4.1 M  | train
-----------------------------------------------------------------
93.7 M    Trainable params
0         Non-trainable params
93.7 M    Total params
374.942   Total estimated model params size (MB)
```

A module in PyTorch is always either in `train` (default) or `eval` mode.
This improvement should give users more visibility into the state of their model and help debug issues, for example when you need to make sure certain layers of the model are frozen.



## Special Forward Methods in Fabric

Until now, Lightning Fabric warned the user in case the forward pass of the model or a subset of its modules was conducted through methods other than the dedicated `forward` method of the PyTorch module. The reason for this is that PyTorch needs to run special hooks in case of DDP/FSDP and other strategies to function properly, and not running through the real `forward` method would skip these hooks and lead to correctness issues.

In Lightning Fabric 2.3, we added a [feature to explicitly mark alternative forward methods](https://lightning.ai/docs/fabric/latest/api/wrappers.html#using-methods-other-than-forward-for-computation) so that Fabric can add the necessary rerouting behind the scenes:

```python
import lightning as L

fabric = L.Fabric(devices=2, strategy="ddp")
fabric.launch()

model = MyModel()
model = fabric.setup(model)

# OK: Calling the model directly
output = model(input)

# ERROR: Calling another method that calls forward indirectly
prediction = model.generate(input)

# New: Mark special forward methods explicitly before using them
model.mark_forward_method(model.generate)

# OK: Now can use `model.generate()` in DDP/FSDP without issues
prediction = model.generate(input)
```

Find the [full example](https://lightning.ai/docs/fabric/latest/api/wrappers.html#using-methods-other-than-forward-for-computation) and more details in our docs.



# Notable Changes

The 2.0 series of Lightning releases guarantees core API stability: No name changes, argument renaming, hook removals etc. on core interfaces (Trainer, LightningModule, etc.) unless a feature is specifically marked experimental. Here we list a few behavioral changes made in places where the change was justified if it significantly improves the user experience, improves performance, or fixes the correctness of a feature. These changes will likely not impact most users.

### Skipping the training step in DDP

It is no longer allowed to skip `training_step()` by returning `None` in distributed training ([#19918](https://github.com/Lightning-AI/pytorch-lightning/pull/19918)). The following usage was previously possible but would result in unpredictable hangs and timeouts in distributed training:

```python
def training_step(self, batch):
    loss = ...
    if loss.isnan():
        # No longer allowed in multi-GPU!
        # Raises error in Lightning >= 2.3
        return None
    return loss
```

We decided to raise an error if the user attempts to return `None` when running in a multi-GPU setting. 


## Miscellaneous Changes

- Dropped support for PyTorch 1.13 ([#19300](https://github.com/Lightning-AI/lightning/pull/19300)). With every new Lightning release, we add official support for the latest PyTorch stable version and drop the oldest version in our support window.
- The `prepare_data()` hook in `LightningModule` and `LightningDataModule` is now subject to a barrier without timeout to avoid long-running tasks to be interrupted ([#19448](https://github.com/Lightning-AI/lightning/pull/19448)). Similarly, also in Fabric the `Fabric.rank_zero_first` context manager now uses an infinite barrier ([#19448](https://github.com/Lightning-AI/lightning/pull/19448)).


# CHANGELOG


## PyTorch Lightning

Added

- The `ModelSummary` and `RichModelSummary` callbacks now display the training mode of each layer in the column "Mode" ([#19468](https://github.com/Lightning-AI/lightning/pull/19468))
- Added `load_from_checkpoint` support for `LightningCLI` when using dependency injection ([#18105](https://github.com/Lightning-AI/lightning/pull/18105))
- Added robust timer duration parsing with an informative error message when parsing fails ([#19513](https://github.com/Lightning-AI/pytorch-lightning/pull/19513))
- Added `on_exception` hook to `LightningDataModule` ([#19601](https://github.com/Lightning-AI/pytorch-lightning/pull/19601))
- Added support for PyTorch 2.3 ([#19708](https://github.com/Lightning-AI/pytorch-lightning/pull/19708))
- Added `ModelParallelStrategy` to support 2D parallelism ([#19878](https://github.com/Lightning-AI/pytorch-lightning/pull/19878), [#19888](https://github.com/Lightning-AI/pytorch-lightning/pull/19888))
- Added a call to `torch.distributed.destroy_process_group` in atexit handler if process group needs destruction ([#19931](https://github.com/Lightning-AI/pytorch-lightning/pull/19931))
- Added support for configuring hybrid-sharding by passing a tuple for the `FSDPStrategy(device_mesh=...)` argument ([#19504](https://github.com/Lightning-AI/pytorch-lightning/pull/19504))



Changed
    
- The `prepare_data()` hook in `LightningModule` and `LightningDataModule` is now subject to a barrier without timeout to avoid long-running tasks to be interrupted ([#19448](https://github.com/Lightning-AI/lightning/pull/19448))
- Relaxed the requirement for custom batch samplers to expose `drop_last` for prediction ([#19678](https://github.com/Lightning-AI/pytorch-lightning/pull/19678))
- It is no longer allowed to skip `training_step()` by returning `None` in distributed training ([#19918](https://github.com/Lightning-AI/pytorch-lightning/pull/19918))



Removed

- Removed the Bagua integration (`Trainer(strategy="bagua")`) ([#19445](https://github.com/Lightning-AI/lightning/pull/19445))
- Removed support for PyTorch 1.13 ([#19706](https://github.com/Lightning-AI/lightning/pull/19706))




Fixed

- Fixed a matrix shape mismatch issue when running a model loaded from a quantized checkpoint (bitsandbytes) ([#19886](https://github.com/Lightning-AI/lightning/pull/19886))
- Fixed `WandbLogger.log_hyperparameters()` raising an error if hyperparameters are not JSON serializable ([#19769](https://github.com/Lightning-AI/pytorch-lightning/pull/19769))
- Fixed an issue with the LightningCLI not being able to set the `ModelCheckpoint(save_last=...)` argument ([#19808](https://github.com/Lightning-AI/pytorch-lightning/pull/19808))
- Fixed an issue causing ValueError for certain object such as TorchMetrics when dumping hyperparameters to YAML ([#19804](https://github.com/Lightning-AI/pytorch-lightning/pull/19804))
- Fixed resetting `epoch_loop.restarting` to avoid full validation run after `LearningRateFinder` ([#19818](https://github.com/Lightning-AI/pytorch-lightning/issues/19818))





## Lightning Fabric

Added

- Added sanitization for classes before logging them as hyperparameters ([#19771](https://github.com/Lightning-AI/pytorch-lightning/pull/19771))
- Enabled consolidating distributed checkpoints through `fabric consolidate` in the new CLI ([#19560](https://github.com/Lightning-AI/pytorch-lightning/pull/19560))
- Added the ability to explicitly mark forward methods in Fabric via `_FabricModule.mark_forward_method()` ([#19690](https://github.com/Lightning-AI/pytorch-lightning/pull/19690))
- Added support for PyTorch 2.3 ([#19708](https://github.com/Lightning-AI/pytorch-lightning/pull/19708))
- Added `ModelParallelStrategy` to support 2D parallelism ([#19846](https://github.com/Lightning-AI/pytorch-lightning/pull/19846), [#19852](https://github.com/Lightning-AI/pytorch-lightning/pull/19852), [#19870](https://github.com/Lightning-AI/pytorch-lightning/pull/19870), [#19872](https://github.com/Lightning-AI/pytorch-lightning/pull/19872))
- Added a call to `torch.distributed.destroy_process_group` in atexit handler if process group needs destruction ([#19931](https://github.com/Lightning-AI/pytorch-lightning/pull/19931))
- Added support for configuring hybrid-sharding by passing a tuple for the `FSDPStrategy(device_mesh=...)` argument ([#19504](https://github.com/Lightning-AI/pytorch-lightning/pull/19504))



Changed

- Renamed `lightning run model` to `fabric run` ([#19442](https://github.com/Lightning-AI/pytorch-lightning/pull/19442), [#19527](https://github.com/Lightning-AI/pytorch-lightning/pull/19527))
- The `Fabric.rank_zero_first` context manager now uses a barrier without timeout to avoid long-running tasks to be interrupted ([#19448](https://github.com/Lightning-AI/lightning/pull/19448))
- Fabric now raises an error if you forget to call `fabric.backward()` when it is needed by the strategy or precision selection ([#19447](https://github.com/Lightning-AI/lightning/pull/19447), [#19493](https://github.com/Lightning-AI/lightning/pull/19493))
- `_BackwardSyncControl` can now control what to do when gradient accumulation is disabled ([#19577](https://github.com/Lightning-AI/lightning/pull/19577))



Removed

- Removed support for PyTorch 1.13 ([#19706](https://github.com/Lightning-AI/lightning/pull/19706))



Fixed

- Fixed a matrix shape mismatch issue when running a model loaded from a quantized checkpoint (bitsandbytes) ([#19886](https://github.com/Lightning-AI/lightning/pull/19886))






**Full commit list**: [2.2.0 -> 2.3.0](https://github.com/Lightning-AI/lightning/compare/2.2.0...2.3.0)


# Contributors

We thank all our contributors who submitted pull requests for features, bug fixes and documentation updates.

### New Contributors
* @cauyxy made their first contribution in https://github.com/Lightning-AI/pytorch-lightning/pull/19437
* @mwip made their first contribution in https://github.com/Lightning-AI/pytorch-lightning/pull/19518
* @kylebgorman made their first contribution in https://github.com/Lightning-AI/pytorch-lightning/pull/19513
* @kashif made their first contribution in https://github.com/Lightning-AI/pytorch-lightning/pull/19520
* @ash0ts made their first contribution in https://github.com/Lightning-AI/pytorch-lightning/pull/19451
* @dimitri-voytan made their first contribution in https://github.com/Lightning-AI/pytorch-lightning/pull/19524
* @ankitgola005 made their first contribution in https://github.com/Lightning-AI/pytorch-lightning/pull/19615
* @invisprints made their first contribution in https://github.com/Lightning-AI/pytorch-lightning/pull/19629
* @kvenkman made their first contribution in https://github.com/Lightning-AI/pytorch-lightning/pull/19465
* @fnhirwa made their first contribution in https://github.com/Lightning-AI/pytorch-lightning/pull/19640
* @inyong37 made their first contribution in https://github.com/Lightning-AI/pytorch-lightning/pull/19677
* @clumsy made their first contribution in https://github.com/Lightning-AI/pytorch-lightning/pull/19601
* @judidoko made their first contribution in https://github.com/Lightning-AI/pytorch-lightning/pull/19692
* @Lunamos made their first contribution in https://github.com/Lightning-AI/pytorch-lightning/pull/19701
* @dominicgkerr made their first contribution in https://github.com/Lightning-AI/pytorch-lightning/pull/19727
* @daavoo made their first contribution in https://github.com/Lightning-AI/pytorch-lightning/pull/19774
* @Peiffap made their first contribution in https://github.com/Lightning-AI/pytorch-lightning/pull/19805
* @IvanYashchuk made their first contribution in https://github.com/Lightning-AI/pytorch-lightning/pull/19926
* @ringohoffman made their first contribution in https://github.com/Lightning-AI/pytorch-lightning/pull/19904
* @afspies made their first contribution in https://github.com/Lightning-AI/pytorch-lightning/pull/19847
* @fedebotu made their first contribution in https://github.com/Lightning-AI/pytorch-lightning/pull/19822
* @mariovas3 made their first contribution in https://github.com/Lightning-AI/pytorch-lightning/pull/19808
* @Bhavay-2001 made their first contribution in https://github.com/Lightning-AI/pytorch-lightning/pull/19947
* @V0XNIHILI made their first contribution in https://github.com/Lightning-AI/pytorch-lightning/pull/19771

### Did you know?

Chuck Norris is a big fan and daily user of Lightning Studio.

Lightning v2.2 (2024-02-07)

[Lightning AI](https://lightning.ai) is excited to announce the release of Lightning 2.2 :zap:

**Did you know?** The Lightning philosophy extends beyond a boilerplate-free deep learning framework: We've been hard at work bringing you [Lightning Studio](https://lightning.ai/). Code together, prototype, train, deploy, host AI web apps. All from your browser, with zero setup.

While our previous release was packed with many big new features, this time around we're rolling out mainly improvements based on feedback from the community. And of course, as the name implies, this release fully supports the latest [PyTorch 2.2](https://pytorch.org/blog/pytorch2-2/) :tada:


- [Highlights](#highlights)
    - [Monitoring Throughput](https://github.com/Lightning-AI/lightning/releases/tag/2.2.0#highlights-throughput)
    - [Improved Handling of Evaluation Mode](https://github.com/Lightning-AI/lightning/releases/tag/2.2.0#highlights-eval)
    - [Converting FSDP Checkpoints](https://github.com/Lightning-AI/lightning/releases/tag/2.2.0#highlights-consolidate-fsdp)
    - [Improvements to Compiling DDP/FSDP in Fabric](https://github.com/Lightning-AI/lightning/releases/tag/2.2.0#highlights-compile)
    - [Saving and Loading DataLoader State](https://github.com/Lightning-AI/lightning/releases/tag/2.2.0#highlights-dataloader-state)
    - [Non-strict Checkpoint Loading in Trainer](https://github.com/Lightning-AI/lightning/releases/tag/2.2.0#highlights-non-strict)
- [Notable Changes](https://github.com/Lightning-AI/lightning/releases/tag/2.2.0#bc-changes)
- [Full Changelog](https://github.com/Lightning-AI/lightning/releases/tag/2.2.0#changelog)
    - [PyTorch Lightning](https://github.com/Lightning-AI/lightning/releases/tag/2.2.0#changelog-pytorch)
    - [Lightning Fabric](https://github.com/Lightning-AI/lightning/releases/tag/2.2.0#changelog-fabric)
- [Contributors](https://github.com/Lightning-AI/lightning/releases/tag/2.2.0#contributors)



# Highlights


## Monitoring Throughput

Lightning now has built-in utilities to measure throughput metrics such as batches/sec, samples/sec and Model FLOP Utilization (MFU) ([#18848](https://github.com/Lightning-AI/lightning/pull/18848)).

**Trainer:**

For the Trainer, this comes in form of a `ThroughputMonitor` callback. In order to track samples/sec, you need to provide a function to tell the monitor how to extract the batch dimension from your input. Furthermore, if you want to track MFU, you can provide a sample forward pass and the `ThroughputMonitor` will automatically estimate the utilization based on the hardware you are running on:

```python
import lightning as L
from lightning.pytorch.callbacks import ThroughputMonitor
from lightning.fabric.utilities.throughput import measure_flops


class MyModel(LightningModule):
    def setup(self, stage):
        with torch.device("meta"):
            model = MyModel()

        def sample_forward():
            batch = torch.randn(..., device="meta")
            return model(batch)

        self.flops_per_batch = measure_flops(model, sample_forward, loss_fn=torch.Tensor.sum)


throughput = ThroughputMonitor(
    batch_size_fn=lambda batch: batch.size(0),
    # optional, if your samples have a length (like number of tokens)
    sample_fn=lambda batch: batch.size(1)
)
trainer = L.Trainer(log_every_n_steps=10, callbacks=throughput, logger=...)
model = MyModel()
trainer.fit(model)

```

The results get automatically sent to the logger if one is configured on the Trainer.

**Fabric:**

For Fabric, the `ThroughputMonitor` is a simple utility object on which you call `.update()` and `compute_and_log()` during the training loop:

```python
import lightning as L
from lightning.fabric.utilities import ThroughputMonitor


fabric = L.Fabric(logger=...)
throughput = ThroughputMonitor(fabric)

t0 = time()
for batch_idx, batch in enumerate(train_dataloader):
    do_work()
    torch.cuda.synchronize()  # required or else time() won't be correct
    throughput.update(
        time=(time() - t0), 
        batches=batch_idx, 
        samples=(batch_idx * batch_size)
    )
    if batch_idx % 10 == 0:
        throughput.compute_and_log(step=batch_idx)
```

Check out [our TinyLlama LLM pretraining script](https://github.com/Lightning-AI/lit-gpt/blob/6150d04ff3b199ddefbe55e58d593ecae587b9d9/pretrain/tinyllama.py) for a full example using Fabric's `ThroughputMonitor`. 

The troughput utilities can report:
- batches per second (per process and across process)
- samples per second (per process and across process)
- items per second (e.g. tokens) (per process and across process)
- flops per second (per process and across process)
- model flops utilization (MFU) (per process)
- total time, total samples, total batches, and total items (per process)



## Improved Handling of Evaluation Mode

When you train a model and have validation enabled, the Trainer automatically calls `.eval()` when transitioning to the validation loop, and `.train()` when validation ends. Until now, this had the unfortunate side effect that any submodules in your LightningModule that were in evaluation mode get reset to train mode. In Lightning 2.2, the Trainer now captures the mode of every submodule before switching to validation, and restores the mode the modules were in when validation ends ([#18951](https://github.com/Lightning-AI/lightning/pull/18951), [#18951](https://github.com/Lightning-AI/lightning/pull/18951), [#18951](https://github.com/Lightning-AI/lightning/pull/18951)). This improvement will help users avoid silent correctness bugs and removes boilerplate code for managing frozen layers.


```python
import lightning as L


class LitModel(L.LightningModule):
    def __init__(self):
        super().__init__()
        self.trainable_module = ...
        
        # This will now stay in eval mode
        self.frozen_module = ...
        self.frozen_module.eval()
        
    def training_step(self, batch):
        # Previously, modules were all in train mode
        # Now: Modules are in mode they were set up with
        assert self.trainable_module.training
        assert not self.frozen_module.training
        ...
        
    def validation_step(self, batch):
        # All modules are in eval mode
        ...
    
    
model = LitModel()
trainer = L.Trainer()
trainer.fit(model)
```

If you have overridden any of the `LightningModule.on_{validation,test,predict}_model_{eval,train}` hooks, they will still get called and execute your custom logic, but they are no longer required if you added them to preserve the eval mode of frozen modules.

> [!IMPORTANT]
> In some libraries, for example HuggingFace, models are created in evaluation mode by default (e.g. `HFModel.from_pretrained(...)`). Starting from 2.2, you will have to set `.train()` on these models if you intend to train them.



## Converting FSDP Checkpoints

In the previous release, we introduced distributed checkpointing with FSDP to speed up saving and loading checkpoints for big models. These checkpoints are in a special format saved in a folder with shards from each GPU in a separate file. While these checkpoints can be loaded back with Lightning Trainer or Fabric very easily, they aren't easy to load or process externally. In Lightning 2.2, we introduced a CLI utility that lets you consolidate the checkpoint folder to a single file that can be loaded in raw PyTorch with `torch.load()` for example ([#19213](https://github.com/Lightning-AI/lightning/pull/19213)).

Given you saved a distributed checkpoint, you can then convert it like so:

```bash
# For Trainer checkpoints:
python -m lightning.pytorch.utilities.consolidate_checkpoint path/to/my/checkpoint


# For Fabric checkpoints:
python -m lightning.fabric.utilities.consolidate_checkpoint path/to/my/checkpoint
```

Read more about distributed checkpointing in our documentation: [Trainer](https://lightning.ai/docs/pytorch/2.2.0/common/checkpointing_expert.html#convert-a-distributed-checkpoint), [Fabric](https://lightning.ai/docs/fabric/2.2.0/guide/checkpoint/distributed_checkpoint.html#convert-a-distributed-checkpoint).



## Improvements to Compiling DDP/FSDP in Fabric

PyTorch 2.0+ introduced `torch.compile`, a powerful tool to speed up your models without changing the code.
We now added [a comprehensive guide how to use `torch.compile`](https://lightning.ai/docs/fabric/2.2.0/advanced/compile.html) correctly with tips and tricks to help you troubleshoot common issues. On top of that, `Fabric.setup()` will now reapply `torch.compile` on top of DDP/FSDP if you are enabling these strategies ([#19280](https://github.com/Lightning-AI/lightning/pull/19280)).

```python
import lightning as L

# Select a distributed strategy (DDP, FSDP, ...)
fabric = L.Fabric(strategy="ddp", devices=8)

# Compile your model before `.setup()`
model = torch.compile(model)

# Now automatically handles compiling also over DDP/FSDP
model = fabric.setup(model)

# You can opt-out if it is causing trouble
model = fabric.setup(model, _reapply_compile=False)
```

You might see fewer graph breaks, but there won't be any significant speed-ups with this. We introduced this mainly to make Fabric ready for future improvements from PyTorch to optimizing distributed operations.



## Saving and Loading DataLoader State

If you use a dataloader/iterable that implements the `.state_dict()` and `.load_state_dict()` interface, the Trainer will now automatically save and load their state in the checkpoint ([#19361](https://github.com/Lightning-AI/lightning/pull/19361)).

```python
import lightning as L


class MyDataLoader:
    """A dataloader that implements the 'stateful' interface."""
    
    def state_dict(self):
        # Return a dictionary with state
        return {"batches_fetched": ...}
    
    def load_state_dict(self, state_dict):
        # Load the state from the checkpoint
        self.batches_fetched = state_dict["batches_fetched"]


model = ...
dataloader = MyDataLoader()
trainer = L.Trainer()

# Saves checkpoints that include the dataloader state
trainer.fit(model, dataloader)

# When you resume training, the dataloader can now load its state
trainer.fit(model, dataloader, ckpt_path="path/to/my/checkpoint")
```

Note that the standard [PyTorch DataLoader](https://pytorch.org/docs/stable/data.html) does not support this stateful interface. This feature only works on loaders that implement these two methods. A dataloader that supports full fault-tolerance will be included in our upcoming release of Lightning Data - a library to optimize data preprocessing and streaming in the cloud. Stay tuned!


## Non-strict Checkpoint Loading in Trainer

A feature that has been requested for a long time by the community is non-strict checkpoint loading. By default, a checkpoint in PyTorch is loaded with `strict=True` to ensure all keys in the saved checkpoint match what's in the model's state dict.
However, in some use cases it might make sense to exclude certain weights from being included in the checkpoint. When resuming training, the user would then be required to set `strict=False`, which wasn't configurable until now.

You can now set the attribute `strict_loading=False` on your LightningModule if you want to allow loading partial checkpoints ([#19404](https://github.com/Lightning-AI/lightning/pull/19404)).

```python
import lightning as L

class LitModel(L.LightningModule):
    def __init__(self):
        super().__init__()
        
        # This model only trains the decoder, we don't save the encoder
        self.encoder = from_pretrained(...).requires_grad_(False)
        self.decoder = Decoder()
        
        # Set to False because we only care about the decoder
        self.strict_loading = False
    
    def state_dict(self):
        # Don't save the encoder, it is not being trained
        return {k: v for k, v in super().state_dict().items() if "encoder" not in k}

...

trainer = L.Trainer()
model = LitModel()

# Will load weights with `.load_state_dict(strict=model.strict_loading)`
trainer.fit(model, ckpt_path="path/to/checkpoint")
```

Full documentation [here](https://lightning.ai/docs/pytorch/2.2.0/common/checkpointing_advanced.html#resume-from-a-partial-checkpoint).




# Notable Changes

The 2.0 series of Lightning releases guarantees core API stability: No name changes, argument renaming, hook removals etc. on core interfaces (Trainer, LightningModule, etc.) unless a feature is specifically marked experimental. Here we list a few behavioral changes made in places where the change was justified if it significantly improves the user experience, improves performance, or fixes the correctness of a feature. These changes will likely not impact most users.


## ModelCheckpoint's save-last Feature

In Lightning 2.1, we made the `ModelCheckpoint(..., save_last=True)` feature save a symbolic link to the last saved checkpoint instead of rewriting the checkpoint ([#18748](https://github.com/Lightning-AI/lightning/pull/18748)). This time saver is especially useful for large models who take a while to save. However, many users were confused by the new behavior and wanted it turned off, saving a copy instead of a symbolic link like before. In Lightning 2.2, we are reverting this decision and make the linking opt-in ([#19191](https://github.com/Lightning-AI/lightning/pull/19191)):

```python
from lightning.pytorch.callbacks import ModelCheckpoint

# In 2.1 saves a symbolic link "last.ckpt" to the last checkpoint saved
# In 2.2 saves "last.ckpt" as a copy of the last checkpoint saved
checkpoint = ModelCheckpoint("./my_checkpoints", save_last=True)

# You can opt-in to save a symlink (if possible)
checkpoint = ModelCheckpoint("./my_checkpoints", save_last="link")
```



## Removed Problematic Default Seeding

The `seed_everything(x)` utility function is useful to set the seed for several libraries like PyTorch, NumPy and Python in a single line of code. However, until now you were allowed to omit passing a seeding value, in which case the function picked a seed value *randomly*. In certain cases, for example when processes are launched externally (e.g., SLURM, torchelastic etc.), this default behavior is dangerous because each process will independently choose a random seed. This can affect sampling, randomized validation splits, and other behaviors that rely on each process having the same seed. In 2.2, we removed this default behavior and default to a seed value 0 ([#18846](https://github.com/Lightning-AI/lightning/pull/18846)):

```python
from lightning.pytorch.utilities import seed_everything

# Set the random seed for PyTorch, NumPy, Python etc.
seed_everything(42)

# Not setting a value now defaults to 0
seed_everything()
```

In the unlikely event that you relied on the previous behavior, you now have to choose the seed randomly yourself:

```python
seed_everything(random.randint(0, 1000000))
```


## Miscellaneous Changes

- Dropped support for PyTorch 1.12 ([#19300](https://github.com/Lightning-AI/lightning/pull/19300))
- The columns in the `metrics.csv` file produced by `CSVLogger` are now sorted alphabetically ([#19159](https://github.com/Lightning-AI/lightning/pull/19159))
- Added support for meta-device initialization and materialization of 4-bit Bitsandbytes layers ([#19150](https://github.com/Lightning-AI/lightning/pull/19150))
- Added `TransformerEnginePrecision(fallback_compute_dtype=)` to control the dtype of operations that don't support fp8 ([#19082](https://github.com/Lightning-AI/lightning/pull/19082))
- We renamed the `TransformerEnginePrecision(dtype=)` argument to `weights_dtype` and made it required ([#19082](https://github.com/Lightning-AI/lightning/pull/19082))
- The `LightningModule.load_from_checkpoint()` function now calls `.configure_model()` on the model if it is overridden, to ensure all layers can be loaded from the checkpoint ([#19036](https://github.com/Lightning-AI/lightning/pull/19036))



# CHANGELOG


## PyTorch Lightning

Added
    
- Added `lightning.pytorch.callbacks.ThroughputMonitor` to track throughput and log it ([#18848](https://github.com/Lightning-AI/lightning/pull/18848))
- The Trainer now restores the training mode set through `.train()` or `.eval()` on a submodule-level when switching from validation to training ([#18951](https://github.com/Lightning-AI/lightning/pull/18951))
- Added support for meta-device initialization and materialization of 4-bit Bitsandbytes layers ([#19150](https://github.com/Lightning-AI/lightning/pull/19150))
- Added `TransformerEnginePrecision(fallback_compute_dtype=)` to control the dtype of operations that don't support fp8 ([#19082](https://github.com/Lightning-AI/lightning/pull/19082))
- Added the option `ModelCheckpoint(save_last='link')` to create a symbolic link for the 'last.ckpt' file ([#19191](https://github.com/Lightning-AI/lightning/pull/19191))
- Added a utility function and CLI to consolidate FSDP sharded checkpoints into a single file ([#19213](https://github.com/Lightning-AI/lightning/pull/19213))
- The TQDM progress bar now respects the env variable `TQDM_MINITERS` for setting the refresh rate ([#19381](https://github.com/Lightning-AI/lightning/pull/19381))
- Added support for saving and loading stateful training DataLoaders ([#19361](https://github.com/Lightning-AI/lightning/pull/19361))
- Added shortcut name `strategy='deepspeed_stage_1_offload'` to the strategy registry ([#19075](https://github.com/Lightning-AI/lightning/pull/19075))
- Added support for non-strict state-dict loading in Trainer via the new `LightningModule.strict_loading = True | False` attribute ([#19404](https://github.com/Lightning-AI/lightning/pull/19404))



Changed
    
- `seed_everything()` without passing in a seed no longer randomly selects a seed, and now defaults to `0` ([#18846](https://github.com/Lightning-AI/lightning/pull/18846))
- The `LightningModule.on_{validation,test,predict}_model_{eval,train}` now only get called if they are overridden by the user ([#18951](https://github.com/Lightning-AI/lightning/pull/18951))
- The `Trainer.fit()` loop no longer calls `LightningModule.train()` at the start; it now preserves the user's configuration of frozen layers ([#18951](https://github.com/Lightning-AI/lightning/pull/18951))
- The `LightningModule.load_from_checkpoint()` function now calls `.configure_model()` on the model if it is overridden, to ensure all layers can be loaded from the checkpoint ([#19036](https://github.com/Lightning-AI/lightning/pull/19036))
- Restored usage of `step` parameter when logging metrics with `NeptuneLogger` ([#19126](https://github.com/Lightning-AI/pytorch-lightning/pull/19126))
- Changed the `TransformerEnginePrecision(dtype=)` argument to `weights_dtype` and made it required ([#19082](https://github.com/Lightning-AI/lightning/pull/19082))
- The columns in the `metrics.csv` file produced by `CSVLogger` are now sorted alphabetically ([#19159](https://github.com/Lightning-AI/lightning/pull/19159))
- Reverted back to creating a checkpoint copy when `ModelCheckpoint(save_last=True)` instead of creating a symbolic link ([#19191](https://github.com/Lightning-AI/lightning/pull/19191))




Deprecated
    
- Deprecated all precision plugin classes under `lightning.pytorch.plugins` with the suffix `Plugin` in the name ([#18840](https://github.com/Lightning-AI/lightning/pull/18840))



Removed
    
- Removed support for PyTorch 1.12 ([#19300](https://github.com/Lightning-AI/lightning/pull/19300))



Fixed
    
- Fixed issue where the `precision="transformer-engine"` argument would not replace layers by default ([#19082](https://github.com/Lightning-AI/lightning/pull/19082))
- Fixed issue where layers created in `LightningModule.setup` or `LightningModule.configure_model` wouldn't get converted when using the Bitsandbytes or TransformerEngine plugins ([#19061](https://github.com/Lightning-AI/lightning/pull/19061))
- Fixed the input validation logic in `FSDPStrategy` to accept a `device_mesh` ([#19392](https://github.com/Lightning-AI/lightning/pull/19392))




## Lightning Fabric

Added
  
- Added `lightning.fabric.utilities.ThroughputMonitor` and `lightning.fabric.utilities.Throughput` to track throughput and log it ([#18848](https://github.com/Lightning-AI/lightning/pull/18848))
- Added `lightning.fabric.utilities.AttributeDict` for convenient dict-attribute access to represent state in script ([#18943](https://github.com/Lightning-AI/lightning/pull/18943))
- Added support for meta-device initialization and materialization of 4-bit Bitsandbytes layers ([#19150](https://github.com/Lightning-AI/lightning/pull/19150))
- Added `TransformerEnginePrecision(fallback_compute_dtype=)` to control the dtype of operations that don't support fp8 ([#19082](https://github.com/Lightning-AI/lightning/pull/19082))
- Added support for clipping gradients by value with FSDP ([#19236](https://github.com/Lightning-AI/lightning/pull/19236))
- Added a utility function and CLI to consolidate FSDP sharded checkpoints into a single file ([#19213](https://github.com/Lightning-AI/lightning/pull/19213))
- Added support for re-compiling the model inside `Fabric.setup()` over the FSDP/DDP wrappers ([#19280](https://github.com/Lightning-AI/lightning/pull/19280))



Changed
    
- `seed_everything()` without passing in a seed no longer randomly selects a seed, and now defaults to `0` ([#18846](https://github.com/Lightning-AI/lightning/pull/18846))
- Changed the `TransformerEnginePrecision(dtype=)` argument to `weights_dtype` and made it required ([#19082](https://github.com/Lightning-AI/lightning/pull/19082))
- The columns in the `metrics.csv` file produced by `CSVLogger` are now sorted alphabetically ([#19159](https://github.com/Lightning-AI/lightning/pull/19159))



Removed

- Removed support for PyTorch 1.12 ([#19300](https://github.com/Lightning-AI/lightning/pull/19300))



Fixed

- Fixed parsing of v100s GPUs in `get_available_flops` ([#18952](https://github.com/Lightning-AI/lightning/pull/18952))
- Fixed issue where the `precision="transformer-engine"` argument would not replace layers by default ([#19082](https://github.com/Lightning-AI/lightning/pull/19082))
- Fixed the input validation logic in `FSDPStrategy` to accept a `device_mesh` ([#19392](https://github.com/Lightning-AI/lightning/pull/19392))






**Full commit list**: [2.1.0 -> 2.2.0](https://github.com/Lightning-AI/lightning/compare/2.1.0...2.2.0)


# Contributors

Everyone who contributed between 2.1 and 2.2, in no particular order:

### Veteran

@nik777 @Raalsky @wouterzwerink @AleksanderWWW @awaelchli @nohalon @ioangatop @Borda @ethanwharris @BoringDonut @mauvilsa @parambharat @tchaton @ryan597 @adamjstewart @rasbt @carmocca


### New

@hiaoxui @VictorPrins @jaswon @AMHermansen @JalinWang @MF-FOOM @unacanal @Jamim @harishb00 @asingh9530 @dipta007 @daturkel @jerrymannil @mjbommar @shenmishajing @paganpasta @lauritsf @andyland @mathematicalmichael

### Did you know?

Chuck Norris is a big fan and daily user of PyTorch Lightning.

Lightning 2.1: Train Bigger, Better, Faster (2023-10-12)

[Lightning AI](https://lightning.ai) is excited to announce the release of Lightning 2.1 :zap: It's the culmination of work from 79 contributors who have worked on features, bug-fixes, and documentation for a total of over 750+ commits since v2.0.

The theme of 2.1 is "bigger, better, faster": **Bigger** because training large multi-billion parameter models has gotten even more efficient thanks to FSDP, efficient initialization and sharded checkpointing improvements, **better** because it's easier than ever to scale models without making substantial code changes or installing third-party packages and **faster** because it leverages the latest hardware features to speed up training in low-bit precision thanks to new precision plugins like bitsandbytes and transformer engine.
And of course, as the name implies, this release fully leverages the latest features in [PyTorch 2.1](https://pytorch.org/blog/pytorch-2-1/) :tada: 


- [Highlights](#highlights)
    - [Improvements To Large-Scale Training With FSDP](https://github.com/Lightning-AI/lightning/releases/tag/2.1.0#highlights-fsdp)
    - [True Half-Precision](https://github.com/Lightning-AI/lightning/releases/tag/2.1.0#highlights-half-precision)
    - [Bitsandbytes Quantization](https://github.com/Lightning-AI/lightning/releases/tag/2.1.0#highlights-bitsandbytes)
    - [Transformer Engine](https://github.com/Lightning-AI/lightning/releases/tag/2.1.0#highlights-transformer-engine)
    - [Lightning on TPU Goes Brrr](https://github.com/Lightning-AI/lightning/releases/tag/2.1.0#highlights-tpu)
    - [Granular Control Over Checkpoints in Fabric](https://github.com/Lightning-AI/lightning/releases/tag/2.1.0#highlights-fabric-checkpoints)
- [Backward Incompatible Changes](https://github.com/Lightning-AI/lightning/releases/tag/2.1.0#bc-changes)
    - [PyTorch Lightning](https://github.com/Lightning-AI/lightning/releases/tag/2.1.0#bc-changes-pytorch)
    - [Lightning Fabric](https://github.com/Lightning-AI/lightning/releases/tag/2.1.0#bc-changes-fabric)
- [Full Changelog](https://github.com/Lightning-AI/lightning/releases/tag/2.1.0#changelog)
    - [PyTorch Lightning](https://github.com/Lightning-AI/lightning/releases/tag/2.1.0#changelog-pytorch)
    - [Lightning Fabric](https://github.com/Lightning-AI/lightning/releases/tag/2.1.0#changelog-fabric)
    - [Lightning App](https://github.com/Lightning-AI/lightning/releases/tag/2.1.0#changelog-app)
- [Contributors](https://github.com/Lightning-AI/lightning/releases/tag/2.1.0#contributors)



# Highlights


## Improvements To Large-Scale Training With FSDP

The FSDP strategy for training large billion-parameter models gets substantial improvements and new features in Lightning 2.1, both in Trainer and Fabric (in case you didn't know, [Fabric](https://lightning.ai/docs/fabric/stable) is the latest addition to the Lightning family of tools to scale models without the boilerplate code).
FSDP is now more user-friendly to configure, has memory management and speed improvements, and we have a brand new end-to-end user guide with best practices ([Trainer](https://lightning.ai/docs/pytorch/latest/advanced/model_parallel/fsdp.html), [Fabric](https://lightning.ai/docs/fabric/latest/advanced/model_parallel/fsdp.html)).


### Efficient Saving and Loading of Large Checkpoints

When training large billion-parameter models with FSDP, saving and resuming training, or even just loading model parameters for finetuning can be challenging, as users are are often plagued by out-of-memory errors and speed bottlenecks.

In 2.1, we made several improvements. Starting with saving checkpoints, we added support for distributed/sharded checkpoints, enabled through the setting `state_dict_type` in the strategy ([#18364](https://github.com/Lightning-AI/lightning/pull/18364), [#18358](https://github.com/Lightning-AI/lightning/pull/18358)):


**Trainer:**
```python
import lightning as L
from lightning.pytorch.strategies import FSDPStrategy

# Default used by the strategy
strategy = FSDPStrategy(state_dict_type="full")

# Enable saving distributed checkpoints
strategy = FSDPStrategy(state_dict_type="sharded")

trainer = L.Trainer(strategy=strategy, ...)
```

**Fabric:**
```python
import lightning as L
from lightning.fabric.strategies import FSDPStrategy

# Saving distributed checkpoints is the default
strategy = FSDPStrategy(state_dict_type="sharded")

# Save consolidated (single file) checkpoints
strategy = FSDPStrategy(state_dict_type="full")

fabric = L.Fabric(strategy=strategy, ...)
```

Distributed checkpoints are the fastest and most memory efficient way to save the state of very large models.
The distributed checkpoint format also makes it efficient to load these checkpoints back for resuming training in parallel, and it reduces the impact on CPU memory usage significantly. Furthermore, we've also introduced lazy-loading for non-distributed checkpoints ([#18150](https://github.com/Lightning-AI/lightning/pull/18150), [#18379](https://github.com/Lightning-AI/lightning/pull/18379)), which greatly reduces the impact on CPU memory usage when loading a consolidated (single-file) checkpoint (e.g. for finetuning). Learn more about these features in our FSDP guides ([Trainer](https://lightning.ai/docs/pytorch/latest/advanced/model_parallel/fsdp.html), [Fabric](https://lightning.ai/docs/fabric/latest/advanced/model_parallel/fsdp.html)).


### Fast and Memory-Optimized Initialization

A major challenge that users face when working with large models such as LLMs is dealing with the extreme memory requirements. Even something as simple as instantiating a model becomes non-trivial if the model is so large it won't fit in a single GPU or even a single machine. In Lightning 2.1, we are introducing empty-weights initialization through the `Fabric.init_module()` ([#17462](https://github.com/Lightning-AI/lightning/pull/17462), [#17627](https://github.com/Lightning-AI/lightning/pull/17627)) and `Trainer.init_module()`/`LightningModule.configure_model()` ([#18004](https://github.com/Lightning-AI/lightning/pull/18004), [#18004](https://github.com/Lightning-AI/lightning/pull/18004), [#18385](https://github.com/Lightning-AI/lightning/pull/18385)) methods:


**Trainer:**
```python
import lightning as L

class MyModel(L.LightningModule):
    def __init__(self):
        super().__init__()
        # Delay initialization of model to `configure_model()`

    def configure_model(self):
        # Model initialized in correct precision and weights on meta-device
        self.model = ...

    ...

trainer = L.Trainer(strategy="fsdp", ...)
trainer.fit(model)
```

**Fabric:**
```python
import lightning as L

fabric = L.Fabric(strategy="fsdp", ...)

# Model initialized in correct precision and weights on meta-device
with fabric.init_module(empty_init=True):
    model = ...
    

# You can also initialize buffers and tensors directly on device and dtype
with fabric.init_tensor():
    model.mask.create()
    model.kv_cache.create()
    x = torch.randn(4, 128)

# Materialization and sharding of model happens inside here
model = fabric.setup(model)
```

Read more about this new feature and its other benefits in our docs ([Trainer](https://lightning.ai/docs/pytorch/latest/advanced/model_init.html), [Fabric](https://lightning.ai/docs/fabric/latest/advanced/model_init.html)).


### User-Friendly Configuration

We made it super easy to configure the sharding- and activation-checkpointing policy when you want to auto-wrap particular layers of your model for advanced control ([#18045](https://github.com/Lightning-AI/lightning/pull/18045), [#18084](https://github.com/Lightning-AI/lightning/pull/18084)).

```diff
  import lightning as L
  from lightning.pytorch.strategies import FSDPStrategy
- from torch.distributed.fsdp.wrap import ModuleWrapPolicy

- strategy = FSDPStrategy(auto_wrap_policy=ModuleWrapPolicy({MyTransformerBlock}))
+ strategy = FSDPStrategy(auto_wrap_policy={MyTransformerBlock})
  trainer = L.Trainer(strategy=strategy, ...)
```

Furthermore, the sharding strategy can now be conveniently set with a string value ([#18087](https://github.com/Lightning-AI/lightning/pull/18087)):

```diff
  import lightning as L
  from lightning.pytorch.strategies import FSDPStrategy
- from torch.distributed.fsdp.fully_sharded_data_parallel import ShardingStrategy

- strategy = FSDPStrategy(sharding_strategy=ShardingStrategy.SHARD_GRAD_OP)
+ strategy = FSDPStrategy(sharding_strategy="SHARD_GRAD_OP")
  trainer = L.Trainer(strategy=strategy, ...)
```
You no longer need to remember the long PyTorch imports! Fabric also supports all these improvements shown above.


## True Half-Precision

Lightning now supports true half-precision for training and inference with all built-in strategies ([#18193](https://github.com/Lightning-AI/lightning/pull/18193), [#18217](https://github.com/Lightning-AI/lightning/pull/18217), [#18213](https://github.com/Lightning-AI/lightning/pull/18213), [#18219](https://github.com/Lightning-AI/lightning/pull/18219)). With this setting, the memory required to store the model weights is only half of what is normally needed when running with float32. In addition, you get the same speed benefits as mixed precision training (`precision="16-mixed"`) has:

```python
import lightning as L

# default
trainer = L.Trainer(precision="32-true")

# train with model weights in `torch.float16`
trainer = L.Trainer(precision="16-true")

# train with model weights in `torch.bfloat16`
# (if hardware supports it)
trainer = L.Trainer(precision="bf16-true")
```

The same settings are also available in Fabric! We recommend to try bfloat16 training (`precision="bf16-true"`) as it is often more numerically stable than regular 16-bit precision (`precision="16-true"`).


## Bitsandbytes Quantization

With the new [Bitsandbytes precision plugin](https://lightning.ai/docs/pytorch/latest/common/precision_intermediate.html#quantization-via-bitsandbytes) [#18655](https://github.com/Lightning-AI/lightning/pull/18655), you can now quantize your model for significant memory savings during training, finetuning, or inference with a selection of several state-of-the-art quantization algorithms (int8, fp4, nf4 and more). For the first time, Trainer and Fabric make [bitsandbytes](https://github.com/TimDettmers/bitsandbytes) easy to use for general models.


**Trainer:**
```python
import lightning as L
from lightning.pytorch.plugins import BitsandbytesPrecisionPlugin

# this will pick out the compute dtype automatically, by default `bfloat16`
precision = BitsandbytesPrecisionPlugin("nf4-dq")
trainer = L.Trainer(plugins=precision)
```

**Fabric:**
```python
import lightning as L
from lightning.fabric.plugins import BitsandbytesPrecision

# this will pick out the compute dtype automatically, by default `bfloat16`
precision = BitsandbytesPrecision("nf4-dq")
trainer = L.Fabric(plugins=precision)
```

[Learn more!](https://lightning.ai/docs/pytorch/latest/common/precision_intermediate.html#quantization-via-bitsandbytes)


## Transformer Engine

The [Transformer Engine by NVIDIA](https://docs.nvidia.com/deeplearning/transformer-engine) is a library for accelerating transformer layers on the new Hopper (H100) generation of GPUs. With the integration in Lightning Trainer and Fabric ([#17597](https://github.com/Lightning-AI/lightning/pull/17597), [#18459](https://github.com/Lightning-AI/lightning/pull/18459)), you have easy access to the 8-bit mixed precision for significant speed ups:

**Trainer:**
```python
import lightning as L

# Select 8-bit mixed precision via TransformerEngine, with model weights in float16
trainer = L.Trainer(precision="transformer-engine-float16")
```

**Fabric:**
```python
import lightning as L

# Select 8-bit mixed precision via TransformerEngine, with model weights in float16
fabric = L.Fabric(precision="transformer-engine-float16")
```

More configuration options are available through the respective plugins in [Trainer](https://lightning.ai/docs/pytorch/latest/common/precision_intermediate.html#float8-mixed-precision-via-nvidia-s-transformerengine) and [Fabric](https://lightning.ai/docs/fabric/latest/fundamentals/precision.html#float8-mixed-precision-via-nvidia-s-transformerengine).



## Lightning on TPU Goes Brrr

Lightning 2.1 runs on the latest generation of TPU hardware on Google Cloud! TPU-v4 and TPU-v5 ([#17227](https://github.com/Lightning-AI/lightning/pull/17227)) are now fully supported both in Fabric and Trainer and run using the new PjRT runtime by default ([#17352](https://github.com/Lightning-AI/lightning/pull/17352)). PjRT is the runtime used by Jax and has shown an average improvement of 35% on benchmarks.

**Trainer:**
```python
import lightning as L

trainer = L.Trainer(accelerator="tpu", devices=8)
model = MyModel()
trainer.fit(model)  # uses PjRT if available
```

**Fabric:**
```python
import lightning as L


def train(fabric):
    ...

fabric = L.Fabric(accelerator="tpu")
fabric.launch(train)  # uses PjRT if available
```

And what's even more exciting, you can now scale massive multi-billion parameter models on TPUs using FSDP ([#17421](https://github.com/Lightning-AI/lightning/pull/17421)).

```python
import lightning as L
from lightning.fabric.strategies import XLAFSDPStrategy

strategy = XLAFSDPStrategy(
    # Most arguments from the PyTorch native FSDP strategy are also available here!
    auto_wrap_policy={Block},
    activation_checkpointing_policy={Block},
    state_dict_type="full",
    sequential_save=True,
)
    
fabric = L.Fabric(devices=8, strategy=strategy)
fabric.launch(finetune)
```
You can find a full end-to-end finetuning example script in our [Lit-GPT repository](https://github.com/Lightning-AI/lit-gpt/blob/main/xla/finetune/adapter.py). The new XLA-FSDP strategy is experimental and currently only available in Fabric. Support in the Trainer will follow in the future.



## Granular Control Over Checkpoints in Fabric

Several improvements for checkpoint saving and loading have landed in Fabric, enabling more fine-grained control over what is saved/loaded while reducing boilerplate code:

1. There is a new `Fabric.load_raw()` method with which you can load model- or optimizer state-dicts saved externally by a non-Fabric application (e.g., raw PyTorch) ([#18049](https://github.com/Lightning-AI/lightning/pull/18049))

    ```python
    import lightning as L
    
    fabric = L.Fabric()
    model = MyModel()

    # A model weights file saved by your friend who doesn't use Fabric
    fabric.load_raw("path/to/model.pt", model)

    # Equivalent to this:
    # model.load_state_dict(torch.load("path/to/model.pt"))
    ```

2. A new parameter `Fabric.load(..., strict=True|False)` to disable strict loading ([#17645](https://github.com/Lightning-AI/lightning/pull/17645))

    ```python
    import lightning as L
    
    fabric = L.Fabric()
    model = MyModel()
    state = {"model": model}

    # strict loading is the default
    fabric.load("path/to/checkpoint.ckpt", state, strict=True)

    # disable strict loading
    fabric.load("path/to/checkpoint.ckpt", state, strict=False)
    ```

3. A new parameter `Fabric.save(..., filter=...)` that enables you to exclude certain parameters of your model without writing boilerplate code for it ([#17845](https://github.com/Lightning-AI/lightning/pull/17845))


    ```python
    import lightning as L
    
    fabric = L.Fabric()
    model, optimizer = ...

    state = {"model": model, "optimizer": optimizer, "foo": 123}

    # save only the weights that match a pattern
    filter = {"model": lambda k, v: "weight" in k}
    fabric.save("path/to/checkpoint.ckpt", state, filter=filter)
    ```

You can read more about the new options in our [checkpoint guide](https://lightning.ai/docs/fabric/latest/guide/checkpoint.html).



# Backward Incompatible Changes

The release of PyTorch Lightning 2.0 was a big step into a new chapter: It brought a more polished API and removed a lot of legacy code and outdated as well as experimental features, at the cost of a long list of breaking changes resulting in more work needed than usual to upgrade from 1.9 to 2.0. Moving forward, we promised to maintain full backward compatibility of our public core APIs to guarantee a smooth upgrade experience for everyone, and with 2.1 we are happy to deliver on this promise. A few exceptions were made in places where the change was justified if it significantly improves the user experience, improves performance, or fixes the correctness of a feature. These changes will likely not impact most users.




## PyTorch Lightning

### TPU/XLA Changes

When selecting device indices via `devices=[i]`, the Trainer now selects the i-th TPU core (0-based, previously it was 1-based) ([#17227](https://github.com/Lightning-AI/lightning/pull/17227))

**Before:**
```python
# Selects the first TPU core (1-based index)
trainer = Trainer(accelerator="tpu", devices=[1])
```

**Now:**
```python
# Selects the second TPU core (0-based index)
trainer = Trainer(accelerator="tpu", devices=[1])
```

### Multi-GPU in Jupyter Notebooks

Due to lack of reliability, Trainer now only runs on one GPU instead of all GPUs in a Jupyter notebook if `devices="auto"` (default) ([#18291](https://github.com/Lightning-AI/lightning/pull/18291))


**Before:**
```python
import lightning as L

# In Jupyter notebooks, this would select all available GPUs (DDP)
trainer = L.Trainer(accelerator="cuda", devices="auto")
```

**Now:**
```python
# In Jupyter notebooks, this now selects only one GPU (the first)
trainer = L.Trainer(accelerator="cuda", devices="auto")

# You can still explicitly select multiple
trainer = L.Trainer(accelerator="cuda", devices=8)
```

### Device Access in Setup Hook

- During `LightningModule.setup()`, the `self.device` now returns the device the module *will be placed on* instead of `cpu` ([#18021](https://github.com/Lightning-AI/lightning/pull/18021))

**Before:**
```python
def setup(self, stage):
    # CPU regardless of the accelerator used
    print(self.device)
```

**Now:**
```python
def setup(self, stage):
    # CPU/CUDA/MPS/XLA depending on accelerator
    print(self.device)
```
    
### Miscellaneous Changes

- `self.log`ed tensors are now kept in the original device to reduce unnecessary host-to-device synchronizations ([#17334](https://github.com/Lightning-AI/lightning/pull/17334))
- The `FSDPStrategy` now loads checkpoints after the `configure_model`/`configure_sharded_model` hook ([#18358](https://github.com/Lightning-AI/lightning/pull/18358))
- The `FSDPStrategy.load_optimizer_state_dict` and `FSDPStrategy.load_model_state_dict` are a no-op now ([#18358](https://github.com/Lightning-AI/lightning/pull/18358))
- Removed experimental support for `torchdistx` due to a lack of project maintenance ([#17995](https://github.com/Lightning-AI/lightning/pull/17995))
- Dropped support for PyTorch 1.11 ([#18691](https://github.com/Lightning-AI/lightning/pull/18691))



## Lightning Fabric

We thank the community for the amazing feedback we got for [Fabric](https://lightning.ai/docs/fabric/stable/) so far - keep it coming. The list of breaking changes is short and won't affect the vast majority of users.

### Sharding Context Manager in Fabric.run()

We removed automatic sharding support with `Fabric.run` or using `fabric.launch(fn)`. This only impacts FSDP and DeepSpeed strategy users who use this way of launching. Please note that `Fabric.run` is a legacy construct from the `LightningLite` days, and is not recommended today. Please instantiate your large FSDP or DeepSpeed model under the newly added `fabric.init_module` context manager ([#17832](https://github.com/Lightning-AI/lightning/pull/17832)).

**Before:**
```python
import lightning as L

def train(fabric):
    # FSDP's `enable_wrap` context or `deepspeed.zero.Init()`
    # were applied automaticaly here
    model = LargeModel()
    ...
        
fabric = L.Fabric()
fabric.launch(train)
```

**Now:**
```python
def train(fabric):
    # Use `init_module` explicitly to apply these context managers
    with fabric.init_module():
        model = LargeModel()
    ...
```

### Multi-GPU in Jupyter Notebooks

Due to lack of reliability, Fabric now only runs on one GPU instead of all GPUs in a Jupyter notebook if `devices="auto"` (default) ([#18291](https://github.com/Lightning-AI/lightning/pull/18291))


**Before:**
```python
import lightning as L

# In Jupyter notebooks, this would select all available GPUs (DDP)
fabric = L.Fabric(accelerator="cuda", devices="auto")
```

**Now:**
```python
# In Jupyter notebooks, this now selects only one GPU (the first)
fabric = L.Fabric(accelerator="cuda", devices="auto")

# You can still explicitly select multiple
fabric = L.Fabric(accelerator="cuda", devices=8)
```




# CHANGELOG


## PyTorch Lightning

Added

- Added `metrics_format` attribute to `RichProgressBarTheme` class ([#18373](https://github.com/Lightning-AI/lightning/pull/18373))
- Added `CHECKPOINT_EQUALS_CHAR` attribute to `ModelCheckpoint` class ([#17999](https://github.com/Lightning-AI/lightning/pull/17999))
- Added `**summarize_kwargs` to `ModelSummary` and `RichModelSummary` callbacks ([#16788](https://github.com/Lightning-AI/lightning/pull/16788))
- Added support for the `max_size_cycle|max_size|min_size` iteration modes during evaluation ([#17163](https://github.com/Lightning-AI/lightning/pull/17163))
- Added support for the TPU-v4 architecture ([#17227](https://github.com/Lightning-AI/lightning/pull/17227))
- Added support for XLA's new PJRT runtime ([#17352](https://github.com/Lightning-AI/lightning/pull/17352))
- Check for invalid TPU device inputs ([#17227](https://github.com/Lightning-AI/lightning/pull/17227))
- Added `XLAStrategy(sync_module_states=bool)` to control whether to broadcast the parameters to all devices ([#17522](https://github.com/Lightning-AI/lightning/pull/17522))
- Added support for multiple optimizer parameter groups when using the FSDP strategy ([#17309](https://github.com/Lightning-AI/lightning/pull/17309))
- Enabled saving the full model state dict when using the `FSDPStrategy` ([#16558](https://github.com/Lightning-AI/lightning/pull/16558))
- Update `LightningDataModule.from_datasets` to support arbitrary iterables ([#17402](https://github.com/Lightning-AI/lightning/pull/17402))
- Run the DDP wrapper in a CUDA stream ([#17334](https://github.com/Lightning-AI/lightning/pull/17334))
- Added `SaveConfigCallback.save_config` to ease use cases such as saving the config to a logger ([#17475](https://github.com/Lightning-AI/lightning/pull/17475))
- Enabled optional file versioning of model checkpoints ([#17320](https://github.com/Lightning-AI/lightning/pull/17320))
- Added the process group timeout argument `FSDPStrategy(timeout=...)` for the FSDP strategy ([#17274](https://github.com/Lightning-AI/lightning/pull/17274))
- Added `FSDPStrategy(activation_checkpointing_policy=...)` to customize the layer policy for automatic activation checkpointing (requires torch>=2.1) ([#18045](https://github.com/Lightning-AI/lightning/pull/18045))
- Added CLI option `--map-to-cpu` to the checkpoint upgrade script to enable converting GPU checkpoints on a CPU-only machine ([#17527](https://github.com/Lightning-AI/lightning/pull/17527))
- Added non-layer param count to the model summary ([#17005](https://github.com/Lightning-AI/lightning/pull/17005))
- Updated `LearningRateMonitor` to log monitored values to `trainer.callback_metrics` ([#17626](https://github.com/Lightning-AI/lightning/pull/17626))
- Added `log_weight_decay` argument to `LearningRateMonitor` callback ([#18439](https://github.com/Lightning-AI/lightning/pull/18439))
- Added `Trainer.print()` to print on local rank zero only ([#17980](https://github.com/Lightning-AI/lightning/pull/17980))
- Added `Trainer.init_module()` context manager to instantiate large models efficiently directly on device, dtype ([#18004](https://github.com/Lightning-AI/lightning/pull/18004))
  * Creates the model parameters in the desired dtype (`torch.float32`, `torch.float64`) depending on the 'true' precision choice in `Trainer(precision='32-true'|'64-true')`
- Added the `LightningModule.configure_model()` hook to instantiate large models efficiently directly on device, dtype, and with sharding support ([#18004](https://github.com/Lightning-AI/lightning/pull/18004))
  * Handles initialization for FSDP models before wrapping and the Zero stage 3 initialization for DeepSpeed before sharding
- Added support for meta-device initialization with `Trainer.init_module(empty_init=True)` in FSDP ([#18385](https://github.com/Lightning-AI/lightning/pull/18385))
- Added `lightning.pytorch.plugins.PrecisionPlugin.module_init_context()` and `lightning.pytorch.strategies.Strategy.tensor_init_context()` context managers to control model and tensor instantiation ([#18004](https://github.com/Lightning-AI/lightning/pull/18004))
- Automatically call `xla_model.mark_step()` before saving checkpoints with XLA ([#17882](https://github.com/Lightning-AI/lightning/pull/17882))
- Added a callback for spike-detection ([#18014](https://github.com/Lightning-AI/lightning/pull/18014))
- Added the ability to set the `torch.distributed.fsdp.ShardingStrategy` via string in `FSDPStrategy` ([#18087](https://github.com/Lightning-AI/lightning/pull/18087))
- Improved error messages when attempting to load a DeepSpeed checkpoint at an invalid path ([#17795](https://github.com/Lightning-AI/lightning/pull/17795))
- Allowed accessing rank information in the main process before processes are launched when using the `XLAStrategy` ([#18194](https://github.com/Lightning-AI/lightning/pull/18194))
- Added support for true half-precision training via `Trainer(precision="16-true"|"bf16-true")` ([#18193](https://github.com/Lightning-AI/lightning/pull/18193), [#18217](https://github.com/Lightning-AI/lightning/pull/18217), [#18213](https://github.com/Lightning-AI/lightning/pull/18213), [#18219](https://github.com/Lightning-AI/lightning/pull/18219))
- Added automatic process cleanup to avoid zombie child processes and stalls when exceptions are raised ([#18218](https://github.com/Lightning-AI/lightning/pull/18218))
- Added validation of user input for `devices` and `num_nodes` when running with `SLURM` or `TorchElastic` ([#18292](https://github.com/Lightning-AI/lightning/pull/18292))
- Added support for saving checkpoints with either full state-dict or sharded state dict via `FSDPStrategy(state_dict_type="full"|"sharded")` ([#18364](https://github.com/Lightning-AI/lightning/pull/18364))
- Added support for loading sharded/distributed checkpoints in FSDP ([#18358](https://github.com/Lightning-AI/lightning/pull/18358))
- Made the text delimiter in the rich progress bar configurable ([#18372](https://github.com/Lightning-AI/lightning/pull/18372))
- Improved the error messaging and instructions when handling custom batch samplers in distributed settings ([#18402](https://github.com/Lightning-AI/lightning/pull/18402))
- Added support for mixed 8-bit precision as `Trainer(precision="transformer-engine")` using [Nvidia's Transformer Engine](https://docs.nvidia.com/deeplearning/transformer-engine) ([#18459](https://github.com/Lightning-AI/lightning/pull/18459))
- Added support for linear layer quantization with `Trainer(plugins=BitsandbytesPrecision())` using [bitsandbytes](https://github.com/TimDettmers/bitsandbytes) ([#18655](https://github.com/Lightning-AI/lightning/pull/18655))
- Added support for passing the process group to the `FSDPStrategy` ([#18583](https://github.com/Lightning-AI/lightning/pull/18583))
- Enabled the default process group configuration for FSDP's hybrid sharding ([#18583](https://github.com/Lightning-AI/lightning/pull/18583))
- Added `lightning.pytorch.utilities.suggested_max_num_workers` to assist with setting a good value in distributed settings ([#18591](https://github.com/Lightning-AI/lightning/pull/18591))
- Improved the `num_workers` warning to give a more accurate upper limit on the `num_workers` suggestion ([#18591](https://github.com/Lightning-AI/lightning/pull/18591))
- Added `lightning.pytorch.utilities.is_shared_filesystem` utility function to automatically check whether the filesystem is shared between machines ([#18586](https://github.com/Lightning-AI/lightning/pull/18586))
- Added support for returning an object of type `Mapping` from `LightningModule.training_step()` ([#18657](https://github.com/Lightning-AI/lightning/pull/18657))
- Added the hook `LightningModule.on_validation_model_zero_grad()` to allow overriding the behavior of zeroing the gradients before entering the validation loop ([#18710](https://github.com/Lightning-AI/lightning/pull/18710))



Changed

- Changed default metric formatting from `round(..., 3)` to `".3f"` format string in `MetricsTextColumn` class ([#18483](https://github.com/Lightning-AI/lightning/pull/18483))
- Removed the limitation to call `self.trainer.model.parameters()` in `LightningModule.configure_optimizers()` ([#17309](https://github.com/Lightning-AI/lightning/pull/17309))
- `Trainer(accelerator="tpu", devices=[i])"` now selects the i-th TPU core (0-based, previously it was 1-based) ([#17227](https://github.com/Lightning-AI/lightning/pull/17227))
- Allow using iterable-style datasets with TPUs ([#17331](https://github.com/Lightning-AI/lightning/pull/17331))
- Increased the minimum XLA requirement to 1.13 ([#17368](https://github.com/Lightning-AI/lightning/pull/17368))
- `self.log`ed tensors are now kept in the original device to reduce unnecessary host-to-device synchronizations ([#17334](https://github.com/Lightning-AI/lightning/pull/17334))
- Made the run initialization in `WandbLogger` lazy to avoid creating artifacts when the CLI is used ([#17573](https://github.com/Lightning-AI/lightning/pull/17573))
- Simplified redirection of `*_step` methods in strategies by removing the `_LightningModuleWrapperBase` wrapper module ([#17531](https://github.com/Lightning-AI/lightning/pull/17531))
- Support kwargs input for LayerSummary ([#17709](https://github.com/Lightning-AI/lightning/pull/17709))
- Dropped support for `wandb` versions older than 0.12.0 in `WandbLogger` ([#17876](https://github.com/Lightning-AI/lightning/pull/17876))
- During `LightningModule.setup()`, the `self.device` now returns the device the module will be placed on instead of `cpu` ([#18021](https://github.com/Lightning-AI/lightning/pull/18021))
- Increased the minimum supported `wandb` version for `WandbLogger` from 0.12.0 to 0.12.10 ([#18171](https://github.com/Lightning-AI/lightning/pull/18171))
- The input tensors now get cast to the right precision type before transfer to the device ([#18264](https://github.com/Lightning-AI/lightning/pull/18264))
- Improved the formatting of emitted warnings ([#18288](https://github.com/Lightning-AI/lightning/pull/18288))
- Broadcast and reduction of tensors with XLA-based strategies now preserve the input's device ([#18275](https://github.com/Lightning-AI/lightning/pull/18275))
- The `FSDPStrategy` now loads checkpoints after the `configure_model`/`configure_sharded_model` hook ([#18358](https://github.com/Lightning-AI/lightning/pull/18358))
- The `FSDPStrategy.load_optimizer_state_dict` and `FSDPStrategy.load_model_state_dict` are a no-op now ([#18358](https://github.com/Lightning-AI/lightning/pull/18358))
- The `Trainer.num_val_batches`, `Trainer.num_test_batches` and `Trainer.num_sanity_val_batches` now return a list of sizes per dataloader instead of a single integer ([#18441](https://github.com/Lightning-AI/lightning/pull/18441))
- The `*_step(dataloader_iter)` flavor now no longer takes the `batch_idx` in the signature ([#18390](https://github.com/Lightning-AI/lightning/pull/18390))
- Calling `next(dataloader_iter)` now returns a triplet `(batch, batch_idx, dataloader_idx)` ([#18390](https://github.com/Lightning-AI/lightning/pull/18390))
- Calling `next(combined_loader)` now returns a triplet `(batch, batch_idx, dataloader_idx)` ([#18390](https://github.com/Lightning-AI/lightning/pull/18390))
- Due to lack of reliability, Trainer now only runs on one GPU instead of all GPUs in a Jupyter notebook if `devices="auto"` (default) ([#18291](https://github.com/Lightning-AI/lightning/pull/18291))
- Made the `batch_idx` argument optional in `validation_step`, `test_step` and `predict_step` to maintain consistency with `training_step` ([#18512](https://github.com/Lightning-AI/lightning/pull/18512))
- The `TQDMProgressBar` now consistently shows it/s for the speed even when the iteration time becomes larger than one second ([#18593](https://github.com/Lightning-AI/lightning/pull/18593))
- The `LightningDataModule.load_from_checkpoint` and `LightningModule.load_from_checkpoint` methods now raise an error if they are called on an instance instead of the class ([#18432](https://github.com/Lightning-AI/lightning/pull/18432))
- Enabled launching via `torchrun` in a SLURM environment; the `TorchElasticEnvironment` now gets chosen over the `SLURMEnvironment` if both are detected ([#18618](https://github.com/Lightning-AI/lightning/pull/18618))
- If not set by the user, Lightning will set `OMP_NUM_THREADS` to `num_cpus / num_processes` when launching subprocesses (e.g. when DDP is used) to avoid system overload for CPU-intensive tasks ([#18677](https://github.com/Lightning-AI/lightning/pull/18677))
- The `ModelCheckpoint` no longer deletes files under the save-top-k mechanism when resuming from a folder that is not the same as the current checkpoint folder ([#18750](https://github.com/Lightning-AI/lightning/pull/18750))
- The `ModelCheckpoint` no longer deletes the file that was passed to `Trainer.fit(ckpt_path=...)` ([#18750](https://github.com/Lightning-AI/lightning/pull/18750))
- Calling `trainer.fit()` twice now raises an error with strategies that spawn subprocesses through `multiprocessing` (ddp_spawn, xla) ([#18776](https://github.com/Lightning-AI/lightning/pull/18776))
- The `ModelCheckpoint` now saves a symbolic link if `save_last=True` and `save_top_k != 0` ([#18748](https://github.com/Lightning-AI/lightning/pull/18748))




Deprecated

- Deprecated the `SingleTPUStrategy` (`strategy="single_tpu"`) in favor of `SingleDeviceXLAStrategy` (`strategy="single_xla"`) ([#17383](https://github.com/Lightning-AI/lightning/pull/17383))
- Deprecated the `TPUAccelerator` in favor of `XLAAccelerator` ([#17383](https://github.com/Lightning-AI/lightning/pull/17383))
- Deprecated the `TPUPrecisionPlugin` in favor of `XLAPrecisionPlugin` ([#17383](https://github.com/Lightning-AI/lightning/pull/17383))
- Deprecated the `TPUBf16PrecisionPlugin` in favor of `XLABf16PrecisionPlugin` ([#17383](https://github.com/Lightning-AI/lightning/pull/17383))
- Deprecated the `Strategy.post_training_step` method ([#17531](https://github.com/Lightning-AI/lightning/pull/17531))
- Deprecated the `LightningModule.configure_sharded_model` hook in favor of `LightningModule.configure_model` ([#18004](https://github.com/Lightning-AI/lightning/pull/18004))
- Deprecated the `LightningDoublePrecisionModule` wrapper in favor of calling `Trainer.precision_plugin.convert_input()` ([#18209](https://github.com/Lightning-AI/lightning/pull/18209))

    


Removed

- Removed the `XLAStrategy.is_distributed` property. It is always True ([#17381](https://github.com/Lightning-AI/lightning/pull/17381))
- Removed the `SingleTPUStrategy.is_distributed` property. It is always False ([#17381](https://github.com/Lightning-AI/lightning/pull/17381))
- Removed experimental support for `torchdistx` due to a lack of project maintenance ([#17995](https://github.com/Lightning-AI/lightning/pull/17995))
- Removed support for PyTorch 1.11 ([#18691](https://github.com/Lightning-AI/lightning/pull/18691))

    


Fixed

- Fixed an issue with reusing the same model across multiple trainer stages when using the `DeepSpeedStrategy` ([#17531](https://github.com/Lightning-AI/lightning/pull/17531))
- Fixed the saving and loading of FSDP optimizer states ([#17819](https://github.com/Lightning-AI/lightning/pull/17819))
- Fixed FSDP re-applying activation checkpointing when the user had manually applied it already ([#18006](https://github.com/Lightning-AI/lightning/pull/18006))
- Fixed issue where unexpected exceptions would leave the default torch dtype modified when using true precision settings ([#18500](https://github.com/Lightning-AI/lightning/pull/18500))
- Fixed issue where not including the `batch_idx` argument in the `training_step` would disable gradient accumulation ([#18619](https://github.com/Lightning-AI/lightning/pull/18619))
- Fixed the replacement of callbacks returned in `LightningModule.configure_callbacks` when the callback was a subclass of an existing Trainer callback ([#18508](https://github.com/Lightning-AI/lightning/pull/18508))
- Fixed `Trainer.log_dir` not returning the correct directory for the `CSVLogger` ([#18548](https://github.com/Lightning-AI/lightning/pull/18548))
- Fixed redundant input-type casting in FSDP precision ([#18630](https://github.com/Lightning-AI/lightning/pull/18630))
- Fixed numerical issues when reducing values in low precision with `self.log` ([#18686](https://github.com/Lightning-AI/lightning/pull/18686))
- Fixed an issue that would cause the gradients to be erased if validation happened in the middle of a gradient accumulation phase ([#18710](https://github.com/Lightning-AI/lightning/pull/18710))
- Fixed redundant file writes in `CSVLogger` ([#18567](https://github.com/Lightning-AI/lightning/pull/18567))
- Fixed an issue that could lead to checkpoint files being deleted accidentally when resuming training ([#18750](https://github.com/Lightning-AI/lightning/pull/18750))





## Lightning Fabric

Added
  
- Added support for the TPU-v4 architecture ([#17227](https://github.com/Lightning-AI/lightning/pull/17227))
- Added support for XLA's new PJRT runtime ([#17352](https://github.com/Lightning-AI/lightning/pull/17352))
- Added support for Fully Sharded Data Parallel (FSDP) training with XLA ([#18126](https://github.com/Lightning-AI/lightning/pull/18126), [#18424](https://github.com/Lightning-AI/lightning/pull/18424), [#18430](https://github.com/Lightning-AI/lightning/pull/18430))
- Check for invalid TPU device inputs ([#17227](https://github.com/Lightning-AI/lightning/pull/17227))
- Added `XLAStrategy(sync_module_states=bool)` to control whether to broadcast the parameters to all devices ([#17522](https://github.com/Lightning-AI/lightning/pull/17522))
- Added support for joint setup of model and optimizer with FSDP ([#17305](https://github.com/Lightning-AI/lightning/pull/17305))
- Added support for handling multiple parameter groups in optimizers set up with FSDP ([#17305](https://github.com/Lightning-AI/lightning/pull/17305))
- Added support for saving and loading sharded model and optimizer state with `FSDPStrategy` ([#17323](https://github.com/Lightning-AI/lightning/pull/17323))
- Added a warning when calling methods on `_FabricModule` that bypass the strategy-specific wrappers ([#17424](https://github.com/Lightning-AI/lightning/pull/17424))
- Added `Fabric.init_tensor()` context manager to instantiate tensors efficiently directly on device and dtype ([#17488](https://github.com/Lightning-AI/lightning/pull/17488))
- Added `Fabric.init_module()` context manager to instantiate large models efficiently directly on device, dtype, and with sharding support ([#17462](https://github.com/Lightning-AI/lightning/pull/17462))
  * Creates the model parameters in the desired dtype (`torch.float32`, `torch.float64`, `torch.float16`, or `torch.bfloat16`) depending on the 'true' precision choice in `Fabric(precision='32-true'|'64-true'|'16-true'|'bf16-true')`
  * Handles initialization for FSDP models before wrapping and the Zero stage 3 initialization for DeepSpeed before sharding
- Added support for empty weight initialization with `Fabric.init_module(empty_init=True)` for checkpoint loading ([#17627](https://github.com/Lightning-AI/lightning/pull/17627))
- Added support for meta-device initialization with `Fabric.init_module(empty_init=True)` in FSDP ([#18122](https://github.com/Lightning-AI/lightning/pull/18122))
- Added `lightning.fabric.plugins.Precision.module_init_context()` and `lightning.fabric.strategies.Strategy.module_init_context()` context managers to control model and tensor instantiation ([#17462](https://github.com/Lightning-AI/lightning/pull/17462))
- `lightning.fabric.strategies.Strategy.tensor_init_context()` context manager to instantiate tensors efficiently directly on device and dtype ([#17607](https://github.com/Lightning-AI/lightning/pull/17607))
- Run the DDP wrapper in a CUDA stream ([#17334](https://github.com/Lightning-AI/lightning/pull/17334))
- Added support for true half-precision as `Fabric(precision="16-true"|"bf16-true")` ([#17287](https://github.com/Lightning-AI/lightning/pull/17287))
- Added support for mixed 8-bit precision as `Fabric(precision="transformer-engine")` using [Nvidia's Transformer Engine](https://docs.nvidia.com/deeplearning/transformer-engine) ([#17597](https://github.com/Lightning-AI/lightning/pull/17597))
- Added support for linear layer quantization with `Fabric(plugins=BitsandbytesPrecision())` using [bitsandbytes](https://github.com/TimDettmers/bitsandbytes) ([#18655](https://github.com/Lightning-AI/lightning/pull/18655))
- Added error messaging for missed `.launch()` when it is required ([#17570](https://github.com/Lightning-AI/lightning/pull/17570))
- Added support for saving checkpoints with either full state-dict or sharded state dict via `FSDPStrategy(state_dict_type="full"|"sharded")` ([#17526](https://github.com/Lightning-AI/lightning/pull/17526))
- Added support for loading a full-state checkpoint file into a sharded model ([#17623](https://github.com/Lightning-AI/lightning/pull/17623))
- Added support for calling hooks on a LightningModule via `Fabric.call` ([#17874](https://github.com/Lightning-AI/lightning/pull/17874))
- Added the parameter `Fabric.load(..., strict=True|False)` to enable non-strict loading of partial checkpoint state ([#17645](https://github.com/Lightning-AI/lightning/pull/17645))
- Added the parameter `Fabric.save(..., filter=...)` to enable saving a partial checkpoint state ([#17845](https://github.com/Lightning-AI/lightning/pull/17845))
- Added support for loading optimizer states from a full-state checkpoint file ([#17747](https://github.com/Lightning-AI/lightning/pull/17747))
- Automatically call `xla_model.mark_step()` before saving checkpoints with XLA ([#17882](https://github.com/Lightning-AI/lightning/pull/17882))
- Automatically call `xla_model.mark_step()` after `optimizer.step()` with XLA ([#17883](https://github.com/Lightning-AI/lightning/pull/17883))
- Added support for all half-precision modes in FSDP precision plugin ([#17807](https://github.com/Lightning-AI/lightning/pull/17807))
- Added `FSDPStrategy(activation_checkpointing_policy=...)` to customize the layer policy for automatic activation checkpointing (requires torch>=2.1) ([#18045](https://github.com/Lightning-AI/lightning/pull/18045))
- Added a callback for spike-detection ([#18014](https://github.com/Lightning-AI/lightning/pull/18014))
- Added the ability to set the `torch.distributed.fsdp.ShardingStrategy` via string in `FSDPStrategy` ([#18087](https://github.com/Lightning-AI/lightning/pull/18087))
- Improved error messages when attempting to load a DeepSpeed checkpoint at an invalid path ([#17795](https://github.com/Lightning-AI/lightning/pull/17795))
- Added `Fabric.load_raw()` for loading raw PyTorch state dict checkpoints for model or optimizer objects ([#18049](https://github.com/Lightning-AI/lightning/pull/18049))
- Allowed accessing rank information in the main process before processes are launched when using the `XLAStrategy` ([#18194](https://github.com/Lightning-AI/lightning/pull/18194))
- Added automatic process cleanup to avoid zombie child processes and stalls when exceptions are raised ([#18218](https://github.com/Lightning-AI/lightning/pull/18218))
- Added validation of user input for `devices` and `num_nodes` when running with `SLURM` or `TorchElastic` ([#18292](https://github.com/Lightning-AI/lightning/pull/18292))
- Improved the error messaging and instructions when handling custom batch samplers in distributed settings ([#18402](https://github.com/Lightning-AI/lightning/pull/18402))
- Added support for saving and loading stateful objects other than modules and optimizers ([#18513](https://github.com/Lightning-AI/lightning/pull/18513))
- Enabled the default process group configuration for FSDP's hybrid sharding ([#18583](https://github.com/Lightning-AI/lightning/pull/18583))
- Added `lightning.fabric.utilities.suggested_max_num_workers` to assist with setting a good value in distributed settings ([#18591](https://github.com/Lightning-AI/lightning/pull/18591))
- Added `lightning.fabric.utilities.is_shared_filesystem` utility function to automatically check whether the filesystem is shared between machines ([#18586](https://github.com/Lightning-AI/lightning/pull/18586))
- Removed support for PyTorch 1.11 ([#18691](https://github.com/Lightning-AI/lightning/pull/18691))
- Added support for passing the argument `.load_state_dict(..., assign=True|False)` on Fabric-wrapped modules in PyTorch 2.1 or newer ([#18690](https://github.com/Lightning-AI/lightning/pull/18690))



Changed

- Allow using iterable-style datasets with TPUs ([#17331](https://github.com/Lightning-AI/lightning/pull/17331))
- Increased the minimum XLA requirement to 1.13 ([#17368](https://github.com/Lightning-AI/lightning/pull/17368))
- Fabric argument validation now only raises an error if conflicting settings are set through the CLI ([#17679](https://github.com/Lightning-AI/lightning/pull/17679))
- DataLoader re-instantiation is now only performed when a distributed sampler is required ([#18191](https://github.com/Lightning-AI/lightning/pull/18191))
- Improved the formatting of emitted warnings ([#18288](https://github.com/Lightning-AI/lightning/pull/18288))
- Broadcast and reduction of tensors with XLA-based strategies now preserve the input's device ([#18275](https://github.com/Lightning-AI/lightning/pull/18275))
- Due to lack of reliability, Fabric now only runs on one GPU instead of all GPUs in a Jupyter notebook if `devices="auto"` (default) ([#18291](https://github.com/Lightning-AI/lightning/pull/18291))
- Enabled launching via `torchrun` in a SLURM environment; the `TorchElasticEnvironment` now gets chosen over the `SLURMEnvironment` if both are detected ([#18618](https://github.com/Lightning-AI/lightning/pull/18618))
- If not set by the user, Lightning will set `OMP_NUM_THREADS` to `num_cpus / num_processes` when launching subprocesses (e.g. when DDP is used) to avoid system overload for CPU-intensive tasks ([#18677](https://github.com/Lightning-AI/lightning/pull/18677))



Deprecated

- Deprecated the `DDPStrategy.is_distributed` property. This strategy is distributed by definition ([#17381](https://github.com/Lightning-AI/lightning/pull/17381))
- Deprecated the `SingleTPUStrategy` (`strategy="single_tpu"`) in favor of `SingleDeviceXLAStrategy` (`strategy="single_xla"`) ([#17383](https://github.com/Lightning-AI/lightning/pull/17383))
- Deprecated the `TPUAccelerator` in favor of `XLAAccelerator` ([#17383](https://github.com/Lightning-AI/lightning/pull/17383))
- Deprecated the `TPUPrecision` in favor of `XLAPrecision` ([#17383](https://github.com/Lightning-AI/lightning/pull/17383))
- Deprecated the `TPUBf16Precision` in favor of `XLABf16Precision` ([#17383](https://github.com/Lightning-AI/lightning/pull/17383))



Removed

- Removed automatic sharding support with `Fabric.run` or using `fabric.launch(fn)`. This only impacts FSDP and DeepSpeed strategy users. Please instantiate your module under the newly added `fabric.init_module` context manager ([#17832](https://github.com/Lightning-AI/lightning/pull/17832))
- Removed the unsupported `checkpoint_io` argument from the `FSDPStrategy` ([#18192](https://github.com/Lightning-AI/lightning/pull/18192))



Fixed

- Fixed issue where running on TPUs would select the wrong device index ([#17227](https://github.com/Lightning-AI/lightning/pull/17227))
- Removed the need to call `.launch()` when using the DP-strategy (`strategy="dp"`) ([#17931](https://github.com/Lightning-AI/lightning/pull/17931))
- Fixed FSDP re-applying activation checkpointing when the user had manually applied it already ([#18006](https://github.com/Lightning-AI/lightning/pull/18006))
- Fixed FSDP re-wrapping the module root when the user had manually wrapped the model ([#18054](https://github.com/Lightning-AI/lightning/pull/18054))
- Fixed issue where unexpected exceptions would leave the default torch dtype modified when using true precision settings ([#18500](https://github.com/Lightning-AI/lightning/pull/18500))
- Fixed redundant input-type casting in FSDP precision ([#18630](https://github.com/Lightning-AI/lightning/pull/18630))
- Fixed an issue with `find_usable_cuda_devices(0)` incorrectly returning a list of devices ([#18722](https://github.com/Lightning-AI/lightning/pull/18722))
- Fixed redundant file writes in `CSVLogger` ([#18567](https://github.com/Lightning-AI/lightning/pull/18567))




## Lightning App

Added

- Allow customizing `gradio` components with lightning colors ([#17054](https://github.com/Lightning-AI/lightning/pull/17054))



Changed
    
- Changed `LocalSourceCodeDir` cache_location to not use home in some certain cases ([#17491](https://github.com/Lightning-AI/lightning/pull/17491))



Removed
    
- Remove cluster commands from the CLI ([#18151](https://github.com/Lightning-AI/lightning/pull/18151))




**Full commit list**: https://github.com/Lightning-AI/lightning/compare/2.0.0...2.1.0


# Contributors

### Veteran

@adamjstewart @akreuzer @ethanwharris @dmitsf @lantiga @nicolai86 @pl-ghost @carmocca @awaelchli @justusschock @edenlightning @belerico @lightningforever @nisheethlahoti @tchaton @yurijmikhalevich @mauvilsa @rlizzo @rusmux @yhl48 @Liyang90 @jerome-habana @JustinGoheen @Borda @speediedan @SkafteNicki @dcfidalgo

### New

@saryazdi @parambharat @kshitij12345 @woqidaideshi @colehawkins @md-121 @gkroiz @idc9 @BoringDonut @OmerShubi @ishandutta0098 @ryan597 @leng-yue @alicanb @One-sixth @santurini @SpirinEgor @KogaiIrina @shanmugamr1992 @janeyx99 @asmith26 @dingusagar @AleksanderWWW @strawberrypie @solyaH @kaczmarj @voidful @water-vapor @bkiat1123 @rhiga2 @baskrahmer @felipewhitaker @mukhery @Quasar-Kim @robieta @one-matrix @jere357 @schmidt-ai @schuhschuh @anio @rjarun8 @callumhay @minhlong94 @klieret @giorgioskij @shihaoyin @JonathanRayner @NripeshN @marcimarc1 @bilelomrani1 @NikolasWolke @0x404 @quintenroets @Borodin @amorehead @SebastianGer @ioangatop @Tribhuvan0 @f0k @sameertantry @kwsp @nik777 @matsumotosan

### Did you know?

When Chuck Norris trains a neural network, it not only learns, but it also gains the ability to defend itself from adversarial attacks by roundhouse kicking them into submission.

🚀 lightning-ai/lightning - Release Notes

Lightning v2.5.1 (2025-03-19)

Lightning v2.5 post0 (2024-12-21)

Lightning v2.5 (2024-12-20)

Lightning 2.5 RC (2024-12-12)

Lightning v2.4 (2024-08-07)

Patch release v2.3.3 (2024-07-08)

Patch release v2.3.2 (2024-07-04)

Patch release v2.3.1 (2024-06-27)

Lightning v2.3: Tensor Parallelism and 2D Parallelism (2024-06-13)

Patch release v2.2.5 (2024-05-22)

Patch release v2.2.4 (2024-05-01)

Patch release v2.2.3 (2024-04-23)

Patch release v2.2.2 (2024-04-11)

Patch release v2.2.1 (2024-03-04)

Minor release correction (2024-02-12)

Lightning v2.2 (2024-02-07)

Lightning 2.2 Release Candidate (2024-02-01)

Minor patch release v2.1.4 (2024-02-01)

Minor patch release v2.1.3 (2023-12-21)

Minor patch release v2.1.2 (2023-11-15)

Minor patch release v2.1.1 (2023-11-06)

Lightning 2.1: Train Bigger, Better, Faster (2023-10-12)

Feature teaser (2023-10-10)

Hotfix for Conda package (2023-09-28)

Weekly patch release (2023-09-14)

Weekly patch release (2023-08-30)

Weekly patch release (2023-08-16)

Minor patch release (2023-07-24)

Minor patch release (2023-07-10)

Minor patch release (2023-06-22)