pytorch/pytorch Release Notes

PyTorch 2.6.0 Release (2025-01-29)

* Highlights
* Tracked Regressions
* Backwards Incompatible Change
* Deprecations
* New Features
* Improvements
* Bug fixes
* Performance
* Documentation
* Developers

## **Highlights**

We are excited to announce the release of PyTorch® 2.6 ([release notes](https://github.com/pytorch/pytorch/releases/tag/v2.6.0))! This release features multiple improvements for PT2: `torch.compile` can now be used with Python 3.13; new performance-related knob `torch.compiler.set_stance`; several AOTInductor enhancements. Besides the PT2 improvements, another highlight is FP16 support on X86 CPUs.

NOTE: Starting with this release we are not going to publish on Conda, please see [[Announcement] Deprecating PyTorch’s official Anaconda channel](https://github.com/pytorch/pytorch/issues/138506) for the details.

For this release the experimental Linux binaries shipped with CUDA 12.6.3 (as well as Linux Aarch64,  Linux ROCm 6.2.4, and Linux XPU binaries) are built with CXX11_ABI=1 and are [using the Manylinux 2.28 build platform](https://dev-discuss.pytorch.org/t/pytorch-linux-wheels-switching-to-new-wheel-build-platform-manylinux-2-28-on-november-12-2024/2581). If you build PyTorch extensions with custom C++ or CUDA extensions, please update these builds to use CXX_ABI=1 as well and report any issues you are seeing. For the next PyTorch 2.7 release we plan to switch all Linux builds to Manylinux 2.28 and CXX11_ABI=1, please see [[RFC] PyTorch next wheel build platform: manylinux-2.28](https://github.com/pytorch/pytorch/issues/123649) for the details and discussion.

Also in this release as an important security improvement measure we have changed the default value for `weights_only` parameter of `torch.load`. This is a backward compatibility-breaking change, please see [this forum post](https://dev-discuss.pytorch.org/t/bc-breaking-change-torch-load-is-being-flipped-to-use-weights-only-true-by-default-in-the-nightlies-after-137602/2573) for more details.

This release is composed of 3892 commits from 520 contributors since PyTorch 2.5. We want to sincerely thank our dedicated community for your contributions. As always, we encourage you to try these out and report any issues as we improve PyTorch. More information about how to get started with the PyTorch 2-series can be found at our [Getting Started](https://pytorch.org/get-started/pytorch-2.0/) page.



  
   Beta
   
   Prototype
   
  
  
   torch.compiler.set_stance
   
   Improved PyTorch user experience on Intel GPUs
   
  
  
   torch.library.triton_op
   
   FlexAttention support on X86 CPU for LLMs
   
  
  
   torch.compile support for Python 3.13
   
   Dim.AUTO
   
  
  
   New packaging APIs for AOTInductor
   
   CUTLASS and CK GEMM/CONV Backends for AOTInductor
   
  
  
   AOTInductor: minifier
   
   
   
  
  
   AOTInductor: ABI-compatible mode code generation
   
   
   
  
  
   FP16 support for X86 CPUs
   
   
   
  



*To see a full list of public feature submissions click [here](https://docs.google.com/spreadsheets/d/1TzGkWuUMF1yTe88adz1dt2mzbIsZLd3PBasy588VWgk/edit?usp=sharing).


### BETA FEATURES


#### **[Beta] torch.compiler.set_stance**

This feature enables the user to specify different behaviors (“stances”) that `torch.compile` can take between different invocations of compiled functions. One of the stances, for example, is 

“eager_on_recompile”, that instructs PyTorch to code eagerly when a recompile is necessary, reusing cached compiled code when possible.

For more information please refer to the [set_stance documentation](https://pytorch.org/docs/2.6/generated/torch.compiler.set_stance.html#torch.compiler.set_stance) and the [Dynamic Compilation Control with torch.compiler.set_stance](https://pytorch.org/tutorials/recipes/torch_compiler_set_stance_tutorial.html) tutorial.

**[Beta] torch.library.triton_op**

`torch.library.triton_op` offers a standard way of creating custom operators that are backed by user-defined triton kernels. 

When users turn user-defined triton kernels into custom operators, `torch.library.triton_op` allows `torch.compile` to peek into the implementation, enabling `torch.compile` to optimize the triton kernel inside it.

For more information please refer to the [triton_op documentation](https://pytorch.org/docs/2.6/library.html#torch.library.triton_op) and the[ Using User-Defined Triton Kernels with torch.compile](https://pytorch.org/tutorials/recipes/torch_compile_user_defined_triton_kernel_tutorial.html) tutorial.

**[Beta] torch.compile support for Python 3.13**

`torch.compile` previously only supported Python up to version 3.12. Users can now optimize models with `torch.compile` in Python 3.13. 

**[Beta] New packaging APIs for AOTInductor**

A new package format, “[PT2 archive](https://docs.google.com/document/d/1RQ4cmywilnFUT1VE-4oTGxwXdc8vowCSZsrRgo3wFA8/edit?usp=sharing)”, has been introduced. This essentially contains a zipfile of all the files that need to be used by AOTInductor, and allows users to send everything needed to other environments. There is also functionality to package multiple models into one artifact, and to store additional metadata inside of the package.

For more details please see the updated [torch.export AOTInductor Tutorial for Python runtime](https://pytorch.org/tutorials/recipes/torch_export_aoti_python.html).

**[Beta] AOTInductor: minifier**

If a user encounters an error while using AOTInductor APIs, AOTInductor Minifier allows creation of a minimal nn.Module that reproduces the error.

For more information please see the [AOTInductor Minifier documentation](https://pytorch.org/docs/2.6/torch.compiler_aot_inductor_minifier.html).

**[Beta] AOTInductor: ABI-compatible mode code generation**

AOTInductor-generated model code has dependency on Pytorch cpp libraries. As Pytorch evolves quickly, it’s important to make sure previously AOTInductor compiled models can continue to run on newer Pytorch versions, i.e. AOTInductor is backward compatible. 

In order to guarantee application binary interface (ABI) backward compatibility, we have carefully defined a set of stable C interfaces in libtorch and make sure AOTInductor generates code that only refers to the specific set of APIs and nothing else in libtorch. We will keep the set of C APIs stable across Pytorch versions and thus provide backward compatibility guarantees for AOTInductor-compiled models.

**[Beta] FP16 support for X86 CPUs (both eager and Inductor modes)**

Float16 datatype is commonly used for reduced memory usage and faster computation in AI inference and training. CPUs like the recently launched [Intel® Xeon® 6 with P-Cores](https://www.intel.com/content/www/us/en/products/details/processors/xeon/xeon6-p-cores.html) support Float16 datatype with native accelerator [AMX](https://www.intel.com/content/www/us/en/products/docs/accelerator-engines/advanced-matrix-extensions/overview.html). Float16 support on X86 CPUs was introduced in PyTorch 2.5 as a prototype feature, and now it has been further improved for both eager mode and Torch.compile + Inductor mode, making it Beta level feature with both functionality and performance verified with a broad scope of workloads.


### PROTOTYPE FEATURES

**[Prototype] Improved PyTorch user experience on Intel GPUs**

PyTorch user experience on Intel GPUs is further improved with simplified installation steps, Windows release binary distribution and expanded coverage of supported GPU models including the latest Intel® Arc™ B-Series discrete graphics. Application developers and researchers seeking to fine-tune, inference and develop with PyTorch models on [Intel® Core™ Ultra AI PCs ](https://www.intel.com/content/www/us/en/products/docs/processors/core-ultra/ai-pc.html)and [Intel® Arc™ discrete graphics](https://www.intel.com/content/www/us/en/products/details/discrete-gpus/arc.html) will now be able to directly install PyTorch with binary releases for Windows, Linux and Windows Subsystem for Linux 2.



* Simplified Intel GPU software stack setup to enable one-click installation of the torch-xpu PIP wheels to run deep learning workloads in an out of the box fashion, eliminating the complexity of installing and activating Intel GPU development software bundles.
* Windows binary releases for torch core, torchvision and torchaudio have been made available for Intel GPUs, and the supported GPU models have been expanded from Intel® Core™ Ultra Processors with Intel® Arc™ Graphics, [Intel® Core™ Ultra Series 2 with Intel® Arc™ Graphics](https://www.intel.com/content/www/us/en/products/details/processors/core-ultra.html) and [Intel® Arc™ A-Series Graphics](https://www.intel.com/content/www/us/en/products/docs/discrete-gpus/arc/desktop/a-series/overview.html) to the latest GPU hardware [Intel® Arc™ B-Series graphics](https://www.intel.com/content/www/us/en/products/docs/discrete-gpus/arc/desktop/b-series/overview.html).
* Further enhanced coverage of Aten operators on Intel GPUs with SYCL* kernels for smooth eager mode execution, as well as bug fixes and performance optimizations for torch.compile on Intel GPUs.

For more information regarding Intel GPU support, please refer to [Getting Started Guide](https://pytorch.org/docs/main/notes/get_start_xpu.html).

**[Prototype] FlexAttention support on X86 CPU for LLMs**

FlexAttention was initially introduced in PyTorch 2.5 to provide optimized implementations for Attention variants with a flexible API. In PyTorch 2.6, X86 CPU support for FlexAttention was added through TorchInductor CPP backend. This new feature leverages and extends current CPP template abilities to support broad attention variants (e.x.: PageAttention, which is critical for LLMs inference) based on the existing FlexAttention API, and brings optimized performance on x86 CPUs. With this feature, it’s easy to use FlexAttention API to compose Attention solutions on CPU platforms and achieve good performance.

**[Prototype] Dim.AUTO**

`Dim.AUTO` allows usage of automatic dynamic shapes with `torch.export`. Users can export with `Dim.AUTO `and “discover” the dynamic behavior of their models, with min/max ranges, relations between dimensions, and static/dynamic behavior being automatically inferred.

This is a more user-friendly experience compared to the existing named-Dims approach for specifying dynamic shapes, which requires the user to fully understand the dynamic behavior of their models at export time. `Dim.AUTO` allows users to write generic code that isn’t model-dependent, increasing ease-of-use for exporting with dynamic shapes.

Please see [torch.export tutorial](https://pytorch.org/tutorials/intermediate/torch_export_tutorial.html#constraints-dynamic-shapes) for more information.

**[Prototype] CUTLASS and CK GEMM/CONV Backends for AOTInductor**

The CUTLASS and CK backend adds kernel choices for GEMM autotuning in Inductor. This is now also available in AOTInductor which can run in C++ runtime environments. A major improvement to the two backends is improved compile-time speed by eliminating redundant kernel binary compilations and dynamic shapes support.


## **Tracked Regressions**

### torch.device(0) makes CUDA init fail in subprocess 

There is a known regression (#144152) that `torch.device(0)` makes CUDA init fail in subprocess since PyTorch 2.5.0.
There was an attempt to fix the regressions, but it caused some complications and was reverted.

An easy workaround is to use `torch.device('cuda')` or `torch.device('cuda:0')` instead.

### Regression in the compilation of the torch.all operation with out= variant

A regressions (https://github.com/pytorch/pytorch/issues/145220) was reported for PyTorch 2.6.0 with 
compilation of the `out=` variant of the `torch.all` operator. This should be a rare use case, a workaround can be
rewriting the model code to avoid the `out=` variant.

## **Backwards Incompatible changes**

### Flip default torch.load to weights_only ([#137602](https://github.com/pytorch/pytorch/pull/137602), [#138225](https://github.com/pytorch/pytorch/pull/138225), [#138866](https://github.com/pytorch/pytorch/pull/138866), [#139221](https://github.com/pytorch/pytorch/pull/139221), [#140304](https://github.com/pytorch/pytorch/pull/140304), [#138936](https://github.com/pytorch/pytorch/pull/138936), [#139541](https://github.com/pytorch/pytorch/pull/139541), [#140738](https://github.com/pytorch/pytorch/pull/140738), [#142153](https://github.com/pytorch/pytorch/pull/142153), [#139433](https://github.com/pytorch/pytorch/pull/139433))

We are closing the loop on the deprecation that started in 2.4 and flipped `torch.load` to use `weights_only=True` by default.


When this flag is set, instead of using the usual pickle module, `torch.load` uses a custom unpickler constrained to call only functions and classes needed for loading state dictionaries and basic types.


While this change is disruptive for users serializing more than basic types, we expect the increased security by default is a tradeoff that is worth it. Do note that, even though this default is safer, we still recommend only loading trusted checkpoints and rely on more constrained (and even safer) formats like [safetensors](https://github.com/huggingface/safetensors) for un-trusted checkpoints.


For full details, please refer to [this dev-discuss post](https://dev-discuss.pytorch.org/t/bc-breaking-change-torch-load-is-being-flipped-to-use-weights-only-true-by-default-in-the-nightlies-after-137602/2573).


### Anaconda deprecation in CD. Remove anaconda dependency in Magma builds ([#141024](https://github.com/pytorch/pytorch/pull/141024)) ([#141281](https://github.com/pytorch/pytorch/pull/141281)) ([#140157](https://github.com/pytorch/pytorch/pull/140157)) ([#139888](https://github.com/pytorch/pytorch/pull/139888)) ([#140141](https://github.com/pytorch/pytorch/pull/140141))  ([#139924](https://github.com/pytorch/pytorch/pull/139924)) ([#140158](https://github.com/pytorch/pytorch/pull/140158)) ([#142019](https://github.com/pytorch/pytorch/pull/142019))  ([#142276](https://github.com/pytorch/pytorch/pull/142276)) ([#142277](https://github.com/pytorch/pytorch/pull/142277))  ([#142282](https://github.com/pytorch/pytorch/pull/142282))

PyTorch will stop publishing Anaconda packages that depend on Anaconda’s default packages. We are directing users to utilize our official wheel packages from download.pytorch.org or PyPI, or switch to utilizing conda-forge (pytorch) packages if they would like to continue to use conda. For more details refer to [this announcement](https://github.com/pytorch/pytorch/issues/138506)

### Added Manylinux 2.28 prototype support and CXX11_ABI=1 for following binaries: Linux CUDA 12.6, Linux aarch64 CPU, Linux aarch64 GPU CUDA 12.6, ROCm 6.2.4, Linux XPU  ([#139894](https://github.com/pytorch/pytorch/pull/139894)) ([#139631](https://github.com/pytorch/pytorch/pull/139631)) ([#139636](https://github.com/pytorch/pytorch/pull/139636)) ([#140743](https://github.com/pytorch/pytorch/pull/140743)) ([#137696](https://github.com/pytorch/pytorch/pull/137696)) ([#141565](https://github.com/pytorch/pytorch/pull/141565)) ([#140681](https://github.com/pytorch/pytorch/pull/140681)) ([#141609](https://github.com/pytorch/pytorch/pull/141609)) ([#141704](https://github.com/pytorch/pytorch/pull/141704)) ([#141423](https://github.com/pytorch/pytorch/pull/141423)) ([#141609](https://github.com/pytorch/pytorch/pull/141609))

The PyTorch binaries shipped with CUDA 12.6.3 are built with CXX11_ABI=1 and are using the Manylinux 2.28 build platform. If you are building PyTorch extensions with custom C++ or CUDA extensions, please update these builds to use CXX_ABI=1 as well and report any issues you are seeing. For the next PyTorch 2.7 release we plan to switch all Linux builds to Manylinux 2.28 and CXX11_ABI=1, please see [[RFC] PyTorch next wheel build platform: manylinux-2.28](https://github.com/pytorch/pytorch/issues/123649) for the details and discussion.

### ONNX


#### `torch.onnx.export(..., dynamo=True)` now creates ONNX models using IR version 10 ([#141207](https://github.com/pytorch/pytorch/pull/141207)) 

ONNX ir_version=10 is used to add support for UINT4, INT4 data types and include metadata in GraphProto and NodeProto. Make sure model consumers are able to accept IR version 10 ONNX models. You may read more about IRv10 on https://github.com/onnx/onnx/releases/tag/v1.16.0.

#### Several obsolete APIs are removed ([#133825, #136279, #137789, #137790](https://github.com/pytorch/pytorch/pull/133825)) 

Some logging APIs, `torch.onnx.ExportTypes`, `torch.onnx.export_to_pretty_string` are removed. Users should remove usage of the APIs above. 

#### `torch.onnx.ONNXProgram` has been reimplemented and improved ([#136281](https://github.com/pytorch/pytorch/pull/136281)) 

All ONNX "dynamo" APIs will return the new `ONNXProgram` class. Some notable methods available are `save()`, `optimize()`. It can also be directly applied on PyTorch tensors to leverage ONNX Runtime to verify the ONNX graph. Some legacy methods are no longer available.


## **Deprecations**
### Releng

### Removed CUDA 12.1 support in CI/CD ([#141271](https://github.com/pytorch/pytorch/pull/141271)) ([#142177](https://github.com/pytorch/pytorch/pull/142177))
The full release compatibility matrix matrix can be found in [release.md](https://github.com/pytorch/pytorch/blob/main/RELEASE.md#release-compatibility-matrix)

### Deprecated `c10d::onCompletionHook` ([#142390](https://github.com/pytorch/pytorch/pull/142390))

* In PT 2.5 and before, users can do:
  ```py
  pg = dist.init_process_group()
  def hook(work_info: torch._C._distributed_c10d.WorkInfo):
    # do something
  pg._register_on_completion_hook(hook)

  # The hook will be triggered after the collective complete
  pg.broadcast([tensor]).wait()
  ```
* Starting from PT 2.6, when users write the code above, they will get get a warning message “ProcessGroupNCCL OnCompletion hook will be deprecated in favor of Flight Recorder”

### Inductor
### Deprecate TORCHINDUCTOR_STACK_ALLOCATION ([#139147](https://github.com/pytorch/pytorch/pull/139147))
Instead of setting TORCHINDUCTOR_STACK_ALLOCATION, update your torch.compile call: `torch.compile(options={"aot_inductor.allow_stack_allocation": True})(foo)`.

## **New features**

### Python Frontend

* Introduce a device-agnostic runtime API design ([#132204](https://github.com/pytorch/pytorch/pull/132204))
* Add validation for ambiguous behavior in `Tensor.dim_order()` ([#141632](https://github.com/pytorch/pytorch/pull/141632))
* Add type check for `ord` argument for `torch.linalg.{vector,matrix}_norm()` ([#137463](https://github.com/pytorch/pytorch/pull/137463))
* FlexAttention support for NJT ([#136792](https://github.com/pytorch/pytorch/pull/136792), [#140723](https://github.com/pytorch/pytorch/pull/140723))

### Miscellaneous

* Enable forward AD in `functional.affine_grid` ([#135494](https://github.com/pytorch/pytorch/pull/135494))
* Added SVE support for ARM CPUs ([#119571](https://github.com/pytorch/pytorch/pull/119571))
* User buffer registration via MemPool API ([#133603](https://github.com/pytorch/pytorch/pull/133603))
* Add in_order flag for data loader, allowing out-of-order dataloading ([#141833](https://github.com/pytorch/pytorch/pull/141833))


### Optim

* Add Support for Tracking Parameter Names (named_parameters) in Optimizer State Dict ([#134107](https://github.com/pytorch/pytorch/pull/134107))
* Support tensor betas in Adam and AdamW ([#134171](https://github.com/pytorch/pytorch/pull/134171))


### Distributed

* c10d
    * Made ProcessGroup initialization non-blocking when `device_id` is given [#138527](https://github.com/pytorch/pytorch/pull/138527))
    * Allowed sub group to be eagerly inited even if default one is not ([#138665](https://github.com/pytorch/pytorch/pull/138665))
    * Supported `group_dst`/`group_src` in c10d collectives ([#140460](https://github.com/pytorch/pytorch/pull/140460), [#139677](https://github.com/pytorch/pytorch/pull/139677), [#140827](https://github.com/pytorch/pytorch/pull/140827), [#140843](https://github.com/pytorch/pytorch/pull/140843), [#140847](https://github.com/pytorch/pytorch/pull/140847))
    * Enabled Flight Recorder buffer for all users ([#142260](https://github.com/pytorch/pytorch/pull/142260))
    * Registered Intel distributed Backend (`XCCL`) in PyTorch distributed package ([#141856](https://github.com/pytorch/pytorch/pull/141856))
* Pipeline
    * Performed shape inference at runtime using user-provided real tensors ([#136912](https://github.com/pytorch/pytorch/pull/136912))
    * Added ZBV schedule ([#142084](https://github.com/pytorch/pytorch/pull/142084))
* FSDP2
    * Moved FSDP2 to public ([#141868](https://github.com/pytorch/pytorch/pull/141868))


### Dynamo

* Add `torch.compiler.set_stance` to dynamically change `torch.compile` behavior without needing to re-apply `torch.compile`. ([#137504](https://github.com/pytorch/pytorch/pull/137504))
* Profile guided optimization for `automatic_dynamic` - automatically save and load automatic dynamic decisions to reuse on future runs ([#139001](https://github.com/pytorch/pytorch/pull/139001))
* `skip_guard_eval_unsafe` compiler stance option for power users - skip guard checks when it is known to be safe to do so ([#140251](https://github.com/pytorch/pytorch/pull/140251))


### Releng

* Added support for CUDA 12.6 in CI/CD ([#142335](https://github.com/pytorch/pytorch/pull/142335)) ([#136321](https://github.com/pytorch/pytorch/pull/136321)) ([#138417](https://github.com/pytorch/pytorch/pull/138417)) ([#138563](https://github.com/pytorch/pytorch/pull/138563)) ([#138562](https://github.com/pytorch/pytorch/pull/138562))  ([#139909](https://github.com/pytorch/pytorch/pull/139909)) ([#138899](https://github.com/pytorch/pytorch/pull/138899)) ([#141365](https://github.com/pytorch/pytorch/pull/141365)) ([#141433](https://github.com/pytorch/pytorch/pull/141433))  ([#141805](https://github.com/pytorch/pytorch/pull/141805)) ([#141976](https://github.com/pytorch/pytorch/pull/141976)) ([#139988](https://github.com/pytorch/pytorch/pull/139988))  ([#140143](https://github.com/pytorch/pytorch/pull/140143)) ([#141377](https://github.com/pytorch/pytorch/pull/141377)) ([#142064](https://github.com/pytorch/pytorch/pull/142064))
* Intel GPU enablement in CI/CD. Upgrade XPU support packages to Intel® Deep Learning Essentials 2025.0. Add prototype Linux and Windows binary builds with XPU runtime pypi packages dependencies. ([#138189](https://github.com/pytorch/pytorch/pull/138189)) ([#139050](https://github.com/pytorch/pytorch/pull/139050)) ([#139604](https://github.com/pytorch/pytorch/pull/139604)) ([#139775](https://github.com/pytorch/pytorch/pull/139775)) ([#140373](https://github.com/pytorch/pytorch/pull/140373)) ([#141546](https://github.com/pytorch/pytorch/pull/141546)) ([#141775](https://github.com/pytorch/pytorch/pull/141775)) ([#141135](https://github.com/pytorch/pytorch/pull/141135)) ([#142210](https://github.com/pytorch/pytorch/pull/142210)) ([#135638](https://github.com/pytorch/pytorch/pull/135638)) ([#142298](https://github.com/pytorch/pytorch/pull/142298))
* Added Python 3.13 in CI/CD support and prototype support for Python 3.13t in CD (Only Linux and Linux aarch64 torch binaries)  ([#136001](https://github.com/pytorch/pytorch/pull/136001)) ([#137396](https://github.com/pytorch/pytorch/pull/137396)) ([#138037](https://github.com/pytorch/pytorch/pull/138037)) ([#138629](https://github.com/pytorch/pytorch/pull/138629)) ([#140137](https://github.com/pytorch/pytorch/pull/140137)) ([#138095](https://github.com/pytorch/pytorch/pull/138095)) ([#141572](https://github.com/pytorch/pytorch/pull/141572)) ([#140733](https://github.com/pytorch/pytorch/pull/140733)) ([#141264](https://github.com/pytorch/pytorch/pull/141264)) ([#142294](https://github.com/pytorch/pytorch/pull/142294)) ([#137142](https://github.com/pytorch/pytorch/pull/137142)) ([#137127](https://github.com/pytorch/pytorch/pull/137127)) ([#139533](https://github.com/pytorch/pytorch/pull/139533)) ([#140733](https://github.com/pytorch/pytorch/pull/140733))


### ROCM

* Added AMDSMI support for UUID input ([#129741](https://github.com/pytorch/pytorch/pull/129741))
* Added faster HW support for packed bfloat16 and fp16 for MI300 ([#135770](https://github.com/pytorch/pytorch/pull/135770))
* Improved performance of reductions on 1D and 2D tensors. ([#137737](https://github.com/pytorch/pytorch/pull/137737))


### XPU

* Add `torch.xpu.mem_get_info` API: Introduces a new API to retrieve memory information for XPU devices. ([#141230](https://github.com/pytorch/pytorch/pull/141230))
* Add architecture property to XPU device: Adds new properties to XPU devices to query architecture details. ([#138186](https://github.com/pytorch/pytorch/pull/138186))
* Add `elapsed_time` method for XPU events: Introduces a method to measure elapsed time between XPU events. ([#140865](https://github.com/pytorch/pytorch/pull/140865))
* Add `torch.xpu.get_arch_list` and `torch.xpu.get_gencode_flags`: Introduces new APIs to retrieve architecture lists and code generation flags for XPU. ([#137773](https://github.com/pytorch/pytorch/pull/137773))
* Add quantized convolution support for XPU backend ([#133080](https://github.com/pytorch/pytorch/pull/133080))
* Enable XPU device support for LSTMCell operators ([#140246](https://github.com/pytorch/pytorch/pull/140246))


### Profiler

* Hide ProfilerStep Alignment behind Experimental Config ([#137668](https://github.com/pytorch/pytorch/pull/137668))
* Add functionality to call dump function of NCCL profiler plugin ([#137523](https://github.com/pytorch/pytorch/pull/137523))


### Export

* Add `torch.export.export_for_training()` API to perform export that can run training. Note that this replaces the non-documented `capture_pre_autograd_graph` feature ([#135374](https://github.com/pytorch/pytorch/pull/135374), [#135918](https://github.com/pytorch/pytorch/pull/135918), [#135549](https://github.com/pytorch/pytorch/pull/135549), [#143224](https://github.com/pytorch/pytorch/pull/143224))
* New packaging APIs for AOTInductor `torch._inductor.aoti_compile_and_package` 
    * Previously, AOTInductor (through `torch._export.aot_compile`), would return a path to a .so. However, this does not have a great user experience as actually there are other files that are used along with the .so, for example .cubin files and serialized extern kernels. So, we introduce a new package format, “[PT2 archive](https://docs.google.com/document/d/1RQ4cmywilnFUT1VE-4oTGxwXdc8vowCSZsrRgo3wFA8/edit#heading=h.v2y2jgnwc56a)”, which is what we intend to have AOTInductor return. This essentially contains a zipfile of all the files that need to be used by AOTInductor, and allows users to send to other environments. There is also functionality to package multiple models into one artifact, and to store additional metadata inside of the package.
* [AOTInductor Minifier](https://pytorch.org/docs/main/torch.compiler_aot_inductor_minifier.html). If you encounter an error while using AOT Inductor APIs such as `torch._inductor.aoti_compile_and_package`, `torch._indcutor.aoti_load_package`, or running the loaded model of aoti_load_package on some inputs, you can use the AOTInductor Minifier to create a minimal nn.Module that reproduces the error. ([#139351](https://github.com/pytorch/pytorch/pull/139351),[#140999](https://github.com/pytorch/pytorch/pull/140999), [#141159](https://github.com/pytorch/pytorch/pull/141159), [#141156](https://github.com/pytorch/pytorch/pull/141156))
* AOTInductor: ABI-compatible mode code generation. In order to guarantee ABI backward compatibility, we have carefully defined a set of stable C interfaces in libtorch and make sure AOTInductor generates code that only refers to the specific set of APIs and nothing else in libtorch. We will keep the set of C APIs stable across Pytorch versions and thus provide BC guarantees for AOTInductor-compiled models.
* `export.export_for_inference` and `export.exported_program.core_aten_decompositions` API. `export_for_inference` returns a functional, post-dispatch ATen IR. ([#135912](https://github.com/pytorch/pytorch/pull/135912)).

### Inductor

* Move stack allocation related configs in AOTI ([#139093](https://github.com/pytorch/pytorch/pull/139093)). All stack allocation related configs now have a aot_inductor prefix, so `torch.compile(options={"use_minimal_arrayref_interface": True})(foo)` is now `torch.compile(options={"aot_inductor.use_minimal_arrayref_interface": True})(foo)` and `torch.compile(options={"allow_stack_allocation": True})(foo)` is now `torch.compile(options={"aot_inductor.allow_stack_allocation": True})(foo)`.
* Move `torch._utils.is_compiling` to `torch.compiler.is_compiling` ([#127690](https://github.com/pytorch/pytorch/pull/127690)) Rewrite `torch._utils.is_compiling()` to `torch.compiler.is_compiling()`.
* Added option `autotune_num_choices_displayed` to control number of kernel options displayed ([#138788](https://github.com/pytorch/pytorch/pull/138788))
* Added option `force_pointwise_cat` concat support through inductor using pointwise kernels ([#141966](https://github.com/pytorch/pytorch/pull/141966)). This forces concat to be generated as a pointwise op with masked loads.
* New config option `annotate_training` that adds Inductor annotations to NVTX.  ([#130429](https://github.com/pytorch/pytorch/pull/130429))
* Introduces an option `triton_kernel_default_layout_constraint` to tweak stride settings for user-defined Triton kernels, enhancing customization and flexibility ([#135530](https://github.com/pytorch/pytorch/pull/135530)).
* User can patch inductor config to enable strict custom kernel layout constraints by changing `torch.compile(options={"triton_kernel_default_layout_constraint": "needs_fixed_stride_order"})(foo)` ([#135581](https://github.com/pytorch/pytorch/pull/135581)). 
* External callable registration API `register_external_matmul` for Matmul tuning candidates in Inductor ([#130774](https://github.com/pytorch/pytorch/pull/130774)).
* Adds support for Windows Arm64 to enhance platform compatibility ([#133088](https://github.com/pytorch/pytorch/pull/133088)).
* Integrates support for AMD triton stream pipeliner in ROCm to enhance performance ([#139881](https://github.com/pytorch/pytorch/pull/139881)).
* Adds support for TRITON_INTERPRET in Inductor ([#140841](https://github.com/pytorch/pytorch/pull/140841)).
* Adds update_constant_buffer pybind support in AOTInductor ([#140755](https://github.com/pytorch/pytorch/pull/140755)).
* Provides an option `package_constants_in_so` to exclude weights from .so files in AOTInductor ([#141997](https://github.com/pytorch/pytorch/pull/141997)).
* Adds `load_constants` to the package API ([#142246](https://github.com/pytorch/pytorch/pull/142246)).
* Enables auto functionalize v2 by default ([#136685](https://github.com/pytorch/pytorch/pull/136685)).
* Adds raise_error_on_ignored_optimization to the aoti config ([#138035](https://github.com/pytorch/pytorch/pull/138035)).
* Adds stats summary (mean/min/max, etc) for jit inductor tensor value printing ([#135887](https://github.com/pytorch/pytorch/pull/135887)).


### ONNX

* Models using `torch.cond` is supported ([#137428](https://github.com/pytorch/pytorch/pull/137428))

`torch.cond` is the recommended way to introduce control flows that can be converted to an ONNX model.

* Users can provide a `custom_translation_table` to provide custom implementations for converting operators to ONNX ([#135403](https://github.com/pytorch/pytorch/pull/135403))

This is useful when you need to override an implementation or provide one that is not currently implemented. Refer to the tutorials for a more complete description of the operator registration mechanism. 

```py 
# Define the translation using ONNX Script 
from onnxscript import opset18 as op 

def sym_not_onnx(input): 
   return op.Not(input) 
torch.onnx.export(... 
  dynamo=True,  
   custom_translation_table = { # Then provide it here 
      torch.sym_not: sym_not_onnx,  
}) 
``` 

* `ONNXProgram` has a new `optimize()` method ([#137667](https://github.com/pytorch/pytorch/pull/137667))

Users can run `optimize()` to flatten nested structures in the ONNX graph, perform constant folding and remove redundancies in the ONNX model. Calling `optimize()` after exporting to ONNX is recommended. 

```py 
onnx_program = torch.onnx.export(..., dynamo=True) 
onnx_program.optimize()  # Optimize the graph before saving is recommended 
onnx_program.save(...) 
``` 

* Users can now use complex constants in their models and export to ONNX ([#138279](https://github.com/pytorch/pytorch/pull/138279)) 



## **Improvements**

### Python Frontend

* Add support for fp16 and bf16 to `torch.special.i1` ([#137899](https://github.com/pytorch/pytorch/pull/137899))
* Add option to disable checksum computation in `torch.save` ([#137735](https://github.com/pytorch/pytorch/pull/137735))
* Speed up fp16 tensors printing ([#141927](https://github.com/pytorch/pytorch/pull/141927))
* Add support for fp16 for `torch.adaptive_pool3d` on cpu ([#136091](https://github.com/pytorch/pytorch/pull/136091))
* Add support for fp8* to `torch.masked_select` ([#141928](https://github.com/pytorch/pytorch/pull/141928))
* Add support for complex fp16 to fill_empty_deterministic_ ([#137488](https://github.com/pytorch/pytorch/pull/137488))
* Remove dependency on numpy for serialization for XLA/open registration devices without numpy ([#137444](https://github.com/pytorch/pytorch/pull/137444), [#137600](https://github.com/pytorch/pytorch/pull/137600))
* Fix `torch.{linalg.}norm` complex half support ([#133661](https://github.com/pytorch/pytorch/pull/133661))



### NN Frontend

* Allow global module hook to accept keyword arguments ([#137403](https://github.com/pytorch/pytorch/pull/137403))
* Add APIs to separate norm calculation and gradient scaling in ``nn.utils.clip_grad_norm_`` ([#139662](https://github.com/pytorch/pytorch/pull/139662))
* Add Half support for reflection and replication padding on CPU ([#135931](https://github.com/pytorch/pytorch/pull/135931))
* Add `weight` argument to MSELoss, HuberLoss and L1Loss ([#132049](https://github.com/pytorch/pytorch/pull/132049))
* Gaussian nll loss scalar variance support ([#138931](https://github.com/pytorch/pytorch/pull/138931))
* Added validation for input types for `torch.nn.Linear` and `torch.nn.Bilinear` ([#135596](https://github.com/pytorch/pytorch/pull/135596))


### Optim



* Improve `ReduceLROnPlateau` and `Optimizer.add_param_group` interaction by auto-updating `min_lrs` ([#137637](https://github.com/pytorch/pytorch/pull/137637))
* Allow `SequentialLR` to include `ChainedScheduler` ([#133450](https://github.com/pytorch/pytorch/pull/133450))


### Composability


##### **Decompositions, FakeTensor and meta tensors**

Operator decompositions, FakeTensors and meta tensors are used to trace out a graph in `torch.compile` and `torch.export`. They received several improvements:


* Several operator decomps received improvements/bugfixes:
    * `aten.split_with_sizes` ([#135728](https://github.com/pytorch/pytorch/pull/135728))
    * `aten.max_unpool2d/aten.max_unpool3d` ([#133146](https://github.com/pytorch/pytorch/pull/133146))
    * `aten.dot` ([#138596](https://github.com/pytorch/pytorch/pull/138596))
    * `aten.layer_norm` ([#140557](https://github.com/pytorch/pytorch/pull/140557))
    * `aten.scaled_dot_product_attention` ([#135297](https://github.com/pytorch/pytorch/pull/135297))
    * `aten.matmul` ([#134568](https://github.com/pytorch/pytorch/pull/134568))
    * `aten._embedding_bag`  ([#136774](https://github.com/pytorch/pytorch/pull/136774))
    * `aten.native_group_norm/aten.native_layer_norm` ([#137079](https://github.com/pytorch/pytorch/pull/137079))
    * `aten.to(..., non_blocking=True)` ([#136513](https://github.com/pytorch/pytorch/pull/136513))
    * `Aten.addmm` ([#138520](https://github.com/pytorch/pytorch/pull/138520))
    * General fixes:
        * `out= dtype` checks for unary ops  ([#140288](https://github.com/pytorch/pytorch/pull/140288))
* New decompositions for a few pytorch operators:
    * `aten.diagonal_copy` ([#136730](https://github.com/pytorch/pytorch/pull/136730))
* Several meta implementations of operators received improvements/bugfixes:
    * `Aten.triangular_solve` ([#140186](https://github.com/pytorch/pytorch/pull/140186))
    * `Aten.log_softmax` ([#140289](https://github.com/pytorch/pytorch/pull/140289))
* New meta tensor implementations for a few pytorch operators:
    * `aten._segment_reduce_backward` ([#137442](https://github.com/pytorch/pytorch/pull/137442))
    * `Aten._add_relu` ([#140009](https://github.com/pytorch/pytorch/pull/140009))

**Dynamic shapes**

We made many improvements and bugfixes to dynamic shapes in `torch.compile`


* Minor error message improvements ([#136671](https://github.com/pytorch/pytorch/pull/136671), [#138310](https://github.com/pytorch/pytorch/pull/138310))
* Make `native_layer_norm_backward` work with unbacked SymInts ([#136798](https://github.com/pytorch/pytorch/pull/136798))
* Make `masked_fill` work with unbacked SymIntsl ([#137060](https://github.com/pytorch/pytorch/pull/137060))
* Improve tracing speed of `torch.cat` with large numbers of symbolic variables ([#139653](https://github.com/pytorch/pytorch/pull/139653))
* Improve performance of `canonicalize_bool_expr` ([#135621](https://github.com/pytorch/pytorch/pull/135621))
* Improve performance of `sympy_generic_le` ([#135622](https://github.com/pytorch/pytorch/pull/135622))
* Simplify expr before getting implications in `_maybe_evaluate_static` ([#135499](https://github.com/pytorch/pytorch/pull/135499))
* use a fast expand algorithm ([#135999](https://github.com/pytorch/pytorch/pull/135999), [#136163](https://github.com/pytorch/pytorch/pull/136163))
* Fix calling `Add._from_args` and `Mul._from_args` ([#136143](https://github.com/pytorch/pytorch/pull/136143))
* Dynamic shape logging improvements in tlparse ([#136508](https://github.com/pytorch/pytorch/pull/136508), [#141068](https://github.com/pytorch/pytorch/pull/141068), [#140867](https://github.com/pytorch/pytorch/pull/140867))
* Avoid some quadratic behavior of dynamic shapes involving aliasing + mutation of graph inputs ([#136857](https://github.com/pytorch/pytorch/pull/136857))
* Tensorify compute on Python scalars ([#136674](https://github.com/pytorch/pytorch/pull/136674))
* Delay mul/pow expansion for `_SympyT` to enable more folding ([#138235](https://github.com/pytorch/pytorch/pull/138235))
* Fix bug in unbacked_bindings for a*u0 ([#138136](https://github.com/pytorch/pytorch/pull/138136))
* Remove parallel_and and parallel_or ([#138135](https://github.com/pytorch/pytorch/pull/138135))
* Explicitly avoid recording when should_record_events is false in record_shapeenv_event ([#138965](https://github.com/pytorch/pytorch/pull/138965))
* Better support for dynamic shapes with tensor subclasses ([#125941](https://github.com/pytorch/pytorch/pull/125941))
* support symfloats in translation validation ([#139457](https://github.com/pytorch/pytorch/pull/139457))
* Add trunc to z3 validator ([#140886](https://github.com/pytorch/pytorch/pull/140886))
* Refactor ShapeGuardPrinter for future C++ additon ([#140968](https://github.com/pytorch/pytorch/pull/140968))
* Fix another item memo loss location + bool specialization bug ([#139587](https://github.com/pytorch/pytorch/pull/139587))
* Optimize increment summations ([#140822](https://github.com/pytorch/pytorch/pull/140822))
* Only compute new_untracked_symbols and `new_unbacked_bindings` if needed. ([#140083](https://github.com/pytorch/pytorch/pull/140083))
* Use `has_free_unbacked_symbols` instead of `bool(free_unbacked_symbols)` ([#140027](https://github.com/pytorch/pytorch/pull/140027))
* Try to simplify FloorDiv axioms implications when needed during evaluations. ([#141267](https://github.com/pytorch/pytorch/pull/141267))
* Fix AttributeError: 'int' object has no attribute 'node' due to constant prop ([#141250](https://github.com/pytorch/pytorch/pull/141250))
* Update tensorify pass to specialize symfloats we didn't tensorify away ([#139564](https://github.com/pytorch/pytorch/pull/139564))
* Add `TORCHDYNAMO_EXTENDED_ADVICE` ([#137159](https://github.com/pytorch/pytorch/pull/137159)) ([#137196](https://github.com/pytorch/pytorch/pull/137196))
* Do not try to optimize new implications in `get_implications` ([#139738](https://github.com/pytorch/pytorch/pull/139738))


**Custom operators**

We improved the existing `torch.library` APIs and added new ones.


* Add new `torch.library.triton_op` API ([#141880](https://github.com/pytorch/pytorch/pull/141880))
* Fix partitioner behavior on user triton kernels ([#136878](https://github.com/pytorch/pytorch/pull/136878))
* Add links to new Custom Ops Landing Page ([#137933](https://github.com/pytorch/pytorch/pull/137933), [#139634](https://github.com/pytorch/pytorch/pull/139634))
* Fix `torch.library.register_vmap` to work with nested vmap ([#137306](https://github.com/pytorch/pytorch/pull/137306))
* No-op `torch.library.custom_op` APIs on `torch.deploy` ([#139509](https://github.com/pytorch/pytorch/pull/139509))
* Optimize mutable `torch.library.custom_op` overhead ([#139513](https://github.com/pytorch/pytorch/pull/139513))
* Improve `torch.library.opcheck` and `register_autograd` docs ([#141883](https://github.com/pytorch/pytorch/pull/141883))


### Distributed

* c10d
    * Added FP8 support to NaN checker ([#135891](https://github.com/pytorch/pytorch/pull/135891), [#135961](https://github.com/pytorch/pytorch/pull/135961), [#136115](https://github.com/pytorch/pytorch/pull/136115))
    * Added support for `cuStreamWriteValue32` ([#136488](https://github.com/pytorch/pytorch/pull/136488))
    * Improved the detection robustness in `CudaDMAConnectivityDetector` ([#137530](https://github.com/pytorch/pytorch/pull/137530))
    * Simplified barrier implementation and further decouple CPU/GPU synchronization ([#137516](https://github.com/pytorch/pytorch/pull/137516))
    * Threw value error if passing `world_size=0` to `TCPStore` ([#137792](https://github.com/pytorch/pytorch/pull/137792))
    * Performed retry connection timeout failures in socket ([#138003](https://github.com/pytorch/pytorch/pull/138003))
    * Added an API to get the future result(success or failure) of a collective and customized error handling ([#137799](https://github.com/pytorch/pytorch/pull/137799))
    * Disabled watchdog thread in blockingWait mode ([#138001](https://github.com/pytorch/pytorch/pull/138001))
    * Added default value for ``nccl_nonblocking_timeout`` ([#138374](https://github.com/pytorch/pytorch/pull/138374))
    * Ensured nccl comm is ready before all accesses ([#138384](https://github.com/pytorch/pytorch/pull/138384))
    * Used a promise to delay watchdog shutdown ([#138828](https://github.com/pytorch/pytorch/pull/138828))
    * Supported optional backend if `device_id` provided ([#140963](https://github.com/pytorch/pytorch/pull/140963))
    * Supported group ranks in `P2POp` and `batch_isend_irecv` ([#141054](https://github.com/pytorch/pytorch/pull/141054))
    * Enabled `CudaEventCache` by default and add multi device support ([#140975](https://github.com/pytorch/pytorch/pull/140975))
    * Added an API to retrieve default distributed backend from device ([#140536](https://github.com/pytorch/pytorch/pull/140536))
    * Supported rank, world size, group name/desc overrides for `PyProcessGroup` ([#141529](https://github.com/pytorch/pytorch/pull/141529))
    * Added the detect of accelerator type when backend is not specified ([#142216](https://github.com/pytorch/pytorch/pull/142216))
    * Used task submitter TLS in gloo working threads ([#142184](https://github.com/pytorch/pytorch/pull/142184))
    * Added ``_reduce_scatter_base`` to ``c10d::ProcessGroupUCC`` ([#138021](https://github.com/pytorch/pytorch/pull/138021))
* DDP
    * Made `DDPOptimizer` work with HOPs ([#138787](https://github.com/pytorch/pytorch/pull/138787))
    * Made DDP Quantization hooks backend Agnostic ([#138816](https://github.com/pytorch/pytorch/pull/138816))
    * Used device-agnostic runtime API in DDP/FSDP instead of `cuda` device specific. ([#137678](https://github.com/pytorch/pytorch/pull/137678))
* FSDP
    * Updates real device in FSDP `state_dict_utils` ([#134994](https://github.com/pytorch/pytorch/pull/134994))
    * Generalized of FSDP common for non-cuda execution ([#133209](https://github.com/pytorch/pytorch/pull/133209))
* FSDP2
    * Added ``_set_unshard_async_op`` ([#135523](https://github.com/pytorch/pytorch/pull/135523))
    * Added module, mp policy to ``fsdp_pre_all_gather`` ([#136129](https://github.com/pytorch/pytorch/pull/136129))
    * Added check for contiguous parameters ([#137000](https://github.com/pytorch/pytorch/pull/137000))
    * Relaxed even sharding requirement for `all-gather` extensions ([#137005](https://github.com/pytorch/pytorch/pull/137005))
    * Used stream and event based on device ([#136843](https://github.com/pytorch/pytorch/pull/136843))
    * Added ``shard_placement_fn`` arg ([#137496](https://github.com/pytorch/pytorch/pull/137496))
    * Added ``set_unshard_in_backward(bool)`` ([#137922](https://github.com/pytorch/pytorch/pull/137922))
    * Made module-to-state mapping use weakrefs ([#139650](https://github.com/pytorch/pytorch/pull/139650))
    * Removed CUDA-like device check in fsdp2. ([#139539](https://github.com/pytorch/pytorch/pull/139539))
* DTensor
    * Allowed user to manual_seed different seed on device mesh and only synced RNG state in WORLD when manual_seed has not been called ([#141223](https://github.com/pytorch/pytorch/pull/141223))
    * Supported `matmul` in inference_mode ([#142197](https://github.com/pytorch/pytorch/pull/142197))
* Pipeline
    * Made `PipelineStage` support meta initialization ([#136243](https://github.com/pytorch/pytorch/pull/136243))
    * Allowed non-0 stages to accept kwargs ([#136416](https://github.com/pytorch/pytorch/pull/136416))
    * added schedule simulator and chrometrace dump ([#138134](https://github.com/pytorch/pytorch/pull/138134))
    * Supported separate dI / dW and V-schedules ([#131762](https://github.com/pytorch/pytorch/pull/131762))
    * Updated schedules to use I, B actions. ([#138886](https://github.com/pytorch/pytorch/pull/138886))
    * Added type checking to _backward functions ([#140019](https://github.com/pytorch/pytorch/pull/140019))
    * Allowed multiple backward grads ([#140981](https://github.com/pytorch/pytorch/pull/140981))
    * Improved schedule csv loading ([#142009](https://github.com/pytorch/pytorch/pull/142009))
* TorchElastic
    * Added TryExcept when decoding healthcheck port ([#136574](https://github.com/pytorch/pytorch/pull/136574))
    * Skipped store barrier and store get in host assign ([#136865](https://github.com/pytorch/pytorch/pull/136865))
* Checkpoint
    * Throw an error when state_dict and saved tensors are different sizes ([#141571](https://github.com/pytorch/pytorch/pull/141571))

### Profiler



* Create Auto-Trace Frontend for Trace ID ([#139310](https://github.com/pytorch/pytorch/pull/139310))
* Add skip_first_wait to profiler.schedule ([#141512](https://github.com/pytorch/pytorch/pull/141512))
* Add CUDA Overhead to Auto-trace ([#142271](https://github.com/pytorch/pytorch/pull/142271))


### Nested Tensor

* Added NJT operator support: `rms_norm()`, `embedding_bag()`, `record_stream()`, `rad2deg()`, `embedding()` backward, activation functions ([#135872](https://github.com/pytorch/pytorch/pull/135872), [#135888](https://github.com/pytorch/pytorch/pull/135888), [#140736](https://github.com/pytorch/pytorch/pull/140736), [#138627](https://github.com/pytorch/pytorch/pull/138627), [#137099](https://github.com/pytorch/pytorch/pull/137099), [#140290](https://github.com/pytorch/pytorch/pull/140290))
* Mixed NJT, dense binary pointwise broadcasting support ([#133021](https://github.com/pytorch/pytorch/pull/133021))
* Allow any single non-batch dim to be ragged for NJT ([#137125](https://github.com/pytorch/pytorch/pull/137125))
* Add bfloat16 support to `torch.bmm(NST, NST)` ([#141380](https://github.com/pytorch/pytorch/pull/141380))
* Add missing fp classification functions for NST ([#139890](https://github.com/pytorch/pytorch/pull/139890))


### Functorch

* Add vmap support for `torch.scatter_reduce` ([#135547](https://github.com/pytorch/pytorch/pull/135547))
* Add vmap support for `native_dropout_backward` ([#140140](https://github.com/pytorch/pytorch/pull/140140))
* Allow optional positional arguments for `torch.func.functional_call` ([#134643)](https://github.com/pytorch/pytorch/pull/134643))


### Quantization



* Add uint16 support for observer ([#136238](https://github.com/pytorch/pytorch/pull/136238))
* change flatten recipe for `X86InductorQuantizer` ([#136298](https://github.com/pytorch/pytorch/pull/136298))
* Update choose_qparams_per_token op to output correct shape for scales and zp ([#136807](https://github.com/pytorch/pytorch/pull/136807))
* Make QAT Fused modules torchscriptable ([#136285](https://github.com/pytorch/pytorch/pull/136285))
* Add missing mappings to support `torch.uint16` in quantization and export ([#136547](https://github.com/pytorch/pytorch/pull/136547))
* Default to use training IR ([#137804](https://github.com/pytorch/pytorch/pull/137804))
* Remove Redundant Method in X86 Quantizer ([#139161](https://github.com/pytorch/pytorch/pull/139161))
* Add bfloat16 support for per tensor/channel cpu/cuda fake quantize ops ([#139306](https://github.com/pytorch/pytorch/pull/139306))
* add `linear_dynamic_fp16` ops for OneDNN ([#140376](https://github.com/pytorch/pytorch/pull/140376))
* annotate and convert for `linear_dynamic_fp16` for x86 ([#141480](https://github.com/pytorch/pytorch/pull/141480))


### Releng

* Updated CUDNN to 9.5.1.17 for CUDA 12.6 builds, Linux and Windows  ([#137978](https://github.com/pytorch/pytorch/pull/137978))
* upgrade CI/CD to 6.2.4 for ROCm ([#141423](https://github.com/pytorch/pytorch/pull/141423))




### Cuda

* Extend `cuda_flip` to unsigned types ([#137781](https://github.com/pytorch/pytorch/pull/137781))
* SDPA Priority Manager accepts ordering ([#140467](https://github.com/pytorch/pytorch/pull/140467))
* cuDNN Attention memory layout handling improvements ([#141147](https://github.com/pytorch/pytorch/pull/141147)) ([#138354](https://github.com/pytorch/pytorch/pull/138354))


### Mps



* Add native im2col ([#135706](https://github.com/pytorch/pytorch/pull/135706))
* Add `upsample_bicubic2d` as Metal op ([#136123](https://github.com/pytorch/pytorch/pull/136123))
* Add `scatter_reduce.two` ([#141948](https://github.com/pytorch/pytorch/pull/141948))
* Add i0 op ([#137849](https://github.com/pytorch/pytorch/pull/137849))
* Add `torch.special.i1` op ([#140196](https://github.com/pytorch/pytorch/pull/140196))
* Add `unfold_backward` on MPS ([#135411](https://github.com/pytorch/pytorch/pull/135411))
* Add `isposinf` and `isneginf` ([#136689](https://github.com/pytorch/pytorch/pull/136689))
* Add `MetalShaderLibrary::getFunctionNames()` ([#141499](https://github.com/pytorch/pytorch/pull/141499))
* Add `tri[lu]_indices` ([#137648](https://github.com/pytorch/pytorch/pull/137648))
* Fix Gamma for bfloat16 dtypes ([#136981](https://github.com/pytorch/pytorch/pull/136981))
* Extend `fmin`/`fmax`/`copysign` and `nextafter` to bfloat16 ([#136982](https://github.com/pytorch/pytorch/pull/136982))
* Enable bucketization for bfloat16 ([#136983](https://github.com/pytorch/pytorch/pull/136983))
* Fix bfloat16 to complex casts ([#137070](https://github.com/pytorch/pytorch/pull/137070))
* Enable `arange` to bfloat16 ([#136754](https://github.com/pytorch/pytorch/pull/136754))
* Enable `torch.linalg.cross` for bfloat16 ([#136984](https://github.com/pytorch/pytorch/pull/136984))
* Enable Renorm for bfloat16 ([#136985](https://github.com/pytorch/pytorch/pull/136985))
* Enable `nan_to_num` for bfloat16 ([#136986](https://github.com/pytorch/pytorch/pull/136986))
* Add support for bfloat16 autocast ([#139390](https://github.com/pytorch/pytorch/pull/139390))
* Eliminate `c10::value_or_else` ([#138818](https://github.com/pytorch/pytorch/pull/138818))
* Compile kernels into Metallib ([#138636](https://github.com/pytorch/pytorch/pull/138636))
* Write/Invoke Metal shaders from C++ ([#141547](https://github.com/pytorch/pytorch/pull/141547))
* Support `torch.Event` for MPS ([#142468](https://github.com/pytorch/pytorch/pull/142468))
* Add CompileShader method ([#141478](https://github.com/pytorch/pytorch/pull/141478))
* Reintroduce support for convolutions with output_channels > 65536 ([#140726](https://github.com/pytorch/pytorch/pull/140726))


### ROCM

* Improve PyTorch build speed in ROCm environment by Downloading AOTriton from GitHub unless AOTRITON_INSTALL_FROM_SOURCE=1 is set ([#136603](https://github.com/pytorch/pytorch/pull/136603))
* enable gfx110x architecture for hipblaslt ([#137317](https://github.com/pytorch/pytorch/pull/137317))


### XPU

* Improves the device index bound checking mechanism for XPU. ([#120768](https://github.com/pytorch/pytorch/pull/120768))
* Use default context on Windows for Intel GPU: Improves XPU device handling on Windows by using the default context. ([#138049](https://github.com/pytorch/pytorch/pull/138049))
* Add device guard for XPU structured operators in torchgen ([#138802](https://github.com/pytorch/pytorch/pull/138802))
* Generalize device-bias code to align XPU unroll reduction with CUDA ([#142348](https://github.com/pytorch/pytorch/pull/142348))
* Generalize CUDA C++ wrapper for reuse by XPU ([#135312](https://github.com/pytorch/pytorch/pull/135312))

### Miscellaneous

* Add `torch.float8e4m3fn` dtype support to semi-structured sparse ([#136397](https://github.com/pytorch/pytorch/pull/136397))
* Faster BatchSampler ([#137423](https://github.com/pytorch/pytorch/pull/137423))
* Init threadpool with user defined `num_threads` before default ([#136793](https://github.com/pytorch/pytorch/pull/136793), [#137051](https://github.com/pytorch/pytorch/pull/137051))


### Dynamo


* `automatic_dynamic_shapes_mark_as` - adds an option to cause automatic dynamic shapes to trigger unbacked SymInts rather than backed SymInts ([#141415](https://github.com/pytorch/pytorch/pull/141415))
* Propagate detailed source location information of shape guards to guards/recompiles output ([#136917](https://github.com/pytorch/pytorch/pull/136917))
* `torch.compile` support for Python 3.13 ([#139533](https://github.com/pytorch/pytorch/pull/139533))
* Trace through dynamic callables on tensor variables ([#137940](https://github.com/pytorch/pytorch/pull/137940))
* Trace through dataclasses ([#141294](https://github.com/pytorch/pytorch/pull/141294))
* Graph region tracking for deduplication (i.e. common subgraph extraction) ([#141381](https://github.com/pytorch/pytorch/pull/141381))
* Scan higher order op ([#134102](https://github.com/pytorch/pytorch/pull/134102))
* Trace subclasses of namedtuple type ([#140534](https://github.com/pytorch/pytorch/pull/140534))
* Trace dict subclasses ([#143548](https://github.com/pytorch/pytorch/pull/143548))


### Export



* Preserve preserve the call signature for a module when it was called multiple times ([#137999](https://github.com/pytorch/pytorch/pull/137999), [#138669](https://github.com/pytorch/pytorch/pull/138669))
* Let `export` preserves `node.meta["custom"]` field ([#138266](https://github.com/pytorch/pytorch/pull/138266))
* Add `neg` and `pos` operator to `serde/serialize` ([#138309](https://github.com/pytorch/pytorch/pull/138309), [#143343](https://github.com/pytorch/pytorch/pull/143343))
* Update min_val and max_val to Optional[int] in serialization and allow the schema to express infinity ([#139394](https://github.com/pytorch/pytorch/pull/139394))



### Fx



* Bypass custom **setattr** in Node.**init** ([#135733](https://github.com/pytorch/pytorch/pull/135733))
* Add new replacement_callback to materialize a replacement just in time ([#135553](https://github.com/pytorch/pytorch/pull/135553))
* Minor optimization in create_arg ([#135821](https://github.com/pytorch/pytorch/pull/135821))
* Replace _snake_case with a regexp ([#135822](https://github.com/pytorch/pytorch/pull/135822))
* Update `_inline_module` util function to work with both args and kwargs ([#136631](https://github.com/pytorch/pytorch/pull/136631))
* Fx graph always return tuple in fuse_as_graphmodule ([#139236](https://github.com/pytorch/pytorch/pull/139236))
* Change fx graph `_replace_hook` to a list of Callable ([#142006](https://github.com/pytorch/pytorch/pull/142006))
* Avoid generation of empty merge cpu submodule by splitter v2 ([#140794](https://github.com/pytorch/pytorch/pull/140794))
* Make split_module work with `keep_original_order=True` and no-op graph ([#141340](https://github.com/pytorch/pytorch/pull/141340))
* Add output_node util function to `fx.Graph` ([#139770](https://github.com/pytorch/pytorch/pull/139770))
* Fix `stride` in TensorMetadata to always be a `Tuple[int, ...]` ([#141106](https://github.com/pytorch/pytorch/pull/141106))
* Enhance `from_node` node meta to track source recursively ([#142066](https://github.com/pytorch/pytorch/pull/142066))
* Support linear/BN fusion and follow the API guideline ([#141585](https://github.com/pytorch/pytorch/pull/141585))
* Enable `fuse_by_partitions` to always return output as tuple ([#142056](https://github.com/pytorch/pytorch/pull/142056))
* Add safer check for isatty in fx/_utils.py ([#140876](https://github.com/pytorch/pytorch/pull/140876))


### Inductor


* Switch GPU codegen to one-pass in AOTI ([#141980](https://github.com/pytorch/pytorch/pull/141980))
* Fix multi-kernel codegen when using one-pass in AOTI ([#142333](https://github.com/pytorch/pytorch/pull/142333))
* Fix an issue when fallback op does not return a value in AOTI ([#142339](https://github.com/pytorch/pytorch/pull/142339))
* Improve the stride preservation logic of user-visible outputs ([#136732](https://github.com/pytorch/pytorch/pull/136732))
* Add workspace to TritonTemplates ([#138050](https://github.com/pytorch/pytorch/pull/138050))
* Enable Cpp wraper for Intel GPU. ([#135318](https://github.com/pytorch/pytorch/pull/135318))
* Flip `custom_op_default_layout_constraint` in Inductor to optimize tensor layout ([#135239](https://github.com/pytorch/pytorch/pull/135239)).
* Enables coordinate descent tuning with max-autotune in Inductor ([#136867](https://github.com/pytorch/pytorch/pull/136867)).
* Adds `relu_nan_to_num` option for handling NaNs in pre-grad passes in AOTInductor ([#138545](https://github.com/pytorch/pytorch/pull/138545)).
* Enables cooperative and persistent reductions in Inductor ([#138533](https://github.com/pytorch/pytorch/pull/138533)).
* Introduces multi-kernel support alongside cooperative reductions in Inductor ([#138893](https://github.com/pytorch/pytorch/pull/138893)).
* Adds new configs `env_name_default` and `env_name_force` for better configuration management ([#138956](https://github.com/pytorch/pytorch/pull/138956)).
* Adjusts loop split optimization heuristic ([#137550](https://github.com/pytorch/pytorch/pull/137550)).
* Enhances numerical precision for fp32 in FlexAttention on ROCm devices using IEEE ([#135702](https://github.com/pytorch/pytorch/pull/135702)).
* Enables SDPA pattern matching in Inductor for CUDA, enhancing optimization capabilities ([#137085](https://github.com/pytorch/pytorch/pull/137085)).
* Updates Inductor's support for Triton AttrsDescriptor ([#137757](https://github.com/pytorch/pytorch/pull/137757)).
* Update C++ runner API to take a const vector ([#139955](https://github.com/pytorch/pytorch/pull/139955))



### ONNX


* Remove type promotion rule for pow ([#139527](https://github.com/pytorch/pytorch/pull/139527))
* Prioritize strict=False export strategy ([#139905](https://github.com/pytorch/pytorch/pull/139905))
* Update TorchTensor implementation to handle fake mode ([#139534](https://github.com/pytorch/pytorch/pull/139534))
* Use TracedONNXFunction op signature to promote inputs to tensors ([#138770](https://github.com/pytorch/pytorch/pull/138770))
* Separate decomp into single step and add to the report ([#140767](https://github.com/pytorch/pytorch/pull/140767))
* Improve the conversion of `from dynamic axes to shapes` ([#140488](https://github.com/pytorch/pytorch/pull/140488))
* Use the torchlib opset number and fix opset import logic ([#141413](https://github.com/pytorch/pytorch/pull/141413))


## **Bug fixes**

### Python Frontend



* Fix `torch.mean(..., out=)` for fp16 and bf16 on CPU ([#135174](https://github.com/pytorch/pytorch/pull/135174))
* Fix serialization for `torch.uint16`, `torch.uint32`, `torch.uint64` ([#137184](https://github.com/pytorch/pytorch/pull/137184))
* Fix Tensor preservation logic to not lose user-defined attributes in some cases ([#137267](https://github.com/pytorch/pytorch/pull/137267))
* Fix memory leak in `torch.utils.module_tracker.ModuleTracker` ([#141960](https://github.com/pytorch/pytorch/pull/141960))



### NN Frontend



* Fix `nn.functional.softshrink` returning 0 on NAN input  ([#138421](https://github.com/pytorch/pytorch/pull/138421))
* Fix flex_decode to build offsets off of strides ([#139516](https://github.com/pytorch/pytorch/pull/139516))




### Autograd Frontend



* Fix `torch.nn.EmbeddingBag` when per_sample_weights is differentiable but embedding weights are not ([#142338](https://github.com/pytorch/pytorch/pull/142338))
* Determine autograd engine ready queue based on InputMetadata instead of InputBuffer ([#135633](https://github.com/pytorch/pytorch/pull/135633))


### Composability



* Fixed a correctness issue when `torch.compiling` `torch.scaled_dot_product_attention`, in the case where the scale argument is a dynamic shape ([#141728](https://github.com/pytorch/pytorch/pull/141728))
* Fixed a correctness issue when `torch.compiling` `torch.rrelu`, in the case where it mutates any module buffers ([#136008](https://github.com/pytorch/pytorch/pull/136008))





### Distributed



* c10d
    * Fixed extra context on device 0 ([#135273](https://github.com/pytorch/pytorch/pull/135273))
    * Fixed bugs in non-blocking mode ([#137741](https://github.com/pytorch/pytorch/pull/137741))
    * Fixed P2P data corruption in non-blocking mode ([#138860](https://github.com/pytorch/pytorch/pull/138860))
    * Made sure not use split for P2P comm creation ([#139013](https://github.com/pytorch/pytorch/pull/139013))
    * Used long/short wait for different non-blocking calls ([#142291](https://github.com/pytorch/pytorch/pull/142291))
    * Recorded device index for GPU guarding during NCCLComm method calls ([#141270](https://github.com/pytorch/pytorch/pull/141270))
    * Fixed the behavior of `destroy_process_group` ([#141510](https://github.com/pytorch/pytorch/pull/141510))
    * Reworked NCCLComm destructor to avoid clash with CUDA driver shutdown ([#141511](https://github.com/pytorch/pytorch/pull/141511))
    * Removed Option for `ProcessGroup` and Expose backend `Options` to reflect the correct code structure ([#132931](https://github.com/pytorch/pytorch/pull/132931)) ([#135653](https://github.com/pytorch/pytorch/pull/135653))
    * Fixed prefix store segmentation fault ([#136872](https://github.com/pytorch/pytorch/pull/136872))
    * Fixed a race condition in one-shot `all-reduce` ([#137257](https://github.com/pytorch/pytorch/pull/137257))
    * Enforced contiguity for `all-reduce` ([#137345](https://github.com/pytorch/pytorch/pull/137345))
    * Fixed data corruption bug after `CUDAEventCache` is enabled ([#138040](https://github.com/pytorch/pytorch/pull/138040))
    * Enforced contiguity for `alltoall` ([#141816](https://github.com/pytorch/pytorch/pull/141816))
    * Fixed sequence numbers for coalesced operations ([#135132](https://github.com/pytorch/pytorch/pull/135132))
    * Fixed color value for comm split being negative ([#137855](https://github.com/pytorch/pytorch/pull/137855))
    * Fixed a logic of using `ncclCommSplit` ([#138781](https://github.com/pytorch/pytorch/pull/138781))
    * Caught tensor.numel() == 0 in NaN detector ([#140741](https://github.com/pytorch/pytorch/pull/140741))
    * Fixed a breakage in `IntraNodeComm::rendezvous()` ([#141200](https://github.com/pytorch/pytorch/pull/141200))
    * Fixed `_are_we_tracing()` in dynamo for functional collectives ([#142075](https://github.com/pytorch/pytorch/pull/142075))
* DeviceMesh
    * Fixed ``from_group`` when passing a tensor `mesh` ([#137713](https://github.com/pytorch/pytorch/pull/137713))
* DTensor
    * Fixed 2D DTensor `mm` with mesh_shape (1, n) or (n, 1) ([#139134](https://github.com/pytorch/pytorch/pull/139134))
    * Removed the adhoc DTensor RNG tracker `TensorParallelRNGTracker` since it does not match FSDP2+TP ([#141220](https://github.com/pytorch/pytorch/pull/141220))
* DistributedStateDict (DSD)
    * Initialize lr as a tensor if it is originally a tensor ([#141620](https://github.com/pytorch/pytorch/pull/141620))
* FSDP2
    * Fixed 2D mismatched grad placements ([#136237](https://github.com/pytorch/pytorch/pull/136237))
    * Fixed ``test_all_gather_extensions_monkey_patch`` ([#136130](https://github.com/pytorch/pytorch/pull/136130))
    * Fixed mistargeted backward prefetch ([#137348](https://github.com/pytorch/pytorch/pull/137348))
    * Fixed incorrect tensor meta after `.to(dtype)` ([#137593](https://github.com/pytorch/pytorch/pull/137593))
    * Gated dynamo import for torch deploy ([#137203](https://github.com/pytorch/pytorch/pull/137203))
    * Fixed CUDA sync for bf16 HSDP AR, fp32 params ([#140044](https://github.com/pytorch/pytorch/pull/140044))
    * Fixed backward-compatible imports ([#142419](https://github.com/pytorch/pytorch/pull/142419))
    * Gated PT2 code for torch deploy ([#142456](https://github.com/pytorch/pytorch/pull/142456))
* Pipeline
    * Fixed py ref cycle in `stage_backward` ([#136507](https://github.com/pytorch/pytorch/pull/136507))
    * Fixed more leaks and check leaks in tests ([#136584](https://github.com/pytorch/pytorch/pull/136584))
    * Removed modifications to autograd nodes in Zero Bubble schedule ([#136678](https://github.com/pytorch/pytorch/pull/136678))
    * Fixed extra memory usage in Zero Bubble ([#138119](https://github.com/pytorch/pytorch/pull/138119))
    * Fixed last backward counting for dI / dW ([#139415](https://github.com/pytorch/pytorch/pull/139415))
    * Forward fixed for `_validate_schedule` ([#142211](https://github.com/pytorch/pytorch/pull/142211))
    * Allowed schedules to run with single stage ([#138925](https://github.com/pytorch/pytorch/pull/138925))
    * Freed memory usage earlier in last stage ([#138504](https://github.com/pytorch/pytorch/pull/138504))
* TorchElastic
    * Fixed store prefix race in `rendezvous` ([#136768](https://github.com/pytorch/pytorch/pull/136768))
    * Fixed rendezvous error due to `EtcdStore` get method not waiting in some cases ([#137056](https://github.com/pytorch/pytorch/pull/137056))
    * Fixed the bug caused by wrong host address in creating `TCPStore` server inside dynamic rendezvous ([#139702](https://github.com/pytorch/pytorch/pull/139702))
* Checkpoint
    * Fix fsspec transaction failure cleanup in multithreaded environments ([#135541](https://github.com/pytorch/pytorch/pull/135541))



### Dynamo



* Fix tracing of NumPy 2 ops ([#138686](https://github.com/pytorch/pytorch/pull/138686))
* Don’t graph break on inner `torch.compile` ([#135819](https://github.com/pytorch/pytorch/pull/135819))
* Various closure/cell variable/mutation related fixes ([#136891](https://github.com/pytorch/pytorch/pull/136891), [#139339](https://github.com/pytorch/pytorch/pull/139339), [#140155](https://github.com/pytorch/pytorch/pull/140155))
* Stop importing some third party libraries ([#136334](https://github.com/pytorch/pytorch/pull/136334), [#142502](https://github.com/pytorch/pytorch/pull/142502), [#142503](https://github.com/pytorch/pytorch/pull/142503))



### Nested Tensor Frontend



* Fix NJT operator support: `sum()`, `unsqueeze()`, `to()` on non-contiguous NJTs, `where()`, `select()`, `chunk()`, reductions ([#131945](https://github.com/pytorch/pytorch/pull/131945), [#141392](https://github.com/pytorch/pytorch/pull/141392), [#137124](https://github.com/pytorch/pytorch/pull/137124), [#141500](https://github.com/pytorch/pytorch/pull/141500), [#139317](https://github.com/pytorch/pytorch/pull/139317), [#141506](https://github.com/pytorch/pytorch/pull/141506), [#141604](https://github.com/pytorch/pytorch/pull/141604))
* Fix NJT `linear_backward()` memory usage using a more efficient formula ([#141163](https://github.com/pytorch/pytorch/pull/141163))
* Fix NJT serialization ([#137031](https://github.com/pytorch/pytorch/pull/137031))

### Cuda



* Add missing boundary checks to cunn_SoftMaxForward ([#140682](https://github.com/pytorch/pytorch/pull/140682))
* Fix CTC cuda backend out-of-bound access ([#141607](https://github.com/pytorch/pytorch/pull/141607))
* Fixed cuda sanitizer and as_subclass calls ([#138218](https://github.com/pytorch/pytorch/pull/138218))


### Mps



* Allow nan mean reduction in `nll_loss` ([#135434](https://github.com/pytorch/pytorch/pull/135434))
* Fix AvgPool2d for float16 ([#136822](https://github.com/pytorch/pytorch/pull/136822))
* Error checking/bfloat16 support for `torch.normal` ([#136863](https://github.com/pytorch/pytorch/pull/136863))
* Fix reduction ops outputs for empty tensors ([#139446](https://github.com/pytorch/pytorch/pull/139446))
* Restrict MSELoss to floating types ([#139960](https://github.com/pytorch/pytorch/pull/139960))
* Fix conv backward pass for channels last ([#141009](https://github.com/pytorch/pytorch/pull/141009))
* Add autocast rule for SDPA ([#141776](https://github.com/pytorch/pytorch/pull/141776))
* Release MetalShaderLibrary cached resources ([#142053](https://github.com/pytorch/pytorch/pull/142053))
* Fixes SiLU on non-contiguous tensors ([#139006](https://github.com/pytorch/pytorch/pull/139006))
* Fix `channels_last_3d` in `nn.Conv3d` ([#141780](https://github.com/pytorch/pytorch/pull/141780))
* Guard on flash attention SymFloat scale instead of incorrectly casting to float ([#141725](https://github.com/pytorch/pytorch/pull/141725))
* Fix memory leak from unreleased NSProcessInfo ([#142052](https://github.com/pytorch/pytorch/pull/142052))

### ROCM



* Fixed out of memory errors on AMD triton backend ([#139883](https://github.com/pytorch/pytorch/pull/139883))
* Correct numerical issues in layer norm backwards kernel ([#140259](https://github.com/pytorch/pytorch/pull/140259))

### XPU



* Resolves an issue with duplicated build environments in XPU Linux CI. ([#141546](https://github.com/pytorch/pytorch/pull/141546))
* Fix XPU support packages version: Corrects the versioning of XPU support packages. ([#138189](https://github.com/pytorch/pytorch/pull/138189))
* Fix `c10::Event` unit test failure on XPU backend ([#141800](https://github.com/pytorch/pytorch/pull/141800))
* Fix mismatched tensor metadata between FakeTensor and XPU concrete tensor in `F.logsigmoid` ([#141333](https://github.com/pytorch/pytorch/pull/141333))
* Fix memory stats error on XPU: Corrects an error in memory statistics for XPU devices. ([#135818](https://github.com/pytorch/pytorch/pull/135818))
* Fix XPU CMake typo: Corrects a typo in XPU CMake configuration. ([#140374](https://github.com/pytorch/pytorch/pull/140374))
* Fix an issue causing endless code regeneration in non-XPU environments. ([#140438](https://github.com/pytorch/pytorch/pull/140438))
* Fix incorrect device check before skipping concat linear in Inductor XPU. ([#140916](https://github.com/pytorch/pytorch/pull/140916))



### Profiler



* Clear Out Dangling AppendOnlyLists after collection ([#137450](https://github.com/pytorch/pytorch/pull/137450))
* Fix `UnicodeDecodeError: 'utf-8' codec can't decode byte` ([#139062](https://github.com/pytorch/pytorch/pull/139062))
* Fix ASAN Overflow Issues ([#140441](https://github.com/pytorch/pytorch/pull/140441))
* Fix devices Parameter Type in benchmark_utilization Function ([#138774](https://github.com/pytorch/pytorch/pull/138774))[ ](https://github.com/pytorch/pytorch/pull/138774)


### Quantization



* Pass `ideep:lowp_kind` to` matmul_forward::compute` on cache misses ([#135058](https://github.com/pytorch/pytorch/pull/135058))
* fix re-export custom metadata ([#135282](https://github.com/pytorch/pytorch/pull/135282), [#135634](https://github.com/pytorch/pytorch/pull/135634), [#135720](https://github.com/pytorch/pytorch/pull/135720))
* moving eps in `torchao/quantization/utils.py` to targeted device to avoid device mismatch issue ([#135204](https://github.com/pytorch/pytorch/pull/135204))
* Add type check for `dilation` in `torch.quantized_max_pool3d()` ([#137845](https://github.com/pytorch/pytorch/pull/137845))
* Pass all arguments when quantizing embedding bag from float ([#137697](https://github.com/pytorch/pytorch/pull/137697))
* Fix for split gates enabled quantizable LSTM subclass ([#140818](https://github.com/pytorch/pytorch/pull/140818))
* Fix ReLU fusion when conv/linear has > 1 user for XNNPACK ([#140846](https://github.com/pytorch/pytorch/pull/140846))
* Fix RecursionError when `prepare_pt2e` graph with concat of the same node ([#141651](https://github.com/pytorch/pytorch/pull/141651))



### Sparse Frontend



* Fix memory leak in MaskedTensor when using autograd ([#137890](https://github.com/pytorch/pytorch/pull/137890))
* Fix `bmm(COO, dense)` illegal memory access for some shapes ([#131977](https://github.com/pytorch/pytorch/pull/131977))
* Fix MaskedTensor binary ops for `sparse_csr` layout ([#134335](https://github.com/pytorch/pytorch/pull/134335))


### Miscellaneous



* Fix PyBind 2.10.4 compatibility issue ([#141456](https://github.com/pytorch/pytorch/pull/141456))
* correctly keep track of processed tensors for foreach reductions (norm, max) ([#140103](https://github.com/pytorch/pytorch/pull/140103))
* Fixes to `torch.package` for 3.13 ([#141409](https://github.com/pytorch/pytorch/pull/141409))


### Export



* Do not deserialize arguments with default values as kwargs ([#136036](https://github.com/pytorch/pytorch/pull/136036))
* Fix `_get_non_persistent_buffers` for duplicate submodules ([#136552](https://github.com/pytorch/pytorch/pull/136552))
* Fix lifted constants order for 0-input graphs ([#136658](https://github.com/pytorch/pytorch/pull/136658))
* Handle attribute assignment detection and registered buffer assignments in `make_fx`([#137240](https://github.com/pytorch/pytorch/pull/137240))
* Fix specialization bug in unflatten + preserve_module_call_signature ([#137363](https://github.com/pytorch/pytorch/pull/137363))
* Fix `export` for constant outputs ([#137547](https://github.com/pytorch/pytorch/pull/137547), [#137993](https://github.com/pytorch/pytorch/pull/137993))
* Fix param and buffer mapping for state_dict when there are state_dict hooks ([#137609](https://github.com/pytorch/pytorch/pull/137609))
* Fix export retracing ([#137733](https://github.com/pytorch/pytorch/pull/137733))
* Fix non-strict retracing with kwargs ([#138927](https://github.com/pytorch/pytorch/pull/138927))
* Fix assigning tensor with requires_grad as constant in export ([#137997](https://github.com/pytorch/pytorch/pull/137997))
* Fix issue with runtime_assertions with in `export_for_training` ([#138292](https://github.com/pytorch/pytorch/pull/138292))
* Fix issue in move pass for copying Parameter ([#138855](https://github.com/pytorch/pytorch/pull/138855))
* Fix unflatten with HOPs ([#138978](https://github.com/pytorch/pytorch/pull/138978))
* Fix unflattening to handle multiple specialized graphs corresponding to multiple calls to the same submodule ([#137013](https://github.com/pytorch/pytorch/pull/137013))
* Allow autocast in training ir export ([#137287](https://github.com/pytorch/pytorch/pull/137287))
* Fix unlift to preserve aliased constants ([#137310](https://github.com/pytorch/pytorch/pull/137310))
* Handle` AttrProxy._modules` when module is overwritten as None ([#139957](https://github.com/pytorch/pytorch/pull/139957))
* Fix joint graph metadata ([#136011](https://github.com/pytorch/pytorch/pull/136011))
* Fix mapping issue with `torch.Size` ([#137465](https://github.com/pytorch/pytorch/pull/137465))
* Fix `test_lazy_module_kwargs` ([#137705](https://github.com/pytorch/pytorch/pull/137705))
* Propagate ShapeEnv during lowering ([#138362](https://github.com/pytorch/pytorch/pull/138362))
* Plumb `is_export` flag to `FunctionalTensorMode` in analysis pass ([#138836](https://github.com/pytorch/pytorch/pull/138836))


### Fx



* Add  `__init__.py` to shape inference folder. ([#135461](https://github.com/pytorch/pytorch/pull/135461))
* Handle `sympy.oo` in bitwise_and/or value_ranges ([#141522](https://github.com/pytorch/pytorch/pull/141522))
* Fixes issue with enums in a tuple for dynamo ([#133123](https://github.com/pytorch/pytorch/pull/133123))
* Add output node to split_module subgraphs ([#139275](https://github.com/pytorch/pytorch/pull/139275))
* Fix deep copy of empty graph ([#141660](https://github.com/pytorch/pytorch/pull/141660))


### Inductor



* Fix a bug with not enabling the Python dispatcher in AOTInductor ([#135933](https://github.com/pytorch/pytorch/pull/135933))
* Don't run reshape pattern match on dynamic shape size tensor ([#136100](https://github.com/pytorch/pytorch/pull/136100))
* Make DtypeView work with cpp_wrapper without `abi_compatible` ([#136233](https://github.com/pytorch/pytorch/pull/136233))
* Check size hints to determine indexing dtype in Triton ([#137234](https://github.com/pytorch/pytorch/pull/137234))
* Fix an error in `_dynamo.compiled_autograd.reset()` ([#137889](https://github.com/pytorch/pytorch/pull/137889))
* Fix out-of-bounds array access in `atomic_add_vec` ([#138744](https://github.com/pytorch/pytorch/pull/138744))
* Update zero size computation in `clone_preserve_strides` ([#139224](https://github.com/pytorch/pytorch/pull/139224), [#139458](https://github.com/pytorch/pytorch/pull/139458))
* Fix for gcc10 `torch.compile` compiler error when `march=aarch64+sve` ([#137795](https://github.com/pytorch/pytorch/pull/137795))
* Fix a cubin file path issue ([#139848](https://github.com/pytorch/pytorch/pull/139848))
* Fix caching issue with AOTI packaging ([#140022](https://github.com/pytorch/pytorch/pull/140022))
* Fix a two-pass kernel missmatch in AOTI ([#141041](https://github.com/pytorch/pytorch/pull/141041))
* Fix performance bug by removing `copy_misaligned_inputs` from AOTI ([#142136](https://github.com/pytorch/pytorch/pull/142136))
* Fix mask bug in `torch.cat` kernel ([#140838](https://github.com/pytorch/pytorch/pull/140838))
* Fixed max-autotune in FlexAttention to reset kernel options appropriately ([#138733](https://github.com/pytorch/pytorch/pull/138733))
* Don't set XBLOCK larger than xnumel ([#138730](https://github.com/pytorch/pytorch/pull/138730))
* Fix inductor CPU `masked()` body codegen when result dtype is bool and operator is where ([#138486](https://github.com/pytorch/pytorch/pull/138486))
* Fix typo in `codegen_dynamic_scalar` ([#138760](https://github.com/pytorch/pytorch/pull/138760))
* Fix `ReinterpretView` call in `TMADescriptor` IR ([#138759](https://github.com/pytorch/pytorch/pull/138759))
* Fix free symbol handling in FlexAttention ([#138794](https://github.com/pytorch/pytorch/pull/138794))
* Fix codegen for `tl.constexpr` globals ([#138757](https://github.com/pytorch/pytorch/pull/138757))
* Force strides for efficient attention backward ([#138879](https://github.com/pytorch/pytorch/pull/138879))
* Make AOT inductor treat None args correctly ([#139114](https://github.com/pytorch/pytorch/pull/139114))
* Fix a bug with arg ordering in handling dynamic shapes ([#139777](https://github.com/pytorch/pytorch/pull/139777))
* Fixing missing ck package warning when the backend is disabled ([#139790](https://github.com/pytorch/pytorch/pull/139790))
* Force contiguous layout for implicit fallback ([#140996](https://github.com/pytorch/pytorch/pull/140996))
* Fix another IMA with captured buffers ([#141164](https://github.com/pytorch/pytorch/pull/141164))
* Inductor dtype propagation fixes ([#141495](https://github.com/pytorch/pytorch/pull/141495))
* Fix broadcast logic for Triton ([#141027](https://github.com/pytorch/pytorch/pull/141027)) ([#141693](https://github.com/pytorch/pytorch/pull/141693))
* Fix grid codegen for configs with empty kwargs ([#141824](https://github.com/pytorch/pytorch/pull/141824))
* Fix issue in CPP GEMM Template Prune Tensor ([#141798](https://github.com/pytorch/pytorch/pull/141798))
* Fix max-autotune bug with captured buffer grads ([#141531](https://github.com/pytorch/pytorch/pull/141531))
* TritonTemplate dtype fixes ([#141991](https://github.com/pytorch/pytorch/pull/141991))
* Fix device error for `NopKernelSchedulerNode` ([#141372](https://github.com/pytorch/pytorch/pull/141372))
* Resolves an issue where `try_solve` fails when both symbols are unknown and their product is zero ([#137919](https://github.com/pytorch/pytorch/pull/137919)).
* Resolves an issue where a fallback operation returned `None`, preventing potential errors in AOTI initialization ([#135997](https://github.com/pytorch/pytorch/pull/135997)).
* Resolves test failures following the update of pybind11 to version 2.13.6 ([#136280](https://github.com/pytorch/pytorch/pull/136280)).
* Corrects the maximum autotuning for single-thread dynamic shapes in Inductor ([#136418](https://github.com/pytorch/pytorch/pull/136418)).
* Fixes FMA codegen for Halide backend to ensure correct operation behavior ([#136810](https://github.com/pytorch/pytorch/pull/136810)).
* Corrects `max-autotune` behavior when dealing with View nodes in FlexAttention ([#137204](https://github.com/pytorch/pytorch/pull/137204)).
* Adjust BlockMask handling when reused from a larger sequence length ([#137255](https://github.com/pytorch/pytorch/pull/137255)).
* Corrects `triton_reshape` by properly expanding the Min keyword in code generation ([#137357](https://github.com/pytorch/pytorch/pull/137357)).
* Corrects `reduction_hint` behavior for single-element sums ([#137754](https://github.com/pytorch/pytorch/pull/137754)).
* Resolves a codecache `write_atomic` issue on Windows ([#138331](https://github.com/pytorch/pytorch/pull/138331)).
* Fixes AOTI data type codegen for symbolic integers ([#138106](https://github.com/pytorch/pytorch/pull/138106)).
* Resolves an issue where passing `None` arguments to user-defined Triton kernels caused errors ([#138472](https://github.com/pytorch/pytorch/pull/138472)).
* Correctly sets keyword arguments when creating Buffers in ROCmTemplate for proper initialization ([#138521](https://github.com/pytorch/pytorch/pull/138521)).


### Jit



* Unbreak vec128_half_neon comparison without FP16 hardware support ([#139558](https://github.com/pytorch/pytorch/pull/139558))
* Isolate the locale for NNC’s IRPrinter ([#136458](https://github.com/pytorch/pytorch/pull/136458))
* Fix misuse of offset param in seek ([#140633](https://github.com/pytorch/pytorch/pull/140633))


### ONNX

* Drop final None values as inputs for nodes in exporter graph ([#135520](https://github.com/pytorch/pytorch/pull/135520))
* Insert contiguous node between transpose and view before calling `run_decompositions` ([#137340](https://github.com/pytorch/pytorch/pull/137340))
* Fix sequence handling in graph building ([#138656](https://github.com/pytorch/pytorch/pull/138656))
* Fix 2GB exporting crash during onnx shape type inference ([#140962](https://github.com/pytorch/pytorch/pull/140962))
* Support from dynamic_shapes to dynamic_axes when `torch.onnx.export(fallback=True)` is triggered  ([#139532](https://github.com/pytorch/pytorch/pull/139532))
* Remove special handling of `torchvision.ops` imports in onnx export ([#141569](https://github.com/pytorch/pytorch/pull/141569))
* Fix `tensor.index_fill` when dim=0 #139594 ([#139596](https://github.com/pytorch/pytorch/pull/139596))


## **Performance**


### Dynamo



* Attempt to use previously compiled code when Dynamo cache limit is hit ([#136655](https://github.com/pytorch/pytorch/pull/136655))
* Don’t convert Python frame local C buffer into Python dict until necessary [#140063](https://github.com/pytorch/pytorch/pull/140063)


### Mps



* Dispatch to SDP-math-mps for non-contiguous Tensors ([#139791](https://github.com/pytorch/pytorch/pull/139791))
* Avoid creating spurious instances of `FUSED_ADAM_OPS` ([#141090](https://github.com/pytorch/pytorch/pull/141090))


### ROCM



* Improve `torch.sum` performance by increasing max_values_per_thread ([#135397](https://github.com/pytorch/pytorch/pull/135397))
* Turn on fast path for index_put on new ROCm version ([#136136](https://github.com/pytorch/pytorch/pull/136136))

### Sparse Frontend



* Speedup broadcasting of sparse_coo Tensors ([#142364](https://github.com/pytorch/pytorch/pull/142364))
* Speedup addmm(dense, BSR) for some int8 shapes on A100 ([#136088](https://github.com/pytorch/pytorch/pull/136088))
* Fuse scaling with addmm(dense, BSR) for some int8 shapes on A100 ([#136104](https://github.com/pytorch/pytorch/pull/136104))
* Fuse dtype conversion with addmm(dense, BSR) for some int8 shapes on A100 ([#136626](https://github.com/pytorch/pytorch/pull/136626))



### Miscellaneous


* Speed up fp16/bf16 AMP casts on H100+ ([#137053](https://github.com/pytorch/pytorch/pull/137053))
* c10d
    * Improved efficiency of NaN checker ([#135414](https://github.com/pytorch/pytorch/pull/135414))
* Improves performance by avoiding atomic add operations in `scatter_add` for XPU. ([#137966](https://github.com/pytorch/pytorch/pull/137966))




### Inductor



* Turn on TORCHINDUCTOR_REORDER_FOR_PEAK_MEMORY by default ([#137205](https://github.com/pytorch/pytorch/pull/137205)).  If old behavior is desired, add `"reorder_for_peak_memory": False` to options in your `torch.compile` call.
* Cache weight tiles in L1D for AMX int8 WoQ GEMM ([#136688](https://github.com/pytorch/pytorch/pull/136688))
* Add and use `borrow_arrayref_tensor_as_tensor` ([#142183](https://github.com/pytorch/pytorch/pull/142183))
* Support for accelerated sorting with x86-simd-sort ([#127936](https://github.com/pytorch/pytorch/pull/127936))
* Enable extended MMA shapes in CUTLASS. ([#133686](https://github.com/pytorch/pytorch/pull/133686))
* Port ExecuTorch bfdot improvement back to ATen BlasKernel ([#136331](https://github.com/pytorch/pytorch/pull/136331), [#137377](https://github.com/pytorch/pytorch/pull/137377))
* Build `ReducedPrecisionFloatGemvFastPathKernel` & entry points for non-ARM architectures too ([#137917](https://github.com/pytorch/pytorch/pull/137917))
* Hook up `fp16_gemv_trans` to gemv fast path for non-aarch64 architectures ([#138005](https://github.com/pytorch/pytorch/pull/138005))
* Add `Vectorizedc10::BFloat16` specialization for ARM ([#139090](https://github.com/pytorch/pytorch/pull/139090))
* Build bf16 gemv fast path & entry points for non-ARM architectures too ([#139208](https://github.com/pytorch/pytorch/pull/139208))
* Hook up `bf16_gemv_trans` to x86 bf16 GEMM ([#139220](https://github.com/pytorch/pytorch/pull/139220))
* Don't go through dispatch for *_dot_with_fp32_arith ([#140834](https://github.com/pytorch/pytorch/pull/140834))
* Add efficient isnan for NEON float/half ([#139082](https://github.com/pytorch/pytorch/pull/139082), [#139083](https://github.com/pytorch/pytorch/pull/139083))
* Hook up `fp16_gemv_trans` to x86 fp16 GEMM ([#137918](https://github.com/pytorch/pytorch/pull/137918))
* Support non-zero beta in `fp16_gemv_trans` ([#138275](https://github.com/pytorch/pytorch/pull/138275))
* Port X86_F16 from executorch half to PyTorch half ([#140720](https://github.com/pytorch/pytorch/pull/140720))
* Reserve vector for NT GEMM Matmul ([#141130](https://github.com/pytorch/pytorch/pull/141130))
* add CK grouped conv2d fwd kernels to ROCm codegen ([#137947](https://github.com/pytorch/pytorch/pull/137947))
* expand quantization conv-binary(-unary) pattern fusion inside inductor ([#138051](https://github.com/pytorch/pytorch/pull/138051))
* Stop force realizing to prevent recursion errors unless it's much bigger ([#138881](https://github.com/pytorch/pytorch/pull/138881))
* Constant folding for lifted graph ([#135060](https://github.com/pytorch/pytorch/pull/135060))
* Add host-side TMA support to AOTInductor ([#138878](https://github.com/pytorch/pytorch/pull/138878))
* Allow inplacing buffer when other users are inconsequential ([#138383](https://github.com/pytorch/pytorch/pull/138383))
* Don't fuse two nodes if likely increase peak memory ([#138756](https://github.com/pytorch/pytorch/pull/138756))
* Add oneDNN BRGEMM config for Half cpp gemm template ([#136255](https://github.com/pytorch/pytorch/pull/136255))
* Enable the oneDNN Linear fusion for special case ([#139172](https://github.com/pytorch/pytorch/pull/139172))
* Remove uses of deleted operations ([#139447](https://github.com/pytorch/pytorch/pull/139447))
* Enable scaled mm with bias in gemm max autotune with CK backend ([#140674](https://github.com/pytorch/pytorch/pull/140674))
* Support linear+binary folding for freezing path ([#138807](https://github.com/pytorch/pytorch/pull/138807))
* Simplify & rectify dequantized B buffer loading for AMX GEMM micro-kernel for WoQ int8 case ([#140258](https://github.com/pytorch/pytorch/pull/140258))
* Improve parallelization by collapsing vectorized loop ([#128812](https://github.com/pytorch/pytorch/pull/128812))
* qconv at XPU backend ([#133080](https://github.com/pytorch/pytorch/pull/133080))
* Dont use constant mask if y numel potentially overflows y grids ([#139751](https://github.com/pytorch/pytorch/pull/139751))
* Add batched gemms into gemm max autotune with CK backend ([#141520](https://github.com/pytorch/pytorch/pull/141520))
* Adding lowering to persistent-tma device kernel for `_scaled_mm` ([#142045](https://github.com/pytorch/pytorch/pull/142045))
* Add fusion pass for `linear_dynamic_fp16` with RELU ([#141556](https://github.com/pytorch/pytorch/pull/141556))
* Reverts runtime numeric check in Inductor to reduce compilation time ([#137324](https://github.com/pytorch/pytorch/pull/137324)).
* Optimizes ARM64 performance by utilizing 128-bit vectors ([#137426](https://github.com/pytorch/pytorch/pull/137426)).
* Adjusts `score_fusion_memory_threshold` application strategy in Inductor ([#138970](https://github.com/pytorch/pytorch/pull/138970)).
* Enhances reduction operations with cooperative multi-kernel support in Inductor ([#138893](https://github.com/pytorch/pytorch/pull/138893138893)).
* Disables `sanitize_overflow` in Inductor kernels ([#139502](https://github.com/pytorch/pytorch/pull/139502)).
* Implements caching for `get_operation_names` and `get_buffer_names` ([#135446](https://github.com/pytorch/pytorch/pull/135446)).
* Reorders scheduler nodes after fusion to reduce peak memory usage ([#134874](https://github.com/pytorch/pytorch/pull/134874)).
* Optimize WOQ INT8 weight dequantization in AMX GEMM template ([#136630](https://github.com/pytorch/pytorch/pull/136630)).
* Uses scalar for f64 constants in Triton codegen ([#136858](https://github.com/pytorch/pytorch/pull/136858)).
* Reduces block sizes for improved performance when using the Triton CPU backend ([#136612](https://github.com/pytorch/pytorch/pull/136612)).
* Optimizes CPU copies during autotuning by restricting them to CUDA devices ([#137509](https://github.com/pytorch/pytorch/pull/137509)).
* Adds host-side Triton TMA support ([#137950](https://github.com/pytorch/pytorch/pull/137950)).
* Optimizes the `can_fuse_vertical()` function ([#135788](https://github.com/pytorch/pytorch/pull/135788)).



## **Documentation**



### Distributed



* c10d
    * Added some code documents for `TCPStore` and `TCPStoreLibUvBackend` code ([#130496](https://github.com/pytorch/pytorch/pull/130496))
    * Added more examples for c10d collectives `gather` and `scatter` ([#130427](https://github.com/pytorch/pytorch/pull/130427))
    * Fixed comments in `ProcessGroupGloo` ([#137746](https://github.com/pytorch/pytorch/pull/137746))
    * Added more inline comments to `CUDAEventCache` code ([#138079](https://github.com/pytorch/pytorch/pull/138079))
    * Added documentations for PG APIs with some cleanups ([#140853](https://github.com/pytorch/pytorch/pull/140853))
    * Updated `backend` arg documentation ([#142404](https://github.com/pytorch/pytorch/pull/142404))
* DTensor
    * Updated DTensor readme to use the new import path ([#138625](https://github.com/pytorch/pytorch/pull/138625))
* FSDP2
    * Better error msg for cpu offloading ([#135156](https://github.com/pytorch/pytorch/pull/135156))
    * Added current FSDP2 path to old composable FSDP1 warning ([#139759](https://github.com/pytorch/pytorch/pull/139759))
* Pipeline
    * Added small comments and variable renames ([#138735](https://github.com/pytorch/pytorch/pull/138735))
* c10d
    * Added some code documents for `TCPStore` and `TCPStoreLibUvBackend` code ([#130496](https://github.com/pytorch/pytorch/pull/130496))
    * Added more examples for c10d collectives `gather` and `scatter` ([#130427](https://github.com/pytorch/pytorch/pull/130427))
    * Fixed comments in `ProcessGroupGloo` ([#137746](https://github.com/pytorch/pytorch/pull/137746))
    * Added more inline comments to `CUDAEventCache` code ([#138079](https://github.com/pytorch/pytorch/pull/138079))
    * Added documentations for PG APIs with some cleanups ([#140853](https://github.com/pytorch/pytorch/pull/140853))
    * Updated `backend` arg documentation ([#142404](https://github.com/pytorch/pytorch/pull/142404))
* DTensor
    * Updated DTensor readme to use the new import path ([#138625](https://github.com/pytorch/pytorch/pull/138625))
* FSDP2
    * Better error msg for cpu offloading ([#135156](https://github.com/pytorch/pytorch/pull/135156))
    * Added current FSDP2 path to old composable FSDP1 warning ([#139759](https://github.com/pytorch/pytorch/pull/139759))
* Pipeline
    * Added small comments and variable renames ([#138735](https://github.com/pytorch/pytorch/pull/138735))
* TP
    * Updated link in distributed.tensor.parallel.rst ([#136103](https://github.com/pytorch/pytorch/pull/136103))
* Checkpoints
    * Add links to tutorial and TorchTitan checkpointing to DCP docs ([#139776](https://github.com/pytorch/pytorch/pull/139776))


### Inductor



* Update the OSS tutorial ([#139956](https://github.com/pytorch/pytorch/pull/139956))
* Add README for `torch._inductor.runtime` ([#141492](https://github.com/pytorch/pytorch/pull/141492))
* Improve OSSProxyExecutor error messages ([#141501](https://github.com/pytorch/pytorch/pull/141501))
* Enhances documentation for the bundled autotune cache to provide clearer guidance ([#138298](https://github.com/pytorch/pytorch/pull/138298)).


### Mps



* Update `MPS_ERROR_RUNTIME_TOO_LOW` message ([#139427](https://github.com/pytorch/pytorch/pull/139427))
* Fixing MPS conv1d error message for output 2**16 ([#134770](https://github.com/pytorch/pytorch/pull/134770))
* Modify missing op message ([#141314](https://github.com/pytorch/pytorch/pull/141314))
* Update error message for supported autocast type ([#139192](https://github.com/pytorch/pytorch/pull/139192))


### NN Frontend



* Fix formula in RMSNorm documentation ([#136727](https://github.com/pytorch/pytorch/pull/136727))
* Remove incorrect bias initialization in RMSNorm documentation ([#139620](https://github.com/pytorch/pytorch/pull/139620))
* Add reference to `pad_packed_sequence` in `pack_padded_sequence` documentation ([#137294](https://github.com/pytorch/pytorch/pull/137294))
* Improve documentation of `register_module_forward_hook` ([#140379](https://github.com/pytorch/pytorch/pull/140379))
* Correct reference link for triplet margin loss ([#142071](https://github.com/pytorch/pytorch/pull/142071))
* Changed 'standard-deviation' to 'variance' in normalization documentation ([#141982](https://github.com/pytorch/pytorch/pull/141982))
* Fix broadcasting error in example in `nn.functional.scaled_dot_product_attention` documentation ([#135427](https://github.com/pytorch/pytorch/pull/135427))
* Point to transformer building blocks tutorial in transformer documentation ([#144425](https://github.com/pytorch/pytorch/pull/144425))

### Optim



* Removes confusing note about closure grad modification ([#137535](https://github.com/pytorch/pytorch/pull/137535))
* Minorly reorder optim kwargs in docs ([#137531](https://github.com/pytorch/pytorch/pull/137531), [#137528](https://github.com/pytorch/pytorch/pull/137528))
* RMSprop docs: add missing input "epsilon" ([#137854](https://github.com/pytorch/pytorch/pull/137854))
* Add missing input "eps" to adam docs ([#135191](https://github.com/pytorch/pytorch/pull/135191))
* Corrected AMSGrad max equation in Adam and AdamW ([#142051](https://github.com/pytorch/pytorch/pull/142051))
* Documentation Update: Fix Missing Whitespace in Optimizer Docs ([#138321](https://github.com/pytorch/pytorch/pull/138321))


### Python Frontend



* Fix return type of `torch.nansum` example. ([#135435](https://github.com/pytorch/pytorch/pull/135435))
* Fix `torch.cat` doc ([#135698](https://github.com/pytorch/pytorch/pull/135698))
* Fix multiple function parameters docstring ([#136097](https://github.com/pytorch/pytorch/pull/136097), [#140089](https://github.com/pytorch/pytorch/pull/140089))
* Clarify that NaNs are not equal to each other ([#137386](https://github.com/pytorch/pytorch/pull/137386))
* Fix description in `torch.save` docs to show default for pickle_protocol instead of variable name ([#138153](https://github.com/pytorch/pytorch/pull/138153))
* Fix docs for logcumsumexp formula ([#139768](https://github.com/pytorch/pytorch/pull/139768))
* Clarify meaning of rate parameter in Gamma distribution ([#134847](https://github.com/pytorch/pytorch/pull/134847))
* Updated docstrings referring to `torch.expand` to point to `torch.Tensor.expand` ([#140045](https://github.com/pytorch/pytorch/pull/140045))
* Update documentation for `torch.mean()` to note behavior with empty tensors ([#142039](https://github.com/pytorch/pytorch/pull/142039))
* Improve `torch.squeeze` parameter type in docstring ([#137485](https://github.com/pytorch/pytorch/pull/137485))
* Improve `torch.isclose` docstring ([#138459](https://github.com/pytorch/pytorch/pull/138459), [#139724](https://github.com/pytorch/pytorch/pull/139724))
* Clarify `torch.sum` dtype promotion behavior ([#140939](https://github.com/pytorch/pytorch/pull/140939))
* Clarify `torch.arang`e floating-point rounding behavior ([#141655](https://github.com/pytorch/pytorch/pull/141655))
* Fix `torch.trapezoid` docstring ([#141459](https://github.com/pytorch/pytorch/pull/141459))
* Clarify when the optional opt-einsum dependency is used ([#137596](https://github.com/pytorch/pytorch/pull/137596))
* Clarify `torch.linalg.vector_norm` input aliasing behavior ([#136921](https://github.com/pytorch/pytorch/pull/136921))
* Fix `torch.linalg.svd` V* shape ([#142037](https://github.com/pytorch/pytorch/pull/142037))


### Miscellaneous



* Small rendering fix to our `torch.compile` FakeTensor documentation ([#138281](https://github.com/pytorch/pytorch/pull/138281))
* Document that load_inline requires having a C++ compiler installed ([#137521](https://github.com/pytorch/pytorch/pull/137521))
* Fix error message in `torch._scaled_mm` ([#140343](https://github.com/pytorch/pytorch/pull/140343))
* Revamp `torch.compile` troubleshooting doc ([#138620](https://github.com/pytorch/pytorch/pull/138620))
* Fix doc for export.export() API ([#135551](https://github.com/pytorch/pytorch/pull/135551))
* Fix the example in fx/interpreter ([#139368](https://github.com/pytorch/pytorch/pull/139368))
* Add new PT2 troubleshooting doc ([#138620](https://github.com/pytorch/pytorch/pull/138620))
* Update "Getting Started with XPU" documentation. ([#137479](https://github.com/pytorch/pytorch/pull/137479))


## **Developers**


### Composability



* Make `maybe_aliasing_or_mutating` proper tag ([#131990](https://github.com/pytorch/pytorch/pull/131990))


### Distributed



* c10d
    * Added wait counter for nccl abort ([#136067](https://github.com/pytorch/pytorch/pull/136067))
    * Added wait counter for time spent in object to tensor and tensor to object ([#140414](https://github.com/pytorch/pytorch/pull/140414))
    * Added trace operations for `TCPStoreLibUvBackend` ([#136320](https://github.com/pytorch/pytorch/pull/136320))
    * Cast device index to int before logging ([#135405](https://github.com/pytorch/pytorch/pull/135405))
    * Logged `WorkNCCL` exception string to `C10dLogger` ([#137736](https://github.com/pytorch/pytorch/pull/137736))
    * Made Formatter avoid throwing exceptions in `socket.cpp` ([#137745](https://github.com/pytorch/pytorch/pull/137745))
    * Recorded world size in the log of flight recorder ([#138044](https://github.com/pytorch/pytorch/pull/138044))
    * Differentiated timeout errors from nccl errors ([#138240](https://github.com/pytorch/pytorch/pull/138240))
    * Added more appropriate socket errors and debug messages ([#130347](https://github.com/pytorch/pytorch/pull/130347))
    * Reordered cpp stack dump and FR dump and add log prefix to loggings ([#138368](https://github.com/pytorch/pytorch/pull/138368))
    * Reordered GIL checker and c++ stack trace print with comments ([#138734](https://github.com/pytorch/pytorch/pull/138734))
    * Enabled watchdog to print call-time traceback when reporting NCCL watchdog timeout ([#139659](https://github.com/pytorch/pytorch/pull/139659))
    * Added type information for `FakeProcessGroup` ([#133211](https://github.com/pytorch/pytorch/pull/133211))
    * Added a wait counter for dump function ([#140823](https://github.com/pytorch/pytorch/pull/140823))
    * Switched all timer logging in c10d to wait_counter ([#141154](https://github.com/pytorch/pytorch/pull/141154))
    * Improved Flight Recorder efficacy ([#142178](https://github.com/pytorch/pytorch/pull/142178))
    * Changed back `vlog(2)` to `LOG(INFO)` for Flight Recorder ([#142441](https://github.com/pytorch/pytorch/pull/142441))
    * Added better profiling title for “NCCL barrier, nccl:all_reduce” to “nccl:all_reduce_barrier” ([#140785](https://github.com/pytorch/pytorch/pull/140785))
    * Adopted better error message for flight recorder status ([#142505](https://github.com/pytorch/pytorch/pull/142505))
    * Fixed the wrong error msg in `ProcessGroupNCCL` ([#135423](https://github.com/pytorch/pytorch/pull/135423))
    * Added some missing spaces in barrier msg ([#137721](https://github.com/pytorch/pytorch/pull/137721))
    * Added thread-safety initialization warning ([#139638](https://github.com/pytorch/pytorch/pull/139638))
    * Added the log of started work numel ([#139773](https://github.com/pytorch/pytorch/pull/139773))
    * Improved messaging of `ProcessGroupNCCL` destructor ([#142297](https://github.com/pytorch/pytorch/pull/142297))
* TorchElastic
    * Passed `FileTimerRequests.to_json()` to `log_debug_info_for_expired_timers` for a better debugging experience ([#135913](https://github.com/pytorch/pytorch/pull/135913))


### Export



* Prototype `_swap_modules` API that can be used to swap submodules of an exported program ([#136190](https://github.com/pytorch/pytorch/pull/136190), [#139126](https://github.com/pytorch/pytorch/pull/139126))
* Avoid debug name crash for dim hints ([#139104](https://github.com/pytorch/pytorch/pull/139104))


### Inductor



* Remove the non-ABI-compatible mode ([#138009](https://github.com/pytorch/pytorch/pull/138009), [#138047](https://github.com/pytorch/pytorch/pull/138047))
* Move `use_minimal_arrayref_interface` logic ([#138250](https://github.com/pytorch/pytorch/pull/138250))
* Refactor `ir.Layout` into `ir.OutputSpec` ([#140910](https://github.com/pytorch/pytorch/pull/140910))
* Refactor `dependencies.extract_loop_body_with_args` ([#141404](https://github.com/pytorch/pytorch/pull/141404))
* Modest code motion in compile_fx ([#141574](https://github.com/pytorch/pytorch/pull/141574))
* Move post compile steps into post_compile1/post_compile2 method ([#141656](https://github.com/pytorch/pytorch/pull/141656))
* Inline `FxGraphCache.load` into its sole call site ([#141681](https://github.com/pytorch/pytorch/pull/141681))
* Hoist `set_feature_use` out of conditional, rename some variables ([#141683](https://github.com/pytorch/pytorch/pull/141683))
* Unify cache disable and cache bypass paths ([#141685](https://github.com/pytorch/pytorch/pull/141685))
* Unify `post_compile1` and `CompiledFxGraph` constructor ([#141689](https://github.com/pytorch/pytorch/pull/141689))
* Inline `compile_to_fn` at its only call site ([#141691](https://github.com/pytorch/pytorch/pull/141691))
* move block pointer analysis to a new module ([#141733](https://github.com/pytorch/pytorch/pull/141733))
* Factor `_fx_graph_cache_key` and _time_taken_ns to common base class ([#141878](https://github.com/pytorch/pytorch/pull/141878))
* codecache: pull out some Graph serialization code into common helpers ([#141502](https://github.com/pytorch/pytorch/pull/141502))
* Refactor optional graph module into `CompiledFxGraphConstants` ([#141897](https://github.com/pytorch/pytorch/pull/141897))
* Adds a compiler bisector tool to aid in debugging and development processes within PyTorch ([#131936](https://github.com/pytorch/pytorch/pull/131936)).


### Optim



* Add back optim type hints that were lost when `*.pyi` files were removed ([#136185](https://github.com/pytorch/pytorch/pull/136185))
* Ensure SWA boundary conditions w.r.t. definition ([#133773](https://github.com/pytorch/pytorch/pull/133773))


### Quantization



* Add unaligned attributes to `q8gemm`/`4x4c2-sse2.c` ([#140188](https://github.com/pytorch/pytorch/pull/140188))
* Adding more support QuantizedPrivateuse1 backends ([#139860](https://github.com/pytorch/pytorch/pull/139860))
* Make `move_exported_model_to_train`/`eval` idempotent ([#142239](https://github.com/pytorch/pytorch/pull/142239))


### Releng



* Deprecate usage of pytorch/builder repository ([#142156](https://github.com/pytorch/pytorch/pull/142156)) ([#142277](https://github.com/pytorch/pytorch/pull/142277)) ([#142282](https://github.com/pytorch/pytorch/pull/142282)) ([#142482](https://github.com/pytorch/pytorch/pull/142482)) ([#138103](https://github.com/pytorch/pytorch/pull/138103)) ([#139815](https://github.com/pytorch/pytorch/pull/139815)) ([#140020](https://github.com/pytorch/pytorch/pull/140020)) ([#142382](https://github.com/pytorch/pytorch/pull/142382))
* Add inductor micro benchmark on x86 metal runner ([#135042](https://github.com/pytorch/pytorch/pull/135042)) ([#136052](https://github.com/pytorch/pytorch/pull/136052)) ([#135780](https://github.com/pytorch/pytorch/pull/135780))
* Migrated PyTorch Dev Infra Runners to Amazon Linux 2023 ([#136540](https://github.com/pytorch/pytorch/pull/136540)) ([#136544](https://github.com/pytorch/pytorch/pull/136544))
* Migrated HUD backend database from Rockset to Clickhouse ([#139296](https://github.com/pytorch/pytorch/pull/139296)) ([#139322](https://github.com/pytorch/pytorch/pull/139322)) ([#137207](https://github.com/pytorch/pytorch/pull/137207)) ([#139922](https://github.com/pytorch/pytorch/pull/139922)) ([#140574](https://github.com/pytorch/pytorch/pull/140574))
* Release engineering tooling, CI fixes and additional CI tests . Workflows, Trymerge, Bot Labeler, Mergebot  ([#136060](https://github.com/pytorch/pytorch/pull/136060)) ([#140185](https://github.com/pytorch/pytorch/pull/140185)) ([#135582](https://github.com/pytorch/pytorch/pull/135582))  ([#135644](https://github.com/pytorch/pytorch/pull/135644)) ([#136061](https://github.com/pytorch/pytorch/pull/136061))  ([#135342](https://github.com/pytorch/pytorch/pull/135342)) ([#136043](https://github.com/pytorch/pytorch/pull/136043)) ([#134356](https://github.com/pytorch/pytorch/pull/134356)) ([#136208](https://github.com/pytorch/pytorch/pull/136208)) ([#136610](https://github.com/pytorch/pytorch/pull/136610)) ([#136791](https://github.com/pytorch/pytorch/pull/136791)) ([#136239](https://github.com/pytorch/pytorch/pull/136239)) ([#135342](https://github.com/pytorch/pytorch/pull/135342)) ([#136794](https://github.com/pytorch/pytorch/pull/136794)) ([#137104](https://github.com/pytorch/pytorch/pull/137104)) ([#137168](https://github.com/pytorch/pytorch/pull/137168)) ([#137176](https://github.com/pytorch/pytorch/pull/137176)) ([#137170](https://github.com/pytorch/pytorch/pull/137170)) ([#137169](https://github.com/pytorch/pytorch/pull/137169))  ([#135390](https://github.com/pytorch/pytorch/pull/135390)) ([#137614](https://github.com/pytorch/pytorch/pull/137614)) ([#137802](https://github.com/pytorch/pytorch/pull/137802)) ([#137791](https://github.com/pytorch/pytorch/pull/137791)) ([#138178](https://github.com/pytorch/pytorch/pull/138178)) ([#138054](https://github.com/pytorch/pytorch/pull/138054))  ([#138232](https://github.com/pytorch/pytorch/pull/138232)) ([#138263](https://github.com/pytorch/pytorch/pull/138263)) ([#138178](https://github.com/pytorch/pytorch/pull/138178))  ([#138752](https://github.com/pytorch/pytorch/pull/138752)) ([#138204](https://github.com/pytorch/pytorch/pull/138204)) ([#138714](https://github.com/pytorch/pytorch/pull/138714))  ([#138874](https://github.com/pytorch/pytorch/pull/138874))


### XPU



* Remove unnecessary Triton dependencies for XPU wheel builds. ([#143983](https://github.com/pytorch/pytorch/pull/143983))
* Update Docker builds workflow with a new XPU image name. ([#142298](https://github.com/pytorch/pytorch/pull/142298))
* Restore Triton build support for XPU. ([#141775](https://github.com/pytorch/pytorch/pull/141775))
* Update Triton XPU version pinning. ([#135638](https://github.com/pytorch/pytorch/pull/135638))
* Improve exception handling for XPU device initialization. ([#141658](https://github.com/pytorch/pytorch/pull/141658))
* Enhance unit tests for XPU memory allocation. ([#141325](https://github.com/pytorch/pytorch/pull/141325))
* Make XPU libraries publicly accessible for developers. ([#136974](https://github.com/pytorch/pytorch/pull/136974))
* Improve code formatting for XPU oneDNN integration. ([#139721](https://github.com/pytorch/pytorch/pull/139721))
* Make XPU oneDNN headers publicly available for documentation purposes. ([#139177](https://github.com/pytorch/pytorch/pull/139177))
* Ensure XPU compiler version control in CMake for backward compatibility. Users should align their XPU compiler version with supported versions in PyTorch. ([#139258](https://github.com/pytorch/pytorch/pull/139258))

PyTorch 2.5.0 Release, SDPA CuDNN backend, Flex Attention (2024-10-17)

# PyTorch 2.5 Release Notes

- Highlights  
- Backwards Incompatible Change  
- Deprecations  
- New Features  
- Improvements  
- Bug fixes  
- Performance  
- Documentation  
- Developers  
- Security

## Highlights

We are excited to announce the release of PyTorch® 2.5! This release features a new CuDNN backend for SDPA, enabling speedups by default for users of SDPA on H100s or newer GPUs. As well, regional compilation of torch.compile offers a way to reduce the cold start up time for torch.compile by allowing users to compile a repeated nn.Module (e.g. a transformer layer in LLM) without recompilations. Finally, TorchInductor CPP backend offers solid performance speedup with numerous enhancements like FP16 support, CPP wrapper, AOT-Inductor mode, and max-autotune mode.
This release is composed of 4095 commits from 504 contributors since PyTorch 2.4. We want to sincerely thank our dedicated community for your contributions. As always, we encourage you to try these out and report any issues as we improve 2.5. More information about how to get started with the PyTorch 2-series can be found at our [Getting Started](https://pytorch.org/get-started/pytorch-2.0/) page.
As well, please check out our new ecosystem projects releases with [TorchRec](https://github.com/pytorch/torchrec) and [TorchFix](https://github.com/pytorch-labs/torchfix/releases/tag/v0.6.0).

| Beta | Prototype |
|------|-----------|
| CuDNN backend for SDPA | FlexAttention |
| torch.compile regional compilation without recompilations | Compiled Autograd |
| TorchDynamo added support for exception handling & MutableMapping types | Flight Recorder |
| TorchInductor CPU backend optimization | Max-autotune Support on CPU with GEMM Template |
| | TorchInductor on Windows |
| | FP16 support on CPU path for both eager mode and TorchInductor CPP backend |
| | Autoload Device Extension |
| | Enhanced Intel GPU support |

*To see a full list of public feature submissions click [here](https://docs.google.com/spreadsheets/d/1TzGkWuUMF1yTe88adz1dt2mzbIsZLd3PBasy588VWgk/edit?gid=949287277#gid=949287277).

### BETA FEATURES
#### [Beta] CuDNN backend for SDPA
The cuDNN "Fused Flash Attention" backend  was landed for `torch.nn.functional.scaled_dot_product_attention`. On NVIDIA H100 GPUs this can provide up to 75% speed-up over FlashAttentionV2. This speedup is enabled by default for all users of SDPA on H100 or newer GPUs.
#### [Beta] _torch.compile_ regional compilation without recompilations
Regional compilation without recompilations, via `torch._dynamo.config.inline_inbuilt_nn_modules` which default to True in 2.5+. This option allows users to compile a repeated nn.Module (e.g. a transformer layer in LLM) without recompilations. Compared to compiling the full model, this option can result in smaller compilation latencies with 1%-5% performance degradation compared to full model compilation.

See the [tutorial](https://pytorch.org/tutorials/recipes/regional_compilation.html) for more information.
#### [Beta] TorchInductor CPU backend optimization
This feature advances Inductor’s CPU backend optimization, including CPP backend code generation and FX fusions with customized CPU kernels. The Inductor CPU backend supports vectorization of common data types and all Inductor IR operations, along with the static and symbolic shapes. It is compatible with both Linux and Windows OS and supports the default Python wrapper, the CPP wrapper, and AOT-Inductor mode. 

Additionally, it extends the max-autotune mode of the GEMM template (prototyped in 2.5), offering further performance gains. The backend supports various FX fusions, lowering to customized kernels such as oneDNN for Linear/Conv operations and SDPA. The Inductor CPU backend consistently achieves performance speedups across three benchmark suites—TorchBench, Hugging Face, and timms—outperforming eager mode in 97.5% of the 193 models tested.

### PROTOTYPE FEATURES
#### [Prototype] FlexAttention
We've introduced a flexible API that enables implementing various attention mechanisms such as Sliding Window, Causal Mask, and PrefixLM with just a few lines of idiomatic PyTorch code. This API leverages torch.compile to generate a fused FlashAttention kernel, which eliminates extra memory allocation and achieves performance comparable to handwritten implementations. Additionally, we automatically generate the backwards pass using PyTorch's autograd machinery. Furthermore, our API can take advantage of sparsity in the attention mask, resulting in significant improvements over standard attention implementations.

For more information and examples, please refer to the [official blog post](https://pytorch.org/blog/flexattention/) and [Attention Gym](https://github.com/pytorch-labs/attention-gym).
#### [Prototype] Compiled Autograd
Compiled Autograd is an extension to the PT2 stack allowing the capture of the entire backward pass. Unlike the backward graph traced by AOT dispatcher, Compiled Autograd tracing is deferred until backward execution time, which makes it impervious to forward pass graph breaks, and allows it to record backward hooks into the graph.

Please refer to the [tutorial](https://pytorch.org/tutorials/intermediate/compiled_autograd_tutorial.html) for more information.
#### [Prototype] Flight Recorder
Flight recorder is a new debugging tool that helps debug stuck jobs. The tool works by continuously capturing information about collectives as they run. Upon detecting a stuck job, the information can be used to quickly identify misbehaving ranks/machines along with code stack traces.

For more information please refer to the following [tutorial](https://pytorch.org/tutorials/prototype/flight_recorder_tutorial.html).
#### [Prototype] Max-autotune Support on CPU with GEMM Template
Max-autotune mode for the Inductor CPU backend in torch.compile profiles multiple implementations of operations at compile time and selects the best-performing one. This is particularly beneficial for GEMM-related operations, using a C++ template-based GEMM implementation as an alternative to the ATen-based approach with oneDNN and MKL libraries. We support FP32, BF16, FP16, and INT8 with epilogue fusions for x86 CPUs. We’ve seen up to 7% geomean speedup on the dynamo benchmark suites and up to 20% boost in next-token latency for LLM inference.

For more information please refer to the [tutorial](https://pytorch.org/tutorials/prototype/max_autotune_on_CPU_tutorial.html).
#### [Prototype] TorchInductor CPU on Windows
Inductor CPU backend in torch.compile now works on Windows. We support MSVC (cl), clang (clang-cl) and Intel compiler (icx-cl) for Windows inductor currently.

See the [tutorial](https://pytorch.org/tutorials/prototype/inductor_windows_cpu.html) for more details.
#### [Prototype] FP16 support on CPU path for both eager mode and TorchInductor CPP backend
Float16 is a commonly used reduced floating point type for performance improvement in neural network inference/training. Since this release, float16 for both eager and TorchInductor is supported on the CPU path.
#### [Prototype] Autoload Device Extension
PyTorch now supports autoloading for out-of-tree device extensions, streamlining integration by eliminating the need for manual imports. This feature, enabled through the torch.backends entrypoint, simplifies usage by ensuring seamless extension loading, while allowing users to disable it via an environment variable if needed.

See the [tutorial](https://pytorch.org/tutorials/prototype/python_extension_autoload.html) for more information.
#### [Prototype] Enhanced Intel GPU support
Intel GPUs support enhancement is now available for both Intel® Data Center GPU Max Series and Intel® Client GPUs (Intel® Core™ Ultra processors with built-in Intel® Arc™ graphics and Intel® Arc™ Graphics for dGPU parts), which is to make it easier to accelerate your Machine Learning workflows on Intel GPUs in PyTorch 2.5 release. We also enabled the initial support of PyTorch on Windows for Intel® Client GPUs in this release.
- Expanded PyTorch hardware backend support matrix to include both Intel Data Center and Client GPUs.   
- The implementation of SYCL* kernels to enhance coverage and execution of Aten operators on Intel GPUs to boost performance in PyTorch eager mode. 
- Enhanced Intel GPU backend of torch.compile to improve inference and training performance for a wide range of deep learning workloads.  

These features are available through PyTorch preview and nightly binary PIP wheels. For more information regarding Intel GPU support, please refer to [documentation](https://pytorch.org/docs/main/notes/get_start_xpu.html).

## Backwards Incompatible changes

### Distributed

- [c10d] Remove Option for ProcessGroup and Expose backend Options to reflect the correct code structure (#132931)
  - We released Dispatchable collectives in 2.0 and we will use Backend Option for Backend initialization and the PG options are not needed any more.
  - In 2.4 and before, users can do:
  ```py
  # Users can pass in a basic option when creating an instance of ProcessGroup
  base_pg_options = ProcessGroup.Options(backend=str(backend))
  base_pg_options._timeout = timeout

  pg: ProcessGroup = ProcessGroup(
    store, rank, group_size, base_pg_options
  )

  # Users then need to create a backend option to create the comm backend (e.g., ProcessGroupNCCL)
  pg_options = ProcessGroupNCCL.Options()
  backend = ProcessGroupNCCL(
    store, rank, group_size, pg_options
  )
  ```
  - But from 2.5 onwards, users don’t need to pass in an option to create an instance of ProcessGroup and user can still set default backend for the pg since users still try to get default backend in the code:

  ```py
  # No basic option is passed in when creating a instance of ProcessGroup
  pg: ProcessGroup = ProcessGroup(store, rank, group_size)
  pg._set_default_backend(Backend.backend_type_map[backend])
  # Users then need to create a backend option to create the comm backend (e.g., ProcessGroupNCCL)
  pg_options = ProcessGroupNCCL.Options()
  backend = ProcessGroupNCCL(
    store, rank, group_size, pg_options
  )
  ```

### Export

- Remove `dynamic_dim()` (#134211)  
  - The `dynamic_dim()` method for specifying dynamic shapes in `torch.export()` has been removed. Please refer to the [export tutorial](https://pytorch.org/tutorials/intermediate/torch_export_tutorial.html#constraints-dynamic-shapes) for using `Dims` to specify dynamic shapes.

### Inductor

- [Torch] Support meta device in checkpoint (#132684)
- Switch to internal benchmarking and update benchmarking path (#132827)

  This change moves from using triton’s benchmarking utils to the internal inductor utils at `torch._inductor.runtime.benchmarking`. To update your benchmarking code:

  ```py
  # before
  from torch._inductor.runtime.runtime_utils import do_bench_gpu
  # ...
  do_bench_gpu(kernel, rep=40, fast_flush=True)

  # after
  from torch._inductor.runtime.benchmarking import benchmarker
  # ...
  benchmarker.benchmark_gpu(kernel_call, rep=40, fast_flush=True)
  ```

### mps

- [MPS][BE] Delete MacOS-12.3 specific checks (#133141)

### nn

- Update fused kernels and call _safe_softmax from SDPA (#131863)  
    
  Before this PR, fully masked rows in the `attn_mask` passed to `nn.functional.scaled_dot_product_attention` would yield NANs in the output, after this PR, fully masked rows yield 0s.

  Example:

  2.4.0

  ```py 
  B, S, D = 1, 1, 128

  q = torch.randn(B, S, D, device='cuda')  
  k = torch.randn(B, S, D, device='cuda')  
  v = torch.randn(B, S, D, device='cuda')

  attn_mask = torch.tensor([False], device='cuda')

  F.scaled_dot_product_attention(q, k, v, attn_mask=attn_mask)  
  tensor([[[nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,  
            nan, nan, nan, nan, nan, nan, nan, nan, nan]]], device='cuda:0')  
  ```

  2.5.0

  ```py 
  B, S, D = 1, 1, 128

  q = torch.randn(B, S, D, device='cuda')  
  k = torch.randn(B, S, D, device='cuda')  
  v = torch.randn(B, S, D, device='cuda')

  attn_mask = torch.tensor([False], device='cuda')

  F.scaled_dot_product_attention(q, k, v, attn_mask=attn_mask)  
  tensor([[[-0., -0., 0., -0., 0., 0., -0., -0., -0., -0., 0., -0., -0., -0., 0., -0., 0., 0., -0., -0., 0., -0., -0.,  
            0., 0., -0., 0., -0., -0., 0., -0., -0.]]], device='cuda:0')  
  ```

### Optimizer Frontend

- Add support to `GradScaler` for respecting an already set `grad_scale` value (#123429)

### Python Frontend

- No more CPython 3.8 support and removal from binary (#132138)  
  CPython 3.8 is now EOL and PyTorch 2.4 is the last version that is supported.  
  See [https://devguide.python.org/versions/](https://devguide.python.org/versions/) for CPython EOL timelines and [https://github.com/pytorch/pytorch/blob/main/RELEASE.md\#python](https://github.com/pytorch/pytorch/blob/main/RELEASE.md#python) for PyTorch’s support CPython version policy.

### ONNX
#### Options to `torch.onnx.export` (except for the first three arguments) are now keyword-only (#131501)

Options can be supplied by keywords only to allow for future addition and evolution of the `torch.onnx.export` API.

Example:  
Version 2.4  
```python  
torch.onnx.export(model, input, f, True, False)  
```

Version 2.5:  
```python  
torch.onnx.export(model, input, f, export_params=True, verbose=False)  
```

#### Deprecated internal API `torch.onnx._export` has been removed (133824)

`torch.onnx._export` is an internal API which is not meant for public consumption. Use the public `torch.onnx.export` instead.

Example:  
Version 2.4  
```python  
torch.onnx._export(...)  
```

Version 2.5:  
```python  
torch.onnx.export(...)  
```

#### The `op_level_debug` option from `torch.onnx.ExportOptions` has been removed (#134961)

This option, designed to identify operator discrepancies, proved unreliable and has been removed. Instead, use `torch.onnx.export(..., report=True, verify=True)` option to validate exported models.

#### The `ONNXProgramSerializer` class has been removed (#135261)

The ONNX model in `torch.onnx.ONNXProgram` is now maintained and serialized by [ONNX IR](https://github.com/microsoft/onnxscript/blob/main/onnxscript/ir/README.md).  
`textproto`, `onnxtext`, and `json` formats are supported by default when calling `ONNXProgram.save()` with a corresponding file extension.

#### The `SymbolicContext` class has been removed (#132184)

The deprecated `torch.onnx.SymbolicContext` class has been removed. (Non-dynamo) custom symbolic functions can no longer take `ctx: torch.onnx.SymbolicContext` as the first argument.

#### Support for caffe2 has been removed (#129021)

- Remove Caffe2 handling from `onnx_unpack_quantized_weights` (#129021)  
- Remove `is_caffe2_aten_fallback` in `torch.onnx.symbolic_helper`

#### Some errors classes are removed

`CheckerError` and `InvalidExportOptionsError` are removed. Users can always catch `RuntimeError` to handle torch.onnx export errors.

## Deprecations

### Dynamo

- Remove `torch._dynamo.utils.CompileProfiler` (#135133)

### Export

- Deprecate `None` for specifying static dimensions in `dynamic_shapes` (#134877)

  The use of `None` at the dimension-level for specifying dynamic shapes is now deprecated, and a user warning will be raised, so please use `Dim.STATIC` in its place. Specifying `None` for an entire input, or an entire program, is still supported. 

### Inductor

- aot_autograd: copy metadata from fw to bw nodes (#126573)
- deprecate `search_autotune_cache` (#133628)

### Releng

- Deprecate Python 3.8 support from CI/CD (#133621, #133624, #135245)

### ONNX

#### Supplying model keyword arguments to `torch.onnx.export` is deprecated (#131501)  
The ability to supply model keyword arguments as a final dictionary is deprecated. Users should use the `kwargs` parameter instead.

Deprecated:  
```python  
torch.onnx.export(model, (arg1, arg2, {“kwarg1”: …}))
```

Future:  
```python  
torch.onnx.export(model, (arg1, arg2), kwargs={“kwarg1”: …})  
```

#### `torch.onnx.OperatorExportTypes` is deprecated (#131501)

The ability to supply `operator_export_type` in `torch.onnx.export()` is deprecated. Exported ONNX graphs will always use the ONNX opset domain. Options `ONNX_FALLTHROUGH`, `ONNX_ATEN` and `ONNX_ATEN_FALLBACK` are no longer supported. The `OperatorExportTypes` class will be removed in a future release.

#### The `training` option in `torch.onnx.export` is deprecated

Set the model training mode first before exporting instead.

Deprecated:  
```python  
torch.onnx.export(model, inputs, path, training=torch.onnx.TrainingMode.EVAL)  
```

Future:  
```python  
model = model.eval()  
torch.onnx.export(model, inputs, path)  
```

## New features

### Autograd frontend

- Add selective activation checkpoint support to `torch.utils.checkpoint` (#125795, #129262)

### Distributed

#### Flight Recorder with an analyzer  
  - Flight Recorder captures diagnostics information as collectives run- right now only for NCCL collectives. The captured diagnostic information is used to help root cause issues when jobs get stuck or timeout. An available analyzer script runs known heuristics using the collected data and attempts to automatically identify the underlying issue that caused the job to stall. (#110960, #113678, #114615, #114651, #114810, #114817, #115090, #115139, #115176, #115358, #115851, #118044, #118046, #118047, #119249, #119748, #119837, #120063, #120262, #120724, #120975, #122731, #126581, #126581, #126726, #128190, #128781, #128948, #129505, #130764, #131268, #133150, #133237, #133933, #133412, #134383, #134528, #134780, #134794)
 
#### c10d  
  - Enabled symmetricMemory-based, low contention intra-node `all-gather` and `reduce-scatter` (#130583)

### Dynamo

- Introduce `torch._dynamo.config.enable_compiler_collectives` for syncing compilation across ranks (#130935)

### Export

- `export_for_training [WIP/unstable]` (#129092, #130062, #134677, #135549)  
- Automatic dynamic shapes (#133620, #134486, #134702)

### Inductor

- [inductor] Add Triton template for Conv3D (#129518)  
- Autoheuristic: add config options for specifying for which optimizations to collect data, and for which optimizations to use learned heuristics (#130245)  
- Automatic horizontal fusion for Inductor ComboKernels (#131675)  
- Mode to emulate amp numerics (#131595)  
- [halide]The issue Add GPU support for the Halide backend adds the necessary functionality to enable GPU acceleration in PyTorch's Halide backend, improving compatibility with GPU-based computation and enhancing performance for specific workloads.(#127506)  
- [halide]The issue Enable bfloat16 support for the Halide backend introduces support for bfloat16 (bf16) data types in the Halide backend of PyTorch, expanding its capability to handle lower-precision computations and improving performance for models that benefit from mixed-precision training.(#129036)  
- [halide]The issue Support scan kernels in the Halide backend adds support for scan operations in PyTorch's Halide backend, enabling efficient reductions across multiple axes in tensors(#129035)  
- The issue Support adding a new inductor backend using PrivateUse1 enables the registration of a custom backend using the PrivateUse1 device type in PyTorch's inductor, facilitating backend extensions and new device types for specialized hardware.(#129953)  
- [halide]The issue Random number generation for the Halide backend introduces support for random number generation in PyTorch's Halide backend, enabling randomized operations for certain tensor computations that previously lacked support.(#130211)  
- [aoti] The issue Add packaging solution introduces a packaging solution for AOTInductor, allowing AOT-generated files to be packaged into a zipfile and loaded in Python. This feature supports the compilation and loading of precompiled models, enabling a more efficient workflow for distributed models (#129895)  
- Adds support for matrix decompositions when working with tensors that have unbacked sizes.(#128655)  
- Adds support for Intel GPUs by splitting reduction operations.  (#129120)  
- Introduces a benchmark flag for inductor configuration, enhancing test workflows.  (#129034)  
- Adds support for nested kernels in Triton when using indirect indexing.  (#129223)  
- Introduces a composable kernel backend for ROCm-enabled devices in PyTorch's inductor.  (#125453)  
- Adds support for mutating input tensors within CUDAGraph trees.  (#129184)  
- Enables vectorization for bitwise operations in the C++ backend.  (#129733)  
- Adds support for quantized linear GEMM templates with FP32 outputs.  (#128825)  
- Extends GEMM template support to INT8 output with unary post-operation support.  (#129048)  
- Adds support for binary fusion in GEMM templates for quantized linear operations.  (#129103)  
- Adds support for AMX micro-GEMM kernel with int8 data type for quantized linear operations.  (#129220)  
- Extends UserDefinedTritonKernel to support multiple outputs.  (#129325)  
- Enables support for handling multiple outputs in FlexAttention operations.  (#129344)  
- Introduces visualization methods for block masks in FlexAttention.  (#129950)  
- Introduces the initial implementation of the B2B-GEMM pass with accompanying tests.  (#129995)  
- Adds support for FX graph caching on AMD GPUs.  (#130463)  
- Adds a GroupedSchedulerNode to handle nodes that need to be scheduled together in FSDP2.  (#128568)  
- Adds support for flex decoding in FlexAttention's high-order programming (HOP).  (#129415)  
- Adds partial masking support to FlexAttention. (#130415)  
- Adds a DeferredCudaGridLine wrapper for CUDA grid operations.  (#129268)  
- Adds support for folding conv_bn with mixed data types in the post-grad phase.  (#133968)  
- Enables OpenMP support in the inductor when using the Intel compiler on Linux.  (#134973)  
- Enables the use of CUDA graphs even when there are unused CPU inputs.  (#134749)  
- Adds support for MKLDNN convolution operations in the C++ wrapper for inductor.  (#134475)  
- Introduces support for generalized linear operations using MKLDNN in the C++ wrapper.  (#134783)  
- Extends support for quantized convolution operations in MKLDNN via the C++ wrapper.  (#134795)  
- Adds support for unbacked symbolic integer (symint) divisors in variables and sizes.  (#130595)

### nn

- Made `FlexAttention` API public (#130755)  
- Add `nn.Modules.set_submodule()` like `get_submodule` (#127714)  
- Add `nn.Buffer` like `nn.Parameter` (#125971)

### Optim

- Add an Adafactor impl (forloop and foreach) (#129905, #132336)  
- Add support for capturable optimizers on hpu and xpu (#132119)

### Optimizer Frontend

- Disable expandable segments checkpointing internally (#132048)

### Profiler

- [Profiler] Collect observer traces from C++ child threads (#128743)  
- [Profiler][XPU] Introduce kineto-based XPU profiler (#130811)  
- [Profiler] Add API for Dynamic Activity Toggling [2/n] (#133035)  
- [Memory Snapshot][Viz] Show event timestamps if collected (#132523)  
- [Memory Snapshot][Viz] Add Allocator Settings Tab (#132518)  
- [Profiler] Add kwargs to Kineto Traces (#130373)

### Python Frontend

- Add support for device extension autoloading (#127074)  
- Added support for sharing tensors on the meta device between processes (#129520)  
- add `__torch_function__` handler to `Tensor.get_device` cpp (#132567)  
- Add `torch.serialization.skip_data` context manager to create a metadata-only checkpoint (#134504)  
- Add `torch.serialization.safe_globals` context manager to work with weights_only serialization (#127939)

### Quantization

#### PT2E Numeric Debugger

- Preserve `_numeric_debug_handle` throguh deepcopy and re-export (#129287)  
- Add `numeric_debugger` top level APIs (#130643)  
- Update pt2e numeric debugger to use `node.meta["custom"]` field (#134040)  
- Fix output node's meta (#131706)

### Releng

- Split Build - Create a distribution of pytorch which is composed of a c++ component and mostly python component similar to jax and jaxlib (#129088, #127934, #126813, #129011, #129270, #129253, #129269, #129774, #132537, #124995, #134624)  
- Intel GPU enablement in CI/CD. Add prototype Linux Manywheel binary builds with better ATen operation coverage and improved torch.compile support (#129730, #129560, #128486, #130742, #130922, #133069, #132854, #129847, #134074, #134204, #134214, #134461, #134464, #134455, #134312, #133151, #124147)  
- Add prototype Linux Manywheel Python 3.13 binary builds (#130030, #132984, #133670)

### XPU

- Improve ATen operation coverage and support Intel Client GPUs in addition to Intel Data Center GPUs (#135833)   
- Enable Windows support and enable PyTorch wheel build for Windows (#135833, #133151, #134312)   
- Enable deterministic support for Intel GPU operations (#127277, #129864)   
- Support _GLIBCXX_USE_CXX11_ABI both 0 and 1 mode (#130110)

### Sparse Frontend

- Add pinned memory support to COO/CSR/CSC/BSR/BSC tensors (#129645)  
- Add MaskedTensor support to _is_any_true (#128574)  
- Add MaskedTensor support to *_like API, example: empty_like, etc. ops (#128637)  
- Add MaskedTensor passthrough: unfold, F.Unfold, F.Fold, stack (#125262)

### ONNX

#### The `dynamo=True` option and new export logic (#132530, #133743, #134304, #134782, #135378, #135399, #135786, #136162, #135134, #134976, #135367, #135418, #135591, #135520)

We introduce the `dynamo=True` option in `torch.onnx.export()`. This is recommended as a replacement for `torch.onnx.dynamo_export` starting in PyTorch 2.5.

Version 2.5:  
```python  
onnx_program = torch.onnx.export(model, inputs, kwargs=kwargs, dynamo=True)  
# Use the external_data option to save weights as external data  
onnx_program.save(“model.onnx”, external_data=True)  
# To save without initializers  
onnx_program.save(“model.onnx”, include_initializers=False, keep_initializers_as_inputs=True)  
```

`torch.onnx.export(model, args, dynamo=True, report=True, verify=True)` leverages `torch.export` and [ONNX IR](https://github.com/microsoft/onnxscript/blob/main/onnxscript/ir/README.md) to convert captured `ExportedProgram`s to ONNX efficiently and robustly. This new process reduces memory consumption by half compared to `dynamo_export` in 2.4, while preserving rich tensor shape and stack trace information in the ONNX graph. You can leverage the `report=True` option to obtain a conversion report in markdown format to diagnose any conversion issues. Set `verify=True` to verify the ONNX model numerically with ONNX Runtime.

When using `external_data=True` to save model weights as external data to the .onnx file, weights larger than 1 MB are now aligned at 64 KB addresses. This allows runtimes to memory-map weights for better memory efficiency during inference.

> [NOTE]  
> The `dynamo=True` option currently supports only ONNX opset 18. Future releases will expand support to newer opsets.

> [NOTE]  
> The `dynamo=True` option requires the latest versions of `onnxscript` and `onnx` packages.

## Improvements

### Autograd frontend

- Support GradientEdge as output for `torch.autograd.grad` (#127766)  
- `torch.autograd.graph.increment_version` accept List[Tensor] (#132652)  
- Hooks registered via `torch.autograd.graph.Node.register_hook` during Node execution are run (#134728)

### Compostability

#### Custom ops:

- Improvements to torch.library.custom_op  
  - Supported calling from a multithreaded context (#128547)  
  - Improved overhead when autograd is unnecessary (#127976)  
  - Made it compatible with `from __future__ import annotations` (#128809)  
  - Supported string default values in schema (#129179)  
  - Added better suggestions for unsupported types (#129417)  
  - Added `mutates_args=”unknown”` for if you didn’t know the alias information of your custom op (#129614)  
  - Added ability to add default values for device types (#129792)  
  - Added ability to construct factory functions  (#129978)  
  - Added ability to temporarily disable kernels (#130190, #130406)  
  - [docs] Redirect custom ops landing page to the correct place (#129177)  
  - Prevented Dynamo from peeking into torch.library.{custom_op, register_kernel} (#133125)  
  - Improve aliasing error message (#134688)  
- Miscellaneous  
  - `register_flop_formula` now actually does something for custom ops (#131777)  
  - Improved torch.library.opcheck docs. (#134692)

#### Dynamic shapes:

- User facing features offering more control over dynamism  
  - [dynamic] config to disable duck sizing (#129804)  
  - Trigger dynamism on stride changes (#130232)  
  - Add mark_unbacked (#128638)  
- Unbacked SymInt support  
  - Fix set_unbacked_bindings when list of Tensors is returned (#133585)  
  - Compute and do renamings even when ignoring fresh unbacked symbols (#134407)  
  - Also preserve unbacked SymInts when partitioning as backward inputs (#128338)  
  - Improve unbacked reasoning involving has internal overlap (#128332)  
  - Make are_strides_like_channels_last size oblivious (#129677)  
  - Correctly put mark_unbacked symbols in shape_env_to_source_to_symbol_cache (#129869)  
  - Don't attempt to compute hints for unbacked expressions (#132060)  
- Export related bug fixes  
  - Fix ConstraintViolationError exception string when exprs are int (#129271)  
  - suggested fix for data-dependent error (#125378)  
  - carry cond in data-dependent error (#131932)  
  - add src map to data-dependent errors (#132393)  
  - check unsupported sympy functions for runtime asserts (#132457)  
  - fix silly error when printing diff (#133345)  
  - remove dead code for suggesting legacy dynamic shapes fixes (#133700)  
- Improved symbolic reasoning, including performance improvements  
  - Remove some implications from the static_eval pattern matcher (#128500)  
  - Replace sympy Min/Max with reimplementations (#133319)  
  - Fixed dynamic shape inference  (#128807)  
  - remove redundant upper bound check at runtime (#133627)  
  - Remove dead expect_rational (#135105)  
  - Stop updating hints (#129893)  
  - Don't constrain range on the replacement for a symbol (#129907)  
  - Make sym_node log more useful (#130436)  
  - When translation validation is enabled, assert that hint is consistent (#130478)  
  - Add trace_shape_events artifact tracing for ShapeEnv events (#130473)  
  - FakeTensor cache SymInt support (#127596)  
- Dynamic shapes improvements to specific operators  
  - [PT2] Resolve PT2 compatility issue in slice and diff (#133740)  
  - Use integer divison in arange length calculation when start/end/step are integral (#134296)  
  - Remove unnecessary expect_true in split_with_sizes (#133439)

#### Decompositions, FakeTensor and meta tensors

Operator decompositions, FakeTensors and meta tensors are used to trace out a graph in `torch.compile` and `torch.export`. They received several improvements:

##### Decompositions:  
  - Fixes to existing decompositions:  
    - rot90 (#129097)  
    - aten.slice_scatter (#123744)  
    - bucketize (#133652)  
    - torch.istft (#135234)  
    - torch.exp (#129154)  
    - aten._to_copy (#130381)  
    - aten.masked_fill_ (#127871)  
    - const_pad_nd (#132679)  
  - New operator decompositions:  
    - aten.channel_shuffle (#118775)  
    - aten.nll_loss2d (#133534)  
    - aten.reflection_pad{1,2,3}d_backward (#130299)  
    - aten._unsafe_index_put (#133365)  
##### Meta tensors:  
  - Fixes to existing meta tensor op implementation  
  - aten._scaled_mm (#129521)  
  - _convert_weight_to_int4pack (#130707) 
  - New meta tensor op impls  
      - _fused_adamw_ (#133728)  
      - poisson (#134103)  
##### Misc fixes:  
  - Infer prim tags from equivalent aten ones (#130367)  
  - Fix dim order calculation for channels_last in decomps (#131366)  
  - Allow cross-device copies for cpu scalars in refs (#135140)  
  - Update fake tensor error checks for bool tensor subtraction (#128492)

### Cpp frontend

- Add `padding_side` to `pad_sequence` with `"left"` and `"right"` options (`"right"` as default) (#131884)  
- Add out variants to avg_pool1d and adaptive_avg_pool1d (#135051)

### Cuda

- [BE] Improve CUDA UpSample error message (#131252)  
- Reduce number of guards introduced by check_cudnn_tensor_shapes when cudnn version is higher enough (#132384)  
- [CUDA]: Add frexp CUDA bfloat16 support (#133313)  
- Allow torch.cuda.memory.mem_get_info to take a device str argument with an unspecified device index. (#132616)  
- Change index_put on GPU to accept FP8 inputs (#128758)

### Distributed

#### Activation Checkpointing (AC)
  - Added `kwargs` to composable AC API to enable full capabilities (#128516)
  - Made `ActivationWrapper` an abstract class (#129808)
#### c10d
  - Applied `somaxconn` and enabled `TCP_NODELAY` to `TCPStoreLibUvBackend` (#128739)
  - Improved connect and retry logic for `TCPStore` (#129261)
  - Added pings to verify network connectivity on connect for `TCPStore` (#129985)
  - Made `new_group` eager when used with `comm_split` (#129284)
  - Exposed running handlers from Python for control plane (#130149)
  - Added new control plane handler (#129712)
  - Added a new Pytorch API `split_group` to create a process group (#130507)
  - Add `bfloat16` support for NAN check (#131131)
  - Enabled custom work registration from python in the Functional Collective (#130354)
  - Not removed collective ops in dce since they have side-effect in the Functional Collective (#131023)
  - Used float tensor for `ProcessGroupNCCL` barrier `all-reduce` (#132701)
  - Set a shorter heartbeat detect timeout to avoid race with `ProcessGroupNCCL` watchdog timeout (#133028)
  - Made it not call `ncclCommAbort` if comm is not initialized (#133630)
  - Made it not broadcast uniqueId during a split (#133962)
  - Reconciled barrier and NaN checker (#134707)
  - Releases gil lock during eager init (#134779)
  - Improved logic to infer device for barrier (#134617)
#### DeviceMesh
  - Added supports for non-continuous slicing (#132310)
#### Dtensor
  - Moved `DTensor` to public namespace (#134203)
  - Made `slice_backward` to use op strategy (#130287)
  - Improved `from_local` API with run_check (#130289)
  - Added a few dunder methods to pointwise ops (#130754)
  - Added naive support for `nn.init.orthogonal_` (#132104)
  - Added support for custom op registration (#131108)
  - Added more foreach ops to supported sharding prop list (#132066)
  - Added naive replicate strategy for more diagonal ops (#132201)
  - Rewrote redistribute algorithm for multi-dim mesh (#131210)
  - Added missing all to public modules (#133305)
  - Made DTensor sharding propagation for `scaled_dot_product_efficient_attention` and `scaled_dot_product_flash_attention` more conservatively cached (#134146)
  - Extended implicit replication to replicate `DTensor` for foreach ops so model doesn't have to be fully tp-ed when using 2D (#134551)
  - Added gradient scaler for `DTensor` (#132816)
#### DistributedStateDict (DSD)
  - Kept 'exp_avg' as `DTensor` after `torch.distributed.checkpoint.state_dict.set_optimizer_state_dict` (#128004)
  - Correctly handled shared parameters for optimizer `state_dict` (#128685)
#### FullyShardedDataParallel (FSDP)
  - Integrated device agnostic APIs in FSDP library (#134337)
  - Made `clip_grad_norm_` norm compute order deterministic (#134673)
  - Casted input args with `dataclass(frozen=True)` (#135067)
  - Avoided GPU syncs by reusing Pre-allocated Zero Tensor (#128069)
#### fully_shard (FSDP2)
  - Included module FQN in `FSDPParamGroup` `record_functions` (#128624)
  - Added APIs for explicit fwd/bwd prefetching (#128884)
  - Added `set_post_optim_event` (#128975)
  - Ran reduce-scatter copy-in in default stream (#129721)
  - Relaxed `contract` to allow `Sequence[nn.Module]` (#127773) (#130947)
  - Allowed `List[nn.Module]` as arg (#130949)
  - Preserved `fsdp.set_ op` through lowering (#130786)
  - Added `set_reduce_scatter_divide_factor` (#129286)
  - Added hpu device to `_get_remote_device_str` (#132120)
  - Added repr to `FSDPParamGroup` and `FSDPParam` (#132350)
  - Added missing event wait (for future) in FSDP2 (#132568)
  - Let `fsdp.set_` convey to functionalization that it mutates storage (#132322)
  - Enabled autoselect default device in FSDP construction. (#127609)
  - Reset `FSDPParam.sharded_param` in `lazy_init` (#132954)
  - Enabled HSDP + TP in FSDP2 (#133335)
  - Added eager fast-path for fp32->bf16 param cast (#133369)
  - Kept DTensor params for `replicate` and `fully_shard` (#133059)
#### TorchElastic
  - Shared `TCPStore` by default when using `c10d_rendezvous_backend` (#128096)
  - Used `wait` instead of `get` for store barrier (#130148)
  - Added missing rank tracing support in the barrier inside TorchElastic (#132818)
  - Made torch elastic not have to realize `TCPStore` backend type and rely on c10d to decide which backend to use (#134882)
  - No signal handling when off the main thread (#135088)
  - Supported `local_addr` across all rendezvous implementations (#135262)
  - Created processes in parallel in `mp.start_processes` for `forkserver` (#134629)
#### TensorParallel(TP)
  - Improve `SequenceParallel` and its documentation (#131346)
#### Pipelining
  - Supported separate `dw_runner` for PipelineStage (#128983)
  - Supported arbitrary stage ordering on ranks (#128976)
  - Supported W action for schedules (#129233)
  - Added to/from CSV format and improved repr (#129264)
  - Implemented flexible PP schedule (#129597)
  - Reordered `_Action` from `F1_1` to `1F1` (#129786)
  - Added forward only schedule (#132177)
  - Added schedule `unshard/reshard` pass (#129810)
  - Added schedule `send/recv` pass (#130378)
  - Added `zb1p` schedule (#130210)
  - Added `get_schedule_class` util (#132768)
  - Added pytorch-native input/weight grad split (#132691)
  - Added ZeroBubble schedule (#133467)
  - Unblocked zero bubble composability with DP (#134052)

### Dynamo

- More tracing support: (non-exhaustive list) - 
  - - Weakref objects (#128533) - 
  - Some set methods (e.g. `discard` (#133317), `intersection` (#130672), `remove` (#132943))  
  - Support for proxying frozen dataclasses (#134846)  
  - `inspect.signature.bind` (#132330) and `inspect.signature.Parameter` attribute access (#134636)  
  - `random.Random` objects (#133725)  
  - User-defined method descriptors (#130159)  
  - `set` on `KeysView` (#131389)  
  - `dict` conversion of objects derived from `MutableMapping` (#131367), constrained subclasses of dict and OrderedDict (#132558)  
  - `__contains__` on `__dict__` of user defined classes (#131378)  
  - `abc.MutableMapping.get` (#132363)  
  - `frozenset` (#134563)  
  - reading attributes from pybind objects (#134630)  
  - classes with custom `__new__` (#132977), `object.__new__` (#133746)  
  - `torch.cuda.device` (#133385)  
  - `id(Parameter)` (#130100)  
  - `str` on UserDefinedObjectVariable (#130506)  
  - `out`-variant custom ops that return None (#129078)  
  - `autograd.Function` `mark_non_differentiable` (#134087), `ctx.set_materialize_grads` (#133978)  
  - Better support for HuggingFace `ModelOutput` class (#127780)  
- `mark_static` can now be used on `nn.Module`s to make int attributes static (instead of automatic dynamic) (#134713)  
- Recursively skip frames when Dynamo cache limit is hit (#135144)  
- Context hints for backend compilers (#132860)  
- Add queue_callback() support (#126366)  
- Suppress guards generated by empty_strided in ir_node_to_tensor (#130431)

### Export

- Effect token support for TorchBind calls (#128397)  
- Pretty printing for unflattener (#128617)  
- Decomposition support for `export_for_training` (#128077, #129249, #134801)  
- Add CSE and SymInt compute optimization to export graphs (#130380)  
- Check node schema for side-effectful ops when DCE-ing (#130552)  
- Experimental joint-graph export API (#128847)  
- Dynamic shapes serialization (#134718)  
- Suggested fixes for data-dependent errors in non-strict mode (#125378)  
- Kill forced specializations, and prefer runtime asserts for dynamic shapes (#130775, #132698)  
- Fully support extension operators in de/serialization (#130851)  
- Fully handle preserved_ops with `run_decompositions()` (#130970, #131075)  
- Add `torch.amp.autocast_mode` as a higher-order-op subgraph (#132677)  
- Make `ExportedProgram.validate()` public-facing (#132777)  
- Allow string outputs for ExportedPrograms (#132808)  
- Better user error messages for dynamic_shapes mismatches (#132982)  
- Add a graph & node-level `“custom”` metadata field (#131912)  
- Add a `getitem` deduplication pass (#133618, #134830)  
- Make `move_to_device_pass` public-facing (#134263)  
- Make `while_loop` higher-order-op public-facing (#128562)  
- `TORCHEXPORT_EXTENDED_DEBUG_CURRENT_LOC=1` flag for line-by-line debugging in non-strict mode (#134298)  
- Support `aten.full.default` and `aten.full_like.default` (#130639)  
- Handle python list append, list add, `aten.to.dtype` + `mutation_op` pattern for TSConverter (#132529)  
- Quantized ops to standard ops pass. (#133026)  
- Add `InterpreterModule` to trace_rules (#132949)  
- Add `tracing_mode` support for TorchBind (#129586)  
- Inline True branch for torch.cond when predicate is a Python constant (#130493)  
- Support Unbacked SymBool inputs for torch.cond (#133589)  
- Support `set_grad_enabled` higher-order-op in dynamo to enable re-tracing (#134281)

### ForEach Frontend

- Increase parity for dtype support for `_foreach_sigmoid` and `_foreach_lgamma` (#134253, #134344)

### Fx

- Use to_dtype node and const folding in convert fp32 to fp16 fx pass (#127829)  
- Add decomposition_table as an arg to `get_isolated_graphmodule` (#130886)  
- Support `meta["val"]` that is a dict, for triton kernels and for the partitioner (#132466)  
- Allow SymInt input for torch.fx reinplace pass (#133178)  
- Add `max_acc_splits` (#133041, #133724)  
- Update source matcher to use torch_fn (#133642)  
- Set maximum warning count during `fx.Graph.lint` (#135069)  
- Graph Printing:  
  - Change colored logging to only be turned on if printing to interactive terminal (#128874)  
  - Print float with full precision, don't truncate (#130027)  
  - Save DOT file of graph instead of SVG for GraphTranformObserver (#128634)  
  - Add normalize_args constructor argument to FxGraphDrawer (#130348)  
  - Show size/strides for all tensors in `python_code(verbose=True)` (#132192)  
- Fix py codegen to delete values that don't have any users (#131028)  
- Propagate sparsity in fx graph (#131920)

### Inductor

- Inductor windows support improvements (#134772, #133921, #131767, #131980, #132025, #132326, #132533, #132491, #132630, #132848, #132841, #132973, #133184, #134033, #134229, #134358, #134348, #134397, #134402, #134400, #134401, #134419, #134420, #134394, #134424, #134426, #134427, #134221, #132387, #132394, #132571, #134365)  
- Autoheuristic improvements (#130304, #133608, #131610, #132685, #131615, #131714, #131710)  
- ROCm Support: enable dynamic shapes for CK backend (#133285)  
- ppc64le: VSX Support for Inductor (#132746)  
- Add an option to exclude CPU overheads using `do_bench_using_profiling` in `TORCHINDUCTOR_PROFILE` (#133523)  
- Add thread blocking config `TORCHINDUCTOR_CPP_GEMM_THREAD_FACTORS` (#132730)  
- FlexAttention improvements (#133019, #132015, #133159, #134065, #134351, #133836, #133664, #132157, #131559, #132547, #131404, #131552, #130904, #134055)  
- Make config.autotune_remote_cache be a three-way option (#132285)  
- Add config option to force higher-dimensional tiling (#132937)  
- Add torch.save() for individual intermediate tensor (#133871)  
- Add inductor config: masked_vec (#134566)  
- Add dynamic shapes for combo_kenel/foreach_kernel (#134477)  
- Add Inductor config for default stride behavior (#135238)  
- Optionally allow padding on non-GPU devices (#135280)  
- Make conv template work with dynamic stride/padding (#132938)  
- Adds a print_readable function to the unflattener for better debug output, improving readability of the module structure.  (#128617)  
- Adds logging for static input indices in CUDAGraphs to aid debugging.  (#132726)  
- Improves dispatch logic for vectorized instruction sets in CPU code.  (#128320)  
- Adds a shape property to intermediate representation nodes for better shape inference.  (#127818)  
- Supports comprehensive padding for tensor operations, enhancing flexibility in padding schemes.  (#128555)  
- Reduces superfluous mask handling in Triton code generation.  (#128518)  
- Improves fusion logs by making it easier to attribute nodes to the aten graph, aiding in debugging and performance tuning.  (#127159)  
- Ensures mixed_mm operations are only enabled when casting from a lower bitwidth type to a higher one, avoiding unnecessary type conversions.  (#128899)  
- Enables multiple workers to function correctly if the method being used is subprocess.  (#129002)  
- Adopts future annotations for better type hinting and maintainability in the inductor's scheduler and IR.  (#128892)  
- Moves auto-tuning for Triton kernels into a separate block for better performance isolation.  (#129057)  
- Improves convolution dilation by adding a size_hint to optimize memory allocation.  (#129631)  
- Switches the Halide code cache to the new cpp_builder framework for better compilation support.  (#129441)  
- Refactors GEMM template to pass weight data type explicitly.  (#129221)  
- Introduces helper functions to convert score_mod into block_mask in FlexAttention.  (#129909)    
- Updates HalideCodeCache to use the new cpp_builder framework for improved functionality.  (#130146)  
- Limits functions in foreach operations when dependent on multiple subkernels to improve consistency.  (#130046)  
- Introduces improved methods for generating runtime checks for symbolic dimensions.  (#130220)  
- Adds a check to verify if the FX graph returned by aot_compile is a tuple.  (#129824)  
- Adds support for passing a module map to the Triton make_ir API for better modularity.  (#134774)  
- Removes VecChecker and implements a fallback for non-supported vector operations with a scalar loop.  (#134569)  
- Generates a reindexer for each epilogue_node in C++ inductor operations.  (#134984)  
- Adds a padding_value for boundary-checked loads in FlexAttention, improving load operations in corner cases.  (#134573)  
- Improves the way argument names are used as keys in constant and signature dictionaries for better consistency.  (#135170)  
- Removes "spawn" as an option for parallel compilation to avoid issues with certain platforms.  (#130746)  
- Introduces a heuristic to determine whether padding should be applied before matrix multiplication operations.  (#128643)  
- Introduces inductor-specific counters within the FX graph cache for better debugging and performance tracking.  (#130635)  
- Adds a counter for num_matches_for_scatter on constant tensors to the cached metrics.  (#130843)  
- The issue Lift inductor lowerings for jagged <-> padded dense kernels introduces lowerings for conversions between jagged and padded dense tensors, supporting operations like _jagged_to_padded_dense_forward and _padded_dense_to_jagged_forward. This improves PyTorch's ability to handle irregular tensor structures efficiently, particularly in tasks like natural language processing where padding is common.(#125968)  
- The issue Add BackendFeature gating introduces a mechanism to gate backend features in PyTorch's inductor based on the availability of specific backend capabilities. This ensures that operations requiring specialized backend support are only enabled when those features are present, improving compatibility and stability.  (#128266)  
- The issue Support additional lifted constants supplied to const folding enhances AOTInductor's constant folding by allowing additional lifted constants from graphs to be included. This update improves compatibility when using lifted graphs in constant folding processes, particularly with the split_const_gm function.  (#130743)  
- The issue Compute q_num_blocks from kv_num_blocks if q_num_blocks is not passed in updates the logic in flex_attention to compute the q_num_blocks from kv_num_blocks when q_num_blocks is not explicitly provided, ensuring consistent behavior across different input configurations. (#130809)  
- The issue Remove static param counting if inlining NN modules eliminates the need for static parameter counting when inlining neural network modules in the inductor backend. This simplifies the logic for handling parameters and improves performance by skipping unnecessary counting. (#130503)  
- The issue Use get_reduction_combine_fn for reduction ops updates the Halide backend by introducing a function that determines how reductions are combined during computation, improving the clarity and efficiency of reduction operations in the backend. (#130212)  
- The issue Use 0D scalar inputs/outputs allows the Halide backend to handle 0-dimensional scalar inputs and outputs, fixing issues related to scalar operations in the Halide integration with PyTorch. (#130129)  
- The issue Change the schema of QLinear Binary modifies the schema to better support the corresponding GEMM templates, making it easier to handle autotuning by reordering tensor inputs. (#129049)  
- The issue Dimension-based indexing for the halide-backend changes the indexing in the generated Halide code to be dimension-based rather than 1D-based, which resolves multiple bugs and performance issues related to indexing in PyTorch's Halide backend. (#129026)  
- The issue Use _unsafe_masked_index in masked_scatter decomposition updates the masked_scatter operation by using the _unsafe_masked_index function in its decomposition, improving performance and reliability.(#123667)  
- The issue Only autotune at compile time when enabled via config ensures that autotuning in AOTInductor only occurs when explicitly enabled via configuration, preventing unintended autotuning during execution unless specified. (#129413)  
- The issue Adjust efficient_conv_bn_eval_graph for inlining modifies the inlining of the efficient_conv_bn_eval_graph function to improve execution during evaluation in PyTorch, particularly for built-in neural network modules. (#128878)  
- The issue Make config.fx_graph_remote_cache be a three-value switch introduces a configuration option allowing fx_graph_remote_cache to be set as True, False, or None, where None disables it for OSS (open-source software) but enables internal configurations for JK. (#128628)  
- Enables GraphTransformObserver for inductor backend.(#127962)

### mps

- [MPS] Add Metal implementation of exp op (#128421)  
- [MPS] Add lu_factor in MPS (#99269)  
- [MPS] Add support for autocast in MPS  (#99272)  
- [MPS] Add support for autocast in MPS  (#99272)  
- [MPS] Enable MPS mm from macOS >= 14.4 (#133494)  
- [MPS] Add support for autocast in MPS  (#99272)  
- [MPS] Add int4mm weight packing mps kernel, and improved int4mm shader (#128965)  
- [MPS] Fast math env var (#129007)  
- [MPS] Check and error message for no support for conv with output_channels > 2^16 (#129484)  
- [MPS] Add tensor_lr overloads to fused adam & adamw (#129451)  
- [MPS] Add SDPA implentation (#131362)  
- [MPS] Add native implementation for shift ops (#131813)  
- [MPS] Add native strided API for MPSNDArray starting with macOS 15 (#128393)

### nn

- Made nn.Module state_dict load_state_dict pre-hook and state_dict post-hook public (#131690)  
- Add deterministic support in nn.functional.interpolate for XPU (#129864)  
- Add batching rule for sdpa_math, sdpa_efficient_attention forward, cudnn, and flash attention (#133964)  
- Support `nn.Module.mtia()` (#131499)

### Optim

- Move fused optimizers’ param's device check to later in `_init_group` to allow better error checking in more cases (#131153)  
- Update typing and 1-element check for tensor lr support on all optimizers (#131065)

### Optimizer Frontend

- Fix fake_tensor w/ non-view tensor (#132050)  
- [BE][optim] Make pyright recognize exported symbols (#135043)

### Profiler

- [Memory Snapshot] Move user_defined annotations to Native Caching Allocator (#130964)  
- [Memory Snapshot] Add recordAnnotations to capture record_function annotations (#129072)

### Python Frontend

- Improve error message for weights_only torch.load (#129705)  
- Add new blocklist for weights_only load to prevent some modules from being allowlisted (#131259)  
- Make PyTorch argparser understand python’s “complex” type (#129580)  
- Allow to register “fallback” to torch.library (#131707)  
- Move module_tracker to general logging system when reporting confused hierarchy (#134467)  
- Make `torch.serialization.set_default_mmap_options` usable as a context manager (#134371)

### Quantization

#### PT2E quantization

- Support `set_module_name_qconfig` in X86InductorQuantizer (#126044)  
- Enable PT2E symmetric dynamic quantization in metadata porting (#124615)  
- Fix add annotation with constant (#132092)  
- Fix Maxpool2d share quantization params (#132704)  
- Fix getattr for quantizing constants (#132705)  
- Use returned model from Quantizer.transform_for_annotation in prepare_pt2e (#132893)  

#### Observers

- Fix edge cases for HistogramObserver (#129387)  
- Corner-case fix for upscale_histogram in the new HistogramObserver (#130316)

#### Export IR Migration

- Fix batch norm pattern match in quantization (#134157)  
- Fix getitem not exist (#134259)  
- Fix tests for quantized_decomposed.quantize_per_tensor decomposition (#134525)  

#### Others

- Enable torch.empty for float8 dtypes on cpu in deterministic mode (#128744)  
- Fixing equalize with three things and improving functionality (#124632)  
- Fix the warning for cat operators with same qparams (#133999)

### Releng

- Update NCCL to 2.21.5 (#124014)  
- Update cuSPARSELt to v0.6.2 (#134022)  
- Update CI/CD to CUDA 12.4.1 (#132202, #125944)  
- [ROCm] upgrade CI/CD to 6.2, triton-rocm improvements.  (#132875, #133238, #128525, #131637, #129361, #129480, #128873, #127947, #129094)  
- Migrate conda, manywheel, libtorch Docker images build from pytorch/builder to pytorch/pytorch. (#129022, #132410, #133699, #133709)  
- Migrate nightly CD builds to ephemeral runners  (#134469, #134380, #134367, #134463, #134473)

#### Infrastructure

- Remove usages of Rockset as storage in favor of AWS (#129503, #129594, #129544, #130153, #130168)   
- Migrated pytorch/pytorch’s CI to run  on hardware owned by the Linux Foundation (#129246, #129462, #129746, #133225, #133232, #133320, #133124, #133457, #134231, #134796, #134800, #129612, #131325, #131188, #128969, #128985, #129679, #129977, #131955, #132870, #131472, #134128)  
- Upgraded the CI Linux runners to use the Amazon Linux 2023 AMI (#128619, #132918, #131792, #134355, #131246, #131514, #131485, #131677, #131963, #131771, #133036, #133352, #133469, #133355, #133641, #134116, #131821, #131250)

### XPU

#### Intel GPU Backend for Inductor  
  - Generalize GPU type for Intel GPU and CUDA for Inductor Triton backend (#132740)  
  - Initialize AOTInductor support for Intel GPU by supporting store SPIR-V binary file output from Intel Triton. (#130849)  
  - Handle device_put op in constant folding. (#130824)  
  - Support reduction split. (#129120)  
  - Add new prop to _XpuDevicePropertie for triton gemm optimization (#131738)

#### Intel GPU ATen Operation  
  - Allow XPU device in copy, cdist, index_put_impl (#130088)    
  - Customized XPU behavior in indexing, group norm (#134453)   
  - Enable codegen for Intel GPU (#130082)   
  - Enable Dispatch Stub support for Intel GPU (#130019)   
  - Enhance the stability of the complex divide code (#134647)   
  - Add support for XPU accumulate type (#128579)

#### Intel GPU Runtime and Generalization  
  - Device guard codegen for XPU (#133980)   
  - Refactor caching device allocator utils (#130923)   
  - Refactor cached tensor more generic (#129359)   
  - Refine the logic of device construction when only device index is given (#129119)   
  - XPUHooksInterface inherits from AcceleratorHooksInterface (#129463)   
  - Add XPU memory-related APIs (#129919)   
  - Add xpu to getAccelerator (#129205) 

### Nested-Tensor Frontend

- Backward support for unbind() with NJT (#128032)  
- Backward support for cat() with NJT (#132076)  
- Backward support for chunk() on non-batch, non-jagged dimensions for NJTs(#132193)  
- Support linear backward for NJT with dim > 3 (#129393)  
- Accept min / max sequence length in nested_tensor_from_jagged() constructor (#130175)  
- Support sum operator along the jagged dimension for NJTs (#130425)  
- Support mean operator along the jagged dimension for NJTs (#131132)  
- Support permute() for NJT (#135336)  
- Support dtype conversions for NJT (#134164)  
- Support for copy_() when shape is identical for NJTs (#132193)  
- Implement 2D version of masked_select() for NJTs (#133889)

### cuDNN

- [cuDNN][64-bit indexing] cuDNN v9.3+ supports non-batch-splittable convolutions with > 2**31 elements (#134890)  
- cuDNN now supports convolutions with spatial dimensions that require 64-bit indexing. Previously this would fallback to a native im2col or vol2col style implementation that was very memory inefficient and lead to OOMs.

### Sparse Frontend

- Add Half for sparse.mm reduce on CPU (#133672)  
- Add partial support for COO/CSR/CSC/BSR/BSC representation in traced graph (#132690, #134037, #133371, #129983)

### ONNX

- Lazy-import `onnxruntime` (#134662)
- Add upsample trilinear to skip decomp (#128259)
- Add onnx::Gelu support for version 20 (#128773)

### ROCm

- Add warpSize to torch.cuda.get_device_properties (#128449)

## Bug fixes

### Autograd frontend

- Fix case where saved tensors are incorrectly shared version counter (#128545)  
- Fix thread safety of `torch.autograd.graph.register_multi_grad_hook` (#132055)  
- Fix PyObject reference counting of `torch.autograd.graph.saved_tensors_hooks` (#131700)  
- Update `torch.autograd.graph.saved_tensors_hooks` to not detach when unpacking tensors that do not require grad  (#127959)  
- Fix device propagation for `torch.utils.checkpoint` (#128671)

### Compostability

- Fix interaction between `torch.conj()` tensor subclasses and scalar tensors (#131482)  
- Better complex number support for aliasing prim operators (#132699)
- AOTDispatcher is a component in the torch.compile stack responsible for capturing a graph of normalized/functionalized ATen IR as well as capture the backward. A few bugfixes this release:

  - Properly bump version counter on input mutations in inference graphs (#131665)  
  - Ensure that graph input mutations from the backward remain in the backward graph after partitioning (#129130)

- Dynamic Shapes  
    - Do not generate -1* in SymPy expressions when canonicalising (#128411)  
    - Add FloatPow in the symbolic shape guard closure (#129857)  
    - Add FloatTrueDiv and ToFloat to SYMPY_INTERP (#128418)  
    - Fix symbolic nested int printing (#131916)  

### Cuda

- [CUDA][Pooling] Fix 64-bit indexing in `avg_pool_2d` backward attempt 2 (#129818)  
- fixes (#124582, #128483)  
- Add threadfence to 2-stage reduction for correct writes visibility (#128455)  
- [cuDNN][SDPA] Limit cuDNN SDPA head-dim to 128 (#130494)  
- expose host_emptyCache to python, fix a bug in freeing cudaHostRegist… (#134919)

### Distributed

#### Distributed checkpoint 
- Update _all_gather_keys utils function to derive world size based on input process group. (#135045)  
- Fix non-tensor objects not being loaded correctly during checkpoint loads. (#129398)  
- Fix meta tensors not being loaded during checkpoint loads. (#133256)  
#### c10d
  - Fixed `commSplit` bug by having every rank being called even though it is no-color (#128459)
  - Made sure current device is correct in `torch.distributed.barrier()`'s `streamSynchronize` (#129908)
  - Added flag to control which rank should perform NaN check (#134345)
  - Set correct device to CUDA guards (#134357)
  - Fixed p2p group commsplit (#128803)
  - Fixed corrupt log due to uint_8 printing as char (#130184)
  - Fixed an issue where `ENABLE_INTRA_NODE_COMM=1` + multiple process groups leads to failure (#130269)
  - Fixed an issue where input check fails when running all-reduce on sub groups (#130492)
  - Fixed some issues in two-shot `all-reduce` (#131244)
  - Fixed `split_group` usage when there is a single rank (#131824)
  - Fixed remote address in the `TCPStore` (#131773) (#131913)
  - Fixed a getenv segfault due to a race in getting nccl version (#133744)
  - Changed collective to take in a list of tensors so it work fully for all collectives (#135049)
#### CPU profiler for distributed
  - Fixed input/output dimension overflow (#134360)
#### DSD
  - Disabled 2D state_dict temporarily before the feature is fully ready (#129519)
#### DeviceMesh
  - Fixed replicate with `DeviceMesh` initialization (#133024)
#### DTensor
  - Fixed `foreach_norm` when ord is 2 (#130753)
  - Fixed `_MaskPartial` when multiple embeddings coexist (#131264)
  - Fixed the bug where Sharding Prop cache was wrongly shared among multi-threaded ProcessGroup in tests (#134509)
#### FSDP2
  - Fixed `unshard` without lazy init (#129241)
#### TensorParallel(TP)
  - Fixed `loss_parallel` with BF16 logits (#130550)
#### RPC
  - Fixed Distributed EventList usage (#132448)

### Dynamo

- Do not run default saved tensor hooks during tracing (#123196)  
- Make `.data` mutations invisible to autograd (#131403)  
- Skip frame if `TorchDispatchMode` is enabled (#131828)  
- Ensure `nn_module_stack` is the same when inlining inbuilt `nn.Module`s (#128295)  
- Compiled autograd bug fixes:  
  - use same graph node names as AOTDispatcher (#133148)  
  - match eager behavior for `post_acc_grad_hook`s (#134205) and `ctx.saved_variables` (#134286)  
  - error instead of deadlock on reentrant autograd (#134530)  
  - match eager behavior for inplace detached activations (#134186)  
- `OptimizedModule.training` flag now mirrors the wrapped module. (#131546)  
- Fixes to exception handling (#131795, #131801, #132425)  
- Fix indexing and slicing of ranges (#128567)  
- Create list slices as new local objects (#132912)  
- Handle infinite `map`/`zip` and return `map`/`zip` instead of a tuple (#135074)  
- Guard correctly on tensor subclasses (#130780)  
- Register all entrypoint backends on first attempt to `list_backends` (#132546)  
- Cache attr_proxy for nn_module attribute to fix guard check failure (#130280)

### Export

- Fix unflattener for unused inputs/outputs + `preserve_module_call_signature` (#128260)  
- Preserve `.requires_grad` on FakeTensors in graph metadata (#128656)  
- Handle deduplicated SymInt compute nodes when unflattening (#129153)  
- Handle mutated FakeScriptObject state when reused in `aot_export` (#128844)  
- Fix FakeMode mismatch for joint-graph export API (#129421)  
- Don’t DCE side-effectful custom ops (#129680, #130970)  
- Fix FakeMode detection for 0-input graphs (#129928, #131995)  
- Allow kwargs for training IR (#130553)  
- Fix constants & non-persistent buffers for training IR (#130864)  
- Fix training IR for 0-input graphs (#130990, #133031)  
- Improve output node metadata preservation for strict mode (#131706)  
- Preserve autograd state when lowering to inference IR (#131988)  
- Preserve `source_fn_stack` for training IR (#132033)  
- Preserve `.requires_grad` for unflattened parameters (#134353)  
- Preserve `aten::to` for training IR (#134622)  
- Error out when exporting ScriptModule (#135302)  
- Fix warning for ComposeImplicitAutograd custom ops in pre-dispatch (#130623)  
- fix `node.users` when inlining higher-order-ops (#133144)  
- Improve logging for TSConverter (#132082)  
- Fix serialization of OpOverload with SymInt outputs (#132126)  
- Construct empty graph when there's no tensor computation (#129541)  
- Fix inline_inbuilt_nn_modules + export (#133731)

### ForEach Frontend

- Perform reciprocal optimization with foreach_div (#128433)

### Fx

- Fix index issues in torch.fx.interpreter (#129527)  
- Recursively apply options to print_readable (#130268)  
- Implement deepcopy for Proxy (#133706, #133470)  
- Fix `linearize(grad(...))` call by moving DCE (#133364)  
- Respect find_all setting in sequential mode of minimizer (#134339)

### Jit

- Validate that node TK_ASSIGN have field initialized (#127878)  
- Fixes for LLVM 18 and ASAN (#130661, #133623, #134572)

### Linalg Frontend

- Fix input checks for `linalg.lstsq` (#130612)

### mps

- [MPS] Make erfinv compilable for bfloat16 (#128375)  
- [MPS] Fix Clamp correctness with type promotion (#130226)  
- [MPS] Fix `torch.[all|any]` for 5+D tensors (#130542)  
- [MPS] Store philox counter as part of the RNG state (#130662)  
- [MPS] Min and max NaN propagation fix in MPS backend (#130445)  
- [MPS] Correct nonzero warning and fix the test (#132127)  
- [MPS] Fix SDP training (#134719)  
- [MPS] Fix bachnorm_2d for channels last (#134618)  
- [MPS] Fix NaNs in triu op (#128575)  
- [MPS] Parameterize group_size in int4_mm test, fix int4mm for group_size > 128 (#129628)  
- [MPS] Fix crash when running PyTorch with Metal API validation turned on (#130377)  
- [MPS] LSTM backward kernel workaround on MacOS 14.4+ to fix correctness (#130038)  
- [MPS] Fix masked_fill_ in non_contiguous cases (#131957)  
- [MPS] Fix relu for 0-element input case (#133191)  
- [MPS] Add workaround for nonzero with large/complex inputs  (#126188)  
- [MPS] Enable batch matmul for sizes > 2**32 when tensor can be split along batch axis (#133430)

### nn

- Use newer `toAccumulateType` signature in `Normalization.cpp` (#134540)

### Optim

- Fix accidental change of step signature which affected GradScaler interaction (#129933)  
- Remove an introduced Host & Device Sync In LR Scheduler (#133663)  
- Fall back to slower foreach_add in optim.step() when is_compiling() to avoid untracked tensor during graph tracing (#130909)

### Optimizer Frontend

- [3/N] Enable clang-tidy on torch/csrc/inductor (#132101)

### Profiler

- [Profiler] Fix profiler_kineto Clang errors (#128464)  
- [Profiler] Add Rank to NCCL Debug Info (#129528)  
- [Profiler] Directly use end_ns to create the FunctionEvent instead of using start_ns + duration_ns in pytorch profiler post processing for checking parent-child precisely (#129554)  
- [Profiler] exclude gpu_user_annotation when accumulating cuda time total (#130733)  
- [Memory Snapshot] Make recordAnnotations callback initialize lazily (#129242)  
- [Memory Snapshot] Fix race on alloc_trace vector - S430480 (#130180)  
- [Memory Snapshot] Stop duplicating annotations to all device_traces (#130315)  
- [Profiler] Allow record_function kwargs to be non-string values (#134893)  
- [Profiler] Fix CPU Annotation Overlapping with Python Events (#129599)

### Python Frontend

- Fix Storage.filename to not track the filename when storage was mmap-ed with MAP_PRIVATE (#128725)  
- Fix allowlisting of builtins for weights_only unpickler (#129244)  
- Fix div with rounding_mode="floor" when division overflows (#129536)  
- Fix warning when pickle.load torch.Storage (#130246)  
- Fix dtype mismatch in lobpcg eigen solver (#132762)  
- Prevent an unnecessary device -> host copy for CuPy arrays when not explicitly setting a device in torch.as_tensor.  (#132595)  
- Fix type promotion for torch.`ldexp` (#133519)  
- Fix 0-dim tensor of complex or bool type for torch.aminmax. (#128404)  
- Fix torch.prod vectorized path for bool (#128009)

### Releng

- Fix MacOS double-loading of OpenMP runtime (#129473)  
- Fix exposing statically linked libstdc++ CXX11 ABI symbols in Linux binaries (#137209)  
- Release engineering tooling and CI fixes. Workflows, Trymerge, Bot Labeler, Mergebot (#128840, #129924, #129500, #128924, #129291, #128842, #129720, #129987, #130570, #132681, #133143, #133350, #133372, #133861, #131475, #134047, #134711, #134785, #133869)

### XPU

- Fix xpu nightly wheel test failure (#130742)   
- Fix xpu nightly wheel test env (#134395)   
- Fix test_sgd_weight_decay_xpu accuracy error (#134744)   
- Fix windows xpu build issue (#133845)   
- Fix tensor print behavior for XPU (#130523)   
- Fix overriding default CMAKE_CXX_FLAGS on Windows (#135093)   
- Remove duplicate XPU switch case in DispatchStub (#132480)   
- Keep zero check be compatible with different sympy versions (#130729)   
- Align the appearance of device_put op in fx_graph generated for CUDA and XPU, which is exposed in the issue #130823 (#132479)   
- Fix patch for old llvm package error for triton xpu (#134204)   
- Fix tensor print behavior for XPU (#130523)  
- Fix windows xpu build issue (#133845)  
- Check compilation status before query cudnn version in conv (#135332) 

### Nested-Tensor Frontend

- Default to input tensor device for as_nested_tensor(t) (#130050)  
- Fix SDPA backward for the special case of an NJT with ragged second batch dim and constant length (#128349)

### cuDNN

- [CUDNN][SDPA] Fix unsupported trivial stride-1 transpose case (#134031)  
- Minor bugfix for cuDNN SDPA

### ONNX

- Fix onnx conversion `scaled_dot_product_attention` (#133314)
- Fix `scaled_dot_product_attention` with float scale (#135594)
- Add assertion nodes to ignoring list (#135591)

### ROCm

- [ROCm] CUDA_VISIBLE_DEVICES fallback option for device_count (#129650)
- [ROCm] Return correct AMDSMI socket_power metric   (#130331)
- [ROCm] Check supported archs before setting preferred blas backend to hipblasLT (#128753)

## Performance

### Cuda

- [pytorch][cuda] Generate kernels for 5x5 filters on depth wise convolution backward (#129609)  
- [fp8 rowwise] Retune the tile heuristics to increase perf (#134781)

### Distributed

#### CPU profiler for distributed  
  - Added API for Dynamic Activity Toggling for CPU profiler (#133353)  

### Dynamo

#### Compile time improvements  
  - Manually implement `nn.Module.__getattr__` (#129315)  
  - Manually implement `nn.Module._call_impl` (#129285)  
  - Manually trace `torch.nn.Module.parameters` (#129583)  
  - Optimize guard for small tuples (#130400)  
  - Use dict tags to skip guards on immutable dict `getitem`s (#130654)  
  - Reduce overhead for `PolyfilledFunctionVariable.call_function` (#134842)

### Fx

- Speed up fx graph iteration by implementing it in C++ (#128288)  
- Remove unnecessary `get_implications` calls (#128410)  
- Implement a fast-path to FakeTensor detach (#131899)  
- Optimize `Node.__update_args_kwargs` (#135076)  
- Remove generators in map_aggregate (#135082)  
- Bypass custom `setattr` in `Node.__init__` (#135079)

### Inductor

- [Compile] Add NEON implementation for bf16->fp32 cast (#134297)  
- Remove dtype check/assert for reduction vectorization and support bool for min/max (#132473)  
- Support masked vectorization for the tail_loop for INT8 datatype (#131155)  
- Support vectorization for torch.argmax/min(float/int64_t)-> int64_t in inductor cpp backend (#131016)  
- Optimize aten.cat calls of a repeated element (#132081)  
- Fix mm pad regression - more conservative estimation of plannable inputs (#128909)  
- Void fallback case for custom scan op lowering (#130936)  
- Move bias add to gemm epilogue (#130675)  
- Add B2B-GEMM performance tuning (#130778)  
- Optimize arbitrary N in cpp packed gemm template (#130690)  
- Improve cache blocking with CPU info in the cpp GEM template (#129348)  
- Improve thread blocking heuristics in cpp gemm (#131024)  
- Support k slicing for static shapes in cpp gemm (#130821)  
- Apply compute/comm reordering passes to achieve overlap (#131614)  
- Support pointwise intermediate nodes in B2B-GEMM (#131685)  
- Add lowering for _scaled_mm that autotunes between ATen kernels and Triton kernels (#130422)  
- Save and run post compilation steps within FXGraphCache (#132294)  
- Add vectorization support for doubles in inductor cpp (#131886)  
- Support vectorization for torch.any(bool) -> bool (#132472)  
- Performance, precision, and dependency improvements to B2B-GEMM (#132354)  
- Support use_libdevice_for_f64 for pointwise ops on XPU, align with CUDA. (#132739)  
- Intel GPU Support: Support codegen empty_strided_xpu, align with #118255 (#126678)  
- Move GPU_TYPE(The runtime avaliable gpu type, cuda or xpu) from (#132740)  
- Support _check_triton_bf16_support on XPU. (#132748)  
- Moves intermediary tensors which are constructed on the cpu to XPU when safe, align with CUDA. (#132843)  
- Support masked vectorization for the tail_loop of the 2d tiles kernel (#130724)  
- Improve large bs perf with better cache blocking in cpp gemm (#132729)  
- Add auto-tuning for sparse semi-structured MM operator (#123742)  
- Support vectorization for atomic add (#131314)  
- Skip cudagraph if too many distinct sizes (#131387)  
- Improve sort kernel performance (#131719)  
- Add unbind_stack_to_cat_pass (#132542)  
- support masked vectorization for the tail_loop (#126526)  
- Update unbind_cat_to_view pass to include more complicated cases (#132831)  
- Extend split_stack_to_cats when split and stack have different dims (#133060)  
- Add unbind_stack_to_slices pass (#133420)  
- Improve compile time regression from MemoryDep.normalize (#135070)  
- Reduce memory alloc overhead by allocating local acc once per thread in cpp gemm (#135277)  
- Enable dynamic M for k-slicing in cpp gemm (#133447)  
- Speedup int8-to-float conversion on aarch64 (#132676)  
- Optimizes cache key calculation by memoizing device data for predictable behavior. (#128366)  
- Adjusts backward kernel block sizes for the FlexAttention module. (#128853)  
- Updates template indexing to use broadcasting instead of masks in the FlexAttention module, optimizing performance. (#128938)  
- Evaluates symbolic expressions during the loading of cached entries, preventing unnecessary computations during writes. (#128997)  
- Adds support for BF16 matrix multiplication (micro-gemm) using Advanced Matrix eXtension (AMX) instructions available in Intel processors for performance improvements. (#127195)  
- Optimizes memory use by avoiding the materialization of large sparse matrices during backward passes in conditional execution. (#129043)  
- Enhances autotuning of FlexAttention by passing fake inputs for block sparse entries. (#129915)  
- Reduces unnecessary tensor metadata in the AOTAutogradCache to improve performance. (#128583)  
- Introduces a caching mechanism for precompilation functions to avoid redundant compilation. (#130350)  
- Updates the loop order to occur post-fusion in inductor. (#126254)  
- Ensures that icx built-in math libraries are preloaded in the compilation process. (#134870)  
- Converts addmm to a decomposition operation instead of a fallback for better performance. (#134823)  
- Adds support for vectorization in the tail loop for dynamic shapes in inductor. (#131745)  
- Optimizes template_buffer usage by introducing a local accumulator when the buffer has multiple users. (#135081)  
- Maximizes the use of available bits for BF16/FP16 vectorization in CPU operations. (#126502)  
- Directly sets meta.val for efficient batch fusion during aten operations. (#135078)  
- Improves performance by tuning the INT8 AMX WoQ micro-kernel for CPU. (#134832)  
- Adjusts the tiling factor for lower precision data types (e.g., BF16, FP16) in inductor's C++ backend to optimize performance. (#133830)  
- Enhances cache blocking mechanisms for dynamic matrix multiplications. (#131306)  
- Addresses issues with loop splitting optimizations in the inductor backend. (#135303)  
- Adds improved cache-blocking configurations for dynamic shapes in GEMM operations. (#133538)  
- Optimizes tensor core usage during matrix multiplications in FlexAttention. (#135168)  
- Refactors memory usage patterns in LoopBody to improve efficiency. (#135286)  
- Introduces a fast path for extracting read/write information without requiring full tracing, improving performance. (#135377)  
- Directly uses empty tensors with strides in CUDA graphs during copy operations for efficiency. (#130777)  
- The issue Fix compile time regression by caching get_gpu_type addresses a significant compile time regression in inductor by caching the results of GPU type queries, reducing the need for repetitive calls to nvidia-smi. This leads to substantial performance improvements, particularly for jobs involving collective operations  (#128363)  
- The issue Introduce heuristic for mixed_mm on A100 adds a heuristic for mixed matrix multiplications, specifically for the A100 GPU, significantly improving performance in certain configurations by tuning the heuristic to outperform existing solutions. (#128232)  
- The issue Improve codegen for ops.masked in triton enhances code generation for masked operations in the Triton backend, ensuring better performance and correctness for these operations within the PyTorch inductor. (#128054)  
- The issue Enable shape_padding multiplier adjustment allows for the adjustment of the shape_padding_multiplier value in PT2 to improve query-per-second (QPS) performance based on specific configurations. (#128346)  
- The issue Emit strided block pointer from ModularIndexing and FloorDiv improves the efficiency of indexing in multi-dimensional tensors by emitting block pointers instead of relying on division and modulo operations. This enhancement optimizes performance, especially for workloads involving strided tensor access patterns. (#127342)  
- The issue Make tl.atomic_add relaxed changes the atomic addition operation in Triton to use relaxed memory ordering rather than acquire/release synchronization, improving performance without sacrificing correctness where strict memory synchronization is unnecessary. (#129133)  
- The issue Linear time dead node elimination optimizes the inductor's dead code elimination by ensuring dead nodes are removed in linear time, improving execution efficiency in PyTorch's computation graphs.  (#129082)  
- The issue Add lowering and codegen for aten.sort implements lowering and code generation for aten.sort in PyTorch's inductor, improving the performance of sorting operations by optimizing them for Triton kernels under specific size thresholds. (#128458)  
- The issue Fallback to eager if re-record too many times ensures that if a CUDAGraph re-records a function too frequently (more than cudagraph_max_recording), PyTorch falls back to the eager execution mode, preventing performance degradation from excessive re-recording.(#129349)  
- The issue Reinplacing should ignore copy_ nodes where the mutated argument is not read updates the reinplacing pass in the inductor backend to skip unnecessary clone() operations when the original tensor's value is not needed. This improves performance by avoiding redundant operations, particularly for custom operations and Triton kernels.  (#130866)  
- The issue Avoid `OpOverloadPacket.__getattr__` calls in inductor lowering resolves inefficiencies caused by excessive calls to `__getattr__` during inductor lowering. By avoiding these calls, the patch prevents unnecessary exceptions that slow down compilation, improving overall performance. (#131348)  
- The issue Always realize sigmoid for CPU modifies the inductor backend to ensure that sigmoid is realized in CPU operations, similar to how exp is handled. This update improves performance by preventing repeated computation of exp in nested loops during inference tasks, particularly for models like LLaMA2. (#128339)  
- The issue bf16/fp16 gemm template computed with fp32 reintroduces the GEMM template for mixed-precision computation, allowing bf16 and fp16 values to be computed using fp32 precision in PyTorch's inductor backend. This enhances accuracy and stability for matrix multiplications involving lower precision data types. (#128472)

### mps

- [MPS] GGML inspired int8 MM Metal shader (#127646)  
- [MPS] Fused Adam & AdamW (#127242)  
- [MPS] Fused SGD optimizer (#129350)

### Profiler

- [Profiler] Add TSC Clock Callback to CUPTI (#125036)  
- [Profiler] Only parse kineto requests and build tree when required (#132713)

### Quantization

- Speed up `torch.ops.decomposed.dequantize_per_channel` (#132828)  
- Speed up `torch.ops.decomposed.quantize_per_channel` (#133029)  
- Enable optimized dynamic quantization on aarch64 (#126687)

### cuDNN

- [cuDNN][SDPA] cherrypick Support attn_bias in cuDNN (#130482)  
  - cuDNN SDPA now supports arbitrary fully materialized arbitrary bias (masks)

### Sparse Frontend

- Improve addmm(dense, BSR) performance on specific shapes and GPUs (#132646)

## Documentation

### Autograd frontend

- Improve wording of `torch.inference_mode` documentation (#132321, #130307)  
- Made `torch.autograd.graph.register_multi_grad_hook` return type `RemovableHandle` (#132074)

### Distributed

#### TorchElastic
  - Added docstring for the `torch.distributed.elastic.utils.distributed.get_free_port` function (#128133)
  - Added docstring to `construct_and_record_rdzv_event()` (#128189)
#### c10d
  - Added notes about same sized tensors to `dist.gather()` (#128676)
  - Fixed `DDPLoadBalancingPlanner` docstring (#134044)
  - Cleaned up `distributed/CONTRIBUTING.md` (#128450)
  - Clarified warning for concurrent PG usage (#131895)
  - Fixed typo in docs of `all_gather` (#133066)
  - Add docs for ENV variables `TORCH_NCCL_ASYNC_ERROR_HANDLING`  - `TORCH_NCCL_TRACE_CPP_STACK` and `TORCH_NCCL_COORD_CHECK_MILSEC` (#132920)
#### DeviceMesh
  - Updated slicing documentation to include n-D and non-continuous slicing (#132311)
#### DTensor
  - Improved docs and comments in `DTensor` (#132683, #133149, #133306)
#### Pipelining
  - added small logging section to docs (#129368)

### Dynamo

- Switch out references from old custom ops landing page to new landing page (#129178)  
- Point C++ custom autograd function tracing error to google doc (#134514)  
- Suggest to use `pytree` when graph-break on `optree` (#131827)

### Fx

- Document the torch.fx.annotate.annotate function (#128337)  
- Fix typo in `torch/fx/passes/README.md` (#134078)  
- Add docstring for the torch.fx.operator_schemas.create_type_hint (#128139)  
- Fix link to dynamo in torch/fx readme (#130233)

### Inductor

- Add a 'to' method for moving to and from device for BlockMask (#132087)  
- Update max-autotune documentation for CPU (#134986)  
- Fix typos (#128258, #128587)  
- Improves documentation for block mask creation to better explain its usage and structure. (#130649)x

### jit

- Document torch.jit.frontend.get_jit_class_def method (#128391)  
- Document `torch.jit.frontend.get_default_args` (#128408)  
- Document c10::AliasAnalysisKind::CONSERVATIVE (#130765)

### Linalg Frontend

- Fix a typo in `solve_triangular` and `householder_product` (#129766, #124279)

### mps

- [MPS] Add mps environment variable table (#129008)  
- [MPS] Add mps profiler env vars to docs (#129552)

### nn

- Fix small typo in docstring in `nn.ParameterList` (#129193)  
- Fix the max_norm value in a note for `nn.Embedding` (#129687)  
- Add note on transposed weight initialisations in `nn.init` (#130122)  
- Fix an example for broadcasting error of `attn_bias` and `attn_mask` for `nn.functional.scaled_dot_product_attention` (#130209)  
- Fix example for `convert_conv3d_weight_memory_format` (#131742)  
- Fix doc string for clip_grad_norm_ to (#133406)  
- Fix docs for L1Loss and MSELoss (#133501)

### Optim

- Improve docstrings for Learning Rate Scheduler (#128679, #130306, #132482)  
- Document optim hooks on Optimizer base class  (#131628)  
- Add a reference for the LRScheduler class from main torch.optim doc (#133243)  
- Make optim.swa.util content accessible from the torch.optim doc (#133393)  
- Improve docstrings for Optimizer (#129086, #135384)

### Optimizer Frontend

- [NeuralNetInference] Bring up iOS builds (#131917)

### Profiler

- [Profiler] Document the torch.cuda.profiler.profile function (#128216)  
- [Profiler] Document `torch.cuda.profiler.start` (#128098)

### Python Frontend

- Strip inaccurate either float32 or float64 statement from set_default_type (#128192)  
- Add docstring for the torch.typename function (#128129)  
- Add documentation for automatic mixed precision for xpu backend (#127276)  
- Add example for torch.serialization.add_safe_globals (#129396)  
- Fix rendering of Tensor.module_load doc (#130489)  
- Fix documentation for tensor.repeat. (#131195)  
- Fix rendering of the unicode characters (#134597)  
- Fix documentation for `torch.nn.utils.rnn.pack_padded_sequence` (#135417)  
- Add docstring for the torch.serialization.default_restore_location function (#128132)  
- Update `torch.nanmean` docstring to mention input dtype requirement (#128155)  
- Fix reference to `Multinomial` class in torch.multinomial documentation (#131904)  
- Improve `torch.stack` example code to be reproducible (#133857)

### Releng

- Deleted outdated and misleading info from .ci/pytorch/README.md  (#131502)

### XPU

- Adding a note for Getting Started with PyTorch on Intel GPUs (#127872)   
- Fix requirements.txt installation failure issue on Windows (#136893)   
- Add xpu for amp (#127276)   
- Add xpu to `torch.compile` (#127279)   
- Update amp example to device-agnostic (#127278)   
- Add xpu to `torch.tensors` (#127280)   
- Introduce the concept of Accelerators to PyTorch doc (#129363) 

### Sparse Frontend

- Define **zero-preserving unary functions** (#130804)

### ONNX

- Update fake mode usage in onnx docs (#135512)
- Improve comments (#128083, #128171, #128082)

## Developers

### Distributed

#### c10d
  - Exposed `set_thread_name` to Python and set thread names (#128448)
  - Added better logging for `Socket` and `TCPStore`: (#128673)
  - Fixed pyi annotation for `ProcessGroupNCCL.Options` (#130957)
  - Added `dump_traceback` handler for Control Plane (#128904)
  - Added logs of whenever we sleep in `ProcessGroupNCCL` (#129197)
  - Introduced a util for detecting DMA connectivity among devices (#129510)
  - Surfaced better error message on 0 bytes (#130056)
  - Logged port on error inside `TCPStoreLibUvBackend` (#130797)
  - Added a warning messages in the comment about cuda hang (#130844)
  - Changed to LOG error rather than info in device check (#131483)
  - Add better logging on wait timeout in `TCPStore` (#131808)
  - Fixed pyi annotation for `ProcessGroupGloo.Options` (#132080)
  - Added a new API for adding ephemeral timeout for one local rank and the timeout will reset when the first collective finishes (#130905)
  - Use `pg_id` instead of `pg_name` for logging prefix (#132058)
  - Added space between PG ID and PG UID (#133497)
  - Control logging of c++ traces with a flag in `ProcessGroupNCCL` (#133490)
  - Made NCCL PG error messages more accurate and simpler (#134017, #134036)
  - Used wait counters instead in `TCPStore` (#135283)
  - Switched LOG level from ERROR to WARNING for `TCPStore` get failure (#134349)
#### DTensor
  - Included meshes in cross-mesh error msg (#130454)
  - Added `dtensor` to `TORCH_LOGS` (#129512)
#### FDSP
  - Removed spammy logs in `_runtime_utils.py` (#129967)
#### FDSP2
  - Added `TORCH_LOGS=+fsdp` to log hooks(pre/post forward/backward) and FQN (_init_fqns) (#128663)
#### DSD
  - Captured reader, writer and planner components in the DCP API logger (#129548)
#### TorchElastic
  - Fixed torchrun log message (#131652)
  - Fixed stdout / stderr typing in `SubprocessHandler` (#132071)
  - Added warning when users try to pass a `use_libuv` argument to `create_c10d_store` (#135062)

### Fx

- Add bits16 to graph dtype_abbrs (#130339)

### Optim

- Add `__all__` to torch.optim to define public interface (#131959)  
- Remove circular import coming from torch.optim._multi_tensor (#128875)

### Optimizer Frontend

- [pytorch][counters] DynamicCounter (#132166)

### Releng

- Better Engineering, Ruff lint improvements, better error messages (#130199, #133200, #129374, #129809, #129752, #129753, #131547, #130139)

### XPU

- Change conda to miniforge for xpu images (#134455)   
- Enable python 3.13 for xpu nightly build (#133670)   
- Update xpu cd used driver to rolling version (#133454)    
- Change xpu nightly build back to ABI=0 (#132854)   
- Disable xpu kineto build (#133069)    
- Change xpu ci build runner type to reduce build time (#130922)   
- Add pytorch xpu wheel build in nightly (#129560)    
- Add triton xpu wheel build (#129730)    
- Disable Kineto PTI on Windows only (#134620)   
- Update Intel Triton to release/2.5.0 (#134074)    
- Use larger instance for building triton whl (#135201)   
- Set make triton install pre-built whl by default (#130313)    
- Add `xpu_cmake_macros.h` to xpu build (#132847)

### ONNX

- Remove beartype usage (#130484)

## Security

### Inductor

- relax unification checks when size-like symbols can be 0 (#133112)

### Linalg Frontend

- Force inconsistent-missing-override for torch targets (#130010)

### Optimizer Frontend

- [torch][take2] Implement BFloat16 `__hip_bfloat16` overloads (#132234)

### Quantization

- Hipify Pytorch3D (#133343)

PyTorch 2.4: Python 3.12, AOTInductor freezing, libuv backend for TCPStore (2024-07-24)

# PyTorch 2.4 Release Notes

- Highlights
- Tracked Regressions
- Backward incompatible changes
- Deprecations
- New features
- Improvements
- Bug Fixes
- Performance
- Documentation
- Developers
- Security

## Highlights

We are excited to announce the release of PyTorch® 2.4!
PyTorch 2.4 adds support for the latest version of Python (3.12) for `torch.compile`.
AOTInductor freezing gives developers running AOTInductor more performance based optimizations by allowing the
serialization of MKLDNN weights. As well, a new default TCPStore server backend utilizing `libuv` has been introduced
which should significantly reduce initialization times for users running large-scale jobs.
Finally, a new Python Custom Operator API makes it easier than before to integrate custom kernels
into PyTorch, especially for `torch.compile`.

This release is composed of 3661 commits and 475 contributors since PyTorch 2.3. We want to sincerely thank our 
dedicated community for your contributions. As always, we encourage you to try these out and report any issues as we 
improve 2.4. More information about how to get started with the PyTorch 2-series can be found at our 
[Getting Started](https://pytorch.org/get-started/pytorch-2.0/) page.



  
   Beta
   
   Prototype
   
   Performance Improvements
   
  
  
   Python 3.12 support for torch.compile
   
   FSDP2: DTensor-based per-parameter-sharding FSDP
   
   torch.compile optimizations for AWS Graviton (aarch64-linux) processors
   
  
  
   AOTInductor Freezing for CPU
   
   torch.distributed.pipelining, simplified pipeline parallelism
   
   BF16 symbolic shape optimization in TorchInductor
   
  
  
   New Higher-level Python Custom Operator API
   
   Intel GPU is available through source build
   
   Performance optimizations for GenAI projects utilizing CPU devices
   
  
  
   Switching TCPStore’s default server backend to libuv
   
   
   
   
   
  
  
   
   
   
   
   
   
  


*To see a full list of public feature submissions click [here](https://docs.google.com/spreadsheets/d/1TzGkWuUMF1yTe88adz1dt2mzbIsZLd3PBasy588VWgk/edit?usp=sharing).

## Tracked Regressions

### Subproc exception with torch.compile and onnxruntime-training

There is a reported issue (#131070) when using `torch.compile` if `onnxruntime-training` lib is
installed. The issue will be fixed (#131194) in v2.4.1. It can be solved locally by setting the environment variable
`TORCHINDUCTOR_WORKER_START=fork` before executing the script.

### cu118 wheels will not work with pre-cuda12 drivers

It was also reported (#130684) that the new version of triton uses cuda features that are not compatible with pre-cuda12 drivers.
In this case, the [workaround](https://github.com/triton-lang/triton/pull/4335#issuecomment-2232505298) is to set
`TRITON_PTXAS_PATH` manually as follows (adapt the code according to the local installation path):
```bash
TRITON_PTXAS_PATH=/usr/local/lib/python3.10/site-packages/torch/bin/ptxas  python script.py
```

## Backwards Incompatible Change

### Python frontend

#### Default `TreadPool` size to number of physical cores  (#125963)

Changed the default number of threads used for intra-op parallelism from the number of logical cores to the number of
physical cores. This should reduce core oversubscribing when running CPU workload and improve performance.
Previous behavior can be recovered by using torch.set_num_threads to set the number of threads to the desired value.

#### Fix `torch.quasirandom.SobolEngine.draw` default dtype handling (#126781)

The default dtype value has been changed from `torch.float32` to the current default dtype as given by
`torch.get_default_dtype()` to be consistent with other APIs.

#### Forbid subclassing `torch._C._TensorBase` directly (#125558)

This is an internal subclass that a user used to be able to create an object that is almost a Tensor in Python and was
advertised as such in some tutorials. This is not allowed anymore to improve consistency and all users should
subclass torch.Tensor directly.

### Composability

#### Non-compositional usages of as_strided + mutation under `torch.compile` will raise an error (#122502)

The `torch.compile` flow involves functionalizing any mutations inside the region being compiled. Torch.as_strided is
an existing view op that can be used non-compositionally: meaning when you call x.as_strided(...), as_strided will only
consider the underlying storage size of x, and ignore its current size/stride/storage_offset when creating a new view.
This makes it difficult to safely functionalize mutations on views of as_strided that are created non-compositionally,
so we ban them rather than risking silent correctness issues under torch.compile.

An example of a non-compositional usage of as_strided followed by mutation that we will error on is below. You can avoid
this issue by re-writing your usage of as_strided so that it is compositional (for example: either use a different set
of view ops instead of as_strided, or call as_strided directly on the base tensor instead of an existing view of it).
```python
@torch.compile
def foo(a):
    e = a.diagonal()
    # as_strided is being called on an existing view (e),
    # making it non-compositional. mutations to f under torch.compile
    # are not allowed, as we cannot easily functionalize them safely
    f = e.as_strided((2,), (1,), 0)
    f.add_(1.0)
    return a
```

#### We now verify schemas of custom ops at registration time (#124520)

Previously, you could register a custom op through the operator registration APIs, but give it a schema that contained
types unknown to the PyTorch Dispatcher. This behavior came from TorchScript, where “unknown” types were implicitly
treated by the TorchScript interpreter as type variables. However, calling such a custom op through regular pytorch
would result in an error later. As of 2.4, we will raise an error at registration time, when you first register the
custom operator. You can get the old behavior by constructing the schema with allow_typevars=true.
```
TORCH_LIBRARY(my_ns, m) {
  // this now raises an error at registration time: bar/baz are unknown types
  m.def("my_ns::foo(bar t) -> baz");
  // you can get back the old behavior with the below flag
  m.def(torch::schema("my_ns::foo(bar t) -> baz", /*allow_typevars*/ true));
}
```

### Autograd frontend

#### Delete torch.autograd.function.traceable APIs (#122817)

The torch.autograd.function.traceable(...) API, which sets the is_traceable class attribute
on a torch.autograd.Function class was deprecated in 2.3 and is now being deleted.
This API does not do anything and was only meant for internal purposes.
The following raised an warning in 2.3, and now errors because the API has been deleted:
```python
@torch.autograd.function.traceable
class Func(torch.autograd.Function):
    ...
```
### Release engineering

- Remove caffe2 db and distributed from build system (#125092)

### Optim

- Remove `SparseAdam` weird allowance of raw Tensor input (#127081).

### Distributed

#### DeviceMesh

Update get_group and add get_all_groups (#128097)
In 2.3 and before, users can do:
```python
mesh_2d = init_device_mesh(
    "cuda", (2, 2), mesh_dim_names=("dp", "tp")
)
mesh_2d.get_group()  # This will return all sub-pgs within the mesh
assert mesh_2d.get_group()[0] == mesh_2d.get_group(0)
assert mesh_2d.get_group()[1] == mesh_2d.get_group(1)
```
But from 2.4 forward, if users call `get_group` without passing in the dim, users will get a `RuntimeError`.
Instead, they should use `get_all_groups`:
```python
mesh_2d = init_device_mesh(
    "cuda", (2, 2), mesh_dim_names=("dp", "tp")
)
mesh_2d.get_group()  # This will throw a RuntimeError
assert mesh_2d.get_all_groups()[0] == mesh_2d.get_group(0)
assert mesh_2d.get_all_groups()[1] == mesh_2d.get_group(1)
```

#### Pipelining

Retire torch.distributed.pipeline (#127354)
In 2.3 and before, users can do:
```python
import torch.distributed.pipeline # warning saying that this will be removed and users need to migrate to torch.distributed.pipelining
```
But from 2.4 forward, if users write the code above, users will get a `ModuleNotFound` error.
Instead, they should use `torch.distributed.pipelining`:
```python
import torch.distributed.pipeline # -> ModuleNotFoundError
import torch.distributed.pipelining
```

### jit

- Fix serialization/deepcopy behavior for tensors that are aliasing but not equal (#126126)

### Fx

Complete revamp of float/promotion sympy handling (#126905)

### ONNX

- Remove caffe2 contrib and experiments (#125038)

## Deprecations

### Python frontend

- User warning when using `torch.load` with default `weights_only=False` value (#129239, #129396, #129509).
  A warning is now raised if the weights_only value is not specified during a call to torch.load, encouraging users to
  adopt the safest practice when loading weights.
- Deprecate device-specific autocast API (#126062)
  All the autocast APIs are unified under torch.amp and it can be used as a drop-in replacement for torch.{device}.amp APIs
  (passing a device argument where applicable)..
- Export torch.newaxis=None for Python Array API/Numpy consistency (#125026)

### Composability

- Deprecate calling FakeTensor.data_ptr in eager-mode. FakeTensors are tensors without a valid data pointer, so in
  general their data pointer is not safe to access. This makes it easier for `torch.compile` to provide a nice error
  message when tracing custom ops into a graph that are not written in a PT2-friendly way (because, for example, they
  try to directly access a tensor’s data pointer from a region of code being traced). More details on integrating custom
  ops with `torch.compile` can be found [here](https://dev-discuss.pytorch.org/t/guide-getting-c-custom-ops-to-work-with-torch-compile/1737) (#123292)
- Dynamic shapes:
  - SymInt-ify mem-efficient attention forward op signature (#125418)
  - Don't call item() into torch.scalar_tensor uselessly (#125373)
  - Fix scalar type for constraint_range to Long (#121752)
  - Guard oblivious on meta registrations (#122216), vector_norm (#126772), and unbind (#124959)
  - Make expected stride test in torch._prims_common size oblivious (#122370)
  - Use torch._check for safety assert in _reshape_view_helper (#125187)
  - Add a code comment about torch._check_is_size in tensor_split (#125292)
  - Make min(stride, strides[idx]) in collapse_view_helper size oblivious (#125301)
  - Don't short circuit if shape is same (#125188)

### CPP

- Refactor autocast C++ APIs to be device-agnostic (#124359)

### Release Engineering

- Remove of QNNPACK third-party module (#126941)

### Optim

- Deprecate LRScheduler.print_lr (#126105)

### nn

- `torch.nn.hardtahn` allowed `min_val` to be greater than max_val (#121627)

### Distributed

- Distributed Checkpointing (DCP)
  Deprecated submodules feature for distributed_state_dict (#127793)
  In 2.3 and before, users can do:
  ```python
  model = AnyModel(device=torch.device("cuda"))
  model_state_dict = get_model_state_dict(model)
  set_model_state_dict(
      model,
      model_state_dict=new_model_state_dict,
      options=StateDictOptions(strict=False),
  )

  # Below way of calling API is also legit
  model_state_dict2 = get_model_state_dict(model, submodules={model.submodule})
  set_model_state_dict(
      model,
      model_state_dict={model.submodule: new_submodel_state_dict},
      options=StateDictOptions(strict=False),
  )
  ```
  But from 2.4 forward, if users call `get_model_state_dict` or `set_model_state_dict` with a submodule path or
  state_dict, users will see a warning about the feature. To achieve the same functionality, users can manually
  filter out the `state_dict` returned from `get_state_dict` API and preprocess the model_state_dict before
  calling `set_state_dict` API:
  ```python
  model = AnyModel(device=torch.device("cuda"))
  model_state_dict = get_model_state_dict(model)
  set_model_state_dict(
      model,
      model_state_dict=new_model_state_dict,
      options=StateDictOptions(strict=False),
  )
  # Deprecating warnings thrown for the below way of calling API
  model_state_dict2 = get_model_state_dict(model, submodules={model.submodule})
  set_model_state_dict(
      model,
      model_state_dict={model.submodule: new_submodel_state_dict},
      options=StateDictOptions(strict=False),
  )
  ```
- FullyShardedDataParallel (FSDP)
  Deprecate FSDP.state_dict_type and redirect users to distributed_state_dict (#127794)
  In 2.3 and before, users can do:
  ```python
  model = AnyModel(device=torch.device("cuda"))
  fsdp_model = FSDP(model)
  # Users can do both ways below
  get_model_state_dict(model)
  with FSDP.state_dict_type(fsdp_model, StateDictType.FULL_STATE_DICT):
      fsdp_model.state_dict()
  ```
  But from 2.4 forward, if users call `state_dict` or set `state_dict` with the FSDP.state_dict_type, users will see warnings. And the recommended solution now is to use `get_model_state_dict` and `set_model_state_dict` directly:
  ```python
  model = AnyModel(device=torch.device("cuda"))
  fsdp_model = FSDP(model)

  get_model_state_dict(model)
  # Deprecating warnings thrown for the below way of calling API
  with FSDP.state_dict_type(fsdp_model, StateDictType.FULL_STATE_DICT):
      fsdp_model.state_dict()
  ```

### Profiler

- Remove FlameGraph usage steps from export_stacks docstring (#123102)
  The export_stacks API will continue to work as before, however we’ve removed the docstring to use FrameGraph.
  PyTorch doesn’t own FrameGraph, and cannot guarantee that it functions properly.

### Quantization

- Remove deprecated `torch._aminmax` operator (#125995).
  `torch._aminmax` -> `torch.aminmax` instead

### Export

- Start deprecation of capture_pre_autograd_graph (#125848, #126403)

### XPU

- Refactor autocast C++ APIs to be device-agnostic(#124359)
  `at::autocast::get_autocast_gpu_dtype()` -> `at::autocast::get_autocast_dtype(at::kCUDA)`
  `at::autocast::get_autocast_cpu_dtype()` -> `at::autocast::get_autocast_dtype(at::kCPU)`
- Refactor autocast Python APIs(#124479)
  `torch.get_autocast_gpu_dtype()` -> `torch.get_autocast_dtype(“cuda”)`,
  `torch.set_autocast_gpu_dtype(dtype)` -> `torch.set_autocast_dtype(“cuda”, dtype)`,
  `torch.is_autocast_enabled() ` -> `torch.is_autocast_enabled(“cuda”)`,
  `torch.set_autocast_enabled(enabled)` -> `torch.set_autocast_enabled(”cuda”, enabled)`,
  `torch.get_autocast_cpu_dtype()` -> `torch.get_autocast_dtype(“cpu”)`
- Make torch.amp.autocast more generic (#125103)
  `torch.cuda.amp.autocast(args…) ` -> `torch.amp.autocast(“cuda”,args…)`,
  `torch.cpu.amp.autocast(args…) ` -> `torch.amp.autocast(“cpu”, args…)`,
- Deprecate device-specific GradScaler autocast API(#126527)
  `torch.cuda.amp.GradScaler(args…) ` -> `torch.amp.GradScaler(“cuda”, args…)`,
  `torch.cuda.amp.GradScaler(args…) ` -> `torch.amp.GradScaler(“cpu”, args…)`,
- Generalize custom_fwd&custom_bwd to be device-agnostic (#126531)
  `torch.cuda.amp.custom_fwd(args…) ` -> `torch.amp.custom_fwd(args…, device_type=’cuda’)`,
### ONNX

- Remove more caffe2 files (#126628)

## New Features

### Python frontend

- Add
  - support for unsigned int sizes for torch.unique (#123643)
  - torch.OutOfMemoryError to signify out of memory error from any device (#121702)
  - new device-agnostic API for autocast in torch.amp.* (#124938)
  - new device-agnostic API for Stream/Event in torch.{Stream,Event} (#125757)
  - channels last support to max, average and adaptive pooling functions (#116305)
  - torch.serialization.add_safe_globals that allows users to allowlist classes for weights_only
    load (#124331, #124330, #127808)
  - pickling support for torch.Generator (#126271)
  - torch.utils.module_tracker to track position within torch.nn.Module hierarchy (#125352)

### Composability

- Add
  - OpOverload.redispatch; use it in new custom ops API (#124089)
  - mutated_args field to custom_op (#123129)
  - new Python Custom Operators API
  - register_autograd to register backward formulas for custom ops (#123110)
  - torch.library.opcheck (#124496), torch.library.register_autograd (#124071), torch.library.register_kernel (#124299)
- Blanket ban kwarg-only Tensors (#124805)
- Change register_autograd to reflect ordering of setup_context and backward (#124403)
- Ensure torch.library doctests runs under xdoctest (#123282)
- Fix torch.library.register_fake's module reporting (#125037)
- New Custom Ops Documentation landing page (#127400)
- Refresh OpOverloadPacket if a new OpOverload gets added (#126863, #128000)
- Rename
  - impl_abstract to register_fake, part 1/2 (#123937)
  - register_impl to register_kernel (#124200)
- Schema inference now includes default values (#123453)
- Stop requiring a pystub for register_fake by default (#124064)
- Support TensorList inputs/outputs (#123615)
- Update the functionalization error message (#123261)
- add ability to provide manual schema (#124180)
- fix schema inference for kwarg-only args (#124637)
- mutated_args -> mutates_args (#123437)
- register_autograd supports non-tensor kwargonly-args (#124806)
- set some tags when constructing the op (#124414)
- setup_context fills in default values (#124852)
- torch.library.register_fake accepts more types (#124066)
- use new python custom ops API on prims ops (#124665)

### Optim

- Enable `torch.compile` support for LRScheduler with Tensor LRs (#123751, #123752, #123753, #127190)

### nn frontend

- Add RMSNorm module (#121364)

### linalg

- Implement svd_lowrank and pca_lowrank for complex numbers (#125580)
- Extend `preferred_backend` on ROCm backend.
- Add cuBLASLt `gemm` implementation (#122106)

### Distributed

#### c10d

- Implemented IntraNodeComm primitives for `allgather_matmul` (#118038)
- Add first differentiable collective `all_to_all_single_grad` (#123599)
- Add P2P versions of `send/recv_object_list` operations (#124379)
- Add a new Collectives API for doing distributed collectives operations in the Elastic
  store with more performant and debuggable primitives (#126695)

#### FullyShardedDataParallel v2 (FSDP2)

- FSDP2 is a new fully sharded data parallel implementation that uses DTensor-based dim-0 per-parameter
  sharding for improved flexibility (e.g. mixed-dtype all-gather, no constraints on requires_grad) without
  significant cost to performance.
  See the [document](https://github.com/pytorch/torchtitan/blob/main/docs/fsdp.md) for more details and a
  comparison with FSDP1 (#122888, #122907, #123142, #123362, #123491, #123857, #119302, #122908, #123953, #120952, #123988, #124293, #124318, #124319, #120256, #124513, #124955, #125191, #125269, #125394, #126070, #126267, #126305, #126166, #127585, #127776, #127832, #128138, #128117, #128242)

#### Pipelining

- PyTorch Distributed pipeline parallelism APIs were upstreamed from the
  [PiPPy project](https://github.com/pytorch/PiPPy) and are available as a prototype release in
  PyTorch 2.4.
  The package is under [torch.distributed.pipelining](https://github.com/pytorch/pytorch/tree/main/torch/distributed/pipelining)
  and consists of two parts: a splitting frontend and a distributed runtime.
  The splitting frontend takes your model code as-is, splits it up into “model partitions”, and captures the data-flow relationship.
  The distributed runtime executes the pipeline stages on different devices in parallel, handling things like micro-batch splitting,
  scheduling, communication, and gradient propagation.
  For more information please check out the
  [documentation](https://pytorch.org/docs/main/distributed.pipelining.html) and
  [tutorial](https://pytorch.org/tutorials/intermediate/pipelining_tutorial.html) (#126322, #124776, #125273, #125729, #125975, #126123, #126419, #126539, #126582, #126732, #126653, #127418, #127084, #127673, #127332, #127946, #128157, #128163, #127796, #128201, #128228, #128240, #128236, #128273, #128279, #128276, #128278, #127066)

### Profiler

- Add profiler support for `PrivateUse1` (#124818)

### Dynamo

- `torch.compile` is compatible with Python 3.12.
- Guarding on nn modules attributes (#125202) - TorchDynamo guards on nn module attributes. This was a frequently raised
  issue in the past (examples (#111785, #120248, #120958, #117758, #124357, #124717, #124817)).
  This increases TorchDynamo soundness with minimal perf impact.
- Hardened the recently introduced tracing rules infrastructure. This allows `torch.compile` users to easily control TorchDynamo tracing of PyTorch internal code.
- Extended `torch.compile` support for RAdam and Adamax optimizer. Compiler optimizers now demonstrate SOTA performance.
- Experimental feature - We introduced a new experimental flag torch._dynamo.config.inline_inbuilt_nn_modules to enable `torch.compile` to reuse compiled
  artifacts on repeated blocks in the models. This gives another point in the tradeoff space of compilation time and performance speedup.
  By moving `torch.compile` from full model to a repeated block (e.g. moving `torch.compile` from full LLM to a repeated Transformer block),
  we can now achieve faster compilation time with some performance dip compared to full model.
  We plan to make this flag default to True in the 2.5 release.

### Export

- Introduce ShapesCollection, a dynamic shapes builder API (#124898)

### Inductor

- Add higher order associative scan operator (#119430)

### jit

- Add aten::sort.any op for sorting lists of arbitrary elements (#123982)

### MPS

- Conform torch.mps to device module interface (#124676)

### XPU

- Inductor Intel GPU backend (#121895)
- a new autocast API torch.amp.is_autocast_available(#124938)
- attributes to xpu device prop (#121898)
- XPU implementation for PyTorch ATen operators (#120891)
- generic stream/event on XPU backend (#125751)
- gpu trace on XPU (#121795)
- Switch to torch.float16 on XPU AMP mode (#127741)

### ONNX

- quantized layer norm op to opset 17 (#127640)
- symbolic_opset19.py and symbolic_opset20.py to support opset 19/20, extend opset 18 support (#118828)
- Support for Some Bitwise Ops in Onnx Exporter (#126229)
- Allow ONNX models without parameters (#121904)
- Integrate onnxscript optimizer (#123379)

### Vulkan

- quantized transposed 2D convolutions (#120151, #122547)
- the quantized ReLU operator (#123004)

## Improvements

### Python frontend

- bfloat16 support for torch.binary_cross_entropy on CPU (#123823)
- MAP_SHARED option for torch.load when mmap=True (#124889)
- default value when printing function signature (#127059)
- all variants of upsampling functions to be done in high precision in autocast (#121324)

### Composability

- FakeTensors, meta tensors and python decompositions are used to perform shape propagation when tracing out a graph in
  torch.compile. There were much coverage improvements this release:
  - New metas / fake tensor rules:
    - aten._embedding_bag_dense_backward, aten._embedding_bag_per_sample_weights_backward (#125785), aten.randint.out,
      aten.rand.out (#122375), aten.unique2 (#124306), aten.histc (#124548), aten.channel_shuffle (#123033),
      aten._masked_scale (#127389), aten.addcdiv.ScalarList, aten.addcmul.ScalarList (#123486)
- New decomps:
  - Aten.resize_as (#122317), several out= variants of ops with existing decomps (#122979, #115437)

### Autograd frontend

- `nn.functional.batch_norm`: add forward AD rule for miopen backend (#125069)
- `nn.functional.scaled_dot_product_attention`: add backward rule for cuDNN backend (#122510)

### Release Engineering

- Add CI support for aarch64 linux. The CI is triggered when the ciflow/linux-aarch64 label is added.
  (#120931, #121284, #125255, #121136, #124781, #125599)
- Add experimental CUDA pip wheels for ARM architectures supporting the NVIDIA Hopper architecture as nightly binaries
  and a prototype for the PyTorch 2.4.0 release. (#126174, #127514)
- Add support for CUDA 12.4 in CI/CD (#121684, #121956, #127825, #125944, #128250)
- Add support for numpy 2.0.0rc1 in CI and CD (#123286, #122157)
- Enable support for `torch.compile` and triton with Python 3.12 CI/CD (#127547, #123307, #126218)
- Intel GPU enablement in CI (#122254, #123920, #125655)
- Migrated CI/CD jobs to macOS 14 (#127582, #127853, #125801)
- ROCM: upgrade CI/CD to 6.1 (#124811, #118216, #124300, #125646)
- CUDNN version 9.1.0.70 for CUDA 11.8, 12.1, 12.4 builds (#123475)
- NCCL submodule v2.20.5 (#121635)
- submodule oneDNN v3.4.2 (#126137)
- Wrapped deprecated function/class with typing_extensions.deprecated (#127689)

### nn frontend

- Add `swap_tensors` path to nn parametrizations (#124130)
- Relax `use_count` constraints for `swap_tensors` when `AccumulateGrad` node holds a reference (#127313)
- Increase numel limit to 2^63 for replicatepad1d (#122199)
- Use `int64_t` indexing for `Upsample2d` backwards (#123682)
- Remove warning from `LazyModuleMixin` constructor (#123968)

### Optim

- Radam and Nadam support the flag for "maximize" (#126765, #127214)
- Include scheduler_on_plateau in optim.h (#121722)

### Foreach

- Allow foreach ops to run for any backend, not just CPU (#127412)

### cuda

- Update CUDA out of memory message with private pool info (#124673)
- Add autocast rule for torch.vdot (#125697)
- Fix type hint for cuda.get_device_name() and cuda. get_device_capability() (#126743)

### Quantization

- X86 Inductor backend
  - Enable linear and linear-unary post-op gelu quant recipe for `X86InductorQuantizer` (#114853)
  - Add Quantization recipe filter per operator type for `X86InductorQuantizer` (#122775)
  - Add Matmul recipe into `X86InductorQuantizer` (#122776)
  - Improve performance of `qconv` by reducing integration overhead (#123240)
- PT2E quantization flow
  - Add support for conv transpose + bn + {relu} weights fusion in PTQ and QAT (#122046, #123652)
  - Simplify `fake_quant_per_channel` (#123186)
  - Support fp8 quantization (#123161)
  - Propagate get_attr meta through known ops only (#124415)
  - Fix issue of lowering nn.linear ops with kwargs (#126331)

### Distributed

#### c10d

- `TORCH_NCCL_HIGH_PRIORITY` option for ProcessGroupNCCL (#122830)
- `__repr__` to P2POp class (#126538)
- `commCreateFromRanks` to c10d (#127421, #127982)
- `dist.get_node_local_rank` helper (#123992)
- an option to enable TCPStore libuv backed for c10d rendezvous  (#124684)
- Captured dtype in Flight Recorder (#126581)
- Enable ncclCommDevIdxMap unconditionally (#122049)
- Extended the flight recorder dump from timeout to any exception (#123023)
- Make TCPStore server use libuv by default (#127957)
- Make `get_node_local_rank()` accept fallback_rank (#126737)
- Make abort communicators in destroy_process_group call on default and code cleanup (#124334)
- Mapped float8 types to uint8 for allgather (#126556)
- Optionally avoided rethrowing CUDA Errors in NCCL Watchdog (#126587)
- Wrapped TCPStore check in a try/catch (#127030)
- `ProcessGroupWrapper` support custom backend (#124447)
- ncclComm is not aborted before checking exception (#124466)

#### DeviceMesh

- Add a private init backend option (#124780)
- Initialized mesh tensor with CPU context (#124767)
- Add `DeviceMesh.from_group()` (#124787)
- Make `_validate_tp_mesh_dim` support 3D (#125763)
- Supported N groups in from_group (#126258)
- Make sure device mesh can be imported from torch.distributed (#126119)

#### Distributed quantization

- Used BFloat16 in distributed quantization when supported by NCCL (#125113)

#### DistributedDataParallel (DDP)

- Add a mode to avoid clone() in DDPSink (#122927)

#### Distributed Checkpointing (DCP)

- Add `type_check` param to copy state dict utils (#127417)
- Add strict option to `DefaultPlanner` (#123869)
- Always created requests for non-tensor objects (#125334)
- Always flattened mapping even if no tensors present (#125335)
- Correctly handle `_extra_state` (#125336)
- Implement `broadcast_from_rank0` option for model/optim `state_dict` (#125338, #125339)
- Introduced async staging extension points (#122965)
- Make distributed `state_dict` support `torch.distributed` is not initialized case (#127385)
- Make param name consistent with overridden function (#124770)
- Remove the support of Dict[nn.Module, Dict[str, Any]] state_dict (#127070)
- Supported flattening the optimizer `state_dict` when saving and unflattening when loading (#127071)
- Unified the API signatures of `set_model_state_dict` and `set_optimizer_state_dict` (#127384)

#### DTensor

- backward support for `scaled_dot_product_attention` (flash-attention) (#122541)
- more foreach ops (#123214)
- op support for `view_as_complex` and `view_as_real` (#122569)
- op support for memory efficient attention (#122996)
- support for `fused_adam` and `fused_adamw` when lr is a tensor (#126750)
- ASGD foreach optimizer with associated unit tests (#121942)
- the handle of DTensor.device_mesh.device_type in dynamo (#118803)
- the support of placement kwargs for DTensor.to_local() in dynamo (#119947)
- scatter op with simple replication (#126713)
- distributed topk operator (#126711)
- Make Partial placement public (#127338, #127420)
- ensure expected input spec have correct tensor meta (#122949)
- ensure meta tensor random op does not alternate rng state (#125693)
- Move early return check into redistribute autograd function (#121653)
- Move some modules to private namespace (#127339)
- Standardized multi mesh-dim strategy with utils (#126712)
- 2D clip_grad_norm_ (#121945)
- simple replicate strategy for SVD (#127004)
- Turned on foreach implementation for (1) `clip_grad_norm_` for DTensor by default (#126423), (2) optimizer for DTensor by default (#123394)

#### FullyShardedDataParallel (FSDP)

- device in `pin_memory` argument (#119878)
- private _unshard API (#124304)
- privateuse1 in FSDP's sharded grad scaler (#126971)
- Avoided CPU sync in `clip_grad_norm_` (#122001)
- Marked `pre_backward_hook` unserializable (#125464)
- Skipped FSDP hooks base on dynamo config (#123021)
- Used generic device handle instead of cuda (#121620)

#### ShardedTensor

- Supported non-contiguous rank validation in sharded tensor (#123230)

#### TorchElastic

- debug info logging interface for expired timers (#123883)
- health check server hook in torch elastic (#122750, #123504)
- option for sharing TCPStore created by rendezvous handlers (#125743)
- support for binding to TCP in WorkerServer (#127986)
- Applied "distributed debug handlers" (#127805)
- Cleared timer for already terminated process (#122324)
- Skipped expired timer logging for empty expired timers (#125039)

#### Tensor Parallel

- wildcard support for Tensor Parallel `parallelize_plan` (#122968)
- kwargs support to `prepare_module_input` (#124114)

### Profiler

#### Profiler `torch.profiler`:

- metrics for performance timing and other statistics collection (#123412)
- Kineto traces will export ns granularity for finer timestamps (#122425, #123650)
- Unified the device (CUDA, XPU, PrivateUse1) in profiler’s post processing (#123247)
- Improve profiler post processing by iterating frontend function events rather than all function events (#124596)
- Report strides in json traces (#125851)
- Register COLLECTIVE_COMM profiler activity type when available (#121461)
- Support third-party devices emit a range for each autograd operator (#125822)

#### Memory Snapshot `torch.cuda.memory._dump_snapshot`:

- Improve the description of blocks with missing frames in the Memory Visualizer (#124784)
- Add recordAnnotations to capture record_function annotations (#124179)

#### Profiler `record_function`:

- For with_effects, skip over profiler.record_function_exit (#121829)
- support for RecordFunctionFast to take inputs (#123208)
- support for kwargs in RecordFunctionFast (#123600)
- Collecting autograd sequence numbers on PythonTLSSnapshot dispatch keys for Nested Tensor (#123304)

### Export

- a printer to the unflattened module (#124315)
- disable_forced_specializations flag (#124949, #126925)
- export support for auto_functionalize (#121990, #122177, #122246)
- readable placeholder names to ExportedProgram nodes (#123587, #123590, #124765)
- set_grad_enabled higher order operator (#123391, #125066, #121736)
- stack_trace for non-strict export (#121034)
- torch_fn, a more consistent metadata across strict and non-strict export (#122693)
- torchbind tracing support (#122619, #123370, #122622, #125490)
- Allow static constraints in dynamic_shapes (#121860)
- Ignore logging.Logger.* calls during dynamo export (#123402)
- Make metadata serialization more strict (#124411)
- Populate ShapeEnv's var_to_val during deserialization (#121759)
- Prototype TorchScript 2 ExportedProgram Converter (#126920, #127466)
- Provide refine function for automatically accepting dynamic shapes suggested fixes (#127436)
- Save/load example inputs in the ExportedProgram (#122618)
- Suggest constant dim values in dynamic shapes fixes (#125458)
- Support map in pre-dispatch functionalization (#121444)
- We introduced the concept of “effect tokens”, which is how we allow side-effectful operators in torch.compile/export (#121552, #122357)

### Fx

- shape inference tool (#120097)
- device_ordinal to Subgraph in splitter_base (#125616)
- exclusion function to minimizer base (#124504)
- missing forbidden mutation methods in immutable collections (#125468)
- option to turn on return_tuple in _SplitterBase (#123868)
- prefix option to CapabilityBasedPartitioner (#126382)
- Create block traverse mode in minimizer for graph aware debugging (#125613)
- Implement Graph Transform Observer (#127427)
- Option to include stride and device annotation in gm.print_readable() (#123690)
- Register create_node_hook (#126671)

### Dynamo

- We performed a careful audit and fixed all known memory leaks in TorchDynamo.
- We hardened `torch.compile` + `__torch_function__` support by onboarding Scaled Dot Product Attention (SDPA) and TensorDict.

### Inductor

- 0 initialization to Triton masked loads (#127311)
- HalideCodeCache (#126416)
- clone if output is a view from constant (#123200)
- config to allow buffer mutation (#126584)
- decompose_mem_bound_mm to the customization pre and post grad passes (#123376)
- inductor support (#123709)
- kernel_code logging artifact (#126631)
- lowering for avg_pool{1, 3}d (#116085), cummax, cummin (#120429)
- missing files to torch_key (#128230)
- mode to MemoryDep to track atomic accumulates (#123223)
- pybind for tensor_converter util functions (#121744)
- qlinear_pointwise.binary op for X86Inductor backend (#123144)
- support for multiple flexattention calls in a single compile (#125516)
- tensor_constantX to pass constant buffer update's check (#122562, #122690)
- the quant lift up pass in convert phase (#122777)
- a decomposition for select_scatter (#124426)
- Allow multiple cudagraph recordings per compiled graph (#126822)
- Automatic detection for buffer mutation and binary linking (#126706)
- Change
  - OverridesData to take callables instead of strings (#123397)
  - aot_compile callsites (#122225)
- Clean up for removing 2 decompose patterns (#123422)
- Codegen runtime asserts in Inductor (#124874)
- Customize pre grad and post grad patterns (#121915)
- Disallow fusions of foreach and reductions (#127048)
- Enable
  - lowering of qlinear-binary(-unary) fusion for X86Inductor (#122593)
  - mmaped weights when CUDA is used (#124346)
  - meta internal AOTInductor compilation on ROCM (#124123)
- Enhance RecordFunctionFast input args and use input args in triton_heuristics.py (#123459)
- Filter non input symexprs from codecache guards (#128052)
- Get PT2 Cutlass backend working under fbcode (#125688)
- Hipifying aoti code_wrapper (#124241)
- Improve group batch fusion with same parent/users fusion enablement (#127648)
- Inductor respects strides for custom ops by default (#126986)
- Initial implementation of Inductor FX Graph Remote Cache (#124669)
- Make
  - torch._inductor.dependencies.Dep a proper class (#124407)
  - c10/util ostream function implementations to their headers (#123847)
  - some cudagraphs checks into C++ (#122251)
- Pass triton kernel info to record function (#123871)
- Read the patterns from the config instead of hard-code passes (#125136)
- Remove
  - API that allows for extra deferred runtime asserts during lowering (#124864)
  - assertion for cat target_func (#125540)
- Serialize large weights (#123002)
- Specialize on unguarded alignment of example inputs (#123319)
- Split cat customization (#123045)
- Support
  - CUDA_INC_PATH env variable when compiling extensions (#126808)
  - custom op in JIT with cpp wrapper (#122554)
  - pytrees as associative_scan input (#122137)
  - use_runtime_constant_folding for CPU (#122563)
- Try to reuse old symbol name rather than new symbol name when renaming (#124782)
- Update the cpp_wrapper entry function signature (#121745)
- Use source code hash instead of torch version (#126092)
- Various improvements to error handling during autotuning (#126847)
- batch pointwise op + unbind stack pass in post grad (#126959)
- config target platform (#126306)
- disable comprehensive padding in fbcode (#124191)
- enable software pipelining on AMD devices (#125858)
- epilogue support for gemm template (#126019)
- make mask_rcnn inference work in max-autotune mode (#123008)
- pt2 dper passes: run shape prop before each pass (#122451)
- remove 2 decompose patterns (#123371)
- switch assume_aligned_inputs to False (#124336)
- unified the vectorized conversion with at::vec::convert for all data types (#119979)

### jit

- Shape function fix for _batch_norm_with_update (#122430)
- Attach target function to OSError when source can't be found (#125248)
- Support getattr/hasattr on NamedTuple (#121863)

### ONNX

- Allow fake models to run with ONNXProgram.__call__ (#122230)
- Fix ONNX export with print (#123368)
- Improve torch.onnx.export runtime from O(n^2) to O(n) (#123025, #123027, #123063, #124909, #123028, #123028, #123029, #123026, #124912)
- Make ONNXProgram.model_proto and disk file the same (#122196)
- Skip optimizer when it fails (#127349)
- Update decomposition table to core ATen ops (#127353)
- beartype to emit warning instead of error by default (#123205)

### MPS

- Add naive quantized int4_mm, int8_mm and .gputrace capture hooks (#125163)
- Better error-check for linear op (#124952)
- Enable
  - index_select for complex types (#122590)
  - torch.mm and other ops for complex dtypes (#127241)
- Implemented isin_Tensor_Tensor_out for MPS backend (#124896)
- Improve F.adaptive_avg_pool2d error messages on MPS backend (#124143)
- Native non-zero op implementation (#125355)

### XPU

- Generalize host allocator to be device-agnostic(#123079)
- Make macro with AMP more generic(#124050)
- Refactor
  - CUDA’s AMP autocast policy to be generic(#124051)
  - gpu trace to be device-agnostic(#121794)
- Support generic Stream/Event on CUDA/HIP backend(#125757)

## Bug fixes

### Python frontend fixes

- DtoH sync in torch.index_put_ (#125952)
- `torch.load` map_location for wrapper subclass and device being serialized through numpy (#126728)
- memory leak in torch.dtype.to_complex() (#125154)
- nn.Parameter constructor type hint (#125106)
- parameter name in torch.can_cast to from_ (#126030)
- support of paths with space in torch.utils.cpp_extensions (#122974)
- Support numpy array in Tensor.__eq__ (#122249)

### Composability fixes

- FakeTensors, meta tensors and python decompositions are used to perform shape propagation when tracing out a graph in
  torch.compile. There were a number of bug fixes improvements this release:
  - FakeTensor fixes:
    - Handle symbolic size access in FakeTensor (#124760)
    - Avoid cuda init in FakeTensorMode (#124413)
    - Do not run CUDA lazy init if it is triggered with fake mode on (#122636)
    - Refactor faketensor ops that produce unbacked symints to memoize (#125623)
  - Meta device fixes:
    - fix meta tensor set_() incorrectly modifying nbytes of the storage (#123880)
    - Fix aten._weight_int4pack_mm meta registration for float16 inputs (#124136)
  - Fixes to python decompositions:
    - aten.upsample_bicubic2d: support for uint8 (#120411)
    - aten.upsample_nearest* ops: properly registered decomp to dispatch keys (#122782), (#122783)
    - _refs.masked_fill: support privateuse1 device when value.device.type is cpu (#124835)
    - _refs._reshape_view_helper: specialization shortcut for converting n-d to 1-d and 1-d to 2-d views (#127641)
    - Fix decomp for torch.tensor(...) constructor with nested python lists(#125639)
    - `Aten.rrellu_`: fix decomp when default values are missing (#126978)
- AOTDispatcher is the component of the `torch.compile` stack that functionalizes and normalizes the graph, and adds
  support for compiling the backward during training. There were several bugfixes and improvements to AOTDispatcher:
  - Fix `torch.compile` used with triton kernels under inference_mode (#124489)
  - Fix incorrect graph when functionalizing aten.expand followed by mutation (#122114)
  - Properly keep input mutations in the graph when they are under torch.no_grad, even if there are outstanding aliases (#122433)
  - Replay original views from the user code instead of falling back to as_strided in a few cases, which can improve
    performance of the backward pass in cases where `torch.compile` captures small graphs with outputs that alias graph inputs (#121007)
- For `__torch_dispatch__`-based tensor subclasses, support custom layout overrides under torch dispatch mode (#125379)

### cuda fixes

- cuda array for empty arrays (#121458)
- a perf regression in kernel launcher for the foreach_* family of ops (#123566)
- CUDA out of memory error message formatting (#123984)
- CUblasLt compilation on windows (#125792)

### Autograd frontend fixes

- `torch.utils.checkpoint`: Use pytrees to improve determination of what RNG state to stash (#121462)
- Fix error message of autograd (#123154)

### Release Engineering fixes

- Fix mypy issues in fake_tensor.py (#124428)
- Fix running of: lintrunner --all-files --take FLAKE8 (#124771)
- Fix libc and libstdcxx installation on conda environments (#121556)
- Release engineering tooling and CI fixes. Workflows, Trymerge, Bot Labeler, Mergebot (#125042, #121762, #121920, #124965, #122155, #123301, #121733, #127567, #128080)

### nn frontend fixes

- access to unitialized memory in VSX vector functions for quantized values (#122399)
- `swap_tensors` path in `nn.Module._apply` for modules that inherit from `RNNBase` (`RNN`, `GRU`, `LSTM`) (#122800)
- `ctc_loss` zero/negative length corner cases (#123193)
- `_LazyConvXdMixin.initialize_parameters` and add related tests (#123756)
- `load_state_dict` with unexpected key whose prefix matches a valid key (#124385)
- `requires_grad` propagation in `nn.utils.parametrize` (#124888)
- `nan` with large `bfloat16` values for `FlashAttention` backend of `nn.functional.scaled_dot_product_attention`
- issue in `affine_grid_backward` when `grad_grid` is non-contiguous (#124370)
- Add error checks for invalid inputs on `thnn_conv2d` (#121906)(#122135)

### Optim fixes fixes

- Wrong ASGD implementation (#125440, #126375)
- loading optimizer options from archive (#125215)

### linalg fixes

- `svd_lowrank(..., M)` in the presence of broadcasting (#122681)
- `linalg.vector_norm` when used with `autocast(cuda)` (#125175)

### CPP fixes

- Handle all types c10::isSigned (#125637)
- crash for AVX512 int4 matrix multiplication if weights are unaligned (#124128)
- loading custom C++ extension within DataParallel-ized model (#125404)

### Distributed fixes

#### c10d

- `coalescedCollective` op Flight Recording (#120430)
- `group_name/group_desc` set up in eager initialization (#127053)
- bug in `_update_process_group` API (#128262)
- bug in update_process_group DDP API (#128092)
- excepthook crash on exit after destroy_process_group (#126739)
- various errors in `TCPStoreLibUvBackend.cpp` (#127230)
- work handle for coalescing manager (#122849)
- Add check gloo availability when doing `_ProcessGroupWrapper` check (#124233)
- Add initialize lastEnqueuedSeq_ and lastCompletedSeq_ in ProcessGroupNCCL (#121980)
- Ensured gil is not released when calling to PyBytes (#128212)
- Guarded gpu context during abort (#127363)
- Make monitorThread sleep when we try to dump flight recorder (#123788)
- Only included NCCL related header file with macro `USE_C10D_NCCL` (#127501)
- Prevented `wait_tensor()` calls on graph inputs from getting DCEd for AsyncCollectiveTensor (#125677)

#### DeviceMesh

- hash and eq not match (#123572)
- device type issue in `_get_device_handle` (#124390)
- Enable cache and reuse of sliced result to prevent funky behaviors and NCCL deadlock at large scale (#122975)
- Make dtype of mesh tensor from `init_device_mesh()` consistent with directly calling `DeviceMesh()` (#123677)

#### DistributedDataParallel (DDP)

- DDP `no_sync` when `find_unused_parameters` is True (#124193)

#### Distributed Checkpointing (DCP)

- to remove non_persistent buffer in distributed state dict (#125337)
- `set_optimizer_state_dict()` changes the parameters with some optimizers (#125708)
- various bugs for `broadcast_from_rank0` (#127635)
- Remove the check of FSDP has root (#121544)
- Kept params in torch.distributed.checkpoint.state_dict.set_optimizer_state_dict (#127644)

#### FullyShardedDataParallel (FSDP)

- FSDP 2D state_dict to use run_check=False (#123802)
- HSDP: sharding placement (#123778), validation error msg (#123019)
- summon_full_params on submodule (#123290)

#### TorchElastic

- Make `torch.multiprocessing.ProcessContext.join()` wait for all child procs to exit before return (#125969)

### Profiler fixes

- an asynchronous trace bug where end timestamp overflows and events are years in the future (#124080)
- torch.profiler Schedule Function (Function Event only) to accumulate events (#125510)
- Add a sanity test to the unit testing (#124773)
- Add missing field device_resource_id in profiler events (#121480)
- Cleaned up deprecated use_cuda by default (#126180)
- Do not emit a warning when using CPU profiler only (#125654)
- Handle more cases of symbolic sizes/strides detection (#123696)
- Reduced warning msg in torch.profiler when using AMD (#124469)
- Release gil in prepareProfiler (#121949)
- Remove a redundant *1000 to timestamp since we already have ns precision (#124374)
- Split up profiler test file (#124856)

### Dynamo fixes

- 'Could not infer dtype of SymBool' on torch.tensor call (#125656)
- 'get_attr' call in dynamo 'run_node' (#127696)
- 'get_real_value' on placeholder nodes (#127698)
- assume_constant_result for UnspecializedNNModuleVariable methods (#127695)
- guard_size_oblivious on non-symbolic expression (#123743)
- tvm backend interface (#126529)
- Add support for tensor's is_complex method (#124927)
- Allow asserts to fail (#126661)
- Forward OptimizedModule.__setattr__ to the wrapped module (#122098)
- Initial exception handling support in dynamo (#126923)
- Keep track of ViewMeta with symbolic inputs (#125876)
- Support macOS and Linux/aarch64 platforms (#128124)

### Export fixes

- GraphModuleDeserializer handling of signature (#122342)
- bug in get_update_constraint (#125194)
- conv decomp when decomposing to core-aten (#123283)
- mode not on stack error for while loop (#122323)
- runtime assertions to add call_function (#125878)
- to_copy to be inserted in the exported graph (#125628)
- unflattening with duplicate tensors (#125192)
- up nn_module_stack for nodes occurred around tracepoint ops (#124457)
- leaky fake tensor on attribute assignment, support buffer assignment (#122337)
- Allow Dim(1,2) for export dynamic shapes (v2 after revert) (#121910)
- Allow modules to be created in the forward (#125725)
- Correctly serialize empty list based on argument type (#123748)
- Forward fix failures for torch.export switch to predispatch (#126081)
- Handle param aliasing (#127471, #125509, #125758)
- Make error name private (#126715)
- More strictly respect scope when removing inputs in unflattener (#127607)
- Skip nn_module_stack verifier for non-fx.GraphModule modules  (#122210)

### Fx fixes

- fx graph triton import bug (#122041)
- graph partitioner and make runtime assertion work with submodules in export (#125793)
- infinite recursion in API BC test (#125706)
- mem size mismatch from split/chunk in const folding (#125199)
- triton import time cycles (#122059)
- Don't intersect when clamping for size oblivious (#123675)
- Don't use Proxy torch function in the sym size calls (#121981)
- FakeTensorProp assert consistency of sizes when metadata previously existed (#124059)
- Keep set_() input mutations in the AOTDispatcher graph, ban other cases (#122981)
- Make
  - check_is_size clamp to sys.maxsize - 1, so sys.maxsize comparison returns False (#122372)
  - torch._check understand Eq commutativity (#125629)
- Preserve
  - node.meta when fusing subgraph (#125261)
  - partitioner order (#122111)
  - unbacked SymInt on SymNode (#120816)
- Remove
  - duplicated nodes in dfs_iter_find_cycle (#125585)
  - incorrect check (#123616)
- Skip index_put_ in dce (#122683)

### Inductor fixes

- AFOC QPS Regression (#122944)
- C++ compilation error for tensor array in abi_compatible mode
- FakeTensorUpdater logic for updating fake tensors (#116168)
- a bool value codegen issue when calling custom ops (#127398)
- a bug when mutated buffer meets .to (#127671)
- a codegen issue when .item() is used for kernel arg (#126575)
- a dynamic shape problem when lowering diagonal (#121881)
- an internal test regression (#123481)
- another out-of-bounds access (#122580)
- cat backwards wrapping on symints (#121527)
- compilation_latency regression caused by #127060 (#127326)
- constant propagation pass (#114471)
- cuda compilation under fbcode remote execution (#126408)
- cummax and cummin lowering for empty case (#126461)
- cutlass path in inductor (#125463)
- edge case in JIT vs. AOT fusion after finalizing MultiTemplateBuffer (#126622)
- includes to system Python (#125285)
- issue with randint + symbolic shapes (#122428)
- issues in pre_grad passes  (#123181)
- mask propagation in the presence of where (#125574)
- memory planning compile error (#123867)
- missing unbacked def for unbacked in input expr (#127770)
- nextafter in inductor CPP codegen (#126876)
- ops.scan for non-commutative operators (#126633)
- out-of-bounds read/write in cvt_int64_to_[fp32|int32] (#122511)
- scheduler typehints (#127769)
- test with inlining flag (#128200)
- to #126656 (#127050)
- triton codegen main do_bench_gpu import error (#126213)
- unbacked symbol in stride when using item() (#122298)
- unsupported type of output=s1 (#126797)
- ScatterFallback codegen (#124580)
- a constant tensor device move issue (#128265)
- an assertion for node debug str (#127021)
- grid z bug for large grid (#127448)
- invalid call to aoti_torch_tensor_copy_ (#126668)
- linear_add_bias path (#127597)
- loop ordering test (#127807)
- miss isa bool check (#128274)
- post_grad pattern (#127457)
- redis-related env vars in remote_cache.py (#127583)
- Add missing acosh op to vec256_float_neon.h (#122513)
- Back out
  - "Added a check in register_lowering to avoid decomposed ops (#117632)" (#122709)
  - "Precompile triton templates (#121998)" (#123305)
- Backport https://github.com/openai/triton/pull/3433 (#122470)
- Correctly calculate the numel with symint in DDP fusion (#124422)
- Disable stack allocation when there is a fallback op (#122367)
- Do not forward parent's value range to CSE variable for variables created within codegen (#123099)
- Do not propogate (#124769)
- Don't clamp slices generated from cat kernel (#124139)
- Enable B019 - flags memory leaks through LRU cache on method (#127686)
- FX graph cache: Fix bug handling constants (#121925)
- Fall back to eager mode when viewing with differing bitwidths (#120998, #121786)
- Implement masked_load for integral types (#122608)
- Improve unbacked SymInt input support in Inductor (#124739)
- Inductor: fix Conv output stride for dynamic shapes (#121400)
- Remove symbol exports in C shim for Windows (#125472)
- Revert "Inductor respects strides for custom ops by default (#126986)" (#127923)
- Use pexpr, not texpr in Triton launch codegen (#128038)
- turn off triton memcache for amd devices (#122560)
- typing scheduler.py [1/2]: Bug fix (#126610)
- use two pass reduction for deterministic reduction order (#115620)
- Forward fixes
  - for D56289438 (#124882)
  - for templates + views (#127446)

### ONNX fixes

- Fix list dtype finding bug in dispatcher (#122327)
- Rename ort to maia in dynamo's ort backend (#124967)
- Cast checkpoint weights to match model parameter's dtype (#122100)
- Reduce excessive warning to info (#122442)
- Prevent dup initializers when ONNXProgram.save is called many times (#122435)

### MPS fixes

- FFT descriptor fields to resolve precision issue (#125328)
- FFT implementation bug dropping negative frequency components (#123274)
- GELU, LeakyRELU and MISH on non-contiguous tensors (#123049)
- abs for complex types (#125662)
- copies larger than 4GB (#124635)
- crash with binary_cross_entropy is invoked for half dtypes (#124258)
- for MPS regression in scalar creation (#123234)
- for addcdiv contiguous problem (#124442)
- naive matmul for BFloat16 (#121731)
- nextafter for negative values (#125029)
- overflow in cumsum when dtype is bool (#125318)
- strided ELU op correctness issue (#125692) and mse_loss correctness issue (#125696)
- Fwd-fix for clamp regression (#122148)
- Remove in place views fixing various crashes (#124895)

### XPU fixes

- record issue on XPUGuardImpl (#123523)

## Performance

### Python frontend

- Use sleef on macOS Apple silicon by default (#126509)

### cuda

- Speed up torch.softmax kernel (#122970)

### nn frontend

- Parallelize upsampling ops across the batch/channel dimension (#127082)
### Optim

- Add fast fused kernels for Adam, AdamW, SGD, and Adagrad on CPU (#123074, #123629, #124905)

### linalg

- Improvements
  - the CPU performance of `linalg.vector_norm` when reducing over a dimension of length 1 (#122143)
  - performance of FP16 `gemv` on ARM (#126297, #126745, #126746, #126877, #127033) and BF16 `gemm` fallback on ARM (#126592)
  - autotuning through `TunableOp` on ROCm (#124362)

### Foreach

- Allow int vals to go down the fastpath for _foreach_max (#127303)
- `_foreach_copy` now supports different source/dest dtypes on the fast path (#127186)

### Distributed

#### C10d

- Disable compute of collective duration by default (#122138)

#### DTensor

- Used str for reduce_op instead of c10d enum (#125172)
- Make early return for `_split_tensor` (#125810)
- Make directly return` local_tensor` under `no_grad` (#128145)

#### Distributed Checkpointing (DCP)

- Improve the performance of distributed state_dict (#125501)

#### TorchElastic

- Changed `monitor_interval` for torchelastic default value to 0.1 sec (#124692)
- Add timing events to different stages of rendezvous (#125636)

### jit

- Fix exponential memory usage when TorchScript types share the same name (#121874), (#121928)

### Fx

- Add side table to FX Graph for O(1) op/target query (#121565)
- Apply guard knowledge to all simplifications (#123342)
- Do not calculate hint in advice_is_size (#124472)
- Enable FX graph and symbolic shape caching (#121697, #125258, #123724, #124610)
- Flatten/Unflatten micro optimization in proxy_tensor.py (#121993)
- Minor compile time optimization in has_free_symbols (#122144)
- Skip assert in check_is_size (#124209)
- Teach ShapeEnv that a <= b => a < b + 1 (#123436)
- Use sympy xreplace instead of subs (#124208)
- `_find` not update unchanged replacements (#124274)
- eval_static: guards, unbacked compute once (#124217)

### Inductor

- Speedup `convert(Vectorized::loadu(ptr, 8))` on ARM (#125889)
- Add more mm kernel choices (#125000)
- Add NEON ISA support on
  - arm64 Macs (#122217)
  - aarch64 (#123584)

### MPS

- Improvements to perf of int4pack_mm (#125983, #127135, #125704)
- Making copy_cast, softmax and cat_out unranked (#123191)

### XPU

- Intel GPU
  - Convolution&Deconvolution aten operators(#117529)
  - Matmul aten operators(addmm, badbmm, etc.)(#117202)
- Support xpu host allocator (#123080)
- oneDNN
  - Conv primitive integration (#117512)
  - Matmul primitive integration (#117112)
  - library compilation for Intel GPU support (#117098)

## Documentation

### Python frontend

- Add doc for
  - torch.distributions.utils.clamp_probs (#128136)
  - the legacy constructor for Tensor (#122625)
  - torch.Size.numel (#124186)
  - torch.utils.benchmark.utils.compare.Compare (#125009)
  - torch.utils.collect_env.get_env_info (#128021)
- Clarify Security Policy (#120531)
- Fixes doc
  - example of torch.masked_scatter (#123664)
  - for torch.load map_location (#125473)
- Improve doc for
  - torch.set_default_dtype (#121730)
  - torch.load weights_only argument (#127575)
- Update doc for
  - functions in torch.multinomial (#125495)
  - functions in torch.random (#125265)
  - torch.dot (#125908)

### Composability

- Add extended debugging options for troubleshooting `torch.compile` issues (#122028)

### cuda

- Add doc for torch.cuda.nccl.version (#128022)
- Add documentation for nvtx.range (#121699)

### Autograd frontend

- `torch.autograd.Function`: update docs for separate context and forward functions (#121955)
- `torch.utils.checkpoint`: Improve error message when use_reentrant=True is used with .grad() (#125155)
- Improve the clarity of the `torch.Tensor.backward` doc (#127201)
- Fix typing for `torch.autograd.Function` with ctx-less forward (#122167)

### Release Engineering

- Fix torch and `torch.compile` links (#121823, #121824)
- Add
  - fuzzer instructions to pt2 bug template (#123156)
  - better instructions for pytorchbot merge command on cancel (#124947)
  - instructions on how to run doc coverage locally (#123688)

### nn frontend

- Fixes
  - `KLDiv` example (#126857)
  - `torch.nn.TripletMarginLoss` allowing margin less or equal to 0 (#121978)
  - example and typo in `nn.ChannelShuffle` and `nn.PReLU` docs (#123959)
  - redundant tensor in `nn.MaxUnpool2d` example (#127850)
  - wording in `nn.Linear` docstring (#127240)
- Improvements
  - `NLLLoss` documentation (#127346)
  - documentation of `torch.nn.utils.rnn` (#123559)
  - return value documentation for `nn.Module.load_state_dict` (#123637)
  - the example description for `torch.nn.utils.rnn.pad_sequence` (#123183)
- Update the `is_causal` explanation in the `nn.functional.scaled_dot_product_attention` doc (#127209)
- Warn SDPA users about dropout behavior (#126294)

### Optim

- Document complex optimizer semantic behavior (#121667)
- Add missing parameter doc of Adagrad (#125886)

### linalg

- Improve docs on the sorting of `eig`/`eigvals` (#127492)

### Distributed

#### c10d

- Add
  - a doc page for NCCL ENVs (#128235)
  - migration notes for --local-rank option style change for torchrun for PyTorch 2.0 onwards (#109480)
- Documents
  - 'tag' limitation for nccl send/recv (#125278)
  - `destroy_process_group` usage (#122358)
- Fixes
  - example in `torch.distributed.new_subgroups` docstring (#123492)
  - the document of `distributed.new_group()` (#122703)

#### Distributed Checkpointing (DCP)

- Corrected typos in assert (#122633)

#### DTensor

- Add comment on replicate -> partial for _NormPartial (#121976)
- Updated public API docs for DTensor (#127340)

#### FullyShardedDataParallel (FSDP)

- Remove excessive warnings and rewrite FSDP docstrings (#123281)
- Fix docs for inter/intra node PG helpers (#126288)
- Updated docstring to include `device_mesh` arg (#126589)

### Profiler

- Updated PT2+Profiler docs (#122272)

### Export

- Fix documentation for register_fake_class (#126422)

### Fx

- Document for add_var_to_val (#121850)

### Dynamo

- Add a Dynamo deepdive to documentation (#122305)
- Update compile doc to suggest Module.compile (#123951)
- Fixes
  - links rendering when surrounding code in Dynamo deepdive (#123427)
  - the link to torch.compiler_custom_backends (#125865)
  - typos in torch._dynamo.config.py (#126150)
  - NumPy + backward example (#126872)

### Inductor

- Fix aoti doc to avoid cannot bind non-const lvalue reference error (#121672)
- documentation for pattern_matcher.py (#127459)

### ONNX

- Fix pytorch version for onnx in doc (#124182)
- Add docstring to masked_fill, expand, select, unsqueeze, cat fns (#128055)
- Documenting torch.onnx.operator.shape_as_tensor (#128051)
- Init sigmoid comments (#127983) (edited)

### XPU

- PyTorch 2.4 XPU Getting Started (#127872)
- Update Intel GPU Support on README (#126001)
- Tensor (#126383 #127280)
- Stream (#121398)
- AMP (#127276 #127278)
- `torch.compile` with XPU support (#127879)

## Developers

### Composability

- cpu_fallback for aten::triu_indices on custom device crash (#121306)
- API to check whether running in torch_dispatch mode (#122339)
- clarify c10::Dispatcher kernel static asserts (#124519)

### Release Engineering

- TD (target determination) reorders tests in CI based on heuristics and
  removes tests it believes to be irrelevant to the changes in the PR.
  (#121835, #121836, #122279, #122615, #122901, #124082, #122976, #125931)
- torchbench on-demand test workflow (#122624).
- BE: Ruff lint improvements (#124743, #124570)
- ability to save TORCH_COMPILE_DEBUG logs for CI failures (#124408)
- freezing option for cpu inductor accuracy test in inductor CI (#124715)

### Optim

- Modify device check in capturable optimizer to support more devices (#124919)
- Improve typing and error messages in LRScheduler (#125556, #127943, #121633, #125161)
- Only initialize state if needed in SGD (#123757)
- Exempt `torch.compile` from more checks in Adamax (#123498)
- Merged the pyi files into py files of optimizer (#125153, #125452)
- Tighten fallback conditions for compiled optimizer (#125825)

### Distributed

#### c10d

- Updated error message for sparse all-reduce (#121644)
- Add
  - generic scuba logging capability into c10d (#121859)
  - log the target of Flight Recorder dump (#122345)
  - the source rank in the logs when detecting the timeout (#122850)
  - more fields for periodic logging (#123860)
  - `pg_name` and `pg_desc` to logger (#126409)
  - Work's numel to logger for debugging purposes (#127468)
- Allow user to pass process group description for ProcessGroupNCCL (#123472)
- Print the duration of the broadcast  of `ncclunique_id` (#123963)
- Pass and recorded `process_group_name` when creating ProcessGroupNCCL (#123117)
- Pass pg name and desc to NCCL communicator (#124149)
- Make only PG0 should dump when monitoring thread timed out (#125356)
- split seq_id to `collective_seq_id` and `p2p_seq_id` (#125727)
- Print certain logs only on the head rank of each node (#125432)
- Make warn env vars only once during program (#127046)

#### DTensor

- Add some initial c10d ops to `CommDebugMode` (#125475)
- Remove unused failed_reason (#126710)
- Add `all_reduce_coalesced` tracing to CommDebugMode (#127025)

#### Distributed Checkpointing (DCP)

- additional logging for improved observability in DCP (#121352)

#### FullyShardedDataParallel (FSDP)

- Remove unnecessary warnings (#126365)
- warnings on wrapping `ModuleList`/`ModuleDict` (#124764)

#### Miscellaneous

- Remove dist_ prefix from TORCH_LOGS shortcuts (#126499)
- Make torch.distributed.breakpoint() to work under Python/Meta contexts (#118645)

#### TorchElastic

- Make log directory creation idempotent (#126496)

### Fx

- Suggest TORCHDYNAMO_EXTENDED_DEBUG_ envvars when appropriate (#122473)

### Inductor

- `aoti_torch_item` as a util function (#126352)
- model_type and global_rank for the scuba log for the dashboard Optimus pattern frequency monitor (#123398)
- Change the log for the group batch fusion (#122245)
- Do not use `importlib.load_module` (#122542)
- Enable FX graph caching on another round of inductor tests (#121994)
- Improves
  - exception typing. Remove NOQAs (#125535)
  - generate_extern_kernel_out's signature (#123351)
  - logging (#122932)
  - the optimus scuba log (#122361)
- Misc refactors (#126945)
- Only print bw result for the first time we benchmark a kernel (#123568)
- Refactor
  - MultiOutput. codegen_list_tuple_access to use subclass type checks (#121662)
  - indexing() into triton.py
  - part of `IterationRangesEntry` into triton.py (#126944)
  - some fallback op util functions (#126182)
  - is_legacy_abi_kernel and abi_compatible_kernel (#121523)
- Renamed `mutationlayout`/`aliasedlayout` (#122474)
- Unify val_to_arg_str and val_to_cpp_arg_str (#126916)
- Update
  - `DTYPE_TO_CPP` mapping (#126915)
  - `opinfo` tests (flattened diff) (#124657)
  - tensor_converter util functions (#121743)
  - triton pin (#121268)
- Use C++17 helper templates (#122607)
- delete inductor `config.trace.compile_profile` (#127143)
- log pt2 config dict to signpost from inductor post grad (#124593)
- refactor
  - code to use define_kernel and call_kernel similar to CUDA (#123704)
  - device dispatch inside `do_bench` (#125736)

### MPS

- Reorganize logics and naming in copy.mm (#123310)
- Pointer to the non-zero limit ticket#124244
- Introduce MetalShaderLibrary class (#125550)
- Include MPSGraphVenturaOps.h for complex types on macOS12 (#127859)
- Define _compute_tolerances  (#121754)

### XPU

- Support general device runtime Interface for Intel GPU (#121883)
- Enable triton installation for Intel GPU (#122254)
- Reuse inductor test for Intel GPU  (#122866, #124147)
- Update Intel triton for Pytorch 2.4 release (#128615)
- Support reduction split for Intel GPU (#129337)
- call empty_cache for dynamo tests (#126377)
- Support xpu autocast policy (#124052)

## Security

### Python frontend

- warning for weights_only (#129239, #129396, #129509) (see Deprecations section)

### Release Engineering

- Vulnerability related updates of packages used in CI (#124614, #124675, #124976, #124983, #125698, #126805, #126989)

PyTorch 2.3: User-Defined Triton Kernels in torch.compile, Tensor Parallelism in Distributed (2024-04-24)

# PyTorch 2.3 Release notes



* Highlights
* Backwards Incompatible Changes
* Deprecations
* New Features
* Improvements
* Bug fixes
* Performance
* Documentation


# Highlights

We are excited to announce the release of PyTorch® 2.3! PyTorch 2.3 offers support for user-defined Triton kernels in torch.compile, allowing for users to migrate their own Triton kernels from eager without experiencing performance complications or graph breaks. As well, Tensor Parallelism improves the experience for training Large Language Models using native PyTorch functions, which has been validated on training runs for 100B parameter models.

This release is composed of 3393 commits and 426 contributors since PyTorch 2.2. We want to sincerely thank our dedicated community for your contributions. As always, we encourage you to try these out and report any issues as we improve 2.3. More information about how to get started with the PyTorch 2-series can be found at our [Getting Started](https://pytorch.org/get-started/pytorch-2.0/) page.



  
   
Stable


   
   
Beta


   
   
Prototype


   
   
Performance Improvements


   
  
  
   
   
   User-defined Triton kernels in torch.compile
   
   torch.export adds new API to specify dynamic_shapes
   
   Weight-Only-Quantization introduced into Inductor CPU backend
   
  
  
   
   
   Tensor parallelism within PyTorch Distributed
   
   Asynchronous checkpoint generation
   
   
   
  
  
   
   
   Support for semi-structured sparsity
   
   
   
   
   
  



*To see a full list of public feature submissions click [here](https://docs.google.com/spreadsheets/d/1TzGkWuUMF1yTe88adz1dt2mzbIsZLd3PBasy588VWgk/edit?usp=sharing).


# Tracked Regressions


### **torch.compile on MacOS is considered unstable for 2.3 as there are known cases where it will hang ([#124497](https://github.com/pytorch/pytorch/issues/124497))**


### **torch.compile imports many unrelated packages when it is invoked ([#123954](https://github.com/pytorch/pytorch/issues/123954))**

This can cause significant first-time slowdown and instability when these packages are not fully compatible with PyTorch within a single process.


### **torch.compile is not supported on Python 3.12 ([#120233](https://github.com/pytorch/pytorch/issues/120233))**

PyTorch support for Python 3.12 in general is considered experimental. Please use Python version between 3.8 and 3.11 instead. This is an existing issue since PyTorch 2.2.


# Backwards Incompatible Changes


### **Change default torch_function behavior to be disabled when torch_dispatch is defined (#120632)**

Defining a subclass with a **torch_dispatch** entry will now automatically set **torch_function** to be disabled. This aligns better with all the use cases we’ve observed for subclasses. The main change of behavior is that the result of the torch_dispatch handler will not go through the default torch_function handler anymore, wrapping it into the current subclass. This allows in particular for your subclass to return a plain Tensor or another subclass from any op.

The original behavior can be recovered by adding the following to your Tensor subclass:


```python
@classmethod
def __torch_function__(cls, func, types, args=(), kwargs=None):
      return super().__torch_function__(func, types, args, kwargs)
```



### **ProcessGroupNCCL removes multi-device-per-thread support from C++ level (#119099, #118674)**



* Python level support was removed in 2.2.
* To simplify ProcessGroupNCCL’s code, we remove support for multiple cuda devices per thread.  To our knowledge, this is not an active use case, but it adds a large burden to our codebase. If you are relying on this, there is no workaround other than rewriting your pytorch program to use one device per process or one device per thread (multi-threads per process is still supported).


### **Removes `no_dist` and `coordinator_rank` from public DCP API's (#121317)**

As part of an overall effort to simplify our public facing API's for Distributed Checkpointing, we've decided to deprecate usage of the `coordinator_rank` and `no_dist` parameters under `torch.distributed.checkpoint`. In our opinion, these parameters can lead to confusion around the intended effect during API usage, and have limited value to begin with. One concrete example is here, https://github.com/pytorch/pytorch/issues/118337, where there is ambiguity in which Process Group is referenced by the coordinator rank (additional context: https://github.com/pytorch/pytorch/issues/118337). In the case of the `no_dist` parameter, we consider this an implementation detail which should be hidden from the user. Starting in this release, `no_dist` is inferred from the initialized state of the process group, assuming the intention is to use collectives if a process group is initialized, and assuming the opposite in the case it is not.



  
   
2.2


   
   
2.3


   
  
  
   



```python
# Version 2.2.2
import torch.distributed.checkpoint as dcp

dcp.save(
	state_dict={"model": model.state_dict()},
       checkpoint_id="path_to_model_checkpoint"
       no_dist=True,
       coordinator_rank=0
)
# ...
dcp.load(
	state_dict={"model": model.state_dict()},
       checkpoint_id="path_to_model_checkpoint"
       no_dist=True,
       coordinator_rank=0
)
```


   
   



```python
# Version 2.2.3
# no dist is assumed from pg state, and rank 0 is always coordinator.
import torch.distributed.checkpoint as dcp

dcp.save(
	state_dict={"model": model.state_dict()},
       checkpoint_id="path_to_model_checkpoint"
) 
# ...
dcp.load(
	state_dict={"model": model.state_dict()},
       checkpoint_id="path_to_model_checkpoint"
)
```


   
  




### **Remove deprecated tp_mesh_dim arg (#121432)**

Starting from PyTorch 2.3, `parallelize_module` API only accepts a DeviceMesh (the `tp_mesh_dim` argument has been removed). If having a N-D DeviceMesh for multi-dimensional parallelism, you can use `mesh_nd["tp"]` to obtain a 1-D DeviceMesh for tensor parallelism.


## **torch.export**



* Users must pass in an nn.Module to torch.export.export. The reason is that we have several invariants the ExportedProgram that are ambiguous if the top-level object being traced is a function, such as how we guarantee that every call_function node has an nn_module_stack populated, and we offer ways to access the state_dict/parameters/buffers of the exported program. We'd like torch.export to offer strong invariants—the value proposition of export is that you can trade flexibility for stronger guarantees about your model. (#117528)
* Removed constraints in favor of dynamic_shapes (#117573, #117917, #117916, #120981, #120979)
* ExportedProgram is no longer a callable. Instead users will need to use .module() to call the ExportedProgram. This is to prevent users from treating ExportedPrograms as torch.nn.Modules as we do not plan to support all features that torch.nn.Modules have, like hooks. Instead users can create a proper torch.nn.Module through exported_program.module() and use that as a callable. (#120019, #118425, #119105)
* Remove equality_constraints from ExportedProgram as it is not used or useful anymore. Dimensions with equal constraints will now have the same symbol. (#116979)
* Remove torch._export.export in favor of torch.export.export (#119095)
* Remove CallSpec (#117671)


### **Enable fold_quantize by default in PT2 Export Quantization (#118701, #118605, #119425, #117797)**

Previously, the PT2 Export Quantization flow did not generate quantized weight by default, but instead used fp32 weight in the quantized model in this pattern: `fp32 weight -> q -> dq -> linear`. Setting `fold_quantize=True` produces a graph with quantized weights in the quantized model in this pattern by default after convert_pt2e, and users will see a reduction in the model size: `int8 weight -> dq -> linear`.



  
   
2.2


   
   
2.3


   
  
  
   

```python
folded_model = convert_pt2e(model, fold_quantize=True)
non_folded_model = convert_pt2e(model)
```

   
   

```python
folded_model = convert_pt2e(model)
non_folded_model = convert_pt2e(model, fold_quantize=False)
```

   
  




### **Remove deprecated torch.jit.quantized APIs (#118406)**

All functions and classes under `torch.jit.quantized` will now raise an error if called/instantiated. This API has long been deprecated in favor of `torch.ao.nn.quantized`.



  
   
2.2


   
   
2.3


   
  
  
   

```python
# torch.jit.quantized APIs

torch.jit.quantized.quantize_rnn_cell_modules

torch.jit.quantized.quantize_rnn_modules
torch.jit.quantized.quantize_linear_modules

torch.jit.quantized.QuantizedLinear
torch.jit.QuantizedLinearFP16

torch.jit.quantized.QuantizedGRU
torch.jit.quantized.QuantizedGRUCell
torch.jit.quantized.QuantizedLSTM
torch.jit.quantized.QuantizedLSTMCell
```
   
   



```python
# Corresponding torch.ao.quantization APIs

torch.ao.nn.quantized.dynamic.RNNCell

torch.ao.quantization.quantize_dynamic APIs

torch.ao.nn.quantized.dynamic.Linear

torch.ao.nn.quantized.dynamic.GRU
torch.ao.nn.quantized.dynamic.GRUCell
torch.ao.nn.quantized.dynamic.LSTM
```
   
  




### **Remove deprecated fbgemm operators (#112153)**

`TorchScript` models that were exported with the deprecated `torch.jit.quantized` API will no longer be loadable, as the required internal operators have been removed. Please re-export your models using the newer `torch.ao.quantization` API instead.


## **Other**



* Make `List::get() const` match `List::operator[]() const` (#117568)
* Delete `C10_IS_TRIVIALLY_COPYABLE` (#120120)
* Fix synchronization behavior for copies with type change (#121341)


# Deprecations


### **torch.autograd.Function: Using the torch.autograd.function.traceable decorator and getting/setting torch.autograd.Function's is_traceable is now deprecated (#121413)**

These decorators were previously marked for internal use only. They will be removed in version 2.4.


### **torch.utils.checkpoint: not passing use_reentrant explicitly to activation checkpoint and checkpoint_sequential is deprecated (#116710)**

torch.utils.checkpoint: the use_reentrant parameter should be passed explicitly. In version 2.4 we will raise an exception if use_reentrant is not passed. use_reentrant=False is recommended, but if you need to preserve the current default behavior, you can pass use_reentrant=True. Refer to docs for more details on the differences between the two variants.

(Note that this was already deprecated in a previous release. In this version, we improve the deprecation message.)


### **Deprecated torch.backends.cuda.sdp_kernel and replace with torch.nn.attention.sdpa_kernel (#114689)**

This PR deprecated `torch.backends.cuda.sdp_kernel`, users can now use `torch.nn.attention.sdpa_kernel` instead. The old code will raise the following warning: `FutureWarning: torch.backends.cuda.sdp_kernel() is deprecated. In the future, this context manager will be removed. Please see, torch.nn.attention.sdpa_kernel() for the new context manager, with updated signature.`



  
   
2.2


   
   
2.3


   
  
  
   

```python
import torch
from torch.backends.cuda import sdp_kernel

with sdp_kernel(enable_math=False, enable_flash=False, enable_mem_efficient=True):
    torch.nn.functional.scaled_dot_product_attention(...)
```

   
   

```python
import torch
from torch.nn.attention import sdpa_kernel, SDPBackend

with sdpa_kernel(backends=[SDPBackend.EFFICIENT_ATTENTION]):
    torch.nn.functional.scaled_dot_product_attention(...)
```

   
  




## Distributed API



* [c10d] Deprecate Work.result() (#117565)
* Deprecate torch.distributed.pipeline in favor of the [PiPPy](https://github.com/pytorch/PiPPy) library (#121464)
* Add deprecation msg for `NO_SHARD` (#119553)
* Add composable API `fully_shard` deprecation warning (#120929)
* [DTensor] Change distribute_module input/output_fn to accept module, deprecating the input/output_fn that does not accept a module (#120895)


## Releng



* Removal of macOS x86 binaries build jobs following their deprecation in 2.2 (#116726)
* CircleCI removed (#115701)


# New Features


## Autograd API



* Add basic autograd TORCH_LOGS support. (#115438)
* Autograd attaches logging hooks only in debug level (#116522)
* Update torch.autograd.graph logging to not print out grad_output (#116523)
* Added `"any"` mode to `register_multi_grad_hook` (#117984)


## CUDA



* Add bfloat16 CUDA support for smoothl1loss (#116933), RNN (#116927), gamma unary functions (#116929), binomial distribution (#116932), and multinomial (#116951)
* Add float16 support to CUDA logaddexp2 (#116948)
* Add bfloat16 + fp16 support to fractional_max_pool for CUDA and CPU (#116950)
* Make torch.cuda.has_magma a build time check (#116299)


## Distributed API



* C10d:
    * Install an excepthook which annotates exceptions with rank information when distributed is initialized (#118190)
    * Flight recorder for debugging failed collectives: (#119837, #116905, #114817, #118044, #115139, #118046, #119249, #118047 , #118142, #115771, #120063, #120331, #120544, #120262, #120724, #120270, #120893, #120502) 
    * explicitly abort communicators in destroy_process_group call (#119250) 
    * ProcessGroupGloo::allgather_into_tensor_coalesced (#118910)
    * ProcessGroupGloo::reduce_scatter_tensor_coalesced (#118911)
    * Create a python c10d API _set_pg_timeout to set timeout (#115453)
    * Introduce 3 low-latency, intra-node allreduce algorithms for small messages to PyTorch (#114001) (#116125)
    * Add complex support for P2P (#121240)
    * 
* Distributed Checkpointing (DCP):
    * Adds Checkpointer Wrapper for DCP [3/N] (#114603)
    * Adds async save, makes checkpointer private (#116293)
    * Makes async_save public (#121325)
    * Adds storage reader and planner classes for online loading/sharding of models in torch.save format (#119816)
    * Adds support for meta tensor loading for DCP.load_state_dict() (#113319)
    * [state_dict] Implement pin_memory and shared_memory copy for _offload_state_dict_to_cpu (#120378)
    * Add tests to demonstrate DCP checkpoint conversion (#117773)
    * Adds utility for converting dcp to torch save format (#119814), converting torch save to dcp (#119815)
    * Add distributed checkpoint support for custom device (#120201)
    * Add a profiler function for benchmarking save and load (#116007)
    * Enables load balancing duplicates in DCP (#116469)
    * Enable filesystem/fsspec auto detection (#118888) 
* DTensor:
    * Add op support for aten.gather.default (#118513)
    * Add op support for nll_loss_backward (#119256)
    * Implement scaled dot product attention (flash-attention, forward only) (#120298)
    * Add async_op option to redistribute and some refactor (#121477)
    * [TP] add support for loss parallel (#119877)
    * [TP] Introduce Sequence Parallel Style for Laynorm/RMSNorm/Dropout (#121295)
    * allow OpStrategy to represent ops whose return type is a tuple (#115682)
    * Add op support for nll_loss_forward (#118917)
    * Supported `foreach=False` for `clip_grad_norm_` (#120238)
    * Supported `foreach=True` for `clip_grad_norm_` (#120910)
    * Add layer norm backward support (#115683)
    * Add Silu operator support (#118702)
    * Enable adadelta foreach optimizer (#115564)
    * Enable radam foreach optimizer (#115566)
    * Enable Adamax foreach optimizer (#119850)
    * Make add_.Tensor/div_.Scalar to be linear pointwise instead (#121294)
    * [DeviceMesh] Allow 1d slice from 1d mesh (#118895)
    * Add a implicit replication flag (#115297)
    * Adds tool to visualize sharding (#114307)
    * Initialized RNG tracker if needed (#121328)
    * get CommsDebugMode to work with DTensor (#118769)
    * [DTensor][XLA] support XLA backend in distribute_module API (#121355)
* DDP
    * Use compiled_autograd to trace DDP backward allreduce (#110662)
* Functional Collectives
    * Support broadcast in native funcol (#119229)
    * Allow using native c10d_functional via _functional_collectives (#113057)
    * Implements `permute_tensor` in functional collectives (#115078)
    * Support tracing native functional collective via python APIs (#119103)
    * Directly import DeviceMesh to avoid circular dependency (#115649)
    * Port all_to_all_single to native c10d_functional (#113438)
    * Make native c10d_functional ops work with AOTInductor (#113735)
    * fix an issue where mutation on views fails in inductor (#118333)
    * Run device mesh tests with native funcol enabled (#118437)
    * Change the .clone() in native funcol's all_reduce to use at::MemoryFormat::Contiguous (#120042 )
    * Make tests using CommDebugMode work for both legacy and native funcol (#120070)
    * Temporarily support ranks + tag as pg identifier in native funcol (#120226)
    * don't import torchdynamo when running torchdeploy (#120900)
    * Preliminary DeviceMesh + native c10d functional integration (#118423)
    * Change native funcol inductor tests to use fake pg (#119104)
    * Prepare test_inductor_collectives.py for native funcol migration (#120025)
    * Prepare test_dtensor.py for native funcol migration (#120043)
    * Disable GroupRegistry's thread isolation by default (#121457)


## FX



* Experimental non-strict mode (#114658)
* Add _assert_scalar and teach Inductor to codegen it (#114148)
* Add symbol_guard_limit_before_specialize (#119347)
* Add _assert_scalar and teach Inductor to codegen it (#114148)


## torch.compile


### Dynamo



* Mutable variable trackers - TorchDynamo variable trackers are now mutable. Therefore, we can just mutate the variable tracker during symbolic tracing, instead of creating a new variable tracker. This improves compilation time for models with frequent mutations.
* (beta feature) Improved automatic deletion of compilation units - TorchDynamo caches the compilation units internally on every compilation event. Earlier, this cache was not cleared automatically when the original nn module went out of scope, holding on to the GPU memory. Therefore, users needed to call `torch._dynamo.reset()` to clear the cache. Now, we automatically clear the cache as soon as the nn module goes out of scope.
* (beta feature) torch.compile works with User defined triton kernels  - Allows for PyTorch code that contains a triton kernel to be executed natively using torch.compile. This will allow users to migrate code containing triton kernels from eager PyTorch to using torch.compile without running into performance complications or graph breaks. A good entry point to how it works is on https://github.com/pytorch/pytorch/blob/main/torch/_higher_order_ops/triton_kernel_wrap.py
* (beta feature) Improved tracing rules infrastructure - Added a central infrastructure to control whether to inline or skip TorchDynamo tracing.
* Create a sentinel file for each dynamo test skips (Part 1) (#120500)
* Windows Dynamo Error Removal CI Check (#115969)
* Add TORCHDYNAMO_EXTENDED_DEBUG_GUARD_ADDED (#118750)
* feat: Add min, max ranges to mark_dynamic API (#119737)
* Add support for dynamic shapes in round (#115259)
* Add support for torch.cond in vmap (#114523)
* Added dynamic shapes support for math trigo ops: sin(h), cos(h), tan(h) ... (#114866)
* Add dynamo support for operator.abs (#117442)
* Add support for `operator.truth` (#117463)
* Add support for compiling SDPAParams (#117207)
* Add more dynamo call_methods and getattr support or Placement (#117733)
* Add size(), get_coordinate() support for DeviceMesh in dynamo (#117710)
* Initial torchbind support in PT2 (#117697)
* Add compilable foreach RAdam support (#117912)
* [HigherOrderOp] support while_loop in dynamo (#116913)
* Add Support for CausalBias to torch compile (#116071)
* support dict.clear() (#119197)
* Add support for labels to ttir analysis (#119836)
* Add jacrev support in torch.compile (#121146)
* Introduce size oblivious guards (#118579)


### Inductor



* Some basic support for uint{16,32,64} codegen in CPU inductor (#116810)
* Enable for MacOS (#118076)
* Add torch.cond support to JIT Inductor (#119759) and AOT Inductor (#121120)
* Double buffering for Weights (#114446)
* Add Runtime Constant-folding for AOTInductor (#118765)
* Intel GPU backend Upstream: Generalize device-bias code in code generation (#116020), Register and add Intel GPU Inductor backend (#116330)


### torch.export



* Introduced pre-dispatch IR through `torch.export._trace._export(..., predispatch=True)`. This returns a graph that contains ATen operators at a higher level (aten.linear.default will not become decomposed) and are also safe for training (#117278)
* Support higher order op functionalization in predispatch IR (#115314)
* Ignore autograd ops for predispatch export (#116527)
* Introduced [non-strict export](https://pytorch.org/docs/main/export.html#non-strict-export) (#118607, #118609, #119297, #119446, #119602)
* Transformed global state mutating operators torch._C.set_grad_enabled into a higher order operator (#119732, #119736, #119810, #119913, #119915)
* Added while_loop support (#116823)
* A very preliminary torchbind object tracing support was added. (#116985, #117978, #117979, #118158, #118684)


## Linalg



* Add HIPSolver backend to linalg.eigh (#115177)
* Add MKLDNN backend to matmul for bfloat32 (#116015)
* Support torch.mm with conjugate transposed inputs (#117238)


## MPS



* Add complex_out to MPS backend (#110851)
* Add Conv3D support for MPS (#114183)
* Enable select/[broad]cast ops for complex dtypes (#115727)
* Add `torch.fft.` support (#119670)
* Add support for 64-bit index operations (#116942)


## torch.nn API



* Added `swap_tensors` path to `nn.Module._apply` (#117167)
* Integrated `swap_tensors` into `nn.Module.load_state_dict` ((#117913)
* Added `assign` argument to `torch.Tensor.module_load` (#121158)
* Added an attention bias subclass for a lower right causal masking (#114823)
* Added config to disable `TransformerEncoder`/`MultiHeadAttention` fastpath (#112212)


## Profiler



* Enable profiler initialization to enable on-demand profiling for CPU only mode (#118320)
* Add execution_trace_observer as an optional argument to Profiler API (#119912)


## Python API



* Add new torch.utils.swap_tensors function that can be used to exchange Tensor while preserving references from other objects (#111747)
* Add unsigned integer dtypes with full interop with third party (#116594, #116805, #116806, #116807, #116808, #116803, #116804)
* Add {Untyped,Typed}Storage.resizable() to check if a Storage can be resized (#119286)
* add Half support for flash attention (#119247)


## Sparse



* Add bfloat16 support to torch.sparse.addmm for CPU (#115535)
* Add tune_bsr_dense_addmm as an API to find optimal triton kernel parameters for bsr_dense_addmm (#115499)
* Add values backward support for sparse CSR, CSC, BSR, and BSC tensors (#115586)
* Add batched sparse CSR/CSC/BSR/BSC to sparse COO conversion support (#116206)
* Add meta device support to sparse compressed tensors (#120498)
* Add sparse compressed meta tensor support (#120707)
* Add sparse compressed fake tensor support (#120920)
* [semi-structured] Enable fp32 support, separate sparse and dense constraints (#115550)
* Support `add(sparse_compressed, dense)` (#115432)
* Support csc layout for add sparse/dense. (#115433)
* Add in out_dtype support (i8i8->bf16, i32) for cusparselt (#119296)


## Vulkan



* Added Vulkan support for 1D for the following ops: Convolution (#117780, #118660, #118833, #118834, #118835) Linear (#118690)


## XPU



* Intel GPU Runtime Upstreaming for Device (#116019, #116833, #116850, #116869, #117611, #117619), Event (#117734), Device Allocator (#118091), Guard (#118523), and Generator (#118528, #118613)


## Other



* Enable compiled Adam in the benchmarks (#116093)
* [ROCm] TunableOp (#114894)
* [pytree] Add access api (#117771)


# Improvements


## Autograd API



* Out-variant of ops for which autograd also does not have a formula now produce an improved error message when the out= argument is a leaf tensor (#121089)
* Support custom autograd Function forward AD return non-Tensor in forward (#118234)
* In-place foreach to returns the input list as-is instead of None (#121405)


## Composability



* FakeTensors and meta tensors are used to perform shape propagating when tracing out a graph in torch.compile. There were a number of op coverage improvements this release:
    * New metas
        * _foreach_norm (#119927)
        * _upsample_bicubic2d_aa (#117347)
    * Fixes to metas
        * _efficient_attention_forward for jagged inputs (#118657)
        * _flash_attention_forward() (#119812)
        * efficient_attention_forward fix for NT inputs (#120594)
* We have python “reference” decompositions for many aten operators. These are used during the tracing step of torch.compile in a few ways: sometimes they are used to directly decompose operators in the captured graph. Other times, they are used as an alternative to a shape-propagation rule for an operator. There were several improvements to operator coverage in this release
    * New decomps
        * reflection_pad{1, 2, 3}d (#115100)
        * replication_pad (#115113)
        * torch.block_diag (#115096)
        * torch.take (#114813)
        * rsub (#118288)
        * roll (#119857)
        * frexp (#119217)
        * pad_sequence (#116285)
        * linalg.cross (#119809)
    * Fixes to some decomps:
        * SDPA decomp: actually use attn_mask (#117579)
        * aten.diag_embed (#120549)
        * Isin (#120821)
        * linalg norm (#120993)
        * Im2col: Remove opmath cast (#121363)
        * Index_copy: fix 0-dim Index (#117065)
        * Rrelu_with_noise: add default parameters (#117141)
        * _refs.linalg.svd: don’t check is_conj() (#117972)
        * Weight norm interface decomp: fix missing default dim param (#118762)
        * SDPA decomposition: add tests for different dtypes (#119239)
        * _to_copy fix (#119868)
        * torch.Tensor decomp fix to support sequences of tensors (#120872)
        * Avoid mkldnn ops appearing in graph in some cases (#115448)
    * Decomps relevant to the Core ATen opset defined [here](https://pytorch.org/executorch/stable/ir-ops-set-definition.html).
        * Add decomp pixel_shuffle/unshuffle (#118239), (#118921), (#120092)
        * Modify SDPA decomposition to decompose _scaled_dot_product_flash_attention_for_cpu (#117097)

Dynamic shapes:



* Improved unbacked SymInt support
    * Support symbolic min/max on unbacked SymInt (#118953)
    * Rewrite maybe_reduce more carefully for unbacked SymInt (#119562)
    * Size oblivious test for slice optimization (#119625)
    * If data dependent, check if guard_size_oblivious would fix problem and report if so (#121011)
    * Prevent unbacked symbol reallocation by forcing unification for unbacked symbol def sites (#114368)
* Improved dynamic shapes support
    * Always accept 0-d scalar tensors as int, even if **index** fails (#117451)
    * SymIntify prod_backward (#120776)
    * Add is_integer to SymFloat (#114703)
    * Dedupe symbolic shapes in tracing (#116158)
    * [Dynamic] Fix dynamic shape size inspection bug (#120341)
* Docs / debugging tools
    * Add basic reference documentation for symbolic_shapes.py (#118997)
    * Rename unbacked SymInt prefix to u (#117859)
    * Augment create_symbol with user/infra backtrace fragment (#118215)


## CPP API



* Add `TensorIteratorConfig::add_const_input` to avoid COW materialize (#118053)
* Reserve sizes in c10::VaryingShape::concrete_sizes(), c10::TensorType::computeStrideProps() (#119189)


## CUDA



* Faster gc_count update for CUDACachingAllocator (and avoid nullptr de… (#117064)
* Reduce register usage of fused adam(w) (#118361)
* Improve CUDACachingAllocator lock contention (#118550)
* Back scalar value to pinned memory for .item() (#119202)
* Avoid COW materialization for TensorInfo with const type (#119502)
* [CUDA graphs] Pool argument for make_graphed_callables (#121475)
* [CUDA Caching Allocator] Export sync-stream-and-free-HBM counter in memory_stats for performance debugging (#120050)


## Distributed API



* c10d:
    * ProcessGroup/NCCL logging improvements: (#115801, #116059, #116060, #116520, #117291, #117868, #118335, #113238, #118455, #118582, #118924, #116489 )
    * NCCL ProcessGroup watchdog improvements: (#115403, #115577, #116702, #116267, #116717, #116545, #117312, #117093, #117682, #117297, #118016, #115770, #116661, #118344, #117699, #117168, #117738, #121132 
    * Use TCPStore to record NCCL timeout and dump debug info (#115226)
    * Extend NCCL communicator splitting to more use cases (#114916)
    * Let all_reduce_coalesced accept one tensor as well (#115650)
    * Add stream info during nccl comm abort call (#116076)
    * Pass group global rank information to NCCL PG (#114736)
    * Store PG global rank information in tracing logs (#115730)
    * Only open NCCL dump pipe file once per process (#115798)
    * Dynamo + collectives: allreduce remap (#115950), all_gather_into_tensor remap (#117224)
    * Expose check method to Python for store via pybind (#116144)
    * Refactor all_reduce variants as private methods (#120855)
    * To make ProcessGroupNCCL to use globalStore for coordination (#117075)
    * Allow nonblocking wrap of ncclCommInitRankConfig (#118256) 
    * Do not print NCCL_DEBUG before NCCL init (#117328) 
    * Update the work progress of PG periodically (#120438)
    * Add NCCL work sequence number to work info (#120596)
    * [UCC] Retain CUDA context in progress_loop (#121446)
    * Change watchdog log from "NCCL" to "Process group" (#118121)
    * [IntraNodeComm] accept P2P buffer size as constructor argument (#120856)
* FSDP:
    * [torch.compile] FSDP changes (#115497)
    * Remove unused flat_param_part_view (#117082)
    * Replace acc_grad hooking with register_post_accumulate_grad_hook on flat_param (#112184)
    * Fix deprecation warning on typed storage (#116714)
    * Pass DTensor shape/stride during tensor unflatten in 2D (#117340)
    * Cloned unsharded tensor slice in optim state dict load (#117261)
    * Drop all gather stats to debug not warning (#117669) 
    * Idempotent reshard (#117997)
    * Removed `.detach` in `clip_grad_norm_` (#120612)
    * Vlean up unwanted _fsdp_wrapped_module FQNs (#120600)
* Distributed Checkpointing (DCP):
    * Automatically set `no_dist` if distributed is unavailable (#119813)
    * Let distributed_state_dict filter out the compiler prefix (#119830)
    * Only wait on AsyncCollectiveTensor after DTensor-based state dict loading (#119716)
    * Improve the readability of filesystem and fsspec filesystem (#116246)
    * Uses Serial Loader for DCP.save when more then one thread is used. (#118114)
    * Skip log line if no tensors were dedupped (DCP) (#119742)
    * Removes Checkpoint Wrapped Prefix from state dict fqns (#118119)
    * [DCP] Replaced `storage()` with `untyped_storage()` (#121538)
    * Let _offload_state_dict_to_cpu to return the companion_obj if it exist. (#121273)
    * Asserts CPU backend for async_save (#120241)
    * Allow users to save and load without creating storage reader and writer (#117772)
    * Passes process group to `_all_gather_keys` in `dcp.load` (#118301)
* DTensor:
    * [DeviceMesh] Ensure mesh tensor is a cpu tensor (#120046)
    * Standardize tuple strategy handling for foreach ops (#120695)
    * Add mesh_dim_names to DeviceMesh __repr__ if it exists (#115579)
    * Remove assert to allow tensor sharding dimension < Shard(x).ndim (#115114)
    * Refactor sharding cost model to count for latency (#119897)
    * Make tensor_flatten more compatible for dynamo getattr (#118209)
    * DTensor: use memory_format in the hash for all aten ops that use that arg (e.g. aten.clone) (#118667)
    * [DeviceMesh] Removed print of `self._dim_group_infos` (#118527)
    * Account for empty list when turning to OpStrategy (#115298)
    * [debug] have visualize_sharding correctly print for sub-mesh DTensor (#121216)
    * switch softmax backward ops to OpStrategy (#119255)
    * Make DTensor `from_local` backward partial() to replicate() pass through (#115967)
    * Refactor partial redistribution logic (#113334)
    * Refactor redistribute and fix uneven sharding redistribution (#115525)
    * Switch softmax forward ops to OpStrategy (#117723)
    * Simplify outputs wrapping handling (#120297)
    * Relaxed `to_local` `requires_grad` warning (#118186)
    * Make local_shard_size_on_dim be staticmethod (#118078)
* TorchElastic:
    * Support for overprovisioning in C10 based rendezvous (#117066)
    * [rendezvous] Add option to enable libuv for TCPStore based rendezvous backend (#118944)  
    * Create root log directory by default (#121257)
    * Refactoring to support non-default logging strategy (#120691)
    * [Logging] Pluggable logsspecs using python entrypoints and option to specify one by name. (#120942)
    * Refactor SubprocessHandler to separate module for easier subclass (#120373)


## torch.compile


### Dynamo



* Fewer graph breaks for dicts - ConstDictVariableTracker now supports many more types of keys, earlier it was just string and ints.
* Improved TorchDynamo reliability by fixing many bugs, increasing the pass rate of TorchDynamo wrapped PyTorch tests to 90%.
* Improve TORCHDYNAMO_EXTENDED_DEBUG for GuardOnDataDependentSymNode (#119412)
* Make torch._dynamo.mark_static work inside graph (#118962)
* Expand dynamic dims support for traceable subclasses (#114311)
* Extend auto_functionalized to support ops that return Tensors (#115135)
* Improve support for dynamic shapes str.format and _assert (#115203)
* make __lookup_backend return None when cache misses (#114766)
* Make subclass type instances constants (like UserDefinedClasses) (#115323)
* [HigherOrderOp] make MapHigherOrder use should_flatten_output=True (#115204)
* Move the shape env symint cache to a symbol cache, better routing for subclass fakification [re-pr 115227] (#115396)
* Remove always restore (#115317)
* Remove replace_all and make VTs mutable (#113725)
* Support torch function user objects (#111765)
* Check tensor subclass when using torch.compile + SAC (#115960)
* Ensure wrapping subclasses with `as_subclass` is supported (#116091)
* Add CALL_FINALLY opcode (#116159)
* Add a wrapper to transform a NumPy function into a PyTorch function (#114610)
* [HigherOrderOp] set set_subgraph_inputs to flatten_manual for map (#115853)
* Graphbreak when creating a map with unsupported keys (#116460)
* Specialize SymNodeVariable when used as module index (#114377)
* Impl. call_hasattr for BaseUserFunctionVariable (#116049)
* [HigherOrderOp] change signature of map_impl (#117161)
* Error if compiled nondeterministic backward called in deterministic mode (#114780)
* Add hasattr support for TupleVariable (#117694)
* Break on unsupported keys for dicts / elements for sets (#117630)
* Implement set in terms of dict (#110524)
* Add DictView variable tracker (#108420)
* add common methods to DistributedVariable (#117590)
* make ConstantSource propagate through built-in ops for TensorVariable (#117704)
* Add compilable and capturable foreach adamax with tests (#117835)
* Enhance torch.vmap support from inside torch.compile (#116050)
* Install module globals per output_graph (#117998)
* Remove optimizer.step patching for profiler hook (#115772)
* add username in debug path (#117820)
* avoid graph break on tensor.element_size() (#118229)
* avoid graph break on torch.backends.cuda.matmul.allow_tf32 (#118236)
* move torch._C._get_cublas_allow_tf32 to constant_fold_functions (#118342)
* inline torch.jit._unwrap_optional (#118434)
* support inference_mode with no arguments (#118427)
* constant fold torch.cuda.get_device_properties to avoid graph break (#118422)
* Faster empty LIST_LENGTH guard (#118542)
* Expose dynamic_shapes api at multiple levels (#118695)
* graph break on isinstance calls if we don't know the type (#118778)
* Print the malformed guard when there's a guard error. (#117982)
* Use SourcelesBuilder in BuiltinVariable (#118098)
* [optim] Place guards on the args before assuming they exist (#117983)
* Make variables in dict LazyTrackers (not lazily guarded yet) and avoid using DICT_KEYS guard (#117625)
* Make dict guards amenable to the CSE pass (#118194)
* Add Typing variable to possible dict keys (#118003)
* Don't assume all subclasses of BaseUserFunctionVariable have a fn attribute (#118208)
* Add functools.partial and UserDefinedFunction to dict keys (#118199)
* [optim] Use the actual sources from the parameters when tracing "params" in an optimizer (#118535)
* Optimize dict keys guard when all the keys are constant (#118855)
* bypass graph break due to masking if inference mode (#119056)
* decrease logging level for graph break in higher order op. (#119079)
* Add torch.backends.mha.get_fastpath_enabled to FUNC_INLINELIST (#118979)
* support comparing stream with constant (#119199)
* Functools partial reconstruct (#118583)
* Print the value of constants in **str** (#119276)
* inlining into **iter** of user defined object (#119243)
* Support kwargs for lazy module (#119445)
* In dynamo tracing for index() use None as the default indicator for end and not -1 (#119151)
* Capture untyped_storage().resize_() (#119647)
* Respect autograd.Function + multiple save_for_backward calls (#117667)
* Support attribute access on tensor subclasses without sources (#117666)
* [functional_collectives] Add all_to_all_single, all_gather_list, reduce_scatter_list to dynamo remapping (#119683)
* [Optimus] Log the optimus graph transformation to the scuba (#119745)
* Update tracing rules for new cudnn functions (#120268)
* [guards-cpp-refactor] WEAKREF_ALIVE guard (#120344), DictGuardManager (#120359)
* derived dim (#118729)
* Let torch dynamo inline torch.func.grad (#118407)
* Teach dynamo about vjp (#119405)
* Support module backwards hooks (#120685)
* DICT_CONTAINS guard (#120673)
* KeyValueDictGuardManager (#121147)
* Re-dispatch `torch.Tensor.new` into `torch.Tensor.new_empty` method. (#121075)
* guard on grads being `None` in compiled optimizers (#121291)
* Use type check for also `is_not` (#113859)
* Relax missing symbols runtime assert (#121339)
* Add operator length hint support (#121495)
* Improve Dynamo support for torch function and class methods in general (#121365)
* Switch cudagraph backend to cudagraph trees (#121019)
* Support _unsafe_set_version_counter (#121086)


### Inductor



* [custom ops] Add tag to custom ops to preserve stride orders in inductor (#117298)
* Fix `argument unused during compilation` warning (#118077)
* Add a decomposition for isin() (#115390)
* Add no weight change version of fuse_parallel_linear (#115791)
* Allow user to explicitly specify Device to run on (#117413)
* Add torch._export.aot_load (#117610)
* Support .item() in the ABI-compatible mode (#117989)
* Add _scaled_dot_product_efficient_attention to C shim (#118169)
* Support scalar to tensor in the ABI-compatible mode (#118024)
* Replicate split_cat from torch IR to predispatch IR" (#118590)
* Change the cpp wrapper codegen for sdpa (#120592)
* Store OpOverload in ir.ExternKernel (#120629)
* Use torchgen to generate C shim functions (#120513)
* Update cpp wrapper codegen to use v2 C shim (#120714)
* Reuse generated kernels between constant graph and main graph (#121564)
* Update AOTI runner util (#116971)
* Remove caching for compiled model.so (#117087)
* Retrieve original FQNs for weights (#116157)
* Enable Dequant Promotion when Linear input dimension size exceeds 2 (#113912)
* Enable QLinear weight prepack when input dimension size exceeds 2 (#113928)
* Make some improvements to FX graph caching (#117888)
* Fuse parallel linear based on pre grad aten IR (#114776)
* Fuse pointwise operators in the post grad (#114778)
* Enable mkldnn op weight pre-packing on aarch64 (#115037)
* Add sub and div pointwise ops to the post grad fusion (#115389)
* [optimus] enable smart fusion (#115471)
* Parameterize ir.Scan on combine_fn (#109132)
* Add input numel assert for minimal arrayref interface (#113577)
* Remove ArrayRefTensor::dtype (#113578)
* Added non-integer expr support for floordiv in triton codegen (#115751)
* Updated upsample_bilinear2d decomposition (#104182)
* Consolidate usage of fp8 linears for inference models (#115808)
* Add lowerings for reflection_pad{1, 3}d_backward (#115645)
* Support Predispatch functionalization (#113728)
* Support sym exprs in lowering constant promotion (#116196)
* Add Support For Symbolic Shapes in Register_replacement, SDPA Pattern Matching (#115441)
* Handle more edge cases in slice and slice_scatter (#117377)
* Add statically_known_true utility for SymBool (#117359)
* Allow explicit shutdown of the compile-worker pools (#117664)
* Add runtime numeric check for pt2 Optimus in the pre grad pass (#115142)
* Allow reinplacing before meta-only users (#117121)
* Allow reinplacing functionalized scatter ops (#116899)
* Handle cum{sum,prod} on zero-dim tensors (#117990)
* Use an op counter to decide when to realize a kernel (#117030)
* Remove follow_imports = skip from sympy (#118469)
* Complete decomposition for aten.round (#118635)
* Never reuse accumulated gradients' buffers (#119334)
* Vectorization support for int32/int64 (#119001)
* Update the compile options for CppPythonBindingsCodeCache (#119415)
* Add CUDAEvent recording for constant folding to show up. (#119216)
* Decompose torch.ops.higher_order.auto_functionalized in Inductor (#118673)
* Support storage resizing (#119749)
* Use torch.cuda.clock_rate instead of triton.testing.nvsmi (#118662)
* Handle aliases correctly in foreach (#119508)
* Add Runtime Constant-Folding function of AOTInductor for AOTInductorModels used internally. (#119823)
* Simplify indexing when doing ModularIndexing + index propagation. (#119863)
* Add split cat pattern to remove cat nodes (#115004)
* Decompose memory bound mm (#120047)
* [cond] make sure subgraphs in cond are decomposed according to current decomp table (#120366)
* Use a dtype property in torch inductor nodes (#119227)
* Add unbind node normalization (#120253)
* Reinplace auto_functionalized (#120829)
* Change the split cat log to debug (#120823)
* Add decompostition for mm in backward (#120933)
* Do not use warm_pool() if TorchTnT is used (#121047)
* Triage the remaining fallbacks (#121312)
* Enable ABI-compatible mode for cpp-wrapper JIT (#121309)
* Change assertion throw to error message for const_run_impl call. (#121396)
* Split predispatch pass into multiple passes (#121592)
* Port remove_split_ops to PT2 pre-grad passes (#121674)
* Replace lld with the default ld linker (#115478)
* Emit static constexpr int array vars when possible (#112174)
* Avoid aoti_torch_data_ptr calls for constants at inference time (#112405)
* Use static_cast, not dynamic_cast (#112798)
* Add minimal arrayref interface (#112800)
* Add updaing constant buffer to active buffer. (#116001)
* Add aoti_torch_view_dtype in C shim (#118705)
* Support _embedding_bag in C shim (#118706)
* Skip launching kernels with zero grid in AOTInductor when using backed symints (#118654)
* Support copy_, _fft_c2c and view_as_real in C shim (#119125)
* Migrate fuse_split_linear_add from dper_pass to AOTI based on predispatch IR (#118983)
* Add C-shim for index_put (#116667)
* Port fuse_parallel_linear (without changing weights) to PT2 pre-grad (#121617)
* [OAT] move matmul precision out of system info (#115242)
* [OAT] toggle for forcing matmul precision matching (#115326)
* Remove hashing of tensor data for constants (#115356)
* [Triton] Replace triton.runtime.jit.get_cuda_stream with torch.cuda.c… (#115397)
* De-duplicate triton helper functions (#115546)
* Don't print disable_cudagraphs_reason when cudagraphs is disabled (#115489)
* Implement a deduplist data structure for name to user tracking (#115609)
* SDPA extend backward realized tensor alignment checking to forward realized tensors (#116069)
* Serve multistream graph captures from correct pool (#116199)
* Add input shape check for quantized conv binary lowering (#115247)
* Preserve strides of custom Triton kernel args (#116219)
* Add ABI shim function for torch.scatter_reduce (#116700)
* Use max sm clock when calculating device tflops (#116754)
* Replace recursive stable_topological_sort() with iterative. (#116761)
* Remove the float16 restriction for cpu cpp_wrapper (#116205)
* Sort unbacked symbols before iterating on them (#116421)
* Decompose bmm if batch2's last dim size is 1 and coordinate_descent_tuning is enabled (#116582)
* Control the cpp_wrapper mode with an env variable (#116615)
* Add shape checks to ExpandView (#113839)
* Add remaining user check for qconv binary fusion (#115809)
* add predispatch_pass to hold pass functions to be run when config.is_predispatch is true (#116788)
* [ROCm] Add opt-in option for inductor's layout optimisation on ROCm (#116329)
* Disable pointwise_cat on CPU (#116313)
* Don't access cluster_dims for too old version of triton (#117192)
* Check nan/inf for graph inputs (#117189)
* Iterative percolate tags (#117306)
* Update JIT Inductor cpp wrapper entry function signature (#119280)
* Make auto_functionalized HOP fallback in inductor (#117084)
* Realize non-ReinterpretView Views in custom Triton kernel args (#117468)
* Add torch.complex128 and torch.complex32 to DTYPE_TO_ATEN dictionary. (#117929)
* correctly retrieve the "shared" attribute from a Triton binary (#120666)
* Add lowering for adaptive_max_pool2d (#120254)
* Track constant's original_fqn mapping (#120524)
* enable fp8 cast for inductor CPU (#117737)
* Express y grid > 2^16 in terms of z grid (#121554)
* Use codegen reference for buffer to string (#117838)
* move disable_cudagraph_reason disabling after codecache is accessed (#117823)
* Add new pattern matchers for SDPA (#113004)
* sympy.Symbol is a subclass of sympy.Expr (#117857)
* For View.create(x, sizes) call realize_input() instead of realize() when handling unbacked symints (#117013)
* [ac][pattern matcher] Do not percolate tags beyond the inputs of matched portion (#118034)
* Add lowering to special.bessel_j0 (2nd try) (#118565)
* Dont fuse write into read if indexing differs (#118210)
* Handle special values correctly in ir.Scan codegen (#118788)
* Limit reductions into pointwise cat fusion (#118452)
* Update pointwise concat heuristics (#118453)
* Add lowering to special.bessel_j1 (#118992)
* Support ProxyExecutor argument codegen for sympy.Expr (#119166)
* Implementing missing magic methods on IR values. (#118933)
* Support sympy.expr in user-defined Triton kernel grid fn (#119165)
* Add lowering to special.modified_bessel_i0 (#118993)
* Add split scan kernel (#117992)
* Add lowerings to special functions (#119187)
* Use list comprehension to initialize unused_views. (#119618)
* Rewrite group_batch_fusion.find_independent_subset_greedy() to be iterative. (#118324)
* Recursivly unwrap_storage_for_input when convert_to_reinterpret_view fails (#119867)
* Replace generators with map. (#119818)
* Reorder if check to avoid more expensive check. (#119817)
* [scheduler] Use set for origin (#119861)
* Allow padding mm/bmm/addmm in the presence of dynamic dims (#120073)
* Enhance next_power_of_2 function (#120153)
* Always allow 64 bit in next_power_of_2 (#120164)
* Use two pass reduction for deterministic reduction order (#115620)
* Remove redundant to_dtype in Fused Schedular Nodes (#118365)
* Remove dependency of triton during inductor codegen (#120193)
* Pass device_str for async_compile.triton function (#120202)
* Colorization improvements for bandwidth profiler (#120343)
* Disable masked load for non-fp data types (#120558)
* Apply fx passes recursively to nested subgraphs (#120665)
* Make triton_meta be part of user defined triton kernel cache (#120809)
* Emit grid wrapper inlined with the user defined triton kernel (#120824)
* Add lowering for fraction_max_pool2d (#120460)
* Fix accuracy failure for a few models under freezing (#121054)
* Add a decomposition for torch.put, 2. (#120179)
* Skip foreach kernel for benchmark fusion (#121168)
* Move JK check to on-demand (#121182)
* Correctly read the cache key for remote cache (#121151)
* Make configs hash part of remote cache key (#121152)
* Fuse nodes with sizes (s0_s1_...,) and (s0, s1, s2, ...) (#120077)
* Use indices for constants in triton_meta (#121427)
* Skip welford combine on first reduciton loop iteration (#121488)
* Changes to support newer triton pin (#121267)


### torch.export



* Added effect token to export (#121424)
* Preserve constant fqn (#120664)
* Require pytree serialized_type_name (#120636)
* Added 'is_lifted_tensor_constant' and 'get_lifted_tensor_constant' utils (#120546)
* Use forward hooks to capture module signatures. (#120468)
* Allow str inputs in non-strict tracing (#120536)
* Support output types that are non tensors (#120804)
* Allow None as the meta value for tensor output. (#116664)
* Make spec comparison indifferent to fx collections (#118718)
* Support non-tensor tuple hoo outputs (#119402)
* Make control flow operators respect global decomp table (#120412)
* Make balance_gradient preserved in export (#120332)
* Ensure optional fields in the schema always have default value. (#121163)


## FX



* Ignore ill-formed solution of reduce_inequalities (#117310)
* Add an option to not retrace when doing op fusion (#118120)
* [minimizer] Defined traverse (#118889)
* [pytree] Properly register immutable collections (#120036)
* Skip less replacements (#119570)
* Refine value ranges on inequalities (#120800)
* Remove dead get_shape_groups (#120813)
* Simplify guards using info from previous guards (#121463)
* Inspect get_attr nodes for _decline_if_input_dtype (#118760)
* Optimize recursive_add_node in fx splitter (#117969)
* Slightly faster FX graph iterator (#121611)
* More strong typed codegen for partial specialized code on boolean (#117201)
* Add torch.fx.interpreter to uninteresting_files (#117460)
* Report function name in stack trace annotations (#117459)
* Cache dfs path in propose_partitions and re-use that later when trying to find cycles in the graph (#115943)
* [pytree] update treespec `num_children` access (#116370)
* [pytree] Allow tree_map_only to support predicate function as filter (#119974)
* Register torch.return_types in torch.fx._pytree (#120027)
* [pytree] Add key path api (#116786)
* [pytree] Reuse `flatten_fn` in `flatten_with_keys_fn` to ensure consistency (#117656)


## JIT



* Release the GIL in serialization when it is safe to do so (#120818)
* Improve support for boolean inputs to operators in TorchScript (#113835)


## NestedTensors



* Support ragged_idx != 1 on aten::is_same_size, aten::_to_copy (#118442)
* view: basic support for ragged_idx != 1 and _unsafe_view (#118317)
* Support nested tensor in check_trace (#121039)
* Add is_nested_int() (#119975)
* Rename singleton int to nested int (#119661)


## Linalg



* Avoid COW materialization in `at::parallel_for/parallel_reduce` (#120455), TensorAccessors with const type (#119501), and input materialization in more forward ops (#121070)
* [executorch] Run llama in xplat (#118831)


## MPS



* Add MacOS 14 runtime check (#115512)
* Add support for `MPSDataTypeComplexFloat[16|32]` (#115513)
* Fix `sum` and `prod` for complex types (#115554)
* Add support for `MPSDataTypeComplexFloat[16|32]` (#115513)
* Fix `sum` and `prod` for complex types (#115554)
* Enable `torch.rand[n]` for complex types (#115514)
* Implement aten::upsample_linear1d on mps (#115031)
* Speedup addmm (#116548)
* Enable `bfloat16` support on MacOS 14 (#119641)
* Add naive std_mean implementation (#119777)
* Implement aten::upsample_linear1d on mps (#115031)
* Use dispatch with rethrow for indexing (#116903)
* Add function to materialize COW storages (#117053)
* Make addmm support empty matmul (#117223)


## torch.nn API



* Add python and C++ support for `LPPool3d` (#114199)
* Add compatibility with channels_last_3d for conv3d (#114790)
* Add Half support for interpolate operators on CPU (#105648)
* Add `nn.Module.to_empty()` suggestion in the error message (#119353)
* Explicitly set `nn.Module.set_extra_state` return type to None (#120161)
* Updated `nn.Module._apply` to not gate on `should_use_set_data` when `swap_tensors` is set (#120659)
* Add Half support for AvgPool2d on CPU (#109578)
* Fix `AdaptiveAvgPool1D` to account for shmem limit for certain input sizes (#115231)
* Add back initial Flash Attention support on ROCM (#115981)
* Add 64-bit indexing for CUDA `avg_pool_backward` (#114193)
* Add Half support for `layer_norm` on CPU (#99590)
* Add Half support for flash attention on CPU (#118368)


## ONNX



* Introduce decomposition skips using custom operator (#117314)
* Apply Modularizarion to ExportedProgram during ONNX Export (#119498)
* Disable opmath type promotion for div (#119112)
* Prevent instance_norm decomp for export (#120866)
* Allow ONNXProgram.save to use torch.load(..., mmap=True) for large models (#117295)
* Enable llama attention with dynamic shapes for onnxrt backend (#117009)
* Add bfloat16 support for scaled_dot_product_attention (#117878)
* Require Module to be passed to export (#117528)
* Improve support to mmap for ONNXProgram.save (#117863)
* Use environment variable ONNXRT_DUMP_PATH to dump onnx models created by onnxrt backend (#117551)
* Remove monkey-patch for torch.utils._rebuild_tensor (#120446)
* Enable custom ONNX model transforms in `onnxrt` dynamo backend (#120854)
* Add Float8 support to onnx exporter (#121281)
* Add support to save safetensors checkpoint directly into onnx (#121001)
* Update submodule onnx==1.16.0 (#123125)


## Optimizer



* Allow torch.float64 scalars (in addition to torch.float32) for forloop + foreach implementations (#115841)
* Add beta1 support to CyclicLR momentum (#113548)
* Add guardrails preventing complex params in SparseAdam and add complex support for L-BFGS (#118161, #118184)
* Add capturable API for the forloop/single tensor (foreach=False) implementation in Adamax, RAdam, and ASGD (#121183, #121260, #121264)


## Profiler



* Clean up first line of element text for readability (#120245)
* Add Total memory used after allocation in Trace View (#120339)
* Track context for SEGMENT_FREE and SEGMENT_UNMAP (#118055)
* Add CUDAAllocatorConfig details into snapshot metadata (#119404)
* Log process group config information in GPU trace’s distributedInfo field (#119443)
* Support GPU annotations for auto-trace jobs similar on-demand support (#114638)
* Record nccl version in distributed info (#121044)
* Add a function to allow adding preset user-defined metadata to traces (#121487)


## Python API



* Enable eye on CPU for bfloat16 dtype (#116616)


## Quantization

PT2 Export Quantization Flow:



* Relax constraints on dtype and qscheme to allow for customizations (#116287)
* Skip conv-bn folding when there are no batchnorm ops (#116440)
* Add `move_exported_model_to_train` (#113492)
* Allow users to override train/eval behavior (#119091)
* Add convert callback to Observer module (#115001)
* Fix _disallow_eval_train error message (#119694)
* Add `model_is_exported` util function (#119726)
* Relax `model_is_exported` input (#120720)
* Add the operator of decomposed fake quant per channel (#121297)
* Add error check for input_edge annotation in Quantizer (#121536)
* Call sub-quantizers' transform_for_annotation in ComposableQuantizer (#121548)

XNNPACKQuantizer:



* XNNPACKQuantizer skip inserting observers for non-float Tensors (#114999)
* Add support for linear_relu in XNNPACKQuantizer (#117052)
* Support custom qmin/qmax for activation and weight for XNNPACKQuantizer (#117305)
* Fix module name filter for underscores (#119344)

X86 CPU Inductor Backend:



* Enable QLinear input with multi dims in x86 CPU inductor backend (#113733)
* Add int8 linear op gelu for x86 CPU Inductor backend (#114852)
* Add dynamic quantization config for x86 inductor backend (#115337)
* Enable QConv2d with hardswish post op (#117487)
* Add Hardswish Conv2d Unary Annotation (#117488)

DTypes:



* Add uint1 to uint7 dtypes (#117208)
* Add float8 types to dtypes table (#117375)
* Enable cat for cuda torch.bits types (#115044)

Others:



* Skip privateuse1's checkZeroPoints (#114117)
* Support lowering for operator.matmul in fx graph mode quantization (#113954)
* Add quantized gelu (#119935)
* Update Quantizable LSTM to support QAT (#121448)


## Releng



* Addition of linux cpu test for 3.12 (#117853)
* Bazel CUDA tests timeout increased to 480s (#120443)
* Update torchbench commit pin, add sam_fast benchmark (#121420)
* Trigger a mergability check on ghstack prs (#115944)
* Use matrix generate script for docker release workflows (#115949)
* [CI] Addition of initial inductor cpu smoketest for performance (#116456)
* [CI] Addition of python test skip logic for XPU (#117621)
* [ROCm] upgrade CI to 6.0 (#119495)
* Improved Dynamo testing convenience (#116173)
* Improved the error message when a PR lacks the necessary approvals (#116161)
* [CI] Addition of initial ci build test for XPU (#116100)
* Addition of CPU inductor merge rule (#116679)


## ROCm



* Add hipblaslt support (#114329)
* Initial Flash Attention support on ROCM (#114309)
* Make ATen-cpu cuda/rocm agnostic (#121082)
* Autocast RNN Support (#121539)
* Add Flash Attention support on ROCM (#121561)
* Initial ir.Scan/aten.cumsum lowering support on ROCm (#119369)


## Other



* Multiprocessing api to use sigkill if sigterm doesn't kill the process (#115219)
* Explicitly error out if CuDNN older than 8.5 (#118235)
* Set maximum supported version of Python as 3.12 (#119743)
* Update cslt to 0.5.2.1 (#115988)
* Update cutlass from 3.3.0 to 3.4.1 (#120434)
* Preserve metadata for `MutableMapping` and `MutableSequence` in `pin_memory` and `collate_fn` (#120553)
* Add inf norm support for _foreach_norm (#118441)
* Expose aggressive_recomputation as an inductor config (#118943)
* Disables denormal floating numbers on ARM CPU (#115184)
* Enable TORCH_TRACE by default in all Tupperware like environments (#120915)
* Upgrade submodule oneDNN to v3.3.6 (#122164)
* Pin protobuf to 3.20.2 on macOS (#121918)
* Make PyTorch compilable against upcoming Numpy-2.0 (#121880)
* Upgrade submodule pybind to 2.12.0 (#122899)


# Bug Fixes


## Autograd API



* Fast gradcheck bug no longer ignores specified eps argument when recomputing in slow mode to produce the error message (#115634)
* Properly handle retains_grad hook on the base when view of it is mutated (#117552)


## Composability



* Fixed hash issue in `fx_graph_cse` graph pass (#119567)
* Better support for fakeifying torch_dispatch tensor subclasses that are views (#118405)
* Bugfix for where output of a compiled graph is a subclass, but is a view of a graph input (#118191)


## CPP API



* Add the bound check for flatten with out_dim (#120894)
* torch check the division by zero in batch_norm_update_stats (#120882)
* Fixed crash when calling pad_packed_tensor when packed with cuda tensors and ensure_sorted=false due to indexing with tensors on different devices (#115028)
* [Caffe2] Fix bug in `str` on wide types (#117531)
* Try creating a bf16 tensor as a last resort of `is_bf16_supported()`. (#115924)
* Fix admm over empty tensors and broadcastable input (#118619)


## Distributed API



* C10d:
    * [c10d] Fix Store check condition in NCCL PG watchdog (#115475)
    * [c10d] Fix compilation of NCCL_EXP path (#119805)
    * Don't add NCCL backend by default without CUDA (#119149) 
    * [IntraNodeComm] fix a hybridCubeMeshAllReduceKernel breakage caused by a recent refactor (#121575)
    * Add partial read test for libuv backend and fix an error which only happens when partially reading a buffer (#116141) 
    * Fix `get_rank` under a non-default group. (#120481)
    * Fix a bug where nn.functional._AllGather.backward produces wrong gradients (#120582)
    * Fix default world_size when running on 1 or 0 GPU (#119372)
    * Fix distributed debug w/ non-equal split (#115483)
    * [AMD] Fix build for intra_node_comm (#116291)
    * Fix timeout dump path write path overlap when there are multiple PGs (#116218) 
    * Handle unwaited work objects on process termination (#119881)
    * Guarantee init cuda before attaching hooks (#120052) 
    * Remove backend_id from pg_info (#120038)
    * Fix logic for default group=None in _set_pg_timeout (#120686)
    * Make _set_pg_timeout work with DeviceMesh PG (#120850)
    * Fix false positive ‘errors’ due to ‘reason’ string (#120863)
    * Fix the hang issue in store.check(TIMEOUT_DUMP) (#116297)
* FSDP:
    * Sharded grad scaler: copy found_inf after waiting on async reduce_all (#115710)
    * Fix FSDP + TP state dict in param unflattening (#115105)
    * Fixed `device_mesh` and auto wrap (#119064)
    * Added warning about unsupported double backwards (#120926)
* Distributed Checkpointing (DCP):
    * Call os.sync if os.fsync does not work for fsspec (#119287) 
    * Fixes expected behavior when `no_dist=True` in `state_dict_loader.load` (#115660)
    * Fix no shard state dict loading (#120367)
* DTensor:
    * Force re-compute sharding when normalized_shape differs in fwd layer norm (#115250)
    * Fix is_tensor_shardable to correctly handle Replicate placement (#117726)
    * Fix unnecessary redistribute in new_factory_strategy (#118037)
    * Make input contiguous for DTensor reduce scatter to fix the incorrect numerical values (#115847)
    * DTensor + dynamo: fix is_shard/replicate always inlining to False (#118668)
    * to_local backward grad placement passthrough (#121474)
    * nn.Module: use swap_tensors for Tensor subclasses (#122755)
    * Fix swap_tensors path in _apply for modules that inherit from RNNBase (RNN, GRU, LSTM) (#122800)
* DeviceMesh
    * Fix fsdp device mesh depenency issue (#121061)
    * Cache and reuse sliced result (#122975)
* DDP
    * Pass inductor strides forward in ddp optimizer (#120523)
    * Ignore gradient sync if the gradient is not defined (#120419)
    * DistributedDataParallel._post_forward, fix return (#114678)
    * Lazily compile submodules - to propagate real tensor strides to backend compiler (#114154)
    * Fix wrong behavior of is_alias_of and c10d::reducer on MTIA (#115553)
* TorchElastic:
    * Correctly detect SystemExit with code == 0 when using –run-path  #119697 
* Misc
    * Fix ChunkShardingSpec metadata offsets for empty shards (#121002)


## torch.compile


### Dynamo



* Fix F632 bug in dynamo (if statement is always false) (#116867)
* Properly trace into mark_static (#120232)
* Handle guard_size_oblivious in user code (#120379)
* Do not attempt to make nditer spawned arrays writable (#120868)
* Fix autograd.Function x enum input x torch.compile (#115206)
* Fix handling of one_hot (#116338)
* Fix `functools.reduce()` function with `None` as `initial` (#116398)
* Fix `sum()` function with `start` argument (#116389)
* Fix torch function kwarg dispatch (#117083)
* Fix several bugs related to unbacked SymInt codegen in inductor (#117862)
* Fix Auto Functionalize to handle specified default values (#118331)
* Fix `__name__` on a reconstructed NestedUserFunctionVariable (#118768)
* [HigherOrderOp] fix stack trace to report user stack (#118826)
* Fix dupe deprecated warning in dynamo export (#120896)
* Fix support for nn.Parameter constructor (part 1) (#120163)
* Fix gradient refcounts in pybind and compiled autograd (#118817)


### Inductor



* Fix torch.bernoulli decomposition return type (#115699)
* Fix angle decomposition return type (#115700)
* Allow sympy expressions to participate in type promotion (#115676)
* Do variance calculation in opmath type (#115181)
* Properly unwrap_storage tensors sent to DynamicScalar (#117444)
* Catch some missing unbacked symbol dependencies (#117650)
* Don't try to directly compare symbols, it won't work (#117674)
* Changed return type of randint64_cpu to int64_t to prevent codegen is… (#117443)
* Realize inputs to DynamicScalar before unwrapping storage (#118125)
* Prevent DCE'ing unbacked SymInt for view outputs (#119552)
* Exclude operators that produce unbacked symbols (#120917)
* Fix guards for code objects (#120909)
* Fix a bug in batch linear fusion in the post grad (#115061) (#115131)
* Add missing include to `model.h` (#118075)
* Fix a bug in the torch._export.aot_load API (#118039)
* Fix a None as index codegen issue (#118187)
* Forward fix #117989 (#118291)
* Fix a strict-aliasing warning (#120628)
* Fix broadcast_tensors with unbacked symints when translation validation is off (#118066)
* Fixed issue with true div on integer input with dyn shapes (#115920)
* Fix QConv Binary Inplace Layout Issue (#115613)
* [Optimus] Fix batch layernorm numerical issue (#117404)
* Fix a bug in merge_splits (#117707)
* [Optimus] Fix a bug in gradients computation for runtime numeric check (#118105)
* Fix key error in pre_grad fx_passes_numeric_check (#118325)
* [Runtime numeric check] Fix compatibility issue (#118578)
* Fix a RAIIAtenTensorHandle premature deallocation bug (#118963)
* Fix FallbackKernel behavior on mutable ops (#118649)
* Fix compile error on scan with no mask (#119555)
* Fix a bug in merge_splits (#119956)
* Fix lint after #105590 (#120461)
* Fix "example_value" absent for stack nodes (#120655)
* Fix missing "example_value" for nodes introduced by group batch fusion (#120974)
* Forward fix lint after 121202 (#121425)
* Fix cudagraph check message (#115664)
* Fix constant folding and extern kernel mutation tracking bugs (#115908)
* Fix cpp_wrapper inputs mismatch (#116197), codegen for ir.ComplexView (#116481), cumsum codegen (#116171)
* Fix Conv Binary Inplace Fusion issue (#115153
* Fixed issue in upsample_nearestnd lowering with scales (#117538)
* Fix inductor pattern match error for qlinear with bmm (#117633)
* Fix cpp backend relu codegen with inf input (#117622)
* Fix CPP wrapper codegen for ExternKernel args (#117931)
* Fix sympy_subs to preserve integer and non-negative properties. (#118150)
* Fix Argmax codegen with Nan input (#118358)
* Fix constant folding bug with sym size tensor (#118411)
* Fix a typo in should_pad_bench (#118598)
* Fix codegen bug with Native Triton kernels with ReinterpretView args (#118569)
* Fix an internal test issue (#118903)
* Fix a cpp kernel missing arg type issue (#119021)
* Fix Inductor CSE Across Separate Reductions (#119410)
* Fix bandwidth extimation for StarDep (#120266)
* Fix bug around out of order constexprs in inductor (#120287)
* Fix compiler check (#120492)
* Fix q/dq per channel lowering with 64-bit qparams (#120984)
* Fix the layout problem for nll_loss2d_backward (#121173)
* Fix for Wait kernel lowering in inductor not accepting MultiOutputs from non-collective calls (#121428)
* Correct index propagation for % (#119864)
* Fix a missing declaration for the result of item() (#115175)
* Make sure bitcast input and target type have the same bitwidth (#115619)
* Avoid inplace for ComplexView (#115166)
* SDPA extend backward realized tensor alignment checking to forward realized tensors (#116069)
* Ignore SIGINT in codecache workers (#116380)
* Add more alias and mutation check for other input of Conv Binary Inplace fusion (#117330)
* Fail Conv Binary Inplace check when act and accum are same tensor (#117331)
* [CPU] Disable floating-point contraction when compiling (#116318)
* Use wait stream instead of synchronize() in cudagraph warmup (#117578)
* Place .lrodata later in the binary (#117575)
* Correctly generate grid info for benchmark_kernel (#118202)
* Do not reuse buffers across scopes in mem planning (#120777)
* Fix profiler (#119959)
* Wrap remote cache creation with a try-catch (#121340)
* Inductor cpp wrapper: fix dtype of ShapeAsConstantBuffer (#122297)


### torch.export



* Handle transposition pattern seen in SDPA with unbacked SymInts (#121005)
* Fix bug removing node from wrong graph (#121574)
* Fix graph signature for primitive outputs (#118818)
* Fixed bug with user input mutations (#118942)
* Prevent specialization on backends (#118683)
* Fixed nn_module_stack in retracing (#121423)
* Fixed accidental specialization with faketensor input checks (#121460)
* Fixed name collision on constant name (#121145)
* Don't error if nn_module_stack doesn't contain a class (#119753)
* Fixed tuple return with symints (#119829)
* Fixed canonicalization for input mutations (#119533)
* Fixed getting meta["val"] (#117313)
* Add pass to remove auto functionalized hop (#122246)
* Fix auto_functionalize (#121990)
* Hack skip index_put_ in DCE (#122683)
* Various fixes to .module() (#118272)
* Do not rewrite state dict when unlifting (#118611)


## FX



* Fix pass_manager type annotation (#119499)
* Suggested fixes for congruences (#121418)
* Fix: set codegen in _SplitterBase partitioner (#120361)
* Fixed FxGraphDrawer compat constructor (#119767)
* [sigmoid] Fix for FX tracing unflattened modules (#115708)
* Fix for subgraph rewriter (#119052)
* Fix F821 error in torch/fx/experimental (#116587)
* Support printing storage while FakeTensorMode is enabled (#118780)
* Don't guard if there are unbacked SymInts (#119312)
* Avoid performing replacements when it would unrefine ranges (#117356)
* Fix none type comparison (#116399)


## JIT



* Fix `RuntimeError: NYI: Named tensors are not supported with the tracer` errors when using torch.jit.trace (#118393)
* Fix LLVM18 build (#115652, #117086)
* Fix handling of broadcasted inputs in Linear-BN Fusion (#119264)


## Linalg



* Fix mm accuracy in ROCm for some inputs (#116537)
* Update matmul heuristics in the presence of gradients (#117067) (#118617)
* Workaround a cusolver bug on CUDA < 12.1 in triangular_solve (#117636)


## MPS



* Fix addmm (#116547)
* Fix SegFault when torch.all/any dispatched to mps or other backends (#116457)
* Increase metal language support to 2.3 (#117472)
* Fix `torch.mm` correctness for large matrices (#117549)
* Fix lintear for 5D tensors (#117837)
* Fix `use_metal_mm` condition (#118830)
* Add support for complex scalars (#119318)
* Use `dyspatch_sync_with_rethrow` in searchsorted (#119646)
* Fix cfloat->chalf conversion on MacOS13 (#119681)
* Enable `conj` and `conj_physical` (#119669)
* Fix `out` resize logic in `torch.where` (#121476)
* Fix CrossEntropyLoss for float16 (#116597)
* Do not crash if Metal function can not be found (#116938)
* Fix float32 error on mps, in linalg.matrix_rank and linalg.pinv (#114771)
* Fix placeholder tensor is empty for relu in mps (#118965)
* Fix boundary checks in generateKernelOffsets (#116915)
* Fix torch.clamp in MPS to handle NaN correctly (#121381)
* Fwd-fix for clamp regression (#122148)
* Fix naive matmul for BFloat16 (#121731)
* Fix for MPS regression in #122016 and #123178 (#123234)


## torch.nn API



* Fixed numpy warning when importing torch without numpy installed (#115867)
* Fixed edge case for size 1 channels dim in `AdaptiveMaxPool` (#116482)
* Fixed module pre bw hooks when input doesn't require grad but gradients are changed by the user (#116454)
* Fixed `TransformerEncoderLayer` for `bias=False` (#116760)
* Fixed error checking for `LSTM` with wrong input shape (#115542)
* Removed an incorrect type specification from `AdaptiveMaxPool1d` (#118162)
* Fixed an illegal memory access in cross entropy loss when using an index that is not a valid class (#117561)
* Fixed pool padding assertion to account for dilation (#118897)
* Fixed `flash_attn_bw` impl to match meta implementation when `k` and `v` have different strides (#119500)
* Fixed `nonlinearity` arg issue in RNN (#120234)
* Fixed `requires_grad` preservation for `nn.Module.load_state_dict(assign=True)` (#121157)
* Fixed last_dim stride check for singleton dimensions (#117001)
* Fixed gradients on cuda for interpolate::trilinear on non-contiguous grad output (#117373)
* Added Half support for masked_softmax on CPU (#117028)
* Fixed an issue where nn.Linear would cause an internal int underflow (#119221)
* Fixed segfault in torch.native_channel_shuffle when input is empty (#121199)


## Nested Tensors



* Proper view support for jagged layout NestedTensor (#113279)


## ONNX



* Add decomposition for upsample_linear{1d, 3d} (#114774)
* Fix upsample_bilinear2d decomp skip with output shape (#118823)
* Fix ONNXRT torch.compile backend running with OrtValueVector (#116124)
* Fix output mismatch issue of repeat_interleave when dim is None (#116689)
* Update initializer path for ONNXProgram.save due to onnx.checker limitation (#117294)
* Set proper fqn in lift constant tensor pass (#115222)
* Fix type promotion pass (#118246)
* Perform implicit casting of constants for the onnx::where operator (#118733) (#120619)
* Fix onnxrt backends with inputs on mix devices (#121159)
* Fix breaking changes for ONNX Runtime Training (#122000)
* beartype to emit warning instead of error by default (#123205)


## Optimizer



* Rectify capturable testing and fix load_state_dict bugs with capturable! (#118326)
* Use torch.no_grad decorator for clip_grad_norm APIs vs local detaches (#120638)
* ReduceLROnPlateau allow get_last_lr to not error (#119556)


## Profiler



* Stop clearing history when changing context (#120436)
* Fix conversion of max memory allocated and reserved from GB to GiB (#120172)
* Fix the missing device string in _memory_profiler (#119751)
* Log process group id instead of backend id in GPU traces (#120475)
* Add kineto init delay when used in daemon mode (#120276)
* Fix recorded profiler step number (#121127)
* [ET] Fix deadlock in ExecutionTraceObserver (#119242) (#119398)


## Python API



* Fix index range checks when index is the minimum int value (#116062)
* Fix slots handling in torch.utils.swap_tensor (#116128)
* Fix NaN bug in torch.signal.windows.kaiser (#116470)
* Fix handling of empty inputs in torch.fft.fftn (#117368)
* Fix and/or ops on torch.uint8 tensors only return 0x00 or 0x01 (#117827)
* Fix inf handling in torch.nn.functional.scaled_dot_product_attention (#119577)
* Fix serialization of torch.complex32 dtype (#120388)
* Fix torch.gradient check for spacing arg list length (#115686)
* Fix type hints on torch.nn.attention.sdpa_kernel (#119140)
* Fix dimension checks in torch.distributions.MixtureSameFamily (#118947)


## Quantization



* Make HistogramObserver handle torch.inf and closeby values (#103467)
* Fix equal_quantized_cpu for QUInt4x2 and QUInt2x4 (#116307)
* Fix XNNPACKQuantizer set_module_type issue (#115252)
* Fix a segfault when calling topk on a quantized scalar tensor. (#116337)
* Fix a segfault issue when passing an empty kernel to quantized_max_pool1d (#116342)
* Fix batchnorm folding in pt2e quantization (#118720)
* Update `PerChannelMinMaxObserver` default `_load_from_state_dict` (#118659)


## Releng



* Fix for sparse windows on CPU with MKL (#102604)
* Fix for ExecuTorch pinned commit update failure (#117518)


## Sparse



* Fix a crash in sparse compressed tensor invariants check when nnz == 0 (#115825)
* Fix sparse compressed tensor invariants checks when nnz==0 (#115826)
* Fix segfault when trying to permute empty tensor (#116335)


## Other



* Remove compute capability 3.5 for CUDA 12 (#114930)
* VSX: Fix overflow in complex division (#116972)
* VSX: Fix vectorized abs function for complex tensors (#116859)
* Add complex support to parametrizations.spectral_norm (#121452)
* Fix crash in SymInt unary minus (#116160)
* Fix for out of bounds read in mobile interpreter INTERFACE_CALL opcode handler (#110301)
* Fix for out of bounds registers_ access in mobile TorchScript interpreter (#110300)


# Performance


## Composability



* In torch.compile, avoid allocating extra buffers unnecessarily in cases where the compiled function returns a mutated input directly (#120514)
* Min-cut partitioner always saves tensors that are returned as-is in backward (#114970)


## CUDA



* Speed up triu_tril_kernel (#115013)


## Inductor



* Add an autotune cache for inductor generated kernels (#120963)
* Add ArrayRefTensor (#112115)
* Replace cached thread_locals with stack allocation in AOTI (#112116)
* Autotune with matrix_instr_nonkdim for AMDGPU (#120742)
* Enable lowering of dynamic qlinear for X86Inductor (#120605)
* [NFC][Autotune] Use device_prop.regsPerMultiprocessor instead of hardcoded reg number. (#115094)
* [Autotune] Enable register pressure handling logic for H100. (#115295)
* Support vectorization for index_expr that depends on tiling itervar or with indirect indexing (#114545)
* Load as scalar for the index invariant in the vector range (#116387)
* Inductor qlinear int8_fp32 with bmm (#116599)
* Inductor qlinear int8_bf16 with bmm (#116604)
* Enable fp16 mkldnn fusion/prepack in inductor (#117206)
* Improve vector contiguous checks for FloorDiv and ModularIndexing (#117221)
* Load as scalar for the index invariant in the vector range (#116387)
* Use sleef implementation for CPP backend acosh codegen (#118350)
* Don't skip register-spilling configs in custom Triton kernel auto-tuning (#119634)
* be more consrevative until regression is debugged (#119583)
* Check alignment of ReinterpretView args of custom Triton kernels (#119649)
* Apply simplify_index_in_vec_range in select_tiling_indices to enable more contiguous vec load (#117260)
* Apply simplify_index_in_vec_range to vector store and vector transpose (#117263)
* Multi-kernel support (#103469)
* Enable the Inductor Lowering of QConv2d post op hardswish (#117489)
* Change CppWrapperCodeCache to use faster python binding (#117693)
* Slightly faster memory allocation on CPU (#118171)
* Slightly faster memory allocation on CUDA (#118255)
* Add Thread Number Checker in scatter_reduce_ fallback for CPP backend (#118278)
* Enable vectorization with constant bool (#118380)
* Support scalar value in vec reduction (#118511)
* Use at::detail::empty_strided_* in cpp_wraper mode (#118490)
* Add equal_to_1 to triton_meta for user-written Triton kernels (#120579)
* Add mask_convert_to_lp to support bool->fp16/bf16 convert (#117830)
* Optimize transpose_mxn with bf16 data type (#117958)
* Add Int8 data type into Inductor CPP backend vectorized code generation (#119179)
* [Autotune] Multithreaded Precompilation (#119386)
* Add SDPA pattern for HuggingFace models BF16 (#121202)
* Support auto-tuned custom PT ops in abi compatible mode (#120877)
* Benchmark templates (#118880)


## MPS



* Add native lerp support (#119036)


## Optimizer



* Replace new().resize_as_() by torch.full_like() in Rprop (#119978)
* clip_grad_norm can use fast foreach path for inf norm (#120623)


## Profiler



* Only profile when JIT is enabled. (#121404)


## Python API



* Speed up fp16<->fp32 conversion on ARMV8 platforms (#120012)


## ROCm



* CatArrayBatchedCopy performance improvement (#118685)
* Fix performance regression and memory storage handling of Flash Attention on ROCM (#122857)


## Other



* Add NEON accelerated torch.mv kernel (#119992)


# Documentation


## Autograd API



* Autograd doc cleanup (#118500)
* Deduplicate docs between global and non-global full backward hooks (#119708)
* Add missing words to torch.utils.checkpoint doc (#120196)


## CUDA



* Include a print for _get_cuda_arch_flags (#118503)
* Clarify how to get extra link flags when building CUDA/C++ extension (#118743)
* Test seo torch cuda (#119324)


## Distributed API



* C10d
    * Add documentation for the `device_id` parameter for `init_process_group` (#116222) 
    * Add docstrings and tests for src / dst (#118593)
    * Add device for distributed examples (#118867)
    * Add Work to distributed docs (#115172)
* DDP:
    * Fix docstring errors in model_averaging (#117038)
    * Fix docstring errors in ddp_comm_hooks (#116866) 
    * Update DDP dynamo debug docs (#118295) 
* FSDP
    * Fix optim_state_dict_to_load doc errors (#118195)
* Distributed Checkpointing (DCP):
    * Fix the documents for distributed_state_dict (#121276)
    * Update the distributed state_dict document (#121290)
* DTensor
    * Update README to make all example runnable (#115365)
    * Add torchrec even row-wise sharding example
    * Add clarification to doc and improve TP examples (#121431, #117618)
    * Add torch.float64 precision support to the transformer test suite in TP/SP (#116436)
* Misc:
    * Add doc for torch.distributed.breakpoint (#115656)


## FX



* Reduce create_env log level to DEBUG (#120772)


## torch.compile


### Inductor



* Document and type torch._inductor.virtualized (#117658)
* Document OpsHandler protocol (#117790)


### torch.export



* Added TORCH_LOGS=export (#116993)
* Update _constrain_as_size docs (#120728)
* Added docs for 2.3 release (#121466)
* Updated docs to not export raw functions (#121272)
* Add comments about runtime_var_to_range. (#118539)


## Linalg



* Fix error in examples of torch.linalg.lu_factor (#120484)
* Add links to _ex variants in all linalg functions that support them (#121451)


## torch.nn API



* Updated documentation for the constraints of `FractionalMaxPool2d` (#116261)
* Updated `BCEWithLogitsLoss` documentation regarding `pos_weight` (#117046)
* Fixed typo in `register_state_dict_pre_hook` doc (#118571)
* Change the parameter type from int to float in `torch.nn.Softplus` (#120183)
* Documented special case in `AvgPool` (#120335)
* Added hyperlink to `Transformer` documentation in Transformer-related modules (#120565)
* Added type hints to `TransformerEncoder`/`Decoder` (#120550)
* Added a note in `Transformer` about difference in masking semantic with `torch.nn.functional.scaled_dot_product_attention` (#120668)
* Fixed documentation for mask broadcasting behavior in `torch.nn.functional.scaled_dot_product_attention` (#120859)
* Fixed math display in `ChannelShuffle` documentation (#121247)
* Documented padding size constraint in `nn.ReflectionPad2d` (#115995)
* Fixed documentation of `nn.functional.scaled_dot_product_attention` to indicate scale is a keyword only arg (#119129)


## Optimizer



* Added example regarding weight_decay distinction with per-parameter API (#117436)
* Fix optim.lr_scheduler examples in doc to use optimizer vs self.opt (#119563)
* Clarify decay vs multiply by a constant factor in the constantLR doc (#120852)
* Clarify the patience in ReduceLROnPlateau (#119872)


## Other



* Fixed render of `tensors` in `backward` (#117994)

PyTorch 2.2: FlashAttention-v2, AOTInductor (2024-01-30)

# PyTorch 2.2 Release Notes

- Highlights
- Backwards Incompatible Changes
- Deprecations
- New Features
- Improvements
- Bug fixes
- Performance
- Documentation

# Highlights

We are excited to announce the release of PyTorch® 2.2!  PyTorch 2.2 offers ~2x performance improvements to `scaled_dot_product_attention` via FlashAttention-v2 integration, as well as AOTInductor, a new ahead-of-time compilation and deployment tool built for  non-python server-side deployments.

This release also includes improved torch.compile support for Optimizers, a number of new inductor optimizations, and a new logging mechanism called TORCH_LOGS.

**Please note that we are [deprecating macOS x86 support](https://github.com/pytorch/pytorch/issues/114602), and PyTorch 2.2.x will be the last version that supports macOS x64.**

Along with 2.2, we are also releasing a series of updates to the PyTorch domain libraries. More details can be found in the library updates blog.

This release is composed of 3,628 commits and 521 contributors since PyTorch 2.1. We want to sincerely thank our dedicated community for your contributions. As always, we encourage you to try these out and report any issues as we improve 2.2.  More information about how to get started with the PyTorch 2-series can be found at our [Getting Started](https://pytorch.org/get-started/pytorch-2.0/) page.

Summary:
- `scaled_dot_product_attention` (SDPA) now supports FlashAttention-2, yielding around 2x speedups compared to previous versions.
- PyTorch 2.2 introduces a new ahead-of-time extension of TorchInductor called AOTInductor, designed to compile and deploy PyTorch programs for non-python server-side.
- `torch.distributed` supports a new abstraction for initializing and representing ProcessGroups called device_mesh.
- PyTorch 2.2 ships a standardized, configurable logging mechanism called TORCH_LOGS.
- A number of torch.compile improvements are included in PyTorch 2.2, including improved support for compiling Optimizers and improved TorchInductor fusion and layout optimizations.
- Please note that we are deprecating macOS x86 support, and PyTorch 2.2.x will be the last version that supports macOS x64.
- `torch.ao.quantization` now offers a prototype `torch.export` based flow



  
   
Stable
   
   Beta
   
   Prototype
   
   Performance Improvements
   
  
  
   
   
   FlashAttentionV2 backend for scaled dot product attention
   
   PT 2 Quantization
   
   Inductor optimizations
   
  
  
   
   
    AOTInductor
   
    Scaled dot product attention support for jagged layout NestedTensors
   
   aarch64-linux optimizations (AWS Graviton)
   
  
  
   
   
    TORCH_LOGS
   
   
   
   
   
  
  
   
   
   torch.distributed.device_mesh
   
   
   
   
   
  
  
  
   
   
   torch.compile + Optimizers
   
   
   
   
   
  


\*To see a full list of public 2.2 - 1.12 feature submissions click [here](https://docs.google.com/spreadsheets/d/1TzGkWuUMF1yTe88adz1dt2mzbIsZLd3PBasy588VWgk/edit?usp=sharing).

# Tracked Regressions

### **Performance reduction when using NVLSTree algorithm in NCCL 2.19.3 (#117748)**

We have noticed a performance regression introduced to all-reduce in NCCL 2.19.3. Please use version 2.19.1 instead. 


### **Poor numeric stability of loss when training with FSDP + DTensor (#117471)**

We observe the loss will flatline randomly while training with FSDP + DTensor in some instances.


# Backwards Incompatible Changes

###

### **Building PyTorch from source now requires GCC 9.4 or newer (#112858)**

GCC 9.4 is the oldest version fully compatible with C++17, which the PyTorch codebase has migrated to from C++14.

### **Updated flash attention kernel in `scaled_dot_product_attention` to use Flash Attention v2 (#105602)**
Previously, the v1 Flash Attention kernel had a Windows implementation. So if a user on Windows had explicitly forced the flash attention kernel to be run by using  `sdp_kernel` context manager with only flash attention enabled, it would work. In 2.2, if the `sdp_kernel` context manager must be used, use the memory efficient or math kernel if on Windows.

```Python
with torch.backends.cuda.sdp_kernel(enable_flash=True, enable_math=False, enable_mem_efficient=False):
  torch.nn.functional.scaled_dot_product_attention(q,k,v)
```




```Python
# Don't force flash attention to be used if using sdp_kernel on Windows
with torch.backends.cuda.sdp_kernel(enable_flash=False, enable_math=True, enable_mem_efficient=True):
  torch.nn.functional.scaled_dot_product_attention(q,k,v)
```





### **Rewrote DTensor (Tensor Parallel) APIs to improve UX (#114732)**
In PyTorch 2.1 or before, users can use ParallelStyles like `PairwiseParallel` and specify input/output layout with functions like `make_input_replicate_1d` or `make_output_replicate_1d`. And we have default values for _prepare_input and _prepare_output. The UX of Tensor Parallel was like:
```python
from torch.distributed.tensor.parallel.style import (
    ColwiseParallel,
    make_input_replicate_1d,
    make_input_reshard_replicate,
    make_input_shard_1d,
    make_input_shard_1d_last_dim,
    make_sharded_output_tensor,
    make_output_replicate_1d,
    make_output_reshard_tensor,
    make_output_shard_1d,
    make_output_tensor,
    PairwiseParallel,
    parallelize_module,
)
from torch.distributed.tensor import DeviceMesh

module = DummyModule()
device_mesh = DeviceMesh("cuda", list(range(self.world_size)))
parallelize_module(module, device_mesh, PairwiseParallel(_prepare_input=make_input_replicate_1d))
...
```

Starting from PyTorch 2.2, we simplified parallel styles to only contain `ColwiseParallel` and `RowwiseParallel` because other ParallelStyle can consist of these two. We also deleted the input/output functions, and started using `input_layouts` and `output_layouts` as kwargs instead to specify the sharding layout of both input/output tensors. Finally, added PrepareModuleInput/PrepareModuleOutput style, and no default arguments for layouts in these two styles and users need to specify them to think about the sharding layouts.
```python
from torch.distributed.tensor.parallel.style import (
    ColwiseParallel,
    PrepareModuleInput,
    RowwiseParallel,
    parallelize_module,
)
from torch.distributed._tensor import init_device_mesh

module = SimpleMLPModule()
device_mesh = init_device_mesh("cuda", (self.world_size,)))
parallelize_module(
   module,
   device_mesh,
   {
      "fqn": PrepareModuleInput(
                input_layouts=Shard(0),
                desired_input_layouts=Replicate()
             ),
      "fqn.net1": ColwiseParallel(),
      "fqn.net2": RowwiseParallel(output_layouts=Shard(0)),
   }
)
...
```


### **`UntypedStorage.resize_` now uses the original device instead of the current device context (#113386)**

Before this PR, `UntypedStorage.resize_` would move data to the current CUDA device index (given by `torch.cuda.current_device()`).
Now, `UntypedStorage.resize_()` keeps the data on the same device index that it was on before, regardless of the current device index.



2.1
2.2




```Python
>>> import torch
>>> with torch.cuda.device('cuda:0'):
...:     a = torch.zeros(0, device='cuda:1')
...:     print(a.device)
...:     a = a.untyped_storage().resize_(0)
...:     print(a.device)
cuda:1
cuda:0
```




```Python
>>> import torch
>>> with torch.cuda.device('cuda:0'):
...:     a = torch.zeros(0, device='cuda:1')
...:     print(a.device)
...:     a = a.untyped_storage().resize_(0)
...:     print(a.device)
cuda:1
cuda:1
```






### **Wrapping a function with set_grad_enabled will consume its global mutation (#113359)**
This bc-breaking change fixes some unexpected behavior when `set_grad_enabled` is used as a decorator.



2.1
2.2




```Python
>>> import torch
>>> @torch.set_grad_enabled(False)  # unexpectedly, this mutates the grad mode!
    def inner_func(x):
        return x.sin()

>>> torch.is_grad_enabled()
True
```




```Python
>>> import torch
>>> @torch.set_grad_enabled(False)  # unexpectedly, this mutates the grad mode!
    def inner_func(x):
        return x.sin()

>>> torch.is_grad_enabled()
False
```





### **Deprecated `verbose` parameter in `LRscheduler` constructors (#111302)**
As part of our decision to move towards a consolidated logging system, we are deprecating the `verbose` flag in `LRScheduler`.

If you would like to print the learning rate during execution, please use `get_last_lr()`



2.1
2.2




```Python
optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9)
scheduler = ReduceLROnPlateau(optimizer, 'min', verbose=True)
for epoch in range(10):
    train(...)
    val_loss = validate(...)
    # Note that step should be called after validate()
    scheduler.step(val_loss)
```




```Python
optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9)
scheduler = ReduceLROnPlateau(optimizer, 'min')
for epoch in range(10):
    train(...)
    val_loss = validate(...)
    # Note that step should be called after validate()
    scheduler.step(val_loss)
	print(f"Epoch {epoch} has concluded with lr of {scheduler.get_last_lr()}")
```





### **Removed deprecated c10d multi-gpu-per-thread APIs (#114156)**
In PyTorch 2.1 or before, users can use our multi-gpu c10d collective APIs such as `all_reduce_multigpu`:


2.1
2.2




```Python
import torch.distributed as dist


dist.broadcast_multigpu
dist.all_reduce_multigpu
dist.reduce_multigpu
dist.all_gather_multigpu
dist.reduce_scatter_multigpu
...
```




```Python
optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9)
scheduler = ReduceLROnPlateau(optimizer, 'min')
for epoch in range(10):
    train(...)
    val_loss = validate(...)
    # Note that step should be called after validate()
    scheduler.step(val_loss)
	print(f"Epoch {epoch} has concluded with lr of {scheduler.get_last_lr()}")
```





In PyTorch 2.2, these APIs are removed because PyTorch Distributed's preferred programming model is one device per thread, as exemplified by the APIs in its document. The multi-GPU functions (which stand for multiple GPUs per CPU thread) have been deprecated since PyTorch 1.13.

### **Rename `torch.onnx.ExportOutput*` to `ONNXProgram*` (#112263)**
The torch.onnx.dynamo_export’s output was renamed from torch.onnx.ExportOutput to torch.onnx.ONNXProgram to better align with torch.export.export API terminology which returns a torch.export.ExportedProgram. With this change, any ambiguity that could arise with either API is eliminated.



2.1
2.2




```Python
export_output: torch.onnx.ExportOutput = torch.onnx.dynamo(...)
```




```Python
onnx_program: torch.onnx.ONNXProgram = torch.onnx.dynamo(...)
```





### **Fix `functional::smooth_l1_loss` signatures to not override `beta` (#109798)**

Previously, there were two possible options to pass in `beta` to `smooth_l1_loss`, either as a `SmoothL1LossFuncOption` parameter or a function parameter.

Before, the beta specified as a function parameter would override the other beta if it was set, which was unexpected behavior. Now, we throw an error when beta is passed in both cases.

# Deprecations

## Autograd API

### **Deprecate not passing `use_reentrant` kwarg to `torch.utils.checkpoint.checkpoint_sequential` explicitly (#114158)**
The `use_reentrant` parameter should be passed explicitly. In version 2.4 we will raise an exception if `use_reentrant` is not passed. `use_reentrant=False` is recommended, but if you need to preserve the current default behavior, you can pass `use_reentrant=True`. Refer to docs for more details on the differences between the two variants.
Note that not passing `use_reentrant` kwarg to `torch.utils.checkpoint.checkpoint` has been previously deprecated in a previous release.



2.1
2.2




```Python
a = torch.randn(3, requires_grad=True)
modules_list = [
    torch.nn.Linear(3, 3),
    torch.nn.Linear(3, 3),
    torch.nn.Linear(3, 3)
]

# This would produce a warning in 2.2
checkpoint_sequential(modules_list, 3, a)
```




```Python
# Recommended
checkpoint_sequential(modules_list, 3, a, use_reentrant=False)

# To preserve existing behavior
checkpoint_sequential(modules_list, 3, a, use_reentrant=True)
```





### **Deprecate `"fallthrough"` as autograd fallback default (#113166)**
Custom operators that do not have a kernel registered to the Autograd keys (e.g. AutogradCPU and AutogradCUDA) will now produce a warning when used with autograd.
If your custom operator previously returned floating-point or complex Tensors that do not require grad, they will now require grad as long as grad mode is enabled and the inputs require grad.
For users who would like the old behavior, register `torch::CppFunction::makeFallthrough()` to your Autograd key, as shown [here](https://gist.github.com/zou3519/d09a5f4b1afe2430af09fea67c6ff2c8).

The below example uses the torch library API, but if you are writing an operator in a cpp extension, please read [this](https://docs.google.com/document/d/1W--T6wz8IY8fOI0Vm8BF44PdBgs283QvpelJZWieQWQ/edit#heading=h.pnrqcv6bkfn3) doc for more information.
```Python
import torch
import numpy as np

# Define the operator
torch.library.define("mylibrary::sin", "(Tensor x) -> Tensor")

# Add implementations for the cpu device
@torch.library.impl("mylibrary::sin", "cpu")
def f(x):
    return torch.from_numpy(np.sin(x.detach().numpy()))
x = torch.randn(3, requires_grad=True)
y = torch.ops.mylibrary.sin(x)
y.sum().backward()
```


2.1
2.2




```Python
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn
```




```Python
UserWarning: mylibrary::sin: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd.
```





## Linalg

### **Deprecate `torch.cross` default behavior (#108760)**
Calling `torch.cross` without specifying the dim arg is now deprecated. This behavior will be changed to match that of `torch.linalg.cross` in a future release.

## Jit

### **`NVFuser` functionality has been removed from TorchScript (#110124, #111447, #110881)**
Neural Network Compiler (NNC) has replaced NVFuser as the default GPU fuser for TorchScript in PyTorch 2.1, which also added a deprecation warning for NVFuser. The TorchScript functionality for NVFuser has now been fully removed and is no longer supported.

## Optimizer

### **`SparseAdam` constructor will no longer accept raw Tensor type for `params` (#114425)**
`SparseAdam` is now consistent with the rest of our optimizers and only accepts containers instead of individual Tensors/Parameters/param groups.



2.1
2.2




```Python
import torch
param = torch.rand(16, 32)
optimizer = torch.optim.SparseAdam(param)
```




```Python
optimizer = torch.optim.SparseAdam([param])
```





# New Features

## torch.compile

### Dynamo

  - Fully enabled compiled optimizers (#115906)
  - Cudagraphs support for compiled optimizers (#107504)
  - Experimental support for TorchDynamo tracing with DTensor (#108329)
  - Experimental support for torch.compile, activation checkpointing and FSDP (#103953)
  - Dynamo variable trackers are mutable (#113725) - Improves Dynamo compilation time
  - Reduce cache size limit to 8 - Quickly fallback to eager for non-compile friendly functions (#108526)

### Inductor

  - CUTLASS Backend with epilogue fusion support (#107802, #107847, #107901, #107931, #108015, #109338, #110890,  #112762)
  - FP8 support (#109168, #112527, #114974, #111122, #112297)
  - Support user defined triton kernel (#111434, #111627, #111939, #111956, #111970, #112228, #112290, #112523, #112752, #113090, #114002, #114475)
  - Support registering out of tree customized pre/post grad pattern matchers (#108615, #113823)

### torch.export
  - Introduce dynamic_shapes API in favor of constraints (#108448, #112298, #110101, #110638, #110276)
  - Add torch.export.register_dataclass API (#109152)
  - Expose torch.ops.higher_order.map (#111404)
  - Change export to return full ATen IR (not Core ATen) and add a run_decomposition() function to allow users to pass in a decomposition table (or by default it will decompose to the core ATen decomposition table) (#111030, 8be26111f93, #110236, #114714)

## Build

- Add Hopper (CUDA arch 9.0a) support (#110587)

## Python API

- Add torch.unravel_index (#110580)
- Add multi-dim reductions for `torch.{any,all} (#110310)
- Add file name and size to the serialization metadata logging (#113077)
- Add `torch.distributions.InverseGamma` distribution and fix `sign` bug in `torch.distributions.PowerTransform` (#104501)
- Add torch.utils.deterministic.fill_uninitialized_memory flag (#111377)

## Profiler

- Show shapes for lists of tensors in chrome traces (#109751)
- Add `src/dst` information to  NCCL `send`/`recv` (#111811)
- Populate in/out split size information for NCCL all_to_all from CPU to CUDA kernel (#112308)

## Quantization

- Add CUTLASS-based support for mixed dtypes matrix multiplication (#110981)

## Sparse API

- Add torch.compile support and padding for semi-structured sparsity (#111049, #110583)
- Add CSR tensor with non-contiguous values support to CuSparseSpMatCsrDescriptor (#111742)
- Add scatter_mm and bsr_scatter_mm operations. (#110396, #111796)
- Add is_sparse as a property of MaskedTensor (#110725)

## NestedTensor API

- Add unary out-of-place sin / cos support (#107891)
- Add binary out-of-place ge.Scalar / eq.Scalar support (#107892)
- Add binary op support for (B, C, *, *) NT with (C, 1, 1) dense (#107890)
- Add support for cat with dim=0 (#108361)
- Add support for matmul of (B, *, C, D) NT with dense (D, E) (#108370)
- Add support for narrow() on dim=0 (#108362)
- Add support for cat with dim > 0 when representable as jagged (#108428)
- Add public API for constructing NT with jagged layout from tensor list (#111078)

## Misc

- Python 3.10 Union operator `|` support for JIT (#109293)
- Allow specifiying inputs as GradientEdge in autograd APIs (#110867, [dev-discuss](https://dev-discuss.pytorch.org/t/highlighting-a-few-recent-autograd-features-h2-2023/1787))
- Use CapturedTraceback symbolizer for C++ exceptions from Python library (#113207)
- Add sparse tensor support to dataloader (#112842)
- Add 0dim Tensor overload for _foreach_div (#113688)
- Add global_step parameter to SummaryWriter.add_hparams (#109572)

## Fx

- Add a matcher that supports name to node mapping (#110743)
- Add splitting by tags feature (#109332)
- Allow tracing calls with Python Enum values. (#109507)
- Add function to port FX minified graph to HLO via StableHLO (#109084)

## ONNX

- Add symbolic shape support for torch.onnx.dynamo_export(#112179)
- Add optional torch.export.ExportGraphSignature to ONNXProgram (#113477)
- Add `ONNXProgram.__call__` API to run model with ONNX Runtime (#113495)
- Add decomposition support for dynamo_export + ExportedProgram (#112444)
- Add user input mutation support for dynamo_export + ExportedProgram (#114596)
- Add mutated buffer support for dynamo_export + ExportedProgram (#112272)
- Add FakeTensor support for dynamo_export + ExportedProgram  (#114407)

## CPU

- Add support for `torch.cpu.set_device()` and `torch.cpu.current_device()` (#110716, #110987)

## MPS

- Pixel shuffle unshuffle support (#99306)
- Add lgamma, digamma, and polygamma implementations (#106292)
- Add support for aten::nextafter (#109685)
- Adding weight_norm_interface support for mps (#108008)
- Add searchsorted op (#112829)
- Add bucketize op (#112830)

## Vulkan

- Add Vulkan support for several ATen operators:
  - `aten::randlike` (#108086)
  - `aten::randn_like` and `aten::normal` (#109075)
  - `aten::bmm` (#109360)
  - `aten::baddbmm` (#109360)
  - `aten::floor_divide` (#110785, #112190)
  - `aten::mean.dim` (#111609)
  - `aten::var.dim` (#111965)
  - `aten::log` and `aten::log_softmax` (#112828)
  - `aten::layer_norm` (#112322, #114701)
  - `aten::native_layer_norm` (#113573)
- Partial implementation of 1D convolution (only supports stride=1, padding=0, dilation=1 for now) (#112880)
  - Add support for 0-size tensors (i.e. a tensor with a size of 0, for example with sizes {2, 1, 0}) (#111512)
  - Add support for 0-dim tensors (i.e. a tensor with sizes of {})  (#111680)

# Improvements

## torch.compile

### Dynamo

- Dispatch numpy.take_along_axis to torch.take_along_dim (#108880)
- Force specialization on INT_LIST (#111216)
- Add custom treespec fqn field (#112428)
- Better error handling for cond (#108817)
- Support 'BaseOutput' and subclasses from 'diffusers' in dynamo (#111978)
- Add infinite generators `itertools.{count, repeat, cycle}` (#110967)
- Add support for dict.fromkeys() / OrderedDict.fromkeys() / defaultdict.fromkeys() (#115010)
- Add support for dict.update(seq2) / OrderedDict.update(seq2) / defaultdict.update(seq2) (#115011)
- Add support for dict.copy() / OrderedDict.copy() / defaultdict.copy() (#115012)
- Force synced KJT to trace unbacked SymInt (#108960)

### Inductor

- `max-autotune` improvements
    - max-autotune in multi-processes with multi-GPUs (#109126, #109127, #109500)
    - Allow customizing benchmark input generation  (#108242)
- Make codegen stateless (#107320, #107617)
- Add or improve lowering rules for prims.div, reflection_pad2d, full, sub, _local_scalar_dense, index, reflection_pad2d, index_put, unfold (#102809,  #110988, #108166, #108518, #109893, #111015, #111212, #113204, #113259)
- Add or improve decomposition rules for grid_sampler_2d,  full, torch.ops.quantized.embedding_bag_byte_unpack, amax/amin, native_dropout, bmm, mm, complex dtype addition, upsample_nearest_exactNd (#104710, #108443, #109398, #110311, #115040, #109836, #110740, #113749)
- Add reinplacing pass for scatters + incremental fake tensor updating (#106192)
- Add meta-registration for _sparse_semi_structured_linear, _cslt_sparse_mm (#114477, #114685 )
- Provide fallback values for unbacked symint (#109893, #110520)
- Decompose addmm on cpu for a few special cases (e.g. dot product or small matrix vector multiplication)  (#110010, #110456)
- Don't tune beyond 32 warps (which is a CUDA limit) for the coordinate descent tuner (#108997)
- Avoid special characters in cache_dir path (#110945)
- Add DeviceInterface abstraction to make inductor code more device agnostic (#109486 )
- Allow matmul to have flexiable layout when we are not autotuning (#110726)
- Allow backend compiler skipping a frame for transient errors (#111153)
- Handle item() on boolean tensor (#114157)
- Replace `rand[n].generator` with inductor prim if generator=None (#115051)
- Support channel last for XPU convolution in inductor layout optimization path (#111018)
- Early work to improve the static memory planning algorithm (#111402)
- Added support for symbolic shapes in FX graph cache (#111421)
- Added config to specify the shape attribute for the generated svg graphs (#114811)
- Quantization
    - Enable quantization dynamic batch size support (#108550)
    - Enable QConv2d Unary & Binary int8-mixed-bf16 Lowering (#112550, #112551)
    - Enable QLinear int8-mixed-bf16 Lowering (#112486)
    - Enable the lowering of quantized reshape (#114443)
    - Enable the Inductor Lowering of QConv2d post op hardtanh (#114580)
- Improve the CPU backend
    - Support S390X (#111367, #112723)
    - Support MkldnnRnnLayer (#107858)
    - Add GIL release and acquire (#111888)
    - Vectorize support for truediv (#112234)
    - Support QConv (#112373)
    - Vectorize embedding lookup (#114062)
    - Avoid redundant lowp type cast for direct load/store (#115006)
- Improve the Fx pattern matching passes
    - Generalize pointless_cumsum_replacement pattern (#108373)
    - Improve mem efficiency of constant folding (#108421)
    - Make sure unfuse_addmm and addmm patterns don't overlap (#110235)
    - Improve reinplace_scatters pass (#112801)
    - Make pattern-matcher failure diagnostics lazy and add an error message if format string is too long (#112923)
- Foreach kernel compilation time improvement
    - Skip searching getitem in group batch fusion pass reduces optimizer compilation time by 60s (#112088)
    - Re-inplace foreach when safe and allow aliasing during lowering (#112440)
- AOTInductor
    - ABI-compatible mode support
        - Add a C shim layer for libtorch (#109391, #109834)
        - Support _scaled_dot_product_flash_attention fallback (#110085)
        - Add AOTI ABI shim function for repeat_interleave.Tensor (#110745)
        - Add size, stride, storage_offset to RAIIAtenTensorHandle (#110764)
        - Add AOTI ABI shim function for torch.nonzero (#110766)
        - Enable floor_div indexing to work under ABI-compat mode (#113276)
        - Add ABI shim function for torch.scatter (#114027)
        - Support ReinterpretView in ABI mode (#114169)
        - Support at::convolution for AOTInductor (#114961)
    - ProxyExecutor for custom ops support
        - ProxyExecutor skips serializing missing args with default value (#111425)
        - Support List[Tensor] return type (#110182)
        - Proxy Executor for Extern Fallback kernels (#108350)
        - Switch ProxyExecutor to use AtenTensorHandle (#109748)
        - ProxyExecutor supports custom op with tuple output (#110140)
        - ProxyExecutor supports Tuple of Tensor and List[Tensor] in returns (#110187)
        - ProxyExecutor support ReinterpretView inputs (#110451)
        - ProxyExecutor support Dynamic Shape (#110526)
        - Allow using ProxyExecutor for ATen fallbacks (#112976)
        - Use ProxyExecutor for aten op if c-shim is missing (#113918)
    - CPU performance improvement
        - Generate reused thread_locals when tensors probably have static shape (#110892)
        - Cache dtypes and device types at DSO load (#111820)
        - Emit CACHED_TORCH_TYPE only as needed (#113997)
    - UX improvement and refactoring
        - Use array of constants (#111815)
        - Write weight files only if they do not exist yet (#111379)
        - Enforce no_grad for 'run' entry points (#111613)
        - Improve validation for C++ wrapper codegen (#111102)
        - Avoid generating redundant kernel loading code (#110510)
        - Group AOTInductor configs under aot_inductor class (#108369)
        - Include constants in the generated .so file (#108473)
        - Do not hardcode directory with .cubin files (#109151)
        - Add is_cpu for AOTInductorModelContainer (#109287)
        - Pass TorchIR to AOTInductor (#110020)
        - A lightweight model runner (#110158)
        - Remove CUDA dependency for cpp backend (#110409)
        - Delay the fallback kernel naming decision to the codegen time (#113660)
        - Move constant loading logic from Container to Model (#112197)
        - Allow specifying a .so name in the aot_inductor.output_path config (#112651)
        - Improve the two-pass wrapper codegen (#114067)

###  torch.export

- Address constant tensors in ExportedPrograms (#113689, #108592)
- Remove replaced symbols from range_constraints (#110644)
- Copy graph module before calling PassManager (#108321)
- Made aot_export_module uses dynamo's fake_mode (#114009, #114381)
- Core ATen Opset
    - Registered additional ATen operators as core (#110882)
    - De-registered full_like and empty_like as core (#110924)
    - Added div.Tensor_mode, div.Scalar_mode, and copy as core operators (#109812)

## Composability

- FakeTensors and meta tensors are used to perform shape propagating when tracing out a graph in torch.compile. There were a number of op coverage improvements this release:
  - `masked_scatter` (#108802), `_segment_reduce` (#109359), `foreach` ops (#112281), `linear_backward` (#114359), `scaled_mm` (#112609)
- We have python “reference” decompositions for many aten operators. These are used during the tracing step of torch.compile. In a few ways: sometimes they are used to directly decompose operators in the captured graph. Other times, they are used as an alternative to a shape-propagation rule for an operator. There were several improvements to operator coverage in this release
  - `linalg.vecdot` (#108188)
  - `aten.dot`/`vdot` (#108194)
  - `aten.tensor_split.tensor_indices_or_sections` (#107251)
  - `view_as_complex` (#108005)
  - `aten.take_along_dim` (#108185)
  - `scaled_dot_product_attention` (#108180, #108608, #108371, #113102)
  - `unsafe_split{,_with_sizes}` (#109668)
  - `_weight_norm_interface` (#112193)
- We also have an opset known as “Core ATen IR” as defined here. Several ops were either added to core ATen, or had decompositions for them added, that decompose into other core ATen operators:
- decompositions:
  - `std.correction` (#108733)
  - `unsafe_split.Tensor` (#108544)
  - `clamp_min` (#108717)
  - `clamp_max` (#108718)
  - `_unsafe_view` (#108713)
  - `baddbmm` (#108534)
  - `floor_divide` (#110046)
  - `aten.all` (#110093)
  - `split.Tensor` + `unbind` (#110323)
  - `aten.sum` + `aten.squeeze` (#110645)
  - `std` + `std_mean` (#109667)
  - Scaled dot product attention (#117097)
  - New core aten ops:
    - `trunc` (#109319, #109902)
    - `glu` (#110043)

## Python API

- Add a UserWarning when using torch.{std,var,std_mean,std_var} with dof<=0 (#109824)
- Add `torch.half` support for `torch.multinomial` on CPU (#104178)
- Add support for serializing `torch.float8_*` dtypes (#114662)
- Add different out dtypes support to `torch.addc{mul,div}` (#112682)

## torch.nn API

- Add `__all__` for `torch.nn.utils` (#111026)
- Add Half support for `AdaptiveAvgPool2d` and `AdaptiveMaxPool2d` on CPU (#102079)
- Add Half support for `GroupNorm` on CPU (#100234)
- Add Half support for `torch.nn.functional.{softmax/log_softmax}` on CPU (#103315)
- Add BFloat16 support to `torch.nn.functional.grid_sample` (#112331)
- Add BFloat16 support for `nn.utils.parametrizations.weight_norm` (#114785)

## Linalg API

- Add fp16 support for gemm on CPU (#99498)
- Add quantized int4 support for mm (#111403)

## Optimizer API

- Allow torch.float64 scalars for forloop + foreach implementations (#115841, #111008)
- Add `NAdam` support for complex dtypes, with `has_complex` shortcut (#110634)
- Set default learning rate (`lr`) value of `SGD` to `1e-3` (#114467)
- Add capturable ASGD impl (#107857)

## torch.func

- Add vmap support for various in-place operations (#110692, #113513)
- Add vmap support for `torch.unsafe_chunk` (#110862)
- Add vmap support for `Tensor.index_add_` (#112276)
- Add vmap support for `torch.linspace` and `torch.logspace` (#105451)
- Add vmap support for `torch.linalg.eigh` (#110640)
- Add dynamic shapes support for vmap over `torch.squeeze` and alias  (#107577)
- Add dynamic shapes support for vmap over `torch.is_same_size` and `torch.split_with_sizes` (#111491)

## Misc

- Add `torch.utils.checkpoint.set_checkpoint_debug_enabled` (#110728)
- StackDataset batched sampling (#110694)
- Add option to flop counter formula registration to get raw values (#110591)

## Quantization

- Bits Types
  - Enable `cat` for bits types (e.g.`torch.bits8`) in cuda (#115044)
  - Enable `copy`/`clone`/`reshape`/`contiguous` operations for bits types (#113508)
- PyTorch 2 Export Quantization:
  - Add reference representation for dynamic quantized linear (#108073)
  - Use `input_qspec_map` for weight quantization of `linear` (#107105)
  - Make annotation util functions return annotated nodes (#107106)
  - Add dequantize operator duplication pass (#107900)
  - Add metadata porting for nodes added by quantization (#107107)
  - Move to BFS instead of DFS to check for connectedness (#108572)
  - Support `int16` quantization (#108453)
  - Support `cat` (#108382), `conv1d` (#109830) and `mul` (#110428) in `XNNPACKQuantizer`
  - Enable constant folding for quantize ops (#109343)
  - Add util function to convert scalars to attrs (#110427)
  - Support `cudnn_batch_norm` (#109908) and `miopen_batch_norm` (#110653) in QAT fusion
  - Preserve source_fn_stack after QAT fusion (#110899, #111515)
  - Cleanup observer insertion logic (#111828) (#112453)
  - Fix QAT conv-bn bias using `DerivedQSpec` (#112159)
  - Refactor QAT q-dq patterns (#112279)
  - Add "quantization_tag" as metadata to `fx.Proxy` (#108764)
  - Inductor cpp wrapper: support QLinear (#112378)
  - Enable QAT Quantization flow in `X86InductorQuantizer` (#111280)
  - Add `ConvBNAdd(ReLU)` Annotation (#111281), adaptive_avg_pool2d and flatten (#114442), Hardtanh and ReLU6 for conv2d (#114579) in `X86InductorQuantizer`
  - Enable `oneDNN` QConv (#112010) and QLinear (#112126) FP32/BF16 output
  - Enable `oneDNN` QConv2d with hardtanh post op (#114578)
  - Remove the output Annotation of Conv/Linear in x86InductorQuantizer (#112140)
  - Enable `quantize_per_tensor`/`quantize_per_channel` to accept `bfloat16` input (#112225)
  - Support quantized conv bias in QAT fusion (#112528)
  - Fix custom dtype per channel weight in QAT (#112612)
  - Support `allow_implicit_sharing` flag (#112929)
  - Add `transform_for_annotation` method in `Quantizer` (#113115)
  - Remove add/relu from conv-bn QAT pattern (#113006)
  - Rewrite QAT annotations using `SubgraphMatcherWithNameNodeMap` (#113709)
  - Support conv1d-bn QAT fusion (#113714)
  - `XNNPACKQuantizer` skip quantization for input and output to workaround histogram observer problem (#113405)
  - Add support for QAT dynamic quantization for linear in `XNNPACKQuantizer` (#113288)
- Support Subclasses of `FloatFunctional` in `torch.ao.quantization.prepare` (#109646)
- Enable pickling model prepared with QAT `QConfig` (#109288)
- Suppress empty translation unit warning in QNNPACK (#111475)
- `new_qtensor` support `privateuseone` allocator. (#111464)
- Add support for `float8_e4m3fnuz` and `float8_e5m2fnuz` (#107586)
- Overload vec::dequantize to eliminate rounding error for quantized sigmoid (#114098)

## NestedTensor API

- Pickle support for NT (#110219)
- Multiprocessing support for NT (#110292)
- pin_memory support for NT (#110404)
- Implement split_with_sizes backward for NT (#110647)
- Multiprocessing support for NT (#110292)
- Support for as_nested_tensor() with jagged layout + fixed nested_tensor() semantics (#112304)
- Do not generate zero-numel NT by default in helper and improve to_padded_tensor msg (#113162)
- Backward support for broadcasting binary ops (#112519)
- Implement narrow from a regular tensor to jagged tensor (#112770)

## Distributed

- c10d
  - Make TCPStore more robust to one-time interruptions. (#108425)
  - Add functional collective `all_to_all_single` and support it in Inductor (#110195)
  - Set `ProcessGroupNCCL` default timeout to 10 min (#110947)
  - Add an explicit `_shutdown` method to ProcessGroupNCCL (#111392)
  - Enable coalescing manager in `DETAIL` debug mode (#111878)
  - Avoid recording stream for all-gather, reduce-scatter, broadcast and scatter (#111431, #112896)
  - Relax tensor contiguity requirement for P2P ops (#114982)
  - Add .boxed() to `c10d::ProcessGroup` and `c10d::Work`'s pybind (#111997)
  - Make `init_process_group` timeout kwarg override `pg_options` (#112611, #113094)
  - Use allocator trace callbacks for `ProcessGroupNCCL` register (#112850)
  - Make `FakeProcessGroup` traceable (#113314)
  - Add Bfloat16 scalar support to gloo backend (#113557)
  - Opportunistically use `ncclCommSplit` when creating new NCCL groups (#114385)
  - Add API `_set_group_name` and `group_name` to track pg names in C++. (#108813)
- Distributed Checkpointing (DCP):
  - Stateful Checkpointing for Distributed (#113867)
- DistributedDataParallel (DDP):
  - Add an API to DDP for dynamically updating the underlying process group. (#113580, #114194)
- DTensor
  - Supported convolution ops (#113123)
  - Add DTensor constructor: `randn` (#108285)
  - Add grad placements kwarg to `to_local` API (#110629)
  - Support `aten.where` and enabled implicit scalar promotion (#110584)
  - Support lt/gt op (#110585)
  - Enable DTensor TP in the inference mode (#110751)
  - Refactor Parallel Style and TP API to improve UX (#111160, #111166, #111176, #111346, #111353, #111625, #111521)
  - Enable embedding sharding in TP API (#111177)
  - Enable adagrad foreach support (#114151)
  - Enable RMSprop optimizer foreach support (#114152)
  - Introduce `full_tensor` API to DTensor (#112224, #113322)
  - Enable foreach operators for adam optimizer (#112108)
  - Don’t use make_fx for strategy propagation (#108262)
  - Add assert of shard dim to be less than tensor ndim (#112404)
  - Add `rand_like`, `randn_like`, `randint_like` ops to shard propagation (#112576)
  - Enable `min`, `max` and `prod` sharding propagation rules (#112403)
  - Add support for layer norm op in DTensor (#113105, #113244)
  - Add foreach_zero_ support (#113897)
  - Make `_Partial`, `Replicate` frozen dataclasses (#113919)
  - Make replicate -> partial DTensor do division instead (#110898)
  - Use new placements for neg dim in `redistribute` (#113924)
  - Use new placements for neg dim in `from_local` (#114134)
  - Use new placements for neg dim in `distribute_tensor` (#113930)
  - Ensure `grad_placements` was tuple (#113925)
  - Support Xla backend in distribute_tensor API (#110275)
- FullyShardedDataParallel (FSDP):
  - Not materialized ignored modules for FSDP (#108032)
  - Not moved ignored params / buffers to device (#108033)
  - Make `checkpoint_wrapper` default to `NO_REENTRANT` (#108435)
  - Make `ModuleWrapPolicy` callable (#109117)
  - Enable `cpu_offload` config for optimizer state_dict (#108434)
  - Enable FSDP on CPU when GPU is still available (#112145, #112144)
  - Add `cpu_only` and `ranks_only` support for `_gather_state_dict` (#112836)
  - Implement `cpu_offload` and `full_state_dict` for `get_state_dict` (#112837)
- TorchElastic:
  - Ensure grandchild processes are restarted correctly (#113231)
  - Avoid terminating parent process if exit code from child isn't valid (#111961)

## CPU

- Use cpuinfo to determine c10::ThreadPool thread number (#107010)
- Add Half support for cummax, cummin, cumprod, logcumsumexp, and prod on CPU (#112132)
- Add Half support for poisson and use float for Half cumulative distribution on CPU (#112124)
- Remove memory efficient attention checks (#112375)
- Enable THP for buffer sizes >=2MB  (5a4f1363409)

## CUDA

- Add lazy initialization for p2p access function (#108589)
- Add support of CudaHostRegister (#108488)
- Create per thread task pool for mapping memory space (#111545)
- Add AMP support to linalg.vecdot. (#108165)
- bfloat16 support in erfinv (#111257)
- Add Bfloat16 support to CrossKernel.cu (#108941)
- Add bf16 support to replicate padding (#112099)
- Preserve operations order between vectorized and non-vectorized in ln grad input (#111488)

## Fx

- Add mechanism for make_fx to not error on data-dependent-ops (#114129)
- Preserve non-forward method during torch package serialization (#114702)
- Add Graph input option for replace_pattern (#112409)
- Allow preserving non-forward methods during deepcopy (#114849)
- Replace node.meta source_fn with source_fn_stack (#108595)
- Fix tree spec matching behavior (#109679)
- Assert that output must be the last node of the FX graph (#114973)
- Misc improvements to visualization + utility (#114984)
- Add stylistic improvements for `fx.split_module` (#113373)

## Jit

- Skip builtins while enumerating class methods (#91805)
- Support lovelace for NVRTC (#87611)
- Add expanded symbolic shape support (movedim) (#91696)

## MPS

- Add complex support for fill, mul, add, sub (#111885, #111937, #108395, #108394)
- Add support for sgn to MPS backend (#110829)
- Add support for new activation functions (Mish, Softshrink) (#109786) (#110814)
- Generalize toAccumulateType() (#108248)

## ONNX

- torch->onnx export support: quantized::linear_relu (#109755)
- Add Half for aten2, logaddexp, logaddexp2, hypot, and nextafter on CPU (#112138)
- Support None in fx.args as torchlib inputs (#108708)
- Support attn_mask fp16 type (#110306)
- A better way to safe guard 2GB model serialization (#111984)
- Fix scalar type promotion between fp16 tensor and fp32 scalar (#113404)
- Add 'aten::rsub' type promotion (#113697)
- Relax unsupported node analysis on complex dtype (#113785)

## ROCm

- enable hipSparse const descriptors for version >= 2.4.0 (#110317)

## Vulkan

- Improve binary operators to be able to handle the other argument being a 0-dim tensor (#109035)
- Improve binary operators to automatically convert the other argument to float in order to handle mismatched input dtype (#114145)
- Improve aten::addmm and aten::linear to be able to broadcast the bias argument  (#108199)
- Improve aten::layernorm to be able to handle 2D inputs (#110796)
- Improve aten::slice to be able to return 0-size output (#112879)

# Bug fixes

## Autograd API

- Fix in-place custom autograd Functions to not fail when grad returned from backward is undefined (#108353)
- Update custom Function preserve torch function when inputs returned as-is (#109825)
- Do not error when printing view created in no-grad modified in-place in no-grad (#113716)
- Fix `torch.autograd.gradcheck` when `fast_mode=True` and default device is set (#114560)
- Fix `torch.prod` double backward when input tensor contains more than one zero (#113969)

## Cpp API

- Check results dtype in index_out (#108167)
- Add the appropriate check on div_value to the cpp frontend (#114671)
- Add input check at the beginning for C++ API `interpolate` (#108506)
- Fix the coredump described by #106702 (#108002)
- Fix torch.nn.GRUCell segfaulting (#108340)
- Add checks to `num_layers` for `RNN`, `LSTM`, `GRU` (#108853)
- `torch::nn::AdaptiveLogSoftmaxWithLoss`: check length of `cutoffs` (#106777)

## Foreach API

- Fix 0-size handling for real (#109402)

## Linalg API
- Fallback to GEMM if mkldnn_matmul fails on aarch64 (#115936)
- Preserve input's NaN values to prevent undefined behavior for `matrix_exp` function (#111539)

## NestedTensor API

- Fix torch.load(..., weights_only=True) for NT (#112516)

## Optimizer API
- `ReduceLROnPlateau` now subclasses `LRScheduler` (#113659)
- Fix `adagrad` sparse handling due to incorrect early exit (#110454)
- Solving pickle error when saving CyclicLR `state_dict` (#110931)

## Python API

- Fix type checking of lazy submodule import (#109683)
- Fix unhandled exceptions in `torch.{finfo,iinfo}` calls (#109743)
- Fix torch.{size|stride}(dim=None)` (#111991)
- Fix legacy typed storage warning line pointer (#113601)
- Fix cpu detection error handling (#113771)

## Sparse API

- Fix semi-structured sparse shape mismatch bug (#110420)

## torch.compile

### Dynamo

- Add torch.distributed get_rank and get_world_size to constant_fold_functions (#109029)
- Implement traceable torch.tensor when you have SymInt/SymFloat inputs (#109515)
- Avoid throwing exception in ClosingTHPObjectPtr (#109758)
- Fix inductor CI (by updating graph break count) (#110160)
- Error if you try to run Dynamo compiled function under torch.jit.trace (#111321)
- Adjust _list_with_default to also work with SymInt input (#113073)
- Avoid eager imports of classes with custom VariableTrackers (#112319)
- Uniformly use SourcelessBuilder to handle user defined types (#113390)
- Register SymInt-aware meta function for mm out, symintify resize (#113202)
- use sourceless builder for builtin getattr (#113340)
- use sourceless builder for builtin getattr (#113340)
- Don't toggle torch logger to NOTSET if it is not set; always use pre-existing (#113842)
- Fix dict.get with no default (#115048)
- Improve support for list subclasses (#115052)

### Inductor

- Properly handle unbacked symint in various scenarios (#109603, #109609, #111803)
- Using floating point 0 rather than integer 0 as default value for tl.load (#113047)
- Avoid passing a None 'generator' argument to aten.rand which does not accept a generator argument (#112240)
- Avoid recursion error because of accumulating too much computation in a pointwise IRNode (#112320)
- Fix 0-sized views of tensors in cudagraphs (#109055)
- Explicitly use the result's dtype for 'other' values in a masked load to avoid unexpected type promotion (#109325)
- Work around triton issue that loading from int1 pointer returns int8 (#110388)
- Avoid max-autotune benchmarking messing up the random number generator (RNG) state (#111381)
- Fix an out of shared memory issue by avoiding a single invalid triton config causes fatal problem (#112916)
- Make TORCH_COMPILE_DEBUG=1 work again (#112917)
- Fix inductor <> ddp_optimizer issue (#108081)
- Fix visualize_overlap for Inductor comm reordering (#113066)
- Fix cat decomp that the first tensor was returned if it is empty and there is only one non-empty tensor (#113514)
- Correctly codegen math.inf in Inductor (#114159)
- Do not promote int to float for torch.mm (#115043)
- Dont pad broadcasting bias dimension in pad mm (#115098)
- Bug fix for the CPU backend
    - Fix argmax with >1 reduction dims (#113168)
    - Fix add/sub with uint8 dtype to avoid unexpected type promotion (#113253)
    - Fix non-contiguous reduction store (#113261)
- Bug fix for the Fx pattern matching passes
    - Fix a bug in the merge getitem cat pattern (#113822)
    - Fix shape mismatch in SDPA pattern matcher (#115038)
- AOTInductor
    - make AOTInductor work with pip installed torch (#108319)
    - Fixing a redefining symbol bug (#110041)
    - Make freezing work with AOTInductor (#110055)
    - Make a free function in AOTInductor header file inline to avoid redefining symbol error (#110445)
    - Fix a weight loading issue when the weight size can be 0 (#114280)
    - Handle empty input args (#114682)

### torch.export

- Fix how pass base uses constant fake tensors (#111140)


## torch.func API

- Fix vmap support for `torch.real, torch.imag (#110508)
- Fix vmap support for `torch.isfinite`, `torch.isreal`, and `torch.log_sigmoid` (#110896)
- Fix vmap support for `torch.movedim`, `torch.tensor_split`, `Tensor.to`, `to.*` (#110999)
- Fix vmap support for `torch.flatten`, `torch.linalg.*`, `torch.linear`, `torch.log_softmax`, `torch.logdet`, `torch.special.*` (#110985)

## torch.nn API

- Fix precision issues for `torch.nn.LayerNorm` on CPU (#108089)
- Madforward outputs of type `collections.namedtuple` are preserved instead of being changed to `tuple` when there are backward hooks on `nn.Module` (#112433)
- Fixug in mem_eff kernel with attention mask and MQA (bc244ee2cdc)
- Fix allowed dtypes for CUDA devices less than SM80 for `memory_efficient_attention` (#116026)
- Enfced that both input tensors to `nn.CosineEmbeddingLoss` have the same size (#112782)
- Fix type hints for `nn.Module.to` (#108767)
- Fix `torch.nn.utils.rnn.pad_sequence` type hint to allow sequences to be an iterable (#108765)
- Fix `num_batches_tracked` of `nn.BatchNorm{*}D` in load_state_dict to not be reset to 0 if the state_dict does not contain `num_batches_tracked` (#110850)
- Fix `convert_sync_batchnorm` to return `SyncBatchNorm` layer with same training flag as BatchNorm layer being converted (#111998)
- Fix 64-bit indexing support for cross-entropy CUDA kernels (#112096)

## Build
- Fix finding Intel MKL, LAPACK, cuDNN and cuSPARSELt on Windows (#108040)
- Fix ppc64le clang compilation errors (#106446)
- Compile FBGEMM with ASAN (#111266)

## Composability

- FakeTensors and meta tensors are used to perform shape propagating when tracing out a graph in torch.compile. There were a number of op coverage improvements this release:
  - Bugfixes to several aten operator meta implementations.
      - `index_select.out` (#111364)
      - `meta_randperm` (#109721)
      - `qlinear_pointwise` (#112390)
  - Other meta bugfixes: (#108988, #113634, #108989, #111705, #113635)
- There were several bugfixes to our python decompositions and reference implementations of our aten operators this release:
  - Operator specific
    - Several FFT operators (#108360, #109083)
    - `threshold_backward` (#110689)
    - `aten.add` with int32 + scalar input (#113965)
    - `aten.add` non-contiguous out handling (#111758)
    - `aten.normal` with strided layout input (#111205, #112467)
    - `aten.baddbmm` (#109714)
    - `index_add` (#108826)
    - `aten.split_with_sizes` (#113984)
  - General bugfixes
    - fix infinite loop with primtorch and .to(meta) (#109632)
    - Fix registering jit decompositions for jvp for out wrapped decomps (#109367)
    - Fix python decomps for OpOverloadPackets and add tests (#107707)
- fix issue with lift_fresh_copy when using export + compile (#108243)
- Removed spurious warnings from calling `torch.overrides.get_overridable_functions` (#109890)

## CPU

- Fix NULL dereference in binary CPU ops (e57f089704a)
- Fix cpuinfo related crash on ppc64 (#110708)

## CUDA

- Release GIL in torch.cuda ops wherever possible. (#109159)
- Skipped CUDA Flags if C++ Extension Name includes "arch" Substring (#111211)
- Don't set CUDA_HOME when not compiled with CUDA support (#106310)

## Distributed

- C10d
  - Fix gloo cuda `sparse_allreduce` dispatch (#111485)
  - Add timeout for master store if clients do not join (#111805)
  - Add `cuda` to MPI backend capabilities (#109614)
  - Fix `send()`/`recv()` to make them respect the timeout same as non-p2p collectives (#109611)
  - Change default `NCCL_ASYNC_ERROR_HANDLING` to `3:SkipCleanUp`  to avoid calling ncclCommAbort which in some cases hangs (#110723)
  - Distributed Checkpointing (DCP):
  - Fix `torch.cpu` has no attribute `current_device` in `checkpoint/optimizer.py` (#110299)
- DTensor:
  - Fix `DTensor.from_local()` returns DTensor with wrong size for uneven sharded tensor (#110781)
  - Make DTensor handle negative dim correctly and fixed TP regression (#111750)
  - Fix pointwise op strategy linearity (#112107)
  - Fix empty shape init for DTensor constructors (#115091)
- FullyShardedDataParallel:
  - Fix non-Node 0 unable receive parameters from Node 0 for HSDP (#108331)
  - Add device to `_shard_utils.py` to explicitly use the correct device from `fsdp_state` (#109631)
  - Propagate `requires_grad` attribute to unsharded params (#109892)
  - Fix logics for fsdp exec order pre fwd record (#110138)
  - Move local optimizer state to FSDP `compute_device` (#110929)
  - Fix the FSDP to not reshard parameters twice (#110948)
  - Fix FSDP to reset prefetch flag upon reshard (#111354)
  - Fix FSDP when `SHARD_GRAD_OP` and `forward_prefetch` is turned on (#110139)
  - Fix FSDP `summon_full_params(..., with_grads=True)` when grad precision is not fp32 (#112746)
  - Fix fsdp state_dict to use run_check=False (#114995)
  - Fix pylance issues for torch.distributed.fsdp (#109922)
- RPC:
  - Fix assertion on vector length during message parsing (#108414)

## Fx
- Fix `functorch.compile.minifier` error of “'Node' object is not iterable” (#103011)
  - Skip mode issue in minimizer (#109399)
  - Skip the Tensor node in `__annotations__` (#109853)
  - Fixed dict size change during iteration error (#111267)
  - Made sure fx code is valid in python (#113345)
  - Updated symbolic_trace’s nn_module_stack format (#114422)
  - Fixed missing meta for proxy.node (#114659)
- Correctly restore pybind11 error_already_set (#93238)
- Remove proxy tensor's check for data dependent output (#93265)
- Make ShapeEnv deepcopy-able (#93403)
- Fix SubgraphMatcher for case of no anchor found (#86421)
- Fix for partitioner with symbolic shapes (#86425)
- Fix getitem in partitioner and make metadata storage more consistent (#87012)
- Fix magic method try reverse protocol (#88030)
- Fix FakeTensorProp on Module with Parameters or Buffers (#88700)
- Fix PassManager to not use a class variable mutable list (#89108)
- Prevent tracing when we track_tensor_tree (#89139)
- Make all `make_fx` invocations isolated (opaque to higher `make_fx` invocations) by default (#93290)
- Fix matching args in PatternMatcher (#94375)
- Allow FakeTensorProp to run on graphs traced with some None inputs (#94569)
- Copy codegen in legalize_graph (#90023)
- Fix proxy unwrapping for cond() (#91907)

## Jit
- Fix `optimize_for_inference` to support modules that don't have a forward method (#110013)
- Fix errors found by fuzzing and sanitizers (#108417, #108413, #110303, #110441)
- Fix deprecated python usage for python 3.12 in TorchScript (#113981)
- Support newer versions of LLVM (#110200, #113455)

## Lazy

- Fix error when inferring shape in `AdaptiveAvgPool3d` (#109822)

## Mps

- Fix and refactor unary/binary ops with non-zero offset or non-contiguous output (#97085)
- Fix memory leak in copy_from_mps_ (#114197)
- Fix crash if nonzero is called concurrently (#108996)
- Fix nll_loss with default ignore_index (#109574)
- Fix sort with empty tensor. (#109584)

## ONNX

- Fix module attribute retrieval in ONNX export (#109759)
- Add dynamic input support for MaxPool (#113318)
- Fix op-level debug for complex dtype (#114885)
- Fix indexing for meshgrid op (#109350)
- Fix torch.diagonal for torch.onnx.export when dim1<0 or dim2<0 (#111130)
- Fix scope name when parent scope is empty for torch.onnx.export (#112654)
- Cast ‘scale’ back to float16 after _attention_scale. (#112554)
- Fix aten::layer_norm for ONNX opset 17 (#114058)
- Add support for negative dim in _index_fill_reshape_helper (#114050)
- Disable opmath type promotion (#113780)


## Profiler
- Fix missing dependency in torch.utils.tensorboard (#115598) (#115598)
- Fix torch.utils.benchmark API while use privateuse1. (#108548)
- Ignore some properties when symbolic size/strides exist (#112458)
- Fix description to use nelems rather than size (#114735)
- Use PyCFunction_Check to check both PyCMethod_Type and PyC… (#110002)
- Disable CUPTI Teardown when using CUDA Graphs (#112507)
- Fix the Chrome trace loading issue with all_to_all input split length > 30 (#113392)

## Quantization

- Make mutation test work with quantized tensors (#108935)
- Support transitive sharing for SharedQuantizationSpec (#111172)

## Releng

- Fix focus builds of macOS apps on apple silicon. (#96966) (#107816)
- Use jemalloc for CUDA builds (#116900) (#116900)

## Visualization

- Fix TensorBoard summary writer encoding for torch.bfloat16 tensors (#108351)

## Vulkan

- Fix for a bug in aten::sum.dim_IntList where providing negative dims in the opt_dim argument and setting keepdim=false results in wrong inputs (#111586)

# Performance

## Autograd API

- Avoid saving input for `torch.mean` backward (#109935)

## Cpp API

- Add ScalarTensor or 0dim overload for _foreach_add (#111079)
- Vectorize torch.exp2 on CPU and add complex support (#92115)
- Add various performance fixes to c++ STL usage (#94034)

## Linalg API

- Speedup `torch.matrix_exp` performance (#110848)
- Improve speedup of `cholesky_solve_backward` using output_mask (#112981)

## NestedTensor API

- Reduce overhead in split and chunk for NestedTensor (#108213)

## Optimizer API
- Use for loop with shortcut in `Optimizer`s to speedup inductor against list comprehensions (e.g. complex conversion) (#110613, #112722)
- Speed up dynamo tracing of optimizer by shortcutting is_sparse iteration in foreach SGD (#110648)

## Sparse API

- Add NVIDIA A100 optimized meta parameters to bsr_dense_mm (#111760)
- Improve triton bsr_dense_mm performance on column-major ordered inputs with float32 dtype (#108512)
- Add bsr_dense_addmm triton kernel (#114595)
- Use more performant bsr_scatter_mm within bsr_dense_mm when blocksize is 16. (#111489)

## torch.compile API

###  Inductor

- Support convolution layout optimization for models with SDPA (#112045)
- Scaling down XBLOCK/RBLOCK to increase occupancy for kernels exposing large parallelism and having register pressue (#109275, #109839, #113039, #114284)
- Horizontal fusion for concat (#111437)
- Avoid an extra memory copy for views on ExternKernelAlloc (#108635)
- More memory and performance efficient implementation for Conv+BatchNorm block in eval mode (#109398, #109722)
- Pointwise fuse cat with pointwise inputs or outputs and <= 4 inputs (#111233)
- Add a way to force fusion of int_mm with mul (#111413)
- Add a heuristic to multi-layer reduction to increase the chance that the first splitted reduction can have compatible shape to fuse with a preceding reduction (#111781)
- benchmark fusion: either use this to skip slow fusions or analyze patterns from slow fusions  (#112450)
- optimize sympy expression where div denominator is -1 (#112878)
- Use different conv layout optimization heuristics for inference (#114600)
- Add or improve  Fx pattern matching passes
    - pre grad batch relu fusion (#111146)
    - new split cat pattern detection (#110923)
    - Add split-stack-tahn-unbind pattern detection (#111854)
    - Remove split nodes with split section size one (#112922)
    - Normalize nodes created by users (#113179)
    - post_grad batched linear fusion (#112504)
    - More SDPA patterns (#109156, #110001)
    - Horizontally fusing two matmuls in freezing phase (#111232)
    - A bunch of pattern matcher + indexing fixes (#112476)
- CPU Backend
    - Fallback scatter_add to eager on CPU to avoid bad perf (#108220)
    - Make OneDNN matmul inputs contiguous to avoid degraded performance (#108560)

## torch.func API

- Add vmap batching rule for: `bitwise operators` (#91971), `nansum` & `nanmean` (#91372), `all` & `any` (#91966), `torch.linalg.vander` (#91749), `slogdet` (#86815), `torch.index_fill` (#91364), `narrow_copy` (#88130), `view_copy` (#88150), `greater_equal.Scaler` (#91324)

## CPU

- S390x complex division (#108516)
- Add Half support for CPU autocast on eager mode (#112484)
- Add scalar conversion using avx instructions for half (#102140)

## CUDA

- Release the allocator lock on the slow path (#108367)
- Faster gc_count update for CUDACachingAllocator (#108071)
- baddmm should fall back to addmm for batch=1 (#114992, #114992)
- Speed-up casts to FP8 (#110251)
- int4 mm kernel enhancement (#111460)
- vectorized implementation for layer_norm_grad_input_kernel (#111021)

## Distributed

- c10d:
  - Push TCPStore scalability further by staggering client connection and increasing the backlog to 16k. (#109217)
  - Make the minimum wait time in `_store_based_barrier` to be adaptative based on the number of ranks. (#109218)
- DTensor:
  - Fix and improve the sharding cache behavior (#109306, #109428)
  - Switch DTensor and Functional Collective to use optree (#110670)
  - Skip move to device when `device_type` match (#110774)
  - Skip pytree when not necessary (#110132)
  - Introduce cost model for sharding (#109145)
  - Group dispatch unwrapping to a method (#113846)
  - Cache hash for `DTensorSpec` (#113915)
  - Compute and recompute `DTensorSpec` hash lazily (#114322, #114379)
  - Reduce to one `isinstance` call in `is_shard` (#114140)
- FullyShardedDataParallel (FSDP):
  - Fuse allgather for `optim_state_dict` when `use_orig_params` is True (#108298)
  - Make the new optimizer allgather fusion work with fine-tuning models (#110540)
  - Skip the parameter in optim state dict if the parameter does not belong to the current FSDP instance (#112804)

## Fx

- Use deque instead of list for BFS (#91139)
- Refactor the dfs cyclic search from recursive to iterative approach (#91042)

## Vulkan

- Improve matrix multiplication shader performance by up to 50% through new packing schemes and algorithmic improvements (#112918, #113627, #113883, #113943)

# Documentation

## Autograd API

- Improve docstring issues in various places (#113266)

## Dataloader API

- Add and update docstrings for `torch.utils.data` (#112244, #112817, #112765)


## Linalg API
- Fix typo in example of torch.linalg.solve_triangular (#112361)
- Remove duplicate sentences in description of torch.linalg.eig (#108230)
- Fix bug in matrix_power documentation (#108585)

## Optimizer API

- Update documentation for `PolynomialLR` (#110151)
- Clarify `maximize` option in optimizer.py (#112724)
- Fix docstring errors inside `torch/cuda/` and `torch/optim/` (#112964)


## Python API

- Fix docstring issues in torch.utils (#113335)
- Add docstring to `Timer.adaptive_autorange` (#111612)
- Fix a typo in `torch.cholesky_inverse` documentation (#110364)
- Document `torch.from_file` and `UntypedStorage.from_file` properly (#111688)
- Clarify difference between `Tensor.share_memory_` and `torch.from_file` (#111856)
- Improve `torch.unique` docs (#113424)
- Fix `torch.lgamma` docs (#108719)
- Update `torch.take_along_dim` docs to include `dim=None` case (#109120)
- Fix `torch.searchsorted` docs (#109364)
- Clarify `torch.multinomial` usage (#112892)

## torch.compile API

### Inductor

- Add a tutorial for AOTInductor (#112457)
- Add document for cudagraph_mark_step_begin API (#111722)

### `torch.export` API

- Add `torch.cond` doc (#108691)
- Add ir spec (#110394)
- Update docs to say that export returns full ATen IR (#111161)

## torch.func API

- Fix per-sample-grads notebook (#107988)

## torch.nn API

- Add examples for `nn.CosineEmbeddingLoss`(#108215)
- Fix `attn_bias` in code block in `scaled_dot_product_attention` documentation (#109086)
- Add documentation for `torch.nn.utils.parametrizations.weight_norm` (#113783)
- Improve type annotation for device parameters when a device ordinal is allowed (#113647)
- Update `scaled_dot_product_attention` documentation to point to flash-attn-v2 (#114124)
- Fix extending torch native API docs (#114863)

## Build

- Fix doc preview page url at CONTRIBUTING.md (#108580)
- Fix typo in cpp/installing when wheel is used (#111143)

## Composability

- Fix ScalarTensor **repr** in Extending PyTorch example (#86330)
- Fix incorrect wrapping of function decorator (#94446)
- Add **all** to torch.{autograd, fx, cuda} submodules (#85343)

## CUDA

- Show CUDAExtension example commands as code (#112764)
- Rewrite docs to describe CUDACachingAllocator semantics (#113282)

## Distributed

- c10d:
  - Fix TCPStore doc for arg `wait_for_workers` (#111807)
  - Fix the warning messages when `avoidRecordStreams_` so that correct name of Environment variable is shown in the warning message (#108759)
  - Fix an incorrect indent in documentation for `dist.send` (#108273)
  - Fix warnings and descriptions about distributed exceptions in the logging section of PyTorch Distributed document (#110157)
  - Fix `batch_isend_irecv` example incorrect usage (#110408)
  - Clarify the behavior of `apply_optimizer_in_backward` in the document (#110903)
  - Correct docstring errors for torch.distributed files (#112735, #113511, #113523, #112693, #113241, #113216)
  - Print `NCCL_SUFFIX` in NCCL version log at PG init (#112560)
  - Distributed Checkpointing (DCP):
  - Fix the comment `no_dist` for in `load_state_dict` (save -> load) (#112217)
  - Improve DDP checkpoint documentation (#106985)
- DistributedDataParallel (DDP):
  - Fix import in DDP notes (#111833)
  - Add errors when using `_dynamo.optimize_ddp=True` and `_inductor.keep_output_stride=False` together inside DDP (#108235)
- DTensor:
  - Improve TP documentation (#115880, #115974)
  - FullyShardedDataParallel (FSDP):
  - Fix docstring of `FSDP.optim_state_dict_to_load` to reflect right ctors (#108383)
  - Fix docstring for `FSDP.set_state_dict_type` to contain missing Args (#103864)
  - Remove "on CPU" in the comment of FSDP initialization doc (#113753)
- TorchElastic:
  - Fix a typo in `rendezvous/registry.py` (#111352)
- RPC:
  - Fix `torch.distributed.rpc` example incorrect usage (#112367)
- Activation checkpointing
  - Clean up comments in activation checkpoint (#86622)
- Distributed (c10d)
  - Improve documentation for various functions (#87018, #94543, #91116,#89905, #86438 )
- DistributedDataParallel
  - Improve Documentation (#86221, #91832)
- RPC
  - Fix non-existing parameters in docstrings in benchmarks (#91115)
- Tensor parallelism and DTensor:
  - Add more clarifications and fix errors in tensor parallelism docs (#94786)
  - Update 2D parallelism API naming and docs (#94771)
- FullyShardedDataParallel
  - Add docs to explain the running the forward pass of of submodules in FSDP (#86343)
  - Clarify warnings to mention collectives (#87478)
  - Remove HSDP Zero-2 from doc (#90503)
  - Improve the comments for FSDP (#92359)
- Distributed Checkpoint
  - Enable documentation for Distributed Checkpoint. (#92813)
- Torch Elastic
  - Fix a minor typo in documentation (#90667)
  - Fix `torch.distributed.run` init connect timeout by comparing `host` with the current IP list (#90221)

## Mps

- Resolve docstring errors (#113311)

## ONNX

- Update exporter issue report instructions for quantized models (#113494)
- Fix sample code in onnx_dynamo.rst (#114770)

## Profiler

- Improve the docstring for export_memory_timeline (#110949)
- Improve torch/csrc/profiler/README.md - stubs, RecordFunction, Autograd interaction (#108470)

## Quantization

- Add pt2 export quantization to main doc (#110260)
- Add x86 inductor quantization docs (#112648)
- Use \odot everywhere instead of mixing \odot and * for the Hadamard product (#111763)
- Add documentation for `prepare_pt2e`, `prepare_qat_pt2e` and `convert_pt2e` (#110097)
- Docstyle fix for some quantization code (#112992)
- Updating docs for embedding_bag support for fx and eager (#107623)
- fix docstring errors in quantized modules (#112695)

# Security

## Releng

- Use secure setup-ssh action from test-infra (#111922)
- Automate passing Conda PyTorchBot Test Token for release (#111821)
- Migrate MacOS wheel binary builds to ephemeral M1 runners (#110432)

2.1	2.2
```Python >>> import torch >>> with torch.cuda.device('cuda:0'): ...: a = torch.zeros(0, device='cuda:1') ...: print(a.device) ...: a = a.untyped_storage().resize_(0) ...: print(a.device) cuda:1 cuda:0 ```	```Python >>> import torch >>> with torch.cuda.device('cuda:0'): ...: a = torch.zeros(0, device='cuda:1') ...: print(a.device) ...: a = a.untyped_storage().resize_(0) ...: print(a.device) cuda:1 cuda:1 ```

2.1	2.2
```Python >>> import torch >>> @torch.set_grad_enabled(False) # unexpectedly, this mutates the grad mode! def inner_func(x): return x.sin() >>> torch.is_grad_enabled() True ```	```Python >>> import torch >>> @torch.set_grad_enabled(False) # unexpectedly, this mutates the grad mode! def inner_func(x): return x.sin() >>> torch.is_grad_enabled() False ```

2.1	2.2
```Python optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9) scheduler = ReduceLROnPlateau(optimizer, 'min', verbose=True) for epoch in range(10): train(...) val_loss = validate(...) # Note that step should be called after validate() scheduler.step(val_loss) ```	```Python optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9) scheduler = ReduceLROnPlateau(optimizer, 'min') for epoch in range(10): train(...) val_loss = validate(...) # Note that step should be called after validate() scheduler.step(val_loss) print(f"Epoch {epoch} has concluded with lr of {scheduler.get_last_lr()}") ```

2.1	2.2
```Python import torch.distributed as dist dist.broadcast_multigpu dist.all_reduce_multigpu dist.reduce_multigpu dist.all_gather_multigpu dist.reduce_scatter_multigpu ... ```	```Python optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9) scheduler = ReduceLROnPlateau(optimizer, 'min') for epoch in range(10): train(...) val_loss = validate(...) # Note that step should be called after validate() scheduler.step(val_loss) print(f"Epoch {epoch} has concluded with lr of {scheduler.get_last_lr()}") ```

2.1	2.2
```Python export_output: torch.onnx.ExportOutput = torch.onnx.dynamo(...) ```	```Python onnx_program: torch.onnx.ONNXProgram = torch.onnx.dynamo(...) ```

2.1	2.2
```Python a = torch.randn(3, requires_grad=True) modules_list = [ torch.nn.Linear(3, 3), torch.nn.Linear(3, 3), torch.nn.Linear(3, 3) ] # This would produce a warning in 2.2 checkpoint_sequential(modules_list, 3, a) ```	```Python # Recommended checkpoint_sequential(modules_list, 3, a, use_reentrant=False) # To preserve existing behavior checkpoint_sequential(modules_list, 3, a, use_reentrant=True) ```

2.1	2.2
```Python RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn ```	```Python UserWarning: mylibrary::sin: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. ```

2.1	2.2
```Python import torch param = torch.rand(16, 32) optimizer = torch.optim.SparseAdam(param) ```	```Python optimizer = torch.optim.SparseAdam([param]) ```

PyTorch 2.1: automatic dynamic shape compilation, distributed checkpointing (2023-10-04)

# PyTorch 2.1 Release Notes
- Highlights
- Backwards Incompatible Change
- Deprecations
- New Features
- Improvements
- Bug fixes
- Performance
- Documentation
- Developers
- Security

# Highlights
We are excited to announce the release of PyTorch® 2.1! PyTorch 2.1 offers automatic dynamic shape support in torch.compile, torch.distributed.checkpoint for saving/loading distributed training jobs on multiple ranks in parallel, and torch.compile support for the NumPy API.

In addition, this release offers numerous performance improvements (e.g. CPU inductor improvements, AVX512 support, scaled-dot-product-attention support) as well as a prototype release of torch.export, a sound full-graph capture mechanism, and `torch.export`-based quantization.

Along with 2.1, we are also releasing a series of updates to the PyTorch domain libraries. More details can be found in the library updates blog. 

This release is composed of 6,682 commits and 784 contributors since 2.0. We want to sincerely thank our dedicated community for your contributions. As always, we encourage you to try these out and report any issues as we improve 2.1.  More information about how to get started with the PyTorch 2-series can be found at our [Getting Started](https://pytorch.org/get-started/pytorch-2.0/) page.

Summary: 
- `torch.compile` now includes automatic support for detecting and minimizing recompilations due to tensor shape changes using automatic dynamic shapes.
- `torch.distributed.checkpoint` enables saving and loading models from multiple ranks in parallel, as well as resharding due to changes in cluster topology.
- `torch.compile` can now compile NumPy operations via translating them into PyTorch-equivalent operations.
- `torch.compile` now includes improved support for Python 3.11.
- New CPU performance features include inductor improvements (e.g. bfloat16 support and dynamic shapes), AVX512 kernel support, and scaled-dot-product-attention kernels.
- `torch.export`, a sound full-graph capture mechanism is introduced as a prototype feature, as well as torch.export-based quantization.
- `torch.sparse` now includes prototype support for semi-structured (2:4) sparsity on NVIDIA® GPUs.




  
  Stable 

   
     Beta 

   
    Prototype 

   
    Performance Improvements 

  



   
    

   
   Automatic Dynamic Shapes

   
    torch.export()

   
   AVX512 kernel support





   
   
   
   torch.distributed.checkpoint

   
   torch.export-based Quantization

   
   CPU optimizations for scaled-dot-product-attention (SDPA)



   
   
   
   torch.compile + NumPy

   
   semi-structured (2:4) sparsity

   
   CPU optimizations for bfloat16




   
   
   
   torch.compile + Python 3.11

   
   cpp_wrapper for torchinductor

   
   
   



   
   
   torch.compile + autograd.Function
   
   
   
   




   
   
   
   third-party device integration: PrivateUse1

   
   
   
   
   



*To see a full list of public 2.1, 2.0, and 1.13 feature submissions click [here](https://docs.google.com/spreadsheets/d/1TzGkWuUMF1yTe88adz1dt2mzbIsZLd3PBasy588VWgk/edit?usp=sharing).

For more details about these highlighted features, you can look at the release blogpost.
Below are the full release notes for this release.

# Backwards Incompatible Changes

### Building PyTorch from source now requires C++ 17 (#100557)
The PyTorch codebase has migrated from the C++14 to the C++17 standard, so a C++17 compatible compiler is now required to compile PyTorch, to integrate with libtorch, or to implement a C++ PyTorch extension.


### Disable `torch.autograd.{backward, grad}` for complex scalar output (#92753)

Gradients are not defined for functions that don't return real outputs; we now raise an error if you try to call backward on complex outputs. Previously, the complex component of the output was implicitly ignored. If you wish to preserve this behavior, you must now explicitly call `.real` on your complex outputs before calling `.grad()` or `.backward()`.

#### Example
```python
def fn(x):
    return (x * 0.5j).sum()

x = torch.ones(1, dtype=torch.double, requires_grad=True)
o = fn(x)
```

#### 2.0.1
```python
o.backward()
```

#### 2.1
```python
o.real.backward()
```

### Update non-reentrant checkpoint to allow nesting and support `autograd.grad` (#90105)

As a part of a larger refactor to `torch.utils.checkpoint`, we changed the interaction activation checkpoint and `retain_graph=True`. Previously in 2.0.1, recomputed activations are kept alive if `retain_graph=True`, in PyTorch 2.1, non-reentrant impl now clears recomputed tensors on backward immediately upon unpack, even if `retain_graph=True`. This has the following additional implications: (1) Accessing `ctx.saved_tensor` twice in the same backward will now raise an error.  (2) Accessing `_saved_tensors` multiple times will silently recompute forward multiple times.  

#### 2.1
```python
class Func(torch.autograd.Function):
    @staticmethod
    def forward(ctx, x):
        out = x.exp()
        ctx.save_for_backward(out)
        return out

    @staticmethod
    def backward(ctx, x);
        out, = ctx.saved_tensors
        # Calling ctx.saved_tensors again will raise in 2.1
        out, = ctx.saved_tensors
        return out

a = torch.tensor(1., requires_grad=True)

def fn(x):
    return Func.apply(x)


out = torch.utils.checkpoint(fn, (a,), use_reentrant=False)

def fn2(x):
    return x.exp()

out = torch.utils.checkpoint(fn2, (a,), use_reentrant=False)

out.grad_fn._saved_result
# Calling _saved_result will trigger another unpack, and lead to forward being
# recomputed again
out.grad_fn._saved_result
```

### Only sync buffers when `broadcast_buffers` is True (#100729)
* In PyTorch 2.0.1 and previous releases, when users use DistributedDataParallel (DDP), all buffers were synced automatically even if users set flag `broadcast_buffers` to be `False`:
```python
from torch.nn.parallel import DistributedDataParallel as DDP
module = torch.nn.Linear(4, 8)
module = DDP(module) # Buffer is synchronized across all devices.
module = DDP(module, broadcast_buffers=False) # Buffer is synchronized across all devices.
...
```

* Starting with PyTorch 2.1, if users specify the flag `broadcast_buffers` to be `False`, we don’t sync the buffer across devices:
```python
from torch.nn.parallel import DistributedDataParallel as DDP
module = torch.nn.Linear(4, 8)
module = DDP(module) # Buffer is synchronized across all devices.
module = DDP(module, broadcast_buffers=False) # Buffer is NOT synchronized across all devices
...
```

### Remove store barrier after PG init (#99937)
* In PyTorch 2.0.1 and previous releases, after we initialize PG, we always call store based barrier:
```python
from torch.distributed.distributed_c10d import init_process_group
init_process_group(...) # Will call _store_based_barrier in the end.
...
```

* Starting with PyTorch 2.1, after we initialize PG, the environment variable `TORCH_DIST_INIT_BARRIER` controls whether we call store based barrier or not:
```python
from torch.distributed.distributed_c10d import init_process_group
import os
os.environ["TORCH_DIST_INIT_BARRIER"] = "1" # This is the default behavior
init_process_group(...) # Will call _store_based_barrier in the end.
os.environ["TORCH_DIST_INIT_BARRIER"] = "0"
init_process_group(...) # Will not call _store_based_barrier in the end.
...
```

### Disallow non-bool masks in `torch.masked_{select, scatter, fill}` (#96112, #97999, #96594)
Finish the deprecation cycle for non-bool masks. Functions now require the `dtype` of the mask to be `torch.bool`.

```python
>>> # 2.0.1
>>> inp = torch.rand(3)
>>> mask = torch.tensor([0, 1, 0], dtype=torch.uint8)
>>> torch.masked_select(inp, mask)
UserWarning: masked_select received a mask with dtype torch.uint8, this behavior is now deprecated,please use a mask with dtype torch.bool instead. (Triggered internally at ../aten/src/ATen/native/TensorAdvancedIndexing.cpp:1855.)
  torch.masked_select(inp, mask)

>>> torch.masked_select(inp, mask.to(dtype=torch.bool))
# Works fine

>>> correct_mask = torch.tensor([0, 1, 0], dtype=torch.bool)
>>> torch.masked_select(inp, correct_mask)
# Works fine

>>> # 2.1
>>> inp = torch.rand(3)
>>> mask = torch.tensor([0, 1, 0], dtype=torch.uint8)
>>> torch.masked_select(inp, mask)
RuntimeError: masked_select: expected BoolTensor for mask

>>> correct_mask = torch.tensor([0, 1, 0], dtype=torch.bool)
>>> torch.masked_select(inp, correct_mask)
# Works fine

>>> torch.masked_select(inp, mask.to(dtype=torch.bool))
# Works fine

```

### Fix the result of `torch.unique` to make it consistent with NumPy when `dim` is specified (#101693)

The `dim` argument was clarified and its behavior aligned to match the one from NumPy to signify which sub-tensor to consider when considering uniqueness. See the documentation for more details, https://pytorch.org/docs/stable/generated/torch.unique.html


### Make the Index Rounding Mode Consistent Between the 2D and 3D GridSample Nearest Neighbor Interpolations (#97000)

Prior to this change, for `torch.nn.functional.grid_sample(mode='nearest')` the forward 2D kernel used `std::nearbyint` whereas the forward 3D kernel used `std::round` in order to determine the nearest pixel locations after un-normalization of the grid. Additionally, the backward kernels for both used `std::round`. This PR fixes the inconsistencies to use `std::nearbyint` which rounds values that are exactly <>.5 to the nearest even which is consistent with the behavior of `torch.round`. Unnormalized indices that are exactly <>.5 will now be rounded to the nearest even instead of being rounded away from 0.

### Turned input shapes (aka `record_shapes`) off by default for on-demand tracing (#97917)
Profiler traces collected by on-demand tracing via IPC Fabric will have `record_shapes` off my default. 

* In v2.0.1:
By default, profiler trace files’ `cpu_op` activities will contain metadata fields: Input Dims, and Input type.

* In v2.1.0:
By default, profiler trace files’ `cpu_op` activities will no longer contain metadata fields for input shapes. If turned on via Kineto config, it will show metadata fields: Input Dims, Input type and Concrete Inputs.

### When called with a 0-dim tensor input, **`torch.aminmax`** would previously inconsistently return a 1D tensor output on CPU, but a 0D tensor output on CUDA. This has been fixed, so we consistently return a 0D tensor in both cases. (#96171).

In v2.0.1:

```python
>>> torch.aminmax(torch.tensor(1, device='cpu'), dim=0, keepdim=True)
__main__:1: UserWarning: An output with one or more elements was resized since it had shape [], which does not match the required output shape [1]. This behavior is deprecated, and in a future PyTorch release outputs will not be resized unless they have zero elements. You can explicitly reuse an out tensor t by resizing it, inplace, to zero elements with t.resize_(0). (Triggered internally at ../aten/src/ATen/native/Resize.cpp:24.)
torch.return_types.aminmax(
min=tensor([1]),
max=tensor([1]))
>>> torch.aminmax(torch.tensor(1, device='cpu'), dim=0, keepdim=False)
torch.return_types.aminmax(
min=tensor(1),
max=tensor(1))
```

In v2.1.0:

```python
>>> torch.aminmax(torch.tensor(1, device='cpu'), dim=0, keepdim=True)
torch.return_types.aminmax(
min=tensor(1),
max=tensor(1))
>>> torch.aminmax(torch.tensor(1, device='cpu'), dim=0, keepdim=False)
torch.return_types.aminmax(
min=tensor(1),
max=tensor(1))
```

### Change to the default behavior for custom operators registered to the dispatcher, that do not have anything registered to an Autograd dispatch key 

If you have a custom operator that has a CPU/CUDA kernel registered to the CPU/CUDA dispatch key, but has no implementation at the Autograd key, then:

**Old behavior:** When calling this operator with tensor inputs that require gradients, the tensor outputs would silently not require gradients.

**New behavior:** When calling this operator with tensor inputs that do require gradients, the tensor outputs would require gradients (as long as the outputs are floating-point or complex), and will error if you try to backpropagate through them.

There is more information on how to recover the old behavior in the PR: (#104481, #105078) 

### `torch.autograd.Function` Raise an error if input is returned as-is and saved for forward or backward in `setup_context` (#98051)

If you are writing a custom autograd Function and you have implemented your autograd Function using `setup_context`, and if your forward function returns an input as-is as output, then saving that tensor for forward or backward now raises an error. You should return an alias of the input instead.

#### 2.0.1
```python
class Cube(torch.autograd.Function):
    @staticmethod
    def forward(x):
        return x ** 3, x

    @staticmethod
    def setup_context(ctx, inputs, outputs):
        cube, x = outputs
        ctx.save_for_backward(x)

    @staticmethod
    def backward(ctx, grad_output, grad_x):
        # NB: grad_x intentionally not used in computation
        x, = ctx.saved_tensors
        result = grad_output * 3 * x ** 2
        return result

```

#### 2.1
```
class Cube(torch.autograd.Function):
    @staticmethod
    def forward(x):
        return x ** 3, x.view_as(x)

    ...
```

# Deprecations

### Deprecate not specifying the `use_reentrant` flag explicitly when using `torch.utils.checkpoint` (#100551)

In PyTorch 2.1, if the `use_reentrant` flag is not explicitly passed, a warning is raised. To retain current behavior, pass `use_reentrant=True`. The default value will be updated to `use_reentrant=False` in the future. We recommend using `use_reentrant=False`.

#### 2.1
```
torch.utils.checkpoint(fn, (a,)) # Warns in 2.1
```

### Deprecate `torch.has_*` attributes (#103279)

Use the version in the particular backend module at `torch.backends.*` to access these flags.
Also note that we now properly differente `is_built()` (compile time availability) and `is_available()` (runtime availability) in these modules.


### Deprecate `check_sparse_nnz` argument for `torch.autograd.gradcheck` (#97187)

#### 2.0.1
```
torch.autograd.gradcheck(fn, inputs, check_sparse_nnz=True)
```

#### 2.1
```
torch.autograd.gradcheck(fn, inputs, masked=True)
```

### `NVFuser` integration with `TorchScript` is deprecated (#105185)

`NVFuser` replaced Neural Network Compiler (NNC) as the default GPU fuser for TorchScript in PyTorch 1.13. In PyTorch 2.1, TorchScript switched its default fuser back to NNC. Additionally, `NVFuser` for TorchScript is now deprecated. Currently, users can still manually choose to use `NVFuser` instead of NNC, see [fuser options](https://github.com/pytorch/pytorch/blob/v2.1.0-rc3/torch/csrc/jit/OVERVIEW.md#fusers) for details on how to do this.

# New features
## Release Engineering
- Adding AArch64 wheel builds (#104109)
- CUDA 12.1 support for PyTorch binaries (#107295)
- Compile PyTorch on M1 natively (instead of cross compiling from x86) (#95719)
- Enable UCC Distributed Communication Backend in CI (#100395)

## Python Frontend
- Enable `torch.device` to be used as a context manager to change the default device (#106514)
- Add `torch.Tensor.dim_order` field to access current dimension permutation (#106835)
- Add `torch.backends.cpu.get_cpu_capability()` to expose cpu properties (#100164)
- Add support for `PrivateUse1` device (out of tree device) for untyped storage (#100868)
- Add `torch._foreach_pow` (#92303)

## optim
- Provide **NAdamW** implementation through the `decoupled_weight_decay` flag (#103881, #107706)
- Add xpu support for foreach kernels (#106021)
- Add capturable API w/ tests + fix differentiable for **NAdam** (#106615)
- Add pre hooks and post hooks for optimizer `load_state_dict()` and `state_dict()` (#105953, #106209)

## torch.compile
- Automatic dynamic shapes - `torch.compile` automatically finds dynamic dims and selectively turns on dynamism - (#106188, #98923)
- Support Python 3.11 version with TorchDynamo - (#94098, #94099, #94100, #94101, #94102, #96499, #96500, #96501, #96503, #96504, #96495, #96505, #96506, #96508, #96509, #98032, #98364, #96511, #99934)
- Introduce `TORCH_LOGS` to enable better logging UX for `torch.compile` - (#94858, #98564, #98776, #98795, #100664)
- NumPy support in `torch.compile` (#106211)
- Higher Order ops - TorchDynamo supports higher order ops - (#101707, #104685, #106425, #107459, #107461)

## Sparse Frontend
- Add prototype implementation of semi-structured (often known as 2:4 sparsity) for NVIDIA's Ampere, and newer, architecture at `torch.sparse.semi_structured` (#100485, #103700, #107398, #103978, #103830, #104608, #101339)

## Autograd
- Add backward support for out-of-place foreach functions  (#93901)
- Add backward support for for in-place foreach functions (#96405)
- Add backward support on `_foreach_zero_` (#101149)
- Add forward mode AD to out-place foreach functions  (#106320)
- Add forward over backward support for `torch.nn.functional.logsigmoid` (#99288)
- Add forward mode AD to in-place foreach functions (#100695)
- Add forward mode AD for `torch.renorm` (#100798)

## torch.nn
- Add an optional scale kwarg to `scaled_dot_product_attention` (#95259)
- Add non-recursive `nn.Module.to_empty` option (#104197)
- Add keyword argument to allow disabling bias for `LayerNorm` (#101683)
- Add option to always call `nn.Module` global/non-global forward hooks (#104278)
- Add keyword argument to allow disabling bias for `Transformer` (#101687)

## torch.export
- Add public APIs to `torch.export`: `export()` (#107609), `dynamic_dim()`(#107635), `constrain_as_{size,value}` APIs (#107735) and `ExportedProgram` (#107852)
- Add support for saving and loading exported programs (#107309, #102707, #107924, #102708, #103274, #107818,
#107938, #107420, #107666, #107386, #102716, #103273, #107888)
- Add [ExportDB page](https://pytorch.org/docs/2.1/generated/exportdb/index.html#exportdb) in pytorch.org  ([#104288)
- Add a verifier for EXIR Aten dialect (#94783, #100019)

## functorch
- Add experimental support for functorch transforms and `torch.compile` composition (#98328, #106610, #107462, and others)
- Add `functorch.einops.rearrange` (#101957) 

## Distributed
### c10d
- Add `PrivateUse1` for dispatching PyTorch Distributed Collectives to support custom device. (#98137) 
- Add new `Store` methods: `append`, `multi_get`, `multi_set`. (#100379)
- Implement coalesced `all_gather_into_tensor` (#101157)
- Implement coalesced `reduce_scatter_tensor` (#103561)
- Add back in `reduce_scatter_tensor_coalesced` (#104345)
- Design a new fake process group aimed at running a single rank with a fake process group without needing multiple processes. A fake process group (not related to `FakeTensor`) is a process group which doesn't actually do any communication, but instead just just hallucinates communication. (#102180, #102238, #104213, #104428)
- Enable barrier to support the specified device (#99589)
- Add xpu to the default device supported by user specified backend (#103410)
- Support third-party devices to use the `init_process_group` method without specifying the backend (#107113) 
### Distributed Tensor 
- Add `DTensor` constructor function: `ones`/`empty`/`full` (#100933, #101022, #103165) 
- Enable `DTensor` based Native sequence parallelism (#94369)
- Enable DDP + TP 2D parallelism (#106583)
- Enable `deviceMesh` to use dispatchable PG to support custom backend (#102336)
- Allow ONNX Runtime (ORT) backend for `DTensor` (#101914)
### FullyShardedDataParallel:
- Introduce `CustomPolicy` in FSDP wrapping (#104986)
- Add FSDP support for creating hybrid-sharded process group for custom backend (#100622)
### DTensor based Distributed Checkpoint
- Add 1D `DTensor` based DCP (#94868)

## Profiler
- Add a global flag to record concrete shapes, which are Scalar lists in profiler traces (#101043, #101292)
- Add `export_memory_timeline` to save memory timeline plot to file (#96137, #96535)
- Reintroduce forward-backward links in profiler traces with a global flag (#102424, #102492)
- Add Kineto synchronization events in profiler traces (#105187, #105144)
- Add support for `cuLaunchKernel` in profiler traces for triton kernel launches including flow events in profiler traces (#99571)
- Add CUDA runtime events up to CUDA 12.0 for traces, and added flow events for H100’s `cudaLaunchKernelExC` (#106293)

## ONNX
### New TorchDynamo ONNX Exporter
- Implement a new exporter core infrastructure and expose public APIs for it (#97920, #99940, #99202, #102810, #104736, #104493, #105040, #106228, #99284, #96349, #107245, #100490, #95650, #95676, #96350, #100554, #94878, #99191)
  - `torch.onnx.dynamo_export`
  - `torch.onnx.ExportOptions`
  - `torch.onnx.ExportOutput`
- Add an operator registry (#103943, #106140)
- Add an operator dispatcher (#104679, #105104, #104267, #105972, #106478, #100660,
- Add operator validation (#94920, #97494, #105874, #104268)
- Add Input/Output adapter (#98421)
- Enable `ExportOutput.save` for models larger than 2GB (#107904)
- Add Pre-ONNX FX Passes (#95664, #95935, #98633, #97729, #95929, #98760)
  - Functionalization (#98245, #99667)
  - Explicit type promotion (#104063, #104064, #104229, #104720, #104491, #106178)
  - Export module as function (#105618, #107409)
- Add fake mode export (#103865, #105246, #105247, #105477, #106930, #107836)
  - `torch.onnx.enable_fake_mode()`
- Add SARIF](https://sarifweb.azurewebsites.net/) diagnostic system ([#99668, #99944, #100219, #100299, #100407, #100451, #105231, #105263, #105886, #105889, #105892, #106048, #106592, #106741, #107165, #107653, #107654, #107158)

### New `torch.compile` ONNX Runtime backend (#107973, #106929, #106589)
```python
Usage: `torch.compile(..., backend="onnxrt")`
    Available when `torch.onnx.is_onnxrt_backend_supported()` returns `True`
    Additional Python package dependencies: `onnx`, `onnxscript`, `onnxruntime`
```

### Additional TorchScript ONNX exporter operators:
- `aten::_convolution_mode` (#89107)
- `aten::unflatten` (#99056)
- `aten::scaled_dot_product_attention` (#99658)
- `aten::tile` (#99927)
- `aten::atan2` (#100040)
- `aten::broadcast_to` (#101833)
- `aten::scatter_reduce` (#102048)
- `aten::logit` (#102377)
- `aten::hstack`, `aten::vstack` (#102872)
- `aten::atleast_1d`, `aten::atleast_2d`, `aten::atleast_3d` (#103061)
- `aten::logical_not` (#96315)
- `aten::randint` (#105089)
- `aten::norm`: Supports `dtype` argument (#95637)

### Others
- Add initial support for FP8 ONNX export (#107962)

## MPS
- Add support for `MPSProfiler` (#100635, #101002, #101692)
- Enable saved models to be loaded directly to MPS through `torch.jit.load` (#102204)
- Introduce `torch.mps.Event()` APIs (#102121)

## torch.fx
- Add attribute node matching in the subgraph rewriter (#98604)
- Add variadic arg matching in the subgraph matcher (#99431)
- Add a flag to ignore literals with subgraph matcher (#97683)
- Add a prototype `source_fn` based partitioner to partition modules that were flattened in export (#98628, #101121)
- Add `aggressive_merge` to `CapabilityBasedPartitioner` which merges independent subgraphs (#100195)

## Quantization
- Add various uninterpreted bit tensor data types (`torch.{bits1x8,bits2x4,bits4x2,bits8,bits16}`) (#95860)
- Add basic cuda support for `float8` dtypes (#105807)
- Add Autocast Support for XLA (#96370)

### Export Quantization:
  - Quantizer and Annotation API (#97994, #101708, #101920, #102054, #102184, #102282, #102439, #102582, #105484, #106922, #107833);
  - `XNNPACKQuantizer` (#98507, #98569, #98560, #99063, #99399, #100566, #101122, #102394, #102395, #102396, #102397, #102398, #102703, #103526, #105551, #106087, #106094, #107872, #107992);
  - `EmbeddingQuantizer` (#103088);
  - `ComposableQuantizer` (#102846);
  - `X86InductorQuantizer` and Kernels (#98730, #98826, #105639, #106836, #106781, #105818,  #104580);
  - Quantization Aware Training (#98568, #100283, #100442, #100525, #100610, #100941, #102224, #102993, #103298, #103731, #103759, #103970, #104110);
  - Reference Quantized Model Representation (#104130, #105707, #104395, #105708, #105783, #105784, #107810);
  - Program Capture (#107941)

## JIT
- Register ops for `torch.get_cpu_capability`, `Tensor.is_xla`, so they can be captured by torchscript (#100723)
- Provide `__prepare_scriptable__` on non-`nn.Module` classes as an escape hatch to provide a scriptable alternate implementation (#106229)
- Introduce API to `deepcopy` a JIT module onto a new device (#106521)

## Vulkan
- Add Vulkan support for the following operators: `aten::unsqueeze`for 2d to 3d (#101719), `aten::cat` operator for 1d, 2d, 3d and 4d (#102128), `aten::expand` (#103930), `aten::flip` (#106628), `gelu` (#102762), `aten::masked_fill` (#104444), `aten::pow` (#105550), `at::softmax` 1,2,3 dimension tensors (#105012), `at::softmax` along all dimensions for 4-dim Tensors (#102988), `sum.dim_IntList` (#105612), `sum.dim_IntList` with keepdim (#106159), `at::select.int` operator, 4 dim/rank tensor case (#96228), `aten::stack` (#103344), `aten::uniform` (#102431), `aten::unsqueeze`, 1d->2d, 3d->4d (#102987), `aten::repeat` (#103255), `aten::tile` (#103944), `aten::zero_` (#103042), `aten::zeros` (#103703), `convert_qconv2d_context` (#97714), "height" and "width" dimension for `select` operator (#94612), `t` and `transpose` operators for 2d, 3d and 4d tensors (#101808), unary ops (#104994), `upsample_bilinear2d` (#98022), `upsample_nearest2d` and `quantized_upsample_nearest2d` (#97467), `quantize_per_tensor` vulkan backend function (#106641), quantized binary ops (`add`/`sub`/`mul`/`div`), and adding graph rewrites for quantized `add`, `mul`, `conv2d` and `conv2d_relu` (#97468)
- Add broadcast support for 4D tensors where the batch and channel of a tensor are different (#104718)
- Templatize `BinaryOp.cpp` (#105380)

# Improvements

## Python Frontend
- Support non-ASCII characters in model file paths for `torch.{save,load}` (#99453)
- Enable registering fallthroughs via `torch.library` (#106086)
- Add support for saving and loading with any Endianness to `torch.{save,load}` (#94503) 
- Add `torch.storage.UntypedStorage.get_device` method (#99818)
- Add type hint for `torch.__init__.py` (#106214, #103807), `torch.Tensor.retains_grad` (#103528)
- Add support for HPU device for serialization (#101680)
- Add support for XPU device for old-style Tensor classes (#96656), storage resize_ (#105262)
- Mark `torch.bincount` deterministic on CUDA if weights are not given (#105244)
- Properly expose all constraints on the `torch.distributions` (#106458)
- Add `itemsize` and `nbytes` properties to Tensor (#98322)
- Add complex support for `torch.expm1` (#96644)
- Add `nonzero_static` op to pytorch to unblock export (#97417)
- Tweak heuristic for Scaled Dot Product Attention (SDPA) selection based off of data (and a decision tree) (#99644)
- Fix print `tensor.type()` issue. (#96381)
- Improve error messages in `THPVariable_set_grad` (#100683)
- Enable `new_full`'s fill_value argument type to be complex, for more accurate type checking (#91345)
- Add 0-dim (zero dimension) Tensor overload to `_foreach_mul` (#106677)
- Add in-place `_foreach_copy` (#107226)
- Support floating point correction value for std/var operators (#94073)

## Dataloader and DataPipe
- Fix `validate_input_col` for partial functions (#95067)
- Add support for pin memory on custom device (#97621)
- Fix collation logic (#97789)
- Add context to NotImplementedErrors in dataset.py (#100667)
- Add `__getitems__` to description of Dataset API, and also better support within `Subset` (#100375)
- Adding `StackDataset` (#101338)
- Change DataPipe display name in profiler (#100042)
- Add `copy` option to `fork` DataPipe (#96030)
- Fix missing imports in DataPipe interface file (#97458)
- Do not materialize entire `randperm` in `RandomSampler` (#103339)

## torch.nn
- Add check that `embedding_bag`'s weight is 2D (#94931)
- Improve error message for instance norm when channels is incorrect (#94624)
- Add generator argument to `nn.init.trunc_normal_` (#100810)
- Improve `clip_grad_norm`  to use `torch.linalg.vector_norm` (#102429)
- Add `uint8` support for CPU images in `interpolate(mode='bicubic’)` (#103252)
- Allow `nn.ChannelShuffle` to run without error on CUDA tensors (#105351)
- Add `nn.CircularPad{3/4/5d` and fixed `no_batch_dim` support for CircularPad (#106632)
- Add `reset_parameters` for `torch.nn.PRelu`  (#106507)
- Use accumulate type to improve accuracy of `grid_sample` on half precision inputs on CUDA (#96586)
- Improve performance for vectorized bilinear interpolate `cpu` `uint8` channels last (#96848))
- Add differentiable `mkldnn_rnn_layer_backward` to support double backward of LSTM (#100627)
- Add `is_causal` API for `TransformerDecoder` (#97166)
- Add `is_causal` hints for `Transformer` (#106143)
- Enable channels last for reflection padding on CPU (#102518, #102597)
- Add `bfloat16` support for reflection and replication padding (#102949)
- Add `SyncBatchNorm` support for custom device (#104250)
- Add channels last 3d support for `BatchNorm` on CPU (#97774)

## functorch
- Add `torch.vmap` support for `torch.complex` (#96032), overloads of `float_power`, `where`, and `comparison` ops. (#96744), `linalg.lu_factor` (#94328), `ldl_factor` (#97518), `torch.copysign` (#96018), `torch.nn.functional.smooth_l1_loss` (#98357), `nn.functional.huber_loss` (#99235, #99236), special bessel functions (#99543), `torch.nn.functional.{max_pool1d, max_pool3d}` batch_rule (#99517, #99522), `Tensor.index_fill` (#99229), `torch.bucketize` (#95783), `smooth_l1_loss_backward` (#99429)

## optim
- Merge and improve torch optim optimizer type stubs (#102593)
- Allow fused optimizers to call `_foreach_zero_` in `zero_grad` (#97159)
- Add multi Stochastic Weight Averaging (SWA) support for custom device (#103297)
- Use `torch._foreach_lerp` for SWA update (#103550)
- Add XLA friendly codepath to `single_tensor_adamw` (#102858)

## Linear Algebra
- `lerp(cpu)`: Add `half` support (#105607)
- `norm(cpu)`: Accumulate in `float` when inputs are `half` or `bfloat16` (#95166)
- `matmul`: Avoid unnecessary copies (#97355)
- `matmul` backwards: Don’t create large intermediary tensors (#95261)
- `addmm`: Call to `mkldnn_matmul` on AArch64 (#91763)
- `addmm(cuda)`: Enable `addmm` + GELU epilogue fusion (#103811)
- `dot/mv/gemm(cpu)`: Accumulate in `float` for `bfloat16` inputs in the fallback path (#96074)
- `bmm`: Heuristics for AArch64 (#107167)
- `baddbmm(cpu)`: Fix grain size setting (#98297)
- `mm(cuda)`: Expose cuBLAS int8@int8 -> int32 (#96685)
- `triu/tril`: complete dtype support. (#101414)
- Enable hipSOLVER in ROCm builds (#97370)
- Improve error message in `ADDMM_META()`. (#105309)
- Allow setting `TORCH_LINALG_PREFER_CUSOLVER=1` to prefer cusolver as linear algebra library globally (#106226)
- `ldl_factor(cuda)`: Enable hipSOLVER backend in ROCM (#102665)
- Add `SymInt` support for `{tensordot,inner,linalg.{matrix_power,tensorinv}}`. (#100356, #101940, #102465)
- Add fake tensor support for SVD. (#100130)

## Autograd
- Improve `torch.utils.checkpoint` with `use_reentrant=False`:
  - Support recursive checkpointing; allow grad calls within checkpointed function (#90105)
  - Allow the specification of a pair of context functions via `context_fn=` (#96783)
  - Stop recomputation early if possible; enabled by default but also expose a way to disable (#96866)
  - Improve debuggability of activation checkpoint; expose `debug=` and `determinism_check` kwargs (#103859)
- Allow `torch.inference_mode`, `torch.no_grad`, `torch.enable_grad` decorators to be used without parens (#107086)
- Allow `torch.autograd.set_multithreading_enabled` to act as function and context manager  (#105291)
- Add `materialize_grads` parameter to `torch.autograd.grad()` (#97015)
- Allow `torch.autograd.Function` to save non-input leaf tensors for backward (#104039)
- `sampled_addmm`: backward performance improvements (#103544)

## Sparse
- Add rudimentary support for addmv(strided, CSR, strided) on CPUs without MKL support (#97353, #97730)
- Implement sparse semantics support in gradcheck (#94714, #95405, #96095, #107150)
- Add add(COO, COO) for BFloat16 on CPU (#96767)
- Add support for negative dim to torch.sparse.softmax for COO (#102172)
- Add support for dim to sum for CSR on CPU and CUDA (#99292)
- Add integer overflow checks to sparse tensor invariant checker for large compressed tensor dimensions and large nnz (#102530)

## Nested Tensor
- Support `zeros_like()` and `randn_like()` for nested tensor (#96527, #96528)
- Add backwards for layer norm for nested tensor (#94781)
- Support elementwise add / mul for \[B, *\] nested, \[B, 1\] dense (CUDA only) (#95620)
- Enabling FlashAttention for SDPA when given NestedTensor (#95438)
- Add `sub`, `sgn` `abs` ops for nested tensor (#97837)
- Implement last dim `split_with_sizes` for NestedTensor(forward only, non-SymInt-ified) (#97446)

## Foreach Frontend
- Move tensor grouping to ATen (#103912)
- Disable grouping by dtype and device if compiling (#102771)
- Add fused support for XPU devices (#104517)

## Build Frontend
- `_mm_prefetch` is for Intel, changed to `__prefetch` for ARM64 (#96638)
- Build PyTorch with `-Wnewline-eof` (#99687)
- conditional `CMAKE_CUDA_STANDARD` (#104240)
- cmake: allow `USE_SYSTEM_ZSTD` (#104611)

## CPU
- Introduce fast path for equal and concat: (#100024, #106727)
- Add channel last 3d support for `MaxPool3d` on CPU (#97775)
- Add Half support for `logsigmoid`, `threshold`, `elu`, `gelu`, `hardtanh`, `hardsigmoid`, `hardswish`, `hardshrink`, `softshrink`, `leakyrelu`, `softplus`, `glu`, `silu`, `mish`, and `prelu` on CPU (#98745)
- Make `index_add_` error if input source shape is wrong (#100321)
- Enable taskset core pinning in addition to numactl (#96011)
- Add explicit vectorization for Half dtype on CPU (#96076)
- Add `Half` support for sigmoid on CPU (#96077)
- Add `Half` to cat fast path on CPU (#96078)
- Use `float` as accumulate type for reduce Ops: `min`, `max`, `minmax` on CPU (#96079)

## CUDA
- Support `bf16` dtype for `conv_depthwise3d` and `searchsorted` (#97819, #99426)
- Support integer dtypes for padding (cpu and cuda) (#107755)
- Support complex dtype for Sigmoid Linear Unit (SILU) (#106854)
- Add additional stream priority for cuda streams (#101956)
- Prevent grad scale from overflowing (#98876)
- `nn.EmbeddingBag` bound check (#96022)
- Hide `set_device` change (#94864)

## MPS
- Add `lerp` implementation (#105470), `logit` (#95162), `hardsigmoid` (#95164),`hypot` (#95196), `xlogy` (#95213), `log_sigmoid`(#95280), `fmax` and `fmin` (#95191), `roll`(#95168), `copysign` (#95552), `pow.Scalar` (#95201), `masked_scatter` (#95743), `index_fill` (#98694), `linalg.vector_norm` (#99811), `histogram` (#96652), `aminmax` (#101691), `aten::erfinv` (#101507), `cumprod` (#104688), `renorm` (#106059), `polar` (#107324)
- Add optional minor argument to `is_macos13_or_newer` (#95065)
- Allow `float16` input to `float32` `LayerNorm` (#96430)
- Add higher order derivatives warning to `max_pool2d` (#98582)
- Expose `mps` package in torch (#98837)
- Prerequisite for MPS C++ extension (#102483)

## torch.export
- Change attributes of ExportedProgram to properties and add BC decorator #106170
- Throw explicit error when constraining on static values (#101655)
- Store the arguments used to trace the exported program in itself to facilitate (#107906)
- Add kwargs support for export. (#105337)
- `ExportedProgram.transform` updates `graph_signature` automatically (#107792)
- Support preserving calling convention to some modules so that they can be properly unflattened. (#106798)
- Make pass base composable (#103701)
- Remove unused flags in export (#106336)
- Update the core Aten operator set:
  - Add 23 ops to core Aten set (#107766)
  - Remove `split.Tensor` from core Aten (#107938)
- Allow registration of dataclass as pytree node (serialization of it not supported yet) (#106160)
- Support re-exportability (#106531)

## torch.fx
- Rewrote graph traversal to avoid recursion (#95723)
- Reorder the Fx execution order to in-time `get_attr` rather than putting all `get_attr` ahead (#95014(https://github.com/pytorch/pytorch/pull/95014 ))
- Preserve node.meta for `get_attr` nodes in `fx.Transformer` (#95245)
- Preserve output node metadata (#95426)
- Copy `nn_module_stack` metadata whenever we create a new node when tracing (#95358)
- Prettify assert expr in `self.symbol_to_source` failure (#95972)
- Allow `torch.fx` to take Modules that return dataclass (#99576)
- Add `has_side_effect` to add to a list of side effecting functions (#97288)
- Change placeholder check instanceof PHBase (#102008)
- Add metadata to PHBase placeholders (#102195)
- Make `fx.wrap` idempotent (#104838)
- Enable Python dispatcher when ShapeProp with fake mode (#103512)

## Quantization
- Add quantization support for `pixel_shuffle`, `pixel_unshuffle`, `narrow`, ConvTranspose, ConvTranspose3d (#94769, #96160, #97126, #97125, #101926)
- Support static quantization for `LSTM` and `MultiheadAttention` (#96343, #96436, #101299, #95636)
- Force weight observer/fake_quant to be on the same device as the weight tensor (#106755)
- Add serialization method for quantized `hardswish` (#94486)
- Enable `quantized_max_pool3d` (#101654)
- Quantization oneDNN backend only support VNNI CPU (#103653)
- Fix bug in `fuse_modules` (#105069)
- Add `torch.matmul` in `FloatFunctional`/`QFunctional` (#106831)
- Support quantized Sub, Multiply in XNNPACK (#104090, #104134)

## Profiler
### General Profiling
- Profiler permitted CPU events with CUPTI Range Profiler mode (#97048)
- Make Profiler API agnostic with respect to target hardware (#101554, #106142)
- Improve on-demand profiling options for Input Shapes, Memory, Stack, Flops, and Modules (#97380, #97556)
- When `record_inputs=True`, record scalar lists of length <= 30 (#100593)
- Disable Kineto event profiler by default--due to flakiness; fixed thread sanitizer issue; and refactored `stress_test` (#105144)
- Bump Kineto to C++17 (#106293)
- `tb_plugin` to support HDFS and improved memory view (#106672)
- Make on-demand update duration configurable, and improved start time for on-demand tracing (#101952)
### Memory Profiling
- Add support for HTML plot of memory profile via `export_memory_timeline` (#99751, #101316)
- Include more uncategorized events in memory profiles (#101200)
- Add export of raw memory events with timestamp via `export_memory_timeline_raw` (#105094)

## ONNX
### TorchScript ONNX exporter
- Add `Autocast` support to `MatMul` through explicit cast (#98346)
- Add missing spaces between sentences in warning text (#105527)
- Improve shape inference for `Slice` (#105755)
- Do not run `deduplicate_initializers` when `keep_initializers_as_inputs` is True (#96320)
- Remove legacy diagnostic printing (#106498)
- Re-purpose `name` field of `GraphProto` (#107408)
- Add constant folding for `Softmax` op (#102861)
- Add `autograd_inlining` flag to `torch.onnx.export` (#104067)
- Update opset version warning text (#106830)

## Distributed
### Activation checkpointing
- Enable `checkpoint_wrapper` acccept `auto_wrap_policy` (#102672) 
- Add warns on reentrant use (#102890)
### DistributedDataParallel (DDP)
- Enable delayed all reduce in DDP (#96673)
- Enable DDP native mixed precision (#92882) 
- Add an API to remove autograd hooks from DDP (#96490) 
- Enable fused optimizer for DDP (#98270) 
- Perform input casting in pre-forward (#100131)
- Implement new Store methods in `PrefixStore`. (#100380)
- Unify `_cast_forward_inputs` (#102680) 
- Multiple forward support for static graph (#103487)
- Add methods to DDP to check for backward finalization. (#100773)
- Support optim in backward after DDP init (#105991, #105995) 
### FullyShardedDataParallel (FSDP)
- Add alignment padding for `use_orig_params=True` (#97667) 
- Allow non-uniform `requires_grad` for `use_orig_params=True` (#98221) 
- Move only current FSDP's states to GPU during init (#98319) 
- Reshard frozen params in backward (#101982) 
- Support unfreezing params for reshard-only hook (#104186) 
- Standardize meta device init within FSDP (#104189) 
- Annotate modules for `fully_shard` (#104363) 
- Make `limit_all_gathers=True` default for FSDP (#104900) 
- Add `record_function` for explicit prefetching (#105985) 
- Optimize away intermediate `div_` for Hybrid Sharding Data Parallel (HSDP) (#106034) 
- Check valid param freezing for `ModuleWrapPolicy` (#104427) 
- Allow `ModuleWrapPolicy` to take Iterable (#104999) 
- Enable async all-reduce for Hybrid Sharding Data Parallel (HSDP) #106080) 
- Relax `sharded_grad` assert to allow IDLE state (#96584)
- Copy step tensor so that each parameter has its own step (#96313) 
- Make FSDP `optim_state_dict` aware of DDP prefix (#96415) 
- Consolidate the arguments and logic of `optim_state_dict` and `optim_state_dict_to_load` (#96534)
- Make it full precision in eval mode (#97645)
- Include duplicate parameters and modules when calling `named_parameters` and `named_modules` (#99448) 
- Make `param_groups` optional for FSDP `optim.state_dict` (#99117)
- Support `rank0_only` when `use_orig_params` is True (#99624)
- Consolidate `rank0_only` load logic (#99647)
- Make `fsdp` device-agnostic for custom-backend which implements cuda-semantics (#99024)
- Ensure that customized non-tensor optimizer state can be saved (#99214) 
- Avoid elementwise dispatch of gradient unscaling/validation ops in `_foreach_non_finite_check_and_unscale_cpu_` (#100108)
- Do not flatten states when `use_orig_param` is True and sharding is `NO_SHARD` (#100189)
- Make `set_state_type` to `SHARDED_STATE_DICT` compatible with `NO_SHARD` sharding_strategy (#100208)
- Allow each `fully_shard` unit to cast foward inputs for mixed precision config (#100290)
- Restore the `state_dict_config` for `NO_SHARD` (#100855)
- Skip unshard call during checkpointing for `NO_SHARD` sharding strategy (#101095)
- Add `ignored_states` to FSDP/fully_shard (#102056)
- Start to generalize modules to ignore for mixed precision (#102010)
- Implement a workaround for FSDP init issue for 2D Parallel (#104398)
- Improve support for CPU tensors. (#103171)
- Avoid calling `optim.state_dict()` to get the initial empty states (#103609)
- Use `_get_module_fsdp_state_if_fully_sharded_module` for state_dict (#103783)
- Validate `ignored_modules`, `ignored_states` (#104273)
- Check `module.training` for `_root_cast_forward_inputs` (#104223)
- Add Option for eval in fp32/bf16 (#104682)
- The correct way to initialize optimizer states if the corresponding param is empty (#104765)
- Make `optim_state_dict_to_load` work with `use_orig_param=False` + `NO_SHARD` (#107185)
- Expose optimizer `state_dict` config (#105949)
- Enable custom device support in `fsdp` checkpoint (#107289)
### Distributed Tensor (Prototype Release)
- Add `_tenso.zero` function (#95863)
- Enable the `nn.Embedding` op for `DTensor` (#96702, #104820)
- Support creating `DTensor` in submesh (#95458)
- Enable correct behavior of random ops in `DTensor` and Tensor Parallel (#98198, #98577, #103235, #103910, #106535)
- Implement `aten.equal` sharding prop for `DTensor` (#97170)
- Set `cuda` device automatically, and refactor error handling (#97583)
- Set default value for `DTensor` ops on non-participating devices (#95852) 
- Change sharding algorithm to be in line with `torch.chunk` (#98722, #106250)
- Add a new `ColwiseParallel` style when `Pairwise` cannot be used directly (#100137)
- Enable more generic attention module sharding for PTD Tensor Parallelism (#100508) 
- Adopt strategy based sharding prop in `DTensor` ops (#100607, #101203)
- Support torch.save/load with `DTensor` (#103106)
- Allow `DTensor` to support `cuda`-like device (#102468) 
- Add an input resharding wrapper for TP and unit test for 2D + AC (#103334)
- Add validate mesh flag to `DeviceMesh` (#104807) 
- Improve allgather unpadding logic (#103219) 
- Use stack to manage mesh resources (#101202)
### Distributed (c10d)
- Enable the handling of bool tensors in Gloo. (#105354)
- Enable avoid `recordStream` calls in `ProcessGroupNCCL` an option (#89880)
- Remove stack trace captures from import (#97274) 
- Update `_store_based_barrier` implementation to reduce load on rank 0 (#98000) 
- Remove lock for `nccl` collective launch for `nccl` 2.0+ (#97904) 
- Figure out the correct device to use for object collectives  (#100954) 
- Start gloo sequence numbers at 0. (#101422)
- Add missing `torch.distributed.ReduceOp.AVG` in type stubs (#101534)
- Determine collective device from `_get_pg_default_device` rather than from explicit cuda (#101533)
- Enable configuration of NCCL communicators (#97394)
- Make default backend need to check for `nccl` availability (#102470)
- Add `is_backend_available` for c10d backend. (#101945) 
- Add one flag value for direct teardown without comm abort (#102599)
- Make it the default that PG do not perform barrier after init (#103033)
- Avoid `workEnqueue` when capturing cuda graph for NCCL process group (#103503) 
- Ensure `ncclCommAbort` can abort stuck `ncclCommInitRank` (#103925)
- Allow previously initialized MPI (#105023)
- Increase socket buffer size to allow ProcessGroup init up to 12k ranks (#107878) 
- Change `--standalone` to bind to a random port (#107734)
- Initial commit of `collective_utils`  (#101037)
### Distributed Checkpoint
- Upstream `fsspec` storage read/write to PT (#98387) 
- Rewrote read slicing to use a wrapper. (#99167)
- Consolidate OSS `FsspecWriter`/`Reader` and internal `FsspecWriter`/`Reader` (#104724)
### Torch Elastic
- Allow elastic agent to fail fast (#99051)
### RPC
- Add size check before calling `.back()` in `rpc/script_call.cpp` (#94297)

## Dynamo
- Support `nn.Module` forward hooks in Torch Dynamo (#92125)
- Many graph break fixes - (#94949(https://github.com/pytorch/pytorch/pull/94949,  #94658, #102247 and others).
- Translation validator for dynamo guards (#102563)
- Update dynamo `sum` dtype handling to match eager (#103037)
- Switch calling convention back to real tensors (#99320)

## Inductor
- Support more operators that fallback to eager previously: `rad2deg`, `deg2rad`, `count_nonzero`, `bitwise_right_shift`, `quantized.max_pool2d`, `erfc`, `erfinv`, `all_reduce`, `squeeze_copy`, `aten.prod`, `softshrink`, `aten.unfold`, `diagonal`, `diagonal_copy`, `diagonal_scatter`  ( #98994, #98995, #94997, #105906, #101416, #101863, #93111, #96039, #99484, #105603, #105165, #103755 )
- Add decomposition rules for: `aten.normal_`, `lerp`, `aten.angle`, `unfold_copy`, `aminmax`, `nansum`, `fmin`, `fmax`, `narrow_copy`, `expand_copy`, `view_copy`, `smooth_l1_loss`, `full_like`, `affine_grid_generator`, `aten.dist` ( #91207, #104866 , #105609, #96038, #96039, #102077, #101963, #104709, #105586  )
- `cudagraph` and `cudagraph` trees (#97440 , #98254, #89146, #98529, #102273, #105148  )
- Add convolution triton template (#95556)
- Pattern matcher infrastructure (#97740)
- Use Welford algorithm to compute variance in a single pass (#104725 )
- Value range analysis ( #102611 )
- Do IR validation recursively ( #98887 )
- Merge consecutive splits (#100107 )
- Constant and `index_expr` propagation pass to simplify indexing expressions (#101077 )
- Fallback `max_pool2d_with_indices` to eager rather than fail an assertion if dilation is not 1. (#100531 )
- Don't fuse nodes with long distance to avoid increasing memory usage (#104024 )
- Easier to add third-party backend ( #106874 )
- Improvements on the CPU backend
  - Support vertical reduction (#97644)
  - Support dynamic shape (#97230, #102068 )
  - Support masked load ( #107670 )
  - Enable accuracy testing in CI (#94898)
  - Enable Inductor to support BF16 atomic_add (#96620)
- Improvements for AMD
  - `tl.dot` and MFMA support enabled in ROCm triton for conv/mm lowerings (#107600)
  - Remove ROCm `miopen_batch_norm` fallback, now lowering to triton (#100089)
  - Enable "reduce-overhead" compile mode with hipgraph support on ROCm5.6 (#103092)
- Align inductor behavior with eager mode for split_with_sizes (#99702)
- Avoid decomposing `_unsafe_index` in Inductor (#107882)
- Make CI error on inductor fallback when decomp is available (#99473)
- Enable weight prepack for `LSTM` (#103071)
- Enable `fused_attention` pattern matcher (#107128)
- Add `fused_attention` pattern matcher with additional clone (#108141)

## JIT
- Include more shape information on tensor values in `jit.trace` functions (#95544)
- Allow annotations using generics directly, e.g. `tuple` instead of `Tuple` (#98703)
- Improve source attribution for easier debugging (#95761, #96423, #98606, #100171)
- Shape functions implemented for `stack`, `squeeze.dims`, `cross_entropy_loss`, `conv_transpose` (#92205, #93919, #98078, #97875, #102139)
- Partially support `ForwardRef` type annotations for `NamedTuple` attributes (#96933)
- Optionally ignore UTF-8 decoding error when converting `std::string` to python `str`. (#97282)
- Improvements to flatbuffer serialization and deserialization (#97190, #97298, #99050)
- Support serialization/deserialization of >4GB strings (#99104)
- Enable `torch.jit.load` for custom device (#99535)
- Allow C++ custom class to define `__repr__` (#100724)

## Misc
- Better function annotations for `nn.functional` (#102918)
- Add safe `weights_only` option to `load_state_dict_from_url` (#98479)
- Enable Transparent Hugepages (THP) for buffer sizes >=2MB (#95963)
- Automatic pulling of `ExtraFilesMap` without explicit mapping. (#99747)
- Remove device transfers from Java Native Interface (JNI) (#105583)
- `profile_plot` generates snapshot objects (#103497)
- `vmap` Support for `torch.tril` and `torch.triu` (#94287)

# Bug fixes

## Python Frontend
- Fix docstring setup to allow running PyTorch in python optimize mode (#100750)
- Fix deserialization for `UpsamplingBilinear2d` (#101248)
- Fix `torch.distributions.Dirichlet.log_prob` when `x=0` and `alpha=1` (#103605)
- Fix `torch.distributions.Gumbel.cdf` (#91698
- Fix PEP 484 Violation (#105022)
- Fix bitwise shift operations when shifting out of bounds (#97150)
- Fix `torch.asarray` to use the default device (#106779)
- Fix deepcopy on `torch.Tensor` on MTIA device (#107427)
- Add deterministic path for `Tensor.resize_` (#104300)
- Fix `torch.pow` to handle real negative base and complex exponent (#95198)
- Fix `LayerNorm(bias=False)` error (#108060)
- Don't fastpath conj copy when conj/neg bit mismatch (#108881)

## Autograd
- Fix `torch.autograd.graph.register_multi_grad_hook` to not keep tensor alive in closure (#102859)
- Fix autograd hooks being spuriously garbage collected by removing incorrect `THP{Cpp,}Function_traverse` `PyObject` traversals (#102860)
- Fix `Tensor::register_hook` behavior on undefined tensors (#105587)
- Handle undefined gradients out-of-place foreach backward (#100256)
- Fix codegen logic for foreach derivatives (#95263)
- Bump version counter for `torch{resize_, resize_as_}` (#96598)
- Bump version counter for `foreach` functions  (#93901)

## optim
- Fix and enable `complex` x `amsgrad` support for Adam and AdamW (#104989, #104990)
- Fix unpicklable object in `AveragedModel` (#95979)
- Fix parameter list used in `weight_decay` for Adam (#100973)
- Fix `optimizer` `state_dict` casting to allow step to cast for fused/capturable (#102619)
- Update `lr_scheduler.py` to check the type of eta_min (#97003)
- Fix issue with `lr_scheduler` serialization containing bound methods (#102627)

## torch.nn
- Fix `int()` casting in `torch.nn.RNN` to have correctly traced JIT and ONNX graph. (#92970)
- Fix device handling in  `nn.utils.rnn.unpad_sequence` (#98042)
- Fix `torch.nn.FractionalMaxPool2d` output_size error (#99507)
- Fix inconsistent `torch.nn.MaxPool1d` output on cpu and gpu (#99843)
- Fix device of `lengths` in `pack_padded_sequence` when the default device is GPU (#103967)
- Fix bias overflow for memory efficient attention in `scaled_dot_product_attention` (#107968)
- Update scaled_dot_product_attention dispatch logic to check for sm86 and head_size == 128 for flash attention  (#94921)
- Raise type error message for `interpolate` if `size` contains non-integer elements (#99243)
- Fix a bug in interpolate uint8 AVX2 on non-contiguous input (#101136)
- Fix bug in interpolate when interpolation size is larger than max (#101403)
- Add error if `stateless.functional_call` is called with `nn.DataParallel` (#107403)
- Fixing interpolate on uint8 unsqueezed 3D CL tensor (#100258)
- Add check for 0 to 1 inclusive for elements of target tensor in BCE loss (#97814)

## functorch
- Fix torch.vmap support for `torch.roll` (#95048)， `nn.{PixelShuffle, PixelUnshuffle}`(#96493)
- Add better error message for mutating .data under functorch transforms (#94817)
- Fix `functorch.jacrev` support for `torch.take` (#95772) 
- Fix `functorch` support for transforming over Tensor indexing (#98748)
- Fix `torch.vmap` support for `torch.searchsorted` (#99698)
- Fix `torch.vmap support for `Tensor.index_put` (#100516) 
- Fix UB in `functorch` infrastructure (#101568)
- C++ `autograd.Function` now raises an error with `functorch` transforms instead of being silently incorrect (#103957)
- Fix `nll_loss` batch rule with negative ignore_idx (#106118)

## Distributed
### Distributed (c10d)
- Fix `kDefaultTimeout` multiple definition build failure in Gloo (#97270)
- Delete lengths offset checks (#98368)
- Drop the GIL when creating a `TCPSTore` to avoid deadlocks. (#100555) 
- Fix bug in `process_group_name` when there are duplicate PGs (#100518) 
- Fix subprocess group handlig in `scatter_object_list` (#100552)
- Fix the check message of unsupported collectives ops. (#101775)
- Fix `netName` assignment for NCCL Config (#105776)
- Skip timeout in `FileStore` for Windows if the file path is invalid (#103247)
### FullyShardedDataParallel
- Use correct handle training state when prefetching (#98249)
- Fix issue where `fully_shard` may determine compute device incorrectly (#98831)
- Enable FSDP ``use_orig_params=True`` mixed precision training when some ranks have no (non-zero sized) parameter shards (#99175)
- Fix `use_orig_params=True`, CPU offload, `no_sync()` (#100180)
- Fix `device_id` when buffer-only module (#103504)
- Fix `skip-sharded-views` + mixed precision (#105346)
- Ignore buffer type casting in ignored modules (#106766) 
- Fix train -> EMA -> eval with mixed precision (#106858)
- Unblock `ignored_states` + auto wrap (for now) (#104418)
- Fix a memory leak in `optim_state_dict` (#96263) 
- Fix bug in determining whether parameters need to be materialized (#97488)
- Fix typo when setting FSDP state dict config (#97110)
- Fix osd rank0_only in fsdp (#99136)
- Fix decision logic for `should_cast_forward_inputs in _root_pre_forward()` and `_pre_forward()` (#99546)
- Fix `ignored_states` when they are passed as generators (#102575)
- Fix for optim state dict (#102901)
- Handle corner case of load with multi-backend PG for FSDP `state_dict` (#107172)
### Distributed Tensor (Prototype Release)
- Fix `DeviceMesh` logics in deciding which PG to use (#96861)
- Remove non-generic asserts in `_get_or_create_default_group()` (#96961) 
- Fix the default PG condition for `DeviceMesh` (#97384)
- Fix `DTensor` equal op (#99014) 
- Use Stride inferred from local tensor in `to_local` bwd (#102630)
- Enable partial tensor add without redistribute (#105939)
- Get rid of `dim_groups` attribute from `DeviceMesh` (#103105)
- Fix `requires_grad` in `distribute_tensor` (#107606)
- Fix `new_empty_strided` op’s crash on the shard placement (#108600))
- Fix `new_empty_strided` op (#108600)
- Fix `requires_grad` callsite (#108358)

## torch.compile

### Dynamic Shapes

A lot of dynamic-shapes bugfixes, too many to enumerate one-by-one. Some important points:

- Heavy work our sympy-based symbolic reasoning system, including value ranges analysis for proving redundant constraints (#95174, #105877, #104968, #105138, #105139, #97963, #96121, #104557, #106644(https://github.com/pytorch/pytorch/pull/106644 ), #94944)
- Improved symbolic tracing support for operators, including SymInt’ified schemas and SymInt aware operator/backward implementations (#95543, #96100, #97362, #97675) Some specific operators: 
    - tile (#106933)
    - randint/randperm (#98968) 
    - roll (#99114)
    - fft ops (#99115)
    - philox_rand (#99290)
- Avoid overspecializing in situations where it is not necessary (#96008)
- Don't attempt to use fake tensor fallback to real tensor if there are symbolic sizes (#97148)
- Make `Tensor.__contains__` accept `SymInt/Float/Bool`. (#98933)
- Change Dynamo to not duck-size unspecialized ints (#99010)
- Improved mixed type handling for SymInts (#100008, #100328)
- Support for annotating that SymInts have a constrained range (#103346)
- Don't generate guards that refer to unbacked SymInts (#95732)
- Automatically guard when SymInt is converted to int, instead of erroring (#95479)
- Z3 based translation validation for debugging symbolic reasoning problems (#104827, #106643, #107523, #106645, #101307, #101607)
- Improve constraint violation error reporting, including recommended constraints for export (#102729(https://github.com/pytorch/pytorch/pull/102729 ), #102198, #107470, #107790(https://github.com/pytorch/pytorch/pull/107790 ), #100745, #101636, #101815)
- Improve logs for dynamic shapes using TORCH_LOGS=dynamic (#99277, #98941, #107439)
- If we can't statically prove 32-bit indexing is OK, only add guard if hint exists (#106004)
- Add expect_true for irrefutable guards, greatly improving overall support for error checking involving unbacked ints, and other unbacked symint improvements (#106720, #95216, #106788)
- Support for torch.compile with FakeTensor that has SymInt sizes (#107662)

### Other bug fixes

In addition, we have the following fixes broken down into roughly 4 parts:

- Primtorch and decomposition bugfixes and improvements
- FakeTensor and meta function bugfixes and improvements
- AOTAutograd bugfixes and improvements
- General “composability” bugfixes.

The first three cover a large number of general improvements to torch.compile, since torch.compile captures a graph internally by using these major components (fake tensor, prims and decomps, and AOTAutograd, see docs(https://pytorch.org/get-started/pytorch-2.0/)).

### Primtorch and decompositions bugfixes

There were a large number of fixes to the primtorch and ref decompositions, which are used in torch.compile during graph capture. These all fixed quite a few bugs in torch.compile:

- Sub.scalar decomp: fix primtorch handling for with alpha and float64 arg (#95421)
- Embedding_backward_dense decomp: broadcasting fix (#95499)
- Upsample_bilinear decomp fix (#101682)
- Batch_norm decomp reduce computation when weight or bias is none (#104616)
- _unsafe_index decomp (#106814)
- `Hardshrink`: make decomp composite implicit (#107039)
- normal op decomposition for specializations of the normal op (#106792)
- matmul decomp: update to match eager (#105850)
- prims.collapse: make it a real prim (#91748)
- Diagonal, linalg.diagonal: add refs (#95774)
- addmv decomp (#96264)
- Minimum_value: fix schema (#97327)
- squeeze.dims decomp (#97020)
- cumprod decomp: add ref (#98670)
- Prims.unbind: fix ref if given dimension size is 0 (#100122)
- Aten.arange.default: decompose to to arange.start_step (#99739)
- Philox_rand: add decomps (#100206)
- Elu_backward: fix bad accuracy in decomp (#100284)
- polar decomp: add ref (#100345)
- Batch_norm decomp: fix decomp  when weight/bias is not flattened (#101059)
- aten.fill.Tensor decomp: don’t call .item() (#103880)
- Torch.renorm: add decomp (#103858)
- multi_margin_loss ops: add decomps (#104578)
- aten.logspace decomp: bugfix (#105201)
- multilabel_margin_loss_forward op: add decomps (#105302)
- Torch.{stft,istft} decomps: add ref (#106400)
- Aten.rrelu_with_noise decomp: add ref (#106812)
- Misc fixes:
    - better error message when functionalization cant handle op (#95392)
    - Simplify some decompositions. (#107038)
    - Make the glue compute short circuit only if possible (#94437)
    - Don't use PrimTorch decomposition for empty (#94512)
    - Remove unnecessary TensorMeta rewrap (#95004)

### FakeTensor and Meta function fixes
Fake Tensors and meta functions are used internally to perform “shape inference” during graph capture when running torch.compile. In particular: when we capture a graph of pytorch operators, we’d like detailed information on the shapes of intermediate and output tensors in the graph. There were a large number of bugfixes and improvements to these two subsystems over the last release.

Operator bugfixes:

-  _slice_meta (#98326)
- embedding bag (#105924)
- histogramdd (#100624)
- sort (#96719)
- mm (#97533)
- torch.*_like operators (#98160)
- fused_adam (#99436)
- sdpa_backward (#101128)
- aten.bucketize (#104396)
- torch.full (#104451)
- Multi_margin_loss_shape_check (#104851)

Increased operator coverage:

- randperm.default (#99593)
- aten.searchsorted.Tensor (#101637)
- foreach ops (#102225)
- foreach_mul_ (#105107)
- foreach_maximum_.List (#105864)
- aten.min.dim (#96442)
- baddbmm (#96548)
- take (#98451)
- cummax and cummin (#98552)
- logcumsumexp (#98683)
- linalg_qr (#100714)
- solve_triangular (#100829)
- _linalg_eigh (#100964)
- linalg_householder_product (#101315)
- linalg_ldl_factor_ex (#101362)
- linalg_ldl_solve (#101367)
- linalg_lu (#101372)
- linalg_lu_factor_ex (#101375)
- linalg_lu_solve (#101836)
- _linalg_slogdet (#102464)
- lu_unpack (#102937)
- _linalg_solve_ex (#102454)
- linalg_matrix_exp (#102945)
- avg_pool3d and avg_pool3d_backward (#103392)
- rshift and lshift (#103637)
- pad ops (#103815)
- _pdist_forward and _pdist_backward (#103817)
- max_pool ops (#103951)
- adaptive_max_pool ops (#104167)
- multi_margin_loss ops (#104236)
- multilabel_margin_loss ops (#104388)
- _foreach_div_.Scalar, sqrt_.default (#104779)
- `multi_margin_loss`: check weight shape, make contiguous on CPU, add tests (#104852)
- _cholesky_solve_helper and cholesky_solve (#105867)
- _adaptive_avg_pool3d_backward (#105816)
- `argsort.stable` (#106025)
- cholesky (#106115)
- cholesky_inverse (#106120)
- ormqr (#106278
- mode ops (#106273)
- `searchsorted.Scalar` (#106283)
- grid_sampler_3d ops (#106261)
- _cdist_backward (#106680)
- polygamma (#106681)
- upsample_nearest (#106675)
- triangular_solve (#106682)
- special ops (#106683)


Other:

- Support resize on meta storage (#101988)
- meta_tensor] polish error strings in meta registrations ([#95052)
- meta] error checking for inplace ops ([#101532)
- Implement size checking for copy_ with meta tensors (#107779)
- Use safe_is_leaf to test leafness (#102706)
- FakeTensor] Workaround FFT ops with incorrect meta strides ([#106319)
- Better complex support (#98869)
- pt2] remove unused meta_linalg_eigh ([#100965)
- pt2] convert out params in register_meta ([#101344)
- Add missing decompositons/lowerings for logical/bitwise operators (#102566)
- pt2] bug fix: invert condition in checkFloatingOrComplex ([#102944)
- err on dot product for tensors of different sizes (#106572)


### AOTAutograd bugfixes

AOTAutograd is a major component of the torch.compile stack, and received many bugfixes and improvements over the last release.

- AOTAutograd: fix 'Trying to backward through the graph a second time' error (#98960)
- Handle tracing foreach ops in ProxyTorchDispatchMode. (#99724)
- functionalization: error during mutations on mem overlap (#99919)
- Functionalization of torch.rand/rand_like ops (#97377)
- fix inference mode / PyDispatcher / Functionalize interaction (#103275)
- Refactor (#95991, #96235)
- Dynamic shapes improvements through AOTAutograd (#95975, #96219, #96300, #96653)
- aot_autograd: dont requires_grad on tangents (#96339)
- aot autograd] avoid cloning some inputs unnecessarily when they dont require grad ([#96342)
- aot] disable inference view tracking ([#96478)
- aot autograd: consolidate metadata (#96340)
- Add missing aot_autograd_arg_pos_to_source (#97487)
- Disable logging in pattern matcher calls to AotAutograd (#98936)
- aot_autograd: factor out runtime epilogue from aot_dispatch_base (#100586)
- Disallow module forward input mutation in aot_export (#101834)
- aot_autograd][functional_rng] Change calling convention ([#102344)
- AOTAutograd] add export entrypoints ([#100587)
- aotautograd: fix mutation bug when input is noncontiguous (#102767)
- AOTAutograd] perform comparisons with stride hints ([#103342)
- AOTAutograd] make _unsafe_view() logic happen during the runtime epilogue ([#103919)
- Read out real strides from compilation result, rather than real args (#105010)
- AOTAutograd: correctness fix when tracing custom autograd functions that alias inputs (#102992)
- Add sequence_nr to aot_autograd to map forward ops to their corresponding backward ops (#103129)
- AOTAutograd: allow input mutations on inputs that are non-contiguous (#106460)
- Add some support for detecting false aliasing in AOTAutograd (#106461)
- Add complex dtypes to partitioner (#96297)

## Sparse
- Fix an unexpected assertion error when nesting check_sparse_tensor_invariants context managers (#95372)
- Fix silent nnz overflow for very large sparse compressed tensors. (#102523)
- Fix CSR/BSR invariant validation on 0 sized batched inputs (#101180)
- Fix zeros_like CSR and BSR tensors with batch dimensions. (#101215)
- Fix autograd issue with identity conversions (#92022)
- Set outputs of col_/crow_/ccol_/row_indices methods as non-differentiable. (#107447)
- Fix silent index downcast from int64 for int32 for add/add_ on CSR/BSR (#95294)
- Fix add/add_ device checks for CSR/BSR (#97520)
- Fix incorrect sparse_dim in COO.zero_() and in binary operations with zero-sized COO operands (#98292)
- Fix sparse.mm derivatives for non-contiguous inputs on CPU (#106127)

## Linear Algebra
- `baddbmm`: Fix when out has `nan` value for `beta=0` (#96086)
- Add same dtype checks for {`tensordot`, `addmm(cpu)` (even when input has zero `numel`), `baddbmm`}. (#98938, #100274, #102659)

## Profiler
- Hand-bound `CapturedTraceback` (#107438)
- Fix crash by initializing `kineto_activity` for each event for on-demand profiling (#97550)
- Fix CUPTI lazy re-initialization and CUDA Graphs crash in CUDA 11 with workaround (#101879)
- Fix CUDA IMA for CUPTI and CUDA 12 by disabling CUPTI lazy re-initialization (#107744)
- Fix profiling PT2 w/ dynamic shapes & `record_shapes` (#104320)
- Fix profiling shapes with PT2 + lists of dynamic shapes (#105893)
- Fix an issue where running Kineto daemon and dynolog in docker fails and UUID generation for IPC fabric (#95535)
- Fix static order deinit with `LoggerCollector` (#101952)
- Fix issues in `tb_plugin` regarding Distributed View and NCCL events (#103031)
- Fix `test_profiler_tree` for HIP and enabled individual activity types for RocTracer (#106293)
- Fix flaky `test_memory_timeline_no_id` in `test_memory_profiler.py` (#103441)

## Quantization
- Fixing quantized `prelu` workflow (#103455)
- Fix issue of lowering weighted functional ops with kwargs (#95865)
- Return `zero_point` from `determine_qparams` as a int64 (#98746)
- Fix errors in `QuantizeAvx512` (#104400)

## CUDA
- Add broadcastable check to `index_put` (#94849)
- Fix uniform returning end point for `BFloat16` and `Half` (#96962)
- Fix "Cannot assign index like `x[[1,2], :] = 2` when torch.use_deterministic_algorithms(True)" (#105833)
- Fixing a bug where allocating a 4GB block results in using 8GB of memory (#95827)
- Take `CUDA_VISIBLE_DEVICES` into account for nvml calls (#94568)

## Intel
- Avoid FPE when running batch norm with zero batch size. (#95324)
- Fix CPU bitwise shifts for out-of-limit shift values (#96659)
- Use unordered NEQ comparison for `vec512` `operator!=` implementations (#97466)
- Fix `masked_scatter_:` non-contiguous self (#100232)

## MPS
- Introduce xfail (#95045)
- Error on unsupported types (#95982)
- Add type promotion to `torch.addcmul` (#96164)
- Add `random_` overload (#98333)
- Fix `layer_norm_backward_mps` key (#100295)
- Make grid_sampler_2d available (#101108)
- Fix `bernoulli` for int types (#100946)
- Enable `arange` for `int8` and `uint8` dtypes (#101303)
- Handle deserialization more permissively (#98834)
- Fix mps unary op issue on non densely stored tensors (#105512)
- Fix `torch.std` for negative dimentions (#107754)
- Remove mps specialized path in BCE backward (#95220)
- Fix type casting copy with storage offset (#95573)
- Fix views with 3 or more sliced dimensions (#95762)
- Fix bidirectional LSTM & small one-direction LSTM fix (#95563)
- Fix in-place add and sub with alpha == 0.0 (#96184)
- Fix flip where no dims need to be flipped (#96605)
- Fix LSTM grad_y (#96601)
- Fix the failure with `ReplicatePad3D` (#96988)
- Fix `torch.eye` unsupported bool constant on macOS 12 (#97027)
- Add linear inputs check (#99228)
- Fix gelu exceptions not raised for error inputs (#99237)
- Fix max_pool2d exceptions not raised for error inputs (#99238)
- Fix trace exceptions not raised for error inputs (#99239)
- Add dot input check (#100099)
- Fix index_put with deterministic algorithm enabled (#97660)
- Fix embedding cache key (#101857)
- Fix `softplus` with `f16` input (#101948)
- Fix incorrect distribution of `randperm` with device mps (#104171)
- Fix `argmax` and `argmin` clamp value on MPS (#104374)
- Make `torch.empty*` deterministic by filling with NaN or max int (#104995)
- Correct empty tensor mps all operation (#105218)
- Fix upsample output size tensor (incorrect result in MacOS 14.0) (#105677)
- Fix MPS clamp issue with different dtypes between input and min/max tensors (#105747)
- Fix `copy_ broadcast` (#105617)
- Fix `clamp` with strided outputs/inputs (#97858)
- Restride output strides to contiguous format for inverse op (#102122)
- Remove casts from reduction/cumsum/sort ops starting with macOS 13.3 (#95817)
- Fix `.item()` for multi-dim scalar (#107913, #108410)

## Vulkan
- Ensure dim is `size_t` (#104201)
- Fix divide-by-zero with padded tensors (#97698)
- Ensure non-zero divisors in Vulkan API Tests [#100909, #100910]
- Fix concat op in feature dimension (#101721)
- Fix bug of `aten::cat` for concatenation of 3D tensors at channel dim with channels as multiple of 4 (#103718)
- Fix the position computation with the consideration of channel padding (#103908)
- Fix quantized cpu to vulkan broken by padding (#97372)
- Fix broadcasting in quantized elementwise ops (#97554)
- Fix lint for `at::softmax` 1,2,3 dimension tensors (#105082)
- Fix static analysis errors in `vulkan_quantized_api_test.cpp` (#97400)
- Reuse broadcast checks instead of `check_inputs` (#105960)
- Fix global and local sizes for `image->bool` copy (#106752)

## Build
- `USE_FAST_NVCC` Windows (#95206)
- Enable CuDNN v8 frontend in RL (#102284)

## ONNX
### TorchScript ONNX exporter
- Fixes for operators:
  - Add `cast` operator after `reduce` to match desired dtype (#100700)
  - Simplify `repeat_intereleave` export for scalar-valued `repeat` (#100575)
  - Fix wrong type when exporting `{zeros, ones, full, empty, rand, randn}_like` ops to onnx (#103048)
  - Fix `output_padding` for quantized `tconv` (#104207)
  - Refactor `AvgPool` to support dynamic shapes (#105683)
  - Fix `expand_as` (#95962)
  - Add new `aten::device` variant to TorchScript (#97023)
  - Export dynamic step size for `aten::slice` (#104385)
  - `STFT` Support (#92087)
  - Fix `aten::flatten` conversion with 0d input to onnx `Reshape` and 1d to `Identity` (#104089)
  - Fix output shape mismatch issue of `max_pool` (#106270)
  - Add quantization support to `reshape` and `size` for the ONNX exporter (#106629)
  - Return input itself for non-fp inputs and support decimals for `aten::round` op (#107920)
- Apply `peephole` for eval mode when constant folding is enabled only (#95801)
- Detect `None` constant during jit scalar type analysis (#101608)
- Fix onnx `Gather` constant folding (#101329)
- Fix third-party custom operator support in torchscript exporter (#104785)
- Fix memory leak when exporting models (#107244)
- Fix typo `scipt` -> `script` (#97850)
- Fix circular padding to support dynamic axes (#95647)
- Perform Shape inference on added `Cast` node (#106093)
- Cap opset version at 17 for `torch.onnx.export` (#107829)
- Make `torch/csrc/jit/passes/onnx/unpack_quantized_weights.cpp` data_ptr-correct (#100681)
### TorchDynamo ONNX exporter
  - Fixes for operators:
    - Fix scalar elements in `op.Concat` (#98509)
    - Fix `aten::cat` export when arg include parameters (#105373)
  - Remove duplicated code from previous rebase (#99072)
  - Cover undiscoverable ops `torch.ops.aten` (#99682)
  - Fix type annotation for `fx_to_onnxscript` (#100050)
  - Set `tracing_mode` through `options.dynamic_shapes` and enable dynamic tests in test_fx_to_onnx_runtime.py (#100212)
  - Add `RemoveConstantInputStep` to adapt torch inputs to ONNX inputs (#100252)
  - Fix exported onnx initializer name (#104741)
  - Fix `UnsupportedFxNodesAnalysis` after onnx dispatcher changes (#105156)
  - Support `torch.device` in FX exporter (#105757)
  - Fix passes to reuse existing fake mode if possible (#105764)
  - Exclude `FXSymbolicTracer` from `_assert_fake_tensor_mode` (#107712)
  - Fix initializer naming at `torch.onnx.ExportOutput.save_model_with_external_data` (#105002)
  - Apply `options.dynamic_shapes` to dynamo API usage in fx exporter (#99199)

## torch.fx
- Fix `split_module` bug with unused keys (#95493)
- Fix tabulate import error (#104468(https://github.com/pytorch/pytorch/pull/104468 ))
- Fix issue with SubgraphMatcher when ignoring literals  (#98458)
- Update InternalMatch in `subgraph_rewriter` after repeated replacements (#99039)
- Fix conv+bn folding issue for mixed dtype (#99696)
- Fix submodules/parameters/buffers preservation when unpickling graph module (#104115)
- Prevent interpreter from altering original node’s meta (#105880)
- Fix split module’s interaction with dead code (#104554)
- Fix split copying over `node.meta` (#107248)
- Fix repr when arg is an `OpOverload` (#102547)

## Dynamo
### Misc TorchDynamo fixes
- Correctly use PythonPrinter for generating wrapper code referencing SymPy (#96710)
- Fail fast when dynamo attempts to add unspecialized `int`/`float` as additional graph inputs (#96786)
- Simplify module_key creation logic (#94945)
- Generalize summary script to work with more CSV names (#98500)
- Add support for nonzero, some improvements to reduce guards (#95387)
- Update Dynamo.export to preserve names of args & kwargs (#95851)
- Slight cleanup of VariableBuilder giant if condition (#95471)
- Add guards for deterministic algos (#96695)
- Add signpost_event to dynamic_shapes (#103882)
- Add some missing disabled functions (#103662)
- Add support for dictionary with torch object keys. (#103158)
- Add timeout for translation validation instances. (#104654)
- Add Wav2Vec2 HuggingFace support (#103009)
- Add dyanmo backend based on ONNXRuntime (#106589)
- Allow for torch.sym_int to return int while tracing (#104837)
- Allow NumPy code in torch.compile to run on cuda (#104699)
- Avoid cond prefix when naming subgraph of HigherOrderOperators (#101439)
- Avoid graph break on repeat_interleave.self_int (#99528)
- Change dimension constraint summary to log.info (#101584)
- Debug shape guards (#95848)
- Disable dynamo on some opt methods and differentiable optimizer tests (#103066)
- Disable fused adam op compile (#105256)
- Don't apply automatic_dynamic_shapes if we force tensor to be static (#103673)
- Don't specialize torch.Size with specialize_int = False (#96419)
- Dynamo size dim kwargs (#97450)
- Dynamo stride dim kwargs (#97444)
- Enable fused foreach Adam compilation (#104121)
- Enable torch._C._get_privateuse1_backend_name in Dynamo tracing  (#103141)
- Ensure optimizer state references are cleared (#100282)
- Equality assertions (#102256)
- Explicitly fall back to eager with GraphModule with no output for onnx&tvm backends (#99805)
- Extend assert statement to include ListVariable (#100841)
- Fix disable_saved_tensors_hooks - graph break (#106875)
- Fix for tuple construction from tuple iterators (#97862)
- Fix graph break on boolean mask better (#103052)
- Fix incorrectly getting the name of OrderedDict's index in dynamo (#96940)
- Fix isinstance on SymInt in dynamo (#99393)
- Fix lineinfo generation on PY3.11+ (#103525)
- fix module buffers call (#102251)
- Fix number of inputs in onnxrt and tvm backend (#95429)
- Fix optimizer cuda health check graph break (can be done in the compiler) (#102765)
- Fix optimizer grad mode state interaction with dynamo (#103952)
- Fix OrderedDict reconstruction bytecode (#95800)
- Fix the compatible issue of the Dynamo and the PyDev.Debugger. (#96721)
- Fix torch.compile issue with torch.tensor (#96299)
- fix torch.distributions lazy_attribute failure (#103208)
- Fix usages of contextmanager without finally (#96170)
- Flatten exceptions in dynamo (#100779)
- Full default dict support in dynamo (#102202)
- Generate type match guard for torch.Size input (#96421)
- Graph break on differentiable boolean mask setitem (#102843)
- Graph break on operators that fake tensor doesn't support (#97708)
- Guard on default device (#99551)
- Handle calls to typing.cast (#104799)
- Handle dim in size kwargs (#96992]) ([#97098)
- Initialize optimizer in dynamo to avoid graph break and tracing slowness (#102640)
- Keep submodule's name for nn.Sequential when unrolling (#94913)
- Make _CURRENT_TRACING_CONTEXT thread local (#105942)
- Make int unspecialization actually work (#95621)
- Make openxla and opexla_eval backend show up in list_backends (#107905)
- Make Openxla dynamo backend take boxed input (#107260)
- Manually generate guards for optimizer (#103121)
- Node.stack_trace should have innermost frame last (#95592)
- Normalize builtin types to dtypes (#106074)
- Add a flag that allows breaking on NumPy ops (#107687)
- Fix `ndarray.__pow__ `(#107746)
- Return `NotImplemented` for `np.sort(complex)`` (#107710)
- Support `linalg`, `random` and `fft` module (#105320)
- `torch._numpy`:  remove noops and half-implemented nan-functions (#107596)
- Wrap ndarray dunder methods (#107689)
- Pass `torch.compile` mode/options to all backends (#99645)
- Update pre_dispatch tracing: support autocast and no_grad/enable_grad ctx managers, add a pre_dispatch_eager dynamo backend (#103024)
- Preserve CreationMeta when metafying views (#103152)
- Preserve mark_dynamic when cloning inputs (#99617)
- Prevent GraphArg from keeping real tensors live (#100515)
- Propagate mark_dynamic in dynamo compiled outputs (#99634)
- Properly avoid wrapping numbers as tensors before backend (#96193)
- Properly parenthesize dynamo_dynamic_indices test (#99823)
- Properly respect automatic dynamic config for unspec int (#103321)
- Raise warning if user has hooks installed on the module (#94848)
- Record caller frame instead of function frame (#96882)
- Resolve InlinedClosureVariable in InstructionTranslator stack (#106491)
- Rewrite size/stride/numel TensorVariable handling (#103438)
- Simulate torch function enablement state (#105091)
- Simulate tracing tree_map_only  (#104815)
- Simulate treespec flattening/unflattening (#101896)
- Skip if curr_size is None (#101170)
- Support CUDA stream passed from outside of torch.compile decorator (#94627)
- Support getattr for ConstantVariable when compiling with Dynamo (#98153)
- Support module dict iter (#99503)
- Support threading.local getattr (#104292)
- Support unary not on lists (#102210)
- Support wrapping + returning tensor subclasses (#104802)
- Trace through Tensor slots (#107159)
- Turn on add_runtime_assertion by default (#102671)
- Tweak dynamic=False behavior (#105715)
- Update XLA dynamo backend name (#106489)
- Update `exir.pass_base` to use export.pass_base (#106647)
### Misc dynamic shapes fixes
- Add API to mark input tensors static for cudagraphs (#107154)
- Add invariant that all symbolic shapes must be bound in graph (#99089)
- Add support for Inductor + symbolic shapes + training (#93059)
- Add symbolic tracing support to `torch._dynamo.export` (fake input + weights) (#100017)
- Add unbacked symbol support (#98877)
- Always create ShapeEnv, always apply unspec logic (#103302)
- Do not mutate SymNode expressions. (#107492)
- Do not track parameters, do not generate guards (#98350)
- Add dynamic range constraint API (#98779)
- Enable dynamic shapes of torch.nn.Parameter (#105855)
- Further improve symbolic shapes logging (#99159)
- Group constraints by arg (#102096)
- Guard static shapes alongside tensors, instead of from shape_env, in dynamic_shapes=True (#99566)
- Make hash_storage work with size 0/1 storage (#100467)
- Make unspecified ints to range over negative and positive. (#104658)
- Propagate dynamic int on `__setitem__` (#105923)
- Remove redundant `dynamic_dim` (#107815)
- Support bit shifting `SymInt`s (#104318)
- Switch dynamic_shapes to True by default (#103597)
- Warn if guards are added to ShapeEnv after we produced guards  (#97820)
- Don't specialize when indexing by SymInt (#99123)
- Fix specialization when you pass an unspec int into slicing on a Python list. (#104142)
- Flag guard unbacked SymInt/SymFloat support (#94987)
### Benchmark related bug fixes
- Add a flag to benchmarks script to keep the test report directory (#96398)
- Fix amp in inference in benchmarking suite (#103220)
- Handle new inference csv from CI for benchmarking (#98294)
- Small operatorbench improvements (#103110)
### Export related bug fixes
- Add `aot_export` (#101490)
- Add get buffer from exported program (#107809)
- Add support for edge dialect ops in exir/serde (#106371)
- Change `torch._dynamo.export(aten_graph=...)` to allow `pre_autograd` tracing (#98031)
- Error on closed over variables (#99367)
- Enable dynamo export to export identity function (#94962)
- Error when constraining on static values (#101655)
- ExportedProgram (#102259)
- Fix soundness bug with unsupported constraints (#102897)
- Fix specify_constraints signature for exporting module (#101831)
- Improve error message for IO mismatch (#107907)
- Make serializer more composable (#104816)
- Persist `torch.assert` in aten graph (#100101)
- Preserve `meta"val"]`` on export ([#95314)
- Raise error on 3.11 dynamo export (#95088)
- Refactor and add same_signature flag to dynamo.export (#106569)
- Refactor dynamic dims api, stateless internals, higher level export API (#96699)
- Remove eliminate_dead_code (#105875)
- Remove fake_mode arg from torch._dynamo.export API (#106345)
- Remove setter for graph_module (#106651)
- Suggest constraints to specify for export based on generated shape guards (#98463)
- Support list output for HigherOrderOperators (#101986)
- Support runtime assertion for inline constraints (#100763)
- Integrate `torch.ops.call_delegate` into the delegate workflow (#92562)
- Wrap more constraint violation cases to UserError (#100897)
### Logger bug fixes
- Rename sym_shapes logger to dynamic (#99335)
- Raise a NameError when accessing non-existent variable (#96418)
- Convert logging f-strings to use % format, part five (#98765)
- Enable passing a dict of module names: log level to set_logs python api (#98989)
- Expose function to retrieve list of registered loggers (#100776)
- Remove unnecessary check when logging artifacts (#99260)
- Revamp guard debug logging (#107505)
- Add assert + test for artifact log booleans (#104907)
- Add fast traceback utilities (#107358)
- Add graph break logging option instead of config flag (#103202)
- Add verbose_guards logging artifact (#107388)
- Do not use unicode quotes (#99446)
- Elevate cudagraphs failure to warning, added lineno to recompiles (#105081)
- Generate error on bad input to equality constraint (#107311)
- Fix outdated log settings in doc (#102285]) ([#102286)
- Make DimConstraints create actionable message (#100103)
- Make sure log tests are run in non-verbose mode (#106496)
- Report guard failures with recompiles logging (#105500)
- Update error message with torch logging instructions (#102892)
- Fix typo in settings regex logging (#97245)
- Improve TORCH_LOGS settings error msg (#97264)
### Minifier related bug fixes
- Add `--check-str` support to after_aot minifier (#104758)
- Teach `requires_bwd_pass` how to interpret int (#98312)
- Add `--offload-to-disk` support to minifier (#100546)
- Improve minifier printing to be more chatty when it makes sense (#100486)
- Make `run_fwd_maybe_bwd` work with int inputs (#99365)
- Misc accuracy improvements on minifier (#100447)
- Print AOT Autograd graph name when accuracy failed (#99366)
- Relax after_aot restriction on no buffers, serialize small constants (#100472)
- Cast copied model rather than update the original model (#101901)

## Inductor
- Skip triton configs for `mm_plus_mm` that may crash triton ( #96385)
- Avoid fusion with indirect indexing (#96273)
- Use 64-bit indexing for large tensors in triton codegen (#97447)
- Make `aten.constant_pad_nd` always do a copy even when padding is 0 to have consistent behavior (#100082)
- Make argmin/max handle duplicate values in a way consistent with eager  ( #99920)
- Handle negative padding in reflect_pad_backward. ( #100923)
- Make `torch.sign` return the same type as input (#101346)
- Only reuse aliased buffers if there are no more users ( #100332)
- Fix a number of issues with divs in `ValueRangeAnalysis` (#100547)
- Prevent pattern matches across mutation ops in inductor pre-grad FX passes (#101144)
- Avoid caching stale `inner_fn_str/ReadWrites` objects (#106502)
- Correctly infer dtype of `full` (#95593)
- Avoid zero division error for `dropout` (#100222)
- Fix multi output layout error in indexing dtype calculation (#108085)
- Bug fixes for the CPU backend
  - Fix compilation issues on pre clang-10 (#103347)
  - Fix compiler error when trying to vectorize `logit_and` and `logit_or`  (#95361)
  - Properly handle 3D tensor for `Conv2d` ( #99601)
  - Fix reduction crash caused by storing `float` value to `bfloat16` (#102719)
  - Properly hande broadcasting for `bfloat16` (#104319)
  - Fix compilation for TIMM `mobilevit_s` model (#100230)
- Bug fixes for the AMD backend
  - Triton wheel support enabled in non-ROCm environments (#95142)
  - Conditionalise triton mm/conv configs on ROCm to mitigate crashes (#107584)
- Dynamic shape related bug fixes
  - Disable persistent reductions with dynamic shapes since persistent reduction relies on tensor shapes (#98405)
  - Turn off `divisible_by_16` for dynamic shapes (#98471)
  - Make `philox_rand_like` work with dynamic shapes (#95461)
  - Handle `int`/`float` arguments for cpp codegen in inductor (#95533)

## JIT
- Mark `torch.cuda._exchange_device` op as having side effects (#96364)
- Fix `jit.trace` codegen for out-variants on ops with more than one output (#101563)
- Make `NNC` compatible with LLVM 15-17 (#96762), #98811), #101396), #103824)
- Fix errors found by fuzzing and sanitizers (#94815), #101400), #102156), #103667), #103969), #106041), #103327), #94300)
- Fix handling of >32-bit scalars on 32-bit platforms in `NNC` (#97669)
- Fixes for `NNC`’s variable handling and serialization on big-endian systems (#96951), #95881), #104249)
- Ignore meta-device tensors instead of erroring when loading a model with a target device (#100495)
- Add overloads for `_unsafe_index_put`, `_unsafe_index`  (#104127)
- Match eager result from `torch.round` in `NNC` codegen (#104430)
- Fix lifetime of `JITException` binding (#106401)
- Fix annotation handling for subclasses in python >= 3.10 (#104485)

## Misc
- Stride bugfix: add overflow check for stride calculation (#94900)
- Set `SavedVariable.is_output` to `true` for `grad_fn->result_` (#105504)
- Handle tail 0-size tensor appropriately in `MultiTensorApply` (#100811)
- Fix `UntypedStorage` pin error (#104355)
- Fix `validate_input_col` for `nn.Module` or `Callable` (#96213)
- Fix segmentation fault in flatbuffers when parsing malformed modules (#95221)
- Fix TorchScript support in `as_nested_tensor` (#97960)
- Reintroduce s390x SIMD support (#99057)

# Performance

## General
- Avoid copies in matmul (#76828)
- Improve the precision of `abs()` and `sign()` for large values (#99550)
- Fuse ops in eager cosine_similarity while keeping the stability and the gradients (#104771)
- Add scalar conversion using avx instructions for half (#102140)
- enable `Half` for cat serial kernel (#96021)
- Re-enable AVX512 ATen kernels for compute-intensive ops (#104165)

## torch.optim
- Minimize the use of intermediates across all foreach optimizers (Adam, AdamW, RAdam, NAdam, Adadelta, Adagrad, Adamax, Rprop, ASGD, RMSprop, SGD) to decrease peak memory (#104780, #104898, #104904, #104910, #104983, #104988, #104991, #105193, #105146, #105161, #105599)
- Only make a shallow copy when loading optimizer state_dict (#106082)
- FusedAdam/FusedAdamW accepts lr: Tensor without h2ds (#106916)
- Use plain power operator in Adam/AdamW when capturing (#104254)
- Optimize EMA implementation (#94820)

## torch.nn
- Optimize reflection padding performance on CPU (#102254)
- Optimize replication padding performance on CPU (#102255)
- Improve precision and performance for BFloat16 upsampling (#91169)

## Sparse
Improved performance in the following:

- `mul(COO, COO)` (#96094, #94596
)
- `COO.coalesce` (#94401, #96765)
- `COO.sparse_mask(COO)` (#96094, #94406, #94596)
- `mm(dense, BSR)`, `addmm(dense, BSR)`, linear with BSR weights (#94825, #94823, #96648, #100634, #100543, #100876, #100882, #98403, #104062)
- `sparse.sum` backward (#98838, #94991)
- relu/threshold backward for COO (#98935)
- `add/add_` for CSR/BSR (#95293)

## torch.compile
- Implement CSE for guards (#98488)

## Distributed
### Distributed (c10d)
- Enable `store_barrier` only on the ranks that are part of the process group and not the whole world to make it scalable in PG initiation. (#99931)

### Distributed Tensor (Prototype Release)
- Improve perf to reduce `DTensor` CPU overhead (#106524, #107181, #107305)

### FullyShardedDataParallel:
- Speed up first iter order check (#96146, #96220) 
- Reduce CPU overhead in FSDP (#96958)

## CUDA
- Speed up bincount and histc on CUDA (#97090)
- Speed up indexing_backward_kernel with duplicates (#100505)
- Speed up torch.cat on contiguous tensors with wide loads (#102815)
- Speed up LossCTC (#97269)
- Speed up prefix scan algorithm (#103314, #103435), #103502)
- Speed up vectorized_layer_norm (#107287)

## Intel
- Improve mkldnn matmul performance when one input is contiguous tensor but the strides are not default contiguous strides (#99511)

# MPS
- Implement NEON accelerated implementation of ERF() (#105610)
- Add encoder coalescing support for native kernels (#99810)
- Add PipelineStateObject caching for advanced indexing kernels  (#99855)
- Squeeze last dimensions, if possible, for 5D (or bigger) reductions to map them to optimal 4D implementation (#99856)

## Vulkan
- Pad channels when using texture storage instead of "tight packing" (#95251)
- Introduce GPU Memory Layout qualifier allow more efficient memory layouts when storing Tensors (#106978)

## ONNX
- Improve diagnostics performance (#99936, #96348)
- Don't duplicate model weights in ONNX export (#101134)
- Reduce exporter memory usage by removing intermediate values (#101148)
- TorchScript ONNX exporter:
  - `aten::relu6`: avoid unncessary Relu operation (#99022)

## Inductor
- Match decomposed attention patterns and replace them with eager implementation. This improves perf since eager implementation may use flash attention which do comprehensive fusions. ( #97741, #100609, #107578)
- `matmul` padding (#101913, #102200, #103600  )
- Fuse type casting with triton matmul kernel (#106443, #106516, #107495 )
- Improve loop ordering to generate more coalesced memory access (#106827)
- Enable persistent reductions (#94847, #102444 )
- Layout optimization for convolution ( #99773 )
- Improve max-autotune ( #95554, #95555, #96410, #97219 )
- Coordinate descent tuning: doing coordinate descent search to find promising triton configs that are good for perf.  (#99594, #99403, #103660  )
- Inductor Freezing (#100652)
- Horizontally Fuse Addmm for inference (#100746 )
- Avoid unnecessary copy (#102089 )
- Convert layout of conv weight to channels last ahead of time for inference (#103642)
- Performance improvement for CPU backend: Support mkldnn packed linear to improve bfloat16 performance ( #96954 )
- Performance improvement for dynamic shapes: 
  - Support multilayer reduction (#99475, #101915, #106747 )
  - Apply divisible_by_16 flag in more cases for vectorized load and store ( #105743 )

## Release Engineering
- Add workflow for quick perf comparison for inductor (#96166)
- Run inference accuracy and performance tests with bfloat16 for inductor (#103535)
- Add DALLE2_pytorch to inductor benchmarking workflow with AMP fallback (#104283)
- Run the inductor benchmark suite with dynamic batch only (#97912)

## torch.export
- Speed up export time by avoiding calling the callable during export. (#107249)

## JIT
- Improve load times by reducing the number of times re-indexing occurs (#102312)
- Skip the source info in the error report if the source code is too large (#105608)


# Documentation

## CUDA
- Fix `torch.cuda.mem_get_info` doc (#96621)

## DataPipe
- Add generated docstring to functional form DataPipe (#100503)
- Update docstring for functional form of DataPipes (#100446)

## torch.fx
- Update fx.pass.graph_drawer usage doc to draw fx graph (#95534)
- Add doc test in graph_drawer.py (#95919)
- Add docs for writing ATen IR passes + FX Pattern matching (#100577)
- Update torch.fx docs (#97058)
- Fix typos under torch/fx directory (#97596)

## torch.export
- `torch.export` landing page (#108783)

## Intel
- Add best practices doc for CPU backend (#105051)

## Linear Algebra
- Fix docs rendering in linalg.{matrix_exp, ldl_factor}. (#101363, #99777)
- Fix examples in linalg.tensorinv. (#105911)
- Improve error message for crashes related to `linalg.eigh` when input matrix is ill-conditioned, in some cusolver versions (#107082)

## optim
- Document optimizer state_dict() better with an example (#105958)
- Have SGD summary show up in optimizer overview (#107738)

## Python Frontend
- Improve docs for `torch.gradient` (#98824, torch.complex (#99938), `torch.arange` (#99963), torch.asarray (#100971), torch.{cat,stack} (#103421), torch.randn signature (#102075), `torch.bitwise_right_shift` (#103500), `torch.round`  (#97227), `torch.fake_quantize_per_tensor_affine` (#104453), `torch.manual_seed` (#105175), torch.resolve_neg (#104151), `torch.fake_quantize_per_channel_affine` (#105241, #105955), `torch.bucketize` (#104474), `torch.slice_scatter` (#107849), `RNN` (#106222), `torch.unique` (#108292)


## Quantization
- Fix disbale--and other--typos (#95322)
- Fix  return values of `_get_name()` in quantized ConvTranspose (#97678)
- Fix docs for `prepare_fx`/`prepare_qat_fx` (#105979)
- Error when someone calls train/eval on pre_autograd graph (#108143)
- Move dropout replacement to `move_model_to_eval` (#108255)
- Fix and rename `move_model_to_eval` to `move_exported_model_to_eval` (#109027)

## Inductor
- Improve Discoverability of Inductor Optimizations (#95824 )

## Release Engineering
- Fix doc-rendering error and deprecate CircleCI docs scripts (#105678)

## Dynamo
- Add a RST doc for the performance dashboard (#100592)
- Small doc update for torch_compile_debug (#95809)
- Logging documentation updates (#100595)
- Move Dynamo IPEX backend to training/inference category (#108643)

## nn_frontend
- Remove future deprecation warning from kl_div docs (#96541)
- Fix the docs for `cosine_similarity` (#104772)
- Correct `HingeEmbeddingLoss` documentation (#95140)
- Fix docstring for shape of  `target` for MultiLabelSoftMarginLoss (#107817)
- Document differing behavior of RReLU between training and evaluation (#95624)

## ONNX
- Remove API reference for TorchScript export diagnostics (#107979)
- Refactor `torch.onnx` documentation (#108379)

## Distributed
### FullyShardedDataParallel
- Update the doc to be more clear that per-device NCCL stream is per PG (#95705) 
- Re-addd why we register the post-backward hook only on the first forward in the case of multiple forwards (#95326) 
- Clarify CPU offload implicitly in reshard_doc (#98666) 
- Document `optim_state_dict_config` in method (#102657) 
- Document `get_state_dict_type` (#102658)
### Distributed (c10d)
- Fix typos in comments under torch/csrc/distributed (#96062)
- Update isend/irecv warning messages for nccl (#95236)
- Add warning about object-based collectives for GPU tensors to docs. (#97702) 
### Distributed Checkpoint
- Fix documentation for distributed checkpointing for optimizers (#95264)
- Add fsdp checkpoint example (#95258)
- Update DCP doc to use the updated FSDP optim state_dict APIs (#95303) 
- Update documentation to read FileSystemReader instead of FileSystemLoader (#102795)
- Add documentation for HSDP saving using DCP (#104810)
### RPC
- Add missing RRef docs for RPC (#106902)

## Sparse Frontend
- Improve error message when expand is called on sparse tensor (#98365)

## Composability
### Dynamic Shapes
- Update dynamic shapes documentation (#109764)

## Dynamo
- Add docs for `torch.compile(numpy)` (#109710)


# Developers

## torch.fx
- Fix typos in `torch/fx/_compatibility.py` (#97618(https://github.com/pytorch/pytorch/pull/97618 ))
- Add `torch/utils/_stats.py` to stack frame skiplist (#98117)
- Add pre_autograd kwarg to make_fx (#97559)
- Revert torch.fx.interpreter error printing change (#101462)
- Fix pytree error formatting string (#105935(https://github.com/pytorch/pytorch/pull/105935 ))
- Assume SymPy is always installed (#94903)
- Add a more error checking to minifier (#103057)
- Refactor unwrap_proxy() for proxy tensor tracing (#104667)
- Enable ruff's UP rules and autoformat dynamo / functorch and refs (#105432)
- Enable flake8-simplify checks (#97984)

## Inductor
- Allow overriding the decomposition table in compile_fx API.  ( #95468 )
- Allow saving parameters for compiling a graph and relay later to improve development efficiency (#106952 )
- Support benchmarking kernel perf to gather metrics like latency and memory bandwidth ( #95355, #95506, #95845, #96458, #96461, #97057, #103547)
- Tracking operator count ( #100329 )
- Print the path to the generated wrapper code with TORCH_LOGS=output_code (#99038 )
- Provenance tracking for wrapper code (#105717, #95901  )
- Support inductor OSS perf dashboard (#95685, #99387, #99754, #105221)

## Composability

- A number of improvements that make it easier for custom backends to integrate as a pytorch eager mode backend out-of-tree, through the PrivateUse1 DispatchKey

    - Allow privateuse1 key to be used with legacy constructor (#95748)
    - Add Generator register for the privateuse1 backend (#93920)
    - Optimize the AMP func name in custom_device_mod (#98052)
    - Enable dispatch stub for backend PrivateUse1 (#99611)
    - Support random for custom device (#97420)


- Nvfuser python API import fix (#94036)
- Add ability to create library fragments (#98439)
- Core aten IR:
    - Tag functions to core IR in native_functions.yaml (#105849)
    - Add `_native_batch_norm_legit_no_training` to core IR (#107732)
    - Make python decomp for native_batch_norm CompositeImplicitAutograd, remove native_batch_norm from core aten opset (#107791)
    - Avoid extra copies in batchnorm decomposition inference by introducing a new op, _native_batch_norm_legit_no_training (#94946)
    - Add `aten.smooth_l1_loss_backward` to `core_aten_decompositions` (#100267)
    - Add `empty`/`empty_like` to core aten decomps (#105158)
- Fixed missing-prototypes warnings in torch_cpu (Part 1) (#100053)
- Fix typos in checkFloatingOrComplex errors (#102456)
- Allow existing "Python RAII guards" to be used as context managers (#102579)
- Replace _prims_common.check with torch._check* (#103240)
- Update core aten decomp table (#105673)
- Generate mypy hints for torch.Tag, add a couple of pointwise ops (#106910)
- aot_autograd: avoid using intermediate_base logic unnecessarily (#97786)
- Fix disable amp for runtime wrapper (#97864)
- aot_autograd: more logging on metadata asserts (#99177)
- Proper handling when outputs are aliased but have identical size/stride/offset metadata (#100430)
- Fix de-dupping metadata computation bug (#100431)

## Release Engineering
- Rename default branch to main (2418b945763)
- Use PyTorch wheel in Windows CI (#94958)
- Use GPU machine and run GPU tests with Bazel builds (#95721)
- Enable simpler C++ test discovery + running workflow on CI with run_test.py (#99956, #99559)
- Run C++ test_api binary directly in CI slow jobs (#101088)

## Autograd Frontend
- Add a macro for derivatives formulas that returns multiple outputs and can be specified to save certain tensors conditionally (#103750)
- Fix `torch._C.get_current_graph_task_execution_order` `accumulate_grads` ordering (#105353)
- `torch.autograd._force_original_view_tracking` to work as both context manager and function (#106706)
- Enable `autograd` to be compiled (#103822, #104316)

## JIT
- Create public interface for `torch.jit` to reduce `pyright` errors (#101678)

## optim
- Change step from 1D to singleton tensor in `Adam` (#96994)

## ONNX
- Remove torch dependencies in _beartype (#98958)
- Delay torch.onnx import to after all dynamo sub]components ([#99070)
- Enable xdoctests in CI (#98546)
- Update ONNX submodule from ONNX 1.13.1 with Protobuf 4.21 updates (#96138)
- Run ONNX tests as part of standard run_test script (#99215)
- Skip flaky dynamic tests before ORT==1.15 in fx exporter (#98856)
- Add additional_test_kwargs into test_fx_to_onnx_with_onnxruntime.py (#99434)
- Bump onnx-script version with imported module renaming (#99926)
- Add test_fx_op_consistency.py (#99465)
- Refactor test_op_consistenct.py and test_fx_op_consistency.py (#100172)
- Add xfail into subtests of op consistency and retire fixme (#100173)
- Skip flaky dynamic test in CI (#100297)
- Add supported ops into test_fx_op_consistency - 1st batch (#100265)
- Bump onnx submodule to release 1.14.0 (#101809)
- Bump ORT version to 1.15.0 (#102248)
- Add FX exporter MaxPool tests (#102773)
- FX Dispatcher Test (#103971)
- Bench torch.onnx.dynamo_export and torch.onnx.export under dynamo bench (#103135)
- Separate fx _type_utils from torchscript exporter (#103942)
- Use load_model_from_string (#104533)
- Enable ruff's UP rules and autoformat onnx/ (#105427)
- Suppress ORT warnings in unit tests (#105624)
- Add comment on test_view_dynamic_zero_dim (#105950)
- Bump numpy from 1.21.6 to 1.22.0 in /benchmarks/dynamo/_onnx (ab9ea0d0f25)
- Enable skipped gpt2 test (#94930)
- Clean up outdated skip ort < 1.15 decorator in tests (#105951)
- Add test support for dynamic shapes for torch.onnx.dynamo_export (#106495)
- Update xfail reasons in fx runtime tests (#107257)
- Add unittest for exporting embedding_bag (#105862)
- Add huggingface models into CI tests (#107247)

## Distributed
### FullyShardedDataParallel
- Log FSDP mixed precision (#97367) 
- Add loggings of modules FSDP hooks firing (#102508)
- Print out more useful error message for optim_state_dict (#96860) 
- Use `INFO` instead of `DETAIL` for warning logs (#102639)
- Add a summary log when finishing state_dict (#103784)
### Distributed (c10d)
- Fix `sandcastle_skip_if` decorator name is confusing (#95649) 
- Add sequence number in PG Wrapper (#97462)
- Print collective in PG Wrapper (#97544)
- Add diff capability in PG Wrapper (#100214)
- Add `TDD`, `NCCL_DEBUG` log (#97692)
- Don't crash on retrieve NCCL DesyncReport (#98470)
- Print stacktrace on collectFullMesh in for Gloo (#98810)
- Add fully_shard debug function to print sharded tree structure, module names and managed param fqns (#99133)
- Enhance error msg in PG Wrapper (#100213)
- Make `ProcessGroupNCCL` work.wait() respect timeout (#100162)
- Add size info to collective logs (#100413)
- Add the logics and interface to log ProcessGroup comms configuration (#104373)
- Make NCCL default logging more friendly. (#105695)
- Add OnCompletion Hook to ProcessGroup (#106988) (#107233) 
- Improve input mismatch error msg (#107281) 
### DistributedDataParallel
- Add debug logging around DDP mixed precision copies (#96438)
### Distributed Tensor (Prototype Release)
- Add necessary logging to APIs and components for PTD use cases such as `DTensor,` TP and DCP (#101994, #102209, #102278

## Sparse Frontend

- Expand sparse.softmax zero nnz tests to cover cases of previously reported FPE (#95646)
- Use nested namespaces in sparse (#97581)
- Fix cuSparse CSR SPMM when using nullptr in csrRowOffsets (#105957)
- Remove CUTLASS extensions merged upstream (#107612)
- Remove CUTLASS extensions merged upstream (#107612)
- Fixes for reference and move (#95942)
- Triton kernels without public API
    - Use missing-prototypes in torch_cpu (#104138)
    - SDPA: Support frontend for BSR masks (#104042)
    - sampled_addmm: Support BSR (#101163)
    - softmax: Support Triton kernel for BSR inputs (#102095)

# Security

## Release Engineering
- Move mergebot and other CI/CD workflows to its own secure environment (#107060)

PyTorch 2.0: Our next generation release that is faster, more Pythonic and Dynamic as ever (2023-03-15)

# PyTorch 2.0 Release notes

- Highlights
- Backwards Incompatible Changes
- Deprecations
- New Features
- Improvements
- Bug fixes
- Performance
- Documentation

# Highlights

We are excited to announce the release of PyTorch® 2.0 ([release note](https://github.com/pytorch/pytorch/releases)) which we highlighted during the [PyTorch Conference](https://www.youtube.com/@PyTorch/playlists?view=50&sort=dd&shelf_id=2) on 12/2/22! PyTorch 2.0 offers the same eager-mode development and user experience, while fundamentally changing and supercharging how PyTorch operates at compiler level under the hood with faster performance and support for Dynamic Shapes and Distributed.

This next-generation release includes a Stable version of Accelerated Transformers (formerly called Better Transformers); Beta includes torch.compile as the main API for PyTorch 2.0, the scaled_dot_product_attention function as part of torch.nn.functional, the MPS backend, functorch APIs in the torch.func module; and other Beta/Prototype improvements across various inferences, performance and training optimization features on GPUs and CPUs. For a comprehensive introduction and technical overview of torch.compile, please visit the 2.0 [Get Started page](https://pytorch.org/get-started/pytorch-2.0).

Along with 2.0, we are also releasing a series of beta updates to the PyTorch domain libraries, including those that are in-tree, and separate libraries including TorchAudio, TorchVision, and TorchText. An update for TorchX is also being released as it moves to community supported mode. More details can be found in this [library blog](https://pytorch.org/blog/new-library-updates-in-pytorch-2.0/).

This release is composed of over 4,541 commits and 428 contributors since 1.13.1. We want to sincerely thank our dedicated community for your contributions. As always, we encourage you to try these out and report any issues as we improve 2.0 and the overall 2-series this year.

Summary:

- torch.compile is the main API for PyTorch 2.0, which wraps your model and returns a compiled model. It is a fully additive (and optional) feature and hence 2.0 is 100% backward compatible by definition.
- As an underpinning technology of torch.compile, TorchInductor with Nvidia and AMD GPUs will rely on OpenAI Triton deep learning compiler to generate performant code and hide low level hardware details. OpenAI Triton-generated kernels achieve performance that's on par with hand-written kernels and specialized cuda libraries such as cublas.
- Accelerated Transformers introduce high-performance support for training and inference using a custom kernel architecture for scaled dot product attention (SPDA). The API is integrated with torch.compile() and model developers may also use the [scaled dot product attention](https://pytorch.org/docs/2.0/generated/torch.nn.functional.scaled_dot_product_attention.html) kernels directly by calling the new scaled_dot_product_attention() operator.
- Metal Performance Shaders (MPS) backend provides GPU accelerated PyTorch training on Mac platforms with added support for Top 60 most used ops, bringing coverage to over 300 operators.
- Amazon AWS optimize the PyTorch CPU inference on AWS Graviton3 based [C7g instances](https://aws.amazon.com/blogs/aws/new-amazon-ec2-c7g-instances-powered-by-aws-graviton3-processors/). PyTorch 2.0 improves inference performance on Graviton compared to the previous releases, including improvements for Resnet50 and Bert.
- New prototype features and technologies across TensorParallel, DTensor, 2D parallel, TorchDynamo, AOTAutograd, PrimTorch and TorchInductor.


  
   
Stable
   
   Beta
   
   Prototype
   
   Platform Changes
   
  
  
   Accelerated PT 2 Transformers
   
   torch.compile
   
   DTensor
   
   CUDA support for 11.7 & 11.8 (deprecating CUDA 11.6)
   
  
  
   
   
   PyTorch MPS Backend
   
   TensorParallel
   
   Python 3.8 (deprecating Python 3.7)
   
  
  
   
   
   Scaled dot product attention
   
   2D Parallel
   
   AWS Graviton3
   
  
  
   
   
   Functorch
   
   Torch.compile (dynamic=True)
   
   
   
  
  
   
   
   Dispatchable Collectives
   
   
   
  
  
   
   
   torch.set_default_device and torch.device as context manager
   
   
   
   
   
  
  
   
   
   X86 quantization backend
   
   
   
   
   
  
  
   
   
   GNN inference and training performance
   
   
   
   
   
  


\*To see a full list of public 2.0, 1.13 and 1.12 feature submissions click[ here](https://docs.google.com/spreadsheets/d/1H3jazwO8BBCwK8JwLNYspLiHfUrzshEtyqjL-X93I9g/edit#gid=790902532)

# Backwards Incompatible Changes

### **Drop support for Python versions <= 3.7 (#93155)**

Previously the minimum supported version of Python for PyTorch was 3.7. This PR updates the minimum version to require 3.8 in order to install PyTorch. See [Hardware / Software Support ](https://github.com/pytorch/pytorch/blob/893aa5df3f2a475c91ea8eadb1353812e52fb227/RELEASE.md#python) for more information.

### **Drop support for CUDA 10 (#89582)**

This PR updates the minimum CUDA version to 11.0. See the [getting-started](https://pytorch.org/get-started/locally/) for installation or [building from source](https://github.com/pytorch/pytorch#from-source) for more information.

### **Gradients are now set to `None` instead of zeros by default in `torch.optim.*.zero_grad()` and `torch.nn.Module.zero_grad()` (#92731)**

This changes the default behavior of `zero_grad()` to zero out the grads by setting them to `None` instead of zero tensors. In other words, the `set_to_none` kwarg is now `True` by default instead of `False`. Setting grads to `None` reduces peak memory usage and increases performance. This will break code that directly accesses data or does computation on the grads after calling `zero_grad()` as they will now be `None`. To revert to the old behavior, pass in `zero_grad(set_to_none=False)`.



1.13
2.0




```Python
>>> import torch
>>> from torch import nn
>>> module = nn.Linear(2,22)
>>> i = torch.randn(2, 2, requires_grad=True)
>>> module(i).sum().backward()
>>> module.zero_grad()
>>> module.weight.grad == None
False
>>> module.weight.grad.data
tensor([[0., 0.],
        [0., 0.]])
>>> module.weight.grad + 1.0
tensor([[1., 1.],
        [1., 1.]])
```




```Python
>>> import torch
>>> from torch import nn
>>> module = nn.Linear(5, 5)
>>> i = torch.randn(2, 5, requires_grad=True)
>>> module(i).sum().backward()
>>> module.zero_grad()
>>> module.weight.grad == None
True
>>> module.weight.grad.data
AttributeError: 'NoneType' object has no attribute 'data'
>>> module.weight.grad + 1.0
TypeError: unsupported operand type(s) for +:
'NoneType' and 'float'
```





### **Update `torch.tensor` and `nn.Parameter` to serialize all their attributes (#88913)**

Any attribute stored on `torch.tensor` and `torch.nn.Parameter` will now be serialized. This aligns the serialization behavior of `torch.nn.Parameter`, `torch.Tensor` and other tensor subclasses



1.13
2.0




```Python
# torch.Tensor behavior
>>> a = torch.Tensor()
>>> a.foo = 'hey'

>>> buffer = io.BytesIO()
>>> torch.save(a, buffer)
>>> buffer.seek(0)
>>> b = torch.load(buffer)

>>> print(a.foo)
hey
>>> print(b.foo)
AttributeError: 'Tensor' object has no attribute 'foo'

# torch.nn.Parameter behavior
>>> a = nn.Parameter()
>>> a.foo = 'hey'

>>> buffer = io.BytesIO()
>>> torch.save(a, buffer)
>>> buffer.seek(0)
>>> b = torch.load(buffer)
>>> print(a.foo)
hey
>>> print(b.foo)
AttributeError: 'Parameter' object has no attribute 'foo'

# torch.Tensor subclass behavior
>>> class MyTensor(torch.Tensor):
...   pass

>>> a = MyTensor()
>>> a.foo = 'hey'
>>> print(a.foo)
hey

>>> buffer = io.BytesIO()
>>> torch.save(a, buffer)
>>> buffer.seek(0)
>>> b = torch.load(buffer)
>>>print(b.foo)
hey
```




```Python
# torch.Tensor behavior
a = torch.Tensor()
a.foo = 'hey'

>>> buffer = io.BytesIO()
>>> torch.save(a, buffer)
>>> buffer.seek(0)
>>> b = torch.load(buffer)
>>> print(a.foo)
hey
>>> print(b.foo)
hey

# torch.nn.Parameter behavior
>>> a = nn.Parameter()
>>> a.foo = 'hey'

>>> buffer = io.BytesIO()
>>> torch.save(a, buffer)
>>> buffer.seek(0)
>>> b = torch.load(buffer)
>>> print(a.foo)
hey
>>> print(b.foo)
hey

# torch.Tensor subclass behavior
>>> class MyTensor(torch.Tensor):
...   pass

>>> a = MyTensor()
>>> a.foo = 'hey'
>>> print(a.foo)
hey

>>> buffer = io.BytesIO()
>>> torch.save(a, buffer)
>>> buffer.seek(0)
>>> b = torch.load(buffer)
>>>print(b.foo)
hey
```





If you have an attribute that you don't want to be serialized you should not store it as an attribute on tensor or Parameter but instead it is recommended to use `torch.utils.weak.WeakTensorKeyDictionary`

```Python
>>> foo_dict = weak.WeakTensorKeyDictionary()
>>> foo_dict[a] = 'hey'
>>> print(foo_dict[a])
hey
```

### **Algorithms `{Adadelta, Adagrad, Adam, Adamax, AdamW, ASGD, NAdam, RAdam, RMSProp, RProp, SGD}` default to faster `foreach` implementation when on CUDA + differentiable=`False`**

When applicable, this changes the default behavior of `step()` and anything that calls into `adadelta(...)`, `adagrad(...)`, `adam(...)`, `adamax(...)`, `adamw(...)`, `asgd(...)`, `nadam(...)`, `radam(...)`, `rmsprop(...)`, `rprop(...)`, `sgd(...)` directly to use the `foreach` implementation instead of the for-loop for better performance. However, this change can potentially be backward incompatible since there may be small numerical differences between the results computed with the `foreach` implementation and the previous default. The foreach implementation will be the default only if the following conditions are met.

1. The user has not specified kwargs relating to implementation (`foreach`, `fused`, or `differentiable`),
2. All tensors are native tensors (not subclasses) and on CUDA,
3. `torch.jit.is_scripting` is `False`.

When these conditions are satisfied, the implementation used will match the implementation used when one passes `foreach=True`. The user defined flag for `foreach` will NOT be overwritten in order to preserve user selections. For more details, check the [documentation](https://pytorch.org/docs/stable/optim.html#algorithms). There should be no significant differences between the results returned by these optimizers. To revert to the old behavior, say, for `adam`, pass in `adam(..., foreach=False, ...)` or initialize `Adam` with `Adam(..., foreach=False, ...)`.

Pull Requests: #92306, #92716, #92723,#92724, #92726, #92727, #92728, #92715, #91896, #92730, #90865, #93184, #92181, #92923, #95415, #95818, #95811

### **`torch.nn.utils.stateless.functional_call` now respects tied weights (#90477)**

Assume a module has two tied weights, x and x_tied. Previously, invoking `functional_call(module, parameters_and_buffers, args, kwargs=None, *, strict=False)` with a parameter dictionary of only one of the tied weights would result in the other one(s) not being updated.

We’ve changed the behavior so that providing one of the tied weights in the parameter dictionary will update all other tied weights. If you would like the behavior in previous versions of PyTorch, please set `tie_weights=False`.

Please also see the related deprecation section "torch.nn.stateless.functional_call in favor of torch.func.functional_call".



1.13
2.0




```Python
>>> class Foo(nn.Module):
...    def __init__(self):
...        super().__init__()
...        self.x = nn.Parameter(torch.zeros([]))
...        self.x_tied = self.x
...
...    def forward(self, inp):
...        return self.x + self.x_tied

>>> foo = Foo()
>>> params = {'x': torch.ones([])}
>>> result = functional_call(foo, params, torch.randn([]))
>>> print(result)
1.0
```




```Python
>>> class Foo(nn.Module):
...    def __init__(self):
...        super().__init__()
...        self.x = nn.Parameter(torch.zeros([]))
...        self.x_tied = self.x
...
...    def forward(self, inp):
...        return self.x + self.x_tied

>>> foo = Foo()
>>> params = {'x': torch.ones([])}
>>> result = functional_call(foo,
...                         params,
...                         torch.randn([]),
...                         tie_weights=False)
>>> print(result)
1.0
```





### **Require `return_complex` to be passed explicitly to `torch.stft` for real input (#86724)**

`torch.stft` takes an optional return_complex parameter that indicates whether the output should be a floating point tensor or a complex tensor. `return_complex` previously defaulted to False for real input tensors. This PR removes the default and makes `return_complex` a required argument for real inputs. However, complex inputs will continue to default to `return_complex=True`.



1.13
2.0




```Python
>>> a = torch.rand(1024)
>>> _ = torch.stft(a, n_fft=128)
```




```Python
>>> t = torch.rand(1024)
>>> _ = torch.stft(t, n_fft=128, return_complex=False)
```





### **Require inputs to `torch.istft` to be complex valued**

`torch.istft` no longer supports input in the form of real tensors
with shape `(..., 2)` to mimic complex tensors. Instead, convert
inputs to a complex tensor first before calling `torch.istft`.



1.13
2.0




```Python
>>> t = torch.rand(65, 33, 2)
>>> _ = torch.istft(t, n_fft=128, length=1024)
```




```Python
>>> t = torch.rand(65, 33, 2)
>>> _ = torch.istft(t, n_fft=128, length=1024)
RuntimeError: istft requires a complex-valued input
tensor matching the output from stft with return_complex=True.
>>> t_complex = torch.view_as_complex(t)
>>> _ = torch.istft(t_complex, n_fft=128, length=1024)
```





### **Change default behavior of sparse tensor construction to not do component verification(#92094)**

We now disable the costly component verification of torch.sparse_coo/csr/csc/bsr/bsc/compressed_tensor by default. The user can use the new `check_invariants` flag or `torch.sparse.check_sparse_tensor_invariants` to locally enable component verification. This allows users to constrain these costly checks to specific regions of their code and enables better overall performance. Previously users had no access to public constructors that disable these checks.



1.13
2.0




```Python
>>> i = [[0, 1, 1],
         [2, 0, 5]]
>>> v =  [3, 4, 5]
>>> s = torch.sparse_coo_tensor(i, v, (2, 3))
RuntimeError: size is inconsistent with
indices: for dim 1, size is 3 but found index 5
```




```Python
>>> i = [[0, 1, 1],
         [2, 0, 5]]
>>> v =  [3, 4, 5]
>>> s = torch.sparse_coo_tensor(i,
...                            v,
...                            (2, 3),
...                            check_invariants=True)
RuntimeError: size is inconsistent with indices: for
dim 1, size is 3 but found index 5
>>> with torch.sparse.check_sparse_tensor_invariants():
...     s = torch.sparse_coo_tensor(i, v, (2, 3))
...
RuntimeError: size is inconsistent with indices: for
dim 1, size is 3 but found index 5
```





### **Remove deprecated functionality from `torch.testing`**

Historically, `torch.testing` exposed a lot of private and undocumented functionality publicly. The 2.0 release completes the deprecation cycle for the following items and removes them:

- `rand` and `randn` (#87970)
- `get_all_device_types` (#87971)
- multiple dtype getters (#87972)
- `make_non_contiguous` (#87973)

### **Hooks registered on tensor to always run, even if they are the inputs to `.grad()` (#85849)**

This is a bug fix. Per the docs, hooks registered to Tensor should fire any time gradients are computed w.r.t. to that tensor. This change corrects the behavior to be consistent with the documentation. See [documentation](https://pytorch.org/docs/2.0/notes/autograd.html#backward-hooks-execution) for more details about backward hooks execution..

**2.0**

```Python
a = torch.tensor(1., requires_grad=True)
b = a.clone()
b.register_hook(hook)  # the hook registered here didn't fire before!
torch.autograd.grad(b.clone(), inputs=(b,))
```

### **`grad_fn` post-hooks can always observe the modifications to gradient by any grad_fn pre-hooks or hooks registered to Tensor, even if this is a leaf tensor (#85849)**

This corrects the behavior of hooks to be consistent with the documentation in the case where the tensor is a leaf tensor, i.e. the node is a grad accumulator node. See [documentation](https://pytorch.org/docs/**2.0**/notes/autograd.html#backward-hooks-execution) for more details about backward hooks execution.

**2.0**

```Python
def hook(grad):
   # updates grad
   return grad * 3

def hook2(grad_input, grad_output):
   # Before this change, grad_output would NOT see the x3
   print(grad_output)

a = torch.tensor(1., requires_grad=True)
b = a.clone()
acc_grad = b.grad_fn.next_functions[0][0]
acc_grad.register_hook(hook2)
b.register_hook(hook)
torch.autograd.backward(b.clone(), inputs=(a,))  # hook fire
```

### **Remove FSDP `params_with_grad` (#87480)**

In FSDP, we used to have an API `params_with_grad` for users to get parameters which have gradients from the FSDP module. We decided not to expose this helper because it is not a common paradigm.



1.13
2.0




```Python
m = FullyShardedDataParallel(module)
m.params_with_grad()
```




```Python
m = FullyShardedDataParallel(module)
m.params_with_grad()  # Runtime error thrown
# For work-around, users can still do
[p for p in self.parameters() if p.grad is not None]
```





### **Users doing wildcard import of torch.distributed.fsdp.fully_sharded_data_parallel will no longer get non-public symbols (#87917)**

Users could previously import both public and non-public symbols:



1.13
2.0




```Python
from torch.distributed.fsdp.fully_sharded_data_parallel import *
ShardingStrategy.FULL_SHARD # Non-public API
FullyShardedDataParallel(module) # public API
```




```Python
from torch.distributed.fsdp.fully_sharded_data_parallel import *
ShardingStrategy.FULL_SHARD # Non-public API, this will fail now
Fully`Sharded`DataParallel(module) # public API
...
# Users can instead
from torch.distributed.fsdp.fully_sharded_data_parallel import (
FullyShardedDataParallel,
ShardingStrategy,
)
FullyShardedDataParallel(module, sharding_strategy=ShardingStrategy.FULL_SHARD)
```





### **Signature of FSDP `auto_wrap_policy `related APIs were changed in (#88450).**



1.13
2.0




```Python
lambda_auto_wrap_policy(m, unwrapped_params=...)
transformer_auto_wrap_policy(m, unwrapped_params=...)
size_based_auto_wrap_policy(m, unwrapped_params=...)
```




```Python
lambda_auto_wrap_policy(m, nonwrapped_numel=...)
transformer_auto_wrap_policy(m, nonwrapped_numel=...)
size_based_auto_wrap_policy(m, nonwrapped_numel=...)
```





### **Updated `alltoall` signature to be consistent with other c10d APIs (#90569)**

The keyword argument names have been changed.



1.13
2.0




```Python
alltoall(output=..., input=...)
```




```Python
alltoall(output_tensors=..., input_tensors=...)
```





### **Remove unused functions in torch.ao.quantization.fx.utils (#90025)**

This commit removes the following unused functions from both the torch.quantization and the
torch.ao.quantization namespaces:

- `graph_pretty_str`
- `get_per_tensor_qparams`
- `quantize_node`
- `get_qconv_op`
- `create_qparam_nodes`
- `node_return_type_is_int`
- `is_get_tensor_info_node`

### **Make `torch.ao.quantization.backend_config.BackendConfig` accept inputs in the right order (#90698)**

The existing `BackendConfig` fusion pattern uses a "reversed nested tuple" format that is unintuitive.
This pattern format also complicates the signatures of the user specified "fuser methods", which needed to accept arguments in reverse nested order to match
the patterns:



1.13
2.0




```Python
import torch as nn
import torch.ao.nn.intrinsic as nni
from torch.ao.quantization.backend_config import (
  BackendPatternConfig
)
def fuse_linear_relu(is_qat, relu, bn_conv):
    (bn, conv) = bn_conv
    return nni.ConvBnReLU2d(conv, bn, relu)

config = (
    BackendPatternConfig((nn.ReLU, (nn.BatchNorm2d, nn.Conv2d)))
    .set_dtype_configs(...)
    .set_fuser_method(fuse_conv_bn_relu)
    .set_fused_module(nni.ConvBnReLU2d)
)

backend_config.configs  # returns Dict[Pattern, BackendPatternConfig]
```




```Python
def fuse_linear_relu(is_qat, conv, bn, relu):
    return nni.ConvBnReLU2d(conv, bn, relu)

config = (
    BackendPatternConfig((nn.Conv2d, nn.BatchNorm2d, nn.ReLU))
    .set_dtype_configs(...)
    .set_fuser_method(fuse_conv_bn_relu)
    .set_fused_module(nni.ConvBnReLU2d)
)

# Or for backward-compatibility
def fuse_linear_relu(is_qat, relu, bn_conv):
    (bn, conv) = bn_conv
    return nni.ConvBnReLU2d(conv, bn, relu)

config = (
    BackendPatternConfig()
    ._set_pattern_complex_format((nn.ReLU, (nn.BatchNorm2d, nn.Conv2d)))
    .set_dtype_configs(...)
    .set_fuser_method(fuse_conv_bn_relu)
    .set_fused_module(nni.ConvBnReLU2d)
)

backend_config.configs  # returns List[BackendPatternConfig]
```





### **Make the AO codebase compliant with the public vs private API guidelines of pytorch [Public-API-definition-and-documentation](https://github.com/pytorch/pytorch/wiki/Public-API-definition-and-documentation)**

If users were using any of the AO private APIs then these would have to be accessed with a preceding `_` to conform with the guidelines.



1.13
2.0




```Python
get_observer_dict()
```




```Python
_get_observer_dict()
```





Pull Requests: (#86029, #87515, #87516, #87517, #87518, #87519, #88392, #88394, #88396, #88397, #87521, #88395, #87883, #88399, #88398, #86022, #86023, #86024, #86025, #86026, #86027, #86028, #86030, #86031, #86032, #86033, #86034, #86037, #90315, #88391, #90554, #87520)

### **Remove overwrite_output_observer and represent the observer constraints for fixed qparams ops through the existing DTypeWithConstraints mechanism (#88620)**

This commit removes `overwrite_output_observer` and `overwrite_output_fake_quantize` overwrite observer settings in the BackendConfig. Instead, we represent the observer constraints for
fixed qparams ops through the existing DTypeWithConstraints mechanism. Note that, however, to be consistent with other DTypeWithConstraints checks, we no longer throw an error if an incorrect observer is specified, but simply ignore the offending QConfig and log a warning instead. This is the BC-breaking part of the change.
**1.13**

```Python
from torch.ao.quantization.qconfig import default_qconfig
from torch.ao.quantization.quantize_fx import prepare_fx

model = ModelWithFixedQParamsOps()
qconfig_mapping = QConfigMapping().set_global(default_qconfig)
example_inputs = ...
prepare_fx(model, qconfig_mapping, example_inputs)
```

Before this commit, running the above leads to an exception because the wrong observers are used for fixed qparams ops. After this commit, the above will only encounter a warning,and the fixed qparams ops will not be quantized. In both cases, switching to `get_default_qconfig_mapping` will cause the fixed qparams ops to be quantized.

### **Remove `torch.ao.quantization.quantization_patterns` and `torch.ao.quantization.fusion_patterns`(#89872)**

The following classes under the `torch.ao.quantization.fx.quantization_patterns` namespace are migrated to the `torch.ao.quantization.fx.quantize_handler`
namespace:

- `QuantizeHandler`
- `BinaryOpQuantizeHandler`
- `CatQuantizeHandler`
- `ConvReluQuantizeHandler`
- `LinearReLUQuantizeHandler`
- `BatchNormQuantizeHandler`
- `EmbeddingQuantizeHandler`
- `RNNDynamicQuantizeHandler`
- `DefaultNodeQuantizeHandler`
- `FixedQParamsOpQuantizeHandler`
- `CopyNodeQuantizeHandler`
- `GeneralTensorShapeOpQuantizeHandler`
- `CustomModuleQuantizeHandler`
- `StandaloneModuleQuantizeHandler`

The following classes under the torch.ao.quantization.fx.fusion_patterns namespace are migrated to the torch.ao.quantization.fx.fuse_handler
namespace:

- `DefaultFuseHandler`
- `FuseHandler`

### **Remove public APIs under the `torch.ao.quantization.fx.backend_config_utils` namespace(#89810)**

The following APIs that were mistakenly public under the `torch.ao.quantization.fx.backend_config_utils` namespace are removed in this commit.

- `get_quantize_handler_cls`
- `get_fusion_pattern_to_fuse_handler_cls`
- `get_native_quant_patterns`
- `get_pattern_to_quantize_handlers`



1.13
2.0




```Python
from torch.ao.quantization.fx.backend_config_utils import (
    get_quantize_handler_cls,
    get_fusion_pattern_to_fuse_handler_cls,
    get_native_quant_patterns,
    get_pattern_to_quantize_handlers,
)
all_quant_patterns = get_native_quant_patterns()
```




```Python
from torch.ao.quantization.fx.quantization_patterns import (
    _get_quantize_handler_cls,
    _get_pattern_to_quantize_handlers,
)
from torch.ao.quantization.fx.fusion_patterns import (
    _get_fusion_pattern_to_fuse_handler_cls,
)
from torch.ao.quantization.backend_config import (
    get_native_backend_config,
)
all_quant_patterns = _get_pattern_to_quantize_handlers(
    get_native_backend_config()
)
```





### **Update torch.{slice|select|diagonal|as_strided}\_scatter ops to preserve input stride/storage_offset (#91029)**

These operators are primarily used by the [functionalization pass](https://dev-discuss.pytorch.org/t/functionalization-in-pytorch-everything-you-wanted-to-know/965), used in AOTAutograd. Previously, they would always return contiguous tensors. Now, they return a tensor with the same striding as their first argument.



1.13
2.0




```Python
>>> x = torch.ones(2, 2, 2)
>>> base = x[:, :, 1]
>>> base.stride()
(4, 2)
>>> x = torch.zeros(2, 2, 2)
>>> base = x[:, :, 1]
>>> base.stride()
(4, 2)
>>> torch.diagonal_scatter(base, torch.ones(2)).stride()
# returns a tensor with same strides as base.
(4, 2)
```




```Python
>>> x = torch.ones(2, 2, 2)
>>> base = x[:, :, 1]
>>> base.stride()
(4, 2)
>>> x = torch.zeros(2, 2, 2)
>>> base = x[:, :, 1]
>>> base.stride()
(4, 2)
>>> torch.diagonal_scatter(base, torch.ones(2)).stride()
# returns a contiguous tensor
(2, 1)
```





### **Remove ONNX deprecated monkey patches to torch.Graph (#94747)**

The Deprecated monkey patches to `torch.Graph`, `torch.Block` and `torch.Node` are removed

Monkey patches to the classes `torch.Graph`, `torch.Block` and `torch.Node` from `torch.onnx` have been removed. This means the methods `torch.Graph.op()`, `torch..Graph.at()`, `torch.Block.op()`, `torch.Graph.constant()`, and `torch.Node.__getitem__` are no longer available.

Users creating custom symbolic functions for the `torch.onnx` exporter can continue to assume the `g.op()` interface for creating an operator in the graph, which is now exposed via the `GraphContext` class. Users should not assume any other methods from the `GraphContext` class other than those defined natively by `torch.Graph` and `.op()`.

Code change to existing symbolic functions is not expected with this change.

### **Add full checker mode in torch.onnx.export (#83186)**

This removes boolean value of `full_check` parameter in TORCH API `check_onnx_proto`, and forces `full_check` with warning messages if it fails.

Also, the API didn’t check on types in the graph even with `full_check=True` previously. With the change, a warning message will show if the graph contains type error.

### **C++ API specific BC-Breaking Changes:**

#### **Deleted torch::deploy from PyTorch Core (#85953)**

`torch::deploy` has been migrated to over to [MultiPy](https://github.com/pytorch/multipy). Ongoing development will continue in this repository.

#### **Remove the use of `lazy::View` (#87822)**

The view and aliasing infrastructure in lazy tensor core has been deprecated in favor of functionalization.

#### **Renamed `c10::fromIntArrayRef` to `c10::fromIntArrayRefSlow` and changed call sites (#86235)**

The function has been renamed to more accurately reflect its performance characteristics.

# Deprecations

## torch.func aka functorch

### **We’ve deprecated the functorch module in favor of the new torch.func module**

We’re excited to announce that, as the final step of upstreaming and integrating functorch into PyTorch, the functorch APIs are now available in the torch.func module. Our function transform APIs are identical to before, but we have changed how the interaction with NN modules work.

We’ve deprecated `functorch._` function transforms (e.g. `vmap`, `grad`, `jvp`) in favor of their identical `torch.func._ `counterparts (#92279).
PyTorch has consolidated on `torch.func.functional_call` as the NN module functional API. Please migrate from `functorch.{make_functional, make_functional_with_buffers}` to it. For more details see this [Guide](https://pytorch.org/docs/master/func.migrating.html#functorch-make-functional)
Please migrate from `functorch.combine_state_for_ensemble` to `torch.func.stack_module_state`. For more details see this [Guide](https://pytorch.org/docs/master/func.migrating.html#functorch-combine-state-for-ensemble)
We are no longer supporting functorch.compile (also known as AOTAutograd) as a frontend for compilation in PyTorch; we have integrated AOTAutograd into PyTorch’s compilation story. If you are a user, please use `torch.compile()` instead.

## Python API

### **Deprecate TypedStorage, its derived classes, and all of their public methods (#85303)**

Typed storages have been removed from the C++ side and torch.UntypedStorage is used in place. The use of torch.TypedStorage and all of its subclasses is now deprecated.



1.13
2.0




```Python
tensor.storage()
torch.TypedStorage(...)
```




```Python
tensor.untyped_storage()
torch.UntypedStorage(...)
```





If you need to access individual elements in a storage as a particular dtype, you can simply create a tensor to view it:

```Python
torch.tensor(storage, dtype=...)
```

### **Deprecate `tensor.mT`,`tensor.T`,`tensor.mH`,`tensor.H` on 0D-tensors (#92143)**



1.13
2.0




```Python
>>> a = torch.tensor(10)
>>> a.T
>>> a.H
```




```Python
>>> a = torch.tensor(10)
>>> a.T
UserWarning: Tensor.T is deprecated on 0-D tensors.
This function is the identity in these cases.
>>> a.H
UserWarning: Tensor.H is deprecated on 0-D tensors.
Consider using x.conj().
```





## Autograd API

### **Deprecate decorating classes with torch.no_grad (#89522)**

Decorating classes with `torch.no_grad` is now deprecated. You should be decorating its functions or methods instead. To preserve the current behavior of class decoration, you can directly decorate the `__init__` method and nothing else.



1.13
2.0




```Python
@torch.no_grad()
class Blah():
  pass
```




```Python
class Blah():
  @torch.no_grad()
  def __init__(self):
    pass
```





## Linalg

### **Remove the use of overload at::frobenius_norm(const Tensor&) (#81762)**

In continuation with the deprecation process from release 1.12 the tensor overload for this function has been removed. This function was not used in the bindings of Pytorch and should not impact users of `torch.norm`.

## torch.nn API

### **Canceling deprecation of `functional.{tanh, sigmoid}` functions (#86905)**

Both these ops are heavily used and so will not be removed. Deprecation warnings have been removed.

### **Deprecated torch.nn.utils.stateless.functional_call in favor of torch.func.functional_call (#92280)**

We’ve moved torch.nn.stateless.functional_call under the torch.func module to reflect how it is useful for working with nn.Modules in a functional style. As of PyTorch **2.0**, `torch.func.functional_call` is a drop-in replacement for `torch.nn.stateless.functional_call` and we will remove `torch.nn.utils.stateless.functional_call` in a future version of PyTorch. However, please note that we did change the default behavior of `torch.nn.stateless.functional_call` in PyTorch 2.0 (see “torch.nn.utils.stateless.functional_call now respects tied weights” under BC-breaking notes).

## Releng

### **Deprecated private API torch.\_six (#94709)**

Removed the Python 2 and 3 compatibility library six and future and torch.\_six.
**2.0**

```Python
# from torch._six import string_classes
str
# from torch._six import int_classes
int
# from torch._six import inf, nan
from torch import inf, nan
# torch._six.string_classes
str
```

## Onnx

### **Deprecated Caffe2 ONNX exporter support[ #95071](https://github.com/pytorch/pytorch/pull/95071)**

Users must use PyTorch 1.x versions to use Caffe2 ONNX exporter. This capability will be completely removed from PyTorch 2.x series.

# New Features

## torch.nn API

- Add `torch.nn.functional.scaled_dot_product_attention()` to allow writing fast Transformer-like functions and use it to speed up `nn.Transformer()` ( #91362, #91066, #90413, #87312, #94008, #89470, #90776, #92189)
- Add hooks for `Module.register_{buffer,module,parameter}` functions (#86148, #87369)
- Add `Module.full_backward_pre_hook` (#86700)
- Add `Module.state_dict_pre_hook` (#90435)
- Add `Module.call_super_init: bool` flag that can be used to ensure `Module` initialization is properly calling parent’s `__init__` (#91819)

## torch.func

- Add `functorch` support [for torch.autograd.Function](https://pytorch.org/docs/master/notes/extending.func.html): one is now able to apply function transformations (e.g. vmap, grad, jvp) over torch.autograd.Function. (#92023, #91452, #91222, #90037, #90077, #90966, #89860, #91211, #92030)
- Add support for linearize a-la [jax.linearize](https://jax.readthedocs.io/en/latest/_autosummary/jax.linearize.html#jax.linearize) (#94173)
- Add torch.func.functional_call, a new utility function to work with NN modules. (#89213)
- Add torch.func.stack_module_state, a new utility function to help with model ensembling (#88850)

## Cuda

- Introduce CUDA Device Assertions Infrastructure (#84609)
- `Logcumsumexp` for complex dtypes for CUDA (build-time optimized) (#94310)
- Caching allocator tracing (#86241)
- Add Pluggable CUDA allocator backend (#86786)
- Add cudaMallocAsync as an alternative backend for the CUDA allocator (#82682)

## Cpp API

- Add `set_to_none` flag for C++ optim endpoint (#92989)

## NestedTensor API

- Add support for `tensor.to()` for NestedTensor backend (#87146)
- Add backwards support for `gelu` and `relu` operators (#94776)
- Add support for `torch.neg` operator (#88131)

## Distributed

- Distributed Tensor (Prototype Release)
  - PyTorch [DistributedTensor](https://github.com/pytorch/pytorch/blob/master/torch/distributed/_tensor/README.md) (DTensor) is a prototyping effort with distributed tensor primitives to allow easier distributed computation authoring in the SPMD (Single Program Multiple Devices) paradigm. The primitives are simple but powerful when used to express tensor distributions with both sharded and replicated parallelism strategies. PyTorch DTensor empowered PyTorch [Tensor Parallelism](https://pytorch.org/docs/master/distributed.tensor.parallel.html) along with other advanced parallelism explorations. In addition, it also offers a uniform way to save/load state_dict for distributed checkpointing purposes, even when there’re complex tensor distribution strategies such as combining tensor parallelism with parameter sharding in FSDP. (#88176, #88177, #88178, #88179, #88551, #88549, #88550, #89800, #89967, #89968, #89991, #90106, #90241, #90449, #90731, #90732, #90733, #90734, #90735, #91756, #91783, #91785, #91801, #91802, #92069, #92197, #92198, #92290, #92611, #92651, #92652, #92677, #93040, #93160, #93306, #93832, #93957, #94517, #94524)
  - We also design and implement Tensor Parallel & 2D Parallel (Tensor Parallel + FSDP) on top of DistributedTensor. (#88180, #89242, #89467, #89535, #89779, #89878, #93029, #93412, #94421, #94748)
- Distributed Checkpoint
  - PyTorch Distributed Checkpointing (DCP) API was first introduced in PyTorch 1.13 and this will be an official prototype release in PyTorch 2.0. The distributed checkpoint API in PT2.0 decouples the storage layer from the checkpoint planning layer. Planner types are introduced to perform the coordination of storage both locally and globally to plan the save/load process. Checkpointing support for FSDP `sharded_state_dict` is added as well. (#87987, #88698, #89256, #89398, #89399, #89501, #89503, #89537, #89542, #89873, #89964, #90212, #91036, #91092, #91209, #91269, #92553, #92705, #92829, #92869, #92933, #94379, #94501)
- DistributedDataParallel
  - Enable DDP for PyTorch 2.0 (#87549, #88523, #89096, #88460, #88480, #88521, #94749, #93162, #89802, #92986)
- FullyShardedDataParallel
  - Add the option to use the original parameters via `use_orig_params=True` in the FSDP constructor (#84911)
  - Enable the use of TorchDispatch with FSDP (#88014)
  - Hybrid Sharded Data Parallel (#89915)
  - Enable FSDP for PyTorch 2.0 (#88781, #89330, #89523)
- Distributed (c10d)
  - Dispatchable collectives: An improvement to the existing `init_process_group` API which changes backend to an optional argument. For users, this feature will allow for code that runs on both GPU and CPU machines without having to change the backend specification. The dispatchability feature will also allow users to perform both GPU and CPU collectives using the same ProcessGroup, as PyTorch will automatically find an appropriate backend for the tensor type (as of PyTorch 2.0, the default is NCCL for CUDA and Gloo for CPU). Existing backend specifications by users will be honored and will not require change (#83679, #83735, #83810, #83859, #83876, #83916, #84423, #86166, #86368, #86407, #86408, #86409, #88351, #88846, #88889, #88903, #89317, #89318, #89505, #89813, #88330, #91257, #91172)

## Mps

- Add native support for:`torch.nn.functional.group_norm`(#91190), `torch.var_mean` (#91190), `torch.nansym`(#93845), `torch.frac`(#86625), `torch.signbit`(#87214), `torch.exp1m`(#87147), `torch.cumsum`(#88319), `torch.trace`(#87910), `torch.nn.Hardswish` (#87952),`torch.inverse`(#90428), `torch.floor_divide`(#91126), `unfold`(#91266), `bincount`(#91267), `nonzero`(#91616), `norm_dtype`and`cdist`(#91643), `unique`and`unique_consecutive`(#88532), `nan_to_num`(#91110), `torch.linalg.cross`(#91642), `randperm`(#91708), `triangular_solve`(#94345), `grid_sampler2d`(#94273), `remainder`(#92139), `addr`(#94538), `fmod`(#94722), `repeat_interleave` (#88649),`sort`and`argSort`(#94697),`range` (#91075)
- Add functions to handle rng and force device synchronization `torch.mps.{get_rng_state, set_rng_state, synchronize, manual_seed, seed}` (#94417)
- Add support for the `mps` device for `torch.Generator` (#91348)
- Add `torch.int64` support for unary ops (#86615)

## Profiler

- Improve Memory Profiler(alpha): enhancement to the existing memory profiler that can attribute memory consumptions to activations, gradients, parameters, and optimizer states (#86802, #86853, #86854, #86880, #87006, #87566, #87567, #87568, #88924, #88925, #88926, #89355, #89356, #86355, #88917, #87133, #86753, #86754, #87096, #86909, #87825)
- Add Linux perf event support in profiler (#87866, #87874)

## Foreach API

- Implement:
  - `torch._foreach_lerp` (#87562),
  - `fused adamw` (#88015)
  - `_foreach_addc`(div/mul)(\_).Tensor (#88157)
  - `clamp_min` `clamp_max` (#91384)
  - `adamw` (#88015)

## Mobile

- Add XNNPACK Delegate Framework.
  - Enable a XNNPACK graph to be built from the torchscript IR and performing checks (#86980, #87128, #87824)
  - Add flatbuffer serialization support (#87826, #87906, #87907, #87908)
  - Create `Executor` and `Compiler` classes which compiles the XNNPACK graph and preps for execution (#88779, #88778, #88780, #89090)
  - Optimize library includes (#88863, #89231)
  - Add Constant Data which will be used in Convolution (#89445)
- Add support for better benchmarking
  - Add support in lite_predictor benchmark binary to select event lists and perform benchmarking using Linux perf through Kineto profiler (#87876)
  - List all missing ops at once (#94205)

## Sparse API

- Add `torch.sparse.check_sparse_tensor_invariants` context manager that allows users to opt into more checks at runtime for better debugging. (#92094)
- Add `check_invariants` flag to `torch.sparse_coo/csr/csc/bsr/bsc/compressed_tensor `to allow users to verify components at construction time. (#92094)
- Add `reduce` flag for CPU to torch.sparse.mm with support for `sum, mean, amax, amin` (#83727)

## Optimizer API

- Make `{Adadelta, Adagrad, Adamax, AdamW, ASGD, NAdam, RAdam, RProp}` differentiable (#86096, #86258, #86183)
- Publicly expose \_LRScheduler to LRScheduler (#88503)

## Distributions

- Add a transform for positive-definite matrices. (#76777)

## Signals

- Set up new module torch.signal.windows (#85599)
- Add the Nuttall window to signals/ (#90103)
- Implement old singal/windows in Python (#87082, #87330)

## Quantization

- Add support for oneDNN backend for server CPU quantized inference (#91056, #88478, #88665, #88668, #88879, #88923, #89188, #91297, #90262, #90364, #91152, #91153, #91154, #91155, #91934, #88661)
- Add new ‘x86’ backend to be used for quantized CPU inference (#91235, #88799)

## Vulkan

- Add Vulkan support for several torch operators:
  - `torch.abs` (#87414)
  - `torch.select` for height and width dimensions (#94612)
- Vulkan optimization passes now automatically apply data transfers between the CPU and GPU for input and output tensors (#87432)
  - If the `requires_backend_transfers` flag of a model is set to `false`, then input tensors do not to be transferred to the GPU (via `tensor.gpu()`) and output tensors do not to be transferred back to the CPU (via `tensor.cpu()`) since these transfers are inserted into the model
  - To avoid inserting data transfers into a model, add `MobileOptimizer.VULKAN_AUTOMATIC_GPU_TRANSFER` under `torch.utils.mobile_optimizer` to the `optimization_blocklist` argument of `optimize_for_mobile` (#92081)

## ROCm

- `hipGraph` support for pytorch mainline (#88202)

## Fx

- Introduce symbolic shape guards (#87570, #90528, #90665, #90679, #90876, #91058, #93894, #94782)
- Introduce a match filter for SubgraphRewriter (#86430, #87998, #87257)
- Support list-typed args in PatternMatcher (#88656)
- Add `any_chain()` in operator support (#90949)
- Have replace_pattern return replaced nodes (#90244)

## Jit

- Allow freezing JIT modules that contain mutable interfaces (#86039, #91020)
- ApplyLinear-BatchNormNd folding during torch.jit.freeze (#86706)
- Add an option to skip loading of debug traces, in order to reduce memory usage (#91430)
- Introduce torch.jit.\_drop function modifier to avoid compiling a method on a non-nn.Module class (#93012)
- Allow providing a kwargs-like dict of example inputs to torch.jit.trace with the new `example_kwarg_inputs` argument (#81623, #94032)
- Include example input shapes when serializing jit.traced modules to assist with debugging (#90744)

## Build

- Add Ada Lovelace (cuda arch sm8.9) support (#87436)
- Add an option to disable TORCH_WARN and TORCH_WARN_ONCE log (#87188)
- Enable memory map file support for Android, Apple, and CXX (#88545)
- Support DNNL_GRAPH_CPU_RUNTIME=TBB build option (#87512)

## ONNX

- Verification tool to find mismatch in model export (#89946,[ #89807](https://github.com/pytorch/pytorch/pull/89807),[ #89808](https://github.com/pytorch/pytorch/pull/89808),[ #89947](https://github.com/pytorch/pytorch/pull/89947),[ #94648](https://github.com/pytorch/pytorch/pull/94648))

## Cudnn

- Add an environment variable to skip cudnn version compatibility check (#89184)
- Enable cuDNN Frontend v8 API by Default (#91117)

# Improvements

## Python API

- Set std/var correction overloads default value to None (#56398)
- Implement correction argument in torch.masked.{std,var} (#87118)
- Update `torch.squeeze` to allow squeezing multiple dimensions at once (#89017)
- Add support for int32 indices in index/index_put ops (#86309)
- Enable `where` to have cpu scalar args (#87022)
- Add support for NumPy scalars to `torch.tensor.asarray` (#90914)
- Update opt_einsum to have more reasonable defaults (#86985)
- Improve error message for `Tensor.set_` when dtypes mismatch(#88804)
- Enable out variant of `torch.max`(#85926)
- Implement faster gradient clipping using foreach function (#91846)

## Autograd API

- Add backward support for `torch.ormqr` (#86800)
- Pre-hooks registered on tensor are guaranteed to run before pre-hooks registered on grad_fn (#85849)
- Add a new overridable method `setup_context` (#89859, #92312)
  - You must use override this method if you plan to use your autograd Function with functorch
  - If you choose to override this method, `forward` should no longer take ctx as an input.
- Add context manager `torch.autograd.set_multithreading_enabled` for disabling multithreading in the autograd engine (#86245)
- Add backward AD support for unary foreach functions (#89591)

## torch.nn API

- Add `remove_duplicate` flag to `Module.named_buffers()` method (#84984) and `Module.named_parameters()` (#88090)
- Add kwarg support for `Module` forward-pre and forward hooks (#89389)
- Improve error message for `Transformer()` fast path (#90783) and kernel selection (#90783)
- Add support for `torch.bf16` for `Embedding` (#94163)
- Add `freeze` argument to `Embedding()` (#86769)
- Add `torch.channels_last_3d` support for `SyncBatchNorm()` (#88401)
- Add `torch.bfloat16` support on CPU for `functional.{mish,hardtanh,silu}` (#82460)
- Add support for inputs with different data types for `LayerNorm()` (#81851, #88064), `BatchNorm{1,2,3}d()` (#84410), `GroupNorm()` (#89485, #81852, #88663, #92671, #92668)
- Improve printing of `ModuleList()` (#90452)
- Add `torch.uint8` support for `functional.interpolate()` on CPU (#90771)
- Make `functional.max_pool1d` error checking consistent between CPU and CUDA (#90211)
- Add `SyncBatchNorm()` fallback to `BatchNorm()` when it is used in a non-distributed setting (#89706)
- Add channels-last support for `GroupNorm()` on XPU (#87680)
- Add `is_causal` kwarg to `TransformerEncoder()` layer (#90508)
- Add `prepend` argument to `Module` hooks to register a hook that will be called before the existing ones (#87370)

## Distributed

- Activation checkpointing
  - Return `None` from `apply_activation_checkpointing` (#87871)
  - Enable non-reentrant support for `checkpoint_sequential` (#86331)
  - Separate CPU offload activation to its own wrapper (#85459)
- DistributedDataParallel
  - Add `PackedSequence` support when `device_ids` is specified (#86614)
  - Enable DDP to handle custom dataclass forward outputs (#92334)
- Distributed (c10d)
  - Add sequence number support for UCC PG (#85047)
- FullyShardedDataParallel
  - Default to `BACKWARD_PRE` for the backward_prefetch of FSDP (#88428)
  - Skip collective communications for `NO_SHARD` in `clip_grad_norm_` (#89137)
  - Allow handle training state to be both `BACKWARD_PRE` and `BACKWARD_POST` in the post-backward assert (#89791)
  - Limit all gather after pre-unshard (#89057)
  - Include module classes in `ModuleWrapPolicy.__repr__` (#89058)
  - Apply the "largest" dtype across all parameters/gradients as defined by PyTorch's type promotion semantics for the total norm returned in `clip_grad_norm_` for low prec grads (#90028)
  - Introduce `ModuleWrapPolicy` for simplicity in FSDP autowrap (#88450)
  - Enable mixed hybrid/non-hybrid sharding strategies (#90846)
  - Re-support model dtype change after FSDP init (#91192)
  - Enable `use_orig_params=True`, `no_sync` and mixed precision to work together (#91193)
  - Enable `summon_full_params(with_grads=True)` (#85738, #87314)
  - Add `keep_low_precision_grads` support when CPU offloading (#86495)
  - Consolidate FSDP `state_dict` `offload_to_cpu` settings (#86211)
  - Add `set_state_dict_type` API to setup `state_dict_type` without using context manager (#86243)
  - Enable the support of `use_orig_param` for FSDP’s `optim_state_dict` (#89898, #89899, #89900)
  - Enable nested FSDP wrapper to use different mixed precision (#90523)
  - Enable input cast skip in MixedPrecision (#90620)
  - Publish `optim_state_dict` and `optim_state_dict_to_load` for FSDP (#90798, #91343, #92744, #92118, #92991, #92992, #93285, #93318, #94109, #94129)
  - Make default input casting in root module only and enable the ability to set different mixed precisions for different submodules (#91365)
- Torch Elastic
  - Update `torchrun` and `TorchElastic` to take optional `local_addr` param to allow skip local IP lookup if specified (#88922)

## torch.func

- Update vmap to accept None(s) in out_dim (#91644)
- torch.func.jacrev: Support chunked computation (#89376, #91326)
- vmap: chunk_size support (#91157)
- torch.vmap: Implement checks (rather than internal asserts) for vmap escaped errors (#89585)
- Avoid calling allclose in the backward if there are tensor subclasses (#91444)
- Refactor NN stateless APIs by swapping module tensors (#92536)

## Cuda

- Use binary units for CUDA memory summary (#91854)
- Improve perf by avoiding implicit string creation in c10_cuda_check_implementation (#88350)
- Add option to record C++ backtraces in \_record_memory_history (#86145)
- Set CUDA_MODULE_LOADING to LAZY when not set by the user (#85692)
- Add warning if captured graph is empty (#88754)
- Add option to dump a captured graph for debugging (#85519)
- Add support to foreach torch zero for bfloat16s (#90437)
- Enable bfloat16 for hardtanh_backward_cuda (#91511)
- Use pytree to allow any input format for cuda graph (#90941)
- Add requested_bytes to CUDA Caching Allocator Stats (#88575)
- Add an option to disable reduced precision reductions for BF16 GEMM (#89172)
- Add an env variable to disable addmm_cuda_lt kernel (#91436)

## Serialization

- Add XPU backend to support torch.save and torch.load (#89679)

## Cpp API

- Reduce ambiguity in Tensor namespace collisions (#92266)

## Dataloader API

- Add support for pin memory on xpu device (#86545)
- Add type annotation to `get_worker_info` (#87017)
- Allow prefetch factor to be optional (#88972)

## NestedTensor API

- Add add/mul for nested dense [B, *, D], [B, 1, D] case (CUDA-only) (#88289)
- Add support for torch.select over irregular dimensions (#88585)
- Add torch.nested.nested_tensor() constructor (#88213)

## Complex API

- Improve complex support for: `torch.nn.functional.conv_transpose3d `(#87967), `torch.log1p` (#89214,#90422), `torch.lerp` (#75584), `torch.logcumsumexp` for CPU (#93153)
- Solve under/overflow for complex division (#92539)

## Composability

- Improve coverage of primtorch and torch.\_ref decompositions: `prims.clone` (#86705), `ndtr, ndtri, log_ndtr, erfcx` (#86077), `NLL loss` (#81128), `conv backward` (#87047), `xlogy and xlog1py` (#77712), `alpha_dropout` (#87989)
- More operations now work with meta tensors: `_adaptive_avg_pool2d_backward` (#86359), (#87074), `avg_pool2d and avg_pool2d_backward` (#87043), `scalar_tensor and argmax` (#88590), `topk` (#88694), `max_pool2d_with_indices_backward` (#88743), `grid_sampler_2d_backward` (#88745), `linalg_cholesky` and `linalg_cholesky_ex` (#89430), `aten._cdist_forward` (#90042), `aten.pixel_shuffle` (#91605)

## Linalg API

- Fix typos in messages under aten (#88964)

## Mobile

- Improve CoreML logging and dependent libraries.
  - Updated Cocoapods (#88075)
  - Preserved CoreML errors by using special throw macro when encountering CoreML API errors (#86938)
- Clean Up MobileOptimizerType Rewrite Flags Public API and Documentation (#91600)
- Clean up flatbuffer lib dependency and fixed its test to match pkl models (#86041, #93022)
- Type corrections to avoid unnecessary static_casts (#93898)
- Add flake8-logging-format linter (#90805, #94840)

## Sparse API

- Add autograd support for `linear` (#86137, #86302), `mm`, `log1p`(#86301, #88155), `to_sparse_*`(#90281)
- Improve support for `sparse_dim`, `dense_dim` (#86203, #86203), `torch.sum`(#86300, #92979), torch.sparse.sampled_addmm`(#86401),`frac`, `deg2rad`, `rad2deg`, `relu`(#88153, #88156, #88442, #86749),`conj()`(#91695),`to_sparse`(#90718),`sparse_mask` (#92248, #94829)
- Add support for per batch index contiguity in CSR/CSC/BSR/BSC (#91243), non-contiguous values in CSR/CSC/BSR/BSC (#91243), non-zero dense_dim to COO/CSC/BSR/BSC/Strided conversions. (#90177), uncoalesced operands to `sparse_mask` (#91964)
- Improve error messages for `indices, values, (c)row_indices, (c)col_indices` (#93149) and `addmm` (#94843)
- Extend gradcheck to BSR and BSC inputs. (#90719)
- Sort BSR indices as part of CSR to BSR conversion (#90918)

## Cpu

- Implement aten::native_batch_norm.out for CPU (#88604)
- Log1p for complex in CPU (#89691)
- Enable oneDNN implementation for LSTM (#91158)

## Package

- Add better debugging for torch.package (#92939)

## Quantization

- Remove weight arg from DTypeConfig for non-weighted ops (#86335)
- Add get_symmetric_qnnpack_qconfig_mapping for XNNPACK quantized ops (#87002)
- Add assert for backend correctness in get_default_qconfig related apis (#86259)
- Replacing List[QConfigMapping] in parallel numeric profiler (#86922)
- Check the fixedqparam op qconfig based on backend_config (#87425)
- Explicitly set default quantized engine instead of relying on the order of supported_qengines (#89804)
- Support setting qconfig by module_type in QConfigMapping in PT 2.0 export flow (#92355)
- Migration of quantization code from torch._ to torch.ao._ (#86171, #86172)
- Improvements to qnnpack fully connected sparse ops (#85243, #85244, #85245, #85246, #85247)
- Support lowering of channel shuffle in FX (#83731)
- Remove explicitly default QConfigMapping settings (#90066)
- quant: make various configs printable (#91419)
- Enable FX quant for patterns like x.view(x.size(...), ...) (#90001)
- X86 qengine always uses fbgemm kernels on OS other than Linux (#93218)
- Change prepare_fx and convert_fx to preserve the GraphModule type of input (#94412)
- update xnnpack to newer version and update API usage in pytorch (#94330)
- Remove \_input_output_observed from backend_config (#92589)
- Add support for LSTM Structured Pruning prune_functions + pattern (#90801)
- Enable FX static quantization for LSTM (#85068)
- Allow setting fixed quantization params for inner LSTM ops (#88456)
- Add support for GRU in fx graph mode quantization (#91976)

## ONNX

- Operator support `col2im` opset 18 (#84594), `mse_loss` (#90717), `aten::contains` (#91660), src/index dynamic axes support for `aten::scatter_add` (#90090), `aten::zero` (#91731), Raise Unsupported for `GridSample` with volumetric 5D input (#92212)
- Pretty print diagnostic logging (#88261)
- Bump onnx to 1.13.1, onnxruntime to 1.14.0 (#90332,[ #94767](https://github.com/pytorch/pytorch/pull/94767))
- Add full graph checker option for `torch.onnx.export` API (#83186)
- Integrate all ONNX operators with a new `JitScalarType` API (#87245)
- Add `share_from_this` to `torch::jit::Graph` (#87343)
- Use optional op to keep None in results for ONNX internal tests (#84789)
- Add support for autograd function inlining in `ONNX_ATEN_FALLBACK` mode (#85736)
- Default runtime type checking to raising errors (#86555)
- Remove the `INT64_MAX` magic numbers (#88341)

## Fx

- Refactor graph partition to check for cyclic dependency (#86511)
- Enable nvprims.transpose fusions for nvFuser (#86967)
- Simplify magic method definition code. (#88017)
- Add sym_floor, sym_sqrt, sym_int (#88760)
- Propagate .meta info when replacing subgraphs in fx (#87255)
- Make `torch.fx` compatible with Python-3.11 (#92895)
- Add type(module) to be stored in the module stack (#87149)
- Ensure that symbolic variables incorporate fresh constraints before they're used (#87254)
- Add type annotation to `getitem` node before `split_module` (#88510)
- Implement pass for annotating getitem nodes (#90237)
- Guard Symbol and ShapeGuardPrinter behind HAS_SYMPY (#90704)
- Copy meta field in fx.GraphModule on deepcopy (#92062, #92623)
- Match get_attr when comparing nodes (#91657)
- Make **deepcopy** of fx.GraphModule handle circular reference. (#93038)
- Populate memo in deepcopy BEFORE copying children. (#93295)

## Mps

- Add fp16 support for `torch.nn.Linear` (#89774), `torch.nn.GELU` (#86218)
- Add support for empty Tensors in `torch.bitwise_not` (#87286), `torch.nn.LayerNorm` (#94212), many backward functions (#94343), `torch.nn.functional.hardswish` (#94342), `torch.topk` (#91884), `torch.arange` (#94485), `torch.linal.inv` (#94551),
- Improve error message for `nn.Conv2d` when inputs are on different devices (#86303)
- Add support via fallback for `torch.nn.{Fold, UnFold}` (#94491)
- Add support for reduction ops on multiple axis at a time (#91734)
- Add support for `k` greater than 16 for `torch.topk` (#94639)

## Build

- Add @pytorch in tools/bazel.bzl (#91424)
- Change visibility for //c10:headers (#91422)
- Simplify OpenMP detection in CMake (#91576)
- Use `@pytorch//` in bazel build files which improves embedding usecases (#89660)
- Enable `USE_CUDA `for bazel build (#92640)
- Add missing default initializers to class members (#94049)

## Jit

- Skip builtins while enumerating class methods (#91805)
- Support lovelace for NVRTC (#87611)
- Expanded symbolic shape support (movedim) (#91696)

## Releng

- Update CI test environment; Add symbolic functions (#94564)
- Import `Literal`, `Protocol`, and `Final` from standard library `typing` as of Python 3.8+ (#94490)
- Add cpuinfo to collect_env.py for new issues reporting which helps triaging on CPU (#93899)
- Refactor nvfuser build (#89621)
- Add error checking to flaky test bot platform parser (#86632)
- Make LazyGraphExecutor extensible (#87218)
- Delete BUILD_SPLIT_CUDA option (#87502)
- Use faster cache flush in triton benchmarking (#88557)
- Guard global observer init against Edge profiler (#86347)

# Bug fixes

## Python API

- Fix as_strided_scatter derivative formula(#87646)
- Add bfloat16 support to torch.prod (#87205)
- Disable dimension wrapping for scalar tensors (#89234)
- Fix SIGSEGV on a big-endian machine when reading pickle data (#92810)
- Fix BC-breaking change to reduction arguments `amin`/`amax` (#93091)
- Fix incorrect tensor storage check (#86845)
- Ensure einsum contracts left to right (#87199)
- Add nondeterministic error for `torch.tensor.scatter` (#88244)
- Fix multi-index for `torch.tensor.index_select` over scalar tensor (#94347)
- Add scalar support for `torch.tensor.where` (#92849)
- Improve error message for unsupported argument types (#87601)
- Change as_strided_scatter’s storage offset default to None from 0 (#87481)
- Make `torch.histc` consistent between CPU and CUDA (#87832)
- Add float to list of allowed ops for serialization (#94910)
- Fix numpy1.24 deprecations in unittests ([#93997] (https://github.com/pytorch/pytorch/pull/93997))
- Properly moving segment_reduce to be private as expected (#93166)

## Autograd API

- Fix behavior of hooks registered to Tensors that had previously been modified in-place (#92734)
  - Previously hooks registered to a tensor after it is modified in-place would erroneously receive the gradients of the output w.r.t. to that tensor before it is modified in-place if that tensor had previously had a hook registered to it before it was modified in-place.
  - See [documentation](https://pytorch.org/docs/2.0/notes/autograd.html#behavior-of-tensor-hooks-when-tensor-is-modified-in-place) for more details about backward hooks execution when tensors are modified in-place.
- Update saved variable hooks to no longer trigger on wrapped numbers (#87316)
- Modifying a view created in no-grad mode in-place no longer triggers an internal assert (#88243)
- Improve error message when saved tensor is detached inplace (#88860)
- Prevent module full_backward_hook from erroring in double backward (#88357)
- Fix forward AD custom Function non-differentiable outputs (#90787)
- Don't materialize forward grad for non-differentiable types (#91183)
- Return input as-is if marked dirty even when requires_grad=False (#91214)
- Fix saved tensor hooks to propogate errors back to python as-is (#94456)
- Fix NumPy broadcasting for backward of `linalg.solve` (#91456), `linalg.lstsq` (#91460)
- Fix torch.var backward when input numel == correction (#94546)
- Fix CopySlices logic to ensure wrapped node runs properly. (#89812)

## torch.nn API

- Fix for RNN-like `Module`s to work with `stateless.functional_call()` (#91111), better error messages (#87442),
- Add missing dim checks `EmbeddingBag` (#85433)
- Fix `Upsample` and `EmbeddingBag` module printing (#93850)
- Fix segfaul in `Conv3D` CPU implementation (#94325)
- Fix overflow issue in `Upsample` (#94290)
- Fix `functiona.pixel_{shuffle,unshuffle}` to consistently return views or not (#86608)
- Fix 64bit indexing `Conv3d()` (#87527), `Upsample()` (#87901)
- Fix preserving requires_grad-ness in fusion utils (#89100)
- Fix support for empty inputs/outputs for `Conv{1,2,3}d()` (#86521), `functional.adaptive_{avg, max}_pool()` (#88906)
- Fix buffer overflow in `Upsample()` (#89252), `MaxUnpool3d()` (#94372)
- Fix `functional.grid_sample()` loss of precision for `torch.float16` inputs (#90427)
- Fix `functional.interpolate()` bicubic interpolation to properly preserve memory format (#90470)

## torch.func

- Fix cross to match unbatched behavior (#86926)
- Properly error on complex inputs or outputs in jacrev, jacfwd (#94805)
- Fix batching rule for dropout (#92975)
- Fix vmap and anomaly mode interaction (#92672)
- Fix and update type hints for `make_functional.py` (#91579)
- torch.tril & torch.tril : add out of bound checks (#89384)
- Fix torch.cat batching rule (#86932)
- Fix reduction boxed batching rules (#91109)

## Cuda

- Check SM version before calling flash attention with BFloat16 (#86600)
- Add range check to multi margin loss target (#89008)
- Fix NVML visible device parsing (#92315)
- Take `CUDA_VISIBLE_DEVICES` into account for nvml calls (#94568)
- Fix topk IMA (#93095)
- Fix: half reduction with multiple sub-iterators (#85596)
- Fix segfault when swapping custom allocator (#89613)
- Conditionally set device in autograd engine (#91191)
- Store `autocast_gpu_dtype` in `custom_fwd` and `custom_bwd` for BFloat16 autocast (#88029)
- Do not use at::cuda::getDefaultCUDAStream() (#91180)
- Ensure that our error handling runs with the GIL enabled (#92848)
- Fix C10_CUDA_CHECK for failing to capture last cuda error occasionally (#93192)
- Fixes a memory leak by making autocast cache global instead of thread-local (#86492)
- Take `CUDA_VISIBLE_DEVICES` into account for nvml calls (#94568)
- Explicitly set the workspace for cuBLAS handles (#86645)

## Cpp API

- Fix CUDNN_PATH handling on Windows (#88898)
- Fix typos in warning/error messages(#88961)
- Remove uneeded checks from embedding bag impl (#92982)
- Fix c++ : segfault in modulelist and moduledict (#93074)

## Visualization

- Fix overflow issue in tensorboard image summary (#90423)
- Remove deprecated call to tf.io.gfile.get_filesystem (#89832)

## NestedTensor API

- Enable non-contiguous Nested Tensors for BMM inputs for NT on CUDA (#88108), linear backward (#94317)
- Fix bug in unsqueeze_nested stride calculation (#88688)

## Distributed

- Distributed(c10d)
  - Fix a static initialization order fiasco in c10d (#90149)
  - Fix `send`, `recv` return type (#92152)
  - Fix MPI backend PG initialization (#92847)
  - Fix header-filter for clang-tidy c10 and apply some fixes to c10 and c10d (#91178)
  - Fix `backend_type` for backend/PG plugin (#93129)
  - Fix UCC PG barrier (#86961)
  - Properly finalize unsuccessful UCC collective posts (#89306)
  - Add pre & post processing for UCC CPU collectives (#89030)
  - Re-enabl `isinstance` with `torch.distributed.ReduceOp` (#87303, #88275)
  - Ameliorate custom `__eq__` for `ReduceOp` (#90088)
  - Fix warning if backend registers timer (#91702)
- DistributedDataParallel
  - Fix DDP when the number of output features is zero (#87793)
- FullyShardedDataParallel
  - Fix `use_orig_params=True` for reentrant activation checkpointing by disabling the post-backward hooks (#87413)
  - Re-establish the wrapped module in `_lazy_init` in case module changing after FSDP constructor (#87837)
  - Fix the incorrect norm calculation for `NO_SHARD` by handling sharded and non-sharded parameters differently in `FSDP.clip_grad_norm_` (#88955)
  - Pass through `ActivationWrapper` directly to the inner wrapped module to fix `state_dict` issues (#87950)
  - Remove the clean of FQNs even for `use_orig_params=True` in FSDP (#91767, #92662)
  - Restrict meta model check to non ignored modules in FSDP (#86766)
  - Fix `keep_low_precision_grads=True` for `use_orig_params=True` (#90027)
  - Fix for `use_orig_params=True` + `no_sync` (#90546)
  - Fix `no_sync`, `use_orig_params=True`, mixed precision, sharded (#92874)
  - Fix input grad propagation when using param mixed precision (#90921)
  - Fix `_mp_shard` in `record_stream` (#91096)
  - Fix "use-after-free" in reshard logic (#94859)
  - Fix `clip_grad_norm_` issues (#94835), (#86337)
  - Fix `load_sharded_state_dict` FQN mismatches for shared parameters (#86524)
  - Fix grad zero vs. `None` edge case (#87308)
  - Fix FSDP `state_dict` transformations of modules with persistent buffers failure with mixed precision enabled (#93396)
  - [FSDP] Fix `nn.Parameter` usage for 2D and `use_orig_params=True` (#89782, #89845, #90562)
- RPC
  - FFixixed use after free in tensorpipe agent (#87627)
- Torch Elastic
  - Make TorchElastic timer importable on Windows (#88522)
- Tensor parallel & 2D parallel
  - Fix the logic to trigger load hooks for 2D parallel integration with FSDP. (#86272)

## Profiler

- Minor bug fixes for ROCM tracing (#89785, #88207)

## Foreach API

- Fix `_foreach_norm` on some tensor sizes (#91844)
- Exempt `_foreach_norm` from autograd_not_implemented_fallback check (#93995)

## Complex API

- Fix serialization of `conj` and `neg_view` (#88182)

## Linalg API

- Add empty tensor check to \_compute_linear_combination (#94245)

## Optimizer API

- Fix discrepancy between mt vs st impl (#92699)
- Do NOT inplace modify gradients (#92706)
- Fix memory leak in \_LRScheduler.step() (#85602)
- Look up `group["capturable"]`, not `defaults["capturable"]` in Adam(W) (#94149)
- `FusedAdam(W)` should take `OptState` into account before unscaling grads (#94060)
- Fix LinearLR scheduler start_factor (#86695)
- Keep AveragedModel buffers in sync when use_buffers=False (#84054)
- Fix OneCycleLR error log (#92040)
- Fix SparseAdam consuming iterator (#86210)
- Fix empty grad support for SparseAdam (#86459)

## Serialization

- Fix set pickle_module if not specified (#88570)
- Explicitly check filelike arg of `torch.save` (#88867)
- Fix dtype mismatch for unallocated storage deserialization (#91285)
- Add float to list of allowed ops (#94910)

## Composability

- Fix segfault in has_torch_function (#88559)
- Fix for usages of **torch_dispatch** with operators that take in an OptionalTensorList argument (#88887)
- Allow direct Tensor constructor to return preexisting PyObject (#92754)
- Add fallthrough kernel for AutogradMeta key (#94603)
- Several fixes to existing primtorch and reference decompositions:
  - `cat`: fix striding (#89332)
  - `prelu`: Fix prelu ref when a.ndim < 2 (#89809)
  - `huber_loss_backward` fix (#86955)
  - `uniform` fix (#90094)
  - `unfold_copy` fix (#86371)
- Fix aliasing for primtorch view meta kernels (#86285)
- Properly compute device for elementwise operations with CPU scalar tensor (#93073)
- Several fixes to existing operators’ meta tensor kernels:
  - aten.\_embedding_bag (#92549)
  - aten.fill\_ (#87493)
  - `aten.group_norm` type promotion fix (#86607)
  - aten.\_cudnn_rnn (#91333)
  - aten.bernoulli (#88676)
  - unsqueeze\_ (#88675)
- Several bug fixes as part of hardening functionalization, which is used in AOTAutograd:
  - fix detach() in functionalization (#87750)
  - fix `torch.as_strided_scatter_backward` memory initialization (#88342)
  - fix functionalization resize stride compute (#94018)
  - fix x.is_contiguous(channels_last) in functionalization (#94195)
  - fix set\_() with functionalization (#90722)
  - check for undefined tensors in advanced indexing during functionalization (#90791)
  - fix some composite compliance ops for functionalization (#86470)
  - Make `aten.copy` preserve strides (#89464)

## Sparse API

- Fixes to `torch.mm`: (#90763), (#90917), (#91094)
- Fix CSR to CSC conversion when given indices of int32 dtype (#91061)
- Fix `mul` when given CUDA CSR Tensor and scalar (#91239)
- Fix conversion from CSC, BSC to COO to only result in coalesced Tensors when appropriate (#91440)
- Fix numel after resizing a CSR/BSR/CSC/BSC tensor. (#91831)
- Fix `torch.triangular_solve` for CSR on CPU when `unitriangular=True`. (#93352)

## Distributions

- Fix philox randn to follow standard normal distribution (#91945)

## Cpu

- Fix access to uninitialized memory in VSX vector functions (#89833)
- Fix buffer overflow from AddressSanitizer checks due to inaccurate bfloat16 representation of large integer (#89210)
- Make torch.histc ignore NaNs on CPU (consistent with CUDA) (#85870)
- Fix vectorized trigonometric functions for VSX (#86453)
- Call `symint::sizes()` instead of `sizes()` on convolution error messages. (#89549)
- Make `torch.linspace` result on CPU consistent with numpy (#89048)
- Remove variable_excluded_from_dispatch() assertion from mkldnncommon (#92168)
- `exponential_` few fixes (1) lambda > 0 (2) mkl kernel to continuous (3) better error log on dtype (#92891)
- Vectorize more stable complex division (#93277)
- `cauchy_` few fixes (1) check gamma > 0 (2) better dtype error log (#93314)

## Intel

- Fix CPU autocast for torch.cat due to the new type ITensorListRef (#87756)
- Add parameters check for torch.\_mkldnn_transpose (#85318)
- Fix build with Intel compiler due to c10/util/TypeIndex.h (#89610)

## Package

- Treat builtins as default extern module (#88385)
- Support pickle version 4 by adding missing ops (#90223)
- Check spec for module source before falling back to file in package exporter (#90258)

## Quantization

- Fix the call to get_executorch_backend_config (#86338)
- Fix weight_dtype and bias_dtype backend_config checks (#86719)
- Respect non_leaf_module_list for activation modules (#88498)
- Fix incorrect integer cast on histogram observer bounds (#90355)
- Improve numerical stability of HistogramObserver (#86522)
- Quant_min typo bugfix in utils.py (#88024)
- Fix fuse_func method overwrite (#87791)
- Fix get_default_qat_qconfig for PT 1.13 (#88876)
- Check the value of numel to avoid segfault (#81547)
- Fix mkldnn quantization issue for weight reorder error (#86876)
- Fix Memory Leak in QNNPACK QSoftmax Op (#89544)
- Copy MHA's batch_first attribute in prepare() (#91680)
- Fix for swap_custom_module_to_observer doing duplicate swaps on the same node.target (#91905)

## Fx

- Correctly restore pybind11 error_already_set (#93238)
- Remove proxy tensor's check for data dependent output (#93265)
- Make ShapeEnv deepcopy-able (#93403)
- Fix SubgraphMatcher for case of no anchor found (#86421)
- Fix for partitioner with symbolic shapes (#86425)
- Fix getitem in partitioner and make metadata storage more consistent (#87012)
- Fix magic method try reverse protocol (#88030)
- Fix FakeTensorProp on Module with Parameters or Buffers (#88700)
- Fix PassManager to not use a class variable mutable list (#89108)
- Prevent tracing when we track_tensor_tree (#89139)
- Make all `make_fx` invocations isolated (opaque to higher `make_fx` invocations) by default (#93290)
- Fix matching args in PatternMatcher (#94375)
- Allow FakeTensorProp to run on graphs traced with some None inputs (#94569)
- Copy codegen in legalize_graph (#90023)
- Fix proxy unwrapping for cond() (#91907)

## ONNX

- Fix `triu`/`tril` operator export with diagonal input (#86843)
- Skip tensor printing during model tracing (#86223)
- Fix `aten::index_put(self, mask, v)` export when `rank(mask) < rank(self)` (#92862)
- Fix 0d-tensor broadcast export (#87211)
- Fix device type detection based on strings (#86168)
- Fix `scatter_add` with different static shape of src and index (#89787)
- Fix `_pad_circular` export (#86984)
- Fix concat with empty tensors (#87620)
- Disable ONNX `ceil_mode` and `count_include_pad` to align torch `ceil_mode` results in corner case (#87892)
- Fix ignored small eps in layer normalization in fp16 (#89869)
- Fix `unconvertible_ops` as per #89261 (#89299)
- Fix `Gather` replacement in `RNN peephole` (#93120)
- Fix `cat` operator for tensors with unknown rank (#94870)
- Fix scalar type analysis for copied constant (#86716)
- Fix scalar type detection for optional tensors (#94427)
- Fix 'prim::PackPadded' shape inference (#91829)
- Add `onnx::Max` into standard Op for scalar type alignment (#88750)
- Add `setType` from user into `InferredType` and `Reliable` in `ConstantValueMap` (#88622)
- Integrate ONNX ATen Fallback export with the new operator registry (#87735)
- Fix ONNX ATen Fallback integration for `BUILD_CAFFE2=0` builds (#88504)
- Fix `torch.autograd.Function.symbolic` method support (#94746)
- Fix `FindCommonAncestor` in `function_extraction` (#86650)
- Update training state logic to support `ScriptedModule` (#86745)

## ROCm

- Fix hipify mapping for cuDeviceGet (#90726)

## Mps

- Fix issues with non-contiguous Tensor handling (#86956, #86958)
- Fix issues with ops implementation `torch.median` (#90326, #88807), `torch.{std,var}` `correction` argument (#91203), `torch.index_select` (#94117, #91064), `torch.cumsum` (#94119), `torch.where` (#86240), `torch.nn.Embedding` (#82809), `torch.nn.Softplus` (#88555), `torch.nn.functional.pad` (#89864), `torch.max` (#91520), padding functions (#91522), `torch.nn.functional.upsample` (#91669), pooling functions (#91519, #94348), `torch.nn.{NLLLoss,SmoothL1Loss}` (#94226), `torch.nn.SoftPlus` (#94256), `torch.masked_fill` (#94263), `torch.fill_` (#94479), `torch.median` (#94489), `torch.nonzero` (#94442), `torch.nn.BatchNorm` (#94351), `torch.{min,max}` (#94386), `torch.nn.GELU` (#94529), `torch.nn.LSTM` (#94889), #95137),`torch.nn.Conv2d`(#95078),`torch.nn.functional.bilinear`(#94892),`torch.copy\_` (#95272),`torch.max_pool2d`(#94963),`torch.div` (#95769)
- Fix issues with `torch.bool` for Unary ops (#91120), scatter ops (#94464),
- Fix issues with `torch.float16` for `torch.nan_to_num` (#94220), `torch.nn.HuberLoss` (#94567)
- Properly raise error for `torch.int64` inputs for `torch.dot` (#94270), `torch.floor_divide` (#94488), `torch.square` (#94766),
- Properly cast `torch.int64` to `torch.int32` for reduction ops and raise warning. (#94484)
- Properly raise unimplemented error for `torch.nn.Conv3d` (#94492),
- Fix data type issues with index_add for non-`torch.float` inputs by casting them to `torch.float` (#88542)
- Fix the high watermark value for unified memory allocation on x86 (#91268)
- Fix handling of ops taking multiple dtypes as input (#91197, #91514)
- Fix handling of channels last for `torch.cat` (#91786, #94662), `torch.Conv2d` (#91822, #94384), `torch.nn.{ELU,ReLU,Hardswish}` (#94664), `torch.nn.BatchNorm` (#94760), `torch.nn.MaxPool2d` (#94877)
- Fix view operations handling (#94259, #94278,#95145, #95762, #95905)
- Fix numerical stability issues with various ops (#94889)
- Fix TORCH_WARN_ONCE (#95559) (#95559)

## Build

- Move incorrectly placed closing curly brace of `extern "C"` block (#87853)
- Set INTERFACE_LINK_DIRECTORIES on caffe2::mkl (#89359)
- Also include MKL_THREAD_LIB in link libraries for caffe2::mkl (#89378)
- Fix MSVC compiler error in basic_ops.h (#93322)
- Fix a bug that redefines \_\_STDC_FORMAT_MACROS (#89310)
- Fix ReplaceWithMaybeCopy test in OSS (#88099)

## Jit

- Fix out-of-bounds error in torch.jit.script for functions with many decorators (#87804)
- Assorted fixes for NNC cpu fuser (#85056, #86788, #88798, #89978)
- Set the correct size of aten tensor in presence of MKL-DNN padding (#86767)
- Fix Scalar(bool) handling in toIValue (#87179)

## Vulkan

- Fix an issue with Vulkan not being able to be compiled on Windows (#92207)
- Fix a possible empty vector dereference in the Vulkan optimization pass (#92918)

## Cudnn

- Fix cudnn RNN reproducibility issue (#90522)
- Fix `benchmark_limit` ignoring failed kernels in FIND (#91032)

## Releng

- Set nvfuser default to disabled, keep CI (#86369)
- Add manual cuda deps search logic (#90411)
- Workaround for NumPy builds that ship with a broken Dlpack deleter (#89759)
- Workaround MSVC ICE due to constexpr char\* template argument (#86288)
- Add define to fix issue with compatibility with latest Windows SDK (#85408)
- Remove invalid git option when updating submodules (#91132)

# Performance

## Python API

- Improve torch.lerp performance on cpu (#84845)
- Improve torch.istft performance (#88060)
- Call view within einsum to remediate MPS regression (#87135)
- Remove unnecessary calls to python builtins(#94323)
- Improve type hints for Module forward hooks (#92061)

## Autograd API

- Use in-place input accumulation fast path for dense Tensors. (#90217)

## torch.nn API

- Improve `functional.interpolate()` speed for `torch.channels_last` (#86361, #86361, #90302)
- Improve performance for `functional.multi_head_attention_forward()` (#93234, #89847)
- Improve performance for `TransformerEncoderLayer()` and `MultiheadAttention()` (#87377, #88488, #88831, #88854, #88970, #91171)
- Improve `SyncBatchNorm()` performance by using the right gathering ops (#89521)
- Improve `ConvTransposed2D()` CPU performance for `torch.{float32, bfloat16}` (#92530)
- Improve `functional.local_response_norm()` performance for 3d inputs (#91052)

## torch.func

- Add vmap batching rule for: `bitwise operators` (#91971), `nansum` & `nanmean` (#91372), `all` & `any` (#91966), `torch.linalg.vander` (#91749), `slogdet` (#86815), `torch.index_fill` (#91364), `narrow_copy` (#88130), `view_copy` (#88150), `greater_equal.Scaler` (#91324)

## Cuda

- Layer norm backward speed gain with warp shuffles (#87445, #87814)
- Avoid unnecessary type casts (#86086)
- Use `atomicAdd` for `bfloat16` in Ampere and above (#84981)

## Cpp API

- Vectorize torch.exp2 on CPU and add complex support (#92115)
- Add various performance fixes to c++ STL usage (#94034)

## NestedTensor API

- Improve performance for NestedTensor `torch.bmm`(#86856), (#85894)
- Remove unnecessary check in `select_nested` (#89150)

## Distributed

- Do not call `pad` in no-padding case(#88769)

## Complex API

- Improve complex `lerp` performance (#84844)

## Mobile

- Passing serialized XNNPACK model by reference (#89089)
- Fix to add multiple outputs for the CoreML delegate (#88345)

## Sparse API

- Improve performance of `mul` when given COO (#86269)
- Improve `to(dtype)` support for all sparse compressed formats (#89055)
- Improve conversion of BSR/BSC to COO using `to_sparse` (#91389)
- Improve `sparse_mask` (#91964)
- Improve `to_dense` backward by removing redundant call to `coalesce` (#92001)
- Improve validation of CSR/CSC/BSR/BSC tensors for low dimensional inputs (#94048)
- Improve torch.sparse.sampled_addmm performance on CPU for CSR inputs (#90978)

## Optimizer API

- Improve foreach implementations by pre-grouping tensors to maximize fast path for `{Adadelta, Adagrad, Adam, Adamax, AdamW, ASGD, NAdam, RAdam, RMSProp, RProp, SGD}`(#92048, #92362, #92363, #92349, #92364, #92365, #92369, #92372, #92338)

## Cpu

- Optimizations for flip (#89414, #91806,#88989, #90013)
- Add fmsub to vectorization primitives (#86568)
- Optimize GELU BFloat16 Impl in CPU path (#79378)
- Fix `biasadd` OMP perf issue for the packed MKL SGEMM (#92300)
- Optimize LogSoftmax by improving thread-allocation in `_vec_log_softmax_lastdim` (#85398)
- BF16 autocast conv transpose 1d/2d/3d for CPU (#92527)
- Add mkl implementation for exponential on CPU (#69967)

## Fx

- Use deque instead of list for BFS (#91139)
- Refactor the dfs cyclic search from recursive to iterative approach (#91042)

## Mps

- Increase performance of `torch.add{cmul,cdiv,mm}`(#94214, #94534)`torch.multinomial` (#86342), faster op launch time (#86437), `torch.linear` (#91114), view handling (#91743, #94218), `convolutions`(#94661), `scatter/gather` (#94663)

## Jit

- Add BFloat16 dtype support for oneDNN Graph JIT fuser (#85591)

## Cudnn

- Improve hot path heuristics performance in V8 (#90811)

# Documentation

## Python API

- Fix various spelling and grammatical errors (#87357, #87583, #88033, #91641, #91871, #86642, #86721, #90110, #87724, #88483, #92049, #92762, #88962)
- Fix the documentation of various functions (#88059, #94545, #86593, #93145, #90071, #87870, #91627, #89910, #79086)
- Fix dev-discuss link in the maintainer docs (#89493)
- Add General Project Policies (#87385)

## Autograd API

- Improve autograd documentation (#89401, #93065)

## torch.nn API

- Improve documentation for: `MaxPool2d` (#86559), `utils.clip_grad_norm_()` (#91312), `Module()` (#87142), `{Unfold,Fold}()` (#88819), `torch.nn.functional.gelu` (#89061), `functional.conv2d` `padding` (#85004), `functional.leaky_relu()` (#94090), `MaxUnpool{1,2,3}D` (#94629)

## NestedTensor API

- Update Persons of Interest (#90069)
- Fix path to nested_tensor in example (#86891)

## Mps

- Add 'mps' to the tensor attributes doc page (#86585)

## Distributed

- Activation checkpointing
  - Clean up comments in activation checkpoint (#86622)
- Distributed (c10d)
  - Improve documentation for various functions (#87018, #94543, #91116,#89905, #86438 )
- DistributedDataParallel
  - Improve Documentation (#86221, #91832)
- RPC
  - Fix non-existing parameters in docstrings in benchmarks (#91115)
- Tensor parallelism and DTensor:
  - Add more clarifications and fix errors in tensor parallelism docs (#94786)
  - Update 2D parallelism API naming and docs (#94771)
- FullyShardedDataParallel
  - Add docs to explain the running the forward pass of of submodules in FSDP (#86343)
  - Clarify warnings to mention collectives (#87478)
  - Remove HSDP Zero-2 from doc (#90503)
  - Improve the comments for FSDP (#92359)
- Distributed Checkpoint
  - Enable documentation for Distributed Checkpoint. (#92813)
- Torch Elastic
  - Fix a minor typo in documentation (#90667)
  - Fix `torch.distributed.run` init connect timeout by comparing `host` with the current IP list (#90221)

## torch.func

- Downgrade the warning about forward-mode AD coverage (#87383)
- Add version selector back to functorch docs (#86602)
- Add documentation for torch.func (#91319)
- Fix AOTAutograd tutorial (#87415)
- Add migration guide from functorch (#91811)
- Improve inplace/view note on copy slices (#89856)
- Add more details to the functorch install page (#86823)

## Linalg API

- Add a note on the stability of linalg functions. (#88313)
- Improve documentation for various linalg functions (#89013,#89383, #91129)

## Composability

- Fix ScalarTensor **repr** in Extending PyTorch example (#86330)
- Fix incorrect wrapping of function decorator (#94446)
- Add **all** to torch.{autograd, fx, cuda} submodules (#85343)

## Dataloader API

- Update dataloader docstring mentioning prefetch factor behavior (#89874)

## Sparse API

- Extend documentation for `to_sparse` (#89912)
- Small correction to `torch.sparse` overview documentation(#93258)

## Optimizer API

- Improve documentation for various optimizers (#91195, #91196, #91881, #89575, #86629, #92111)
- Add general documentation on our algorithm defaults (#95391)

## Serialization

- Fix various spelling and grammatical errors (#90662, #91253)

## Distributions

- Improve documentation for various distributions (#91091, #87577)
- Add original sources/references to Wishart.py in distributions (#86543)

## Quantization

- Improvements to various READMEs (#89319, #86914,#86523, #89795, #90403)
- Add docstrings for operators defined in torch.ops.quantized_decomposed namespace (#89547)
- Add x86 backend as default backend of server inference (#86794)
- Fix non-existing parameters in docstrings in torch/ao (#90875)
- Move parts of BackendConfig tutorial (#91999)

## ONNX

- Fix non-existing parameters in docstrings in torch/onnx (#90593)
- Update diagnostics system (#94565)

## Releng

- Enabled xdoctest runner in CI (#83816)

1.13	2.0
```Python >>> import torch >>> from torch import nn >>> module = nn.Linear(2,22) >>> i = torch.randn(2, 2, requires_grad=True) >>> module(i).sum().backward() >>> module.zero_grad() >>> module.weight.grad == None False >>> module.weight.grad.data tensor([[0., 0.], [0., 0.]]) >>> module.weight.grad + 1.0 tensor([[1., 1.], [1., 1.]]) ```	```Python >>> import torch >>> from torch import nn >>> module = nn.Linear(5, 5) >>> i = torch.randn(2, 5, requires_grad=True) >>> module(i).sum().backward() >>> module.zero_grad() >>> module.weight.grad == None True >>> module.weight.grad.data AttributeError: 'NoneType' object has no attribute 'data' >>> module.weight.grad + 1.0 TypeError: unsupported operand type(s) for +: 'NoneType' and 'float' ```

1.13	2.0
```Python # torch.Tensor behavior >>> a = torch.Tensor() >>> a.foo = 'hey' >>> buffer = io.BytesIO() >>> torch.save(a, buffer) >>> buffer.seek(0) >>> b = torch.load(buffer) >>> print(a.foo) hey >>> print(b.foo) AttributeError: 'Tensor' object has no attribute 'foo' # torch.nn.Parameter behavior >>> a = nn.Parameter() >>> a.foo = 'hey' >>> buffer = io.BytesIO() >>> torch.save(a, buffer) >>> buffer.seek(0) >>> b = torch.load(buffer) >>> print(a.foo) hey >>> print(b.foo) AttributeError: 'Parameter' object has no attribute 'foo' # torch.Tensor subclass behavior >>> class MyTensor(torch.Tensor): ... pass >>> a = MyTensor() >>> a.foo = 'hey' >>> print(a.foo) hey >>> buffer = io.BytesIO() >>> torch.save(a, buffer) >>> buffer.seek(0) >>> b = torch.load(buffer) >>>print(b.foo) hey ```	```Python # torch.Tensor behavior a = torch.Tensor() a.foo = 'hey' >>> buffer = io.BytesIO() >>> torch.save(a, buffer) >>> buffer.seek(0) >>> b = torch.load(buffer) >>> print(a.foo) hey >>> print(b.foo) hey # torch.nn.Parameter behavior >>> a = nn.Parameter() >>> a.foo = 'hey' >>> buffer = io.BytesIO() >>> torch.save(a, buffer) >>> buffer.seek(0) >>> b = torch.load(buffer) >>> print(a.foo) hey >>> print(b.foo) hey # torch.Tensor subclass behavior >>> class MyTensor(torch.Tensor): ... pass >>> a = MyTensor() >>> a.foo = 'hey' >>> print(a.foo) hey >>> buffer = io.BytesIO() >>> torch.save(a, buffer) >>> buffer.seek(0) >>> b = torch.load(buffer) >>>print(b.foo) hey ```

1.13	2.0
```Python >>> class Foo(nn.Module): ... def __init__(self): ... super().__init__() ... self.x = nn.Parameter(torch.zeros([])) ... self.x_tied = self.x ... ... def forward(self, inp): ... return self.x + self.x_tied >>> foo = Foo() >>> params = {'x': torch.ones([])} >>> result = functional_call(foo, params, torch.randn([])) >>> print(result) 1.0 ```	```Python >>> class Foo(nn.Module): ... def __init__(self): ... super().__init__() ... self.x = nn.Parameter(torch.zeros([])) ... self.x_tied = self.x ... ... def forward(self, inp): ... return self.x + self.x_tied >>> foo = Foo() >>> params = {'x': torch.ones([])} >>> result = functional_call(foo, ... params, ... torch.randn([]), ... tie_weights=False) >>> print(result) 1.0 ```

1.13	2.0
```Python >>> a = torch.rand(1024) >>> _ = torch.stft(a, n_fft=128) ```	```Python >>> t = torch.rand(1024) >>> _ = torch.stft(t, n_fft=128, return_complex=False) ```

1.13	2.0
```Python >>> t = torch.rand(65, 33, 2) >>> _ = torch.istft(t, n_fft=128, length=1024) ```	```Python >>> t = torch.rand(65, 33, 2) >>> _ = torch.istft(t, n_fft=128, length=1024) RuntimeError: istft requires a complex-valued input tensor matching the output from stft with return_complex=True. >>> t_complex = torch.view_as_complex(t) >>> _ = torch.istft(t_complex, n_fft=128, length=1024) ```

1.13	2.0
```Python >>> i = [[0, 1, 1], [2, 0, 5]] >>> v = [3, 4, 5] >>> s = torch.sparse_coo_tensor(i, v, (2, 3)) RuntimeError: size is inconsistent with indices: for dim 1, size is 3 but found index 5 ```	```Python >>> i = [[0, 1, 1], [2, 0, 5]] >>> v = [3, 4, 5] >>> s = torch.sparse_coo_tensor(i, ... v, ... (2, 3), ... check_invariants=True) RuntimeError: size is inconsistent with indices: for dim 1, size is 3 but found index 5 >>> with torch.sparse.check_sparse_tensor_invariants(): ... s = torch.sparse_coo_tensor(i, v, (2, 3)) ... RuntimeError: size is inconsistent with indices: for dim 1, size is 3 but found index 5 ```

1.13	2.0
```Python m = FullyShardedDataParallel(module) m.params_with_grad() ```	```Python m = FullyShardedDataParallel(module) m.params_with_grad() # Runtime error thrown # For work-around, users can still do [p for p in self.parameters() if p.grad is not None] ```

1.13	2.0
```Python from torch.distributed.fsdp.fully_sharded_data_parallel import * ShardingStrategy.FULL_SHARD # Non-public API FullyShardedDataParallel(module) # public API ```	```Python from torch.distributed.fsdp.fully_sharded_data_parallel import * ShardingStrategy.FULL_SHARD # Non-public API, this will fail now Fully`Sharded`DataParallel(module) # public API ... # Users can instead from torch.distributed.fsdp.fully_sharded_data_parallel import ( FullyShardedDataParallel, ShardingStrategy, ) FullyShardedDataParallel(module, sharding_strategy=ShardingStrategy.FULL_SHARD) ```

1.13	2.0
```Python lambda_auto_wrap_policy(m, unwrapped_params=...) transformer_auto_wrap_policy(m, unwrapped_params=...) size_based_auto_wrap_policy(m, unwrapped_params=...) ```	```Python lambda_auto_wrap_policy(m, nonwrapped_numel=...) transformer_auto_wrap_policy(m, nonwrapped_numel=...) size_based_auto_wrap_policy(m, nonwrapped_numel=...) ```

1.13	2.0
```Python alltoall(output=..., input=...) ```	```Python alltoall(output_tensors=..., input_tensors=...) ```

1.13	2.0
```Python import torch as nn import torch.ao.nn.intrinsic as nni from torch.ao.quantization.backend_config import ( BackendPatternConfig ) def fuse_linear_relu(is_qat, relu, bn_conv): (bn, conv) = bn_conv return nni.ConvBnReLU2d(conv, bn, relu) config = ( BackendPatternConfig((nn.ReLU, (nn.BatchNorm2d, nn.Conv2d))) .set_dtype_configs(...) .set_fuser_method(fuse_conv_bn_relu) .set_fused_module(nni.ConvBnReLU2d) ) backend_config.configs # returns Dict[Pattern, BackendPatternConfig] ```	```Python def fuse_linear_relu(is_qat, conv, bn, relu): return nni.ConvBnReLU2d(conv, bn, relu) config = ( BackendPatternConfig((nn.Conv2d, nn.BatchNorm2d, nn.ReLU)) .set_dtype_configs(...) .set_fuser_method(fuse_conv_bn_relu) .set_fused_module(nni.ConvBnReLU2d) ) # Or for backward-compatibility def fuse_linear_relu(is_qat, relu, bn_conv): (bn, conv) = bn_conv return nni.ConvBnReLU2d(conv, bn, relu) config = ( BackendPatternConfig() ._set_pattern_complex_format((nn.ReLU, (nn.BatchNorm2d, nn.Conv2d))) .set_dtype_configs(...) .set_fuser_method(fuse_conv_bn_relu) .set_fused_module(nni.ConvBnReLU2d) ) backend_config.configs # returns List[BackendPatternConfig] ```

1.13	2.0
```Python get_observer_dict() ```	```Python _get_observer_dict() ```

1.13	2.0
```Python from torch.ao.quantization.fx.backend_config_utils import ( get_quantize_handler_cls, get_fusion_pattern_to_fuse_handler_cls, get_native_quant_patterns, get_pattern_to_quantize_handlers, ) all_quant_patterns = get_native_quant_patterns() ```	```Python from torch.ao.quantization.fx.quantization_patterns import ( _get_quantize_handler_cls, _get_pattern_to_quantize_handlers, ) from torch.ao.quantization.fx.fusion_patterns import ( _get_fusion_pattern_to_fuse_handler_cls, ) from torch.ao.quantization.backend_config import ( get_native_backend_config, ) all_quant_patterns = _get_pattern_to_quantize_handlers( get_native_backend_config() ) ```

1.13	2.0
```Python >>> x = torch.ones(2, 2, 2) >>> base = x[:, :, 1] >>> base.stride() (4, 2) >>> x = torch.zeros(2, 2, 2) >>> base = x[:, :, 1] >>> base.stride() (4, 2) >>> torch.diagonal_scatter(base, torch.ones(2)).stride() # returns a tensor with same strides as base. (4, 2) ```	```Python >>> x = torch.ones(2, 2, 2) >>> base = x[:, :, 1] >>> base.stride() (4, 2) >>> x = torch.zeros(2, 2, 2) >>> base = x[:, :, 1] >>> base.stride() (4, 2) >>> torch.diagonal_scatter(base, torch.ones(2)).stride() # returns a contiguous tensor (2, 1) ```

1.13	2.0
```Python tensor.storage() torch.TypedStorage(...) ```	```Python tensor.untyped_storage() torch.UntypedStorage(...) ```

1.13	2.0
```Python >>> a = torch.tensor(10) >>> a.T >>> a.H ```	```Python >>> a = torch.tensor(10) >>> a.T UserWarning: Tensor.T is deprecated on 0-D tensors. This function is the identity in these cases. >>> a.H UserWarning: Tensor.H is deprecated on 0-D tensors. Consider using x.conj(). ```

1.13	2.0
```Python @torch.no_grad() class Blah(): pass ```	```Python class Blah(): @torch.no_grad() def __init__(self): pass ```

PyTorch 1.13: beta versions of functorch and improved support for Apple’s new M1 chips are now available (2022-10-28)

# Pytorch 1.13 Release Notes

* Highlights
* Backwards Incompatible Changes
* New Features
* Improvements
* Performance
* Documentation
* Developers

# Highlights

We are excited to announce the release of PyTorch 1.13! This includes stable versions of BetterTransformer. We deprecated CUDA 10.2 and 11.3 and completed migration of CUDA 11.6 and 11.7. Beta includes improved support for Apple M1 chips and functorch, a library that offers composable vmap (vectorization) and autodiff transforms, being included in-tree with the PyTorch release. This release is composed of over 3,749 commits and 467 contributors since 1.12.1. We want to sincerely thank our dedicated community for your contributions.

Summary:

* The BetterTransformer feature set supports fastpath execution for common Transformer models during Inference out-of-the-box, without the need to modify the model. Additional improvements include accelerated add+matmul linear algebra kernels for sizes commonly used in Transformer models and Nested Tensors is now enabled by default.

* Timely deprecating older CUDA versions allows us to proceed with introducing the latest CUDA version as they are introduced by Nvidia®, and hence allows support for C++17 in PyTorch and new NVIDIA Open GPU Kernel Modules.

* Previously, functorch was released out-of-tree in a separate package. After installing PyTorch, a user will be able to `import functorch` and use functorch without needing to install another package.

* PyTorch is offering native builds for Apple® silicon machines that use Apple's new M1 chip as a beta feature, providing improved support across PyTorch's APIs.

|Stable	|Beta	|Prototype	|
|---	|---	|---	|
|Better Transformer
CUDA 10.2 and 11.3 CI/CD Deprecation 
 | Enable Intel® VTune™ Profiler's Instrumentation and Tracing Technology APIs
Extend NNC to support channels last and bf16
Functorch now in PyTorch Core Library
Beta Support for M1 devices
	| Arm® Compute Library backend support for AWS Graviton
 CUDA Sanitizer |

You can check the blogpost that shows the new features [here](https://pytorch.org/blog/PyTorch-1.13-release/).

# Backwards Incompatible changes

## Python API

### **uint8 and all integer dtype masks are no longer allowed in Transformer** **(#87106)**

Prior to 1.13, `key_padding_mask` could be set to uint8 or other integer dtypes in `TransformerEncoder` and `MultiheadAttention`, which might generate unexpected results. In this release, these dtypes are not allowed for the mask anymore. Please convert them to `torch.bool` before using.

1.12.1

```python
>>> layer = nn.TransformerEncoderLayer(2, 4, 2)
>>> encoder = nn.TransformerEncoder(layer, 2)
>>> pad_mask = torch.tensor([[1, 1, 0, 0]], dtype=torch.uint8)
>>> inputs = torch.cat([torch.randn(1, 2, 2), torch.zeros(1, 2, 2)], dim=1)
# works before 1.13
>>> outputs = encoder(inputs, src_key_padding_mask=pad_mask)
```

1.13

```python
>>> layer = nn.TransformerEncoderLayer(2, 4, 2)
>>> encoder = nn.TransformerEncoder(layer, 2)
>>> pad_mask = torch.tensor([[1, 1, 0, 0]], dtype=torch.bool)
>>> inputs = torch.cat([torch.randn(1, 2, 2), torch.zeros(1, 2, 2)], dim=1)
>>> outputs = encoder(inputs, src_key_padding_mask=pad_mask)
```

### **Updated `torch.floor_divide` to perform floor division** **(#78411)**

Prior to 1.13, `torch.floor_divide` erroneously performed truncation division (i.e. truncated the quotients). In this release, it has been fixed to perform floor division. To replicate the old behavior, use `torch.div` with `rounding_mode='trunc'`.

1.12.1

```python
>>> a = torch.tensor([4.0, -3.0])
>>> b = torch.tensor([2.0, 2.0])
>>> torch.floor_divide(a, b)
tensor([ 2., -1.])
```

1.13

```python
>>> a = torch.tensor([4.0, -3.0])
>>> b = torch.tensor([2.0, 2.0])
>>> torch.floor_divide(a, b)
tensor([ 2., -2.])
# Old behavior can be replicated using torch.div with rounding_mode='trunc'
>>> torch.div(a, b, rounding_mode='trunc')
tensor([ 2., -1.])
```

### **Fixed `torch.index_select` on CPU to error that index is out of bounds when the `source` tensor is empty (#77881)**

Prior to 1.13, `torch.index_select` would return an appropriately sized tensor filled with random values on CPU if the source tensor was empty. In this release, we have fixed this bug so that it errors out. A consequence of this is that `torch.nn.Embedding` which utilizes `index_select` will error out rather than returning an empty tensor when `embedding_dim=0` and `input` contains indices which are out of bounds. The old behavior cannot be reproduced with `torch.nn.Embedding`, however since an Embedding layer with `embedding_dim=0` is a corner case this behavior is unlikely to be relied upon.

1.12.1

```python
>>> t = torch.tensor([4], dtype=torch.long)
>>> embedding = torch.nn.Embedding(3, 0)
>>> embedding(t)
tensor([], size=(1, 0), grad_fn=)
```

1.13

```python
>>> t = torch.tensor([4], dtype=torch.long)
>>> embedding = torch.nn.Embedding(3, 0)
>>> embedding(t)
RuntimeError: INDICES element is out of DATA bounds, id=4 axis_dim=3
```

### Disallow overflows when tensors are constructed from scalars (#82329)

Prior to this PR, overflows during tensor construction from scalars would not throw an error. In 1.13, such cases will error.

1.12.1

```python
>>> torch.tensor(1000, dtype=torch.int8)
tensor(-24, dtype=torch.int8)
```

1.13

```python
>>> torch.tensor(1000, dtype=torch.int8)
RuntimeError: value cannnot be converted to type int8 without overflow
```

### **Error on indexing a cpu tensor with non-cpu indices (#69607)**

Prior to 1.13, `cpu_tensor[cuda_indices]` was a valid program that would return a cpu tensor. The original use case for mixed device indexing was for `non_cpu_tensor[cpu_indices]`, and allowing the opposite was unintentional (`cpu_tensor[non_cpu_indices]`). This behavior appears to be rarely used, and a refactor of our indexing kernels made it difficult to represent an op that takes in (cpu_tensor, non_cpu_tensor) and returns another cpu_tensor, so it is now an error.

To replicate the old behavior for `base[indices]`, you can ensure that either `indices` lives on the CPU device, or `base` and `indices` both live on the same device.

1.12.1

```python
>>> a = torch.tensor([1.0, 2.0, 3.0])
>>> b = torch.tensor([0, 2], device='cuda')
>>> a[b]
tensor([1., 3.])
```

1.13

```python
>>> a = torch.tensor([1.0, 2.0, 3.0])
>>> b = torch.tensor([0, 2], device='cuda')
>>> a[b]
RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cpu)
# Old behavior can be replicated by moving b to CPU, or a to CUDA
>>> a[b.cpu()]
tensor([1., 3.])
>>> a.cuda()[b]
tensor([1., 3.], device='cuda:0')
```


### Remove deprecated `torch.eig`,` torch.matrix_rank`, `torch.lstsq` (#70982, #70981, #70980)
The deprecation cycle for the above functions has been completed and they have been removed in the 1.13 release.

## torch.nn

### Enforce that the `bias` has the same dtype as `input` and `weight` for convolutions on CPU (#83686)

To align with the implementation on other devices, the CPU implementation for convolutions was updated to enforce that the `dtype` of the `bias` matches the `dtype` of the `input` and `weight`.

1.12.1

```python
# input and weight are dtype torch.int64
# bias is torch.float32
>>> out = torch.nn.functional.conv2d(input, weight, bias, ...)
```

1.13

```python
# input and weight are dtype torch.int64
# bias is torch.float32
>>> with assertRaisesError():
>>>    out = torch.nn.functional.conv2d(input, weight, bias, ...)

# Updated code to avoid the error
>>> out = torch.nn.functional.conv2d(input, weight, bias.to(input.dtype), ...)
```

## Autograd

### Disallow setting the `.data` of a tensor that `requires_grad=True` with an integer tensor (#78436)

Setting the  `.data` of a tensor that `requires_grad` with an integer tensor now raises an error.

1.12.1

```python
>>> x = torch.randn(2, requires_grad=True)
>>> x.data = torch.randint(1, (2,))
>>> x
tensor([0, 0], requires_grad=True)
```

1.13

```python
>>> x = torch.randn(2, requires_grad=True)
>>> x.data = torch.randint(1, (2,))
Traceback (most recent call last):
  File "", line 1, in 
RuntimeError: data set to a tensor that requires gradients must be floating point or complex dtype
```

### Added variable_list support to ExtractVariables struct (#84583)

Prior to this change, C++ custom autograd Function considers tensors passed in TensorList to not be tensors for the purposes of recording the backward graph. After this change, custom Functions that receive TensorList must modify their backward functions to also compute gradients for these additional tensor inputs. Note that this behavior now differs from that of custom autograd Functions in Python.

1.12.1

```cpp
struct MyFunction : public Function {
    static Variable forward(AutogradContext* ctx, at::Tensor t, at::TensorList tensors) {
      return 2 * tensors[0] + 3 * t;
    }

    static variable_list backward(
        AutogradContext* ctx,
        variable_list grad_output) {
      return {3 * grad_output[0]};
    }
};
```

1.13

```cpp
struct MyFunction : public Function {
    static Variable forward(AutogradContext* ctx, at::Tensor t, at::TensorList tensors) {
      return 2 * tensors[0] + 3 * t;
    }

    static variable_list backward(
        AutogradContext* ctx,
        variable_list grad_output) {
      return {3 * grad_output[0], 2 * grad_output[0]};
    }
};
```

### Don't detach when making views; force kernel to detach (#84893)

View operations registered as CompositeExplicitAutograd kernels are no longer allowed to return input tensors as-is. You must explicitly create a new tensor (e.g., using `.alias()`).

1.12.1

```cpp
torch::Tensor view_op(const torch::Tensor& self) {
  return self;
}
```

1.13

```cpp
torch::Tensor view_op(const torch::Tensor& self) {
  return self.alias();
}
```

## ONNX

### `torch.onnx.register_custom_op_symbolic` now only registers the symbolic function at the specified opset version (#85636)

This updates `register_custom_op_symbolic`'s behavior to *only register the symbolic function at a single version.* This is more aligned with the semantics of the API signature. Previously the API registers a symbolic function to *all* versions up to the specified version. As a result of this change, users will need to register a symbolic function to the exact version when they want to override an existing symbolic function. Users are not affected if (1) an implementation does not exist for the op, or (2) the symbolic function is already registering to the exact version for export.

1.12.1

```python
# Assuming an implemented symbolic function `custom_op_function`
torch.onnx.register_custom_op_symbolic("aten::foo", custom_op_function, 16)
```

1.13

```python
# Assuming an implemented symbolic function `custom_op_function`
for opset in range(1, 17):
    torch.onnx.register_custom_op_symbolic("aten::foo", custom_op_function, opset)
```

### Default ONNX opset is updated to 14 (#83284)

The update is done in regularly to ensure we are in sync with the onnx updates. Users can specify `opset_version` in `torch.onnx.export` to maintain opset version 13.

### `torch.onnx.symbolic_registry` is removed (#84382)

We removed the `symbolic_registry` module and hid it as an internal implementation detail. Users previously relying on the `register_op` function to register custom symbolic functions should move to use the `torch.onnx.register_custom_op_symbolic` API.

### `ScalarType` and global variables in `torch.onnx.symbolic_helper` are removed (#82995)

The `ScalarType` class in `torch.onnx.symbolic_helper`, along with the global variables `cast_pytorch_to_onnx`, `pytorch_name_to_type`, `scalar_name_to_pytorch`, `scalar_type_to_onnx` and `scalar_type_to_pytorch_type` are removed from the module. Users previously using these global variables for PyTorch JIT-ONNX type conversion in symbolic functions should move to use the `torch.onnx.JitScalarType` class.

1.12.1

```python
# 1
torch.onnx.symbolic_helper.scalar_type_to_onnx[
    symbolic_helper.scalar_type_to_pytorch_type.index(x.dtype)
].value

# 2
torch.onnx.symbolic_helper.scalar_name_to_pytorch[element_type] in cast_pytorch_to_onnx.keys()

# 3
torch.onnx.symbolic_helper.cast_pytorch_to_onnx["Long"]

# 4
torch.onnx.symbolic_helper.cast_pytorch_to_onnx[tensor.type().scalarType()]
```

1.13

```python
# 1
torch.onnx.JitScalarType.from_dtype(x.dtype).onnx_type()

# 2
torch.onnx.JitScalarType.from_name(element_type).onnx_compatible()

# 3
torch.onnx.TensorProtoDataType.INT64

# 4
torch.onnx.JitScalarType.from_name(tensor.type().scalarType()).onnx_type()
```

## Distributed

### **In c10d collectives, input tensors dtype must now be the same (#84664)**

We added a check to validate all dtype across all input tensors. Previously, users were allowed to pass in tensors with diferent dtypes for c10d collectives. Now, passing in tensors with different dtypes will throw a RuntimeError with the following message: “Invalid usage of tensors with different dtypes Found `torch.float` and `torch.half`”. Users can use `tensor.to(dtype={some_dtype})` to fix this.

1.12.1

```python
# users could pass inputs having different dtypes
>>> tensor = torch.ones(2, 2) * 7
>>> tensor_h = tensor.half()
>>> tensor_list = [torch.zeros(2, 2) for _ in range(4)] # Assume world_size = 4
# Both cases work.
>>> dist.all_gather(tensor_list, tensor)
>>> dist.all_gather(tensor_list, tensor_h)
...
```

1.13

```python
# all inputs of c10d collectives need to have the same dtype
>>> tensor = torch.ones(2, 2) * 7
>>> tensor_h = tensor.half()
>>> tensor_list = [torch.zeros(2, 2) for _ in range(4)] # Assume world_size = 4
# Only allow same dtype for all input tensors.
>>> dist.all_gather(tensor_list, tensor) # RuntimeError thrown
...
```

### **Users doing wildcard imports of torch.distributed.distributed_c10d will no longer get non-public symbols (#84872)**

We limit the usage of c10d APIs to public APIs, so if a user does a wildcard import and calls an internal API, it will fail. Please see the example below:

1.12.1

```python
# users could import both public and non-public symbols:
from torch.distributed.distributed_c10d import *
>>> is_nccl_available() # public API
>>> _check_single_tensor(...) # Non-public API
...
```

1.13

```python
# users can only import public symbols
from torch.distributed.distributed_c10d import *
is_nccl_available() # public API
_check_single_tensor(...) # Non-public API, this will fail now
...
```

### [Process Group C++ extensions](https://pytorch.org/tutorials/intermediate/process_group_cpp_extension_tutorial.html?highlight=process%20group) must use absolute path when importing ProcessGroup.hpp (#86257), ProcessGroup::Work object moved out of work to its own Work class (#83680):

Details of the changes and the updated tutorial can be found in the PyTorch tutorial PR [#2099](https://github.com/pytorch/tutorials/pull/2099)

1.12.1

```cpp
// users use relative path to import C++ headers and Work resides in ProcessGroup class
#include 
#include 
#include 
#include 
...
class WorkDummy : public ProcessGroup::Work {
    ...
}
```

1.13

```cpp
// users must use absolute path of import C++ files and Work is its own class
#include 
#include 
#include 
#include 
...
#include 
class WorkDummy : public Work {
    ...
}
```

## Quantization

### Add required `example_args` argument to `prepare_fx` and `prepare_qat_fx` (#249) (#77608)

We added an additional required `example_inputs` argument to `prepare_fx` and `prepare_qat_fx` APIs, this can be used to do type inference to figure out the type information for each of the fx Node in the graph.

1.12.1

```python
m = resnet18(...)
m = prepare_fx(m, qconfig_dict)
# or
m = prepare_qat_fx(m, qconfig_dict)
```

1.13

```python
m = resnet18(...)
m = prepare_fx(m, qconfig_dict, example_inputs=(torch.randn(1, 3, 224, 224),))
# or
m = prepare_qat_fx(m, qconfig_dict, example_inputs=(torch.randn(1, 3, 224, 224),))
```

### Stop moving models to CPU in quantization convert (#80555)

Previously, we automatically moved the model to CPU in `torch.ao.quantization.fx.convert` to work around the issue where certain functions called by convert expect CPU arguments. This commit pushes this responsibility to the caller since it is the user's decision of which device to use.

1.12.1

```python
model = resnet18(...)
model = prepare_fx(model, qconfig_mapping, example_inputs)
# calibrate
model = convert_fx(model)
```

1.13

```python
model = resnet18(...)
model.cpu()  # if needed
model = prepare_fx(model, qconfig_mapping, example_inputs)
# calibrate
model = convert_fx(model)
```

### Replace the `is_reference` flag of the `torch.ao.quantize_fx.convert_fx` function with the `convert_to_reference` function (#80091, #81326)

This PR removes the is_reference flag from the existing `convert_fx` API and replaces it with a new `convert_to_reference` function. This separates (1) converting the prepared model to a reference model from (2) lowering the reference model to a quantized model, enabling users to call their custom lowering function for
custom backends.

1.12.1

```python
from torch.ao.quantization.quantize_fx import (
    prepare_fx,
    convert_to_reference,
)

prepared = prepare_fx(model, ...)
reference = convert_to_reference(prepared, ...)
```

1.13

```python
from torch.ao.quantization.quantize_fx import (
    prepare_fx,
    convert_to_reference_fx,
)

prepared = prepare_fx(model, ...)
reference = convert_to_reference_fx(prepared, ...)
```

### Add default configs for fixed qparams ops (#80184)

This commit adds qconfigs with special observers for fixed qparams ops (operators whose corresponding quantized version has fixed quantized parameters for output) like sigmoid in `get_default_qconfig_mapping` and `get_default_qat_qconfig_mapping`. For correctness, we also require users to use these special observers if we detect these fixed qparams ops in prepare.

1.12.1 (fails after this PR):

```python
from torch.ao.quantization.quantize_fx import prepare_fx

model = ModelWithFixedQParamsOps()
qconfig_mapping = QConfigMapping()
example_inputs = ...
prepare_fx(model, qconfig_mapping, example_inputs)
```

1.13

```python
from torch.ao.quantization import get_default_qconfig_mapping
from torch.ao.quantization.quantize_fx import prepare_fx

model = ModelWithFixedQParamsOps()
qconfig_mapping = get_default_qconfig_mapping()
example_inputs = ...
prepare_fx(model, qconfig_mapping, example_inputs)
```

### Replace `qconfig_dict` with a typed `QConfigMapping` object (#78452, #79618)

Previously, FX graph mode quantization configurations were specified through a dictionary of qconfigs. However, this
API was not in line with other core APIs in PyTorch. This commit replaces this dictionary with a config object that users will
create and pass to prepare and convert. This leads to better type safety and better user experience in notebook settings
due to improved auto completion.

1.12.1 (deprecated)

```python
from torch.ao.quantization.quantize_fx import prepare_fx

qconfig_dict = {
    "": qconfig,
    "object_type": [
        (torch.nn.Linear, qconfig),
    ],
    "module_name_regex": [
        ("foo.*bar", qconfig),
    ],
    "module_name": [
        ("mod", qconfig),
    ],
}

prepare_fx(model, qconfig_dict)
```

1.13

```python
from torch.ao.quantization import QConfigMapping
from torch.ao.quantization.quantize_fx import prepare_fx

qconfig_mapping = QConfigMapping()
    .set_global(qconfig)
    .set_object_type(torch.nn.Linear, qconfig)
    .set_module_name_regex("foo.*bar", qconfig)
    .set_module_name("mod", qconfig)

prepare_fx(model, qconfig_mapping)
```

### Replace `*custom_config_dict` with typed config objects (#79066)

This commit replaces the following config dicts with python objects:

* prepare_custom_config_dict → PrepareCustomConfig
* convert_custom_config_dict → ConvertCustomConfig
* fuse_custom_config_dict → FuseCustomConfig

This leads to better type safety and better user experience in
notebook settings due to improved auto completion.
1.12.1

```python
from torch.ao.quantization.quantize_fx import prepare_fx, convert_fx

prepare_custom_config_dict = {
  "float_to_observed_custom_module_class": {
     "static": {
         FloatClass: ObservedClass
     }
  },
  "non_traceable_module_name": ["mod1", "mod2"],
  "non_traceable_module_class": [class1, class2],
  "input_quantized_idxs": [0, 1],
  "output_quantized_idxs": [0],
  "preserved_attributes": ["attr1", "attr2"],
}

convert_custom_config_dict = {
  "observed_to_quantized_custom_module_class": {
     "static": {
         FloatClass: ObservedClass
     }
  },
  "preserved_attributes": ["attr1", "attr2"],
}

model = prepare_fx(
    model,
    qconfig_mapping,
    example_inputs,
    prepare_custom_config_dict=prepare_custom_config_dict)

model(data)

model = convert_fx(model, convert_custom_config_dict=convert_custom_config_dict)
```

1.13

```python
from torch.ao.quantization.fx.custom_config import (
    PrepareCustomConfig,
    ConvertCustomConfig,
)
from torch.ao.quantization.quantize_fx import prepare_fx, convert_fx

prepare_custom_config = PrepareCustomConfig() \
    .set_float_to_observed_mapping(float_class, observed_class) \
    .set_non_traceable_module_names(["mod1", "mod2"]) \
    .set_non_traceable_module_classes([class1, class2]) \
    .set_input_quantized_indexes([0, 1]) \
    .set_output_quantized_indexes([0]) \
    .set_preserved_attributes(["attr1", "attr2"])

convert_custom_config = ConvertCustomConfig() \
    .set_observed_to_quantized_mapping(observed_class, quantized_class) \
    .set_preserved_attributes(["attr1", "attr2"])

model = prepare_fx(
    model,
    qconfig_mapping,
    example_inputs,
    prepare_custom_config=prepare_custom_config)

model(data)

model = convert_fx(model, convert_custom_config=convert_custom_config)
```

### Remove `remove_quant_dequant_pairs` and fix tests (#84203)

This PR removed some passes in `convert_fx`, and also fixes the way we quantize layer_norm operator, so the `qconfig` for layer_norm op needs to be updated as well.

1.12.1

```python
import torch
from torch.ao.quantization.qconfig_mapping import QConfigMapping, QConfig
from torch.ao.quantization.observer import default_weight_observer
from torch.ao.quantization.backend_config import (
    DTypeConfig,
    ObservationType,
)
from torch.ao.quantization.quantize_fx import prepare_fx, convert_fx

qconfig = QConfig(activation=qconfig.activation, weight=default_weight_observer)
qconfig_mapping = QConfigMapping().set_object_type(torch.nn.LayerNorm, q_config) \
.set_object_type(torch.nn.functional.layer_norm, q_config)

# assuming mymodel contains a LayerNorm layer or torch.nn.functional.layer_norm
m = MyModel()
example_inputs = (torch.rand(3, 3),)
m = prepare_fx(m, qconfig_mapping, example_inputs)
```

1.13

```python
import torch
from torch.ao.quantization.qconfig_mapping import QConfigMapping, QConfig
from torch.ao.quantization.observer import default_placeholder_observer
from torch.ao.quantization.backend_config import (
    DTypeConfig,
    ObservationType,
)
from torch.ao.quantization.quantize_fx import prepare_fx, convert_fx

qconfig = QConfig(activation=qconfig.activation, weight=default_placeholder_observer)
qconfig_mapping = QConfigMapping().set_object_type(torch.nn.LayerNorm, q_config) \
.set_object_type(torch.nn.functional.layer_norm, q_config)

# assuming mymodel contains a LayerNorm layer or torch.nn.functional.layer_norm
m = MyModel()
example_inputs = (torch.rand(3, 3),)
m = prepare_fx(m, qconfig_mapping, example_inputs)
```

### Align observer dtype with reference model spec (#85345)

Before this PR, the `dtype` attribute of observers was not clearly defined. It originally meant `interface_dtype` in the eager mode workflow, which is how the codebase before this PR is using it. In the new reference model spec, `dtype` attribute of an observer represents the `dtype` value which needs to be passed into a `quantize` function in the reference model spec. This PR aligns the codebase to this definition of `dtype`.

1.12.1

```python
dynamic_quant_observer = PlaceholderObserver.with_args(
    dtype=torch.float, compute_dtype=torch.quint8)
```

1.13

```python
dynamic_quant_observer = PlaceholderObserver.with_args(
    dtype=torch.quint8, compute_dtype=torch.quint8)
```

## Composability

### **Changed the backend C++ kernel representation for some operators that take in lists of tensors (#73350)**

If an operator in ATen takes in a list of tensors, and is marked as “structured” in native_functions.yaml ([example](https://github.com/pytorch/pytorch/blob/c8889f4e109866610bd1981f03deee8f102b5ce6/aten/src/ATen/native/native_functions.yaml#L1205)), then previously, TensorList was represented as `at::TensorList`, or `c10::ArrayRef`. Now, it is represented as a more efficient type: `const ITensorListRef&`.

1.12.1

```cpp
at::Tensor cat_kernel(at::TensorList tensors,int64_t dim) {
    ...
}
TORCH_LIBRARY_IMPL(aten, dispatch_key, m) {
    ...
    m.impl("cat", &cat_kernel);
}
```

1.13
```cpp
at::Tensor cat_kernel(const at::ITensorListRef& tensors,int64_t dim) {
    ...
}
TORCH_LIBRARY_IMPL(aten, dispatch_key, m) {
    ...
    m.impl("cat", &cat_kernel);
}
```

## C++ API

### **Lowered randint default dtype to the C++ API (#81410)**

Prior to 1.13, the default for the `dtype` argument of `torch.randint`, `torch.long`, was set via manual python binding. However, in the C++ API, `torch::randint` would default to the global default data type, which is usually `float`. In 1.13 we changed the default for `dtype` in the C++ API to `int64` in order to match the python API. To reproduce the old behavior, one can set the `dtype` argument.

1.12.1

```cpp
torch::randint(/*low=*/0, /*high=*/10, {2, 3});
```

1.13

```cpp
// assuming default dtype is float
torch::randint(/*low=*/0, /*high=*/10, {2, 3}, torch::kFloat);
```

### **Enabled `dim=None` for `torch.{std, var, std_mean, var_mean}` (#81845, #82765, #82912)**

Prior to 1.13, a C++ API call that has argument types `torch::{std, var, std_mean, var_mean}(Tensor, OptionalIntArrayRef, int64_t, bool)` used to resolve to the `{std, var, std_mean, var_mean}.correction` overload. In this release, it resolves to the `{std, var, std_mean, var_mean}.dim` overload. With the `.correction` overload, the third argument of type `int64_t` could be used to pass a correction *δN* other than 1. In order to call the `{std, var, std_mean, var_mean}.correction` overload in 1.13, the old `int64_t` argument can be wrapped in a `c10::optional`.

1.12.1

```cpp
// using std as an example
int64_t correction = 2;
torch::std(t, /*dim=*/dim, /*correction=*/correction, /*keepdim=*/True);
```

1.13

```cpp
// To replicate in 1.13 using std as an example
auto correction = c10::make_optional(2);
torch::std(t, /*dim=*/dim, /*correction=*/correction, /*keepdim=*/True);
```

# Deprecations

## Distributed

We are deprecating the following APIs of c10d: `*_coalesced` APIs (#85959), `*_multigpu` APIs (#85961) and `ProcessGroupRoundRobin` (#85158)

We added warnings when users call c10d’s `*_coalesced`, `*_multigpu` and `ProcessGroupRoundRobin` APIs. Previously, users can use these APIs without any warnings but now they will see warnings like “torch.distributed.all_reduce_coalesced will be deprecated. If you must use it, please revisit our documentation later at [https://pytorch.org/docs/master/distributed.html#collective-functions”](https://pytorch.org/docs/master/distributed.html#collective-functions%E2%80%9D). There are still workarounds for `*_coalesced` APIs but no workarounds will be provided for the other two.

1.12.1

```python
# users could use the following APIs with no warnings:
all_reduce_coalesced(...)
all_gather_coalesced(...)
broadcast_multigpu(...)
all_reduce_multigpu(...)
reduce_multigpu(...)
all_gather_multigpu(...)
reduce_scatter_multigpu(...)
...
```

1.13

```python
# users can still use these APIs but it will come with warnings:
all_reduce_coalesced(...)
# Warnings:
# torch.distributed.all_reduce_coalesced will be deprecated. If you must
# use it, please revisit our documentation later at
# https://pytorch.org/docs/master/distributed.html#collective-functions"

# Potential workaround:
reqs = []
with dist._coalescing_manager(group, reqs):
    reqs.append(dist.all_reduce(tensor1, async_op=True))
    reqs.append(dist.all_reduce(tensor2, async_op=True))
for req in reqs:
    req.wait()
...
```


We are deprecating passing `optim_input` into the FSDP optimizer state checkpointing APIs. The user can simply not pass the `optim_input` argument, and all behavior is preserved. No fix is needed from users side for now.

1.12.1

```python
# the user can use the following APIs with no warnings
full_optim_state_dict(...)
sharded_optim_state_dict(...)
shard_full_optim_state_dict(...)
flatten_sharded_optim_state_dict(...)
scatter_full_optim_state_dict(...)
rekey_optim_state_dict(...)
```

1.13

```python
# users can still use these APIs, but they will come with warnings
# The `optim_input` argument is deprecated and will be removed after PyTorch 1.13.
# You may remove it from your code without changing its functionality.
```

## LinAlg

### Deprecate torch.lu in favor of linalg.lu_factor (_#77636_)

The new operation has a cleaner API and better docs. The update rule is as follows:

1.12.1

```python
LU2, pivots2, info = torch.lu(A, compute_pivots, get_infos=True)
LU1, pivots1, info = torch.lu(A, compute_pivots)
```

1.13

```python
LU2, pivots2, info = torch.linalg.lu_factor_ex(A, compute_pivots)
LU1, pivots1 = torch.linalg.lu_factor(A, compute_pivots)
```

### Deprecate torch.lu_solve in favor of linalg.lu_solve(_#77637_)

The new operation has a notation consistent with `linalg.solve`, and has an extra parameter `adjoint=False`. The update rule is as follows:

1.12.1

```python
X = torch.lu_solve(B, LU, pivots)
```

1.13

```python
X = linalg.lu_solve(LU, pivots, B)
```

## ONNX

### Monkey patched convenience method on `torch._C.Graph`, `torch._C.Block` and `torch._C.Node` are deprecated. (#83006)

Deprecated methods include `Graph.op()`, `Graph.constant()`, `Graph.at()`, `Block.op()`, and `Node.__getitem__()`. Previously, these methods are patched into the classes above when users call `torch.onnx.export()` and are typically used in custom symbolic functions. Users can continue to expect `g.op()` and `g.at()` in symbolic functions to work. The `g` parameter has been substituted by the `GraphContext` object (#84728). The methods are now exposed by the `GraphContext` class with APIs unchanged. Users should not rely on the `Graph.op()`, `Graph.constant()`, `Graph.at()`, `Block.op()`, `Node.__getitem__()` methods when they are directly interacting with the C classes. Users should use only the `op()` and `at()` methods of the `GraphContext` object, as other fields in the class will change in future releases.

# New features

## Python API

* Added a deterministic implementation of `scatter_add` on CUDA for all input sizes (#79466)
* Added `torch.concatenate` that aliases `torch.cat` (#85073)
* Added `Tensor.is_cpu()`  that returns whether a tensor is on CPU (#78887)
* Added a `force` kwarg to `Tensor.numpy()` that enables returning a numpy `ndarray` that does not share storage with the tensor (#78564)
* Added `torch.special.{airy_ai, bessel_j0, bessel_j1, bessel_y0, bessel_y1, modified_bessel_i0, modified_bessel_i1, modified_bessel_k0, modified_bessel_k1, scaled_modified_bessel_k0, scaled_modified_bessel_k1, spherical_bessel_j0}` (#78900), (#78901), (#78902), (#78912),  (#78451)
* Added `torch.special.{chebyshev_polynomial_t, chebyshev_polynomial_u, chebyshev_polynomial_v, chebyshev_polynomial_w, hermite_polynomial_h, hermite_polynomial_he, laguerre_polynomial_l, legendre_polynomial_p, shifted_chebyshev_polynomial_t, shifted_chebyshev_polynomial_u, shifted_chebyshev_polynomial_v, shifted_chebyshev_polynomial_w}` (#78196), (#78293), (#78304),  (#78366), (#78352),  (#78357)
* Added `weights_only` option to `torch.load` that restricts load to state_dict only, enabling safe loading. This can  also be set using the `TORCH_FORCE_WEIGHTS_ONLY_LOAD` environment variable (#86812)

## Build

* Added `-Werror=unused-but-set-variable` build flag (#79305)
* Added ability to get release versions based on the current tag (#78584)
* Added `-Werror=type-limits` in Bazel CPU build (#79139)
* Added `-Werror=unused-variable` in Bazel CPU build (#79156)
* Added —config=shell to bazelrc file for easier debugging (#79350)
* Added clang `-Wconstant-conversion` to catch errors detected in #75400 (#80461)
* Added `-Werror=non-virtual-dtor` build flag (#81012)
* Turned on pocketfft flag for third-party pocket_fft library (#81670)
* Updated NCCL to v2.13.4-1 (#82775)
* Added `-Wunused-local-typedef` build flag (#86154)
* Increased max python version to include 3.10 (#84815)

## Complex

*  Added complex half support for:
    * [CPU] `torch.{index_select, index_add} `(#79217), (#79897).
    * [CUDA]  `torch.roll` (#79970), `torch.fft.{fftshift, ifftshift}` (#79970), `torch.{acos, acosh, asinh, atanh}`, (#80030), `torch.{cos, sinh, cosh, tanh}` (#78718), `torch.sqrt, rsqrt` (#77490), `torch.{triu, tril, diag, trace}`(#78062).
    * [CPU and CUDA] `torch.where` (#78665), `torch.{where, pow, masked_fill, sgn, tan, angle}`(#78665)
* Added complex support for `torch.nn.ConvTranspose1d` (#79694).

## torch.nn

* Added `pop` function to `nn.Sequential` and `nn.ModuleList` (#81601)
* Added deepcopy support for parametrized `nn.Module` (#80811)

## torch.optim

* Added maximization support via the `maximize` kwarg for `optim.SparseAdam` (#80336), `optim.ASGD`  (#81875), `optim.Rprop` (#81864), `optim.RMSprop` (#80326)
* Added support for differentiable optimizers via the `differentiable` kwarg `optim.SGD` (#80938), `optim.Adam` (#82205), `optim.RMSprop` (#83578)
* Added support for complex number for `optim.Adam` (#80279), `optim.AdamW` (#80280), `optim.Adamax` (#80319), `optim.RMSprop` (#83860), `optim.Rprop` (#83858),
* Handled complex params as independent real params in `optim.{RMSprop, ASGD}` (#83860), (#84472)
*  Added `optim.lr_scheduler.PolynomialLR` (#82769)

## BetterTransformer

* Allowed user to assert no mask contiguous check is necessary (#82533)
* Added support for norm_first in nn.TransformerEncoderLayer fast path (#78269)
* Added ustom scaled dot product implementations dense (#85984)
* Added Better Transformer fastpath diagnostics (#81013)

## ForEach

* Implemented inplace `foreach` `maximum` and `minimum` (#82523)

## LinAlg

* Added `linalg.lu_solve`, `linalg.solve_ex`, `linalg.vecdot`, `linalg.vander` (_#77634_, _#80073_, _#70542_, _#76303_)

## Sparse

* Added `torch.sparse.spdiags` for easier creation of diagonal sparse matrices (#78439)

## torch.fx

* Enabled symbolic shapes (#82063, #82317, #82209, #83380, #85808, #84113, #84829, #84918, #85185, #85261, #85260, #85754, #85768, #86050, #86098, #86067)
* Created an improved version of subgraph matcher (#82090, #82853, #85444, #85456, #85617)
* Rewrite subgraph_rewriter with subgraph_matcher (#83717)
* Added PassBase for writing passes, PassResult for the return value of passes, and a PassManager for managing the workflow of passes (#79878, #81366, #80531, #82485, #83933, #84094, #84425, #84232)
* Added an FX graph partitioner and fuser (#79439, #80292)
* Added a reinplacing FX pass (#80897, #83626, #83845, #83846)
* Added a CSE pass to the common passes (#81512, #81530, #81742)
* Created DecompositionInterpreter for decomposing aten → prims after an initial make_fx call (#79989)
* Created a Backend for NvFuser based graph partitioner + Prims (#80591, #81311, #81436, #81911)
* Created a Backend for Cudagraphs from dynamo (#80566)
* Created a type constraint generator to Z3 (#79912, #80084, #80095, #80102, #80110, #80147, #80744, #80799, #80823, #80847, #80909, #80925, #80976, #81159, #81175, #81189, #81190, #81265, #81274, #81344, #81360, #81376, #81445, #81516, #81527, #81714, #82163, #82590, #82597, #82614, #82742, #82856, #82923,#82938,#83087, #83109, #83194, #83334, #83682, #83945)

## JIT

* Added new NVFuser Python Frontend Record Keeping for Cache enablement. (#81578)
* Added `torch.ops.nvprims` namespace for nvFuser-specific prims (#82155)
* Enabled fusion of conv with elementwise OP in NNC (#77157)
* Added symbolic shape functions for `conv_transpose2d.input, convolution, convolution_backward` (#77283, #83557, #80860)
* Added support in symbolic shapes for generalized lists of tensor shapes, tuple outputs, optional None, upper and lower bounds (#77389, #83092, #83222, #78679)
* Added support for `aten::_convolution` when it is 2D conv in NNC (#84038)
* Exposed `ProcessGroup::Work.wait()` API to TorchScript (#83303)

## ONNX

* Inlined `prim::PythonOp` for Autograd Function Export (#74765)

## AMD

* Enabled nvfuser (#82498)

## CUDA

* Added CUDA trace Python hooks (#82824)
* Added CUDA Sanitizer (#83984)
* Added support for multiple outputs in python jiterator  (#77921, #78139)

## Intel

* Added a launch script with Best Recipe of Deep Learning on Intel Xeon CPU (_#63932_)
* Enabled Intel® VTune™ Profiler's Instrumentation and Tracing Technology APIs (ITT) to PyTorch (_#63289_)
* Added unified x86 quantization backend (_#84329_)

## MPS

* Added `aten::index_add.out` operator for MPS backend (_#79935_)
* Added `aten::prelu operator` for MPS backend (_#82401_)
* Added `aten::bitwise-not` operator native support for MPS backend (_#83678_)
* Added `aten::tensor::index_put` operator for MPS backend (_#85672_)
* Added `aten::upsample_nearest1d` operator for MPS backend (_#81303_)
* Added `aten::bitwise_{and|or|xor}` operators for MPS backend (_#82307_)
* Added `aten::index.Tensor_out` operator for MPS backend (_#82507_)
* Added `aten::masked_select` operator for MPS backend (_#85818_)
* Added `aten::multinomial` operator for MPS backend (_#80760_)

## Profiler

* Integrated Execution Graph Observer into PyTorch Profiler (#75358, #79753, #82895, #84285)
* TorchTidy: experimental tool to identify anti-patterns from traces (#79631, #79874, #79993, #80094, #80108, #80572, #81056, #81273, #81501, #81733, #81740, #81921, #82421, #82248, #82261, #82782)
* Added reporting for OOM events to the Pytorch Profiler. (#80050)

## Vulkan

* Added Vulkan support for the following operators:
    * `torch.cumsum` (#78554, #81107)
    * `torch.nn.LSTM` (#78943, #79702)
    * `torch.nn.ReplicationPad2d` (#79057, #79291)
    * `torch.nn.threshold` (#78654, #79717)
    * `torch.nn.BatchNorm2d` (#80510)
    * `torch.nn.LayerNorm` (#80980)
    * `torch.nn.GLU` (#80910, #81729)
    * `torch.select` (#81771)
    * `torch.stack` (#81064)
* Prototype implementations for Quantized Tensors were added (#81491). These implementations still need to be exposed to Torchscript, but so far prototype implementations for the following ops have been added:
    * `torch.quantize_per_tensor` (#81492)
    * `torch.dequantize` (#81493)
    * Quantized arithmetic ops (#81494, #81632, #81640, #81641)
    * Quantized 2D convolution (#81495, #81496, #81497)
    * Quantized `Upsample2D` (#81720)

## Mobile

* Added support for dtypes and custom classes in model tracer (#84795)
* Extended Flatbuffer to get mobile_info for NMLML workflows (#78306)
* Added serialization/deserialization of Sparse Quantize Linear Packed Params (#80474)
* Added qnnpack bcsr matrix unpacking and use unpacking in Linear module (#80475)
* Added OwnedOrBorrowedVector for QNNPack BCSR Indices/Values (#80476)

## Distributed

#### `Distributed Checkpointing` (Prototyping)
* This is a prototyping effort which enables loading and saving PyTorch models from one or more hosts. Models can use features such as DDP, FSDP and ShardedTensor and they can have a different configuration between saving and loading - for example, save from 4 hosts and load from a single host. Distributed checkpointing has an extensibility API that enables full control of how a model is saved; and a pluggable IO backend. (#83781, #83419, #84952, #84881)

#### `Distributed(c10d)`

* Made c10d collective ops dispatcher passable. It allows tracing mechanisms such as LazyTensor and AOTAutograd to observe communications, e.g., : broadcast(#76722), allreduce(#79582), allgather (#79669), reduce_scatter (#79683), reduce  (#79686), gather (#79687), scatter (#79688), alltoall (#79691), barrier (#79777), send/recv (#79779).
* Added UCC process group (#79918)
* Enabled uneven input support for `all_gather`  (#83713) and uneven output support for `reduce_scatter` (#87010)
* Added NCCL PreMul Sum to c10d `ReduceOp` (#84243)

**`DistributedDataParallel`**

* Made DDP work with Python process group (#79176)
* Enabled Zero1's ddp_with_overlap for hpu backend (#80438)

#### `FullyShardedDataParallel`

* Added forward prefetching option in FSDP API (#85177)
* Added fp16 and bf16 hooks for FSDP (#81711)
* Implemented `sharded_optim_state_dict` and `flatten_sharded_optim_state_dict`. (#77628)
* Added rate limiter (#83917) Thanks to IBM Research team, @lchu-ibm for his contributions to FSDP and @hfwen0502 for the experimental testbed that identified the issues.
* Added an option to keep grads in lower prec (#85223)

#### `torch.distributed.elastic`

* Added watchdog to TorchElastic agent and trainers (#84081)

#### `Activation Memory Management` (Prototyping)

* We offer a new API, `torch.distributed.algorithms.checkpoint.checkpoint_wrapper` to wrap `nn.Modules` with activation checkpointing or activation offloading to easily use and experiment with activation checkpoint techniques without modifying model code. This makes it simpler to leverage activation checkpointing to reduce memory footprint of your training applications and train larger models. (#83035, #78704, #78854, #79830, #80089, #84907, #84908, #85448, #85449)

## Infra (RelEng)

* Enabled multigpu unittests on FSDP (#77947)
* Added feature to do rebase (via comment) onto any branch (#78772)
* Added implementation to allow PR collaborators to revert their PRs (#82360)
* Added torchvision onto the commit pins file (#79151)
* Turned on `-Werror=all` with a few exceptions in Bazel build for CUDA (#79306)
* Prepared for running PyTorch tests with TorchDynamo and skips for known failing tests (#80106)
* Added ROCm build to pull request jobs (#80149)
* Added dynamo test configuration (#80342)
* Enabled ROCm CI for trunk test (#80920)
* Added linux cuda 11.7 workflows (#81089)
* Updated CI docker images and jobs to ROCm5.2  (#81168)
* Added UCC PG build in CI (#81583)
* Enabled periodic builds for CUDA 11.7 (#81688)
* Enabled distributed tests for ROCm (#81751)
* Added New TORCH_UCC_BLOCKING_WAIT env variable (#81791)
* Change functorch pin mechanism to test functorch in pytorch/pytorch now that functorch is inside pytorch/pytorch (#81918)
* Added Python 3.11 nightlies for Linux PyPi (Please note that 3.11 binaries are not fully functional) (#82302)
* Updated ROCm nightly builds to rocm5.2 (#82353)
* Add functorch target to cmake (#83464)
* Upgraded CUDNN version for cuda 11.7 (#84964)
* Enabled pytest-shard for functorch (#85321)
* Enabled CI to run test_ops in parallel (#85528)
* Updated trunk CUDA-10.2 to CUDA-11.7 (#85943)
* Added support for building and running Metal tests in CI (#86073)
* Bumped nvidia docker version and using python 3.10 for cuda11.7 (#82472)

# Improvements

## Python API

* Added `float16` support for `torch.{arange, linspace}` (#80492)
* Added integer support to `torch.index_reduce` (#80464)
* Added a `stable` kwarg to `torch.argsort`  that controls the relative order of equivalent elements (#75162)
* Improved stability of `torch.distributions.kl_divergence`  for two Bernoulli distributions (#79944)
* Improved type annotations for `torch.{as_tensor, as_subclass}`  (#86105)
* Added type promotion support for `torch.{addcmul, addcdiv}` (#74234)
* Added `bfloat16` support for `torch.save` with XLA/HPU tensors (#77534)
* Improved wrapper subclass detection for serialization (#81105)
* Updated python API `TensorOption` signatures for consistency with JIT schemas (#82241)
* Allowed disabling of`torch.library.Library` with PYTORCH_DISABLE_LIBRARY (#85190)
* Enabled `dim=None` for `torch.{mean, sum, nanmean, nansum}` (#81286), (#79881), (#82912)
* Added feature to enable registration of extension device modules as a native module under the torch namespace (#78329)
* Added `logsumexp` to `amp.autocast` (#76330)

## C++ API

* Allowed `const T&` access to `ListElementReference` (#83177)
* Redirected print messages to `stderr` in `torch.utils.cpp_extension` (#82097)
* Updated CUDA compiler matrix in `torch.utils.cpp_extension` (#82860)
* Added `__all__` to `torch.utils.cpp_extension`, `torch.utils.hooks` and `torch.utils.show_pickle` (#85331)

## Autograd

* Added forward AD coverage for `torch.{amin, amax, nansum, nanmean}`  (#80082),  `torch.scatter_reduce` (except `reduction=prod`) (#85000),  `torch.linalg.det` (#79487),  `torch.{elu_, celu_, selu_}` (#83080)
* Added forward-over-reverse AD coverage for `nn.functional.{binary_cross_entropy} `(#77852) , ` nn.functional.{embedding} `(#79699),` nn.functional.{mse_loss, softplus, l1_loss, smooth_l1_loss, prelu, hardswish}` (#78740), `nn.functional.{nll_loss,  batch_norm, layer_norm, group_norm, cross_entropy, soft_min}`  (#84976) `torch.`{`log_softmax, softmax}`(#84976), `torch.amin, amax, nansum` (#80082)
* Added support a stable double backward on `torch.linalg.det` for real inputs (#80217)
* Added support for kwargs input to function when `torch.utils.checkpoint` with `use_reentrant=False` (#80987)
* Added context manager to disable saved tensor hooks: `torch.autograd.graph.disable_saved_tensors_hooks` (#85971)
* Added new cpp custom function API to inform the backward function whether a gradient is necessary to compute: `ctx->needs_input_grad(idx)` (#82544)
* Added all device types in the pybinded DeviceType enum (#83676)
* Added `check_nan` flag to `torch.autograd.detect_anomaly` which enables users to run anomaly mode without nan checking (#83481)

## Build

* Specify "Generic" BLAS library name to ensure PyTorch can find the BLAS llibrary (#74269)
* Generate CUDAConfig.h only for CUDA builds (#78218)
* Moved build_variables.bzl and ufunc_defs.bzl from pytorch-root/tools/ to PyTorch root directory (#78542)
* Made lintrunner compatible with M1 (#78628)
* BLAS library is linked privately instead of being linked publicly (#78883)
* Updated build targets to include generated enum_tag.cpp (#79668)
* Use miopen_LIBRARIES and rccl_LIBRARIES directly, when they are valid target for RCCL (#80446)
* Deleted Win specific case for CMake older than 3.1 (#81411)
* Split `.cu` to improve compile times (#81193)
* Added `append_cxx_flag_if_supported` macro (#82883)

## torch.nn

* Improved `groups` argument validation for `nn.Conv{1,2,3}d` modules (#77919)
* Improved error message for convolution backward fallback kernel (#81538)
* Reduced memory usage of `nn.Module` full backward hooks by removing reference cycles (#80139)
* Improved `kl_div` at boundary and its general implementation (#80334)
* Improved input shape validation for MKL-backed convolution operations (#76526)
* Improved input validation for `nn.AdaptiveAvgPool2d` (#84061)
* Improved `groups` argument validation for `nn.Conv{1,2,3}d` (#85248)
* Improved input index validation for `nn.MaxUnpool{2,3}d` (#78280)
* Improved listing of public APIs for `optim` and `nn` (#80237)
* Added new operator for `nn.Sequential`: `+` (#81170), `extend` (#81179), `insert` (#81402), `+=`, `*` and `*=` (#81279),
* Added deepcopy support for unitialized parameter (#83809)
* Added nondeterministic alert for `nn.MaxUnpool`{`1,2,3}d` (#84766)
* Added Bfloat16 support for the backward pass of `nn.functional.kl_div` on CUDA (#77676)

## torch.optim

* Added support for optimizers with more than 2 betas for LRScheduler (#84486)
* Added `fused` kwarg to `optim.Adam` to enable a fused implementation on CUDA (#85739)

## Composability

* Significant hardening and improvements to the `functionalize()` API that lives with functorch (#77129, #77126, #77125, #78199, #77132, #77713, #77714, #78819, #78820, #82008, #82009, #81702, #80416, #80418, #80251, #80526, #82326, #81454, #81471, #83542, #83701, #85975)
* Allow `__torch_dispatch__` subclasses and modes to override more tensor metadata: device/size/stride/dim (#77684, #77970, #78646, #78691)
* Improvements to the `torch.library` API, for registering python functions to the pytorch dispatcher:
    * Improved error checking in `torch.library` (#77990)
    * Make `torch.library` decorators return function, to allow for chaining (#78996)
* Ported `cholesky`, `linalg_qr`, `linalg_eigh` and `linalg_eighvalsh` to structured kernels, giving them support with meta tensors (#79300, #79054, #79072)
* Added python decompositions for many torch operators. This adds meta tensor coverage for a large number of pytorch operators (#77930, #79768, #79808, #84062, #84350, #80219, #78350, #79667, #81003, #81420, #81113, #81241, #81765, #82284, #80497, #80358, #80182, #80737, #81734, #81826, #78461, #78468, #78525, #78914, #78919, #79900, #79225, #80964, #83235, #84108, #84451, #78602, #78603, #78527, #78604, #78992, #78993, #78997, #79278, #79341, #79311, #79411, #79581, #81800, #79834, #82309, #79975, #82587, #82603, #83191, #84349, #84460, #85793, #86057)
* Beefed up API for printing out operators registered to the dispatcher (#78995)
* Trued up `c10::FunctionSchema::operator<<` to print native_functions.yaml syntax (#79645)
* Made it so that it is valid to set metadata after detach calls, like `x.detach().resize_(...)` (#83590)
* Optimized `torch.ops.ns.opname.overload` accessor in `__torch_dispatch__` (#85132)

## Dataloader

* Added shape checking on argument `weights` for `WeightedRandomSampler` (#78585)
* Added support for `radom_split` to accept percentages as `lengths` (#78877)
* Extended collate function that can register collate functions to handle specific batch types (#85748)

## Functorch

* `functorch.jacfwd` now accepts a `randomness` kwarg (#84220)
* Improved the error message when using `vmap` on a function with no Tensor inputs (#83016)
* Relaxed the `Tensor.as_strided` batching rule. This is a primitive used in forward-mode AD (among other things) and improves composability of vmap with other transforms (like jvp).
* `functorch.functionalize`: added support for in-place views on inputs (#83993)
* `functorch.functionalize`: moved this API out of the `functorch.experimental` namespace (#85742)
* Added vmap support for `linalg.cholesky`, `linalg.eigvals`, `linalg.eigvalsh`, `linalg.matrix_norm`, `linalg.matrix_power`, `linalg.norm`, `linalg.tensorinv`, `linalg.solve_triangular`  (#82177)
* Added vmap support for `linalg.solve` (#82814)
* Added vmap support for `linalg.cross` (#83759)
* Added vmap support for `linalg.matrix_rank` (#83760)
* Added vmap support for `linalg.pinv` (#83761)
* Added vmap support for `Tensor.fill_` (#84015)
* Added vmap support for `linalg.lstsq` (#82325)
* Added vmap support for `linalg.lu_solve` (#85175)

## LinAlg

* Added a `driver=` kwarg to `torch.linalg.svd` and `svdvals`. Add cusolver gesvdaStridedBatched driver to `linalg.svd` (_#74521_)
* Added opteinsum backend to `torch.einsum` (_#86219_)
* Added path optimize kwarg to `einsum` (#84890)
* Call view instead of sum in `einsum` to remediate MPS regression (#87135)
* Ensure that we contract left to right in `einsum` (#87199)
* Fixed opt_einsum defaults to be more reasonable (#86985)

## Sparse

* Added `sparse_dim` and `dense_dim` for batched, hybrid CSR/CSC/BSR/BSC (#80565, #80901)
* Added support for conversion between batched CSR/CSC/BSR/BSC and dense Tensors (#80781, #83084, #83086, #78025, #80354, #82120)
    * Conversion between SparseBsr and Strided (#78025)
    * Added support for BSR <-> Strided Conversion (#80354)
* Added support for conversion between CSR and CSC (#85091)
* Added support for conversion between BSR and BSC (#85091)
* Added partial support for CSR/CSC/BSR/BSC inputs to `mm`, `addmm`, `matmul` and `F.linear` (#85551, #85308, #85379, #85307)
* Added support for COO to `permute` (#79707)
* Added support for ComplexHalf to `torch.nonzero` and `add(dense, CSR)` (#79062)
* Added support for CSC/BSR/BSC to unary zero-preserving functions. (#78173, #85031)
* Added support for batched BSR/BSC to `transpose` (#82122)
* Added support for scalar together with COO inputs to `mul` (#82962)
* Added support for CSC/BSR/BSC to `empty_like` (#82310)
* Added support for batch dims of CSR/CSC/BSR/BSC to `select` (#82119)

## torch.fx

* In constant folding, added `device_for_folded_attrs` parameter and sets the `requires_grad` option for a folded tensor (#79067)
* Mode-based tracing in make_fx (#79638, #84238)
* Made executor handle kwargs (#79858)
* Added `ignore_parameters_and_buffers` flag to FxGraphDrawer (#79982)
* Enabled an `is_fx_tracing` flag in the FX tracer (#80255)
* Attached ProxyTorchDispatchMode to ProxyTensor and use it in `__torch_dispatch__` (#82549)
* Used `enable_tracing` flag for ProxyTorchDispatchMode instead of modifying torch dispatch mode stack inner attributes (#82643)
* Improved legalize_graph pass in FX (#82874)
* Implemented `__deepcopy__` for fx.Tracer (#83130)
* Hackde up make_fx to natively support varargs (#83210)
* Updated proxy_tensor.py to support List input/output (#83302)
* Added *_only and all/any pytree utilities (#83316)
* Deleted ProxyTensor wrapper subclass (#83330, #83646)
* Added support for partial decompositions in make_fx (#83770)
* Added metadata field to fx.GraphModule (#84378)
* Added option to maintain the FX graph execution order after splitting_module (#85188)

## JIT

* Added PReLU to MKLDNN convertible Ops in JIT optimize_for_inference (#79011)
* Enabled `torch._refs.var` for nvFuser executor (#79517)
* Fixed nvFuser's `where` (tensor, python_scalar, tensor) type promotion (#80347)
* Added ComplexDouble scalar creation bindings to nvFuser's Python API (#80522)
* Added real and imag to NVFuser and its python frontend (#79824)
* Added Nvfuser opt in for decomposition (#81134)
* Added `torch.jit.fuser()` option for disabling all fusers (#81731)
* Added support for symbolic diff for `silu` (#81724)
* Added NVFuser support for (`prims.sign, refs.sign, squeeze, native_batch_norm, transpose`) (#83167, #85562, #84629, #84117)
* Use high precision accumulate buffer for bf16 accumulation in NNC (#84402)

## Quantization

* Improved quantization support for `masked_fill` (#78368, #85108)
* Improved quantization support for `index_put` (#78384, #85685)
* Improved quantization support for `LSTM` and `MultiHeadAttention` (#79959, #79956, #79960, #83304, #85068)
* Added support for quantized `matmul` (#83885)
* Introduced a more stable conv_bn fusion for QAT training (#85744)
* Removed warnings from using torch.tensor(value) (#84277)

## ONNX

* Added operator support for `torch.tensor_split` (#77437), `torch.lerp` (#78891), `torch.movedim` and `torch.moveaxis` (#78931), `torch.scatter_add` (#79103), `torch.argsort` (#80234), `aten::native_dropout` (#81743), `aten::native_layer_norm` (#81754), `aten::convolution` (#81815), `aten::_log_softmax` (#81804), `aten::layer_norm` for ONNX opset version 17 using LayerNormalization (#84293), `nn.init.normal` (#84149)
* Added quantization support to more single output ops (#83008) `aten::reshape`, `aten::reshape_as`, `aten::t`, `aten::transpose`, `aten::numpy_T`, `aten::expand`, `aten::expand_as`, `aten::embedding`, `aten::embedding_bag`, `aten::view`, `aten::select`, `aten::eq`, `aten::ne`, `aten::gt`, `aten::lt`, `aten::le`, `aten::ge`, `aten::elu`, `aten::selu`, `aten::hardtanh`, `aten::hardswish`, `aten::as_strided`, `quantized::sigmoid`, `quantized::layer_norm`, `quantized::group_norm`, `quantized::leaky_relu`, `quantized::instance_norm`
* ONNX operators are exported with names containing their associated scope from `nn.module` (#82038), (#82039), (#82040)
* Introduced runtime type checking with the beartype library in all public APIs (#83673), (#84091)
* All `torch.onnx` APIs now support runtime type checking when @beartype is present in the Python environment. A warning is emitted when a type mismatch is detected.
* This feature is experimental. To turn all warnings into errors, set the environment variable `TORCH_ONNX_EXPERIMENTAL_RUNTIME_TYPE_CHECK=ERRORS`. To disable this behavior, set `TORCH_ONNX_EXPERIMENTAL_RUNTIME_TYPE_CHECK=DISABLED` which effectively makes it a no-op.
* Improved shape type inference (#78999)
* Turn on ONNX shape inference by default (#82767)
* Enabled data propagation from ONNX (#80730)
* Introduced SARIF (#85428) for `torch.onnx` submodule
* Improved warnings and errors (#78441), (#78309), (#83332), (#85179), (#83007)
* Updated ONNX submodule to 1.12 (#79585)
* Apply Common Subexpression Elimination pass to ONNX export (#85665)

## AMD

* Support benchmark flag for MIOpen (#77438)
* Correctly handle the error codes of hipGetDeviceCount (#80405)
* Use torch._C._cuda_getArchFlags to get list of gfx archs pytorch was built for (#80498)
* `torch.cuda.is_bf16_supported()` returns True (#80410)
* Workaround missing hipProfilerStart/Stop (#82778)
* Enabled jiterator on ROCm (#77982)
* Enabled MIOpen fused convolution relu (#82002)
* Restore MIOpen benchmark flag default to true (#82656)
* embedded_interpreter_hip to enable torch::deploy on AMD (#83329)
* Add HIP libs into torch deploy init list & corresponding dependency for CURE benchmark running on AMD (#83434)

## CUDA

* Added synchronize hooks (#84427)
* Added CSAN support for CPU synchronizations (#84428)
* Return device count using nvml (#84879)
* Reworked printing tensor aliases in CSAN error message (#85008)
* Added jiterator support when dtype is `complex32` for `tan`, `atan`, `sin`, `asin` (#77802),(#77606)
* Added jiterator support when dtype is complex for `logical_{or, xor}` (#75947)
* Reduced overhead of `get_current_stream` (#78066)
* Added an argument to specify warmup iterations in make_graphed_callables (#78124)
* Small improvements to `device_count` (#85192)
* Memoize `torch.cuda.device_count` (#84878)
* Remove the construction of unused tensors in fallback convolution implementation (#79183)
* `__launch_bounds__` for `torch.mode` with CUDA 11.7 (#79710)
* Removed synchronization for D2H copy with a different dtype  (#80607)
* Added nondeterministic alert to CUDA `cumsum` (#75693)
* Annotated CUDACAchingAllocator snapshots (#82146)
* CUDACachingAllocator snapshots from C++ (#86190)
* Propagate CUDAOutOfMemoryError to Python. (#83146)
* Set cublas workspace size to 4M (#74159)
* Allow changing the cuda allocator settings even after the process started (#84970)
* Fixed exception handling, improve overheads and avoid constructing storage for element size for DLPack (#84612)
* Added BFloat16 for fast layernorm (#83971)
* Added BFloat16 support for `torch.{im2col,col2im}` on CUDA (#84372)
* Added Bfloat16 support for `ReflectionPad` (#84949)
* Added explicit `__all__` to torch.cuda (#85193)
* Set CUDA_MODULE_LOADING to LAZY when not set by the user (#85692)
* Support cuDNN Errata Filter (#73934)
* Allow the number of kernels profiled under torch.backends.cudnn.benchmark = True to be limitedCudnnv8 benchmark limit (#78299)
* Update tests and dispatching for CUDNN V8 API behavior for bfloat16 convs (#81139)

## Intel

* [RFC] Enable oneMKL&oneDNN on-demands verbose functionality (_#63212_)
* Updated ideep for NNC post-op (_#82705_)
* Enabled native 1d spatial input for Intel xpu (_#82301_)
* Added loss operators to fp32 cast policy of AutocastCPU (_#81689_)
* Added bfloat16 support for `lerp` on CPU (_#84327_)
* Added `prelu` op and module for quantized CPU backend (_#73491_)
* Enabled mkldnn matmul for aarch64 bf16 devices (#85546)

## MPS

* Added ranked tensors for addcmul ops in MPS instead of constants and update MacOS version check (_#78354_)
* Moved MPS compat check into common comparison machinery of `TensorLikePair` (_#77836_)
* Made MPS buildable with either XCode or CommandLineTools (_#79430_)
* Improved MPS `aten::softplus` operator by adding RankedPlaceholder for graph nodes instead of constants (_#81169_)
* Extended MPS Conv1D operation for NHWC format (_#83121_)
* Added support for 1D weights in MPS linear layer (_#85752_)
* Added full support for serialization of MPS Tensors (_#79465_)
* Added support for 1D bias in MPS operation `torch.addmm `(_#81519_)
* Added torch dispatch stub code for MPS backend (_#82612_)
* Use convenience helper function `dispatch1DJob` for MPS native implementations (_#82982_)
* Enabled support in MPS for `torch.adaptive_avgpool_2d` for larger output sizes (_#85726_)
* Extended support in MPS for `torch.constant_pad_nd` for 4D+ padding (_#85991_)

## Profiler

* Propagate metadata into `Engine::evaluate_function` event. (#77696)
* Switched to nanoseconds for Result's internal representation (#77697)
* Made profiler table column widths changeable via arguments (#85203)

## Vulkan

* Enabled higher dimensional input in `torch.nn.linear` (#81773)
* Vulkan tensor views now infers dim size when -1 is provided as input (#81668)
* Vulkan prepacked op contexts will now release the deserialized CPU tensors from memory upon construction (#83587)
* Vulkan shader codegen is now Windows compatible (#85241)

## Mobile

* Allowed tracing multiple input models at once (#84833)
* Leaky `relu` in metal shader (#78544)
* Added detailed error message for iOS test (#79140)
* Remove dcode duplications and refactor (#79184)
* Optionally run fbgemm in tracer (#83531)
* Added hardshrink op to metal backend (#82224)
* New flatbuffer_loader functions that do not depend on flatbuffers.h (#82618)
* Added `max_pool2d`, `linear`, `conv2d` FP32 operator tests for XNNPACK (#83131)
* Removed flatbuffer types/headers from flatbuffer_serializer[_jit].h (#82619)
* Migrated remaining pytorch code to use new flatbuffer_loader.h APIs (#82620)
* Remove flatbuffer types/headers from flatbuffer_loader.h (#82893)
* Use flatbuffer of alternate namespace (#82952)
* Hide flatbuffer build dependencies (#82953)
* Renamed flatbuffer_all to flatbuffers_jit (#82826)
* Renamed flatbuffer_serializer to *_mobile or* _full_jit  (#82827)
* Created flatbuffers_mobile (#82828)
* Added API for profiling backend memory events for Edge CPU profiler (#80350)
* Switched mobile targets to flatbuffers_mobile (#82829)
* Added an option to avoid adding base ops to static op library for Edge (#84360)
* Fixed load_extra_only api for flatbuffers and enable flatbuffers in mobile for OSS properly (#83855)
* Remove unused field 'order_' in nnapi.h (#84067)

## Distributed

#### `Distributed(c10d)`

* c10d API improvements:
    * Introduced util functions in c10d `get_local_rank`, `get_global_rank` and `get_global_ranks` (#82134, #84363)
    * Replaced internal API `_all_gather_base` with a public API `all_gather_into_tensor` (#85686)
    * Replaced internal API `_reduce_scatter_base` with a public API `reduce_scatter_tensor` (#85867)
* Improvements to c10d error messages:
    * Added `ncclGetLastError` (#83724, #85825, #85850)
    * Added closing parentheses to the CollectiveFingerprint (#79723)
    * Added tensor deserializer and included rank and collective type to the error messages (#79724)
    * Adopted `ncclRemoteError` (#85887)
* Passed group ranks and options to third party distributed backends (#73164)
* Enabled NCCL_DESYNC_DEBUG when TORCH_DISTRIBUTED_DEBUG is set to DETAIL (#83881)
* Added a soft error handling mode `NCCL_ASYNC_ERROR_HANDLING=2` that does not crash the process (#84386)
* Upgraded NCCL to 2.14.3 (#85367)

#### `Distributed Optimizer`

* Added functionality for save and restore step counter for model averanger in PostLocalSGDOptimizer (#78988)

#### `DistributedDataParallel`

* Enabled the static graph to print unused parameters in debug mode for DDP. (#81929)
* Enabled stateful PowerSGD communication hook now can be saved and reloaded to resume training (#79334)

#### `FullyShardedDataParallel`

* Allowed different `optim_input` orders across ranks (#78599)
* Added profiling range for FSDP.backward (#78479)
* Enabled NamedTuple support for FSDP (#83055)
* Added FSDP communication hook interface for NO_SHARD strategy (#79833)
* Moved the `sharded_state_dict` logic to the post hook to avoid OOM (#82613)
* Added ability to iterate through dataclasses in fsdp.utils (#82638)
* Enabled passing kwargs to load_state_dict (#83309)
* Used `_init_from_local_tensor` to create ShardedTensor to avoid communication overhead (#82911)
* Added communication hook for sharded strategies (#83254)
* Changed to print exec order only in debug mode (#83868)
* Ensured that all ranks use the same order to iterate through optimizer states (#84654)
* Optimizer states may be on CPU, copied them to GPU before gathering (#84708)
* Handled the `state_dict` on CPU cases (#85640)
* Add `FSDPExtensions` for TP support (#85039)
* Ignored buffers that are non-persistent. (#85740)
* Delayed moving tensor to CPU until necessary for optim_state_dict() (#85761)
* Dequeue one event instead of flushing for rate limit (#86165)

#### `torch.distributed.elastic`

* Implemented a named pipe based watchdog timer (#83695)

## Infra (RelEng)

* Consolidated all python targets in the tools folder (#80408)
* Improved ios simulator test in CI (#80459)
* Add functorch testing shard in CI (#81283)
* Added functorch shards for windows CI (#82161)
* Added functorch shard for mac x86 tests, linux cu102 tests (#82000)
* Added CI workflow to build official docker images with multiarch (#83437)
* Sharded `trunk / linux-bionic-cuda10.2-py3.9-gcc7 / test (default` from 2 -> 4 (#83424)
* Migrated workflows from 18.04 to 22.04 (#83861)



# Bug fixes

## Python API

* Fixed `dim` out of range check for `logcumsumexp` on CUDA when the source tensor is empty(#78284)
* Added missing `__init__.py` for `torch.utils.jit` (#78629)
* Fixed backward crash for `gather` with an empty index tensor when `sparse_grad=True` (#78698)
* Added type annotations to `torch.distributions.kl_divergence` (#78432)
* Fixed erroneous inclusion of `end` in the output of `torch.arange` for some inputs (#80758)
* Fixed `torch.distributions.Transform` to be pickle-able (#81707)
* Added check that `self` and `mask` are on the same device for `torch.masked_fill` (#82737)
* Fixed potential ref cycle creation in `torch.utils.checkpoint` (#82776)
* Fixed `Tensor.__hash__` for Tensor subclasses (#83174)
* Fixed `torch.cat` for 0-dim tensors with different dtypes (#83391)
* Fixed `torch.equal` on CPU when inputs have different dtypes (#83350)
* Fixed data-dependent shapes in `torch.districutions.{HalfCauchy, HalfNormal}` (#84322)
* Added check that the size of the last dimension of `tau` is less than or equal to that of `input` in `torch.ormqr`  (#85278)
* Added check that `weights` is a 1D tensor in `torch.bincount` (#85881)
* Fixed segfault for `out` arguments that have a large number of dims (#85294)
* Fixed comparison ops with scalar arguments by removing overflow check (#78881)
* Normalized `torch.utils.dlpack` strides to 1 where size of corresponding dimensions < 2 (#83158)
* Added a check in `torch.empty_strided` that `sizes` has the same dimensionality as `strides` (#82422)
* Fixed `torch.istft` default output length to prevent trimming of last element (#80031)

## C++ API

* Fixed missing antialiasing path to the interpolation for bicubic mode (#84599)
* Added `IListRefTag::Materialized` to `IListRefIterator` destructor. (#85467)
* Fixed `im2col` by adding a check that `pad_width` and `pad_height` are non-negative (#85541)
* Fixed `check_compiler_ok_for_platform` on non-English locales in `torch.utils.cpp_extension` (#85891)

## Autograd

* Corrected the forward AD formula of `torch.sgn` which fixed forward-over-backward for `torch.linalg.svd `and other spectral decompositions, and `torch.norm`, `torch.linalg.{norm, matrix_norm}`(#80082)
* Fixed derivatives of convolution overridable backward (#80840)
* Updated setting non-float non-complex values for forward AD dual tensor to properly error(#78361)
* Fixed forward AD to not set tangent as-is in some situations (#79664, #79653)
* Fixed cpp hooks, retains grad, and `backward(inputs=)` behavior in-place (#79996)
* Relaxed storage layout checks for forward AD when zero-numel tensor (#81055)
* Fixed leak when `create_graph=True` and full backward hook registered (#82788)
* Fixed view and in-place interaction when grad_fn is first accessed in no-grad mode (#83872)
* Updated backward of `torch.stack` to correctly handle implicit real->complex casting (#84993)
* Fixed gradients for `torch.nn.functional.{leaky_relu, threshold}` when inplace=True (#85634)
* Corrected autocasting behavior in  `torch.utils.checkpoint` when use_reentrant=False (#81766)
* Fixed gradcheck when outputs that don't require grad precede those that do (#77743)
* Fixed backward and double backward for `nn.functional.binary_cross_entropy_with_logits` (#80083)
* Fixed derivatives of `norm(p=inf)` (#78105)
* Fixed forward AD when conj-ness of primal and tangent of the dual tensor tensor do not match (#78358)

## Build

* Use C++17 for RocksDB 7 header. (#75741)
* Fixed Windows builds with _DEBUG flag (bbe8d019f2)
* Pass WITH_BLAS option from environment to CMake (#78037)
* Remove `-Wno-unused-but-set-variable` for clang 13.0.0 (#79666)
* Fixed variable typo for USE_SYSTEM_PYBIND11. (#80272)
* Fixed compilation errors during build with clang13 (#80916)
* Added missing -fexceptions flags during PyTorch build (#81394)
* Fixed CMake dev warning (#81580)
* Fixed false positive AVX, AVX2 and AVX512 detection with MSVC (#82554)
* Fixed NCCL detection issues of the Gloo library (#82773)
* Fixed objcopy version detection in NCCL cmake process (#82774)
* Fixed build error by changing COLORIZE_OUTPUT option to USE_COLORIZE_OUTPUT in cmake file (#83716)
* Set default value for NCCL make to MAX_JOBS if ProcessorCount returns 0 (#84231)
* Fixed intermittent link errors in NCCL build (#84245)
* Deleted `torch._dl` extension (#84361)
* Used unified source file list for BUCK build (#84770)

## Complex

* Fixed the derivative of `torch.acosh` for complex numbers (#80841).
* Removed unused conjugate kernels for real dtypes (2.2MB reduction in CUDA binary size) (#80374).

## torch.nn

* Fixed `nn.Embedding` ‘s `max_norm` argument when forward mode AD is used (#78560)
* Fixed `nn.ChannelShuffle` when given empty Tensors (#77029)
* Fixed `nn.RReLU` backward on CUDA (#80434)
* Fixed spurious warnings in `torch.nn.parallel.*` APIs (#81476)
* Fixed `nn.Conv2d` fallback implementation for single channel inputs and channels last weight (#82392)
* Fixed segfault in adaptive pooling for specific index values (#84010)
* Fixed type annotation in `nn.Conv{1,2,3}d` for in_channels (#84302)
* Fixed `nn.GeLU` for empty inputs (#84926)
* Fixed correctness issues for `nn.Conv2d` on ARM-based machines (#85711)
* Fixed `nn.ParameterList` printing of Tensors on the “meta” device (#78529)
* Fixed channels-first behavior for `nn.MaxPool3D` on CUDA (#80748)
* Fixed input shape validation `nn.MaxPool1d` (#85594)
* Fixed `nn.Softmax` for large input tensors (#84182)
* Fixed lower and upper bound checks for `nn.RReLU` (#84996)
* Fixed edge cases in `torch.nn.grad` by calling into the c++ backward kernel directly (#81839)
* Fixed `torch.nn.PixelShuffle` for empty inputs (#86262)
* Fixed consistency of output and input dtypes for `torch.nn.BatchNorm` (#84410)

## torch.optim

* Fixed `optim.SGD` `maximize` flag when `momentum` is involved (#81859)
* Fixed temporary bug where checkpoints from optimizers created with older PyTorch version could not be loaded (#83588)
* Fixed memory leak in `optim.lr_scheduler.CyclicLR` (#85462)
* Fixed initialization of `lr` in `optim.lr_scheduler.SequentialLR`  (#72856)

## BetterTransformer

* Cleaned up native transformer implementation (#78265)
* Added fastpath test for mask check flag (#82999)
* Added check for contiguous well-formed mask (#79927)
* Introduced mask contiguity check function (#79186)
* Fixed issue in softmax.cu with transformer error when mask `seqlen > 1024`  (#83639)
* Disabled Transformer/MHA fast path when autocast is enabled (#84722)
* Moved odd `num_head` in TransformerEncoder to `slow_path` (#83483)

## Composability

* Fixed `__torch_function__` bug in getindex that causes an error not set exception (#78781)
* Fixed `__torch_dispatch__` usage with inplace views (#79902)

## Dataloader

* Fixed `NoneType` object has no attribute `python_exit_status` when `DataLoader` exits (#83985)

## Functorch

* `functorch.grad`: fixed silent correctness issue from calling a view operation on a captured tensor followed by an in-place operation (#85374)
* `functorch.jacrev`, `functorch.jacfwd`: fixed loud in-place errors when passing in inputs to the transforms and mutating them (#84914, #84915)
* `functorch.vmap`: Fixed support for in-place view operations (`Tensor.unsqueeze_`, `Tensor.transpose_`, `Tensor.t_`, `Tensor.squeeze_`) (#82899, #82903, #82972)
* `functorch.vmap`: added an error on incorrect `weight` shape to `torch.nn.functional.prelu` (#83106)
* `functorch.vmap`: fixed support for multinomial (#83838)
* `functorch.vmap`: fixed incorrect support for `conv_transpose` with `groups > 1` (#84938)
* Fixed `vmap` x `vjp` x `vjp` composition for `torch.nn.functional.prelu` (#84939)
* Fixed printing tensors that are not being transformed over inside functorch transforms (#85556)
* Disallowed saved tensor hooks in functorch transforms to avoid silently incorrect behavior(#85972)
* Fixed `cross` to match unbatched behavior (#86926)

## LinAlg

* Strengthen the preconditions of `linalg.cross` (_#83798_)
* Fix memory issues in `linalg.lstsq` (_#85357_)
* Fix `linalg.lu_solve`/`torch.unpack` to prevent bad memory usage on CPU (_#85922_)
* Preserve the dim of the input in `matrix_exp`. (_#81330_)

## Sparse

* Fixed COO Tensors with less than two non-zero elements to always be marked coalesced. (#82426, #82085)
* Fixed CUDA kernel launch misconfiguration for `mul` on tiny COO tensors (#80254)
* Fixed silent type promotion bug by `select` if given all zero integer COO tensors(#82215)
* Fixed CUDA kernel coverage on 0-sized dense inputs for `torch.sparse.sampled_addmm` (#85194)

## torch.fx

* Fixed bug where curly brackets were not properly escaped in FxGraphDrawer (#83604)
* Fixed torch.fx.wrap to use the callable `function.__name__` rather than `function.__code__.co_name` (#84373)
* Added strictness check and made tensors into leaves if input tensors were leaves (#77474)
* Used getattr_recursive instead of getattr when splitting (#80011)
* Stopped ProxyTensor from turning aten::lift tensors into proxy objects (#81024)
* Fixed named_modules to be subscriptable (#81258)
* Fixed `to_folder` by adding custom_builtins to dump (#81433)
* Correctly unpacked constants when used in multi-return output (#82568)
* Replaced module name for torch.ops (#82395)
* Removed unnecessary `import warnings` (#82760)
* Don't constant propagate through nondeterministic functions (#83650)
* Don't extract tensor metadata from sparse tensors (#83669)
* Skipped folding side-effectful functions (#84016)
* Fixed make_fx issue by introducing get_attr into symbolic tracing (#84011)
* Disabled autocast cache during aotdispatch (#84035)
* Modified split_by_tags to retain output order (#84136)
* Made NormalizeArgs preserve node type (#85637)
* Fixed PyTree unpacking carrying forward type annotations (#81906)

## JIT

* Fixed conv-batchnorm folding for previously-broken datatype inputs during JIT freezing (#78241)
* Fixed lightweight dispatch OOM error by introducing selective build (#79215)
* Used signed integers in `CalculatedNecessaryArgs` to avoid underflow with schemas where all args have defaults. (#79331)
* Fixed indexing into a tensor with a tuple (#79335)
* Propagate `map_location` arg to `torch.jit.load` in `torch.load` (#78733)
* Improved JIT autodiff heuristics for determining whether outputs require gradients (#78392, #79498)
* Used streams for `import_ir_module` for pickle case to reduce memory usage (#80131)
* Added scripting support for "start" kwarg in `enumerate()`  (#80585)
* Turned off arc in CoreML backend, because throwing exceptions in arc code leaks memory (#79928)
* Suppressed virtual-dtor check on llvm_jit to fix NNC build (#81449)
* Fixed annotation extraction for python 3.10 (#81334) (#81334, #81506)
* Fixed `std::out_of_range` when using NNC and `ConstantChunk` input shapes are unknown (#82698)
* Limits constant chunk propagation for pw-node-only in NVFuser (#83083)
* When encountering dynamic types, one should cast it recursively. (#83218)
* Fixed handling of empty dim list in `sum_mean_dim` symbolic shape fn (#83357)
* Check existence of the array ref when tracing `resize_` to avoid `_MapBase::at runtime` error (#81422)
* Fixed `define_constant` pybind signature to match `std::complex` scalar in NVFuser (#83684)
* Cast to signed char to fix aarch64 build (#84429)
* Support `torch.ScriptObject` in `torch::jit::as_object` (#84398)
* NVFuser torchbench patch to take nvprim fallback when no cuda tensors are provided as inputs (#84411)
* Fixed coreml gpu flag not set (#84725)
* Print the real type for function schema arguments (#85103)
* Fixed `torch.jit.trace` check that was causing tracing to fail for MPS inputs (#84850)
* Throw an error instead of segfaulting when passing `None` to futures (#85304)
* Cherry pick sorting patch for NVFuser fusion segmented (#85620)
* Support freezing modules that don't have a forward method (#85779)

## Quantization

* Added channel axis bound checking in `fused_moving_avg_obs_fake_quant_*` (#78148)
* Disable use of qnnpack with `ceil_mode` of the `avgpool` op (#79028)
* Improve subpackage import in `torch.nn.quantized` (#84141)
* Fix segmentation fault in `QTensor.choose_qparams_optimized` (#85552)
* Enhance the `_rebuild_qtensor` function to support other device type other than CPU (#78234)
* Fix `at::from_blob_quantized_per_tensor_affine` strides calculation (#79314)
* Fix embedding quantization issue when memory format is not `contiguous` (#82605)
* Fix dispatch declaration bug about quantized op (#83649)
* Moved the order of x86 engine to avoid changing the default qengine (#86631)

## ONNX

* Fixed `aten::mul` with Boolean inputs (#81671)
* Fixed `add` and `sub` for non-tensor inputs (#81736)
* Fixed `RReLU` eval mode behavior (#82678)
* Fixed onnx optional node type in for/if block (#83599)
* Fixed `Interpolate`: use `half_pixel` instead of `pytorch_half_pixel`. (#80003)
* Fixed `argmin` and `argmax` edge case consistency with PyTorch. (#79503)
* Shape Type Inference and Propagation
* Fixed shape inconsistency when exporting scalar `log2` (#78701)
* Fixed inconsistent `rand` dtype (#79193)
* Fixed linalg `norm` output's shapes and dtypes (#79506)
* Fixed `any` and `all` outputs' shape (#79371)
* Fixed `prelu` output's shape (#79846)
* Fixed onnx logical functions' dtype (#79339)
* Fixed `hardshrink` and `softshrink` output's shape (#79695)
* Fixed quantization outputs' dtype (#79690)
* Fixed reduce node shape inference (#85765)
* Fixed bug using `std::copy_if` (#80999)
* Fixed default function value in `_optimize_graph` (#83996)
* Fixed constant folding unexpectedly adding folded constant as initializer (#79552)
* Fixed autograd subgraph recording with nested graphs (#82852)
* Disabled autocast cache in exporter (#84219)
* Removed static None graph output (#82623)
* Fixed float point detection for optional tensor (with unknown rank) within a list (#81386)
* Support `device().type()` string comparison with constant (#86168)
* Fixed `scalar_type_analysis` metadata for copied constant (#86716)
* Fixed triu/tril export with diagonal input (#86843)
* Ignore `print(Tensor)` during tracing (#86223)
* Updated training state logic to support ScriptedModule (#86745)

## AMD

* Fixed memory cross-border access on the ROCM platform (#76100)
* Set nvfuser default to disabled (#86369)

## CUDA

* Fix how we handle host memory in CUDA `getDeviceFromPtr` (#76902)
* Only sync CUDA if the operation is run on GPU (#80328)
* Do not use `thrust::lower_bound` on device (#80746)
* Fix `set_requires_cuda_init` (#81183)
* Fix behaviour of index_add / atomicAdd(bool,bool) (#85100)
* Fix IMA for topk (#83042)
* Use `opmath_t` for activation functions in Activation.cu (#77949)
* Fixed the invalid configuration argument error when running layer norm backward (#80893)
* Support non-standard bools in CUDA unique (#79392)
* Accept non-standard bools in more CUDA kernels (#78957)
* Fix cuda-mode and add more tests (#81898)
* Clear autocast amp cache in CUDA Graphs (#81896)
* Properly compute `batch_element_count` in `warp_softmax`  (#82927)
* Disabled autocast cache in torch.cuda.make_graphed_callables (#84289)
* Store RNG seed for CUDA graphs (#84967)
* Assert `lambda >= 0` in poisson distribution cuda kernel (#85906)
* Work-around 32-bit indexing failures in cuDNN batchnorm (#87861)
* Fixed 3d convolution_add_relu in V8 (#85055)

## Intel

* Fixed bug for thnn_conv2d when input's C is 1 and weight is channels last (#82392)
* Fixed oneDNN channels_last path issue (#83653)
* Fixed torch.config can't respect USE_MKLDNN flag issue (#75001)
* Made the data types of output and input consistent for batchnorm (#86784)
* Fixed the issue that cat result would be incorrect for channels-last (#85076)
* Fixed the performance issue that the for-loop before ExternallCall could not be parallelized (#85056)
* Fixed the performance issue that the for-loop before ExternallCall (#86516)

## MPS

* Fixed MPS operator torch.full for boolean types (#82575)
* Extend MPS Unary operators for empty tensors which should be a no-op (#82650)
* Fixed MPS operator `torch.scatter` for boolean types (#82685)
* Fixed MPS operator `torch.cat` for boolean inputs (#81480)
* Fixed typo in MPS allocator (#83465)
* Fixed MPS operator torch.full to handle uint8 types (#83697)
* Fixed creation of `MPS::Placeholder` behavior for transposed view operations (#85689)
* Fixed handling of output shape for empty inputs to binary ops in MPS backend (#85836)
* Added support for handling scalar inputs to MPS operations of `torch.scatter` and `torch.gather` (#85842)
* Support for handling compatible inputs to MPS operation of torch.where (#85946)
* Added support for inputs with datatypes Short, Byte & Char to torch.dot MPS operation by casting to int32 when needed (#86140)
* Remove incorrect asserts in MPS backend from Copy.mm file (#86184)
* Added support for handling of 1D inputs for MPS operation `torch.nll_loss` (#81290)
* Get correct size of the view tensor when copying from cpu to mps device (#81730)
* Fix issues exposed in MPS testConsistency tests. The fix includes correct handling of types in smooth l1 loss, 0 dimensions for torch.repeat and empty inputs for torch.cat operations (#81735)
* Handle Integer inputs for MPS linear layer by returning error of unsupported data types (#82183)
* Workaround int8 datatype outputs in MPS for View operations (gather) by casting it to int8 (#82315)
* Improve handling of empty outputs and fix MPS linear layer’s handling of transposed Tensors in test consistency (#83124)
* Fixed handling of conv1D and conv2D MPS operations with non-matching strides/paddings (#83522)
* Fixed handling of MPS::Placeholder when View operation is missing gather graph (#83744)
* Fixed the index handling in MPS for torch.constant_pad_nd operations with single-dimension input (#83745)
* Handle casting for MPS torch.div operation in case of type mismatch (#84742)
* Fix device (MPS) to host (cpu) copy by casting from a smaller dtype to a bigger dtype (#84928)
* Ensure as_strided_tensorimpl is never called with MPS (#85020)
* Fixed integer rounding crash in torch.div MPS operation on M1 (#85016)
* Fixed crash in MPS bitwise ops on Mac x86 platforms. (#85285)
* Fixed crash in MPS Conv1d backward operation for NHWC (#85283)
* Added support for MPS reduction operations of scalar edge-cases (#83743)
* Fixed memory corruption in torch.var operation for MPS (#85571)
* Fixed memory leaks in MPS that cause the MTLBuffers not to be released and cause OOM (#85661)
* Fix test consistency error in MPS due to type mismatch between int8 and uint8 types (#85666)
* Fixed shape issues for torch.clamp op in MPS (#85673)
* Fixed handling of TensorBase shapes for view ops in MPS for case of multiple slices on a Tensor (#85934)
* Fix the dimension of padding to match the input's dimension for MPS Pad operations (#85990)
* Fix non-contiguous to contiguous copy of MPS tensors (#86056)
* Remove `std::cout` from MPS `multinomial` operation (#86246)
* Do not dispatch empty job in bitwise_not (#87286)
* Made copy from CPU always add storageOffset (#86958)
* Revamped `copy_to_mps_` implementation (#86956)

## Package

* Added fix for implicit numpy dependency (#78979)
* Allowed torch._C to be recognized a module in torch.package (#80917)
* Ignore return value of function declared with 'warn_unused_result' for torch::deploy (#84862)
* Removed torch::deploy from pytorch (#85953)

## Profiler

* Fixed build failure in python 3.10 (#81812)
* Pop `KinetoThreadLocalState` at the start of post processing. (#77996)
* Fixed record function inputs_valid_ check (#78002)
* Weakened ordering check during post processing. (#78563)
* Fixed Python parent id (#79356)
* GIL acquire needed in ValueCache::trimPrefixes (#81061)
* Added ephemeral inputs to the value cache. (#81958)
* Fixed profiling with record_shapes=True and nested tensor (#82854)
* Proper reset execution graph data in remove callback registration (#82910)
* Solved two syntax issues when dumping execution graph result to json file. (#81854)
* Set end time on python events when profiling stops. (#83621)
* Don't try to collect strides for non-strided tensors (#83935)
* Add null handling to `AppendOnlyList::copy` memcpy path. (#83963)
* Add quoted metadata API to remove empty trace cpu_op metadata (#84128)
* Make `RecordQueue` manage the lifetime of `PythonTracer`. (#83964)
* Don't assign in AppendOnlyList::emplace_back (#85716)
* Fixed traversal utility (#85717)
* Fixed python object reference counting (#85847)

## Visualization

* Removed dependency on `torch.onnx` in `graph` (#82628)
* Updated `Image.ANTIALIAS` to `Image.Resampling.LANCZOS` in summary (#85679)

## Vulkan

* Fixed the `aten::cat` operator registration (#78806)
* Fixed a bug in GRU where incorrect behaviour was being observed when `H_in != H_out` (#78945)
* FIxed a possibly null pointer dereference in the `aten::mm` operator when using passing an empty bias (#79701)
* Code under `ATen/native/vulkan/api` was essentially rewritten (more details below) and as a result of these refactors, it is now possible to concurrently execute multiple Vulkan models due to correct synchronization when recording to a Vulkan command buffer (#80959)

## Mobile

* Moved saving storage to the last step. (#78024)
* Fixed build For Model Tracer (#84755)
* Skip TestNNAPI tests if QNNPACK is not supported (#82882)
* Extended LinearPackedParamsBase **getstate**/**setstate** deadline in `check_forward_backward_compatibility.py` Allowlist (#81135)
* Removed LinearPackedParamsBase **getstate**/**setstate** from `check_forward_backward_compatibility.py` Allowlist (#81048)
* Fixed `ao::sparse::BCSR` missing in qlinear serialize and deserialize when USE_FBGEMM and USE_PYTORCH_QNNPACK are not set (#81256)
* Updated `model_ops.yaml` (#82444)
* Fixed signed/unsigned compare for Metal (#86068)
* Re-added benchmarking files to ios TestApp (#85539)

## Distributed

#### `Distributed(c10d)`

* Ensured tensors are contiguous for autograd enabled `all_gather`. (#79747)
* Fixed data race condition of `batch_isend_irecv` (#82450)
* Fixed `distributed_test.py` flakiness by turning off async_errror_handling (#78797)
* Reenabled `isinstance` with `torch.distributed.ReduceOp` (#87303)

#### `DistributedDataParallel`

* Enabled `AllReduceCommHook` to accept `instrusive_ptr` (#80975)

#### `FullyShardedDataParallel`

* Fixed `full_optim_state_dict()` hang (#80712)
* Fixed exec order validation for ignored modules across ranks (#79533)
* Cleaned prefixes when searching for params / buffers to ignore (#78278)
* Returned original module when fsdp wrapped model call .module (#78671)
* Fixed a small bug of pre_backward_hook params prefetch (#78851)
* Fixed param name prefixes for ignored modules (#79955)
* Fixed FSDP when not all outputs get gradient in backward (#80245)
* Fixed that MP config not being passed to FSDP (#80869)
* Fixed FSDP device_id when CPU offloading (#82892)
* Fixed FSDP not all outputs used in loss (#83195)
* Fixed the FQN not found issue for load sharded_state_dict when using activation checkpoint (#84253)
* Fixed `pin_memory()` for CPU offloading (#85048)
* Fixed memory regression! (#85087)
* Implemented a short-term fix to remove `optim_input` (#84201)

#### `torch.distributed.elastic`

* Ensured that exit code is propagated from Child to parent process (#81408)

#### `torch.distributed.rpc`

* Only initialize CUDA if there are devices specified in `init_rpc` (#80180)
* Fixed the wrong usage of `RRefContext::handleException` by having a new API `RRefContext::handleExceptionSilent` (#83166)
* Changed to avoid initializing storage for empty Optionals (#78947)

## Infra (RelEng)

* Made bazel changes to make “bazel query ...” work (#78870)
* Fixed C API to be compatible with latest Python 3.11 beta (Please note that 3.11 binaries are not fully functional)  (#81242)

# Performance

## Python API

* Fixed use of temporary buffers for tensors in `torch.save`. (#80404)
* Fixed and improved the efficiency of the backward for `torch.xlog{*}` functions. (#82713)
* Vectorized `.copy()` acting between different dtypes on CPU (#80905)
* Vectorized `bfloat16` conversions on CPU (#80906)

## Autograd

* Codegened autograd nodes no longer is smarter about which gradients to compute (#82544)
* Made the derivative of masked_fill more efficient (#83515)
* `torch.where` no longer materializes a zero-filled tensor in its backward (#83043)

## torch.nn

* Speed up `nn.Module` constructor by not calling custom `setattr` (#77098)
* Speed up CPU `nn.BatchNorm` implementation by using `torch.zeros()` directly (#82558)
* Speed up `nn.Module.load_state_dict` (#85743)

## BetterTransformer

* Added nn.module activation support in BetterTransformer (#78394), in addition to functional support which is not available in Torchscript
* Added mask identifier for multiplexed src_mask/src_key_padding_mask in BT (#81947)
* Added a small fastpath test for native multi-head attention (#81432)

## Composability

* Release GIL when doing shared memory copies on Tensors (#85389)
* Some micro-optimizations in `RecordFunction`, the core util used by the profiler (#76266)
* `c10::detail::ReplaceAll`: avoid some unnecessary allocations (#79915)

## Dataloader

* Moved loop content into a function to ensure we don't preserve `Tensor` in `pin_memory` thread (#83595)

## LinAlg

* Simplified and optimized `linalg.solve` (_#74046_)
* Improved heuristics for `linalg.lu_solve` when B is a matrix (_#79838_)
* Small optimization of `linalg.cholesky` (_#81316_)
* Prefer contiguous output from mkldnn_bf16_gemm (_#82968_)
* CPUBlas: Use mkldnn optimized BFloat16 matmul for gemm (_#65840_)
* Updated and improved the heuristics for `linalg.lu_solve` (_#73878_)
* Optimized `linalg.householder_product` backward to be more memory-efficient (_#84627_)

## Sparse

* Improved `to_sparse_bsr` for batched dense inputs (#83085)
* Improved `to_dense` for CSC (#79635)
* Improved `index_select` performance for COO input on CUDA (#77551)
* Improved `mul(COO, COO)` performance with broadcasting in dense dims. (#83428, #85336)

## JIT

* Improved coreml load time by loading cpu model first, while asynchronously loading a model (#80941)
* Improved `torch::jit::as_{module,object}` performance (#84399)
* Replaced `IValue::toString()->string()` with `IValue::toStringRef()` (#85437)

## Quantization

* Allow contiguous inputs run into `qcat_nhwc_stub` when dim is last dimension (#72575)
* Enable qlinear dynamic parallelization with fbgemm (#84033)

## CUDA

* Fixed perf regression introduced in #70943 (#78588)
* Improved small sort performance on CUDA (#79627)
* Use cub::BlockRadixSort to improve medium length sort performance (#79628)
* Use cub::BlockRadixSort to improve medium length sort performance (#79628)
* Increased size limit on calling CublasLt in addmm by 32x (#82922)
* Don't synchronize single element any/all reductions (#84465)
* Added col2im_batched kernel (#84543)
* Exposed fast get_current_stream (#78165)
* Pool cudaEvents in CUDACachingAllocator (#78279)

## Intel

* Optimize the copy of BFloat16 to Float and Float to BFloat16 (_#79685_)
* Improve performance of ONEDNN backend (_#84470_)
* Optimize softmax backward and logsoftmax backward _#80114_
* Improve sort multi-core perf by adjusting grain_size w.r.t. dim_size (_#74897_)
* Add fast path of `qmean`/`qstd` for quantized CPU (_#80579_)
* Use direct memcpy in `qcat` when all the inputs and output share the same scale and zero_point (_#71903_)
* Vectorize scalar remainder in quantized kernel for normalization (_#79673_)
* Enhance add_out_dense_sparse_cpu for hybrid sparse tensor (_#23057_)

## MPS

* Performance improvements for the MPS backend by changing commitAndWait to commit & fixing high memory consumption for View operations. Also improved scalar handling in MPS Allocator (_#81951_)
* Improved performance for MPS backend by reducing the number of command buffers created and hence CPU overhead. It uses commitAndContinue feature in MPS (_#81338_)
* Added direct MPS implementation for constant_pad_nd operation which improved performance as the generic implementation was heavily reliant on View ops which are slow (_#82366_)
* Removed checks that incur unnecessary syncs for MPS device with tensor.item() (_#82505_)
* Enabled Graph caching in MPS for torch random ops with Philox engine (_#85833_)
* Added specialized memory pool for scalar values in MPS which improved performance in torchbench networks (_#85817_)
* Improved memory usage and performance by utilizing garbage collector and adaptive commit feature in MPS (_#86119_)

## Profiler

* Optimize getStepCallbacks for common case of no active callbacks for kineto (#77804)
* Use custom AppendOnlyList for op_events to reduce the number of atomic operations (#78643)

## Vulkan

* When waiting on the result of a `VkFence`, busy polling is now used instead of a single call to `VkWaitForFences` with no timeout. This can improve benchmark performance by up to 50% by ensuring that the CPU stays at a high frequency when waiting for work on the GPU to complete (#81470)

## Mobile

* Added compilation_preference & relax_f32_to_f16 APIs (#78758)
* Made flatbuffer loads faster if loading as mobile module. (#78998)
* Stream pkl (#79931)
* Used Apple's Accelerate framework for blas acceleration (#80449)
* Read via FileAdapter when loading files in torch if not flatbuffer for lite_interpreter (#84028, #84296)

# Documentation

## Python API

* Fixed `torch.as_array` documentation formatting (#78485)
* Fixed default value for `storage_offset` in `torch.as_strided` documentation (#78202)
* Removed warning in documentation that `torch.real` is only supported on complex types (#78644)
* Improved reproducibility documentation for the random number generator and `torch.use_deterministic_algorithms` (#78849)
* Fixed example in documentation for serialization (#79454)
* Fixed `torch.linspace` documentation for dtype (#81371)
* Fixed typo in documentation for `torch.distributions.Dirichlet` (#82062)
* Fixed example  in `torch.dist` documentation (#82104)
* Updated `torch.narrow` documentation to reflect that `start` can be a Tensor (#85180)
* Added documentation for `pin_memory` and `layout` arguments to `torch.new_{zeros, ones, full}` (#85605)
* Added documentation for `pin_memory` argument to `torch.{rand, randn}` (#85219),  (#85221)
* Added argument default values to documentation for `torch.rot90` (#85610)
* Removed `out` argument from documentation for `torch.squeeze` (#85222)
* Fixed `torch.log` example (#78776)
* Fixed `torch.argmin` docs for `keepdim` argument (#78888)
* Updated examples in documentation for `torch.use_deterministic_algorithms` (#82003)
* Changed docstring type `callable` to `Callable` for consistency (#82487)
* Added documentation for `torch.narrow_copy` (#84493)
* Improved documentation for `torch.signbit` (#78349)
* Added doc string for `torch.library.Library.impl` (#81047)
* Renamed `_Typed/_UntypedStorage` to `Typed/UntypedStorage` and updated documentation for `torch.storage` (#82438)
* Added documentation for `torch.unflatten()` (#81399)

## Autograd

* Improved autograd custom function docs (#81340)
* Added randomness case to the autograd notes (#78617)

## Complex

* Added a note on CUDA 11.6 (#80363)

## torch.nn

* Fixed docstring and image for `nn.LeakyReLU`  (#78508, #79102), `nn.ELU` (#78909), `nn.GRU` (#79380), `nn.Hardswish` (#70993), `nn.GeLU` (#85790)
* Fixed docstring for `nn.CrossEntropyLoss` (#79568 and #82538), `nn.MultiMarginLoss` (#84513)
* Fixed high level `nn.init` module doc to reflect that all functions run with `torch.no_grad` (#80882)
* Fixed docstring for `nn.Module.state_dict` (#83104)
* Updated docstring for `scale_factor` in `nn.functional.interpolate` (#80807)

## torch.optim

* Fixed docstring for `optim.lr_scheduler.ChainedScheduler` (#79775)
* Fixed docstring for `optim.swa_utils.SWALR` (#79836)

## Composability

* Fix `MetadataTensor` example in `__torch_function__` docs (#78073, #78707)

## Functorch

* Fixed the model description in the functorch ensembling notebook (#83603)
* Fixed indentation in functorch limitations docs (#85346)
* Updated functorch installation instructions (#85854)
* Fixed functorch whirlwind tour notebook to be runnable (#85974)
* Documented new installation instructions for functorch (#86823)

## LinAlg

* Improve `torch.lu_unpack` docs (#77635)
* Fix inconsistent default `rcond` value (#82887)

## Sparse

* Updated `scatter_add_` documentation to fix argument name (#80223)
* Updated `torch.sparse` docs to better cover CSR/CSC/BSR/BSC (#82108)
* Added torch.sparse overview section (#85265)
* Updated documentation for `mm` family ops and `F.linear` to note limited sparse support (#86220)

## torch.fx

* Fixed decomposition example (#79807)
* Added `__all__` to various submodules in torch.fx, distributions, distributed, package (#80367)
* Added warning about DCE in FX being unsound with mutation (#81818)

## Quantization

* Replace `qconfig_dict` with `QConfigMapping` in docs (#78533)
* Corrects typo in quantization docs (#81687)
* Additonal fixes for `quantize_fx` docs (#84587)
* Add example for the error message for fixed qparam ops (#84666)
* Add types for scale and zero_point tensor for `torch.fake_quantize_per_channel_affine` docs (#85733)
* Updated quantization docs to show per channel support for `conv1d` (#81349)
* Add `torch.ao.nn.quantizeable` modules documentation (#79957)
* Add more detailed docs for `torch.ao.quantization.quantize_fx.{prepare_fx|prepare_qat_fx|convert_fx}` (#83132)

## ONNX

* Added a table of unsupported aten operators in the documentation (#84496)

## CUDA

* Fixed jiterator doc format (#78471)
* Use generic amp autocast in example and specified dtype (#79579)
* Fixed small typo in cuda.rst (#84012)
* Added user facing documentation for CSAN (#84689)
* Fixed broken docstring for `set_float32_matmul_precision` (#78949)

## MPS

* Update Persons Of Interest file for MPS (_#81757_)
* Update backends.rst for MPS (_#82525_)

## Package

* PackageExporter does not have file_structure (#79948)
* Updated package.rst to not include hermetic claim (#81019)
* Fixed typos in `torch.package` documentation (#82994)
* Fixed typo in torch/package/_mock.py (#84508)

## Distributed

#### `Distributed(c10d)`

* Fixed some links in torch/distributed/CONTRIBUTING.md  (#79855)
* Updated dist.scatter() documentation (#86069)
* Fixed docstring of `scatter_object_list` (#84596)
* Fixed doc string in `reduce_scatter` (#84983)

#### `DistributedDataParallel`

* Corrected the DDP wrap example by removing pg in DDP wrap (#83034)

#### `FullyShardedDataParallel`

* Improved auto wrap policy doc (#78400)
* Corrected comments in FSDP for gradient averaging (#80456)
* Updated `ShardingStrategy` and `_free_full_params()` docs (#80894)
* Added mentioning of `optim_input` to be removed after 1.13 in the BC breakage warning (#85963)

#### `torch.distributed.rpc`

* Updated distributed/CONTRIBUTING.md to remove ProcessGroupAgent references and add test instructions (#78625)

## Infra (RelEng)

* Added some documentation about the stats uploading process for CI (#79504)
* Fixed release doc builds (#79865)
* Updated release.md with release candidate validation steps (#79889)

# Developers

## Autograd

* Added the ability to register a hook to grad_fn with `.register_prehook`(#83226)

## Build

* Modified nccl_dependency to take dev mode (#79169)
* Moved pytorch buck targets to shared build (#79330)
* Added kineto and flatbuffers to OSS BUCK (#79860)
* Updated llvm deps for Buck build (#79919)
* Moved aten targets to shared buck file (#79966)
* Updated buck_setup.sh (#80467)
* Minor fix for shared build (#80739)
* Deleted CCACHE_DISABLE and SCCACHE_DISABLE from nccl.cmake file (#84007)

## Composability

* `TorchDispatchMode` and `TorchFunctionMode` extension points have been added. They are similar to their `__torch_function__` and `__torch_dispatch__` counterparts, but can be used as context managers that intercept **all** torch operator calls, including factory functions. These API’s are still experimental and aren’t quite user facing yet, and we will add more documentation as they are hardened. See [this post](https://dev-discuss.pytorch.org/t/what-and-why-is-torch-dispatch/557) for more details.   (#78214, #78822, #78847, #84774, #83925, #79143, #77667, #80992, #80995, #80998, #82647, #83372)
* A large amount of hardening to `FakeTensor` and `FakeTensorMode`, a `__torch_dispatch__` style mode that allows you to run shape/dtype/device inference. This is similar to the “meta” device, but fake tensors also faithfully store device metadata, and the logic lives in python. (#77969, #77972, #77971, #78516, #78090, #78836, #78895, #78536, #78677, #78522, #78523, #78972, #79170, #80115, #80193, #80544, #81739, #82281, #82574, #82066, #82449, #82337, #82571, #82593, #82172, #84387, #85065, #82846, #85658, #85759, #85920)
* Added some new tags and beefed up tags support for operators in the dispatcher:
    * Add data_dependent_output tag (#83312)
    * Add nondeterministic tags in tags.yaml and add the nondeterministic_seeded tag to all functions in native_functions.yaml defined as nondeterministic by alias_analysis.cpp (#81440)
    * Allow specifying operator tags when registering an operator to the dispatcher (#79322)
    * add `inplace_view` tag to `resize_()` (#82667)
* Make string serialization of C++ FunctionSchema consistent with torchgen.model.FunctionSchema (#77926)
* Added support for custom namespaces in `torchgen` (#78015, #79733, #81362, #81581)
* Generate kernels for codegen’d `out=` operators (#78626, #81437)
* Added a new alias dispatch key for functional to view op decompositions (#79615)
* Added an env var for dispatcher debug logging (#81846, #82277)
* Fixed printing of DispatchKey in operator not found message (#81637)
* Added test that all BackendComponents are covered by toString (#81713)
* Refactored functionality and backend keys to reduce duplication (#81752)
* Made factory functions `CompositeExplicitAutograd`, so they show up as primitives in `__torch_dispatch__` (#82470)
* Added an `OpOverload.decompose()` API, for running an operator’s decomposition if one exists (#83075)
* Fixed our dispatcher schema parser when parsing tensor list alias annotations (#84005)
* Allowed subclasses of `c10::TensorImpl()` to override non-virtual tensor methods (#84806)
* Made pytorch headers consumable from c++20 code bases (#79985)
* Added meta device support to `_UntypedStorage` and `_TypedStorage` (#78008)

## torch.fx

* Added debug statements for small ACC subgraphs elimination (#80117)
* Checked node type before fetching users (#80166)
* Detected ProxyTensor layering violations (#80994)
* Increased stack level for get_attr warning (#81041)
* Preserved a node’s stack trace (#82670, #83050, #83558, #83706, #83960)
* For quantization, removed `WEIGHT_INDEX_DICT` and `BIAS_INDEX_DICT` and replaced with `node_arg_is_weight` and `node_arg_is_bias` (#83263, #83848)
* Asserted that ProxyTensorMode does not accidentally bake in constants (#83297)
* Improvements to FX Minimizer (#83833)
* Ported matmul compositeimplicitautograd impl into core (#85239)
* OpInfo for Slice (#85554)
* Raised errors in fx.Interpreter with Node info (#85810)

## Quantization

* Enabled support for quantized fill of nhwc tensors (#79025)
* Tests for code snippets in quantization docs (#79923)
* Eliminate Named tensor warnings in XNNPACK and QNNPACK (#77762)
* Added earlier termination and improved error message for calling `min` and `max` ops on per channel quantized tensors. (#79036)
* Added warnings to quantized dynamic conv and linear ops when `reduce_range=true` (#79273)
* Add assertions to fix `torch::jit::load bugs` (#79192)
* Optionally clamp weights post quantization (#83438)

## ONNX

* `onnx.verification` Tool to verify exported model discrepancy between sets of inputs (#78323)
* Symbolic function registration is now done via decorators (#84709)
* `g.op` methods now exposed via the GraphContext class (#84728)
* Initial version of diagnostics infrastructure. (#85107)
* Add dtype check in onnx verification (#79263)

## Intel

* Added native impl for group norm on quantized CPU for channels-last inputs: (_#70520_)
* Added `qscheme` check for quantization observer (_#80126_)
* Added oneDNN graph fuser context API and unittest (_#82491_)
* Added eltwise OPs for NNC: `mish` and `elu` (_#80586_)
* Support BF16ImmPtr (_#84041_)
* Enabled fusion of conv with elementwise OP for NNC (_#77157_)
* Channels last propagation within NNC fusion group (_#76948_)
* Lowering function generates the output buffer with the specified stride for NNC(_#76529_)
* Simplified IfThenElse and CompareSelect within for-loop for NNC (_#76793_)
* Do not pull in _autocast_* ops into NNC (_#85140_)

## MPS

* Improve MPS test by extending `test_no_warnings_on_input` by capturing any output (_#79163_)
* Add testcase in test_mps for circular mode in torch.pad (_#81455_)
* Fixed build warnings while building with MPS on Mac platforms (_#83048_)
* Add per-op MPS gradient tests and update skips for TestConsistency (_#84242_)

## Profiler

* New event representation in profiler (#77693, #77694, #77695, #78163, #79173, #81965, #80797, #81319, #81320, #81321, #81322, #80822, #82993)
* Build call tree for profiled events (#77698, #80810)
* Copy rollbear/strong_type to `c10/util` (#78162)
* Collect Layout and expose TensorMetadata (#81155)
* Added support for storing scalar values in profiling (#81843)
* Added support for Device (#82787)
* Added SOFT_ASSERT to gracefully recover from invariant violations (#82689)
* Added support for accessing strides and scalars (#80072)
* Record nn.Module's parameters (#83209)
* Extend Python bindings (#83622)
* Capture storage data pointer (#84276)
* Cleaned up Tensor representation (#85161)
* Compute unique IDs for Tensors (#85162)
* set_class util (part 1 of Record Optimizer) (#84779)
* Tracking Optimizer (part 2 of Record Optimizer) (#84920)
* Optimizer param_groups (part 3 of Record Optimizer) (#85784)
* Optimizer states (part 4 of Record Optimizer) (#85840)
* Extend ID assignment to allocations and frees (#85719)
* Made `name` a property. (#85720)
* Added dtype to `_TensorMetadata` (#85721)
* Updated python binding type annotations (#85722)
* Started moving python bindings out of autograd (#82584)

## Vulkan

* Vulkan operators that use prepacking have switched from using individual `OpContext` classes with `PackedContext` classes that inherit from a generic `VulkanOpContext` class which should reduce boilerplate code when implementing new ops that require prepacking (#78814, #78815, #78816, #78817, #78818, #82730, #83526)
* Code under `ATen/native/vulkan/api` was essentially rewritten to improve code organization and readability. The refactor implements RAII patterns for the classes used to represent Vulkan handles to facilitate proper resource management and re-designed how the `Context` class functions in order to enable concurrent execution of multiple Vulkan models. The stack of PRs containing these refactors can be found at #80699
* Lint is now enforced in the `ATen/native/vulkan` (#81390)
* The VulkanMemoryAllocator version used was upgraded to 3.0.1, which now lives under `third_party` (#81472, #83906, #83934)
* Shader layouts are now automatically generated based on the GLSL code (#81715, #81716)

## Distributed

#### `torch.distributed`

* Added **all** to torch.distributed and tensorboard submodules (#80444)
* Added **all** to torch.{fx, distributed, backends} submodules (#85079)
* Added **all** to fx, fistributed and cuda submodules (#85080)
* Added **all** to torch.distributed, futures, fx, nn, package, benchmark submodules (#80520)
* Added **all** to torch.distributed submodules (#80523)
* Eliminated code duplication in distributed rendezvous (#81577)
* Refactored distributed to use absolute header path (#85780)

#### `torch.distributed.elastic`

* Added **all** for torch.nn.modules, torch.distributed.elastic, torch.nn.utils submodules (#80240)
* Fixed macos public bindings failures (#80970)

#### `Distributed(c10d)`

* Logged full rank fingerprint mismatches in ProcessGroupWrapper (#79901)
* Added environment parse function that supports default value (#85563)
* Added host and port to TCPStore pyi definition (#84636)
* Added underlying_store property for PrefixStore (#84640)
* Enabled per-thread ProcessGroup for testing (#84153)
* Moved ProcessGroup::Work into a separate class (#83680)
* Install c10d headers with absolute path (#86257)

## Infra (RelEng)

* Migrated off xenial gcc5.4 from merge rules (#78137)
* Added functionality for rebasebot to rebase onto viable/strict branch (#78276)
* Pinned protobuf version to 3.20.1 in docker CI build (#78369)
* Removed gcc5.4 from docker/build.sh (#78405)
* Removed gcc5.4 jobs from CircleCI config (#78555)
* Added merge rules for “pytorch distributed” module (#78751)
* Added onnx / test to required merge rules (#78790)
* Added userbenchmark support to TorchBench CI (#78794)
* Installed torchdynamo as part of most CI jobs (#79051)
* Removed linux-xenial-py3_7-clang7-asan from merge rules (#79088)
* Ran torchdynamo tests on PyTorch Linux CI (#79099)
* Centralized commit pins in a folder (#79150)
* Moved CUDA flags out of --per_file_copts into the cu_library macro (#79414)
* Moved CI to cuda-11.6 (#79921)
* Enabled pytest to run test_ops, test_ops_gradients, test_ops_jit in non linux cuda environments (#79898)
* Upgraded pytorch nightly docker python version to 3.8 (#80051)
* Updated Dockerfile to install cmake as part of conda install (#80258)
* Re-enabled vulkan test (#81368)
* Enhanced mergebot with the feature of posting the PR Comment on cancel (#82744)
* Changed nccl build to be single-threaded (#83173)
* Added process for maintaining Build + CI contributors list (#83869)
* Implemented mechanisms to block land checks if the PR hasn't been approved yet (#84239)
* Allowed External Scripts (e.g. vscode) To Discover and Execute unittest Tests (#85584)
* Updated the pinned torchdynamo hash to `6ead5cae0d1234aa64db06fe230ef56e12ec76fe` (#85683)
* Updated the pinned torchvision hash to `d7d90f56117ce0955332846a5f90b8d1346c4c09` (#85776)
* Modified all functions (except factory functions) to support SymInt and update xla hash to `f2b36df6a1a80137eff7644e6d0f4eeb7ff429d6` (#86078)

PyTorch 1.12: TorchArrow, Functional API for Modules and nvFuser, are now available (2022-06-28)

# PyTorch 1.12 Release Notes

* Highlights
* Backwards Incompatible Change
* New Features
* Improvements
* Performance
* Documentation

# Highlights

We are excited to announce the release of PyTorch 1.12! This release is composed of over 3124 commits, 433 contributors. Along with 1.12, we are releasing beta versions of AWS S3 Integration, PyTorch Vision Models on Channels Last on CPU, Empowering PyTorch on Intel® Xeon® Scalable processors with Bfloat16 and FSDP API. We want to sincerely thank our dedicated community for your contributions. 

Summary:

* Functional Module API to functionally apply module computation with a given set of parameters
* Complex32 and Complex Convolutions in PyTorch 
* DataPipes from TorchData fully backward compatible with DataLoader 
* Functorch with improved coverage for APIs
* nvFuser a deep learning compiler for PyTorch
* Changes to float32 matrix multiplication precision on Ampere and later CUDA hardware
* TorchArrow, a new beta library for machine learning preprocessing over batch data

# Backwards Incompatible changes

## Python API

**Updated type promotion for `torch.clamp`** ([#77035](https://github.com/pytorch/pytorch/pull/77035))

In 1.11, the ‘min’ and ‘max’ arguments in `torch.clamp` did not participate in type promotion, which made it inconsistent with `minimum` and `maximum` operations. In 1.12, the ‘min’ and ‘max’ arguments participate in type promotion.

1.11

```python
>>> import torch
>>> a = torch.tensor([1., 2., 3., 4.], dtype=torch.float32)
>>> b = torch.tensor([2., 2., 2., 2.], dtype=torch.float64)
>>> c = torch.tensor([3., 3., 3., 3.], dtype=torch.float64)
>>> torch.clamp(a, b, c).dtype
torch.float32
```

1.12

```python
>>> import torch
>>> a = torch.tensor([1., 2., 3., 4.], dtype=torch.float32)
>>> b = torch.tensor([2., 2., 2., 2.], dtype=torch.float64)
>>> c = torch.tensor([3., 3., 3., 3.], dtype=torch.float64)
>>> torch.clamp(a, b, c).dtype
torch.float64
```

## Complex Numbers

### Fix complex type promotion ([#77524](https://github.com/pytorch/pytorch/pull/77524))

Updates the type promotion rule such that given a complex scalar and real tensor, the value type of real tensor is preserved 

1.11

```python
>>> a = torch.randn((2, 2), dtype=torch.float)
>>> b = torch.tensor(1, dtype=torch.cdouble)
>>> (a + b).dtype
torch.complex128
```

1.12

```python
>>> a = torch.randn((2, 2), dtype=torch.float)
>>> b = torch.tensor(1, dtype=torch.cdouble)
>>> (a + b).dtype
torch.complex64
```

## LinAlg

### Disable TF32 for matmul by default and add high-level control of fp32 matmul precision ([#76509](https://github.com/pytorch/pytorch/pull/76509))

PyTorch 1.12 makes the default math mode for fp32 matrix multiplications more precise and consistent across hardware. This may affect users on Ampere or later CUDA devices and TPUs. See the PyTorch [blog](https://dev-discuss.pytorch.org/t/pytorch-and-tensorfloat32/504) for more details. 

## Sparse

### Use ScatterGatherKernel for scatter_reduce (CPU-only) ([#74226](https://github.com/pytorch/pytorch/pull/74226), [#74608](https://github.com/pytorch/pytorch/pull/74608))

In 1.11.0, unlike `scatter` which takes a `reduce` kwarg or `scatter_add`, `scatter_reduce` was not an in-place function. That is, it did not allow the user to pass an output tensor which contains data that is reduced together with the scattered data. Instead, the scatter reduction took place on an output tensor initialized under the hood. Indices of the output that were not scattered to were filled with reduction inits (or 0 for options ‘amin’ and ‘amax’).

In 1.12.0, `scatter_reduce` (which is in beta) is in-place to align with the API of the related existing functions `scatter`/`scatter_add`. For this reason, the argument `input` in 1.11.0 has been renamed `src` in 1.12.0 and the new `self` argument now takes a destination tensor to be scattered onto. Since the destination tensor is no longer initialized under the hood, the `output_size` kwarg in 1.11.0 that allowed users to specify the size of the output at dimension `dim` has been removed. Further, in 1.12.0 we introduce an `include_self` kwarg which determines whether values in the `self` (destination) tensor are included in the reduction. Setting `include_self=True` could, for example, allow users to provide special reduction inits for the scatter_reduction operation. Otherwise, if `include_self=False,` indices scattered to are treated as if they were filled with reduction inits.

In the snippet below, we illustrate how the behavior of `scatter_reduce` in 1.11.0 can be achieved with the function released in 1.12.0.

Example:

```python
>>> src = torch.arange(6, dtype=torch.float).reshape(3, 2)
>>> index = torch.tensor([[0, 2], [1, 1], [0, 0]])
>>> dim = 1
>>> output_size = 4
>>> reduce = "prod"
```

1.11

```python
>>> torch.scatter_reduce(src, dim, index, reduce, output_size=output_size)
`tensor([[ 0., 1., 1., 1.],
        [ 1., 6., 1., 1.],
        [20., 1., 1., 1.]])`
```

1.12

```python
>>> output_shape = list(src.shape)
>>> output_shape[dim] = output_size
# reduction init for prod is 1
# filling the output with 1 is only necessary if the user wants to preserve the behavior in 1.11
# where indices not scattered to are filled with reduction inits
>>> output = src.new_empty(output_shape).fill_(1)
>>> output.scatter_reduce_(dim, index, src, reduce)
`tensor([[ 0., 1., 1., 1.],
        [ 1., 6., 1., 1.],
        [20., 1., 1., 1.]])`
```

## torch.nn

### `nn.GroupNorm`: Report an error if `num_channels` is not divisible by `num_groups` ([#74293](https://github.com/pytorch/pytorch/pull/74293))

Previously, `nn.GroupNorm` would error out during the forward pass if `num_channels` is not divisible by `num_groups`. Now, the error is thrown for this case during module construction instead.

1.11

```python
m = torch.nn.GroupNorm(3, 7)
m(...)  # errors during forward pass
```

1.12

```python
m = torch.nn.GroupNorm(3, 7)  # errors during construction
```

### `nn.Dropout2d`: Return to 1.10 behavior: perform 1D channel-wise dropout for 3D inputs

In PyTorch 1.10 and older, passing a 3D input to `nn.Dropout2D` resulted in 1D channel-wise dropout behavior; i.e. such inputs were interpreted as having shape `(N, C, L)` with N = batch size and C = # channels and channel-wise dropout was performed along the second dimension.

1.10

```python
x = torch.randn(2, 3, 4)
m = nn.Dropout2d(p=0.5)
out = m(x)  # input is assumed to be shape (N, C, L); dropout along the second dim.
```

With the introduction of no-batch-dim input support in 1.11, 3D inputs were reinterpreted as having shape `(C, H, W)`; i.e. an input without a batch dimension, and dropout behavior was changed to drop along the first dimension. This was a silent breaking change.

1.11

```python
x = torch.randn(2, 3, 4)
m = nn.Dropout2d(p=0.5)
out = m(x)  # input is assumed to be shape (C, H, W); dropout along the first dim.
```

The breaking change in 1.11 resulted in a lack of support for 1D channel-wise dropout behavior, so `Dropout2d`  in PyTorch 1.12 returns to 1.10 behavior with a warning to give some time to adapt before the no-batch-dim interpretation goes back into effect.

1.12

```python
x = torch.randn(2, 3, 4)
m = nn.Dropout2d(p=0.5)
out = m(x)  # input is assumed to be shape (N, C, L); dropout along the second dim.
            # throws a warning suggesting nn.Dropout1d for 1D channel-wise dropout.
```

If you want 1D channel-wise dropout behavior, please switch to use of the newly-added `nn.Dropout1d` module instead of `nn.Dropout2d`. If you want no-batch-dim input behavior, please note that while this is not supported in 1.12, a future release will reinstate the interpretation of 3D inputs to `nn.Dropout2d` as those without a batch dimension.

### **`F.cosine_similarity`: Improve numerical stability ([#31378](https://github.com/pytorch/pytorch/pull/31378))**

Previously, we first compute the inner product, then normalize. After this change, we first normalize, then compute inner product. This should be more numerically stable because it avoids losing precision in inner product for inputs with large norms. Because of this change, outputs may be different in some cases.

## Composability

**Functions in torch.ops.aten.{foo} no longer accept `self` as a kwarg**

`torch.ops.aten.{foo}` objects are now instances of `OpOverloadPacket` (instead of a function) that have their `__call__` method in Python, which means that you cannot pass `self` as a kwarg. You can pass it normally as a positional argument instead.

1.11

```python
>>> torch.ops.aten.sin(self=torch.ones(2))
    tensor([0.8415, 0.8415])
```

1.12

```python
# this now fails
>>> torch.ops.aten.sin(self=torch.ones(2))
Traceback (most recent call last):
  File "", line 1, in 
TypeError: __call__() got multiple values for argument 'self'
# this works
>>> torch.ops.aten.sin(torch.ones(2))
tensor([0.8415, 0.8415])
```

**__torch_dispatch__ now traces individual op overloads instead of op overload packets (**[**#72673**](https://github.com/pytorch/pytorch/pull/72673)**)**

`torch.ops.aten.add` actually corresponds to a bundle of functions from C++, corresponding to all over the overloads of add operator (specifically, `add.Tensor`, `add.Scalar` and `add.out`). Now, `__torch_dispatch__` will directly take in an overload corresponding to a single aten function.

1.11

```python
class MyTensor(torch.Tensor):
    ....
    def __torch_dispatch__(cls, func, types, args=(), kwargs=None):
        # Before, func refers to a "packet" of all overloads
        # for a given operator, e.g. "add"
        assert func == torch.ops.aten.add
```

1.12

```python
class MyTensor(torch.Tensor):
    ....
    def __torch_dispatch__(cls, func, types, args=(), kwargs=None):
        # After, func refers to an individual operator overload,
        # e.g. "add.Tensor"
        assert func == torch.ops.aten.add.Tensor
        # you can recover the old behavior with func.overloadpacket
        assert func.overloadpacket == torch.ops.aten.add
```

## Profiler

### Disable forward-backward correlation ([#72904](https://github.com/pytorch/pytorch/pull/72904))

The forward-backward correlation is no longer captured as to workaround a profile crash. This feature may be reenabled in a future release after the underlying issue is fixed.

```python
with torch.profiler.profile() as p:
    loss = model(inputs)
    loss.backward()  # Invoke autograd

# The exported chrome trace will not have forward-backward flow events. (arrows)
p.export_chrome_trace(...)
```

## Mobile

### Remove support for bytecode version 3 ([#57775](https://github.com/pytorch/pytorch/pull/57775))

The minimum supported bytecode version is being bumped from 3 to 4. We no longer support version 3 bytecode models because the bytecode version was bumped from 3 to 4 more than half a year ago, and there was code in operator loading that performed differently on one operator on the global bytecode version 3. 

If the model is generated before Oct 5, 2020, please use the following lines to update the model to the latest version:

1.12

```python
import torch
from torch.jit.mobile import _get_model_bytecode_version

old_model_path = "old_model.ptl"
new_model_path = "new_model.ptl"

# Load full jit model
jit_model = torch.jit.load(old_model_path)
# Save model for mobile 
jit_model._save_for_lite_interpreter(new_model_path)
# Verify the model can be loaded
mobile_m = _load_for_lite_interpreter(new_model_path)

# Get bytecode version from the new model
bytecode_version = _get_model_bytecode_version(new_model_path)
print(f"bytecode version is {bytecode_version}")

```

### Remove redundant FSDP prefix and change default auto wrap policy name to avoid confusion ([#76858](https://github.com/pytorch/pytorch/pull/76858), [#73791](https://github.com/pytorch/pytorch/pull/73791))

`FullyShardedDataParallel`'s optional param name ‘fsdp_auto_wrap_policy’ (1.11) changed to ‘auto_wrap_policy’ (1.12). ‘default_auto_wrap_policy’ (1.11) is changed to ‘size_based_auto_wrap_policy’ (1.12). 

In 1.11, when wrapping a model with FSDP instead of:

```python
model = MyModel()
wrapped_model = FullyShardedDataParallel(
    model,
    **fsdp_auto_wrap_policy**=functools.partial(
        default_auto_wrap_policy,
        min_num_params=0,  # wrap all modules
    )
   ...
```

1.12

```python
model = MyModel()
wrapped_model = FullyShardedDataParallel(
    model,
   **auto_wrap_policy**=functools.partial(
        size_based_auto_wrap_policy,
        min_num_params=0,  # wrap all modules
    )
   ...
```

## Quantization

### TorchScript models exported prior to PyTorch 1.6 using quantized Linear, GRU and LSTM operators will no longer work ([#72680](https://github.com/pytorch/pytorch/pull/72680), [#72522](https://github.com/pytorch/pytorch/pull/72522)) 

TorchScript models created with PyTorch 1.5 or earlier and using the operators `quantized::linear_prepack_legacy`, `linear_prepack_fp16_legacy`, `quantized::linear_unpack.legacy`, or `quantized::linear_unpack_fp16.legacy` will no longer work and need to be re-exported. Please use PyTorch [Quantization](http://%20https//pytorch.org/docs/stable/quantization.html) to quantize the Linear module instead.

## ONNX

## Infra (Releng)

* Bump minimum CMake version to 3.13 ([#76312](https://github.com/pytorch/pytorch/pull/76312))

# Deprecations

## Python API

**Deprecated torch.testing.make_non_contiguous** ([#72705](https://github.com/pytorch/pytorch/pull/72705))

`torch.testing.make_non_contiguous` is being deprecated and will be removed in a future release. Depending on the use case there are different replacement options: If you are using `make_non_contiguous` in the PyTorch test suite, you can use ``torch.testing._internal.common_utils.noncontiguous_like``

1.11

```python
a = torch.randn(1, 2, 3)
torch.testing.make_non_contiguous(a)
```

1.12

```python
a = torch.randn(1, 2, 3)
torch.testing._internal.common_utils.noncontiguous_like(a)
```

If you are using `make_non_contiguous` in combination with a creation function to create a noncontiguous tensor with random values, you can use `make_tensor`.

1.11

```python
a = torch.randn(1, 2, 3)
torch.testing.make_non_contiguous(a)
```

1.12

```python
torch.testing.make_tensor(..., noncontiguous=True)
```

If you are using `make_non_contiguous` with a specific tensor, you can use `torch.repeat_interleave`

1.11

```python
a = torch.tensor([[1., 2.], [1., 2.]])
torch.testing.make_non_contiguous(a)
```

1.12

```python
a = torch.tensor([[1., 2.], [1., 2.]])
torch.repeat_interleave(input, 2, dim=-1)[..., ::2]
```

## Build

## LinAlg

### Deprecate torch.lu ([#73804](https://github.com/pytorch/pytorch/pull/73804))

`torch.lu()` is deprecated in favor of `torch.linalg.lu_factor()` and `torch.linalg.lu_factor_ex()`. `torch.lu()` will be removed in a future PyTorch release. If you were previously using `get_infos=False` (this is the default), you should use `torch.linalg.lu_factor` instead:

1.11

```python
LU, pivots = torch.lu(A, compute_pivots) 
```

1.12

```python
LU, pivots = torch.linalg.lu_factor(A, compute_pivots) 
```

If you were previously using `get_infos=True` you should use `torch.linalg.lu_factor_ex`:

1.11

```python
LU, pivots, info = torch.lu(A, compute_pivots, get_infos=True)
```

1.12

```python
LU, pivots, info = torch.linalg.lu_factor_ex(A, compute_pivots) 
```

### Deprecate torch.lu_solve ([#73806](https://github.com/pytorch/pytorch/pull/73806))

`torch.lu_solve()` is deprecated in favor of `torch.linalg.lu_solve()`. `torch.lu_solve()` will be removed in a future PyTorch release.

1.11

```python
X = torch.lu_solve(B, LU, pivots)
```

1.12

```python
X = torch.linalg.lu_solve(LU, pivots, B) 
```

### Remove deprecated torch.solve ([#70986](https://github.com/pytorch/pytorch/pull/70986))

`torch.solve` which was deprecated in a previous release is now being removed. You should use  `torch.linalg.solve`. instead. Note that `torch.linalg.solve` has its arguments reversed and does not return the LU factorization. To get the LU factorization see `torch.lu`, which can be used with `torch.lu_solve` or `torch.lu_unpack`.

1.11

```python
X = torch.solve(B, A).solution
```

1.12

```python
X = torch.linalg.solve(A, B)
```

## torch.nn

### `nn.Module`: Deprecate positional args for `state_dict()` ([#72780](https://github.com/pytorch/pytorch/pull/72780))

`state_dict` can currently be called in two ways: `destination`, `prefix`, and `keep_vars` can be passed as positional arguments, or as kwargs. The ability to do the former is being deprecated and will be removed in a future release. You should pass the arguments in as kwargs only. 

## Composability

**Deprecated `__torch_function__` as instance method for more functions** ([#74829](https://github.com/pytorch/pytorch/pull/74829))

`__torch_function__` should be defined as a class method. Defining `__torch_function__` as a plain method has already been previously deprecated for the functions handling `__torch_function__` in Python. This change makes it so that that is also the case for functions that handle `__torch_function__` in c++.

1.11

```python
class Bad():
    def __torch_function__(self, *args, **kwargs):
        pass
t = Bad()
torch.abs(t)
```

1.12

```python
class Good():
    @classmethod
    def __torch_function__(cls, *args, **kwargs):
        pass
t = Good()
torch.abs(t)
```

## Quantization

### Deprecate `torch.jit.quantized` ([#72690](https://github.com/pytorch/pytorch/pull/72690))

Instead of using functions defined in `torch.jit.quantized,` please use [PyTorch Quantization](https://pytorch.org/docs/stable/quantization.html) to dynamically quantize Linear/RNNCell/LSTMCell/GRUCell/LSTM modules. It’s both supported in [Eager Mode Quantization](http://%20https//pytorch.org/docs/stable/quantization.html#dynamic-quantization) and [FX Graph Mode Quantization](https://pytorch.org/tutorials/prototype/fx_graph_mode_ptq_dynamic.html)

1.11

```python
>> torch.jit.quantized.QuantizedLSTMCell(...)
```

1.12

```python
>> torch.jit.quantized.QuantizedLSTMCell(...)
   "torch.jit.QuantizedLSTMCell is deprecated and will be removed in an upcoming
    PyTorch release. Please use the torch.nn.quantized.dynamic.LSTMCell instead."
```

## Infra (Releng)

* Removed CUDA 11.1 binary builds ([#73376](https://github.com/pytorch/pytorch/pull/73376))
* Removed CUDA 11.5 binary builds ([#76257](https://github.com/pytorch/pytorch/pull/76257))

# New features

## Python API

* Added new device `mps` that can be used to leverage GPU acceleration on macOS platform with Apple Native Silicon (M1) or discrete AMD GPUs. ([blogpost with details](https://pytorch.org/blog/introducing-accelerated-pytorch-training-on-mac/))
* Added `torch.special.log_ndtr` ([#74795](https://github.com/pytorch/pytorch/pull/74795))
* Added `torch.distributions.transforms.{SoftplusTransform,CumulativeDistributionTransform}` ([#52300](https://github.com/pytorch/pytorch/pull/52300), [#72495](https://github.com/pytorch/pytorch/pull/72495))
* Promoted `torch.testing` to stable ([#73348](https://github.com/pytorch/pytorch/pull/73348))
* Added `maximize` flag for `optim.Adadelta`([#75330](https://github.com/pytorch/pytorch/pull/75330))

## Build

* Distributed torchgen as part of PyTorch package ([#76306](https://github.com/pytorch/pytorch/pull/76306))
* Added BUILD_LAZY_CUDA_LINALG option ([#73447](https://github.com/pytorch/pytorch/pull/73447))
* Introduced an environment variable to change c10 log level ([#71746](https://github.com/pytorch/pytorch/pull/71746))

## Complex Numbers

* Added a new data-type `torch.complex32` to help computing with complex datatype with lower memory usage at the cost of lower precision. Note that this is an experimental feature ([#78245](https://github.com/pytorch/pytorch/pull/78245)) and the major focus in this release was to support operators under `torch.fft` on CUDA.  Besides those operators we have added support and testing for following limited set of ops **(NOTE: few operators are only supported on CUDA)**: `Tensor.copy_, torch.complex, torch.testing.make_tensor, cat, Tensor.fill_, Tensor.item, torch.atleast_1d, torch.atleast_2d, torch.atleast_3d, torch.dsplit, torch.vsplit, torch.hsplit, torch.hstack, torch.dstack, torch.vstack, Tensor.conj, torch.add, torch.sub, torch.mul, torch.sub, torch.div, torch.view, torch.view_as, torch.real, torch.imag, torch.neg, Tensor.__getitem__, torch.sum, torch.prod, torch.abs, torch.sgn, torch.exp, torch.log, torch.eq, torch.masked_fill, torch.index_put, torch.rand, torch.randn, torch.full, torch.empty, torch.ones, torch.zeros, torch.block_diag, Tensor.chunk, Tensor.clone, Tensor.contiguous, torch.diag_embed, torch.diagonal, torch.as_strided, torch.column_stack, Tensor.T, Tensor.H, Tensor.mT, Tensor.mH, Tensor.narrow, torch.isfinite, torch.isinf, torch.isreal, torch.flatten, Tensor.chalf, torch.empty_like, torch.movedim` ( [#73847](https://github.com/pytorch/pytorch/pull/73847), [#74667](https://github.com/pytorch/pytorch/pull/74667), [#74854](https://github.com/pytorch/pytorch/pull/74854), [#75010,](https://github.com/pytorch/pytorch/pull/75010)[#75156](https://github.com/pytorch/pytorch/pull/75156), [#75311](https://github.com/pytorch/pytorch/pull/75311), [#75498](https://github.com/pytorch/pytorch/pull/75498), [#76132](https://github.com/pytorch/pytorch/pull/76132), [#76158](https://github.com/pytorch/pytorch/pull/76158), [#75592](https://github.com/pytorch/pytorch/pull/75592), [#76615](https://github.com/pytorch/pytorch/pull/76615), [#77179](https://github.com/pytorch/pytorch/pull/77179), [#77339](https://github.com/pytorch/pytorch/pull/77339), [#77446](https://github.com/pytorch/pytorch/pull/77446), [#77483](https://github.com/pytorch/pytorch/pull/77483), [#77479](https://github.com/pytorch/pytorch/pull/77479), [#77192](https://github.com/pytorch/pytorch/pull/77192), [#76724](https://github.com/pytorch/pytorch/pull/76724), [#77404](https://github.com/pytorch/pytorch/pull/77404)).
* Operators in `torch.fft` now support tensors with `torch.complex32` dtype (CUDA only) ([#74857](https://github.com/pytorch/pytorch/pull/74857)).
* `torch.complex32` tensor now participate in type-promotion ([#76893](https://github.com/pytorch/pytorch/pull/76893))
* Added `torch.chalf` alias for `torch.complex32` and `Tensor.chalf` method ([#75320](https://github.com/pytorch/pytorch/pull/75320)).
* Added proper print support for `torch.chalf` tensors ([#76614](https://github.com/pytorch/pytorch/pull/76614)).
* Added support for complex convolution (data-types supported: `torch.complex32, torch.complex64, torch.complex128`)
    * `torch.nn.functional.conv1d` and `torch.nn.Conv1d` ([#75310](https://github.com/pytorch/pytorch/pull/75310))
    * `torch.nn.functional.conv2d` and `torch.nn.Conv2d` ([#75412](https://github.com/pytorch/pytorch/pull/75412))
    * `torch.nn.functional.conv3d` and `torch.nn.Conv3d` ([#75581](https://github.com/pytorch/pytorch/pull/75581))

## LinAlg

* Added `torch.linalg.ldl_factor_ex` and `torch.linalg.ldl_solve` ([#69828](https://github.com/pytorch/pytorch/pull/69828))
* Added `linalg.vander` ([#76303](https://github.com/pytorch/pytorch/pull/76303))
* Added `linalg.lu` ([#67833](https://github.com/pytorch/pytorch/pull/67833))
* Added `linalg.lu_solve` ([#72935](https://github.com/pytorch/pytorch/pull/72935))

## Meta API

* Added meta tensor kernels for the following operators:
    *  `mse_loss` ([#72294](https://github.com/pytorch/pytorch/pull/72294)), `amax` ([#72124](https://github.com/pytorch/pytorch/pull/72124)), `normal` ([#70089](https://github.com/pytorch/pytorch/pull/70089)) `squeeze()` + `unsqueeze` ([#73440](https://github.com/pytorch/pytorch/pull/73440)), `unfold` ([#75717](https://github.com/pytorch/pytorch/pull/75717)), `clamp_min/max`, ([#76926](https://github.com/pytorch/pytorch/pull/76926)), `index_copy` ([#67329](https://github.com/pytorch/pytorch/pull/67329)), `linalg_cross` ([#72413](https://github.com/pytorch/pytorch/pull/72413)), `amin` ([#73581](https://github.com/pytorch/pytorch/pull/73581))
* Enabled the ability to register Python decompositions for operators as meta kernels, get meta support for `where` and `huber_loss` ([#77353](https://github.com/pytorch/pytorch/pull/77353))
* Registered meta functions through Python for `dot/group_norm/instance_norm/var_mean/index_reduce/matmul/bernoulli/adaptive_avg_pool` ([#77499](https://github.com/pytorch/pytorch/pull/77499))  `index_select/abs/min/max` ([#76916](https://github.com/pytorch/pytorch/pull/76916)), `reflection_pad2d` ([#77681](https://github.com/pytorch/pytorch/pull/77681)), `square` ([#77682](https://github.com/pytorch/pytorch/pull/77682)), `log_sigmoid_forward` ([#77739](https://github.com/pytorch/pytorch/pull/77739)), several more ops ([#77362](https://github.com/pytorch/pytorch/pull/77362))

## torch.nn

* `nn.Dropout1d`: New module for 1D channel-wise dropout ([#79545](https://github.com/pytorch/pytorch/pull/79545))
* `nn.Module`: Public API for stateless / functional module computation ([#75834](https://github.com/pytorch/pytorch/pull/75834))
* `nn.Module`: Support for hooks that run after state dict loading ([#76823](https://github.com/pytorch/pytorch/pull/76823), [#77392](https://github.com/pytorch/pytorch/pull/77392))
* Added support for tensor subclasses as parameters ([#73459](https://github.com/pytorch/pytorch/pull/73459), [#77655](https://github.com/pytorch/pytorch/pull/77655))

## torch.fx

* Core
    * Allowed `Tracer` to record usages of `Buffer`s ([#73612](https://github.com/pytorch/pytorch/pull/73612))
    * Introduced experimental MetaTensorTracer ([#76003](https://github.com/pytorch/pytorch/pull/76003))
    * Introduced `Tracer` the ability to trace different forward functions ([#77502](https://github.com/pytorch/pytorch/pull/77502))

## Composability

* Many features, improvements and fixes to Python tensor subclasses based on `__torch_function__` and  `__torch_dispatch__`
    * Added `__torch_function__` mode, which allows you to override the meaning of all `__torch_function__` overrideable functions within a dynamic scope. ([#75154](https://github.com/pytorch/pytorch/pull/75154))
    * Added `enable_torch_dispatch_mode`, which allows nesting of different `__torch_dispatch__` modes. ([#75965](https://github.com/pytorch/pytorch/pull/75965))
    * Added a default implementation of `__torch_dispatch__` ([#73684](https://github.com/pytorch/pytorch/pull/73684))
    * Added support `super().__torch_dispatch__` with arguments list ([#74509](https://github.com/pytorch/pytorch/pull/74509), [#74720](https://github.com/pytorch/pytorch/pull/74720))
    * Miscellaneous `__torch_function__` fixes ([#75484](https://github.com/pytorch/pytorch/pull/75484), [#75110](https://github.com/pytorch/pytorch/pull/75110))
    * Added `__torch_function__` override protocol supporting to some factory functions ([#75639](https://github.com/pytorch/pytorch/pull/75639))
    * Fixed propagation of warnings when using `__torch_dispatch__`. ([#74357](https://github.com/pytorch/pytorch/pull/74357))
    * Removed spurious warning when using disabled torch function ([#75826](https://github.com/pytorch/pytorch/pull/75826))
    * Added the ability to snapshot TLS for “has-a” use cases of `__torch_dispatch__` ([#72623](https://github.com/pytorch/pytorch/pull/72623), [#74577](https://github.com/pytorch/pytorch/pull/74577))
    * Fixed serialization and deep copying for wrapper subclasses ([#73078](https://github.com/pytorch/pytorch/pull/73078))
    * Allowed `is_contiguous()` to be overridden in `__torch_dispatch__` ([#77906](https://github.com/pytorch/pytorch/pull/77906))
* Added a “functionalization” program transform, that can be used to remove mutation + aliasing ops from PyTorch programs, while maintaining program semantics. Currently while most of the logic for the pass lives in core, the pass is exposed as an API through `functorch`. You can run it with `functorch.experimental.functionalize()`. Example usages can be found [here](https://github.com/pytorch/functorch/blob/130582ce47d30aec58713bb25eb2911b908aa616/test/test_eager_transforms.py#L2909).  ([#75913](https://github.com/pytorch/pytorch/pull/75913), [#76083](https://github.com/pytorch/pytorch/pull/76083), [#76084](https://github.com/pytorch/pytorch/pull/76084), [#73442](https://github.com/pytorch/pytorch/pull/73442), [#77285](https://github.com/pytorch/pytorch/pull/77285),  [#73441](https://github.com/pytorch/pytorch/pull/73441),  [#75302](https://github.com/pytorch/pytorch/pull/75302), [#75818](https://github.com/pytorch/pytorch/pull/75818), [#75819](https://github.com/pytorch/pytorch/pull/75819), [#76125](https://github.com/pytorch/pytorch/pull/76125), [#76318](https://github.com/pytorch/pytorch/pull/76318), [#77358](https://github.com/pytorch/pytorch/pull/77358))
* Added a new `torch.library` API to allow users to override kernels for existing C++ ops through Python ([#75905](https://github.com/pytorch/pytorch/pull/75905), [#76892](https://github.com/pytorch/pytorch/pull/76892))
* Allowed creating new libraries and defining new operators from Python ([#76250](https://github.com/pytorch/pytorch/pull/76250), [#77690](https://github.com/pytorch/pytorch/pull/77690))
* Added experimental API’s for registering and looking up Python decompositions for many aten operators: `from torch._decomp import register_decomposition, get_decompositions`. ([#76311](https://github.com/pytorch/pytorch/pull/76311), [#76814](https://github.com/pytorch/pytorch/pull/76814))
    * Many decompositions have also been added to this table ([#76621](https://github.com/pytorch/pytorch/pull/76621), [#77329](https://github.com/pytorch/pytorch/pull/77329), [#77219](https://github.com/pytorch/pytorch/pull/77219), [#76633](https://github.com/pytorch/pytorch/pull/76633), [#76855](https://github.com/pytorch/pytorch/pull/76855), [#76714](https://github.com/pytorch/pytorch/pull/76714), [#76763](https://github.com/pytorch/pytorch/pull/76763), [#77473](https://github.com/pytorch/pytorch/pull/77473), [#77807](https://github.com/pytorch/pytorch/pull/77807), [#77500](https://github.com/pytorch/pytorch/pull/77500))

## Sparse

* Added factory functions for sparse CSC, BSR, and BSC tensors ([#76634](https://github.com/pytorch/pytorch/pull/76634), [#76623](https://github.com/pytorch/pytorch/pull/76623), [#75946](https://github.com/pytorch/pytorch/pull/75946), [#75961](https://github.com/pytorch/pytorch/pull/75961), [#75831](https://github.com/pytorch/pytorch/pull/75831), [#76651](https://github.com/pytorch/pytorch/pull/76651))
* Added `ccol_indices` and `row_indices` methods for CSC and BSC tensors. ([#77503](https://github.com/pytorch/pytorch/pull/77503))
* Added `to_sparse_csc` with support for 2D Strided and 2D CSC input ([#77521](https://github.com/pytorch/pytorch/pull/77521))
* Added `to_sparse_bsr`  with support for 2D CSR input ([#77366](https://github.com/pytorch/pytorch/pull/77366))
* Added `index_reduce` ([#76997](https://github.com/pytorch/pytorch/pull/76997), [#75981](https://github.com/pytorch/pytorch/pull/75981), [#76296](https://github.com/pytorch/pytorch/pull/76296))

## CUDA

* Add Jiterator support when dtype is complex for `sigmoid`, `exp`, `sqrt`, `rsqrt`, `log`, `log10`, `log2`, `addcmul`, `abs`, `addcdiv`, `sgn`, `neg` , `logical_and`, `angle`([#73643](https://github.com/pytorch/pytorch/pull/73643), [#73776](https://github.com/pytorch/pytorch/pull/73776), [#73781](https://github.com/pytorch/pytorch/pull/73781), [#74160](https://github.com/pytorch/pytorch/pull/74160), [#74161](https://github.com/pytorch/pytorch/pull/74161), [#74533](https://github.com/pytorch/pytorch/pull/74533), [#74455](https://github.com/pytorch/pytorch/pull/74455), [#74827](https://github.com/pytorch/pytorch/pull/74827), [#74814](https://github.com/pytorch/pytorch/pull/74814),  [#74863](https://github.com/pytorch/pytorch/pull/74863), [#75123](https://github.com/pytorch/pytorch/pull/75123),  [#76692](https://github.com/pytorch/pytorch/pull/76692))
* Add Jiterator support when dtype is complex for the backward of `sigmoid` and `tanh `([#76289](https://github.com/pytorch/pytorch/pull/76289), [#74948](https://github.com/pytorch/pytorch/pull/74948))
* Add Jiterator support for `kaiser_window` , `prod` ([#73734](https://github.com/pytorch/pytorch/pull/73734), [#75231](https://github.com/pytorch/pytorch/pull/75231))
* Enable simple reductions with Jiterator ([#75231](https://github.com/pytorch/pytorch/pull/75231))
* Updated to cuDNN v8 API with cuDNN benchmark, convolution bwd / transposed convolution fwd, `bfloat16`, conv-bias-activation fusion ([#60755](https://github.com/pytorch/pytorch/pull/60755))
* Added Python Interface for Jiterator ([#76394](https://github.com/pytorch/pytorch/pull/76394))
* Added Jiterator with Python Registration ([#77121](https://github.com/pytorch/pytorch/pull/77121))
* Prepared Jiterator code template for multiple outputs ([#77902](https://github.com/pytorch/pytorch/pull/77902))
* For CUDA graphs, added `torch.cuda.is_current_stream_capturing` ([#77789](https://github.com/pytorch/pytorch/pull/77789))


## Vulkan

* Added Vulkan support for Gated Recurrent Units (`torch.nn.GRU`) ([#72692](https://github.com/pytorch/pytorch/pull/72692), [#73599](https://github.com/pytorch/pytorch/pull/73599))
* Added Vulkan support for the linear interpolation op (`torch.lerp`) ([#76544](https://github.com/pytorch/pytorch/pull/76544))

## Profiler

* Added support both global (experimental) and thread local profiling ([#75525](https://github.com/pytorch/pytorch/pull/75525), [#76078](https://github.com/pytorch/pytorch/pull/76078), [#76239](https://github.com/pytorch/pytorch/pull/76239))

## Mobile

* Added support for different memory formats of Tensors in NNC ([#72873](https://github.com/pytorch/pytorch/pull/72873))
* Upgraded mobile model bytecode to V9 and provide backporting to previous versions ([#71662](https://github.com/pytorch/pytorch/pull/71662))

## Distributed

* `ShardedTensor` and `tensor parallel` 
    * This is a prototyping effort which consists of having a new class to represent how one `torch.tensor` is being sharded across multiple GPUs or hosts and a high level APIs for users to specify how to shard, enabling basic tensor ops for `ShardedTensor` and enabling optimizer for `ShardedTensor`. In addition, we have added PartialTensor, ReplicatedTensor and checkpoint with ShardedTensor ([#63997](https://github.com/pytorch/pytorch/pull/63997), [#65511](https://github.com/pytorch/pytorch/pull/65511), [#65671](https://github.com/pytorch/pytorch/pull/65671), [#65855](https://github.com/pytorch/pytorch/pull/65855), [#66012](https://github.com/pytorch/pytorch/pull/66012), [#66351](https://github.com/pytorch/pytorch/pull/66351), [#66464](https://github.com/pytorch/pytorch/pull/66464), [#66603](https://github.com/pytorch/pytorch/pull/66603), [#66604](https://github.com/pytorch/pytorch/pull/66604), [#67057](https://github.com/pytorch/pytorch/pull/67057), [#67188](https://github.com/pytorch/pytorch/pull/67188), [#67199](https://github.com/pytorch/pytorch/pull/67199), [#67799](https://github.com/pytorch/pytorch/pull/67799), [#68021](https://github.com/pytorch/pytorch/pull/68021), [#68096](https://github.com/pytorch/pytorch/pull/68096), [#68607](https://github.com/pytorch/pytorch/pull/68607), [#68771](https://github.com/pytorch/pytorch/pull/68771), [#68786](https://github.com/pytorch/pytorch/pull/68786), [#68806](https://github.com/pytorch/pytorch/pull/68806), [#69226](https://github.com/pytorch/pytorch/pull/69226), [#69493](https://github.com/pytorch/pytorch/pull/69493), [#69569](https://github.com/pytorch/pytorch/pull/69569), [#69725](https://github.com/pytorch/pytorch/pull/69725), [#69874](https://github.com/pytorch/pytorch/pull/69874), [#69945](https://github.com/pytorch/pytorch/pull/69945), [#69946](https://github.com/pytorch/pytorch/pull/69946), [#70145](https://github.com/pytorch/pytorch/pull/70145), [#70228](https://github.com/pytorch/pytorch/pull/70228), [#70266](https://github.com/pytorch/pytorch/pull/70266), [#70331](https://github.com/pytorch/pytorch/pull/70331), [#70476](https://github.com/pytorch/pytorch/pull/70476), [#71445](https://github.com/pytorch/pytorch/pull/71445), [#72062](https://github.com/pytorch/pytorch/pull/72062), [#72130](https://github.com/pytorch/pytorch/pull/72130), [#73309](https://github.com/pytorch/pytorch/pull/73309), [#76360](https://github.com/pytorch/pytorch/pull/76360), [#76477](https://github.com/pytorch/pytorch/pull/76477), [#72733](https://github.com/pytorch/pytorch/pull/72733), [#73392](https://github.com/pytorch/pytorch/pull/73392), [#76199](https://github.com/pytorch/pytorch/pull/76199), [#75374](https://github.com/pytorch/pytorch/pull/75374), [#71624](https://github.com/pytorch/pytorch/pull/71624), [#74040](https://github.com/pytorch/pytorch/pull/74040), [#73529](https://github.com/pytorch/pytorch/pull/73529), [#74941](https://github.com/pytorch/pytorch/pull/74941), [#73703](https://github.com/pytorch/pytorch/pull/73703), [#75712](https://github.com/pytorch/pytorch/pull/75712), [#73873](https://github.com/pytorch/pytorch/pull/73873), [#75991](https://github.com/pytorch/pytorch/pull/75991), [#75844](https://github.com/pytorch/pytorch/pull/75844), [#76824](https://github.com/pytorch/pytorch/pull/76824), [#76897](https://github.com/pytorch/pytorch/pull/76897), [#77185](https://github.com/pytorch/pytorch/pull/77185), [#77191](https://github.com/pytorch/pytorch/pull/77191), [#76758](https://github.com/pytorch/pytorch/pull/76758), [#77209](https://github.com/pytorch/pytorch/pull/77209), [#77214](https://github.com/pytorch/pytorch/pull/77214), [#77367](https://github.com/pytorch/pytorch/pull/77367), [#77580](https://github.com/pytorch/pytorch/pull/77580), [#77626](https://github.com/pytorch/pytorch/pull/77626), [#77800](https://github.com/pytorch/pytorch/pull/77800), [#77707](https://github.com/pytorch/pytorch/pull/77707), [#78056](https://github.com/pytorch/pytorch/pull/78056))
        * Design proposal: [ShardedTensor](https://github.com/pytorch/pytorch/issues/55207) and [Sharding APIs](https://github.com/pytorch/pytorch/issues/72138). [Example](https://github.com/pytorch/examples/tree/main/distributed/sharded_tensor) for a Megatron-LM style tensor parallel.
* FullyShardedDataParallel
    * Added `FlatParameter` to track the information of a flat parameter ([#69241](https://github.com/pytorch/pytorch/pull/69241))
    * Enabled `summon_full_params` for FSDP. ([#71225](https://github.com/pytorch/pytorch/pull/71225))
    * Added `no_sync()` context manager ([#72446](https://github.com/pytorch/pytorch/pull/72446))
    * Implemented `apply()` ([#72925](https://github.com/pytorch/pytorch/pull/72925))
    * Implemented local_state_dict and load_local_state_dict ([#73300](https://github.com/pytorch/pytorch/pull/73300))
    * Implemented `full_state_dict` ([#73324](https://github.com/pytorch/pytorch/pull/73324))
    * Implemented `clip_grad_norm` for FSDP ([#73405](https://github.com/pytorch/pytorch/pull/73405))
    * Added grad accumulation without `no_sync()` ([#73535](https://github.com/pytorch/pytorch/pull/73535))
    * Added `full_optim_state_dict` ([#74215](https://github.com/pytorch/pytorch/pull/74215))
    * Implemented `reshard_flatten_tensor` ([#75192](https://github.com/pytorch/pytorch/pull/75192))
    * Added `scatter_full_optim_state_dict()` ([#75517](https://github.com/pytorch/pytorch/pull/75517))
    * Implemented `sharded_state_dict` and `load_sharded_state_dict` ([#77356](https://github.com/pytorch/pytorch/pull/77356))
    * Enabled mixed precision in FSDP ([#75024](https://github.com/pytorch/pytorch/pull/75024))
    * Changed to allow specification of modules to ignore when wrapping with FSDP([#75431](https://github.com/pytorch/pytorch/pull/75431))
    * Added `FullStateDictConfig` to allow full state dict checkpoint with rank0 only and CPU offload ([#75908](https://github.com/pytorch/pytorch/pull/75908))
    * Added validation to ensure FSDP units execute consistently across ranks ([#75902](https://github.com/pytorch/pytorch/pull/75902))
    * Added support for initialization of modules on meta device ([#75880](https://github.com/pytorch/pytorch/pull/75880))
    * Added support for no sharding config for DDP-style parallelism ([#76736](https://github.com/pytorch/pytorch/pull/76736))
    * Changed to allow FSDP to specify device sharded wrapped module should be placed on ([#77321](https://github.com/pytorch/pytorch/pull/77321))
    * Enabled FSDP parameter sync ([#77492](https://github.com/pytorch/pytorch/pull/77492))
    * Made sharding strategy configurable and support zero2 algorithm ([#73819](https://github.com/pytorch/pytorch/pull/73819))
    * Added a shard aware grad scaler for FSDP+MixedPrecision ([#76918](https://github.com/pytorch/pytorch/pull/76918))
    * Enabled FSDP full state dict to work for non root FSDP modules via post load hooks ([#76912](https://github.com/pytorch/pytorch/pull/76912))
    * Added `always_wrap` policy ([#73687](https://github.com/pytorch/pytorch/pull/73687))
    * Provided an auto wrap policy for common transformer models ([#76455](https://github.com/pytorch/pytorch/pull/76455))
* DistributedDataParallel
    * Added support for hierarchical model averaging ([#73285](https://github.com/pytorch/pytorch/pull/73285))
* torch.distributed.rpc
    * Changed to allow for optional `world_size` argument in `init_rpc` ([#73372](https://github.com/pytorch/pytorch/pull/73372))
    * Changed to allow newly joined ranks to communicate with existing ranks ([#73373](https://github.com/pytorch/pytorch/pull/73373))
    * Changed to allow existing ranks to communicate with newly joined ranks ([#74035](https://github.com/pytorch/pytorch/pull/74035))
    * Added graceful shutdown for dynamic RPC members ([#74561)](https://github.com/pytorch/pytorch/pull/74561)

## JIT/TorchScript

* Added autocasting of values from fp32 to lower precision floats in `torch.jit.freeze`  ([#74178](https://github.com/pytorch/pytorch/pull/74178))
* `torch.jit.set_fusion_strategy` is now a public API, allowing one to set if they want fusion based on static or dynamic tensor sizes ([#72639](https://github.com/pytorch/pytorch/pull/72639))
* Added support for compiling `tensor.__getitem__()` ([#73952](https://github.com/pytorch/pytorch/pull/73952))
* TorchScript uses a fuser to combine multiple operator calls into a single kernel. In 1.12 the default fuser for NVIDIA GPUs is switched to NVFuser, which supports a wider range of operators and has demonstrated improved throughput compared to NNC, the previous fuser. ([#74361](https://github.com/pytorch/pytorch/pull/74361), [#77010](https://github.com/pytorch/pytorch/pull/77010), [#77395](https://github.com/pytorch/pytorch/pull/77395), [#72127](https://github.com/pytorch/pytorch/pull/72127), [#73627](https://github.com/pytorch/pytorch/pull/73627), [#75235](https://github.com/pytorch/pytorch/pull/75235), [#75539](https://github.com/pytorch/pytorch/pull/75539), [#75558](https://github.com/pytorch/pytorch/pull/75558), [#75646](https://github.com/pytorch/pytorch/pull/75646), [#76226](https://github.com/pytorch/pytorch/pull/76226), [#76459](https://github.com/pytorch/pytorch/pull/76459), [#76604](https://github.com/pytorch/pytorch/pull/76604), [#76563](https://github.com/pytorch/pytorch/pull/76563), [#77001](https://github.com/pytorch/pytorch/pull/77001), [#77017](https://github.com/pytorch/pytorch/pull/77017), [#77471](https://github.com/pytorch/pytorch/pull/77471), [#77777](https://github.com/pytorch/pytorch/pull/77777), [#77884](https://github.com/pytorch/pytorch/pull/77884), [#76790](https://github.com/pytorch/pytorch/pull/76790), [#76343](https://github.com/pytorch/pytorch/pull/76343), [#76605](https://github.com/pytorch/pytorch/pull/76605), [#76769](https://github.com/pytorch/pytorch/pull/76769), [#77158](https://github.com/pytorch/pytorch/pull/77158), [#74339](https://github.com/pytorch/pytorch/pull/74339))
* Added option to save extra files in `torch.jit.save_jit_module_to_flatbuffer` ([#77870](https://github.com/pytorch/pytorch/pull/77870))

## Quantization

* Added oneDNN quantization backend ([#69820](https://github.com/pytorch/pytorch/pull/69820))
* Added oneDNN quant backend ([#74137](https://github.com/pytorch/pytorch/pull/74137))

## ONNX

* Added support to exporting additional ops: 
    * `Cross`, `Cdist` and `Pairwise Distance` ([#75278](https://github.com/pytorch/pytorch/pull/75278))
    * `bucketize` ([#74856](https://github.com/pytorch/pytorch/pull/74856))
    * `pixel unshuffle` ([#72499](https://github.com/pytorch/pytorch/pull/72499))
    * `embedding_renorm` ([#72738](https://github.com/pytorch/pytorch/pull/72738))
    * `aminmax` ([#75714](https://github.com/pytorch/pytorch/issues/75714))
    * `amax` and `amin` ([#75268](https://github.com/pytorch/pytorch/pull/75268))
    * `grid_sample` ([#76159](https://github.com/pytorch/pytorch/issues/76159))
* Added support to exporting quantized models ([#72986](https://github.com/pytorch/pytorch/issues/72986), [#73102](https://github.com/pytorch/pytorch/issues/73102), [#75697,](https://github.com/pytorch/pytorch/pull/75697) [#76002,](https://github.com/pytorch/pytorch/pull/76002) [#76055,](https://github.com/pytorch/pytorch/pull/76055) [#73336,](https://github.com/pytorch/pytorch/pull/73336)[#77393,](https://github.com/pytorch/pytorch/pull/77393) [#75920,](https://github.com/pytorch/pytorch/pull/75920) [#75921](https://github.com/pytorch/pytorch/pull/75921))
* Added support to optional type. See tests in PR for examples. ([#73284](https://github.com/pytorch/pytorch/issues/73284))
* Added support to ATen fallback for non-Caffe2 implementations of ONNX ([#74759](https://github.com/pytorch/pytorch/pull/74759), [#75468,](https://github.com/pytorch/pytorch/pull/75468) [#74680,](https://github.com/pytorch/pytorch/pull/74680) [#73954](https://github.com/pytorch/pytorch/pull/73954))

## Infra (Releng)

* Added support for ROCm 5.0 ([#72895](https://github.com/pytorch/pytorch/pull/72895))
* Added LibTorch builds for ROCm ([#57506](https://github.com/pytorch/pytorch/pull/57506))
* Added support for CUDA 11.6 ([#75518](https://github.com/pytorch/pytorch/pull/75518))

# Improvements

## Python API

* Improved numerical stability of `torch.distributions.wishart.Wishart` ([#72993](https://github.com/pytorch/pytorch/pull/72993))
* Added `mode` property to `torch.distributions.Distribution` ([#76690](https://github.com/pytorch/pytorch/pull/76690))
* Added `foreach` flag for `torch.optim.{Adadelta, Adagrad, Adamax, Adam, ASGD, NAdam, RAdamSGD, Rmsprop, Rprop, AdamW}` ([#69980](https://github.com/pytorch/pytorch/pull/69980), [#69981](https://github.com/pytorch/pytorch/pull/69981), [#69982](https://github.com/pytorch/pytorch/pull/69982), [#70295](https://github.com/pytorch/pytorch/pull/70295), [#70481](https://github.com/pytorch/pytorch/pull/70481), [#70229](https://github.com/pytorch/pytorch/pull/70229), [#70230](https://github.com/pytorch/pytorch/pull/70230), [#70231](https://github.com/pytorch/pytorch/pull/70231), [#70482](https://github.com/pytorch/pytorch/pull/70482), [#70483](https://github.com/pytorch/pytorch/pull/70483), [#70484](https://github.com/pytorch/pytorch/pull/70484))
* Added out variant for `torch.softmax` and `torch.log_softmax` ([#75833](https://github.com/pytorch/pytorch/pull/75833))
* Added handling for r=0 case for `torch.combinations` ([#70270](https://github.com/pytorch/pytorch/pull/70270))
* Added XPU support for `torch.autocast` ([#75250](https://github.com/pytorch/pytorch/pull/75250))
* Added support for Tensor source for `.set_(storage, offset, size, strides)` ([#77007](https://github.com/pytorch/pytorch/pull/77007))
* Changed to register `torch.return_types.*` as pytree nodes ([#75915](https://github.com/pytorch/pytorch/pull/75915))
* Added typing for `torch.return_type` ([#74199](https://github.com/pytorch/pytorch/pull/74199))
* Set correct module for APIs in the `torch` module ([#75801](https://github.com/pytorch/pytorch/pull/75801))
* Improved `NotImplementedError` verbosity for `torch.distributions.kl_divergence` ([#72845](https://github.com/pytorch/pytorch/pull/72845))
* Added maximize flag to `torch.optim.Adagrad` ([#75968](https://github.com/pytorch/pytorch/pull/75968))
* `optim.{Adagrad, Adam, Adamax, AdamW, RAdam}`: Updated `step` in functional optimizers and pass `state_steps` instead of `state` ([#71333](https://github.com/pytorch/pytorch/pull/71333))
* Improved `torch.lerp` numerical precision by doing intermediate math in opmath_t ([#76062](https://github.com/pytorch/pytorch/pull/76062))
* Changed to alias `torch.finfo.tiny` to `torch.finfo.smallest_normal` ([#76292](https://github.com/pytorch/pytorch/pull/76292))

## C++ API

* Added catch for overflows in calculating storage byte size for `col2im `([#73719](https://github.com/pytorch/pytorch/pull/73719))
* Implemented center padding for `stft` ([#73432](https://github.com/pytorch/pytorch/pull/73432))

## Autograd

* Added forward AD support for `torch.{atan2, dist, logsumexp, log_softmax, norm, polar, put softmax}` ([#73741](https://github.com/pytorch/pytorch/pull/73741), [#74205](https://github.com/pytorch/pytorch/pull/74205), [#75027](https://github.com/pytorch/pytorch/pull/75027), [#75326](https://github.com/pytorch/pytorch/pull/75326), [#77421](https://github.com/pytorch/pytorch/pull/77421))
* Added forward AD support for `torch.nn.functional.{cross_entropy, pairwise_dist, nll_loss, normalize}` ([#73741](https://github.com/pytorch/pytorch/pull/73741), [#74205](https://github.com/pytorch/pytorch/pull/74205))
* Added forward AD support for `torch.cholesky_inverse` ([#75033](https://github.com/pytorch/pytorch/pull/75033))
* Added forward AD and forward-over-reverse support for FFTs ([#75326](https://github.com/pytorch/pytorch/pull/75326))
* Added forward AD support for `torch.nn.functional.{embedding,prelu, bilinear, rrelu, logsigmoid}` ([#77421](https://github.com/pytorch/pytorch/pull/77421))
* Added forward AD support for `torch.nn.BCELoss` ([#77755](https://github.com/pytorch/pytorch/pull/77755))
* Added forward AD support for `Tensor.__rsub__` ([#75326](https://github.com/pytorch/pytorch/pull/75326))
* Added forward AD support for `torch.clamp` when bounds are tensors ([#74042](https://github.com/pytorch/pytorch/pull/74042))
* Added forward AD support for `torch.nn.functional.{dropout, glu}`([#75288](https://github.com/pytorch/pytorch/pull/75288), [#77186](https://github.com/pytorch/pytorch/pull/77186))
* Added forward-over-reverse for `torch.nn.functional.`{`leaky_relu, glu, elu, selu, celu}` ([#75294](https://github.com/pytorch/pytorch/pull/75294), [#77309](https://github.com/pytorch/pytorch/pull/77309), [#75297](https://github.com/pytorch/pytorch/pull/75297))
* Improved forward and backward derivative `torch.{linalg.cholesky, cholesky}` ([#76032](https://github.com/pytorch/pytorch/pull/76032))
* Improved forward and backward derivative of `torch.linalg.qr` ([#76115](https://github.com/pytorch/pytorch/pull/76115))
* Added complex autograd support for  `torch.cholesky_inverse` ([#75033](https://github.com/pytorch/pytorch/pull/75033))
* Added double backward support for `torch.nn.functional.binary_cross_entropy` wrt target ([#77416](https://github.com/pytorch/pytorch/pull/77416))
* Improved error message for `torch.nn.functional.batch_norm` when `running_{mean,var}` have forward grad defined ([#73655](https://github.com/pytorch/pytorch/pull/73655))
* Improve error message when forward AD is not supported ([#75105](https://github.com/pytorch/pytorch/pull/75105))
* Added forward AD and forward-over-reverse support for `torch.nn.functional.max_unpool` ([#68625](https://github.com/pytorch/pytorch/pull/68625))
* Added autograd support for `masked_softmax` ([#71502](https://github.com/pytorch/pytorch/pull/71502))

## Build

* Fixed pybind deprecation warnings ([#72376](https://github.com/pytorch/pytorch/pull/72376))
* Enabled win-arm64 ([#72424](https://github.com/pytorch/pytorch/pull/72424))
* Moved magma utils to its own header ([#73058](https://github.com/pytorch/pytorch/pull/73058))
* Turned on -Wsign-compare ([#74996](https://github.com/pytorch/pytorch/pull/74996))
* Made all `.pyi.in` files exportable from torch/_C/ folder ([#74962](https://github.com/pytorch/pytorch/pull/74962))
* Improved Jinja2 for docs/cpp build set to version 3.0 ([#74718](https://github.com/pytorch/pytorch/pull/74718))
* Added CMake option for using static MKL libraries ([#73069](https://github.com/pytorch/pytorch/pull/73069))
* CPU Kernel: Changed to use per-operator headers ([#71137](https://github.com/pytorch/pytorch/pull/71137))
* CUDA Kernels: Changed to use per-operator headers ([#71212](https://github.com/pytorch/pytorch/pull/71212))

## Dataloader

* Added `pin_memory_device` to Dataloader to pin `Tensor` to the corresponding GPU device ([#65402](https://github.com/pytorch/pytorch/pull/65402))

## ForEach

* Improved numerical precision for `ForEach` L1 and L2 norm by using  `OpMathType` tensor for intermediate results ([#68107](https://github.com/pytorch/pytorch/pull/68107))

## Meta API

* Changed to skip superfluous storage allocations while constructing meta tensors ([#65331](https://github.com/pytorch/pytorch/pull/65331))

## torch.nn

* Made `nn.init.orthogonal_` no-op for empty input ([#75553](https://github.com/pytorch/pytorch/pull/75553))
* `nn.{Conv1d, Conv2d, Conv3d}`: Added support for complex datatypes ([#75310](https://github.com/pytorch/pytorch/pull/75310), [#75412](https://github.com/pytorch/pytorch/pull/75412), [#75581](https://github.com/pytorch/pytorch/pull/75581))
* `nn.Conv2d`: Added bfloat16 support for mkl-dnn backend ([#55864](https://github.com/pytorch/pytorch/pull/55864))
* `nn.Conv2d`: Added support for channels last memory format on CPU for mkl-dnn backend, naive algorithm, and dilated algorithm ([#55584](https://github.com/pytorch/pytorch/pull/55584), [#68101](https://github.com/pytorch/pytorch/pull/68101), [#70665](https://github.com/pytorch/pytorch/pull/70665))
* `nn.EmbeddingBag`: Added half precision support on CPU ([#74844](https://github.com/pytorch/pytorch/pull/74844))
* `nn.FractionalMaxPool*d`: Added support `0`s in `out_size` ([#73634](https://github.com/pytorch/pytorch/pull/73634))
* `nn.Module`: Changed to throw error for non-dict inputs to `load_state_dict()` ([#77197](https://github.com/pytorch/pytorch/pull/77197))
* `nn.{PixelShuffle, PixelUnshuffle}`: Added support for channels last memory format ([#50573](https://github.com/pytorch/pytorch/pull/50573))
* `nn.PReLU`: Enabled fp32/bfloat16 forward and backward for mkl-dnn backend ([#60427](https://github.com/pytorch/pytorch/pull/60427))
* `F.elu`: Improve numerical precision by using `opmath` and `expm1` ([#77062](https://github.com/pytorch/pytorch/pull/77062))
* `F.{hardshrink, hardsigmoid, hardswish, logsigmoid,  smooth_l1_loss, softplus, softshrink}, nn.{BatchNorm, GLU, Upsample}`: Add bfloat16 support on CPU ([#62558](https://github.com/pytorch/pytorch/pull/62558), [#63134](https://github.com/pytorch/pytorch/pull/63134), [#77496](https://github.com/pytorch/pytorch/pull/77496), [#61944](https://github.com/pytorch/pytorch/pull/61944), [#76935](https://github.com/pytorch/pytorch/pull/76935))

## torch.fx

* FX/graph_drawer
    * Added args/kwargs and users ([#73464](https://github.com/pytorch/pytorch/pull/73464))
    * Added `skip_node_names_in_args` option, default to `True` ([#73815](https://github.com/pytorch/pytorch/pull/73815))
* Core
    * Refactor FX codegen into extensible Codegen object ([#72566](https://github.com/pytorch/pytorch/pull/72566))
    * Modified `replace_all_uses_with` to allowing filtering of nodes to update([#73763](https://github.com/pytorch/pytorch/pull/73763))
    * Made `immutable_list` and `immutable_dict` work with pytrees ([#73766](https://github.com/pytorch/pytorch/pull/73766))
    * Added `Assert None concrete_args` and improve error messages ([#74662](https://github.com/pytorch/pytorch/pull/74662))
* In minimizer, made args work in the `uru10x10_to_trt_eval` script ([#74707](https://github.com/pytorch/pytorch/pull/74707))
* For split_module, changed to return mapping of qualified names from split_module() ([#73564](https://github.com/pytorch/pytorch/pull/73564))
* For shape propagation, made shapes and args/kwargs concrete for minimizer ([#75291](https://github.com/pytorch/pytorch/pull/75291))

## Sparse

* Added CUDA support for `scatter_reduce` ([#74606,](https://github.com/pytorch/pytorch/pull/74606)[#74607](https://github.com/pytorch/pytorch/pull/74607))
* Added 2D Strided, 2D CSR, 2D CSC, 2D COO support to `to_sparse_csr` ([#77521](https://github.com/pytorch/pytorch/pull/77521))
* Added ND Strided, 2D CSC support to `to_dense` ([#74486](https://github.com/pytorch/pytorch/pull/74486), [#77521](https://github.com/pytorch/pytorch/pull/77521))
* Added 2D CSC support to `to_sparse`  ([#73642](https://github.com/pytorch/pytorch/pull/73642), [#77521](https://github.com/pytorch/pytorch/pull/77521))
* Added support for batched CSR to `sparse_csr_tensor` ([#74542](https://github.com/pytorch/pytorch/pull/74542))
* Added support for `__str__` for CSC, BSR, and BSC tensors ([#77530](https://github.com/pytorch/pytorch/pull/77530), [#76650](https://github.com/pytorch/pytorch/pull/76650))
* Updated transpose to return CSC when given CSR ([#77615](https://github.com/pytorch/pytorch/pull/77615))
* Added support for CSR gradients for CSR tensors ([#75435](https://github.com/pytorch/pytorch/pull/75435))
* Added CSC support to `addmm`, `addmv`, `mm` ([#77615](https://github.com/pytorch/pytorch/pull/77615))
* Added autograd for CSR inputs to `torch.sparse.sampled_addmm` ([#68084](https://github.com/pytorch/pytorch/pull/68084))
* Added autograd for CSR inputs to `torch.sparse.addmm and torch.sparse.mm` ([#76591](https://github.com/pytorch/pytorch/pull/76591))
* Added Half/BFloat16 support for to_dense and coalesce methods. ([#72397](https://github.com/pytorch/pytorch/pull/72397))
* Added CSR support to `mul` ([#74266](https://github.com/pytorch/pytorch/pull/74266), [#77177](https://github.com/pytorch/pytorch/pull/77177))
* Added CSR support to `sum` ([#74766](https://github.com/pytorch/pytorch/pull/74766))
* Added BSR support to `addmm`, `addmv`, `triangular_solve` ([#77255](https://github.com/pytorch/pytorch/pull/77255))
* Added batched CSR support to `torch.sparse.sampled_addmm` on CUDA ([#77243](https://github.com/pytorch/pytorch/pull/77243))
* Added CSR support for `torch.sparse.sampled_addmm` on CPU ([#76589](https://github.com/pytorch/pytorch/pull/76589))
* Added CSR support to `torch.select` ([#76228](https://github.com/pytorch/pytorch/pull/76228))
* Added CSR support to `Tensor.to` ([#76400](https://github.com/pytorch/pytorch/pull/76400))
* Added CSC support to `torch.empty` ([#77508](https://github.com/pytorch/pytorch/pull/77508))
* Added CSC, BSR, BSC support to `torch.clone` ([#77512](https://github.com/pytorch/pytorch/pull/77512))
* Added CSC, BSR, BSC support for `copy_`  ([#77605](https://github.com/pytorch/pytorch/pull/77605))
* Added (Strided, CSR) input support to `torch.mm` ([#73686](https://github.com/pytorch/pytorch/pull/73686))
* Added CSR support to `torch.sparse.mm` ([#73075](https://github.com/pytorch/pytorch/pull/73075))
* Added (Strided, CSR, CSR) support to `addmm` on CPU ([#73076](https://github.com/pytorch/pytorch/pull/73076))
* Added runtime beta support warning to CSR, CSC, BSR, BSC tensors ([#75594](https://github.com/pytorch/pytorch/pull/75594), [#75865](https://github.com/pytorch/pytorch/pull/75865))
* Added `bool` support to `coalesce` and `to_dense`  ([#74495](https://github.com/pytorch/pytorch/pull/74495))
* Added `half` support to `sparse_mask` ([#76862](https://github.com/pytorch/pytorch/pull/76862))
* Added AMD Navi 21 support to `coalesce` ([#73548](https://github.com/pytorch/pytorch/pull/73548))

## AMD

* Enabled `atomicAddNoRet()` for all gfx targets. ([#75451](https://github.com/pytorch/pytorch/pull/75451))
* Enabled miopen for RNNs with dropout. ([#75429](https://github.com/pytorch/pytorch/pull/75429))
* Used `ncclAllToAll` for ROCm ([#75128](https://github.com/pytorch/pytorch/pull/75128))
* Navi21 Enablement: fix TI `num_threads` for ROCm,  Depthwise kernels, Embedding kernels, Normalization kernels, Softmax kernels, Tensor kernels, Index, Repeat and Sort kernels, Range and Multinomial Kernels ([#69942](https://github.com/pytorch/pytorch/pull/69942), [#72682](https://github.com/pytorch/pytorch/pull/72682), [#72809](https://github.com/pytorch/pytorch/pull/72809), [#73543](https://github.com/pytorch/pytorch/pull/73543),  [#73545](https://github.com/pytorch/pytorch/pull/73545), [#73546](https://github.com/pytorch/pytorch/pull/73546), [#73549](https://github.com/pytorch/pytorch/pull/73549), [#73550](https://github.com/pytorch/pytorch/pull/73550))
* Added ROCm version api within CMake ([#69481](https://github.com/pytorch/pytorch/pull/69481))
* Enabled `sort` operator BF16 support ([#72854](https://github.com/pytorch/pytorch/pull/72854))
* Enabled HIP IPC ([#74383](https://github.com/pytorch/pytorch/pull/74383))
* Enabled `topk` operator for `bfloat16` dtype ([#71913](https://github.com/pytorch/pytorch/pull/71913))
* Added HIP_HOME/include.lib in cpp_extensions ([#75548](https://github.com/pytorch/pytorch/pull/75548))

## CUDA

* PyTorch: added support to NVTX range_start and range_end ([#70030](https://github.com/pytorch/pytorch/pull/70030))
* Show friendly error message when forgetting `init` in `torch.cuda` ([#72404](https://github.com/pytorch/pytorch/pull/72404))
* PyTorch GPU Allocator: better use of blocks with rounding of allocation sizes ([#74213](https://github.com/pytorch/pytorch/pull/74213))
* CUDACachingAlloc/GPUInference: implemented garbage collection without GPU sync ([#74261](https://github.com/pytorch/pytorch/pull/74261))
* CUBLAS/TF32: added environment variable to allow override of `allow_tf32_cublas` ([#77114](https://github.com/pytorch/pytorch/pull/77114))

## Intel 

* Bfloat16
  * Added BFloat16 support for `torch.{nn.PReLU, nn.Upsample,nn.GLU, randperm, multinomial, poisson, nn.ELU, nn.SELU, nn.CELU, nn.LogSigmoid, nn.Hardsigmoid, nn.Hardshrink, nn.Softshrink, nn.Hardswish, nn.Softplus, nn.SmoothL1Loss, histc, atan2, logcumsumexp, diag, fmod, cumsum, cumprod, nn.utils.weight_norm , nn.BatchNorm2d}` and allow autocast enabled ([_#63634,_](https://github.com/pytorch/pytorch/pull/63634) [_#58297,_](https://github.com/pytorch/pytorch/pull/58297) [_#61944,_](https://github.com/pytorch/pytorch/pull/61944) [_#63215 ,_](https://github.com/pytorch/pytorch/pull/63215) [_#62546,_](https://github.com/pytorch/pytorch/pull/62546) [_#63134_](https://github.com/pytorch/pytorch/pull/63134), [_#72694,_](https://github.com/pytorch/pytorch/pull/72694) [_#61897,_](https://github.com/pytorch/pytorch/pull/61897) [_#73845,_](https://github.com/pytorch/pytorch/pull/73845) [_#74410,_](https://github.com/pytorch/pytorch/pull/74410) [_#68725_](https://github.com/pytorch/pytorch/pull/68725))
    * Improved `torch.nn.functional.log_softmax` on CPU when dim != -1 on both float32 and bfloat16 ([_#64726_](https://github.com/pytorch/pytorch/pull/64726))
    * Improved `torch.nn.functional.layer_norm` bfloat16 performance on CPU ([_#71376_](https://github.com/pytorch/pytorch/pull/71376))
    * Improved autocast cpu documentation ([_#68567_](https://github.com/pytorch/pytorch/pull/68567))
* Channels last
    * Add channels-last support for `torch.nn.{conv2D(kernel slow_conv_dilated2d and thnn_conv2d, mkldnn as backend), GroupNorm, PixelShuffle, PixelUnshuffle}`([_#70665_](https://github.com/pytorch/pytorch/pull/70665), [_#68101_](https://github.com/pytorch/pytorch/pull/68101), [_#55584,_](https://github.com/pytorch/pytorch/pull/55584) [_#50573,_](https://github.com/pytorch/pytorch/pull/50573) [_#555864_](https://github.com/pytorch/pytorch/pull/555864))
* OneDNN
    * Upgraded oneDNN to v2.6.0, ([_#75398_](https://github.com/pytorch/pytorch/pull/75398))
    * Added JIT graph fuser for oneDNN Graph API (v0.5) ([_#76622_](https://github.com/pytorch/pytorch/pull/76622))
* Quantization
    * Improve {`qcat_nhwc, qupsample_bilinear2d, qupsample_nearest2d, qbatch_norm2d, qmax_pool2d, qavg_pool2d`} performance on multi-core ([_#69667_](https://github.com/pytorch/pytorch/pull/69667), [_#69601_](https://github.com/pytorch/pytorch/pull/69601), [_#69600,_](https://github.com/pytorch/pytorch/pull/69600) [_#69599_](https://github.com/pytorch/pytorch/pull/69599), [_#69598_](https://github.com/pytorch/pytorch/pull/69598), [_#69517_](https://github.com/pytorch/pytorch/pull/69517))
    * Add oneDNN as backend for quantization ([_#69820_](https://github.com/pytorch/pytorch/pull/69820)) 
* Improved `torch{norm,argmax,argmin, scatter, gather}` performance on CPU ([_#64479_](https://github.com/pytorch/pytorch/pull/64479), [_#64478_](https://github.com/pytorch/pytorch/pull/64478))
* Improved `torch.nn.functional{log_softmax``, softmax}` performance on CPU ([_#73953_](https://github.com/pytorch/pytorch/pull/73953))
* Expanded graph rewrite to handle `conv_transpose3d` ([_#76888_](https://github.com/pytorch/pytorch/pull/76888))
* Expanded coverage of convolution folding in conv→mul→add→bn ([_#75724_](https://github.com/pytorch/pytorch/pull/75724))
* Added MKLDNN support for `PReLU` ([_#60427_](https://github.com/pytorch/pytorch/pull/60427))

## Composability 

* Added `torch.nn.init` to list of functions overridable by `__torch_function__` ([#76014](https://github.com/pytorch/pytorch/pull/76014))
* Relaxed dtype restrictions on `torch.Tensor`([#73850](https://github.com/pytorch/pytorch/pull/73850))

## Profiler

* Enabled iteration tracking for kineto ([#72292](https://github.com/pytorch/pytorch/pull/72292))
* Added support for input sequence ID tracking for NVTX profiler ([#70264](https://github.com/pytorch/pytorch/pull/70264))
* Re-enabled user-annotations in PyTorch ([#75601](https://github.com/pytorch/pytorch/pull/75601))
* Added support to configure Kineto CUPTI profiler from PyTorch profiler interface ([#75616](https://github.com/pytorch/pytorch/pull/75616))

## Vulkan

* Added an interface to obtain execution time data for GPU shader kernels when executing Vulkan operators ([#75829](https://github.com/pytorch/pytorch/pull/75829))

## Mobile

* Improved Android instrumentation test and update README ([#72736](https://github.com/pytorch/pytorch/pull/72736))
* Improved unsupported scalar type error message for Android ([#74660](https://github.com/pytorch/pytorch/pull/74660))



## JIT/TorchScript

* `torch.jit.trace` now treats `tensor.numel()` as `aten::numel`, instead of a constant value ([#74081](https://github.com/pytorch/pytorch/pull/74081))
* When printing out the types of a JIT Dict, with a tuple key, we now print out the types of the tuple if it is simple ([#76164](https://github.com/pytorch/pytorch/pull/76164))
* Added support for basic ops support for complex numbers in JIT, We now support op(complex, Tensor) for the following: add (+), mul (*), eq (==), ne (!=), sub (-), div (/) ([#73286](https://github.com/pytorch/pytorch/pull/73286))
* TorchScript now preserves the original exception message when rethrowing a Python-based exception ([#77093](https://github.com/pytorch/pytorch/pull/77093))
* Modified the conditions for conv folding in `torch.jit.freeze` to allow for folding arguments that can be promoted to floating point (eg integer tensor arguments) ([#73278](https://github.com/pytorch/pytorch/pull/73278))
* Reduced size of JIT debug.pkl files by only storing unique traces ([#76688](https://github.com/pytorch/pytorch/pull/76688))
* `torch.jit.save` and `torch.jit.load` are now supported for meta tensors ( aka `torch.Tensor(device="meta")`) ([#73435](https://github.com/pytorch/pytorch/pull/73435))

## Architecture Optimization

* Added default symmetric qconfig for QNNPACK ([#74396](https://github.com/pytorch/pytorch/pull/74396))

## Quantization

* Core (Quantized Tensor, Operator, Modules)
    * Added QAT fused `Linear-Bn1d` ([#72431](https://github.com/pytorch/pytorch/pull/72431), [#72796](https://github.com/pytorch/pytorch/pull/72796))
    * Added 4 bit support for embedding quantized module (re-land PR 69769) ([#72276](https://github.com/pytorch/pytorch/pull/72276))
    * Enabled slicing on per-channel quantized tensors (support is limited to the a contiguous sliced tensor) and corresponding test case ([#71269](https://github.com/pytorch/pytorch/pull/71269))
    * Added `qint32` quantization support ([#72472](https://github.com/pytorch/pytorch/pull/72472))
    * Added explicit entries for for functional and module conv and linear support into `get_default_qconfig_dict`&`get_default_qat_qconfig_dict` ([#73528](https://github.com/pytorch/pytorch/pull/73528))
    * Added default symmetric QAT qconfig for QNNPACK ([#74507](https://github.com/pytorch/pytorch/pull/74507))
    * Added Quantized `Matmul` Op (Naive Implementation) ([#71783](https://github.com/pytorch/pytorch/pull/71783))
    * Added Quantized `Softmax` Op (Naive Implementation) ([#75415](https://github.com/pytorch/pytorch/pull/75415))
    * Using QNNPACK in Quantized `Softmax` Op ([#75799](https://github.com/pytorch/pytorch/pull/75799))
* Eager Mode Quantization
    * Added 4 bit support for eager mode quantization flow ([#72277](https://github.com/pytorch/pytorch/pull/72277))
* FX Graph Mode Quantization
    * Added workflow support for `torch.matmul` quantization ([#72444](https://github.com/pytorch/pytorch/pull/72444))
    * Added support `conv1d` and its fusion variants in QAT ([#74506](https://github.com/pytorch/pytorch/pull/74506))
    * Decoupled `prepare_*fx `from training/eval modes ([#75401](https://github.com/pytorch/pytorch/pull/75401))
    * Added quantized Softmax workflow integration ([#75106](https://github.com/pytorch/pytorch/pull/75106))
    * Renamed `default_affine_fixed_qparams_observer` and `default_symmetric_fixed_qparams_observer` ([#76637](https://github.com/pytorch/pytorch/pull/76637))

## ONNX

* Updated default `opset_version` to 13. The previous default was 9. To get the old behavior, just specify `opset_version=9` when calling ``torch.onnx.export``. Going forward we plan to update the default regularly to "latest as of 18 months ago". ([#73898](https://github.com/pytorch/pytorch/issues/73898))
* De-duplicated initializers to reduce ONNX model size for shared parameters ([#69547,](https://github.com/pytorch/pytorch/pull/69547) [#74247](https://github.com/pytorch/pytorch/pull/74247))
* Changed to capture annotated attributes for local function ([#72883](https://github.com/pytorch/pytorch/pull/72883))
* Improve error and warning messages ([#71342,](https://github.com/pytorch/pytorch/pull/71342) [#73255,](https://github.com/pytorch/pytorch/pull/73255) [#73770,](https://github.com/pytorch/pytorch/pull/73770) [#73265](https://github.com/pytorch/pytorch/pull/73265))
* Added support to exporting `torch.minimum` with different dtype combinations ([#76022](https://github.com/pytorch/pytorch/issues/76022)) 
* Improved `Expand` shape inference ([#72985](https://github.com/pytorch/pytorch/pull/72985))
* Added broadcast to `matmul` shape inference ([#72990](https://github.com/pytorch/pytorch/pull/72990))
* Rewrote linspace symbolic to improve numerical stability ([#73610](https://github.com/pytorch/pytorch/pull/73610))
* Enabled `topk` export with non-int64 k ([#73761](https://github.com/pytorch/pytorch/pull/73761))
* Enabled `numel` tracing ([#74081](https://github.com/pytorch/pytorch/pull/74081))
* Added constant folding for `onnx::ReduceProd` ([#74082](https://github.com/pytorch/pytorch/pull/74082))
* Added support to equality checks on devices ([#77203](https://github.com/pytorch/pytorch/issues/77203))
* Added support to dynamic dimensions in `Squeeze` and `Unsqueeze` ([#73104](https://github.com/pytorch/pytorch/pull/73104))

## torch.package

* Added Python Version to `Torch.Package` metadata ([#74610](https://github.com/pytorch/pytorch/pull/74610))
* Added utility for determining where bad modules may come from ([#74998](https://github.com/pytorch/pytorch/pull/74998))

## Distributed

* torch.distributed
    * Refactored `TORCH_DISTRIBUTED_DEBUG` implementation ([#73166](https://github.com/pytorch/pytorch/pull/73166))
    * Set default value of TCPStore world_size to None in pybind definition ([#77277](https://github.com/pytorch/pytorch/pull/77277))
    * Added orthogonalization with QR factorization ([#72043](https://github.com/pytorch/pytorch/pull/72043))
    * Added pickling support for WorkerInfo ([#73371](https://github.com/pytorch/pytorch/pull/73371))
    * Added support for RRefs that contain `threading.Thread` ([#74462](https://github.com/pytorch/pytorch/pull/74462))
    * Added check for mismatch in number of parameters in `verify_params_across_processes` ([#74113](https://github.com/pytorch/pytorch/pull/74113))
    * Added support for backend to register reducer timer ([#71700](https://github.com/pytorch/pytorch/pull/71700))
    * Made ProcessGroupNCCL load torch_ucc.so when TORCH_UCC_LIBRARY_PATH is set ([#69552](https://github.com/pytorch/pytorch/pull/69552))
    * Added support for non-contiguous inputs for `nn.functional.all_gather/reducescatter/gather` ([#75276](https://github.com/pytorch/pytorch/pull/75276))
    * Added the use of batched operations for PowerSGD ([#76041](https://github.com/pytorch/pytorch/pull/76041))
    * Changed to create UCC ProcessGroup when `ucc_lib` available ([#69564](https://github.com/pytorch/pytorch/pull/69564))
    * Changed to generalize param verification and broadcast ([#76374](https://github.com/pytorch/pytorch/pull/76374))
    * Changed to use a more reliable signaling mechanism to stop TCPStore background threads ([#76973](https://github.com/pytorch/pytorch/pull/76973))
    * Added support to disabling post-local gradient sync ([#76723](https://github.com/pytorch/pytorch/pull/76723))
    * Removed call into Python API without GIL being held in c10d ([#72928](https://github.com/pytorch/pytorch/pull/72928))
* FullyShardedDataParallel
    * Fixed `summon_full_params` when not sharded ([#72572](https://github.com/pytorch/pytorch/pull/72572))
    * Fixed 0-dim tensor optim state device ([#75243](https://github.com/pytorch/pytorch/pull/75243))
    * Fixed the synchronization of `all_gather` stream in `summon_full_params` ([#73314](https://github.com/pytorch/pytorch/pull/73314))
    * Added state_dict() save/reload in parity test ([#73366](https://github.com/pytorch/pytorch/pull/73366))
    * Changed to use `unflatten_parameter` in `_summon_full_parameters` ([#72467](https://github.com/pytorch/pytorch/pull/72467))
    * Changed to use `summon_full_params` in `get_full_params` ([#73242](https://github.com/pytorch/pytorch/pull/73242))
    * Added generic arguments for `state_dict` ([#73323](https://github.com/pytorch/pytorch/pull/73323))
    * Added generic argument forward for `load_local_state_dict` ([#73325](https://github.com/pytorch/pytorch/pull/73325))
    * Made `summon_full_params` a public method ([#73116](https://github.com/pytorch/pytorch/pull/73116))
    * Generalized `fsdp_modules()` ([#73553](https://github.com/pytorch/pytorch/pull/73553))
    * Introduced a utility API to allow users easily to set `state_dict_type` ([#73716](https://github.com/pytorch/pytorch/pull/73716))
    * Added an option to summon on rank 0 only in  `summon_full_params` ([#73903](https://github.com/pytorch/pytorch/pull/73903))
    * Enabled offload full params to CPU in `summon_full_params` ([#73904](https://github.com/pytorch/pytorch/pull/73904))
    * Removed `_lazy_init()` in rebuild full params ([#74263](https://github.com/pytorch/pytorch/pull/74263))
    * Changed to override `named_parameters()` for clean names in `summon_full_params()` ([#74333](https://github.com/pytorch/pytorch/pull/74333))
    * Changed to strip FSDP info in `summon_full_params` context, similar to `named_params` in `named_buffers` ([#74517](https://github.com/pytorch/pytorch/pull/74517))
    * Change to use param name as key in `full_optim_state_dict` ([#74879](https://github.com/pytorch/pytorch/pull/74879))
    * Enabled re-key between param names/IDs for `full_optim_state_dict` ([#74912](https://github.com/pytorch/pytorch/pull/74912))
    * Changed to register `state_dict` hooks for `FlatParamsWrapper` even if params_list is empty ([#74860](https://github.com/pytorch/pytorch/pull/74860))
    * Made `apply_to_tensors` support `OrderedDict` type ([#75560](https://github.com/pytorch/pytorch/pull/75560))
    * Added `rank0_only` to `full_optim_state_dict()` ([#75516](https://github.com/pytorch/pytorch/pull/75516))
    * Made `summon_full_params` a static method ([#75423](https://github.com/pytorch/pytorch/pull/75423))
    * Added support for PackedSequence type for `apply_for_tensors` ([#76265](https://github.com/pytorch/pytorch/pull/76265))
    * Made mixed precision API configurable ([#76423](https://github.com/pytorch/pytorch/pull/76423))
    * Validated exec order using `compute_device` ([#76664](https://github.com/pytorch/pytorch/pull/76664))
    * Improved dict inversion in `_get_param_name_to_param` to be faster([#76665](https://github.com/pytorch/pytorch/pull/76665))
    * Changed to ignore params if not in `Optim` state dict ([#76671](https://github.com/pytorch/pytorch/pull/76671))
    * Changed to include buffers in `ignored_modules` ([#76784](https://github.com/pytorch/pytorch/pull/76784))
    * Moved param/buffer name computation to constructor for `ignored_modules` ([#76994](https://github.com/pytorch/pytorch/pull/76994))
    * Changed to not clone buffers and ensure that we offload buffers to CPU if specified ([#77000](https://github.com/pytorch/pytorch/pull/77000))
    * Profiling range for `FSDP.forward` ([#76899)](https://github.com/pytorch/pytorch/pull/76899)
    * Disabled the default behavior of moving CPU module to GPU ([#77720](https://github.com/pytorch/pytorch/pull/77720))
    * Fixed `_get_param_to_unflat_param_names()` for shared params ([#75430](https://github.com/pytorch/pytorch/pull/75430))
* ShardedTensor (prototype)
    * Changed to use absolute imports for ShardMetadata instead ([#73678](https://github.com/pytorch/pytorch/pull/73678))
    * Fixed the metadata error in `init_from_local_shards` with deepcopy ([#73400](https://github.com/pytorch/pytorch/pull/73400))
    * Fixed view op and matrix ops unit test ([#77706](https://github.com/pytorch/pytorch/pull/77706))
* torch.distributed.rpc
    * Improved logging from 'unknown destination worker' ([#75811](https://github.com/pytorch/pytorch/pull/75811))
    * Improved logging for store.wait error ([#76548](https://github.com/pytorch/pytorch/pull/76548))
    * Added support for RPC Meta device ([#76882](https://github.com/pytorch/pytorch/pull/76882))
    * Changed to keep stacktrace when rewriting AttributeError ([#73720](https://github.com/pytorch/pytorch/pull/73720))
* DistributedDataParallel
    * Improved debug level and logging ([#72455](https://github.com/pytorch/pytorch/pull/72455))
    * Removed bucket replicas ([#73567](https://github.com/pytorch/pytorch/pull/73567))
    * Made `HierarchicalModelAverager` a subclass of `averagers.ModelAverager` ([#74564](https://github.com/pytorch/pytorch/pull/74564))
    * Made code simplification for `_find_process_group` function ([#75007](https://github.com/pytorch/pytorch/pull/75007))
    * Made distributed raise `ImportError` when not available ([#75975](https://github.com/pytorch/pytorch/pull/75975))
* torch.distributed.elastic
    * Created a final agent barrier to shutdown process properly ([#74931](https://github.com/pytorch/pytorch/pull/74931))

# Bug fixes

## Python API

* Fixed type promotion for `torch.where` ([#76691](https://github.com/pytorch/pytorch/pull/76691))
* Fixed `torch.clamp` to correctly propagate nans ([#77306](https://github.com/pytorch/pytorch/pull/77306))
* Fixed `torch.unique` to preserve input size when dim is zero-length ([#75764](https://github.com/pytorch/pytorch/pull/75764))
* Fixed  `torch.ravel` to also return contiguous outputs for non-contiguous inputs([#71771](https://github.com/pytorch/pytorch/pull/71771))
* Fixed `CosineAnnealingLR` to resume last learning rate on restart ([#60339](https://github.com/pytorch/pytorch/pull/60339))
* Fixed invalid shape error for `torch.fft.{irfft2,irfft2} `([#73012](https://github.com/pytorch/pytorch/pull/73012))
* Fixed `torch.set_default_dtype` to no longer crash with invalid dtype ([#72405](https://github.com/pytorch/pytorch/pull/72405))
* Fixed `torch.tril` edge case ([#75335](https://github.com/pytorch/pytorch/pull/75335))
* Fixed `torch.broadcast_shapes` to not handle shapes with negative dimensions. ([#72999](https://github.com/pytorch/pytorch/pull/72999))
* Fixed `torch.logsumexp` integral to float type promotion ([#77480](https://github.com/pytorch/pytorch/pull/77480))
* Fixed `torch.amax` and `torch.amin` for empty tensors if dim arg not provided. ([#73914](https://github.com/pytorch/pytorch/pull/73914))
* Disallowed calling `.tolist` on tensors with nullptr storage ([#75990](https://github.com/pytorch/pytorch/pull/75990))
* Fixed `.tolist` to work correctly work for 0 element tensors ([#76335](https://github.com/pytorch/pytorch/pull/76335))
* Adjusted the stubs for PyCharm autocompletion of the `Tensor` methods. ([#76712](https://github.com/pytorch/pytorch/pull/76712))
* Fixed `Optimizer.zero_grad` type annotation ([#76998](https://github.com/pytorch/pytorch/pull/76998))
* Fixed  `torch.distributions.lkj_cholesky` device error ([#73980](https://github.com/pytorch/pytorch/pull/73980))
* Fixed misplaced type annotation for `torch.distributions.transforms.CatTransform` ([#73747](https://github.com/pytorch/pytorch/pull/73747))
* Fixed `torch.clamp` scalar overloads to propagate nan ([#77371](https://github.com/pytorch/pytorch/pull/77371))
* Fixed advanced indexing assignment when  `use_deterministic_algorithms(True)` for non-contiguous tensors ([#76220](https://github.com/pytorch/pytorch/pull/76220))
* Fixed `**=` operator ([#76900](https://github.com/pytorch/pytorch/pull/76900))
* Fixed `to` to properly support permutation ([#77610](https://github.com/pytorch/pytorch/pull/77610))

## C++ API

* Used the same checks in all `grid_sampler` functions ([#75164](https://github.com/pytorch/pytorch/pull/75164))
* Fixed `mean` bug for integral tensors ([#76584](https://github.com/pytorch/pytorch/pull/76584))
* Added missing import to fix crash on loading cpp extension ([#75736](https://github.com/pytorch/pytorch/pull/75736))

## Autograd

* Fixed forward AD formula for `torch.angle` ([#77267](https://github.com/pytorch/pytorch/pull/77267))
* Fixed `torch.{minimum, maximum}` forward AD formula for float32 ([#75277](https://github.com/pytorch/pytorch/pull/75277))
* Fixed forward-mode AD formula for `torch.nn.functional.binary_cross_entropy_with_logits` ([#76322](https://github.com/pytorch/pytorch/pull/76322))
* Fixed gradients for norm related ops at zero when p < 1 to mask out nans ([#75103](https://github.com/pytorch/pytorch/pull/75103))
* Fixed forward-over-reverse for convolution to no longer fail in some cases ([#75298](https://github.com/pytorch/pytorch/pull/75298))
* Fixed `torch.autograd.gradcheck` to run with requires_grad=False when `check_forward_ad=True` ([#72309](https://github.com/pytorch/pytorch/pull/72309))
* Fixed `requires_grad`-ness to be propagated for all backends when tensors are deep-copied ([#76256](https://github.com/pytorch/pytorch/pull/76256))
* Fixed `torch.autograd.grad` to automatically needs an extra tuple when handling single outputs and `is_grads_batched=True` ([#75779](https://github.com/pytorch/pytorch/pull/75779))
* Updated forward AD metadata check to skip stride check when size is 0 ([#77269](https://github.com/pytorch/pytorch/pull/77269))
* Fixed deadlock an edge case in autograd ([#73961](https://github.com/pytorch/pytorch/pull/73961))
* Allow forking until a worker thread is created in autograd engine ([#72689](https://github.com/pytorch/pytorch/pull/72689))
* Removed some spurious warnings in the autograd engine  ([#72542](https://github.com/pytorch/pytorch/pull/72542))
* Fixed issue with `torch.utils.checkpoint.checkpoint` when both `use_reentrant` and` preserve_rng_state` set to `False` ([#76890](https://github.com/pytorch/pytorch/pull/76890))
* Fixed Python indexing set item to scalar tensor preserve autograd graph ([#78746](https://github.com/pytorch/pytorch/pull/78746))

## Build

* Added TORCH_CUDA_CU_API to CUDABlas functions ([#72340](https://github.com/pytorch/pytorch/pull/72340))
* Fixed doc build for release branches ([#72567](https://github.com/pytorch/pytorch/pull/72567))
* Moved AndroidNightly to GHA ([#74243](https://github.com/pytorch/pytorch/pull/74243))
* Changed `numModules` type to `unsigned` ([#74978](https://github.com/pytorch/pytorch/pull/74978))
* In Kineto, Changed to not search for CUPTI in default paths ([#76188](https://github.com/pytorch/pytorch/pull/76188))
* Changed to use TensorPipe libuv in Gloo ([#77312](https://github.com/pytorch/pytorch/pull/77312))

## Complex Numbers

* Fixed segmentation fault when real and imaginary attributes of a tensor are set to a number ([#73867](https://github.com/pytorch/pytorch/pull/73867))
* Fixed complex to real casting warning in the backward’s pass for Real→Complex `copy` ([#75805](https://github.com/pytorch/pytorch/pull/75805))
* Make `torch.addcmul` and `torch.addcdiv` support different complex and non-complex type args together ([#74234](https://github.com/pytorch/pytorch/pull/74234))
* Fixed `torch.isfinite` for complex to avoid overflow when real and imaginary values are finite but abs is infinite ([#76606](https://github.com/pytorch/pytorch/pull/76606)).
* Fixed complex abs/angle output format ([#77585](https://github.com/pytorch/pytorch/pull/77585))

## Dataloader

* Reset worker cycle for persistent DataLoader to ensure determinism across epochs ([#73675](https://github.com/pytorch/pytorch/pull/73675))

## LinAlg

* Fixed SVD error code handling for OpenBLAS 0.3.15+ and MKL 2022+([#72357](https://github.com/pytorch/pytorch/pull/72357))
* Fixed addmm_cpu for int64 ([#75200](https://github.com/pytorch/pytorch/pull/75200))

## Meta API

* Fixed meta kernel for `normal_` when `std` is equal to 0 ([#70085](https://github.com/pytorch/pytorch/pull/70085))
* Fixed `torch.kaiser_window` : meta for window_length > 1 ([#73733](https://github.com/pytorch/pytorch/pull/73733))
* Fixed meta kernel for `normal` ([#77740](https://github.com/pytorch/pytorch/pull/77740))

## torch.nn

* `F.pad`: Silence error when unused fill value is zero ([#76307](https://github.com/pytorch/pytorch/pull/76307))
* `nn.{Conv1d, Conv2d, Conv3d}`: Properly initialize `grad_weight` in `raw_cudnn_convolution_backward_weight_out` ([#72157](https://github.com/pytorch/pytorch/pull/72157))
* `nn.Conv2d`: Fix channels last propagation for naive algorithm ([#77347](https://github.com/pytorch/pytorch/pull/77347))
* `nn.ConvTranspose*d`: Fix to support no-batch-dim inputs with `output_size` ([#76151](https://github.com/pytorch/pytorch/pull/76151))
* `nn.CrossEntropyLoss`: Support no-batch-dim input with probability target ([#77653](https://github.com/pytorch/pytorch/pull/77653))
* `nn.CrossEntropyLoss`: Fix to avoid floating point exception for zero-size inputs ([#73837](https://github.com/pytorch/pytorch/pull/73837))
* `nn.GroupNorm`: Ensure `num_groups > 0` in `native_group_norm` ([#75270](https://github.com/pytorch/pytorch/pull/75270))
* `nn.MaxPool2d`: Properly support dilation in channels last kernel ([#76597](https://github.com/pytorch/pytorch/pull/76597))
* `nn.ParameterList`: Fix `__dir__` implementation ([#74997](https://github.com/pytorch/pytorch/pull/74997))
* `nn.{ParameterList, ParameterDict}`: Support containing any kind of object ([#70499](https://github.com/pytorch/pytorch/pull/70499))
* `nn.RReLU`: Fix to support empty tensor inputs ([#70496](https://github.com/pytorch/pytorch/pull/70496))
* `nn.utils.rnn.pad_sequence`: Fix regression; support tensor input for `sequences` ([#72436](https://github.com/pytorch/pytorch/pull/72436))
* `nn.utils.stateless.functional_call`: Properly support setting attributes during forward ([#77137](https://github.com/pytorch/pytorch/pull/77137))

## torch.fx

* Core
    * Made `map_aggregate`/`map_arg` work for NamedTuple ([#73198](https://github.com/pytorch/pytorch/pull/73198))
    * Fixed tracing for OpOverload ([#73940](https://github.com/pytorch/pytorch/pull/73940))
    * Fixed codegen for bare generic type annotations ([#74135](https://github.com/pytorch/pytorch/pull/74135))
    * Modified `__deepcopy__` to also copy _codegen ([#75851](https://github.com/pytorch/pytorch/pull/75851))
    * Fixed unnecessary recursion in `GraphModule.__call__` ([#76068](https://github.com/pytorch/pytorch/pull/76068))
    * Changed to prevent infinite recursion in GraphModule ([#73866](https://github.com/pytorch/pytorch/pull/73866))
    * Changed to preserve codegen on FX graph in transformer ([#74189](https://github.com/pytorch/pytorch/pull/74189))
* operator_schemas
    * Added back check for OpOverload ([#73978](https://github.com/pytorch/pytorch/pull/73978))
    * Fixed normalize_function to consider OpOverloads ([#76469](https://github.com/pytorch/pytorch/pull/76469))
    * Fixed for normalizing signature for op overloads ([#77182](https://github.com/pytorch/pytorch/pull/77182))
* For testing, added `super()` calls for FX TestCases ([#74216](https://github.com/pytorch/pytorch/pull/74216))
* For split_module, made split_module preserve proper placeholder names ([#74736](https://github.com/pytorch/pytorch/pull/74736))

## Sparse

* Fixed ignored beta value for sparse inputs to `torch.addmm` with non-MKL build ([#72430](https://github.com/pytorch/pytorch/pull/72430))
* Fixed float16/bf16 support for sparse inputs to `torch.addmm` ([#72559](https://github.com/pytorch/pytorch/pull/72559))
* Fixed CUDA error for `torch.mul` when given COO Tensors with zero sized dense dimensions ([#73428](https://github.com/pytorch/pytorch/pull/73428))
* Fixed incorrect results of `torch.sparse.sampled_addmm` for noncontiguous inputs ([#76590](https://github.com/pytorch/pytorch/pull/76590))
* Fixed runtime generation of doc strings for torch._masked functions by making them static instead ([#72865](https://github.com/pytorch/pytorch/pull/72865))

## CUDA

* Created jiterator cache dirs recursively ([#74592](https://github.com/pytorch/pytorch/pull/74592))
* Fixed bincount to use acc scalar for the bounds ([#76979](https://github.com/pytorch/pytorch/pull/76979))
* Avoid `collections` deprecation warning ([#72239](https://github.com/pytorch/pytorch/pull/72239))
* Disabled cuBLASLt when batch is too large. ([#73533](https://github.com/pytorch/pytorch/pull/73533))
* Abated spurious resize warnings in `MultiMarginLoss` on CUDA ([#75000](https://github.com/pytorch/pytorch/pull/75000))
* Added missing AT_CUDA_CHECK in CUDAGraph.cpp ([#74392](https://github.com/pytorch/pytorch/pull/74392))
* CUDA graphs
    * Fixed OOM inside graph capture_begin ([#76247](https://github.com/pytorch/pytorch/pull/76247))
    * Changed to allow Adam and AdamW to be capture-safe ([#77862](https://github.com/pytorch/pytorch/pull/77862))

## Intel

* Fixed Caffe2 convolution issue in AVX512 when using oneDNN v2.5.2 ([_#73290_](https://github.com/pytorch/pytorch/pull/73290))

## Composability 

* Fixed formatting of scalar tensors for the `meta` device (don't call item) ([#74376](https://github.com/pytorch/pytorch/pull/74376))
* Fixed to metadata preservation for Python tensor subclasses: preserve Python dispatch keys when copying tensor metadata ([#75644](https://github.com/pytorch/pytorch/pull/75644))
* Fixed data race on `TensorImpl::wns_pyobj_` accesses with non-GIL protected threads ([#75563](https://github.com/pytorch/pytorch/pull/75563))
* Fixed for Python garbage collector can sometimes deallocate a tensor, even when C++ still has strong references to it ([#75933](https://github.com/pytorch/pytorch/pull/75933))
* Added better error checking to `TensorImpl::size_between_dim_`. ([#76719](https://github.com/pytorch/pytorch/pull/76719))
* Changed to ensure that `torch.memory_format` instances are singletons ([#77543](https://github.com/pytorch/pytorch/pull/77543))

## Profiler

* Avoided picking up old CUPTI headers ([#72761](https://github.com/pytorch/pytorch/pull/72761))
* Kineto submodule update and fixes ([#75206](https://github.com/pytorch/pytorch/pull/75206))
* Fixed segfault in AppendOnlyList ([#78084](https://github.com/pytorch/pytorch/pull/78084))

## Vulkan

* Fixed a bug in the Vulkan implementation of `aten::tanh` where inputs of large magnitudes would result in numerically unstable results ([#73107](https://github.com/pytorch/pytorch/pull/73107))
* Fixed a bug in the Vulkan implementation of `aten::add`, `aten::sub`, `aten::mul`, and `aten::div` where passing in a single element tensor as a second argument would result in an assertion error ([#73108](https://github.com/pytorch/pytorch/pull/73108))

## Mobile

* Changed to protect against threading errors when tracing models with parallel operators ([#73327](https://github.com/pytorch/pytorch/pull/73327))
* Changed to ensure error messages are preserved from Metal and CoreML Backend ([#77430](https://github.com/pytorch/pytorch/pull/77430), [#76236](https://github.com/pytorch/pytorch/pull/76263))
* Changed to ensure the iOS test app is working correctly ([#74090](https://github.com/pytorch/pytorch/pull/74090))
* Fixed off-by-one error in tupleIndex ([#72447](https://github.com/pytorch/pytorch/pull/72447))
* Fixed error in export of models containing nested NamedTuple ([#75996](https://github.com/pytorch/pytorch/pull/75996))

## Distributed

* torch.distributed
    * Fixed process group wrapper check for Gloo ([#72657](https://github.com/pytorch/pytorch/pull/72657) (https://github.com/pytorch/pytorch/pull/72657))
    * Changes to catch CUDA library runtime error (driver shutting down) during the exit of ProcessGroup ([#74258](https://github.com/pytorch/pytorch/pull/74258) (https://github.com/pytorch/pytorch/pull/74258))
    * Fixed NCCL version string ([#73333](https://github.com/pytorch/pytorch/pull/73333) (https://github.com/pytorch/pytorch/pull/73333))
    * Add retry DNS lookup failures ([#74641](https://github.com/pytorch/pytorch/pull/74641) (https://github.com/pytorch/pytorch/pull/74641))
    * Validated that tensors are contiguous in ProcessGroupNCCL ([#77809](https://github.com/pytorch/pytorch/pull/77809) (https://github.com/pytorch/pytorch/pull/77809))
    * Fixed sign-compare in c10d/Utils.hpp ([#75081](https://github.com/pytorch/pytorch/pull/75081) (https://github.com/pytorch/pytorch/pull/75081))
    * Fixed NCCL gather outputs on non-root ranks ([#75535](https://github.com/pytorch/pytorch/pull/75535) (https://github.com/pytorch/pytorch/pull/75535))
    * Fixed batch_isend_irecv ([#74701](https://github.com/pytorch/pytorch/pull/74701) (https://github.com/pytorch/pytorch/pull/74701))
    * Disabled RPC profiling for kineto profilers ([#76234](https://github.com/pytorch/pytorch/pull/76234) (https://github.com/pytorch/pytorch/pull/76234))
    * Typo fix in generated module name ([#76880](https://github.com/pytorch/pytorch/pull/76880) (https://github.com/pytorch/pytorch/pull/76880))
    * Fixed broadcast for channels-last tensors ([#79071](https://github.com/pytorch/pytorch/pull/79071) (https://github.com/pytorch/pytorch/pull/79071))
* DistributedDataParallel
    * Disabled bucketing for the first iteration ([#72843](https://github.com/pytorch/pytorch/pull/72843) (https://github.com/pytorch/pytorch/pull/72843))
    * Fixed SyncBatchNorm for empty inputs ([#74944](https://github.com/pytorch/pytorch/pull/74944) (https://github.com/pytorch/pytorch/pull/74944))
    * Added a guard for non CPU/CUDA devices ([#75247](https://github.com/pytorch/pytorch/pull/75247) (https://github.com/pytorch/pytorch/pull/75247))
    * Fixed bug where *getstate* of DDP looks for self._replicated_tensor_module when not using ReplicatedTensor. ([#76349](https://github.com/pytorch/pytorch/pull/76349) (https://github.com/pytorch/pytorch/pull/76349))
    * Fixed post_localSGD_optimizer by calling optim.step only once when there are multiple param groups or params ([#74737](https://github.com/pytorch/pytorch/pull/74737) (https://github.com/pytorch/pytorch/pull/74737))
    * Fixed PostLocalSGDOptimizer and ModelAverager average ([#74894](https://github.com/pytorch/pytorch/pull/74894) (https://github.com/pytorch/pytorch/pull/74894))
* ShardedTensor (prototype)
    * Fixed Sharding spec inference to avoid invalid chunk sharding to be inferred as chunkshardingspec ([#75296](https://github.com/pytorch/pytorch/pull/75296) (https://github.com/pytorch/pytorch/pull/75296))
* FullyShardedDataParallel
    * Fixed no_sync() + FULL_SHARD root all-gather behavior ([#75901](https://github.com/pytorch/pytorch/pull/75901) (https://github.com/pytorch/pytorch/pull/75901))
    * Fixed exec order validation (static variable issue) ([#76273](https://github.com/pytorch/pytorch/pull/76273) (https://github.com/pytorch/pytorch/pull/76273))
    * Fixed local_state_dict and state_dict_type bugs ([#77101](https://github.com/pytorch/pytorch/pull/77101) (https://github.com/pytorch/pytorch/pull/77101))
    * Fixed FSDP wrapping for batchnorm when mixed precision enabled ([#77234](https://github.com/pytorch/pytorch/pull/77234) (https://github.com/pytorch/pytorch/pull/77234))
    * Fixed CheckpointWrapper state_dict to enable wrapped modules loaded into non-checkpointed wrapped module ([#77224](https://github.com/pytorch/pytorch/pull/77224) (https://github.com/pytorch/pytorch/pull/77224))
    * Changed to relax exec order valid. to only forward pass ([#76556](https://github.com/pytorch/pytorch/pull/76556) (https://github.com/pytorch/pytorch/pull/76556))
    * Changed to not check forward order in eval mode ([#77195](https://github.com/pytorch/pytorch/pull/77195) (https://github.com/pytorch/pytorch/pull/77195))
    * Changed to pass device_id into recursive_wrap for FSDP ([#77491](https://github.com/pytorch/pytorch/pull/77491) (https://github.com/pytorch/pytorch/pull/77491))

## JIT/TorchScript

* torch.jit.fuser("fuser1") is supposed to enable NNC fusion, but it currently only enables gpu fusion. This will enable CPU fusion as well. ([#74078](https://github.com/pytorch/pytorch/pull/74078))
* Fixed bug where when parsing a Python TernaryIf expression (`x if y else z`)  was not being parsed into TorchScript using `torch.jit.script` as right associative ([#68416](https://github.com/pytorch/pytorch/pull/68416))
* Got rid of TorchScript sparse tensor is experimental warning. ([#73874](https://github.com/pytorch/pytorch/pull/73874))
* Custom post-processing passes registered through `torch::jit::RegisterPass` now have access to profiled Tensor Type Specializations ([#71748](https://github.com/pytorch/pytorch/pull/71748))
* When registering a custom print handler for `prim::print()` inside `torch.deploy`, we restore the default print handler when all Python environments are destroyed to prevent errors from not having a Python environment. ([#74513](https://github.com/pytorch/pytorch/pull/74513))
* When running `torch.jit.freeze` on the backward passes of conv (`conv_bn`) with reduced precision (eg `bfloat16`) , fusions will respect the precision of the original op, instead of promoting to `float32`  ([#77042](https://github.com/pytorch/pytorch/pull/77042))
* Loosened `torch.jit.script` type checks that were too strict for the `torch.nn.LPPool2D` and `torch.nn.functional.lp_pool2d` functions ([#73287](https://github.com/pytorch/pytorch/pull/73287))
* `torch.nn.ParameterList` is now subscriptable in TorchScript  ([#75479](https://github.com/pytorch/pytorch/pull/75479))

## Quantization

* Fixed `get_module_type` for fusion ([#72735](https://github.com/pytorch/pytorch/pull/72735))
* Fixed bug in QuantWrapper with DeQuant qconfig ([#73671](https://github.com/pytorch/pytorch/pull/73671))
* Fixed observer insertion through dtype propagation ([#73274](https://github.com/pytorch/pytorch/pull/73274))
* Only do reference module swapping for floating point fused modules ([#74231](https://github.com/pytorch/pytorch/pull/74231))
* Fixed dynamic weighted op lowering when input is used multiple times ([#74364](https://github.com/pytorch/pytorch/pull/74364))
* Fixed `get_default_qconfig_dict` for fused modules ([#75838](https://github.com/pytorch/pytorch/pull/75838))
* Fixed bug for ave pooling in FX quant ([#73054](https://github.com/pytorch/pytorch/pull/73054))
* Fixed FX QAT for untraceable modules ([#74277](https://github.com/pytorch/pytorch/pull/74277))
* Fixed `qmin`/`qmax` when using customized ‘qrange’ ([#74717](https://github.com/pytorch/pytorch/pull/74717))

## ONNX

* Fixed repeat interleave when repeats and dim is 1 ([#73760](https://github.com/pytorch/pytorch/pull/73760))
* Fixed ONNX gather shape inference ([#73607](https://github.com/pytorch/pytorch/pull/73607))
* Fixed 1d case flatten export ([#74595](https://github.com/pytorch/pytorch/pull/74595))
* Fixed opset_version checked before set ([#76928](https://github.com/pytorch/pytorch/pull/76928))
* Fixed an assertion failure involving Slice ([#72989](https://github.com/pytorch/pytorch/pull/72989))
* Fixed LSTM reshape shape inference regression ([#72532](https://github.com/pytorch/pytorch/pull/72532))
* Fixed Caffe2 ONNX export for environment with newer ONNX ([#75718)](https://github.com/pytorch/pytorch/pull/75718/)
* Refactored test/onnx/test_onnx_export.py for better code reuse ([#76851](https://github.com/pytorch/pytorch/pull/76851))
* Fixed `aten::to("cpu")` and `aten::to(device="cpu")` ([#76498](https://github.com/pytorch/pytorch/pull/76498))
* Fixed BatchNormalization for invalid dtype ([#74875](https://github.com/pytorch/pytorch/pull/74875))
* Added Autocast support for `einsum` ([#71916](https://github.com/pytorch/pytorch/pull/71916))

## torch.package

* Deploy: added dummy metadata for builtin packages ([#76211](https://github.com/pytorch/pytorch/pull/76211))
* Enabled module modification during repackaging ([#71520](https://github.com/pytorch/pytorch/pull/71520))
* Added test case for repackaging parent module ([#72367](https://github.com/pytorch/pytorch/pull/72367))
* Fixed orderedimporter dummy package check ([#72533](https://github.com/pytorch/pytorch/pull/72533))
* Improved error message for module detection on saving pass ([#73106](https://github.com/pytorch/pytorch/pull/73106))
* Changed to allow torch/csrc/deploy/interpreter/Optional.hpp to be allowed into the wheel distribution ([#74643](https://github.com/pytorch/pytorch/pull/74643))

# Performance

## Python API

* Improved `torch.topk` performance on CUDA ([#74267](https://github.com/pytorch/pytorch/pull/74267))
* Added SIMD horizontal reduce to improve `torch.log_softmax` and `torch.softmax` performance on CPU ([#73953](https://github.com/pytorch/pytorch/pull/73953))
* Made small optimizations for `torch.view` ([#72626](https://github.com/pytorch/pytorch/pull/72626))
* Optimized dim reduce performance on `torch.{norm,` `argmax, argmin}` ([#72083](https://github.com/pytorch/pytorch/pull/72083))
* Improved CPU performance for `torch.log_softmax` when dim != -1 on both float32 and bfloat16 ([#72163](https://github.com/pytorch/pytorch/pull/72163))
* Improved `torch.softmax` `dim=-1` performance on bfloat16 by adding more fusion ([#76278](https://github.com/pytorch/pytorch/pull/76278))
* Removed duplicate call to objective function in strong Wolfe line search in `L-BFGS` optimizer. ([#72773](https://github.com/pytorch/pytorch/pull/72773))

## Autograd

* Optimized code-generated in-place forward AD formulas ([#74017](https://github.com/pytorch/pytorch/pull/74017))
* Added a fast path for `torch.{stack, cat}` forward AD computation when tangents are zero-filled ([#75590](https://github.com/pytorch/pytorch/pull/75590))
* Reduced forward AD recomputation for `linalg.{eig,eigh,svd}` when function returns multiple outputs ([#75583](https://github.com/pytorch/pytorch/pull/75583))

## Sparse

* Improved performance of `index_select` for COO inputs on CPU ([#72710](https://github.com/pytorch/pytorch/pull/72710))
* Improved performance of `index_add` on CUDA ([#76996](https://github.com/pytorch/pytorch/pull/76996))

## Dataloader

* Improved the performance of `BatchSampler` ([#76951](https://github.com/pytorch/pytorch/pull/76951))

## AMD

* Enabled foreach fast path ([#74417](https://github.com/pytorch/pytorch/pull/74417))
* Reverted cat operator performance work-around ([#74129](https://github.com/pytorch/pytorch/pull/74129))

## CUDA

* Removed sync in embedding ([#70943](https://github.com/pytorch/pytorch/pull/70943))
* Added fused addmm path in linear for contiguous 3D input ([#72728](https://github.com/pytorch/pytorch/pull/72728))
* Changed to use cub 1.15's latest scan-by-key algorithm to replace thrust for `Embedding.cu` and `EmbeddingBag.cu` ([#66580](https://github.com/pytorch/pytorch/pull/66580))
* Changed to use `cub::DeviceSelect::UniqueByKey` for EmbeddingBackward ([#68376](https://github.com/pytorch/pytorch/pull/68376))
* Changed to use cuBLASLt interface for bias fusion ([#72148](https://github.com/pytorch/pytorch/pull/72148))
* Set workspace size for cuBLASLt interface 1M ([#73439](https://github.com/pytorch/pytorch/pull/73439))
* Added fastAtomicAdd to scatter_add [v2] ([#75545](https://github.com/pytorch/pytorch/pull/75545))
* Added a new optimized cuDNN RNN algorithm for small RNN hidden_size ([#73211](https://github.com/pytorch/pytorch/pull/73211))
* Avoided CPU Sync in SyncBatchNorm When Capturing CUDA Graphs ([#78810](https://github.com/pytorch/pytorch/pull/78810)) ([_commit_](https://github.com/pytorch/pytorch/commit/2652da29ab6c0d690bfb543bee958f50c0b86451))
* Added Autocast CPU doc ([#68567](https://github.com/pytorch/pytorch/pull/68567))
* Documented CUDA 11.5 windows issue ([#73013](https://github.com/pytorch/pytorch/pull/73013))
* Added `__all__` for `torch.cuda.memory` ([#76490](https://github.com/pytorch/pytorch/pull/76490))

## Composability 

* Improved performance for forward-mode AD with `at::sub`: added ZeroTensor fast-path ([#75587](https://github.com/pytorch/pytorch/pull/75587))

## torch.nn

* `nn.EmbeddingBag`: Removed out-of-bounds check to improve CUDA performance ([#74767](https://github.com/pytorch/pytorch/pull/74767))
* `nn.GELU`: Added support tanh-based approximation ([#61439](https://github.com/pytorch/pytorch/pull/61439))
* `nn.GroupNorm`: Improved channels last performance on CPU ([#69067](https://github.com/pytorch/pytorch/pull/69067))
* `nn.LayerNorm`: Improved bfloat16 performance on CPU ([#71376](https://github.com/pytorch/pytorch/pull/71376))
* `nn.LayerNorm`: Added mixed data type mode for forward path ([#73844](https://github.com/pytorch/pytorch/pull/73844))
* `nn.MultiheadAttention`: Fast path using nested tensors for inference under specific conditions ([#77924](https://github.com/pytorch/pytorch/pull/77924), [#77761](https://github.com/pytorch/pytorch/pull/77761))
* `nn.MultiheadAttention`: Fuse the `attn_mask` addition ([#73219](https://github.com/pytorch/pytorch/pull/73219), [#72871](https://github.com/pytorch/pytorch/pull/72871)))
* `nn.MultiheadAttention`: Native fast path under specific conditions ([#75809](https://github.com/pytorch/pytorch/pull/75809), [#76333](https://github.com/pytorch/pytorch/pull/76333), [#72944](https://github.com/pytorch/pytorch/pull/72944), [#72941](https://github.com/pytorch/pytorch/pull/72941), [#72671](https://github.com/pytorch/pytorch/pull/72671), [#72375](https://github.com/pytorch/pytorch/pull/72375), [#72458](https://github.com/pytorch/pytorch/pull/72458), [#72464](https://github.com/pytorch/pytorch/pull/72464), [#72463](https://github.com/pytorch/pytorch/pull/72463))
* `nn.MultiheadAttention`: Preserve identity relationships among query, key, and value for `batch_first=True` ([#73053](https://github.com/pytorch/pytorch/pull/73053))
* `nn.utils.weight_norm`: Added native CPU kernel ([#73845](https://github.com/pytorch/pytorch/pull/73845))
* `F.grid_sample`: Improved backward pass scaling with input size for 3d implementation ([#71759](https://github.com/pytorch/pytorch/pull/71759))

## Benchmark

* Added binary to benchmark model load speed ([#74700](https://github.com/pytorch/pytorch/pull/74700))

## Profiler

* Optimized Profiler overhead and improve scalability ([#71538](https://github.com/pytorch/pytorch/pull/71538), [#73409](https://github.com/pytorch/pytorch/pull/73409), [#73855](https://github.com/pytorch/pytorch/pull/73855), [#74151](https://github.com/pytorch/pytorch/pull/74151), [#74241](https://github.com/pytorch/pytorch/pull/74241), [#74484](https://github.com/pytorch/pytorch/pull/74484), [#74888](https://github.com/pytorch/pytorch/pull/74888))
* Optimized RecordFunction machinery ([#75807](https://github.com/pytorch/pytorch/pull/75807), [#76017](https://github.com/pytorch/pytorch/pull/76017), [#76016](https://github.com/pytorch/pytorch/pull/76016))

## Mobile

* Reduced unnecessary reference count bumps while parsing ByteCode. ([#72523](https://github.com/pytorch/pytorch/pull/72523))

## Quantization

* Improved multi-core performance of `qavg_pool2d` ([#69517](https://github.com/pytorch/pytorch/pull/69517))
* Improved multi-core performance of `qmax_pool2d` ([#69598](https://github.com/pytorch/pytorch/pull/69598))
* Improved multi-core performance of `qbatch_norm2d` ([#69599](https://github.com/pytorch/pytorch/pull/69599))
* Improved multi-core performance of `qupsample_nearest2d` ([#69600](https://github.com/pytorch/pytorch/pull/69600))
* Improved multi-core performance of `qupsample_bilinear2d` ([#69601](https://github.com/pytorch/pytorch/pull/69601))
* Improved `qcat_nhwc` performance on both multi-core and single-core ([#69667](https://github.com/pytorch/pytorch/pull/69667))
* Added Optimized QInt8 Quantize Tensor Arm ([#76245](https://github.com/pytorch/pytorch/pull/76245))

# Documentation

## Python API

* Updated `torch.amp` document with CPU Training/Inference Examples ([#77244](https://github.com/pytorch/pytorch/pull/77244))
* Updated `torch.utils.dlpack.from_dlpack` documentation ([#70543](https://github.com/pytorch/pytorch/pull/70543))
* Fixed indexing of class names in docs for `torch.{device,` `dtype, layout, memory_format}` ([#73632](https://github.com/pytorch/pytorch/pull/73632))
* Fixed `torch.asarray` docs and add test case ([#73736](https://github.com/pytorch/pytorch/pull/73736))
* Removed misleading statement in `optim.Optimizer` docs ([#76967](https://github.com/pytorch/pytorch/pull/76967))
* Fixed nesterov momentum equation for `torch.optim.SGD` ([#76639](https://github.com/pytorch/pytorch/pull/76639))
* Added missing zero-ing step in `torch.optim.Rprop` algorithm ([#75555](https://github.com/pytorch/pytorch/pull/75555))
* Fixed docs about type promotion of `torch.`{`bitwise_left_shift,bitwise_right_shift}` ([#77613](https://github.com/pytorch/pytorch/pull/77613))
* Fixed docstring for `torch.roll` ([#74880](https://github.com/pytorch/pytorch/pull/74880))
* Added docs for `torch.scatter_reduce` ([#73125](https://github.com/pytorch/pytorch/pull/73125))
* Automatically generate docstring for `torch.distributions.kl_divergence` ([#72845](https://github.com/pytorch/pytorch/pull/72845))
* Miscellaneous documentation improvements ([#74796](https://github.com/pytorch/pytorch/pull/74796), [#76369](https://github.com/pytorch/pytorch/pull/76369))

## C++ API

* Exposed documentation for `unfold` ([#74224](https://github.com/pytorch/pytorch/pull/74224))

## Autograd

* Fixed error in “Autograd Mechanics” doc’s eval mode section ([#74807](https://github.com/pytorch/pytorch/pull/74807))
* Added “Gradients for non-differentiable functions” section in "Autograd Mechanics" doc to explain how gradients are chosen in edge cases ([#76898](https://github.com/pytorch/pytorch/pull/76898))
* Added  link to "Custom function double backward tutorial" from "Extending Pytorch" page ([#72584](https://github.com/pytorch/pytorch/pull/72584))
* Documented forward AD interaction with grad mode ([#72216](https://github.com/pytorch/pytorch/pull/72216))
* Fixed code examples to run successfully ([#74044](https://github.com/pytorch/pytorch/pull/74044))

## Dataloader

* Updated DataLoader docstring about `prefetch_factor` to reflect right amount of batches prefetched by `DataLoader` ([#74558](https://github.com/pytorch/pytorch/pull/74558))
* Fixed docstring for `collate_fn` ([#76594](https://github.com/pytorch/pytorch/pull/76594))

## LinAlg

* Extrapolated on equiv between linalg @ and solve ([#71769](https://github.com/pytorch/pytorch/pull/71769))
* Updated `torch.lu_unpack` docs ([#73803](https://github.com/pytorch/pytorch/pull/73803))

## torch.nn

* `nn.CosineEmbeddingLoss`: Use correct cosine similarity term instead of cosine distance ([#75188](https://github.com/pytorch/pytorch/pull/75188))
* `nn.Hardtanh`: Use `min_val` and `max_val` in function definition ([#75789](https://github.com/pytorch/pytorch/pull/75789))
* `nn.KLDivLoss`: Fixed `log_target` example ([#74945](https://github.com/pytorch/pytorch/pull/74945))
* `nn.``LazyModuleMixin` Fixed typo in docs ([#76269](https://github.com/pytorch/pytorch/pull/76269))
* `nn.LSTM`: Clarified docs for outputs vs. hidden states ([#74291](https://github.com/pytorch/pytorch/pull/74291))
* `nn.Module`: Fixed docs by moving `_version` class variable after docstring ([#72912](https://github.com/pytorch/pytorch/pull/72912))
* `nn.Module`: Fixed docstring typo for `get_submodule()` ([#73018](https://github.com/pytorch/pytorch/pull/73018))
* `nn.Module`: Fixed URL for creating GitHub issues ([#73411](https://github.com/pytorch/pytorch/pull/73411))
* `nn.RNN`: Fixed math notation for linear projections ([#77082](https://github.com/pytorch/pytorch/pull/77082))
* `nn.Transformer`: Detailed 3D tensor shape for masks ([#75552](https://github.com/pytorch/pytorch/pull/75552))
* `nn.TripletMarginLoss`: Fixed formatting error ([#76629](https://github.com/pytorch/pytorch/pull/76629))
* `F.{conv3d, conv_transpose3d, fold, linear}, nn.{AdaptiveAvgPool3d, AvgPool1d, MultiMarginLoss, PairwiseDistance, TripletMarginLoss}`: Fixed doc formatting regressions ([#73014](https://github.com/pytorch/pytorch/pull/73014))
* `F.multi_head_attention_forward`: Added to functional rst ([#72675](https://github.com/pytorch/pytorch/pull/72675))
* `F.multi_head_attention_forward`: Fixed math formatting, misc edit ([#74181](https://github.com/pytorch/pytorch/pull/74181))
* `F.pad`: Fixed supported input shapes in docs ([#76117](https://github.com/pytorch/pytorch/pull/76117))
* `nn.init.trunc_normal_`: Added to `nn.init` docs ([#76896](https://github.com/pytorch/pytorch/pull/76896))
* `nn.utils.clip_grad_norm_`: Fixed return value description ([#76230](https://github.com/pytorch/pytorch/pull/76230))
* `nn.Convolution`: Added note on complex support ([#](https://github.com/pytorch/pytorch/pull/78351)[78351](https://github.com/pytorch/pytorch/pull/78351))

## torch.fx

* Added better error message for FX when using concrete_args ([#76600](https://github.com/pytorch/pytorch/pull/76600))

## Composability

* Added docs for Python Registration ([#79481](https://github.com/pytorch/pytorch/pull/79481))

## Sparse

* Added missing entry for `torch.sparse.sampled_addmm` on website ([#72312](https://github.com/pytorch/pytorch/pull/72312))

## Mobile

* Documentation improvement in test_backend_with_compiler (52c516ecb8)
* Added README for mobile model test ([#76385](https://github.com/pytorch/pytorch/pull/76385), [#76409](https://github.com/pytorch/pytorch/pull/76409))

## Distributed

* torch.distributed
    * Clarified the input of PostLocalSGDState ([#72792](https://github.com/pytorch/pytorch/pull/72792))
    * Added a reference to hierarchical SGD for Model Averaging ([#73823](https://github.com/pytorch/pytorch/pull/73823))
    * Updated documentation about NCCL environment variables ([#74006](https://github.com/pytorch/pytorch/pull/74006))
    * Added `TORCH_CPP_LOG_LEVEL` to the docs ([#76625](https://github.com/pytorch/pytorch/pull/76625))
* FullyShardedDataParallel
    * Improved the documentation of state_dict ([#73453](https://github.com/pytorch/pytorch/pull/73453))
    * Updated `full_optim_state_dict` warning ([#75109](https://github.com/pytorch/pytorch/pull/75109))
    * Added warning when fail to clone ([#74946](https://github.com/pytorch/pytorch/pull/74946))
    * Added mixed precision doc ([#76130](https://github.com/pytorch/pytorch/pull/76130))
    * Added warnings for shared params and updated doc ([#77726](https://github.com/pytorch/pytorch/pull/77726))
    * Fixed `state_dict_type()` example ([#77848](https://github.com/pytorch/pytorch/pull/77848))
    * Reworded device placement warning ([#77850](https://github.com/pytorch/pytorch/pull/77850))
    * Updated `state_dict()` docstring ([#77853](https://github.com/pytorch/pytorch/pull/77853))
* torch.distributed.rpc
    * Added note in RPC docs about retries. ([#73601](https://github.com/pytorch/pytorch/pull/73601))
* DistributedDataParallel
    * Updated the comment for Forward and Backward Hook ([#74063](https://github.com/pytorch/pytorch/pull/74063))
    * Added documentation for c10d log levels ([#73361](https://github.com/pytorch/pytorch/pull/73361))
* torch.distributed.elastic
    * Added documentation clarifying that `torchrun` is a console script to `torch.distributed.run` ([#73598](https://github.com/pytorch/pytorch/pull/73598))

## TorchScript

* Corrected torch.jit.Attribute docs to say that it needs to be used in subclasses of torch.jit.ScriptModule, not torch.nn.Module ([#74653](https://github.com/pytorch/pytorch/pull/74653))

## Quantization

* Added docs for `torch.quantize_per_tensor_dynamic` ([#72311](https://github.com/pytorch/pytorch/pull/72311))
* Fixed typo in quantization docs ([#73511](https://github.com/pytorch/pytorch/pull/73511))
* Grammatically updated quantization tech doc ([#74436](https://github.com/pytorch/pytorch/pull/74436))
* Added best practices for quantization accuracy debugging ([#77536](https://github.com/pytorch/pytorch/pull/77536))
* Improved rendered documentation for backend_config_dict ([#77535](https://github.com/pytorch/pytorch/pull/77535))
* Autogenerated quantization backend configs for documentation ([#75126](https://github.com/pytorch/pytorch/pull/75126))
* Added more docs for quantization.rst ([#75998](https://github.com/pytorch/pytorch/pull/75998))
* Fixed formatting for quantization.rst ([#76223](https://github.com/pytorch/pytorch/pull/76223))

## ONNX

* Added the developing PyTorch ONNX exporter wiki doc link ([_#72663_](https://github.com/pytorch/pytorch/pull/72663))
* Added list of supported ATen ops to [_torch.onnx_](https://pytorch.org/docs/master/onnx.html#list-of-supported-operators) page ([_#74397_](https://github.com/pytorch/pytorch/pull/74397))

## Visualization

* `torch.utils.tensorboard.writer:` Added missing 'dataformats' argument to 'add_image' docs. ([#48834](https://github.com/pytorch/pytorch/pull/48834))

PyTorch 1.11, TorchData, and functorch are now available (2022-03-10)

# PyTorch 1.11 Release Notes

* Highlights
* Backwards Incompatible Change
* New Features
* Improvements
* Performance
* Documentation

# Highlights

We are excited to announce the release of PyTorch 1.11. This release is composed of over 3,300 commits since 1.10, made by 434 contributors. Along with 1.11, we are releasing beta versions of TorchData and functorch. We want to sincerely thank our community for continuously improving PyTorch.


* TorchData is a new library for common modular data loading primitives for easily constructing flexible and performant data pipelines. [_View it on GitHub_](https://github.com/pytorch/data). 
* functorch, a library that adds composable function transforms to PyTorch, is now available in beta. [_View it on GitHub_](https://github.com/pytorch/functorch).
* Distributed Data Parallel (DDP) static graph optimizations available in stable.

You can check the blogpost that shows the new features [here](https://pytorch.org/blog/pytorch-1.11-released/).

# Backwards Incompatible changes

## Python API

### Fixed python `deepcopy` to correctly copy all attributes on `Tensor` objects ([#65584](https://github.com/pytorch/pytorch/pull/65584))

This change ensures that the `deepcopy` operation on Tensor properly copies all the attributes (and not just the plain Tensor properties).


  

    1.10.2 1.11.0
    
      _{a = torch.rand(2)
a.foo = 3
torch.save(a, "bar")
b = torch.load("bar")
print(b.foo)
# Raise AttributeError: "Tensor" object has no attribute "foo"}
      _{a = torch.rand(2)
a.foo = 3
torch.save(a, "bar")
b = torch.load("bar")
print(b.foo)
# 3}
    
  


### **`steps` argument is no longer optional in `torch.linspace` and `torch.logspace`**

This argument used to default to 100 in PyTorch 1.10.2, but was deprecated (previously you would see a deprecation warning if you didn’t explicitly pass in `steps`). In PyTorch 1.11, it is not longer optional.


  

    1.10.2 1.11.0
    
      _{# Works, but raises a deprecation warning
# Steps defaults to 100
a = torch.linspace(1, 10)
# UserWarning: Not providing a value for linspace's steps is deprecated
# and will throw a runtime error in a future release.
# This warning will appear only once per process.
# (Triggered internally at  ../aten/src/ATen/native/RangeFactories.cpp:19}
      _{# In 1.11, you must specify steps
a = torch.linspace(1, 10, steps=100)}
    
  



### Remove `torch.hub.import_module` function that was mistakenly public ([#67990](https://github.com/pytorch/pytorch/pull/67990))

This function is not intended for public use.
If you have existing code that relies on it, you can find an equivalent function at `torch.hub._import_module`.

## C++ API

### **We’ve cleaned up many of the headers in the C++ frontend to only include the subset of `aten` operators that they actually used ([#68247](https://github.com/pytorch/pytorch/pull/68247), [#68687](https://github.com/pytorch/pytorch/pull/68687), [#68688](https://github.com/pytorch/pytorch/pull/68688), [#68714](https://github.com/pytorch/pytorch/pull/68714), [#68689](https://github.com/pytorch/pytorch/pull/68689), [#68690](https://github.com/pytorch/pytorch/pull/68690), [#68697](https://github.com/pytorch/pytorch/pull/68697), [#68691](https://github.com/pytorch/pytorch/pull/68691), [#68692](https://github.com/pytorch/pytorch/pull/68692), [#68693](https://github.com/pytorch/pytorch/pull/68693), [#69840](https://github.com/pytorch/pytorch/pull/69840))**

When you `#include` a header from the C++ frontend, you can no longer assume that every `aten` operators are transitively included. You can work around this by directly adding `#include ` in your file, which will maintain the old behavior of including every `aten` operators.

###  **Custom implementation for `c10::List` and `c10::Dict` move constructors have been removed (**[**#69370**](https://github.com/pytorch/pytorch/pull/69370)**)**

The semantics have changed from "make the moved-from List/Dict empty" to "keep the moved-from List/Dict unchanged"


  

    1.10.2 1.11.0
    
      _{c10::List list1({"3", "4"});
c10::List list2(std::move(list1));
std::cout << list1.size() // 0}
      _{c10::List list1({"3", "4"});
c10::List list2(std::move(list1)); // calls copy ctr
std::cout << list1.size() // 2}
    
  


## CUDA

### **Removed `THCeilDiv` function and corresponding `THC/THCDeviceUtils.cuh` header ([#65472](https://github.com/pytorch/pytorch/pull/65472))**

As part of cleaning up `TH` from the codebase, the `THCeilDiv` function has been removed. Instead, please use `at::ceil_div`, and include the corresponding `ATen/ceil_div.h` header

### **Removed `THCudaCheck` (**[**#66391**](https://github.com/pytorch/pytorch/pull/66391)**)**

You can replace it with `C10_CUDA_CHECK`, which has been available since at least PyTorch 1.4, so just replacing is enough even if you support older versions

### **Removed `THCudaMalloc()`, `THCudaFree()`,  `THCThrustAllocator.cuh` (**[**#65492**](https://github.com/pytorch/pytorch/pull/65492)**)**

If your extension is using `THCThrustAllocator.cuh`, please replace it with `ATen/cuda/ThrustAllocator.h` and corresponding APIs (see examples in this PR).

This PR also removes `THCudaMalloc/THCudaFree` calls. Please use `c10::cuda::CUDACachingAllocator::raw_alloc(size)/raw_delete(ptr)`, or, preferably, switch to `c10:cuda::CUDaCachingAllocator::allocate` which manages deallocation. Caching allocator APIs are available since PyTorch 1.2, so just replacing it is enough even if you support older versions of PyTorch.

## Build

### Stopped building shared library for AOT Compiler, `libaot_compiler.so` ([#66227](https://github.com/pytorch/pytorch/pull/66227))

Building `aot_compiler.cpp` as a separate library is not necessary, as it’s already included in `libtorch.so`.
You can update your build system to only dynamically link `libtorch.so`.

## Mobile

### Make `typing.Union` type unsupported for mobile builds ([#65556](https://github.com/pytorch/pytorch/pull/65556))

`typing.Union` support was added for TorchScript in 1.10. It was removed specifically for mobile due to its lack of use and increase in binary size of PyTorch for Mobile builds.

## Distributed

### `torch.distributed.rpc`: Final Removal of ProcessGroup RPC backend ([#67363](https://github.com/pytorch/pytorch/pull/67363))

ProcessGroup RPC backend is deprecated. In 1.10, it threw an error to help users update their code, and, in 1.11, it is removed completely.

The backend type “PROCESS_GROUP” is now deprecated, e.g.
`torch.distributed.rpc.init_rpc("worker0", backend="PROCESS_GROUP", rank=0, world_size=1)`
and should be replaced with:
`torch.distributed.rpc.init_rpc("worker0", backend="TENSORPIPE", rank=0, world_size=1)`

## Quantization

### Disabled the support for `getitem` in FX Graph Mode Quantization ([#66647](https://github.com/pytorch/pytorch/pull/66647))

`getitem` used to be quantized in `FX Graph Mode Quantization`, and it is no longer quantized. This won’t break any models but could result in a slight difference in numerics.



  

    1.10.2 1.11.0
    
      _{from torch.ao.quantization.quantize_fx import convert_fx, prepare_fx
class M(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.linear = torch.nn.Linear(5, 5)
    def forward(self, x):
        x = self.linear(x)
        y = torch.stack([x], 0)
        return y[0]
m = M().eval()
m = prepare_fx(m, {"": torch.ao.quantization.default_qconfig})
m = convert_fx(m)
print(m)
# prints
# GraphModule(
#   (linear): QuantizedLinear(in_features=5, out_features=5,
#      scale=1.0, zero_point=0, qscheme=torch.per_tensor_affine)
# )
# def forward(self, x):
#     linear_input_scale_0 = self.linear_input_scale_0
#     linear_input_zero_point_0 = self.linear_input_zero_point_0
#     quantize_per_tensor = torch.quantize_per_tensor(x,
#         linear_input_scale_0, linear_input_zero_point_0, torch.quint8)
#     x = linear_input_scale_0 = linear_input_zero_point_0 = None
#     linear = self.linear(quantize_per_tensor)
#     quantize_per_tensor = None
#     stack = torch.stack([linear], 0);  linear = None
#     getitem = stack[0]; stack = None
#     dequantize_2 = getitem.dequantize();  getitem = None
#     return getitem}
      _{from torch.ao.quantization.quantize_fx import convert_fx, prepare_fx
class M(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.linear = torch.nn.Linear(5, 5)
    def forward(self, x):
        x = self.linear(x)
        y = torch.stack([x], 0)
        return y[0]
m = M().eval()
m = prepare_fx(m, {"": torch.ao.quantization.default_qconfig})
m = convert_fx(m)
print(m)
# prints
# GraphModule(
#   (linear): QuantizedLinear(in_features=5, out_features=5, scale=1.0,
                    zero_point=0, qscheme=torch.per_tensor_affine)
# )
# def forward(self, x):
#     linear_input_scale_0 = self.linear_input_scale_0
#     linear_input_zero_point_0 = self.linear_input_zero_point_0
#     quantize_per_tensor = torch.quantize_per_tensor(x, linear_input_scale_0,
                     linear_input_zero_point_0, torch.quint8)
#     x = linear_input_scale_0 = linear_input_zero_point_0 = None
#     linear = self.linear(quantize_per_tensor);  quantize_per_tensor = None
#     stack = torch.stack([linear], 0);  linear = None
#     dequantize_2 = stack.dequantize();  stack = None
#     getitem = dequantize_2[0];  dequantize_2 = None
#     return getitem}
    
  



### **Users should now use `fuse_modules` for PTQ fusion and `fuse_modules_qat` for QAT fusion ([#69878](https://github.com/pytorch/pytorch/pull/69878), [#71956](https://github.com/pytorch/pytorch/pull/71956))**

There are two types of fusion supported by fuse_modules api: PTQ and QAT fusion. Previously we relied on `module.training` to decide which mode user wanted, but this was a misuse of the `training` attribute since that is not the intended purpose. This PR removes the dependency on `module.training` and uses separate APIs to make the fusion requested by the user explicit.

Previously, `fuse_module` used to support both cases and distinguished PTQ/QAT fusion based on `module.training`, but now `fuse_module` only supports the PTQ fusion. So, in the case when user wants to do QAT fusion, they need to change the call to `fuse_modules_qat`, instead of using `fuse_modules`, otherwise, they would silently get unwanted fusion results (PTQ fusion), or if the model is in training mode, it might result in error.

**Note:** Currently it is still enforced that if the model is in eval mode, only PTQ fusion can be used; if the model is in training mode, then only QAT fusion can be used. In the future this constraint will be relaxed.



  

    1.10.2 1.11.0
    
      _{import torch
from torch.ao.quantization import fuse_modules
class M(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.conv = torch.nn.Conv2d(3, 3, 3)
        self.bn = torch.nn.BatchNorm2d(3)
    def forward(self, x):
        return self.bn(self.conv(x))
m = M().train()
m = fuse_modules(m, ["conv", "bn"])
print(type(m.conv))
m = M().eval()
m = fuse_modules(m, ["conv", "bn"])
print(type(m.conv))
<class 'torch.nn.intrinsic.modules.fused.ConvBn2d'>
<class 'torch.nn.modules.conv.Conv2d'>}
      _{import torch
from torch.ao.quantization import fuse_modules
class M(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.conv = torch.nn.Conv2d(3, 3, 3)
        self.bn = torch.nn.BatchNorm2d(3)
    def forward(self, x):
        return self.bn(self.conv(x))
m = M().train()
# For Quantization Aware Training, use fuse_modules_qat()
m = fuse_modules_qat(m, ["conv", "bn"])
print(type(m.conv))
m = M().eval()
m = fuse_modules(m, ["conv", "bn"])
print(type(m.conv))
# Result (doesn't change):
<class 'torch.nn.intrinsic.modules.fused.ConvBn2d'>
<class 'torch.nn.modules.conv.Conv2d'>}
    
  



## ONNX

### Removed `f` arg from `onnx.export_to_pretty_string` ([#69546](https://github.com/pytorch/pytorch/pull/69546))

The arg has always been ignored. Simply remove it from your code.


  

    1.10.2 1.11.0
    
      _{torch.onnx.export_to_pretty_string(model, inputs, "file_name")}
      _{torch.onnx.export_to_pretty_string(model, inputs)}
    
  


### Removed `use_external_data_format` arg from `onnx.export` ([#67809](https://github.com/pytorch/pytorch/pull/67809))

The arg has been deprecated and ignored since 1.10. The external data format is now used automatically if and only if the exported file would exceed protocol buffer’s file size limit. Simply remove it from your code.


  

    1.10.2 1.11.0
    
      _{torch.onnx.export(model, inputs, f_name, use_external_data_format=True)}
      _{torch.onnx.export(model, inputs, f_name)}
    
  


### Removed `example_outputs` arg from `torch.onnx.export` ([#67809](https://github.com/pytorch/pytorch/pull/67809))

The arg has been deprecated and ignored since 1.10. The provided model is instead executed once to produce example outputs. Simply remove it from your code.


  

    1.10.2 1.11.0
    
      _{torch.onnx.export(model, inputs, f_name, exaple_outputs=(foo,))}
      _{torch.onnx.export(model, inputs, f_name)}
    
  


### Removed `enable_onnx_checker` arg from `onnx.export` ([#67276](https://github.com/pytorch/pytorch/pull/67276))

The arg has been deprecated and ignored since 1.10. The ONNX checker is always enabled. If it fails, `onnx.CheckerError` will be raised. Users can catch and ignore that exception.


  

    1.10.2 1.11.0
    
      _{torch.onnx.export(model, inputs, f_name, enable_onnx_checker=False)}
      _{try:
    torch.onnx.export(model, inputs, f_name)
except torch.onnx.CheckerError:
    pass # ignore error}
    
  


### Moved and renamed `onnx.utils.ONNXCheckerError` to `onnx.CheckerError` ([#66644](https://github.com/pytorch/pytorch/pull/66644))

Previously the documentation was incorrect and stated `ONNXCheckerError` was in the `onnx` module, so this moves the class to the originally intended module and brings the code in line with the documentation. The new name is shorter and less redundant with the module name.


  

    1.10.2 1.11.0
    
      _{except torch.onnx.utils.ONNXCheckerError:}
      _{except torch.onnx.CheckerError:}
  


### Removed `_retain_param_name` arg from `onnx.export` ([#67276](https://github.com/pytorch/pytorch/pull/67276))

The arg has been deprecated and ignored since 1.10. Param names are now always retained. Simply remove it from your code. If you want to remove param names, you can do so by editing the exported ONNX model.


  

    1.10.2 1.11.0
    
      _{# NOTE: No way to get same behavior as _retain_param_name=False.
torch.onnx.export(model, inputs, f_name, _retain_param_name=True)}
      _{torch.onnx.export(model, inputs, f_name)}
  


# Deprecations

## Python API

### Deprecated `x.T` on tensors of dimension other than 0 or 2 ([#64180](https://github.com/pytorch/pytorch/pull/64180))

`x.T` only accepts tensors with 0 or 2 dimensions. Calling `x.T` on tensors with a different number of dimensions has been deprecated.


  

    1.10.2 1.11.0
    
      _{a = torch.ones(2, 3, 4)
a.T.size()
# torch.Size([4, 3, 2])}
      _{a = torch.ones(2, 3, 4)
a.T.size()
# UserWarning: The use of `x.T` on tensors of dimension other than 2
# to reverse their shape is deprecated and it will throw an error in a future release.
# Consider `x.mT` to transpose batches of matrices or `x.permute(*torch.arange(x.ndim - 1, -1, -1))`
# to reverse the dimensions of a tensor. (Triggered internally at 
# aten/src/ATen/native/TensorShape.cpp:2386.)
# torch.Size([4, 3, 2])}
  


## Quantization

### `torch.ao.quantization.QConfigDynamic` is deprecated and going to be removed in next the release, please use `torch.ao.quantization.QConfig` instead ([#69875](https://github.com/pytorch/pytorch/pull/69875), [#69864](https://github.com/pytorch/pytorch/pull/69864))


  

    1.10.2 1.11.0
    
      _{qconfig = torch.ao.quantization.QConfigDynamic(...)}
      _{qconfig = torch.ao.quantization.QConfig(...)}
  


# New features

## Python API

* Added `set_deterministic_debug_mode` and `get_deterministic_debug_mode` ([#67778](https://github.com/pytorch/pytorch/pull/67778), [#66233](https://github.com/pytorch/pytorch/pull/66233))
* Added n-dimensional Hermitian FFT: `torch.fft.ifftn` and `torch.fft.hfftn` ([#63890](https://github.com/pytorch/pytorch/pull/63890))
* Added `Wishart` distribution to `torch.distributions` ([#70377](https://github.com/pytorch/pytorch/pull/70377))
* Preliminary support for the [Python Array API](https://data-apis.org/array-api/latest/) standard has been added to the `torch` and `torch.linalg` modules. PyTorch implements over 90% of the operators defined by the Python Array API, including the `torch.from_dlpack` operation for improved DLPack support ([#60627](https://github.com/pytorch/pytorch/pull/60627))
* Moved `torch.testing` from prototype to beta ([#69668](https://github.com/pytorch/pytorch/pull/69668))

## Autograd

* Added new `torch.utils.checkpoint` implementation that does not use reentrant autograd (can be toggled with the new `use_reentrant` flag) ([#69508](https://github.com/pytorch/pytorch/pull/69508))
* Added `batched_grad` parameter to `autograd.grad` to allow batched gradient computation ([#65564](https://github.com/pytorch/pytorch/pull/65564))
* Forward mode AD:
    * Added support for most ops (and many of their backwards as well) ([#71026](https://github.com/pytorch/pytorch/pull/71026), [#69956](https://github.com/pytorch/pytorch/pull/69956), [#70355](https://github.com/pytorch/pytorch/pull/70355), [#71901](https://github.com/pytorch/pytorch/pull/71901), [#69908](https://github.com/pytorch/pytorch/pull/69908), [#69884](https://github.com/pytorch/pytorch/pull/69884), [#67837](https://github.com/pytorch/pytorch/pull/67837), [#68566](https://github.com/pytorch/pytorch/pull/68566), [#69661](https://github.com/pytorch/pytorch/pull/69661), [#69384](https://github.com/pytorch/pytorch/pull/69384), [#68631](https://github.com/pytorch/pytorch/pull/68631), [#70468](https://github.com/pytorch/pytorch/pull/70468), [#70460](https://github.com/pytorch/pytorch/pull/70460), [#67820](https://github.com/pytorch/pytorch/pull/67820), [#70460](https://github.com/pytorch/pytorch/pull/70460), [#65546](https://github.com/pytorch/pytorch/pull/65546), [#67043](https://github.com/pytorch/pytorch/pull/67043),  [#67268](https://github.com/pytorch/pytorch/pull/67268), [#67837](https://github.com/pytorch/pytorch/pull/67837), [#69727](https://github.com/pytorch/pytorch/pull/69727))
        * *Check the following issue ([#71117](https://github.com/pytorch/pytorch/issues/71117)) to see the list of ops that do not yet support forward AD. Please comment there if you run into any ops that don’t support forward AD that you want prioritized or are missing from that list.*
    * Added `ctx.save_for_forward` function to `autograd.Function` ([#71569](https://github.com/pytorch/pytorch/pull/71569))
    * `autograd.forward_ad.unpack_dual` returns a named tuple instead of plain tuple ([#68062](https://github.com/pytorch/pytorch/pull/68062), [#68628](https://github.com/pytorch/pytorch/pull/68628))
* Linear algebra operation support:
    * Added forward AD support for `torch.linalg.{eig, inverse, householder_product, qr}` and `torch.*_solve` ([#65546](https://github.com/pytorch/pytorch/pull/65546), [#67043](https://github.com/pytorch/pytorch/pull/67043), [#67268](https://github.com/pytorch/pytorch/pull/67268), [#67837](https://github.com/pytorch/pytorch/pull/67837))
    * Added forward and backward AD support for `torch.linalg.lstsq` ([#65054](https://github.com/pytorch/pytorch/pull/65054)) 
    * Added support for a wider range of inputs for `linalg.pinv` ([#66092](https://github.com/pytorch/pytorch/pull/66092))

## Build

* Added FlexiBLAS build support ([#64815](https://github.com/pytorch/pytorch/pull/64815))
* Added `IS_LINUX` and `IS_MACOS` global vars for cpp extensions building ([#69093](https://github.com/pytorch/pytorch/pull/69093))
* Added ARC for iOS CMake builds ([#67884](https://github.com/pytorch/pytorch/pull/67884))
* Added support for IBM z14/15 SIMD ([#66407](https://github.com/pytorch/pytorch/pull/66407))

## Complex Numbers

* Added complex number support to `Adagrad` and `Adadelta` optimizers ([#66671](https://github.com/pytorch/pytorch/pull/66671), [#66587](https://github.com/pytorch/pytorch/pull/66587))

## Dataloader

* TorchData library is going to provide modular data loading primitives for easily constructing flexible and performant data pipelines. Beta release will be provided after the release of PyTorch Core (https://github.com/pytorch/data)

## LinAlg

* Added an **experimental** flag that allows specifying a preferred linear algebra library (see the docs [here](https://pytorch.org/docs/master/backends.html?highlight=preferred_linalg_library#torch.backends.cuda.preferred_linalg_library)) ([#67980](https://github.com/pytorch/pytorch/pull/67980))
* Added the `linalg.matrix_exp` operation (see the docs [here](https://pytorch.org/docs/master/generated/torch.linalg.matrix_exp.html?highlight=matrix_exp#torch.linalg.matrix_exp)) ([#62715](https://github.com/pytorch/pytorch/pull/62715))
* Added the `linalg.cross` operation (see the docs [here](https://pytorch.org/docs/master/generated/torch.linalg.cross.html?highlight=linalg%20cross#torch.linalg.cross)) ([#63285](https://github.com/pytorch/pytorch/pull/63285))
* Added the `linalg.diagonal` operation, an alias for torch.diagonal (see the docs [here](https://pytorch.org/docs/master/generated/torch.linalg.diagonal.html?highlight=linalg%20diagonal#torch.linalg.diagonal)) ([#70599](https://github.com/pytorch/pytorch/pull/70599))
* Added the `linalg.lu_factor` operation (see the docs [here](https://pytorch.org/docs/master/generated/torch.linalg.lu_factor.html?highlight=lu_factor#torch.linalg.lu_factor)) ([#66933](https://github.com/pytorch/pytorch/pull/66933))

## torch.nn

* Added `torch.nn.utils.rnn.{unpack_sequence,unpad_sequence}` functions ([#66550](https://github.com/pytorch/pytorch/pull/66550))

## Sparse

* Added `torch.sparse.sampled_addmm` for CSR Tensors on GPU ([#68007](https://github.com/pytorch/pytorch/pull/68007))

## CUDA

* The Jiterator  - enables compiling rarely used CUDA kernels at runtime ([#69439](https://github.com/pytorch/pytorch/pull/69439))
    * Low precision supported for jiterator ([#70157](https://github.com/pytorch/pytorch/pull/70157)) - enables runtime-compilation of ops on low precision tensors (half and bfloat16)
    * Enable cpu scalar arguments for jiterator ([#69861](https://github.com/pytorch/pytorch/pull/69861)) - enables passing cpu scalars as an argument to the jit-compiled kernels at runtime
    * The Cacherator ([#71350](https://github.com/pytorch/pytorch/pull/71350)) - caches the jit-compiled kernels on disk, so that they can be reused between different processes
    * Added complex support for Jiterator, port sinc to Jiterator ([#71577](https://github.com/pytorch/pytorch/pull/71577))
    * Jiterates `lcm`, `i0e`, `i1e`, `ndtri`, `efcx`, `digamma`, `trigamma`, `lgamma` ([#70663](https://github.com/pytorch/pytorch/pull/70663))
    * Jiterates `exp2`, `erfc`, `erfinv` and `entr` ([#71295](https://github.com/pytorch/pytorch/pull/71295))
    * Fixes jiterator cache macro include + updates CUDA note with cache variables ([#71452](https://github.com/pytorch/pytorch/pull/71452))
    * Jiterates `polygamma` ([#71162](https://github.com/pytorch/pytorch/pull/71162))
* Added cuSPARSE descriptors and updated CSR addmm ([#60838](https://github.com/pytorch/pytorch/pull/60838))
* Sparse CSR CUDA: added `addmv_out` ([#61407](https://github.com/pytorch/pytorch/pull/61407))
* Added nvidia-smi memory and utilization as native Python API ([#69104](https://github.com/pytorch/pytorch/pull/69104))

## Vulkan

* Added Vulkan support for several torch operators:
    * `torch.cat` (support for concatenation along the height and channel dimensions for 4-D tensors)  ([#66669](https://github.com/pytorch/pytorch/pull/66669), [#66103](https://github.com/pytorch/pytorch/pull/66103), [#67207](https://github.com/pytorch/pytorch/pull/67207))
    * `torch.nn``.ConvTranspose2d` ([#67104](https://github.com/pytorch/pytorch/pull/67104), [#67358](https://github.com/pytorch/pytorch/pull/67358))
    * `torch.permute` ([#68274](https://github.com/pytorch/pytorch/pull/68274))
    * Tensor indexing (`at::slice`) ([#69382](https://github.com/pytorch/pytorch/pull/69382))
    * `torch.clone` ([#69551](https://github.com/pytorch/pytorch/pull/69551))
* Added the `vulkan_perf_test` benchmark binary to benchmark Vulkan ops under various input conditions. ([#67230](https://github.com/pytorch/pytorch/pull/67230))

## Mobile

* Tracing Based Selective Build (PyTorch Mobile Build Size Reduction) is a new feature that reduces a mobile model’s binary size by only including the operators that the model uses.
    * Build tracer for tracing based workflow ([#66267](https://github.com/pytorch/pytorch/pull/66267))
    * Used operator.yaml to build LibTorch library ([#66237](https://github.com/pytorch/pytorch/pull/66237))
    * Unified tracer between internal and external ([#64152](https://github.com/pytorch/pytorch/pull/64152))
    * Reorganized model tracer dependency ([#63421](https://github.com/pytorch/pytorch/pull/63421))
    * Added support for the `bool` and `int` dtypes in the copy kernel by default when using Tracing Based Selective Build ([#69106](https://github.com/pytorch/pytorch/pull/69106), [#69297](https://github.com/pytorch/pytorch/pull/69297))
    * Generic build features for selective build ([#67817](https://github.com/pytorch/pytorch/pull/67817))
    * Made more classes selective ([#67397](https://github.com/pytorch/pytorch/pull/67397))
    * Added custom classes to selective build and compatibility APIs ([#67004](https://github.com/pytorch/pytorch/pull/67004), [#66972](https://github.com/pytorch/pytorch/pull/66972), [#67340](https://github.com/pytorch/pytorch/pull/67340))

## Distributed

* `FullyShardedDataParallel`
    * FSDP is a type of data-parallel training but unlike traditional data-parallel it shards model’s parameters, gradients and optimizer states across data parallel workers and can optionally offload the sharded model parameters to the CPUs. This new API can help users to scale their large model training with minimal code change when switching from DDP to FSDP.  ([#63881](https://github.com/pytorch/pytorch/pull/63881), [#64964](https://github.com/pytorch/pytorch/pull/64964), [#66578](https://github.com/pytorch/pytorch/pull/66578), [#66904](https://github.com/pytorch/pytorch/pull/66904), [#66956](https://github.com/pytorch/pytorch/pull/66956), [#66957](https://github.com/pytorch/pytorch/pull/66957), [#67117](https://github.com/pytorch/pytorch/pull/67117), [#67292](https://github.com/pytorch/pytorch/pull/67292), [#67249](https://github.com/pytorch/pytorch/pull/67249), [#67135](https://github.com/pytorch/pytorch/pull/67135), [#67813](https://github.com/pytorch/pytorch/pull/67813), [#68308](https://github.com/pytorch/pytorch/pull/68308), [#68155](https://github.com/pytorch/pytorch/pull/68155), [#68417](https://github.com/pytorch/pytorch/pull/68417), [#68776](https://github.com/pytorch/pytorch/pull/68776), [#69356](https://github.com/pytorch/pytorch/pull/69356), [#69357](https://github.com/pytorch/pytorch/pull/69357), [#69358](https://github.com/pytorch/pytorch/pull/69358), [#70340](https://github.com/pytorch/pytorch/pull/70340), [#71803](https://github.com/pytorch/pytorch/pull/71803), [#71804](https://github.com/pytorch/pytorch/pull/71804), [#70341](https://github.com/pytorch/pytorch/pull/70341), [#70235](https://github.com/pytorch/pytorch/pull/70235), [#72084](https://github.com/pytorch/pytorch/pull/72084))
* `DistributedDataParallel`
    * Made static graph to be stable ([#71459](https://github.com/pytorch/pytorch/pull/71459), [#68413](https://github.com/pytorch/pytorch/pull/68413))
    * Made LocalSGD beta, cleaned up some docs ([#71621](https://github.com/pytorch/pytorch/pull/71621))
    * Support custom buffer reduction in DDP via hooks ([#64513](https://github.com/pytorch/pytorch/pull/64513))

## TorchScript

* Enabled running `torch.jit.freeze()` and `torch.jit.optimize_for_inference` on functions that are not forward ([#68668](https://github.com/pytorch/pytorch/pull/68668), [#69367](https://github.com/pytorch/pytorch/pull/69367))
* Enabled `torch.jit.freeze` to work on for sparse COO tensors ([#69614](https://github.com/pytorch/pytorch/pull/69614))
* Enabled `torch.jit.script()`, `torch.jit.freeze()` and serialization for tensors in Compressed Sparse Row (CSR) format ([#69555](https://github.com/pytorch/pytorch/pull/69555))
* Allowed users to set the fusion strategy for `torch.jit.fuser` through the now public  `torch.jit.set_fusion_strategy` . ([#72937](https://github.com/pytorch/pytorch/pull/72937)) 
* Enabled Dynamic Shape Fusion For GPU & CPU, configurable via `torch.jit.set_fusion_strategy` ([#72036](https://github.com/pytorch/pytorch/pull/72036))

## Quantization

* Added bilinear quantized implementation of `torch.nn.functional.grid_sample` 2d operator ([#66879](https://github.com/pytorch/pytorch/pull/66879))
* Added the `torch.quantize_per_tensor_dynamic` operator ([#68004](https://github.com/pytorch/pytorch/pull/68004))
* Added Quantization Aware Training support for `torch.nn.Embedding` and `torch.nn.EmbeddingBag`
    * Added basic EmbeddingBag QAT fakeQuant workflow ([#65443](https://github.com/pytorch/pytorch/pull/65443))
    * Added support for quantization of Embedding{Bag} in dynamic quant APIs ([#65674](https://github.com/pytorch/pytorch/pull/65674))
    * Eager mode QAT for Embeddings ([#66429](https://github.com/pytorch/pytorch/pull/66429))
    * Add benchmarks for QAT Embedding+EmbeddingBag ([#66560](https://github.com/pytorch/pytorch/pull/66560))
    * Supported Embedding QAT via FX API ([#69333](https://github.com/pytorch/pytorch/pull/69333))
    * Add FX support for QAT EmbeddingBag ([#69334](https://github.com/pytorch/pytorch/pull/69334))
* Added support for depthwise quantized `torch.nn.Conv3d` in qnnpack, for use in quantization
    * Depthwise Conv3d Indirection Buffer Setup ([#69311](https://github.com/pytorch/pytorch/pull/69311))
    * Depthwise Conv3d Weight Packing ([#69312](https://github.com/pytorch/pytorch/pull/69312))
    * Depthwise Conv3d mp8x27 (per channel) Neon Kernel ([#69313](https://github.com/pytorch/pytorch/pull/69313))
    * Depthwise Conv3d mp8x27 (per-channel) Sse2 Kernel ([#69314](https://github.com/pytorch/pytorch/pull/69314))
    * Tightened Step Height for Indirection Buffers ([#70530](https://github.com/pytorch/pytorch/pull/70530))
    * Enabled Depthwise Specific Conv3d Kernel for Kernel Size 3x3x3 ([#69315](https://github.com/pytorch/pytorch/pull/69315))
    * Implemented 3d convolution in qnnpack ([#66350](https://github.com/pytorch/pytorch/pull/66350))

## ONNX

* Supports opset version 15 ([#67805](https://github.com/pytorch/pytorch/pull/67805))
* Supports exporting `nn.Module` calls as ONNX local functions ([#66140](https://github.com/pytorch/pytorch/pull/66140), [#67803](https://github.com/pytorch/pytorch/pull/67803))
* Supports for exporting new ops:
    * `tanhshrink`, `hardshrink`, `softshrink` ([#68492](https://github.com/pytorch/pytorch/pull/68492))
    * `__xor__` ([#64581](https://github.com/pytorch/pytorch/pull/64581))
    * `isfinite` ([#64754](https://github.com/pytorch/pytorch/issues/64754))
    * `log10` ([#64374](https://github.com/pytorch/pytorch/pull/64374))
    * `diagonal` ([#66144](https://github.com/pytorch/pytorch/pull/66144))
* Added BFloat16 type support ([#66788](https://github.com/pytorch/pytorch/pull/66788))
* Supports exporting with Apex O2 ([#66700](https://github.com/pytorch/pytorch/pull/66700))

## Infra (Releng)

* Added support for ROCm 4.3.1 ([#65624](https://github.com/pytorch/pytorch/pull/65624))
* Added support for ROCm 4.5.2 ([#71064](https://github.com/pytorch/pytorch/pull/71064))
* Added support for CUDA 11.5 ([#69262](https://github.com/pytorch/pytorch/pull/69262))
* Added support for CUDA enabled Bazel builds ([#66241](https://github.com/pytorch/pytorch/pull/66241))
* Added support for Python 3.10 ([#71132](https://github.com/pytorch/pytorch/pull/71132), [#71419](https://github.com/pytorch/pytorch/pull/71419))

# Improvements

## Python API

* NumPy compatibility:
    * Improved `torch.searchsorted` to be more consistent with NumPy ([#66818](https://github.com/pytorch/pytorch/pull/66818))
    * Added `torch.argwhere` to match NumPy ([#64257](https://github.com/pytorch/pytorch/pull/64257))
    * Added an alias for `torch.special.softmax` ([#62251](https://github.com/pytorch/pytorch/pull/62251))
* Improved `torch.Tensor.view(dtype)`: enable all dtype combinations ([#66493](https://github.com/pytorch/pytorch/pull/66493))
* Improved `torch.diff` by adding support for n greater than 1 ([#67260](https://github.com/pytorch/pytorch/pull/67260))
* Improved `torch.movedim` to handle scalar as no-op ([#69537](https://github.com/pytorch/pytorch/pull/69537))
* Improved `cartesian_prod`: fixed a warning in the docs example ([#68753](https://github.com/pytorch/pytorch/pull/68753))
* Improved error messages for `max_unpool{}d` operators ([#67328](https://github.com/pytorch/pytorch/pull/67328))
* `torch.distributions`
    * Implemented positive-semidefinite constraint in `torch.distributions` ([#71375](https://github.com/pytorch/pytorch/pull/71375))
    * Implemented Entropy methods for Binomial and Multinomial distributions ([#67609](https://github.com/pytorch/pytorch/pull/67609))
    * Implemented support for `non-negative` constraint in exponential distribution (allowing it to include zero). ([#67184](https://github.com/pytorch/pytorch/pull/67184))
    * Implemented `kl divergence` between `normal` and `laplace` distribution. ([#68807](https://github.com/pytorch/pytorch/pull/68807))
* Improved meta tensor support for operators:
    * `max` ([#61449](https://github.com/pytorch/pytorch/pull/61449)) `min` ([#61450](https://github.com/pytorch/pytorch/pull/61450))  `tril`, `triu` ([#67055](https://github.com/pytorch/pytorch/pull/67055)) `mv` ([#67373](https://github.com/pytorch/pytorch/pull/67373)) `range`, `arange`, `linspace`, `logspace` ([#67032](https://github.com/pytorch/pytorch/pull/67032)) `lerp` ([#68924](https://github.com/pytorch/pytorch/pull/68924)) `smooth_l1_loss` ([#67404](https://github.com/pytorch/pytorch/pull/67404)) `fractional_max_pool2d_backward` ([#68245](https://github.com/pytorch/pytorch/pull/68245)) `linalg.lu_factor` ([#66934](https://github.com/pytorch/pytorch/pull/66934)) `fractional_maxpool3d`: port to structured kernel ([#70414](https://github.com/pytorch/pytorch/pull/70414))
* Added support for `torch.Tensor.real` for real-valued tensors ([#71718](https://github.com/pytorch/pytorch/pull/71718))
*  `torch.logaddexp, torch.logaddexp2, torch.remainder`: added BFloat16 support on CPU ([#63621](https://github.com/pytorch/pytorch/pull/63621))
*  `torch.bucketize` and `searchsorted`: added Half precision support ([#67077](https://github.com/pytorch/pytorch/pull/67077))
* Added new `torch.slice_scatter`,`torch.select_scatter`, `torch.diagonal_scatter` ops ([#64430](https://github.com/pytorch/pytorch/pull/64430))
* Made `torch.scatter_reduce` a public API ([#68580](https://github.com/pytorch/pytorch/pull/68580), [#73125](https://github.com/pytorch/pytorch/pull/73125/files))

## C++ API

* Added C++ API and docs for `hfftn` ([#66127](https://github.com/pytorch/pytorch/pull/66127))
* Added support for `MaybeOwned` ([#68157](https://github.com/pytorch/pytorch/pull/68157))
* Added `set_to_none` option for `zero_grad()` to C++ API ([#68801](https://github.com/pytorch/pytorch/pull/68801))
* Added an environment variable, `TORCH_CPP_LOG_LEVEL`, that you can use to toggle the log level in the c10 library ([#71746](https://github.com/pytorch/pytorch/pull/71746))

## Autograd

* Added nesting support for `torch.autograd.graph.saved_tensor_hooks` ([#70932](https://github.com/pytorch/pytorch/pull/70932))
* Delayed all warnings encountered during the backward pass until the end of backward execution ([#66235](https://github.com/pytorch/pytorch/pull/66235))
* Added complex autograd support to `torch.{col2im,im2col}` ([#68199](https://github.com/pytorch/pytorch/pull/68199))
* Added new reduce options and autograd support for `torch.scatter_reduce` ([#71788](https://github.com/pytorch/pytorch/pull/71788))
* Added derivatives wrt the second argument for `torch.{remainder,fmod}` ([#69908](https://github.com/pytorch/pytorch/pull/69908))
* Added new `strategy` flag to `autograd.functional.{Jacobian, Hessian}` to enable vectorized computation ([#67041](https://github.com/pytorch/pytorch/pull/67041), [#66292](https://github.com/pytorch/pytorch/pull/66292)) 
* Added `check_backward_ad` flag to `torch.autograd.gradcheck` to be able to skip backward mode AD checks ([#65040](https://github.com/pytorch/pytorch/pull/65040))
* Relaxed forward AD layout check to allow primal and tangent stride to differ when their size is 1 ([#66294](https://github.com/pytorch/pytorch/pull/66294))

## Build

* Improved incremental build times of PyTorch core by removing a dependency on `native_functions.yaml` in many core files ([#64499](https://github.com/pytorch/pytorch/pull/64499), [#66914](https://github.com/pytorch/pytorch/pull/66914), [#64172](https://github.com/pytorch/pytorch/pull/64172), [#64171](https://github.com/pytorch/pytorch/pull/64171), [#66620](https://github.com/pytorch/pytorch/pull/66620), [#66793](https://github.com/pytorch/pytorch/pull/66793), [#66913](https://github.com/pytorch/pytorch/pull/66913), [#66794](https://github.com/pytorch/pytorch/pull/66794), [#64169](https://github.com/pytorch/pytorch/pull/64169), [#64173](https://github.com/pytorch/pytorch/pull/64173), [#64170](https://github.com/pytorch/pytorch/pull/64170), [#67735](https://github.com/pytorch/pytorch/pull/67735))
* Enabled bazel build without glog and gflags ([#70850](https://github.com/pytorch/pytorch/pull/70850))
* Added support for C++ frontend wrapper on Linux ([#69094](https://github.com/pytorch/pytorch/pull/69094))
* Added support for dynamic codegen outputs in CMake ([#68246](https://github.com/pytorch/pytorch/pull/68246))
* Max CMake version is now used by default with setup.py ([#69355](https://github.com/pytorch/pytorch/pull/69355))
* Upgraded oneDNN to v2.3.3 and package oneDNN Graph API together ([#63748](https://github.com/pytorch/pytorch/pull/63748))
* Code base should now be `-Wno-unused-variable` compliant ([#66041](https://github.com/pytorch/pytorch/pull/66041))
* Added lazy import for `packaging` in `torch_version` ([#71345](https://github.com/pytorch/pytorch/pull/71345))

## Dataloader

* Support custom `Sequence` and `Mapping` for `utils.data.default_collate` ([#68779](https://github.com/pytorch/pytorch/pull/68779))
* Allowed specifying `num_samples` to `RandomSampler` when `replacement` is `False` ([#71568](https://github.com/pytorch/pytorch/pull/71568))
* Fixed the warning of shape inconsistency `utils.data.default_collate` ([#71065](https://github.com/pytorch/pytorch/pull/71065))

## ForEach

* Implemented `ForEach` L1 & L2 norm ([#62646](https://github.com/pytorch/pytorch/pull/62646))

## LinAlg

* The `linalg.matrix_rank` ([docs](https://pytorch.org/docs/master/generated/torch.linalg.matrix_rank.html?highlight=matrix_rank#torch.linalg.matrix_rank)) and `linalg.pinv` ([docs](https://pytorch.org/docs/master/generated/torch.linalg.pinv.html?highlight=pinv#torch.linalg.pinv)) operations now support specifying absolute and relative tolerances for better handling of singular values ([#63102](https://github.com/pytorch/pytorch/pull/63102))

## torch.nn

* Added `channels_last` support for `ChannelShuffle` ([#50247](https://github.com/pytorch/pytorch/pull/50247))
* Added no-batch-dim support for `nn.{AdaptiveLogSoftmaxWithLoss, Bilinear, Conv*d, ConvTranspose*d, CrossEntropyLoss, CTCLoss, Fold, FractionalMaxPool3d, GaussianNLLLoss, GRU, GRUCell, InstanceNorm*d, LSTM, LSTMCell, MarginRankingLoss, MultiheadAttention, MultiLabelSoftMarginLoss, RNN, RNNCell, Transformer, TransformerDecoderLayer, TransformerEncoderLayer}` ([#69054](https://github.com/pytorch/pytorch/pull/69054), [#69539](https://github.com/pytorch/pytorch/pull/69539), [#70506](https://github.com/pytorch/pytorch/pull/70506), [#71055](https://github.com/pytorch/pytorch/pull/71055), [#70092](https://github.com/pytorch/pytorch/pull/70092), [#64909](https://github.com/pytorch/pytorch/pull/64909), [#69732](https://github.com/pytorch/pytorch/pull/69732), [#69783](https://github.com/pytorch/pytorch/pull/69783), [#70236](https://github.com/pytorch/pytorch/pull/70236), [#65323](https://github.com/pytorch/pytorch/pull/65323), [#71056](https://github.com/pytorch/pytorch/pull/71056), [#64975](https://github.com/pytorch/pytorch/pull/64975), [#67176](https://github.com/pytorch/pytorch/pull/67176), [#70590](https://github.com/pytorch/pytorch/pull/70590), [#65690](https://github.com/pytorch/pytorch/pull/65690), [#70977](https://github.com/pytorch/pytorch/pull/70977), [#70597](https://github.com/pytorch/pytorch/pull/70597), [#70322](https://github.com/pytorch/pytorch/pull/70322), [#69291](https://github.com/pytorch/pytorch/pull/69291))
* Added `BFloat16` support on CPU to `nn.{AdaptiveAvgPool2d, AdaptiveMaxPool2d, AvgPool2d, MaxPool2d}` ([#56902](https://github.com/pytorch/pytorch/pull/56902), [#66929](https://github.com/pytorch/pytorch/pull/66929), [#66927](https://github.com/pytorch/pytorch/pull/66927), [#56903](https://github.com/pytorch/pytorch/pull/56903))
* Added `maximize` support to `optim.{Adam, AdamW, SGD}` ([#68164](https://github.com/pytorch/pytorch/pull/68164), [#70146](https://github.com/pytorch/pytorch/pull/70146), [#67847](https://github.com/pytorch/pytorch/pull/67847), [#68733](https://github.com/pytorch/pytorch/pull/68733), [#71023](https://github.com/pytorch/pytorch/pull/71023))
* `F.interpolate`: Add `nearest-exact` mode to fix off-by-one error in `nearest` mode ([#64501](https://github.com/pytorch/pytorch/pull/64501))
* `F.interpolate`: Added support for anti-aliasing to bilinear and bicubic algorithms ([#70930](https://github.com/pytorch/pytorch/pull/70930), [#68819](https://github.com/pytorch/pytorch/pull/68819), [#65142](https://github.com/pytorch/pytorch/pull/65142), [#69318](https://github.com/pytorch/pytorch/pull/69318))
* `F.interpolate`: Improved error message for invalid shapes ([#66417](https://github.com/pytorch/pytorch/pull/66417))
* `nn.Conv*d`: Accepts 0-sized channel inputs ([#66256](https://github.com/pytorch/pytorch/pull/66256))
* `nn.LogSigmoid`: Used `log1p` for improved precision ([#66441](https://github.com/pytorch/pytorch/pull/66441))
* `nn.Module`: Added flag for removing duplicates from parameters ([#71542](https://github.com/pytorch/pytorch/pull/71542))
* `nn.Module`: Added `register_module` alias for registering a sub-module ([#65174](https://github.com/pytorch/pytorch/pull/65174))
* `nn.ModuleList`: Supported concatenation ([#70887](https://github.com/pytorch/pytorch/pull/70887))
* `nn.MultiheadAttention`: Added flag to optionally average output attention weights across heads ([#70055](https://github.com/pytorch/pytorch/pull/70055))
* `nn.ParameterDict`: Supported full set of `dict` methods ([#69403](https://github.com/pytorch/pytorch/pull/69403))
* `nn.{RNN, GRU}`: Allowed `hidden_size` to be 0 ([#70556](https://github.com/pytorch/pytorch/pull/70556))
* `nn.Sequential`: Added `append` method ([#71326](https://github.com/pytorch/pytorch/pull/71326))
* `nn.Upsample`: Exposed `recompute_scale_factor` ([#66419](https://github.com/pytorch/pytorch/pull/66419))
* `nn.ZeroPad2d`: Added `extra_repr` for printing purposes ([#69206](https://github.com/pytorch/pytorch/pull/69206))
* `optim.{ChainedScheduler, SequentialLR}`: Added `optimizer` attribute ([#67406](https://github.com/pytorch/pytorch/pull/67406), [#69817](https://github.com/pytorch/pytorch/pull/69817))
* `optim.swa_utils.AveragedModel`: Added `use_buffers` flag for averaging buffers in addition to parameters ([#65921](https://github.com/pytorch/pytorch/pull/65921), [#71763](https://github.com/pytorch/pytorch/pull/71763))

## torch.fx

* Improved the customizability of `fx.Graph`’s code generation function, including support for setting a breakpoint in the generated code ([#67139](https://github.com/pytorch/pytorch/pull/67139))
* Supported printing inplace operators in FX ([#71887](https://github.com/pytorch/pytorch/pull/71887))

## Sparse

* Add CSR support for several operators:
    * `torch.triangular_solve`, `torch.addmv`, `torch.addmm`, `torch.add`  for all arguments on CPU [(#62180](https://github.com/pytorch/pytorch/pull/62180), [#61536](https://github.com/pytorch/pytorch/pull/61536), [#65606](https://github.com/pytorch/pytorch/pull/65606), [#64391](https://github.com/pytorch/pytorch/pull/64391))
    * `torch.triangular_solve`, `torch.addmv`, `torch.addmm`, `torch.add`  for all arguments on GPU ([#61407](https://github.com/pytorch/pytorch/pull/61407), [#61858](https://github.com/pytorch/pytorch/pull/61858), [#63511](https://github.com/pytorch/pytorch/pull/63511), [#63948](https://github.com/pytorch/pytorch/pull/63948))
    * zero-preserving unary functions ([#68123](https://github.com/pytorch/pytorch/pull/68123), [#69292](https://github.com/pytorch/pytorch/pull/69292))
    * `torch.empty`, `torch.resize_`, `torch.copy_`, `torch.randn_like`, `torch.clone` ([#63509](https://github.com/pytorch/pytorch/pull/63509), [#63510](https://github.com/pytorch/pytorch/pull/63510), [#68083](https://github.com/pytorch/pytorch/pull/68083), [#70581](https://github.com/pytorch/pytorch/pull/70581))
    * `transpose` ([#70582](https://github.com/pytorch/pytorch/pull/70582))
* Added torch.sparse_coo Layout support to `zeros_like` ([#68108](https://github.com/pytorch/pytorch/pull/68108))
* Added Half, BFloat16, and Complex dtype support for matrix multiplication of two COO Tensors on GPU ([#59980](https://github.com/pytorch/pytorch/pull/59980))
* Added support for conversion of CSR to COO Tensor to `to_sparse` ([#66774](https://github.com/pytorch/pytorch/pull/66774))
* Added support for empty COO Tensors to sparse.sum ([#71091](https://github.com/pytorch/pytorch/pull/71091))

## AMD

* Added sparse mappings for CUDA->HIP translation ([#67323](https://github.com/pytorch/pytorch/pull/67323))
* Enabled frexp support for ROCm builds ([#67226](https://github.com/pytorch/pytorch/pull/67226))
* Used hipCUB/rocPRIM scan algorithms for large index support ([#68487](https://github.com/pytorch/pytorch/pull/68487))

## CUDA

* Allows external CUDA streams to be set as current ([#66324](https://github.com/pytorch/pytorch/pull/66324))
* Added an option to disable reduced precision reductions for FP16 GEMM ([#67946](https://github.com/pytorch/pytorch/pull/67946))
* Improved CUDA memory usage of `nanmedian` result ([#68591](https://github.com/pytorch/pytorch/pull/68591))
* Reduced number of `igamma` kernel instantiations ([#70666](https://github.com/pytorch/pytorch/pull/70666))
* Reduced number of `compare` kernels by unifying them ([#69111](https://github.com/pytorch/pytorch/pull/69111))
* Reduced number of `bernoulli` tensor tensor kernel instantiations ([#70169](https://github.com/pytorch/pytorch/pull/70169))
* Used `cub::FutureValue` to simplify 64bit indexing split of cub scan ([#66711](https://github.com/pytorch/pytorch/pull/66711))
* Added `hascuSOLVER` flag to Context ([#69825](https://github.com/pytorch/pytorch/pull/69825))
* Improved error message from `CUDACachingAllocator` ([#69174](https://github.com/pytorch/pytorch/pull/69174))
* Fixed `masked_softmax` perf for element_size is not 8 ([#70271](https://github.com/pytorch/pytorch/pull/70271))
* Reduced binary size of `TensorCompare.cu` ([#68835](https://github.com/pytorch/pytorch/pull/68835))
* Improved error message for `interpolation` ([#72066](https://github.com/pytorch/pytorch/pull/72066))
* Doesn't compile `pow` kernels for non-existent case ([#70017](https://github.com/pytorch/pytorch/pull/70017))

## Profiler

* Added flop count formulas for `bmm` and `baddbmm` ([#66636](https://github.com/pytorch/pytorch/pull/66636))

## Vulkan

* Allowed Vulkan models to return multiple outputs by improving Vulkan tensor lifecycle management to release GPU resources when the tensor is destroyed, instead of being released at the end of every inference ([#66477](https://github.com/pytorch/pytorch/pull/66477), [#66478](https://github.com/pytorch/pytorch/pull/66478))
* Enabled multiple Vulkan models to execute concurrently in parallel threads, by moving components of the Vulkan global context into thread local objects ([#67733](https://github.com/pytorch/pytorch/pull/67733), [#69576](https://github.com/pytorch/pytorch/pull/69576))

## Mobile

* Introduced multiple improvements for `NNAPI` 
    * Added converters for torchscript ops `quantized::mul` and `quantized::convtranspose2d` to converter (`torch.backends._nnapi.prepare.convert_model_to_nnapi`) ([#63913](https://github.com/pytorch/pytorch/pull/63913), [#63914](https://github.com/pytorch/pytorch/pull/63914))
    * Supported `int32` and `qint16` type in Torchscript expressions ([#70197](https://github.com/pytorch/pytorch/pull/70197), [#70621](https://github.com/pytorch/pytorch/pull/70621))
    * Supported runtime flexible shapes and return shapes ([#70334](https://github.com/pytorch/pytorch/pull/70334))
* Improved Model Tracer Coverage and Selective Metal Ops ([#68134, #69492, #69328](https://github.com/pytorch/pytorch/pull/68134))
* Introduced multiple improvements for `CoreML`
    * Fixed error messages ([#67410](https://github.com/pytorch/pytorch/pull/67410))
    * Assigned `computationUnit` to executor ([#67411](https://github.com/pytorch/pytorch/pull/67411))
    * Cleaned up shape information from `TensorSpec` ([#67412](https://github.com/pytorch/pytorch/pull/67412))
* Type Support in Mobile Lite Interpreter
    * Extended `type_parser` to handle `NamedTuple` type ([#63130](https://github.com/pytorch/pytorch/pull/63130), [#62612](https://github.com/pytorch/pytorch/pull/62612))

## Distributed

* `torch.distributed`
    * Improvements to error handling in `TCPStore’`s socket implementation (#68225)
    * Enabled `ncclAvg` for reductions ([#62835](https://github.com/pytorch/pytorch/pull/62835))
    * Init dummy `NCCL` comms in constructor ([#65173](https://github.com/pytorch/pytorch/pull/65173), [#66393](https://github.com/pytorch/pytorch/pull/66393))
    * Added pybind trampoline for `ProcessGroup` and `Work` ([#66338](https://github.com/pytorch/pytorch/pull/66338))
    * Setup `c10d` extension Backend class attr the same way as builtin ones ([#66991](https://github.com/pytorch/pytorch/pull/66991))
    * Added barrier to `ProcessGroup` trampoline ([#67236](https://github.com/pytorch/pytorch/pull/67236))
    * Raised warning when calling collectives on non-member group objects ([#67639](https://github.com/pytorch/pytorch/pull/67639))
    * Patched `bfloat16` support for NCCL ([#67843](https://github.com/pytorch/pytorch/pull/67843))
    * Fixed `c10d` TCP store race condition with mutex ([#68499](https://github.com/pytorch/pytorch/pull/68499))
    * Surfaced `ncclUniqueId` store broadcast error ([#68597](https://github.com/pytorch/pytorch/pull/68597))
    * Checks for file existence before invoking cleanup logic in `FileStore` destructor ([#68603](https://github.com/pytorch/pytorch/pull/68603))
    * Implemented gather primitive for `ProcessGroupNCCL` ([#66745](https://github.com/pytorch/pytorch/pull/66745))
    * Implemented scatter primitive for `ProcessGroupNCCL` ([#70029](https://github.com/pytorch/pytorch/pull/70029))
    * Enabled `gather_object` on `NCCL` ([#71623](https://github.com/pytorch/pytorch/pull/71623))
    * Implemented `allreduce_coalesced` for `ProcessGroupNCCL` ([#62140](https://github.com/pytorch/pytorch/pull/62140))
    * Set non-default backend names to lower case ([#69400](https://github.com/pytorch/pytorch/pull/69400))
    * Added support for `deleteKey` for `FileStore` ([#69953](https://github.com/pytorch/pytorch/pull/69953))
    * Fixed `TSAN` issue in `TCPStore` ([#69590](https://github.com/pytorch/pytorch/pull/69590))

* `DistributedDataParallel`
    * Refactored and removed `sync_params` ([#64514](https://github.com/pytorch/pytorch/pull/64514))
    * Used `named_params` and `named_buffers` explicitly ([#65181](https://github.com/pytorch/pytorch/pull/65181))
    * Allow await of custom buffer reduction in backward ([#64515](https://github.com/pytorch/pytorch/pull/64515))
    * Profiling range for bucket copy ([#65769](https://github.com/pytorch/pytorch/pull/65769))
    * Logs iteration in debug mode ([#65770](https://github.com/pytorch/pytorch/pull/65770))

* `torch.distributed.rpc`
    * Added a timeout argument to RPC shutdown() ([#65425](https://github.com/pytorch/pytorch/pull/65425))
    * Released GIL during RPC shutdown. ([#69586](https://github.com/pytorch/pytorch/pull/69586))
    * Updated RPC `shutdown()` logic to remove process group usage. ([#65946](https://github.com/pytorch/pytorch/pull/65946))
    * Removal of Process Group dependency for TensorPipe Agent. ([#68128](https://github.com/pytorch/pytorch/pull/68128))

* `torch.distributed.autograd`
    * Made Kineto + distributed a warning rather than an error ([#71120](https://github.com/pytorch/pytorch/pull/71120))

* `torch.distributed.elastic`
    * Added ability to override sys.executable for `torch.distributed.run` ([#66179](https://github.com/pytorch/pytorch/pull/66179))

## TorchScript

* Several improvements to NVFuser, which is an optimization that speeds up all JIT graphs with a CUDA Tensors on Nvidia GPUs. This includes extending fusing support to normalization and reduction kernels, enabling multiple kernel launch for single `CudaFusionGroup`, and addition of a graph segmentation cache to the hierarchical caching system. ([#63745](https://github.com/pytorch/pytorch/pull/63745), [#65137](https://github.com/pytorch/pytorch/pull/65137), [#63745](https://github.com/pytorch/pytorch/pull/63745), [#65137](https://github.com/pytorch/pytorch/pull/65137))
* Enabled `profile_ivalue` to convert dynamic scalar into compile time constants in NVFuser. (e.g. reduction axes). ([#63745,](https://github.com/pytorch/pytorch/pull/63745) [#65137](https://github.com/pytorch/pytorch/pull/65137))
* Added support in `torch.jit.trace` for tracing already JITted subgraphs([#59949](https://github.com/pytorch/pytorch/pull/59949))
* We now provide full types on graph inputs when tracing graphs that are already JITted([#67424](https://github.com/pytorch/pytorch/pull/67424))
* `torch.jit.freeze` now can preserve attributes of submodules - previously, it was only possible to prevent inlining of attributes of the top level module.([#66102](https://github.com/pytorch/pytorch/pull/66102))
* The peephole optimizer, which is used in `torch.jit.freeze` now coalesces consecutive calls to `torch.concat` into a single call ([#67000](https://github.com/pytorch/pytorch/pull/67000))
* Added ability for Torch.JIT C dispatch to convert python `None` into an undefined Tensor([#67793](https://github.com/pytorch/pytorch/pull/67793))
* `torch.jit.script` now recognizes union of scalars as a JIT NumberType ([#66591](https://github.com/pytorch/pytorch/pull/66591))
* No longer adds a tensor in a returned list to the wildcard alias set in AliasDB, allowing for additional optimizations in JIT optimization passes. ([#71170](https://github.com/pytorch/pytorch/pull/71170))
* In `torch.jit.optimize_for_inference`, there is a new graph pass to precompute transposes for linear layers. ([#65631](https://github.com/pytorch/pytorch/pull/65631), [68024](https://github.com/pytorch/pytorch/pull/68024))
* In `torch.jit.freeze`, there is a new pass where we concat together multiple linear layers with same input Tensor (different weight/bias) ([#63198](https://github.com/pytorch/pytorch/pull/63198), #[68024](https://github.com/pytorch/pytorch/pull/68024))
* Added support for normalizing `torch.Tensor.__rsub__` in `normalize_ops` JIT pass([#65014](https://github.com/pytorch/pytorch/pull/65014))

## Quantization

* Quantized op improvements
    * `torch.ao.FakeQuantize` now supports `fp32/fp16` `zero_point`. ([#65836](https://github.com/pytorch/pytorch/pull/65836))
    * `torch.ops.quantized.add` now supports broadcasting ([#66049](https://github.com/pytorch/pytorch/pull/66049))
    * `torch.Tensor.dequantize` now supports fp16 + cuda ([#67234](https://github.com/pytorch/pytorch/pull/67234))
    * Added quantized CPU support for `torch.nn.GELU` ([#69968](https://github.com/pytorch/pytorch/pull/69968))
    * `torch.nn.quantized.functional.hardsigmoid` supports an `inplace` flag ([#65740](https://github.com/pytorch/pytorch/pull/65740))
* Workflow improvements
    * FX graph mode quantization: enable `torch.nn.Linear + torch.nn.BatchNorm1d` fusion for PTQ ([#66484](https://github.com/pytorch/pytorch/pull/66484))
    * Added an option in `torch.ao.quantization.quantize_fx.convert_fx` to accept `qconfig_dict` to skip quantization ([#66878](https://github.com/pytorch/pytorch/pull/66878))
    * Added `torch.nn.qat.dynamic.modules.Linear` module ([#67325](https://github.com/pytorch/pytorch/pull/67325))
    * Added `torch.nn.ConvTranspose{n}d + torch.nn.BatchNorm{n}d` fusion support ([#70022](https://github.com/pytorch/pytorch/pull/70022))
    * Extended `torch.ao.quantization.prepare_qat` with `allow_list` argument, to allow custom mapping and custom QAT module ([#65119](https://github.com/pytorch/pytorch/pull/65119))
    * Added `torch.ao.quantization.default_replay_qconfig` which allows observer reuse for `torch.reshape` in FX graph mode quantization ([#69249](https://github.com/pytorch/pytorch/pull/69249))

## ONNX

* Set `ir_version` of the exported model based on `opset_version`. This increases the odds that the exported ONNX model will be usable. Before this change, we were setting the IR version to a hard-coded value which may be higher than what the model consumer supports. ([#67803](https://github.com/pytorch/pytorch/pull/67803))
* Preserved op input names when op just passes through the input to the output ([#67275](https://github.com/pytorch/pytorch/pull/67275))
* Shape inference improvements:
    * Updated slice process shape to support rank only inference ([#66149](https://github.com/pytorch/pytorch/pull/66149))
    * Represent symbolic shape as value ([#69545](https://github.com/pytorch/pytorch/pull/69545))
* Included op type in exported models’ input and output names ([#68976](https://github.com/pytorch/pytorch/pull/68976))
* Supports Conv-BatchNorm fusion inside blocks ([#67272](https://github.com/pytorch/pytorch/pull/67272))
* Exported `torch.reciprocal` to ONNX Reciprocal operator instead of `Div(1, x)` ([#67271](https://github.com/pytorch/pytorch/pull/67271))
* Supports `beta!=1` in softplus ([#66146](https://github.com/pytorch/pytorch/pull/66146))
* Added warning for inplace updates on `tensor.shape` in tracing mode ([#66142](https://github.com/pytorch/pytorch/pull/66142))
* Supports `instance_norm` in training mode ([#64375](https://github.com/pytorch/pytorch/pull/64375))
* Allow registration of custom symbolics for ops specifying aten namespace (i.e. `aten::foo` is allowed as well as “foo”). ([#67810](https://github.com/pytorch/pytorch/pull/67810))
* Allow registration of custom symbolics for `prim` namespace ([#66139](https://github.com/pytorch/pytorch/pull/66139))
* Supports dynamic inputs for `OneHot`, bool for `Einsum` ([#66147](https://github.com/pytorch/pytorch/pull/66147))

## Infra (Releng)

* Build with BUILD_SPLIT_CUDA for all 11.X Windows builds ([#70899](https://github.com/pytorch/pytorch/pull/70899))

## torch.package

* Add ability to retrieve the dependency graph via `all_path` function([#65602](https://github.com/pytorch/pytorch/pull/65602))
* Add support for pickle v4 ([#70642](https://github.com/pytorch/pytorch/pull/70642))
* Add better testing support for Package Exporter ([#70641](https://github.com/pytorch/pytorch/pull/70641))

# Bug fixes

## Python API

* Fixed scalar inputs for aliased binary ops {`multiply`, `subtract`, `divide`} ([#65937](https://github.com/pytorch/pytorch/pull/65937))
* Fixed `torch.save` when saving storages that view same data with different type ([#66949](https://github.com/pytorch/pytorch/pull/66949))
* Fixed `torch.save` error if storages are unallocated ([#68787](https://github.com/pytorch/pytorch/pull/68787))
* Fixed `k` out-of-bounds in `torch.kthvalue` (cpu kernel) ([#68863](https://github.com/pytorch/pytorch/pull/68863))
* Fixed `inference_mode` decorator: `with inference_mode(mode=False)` used to ignore the `mode` argument and always set inference mode. ([#68617](https://github.com/pytorch/pytorch/pull/68617))
* Fixed `cdist_backward` in the case when `cdist` inputs are not contiguous ([#70016](https://github.com/pytorch/pytorch/pull/70016))
* Fixed `cdist` error message typo ([#70178](https://github.com/pytorch/pytorch/pull/70178))
* Fixed `scatter` for empty indexes ([#70662](https://github.com/pytorch/pytorch/pull/70662))
* Fixed `torch.{unique, unique_consecutive}` out of bound ([#71540](https://github.com/pytorch/pytorch/pull/71540))
* Fixed `torch.isin` in the case when inputs are non-contiguous on CPU ([#70659](https://github.com/pytorch/pytorch/pull/70659))
* Fixed `hsplit vsplit dsplit` crash when section is 0 ([#69342](https://github.com/pytorch/pytorch/pull/69342))
* Fixed: `torch.gradient` ignores dim argument when checking edge_order ([#67926](https://github.com/pytorch/pytorch/pull/67926))
* Fixed: `TransformedDistribution.icdf` should perform validation *after* applying the inverse transformation rather than before. ([#71393](https://github.com/pytorch/pytorch/pull/71393))
* Fixed `torch.all and torch.any` internal assert error with requires_grad=True ([#65714](https://github.com/pytorch/pytorch/pull/65714))
* Fixed `torch.logsumexp` type promotion: promote integral inputs to floating for([#63393](https://github.com/pytorch/pytorch/pull/63393))

## C++ API

* Fixed libtorch `at::Tensor::print()` linking error ([#69615](https://github.com/pytorch/pytorch/pull/69615))
* Avoided UB when indexing into size-0 tensors ([#65878](https://github.com/pytorch/pytorch/pull/65878))
* Fixed an ICE when compiling PyTorch from source on MacOS with clang-1300 ([#65655](https://github.com/pytorch/pytorch/pull/65655))

## Autograd

* Fixed autocast state propagation in the `torch.utils.checkpoint` API ([#71169](https://github.com/pytorch/pytorch/pull/71169))
* Fixed `torch.nn.functional.conv_transpose3d` backward when grad_out is non-contiguous ([#67829](https://github.com/pytorch/pytorch/pull/67829))
* Forward mode AD:
    * Fixed a case where forward AD in-place-over-view silently copies the view ([#67816](https://github.com/pytorch/pytorch/pull/67816))
    * Fixed deadlock in forward AD for functions that return multiple outputs ([#67995](https://github.com/pytorch/pytorch/pull/67995))
    * Fixed forward AD codegen for functions that have multiple formulas ([#68535](https://github.com/pytorch/pytorch/pull/68535))
    * Fixed deadlock when forward and backward AD are used at the same time ([#67360](https://github.com/pytorch/pytorch/pull/67360))
    * Fixed  `Tensor.copy_` forward AD to handle broadcasting ([#69592](https://github.com/pytorch/pytorch/pull/69592))
    * Do not generate not_implemented error for forward AD when input with tangent passed to non-differentiable function ([#66926](https://github.com/pytorch/pytorch/pull/66926))
* Fixed `autograd.Function` when non-Tensor argument precedes tensor argument ([#71530](https://github.com/pytorch/pytorch/pull/71530))
* Fixed `autograd.Function` forward AD when forward is a no-op to no longer raise an internal error ([#71531](https://github.com/pytorch/pytorch/pull/71531))

## Build

* Stopped reporting CPU Capability as AVX512 on machines with AVX512 support but without AVX512 kernels ([#66703](https://github.com/pytorch/pytorch/pull/66703))
* Disabled SVE when cross-compiling for M1 ([#67114](https://github.com/pytorch/pytorch/pull/67114))
* Added failure if `pocketfft` is not found and `at_mkl` is not enabled ([#67909](https://github.com/pytorch/pytorch/pull/67909))
* Fixed clang issues when compiling with `_GLIBCXX_USE_CXX11_ABI` ([#72081](https://github.com/pytorch/pytorch/pull/72081))

## Complex Numbers

* Fixed `torch.autograd.gradcheck` to generate valid inputs for forward AD computation for complex functions ([#68001](https://github.com/pytorch/pytorch/pull/68001))
* Fixed `torch.Tensor.copy_` transpose path for tensors with conjugate or negative bit set ([#69026](https://github.com/pytorch/pytorch/pull/69026))
* Fixed `torch.Tensor.copy_` behavior for the case when two conjugated or negated tensors of the same dtype (one or both of which are non-contiguous) are copied into each other ([#68963](https://github.com/pytorch/pytorch/pull/68963))

## Dataloader

* Made `ProcessException` picklable ([#70118](https://github.com/pytorch/pytorch/pull/70118))
* Fixed persistent worker exiting before `pin_memory_thread` ([#71579](https://github.com/pytorch/pytorch/pull/71579))

## torch.nn

* `nn.AdaptiveAvgPool*d`: Throws an error for negative `output_size` ([#70488](https://github.com/pytorch/pytorch/pull/70488))
* `nn.Conv1d`: Fixed for 1D convolution on MKL-DNN backend ([#68166](https://github.com/pytorch/pytorch/pull/68166))
* `nn.CrossEntropyLoss`: Fixed for usage of `weight`, `ignore_index`, and `label_smoothing` together ([#69511](https://github.com/pytorch/pytorch/pull/69511))
* `nn.Fold`: Checked that block height and width are positive ([#69048](https://github.com/pytorch/pytorch/pull/69048))
* `nn.LayerNorm`: Fixed incorrect result on CUDA when `gamma` or `bias` are missing ([#69210](https://github.com/pytorch/pytorch/pull/69210))
* `nn.LayerNorm`: Avoided overflow by doing computation in `float` for `half` ([#66920](https://github.com/pytorch/pytorch/pull/66920))
* `nn.Module`: Throws a proper error message from `load_state_dict` for non-tensor values ([#70596](https://github.com/pytorch/pytorch/pull/70596))
* `nn.ModuleList`: Fixed incorrect return type in `__getitem__` ([#69083](https://github.com/pytorch/pytorch/pull/69083))
* `nn.MultiheadAttention`: Used query dtype for mask type ([#68077](https://github.com/pytorch/pytorch/pull/68077))
* `nn.NLLLoss`: Fixed backward computation with negative weights ([#64572](https://github.com/pytorch/pytorch/pull/64572))
* `nn.{RNN, GRU}`: Fixed RNN modules with input shapes containing-0 in CUDA ([#71696](https://github.com/pytorch/pytorch/pull/71696))
* `nn.utils.rnn.pad_sequence`: Fix regression to support tuples for padding ([#72436](https://github.com/pytorch/pytorch/pull/72436))
* `optim._LrScheduler`: Fixed print formatting ([#68338](https://github.com/pytorch/pytorch/pull/68338))
* `optim.ChainedScheduler`: Fixed `get_last_lr()` ([#69112](https://github.com/pytorch/pytorch/pull/69112))
* `optim.CosineAnnealingWarmRestarts`: Fixed ordering bug when `last_epoch > 0` ([#64758](https://github.com/pytorch/pytorch/pull/64758))
* `optim.SequentialLR`: Updated `_last_lr` on step ([#70558](https://github.com/pytorch/pytorch/pull/70558))

## torch.fx

* Supported `torch.layout` as arg ([#66048](https://github.com/pytorch/pytorch/pull/66048))
* Specified a default value when possible for placeholders created from `concrete_args` ([#59569](https://github.com/pytorch/pytorch/pull/59569))
* Fixed issue where `GraphModule.delete_all_unused_submodules` deletes submodules from called leaf modules ([#66430](https://github.com/pytorch/pytorch/pull/66430))
* Fixed `torch.fx.subgraph_rewriter.replace_pattern` mechanism so that multiple one-liner instances of the pattern are captured correctly ([#66442](https://github.com/pytorch/pytorch/pull/66442))
* Fixed bug in graph matcher that caused certain nodes to be matched twice ([#69238](https://github.com/pytorch/pytorch/pull/69238))
* Ensured node stack trace survives copying ([#69368](https://github.com/pytorch/pytorch/pull/69368))
* Fixed `to_folder` not saving dtype ([#69983](https://github.com/pytorch/pytorch/pull/69983))
* Added a `default_value` arg to `fx.Graph.placeholder` and fix `split_module` ([#71016](https://github.com/pytorch/pytorch/pull/71016))

## Sparse

* Fixed CSR storage access to throw when used ([#70072](https://github.com/pytorch/pytorch/pull/70072))
* Fixed multiplication of 0-D sparse tensors ([#70749](https://github.com/pytorch/pytorch/pull/70749))
* Fixed result dtype for neg if given sparse Tensor ([#68885](https://github.com/pytorch/pytorch/pull/68885))

## CUDA

* Fixed CUDA vs CPU consistency for index_put_ when accumulating ([#66790](https://github.com/pytorch/pytorch/pull/66790))
* Fixed CUDA vs CPU consistency for index_put_ when accumulating (part 2) ([#67189](https://github.com/pytorch/pytorch/pull/67189))
* Fixed error in warning about unsupported GPU ([#67900](https://github.com/pytorch/pytorch/pull/67900))
* Disabled TF32 in `pinv_jvp` and `pinv_backward` ([#67948](https://github.com/pytorch/pytorch/pull/67948))
* Fixed DLPack CUDA stream convention ([#67618](https://github.com/pytorch/pytorch/pull/67618))
* Sets device guard in `_cudnn_impl` functions ([#70406](https://github.com/pytorch/pytorch/pull/70406))
* Fixed `mem_get_info` when querying on a device other than the current device ([#69640](https://github.com/pytorch/pytorch/pull/69640))

## Benchmark

* Fixed divide-by-zero errors in `torch.utils.benchmark.Timer` ([#70050](https://github.com/pytorch/pytorch/pull/70050))

## Dispatcher

* Added explicit `OperatorHandle` destructor, so that the symbol shows up in windows builds ([#70033](https://github.com/pytorch/pytorch/pull/70033))

## Profiler

* Fixed race condition in profiler ([#65812](https://github.com/pytorch/pytorch/pull/65812))
* Fixed TensorBoard memory profiling ([#71417](https://github.com/pytorch/pytorch/pull/71417))

## Visualization

* Fixed `torch.utils.tensorboard` parsing JIT graph incorrectly ([#65692](https://github.com/pytorch/pytorch/pull/65692))

## Vulkan

* Greatly reduced memory usage of the Vulkan backend by updating the configuration of the Vulkan Memory Allocator ([#69088](https://github.com/pytorch/pytorch/pull/69088))
* Addressed several warnings raised by the Vulkan Validation layers:
    * Updated all texture resources to have the same dimensionality ([#67647](https://github.com/pytorch/pytorch/pull/67647))
    * Added image format qualifier to shader files ([#69330](https://github.com/pytorch/pytorch/pull/69330))
    * Disabled SPIR-V compiler size optimization ([#69331](https://github.com/pytorch/pytorch/pull/69331))

## Mobile

* Fixed quantized logistic converter for `NNAPI` ([#70847](https://github.com/pytorch/pytorch/pull/70847))
* Fixed potential crash if `MTLCreateSystemDefaultDevice` returns nil ([#66859](https://github.com/pytorch/pytorch/pull/66859))
* Used full name to look for the promoted prim operator table ([#66081](https://github.com/pytorch/pytorch/pull/66081))
* Fixed function name bug in mobile export ([#66915](https://github.com/pytorch/pytorch/pull/66915))
* Fixed issues with `irange` not having a header included in `Metal` ([#66877](https://github.com/pytorch/pytorch/pull/66877))
* Fixed backward compatibility issue for UnionType on mobile in `type_parser`. ([#71341](https://github.com/pytorch/pytorch/pull/71341))
* Fixed forward flatbuffer type handling with dynamic type in `flatbuffer_loader`. ([#71500](https://github.com/pytorch/pytorch/pull/71500))
* Fixed type equalities issue in `pytorch_jni_common` ([#71508](https://github.com/pytorch/pytorch/pull/71508))
* Fixed missing properties to the executor in `CoreML` ([#67737](https://github.com/pytorch/pytorch/pull/67737/files))
* Fixed memory computation when both constants and data tensors are present in model_dump ([#66006](https://github.com/pytorch/pytorch/pull/66006))
* Ensured that function participating in bundled inputs have their “__name__" attribute set ([#65856](https://github.com/pytorch/pytorch/pull/65856))

## Distributed

* `torch.distributed`
    * Fixed bug on empty `GLOO_SOCKET_IFNAME_ENV` ([#68933](https://github.com/pytorch/pytorch/pull/68933))
* `DistributedDataParallel`
    * Fixed “Cannot modify in-place due to DDPSink” ([#66015](https://github.com/pytorch/pytorch/pull/66015))
* `torch.distributed.elastic`
    * Fixed scale down bug caused by calling `rdzv_handler.shutdown()` on premature agent failures ([#67749](https://github.com/pytorch/pytorch/pull/67749))

## TorchScript

* Fixed a race condition in the JIT interpreter when unpickling source ranges ([5525e9a591](https://github.com/pytorch/pytorch/commit/5525e9a5910a01b880f5f34827c43c29a1473775))
* Fixed a ref counting loop for `CompilationUnit`, resulting in memory leaks when class objects were in JIT graphs. ([#65442](https://github.com/pytorch/pytorch/pull/65442))
* Fixed bug where output type was discarded after calling SubgraphRewriter in C++ ([#65453](https://github.com/pytorch/pytorch/pull/65453))
* Fixed bug where `torch.jit.optimize_for_inference` did not `torch.jit.freeze` a module when passed a a non-frozen module ([#71436](https://github.com/pytorch/pytorch/pull/71436))
* Fixed bug where running module.forward() on a `torch.jit.freeze` ed  module ran the wrong graph ([#68316](https://github.com/pytorch/pytorch/pull/68316))
* Fixed bug where alias analysis in the JIT optimizer was incorrect for the int[] version of `torch.split` , resulting in invalid optimizations in various JIT optimization passes ([#69745](https://github.com/pytorch/pytorch/pull/69745))
* Fixed places where using `torch.autocast`  together with autodiff (module.backwards())  in a JIT graph had the wrong number of arguments and would error out. ([#67648](https://github.com/pytorch/pytorch/pull/67648))
* Forbid propagating gradients through views in JIT graphs as currently it is broken ([#67732](https://github.com/pytorch/pytorch/pull/67732))
* Fixed bug where graph input types were incorrect after running `torch.jit.trace` ([#68242](https://github.com/pytorch/pytorch/pull/68242))
* Fixed case where BroadcastMKLDNN breaks the stack invariant by pushing more than 2 tensors to the stack for when `torch.jit.freeze` ops are converted to MKLDNN([#66628](https://github.com/pytorch/pytorch/pull/66628))
* Raised error instead of segfaulting when passing None into torch.jit.Graph.create ([#68253](https://github.com/pytorch/pytorch/pull/68253))
* Raised error instead of crashing when a JIT ScriptFunction is pickled with an incompatible Python `pickle` version.([#69807](https://github.com/pytorch/pytorch/pull/69807))
* Fixed bug where `torch.jit.script` fails when comments in function has less indent than surrounding code ([#70227](https://github.com/pytorch/pytorch/pull/70227))
* Fixed incorrect device type when torch.device is called inside scripted (`torch.jit.script`) code ([#69645](https://github.com/pytorch/pytorch/pull/69645))
* Fixed warning: overloaded virtual function `torch::jit::Function::call` is only partially overridden in class `torch::jit::GraphFunction` ([4bf1be898d](https://github.com/pytorch/pytorch/commit/4bf1be898d))

## Quantization

* Fixed applying non-zero offset 1 to null pointer in `torch.nn.functional.interpolate` for quantized tensors ([#65570](https://github.com/pytorch/pytorch/pull/65570))
* Doesn't assume bias is a keyword argument to `torch.nn.Conv{n}d` ([#61647](https://github.com/pytorch/pytorch/pull/61647), [#71426](https://github.com/pytorch/pytorch/pull/71426))
* Made error message when trying to use `torch.quantize_per_tensor` on non floats more specific ([#66050](https://github.com/pytorch/pytorch/pull/66050))
* Quantized `torch.nn.Embedding` conversion with unsupported dtype: make error message clearer ([#66051](https://github.com/pytorch/pytorch/pull/66051))
* Fixed `torch.nn.qat.EmbeddingBag` from_float error message ([#66989](https://github.com/pytorch/pytorch/pull/66989))
* Fixed bug enforcing quant_min <= zero_point <= quant_max for float zeropoint in `torch.nn.Embedding` QAT ([#68852](https://github.com/pytorch/pytorch/pull/68852))
* Fixed scale+zp serialization of `torch.nn.quantized.BatchNorm{2|3}d` ([#70432](https://github.com/pytorch/pytorch/pull/70432))
* Fixed `torch.nn.Dropout` in FX graph mode quantization ([#71043](https://github.com/pytorch/pytorch/pull/71043), [#71438](https://github.com/pytorch/pytorch/pull/71438))
* Fixed `qconfig` setting for fused modules in FX graph mode quantization ([#71254](https://github.com/pytorch/pytorch/pull/71254))
* Removed assumption number of rows is in 32 bit in fbgemm ([#69066](https://github.com/pytorch/pytorch/pull/69066))
* Fixed `reduce_range` warning when using default observers ([#71027](https://github.com/pytorch/pytorch/pull/71027))

## ONNX

* Doesn’t create invalid `index_select` op when constant folding through ONNX Gather with indices rank > 1. Fixes export of some uses of Embedding. ([#68493](https://github.com/pytorch/pytorch/pull/68493))
* Shape inference:
    * ConstantMap setters to update existing value instead of emplace, and fix default value of `keepdims` for Reduce ([#67812](https://github.com/pytorch/pytorch/pull/67812))
    * Fixed memory leak ([#68210](https://github.com/pytorch/pytorch/pull/68210))
    * Fixed reshape shape inference regression affecting LSTM ([#72532](https://github.com/pytorch/pytorch/pull/72532))
* Fixed inplace `fill_` dtype export mismatch ([#64580](https://github.com/pytorch/pytorch/pull/64580))
* Fixed `remainder` ([#64578](https://github.com/pytorch/pytorch/pull/64578))
* Fixed `reciprocal` when input is not floating point ([#67808](https://github.com/pytorch/pytorch/pull/67808))
* Fixed `new_full` and `full_like` for Python 3.9 ([#67806](https://github.com/pytorch/pytorch/pull/67806))
* Fixed reduce ops on `binary_cross_entropy_with_logits` ([#67805](https://github.com/pytorch/pytorch/pull/67805))
* Propagated node metadata across passes ([#45256](https://github.com/pytorch/pytorch/pull/45256))
* Ensured outputs don’t have the same name ([#66137](https://github.com/pytorch/pytorch/pull/66137))
* Fixed `pad` with sequence inputs ([#64377](https://github.com/pytorch/pytorch/pull/64377))
* Fixed `instance_norm` with `track_running_stats=True` ([#64375](https://github.com/pytorch/pytorch/pull/64375))
* Fixed `all` and `any` with `dim` arg ([#67270](https://github.com/pytorch/pytorch/pull/67270))
* Allows autograd functions (`prim::PythonOp`) to be exported with `OperatorExportTypes.ONNX_FALLTHROUGH` ([#67273](https://github.com/pytorch/pytorch/pull/67273))

## torch.package

* Prevent import race condition that leaves `torch.package.PackagePickler` with unwanted dispatch table entries. ([#71025](https://github.com/pytorch/pytorch/pull/71025))

# Performance

## Python API

* Speed up pickling for `torch.dtype` ([#65182](https://github.com/pytorch/pytorch/pull/65182))
* Speed up `histogram`: avoid index_put_ overhead in histogram kernel's inner loop ([#67815](https://github.com/pytorch/pytorch/pull/67815))
* Speed up `torch.topk` with sort for some cases ([#68632](https://github.com/pytorch/pytorch/pull/68632))
* Speed up `torch.stack`: don't unsqueeze every stack arg if possible ([#70288](https://github.com/pytorch/pytorch/pull/70288))
* Speed up `LayerNorm` 4-5% ([#71423](https://github.com/pytorch/pytorch/pull/71423))
* Speed up structured kernels: fix some unnecessary refcount bumps ([#71140](https://github.com/pytorch/pytorch/pull/71140))
* Speed up `indexing` functions: release GIL in a few places ([#71728](https://github.com/pytorch/pytorch/pull/71728))
* Speed up `torch.empty` a bit: define check_sizes_nonnegative as inline ([#71640](https://github.com/pytorch/pytorch/pull/71640))
* Speed up `XLA` tensor printing by reducing compilations ([#71147](https://github.com/pytorch/pytorch/pull/71147))

## C++ API

* Updated `c10::SmallVector` from LLVM ([#69110](https://github.com/pytorch/pytorch/pull/69110))
* Reduced some framework overhead in `at::copy_()` ([#68950](https://github.com/pytorch/pytorch/pull/68950))
* Reduced some overhead in `StorageImpl::set_data_ptr` ([#65432](https://github.com/pytorch/pytorch/pull/65432))
* Improved `IValue` performance for tuples by inlining tuple storage ([#64066](https://github.com/pytorch/pytorch/pull/64066))

## Autograd

* Stopped materializing Tensors full of 0s in forward AD when possible ([#64837](https://github.com/pytorch/pytorch/pull/64837))
* Rewrote the backward of `linalg.lu` and `linalg.lu_solve` to use `linalg_solve_triangular` ([#63569](https://github.com/pytorch/pytorch/pull/63569)) 
* Updated `nn.functional.grid_sample` backward to compute input gradient only if required ([#66069](https://github.com/pytorch/pytorch/pull/66069), [#66070](https://github.com/pytorch/pytorch/pull/66070))
* Stopped erroneously saving the output of `torch.softplus` for backward ([#70296](https://github.com/pytorch/pytorch/pull/70296))

## Complex Numbers

* Release GIL when assigning to real or imaginary components of a complex tensor ([#71747](https://github.com/pytorch/pytorch/pull/71747))
* Restored conjugate and negative bits of a tensor when calling `repeat_interleave`  ([#68523](https://github.com/pytorch/pytorch/pull/68523))

## CUDA

* Used a better hash table in `CUDACachingAllocator` ([#71667](https://github.com/pytorch/pytorch/pull/71667))
*  `TopK` CUDA Optimization: used multiple block per slice ([#71081](https://github.com/pytorch/pytorch/pull/71081))
* Removed sync in `Embedding` caused by `unique` ([#66091](https://github.com/pytorch/pytorch/pull/66091))
* `EmbeddingBackward` exclusive_scan thrust->cub ([#66566](https://github.com/pytorch/pytorch/pull/66566))
* `sort_out_cuda`: Used custom kernels to fill index tensors ([#66668](https://github.com/pytorch/pytorch/pull/66668))
* `masked_scatter`: fuse mask count check into one kernel ([#66871](https://github.com/pytorch/pytorch/pull/66871))
* Enabled better depthwise conv perf on cudnn 8.2+ ([#58749](https://github.com/pytorch/pytorch/pull/58749))
* Improved native `layer_norm` forward perf ([#67977](https://github.com/pytorch/pytorch/pull/67977))
* Improved native `layer_norm` backward perf ([#68238](https://github.com/pytorch/pytorch/pull/68238))
* Fast path for size 0 GPU host malloc ([#68532](https://github.com/pytorch/pytorch/pull/68532))
* Alternative implementation of CUDA pinned memory allocator focusing on multi-threaded scalability ([#69299](https://github.com/pytorch/pytorch/pull/69299))
* Used legacy unrolled kernel for non-trivial offset calc cases ([#71710](https://github.com/pytorch/pytorch/pull/71710))
* Removed `call_once` from `CUDACachingAllocator` ([#71668](https://github.com/pytorch/pytorch/pull/71668))
* Reworked stat collection in `CUDACachingAllocator` ([#71669](https://github.com/pytorch/pytorch/pull/71669))
* Fixed CUDA `LpNormFunctor` ([#70601](https://github.com/pytorch/pytorch/pull/70601))

## Dispatcher

* Made `c10::KernelFunction` struct smaller, which should reduce some memory usage by the dispatcher ([#65618](https://github.com/pytorch/pytorch/pull/65618))

## torch.fx

* Made `torch.fx.symbolic_trace` reuse buffers if they're the same ([#66211](https://github.com/pytorch/pytorch/pull/66211))

## Profiler

* Optimized profiler internals ([#68397](https://github.com/pytorch/pytorch/pull/68397), [#68411](https://github.com/pytorch/pytorch/pull/68411), [#69737](https://github.com/pytorch/pytorch/pull/69737), [#68412](https://github.com/pytorch/pytorch/pull/68412), [#70001](https://github.com/pytorch/pytorch/pull/70001), [#70002](https://github.com/pytorch/pytorch/pull/70002), [#70133](https://github.com/pytorch/pytorch/pull/70133))

## Mobile

* Reduced PyTorch Library startup time by 40% for mobile and edge deployments([#65735](https://github.com/pytorch/pytorch/pull/65735), [#65732](https://github.com/pytorch/pytorch/pull/65732), [#65939](https://github.com/pytorch/pytorch/pull/65939), [#66112](https://github.com/pytorch/pytorch/pull/66112), [#66064](https://github.com/pytorch/pytorch/pull/66064), [#66131](https://github.com/pytorch/pytorch/pull/66131))
* Reduced PyTorch Library heap memory utilization by 40% for mobile and edge deployments([#65732](https://github.com/pytorch/pytorch/pull/65732), [#66112](https://github.com/pytorch/pytorch/pull/66112), [#66064](https://github.com/pytorch/pytorch/pull/66064), [#66131](https://github.com/pytorch/pytorch/pull/66131))
* Improve efficiency of IValue and reduce overhead in code paths that use IValue and perform Type Parsing ([#65710](https://github.com/pytorch/pytorch/pull/65710), [#64278](https://github.com/pytorch/pytorch/pull/64278), [#66717](https://github.com/pytorch/pytorch/pull/66717), [#65381](https://github.com/pytorch/pytorch/pull/65381), [#66134](https://github.com/pytorch/pytorch/pull/66134), [#65951](https://github.com/pytorch/pytorch/pull/65951), [#70477](https://github.com/pytorch/pytorch/pull/70477))

## TorchScript

* Improved performance of autodiff on small JIT graphs ([#71666](https://github.com/pytorch/pytorch/pull/71666))
* Enabled autocasting of tensors between fp16, bfloat 16 and fp32 in torchscript models ([#63939](https://github.com/pytorch/pytorch/pull/63939), [#67707](https://github.com/pytorch/pytorch/pull/67707))
* Enables optimizations in more gradSumToSize cases in the JIT Autograd support([#63941](https://github.com/pytorch/pytorch/pull/63941))
* In Unpickling a JIT graph, avoid reading file from a stream for 0 byte tensor storage([#67787](https://github.com/pytorch/pytorch/pull/67787))

## Quantization

* Sped up quantized `torch.nn.functional.interpolate` for channels last ([#66525](https://github.com/pytorch/pytorch/pull/66525))
* Sped up `torch.nn.functional.upsample` for channels last ([#70903](https://github.com/pytorch/pytorch/pull/70903))
* Parallelized computation in `torch.quantize_per_tensor_affine` and `torch.dequantize` ([#65845](https://github.com/pytorch/pytorch/pull/65845))

# Documentation

## Python API

* Added docs for `torch.adjoint`. ([#68869](https://github.com/pytorch/pytorch/pull/68869))
* Clarified difference in behavior of `empty_strided` and `as_strided` ([#64568](https://github.com/pytorch/pytorch/pull/64568))
* Added some missing generated doc entries (`torch.select`, `torch.slice_scatter`, `torch.diagonal_scatter`, `torch.select_scatter`) ([#69030](https://github.com/pytorch/pytorch/pull/69030)),  `histogramdd`  ([#68273](https://github.com/pytorch/pytorch/pull/68273))
* Typo and formatting fixes. `LinearLR`  ([#67840](https://github.com/pytorch/pytorch/pull/67840)), `torch.any` ([#65310](https://github.com/pytorch/pytorch/pull/65310), [#70187](https://github.com/pytorch/pytorch/pull/70187)), `torch.futures`  ([#70630](https://github.com/pytorch/pytorch/pull/70630)), jit docs ([#68557](https://github.com/pytorch/pytorch/pull/68557)), `Tensor.type`  ([#67019](https://github.com/pytorch/pytorch/pull/67019)), `torch.lobpcg` ([#71464](https://github.com/pytorch/pytorch/pull/71464)), `Tensor.triu()`, `Tensor.tril()`, `Tensor.ravel()`. ([#71057](https://github.com/pytorch/pytorch/pull/71057)), `torch.acosh` ([#66814](https://github.com/pytorch/pytorch/pull/66814)), ([#70439](https://github.com/pytorch/pytorch/pull/70439))
* General Doc improvements for individual ops.  `torch.finfo` (mention `torch.bfloat16`) ([#68496](https://github.com/pytorch/pytorch/pull/68496)), `torch.quantile` interpolation kwarg ([#70637](https://github.com/pytorch/pytorch/pull/70637)), `from_dlpack` and `to_dlpack` ([#70437](https://github.com/pytorch/pytorch/pull/70437)), `set_printoptions` added examples ([#68324](https://github.com/pytorch/pytorch/pull/68324)), `index_add` ([#65806](https://github.com/pytorch/pytorch/pull/65806)), topk doc ([#65938](https://github.com/pytorch/pytorch/pull/65938)), `unique` ([#66132](https://github.com/pytorch/pytorch/pull/66132)), `chi2`  ([#67379](https://github.com/pytorch/pytorch/pull/67379)), `torch.histc` ([#64191](https://github.com/pytorch/pytorch/pull/64191)),  `empty` and `empty_like`  ([#68874](https://github.com/pytorch/pytorch/pull/68874)), `torch.cholesky_inverse` ([#69069](https://github.com/pytorch/pytorch/pull/69069)), `torch.dsplit`  ([#70557](https://github.com/pytorch/pytorch/pull/70557))
* Changed README getting started link to explicit instructions ([#66828](https://github.com/pytorch/pytorch/pull/66828))
* Modernized and clarified docs for `torch.tensor` and `torch.as_tensor` ([#63308](https://github.com/pytorch/pytorch/pull/63308))
* Improved `torchhub` docs ([#69970](https://github.com/pytorch/pytorch/pull/69970))
* Updated docs for `torch.Tensor.real` to indicate that it's supported for real tensors ([#71962](https://github.com/pytorch/pytorch/pull/71962))

## C++ API

* Fixed typos in ATen README ([#69170](https://github.com/pytorch/pytorch/pull/69170))
* Mentioned `TORCH_SHOW_CPP_STACKTRACES` in `Contributing.md` docs ([#64052](https://github.com/pytorch/pytorch/pull/64052))
* Updated link to C++ frontend examples ([#66095](https://github.com/pytorch/pytorch/pull/66095))
* Added docs for Visual Studio extension ([#63944](https://github.com/pytorch/pytorch/pull/63944))
* Added docs about an issue with compiling C++ extensions with CUDA 11.5 and Windows ([#73013](https://github.com/pytorch/pytorch/pull/73013))

## Autograd

* Updated docs for forward AD and make them public ([#71643](https://github.com/pytorch/pytorch/pull/71643), [#71159](https://github.com/pytorch/pytorch/pull/71159))
* Updated “Extending PyTorch” doc to cover forward AD ([#66962](https://github.com/pytorch/pytorch/pull/66962))
* Fixed broken code syntax in autograd.rst ([#69362](https://github.com/pytorch/pytorch/pull/69362))
* Fixed incorrect variable in autograd docs ([#70884](https://github.com/pytorch/pytorch/pull/70884))
* Fixed typo in `torch.autograd.Function` docs that prevented it from compiling ([#66754](https://github.com/pytorch/pytorch/pull/66754))

## Dataloader

* Added docstring for `default_collate` and `default_convert` ([#69862](https://github.com/pytorch/pytorch/pull/69862))
* Updated the documentation for AMP with DataParallel ([#69218](https://github.com/pytorch/pytorch/pull/69218))

## torch.nn

* `F.binary_cross_entropy`: Updated examples to avoid deprecated calls ([#69816](https://github.com/pytorch/pytorch/pull/69816))
* `F.linear`: Fixed shape docs to indicate no-batch-dim support ([#66884](https://github.com/pytorch/pytorch/pull/66884))
* `F.max_pool*d`: Added functional docs ([#63264](https://github.com/pytorch/pytorch/pull/63264))
* `F.multilabel_soft_margin_loss`: Added reduction args to signature ([#70420](https://github.com/pytorch/pytorch/pull/70420))
* `nn.AdaptiveLogSoftmaxWithLoss`: Fixed typo in `log_prob` name ([#68926](https://github.com/pytorch/pytorch/pull/68926))
* `nn.{BatchNorm1d, InstanceNorm1d}`: Fixed input shape notation inconsistencies ([#71371](https://github.com/pytorch/pytorch/pull/71371))
* `nn.CrossEntropyLoss`: Corrected typo in formula for class probability targets ([#70220](https://github.com/pytorch/pytorch/pull/70220))
* `nn.{ELU, Hardshrink, Hardsigmoid, MultiHeadAttention, Softplus, Tanh}`: Made first line of docstring readable for overview docs ([#70574](https://github.com/pytorch/pytorch/pull/70574), [#71012](https://github.com/pytorch/pytorch/pull/71012), [#70987](https://github.com/pytorch/pytorch/pull/70987), [#71100](https://github.com/pytorch/pytorch/pull/71100), [#70576](https://github.com/pytorch/pytorch/pull/70576), [#70577](https://github.com/pytorch/pytorch/pull/70577))
* `nn.Flatten`: Simplified example code ([#67472](https://github.com/pytorch/pytorch/pull/67472))
* `nn.{Hardsigmoid, Hardswish, Mish, RReLU, SiLU}`: Added activation function images ([#65415](https://github.com/pytorch/pytorch/pull/65415))
* `nn.KLDivLoss`: Fixed rendering of `reduction` arg ([#66583](https://github.com/pytorch/pytorch/pull/66583))
* `nn.KLDivLoss`: Rewrote docs to clarify math ([#67443](https://github.com/pytorch/pytorch/pull/67443))
* `nn.MaxUnpool2d`: Changed misleading example to better demonstrate `output_size` usage ([#68936](https://github.com/pytorch/pytorch/pull/68936))
* `nn.Module`: Added note describing required `super().__init__()` call ([#66909](https://github.com/pytorch/pytorch/pull/66909))
* `nn.Module`: Changed `super()` usage to Python 3 syntax in example ([#65748](https://github.com/pytorch/pytorch/pull/65748))
* `nn.Module`: Fixed formatting for `named_modules()` ([#70491](https://github.com/pytorch/pytorch/pull/70491))
* `nn.NLLLoss`: Corrected default value for `reduce` ([#68426](https://github.com/pytorch/pytorch/pull/68426))
* `nn.SmoothL1Loss`: Clarified equivalence with `nn.L1Loss` when `beta == 0` ([#70673](https://github.com/pytorch/pytorch/pull/70673))
* `nn.{TransformerDecoderLayer, TransformerEncoderLayer}`: Clarified default `batch_first=False` dimension format ([#66574](https://github.com/pytorch/pytorch/pull/66574))
* `nn.Upsample`: Indicated that `align_corners` takes effect in `bicubic` mode ([#66756](https://github.com/pytorch/pytorch/pull/66756))
* `nn.utils.clip_grad_norm_`: Fixed rendering of `parameters` in `error_if_nonfinite` arg docs ([#69958](https://github.com/pytorch/pytorch/pull/69958))
* `optim.Adam`: Fixed formatting ([#70387](https://github.com/pytorch/pytorch/pull/70387))
* `optim.AdamW`: Fixed formula ([#68587](https://github.com/pytorch/pytorch/pull/68587))
* `optim.RAdam`: Corrected default value of `lr` arg ([#69186](https://github.com/pytorch/pytorch/pull/69186))
* Removed orphan from cuDNN persistent note ([#65160](https://github.com/pytorch/pytorch/pull/65160))
* Updated link to tutorial on defining NN modules ([#65534](https://github.com/pytorch/pytorch/pull/65534))
* `nn.{AvgPool1d, AdaptiveAvgPool3d, MultiMarginLoss, PairwiseDistance, TripletMarginLoss}, ``F.{conv3d, conv_transpose3d, fold, linear}`: Fix doc formatting regressions from no-batch-dim support ([#73014](https://github.com/pytorch/pytorch/pull/73014))

## torch.fx

* Fixed for retracing documentation which would break for n-ary operators ([#71599](https://github.com/pytorch/pytorch/pull/71599))
* Updated `torch.fx.passes.split_module` docstring ([#65542](https://github.com/pytorch/pytorch/pull/65542))
* Updated `fx.rst` example outputs ([#68043](https://github.com/pytorch/pytorch/pull/68043))
* Added document gotcha about training flag ([#68915](https://github.com/pytorch/pytorch/pull/68915))
* Defined `get_dot_``graph` to match documentation ([#70541](https://github.com/pytorch/pytorch/pull/70541))

## Sparse

* Updated sparse.rst to warn about _values() ([#71088](https://github.com/pytorch/pytorch/pull/71088))

## CUDA

* Updated Stream `wait` documentation to reference underlying `cudaStreamWaitEvent` call ([#67973](https://github.com/pytorch/pytorch/pull/67973))
* Documented `torch.cuda.ExternalStream`, `torch.cuda.caching_allocator_alloc` and `torch.cuda.caching_allocator_delete` ([#70126](https://github.com/pytorch/pytorch/pull/70126))
* Updated `CUDA Graphs` docs: Fixed `make_graphed_callables` example typos ([#69379](https://github.com/pytorch/pytorch/pull/69379))

## Mobile

* Added user facing documentation for tracing-based selective build mobile interpreter in Android and iOS ([#1709](https://github.com/pytorch/tutorials/pull/1709))
* Added recipe for bundled inputs in TorchScript models ([#1524](https://github.com/pytorch/tutorials/pull/1524/files))

## Distributed

* `DistributedDataParallel`
    * DDP doc fix ([#71363](https://github.com/pytorch/pytorch/pull/71363))
    * Clarified how to check memory saving if using gradient_as_bucket_view ([#71483](https://github.com/pytorch/pytorch/pull/71483))

* `torch.distributed`
    * Updated distributed.rst to show that CUDA send/recv on GPU is supported ([#65601](https://github.com/pytorch/pytorch/pull/65601))
    * Clarified checkpoint support ([#68827](https://github.com/pytorch/pytorch/pull/68827))
    * Updated distributed.rst for ProcessGroup Extensions ([#71482](https://github.com/pytorch/pytorch/pull/71482))
* `torch.distributed.elastic`
    * Made --max_restarts explicit in the quickstart and runner docs ([#65838](https://github.com/pytorch/pytorch/pull/65838))
* `torch.distributed.optim`
    * Rendered `torch.distributed.optim` members ([#67885](https://github.com/pytorch/pytorch/pull/67885))
* `torch.distributed.rpc`
    * Deleted distributed optimizer section from RPC and add reference to namespace docs page ([#68068](https://github.com/pytorch/pytorch/pull/68068))

## TorchScript

* Added `typing.Union` to supported types in documentation ([#68435](https://github.com/pytorch/pytorch/pull/68435))
* Added documentation to `torch.jit.is_tracing()` ([#67326](https://github.com/pytorch/pytorch/pull/67326))
* Fixed typos in `jit_language_reference.rst` ([#68706](https://github.com/pytorch/pytorch/pull/68706))

## Quantization

* Added documentation for quantized model save/load instructions ([#69789](https://github.com/pytorch/pytorch/pull/69789))
* Updated link to qnnpack in quantization doc. ([#66226](https://github.com/pytorch/pytorch/pull/66226))
* Improved quantization API docs ([#66379](https://github.com/pytorch/pytorch/pull/66379))
* Quantization docs: add pages for Numeric Suite (Eager and FX) ([#66380](https://github.com/pytorch/pytorch/pull/66380))
* Documented the quantization custom module APIs ([#67449](https://github.com/pytorch/pytorch/pull/67449))
* Improved quantization documentation ([#68907](https://github.com/pytorch/pytorch/pull/68907))

## ONNX

* Improved documentation of `operator_export_type` and `opset_version` args ([#69549](https://github.com/pytorch/pytorch/pull/69549))
* Fixed documentation for `do_constant_folding` arg default ([#71348](https://github.com/pytorch/pytorch/pull/71348))
* Documented `ExportTypes`, `CheckerError`, and `unregister_custom_op_symbolic` ([#68489](https://github.com/pytorch/pytorch/pull/68489))
* Fixed link to ONNX Runtime custom op documentation ([#67944](https://github.com/pytorch/pytorch/pull/67944))
* Added section “Discovering all unconvertible ATen ops at once” ([#66143](https://github.com/pytorch/pytorch/pull/66143))
* Fixed typos ([#66090](https://github.com/pytorch/pytorch/pull/66090))
* Documented work-arounds for indexing export limitations, and improve error messages ([#64579](https://github.com/pytorch/pytorch/pull/64579))

## torch.package

* Add some docs describing how to debug `torch.package` dependencies ([#65704](https://github.com/pytorch/pytorch/pull/65704))

1.10.2	1.11.0
_{a = torch.rand(2) a.foo = 3 torch.save(a, "bar") b = torch.load("bar") print(b.foo) # Raise AttributeError: "Tensor" object has no attribute "foo"}	_{a = torch.rand(2) a.foo = 3 torch.save(a, "bar") b = torch.load("bar") print(b.foo) # 3}

1.10.2	1.11.0
_{# Works, but raises a deprecation warning # Steps defaults to 100 a = torch.linspace(1, 10) # UserWarning: Not providing a value for linspace's steps is deprecated # and will throw a runtime error in a future release. # This warning will appear only once per process. # (Triggered internally at ../aten/src/ATen/native/RangeFactories.cpp:19}	_{# In 1.11, you must specify steps a = torch.linspace(1, 10, steps=100)}

1.10.2	1.11.0
_{c10::List list1({"3", "4"}); c10::List list2(std::move(list1)); std::cout << list1.size() // 0}	_{c10::List list1({"3", "4"}); c10::List list2(std::move(list1)); // calls copy ctr std::cout << list1.size() // 2}

1.10.2	1.11.0
_{from torch.ao.quantization.quantize_fx import convert_fx, prepare_fx class M(torch.nn.Module): def __init__(self): super().__init__() self.linear = torch.nn.Linear(5, 5) def forward(self, x): x = self.linear(x) y = torch.stack([x], 0) return y[0] m = M().eval() m = prepare_fx(m, {"": torch.ao.quantization.default_qconfig}) m = convert_fx(m) print(m) # prints # GraphModule( # (linear): QuantizedLinear(in_features=5, out_features=5, # scale=1.0, zero_point=0, qscheme=torch.per_tensor_affine) # ) # def forward(self, x): # linear_input_scale_0 = self.linear_input_scale_0 # linear_input_zero_point_0 = self.linear_input_zero_point_0 # quantize_per_tensor = torch.quantize_per_tensor(x, # linear_input_scale_0, linear_input_zero_point_0, torch.quint8) # x = linear_input_scale_0 = linear_input_zero_point_0 = None # linear = self.linear(quantize_per_tensor) # quantize_per_tensor = None # stack = torch.stack([linear], 0); linear = None # getitem = stack[0]; stack = None # dequantize_2 = getitem.dequantize(); getitem = None # return getitem}	_{from torch.ao.quantization.quantize_fx import convert_fx, prepare_fx class M(torch.nn.Module): def __init__(self): super().__init__() self.linear = torch.nn.Linear(5, 5) def forward(self, x): x = self.linear(x) y = torch.stack([x], 0) return y[0] m = M().eval() m = prepare_fx(m, {"": torch.ao.quantization.default_qconfig}) m = convert_fx(m) print(m) # prints # GraphModule( # (linear): QuantizedLinear(in_features=5, out_features=5, scale=1.0, zero_point=0, qscheme=torch.per_tensor_affine) # ) # def forward(self, x): # linear_input_scale_0 = self.linear_input_scale_0 # linear_input_zero_point_0 = self.linear_input_zero_point_0 # quantize_per_tensor = torch.quantize_per_tensor(x, linear_input_scale_0, linear_input_zero_point_0, torch.quint8) # x = linear_input_scale_0 = linear_input_zero_point_0 = None # linear = self.linear(quantize_per_tensor); quantize_per_tensor = None # stack = torch.stack([linear], 0); linear = None # dequantize_2 = stack.dequantize(); stack = None # getitem = dequantize_2[0]; dequantize_2 = None # return getitem}

1.10.2	1.11.0
_{import torch from torch.ao.quantization import fuse_modules class M(torch.nn.Module): def __init__(self): super().__init__() self.conv = torch.nn.Conv2d(3, 3, 3) self.bn = torch.nn.BatchNorm2d(3) def forward(self, x): return self.bn(self.conv(x)) m = M().train() m = fuse_modules(m, ["conv", "bn"]) print(type(m.conv)) m = M().eval() m = fuse_modules(m, ["conv", "bn"]) print(type(m.conv)) <class 'torch.nn.intrinsic.modules.fused.ConvBn2d'> <class 'torch.nn.modules.conv.Conv2d'>}	_{import torch from torch.ao.quantization import fuse_modules class M(torch.nn.Module): def __init__(self): super().__init__() self.conv = torch.nn.Conv2d(3, 3, 3) self.bn = torch.nn.BatchNorm2d(3) def forward(self, x): return self.bn(self.conv(x)) m = M().train() # For Quantization Aware Training, use fuse_modules_qat() m = fuse_modules_qat(m, ["conv", "bn"]) print(type(m.conv)) m = M().eval() m = fuse_modules(m, ["conv", "bn"]) print(type(m.conv)) # Result (doesn't change): <class 'torch.nn.intrinsic.modules.fused.ConvBn2d'> <class 'torch.nn.modules.conv.Conv2d'>}

1.10.2	1.11.0
_{torch.onnx.export_to_pretty_string(model, inputs, "file_name")}	_{torch.onnx.export_to_pretty_string(model, inputs)}

1.10.2	1.11.0
_{torch.onnx.export(model, inputs, f_name, use_external_data_format=True)}	_{torch.onnx.export(model, inputs, f_name)}

1.10.2	1.11.0
_{torch.onnx.export(model, inputs, f_name, exaple_outputs=(foo,))}	_{torch.onnx.export(model, inputs, f_name)}

1.10.2	1.11.0
_{torch.onnx.export(model, inputs, f_name, enable_onnx_checker=False)}	_{try: torch.onnx.export(model, inputs, f_name) except torch.onnx.CheckerError: pass # ignore error}

1.10.2	1.11.0
_{except torch.onnx.utils.ONNXCheckerError:}	_{except torch.onnx.CheckerError:}

1.10.2	1.11.0
_{# NOTE: No way to get same behavior as _retain_param_name=False. torch.onnx.export(model, inputs, f_name, _retain_param_name=True)}	_{torch.onnx.export(model, inputs, f_name)}

1.10.2	1.11.0
_{a = torch.ones(2, 3, 4) a.T.size() # torch.Size([4, 3, 2])}	_{a = torch.ones(2, 3, 4) a.T.size() # UserWarning: The use of `x.T` on tensors of dimension other than 2 # to reverse their shape is deprecated and it will throw an error in a future release. # Consider `x.mT` to transpose batches of matrices or `x.permute(*torch.arange(x.ndim - 1, -1, -1))` # to reverse the dimensions of a tensor. (Triggered internally at # aten/src/ATen/native/TensorShape.cpp:2386.) # torch.Size([4, 3, 2])}

1.10.2	1.11.0
_{qconfig = torch.ao.quantization.QConfigDynamic(...)}	_{qconfig = torch.ao.quantization.QConfig(...)}

PyTorch 1.10 Release, including CUDA Graphs APIs, Frontend and compiler improvements (2021-10-21)

# 1.10.0 Release Notes

* Highlights
* Backwards Incompatible Change
* New Features
* Improvements
* Performance
* Documentation

# Highlights

We are excited to announce the release of PyTorch 1.10. This release is composed of over 3,400 commits since 1.9, made by 426 contributors. We want to sincerely thank our community for continuously improving PyTorch. 

PyTorch 1.10 updates are focused on improving training and performance of PyTorch, and developer usability. Highlights include:
* CUDA Graphs APIs are integrated to reduce CPU overheads for CUDA workloads.
* Several frontend APIs such as FX, `torch.special`, and `nn.Module` Parametrization, have moved from beta to stable.  
* Support for automatic fusion in JIT Compiler expands to CPUs in addition to GPUs.
* Android NNAPI support is now available in beta.

You can check the blogpost that shows the new features [here](https://pytorch.org/blog/pytorch-1.10-released/).

# Backwards Incompatible changes

## Python API

### `torch.any`/`torch.all` behavior changed slightly to be more consistent for zero-dimension, `uint8` tensors. ([#64642](https://github.com/pytorch/pytorch/pull/64642))

These two functions match the behavior of NumPy, returning an output dtype of bool for all support dtypes, except for `uint8` (in which case they return a 1 or a 0, but with `uint8` dtype). In some cases with 0-dim tensor inputs, the returned `uint8` value could mistakenly take on a value > 1. This has now been fixed.


  

    1.9.1 1.10.0
    
      _{>>> torch.all(torch.tensor(42, dtype=torch.uint8))
tensor(1, dtype=torch.uint8)
>>> torch.all(torch.tensor(42, dtype=torch.uint8), dim=0)
tensor(42, dtype=torch.uint8) # wrong, old behavior}
      _{>>> torch.all(torch.tensor(42, dtype=torch.uint8))
tensor(1, dtype=torch.uint8)
>>> torch.all(torch.tensor(42, dtype=torch.uint8), dim=0)
tensor(1, dtype=torch.uint8) # new, corrected and consistent behavior}
    
  


### Remove deprecated `torch.{is,set}_deterministic` ([#62158](https://github.com/pytorch/pytorch/pull/62158))

This is the end of the deprecation cycle for both of these functions. You should be using `torch.use_deterministic_algorithms` and`torch.are_deterministic_algorithms_enabled` instead.

## Complex Numbers

### **Conjugate View: [`tensor.conj()`](https://pytorch.org/docs/1.10./generated/torch.conj.html) now returns a view tensor that aliases the same memory and has conjugate bit set ([#54987](https://github.com/pytorch/pytorch/pull/54987), [#60522](https://github.com/pytorch/pytorch/pull/60522), [#66082](https://github.com/pytorch/pytorch/pull/66082), [#63602](https://github.com/pytorch/pytorch/pull/63602)).** 

This means that `.conj()` is now an O(1) operation and returns a tensor that views the same memory as `tensor` and has conjugate bit set. This notion of conjugate bit enables fusion of operations with conjugation which gives a lot of performance benefit for operations like matrix multiplication. All out-of-place operations will have the same behavior as before, but an in-place operation on a conjugated tensor will additionally modify the input tensor. 


  

    1.9.1 1.10.0
    
      _{>>> import torch
>>> x = torch.tensor([1+2j])
>>> y = x.conj()
>>> y.add_(2)
>>> print(x)
tensor([1.+2.j])}
      _{>>> import torch
>>> x = torch.tensor([1+2j])
>>> y = x.conj()
>>> y.add_(2)
>>> print(x)
tensor([3.+2.j])}
    
  


Note: You can verify if the conj bit is set by calling `tensor.is_conj()`. The conjugation can be resolved, i.e., you can obtain a new tensor that doesn’t share storage with the input tensor at any time by calling `conjugated_tensor.clone()` or `conjugated_tensor.resolve_conj()` .

Note that these conjugated tensors behave differently from the corresponding numpy arrays obtained from `np.conj()` when an in-place operation is performed on them (similar to the example shown above).


### **Negative View: `tensor.conj().neg()` returns a view tensor that aliases the same memory as both tensor and `tensor.conj()` and has a negative bit set ([#56058](https://github.com/pytorch/pytorch/pull/56058)).**

`conjugated_tensor.neg()` continues to be an O(1) operation, but the returned tensor shares memory with both `tensor` and `conjugated_tensor`.


  

    1.9.1 1.10.0
    
      _{>>> x = torch.tensor([1+2j])
>>> y = x.conj()
>>> z = y.imag
>>> z.add_(2)
>>> print(x)
tensor([1.+2.j])}
      _{>>> x = torch.tensor([1+2j])
>>> y = x.conj()
>>> z = y.imag
>>> print(z.is_neg())
True
>>> z.add_(2)
>>> print(x)
tensor([1.-0.j])}
    
  



### `tensor.numpy()` now throws `RuntimeError` when called on a tensor with conjugate or negative bit set ([#61925](https://github.com/pytorch/pytorch/pull/61925)).

Because the notion of conjugate bit and negative bit doesn’t exist outside of PyTorch, calling operations that return a Python object viewing the same memory as input like `.numpy()` would no longer work for tensors with conjugate or negative bit set.


  

    1.9.1 1.10.0
    
      _{>>> x = torch.tensor([1+2j])
>>> y = x.conj().imag
>>> print(y.numpy())
[2.]}
      _{>>> x = torch.tensor([1+2j])
>>> y = x.conj().imag
>>> print(y.numpy())
RuntimeError: Can't call numpy() on Tensor that has negative
bit set. Use tensor.resolve_neg().numpy() instead.}
    
  


## Autograd

### Raise `TypeError` instead of `RuntimeError` when assigning to a Tensor’s grad field with wrong type ([#64876](https://github.com/pytorch/pytorch/pull/64876))

Setting the `.grad` field with a non-None and non-Tensor object used to return a `RuntimeError` but it now properly returns a `TypeError`. If your code was catching this error, you should simply update it to catch a `TypeError` instead of a `RuntimeError`.


  

    1.9.1 1.10.0
    
      _{try:
    # Assigning an int to a Tensor's grad field
    a.grad = 0
except RuntimeError as e:
    pass}
      _{try:
   a.grad = 0
except TypeError as e:
    pass}
    
  


### Raise error when inputs to `autograd.grad` are empty ([#52016](https://github.com/pytorch/pytorch/pull/52016))

Calling `autograd.grad` with an empty list of inputs used to do the same as backward. To reduce confusion, it now raises the expected error. If you were relying on this, you can simply update your code as follows:


  

    1.9.1 1.10.0
    
      _{grad = autograd.grad(out, tuple())
assert grad == tuple()}
      _{out.backward()}
    
  



### Optional arguments to `autograd.gradcheck` and `autograd.gradgradcheck` are now kwarg-only ([#65290](https://github.com/pytorch/pytorch/pull/65290))

These two functions now have a significant number of optional arguments controlling what they do (i.e., `eps`, `atol`, `rtol`, `raise_exception`, etc.). To improve readability, we made these arguments kwarg-only. If you are passing these arguments to `autograd.gradcheck` or `autograd.gradgradcheck` as positional arguments, you can update your code as follows:


  

    1.9.1 1.10.0
    
      _{torch.autograd.gradcheck(fn, x, 1e-6)}
      _{torch.autograd.gradcheck(fn, x, eps=1e-6)}
    
  



### In-place detach (`detach_`) now errors for views that return multiple outputs ([#58285](https://github.com/pytorch/pytorch/pull/58285))

This change is finishing the deprecation cycle for the inplace-over-view logic. In particular, a few things that were warning are updated:

    * `detach_` will now raise an error when invoked on any view created by `split`, `split_with_sizes`, or `chunk`. You should use the non-inplace `detach` instead.
    * The error message for when an in-place operation (that is not detach) is performed on a view created by `split`, `split_with_size`, and `chunk` has been changed from "This view is an output of a function..." to "This view is the output of a function...".


  

    1.9.1 1.10.0
    
      _{b = a.split(1)[0]
b.detach_()}
      _{b = a.split(1)[0]
c = b.detach()}
    
  


### Fix saved variable unpacking version counter ([#60195](https://github.com/pytorch/pytorch/pull/60195))

In-place on the unpacked SavedVariables used to be ignored. They are now properly detected which can lead to errors saying that a variable needed for backward was modified in-place.
This is a valid error and the user should fix this by cloning the unpacked saved variable before using it.

No internal formula will trigger this, but it might be triggered by user custom `autograd.Function` if the backward modifies a saved Tensor inplace and you do multiple backwards. This used to silently return the wrong result and will now raise the expected error.

## torch.nn

### Added optional tensor arguments to `__torch_function__` handling checks ([#63967](https://github.com/pytorch/pytorch/pull/63967))

This fixes the `has_torch_function*()` checks throughout `torch.nn.functional` to correctly pass in optional tensor arguments; prior to this fix, `handle_torch_function()` was not called for these optional tensor arguments. Previously, passing a tensor-like object into a function that accepts an optional tensor might not trigger that object's `__torch_function__`. Now, the object's `__torch_function__` will be triggered as expected.


  

    1.9.1 1.10.0
    
      _{import torch
import torch.nn.functional as F
class TestTensor(object):
    def __init__(self, weight):
        self.weight = weight
    def __torch_function__(self, func, _, args=(), kwargs=None):
        print(func)
        print(func == F.group_norm)
# Call F.group_norm with a custom Tensor as the non-optional arg 'features'
features = TestTensor(torch.randn(3,3))
F.group_norm(features, 3)
# ...prints "group_norm" and True
# Call F.group_norm with a custom Tensor as the optional arg 'weight'
features = torch.randn(3,3)
weight = TestTensor(torch.randn(3))
F.group_norm(features, 3, weight=weight)
# ...prints "group_norm" and False because weight's __torch_function__ is
# called with func as torch.group_norm instead of F.group_norm}
      _{import torch
import torch.nn.functional as F
class TestTensor(object):
    def __init__(self, weight):
        self.weight = weight
    def __torch_function__(self, func, _, args=(), kwargs=None):
        print(func)
        print(func == F.group_norm)
# Call F.group_norm with a custom Tensor as the non-optional arg 'features'
features = TestTensor(torch.randn(3,3))
F.group_norm(features, 3)
# ...prints "group_norm" and True
# Call F.group_norm with a custom Tensor as the optional arg 'weight'
features = torch.randn(3,3)
weight = TestTensor(torch.randn(3))
F.group_norm(features, 3, weight=weight)
# ...prints "group_norm" and True}
    
  


## CUDA

### Removed post-backward syncs on default stream ([#60421](https://github.com/pytorch/pytorch/pull/60421))

Calls to backward() or grad() synced only the calling thread's default stream with autograd leaf streams at the end of backward. This made the following weird pattern safe:

```python
with torch.cuda.stream(s):
    # imagine forward used many streams, so backward leaf nodes may run on many streams
    loss.backward()# no sync
use grads
```

but a more benign-looking pattern was unsafe:

```python
with torch.cuda.stream(s):
    # imagine forward used a lot of streams, so backward leaf nodes may run on many streams
    loss.backward()
    # backward() syncs the default stream with all the leaf streams, but does not sync s with anything,
    # so counterintuitively (even though we're in the same stream context as backward()!)
    # it is NOT SAFE to use grads here, and there's no easy way to make it safe,
    # unless you manually sync on all the streams you used in forward,
    # or move "use grads" back to default stream outside the context.
    use grads
```

Note: this change makes it so that backward() has [same user-facing stream semantics as any cuda op](https://pytorch.org/docs/master/notes/cuda.html#stream-semantics-of-backward-passes).** In other words, the weird pattern is unsafe, and the benign-looking pattern is safe. Implementation-wise, this meant backward() should sync its calling thread's current stream, not default stream, with the leaf streams. This PR  deletes syncs on the default stream. 

## torch.package

* Removed verbose mode from PackageExporter ([#61145](https://github.com/pytorch/pytorch/pull/61145))
    * PackageExporter is losing “verbose” mode argument as we have found it is not useful and sometimes confusing. See following examples on how to modify your code to accommodate this change.


  

    1.9.1 1.10.0
    
      _{with PackageExporter(buffer, verbose=False) as e:
    e.intern("**")
    e.save_pickle("res", "mod1.pkl", mod1)
    e.save_pickle("res", "mod2.pkl", mod2)}
      _{with PackageExporter(buffer) as e:
    e.intern("**")
    e.save_pickle("res", "mod1.pkl", mod1)
    e.save_pickle("res", "mod2.pkl", mod2)}
    
  



## Quantization

### Added extra observer/fake_quant (the same observer/fake_quant instance as the input) for some operators in prepare_fx, e.g. maxpool, add_scalar and mul_scalar ([#61687](https://github.com/pytorch/pytorch/pull/61687), [#61859](https://github.com/pytorch/pytorch/pull/61859))

Previously the way we insert observers/fake_quants are specific to fbgemm/qnnpack backend, as we work on making FX Graph Mode Quantization extensible to custom backends, we are changing some behaviors for the fbgemm/qnnpack path as well. The above changes are adding extra observer/fake_quant to the output of some operators to make sure we model the quantized operator more accurately in quantization aware training, the comprehensive list of operators where the behavior changes are the following:

* modules: torch.nn.MaxPool1d, torch.nn.MaxPool2d, torch.nn.MaxPool3d, torch.nn.Identity
* torch functions: torch.nn.functional.max_pool1d, torch.nn.functional.max_pool2d, torch.nn.functional.max_pool3d, torch.chunk, torch.flatten, torch.transpose, torch.repeat_interleave, torch.sort, torch.squeeze, torch.stack, torch.unsqueeze, operator.getitem, 
* Tensor methods: chunk, contiguous, detach, detach_, numel, permute, repeat, repeat_interleave, reshape, resize_, shape, size, squeeze, squeeze_, transpose, unsqueeze, unsqueeze_, view
* Tensor operations: add scalar and mul scalar (add/mul with a Tensor and a Scalar input)


We will show an example with torch.nn.MaxPool2d:

```python
class M(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.maxpool2d = torch.nn.MaxPool2d(kernel_size=3)

    def forward(self, x):
        x = self.maxpool2d(x)
        return x
m = M().eval()        
m = prepare_fx(m, {"": torch.quantization.default_qconfig})
print(m.code)
```


  

    1.9.1 1.10.0
    
      _{def forward(self, x):
    x_activation_post_process_0 = self.x_activation_post_process_0(x); x = None
    maxpool2d = self.maxpool2d(x_activation_post_process_0); x_activation_post_process_0 = None
    return maxpool2d}
      _{def forward(self, x):
    x_activation_post_process_0 = self.x_activation_post_process_0(x); x = None
    maxpool2d = self.maxpool2d(x_activation_post_process_0); x_activation_post_process_0 = None
    maxpool2d_activation_post_process_0 = self.maxpool2d_activation_post_process_0(maxpool2d); maxpool2d = None
    return maxpool2d_activation_post_process_0}
    
  


Note that `self.maxpool2d_activation_post_process_0` and `self.x_activation_post_process_0` will refer to the same observer/fake_quant instance, this is to simulate the numerics for the quantized maxpool implementation, where the output would reuse the quantization parameter of the input. Simple illustration with graph:

Before:

```
observer_0 - maxpool - ...
```

After:

```
observer_0 - maxpool - observer_0 (same observer instance as input observer) - ...
```

## ONNX

### Removed `aten` arg from `torch.onnx.export()`. ([#62759](https://github.com/pytorch/pytorch/pull/62759))

The new `OperatorExportTypes.ONNX` removes the need for an explicit `aten` argument. If Pytorch was built with `-DPYTORCH_ONNX_CAFFE2_BUNDLE` the a `None` value means `OperatorExportTypes.ONNX_ATEN_FALLBACK`


  

    1.9.1 1.10.0
    
      _{torch.onnx.export(..., aten=True)}
      _{torch.onnx.export(..., operator_export_type=torch.onnx.OperatorExportTypes.ONNX_ATEN)}
    
  


# Deprecations

## Python API

### Deprecate **`__torch_function__`** as a plain methods ([#64843](https://github.com/pytorch/pytorch/pull/64843))

The `__torch_function__` function used to create Tensor like objects did not have any constraint whether it should be a method, class method or static method.

To make it compatible with newer features on Tensor-like objects, we are deprecating setting it as a plain method. You can define it as a class method to get the current class and scan the argument list if you need an object that is an instance of this class.

## Mobile

### Removed API torch.utils.bundled_inputs.run_on_bundled_input ([#58344](https://github.com/pytorch/pytorch/pull/58344))

This API caused many issues and is not really necessary. The functionality (run model with bundled input) can be achieved by using `get_all_bundled_inputs`. For example:

1.9.1:

```python
model.run_on_bundled_input(0)
```

1.10.0:

```python
model(*model.get_all_bundled_inputs()[0])
```

## Distributed

### `torch.distributed.rpc`: Removed ProcessGroup RPC backend ([#62411](https://github.com/pytorch/pytorch/pull/62411) , [#62985](https://github.com/pytorch/pytorch/pull/62985))

ProcessGroup RPC backend has been deprecated and 1.9 was the last release which carried it. The default RPC backend is TensorPipe which is the recommended backend for RPC. Users who use `torch.distributed.rpc.BackendType.PROCESS_GROUP` will be given an error message to switch to `torch.distributed.rpc.BackendType.TENSORPIPE`.

## ONNX

### Removed following arguments in torch.onnx.export(): enable_onnx_checker, strip_doc_string, _retain_param_name  ([#64369](https://github.com/pytorch/pytorch/pull/64369), [#64371](https://github.com/pytorch/pytorch/pull/64371), [#64370](https://github.com/pytorch/pytorch/pull/64370))

`enable_onnx_checker` argument is removed. ONNX checker will now always run by default. Users can catch exceptions to ignore raised failures. `strip_doc_string` has been rolled into the `verbose` arg in `torch.onnx.export()`. `_retain_param_name` argument has been removed in  `torch.onnx.export()` will default to `True` . There is no way to get the old behavior of `_retain_param_name=False`. Users should stop setting this arg.

1.9.1:

```
torch.onnx.export(..., enable_onnx_checker=False, strip_doc_string=False)
```

1.10.0:

```
try:
    torch.onnx.export(verbose=True)
except torch.onnx.utils.ONNXCheckerError:
   pass
```

## Infra (Releng)

### Disable ParallelTBB ([#65092](https://github.com/pytorch/pytorch/pull/65092))

`ParallelTBB` config/codepath is no longer actively tested by PyTorch CI and as result is subject to code/functionality degradation


# New features

## Python API

* Added new functions:
    *  `torch.isin()` ([#53125](https://github.com/pytorch/pytorch/pull/53125)), `torch.bitwise_{left/right}_shift`, `__rlshift__`, `__rrshift__` ([#59544](https://github.com/pytorch/pytorch/pull/59544)), `torch.Tensor.{__rand__, __ror__,__rxor__}` ([#59240](https://github.com/pytorch/pytorch/pull/59240)),  `torch.aminmax` ([#62401](https://github.com/pytorch/pytorch/pull/62401)),  `torch.new_ones` ([#58405](https://github.com/pytorch/pytorch/pull/58405))
    * For numpy compatibility `torch.cov` ([#58311](https://github.com/pytorch/pytorch/pull/58311)), `torch.frombuffer` ([#59077](https://github.com/pytorch/pytorch/pull/59077)), `torch.corrcoef` ([#60420](https://github.com/pytorch/pytorch/pull/60420)), `torch.nanmean` ([#62671](https://github.com/pytorch/pytorch/pull/62671)), `torch.cumulative_trapezoid` ([#61615](https://github.com/pytorch/pytorch/pull/61615))
* The [torch.special module](https://pytorch.org/docs/1.10.0/special.html?highlight=special) is now stable! This module, consistent with SciPy’s special module, has 30 operations including the Hurwitz zeta function and various gamma functions.  ([#59623](https://github.com/pytorch/pytorch/pull/59623), [#56352](https://github.com/pytorch/pytorch/pull/56352), [#58126](https://github.com/pytorch/pytorch/pull/58126), [#59141](https://github.com/pytorch/pytorch/pull/59141), [#59143](https://github.com/pytorch/pytorch/pull/59143), [#58650](https://github.com/pytorch/pytorch/pull/58650), [#55878](https://github.com/pytorch/pytorch/pull/55878), [#58838](https://github.com/pytorch/pytorch/pull/58838), [#60512](https://github.com/pytorch/pytorch/pull/60512), [#60641](https://github.com/pytorch/pytorch/pull/60641), [#61633](https://github.com/pytorch/pytorch/pull/61633), [#60519](https://github.com/pytorch/pytorch/pull/60519), [#59691](https://github.com/pytorch/pytorch/pull/59691), [#58194](https://github.com/pytorch/pytorch/pull/58194))
* Added support for slots and subclass magic getstate/setstate method for Tensor serialization ([#62745](https://github.com/pytorch/pytorch/pull/62745))
* `torch.optim`:
    * Added Nesterov Adam as `NAdam` ([#59009](https://github.com/pytorch/pytorch/pull/59009))
    * Added `lr_scheduler.ChainedScheduler` ([#63491](https://github.com/pytorch/pytorch/pull/63491), [#63457](https://github.com/pytorch/pytorch/pull/63457), [#65034](https://github.com/pytorch/pytorch/pull/65034)))
    * Added `lr_scheduler.SequentialLR` ([#64037](https://github.com/pytorch/pytorch/pull/64037), [#65035](https://github.com/pytorch/pytorch/pull/65035))
    * Added `lr_scheduler.{ConstantLR,LinearLR}` ([#64395](https://github.com/pytorch/pytorch/pull/64395))
* `torch.cpu.amp.autocast`: enable new API for CPU autocast ([#57386](https://github.com/pytorch/pytorch/pull/57386), [#63534](https://github.com/pytorch/pytorch/pull/63534))
* Added `BFloat16` support for `torch.{cross, tril, triu, tril_indices, triu_indices, cumsum, cummax, cummin, median, kthvalue, nansum, nextafter, range, sinh, cosh, frexp, nan_to_num, sigmoid, sigmoid_backward, tanh_backward, addcmul, addcdiv, bucketize, bernoulli, dropout, fold, unfold, MaxPool2D, AdaptiveAvgPool2D, topk}` on CPU ([#62454](https://github.com/pytorch/pytorch/pull/62454), [#63307](https://github.com/pytorch/pytorch/pull/63307), [#55210](https://github.com/pytorch/pytorch/pull/55210), [#60074](https://github.com/pytorch/pytorch/pull/60074), [#61083](https://github.com/pytorch/pytorch/pull/61083), [#61829](https://github.com/pytorch/pytorch/pull/61829), [#55221](https://github.com/pytorch/pytorch/pull/55221),  [#61826](https://github.com/pytorch/pytorch/pull/61826), [#55588](https://github.com/pytorch/pytorch/pull/55588), [#56372](https://github.com/pytorch/pytorch/pull/56372), [#62880](https://github.com/pytorch/pytorch/pull/62880), [#55202](https://github.com/pytorch/pytorch/pull/55202), [#59547](https://github.com/pytorch/pytorch/pull/59547))
* Added `BFloat16` support for  `torch.{ceil, floor, frac, round, trunc, sort, topk, aminmax, cumsum, logcumsumexp, cumprod, cummin, cummax}` on CUDA ([#57910](https://github.com/pytorch/pytorch/pull/57910), [#58196](https://github.com/pytorch/pytorch/pull/58196), [#59977](https://github.com/pytorch/pytorch/pull/59977), [#62767](https://github.com/pytorch/pytorch/pull/62767), [#57904](https://github.com/pytorch/pytorch/pull/57904)).
* Added  `torch.cuda.is_bf16_supported` ([#63798](https://github.com/pytorch/pytorch/pull/63798))
* Added zero rate to Poisson distribution ([#61511](https://github.com/pytorch/pytorch/pull/61511))
* Added `torch.segment_reduce` ([#59951](https://github.com/pytorch/pytorch/pull/59951), [#60018](https://github.com/pytorch/pytorch/pull/60018), [#61141](https://github.com/pytorch/pytorch/pull/61141), [#61266](https://github.com/pytorch/pytorch/pull/61266), [#59521](https://github.com/pytorch/pytorch/pull/59521), [#60379](https://github.com/pytorch/pytorch/pull/60379), [#60379](https://github.com/pytorch/pytorch/pull/60379))
* Added boolean support to `torch.isclose` ([#61271](https://github.com/pytorch/pytorch/pull/61271))
* Added `torch.trapezoid` ([#61475](https://github.com/pytorch/pytorch/pull/61475)).
* Added `torch.gradient` support for second order central differences (edge_order=2) ([#58165](https://github.com/pytorch/pytorch/pull/58165))
* `torch.sigmoid`: CUDA support and complex autograd support ([#48647](https://github.com/pytorch/pytorch/pull/48647))
* Added channels-last support for `torch.bilinear` and `torch.nn,MaxUnpool2d` ([#56322](https://github.com/pytorch/pytorch/pull/56322), [#49984](https://github.com/pytorch/pytorch/pull/49984))

## Autograd

* [Experimental] Forward mode AD:
    * *NOTE: In addition to operators listed below, many simple ops are already supported. If you encounter an operator that does not have a forward-mode AD formula implemented, please file an issue. As a workaround, you can use custom `autograd.Function` to implement your own forward-mode-AD-supported operator.*
    * Added forward-mode AD support for custom `autograd.Function` ([#64061](https://github.com/pytorch/pytorch/pull/64061), [#63434](https://github.com/pytorch/pytorch/pull/63434))
    * Added forward-mode AD support for `torch.{acos, add, addbmm, addcdiv, addcmul, addmm, addmv, addr, angle, acosh, asinh, atanh, asin, atan, conj, baddbmm, bmm, cat, ceil, clamp, clamp_min, clamp_max, complex, copy_sign, cos, cosh, cross, cumprod, cumsum, cummax, cummin, deg2rad, div, dot, vdot, exp, exp2, expm1, expand, floor, frac, frexp, gather, hardswish, hstack, hypot, index_add_, index_copy_, index_put_, index_select, kthvalue, lerp, lgamma, digamma, polygamma, log, log10, log1p, log2, logaddexp, logaddexp2, xlogy, masked_fill_, masked_fill_, masked_scatter_, masked_select, max, maximum, fmax, mean, min, mininum, fmin, mm, mode, mul, lu, lu_solve, vstack}` ([#57768](https://github.com/pytorch/pytorch/pull/57768), [#57863](https://github.com/pytorch/pytorch/pull/57863) [#59711](https://github.com/pytorch/pytorch/pull/59711), [#64742](https://github.com/pytorch/pytorch/pull/64742))
    * Added Forward AD support for the following element-wise and linear operators `torch.{mvlgamma, nan_to_num, permute, pow,  reciprocal, remainder, repeat, round, rsqrt, sigmoid, logit, sign, sgn, sin, sinc, sinh, sqrt, squeeze, sub, sum, t, flip, roll, rot90, take, tan, tanh, trace, transpose, tril, triu, trunc, unfold, unsqueeze, view, zero_, hardshrink} `([#59993](https://github.com/pytorch/pytorch/pull/59993))
    * Added Forward AD support for `torch.special.`{`xlog1py, entr}` ([#59711](https://github.com/pytorch/pytorch/pull/59711), [#59993](https://github.com/pytorch/pytorch/pull/59993))
    * Added forward AD support for `torch.linalg.{cholesky, cholesky_ex, eigh, inv, inv_ex, solve}`  ([#62160](https://github.com/pytorch/pytorch/pull/62160), [#64646](https://github.com/pytorch/pytorch/pull/64646), [#62163](https://github.com/pytorch/pytorch/pull/62163), [#62159](https://github.com/pytorch/pytorch/pull/62159))
    * Added forward AD support for `torch.functional.leak_relu` ([#59993](https://github.com/pytorch/pytorch/pull/59993)) 
* Added saved tensor hooks to customize packing/unpacking behavior of tensors saved for backward ([#60685](https://github.com/pytorch/pytorch/pull/60685), [#60663](https://github.com/pytorch/pytorch/pull/60663), [#62564](https://github.com/pytorch/pytorch/pull/62564), [#60975](https://github.com/pytorch/pytorch/pull/60975), [#62909](https://github.com/pytorch/pytorch/pull/62909), [#62717](https://github.com/pytorch/pytorch/pull/62717))
* Exposed raw saved tensors for custom `autograd.Function` to use with the saved tensor hooks ([#60551](https://github.com/pytorch/pytorch/pull/60551))
* Added default saved tensor hooks ([#61834](https://github.com/pytorch/pytorch/pull/61834), [#62563](https://github.com/pytorch/pytorch/pull/62563), [#62361](https://github.com/pytorch/pytorch/pull/62361))
* Added context manager using default saved tensor hooks to automatically move saved tensors on CPU and back ([#61928](https://github.com/pytorch/pytorch/pull/61928), [#62410](https://github.com/pytorch/pytorch/pull/62410))
* Added C++ and python bindings for `.is_inference()` method ([#58729](https://github.com/pytorch/pytorch/pull/58729)) 
* `torch.lu_solve`: Implement support for backward AD ([#61681](https://github.com/pytorch/pytorch/pull/61681)).

## torch.nn

* Added new modules: `nn.{ReflectionPad3d, LazyInstanceNorm*d}` ([#59791](https://github.com/pytorch/pytorch/pull/59791), [#60837](https://github.com/pytorch/pytorch/pull/60837), [#61308](https://github.com/pytorch/pytorch/pull/61308), [#60982](https://github.com/pytorch/pytorch/pull/60982))
* `nn.CrossEntropyLoss`: Added support for class probability targets ([#61044](https://github.com/pytorch/pytorch/pull/61044))
* `nn.CrossEntropyLoss`: Added support for label smoothing ([#63122](https://github.com/pytorch/pytorch/pull/63122))
* `nn.Module`: Added support for arbitrary objects in state_dicts via `get_extra_state()` / `set_extra_state()` ([#62976](https://github.com/pytorch/pytorch/pull/62976))
* `nn.utils.skip_init()`: Added function to skip module parameter / buffer initialization ([#57555](https://github.com/pytorch/pytorch/pull/57555))

## Profiler

* Added profiler support for mobile ([#62419](https://github.com/pytorch/pytorch/pull/62419), [#62418](https://github.com/pytorch/pytorch/pull/62418), [#62417](https://github.com/pytorch/pytorch/pull/62417),[#62228](https://github.com/pytorch/pytorch/pull/62228),[#62191,](https://github.com/pytorch/pytorch/pull/62191)[#61792](https://github.com/pytorch/pytorch/pull/61792))
* Ported Nvtx support to new profiler ([#61634](https://github.com/pytorch/pytorch/pull/61634))
* Added Tensor core usage stats and recommendations in Tensorboard ([`#364`](https://github.com/pytorch/kineto/pull/364)[,](https://github.com/pytorch/kineto/pull/402/commits/e435a8f55fdbf2a2331931782404b9020eefa4ba)[`#368`](https://github.com/pytorch/kineto/pull/368)[,](https://github.com/pytorch/kineto/pull/402/commits/d3132ebc51faed586e6699e895fecc6b4d255334)[`#383`](https://github.com/pytorch/kineto/pull/383), [`#422`](https://github.com/pytorch/kineto/pull/422))

## CUDA

* Allow enabling warnings on CUDA synchronization ([#62092](https://github.com/pytorch/pytorch/pull/62092))
* Added CUDA graph Prototype API and documentation ([#63269](https://github.com/pytorch/pytorch/pull/63269))
* Make stream semantics of backward calls consistent with other cuda ops ([#57833](https://github.com/pytorch/pytorch/pull/57833), [#60230](https://github.com/pytorch/pytorch/pull/60230), [#60127](https://github.com/pytorch/pytorch/pull/60127))
* Enabled autocast support for user-specified device and dtype ([#61002](https://github.com/pytorch/pytorch/pull/61002), [#63416](https://github.com/pytorch/pytorch/pull/63416))

## C++ API

* Added C++ API for meta functions. They are available in the `at::meta::` namespace ([#58570](https://github.com/pytorch/pytorch/pull/58570))
* Exposed interface to set grain size on `cpu_kernel`, `cpu_kernel_vec` and `cpu_kernel_multiple_outputs` ([#58949](https://github.com/pytorch/pytorch/pull/58949))
* Added `at::native::resize_bytes_cpu` to resize `Storage` in ATen ([#60324](https://github.com/pytorch/pytorch/pull/60324))
* Added `transpose` to PackedTensorAccessor ([#61114](https://github.com/pytorch/pytorch/pull/61114))
* Added `torch::linalg::qr` as the C++ API ([#60529](https://github.com/pytorch/pytorch/pull/60529))
* Exposed `amin` and `amax` to aten symbols ([#61550](https://github.com/pytorch/pytorch/pull/61550))
* Added support to invoke callable activation function for Transformer modules ([#62342](https://github.com/pytorch/pytorch/pull/62342))
* Added support for `c10::optional` to compare with different but comparable types ([#62890](https://github.com/pytorch/pytorch/pull/62890))
* Added a unified API `c10::util::check_env` to check environment variable ([#59052](https://github.com/pytorch/pytorch/pull/59052))

## TorchScript

* Added reference semantics to TorchScript classes ([#44324](https://github.com/pytorch/pytorch/pull/44324)) 
* Conservatively moved all suitable prim ops from full-jit to mobile, and make them selective. ([#58353](https://github.com/pytorch/pytorch/pull/58353)) 
* Added change to predicate uses of RPC APIs on `torch.distributed.rpc.is_available()` ([#58887](https://github.com/pytorch/pytorch/pull/58887)) 
* Added a phase to perform inplace<->functional conversion for activation operators ([#57477](https://github.com/pytorch/pytorch/pull/57477)) 
* Enabled Profile-Directed Typing in `torch.jit.script` ([#62420](https://github.com/pytorch/pytorch/pull/62420)) 
* Introduced enhancement for smart serialization for operator schemas with out arg ([#63096](https://github.com/pytorch/pytorch/pull/63096))
* Added a pass to transform better handle concatenation ops ([#59881](https://github.com/pytorch/pytorch/pull/59881)) 
* Added a new operator for concat that takes in variadic parameters ([#59880](https://github.com/pytorch/pytorch/pull/59880)) 
* Added support for union in TorchScript ([#64234](https://github.com/pytorch/pytorch/pull/64234)) 

## torch.package

* Added basic tooling to enable users to see what is inside of a PackageExporter ([#61147](https://github.com/pytorch/pytorch/pull/61147))
* Added hasattr to `torch::deploy` C++ API ([#62669](https://github.com/pytorch/pytorch/pull/62669))
* Added support to re-save a PackageImporter module ([#65101](https://github.com/pytorch/pytorch/pull/65101))
* Added support to make frozen symbol name customizable in `torch::deploy`. ([#63817](https://github.com/pytorch/pytorch/pull/63817))

## Mobile

* Enabled kineto profiler on mobile via EdgeKinetoProfiler ([#62419](https://github.com/pytorch/pytorch/pull/62419))
* Added support of loading lite interpreter module from assets in Android ([#61609](https://github.com/pytorch/pytorch/pull/61609))
* Enabled tracing based selective build ([#63421,](https://github.com/pytorch/pytorch/pull/63421) [#64087](https://github.com/pytorch/pytorch/pull/64087), [#66237,](https://github.com/pytorch/pytorch/pull/66237) [#66395](https://github.com/pytorch/pytorch/pull/66395))
    * built tracer in OSS  ([#64087](https://github.com/pytorch/pytorch/pull/64087))
    * used operator.yaml to build libtorch library ([#66237)](https://github.com/pytorch/pytorch/pull/66237)
    * Built tracer and enabled tracing-based build with tracer output  ([#66395](https://github.com/pytorch/pytorch/pull/66395))
* NNAPI
    * Android NNAPI delegate implementation of runtime initialization (compilation) and execution ([#62272](https://github.com/pytorch/pytorch/pull/62272))
    * Added `aten::{avgpool2d,softmax,to,div,flatten,detach,slice,log_softmax,conv2d_transpose}` to NNAPI converter ([#58538](https://github.com/pytorch/pytorch/pull/58538), [#58539](https://github.com/pytorch/pytorch/pull/58539), [#58540](https://github.com/pytorch/pytorch/pull/58540), [#58541](https://github.com/pytorch/pytorch/pull/58541), [#60885](https://github.com/pytorch/pytorch/pull/60885), [#58543](https://github.com/pytorch/pytorch/pull/58543), [#59364](https://github.com/pytorch/pytorch/pull/59364), [#61378](https://github.com/pytorch/pytorch/pull/61378), [#59529](https://github.com/pytorch/pytorch/pull/59529)
    * Added Int32 support for NNAPI ([#59365](https://github.com/pytorch/pytorch/pull/59365))
    * Made nnapi `aten::{conv2d,linear,cat,flatten}` converter accept flexible batch ([#61021](https://github.com/pytorch/pytorch/pull/61021), [#61022](https://github.com/pytorch/pytorch/pull/61022), [76c0f223d3](https://github.com/pytorch/pytorch/commit/76c0f223d3), [#61024](https://github.com/pytorch/pytorch/pull/61024))
    * Added option to specify custom NNAPI serializer ([#61025](https://github.com/pytorch/pytorch/pull/61025))
    * Made Android NNAPI preprocess to accept both single Tensor inputs and Tensor List inputs ([#61752](https://github.com/pytorch/pytorch/pull/61752))
    * Added a few improvements in NNAPI delegation ([#63489](https://github.com/pytorch/pytorch/pull/63489))
    * Added support const values in binary ops ([2d58f3f56d](https://github.com/pytorch/pytorch/commit/2d58f3f56d))
* Added unary/binary ops necessary and more shape functions for mobilenet ([#56828](https://github.com/pytorch/pytorch/pull/56828), [#58932](https://github.com/pytorch/pytorch/pull/58932))
* Added `aten::{hardswish,tanh,clamp}` for iOS Metal ([#64588](https://github.com/pytorch/pytorch/pull/64588), [#61383](https://github.com/pytorch/pytorch/pull/61383))
* Added CoreML support ([#64521](https://github.com/pytorch/pytorch/pull/64521), [#64522](https://github.com/pytorch/pytorch/pull/64522), [#64523](https://github.com/pytorch/pytorch/pull/64523))
* Added compatibility API ([#61477](https://github.com/pytorch/pytorch/pull/61477), [#57501](https://github.com/pytorch/pytorch/pull/57501))
* Added support operators with default argument in front of out argument ([#63651](https://github.com/pytorch/pytorch/pull/63651), [#63540](https://github.com/pytorch/pytorch/pull/63540))

## Distributed

`DistributedDataParallel`

* Local SGD and variants for DDP communication optimization ([#60303](https://github.com/pytorch/pytorch/pull/60303), [#60320](https://github.com/pytorch/pytorch/pull/60320), [#60632](https://github.com/pytorch/pytorch/pull/60632), [#60891](https://github.com/pytorch/pytorch/pull/60891), [#61206](https://github.com/pytorch/pytorch/pull/61206), [#61207](https://github.com/pytorch/pytorch/pull/61207), [#62105](https://github.com/pytorch/pytorch/pull/62105), [#62111](https://github.com/pytorch/pytorch/pull/62111), [#62131](https://github.com/pytorch/pytorch/pull/62131), [#62132](https://github.com/pytorch/pytorch/pull/62132), [#62392](https://github.com/pytorch/pytorch/pull/62392), [#63277](https://github.com/pytorch/pytorch/pull/63277), [#63340](https://github.com/pytorch/pytorch/pull/63340), [#64885](https://github.com/pytorch/pytorch/pull/64885), [#65197](https://github.com/pytorch/pytorch/pull/65197))
* Provided a noop hook for performance debugging ([#64344](https://github.com/pytorch/pytorch/pull/64344), [#64352](https://github.com/pytorch/pytorch/pull/64352))
* Implemented BF16 allreduce gradient communication hook ([#63260](https://github.com/pytorch/pytorch/pull/63260))
* Allowed retrieval of model parameters in communication hook ([#61637](https://github.com/pytorch/pytorch/pull/61637))

`torch.distributed`

* Added a function to create new subgroups of a given size ([#59111](https://github.com/pytorch/pytorch/pull/59111))
* Introduced a new torchrun entry point for elastic ([#64049](https://github.com/pytorch/pytorch/pull/64049))

## torch.fx

* Added APIs to mutate specific args/kwargs ([#58571](https://github.com/pytorch/pytorch/pull/58571))
* Introduced EngineHolder for serializing and running TRT Engines with PyTorch ([06399d441d](https://github.com/pytorch/pytorch/commit/06399d441d))
* Introduced `__fx_create_arg__` dunder method for controlling custom classes are handled as node args ([#61780](https://github.com/pytorch/pytorch/pull/61780))
* Added `autowrap_functions` kwarg to Tracer ([#62106](https://github.com/pytorch/pytorch/pull/62106))
* Gradual typing
    * Added type annotation field to nodes ([#60621](https://github.com/pytorch/pytorch/pull/60621))
    * Added experimental gradual typechecker ([#60805](https://github.com/pytorch/pytorch/pull/60805))
    * Extended all experimental type-checking operations to support `conv2d`, `BatchNorm2D`,  `ReLU`, `maxpool2D`, `AdaptiveAvgPooling2D`, `flatten` ([#61093](https://github.com/pytorch/pytorch/pull/61093), [#61012](https://github.com/pytorch/pytorch/pull/61012), [#61150](https://github.com/pytorch/pytorch/pull/61150), [#61188](https://github.com/pytorch/pytorch/pull/61188), [#61239](https://github.com/pytorch/pytorch/pull/61239), [#61265](https://github.com/pytorch/pytorch/pull/61265))
    * Added experimental refinement types and unification for symbolic shape inference ([#61776](https://github.com/pytorch/pytorch/pull/61776))
    * Changed output node handling for typechecker to deal with tuples ([#62582](https://github.com/pytorch/pytorch/pull/62582))
    * Added handle of `get_attr` operations in typechecker ([#62682](https://github.com/pytorch/pytorch/pull/62682))
    * Added equality constraints for some acc operations for symbolic inference ([#63689](https://github.com/pytorch/pytorch/pull/63689))
    * Added inference for algebraic expressions ([#63822](https://github.com/pytorch/pytorch/pull/63822))
* Provided function interface for `remove_duplicate_output_args` ([#65134](https://github.com/pytorch/pytorch/pull/65134))
* Introduced helper function to generate an unique name for an attr in a module ([#64970](https://github.com/pytorch/pytorch/pull/64970))

## ONNX

* Added support for ONNX op set 14 ([#59486](https://github.com/pytorch/pytorch/pull/59486))
* Added support for GRU RNNs with packed input in scripting mode ([#58691](https://github.com/pytorch/pytorch/pull/58691))
* Enhanced shape inference ([#64585](https://github.com/pytorch/pytorch/pull/64585))
* Added support for `torch.{linspace, new_ones, nn.LSTMCell, bernoulli, dot, nn.utils.spectral_norm,bernoulli, distributions.normal.Normal, roll}` ([#58854](https://github.com/pytorch/pytorch/pull/58854), [#59255](https://github.com/pytorch/pytorch/pull/59255), [#62757](https://github.com/pytorch/pytorch/pull/62757), [#62765](https://github.com/pytorch/pytorch/pull/62765), [#59536,](https://github.com/pytorch/pytorch/pull/59536)[#61560,](https://github.com/pytorch/pytorch/pull/61560)[#58697](https://github.com/pytorch/pytorch/pull/58697))

## Infra (Releng)

* Default Linux/Windows testing workflows were migrated to GitHub Actions. PyTorch Probot has been extended to support new set of rerun command with new set of labels that one can use to opt in and opt out of certain types of CI. More information can be found on [Continuous Integration](https://github.com/pytorch/pytorch/wiki/Continuous-Integration#user-guide) wiki page
* Overall statistics and health of PyTorch CI/CD system can be viewed at [https://metrics.pytorch.org](https://metrics.pytorch.org/) ([#65157](https://github.com/pytorch/pytorch/pull/65157), [#61389](https://github.com/pytorch/pytorch/pull/61389), [#62217](https://github.com/pytorch/pytorch/pull/62217), [#64948](https://github.com/pytorch/pytorch/pull/64948), [#60026](https://github.com/pytorch/pytorch/pull/60026), [#61071](https://github.com/pytorch/pytorch/pull/61071), [#64303](https://github.com/pytorch/pytorch/pull/64303))
* Improved mechanism for disabling tests via issues. Creating an issue which title begins with “DISABLED” followed by the test name will disable the test in question for all platforms, which could be refined by explicitly specifying list of platforms in the issue body. Comment from @pytorch-probot would indicate that issue format was recognized by the CI system and test is now disabled. Closing the issue re-enabled the specified test in CI. Disabled tests will be temporarily re-enabled while running CI for PR marked as fixing it ([#61427](https://github.com/pytorch/pytorch/pull/61427))
* New documentation preview and new artifacts frontend. Using [https://hud.pytorch.org](https://hud.pytorch.org/), one can get an overview of PR/commit CI status, download build artifacts as well as read documentation associated with this build. See [Using HUD](https://github.com/pytorch/pytorch/wiki/Using-hud.pytorch.org) wiki page for more information ([#60711](https://github.com/pytorch/pytorch/pull/60711),  [#60792](https://github.com/pytorch/pytorch/pull/60792), [#60893](https://github.com/pytorch/pytorch/pull/60893))

## Misc

* Added support for `torch.fft.` operators on ARM-based platforms using pocket FFT ([#60976](https://github.com/pytorch/pytorch/pull/60976), [#62222](https://github.com/pytorch/pytorch/pull/62222), [#63714](https://github.com/pytorch/pytorch/pull/63714))
* `torch.einsum`: added support for the “sublist” format ([#56625](https://github.com/pytorch/pytorch/pull/56625))
* `torch.linalg.det`: added support for complex autograd ([#58195](https://github.com/pytorch/pytorch/pull/58195))
* Added autograd support for `Tensor.to_sparse` ([#58413](https://github.com/pytorch/pytorch/pull/58413))
* Added more CUDA support for CSR layout: constructors ([#59010](https://github.com/pytorch/pytorch/pull/59010)), sparse_to_dense/add_sparse_csr ([#59011](https://github.com/pytorch/pytorch/pull/59011)), addmm/matvec ([#59012](https://github.com/pytorch/pytorch/pull/59012))
* Vulkan: Added support for `max_pool2d`, `tanh`, `hardshrink`, `log_softmax`, `leaky_relu`, `softmax` ([#58806](https://github.com/pytorch/pytorch/pull/58806), [#60695](https://github.com/pytorch/pytorch/pull/60695), [#62870](https://github.com/pytorch/pytorch/pull/62870), [#63193](https://github.com/pytorch/pytorch/pull/63193), [#62239](https://github.com/pytorch/pytorch/pull/62239))
* Enabled local run of clang-tidy and clang-format lint workflows ([#61121](https://github.com/pytorch/pytorch/pull/61121), [#61797](https://github.com/pytorch/pytorch/pull/61797), [#60745](https://github.com/pytorch/pytorch/pull/60745))

# Improvements

## Python API

* Added clearer stack trace for `torch.floor_divide` deprecation warning ([#64034](https://github.com/pytorch/pytorch/pull/64034))
* Use cascade-summation algorithm to improve `torch.nansum` accuracy ([#61082](https://github.com/pytorch/pytorch/pull/61082))
* `torch.i0`: now promote integer inputs to float ([#52735](https://github.com/pytorch/pytorch/pull/52735))
*  `torch.kthvalue:` added change to adjust output dim size for numpy compatibility ([#59214](https://github.com/pytorch/pytorch/pull/59214))
* Added reduce variants for `torch.scatter` operation. ([#57015](https://github.com/pytorch/pytorch/pull/57015))
* Added support for quantized tensors in `torch.testing.assert_close` ([#58926](https://github.com/pytorch/pytorch/pull/58926))
* Improved error message for invalid value input to Distribution methods ([#61056](https://github.com/pytorch/pytorch/pull/61056))
* `torch.isclose` upcast to most precise dtype within their category before the comparison ([#60536](https://github.com/pytorch/pytorch/pull/60536))
* Added change to cast `alpha` to `acc_type` for `torch.add` and `torch.sub` ([#60227](https://github.com/pytorch/pytorch/pull/60227))
* Fixed dimension in the error message for CUDA `torch.cat` shape check and removed unnecessary offending index information ([#64556](https://github.com/pytorch/pytorch/pull/64556)).
* Improved DLPack support ([#57110](https://github.com/pytorch/pytorch/pull/57110)).
* Added change to raise an error when empty index tensor is passed to `torch.gather` ([#65006](https://github.com/pytorch/pytorch/pull/65006)).
* Added change to store `float64` in `tensorboard` instead of `float32` ([#59435](https://github.com/pytorch/pytorch/pull/59435)).
* Added `use_strict_trace` to tensorboard `add_graph` method ([#63120](https://github.com/pytorch/pytorch/pull/63120)).
* Add option to skip GH validation for `torch.hub` ([#62139](https://github.com/pytorch/pytorch/pull/62139))
* Added a new kwarg `output_size` to `tensor.repeat_interleave`([#58881](https://github.com/pytorch/pytorch/pull/58881))
* Add support for `torch.isclose` ([#63571](https://github.com/pytorch/pytorch/pull/63571))
* Make the behavior of `torch.{testting.assert_close,is_close}` consistent with numpy ([#63841](https://github.com/pytorch/pytorch/pull/63841))

## Autograd

* Added warning about memory leak when `.backward()` is called with `create_graph=True` ([#59412](https://github.com/pytorch/pytorch/pull/59412))
* Added warning when accessing `Tensor::grad()` on a non-leaf Tensor in the C++ API ([#59362](https://github.com/pytorch/pytorch/pull/59362))
* Fixed error message formatting in `grad_output` creation for `.backward()` and `autograd.grad()` ([#59532](https://github.com/pytorch/pytorch/pull/59532))
* Added change to raise `NotImplementedError` for forward and backward-mode AD formulas that are not implemented ([#59482](https://github.com/pytorch/pytorch/pull/59482), [#59483](https://github.com/pytorch/pytorch/pull/59483))
* Reduced memory usage for `torch.relu` for common use cases ([#63089](https://github.com/pytorch/pytorch/pull/63089))
* Added support for non-leaf inputs for `autograd.backward()` function `inputs` argument ([#60521](https://github.com/pytorch/pytorch/pull/60521))
* Improved error message when a tensor with `requires_grad=True`  is passed to a non-differentiable function ([#60610](https://github.com/pytorch/pytorch/pull/60610))
* Made `binary_cross_entropy` differentiable w.r.t. `target` ([#59447](https://github.com/pytorch/pytorch/pull/59447))

## torch.nn

* Added support for inputs with no batch dimensions for `nn.{AdaptiveAvgPool*d, AdaptiveMaxPool*d, AvgPool*d, CosineEmbeddingLoss, Dropout, FractionalMaxPool2d, Linear, LPPool1d, MaxPool*d, MaxUnpool*d, NLLLoss, PairwiseDistance, ReflectionPad*d, ReplicationPad*d, TripletMarginLoss, ZeroPad*d}`, most other loss modules, and all activation modules ([#61264](https://github.com/pytorch/pytorch/pull/61264), [#61847](https://github.com/pytorch/pytorch/pull/61847), [#61860](https://github.com/pytorch/pytorch/pull/61860), [#64590](https://github.com/pytorch/pytorch/pull/64590), [#61911](https://github.com/pytorch/pytorch/pull/61911), [#62490](https://github.com/pytorch/pytorch/pull/62490), [#60992](https://github.com/pytorch/pytorch/pull/60992), [#62190](https://github.com/pytorch/pytorch/pull/62190), [#62206](https://github.com/pytorch/pytorch/pull/62206), [#61984](https://github.com/pytorch/pytorch/pull/61984), [#61310](https://github.com/pytorch/pytorch/pull/61310), [#62651](https://github.com/pytorch/pytorch/pull/62651), [#64882](https://github.com/pytorch/pytorch/pull/64882), [#62183](https://github.com/pytorch/pytorch/pull/62183), [#61060](https://github.com/pytorch/pytorch/pull/61060), [#61262](https://github.com/pytorch/pytorch/pull/61262), [#62729](https://github.com/pytorch/pytorch/pull/62729), [#61300](https://github.com/pytorch/pytorch/pull/61300), [#61461](https://github.com/pytorch/pytorch/pull/61461), [#62726](https://github.com/pytorch/pytorch/pull/62726))
* Added support for inputs with 0 batch size for `nn.{AdaptiveAvgPool*d, AdaptiveMaxPool*d, Bilinear, FractionalMaxPool*d, LocalResponseNorm, MaxPool*d, MaxUnpool*d, TransformerDecoder, TransformerDecoderLayer, TransformerEncoder, TransformerEncoderLayer}` ([#62025](https://github.com/pytorch/pytorch/pull/62025), [#62088](https://github.com/pytorch/pytorch/pull/62088), [#47106](https://github.com/pytorch/pytorch/pull/47106), [#62083](https://github.com/pytorch/pytorch/pull/62083), [#62801](https://github.com/pytorch/pytorch/pull/62801), [#64082](https://github.com/pytorch/pytorch/pull/64082), [#62800](https://github.com/pytorch/pytorch/pull/62800))
* Parametrization: Added support for nested parametrizations, parametrizations depending on several inputs, resizing of parametrized tensors, and the orthogonal parametrization ([#65167](https://github.com/pytorch/pytorch/pull/65167), [#60530](https://github.com/pytorch/pytorch/pull/60530), [#60418](https://github.com/pytorch/pytorch/pull/60418), [#62089](https://github.com/pytorch/pytorch/pull/62089))
* `nn.AvgPool2d`: Added `channels_last` support on CPU ([#58725](https://github.com/pytorch/pytorch/pull/58725))
* `nn.BatchNorm`: Use `resize_output` and `empty` instead of `empty_like` to improve flexibility in output memory format choice ([#63084](https://github.com/pytorch/pytorch/pull/63084))
* `nn.Bilinear`: Added support for non-contiguous tensor inputs ([#38409](https://github.com/pytorch/pytorch/pull/38409))
* `nn.GELU`: Added support for fp32/bfloat16 in CPU path using mkldnn implementation ([#58525](https://github.com/pytorch/pytorch/pull/58525))
* `nn.GroupNorm`: Improved numerical stability by using the Welford algorithm and cascade summation ([#54921](https://github.com/pytorch/pytorch/pull/54921))
* `nn.LayerNorm`: Improved numerical stability by using the Welford algorithm and pairwise sums ([#59987](https://github.com/pytorch/pytorch/pull/59987))
* `nn.NLLLoss`: Added support for target of dtype `byte` ([#60308](https://github.com/pytorch/pytorch/pull/60308), [#60650](https://github.com/pytorch/pytorch/pull/60650))
* `nn.SmoothL1Loss`: Added support for integral target within the backward pass ([#61112](https://github.com/pytorch/pytorch/pull/61112))
* `nn.Transformer`: Added configurable pre/post LayerNorm placement ([#60593](https://github.com/pytorch/pytorch/pull/60593), [#61692](https://github.com/pytorch/pytorch/pull/61692))
* Added check to verify non-zero sequence length for `nn.{RNN, LSTM, GRU}` ([#60269](https://github.com/pytorch/pytorch/pull/60269))
* Added support for bfloat16 in CPU path to `nn.{LeakyReLU, RReLU}` ([#61514](https://github.com/pytorch/pytorch/pull/61514))
* Added support for `channels_last` memory format in `nn.{AdaptiveMaxPool2d, GroupNorm}` ([#48920](https://github.com/pytorch/pytorch/pull/48920), [#49821](https://github.com/pytorch/pytorch/pull/49821))
* Added callable activation function support to `nn.{MultiheadAttention, Transformer, TransformerDecoderLayer, TransformerEncoderLayer}` ([#61355](https://github.com/pytorch/pytorch/pull/61355), [#62342](https://github.com/pytorch/pytorch/pull/62342))

## Profiler

* Changed `profiler.profile` argument `with_flops`  when set to `True` to report total FLOPs rather than FLOP/s, and support more operators ([#62779](https://github.com/pytorch/pytorch/pull/62779), [#61895](https://github.com/pytorch/pytorch/pull/61895))
* Improved memory profiling and Tensorboard memory view, enabling better understanding of memory usage by showing active memory allocations at various points of your program run as well as a memory usage trend chart.  ([#61282](https://github.com/pytorch/pytorch/pull/61282), [`#361`](https://github.com/pytorch/kineto/pull/361), [`#404`](https://github.com/pytorch/kineto/pull/404)[,](https://github.com/pytorch/kineto/pull/435/commits/36f069ad8f819255f5b575782e99b0c4573a6d0f)[`#416`](https://github.com/pytorch/kineto/pull/416)[,](https://github.com/pytorch/kineto/pull/435/commits/d6d28b719270b1ceb10fca1003cfb77a11e18c79)[`#421`](https://github.com/pytorch/kineto/pull/421))
* Added flow arrows between ops in the forward pass and the corresponding ops in the backward pass in the trace view ([#62553](https://github.com/pytorch/pytorch/pull/62553), [#372](https://github.com/pytorch/kineto/pull/372))
* Increased profiling coverage of backward pass ([#63619](https://github.com/pytorch/pytorch/pull/63619))
* Made threads and GPU streams appear in a consistent sorted order in the trace view ([#399](https://github.com/pytorch/kineto/pull/399))
* Added shapes and reg usage to the GPU kernel view ([`#351`](https://github.com/pytorch/kineto/pull/351)[)](https://github.com/pytorch/kineto/pull/402/commits/eed895ba7ce521deb457dee4678d7a6c8a4a7bd6)

## Dataloader

* Properly delegated indices called by `Subset` to dataset ([#59513](https://github.com/pytorch/pytorch/pull/59513))
* Removed the restriction that input datasets in `ConcatDataset` must be `Sized` ([#64114](https://github.com/pytorch/pytorch/pull/64114))
* Allowed annotation of `IterableDataset` to accept keyword-only arguments and `abc` class ([#58450](https://github.com/pytorch/pytorch/pull/58450))
* Changed annotation of `DataLoader` to accept non-integer `Sampler` as input([#63500](https://github.com/pytorch/pytorch/pull/63500))

## CUDA

* Include function name in the error message for inputs being on different devices ([#58502](https://github.com/pytorch/pytorch/pull/58502))
* Fix MAGMA initialization ([#58521](https://github.com/pytorch/pytorch/pull/58521))
* Updated NCCL to 2.10 ([#62276](https://github.com/pytorch/pytorch/pull/62276))
* Added deterministic path for `torch.scatter_add` for 1D tensors ([#58761](https://github.com/pytorch/pytorch/pull/58761))
* Added CUDA support for mean reduction ([#59543](https://github.com/pytorch/pytorch/pull/59543))
* Add missing CUDA kernel launch check ([#60114](https://github.com/pytorch/pytorch/pull/60114))
* Improved CUDA extension building error/warning messages ([#59665](https://github.com/pytorch/pytorch/pull/59665), [#60592](https://github.com/pytorch/pytorch/pull/60592))
* Added change to compute CUDA reduction buffer size in elements ([#63969](https://github.com/pytorch/pytorch/pull/63969))

## TorchScript

* Added change to simplify pass on arithmetic expressions for integers. ([#61444](https://github.com/pytorch/pytorch/pull/61444)) 
* Set future's error to current exception as is when `--torch_jit_enable_rethrow_caught_exception=true` ([#63348](https://github.com/pytorch/pytorch/pull/63348)) 
* Improved TorchScript module getattr() to be same as python class getattr() method ([#61599](https://github.com/pytorch/pytorch/pull/61599)) 
* Improved slicing for scripted version of `torch.nn.ModuleList` to support arbitrary step size ([#58361](https://github.com/pytorch/pytorch/pull/58361)) 
* Added parsing logic for `Tuple[()]` annotation ([#58340](https://github.com/pytorch/pytorch/pull/58340)) 
* Changed list striding kernel implementation to handle optional integers ([#58536](https://github.com/pytorch/pytorch/pull/58536)) 
* Added support for `torch.nn.Parameter` type for Profile-Directed-Typing ([#59249](https://github.com/pytorch/pytorch/pull/59249)) 
* Added change to annotate NoneType as Optional[type] ([#60383](https://github.com/pytorch/pytorch/pull/60383)) 
* Added support for default values on NamedTuple fields ([#54682](https://github.com/pytorch/pytorch/pull/54682)) 
* Improved JIT support for `torch.einsum` ([#59265](https://github.com/pytorch/pytorch/pull/59265)) 
* Added change to allow for heterogenous List and Dict values + Improve container typing algorithm ([#57137](https://github.com/pytorch/pytorch/pull/57137)) 
* Added support for eager mode use of `torch.jit.isinstance` with multiple types ([#60465](https://github.com/pytorch/pytorch/pull/60465)) 
* Allowed uncompiled strings as input to `checkScriptRaisesRegex` ([#63901](https://github.com/pytorch/pytorch/pull/63901))
* Introduced more robust check of whether a class is defined in torch ([#64083](https://github.com/pytorch/pytorch/pull/64083)) 
* Added change to preserve types during empty container assignment ([#58911](https://github.com/pytorch/pytorch/pull/58911)) 
* Made JIT not assume that the device is CUDA. ([#54238](https://github.com/pytorch/pytorch/pull/54238)) 
* Updated `optimize_for_mobile` to preserve nodes’ debug information ([#63106](https://github.com/pytorch/pytorch/pull/63106)) 
* Added support for device as Dict key ([#65079](https://github.com/pytorch/pytorch/pull/65079))  
* Added support for Python C extension modules in `torch::deploy` ([#58117](https://github.com/pytorch/pytorch/pull/58117)) 
* Added a flag to suppress stacktrace in exception messages([#63073](https://github.com/pytorch/pytorch/pull/63073)) 
* Added API to change logging levels for JIT ([#58821](https://github.com/pytorch/pytorch/pull/58821)) 
* Provided API to preserve source range and callstack information during graph rewrite ([#58300](https://github.com/pytorch/pytorch/pull/58300)) 
* Re-enabled BatchNorm autodiff  ([#57321](https://github.com/pytorch/pytorch/pull/57321)) 
* Extracted element-wise ops supported by JIT fuser into a separate list ([#59579](https://github.com/pytorch/pytorch/pull/59579)) 
* Reworked requires_grad on DifferentiableGraphOp ([#57575](https://github.com/pytorch/pytorch/pull/57575)) 

## torch.package

* Unified three categories of dependency handling error (broken, denied, unhandled) into a single "error" field in the node, with optional context ([#58572](https://github.com/pytorch/pytorch/pull/58572))
* Renamed MockZipReader into DirectoryReader ([#59107](https://github.com/pytorch/pytorch/pull/59107))
* Added change to silently skip cases where the __**import__** statement cannot be parsed ([#61148](https://github.com/pytorch/pytorch/pull/61148))
* Make torch::deploy work with or without cuda ([#58493](https://github.com/pytorch/pytorch/pull/58493))

## Mobile

* Added check to ensure op name does not contain open parenthesis ([#58687](https://github.com/pytorch/pytorch/pull/58687))
* Added handles and symbolicate exception callstack thrown from backend ([#55462](https://github.com/pytorch/pytorch/pull/55462), [#57441](https://github.com/pytorch/pytorch/pull/57441), [#57481](https://github.com/pytorch/pytorch/pull/57481))
* Enabled implicit operator versioning via number of arguments ([#58852](https://github.com/pytorch/pytorch/pull/58852))
* Cleaned up unused APIs and improve debugging experience for iOS GPU ([#60280](https://github.com/pytorch/pytorch/pull/60280), [#60281,](https://github.com/pytorch/pytorch/pull/60281)[#60282](https://github.com/pytorch/pytorch/pull/60282))
* Added debug information to track memory allocation exception for Metal ([#59112](https://github.com/pytorch/pytorch/pull/59112))
* Added print of IValue type name in error message for Android ([#64602](https://github.com/pytorch/pytorch/pull/64602))
* Added print of error message when failing to load model file ([#63404](https://github.com/pytorch/pytorch/pull/63404))
* Introduced multiple improvements in `torch.utils.model_dump` APIs: 
    * Make stdout argument for main kwarg-only ([#60699](https://github.com/pytorch/pytorch/pull/60699))
    * Implement "Hider" properly ([#57654](https://github.com/pytorch/pytorch/pull/57654))
    * Handle `torch.device` objects ([#57656](https://github.com/pytorch/pytorch/pull/57656))
    * Handle dict rendering ([#57657](https://github.com/pytorch/pytorch/pull/57657))
    * Add a section that summarizes tensor memory usage ([#57658](https://github.com/pytorch/pytorch/pull/57658))
    * Handle invalid UTF-8 in pickles ([#57661](https://github.com/pytorch/pytorch/pull/57661))

## Quantization

* Added out variant for int8 `quantized::linear` ([#58282](https://github.com/pytorch/pytorch/pull/58282)) and `quantized::embedding_bag_byte_prepack` ([#64081](https://github.com/pytorch/pytorch/pull/64081))
* FX graph mode quantization: improve `qconfig_dict` argument handling ([#59605](https://github.com/pytorch/pytorch/pull/59605), [#58566](https://github.com/pytorch/pytorch/pull/58566))
* Added support to embedding trained in FP16 ([#60736](https://github.com/pytorch/pytorch/pull/60736))
* Added support for `torch.index_select` on quantized tensors ([#61406](https://github.com/pytorch/pytorch/pull/61406))
* Added a new fused MovingAvg Obs + FakeQuant operator ([#61570](https://github.com/pytorch/pytorch/pull/61570), [#61589](https://github.com/pytorch/pytorch/pull/61589), [#61691](https://github.com/pytorch/pytorch/pull/61691), [#62346](https://github.com/pytorch/pytorch/pull/62346), [#62863](https://github.com/pytorch/pytorch/pull/62863), [#62702](https://github.com/pytorch/pytorch/pull/62702), [#63043](https://github.com/pytorch/pytorch/pull/63043), [#64829](https://github.com/pytorch/pytorch/pull/64829))
* Added support for dynamic linear + relu fusion (INT8) ([#63799](https://github.com/pytorch/pytorch/pull/63799),[#63826](https://github.com/pytorch/pytorch/pull/63826))
* Enabled JIT tracing on quantizable LSTM ([#64438](https://github.com/pytorch/pytorch/pull/64438))

## Distributed

`DistributedDataParallel`

* Added error logging to DDP logging API ([#59281](https://github.com/pytorch/pytorch/pull/59281), [#59284](https://github.com/pytorch/pytorch/pull/59284), [#59351,](https://github.com/pytorch/pytorch/pull/59351)[#65023](https://github.com/pytorch/pytorch/pull/65023))
* Added `NCCL_ASYNC_ERROR_HANDLING` environment variable to control NCCL error handling ([#59109](https://github.com/pytorch/pytorch/pull/59109))
* Communication hook APIs to always return single tensor ([#62074](https://github.com/pytorch/pytorch/pull/62074), [#62389](https://github.com/pytorch/pytorch/pull/62389), [#62457](https://github.com/pytorch/pytorch/pull/62457))
* Added DDP bucket sizes in DDP logging API ([#62229](https://github.com/pytorch/pytorch/pull/62229), [#62232](https://github.com/pytorch/pytorch/pull/62232), [#62231](https://github.com/pytorch/pytorch/pull/62231), [#62625](https://github.com/pytorch/pytorch/pull/62625), 
* Improved rebuilding buckets logic  ([#62279](https://github.com/pytorch/pytorch/pull/62279), [#58097](https://github.com/pytorch/pytorch/pull/58097))
* Allowed DDP uneven inputs work with communication hooks ([#61017](https://github.com/pytorch/pytorch/pull/61017), [#61018](https://github.com/pytorch/pytorch/pull/61018), [#61019](https://github.com/pytorch/pytorch/pull/61019), [#61020](https://github.com/pytorch/pytorch/pull/61020))
* Added logging if graph is static at end of training ([#61871](https://github.com/pytorch/pytorch/pull/61871))
* Added logging of unused param names under DETAIL debug mode. ([#62209](https://github.com/pytorch/pytorch/pull/62209))
* Allowed tuning of first bucket in DDP ([#62748](https://github.com/pytorch/pytorch/pull/62748))
* Added gradient ready order, host-side timestamps, and bucket indices to DDP logging ([#62751](https://github.com/pytorch/pytorch/pull/62751), [#62770](https://github.com/pytorch/pytorch/pull/62770))
* Added a debug check in C++ fp16 gradient hook ([#63379](https://github.com/pytorch/pytorch/pull/63379))
* Added a fallback to use `mul` and `copy_` instead of `mul`’s `out=` variant when gradient tensor requires grad in DDP ([#63831](https://github.com/pytorch/pytorch/pull/63831))
* Used `Tensor.set_` instead of directory assigning data in model averaging ([#63895](https://github.com/pytorch/pytorch/pull/63895))
* Added more iterations for DDP logging ([#64071](https://github.com/pytorch/pytorch/pull/64071),  [#64411](https://github.com/pytorch/pytorch/pull/64411))

`torch.distributed`

* Introduced ProcessGroup wrapper and use it in debug mode([#58224](https://github.com/pytorch/pytorch/pull/58224), [#58281](https://github.com/pytorch/pytorch/pull/58281), [#60237](https://github.com/pytorch/pytorch/pull/60237))
* Made a small change for `torch.distributed` launcher ([#59152](https://github.com/pytorch/pytorch/pull/59152))
* Added complex number support for all_to_all/scatter ([#61299](https://github.com/pytorch/pytorch/pull/61299))
* Made gloo communication profiling more accurate ([#61342](https://github.com/pytorch/pytorch/pull/61342))
* Used generator instead of list to save memory in scatter ([#62516](https://github.com/pytorch/pytorch/pull/62516))
* Provided failure reason from ProcessGroup when aborting NCCL communicator ([#64241](https://github.com/pytorch/pytorch/pull/64241))
* Introduced error raised when capturing uncapturable NCCL in CUDA graphs. ([#64440](https://github.com/pytorch/pytorch/pull/64440))
* Added Single-Machine Model Parallel Support to `torch.distributed.optim.ZeroRedundancyOptimizer` ([#61370](https://github.com/pytorch/pytorch/pull/61370))

`torch.distributed.nn.RemoteModule`

* Supported creating a RemoteModule by RRef ([#59242](https://github.com/pytorch/pytorch/pull/59242))
* Supported switching RemoteModule between train/eval ([#59026](https://github.com/pytorch/pytorch/pull/59026))

`torch.distributed.elastic`

* Added minor logging and error formatting improvements ([#63214](https://github.com/pytorch/pytorch/pull/63214),  [#62823](https://github.com/pytorch/pytorch/pull/62823))
* Improved process termination logic ([#61602](https://github.com/pytorch/pytorch/pull/61602))
* Added fqdn hostname to error printout ([#66662](https://github.com/pytorch/pytorch/pull/66662/))

`torch.distributed.rpc`

* Fix RPC initialization to avoid shutdown timeout ([#59801](https://github.com/pytorch/pytorch/pull/59801))
* Supported RRefs that contain `threading.Locks` ([#57943](https://github.com/pytorch/pytorch/pull/57943)), `torch.cuda.Event` ([#61354](https://github.com/pytorch/pytorch/pull/61354))
* Updated rpc tensorpipe logic for sparse tensors ([#64575](https://github.com/pytorch/pytorch/pull/64575))
* Added rpc sparse tensor fix ([#59609](https://github.com/pytorch/pytorch/pull/59609), [#62794](https://github.com/pytorch/pytorch/pull/62794))
* Added change to ensure that future completion doesn't swallow exception. ([#61094](https://github.com/pytorch/pytorch/pull/61094))
* Set streams when invoking UDFs ([#59210](https://github.com/pytorch/pytorch/pull/59210))
* Set and propagate devices in RRef completion Future ([#59211](https://github.com/pytorch/pytorch/pull/59211))
* Made TensorPipe agent use streams from Future when sending response ([#59212](https://github.com/pytorch/pytorch/pull/59212))
* Added change to leverage TensorPipe's automatic SHM address selection ([#63028](https://github.com/pytorch/pytorch/pull/63028))
* Made Future store Storages instead of references to DataPtrs ([#60470](https://github.com/pytorch/pytorch/pull/60470), [#60943](https://github.com/pytorch/pytorch/pull/60943))
* Added change to avoid re-doing CUDA stream sync in OwnerRRef ([#57355](https://github.com/pytorch/pytorch/pull/57355))

`torch.distributed.Store`

* Enhanced connect timeout error message ([#61390](https://github.com/pytorch/pytorch/pull/61390))
* Added minor fixes in c10d for Windows ([#62953](https://github.com/pytorch/pytorch/pull/62953))

`torch.distributed.pipeline`

* Supported non-tensor inputs in pipeline parallel API ([#55441](https://github.com/pytorch/pytorch/pull/55441), [#57226](https://github.com/pytorch/pytorch/pull/57226), [#57325](https://github.com/pytorch/pytorch/pull/57325))
* Added a `WithDevice` wrapper to specify device execution for a module. ([#65190](https://github.com/pytorch/pytorch/pull/65190))

## torch.fx

* Added users of a node to the serialized JSON ([#59357](https://github.com/pytorch/pytorch/pull/59357))
* Added requires_grad to TensorMetadata ([#60972](https://github.com/pytorch/pytorch/pull/60972))
* Added change to swap out Python's AnnAssign with an Assign node where the annotation function is called ([#60622](https://github.com/pytorch/pytorch/pull/60622))
* Added type annotations for the `torch.nn.Module` constructor ([#61334](https://github.com/pytorch/pytorch/pull/61334))
* Enabled `torch.deploy` for GraphModules with non-torch dependencies ([#61680](https://github.com/pytorch/pytorch/pull/61680))
* Added change to allow FX tracer to trace control flow (if/while) statements when parameter shapes are in the conditionals ([#61820](https://github.com/pytorch/pytorch/pull/61820))
* Added `torch.memory_format` as a BaseArgumentType ([#62593](https://github.com/pytorch/pytorch/pull/62593))
* Added backwards compatibility guarantees for 1.10 ([#63888](https://github.com/pytorch/pytorch/pull/63888))
    * Renamed reduce functions back to their old, public names ([#64324](https://github.com/pytorch/pytorch/pull/64324))
    * Added change to ensure BC coverage for all of `torch.fx` passes ([#65081](https://github.com/pytorch/pytorch/pull/65081))
* Add `__matmul__` to the magic methods for FX tracing ([#64512](https://github.com/pytorch/pytorch/pull/64512))

## Composability

* Added meta tensor support for `torch.{any, all, fmax, fmin, remainder, glu, argmax, argmin, avg_pool3d_backward, isposinf, isneginf, fmod, fmin, signbit, slow_conv_transpose2d, nll_loss_backward, cumprod, aminmax, addcmul, addcdiv, gather, hardshrink_backward, softshrink_backward, hardshrink, gelu, gelu_backward, avg_pool2d, avg_pool2d_backward, avg_pool3d, reflection_pad1d_backward, all, any, silu_backward, sgn, softplus, leaky_relu_backward, hardsigmoid_backward, elu_backward, eq, xlogy, ne, lt, gt, le, ge, sigmoid_backward, tanh_backward, logit_backward, bitwise_or, bitwise_xor, bitwise_and, nll_loss_forward, log_softmax, log_softmax_backward_data, prod, norm, sum.dim_IntList, clamp}` ([#64642](https://github.com/pytorch/pytorch/pull/64642), [#58458,](https://github.com/pytorch/pytorch/pull/58458)[#58732](https://github.com/pytorch/pytorch/pull/58732), [#61800](https://github.com/pytorch/pytorch/pull/61800), [#60363](https://github.com/pytorch/pytorch/pull/60363), [#60364](https://github.com/pytorch/pytorch/pull/60364), [#59084](https://github.com/pytorch/pytorch/pull/59084), [#60633](https://github.com/pytorch/pytorch/pull/60633), [#60809](https://github.com/pytorch/pytorch/pull/60809), [#60810](https://github.com/pytorch/pytorch/pull/60810), [#57936](https://github.com/pytorch/pytorch/pull/57936), [#55503](https://github.com/pytorch/pytorch/pull/55503), [#62144](https://github.com/pytorch/pytorch/pull/62144), [#61899](https://github.com/pytorch/pytorch/pull/61899), [#62401](https://github.com/pytorch/pytorch/pull/62401), [#62318](https://github.com/pytorch/pytorch/pull/62318), [#62319](https://github.com/pytorch/pytorch/pull/62319), [#63312](https://github.com/pytorch/pytorch/pull/63312), [#58662](https://github.com/pytorch/pytorch/pull/58662), [#58663](https://github.com/pytorch/pytorch/pull/58663), [#58664](https://github.com/pytorch/pytorch/pull/58664), [#58665](https://github.com/pytorch/pytorch/pull/58665), [#58987](https://github.com/pytorch/pytorch/pull/58987), [#59082](https://github.com/pytorch/pytorch/pull/59082), [#59083](https://github.com/pytorch/pytorch/pull/59083), [#59103](https://github.com/pytorch/pytorch/pull/59103), [#60360](https://github.com/pytorch/pytorch/pull/60360), [#60361](https://github.com/pytorch/pytorch/pull/60361), [#58661](https://github.com/pytorch/pytorch/pull/58661), [#58197](https://github.com/pytorch/pytorch/pull/58197), [#58482](https://github.com/pytorch/pytorch/pull/58482), [#58483](https://github.com/pytorch/pytorch/pull/58483), [#58484](https://github.com/pytorch/pytorch/pull/58484), [#58660](https://github.com/pytorch/pytorch/pull/58660), [#60177](https://github.com/pytorch/pytorch/pull/60177), [#60814](https://github.com/pytorch/pytorch/pull/60814), [#60942](https://github.com/pytorch/pytorch/pull/60942), [#60815](https://github.com/pytorch/pytorch/pull/60815), [#60816](https://github.com/pytorch/pytorch/pull/60816), [#60817](https://github.com/pytorch/pytorch/pull/60817), [#60811](https://github.com/pytorch/pytorch/pull/60811), [#60812](https://github.com/pytorch/pytorch/pull/60812), [#60813](https://github.com/pytorch/pytorch/pull/60813), [#61443](https://github.com/pytorch/pytorch/pull/61443), [#57374](https://github.com/pytorch/pytorch/pull/57374), [#62372](https://github.com/pytorch/pytorch/pull/62372), [#62024](https://github.com/pytorch/pytorch/pull/62024), [#62711](https://github.com/pytorch/pytorch/pull/62711), [#61642](https://github.com/pytorch/pytorch/pull/61642), [#61361](https://github.com/pytorch/pytorch/pull/61361))
* PyObject preservation: Previously, tensors in python that no longer had any python-side references (but still had references in C++, e.g. if it’s saved for autograd) would get deallocated, and we would create a new Python object to replace it next time it passes from C++ to Python. We now preserve the PyObject as long as there are any references on either the python or C++ side. This ensures that any metadata on the original python object is preserved. For example, tensor subclasses that were saved for autograd now get properly preserved. ([#56017](https://github.com/pytorch/pytorch/pull/56017))

## Build_Frontend

* Added a new include directory in BLIS search path ([#58166](https://github.com/pytorch/pytorch/pull/58166))
* Added print to show full Python version in `torch.utils.collect_env` ([#59632](https://github.com/pytorch/pytorch/pull/59632))
* Added change to respect `CMAKE_PREFIX_PATH` choice set by caller ([#61904](https://github.com/pytorch/pytorch/pull/61904))
* Dropped incremental linking on Windows when REL_WITH_DEB_INFO=1. ([#64892](https://github.com/pytorch/pytorch/pull/64892))
* Enabled kineto build for ROCm platform ([#58401](https://github.com/pytorch/pytorch/pull/58401))
* Added support to system-provided Intel TBB ([#61934](https://github.com/pytorch/pytorch/pull/61934))
* Added Pytorch build support with [Newlib](https://en.wikipedia.org/wiki/Newlib) c library ([#60345](https://github.com/pytorch/pytorch/pull/60345), [#60052](https://github.com/pytorch/pytorch/pull/60052))
* Imrpove `torch.__version__` comparisons ([#61556](https://github.com/pytorch/pytorch/pull/61556), [#64565](https://github.com/pytorch/pytorch/pull/64565), [#63848](https://github.com/pytorch/pytorch/pull/63848))
* CMake: added optional precompiled header support ([#61940](https://github.com/pytorch/pytorch/pull/61940))
* Removed unnecessary Ubuntu version checks ([#61738](https://github.com/pytorch/pytorch/pull/61738))
* Added GPU support to `bazel` builds ([#63604](https://github.com/pytorch/pytorch/pull/63604))

## Infra (Releng)

* Improved automated test sharding. ([#59727](https://github.com/pytorch/pytorch/pull/59727), [#60206](https://github.com/pytorch/pytorch/pull/60206))
* Added change to strictly type everything in .github and tools ([#59117](https://github.com/pytorch/pytorch/pull/59117))
* Upgraded Windows CI Python to 3.8 ([#59729](https://github.com/pytorch/pytorch/pull/59729)) and CUDA to 10.2 ([#65080](https://github.com/pytorch/pytorch/pull/65080))
* Made change to use expecttest from PyPI ([#60658](https://github.com/pytorch/pytorch/pull/60658), [#63320](https://github.com/pytorch/pytorch/pull/63320))
* Added option to run specified tests option to run_test.py ([#59649](https://github.com/pytorch/pytorch/pull/59649))
* Enabled Metal in PyTorch MacOS/iOS nightly builds ([#63718](https://github.com/pytorch/pytorch/pull/63718), [#65075](https://github.com/pytorch/pytorch/pull/65075))
* Added retries to flaky CI steps. ([#65013](https://github.com/pytorch/pytorch/pull/65013), [#65104](https://github.com/pytorch/pytorch/pull/65104), [#64120](https://github.com/pytorch/pytorch/pull/64120), [#60216](https://github.com/pytorch/pytorch/pull/60216), [#63319](https://github.com/pytorch/pytorch/pull/63319))
* Allowed Docker build on macOS ([#60375](https://github.com/pytorch/pytorch/pull/60375))

## Misc

* Added support for MIOpen channel last convolution ([#63617](https://github.com/pytorch/pytorch/pull/63617))
* Enabled kernel asserts on rocm ([#49624](https://github.com/pytorch/pytorch/pull/49624))
* Added bool, float16, bfloat16 and complex support for to_dense for CSR sparse Tensors ([#60657](https://github.com/pytorch/pytorch/pull/60657))
* Added complex dtype support for matrix multiplication of two COO sparse Tensors on CPU ([#59554](https://github.com/pytorch/pytorch/pull/59554))
* Added the “upper” kwarg to `torch.linalg.cholesky` ([#62434](https://github.com/pytorch/pytorch/pull/62434))
* Improved error message in ONNX when attempting to export dict modification ([#58696](https://github.com/pytorch/pytorch/pull/58696))
* Migrated `THAllocator` to `MapAllocator` in ATen ([#60325](https://github.com/pytorch/pytorch/pull/60325))
* Converted input type of `TensorOptions.device_index` from `int16_t` to to `c10::DeviceIndex` ([#60412](https://github.com/pytorch/pytorch/pull/60412))

# Bug fixes

## Python API

* Added fix to recognize transposed dense tensors as a form of partial overlap ([#59014](https://github.com/pytorch/pytorch/pull/59014))
* Fixed `torch.polygamma` incorrect behavior at infinites when n>=1 ([#61641](https://github.com/pytorch/pytorch/pull/61641))
* Fixed for non-contiguous inputs for `torch.{sort,topk}` on CUDA ([#63029](https://github.com/pytorch/pytorch/pull/63029)), `torch.tensor_split` indices([#63390](https://github.com/pytorch/pytorch/pull/63390))
* Fixed legacy constructor `torch.Tensor`  when given a scalar Tensor ([#58885](https://github.com/pytorch/pytorch/pull/58885))
* Added change to not wrap `Tensor.{grad,_base}` by default for Tensor-like objects([#60464](https://github.com/pytorch/pytorch/pull/60464))
* Fixed `torch.angle` on aarch64 ([#59832](https://github.com/pytorch/pytorch/pull/59832))
* Fixed specialized convolution kernel on arm64 ([#60460](https://github.com/pytorch/pytorch/pull/60460))
* `torch.normal`: fixed RuntimeError when standard deviation named arg is torch.empty [(#66524](https://github.com/pytorch/pytorch/pull/66524/))
* Fixed random sampling on SGX platforms ([#60368](https://github.com/pytorch/pytorch/pull/60368))
* Fixed testing when Scipy is not available ([#61699](https://github.com/pytorch/pytorch/pull/61699))
* Fixed `torch.Tensor.copy_` when using large inputs and broadcasting ([#64425](https://github.com/pytorch/pytorch/pull/64425))
* Fixed broadcasting behavior for `torch.trapezoid` ([#64054](https://github.com/pytorch/pytorch/pull/64054)).
* Fixed dtype check of comparison ops ([#64267](https://github.com/pytorch/pytorch/pull/64267)).
* Fixed `torch.median` crash on empty tensor ([#61698](https://github.com/pytorch/pytorch/pull/61698))
* Fixed missing lazy initialization in `torch.get_num_threads` ([#64486](https://github.com/pytorch/pytorch/pull/64486))
* Fixed check for empty named dims list to `torch.flatten` ([#61953](https://github.com/pytorch/pytorch/pull/61953))
* Fixed `torch.hub.{list,help}` functions for Windows ([#63773](https://github.com/pytorch/pytorch/pull/63773))
* Fixed `torch.{istft,rfft}` errors for special inputs ([#63469](https://github.com/pytorch/pytorch/pull/63469), [#63327](https://github.com/pytorch/pytorch/pull/63327))
* Fixed type annotation
    * `optim.lr_scheduler.CosineAnnealingWarmRestart` ([#61106](https://github.com/pytorch/pytorch/pull/61106))
    * Fixed type annotation of `torch.hub.load` ([#63755](https://github.com/pytorch/pytorch/pull/63755))
* `x[index] = value` no longer results in a RuntimeError if `x` and `value` are different devices.
    ([#61612](https://github.com/pytorch/pytorch/pull/61612))
* Fixed crash while creating new tensor if NumPy is not available ([#66433](https://github.com/pytorch/pytorch/pull/66433))
* Handle exceptions from THPModule_setQEngine ([#60073](https://github.com/pytorch/pytorch/pull/60073))
* Fixed `torch.Tensor.cauchy_` on CUDA for inf values ([#60186](https://github.com/pytorch/pytorch/pull/60186))

## Autograd

* `torch.{signbit,isin}` no longer raise an error when passed a tensor that requires grad ([#62529](https://github.com/pytorch/pytorch/pull/62529))
* Fixed sub-gradient for `torch.a{max,min}` ([#59669](https://github.com/pytorch/pytorch/pull/59669))
* Fixed segfaults when a tensor hook removes itself ([#61250](https://github.com/pytorch/pytorch/pull/61250))
* Fixed double backward for `binary_cross_entropy` loss function when `reduction=sum`. ([#59479](https://github.com/pytorch/pytorch/pull/59479))
* Made sure that TLS (grad mode, inference mode, dispatcher state, etc) are properly set in hooks being called during the backward pass ([#60067](https://github.com/pytorch/pytorch/pull/60067))

## torch.nn

* `nn.AdaptiveAvgPool2d`: Correctly dispatch to CUDA implementation ([#61851](https://github.com/pytorch/pytorch/pull/61851))
* `nn.AdaptiveAvgPool3d`: Fixed gradient computation ([#60630](https://github.com/pytorch/pytorch/pull/60630))
* `nn.BatchNorm`: Fixed mixed precision usage when `affine=False` ([#61962](https://github.com/pytorch/pytorch/pull/61962))
* `nn.BatchNorm2d`: Fixed issue when input is non-contiguous ([#63392](https://github.com/pytorch/pytorch/pull/63392))
* Fixed `batch_norm()` to preserve output memory layout based on input ([#62773](https://github.com/pytorch/pytorch/pull/62773))
* `nn.MaxPool2d`: Use `channels_last` memory format for output and indices when input is channels_last ([#61245](https://github.com/pytorch/pytorch/pull/61245))
* `nn.Module`: Fixed full backward hook when grad is disabled ([#65335](https://github.com/pytorch/pytorch/pull/65335))
* `nn.Module`: Fixed `get_buffer()` to check buffers by name instead of value ([#61429](https://github.com/pytorch/pytorch/pull/61429))
* `nn.Module`: Fixed pre-forward hooks for Lazy modules ([#60517](https://github.com/pytorch/pytorch/pull/60517))
* `nn.Softmax`: Improve numerical stability by subtracting max value in vectorized CPU implementation ([#63132](https://github.com/pytorch/pytorch/pull/63132))
* `F.cosine_similarity`: Fixed type promotion behavior and added input validation checks ([#62054](https://github.com/pytorch/pytorch/pull/62054), [#66191](https://github.com/pytorch/pytorch/pull/66191), [#62912](https://github.com/pytorch/pytorch/pull/62912), [#58559](https://github.com/pytorch/pytorch/pull/58559))
* `F.embedding`: Added check to validate that weights are 2D ([#59314](https://github.com/pytorch/pytorch/pull/59314))
* `F.interpolate`: Fixed output for edge case of single pixel without align_corners ([#61166](https://github.com/pytorch/pytorch/pull/61166))
* `F.nll_loss`: Fixed regression for gradient computation ([#64203](https://github.com/pytorch/pytorch/pull/64203))
* `F.pad`: Fixed type of default pad value to be floating point ([#62095](https://github.com/pytorch/pytorch/pull/62095))
* Fixed issues with printing `torch._ops.ops.{atan, quantized}` modules ([#62447](https://github.com/pytorch/pytorch/pull/62447))
* Fixed `torch.nn.utils.parametrizations.spectral_norm` so that it can be used twice in the same forward pass ([#62293](https://github.com/pytorch/pytorch/pull/62293))
* Disabled cuDNN persistent RNN on A30 to avoid exceptions from hard-to-detect edge cases ([#59830](https://github.com/pytorch/pytorch/pull/59830))

## Dataloader

* Fixed `IterableFecher` to stop fetching data after `StopIterator` ([#59313](https://github.com/pytorch/pytorch/pull/59313))
* Fixed `ExceptionWrapper` to re-raise Exception with multiple args ([#58131](https://github.com/pytorch/pytorch/pull/58131))

## AMD

* Fix ROCm compilation by properly marking c++ functions as CPU only ([#62628](https://github.com/pytorch/pytorch/pull/62628))
* Fixed `torch.{i1,i1e}` ROCm failure: mark array as const so that it is available for host and device ([#59187](https://github.com/pytorch/pytorch/pull/59187))

## CUDA

* Fixed to not use deprecated data accessor in IndexKernel.cu ([#62268](https://github.com/pytorch/pytorch/pull/62268))
* Fixed sign comparison ([#62194](https://github.com/pytorch/pytorch/pull/62194), [#62483](https://github.com/pytorch/pytorch/pull/62483))
* Fixed `torch.manual_seed{_all}` memory leak ([#62534](https://github.com/pytorch/pytorch/pull/62534))
* Fixed CUDA_KERNEL_ASSERT ambiguous symbol in NDEBUG mode ([#62527](https://github.com/pytorch/pytorch/pull/62527))
* Changed to use long index type for `torch.index_add` deterministic implementation ([#59254](https://github.com/pytorch/pytorch/pull/59254))
* Fixed illegal memory access on NHWC BN kernel ([#59981](https://github.com/pytorch/pytorch/pull/59981))
* Fixed typo in Normalization.cu ([#62515](https://github.com/pytorch/pytorch/pull/62515))
* Added change to ignore and clear errors related to cuda not being ready yet ([#61554](https://github.com/pytorch/pytorch/pull/61554))
* Fixed segmentation fault due to access to destroyed global IPC variable([#56141](https://github.com/pytorch/pytorch/pull/56141))
* Fixed reduction launch config ([#64304](https://github.com/pytorch/pytorch/pull/64304))
* Fixed typo embedding_renorm_ cuda implementation ([#64542](https://github.com/pytorch/pytorch/pull/64542))
* Added missing kernel checks ([#60635](https://github.com/pytorch/pytorch/pull/60635))
* CUDA graphs: made sure graph mempool malloc counter pairs with frees for all allocations ([#61567](https://github.com/pytorch/pytorch/pull/61567))
* Fix bug where some kernels would not properly call cuda lazy initialization ([#61882](https://github.com/pytorch/pytorch/pull/61882))
* Added check for contiguous to dispatch to NHWC CUDA template ([#62839](https://github.com/pytorch/pytorch/pull/62839))
* Moved grid_sampler to autocast promote list ([#58618](https://github.com/pytorch/pytorch/pull/58618))
* Added check for memory overlap in sort for large input sizes ([#58327](https://github.com/pytorch/pytorch/pull/58327))

## C++ API

* Fixed `map` function for `vec256` to accept const pointer to function ([#59957](https://github.com/pytorch/pytorch/pull/59957))
* Added `supports_as_strided` method to `Device` and fixed indices of `to_sparse()` contiguous on all devices ([#59370](https://github.com/pytorch/pytorch/pull/59370))
* Removed redundant bitwise-and op in MT19937RNGEngine ([#63219](https://github.com/pytorch/pytorch/pull/63219))
* Fixed subprocess encoding for cpp extension on Windows ([#63756](https://github.com/pytorch/pytorch/pull/63756))
* Define the SYCL device version `__assert_fail` when the NDEBUG defined. ([#58906](https://github.com/pytorch/pytorch/pull/58906))

## TorchScript

* Fixed inconsistency between Python and JIT power operation ([#62842](https://github.com/pytorch/pytorch/pull/62842))
* Added change to convert `__constants__` attribute in model to a set to be consistent ([#60003](https://github.com/pytorch/pytorch/pull/60003)) 
* Added change to Ignore unsupported attribute checker pass for `torch.jit.trace` ([#60200](https://github.com/pytorch/pytorch/pull/60200)) 
* Fixed missing element types and shapes when `torch.autograd.Function` has multiple tensor outputs ([#57966](https://github.com/pytorch/pytorch/pull/57966))
* Fixed `Tensor.to` schema to reflect that the output may alias input ([#60001](https://github.com/pytorch/pytorch/pull/60001))  
* Added change to turn off layer norm in jit symbolic differentiation ([#63816](https://github.com/pytorch/pytorch/pull/63816)) 
* Fixed name conflict by using a more specific prefix for lowered module name. ([#61007](https://github.com/pytorch/pytorch/pull/61007)) 
* Added change to allow disabling cache in autocast (automatic mixed precision) ([#63552](https://github.com/pytorch/pytorch/pull/63552)) 
* Fixed concat optimization to handle cases when input list is mutated after cat using AliasDb ([#60774](https://github.com/pytorch/pytorch/pull/60774)) 
* Fixed symbolic derivative of hardswish ([#59405](https://github.com/pytorch/pytorch/pull/59405)) 

## torch.package

* Fixed a bug when using `importlib.resources.path` for python <3.8.8 ([#58718](https://github.com/pytorch/pytorch/pull/58718))
* Fixed bugs when using `os` and `os.path` ([#60276](https://github.com/pytorch/pytorch/pull/60276))
* Fixed storage serialization collision when saving a `ScriptModule` and then saving a `Tensor` owned by it. ([#61806](https://github.com/pytorch/pytorch/pull/61806))
* Fixed use-after-free during autograd shutdown ([#64620](https://github.com/pytorch/pytorch/pull/64620))
* Fixed non-determinism in naming scheme of serialized storages in export code paths and ABA ABA storage identity problem during serialization for `torch.package` ([#59735](https://github.com/pytorch/pytorch/pull/59735))
* Fixed GIL issue when acquiring multiple sessions. ([#58584](https://github.com/pytorch/pytorch/pull/58584))

## Mobile

* Fixed Nnapi backend dangling pointer bug ([#63092](https://github.com/pytorch/pytorch/pull/63092))
* Fixed missing constants archive in torchscript model after backport ([#58892](https://github.com/pytorch/pytorch/pull/58892))
* Fixed type hints in optimize_for_mobile to be consistent with the default([#59282](https://github.com/pytorch/pytorch/pull/59282))
* Fixed xnnpack hardswish memory issue ([#59577](https://github.com/pytorch/pytorch/pull/59577), [#61622](https://github.com/pytorch/pytorch/pull/61622))
* Fixed the issue that model_dump didn’t work with delegate models ([#61043](https://github.com/pytorch/pytorch/pull/61043))
* Fixed concat shaders didn’t work for certain iOS devices ([#61074](https://github.com/pytorch/pytorch/pull/61074))
* Fixed the Metal `torch.clamp` shader function for x86_64 ([#63062](https://github.com/pytorch/pytorch/pull/63062))
* Fixed callstack pointer serialization bug ([#63576](https://github.com/pytorch/pytorch/pull/63576))
* Fixed model loading error for Vulkan backend in Java API ([#63402](https://github.com/pytorch/pytorch/pull/63402))
* Fixed the issue that sub modules with same names are not serialized correctly in bytecode format ([#61933](https://github.com/pytorch/pytorch/pull/61933))

## Quantization

* Fixed crash when model outputs dicts or lists ([#58416](https://github.com/pytorch/pytorch/pull/58416))
* QAT: Fixed the runtime run `cannot resize variables that require grad` ([#57068](https://github.com/pytorch/pytorch/pull/57068))
* Fixed support for custom module ([#59041](https://github.com/pytorch/pytorch/pull/59041))
* Fixed the "tensors to be on the same device" error in HistogramObserver ([#59234](https://github.com/pytorch/pytorch/pull/59234))
* Fixed dimension for output of batchnorm 1d ([#59264](https://github.com/pytorch/pytorch/pull/59264))
* Fixed quantized mean operator in QNNPACK backend ([#59761](https://github.com/pytorch/pytorch/pull/59761))
* Fixed a bug in .to for qtensors so scale/zp move too ([#61576](https://github.com/pytorch/pytorch/pull/61576))
* Fixed quantized Conv1d module parameters ([#62356](https://github.com/pytorch/pytorch/pull/62356))
* Fixed quantization for tuple arguments ([#63376](https://github.com/pytorch/pytorch/pull/63376))
* Fixed fuse qconfig comparison ([#63384](https://github.com/pytorch/pytorch/pull/63384))
* Fixed the conversion of the quantizable RNN ([#63879](https://github.com/pytorch/pytorch/pull/63879))
* Fixed quantization for sub_scalar ([#64603](https://github.com/pytorch/pytorch/pull/64603))
* Fixed a bug for sub ([#65109](https://github.com/pytorch/pytorch/pull/65109))
* Add change to ensure qconfig works for QAT with multiple modules ([#63343](https://github.com/pytorch/pytorch/pull/63343))

## Distributed

`DistributedDataParallel`

* Fixed Pipe + DDP for unused parameters, static graph ([#60118](https://github.com/pytorch/pytorch/pull/60118))
* Fixed case where new tensors with no grad_fn are returned in DDP forward. ([#60882](https://github.com/pytorch/pytorch/pull/60882))
* Re-enabled the optimization of fusing copy and division when no comm hook is specified for both dense and sparse tensors ([#61379](https://github.com/pytorch/pytorch/pull/61379), [#61814](https://github.com/pytorch/pytorch/pull/61814))
* Fixed fp16 C++ DDP gradient communication hook ([#63375](https://github.com/pytorch/pytorch/pull/63375))
* Added change to ensure buffers are broadcasted properly when they are reassigned in module ([#64776](https://github.com/pytorch/pytorch/pull/64776))
* Fixed GradBucket.is_last() logic ([#63768](https://github.com/pytorch/pytorch/pull/63768))


`torch.distributed.Store`

* torch.distributed and RPC cannot both be initialized with the same host:port pair ([#58328](https://github.com/pytorch/pytorch/pull/58328), [#58329](https://github.com/pytorch/pytorch/pull/58329), [#58330](https://github.com/pytorch/pytorch/pull/58330), [#58331](https://github.com/pytorch/pytorch/pull/58331))

`torch.distributed.rpc`

* Added change to run dist_autograd backward RPCs on appropriate CUDA streams. ([#60606](https://github.com/pytorch/pytorch/pull/60606))
* Fixed race condition in TensorPipe agent ([#58753](https://github.com/pytorch/pytorch/pull/58753))
* Fixed issue when some gradients are None for distributed optimizers ([#62249](https://github.com/pytorch/pytorch/pull/62249))

`torch.distributed.elastic`

* Added change to ensure rendezvous timeout does not get overwritten ([#61471](https://github.com/pytorch/pytorch/pull/61471))
* Fixed the edge case when no node is alive ([#59663](https://github.com/pytorch/pytorch/pull/59663))
* Added change to cast timestamp type to int ([#59712](https://github.com/pytorch/pytorch/pull/59712))
* Added properly formatted traceback on error ([#65041](https://github.com/pytorch/pytorch/pull/65041))

`torch.distributed.autograd`

* Updated GraphTask::owner_ in a single thread for DistEngine. ([#58625](https://github.com/pytorch/pytorch/pull/58625))
* Introduced the deadlock fix ([#61588](https://github.com/pytorch/pytorch/pull/61588), [#61593](https://github.com/pytorch/pytorch/pull/61593))

`torch.distributed`

* Fixed the slowdown of _object_to_tensor since 1.9 (#65721) ([#65721](https://github.com/pytorch/pytorch/pull/65721))

## torch.fx

* Fixed retracing wrapped functions ([#58061](https://github.com/pytorch/pytorch/pull/58061))
* Added override for call_function so that wrapped functions stay wrapped ([#60057](https://github.com/pytorch/pytorch/pull/60057))
* Added fix to retain node.meta after normalizing args ([#60449](https://github.com/pytorch/pytorch/pull/60449))
* Added change to skip the output nodes but process possible nodes after it, when creating a single partition  ([#60370](https://github.com/pytorch/pytorch/pull/60370))
* Fixed fx patch module name ([#61062](https://github.com/pytorch/pytorch/pull/61062))
* Fixed graph `copy.deepcopy` to propagate output type ([#61747](https://github.com/pytorch/pytorch/pull/61747))
* Added change to allow starter nodes to depend on `get_attr` node ([#62234](https://github.com/pytorch/pytorch/pull/62234))
* Added change to prevent implicit submodule inlining when submodule is a GraphModule ([#62436](https://github.com/pytorch/pytorch/pull/62436))
* Added change to persist `tracer_cls` on `fx.Graph` when deep copying ([#63353](https://github.com/pytorch/pytorch/pull/63353))
* Fixed GraphModule deepcopy to use deepcopied graph ([#63090](https://github.com/pytorch/pytorch/pull/63090))
* Fixed constant folding for attrs in submodule hierarchies ([#64342](https://github.com/pytorch/pytorch/pull/64342))
* Fixed some const fold cases with deep model hierarchy ([#64945](https://github.com/pytorch/pytorch/pull/64945))
* Fixed tracing of bitwise and/or ([#65196](https://github.com/pytorch/pytorch/pull/65196))

## ONNX

* Added shape type inference fixes for control flow ([#60248](https://github.com/pytorch/pytorch/pull/60248))
* Fixed sum export with attribute `keepdims` ([#60245](https://github.com/pytorch/pytorch/pull/60245))
* Fixed shape inference for large model ([#60244](https://github.com/pytorch/pytorch/pull/60244))
* Fixed split export in op set 13 ([#57605](https://github.com/pytorch/pytorch/pull/57605))
* Fixed control-flow shape inference with contrib op ([#62762](https://github.com/pytorch/pytorch/pull/62762))
* Updated `instance_norm2d` export to handle `track_running_stats=True` ([#58690](https://github.com/pytorch/pytorch/pull/58690))
* Fixed the issue of converting empty list to sequence([#61558](https://github.com/pytorch/pytorch/pull/61558))
* Fixed sum could not be exported for empty tensor ([#59537](https://github.com/pytorch/pytorch/pull/59537))
* Fixed an issue that optimizations might adjust graph inputs unexpectedly ([#62763](https://github.com/pytorch/pytorch/pull/62763))

## Vulkan

* Fixed an issue where comparing equivalent descriptors would evaluate to `false` ([#60199](https://github.com/pytorch/pytorch/pull/60199))
* Fixed asserts in Vulkan JIT passes to actually throw an exception ([#61495](https://github.com/pytorch/pytorch/pull/61495))

## Performance_as_a_product

* Added fix to ensure number of thread utilities are initialized before getting the number of threads ([#60185](https://github.com/pytorch/pytorch/pull/60185))
* Added fix to ensure thread id is valid in nested parallel regions ([#60183](https://github.com/pytorch/pytorch/pull/60183))
* Fixed parallel tbb build ([#60532](https://github.com/pytorch/pytorch/pull/60532))
* Added change to make flags in the pytorch managed thread pool atomic. ([#58457](https://github.com/pytorch/pytorch/pull/58457))
* Set mkl thread locally ([#62891](https://github.com/pytorch/pytorch/pull/62891))

## Composability

* Added a fix to ensure that the C++ API’s that skip the dispatcher (such as `at::cpu::{op}` and `at::cuda::{op}` get external linkage, so they can be used outside of libtorch ([#58569](https://github.com/pytorch/pytorch/pull/58569))
* Fixed bug where shared memory tensor file names can collide ([#60978](https://github.com/pytorch/pytorch/pull/60978))

## Build_Frontend

* Fixed binary building without python ([#66031](https://github.com/pytorch/pytorch/pull/66031))
* Fixed Windows ninja builds when MAX_JOBS is specified ([#65444](https://github.com/pytorch/pytorch/pull/65444))
* Skipped Bfloat16 support when building for VSX ([#61630](https://github.com/pytorch/pytorch/pull/61630))
* Made change to use python3 alias in Makefile ([#58786](https://github.com/pytorch/pytorch/pull/58786))
* Made change to use `pybind11` from `third_party` folder by default ([#58951](https://github.com/pytorch/pytorch/pull/58951))
* Made change to ensure FindLAPACK finds the same BLAS library ([#49647](https://github.com/pytorch/pytorch/pull/49647))
* Improved Python package detection in `torch.utils.collect_env` ([#63321](https://github.com/pytorch/pytorch/pull/63321))
* Skipped SVE acceleration on M1 machine ([#58785](https://github.com/pytorch/pytorch/pull/58785))
* Made `SciPy` dependency optional in PyTorch unary operators tests ([#59304](https://github.com/pytorch/pytorch/pull/59304))
* Fixed error-handling when Python executable can not be found ([#61230](https://github.com/pytorch/pytorch/pull/61230))
* Fixed `setup.py` re-run incremental build logic on Windows ([#59689](https://github.com/pytorch/pytorch/pull/59689))
* Reduced binary size for CUDA-split build by establishing correct linking order ([#58287](https://github.com/pytorch/pytorch/pull/58287))
* Fixed  `torch.utils.cpp_extension` behavior when older setuptools are used ([#61484](https://github.com/pytorch/pytorch/pull/61484))

## Infra (Releng)

* Fixed windows ci squid env ([#62353](https://github.com/pytorch/pytorch/pull/62353))
* Introduced CI dependency pinning: ([#64922](https://github.com/pytorch/pytorch/pull/64922), [#65017](https://github.com/pytorch/pytorch/pull/65017))
* Fixed breakpad build and add to more images ([#59236](https://github.com/pytorch/pytorch/pull/59236))
* Updated certificate trust chain CI to depend on the linked commits ([#65934](https://github.com/pytorch/pytorch/pull/65934), [#66004](https://github.com/pytorch/pytorch/pull/66004))

## LinAlg_Frontend

* Fixed an issue where the “info” tensor returned by `torch.linalg.inv_ex` could sometimes be on the wrong device ([#59223](https://github.com/pytorch/pytorch/pull/59223))
* Fixed an issue where `torch.linalg.norm` could return tensors with the wrong shape in some edge cases ([#60273](https://github.com/pytorch/pytorch/pull/60273))
* Fixed an issue where `torch.linalg.svd` could return tensors with the wrong shape in some edge cases ([#62022](https://github.com/pytorch/pytorch/pull/62022))
* Fixed an issue where `torch.matmul` would throw an error when attempting to multiply certain empty tensors ([#63359](https://github.com/pytorch/pytorch/pull/63359))

## Sparse_Frontend

* Fixed dtype inference in sparse_csr_tensor_ctor ([#58631](https://github.com/pytorch/pytorch/pull/58631))
* Fixed addmm failure for CSR Tensors when MKL is not available ([#58768](https://github.com/pytorch/pytorch/pull/58768))
* Fixed overflow of numel for sparse COO tensors after calling coalesce ([#57492](https://github.com/pytorch/pytorch/pull/57492))
* Fixed multiplication of 0-dim Tensor and COO sparse Tensor and improved Error message for multiplication of dense and sparse COO tensor ([#61723](https://github.com/pytorch/pytorch/pull/61723))
* Fixed internal assert error for CSR tensors crow_/col_indices methods in Debug build ([#63176](https://github.com/pytorch/pytorch/pull/63176))
* Fixed support of torch.conj for zero-dimensional sparse COO Tensors ([#59553](https://github.com/pytorch/pytorch/pull/59553))

## Misc

* Added change to increase warmup for better steady state measurements. ([#58801](https://github.com/pytorch/pytorch/pull/58801))
* Fixed bad use of channels last kernel in sync batch norm backward ([#64100](https://github.com/pytorch/pytorch/pull/64100))

# Performance

## Python API

* `torch.special.{'i0', 'i0e', 'i1', 'i1e'}:` converted floating-point constants to input type in Bessel functions ([#59416](https://github.com/pytorch/pytorch/pull/59416))
* Added change to speed up `torch.unique_consecutive()` ([#64835](https://github.com/pytorch/pytorch/pull/64835))
* Made sure all graphs tests call `torch.cuda.empty_cache()` before capture to fix flaky tests ([#59233](https://github.com/pytorch/pytorch/pull/59233))
* `torch.flip` : improved performance via TensorIterator ([#59509](https://github.com/pytorch/pytorch/pull/59509))
* Added change to parallelize `torch.gelu` via tensoriterator ([#58950](https://github.com/pytorch/pytorch/pull/58950))
* `torch.sum`: added change to accumulate 16-bit float sums in 32-bit accumulators for improved precision and performance ([#60387](https://github.com/pytorch/pytorch/pull/60387))
* Added fast path for conjugated tensors for  `torch.`{`dot, vdot, mm, addmm, bmm, baddbmm}` ([#62915](https://github.com/pytorch/pytorch/pull/62915), [#59380](https://github.com/pytorch/pytorch/pull/59380))

## Autograd

* Faster `torch.cum{sum,prod}` backward formulas ([#60642](https://github.com/pytorch/pytorch/pull/60642))
* Reduced overhead from `reshape` call if the tensor already has the right shape ([#61466](https://github.com/pytorch/pytorch/pull/61466))
* Added change to speed up saving variables for backward ([#59837](https://github.com/pytorch/pytorch/pull/59837), [#61927](https://github.com/pytorch/pytorch/pull/61927))
* Reduced number of TLS access when deciding if an op needs to be tracked by autograd or not ([#60740](https://github.com/pytorch/pytorch/pull/60740))
* Improved code that detect when it is valid to re-use existing Tensors during the backward pass ([#59817](https://github.com/pytorch/pytorch/pull/59817))

## torch.nn

* `nn.utils.clip_grad_norm_`: Removed device syncs ([#61042](https://github.com/pytorch/pytorch/pull/61042))
* `nn.BatchNorm2d`: Optimized performance for `channels_last` on CPU ([#59286](https://github.com/pytorch/pytorch/pull/59286))
* `nn.Softmax`: Vectorized softmax calculation for the non-last-dimension case ([#59195](https://github.com/pytorch/pytorch/pull/59195), [#60371](https://github.com/pytorch/pytorch/pull/60371))
* `nn.Transformer`: Faster `generate_square_subsequent_mask` ([#60631](https://github.com/pytorch/pytorch/pull/60631))

## CUDA

* Updated launch bounds for trilinear 3d ([#59999](https://github.com/pytorch/pytorch/pull/59999))
* Migrated Embedding thrust sort to cub sort ([#62495](https://github.com/pytorch/pytorch/pull/62495))
* Make `unique` call in embedding use cub instead of thrust ([#63042](https://github.com/pytorch/pytorch/pull/63042))
* Migrated masked_scatter to use cub instead of thrust ([#56750](https://github.com/pytorch/pytorch/pull/56750))
* Reverted D28547564: [pytorch][PR] masked_scatter thrust→cub ([9e261de630](https://github.com/pytorch/pytorch/commit/9e261de630))
* Make sort in EmbeddingBag use cub instead of thrust ([#64498](https://github.com/pytorch/pytorch/pull/64498))
* Migrated Embedding thrust sort to cub sort ([#63806](https://github.com/pytorch/pytorch/pull/63806))
* Removed cat, equal, and stack from autocast promote list ([#59497](https://github.com/pytorch/pytorch/pull/59497))
* Add cublas and cusolver paths for LU solve ([#59148](https://github.com/pytorch/pytorch/pull/59148))
* Fixed launch bounds for gathertopk kernel ([#60314](https://github.com/pytorch/pytorch/pull/60314))
* Changed launch bounds, unrolled for loop for grid sampler 2d fwd and bwd ([#60405](https://github.com/pytorch/pytorch/pull/60405))
* Changed launch bound to fix col2im kernel ([#60315](https://github.com/pytorch/pytorch/pull/60315))
* Fixed launch bounds for grid sampler 3d ([#60385](https://github.com/pytorch/pytorch/pull/60385))
* CUDA graphs: added change to not sync between replays for CUDA driver version 11.4+ ([#61063](https://github.com/pytorch/pytorch/pull/61063))
* Changed launch bounds for upsample_linear1d fwd, bwd from 1024 to 512 ([#61307](https://github.com/pytorch/pytorch/pull/61307))
* Added change to reduce max_num_threads for complex double ops in reduce_kernel ([#61438](https://github.com/pytorch/pytorch/pull/61438))
* Added change to use `fastAtomicAdd` in EmbeddingBag (mode "max") backward ([#63298](https://github.com/pytorch/pytorch/pull/63298))
* Added change to use multi-dimensional cuFFT transforms to improve FFT performance ([#61203](https://github.com/pytorch/pytorch/pull/61203))
* `F.avg_pool3d` CUDA backward: use fast atomic adds ([#63387](https://github.com/pytorch/pytorch/pull/63387))
* Add cuSOLVER path for LU factorization in CUDA. ([#56887](https://github.com/pytorch/pytorch/pull/56887))
* Reverted launch bounds change in topK that induced a regression in perf ([#63431](https://github.com/pytorch/pytorch/pull/63431))
* Added change to bring back old algorithm for sorting on small number of segments ([#64127](https://github.com/pytorch/pytorch/pull/64127))

## Mobile

* Added change to use channel-last to transform the weights for Metal ([#59113](https://github.com/pytorch/pytorch/pull/59113))
* Implemented RoIAlign in Metal shaders using Sampler ([#56075](https://github.com/pytorch/pytorch/pull/56075))
* Added cache operator lambda during model loading ([#61996](https://github.com/pytorch/pytorch/pull/61996))
* Added Operator Call De-dup at TorchScript Serialization Level ([#64269](https://github.com/pytorch/pytorch/pull/64269))
* Added change to speed up model loading by 1directly calling the C file API from FileAdapter ([#61997](https://github.com/pytorch/pytorch/pull/61997))
* Moved from input ivalues in ByteCodeDeserializer ([#64029](https://github.com/pytorch/pytorch/pull/64029))
* Fixed MobileDebugInfo vector copy ([#64030](https://github.com/pytorch/pytorch/pull/64030))
* Added change to gate tls_local_dispatch_key_set off on iOS too ([#64753](https://github.com/pytorch/pytorch/pull/64753))
* Added change to not store multiple kernels per key on mobile ([#64447](https://github.com/pytorch/pytorch/pull/64447))
* Added OpCode cache in ByteCodeDeserializer ([#64110](https://github.com/pytorch/pytorch/pull/64110))
* Reduced mobile model size by reusing constant and bump bytecode to v5 ([#59722](https://github.com/pytorch/pytorch/pull/59722))

## Distributed

* `torch.distributed:` replaced all_gather with more efficient collective api _all_gather_base ([#57769](https://github.com/pytorch/pytorch/pull/57769))
* `torch.distributed.optim.ZeroRedundancyOptimizer: `Sorted params by size (decreasing) ([#59586](https://github.com/pytorch/pytorch/pull/59586))

## Vulkan

* Improved the performance of pointwise convolutions by having each shader invocation calculate a 4x4 output tile  ([#60760](https://github.com/pytorch/pytorch/pull/60760))
* Implemented a simple scheme to set the local work group size adaptively ([#61170](https://github.com/pytorch/pytorch/pull/61170))

## Performance_as_a_product

* TensorIterator: added change to reduce serial_for_each static overhead ([#58909](https://github.com/pytorch/pytorch/pull/58909))
* Added change to avoid using `std::regex` for device string parsing ([#63204](https://github.com/pytorch/pytorch/pull/63204))

## Composability

* Introduced some perf improvements for reduction ops ([#58655](https://github.com/pytorch/pytorch/pull/58655))
* Added optimization to some internal representations of sizes ([#59333](https://github.com/pytorch/pytorch/pull/59333))
* Reduced the number of tensor refcount bumps in many existing kernels ([#58303](https://github.com/pytorch/pytorch/pull/58303), [#59827](https://github.com/pytorch/pytorch/pull/59827), [#58273](https://github.com/pytorch/pytorch/pull/58273), [#58272](https://github.com/pytorch/pytorch/pull/58272), [#58276](https://github.com/pytorch/pytorch/pull/58276), [#58277](https://github.com/pytorch/pytorch/pull/58277), [#58279](https://github.com/pytorch/pytorch/pull/58279), [#60546](https://github.com/pytorch/pytorch/pull/60546), [#58280](https://github.com/pytorch/pytorch/pull/58280))
* Added micro-optimizations to improve the time it takes to load pytorch ([#64784](https://github.com/pytorch/pytorch/pull/64784), [#64820](https://github.com/pytorch/pytorch/pull/64820), [#64821](https://github.com/pytorch/pytorch/pull/64821), [#64822](https://github.com/pytorch/pytorch/pull/64822), [#64838](https://github.com/pytorch/pytorch/pull/64838), [#64678](https://github.com/pytorch/pytorch/pull/64678), [#64682](https://github.com/pytorch/pytorch/pull/64682), [#64670](https://github.com/pytorch/pytorch/pull/64670))

## Build_Frontend

* Compiled BatchLinearAlgebra CUDA integration routines with host compiler ([#64146](https://github.com/pytorch/pytorch/pull/64146))
* Sped-up compilation by splitting autogenerated files into smaller ones ([#62186](https://github.com/pytorch/pytorch/pull/62186))
* Allowed [ninja-build](https://ninja-build.org/) to dynamically pick best parallel build option ([#64733](https://github.com/pytorch/pytorch/pull/64733), [#65162](https://github.com/pytorch/pytorch/pull/65162))

## Infra (Releng)

* .github: upload /download large artifacts to s3 ([#58506](https://github.com/pytorch/pytorch/pull/58506))
* Made change to only run mem leak check on master ([#60023](https://github.com/pytorch/pytorch/pull/60023))
* Enabled parallel clang-tidy on ec2 runner ([#60870](https://github.com/pytorch/pytorch/pull/60870))
* Made change to skip magma library installation for Windows CPU builds ([#59619](https://github.com/pytorch/pytorch/pull/59619))

## Sparse_Frontend

* Sped up conversion of COO to CSR Tensor `to_sparse_csr` by writing custom CPU/GPU kernels ([#61340](https://github.com/pytorch/pytorch/pull/61340), [#61838](https://github.com/pytorch/pytorch/pull/61838))
* Slightly sped up calculation of number of dense entries for sparse softmax via `c10::multiply_integers`  for COO Tensors ([#60872](https://github.com/pytorch/pytorch/pull/60872))
* Slightly sped up sparse softmax for COO Tensors by improve usage of `std::vector` ([#60873](https://github.com/pytorch/pytorch/pull/60873))
* Sped up index_select for sparse COO Tensor ([#63008](https://github.com/pytorch/pytorch/pull/63008))

## Misc

* Greatly reduced the post-processing time of the profiler ([#60432](https://github.com/pytorch/pytorch/pull/60432))
* Saved some little memory in `default_collate` ([#61424](https://github.com/pytorch/pytorch/pull/61424))
* Added new ops to the operator microbenchmark: `gelu`, `bmm`, `mm`, `einsum`, `log1p` ([#59334](https://github.com/pytorch/pytorch/pull/59334), [#59595](https://github.com/pytorch/pytorch/pull/59595), [#63654](https://github.com/pytorch/pytorch/pull/63654), [#64647](https://github.com/pytorch/pytorch/pull/64647), [#64032](https://github.com/pytorch/pytorch/pull/64032), [#64205](https://github.com/pytorch/pytorch/pull/64205))
* Added AVX512 support in ATen & remove AVX support ([#61903](https://github.com/pytorch/pytorch/pull/61903))


You can also find the dev specific and documentation related changes in the forum post [here](https://dev-discuss.pytorch.org/t/pytorch-1-10-dev-release-notes/379)

1.9.1	1.10.0
_{>>> torch.all(torch.tensor(42, dtype=torch.uint8)) tensor(1, dtype=torch.uint8) >>> torch.all(torch.tensor(42, dtype=torch.uint8), dim=0) tensor(42, dtype=torch.uint8) # wrong, old behavior}	_{>>> torch.all(torch.tensor(42, dtype=torch.uint8)) tensor(1, dtype=torch.uint8) >>> torch.all(torch.tensor(42, dtype=torch.uint8), dim=0) tensor(1, dtype=torch.uint8) # new, corrected and consistent behavior}

1.9.1	1.10.0
_{>>> import torch >>> x = torch.tensor([1+2j]) >>> y = x.conj() >>> y.add_(2) >>> print(x) tensor([1.+2.j])}	_{>>> import torch >>> x = torch.tensor([1+2j]) >>> y = x.conj() >>> y.add_(2) >>> print(x) tensor([3.+2.j])}

PyTorch 1.9 Release, including Torch.Linalg and Mobile Interpreter (2021-06-15)

# **PyTorch 1.9 Release Notes** 

* Highlights
* Backwards Incompatible Change
* Deprecations
* New Features
* Improvements
* Bug Fixes
* Performance
* Documentation

# Highlights

We are excited to announce the release of PyTorch 1.9. The release is composed of more than 3,400 commits since 1.8, made by 398 contributors. Highlights include:

* Major improvements to support scientific computing, including torch.linalg, torch.special, and Complex Autograd
* Major improvements in on-device binary size with Mobile Interpreter
* Native support for elastic-fault tolerance training through the upstreaming of TorchElastic into PyTorch Core
* Major updates to the PyTorch RPC framework to support large scale distributed training with GPU support
* New APIs to optimize performance and packaging for model inference deployment 
* Support for Distributed training, GPU utilization and SM efficiency in the PyTorch Profiler

We’d like to thank the community for their support and work on this latest release. We’d especially like to thank Quansight and Microsoft for their contributions.

You can find more details on all the highlighted features in the [_PyTorch 1.9 Release blogpost_](https://pytorch.org/blog/pytorch-1.9-released/). 

# Backwards Incompatible changes

## Python API

* **`torch.divide` with `rounding_mode='floor'` now returns infinity when a non-zero number is divided by zero (**[**#56893**](https://github.com/pytorch/pytorch/pull/56893)**).**
This fixes the `rounding_mode='floor'` behavior to return the same non-finite values as other rounding modes when there is a division by zero. Previously it would always result in a NaN value, but a non-zero number divided by zero should return +/- infinity in IEEE floating point arithmetic. Note this does not effect `torch.floor_divide` or the floor division operator, which currently use `rounding_mode='trunc'` (and are also deprecated for that reason).


  

    1.8.1 1.9.0
    
      _{>>> a = torch.tensor([-1.0, 0.0, 1.0])
>>> b = torch.tensor([0.0])
>>> torch.divide(a, b, rounding_mode='floor')
tensor([nan, nan, nan])}
      _{>>> a = torch.tensor([-1.0, 0.0, 1.0])
>>> b = torch.tensor([0.0])
>>> torch.divide(a, b, rounding_mode='floor')
tensor([-inf, nan, inf])}
    
  

        

* **Legacy tensor constructors and `Tensor.new` no longer support passing both `Tensor` and `device` as inputs  ([#58108](https://github.com/pytorch/pytorch/pull/58108)).**
This fixes a bug in which 1-element integer tensors were misinterpreted as specifying tensor size, yielding an uninitialized tensor. As noted in the error message, use the new-style `torch.tensor(...)` or `torch.as_tensor(...)` to copy or alias an existing tensor. If you want to create an uninitialized tensor, use `torch.empty(...)`. 

  

    1.8.1 1.9.0
    
      _{>>> a = torch.tensor([1])
>>> torch.LongTensor(a, device='cpu') # uninitialized
tensor([7022349217739848992])
>>> a.new(a, device='cpu')
tensor([4294967295]) # uninitialized}
      _{>>> a = torch.tensor([1])
>>> torch.LongTensor(a, device='cpu')
RuntimeError: Legacy tensor constructor of the form torch.Tensor(tensor, device=device) is
not supported. Use torch.tensor(...) or torch.as_tensor(...) instead.
>>> a.new(a, device='cpu')
RuntimeError: Legacy tensor new of the form tensor.new(tensor, device=device) is not
supported. Use torch.as_tensor(...) instead.}
    
  
        
        
* **`torch.divide` with `rounding_mode='true'` is replaced with `rounding_mode=None` ([#51988](https://github.com/pytorch/pytorch/pull/51988)).**
`torch.divide`'s undocumented `rounding_mode='true'` option has been removed, and instead `rounding_mode=None` should be passed to indicate no rounding should take place. This is equivalent to omitting the argument entirely.


  

    1.8.1 1.9.0
    
      _{>>> a, b = torch.full((2,), 4.2), torch.full((2,), 2)
>>> torch.divide(a, b, rounding_mode='true')
tensor([2.1000, 2.1000])}
      _{>>> a, b = torch.full((2,), 4.2), torch.full((2,), 2)
>>> torch.divide(a, b, rounding_mode=None) # equivalent to  torch.divide(a, b, rounding_mode='true') from the prior release
tensor([2.1000, 2.1000])}
    
  


* **`import torch.tensor as tensor` is no longer supported ([#53424](https://github.com/pytorch/pytorch/pull/53424)).**
Instead, use `from torch import tensor`


  

    1.8.1 1.9.0
    
      _{>>> import torch.tensor as tensor
>>> torch.tensor(1.)
tensor(1.)}
      _{>>> import torch.tensor as tensor
ModuleNotFoundError: No module named 'torch.tensor'
>>> from torch import tensor
>>> tensor(1.)
tensor(1.)}
    
  


* **binary release: `numpy` is no longer a required dependency**
If you require `numpy` (and don't already have it installed) you will need to install it separately.


## Autograd

* **`torch.autograd.gradcheck.get_numerical_jacobian` and `torch.autograd.gradcheck.get_analytical_jacobian` no longer support functions that return complex valued output as well as any other values of `grad_out` not equal to 1** ([#55692](https://github.com/pytorch/pytorch/pull/55692)).
This change is a part of a refactor of `gradcheck`’s internals. Note that `gradcheck` itself still supports functions with complex output. This new restriction only applies to calls to the two internal helper functions. As a workaround, you can wrap your functions to return either the real or imaginary component of its output before calling these functions. Additionally these internal helpers no longer accept any other value except 1 for `grad_out` for any input function. Note that these helper functions are also being deprecated in this release.
 
1.8.1:
```python
get_numerical_jacobian(torch.complex, (a, b), grad_out=2.0)
```

1.9.0:
```python
      def wrapped(fn):
            def wrapper(*input):
                return torch.real(fn(*input))
            return wrapper
        
        get_numerical_jacobian(wrapped(torch.complex), (a, b), grad_out=1.0)
```

* **`torch.autograd.gradcheck` now throws `GradcheckError`** ([#55656](https://github.com/pytorch/pytorch/pull/55656)).
This change is a part of a refactor of `gradcheck`’s internals. All errors that are able to be silenced by `raise_exception=False` now raise `GradcheckError` (which inherits from `RuntimeError`). If you explicitly check that the type of the error is `RuntimeError` you'll need to update your code to check for `GradcheckError` instead. Otherwise if you use something like `except` or `isinstance`, no changes are necessary.

1.8.1:
```python
# An example of a situation that will now return GradcheckError instead of
# RuntimeError is when there is a jacobian mismatch, which can happen
# for example when you forget to specify float64 for your inputs.
try:
    torch.autograd.gradcheck(torch.sin, (torch.ones(1, requires_grad=True),))
except RuntimeError as e:
    assert type(e) is RuntimeError # explicitly check type -> NEEDS UPDATE
```

1.9.0:
```python
try:
    torch.autograd.gradcheck(torch.sin, (torch.ones(1, requires_grad=True),)
except RuntimeError as e:
   # GradcheckError inherits from RuntimeError so you can still catch this
   # with RuntimeError (No change necessary!)
   
   # BUT, if you explicitly check type...
   assert type(e) is torch.autograd.GradcheckError
```

* **Finished deprecation cycle for in-place view error checks** ([#56093](https://github.com/pytorch/pytorch/pull/56093)).
 In-place modification of views will now raise an error if that view was created by a custom function or a function that returns multiple views, or if the view was created in no-grad mode. Modifying in-place a view created in the situations above are error-prone and have been deprecated since v1.5.0. Doing these in-place modifications are now forbidden. For more information on how to work around this, see the related sections the release notes linked below:
    * [v1.5.0](https://github.com/pytorch/pytorch/releases?after=v1.5.1) (view created in custom autograd function, view created in no-grad block)
    * [v1.7.0](https://github.com/pytorch/pytorch/releases?after=v1.8.0-rc3) (section on `split` and `chunk`, i.e., functions that return multiple views).

## torch.nn

* **Fixed regression for `nn.MultiheadAttention` to now apply bias flag to both in and out projection layers** ([#52537](https://github.com/pytorch/pytorch/pull/52537)).
In PyTorch 1.6, a regression was introduced that caused the `bias` flag of `nn.MultiheadAttention` only to apply to the input projection layer. This caused the output projection layer to always include a `bias` parameter, even with `bias=False` specified. The regression is now fixed in PyTorch 1.9, making the `bias` flag correctly apply to both the input and output projection layers. This fix is BC-breaking for the `bias=False` case as it will now result in no `bias` parameter for the output projection layer.


  

    v1.6 - v1.8.1: pre 1.6 & 1.9.0
    
      _{>>> mha = torch.nn.MultiheadAttention(4, 2, bias=False)
>>> print(mha.out_proj.bias)
Parameter containing:
tensor([0., 0., 0., 0.], requires_grad=True)}
      _{>>> mha = torch.nn.MultiheadAttention(4, 2, bias=False)
>>> print(mha.out_proj.bias)
None}
    
  



* **Updated `nn.Module` to fire full backward hooks even when no input requires grad** ([#56693](https://github.com/pytorch/pytorch/pull/56693)).
Prior to this release, full backward hooks were not fired when no input requires gradients. This has been changed so that full backward hooks will always fire during the backward pass, regardless of whether or not any input requires gradients. If you are using full backward hooks, be aware that they may fire more frequently than pre-1.9 due to this change.


  

    1.8.1: 1.9.0
    
      _{>>> m = torch.nn.Linear(2, 3)
>>> def hook(mod, grad_input, grad_output):
>>> print('hook called:', grad_input, grad_output)
>>> m.register_full_backward_hook(hook)
>>> input_no_grad = torch.rand(1, 2, requires_grad=False)
>>> m(input_no_grad).sum().backward()
>>> input_grad = torch.rand(1, 2, requires_grad=True)
>>> m(input_grad).sum().backward()
hook called: (tensor([[0.1478, 0.6517]]),) (tensor([[1., 1., 1.]]),)}
      _{>>> m = torch.nn.Linear(2, 3)
>>> def hook(mod, grad_input, grad_output):
>>> print('hook called:', grad_input, grad_output)
>>> m.register_full_backward_hook(hook)
>>> input_no_grad = torch.rand(1, 2, requires_grad=False)
>>> m(input_no_grad).sum().backward()
hook called: (None,) (tensor([[1., 1., 1.]]),)
>>> input_grad = torch.rand(1, 2, requires_grad=True)
>>> m(input_grad).sum().backward()
hook called: (tensor([[0.1478, 0.6517]]),) (tensor([[1., 1., 1.]]),)}
    
  


## Dataloader

* **Add Numpy seeding to worker of DataLoader** ([#56488](https://github.com/pytorch/pytorch/pull/56488)).
`DataLoader` with `num_workers > 0` will now set independent random seed for NumPy random functions on each worker by default. So, users now won’t be required to set random seed for NumPy using `worker_init_fn` to force NumPy random operations deterministic and independent across `DataLoader` workers. This PR won’t affect users who have already set random seed for NumPy random functions using `worker_init_fn`.
```python 
        # dataset returns numpy.random.randint(1, 10000) 
        ctx = mp.get_context('fork')
        gen = torch.Generator().manual_seed(0)
        dl = DataLoader(dataset, batch_size=2, num_workers=2, multiprocessing_context=ctx, generator=gen)
        for epoch in range(2):
            print("=" * 4, "Epoch", epoch, "=" * 4)
            for batch in dl:
                print(batch)
```
        

  

    1.8.1: 1.9.0
    
      _{# When using fork, each worker has same random seed for NumPy random functions at each epoch.
========== Epoch 0 ==========
tensor([[ 0, 340],
[ 1, 7512]])
tensor([[ 2, 340],
[ 3, 7512]])
========== Epoch 1 ==========
tensor([[ 0, 340],
[ 1, 7512]])
tensor([[ 2, 340],
[ 3, 7512]])}
      _{# Random seeds for NumPy are different across `DataLoader` workers in each epoch.
========== Epoch 0 ==========
tensor([[ 0, 8715],
[ 1, 5555]])
tensor([[ 2, 6379],
[ 3, 1432]])
========== Epoch 1 ==========
tensor([[ 0, 1374],
[ 1, 996]])
tensor([[ 2, 143],
[ 3, 3507]])}
    
  



* **Added static type checking enforce for DataPipe** ([#54020](https://github.com/pytorch/pytorch/pull/54020)).

A new attribute named `type` has been introduced for `IterableDataset` using the typing annotation at each class declaration. By adding this attribute, we are able to extend `IterableDataset` to have type inference and lazy initialization to incorporate the new DataLoader architecture. But, several BC-breaking restrictions are introduced due to this feature.

1.8.1:
```python
# Users can use string to bypass the invalid type annotation without any error. 
# And, incorrect type annotations attached to `__iter__` function are ignored.
```

1.9.0:
```python
# The following scenario will now raise different Exceptions
# 1) The type annotation is required to be valid now. Previous workaround
# like using string to  represent the invalid type annotation is not supported now.

# Raises Exception from the evaluation `eval("invalid_type", globals, locals)`
class DS(IterableDataset["invalid_type"]):  
     ...
# Raises TypeError if the return type of __iter__ is not an Iterator
class DS(IterableDataset[str]):
    def __iter__(self) -> str:
      ...
# Raise TypeError if the return type of __iter__ is of the form Iterator[X], 
# but the argument type X is not a subtype of the IterableDataset.type attribute.
class DS(IterableDataset[str]):
    def __iter__(self) -> Iterator[int]:
       ...

#  IterableDatset now has a metaclass, which will conflict with
#  existing user-defined metaclasses on IterableDatasets
class DS(IterableDataset[str], metaclass=MyMeta): 
    ...
```


## Meta API

* **Given Tensor a non-trivial (for now) metaclass _TensorMeta** ([#56147](https://github.com/pytorch/pytorch/pull/56147)).
Tensor now has a non-trivial metaclass. This shouldn't be user observable, as Tensor already inherits from a C defined class (and is thus incompatible with other typical metaclasses), but there may be unanticipated interactions with other language features in Python. This PR changes the metaclass of torch.tensor. I.e. `type(type(torch.tensor([1])))` now prints `` (used to be ``)

## C++ API

* **Changed in-place resize functions to return const Tensor&** ([#55351](https://github.com/pytorch/pytorch/pull/55351)).
The C++ signature for `resize_`, `resize_as_`, `resize_as_sparse_`, `sparse_resize_`, and `sparse_resize_and_clear_` has changed to return a `const Tensor&` instead of a `Tensor&`. This may break users’ TORCH_LIBRARY operators that called these functions but returned a non-const `Tensor&`. Ideally, users can change their operators to also consume and return `const Tensor&`, but simply casting the result of the changed function with `const_cast` is also an option.

1.8.1:
```cpp
const at::Tensor a = at::randn({2, 2});
const at::Tensor b = at::ones({1, 4}, at::kInt);
at::Tensor& out = at::resize_as_(a, b); # success
```

1.9.0:
```cpp
const at::Tensor b = at::ones({1, 4}, at::kInt);
at::Tensor& out = at::resize_as_(a, b); 
# error: binding value of type 'const at::Tensor' to reference to type 'at::Tensor' drops 'const' qualifier
const at::Tensor& out = at::resize_as_(a, b); # Success
```

* **Some ATen Reduction Ops as well as `kron_out` now throw an error when an undefined tensor is passed as input for `out` argument** ([#53218](https://github.com/pytorch/pytorch/pull/53218), [#53640](https://github.com/pytorch/pytorch/pull/53640)).
    * C++ API for the reductions ops like `sum_out`, `nansum_out`, `prod_out`, `std_var_out` have been changed to require users allocating result Tensor before calling these ops. The C++ API `allocate_reduction_result` has changed to `resize_reduction_result` to disallow allocating result Tensor in these reduction ops.
    * The following code can be compiled, but will raise a `c10::Error` when executed. This code compiled and executed successfully in the prior release.
```cpp
at::Tensor out;  # Undefined Tensor
const at::Tensor a = at::randn({2, 2});
at::IntArrayRef dim = {1};
at::sum_out(out, a, dim);
# c10::Error: Expected a Tensor of type Variable but found an undefined Tensor for argument #4 'out'
```

* **The C++ API utility functions `expand_inplace` and `expand_outplace` now return `c10::MaybeOwned` instead of `std::tuple`** ([#55065](https://github.com/pytorch/pytorch/pull/55065), [#55245](https://github.com/pytorch/pytorch/pull/55245)). 
The rationale for this change is to avoid unnecessary Tensor creation, thus improving performance. Functions in ExpandUtils return `c10::MaybeOwned` because expansion may not actually be needed, in which case we can improve efficiency by returning `c10::MaybeOwned::borrowed(to_expand)`. However, this means that you need to be careful: the returned `c10::MaybeOwned `must not outlive the original `Tensor` object that `to_expand` referred to! The deleted rvalue reference overloads of these functions help with this by preventing trivial use of a temporary resulting from a function call, but it is still possible to make a mistake. 

## TorchScript

* **Added recursive scripting for class type module attributes** ([#55124](https://github.com/pytorch/pytorch/pull/55124)).
    * This change is BC-breaking because it will result in class type module attributes being scripted when a module instance is scripted. In previous versions, such attributes were ignored unless their class type was also marked with `@torch.jit.script`. This new feature attempts to script the type, and falls back to the old behaviour of marking the class type attribute as "failed" if scripting fails. However, if the class definition does not have type annotations, the definition of the scripted class can different from users might expect (see code sample). If needed, users can explicitly disable the scripting of a class type attribute by adding its name to the `__jit_ignored_attributes__` class attribute of the module being scripted.

1.8.1:
```python
class MyClass:
    def __init__(self, a):
        self.attr = a
        
class MyModule(torch.nn.Module):
    def __init__(self):
        self.attr = MyClass(4)
        
sm = torch.jit.script(MyModule())
```

1.9.0:
```python
class MyClass:
    def __init__(self, a):
        self.attr = a
        
class MyModule(torch.nn.Module):
    def __init__(self):
        self.attr = MyClass(4)
 
# RuntimeError: Could not cast attribute 'attr' to type Tensor: Unable to cast Python instance of type  to C++ type 'at::Tensor'         
sm = torch.jit.script(MyModule()) 
```

This error occurs because `MyClass` is automatically scripted, but `self.attr` is inferred to be a `Tensor` instead of an `int` because `a` is not annotated. To fix this, annotate `a` with the right type `int`, or mark `attr` as an attribute that should be ignored by the scripting process and not recursively processed:
```python
       class MyModule(torch.nn.Module):
            __jit_ignored_attributes__ = ["attr"]
        
            def __init__(self):
                self.attr = MyClass(4)
```
                

## Quantization

*  **`torch.quantization.quantize_fx.convert_fx`’s `debug` argument has been changed to `is_reference` ([#52179](https://github.com/pytorch/pytorch/pull/52179)).**

  

    1.8.1: 1.9.0
    
      _{import torch.quantization.quantize_fx as quantize_fx
>>> m = quantize_fx.convert_fx(m, debug=True)
(Runs successfully)}
      _{>>> m = quantize_fx.convert_fx(m, is_reference=True) # Runs successfully
>>> m = quantize_fx.convert_fx(m, debug=True)
Traceback (most recent call last):
File "", line 1, in 
TypeError: convert_fx() got an unexpected keyword argument 'debug'}
    
  


* **`torch.cat` is now quantized to `torch.cat` instead of `torch.ops.quantized.cat` ([#54924](https://github.com/pytorch/pytorch/pull/54924)).**
Previously, we produced torch.ops.quantize.cat which took inputs, dequantized them
        and requantized them with new qparams. This behavior has been changed to produce `torch.cat` directly. [torch.cat](http://torch.cat/) uses the same observer/fake_quant instance for all inputs and output, assumes all inputs are sharing the same qparam, and produces a quantized Tensor with
        the same qparam as all inputs. Using torch.cat is expected to be more efficient since it does not introduce extra quant/dequant.
    * Version 1.8.1: `torch.cat` was quantized to  `torch.ops.quantized.cat.`
    * Version 1.9: `torch.cat` is quantized to `torch.cat` (`torch.cat` works on both floating point and quantized Tensor).

## Distributed

* **`DistributedDataParallel`: Removed support for inter-process device replication in DDP ([#54454](https://github.com/pytorch/pytorch/pull/54454),  [#54825](https://github.com/pytorch/pytorch/pull/54825), [#54826](https://github.com/pytorch/pytorch/pull/54826), [#55212](https://github.com/pytorch/pytorch/pull/55212), [#55253](https://github.com/pytorch/pytorch/pull/55253)`, `[`#55353`](https://github.com/pytorch/pytorch/pull/55353)).**
`DistributedDataParallel` now errors out when users attempt to use it in single-process multi-device mode, where a module is replicated across more than one device in a single process. This mode had been previously deprecated and is now removed. Use cases should switch to spawning a single process for each device that is used in replication, which is the performant way to use `DistributedDataParallel` and supports a variety of newly developed features.

1.8.1:
```python
>>> # Assume the below is ran on 2 ranks in a distributed setting.
>>> rank_to_devices = { 0: [0, 1], 1: [2, 3] }
>>> # Each rank replicates model across 2 GPUs.
>>> model_ddp = torch.nn.parallel.DistributedDataParallel(
        model,
        device_ids=rank_to_devices[rank]
    )
>>> # No error is raised, but below warning is produced.
>>> UserWarning: Single-Process Multi-GPU is not the recommended mode for DDP. In this mode, each DDP instance operates on multiple devices and creates multiple module replicas within one process. The overhead of scatter/gather and GIL contention in every forward pass can slow down training. Please consider using one DDP instance per device or per module replica by explicitly setting device_ids or CUDA_VISIBLE_DEVICES.
```

1.9.0:
```python
>>> # Assume the below is ran on 2 ranks in a distributed setting.
>>> rank_to_devices = { 0: [0, 1], 1: [2, 3] }
>>> # Each rank replicates model across 2 GPUs.
>>> model_ddp = torch.nn.parallel.DistributedDataParallel(
        model,
        device_ids=rank_to_devices[rank]
    )
>>> # Single process multi-GPU mode now produces an error on initialization.
>>> ValueError: device_ids can only be None or contain a single element.
```

* **`torch.distributed.elastic`: Replaced `torch.distributed.launch` with `torch.distributed.elastic_launch` ([#56037](https://github.com/pytorch/pytorch/pull/56037)`, `[`#56214`](https://github.com/pytorch/pytorch/pull/56214)).**
        * --logdir → —log_dir. The stdout and stderr log dir arg name and destination changed. The file destination changed from `$logdir/node_{}_local_rank_{}_stdout` to  `$log_dir/$rank/stdout.log`. If users used the `—logdir` introduced in 1.8 pytorch version, they need to use` —log_dir` parameter now.

1.8.1:
```python
#!/bin/bash
# Assumes training script train.py exists.
python -m torch.distributed.launch --nproc_per_node=2 --nnodes=1 --node_rank=0 --master_addr="127.0.0.1" --master_port="29500" --logdir test_logdir train.py
# Logs are written to $logdir/node_{}_local_rank_{}_stdout
```
1.9.0:
```python
#!/bin/bash
# Assumes training script train.py exists.
python -m torch.distributed.launch --nproc_per_node=2 --nnodes=1 --node_rank=0 --master_addr="127.0.0.1" --master_port="29500" --log_dir test_logdir train.py
# Logs are written to $log_dir/$rank/stdout.log
```

# Deprecations

## Python API

* **`torch.floor_divide` has been deprecated in favor of `torch.div(..., rounding_mode=‘floor’)` ([#50281](https://github.com/pytorch/pytorch/pull/50281)).**
    * `torch.floor_divide` incorrectly divides then truncates (rounds towards zero) instead of dividing then flooring (rounds “down”). Use the `rounding_mode` argument of `torch.div` to indicate if you’d like to continue performing truncation division or floor division, instead, since `torch.floor_divide` will be removed in a future PyTorch release.
* **Older linear algebra operations have been deprecated in favor of their new linalg module counterparts. Namely:**
    *  `torch.{cholesky, qr, symeig, chain_matmul, solve, eig, matrix_rank, lstsq}` have been deprecated in favor of `torch.linalg.{cholesky, qr, symeig, chain_matmul, solve, eig, matrix_rank, lstsq}` ([#57725,](https://github.com/pytorch/pytorch/pull/57725)[#57745](https://github.com/pytorch/pytorch/pull/57745), [#57732,](https://github.com/pytorch/pytorch/pull/57732)[#53453](https://github.com/pytorch/pytorch/pull/53453), [#57741](https://github.com/pytorch/pytorch/pull/57741), [#57727](https://github.com/pytorch/pytorch/pull/57727), [#57734](https://github.com/pytorch/pytorch/pull/57734), [#57743](https://github.com/pytorch/pytorch/pull/57743)).
    * `torch.norm` has been deprecated in favor of the new linalg module norm functions: `torch.linalg.vector_norm`, `torch.linalg.matrix_norm`, and `torch.linalg.norm` ([#57986](https://github.com/pytorch/pytorch/pull/57986)).
    * Aliased `torch.det`, `torch.slogdet`, `torch.matrix_power`, `torch.inverse`, and `torch.pinverse` to their linalg module counterparts ([#57821](https://github.com/pytorch/pytorch/pull/57821)).

## Autograd

* **[cpp] Renamed `AutoNonVariableTypeMode` to `AutoDispatchBelowAutograd` and added a warning. ([#56422](https://github.com/pytorch/pytorch/pull/56422))** 
`AutoNonVariableTypeMode` is deprecated and will be removed in 1.10 release. For kernel implementations,  please use `AutoDispatchBelowAutograd` instead. Check out more details on how to migrate your kernel [here](https://pytorch.org/cppdocs/notes/inference_mode.html#migration-guide-from-autononvariabletypemode). If you are looking for a user-facing API to enable running your inference-only workload, please use `c10::InferenceMode`. Using `AutoDispatchBelowAutogradMode` in user code is under risk of producing silently wrong result for some edge cases.
    
1.8.1:
```cpp
{
  at::AutoNonVariableTypeMode guard(true);
}
```

1.9.0:
```
{
  c10::AutoDispatchBelowAutograd guard(true); // for kernel implementations
  // c10::InferenceMode guard(true); --> consider inference mode if you are looking for a user-facing API

}
```

* **Removed logic for old style custom autograd `Function` ([#57357](https://github.com/pytorch/pytorch/pull/57357)).**
Instantiating a custom autograd function is now deprecated and will raise a warning. Users should call `.apply()` on the class itself because it is a static method.

1.8.1:
```python
        # Instantiating custom function will raise a warning in 1.9
        Func().apply
```

1.9.0:
```python
        # You should directly call the `apply` (classmethod) on the class
        Func.apply
```

* **Deprecated `get_analytical_jacobian` and `get_numerical_jacobian` ([#54378](https://github.com/pytorch/pytorch/pull/54378), [#54049](https://github.com/pytorch/pytorch/pull/54049)).**
`torch.autograd.gradcheck.get_analytical_jacobian`  and `torch.autograd.gradcheck.get_numerical_jacobian` are internal-facing functions that are not a part of our public API. We’ve refactored some PyTorch internals to work without it and will
        remove it in a future release. For gradient checking purposes, please use `torch.autograd.gradcheck`. 

## C++ API

* **Removed the redundant `linalg_` prefix from `torch::linalg::linalg_det` and `torch::linalg::linalg_norm` C++ API ([#57464](https://github.com/pytorch/pytorch/pull/57464)).**
C++ code that used to call `torch::linalg::{linalg_det, linalg_norm}` should be updated to call `torch::linalg::{det, norm}`

## Distributed

* **`torch.distributed.rpc`: Added a warning message to retire ProcessGroup RPC backend ([#55616](https://github.com/pytorch/pytorch/pull/55616))**
    * ProcessGroup RPC backend is being deprecated and 1.9 is the last release which will carry it. The default RPC backend is TensorPipe which is the recommended backend to use over ProcessGroup.

# 

# New features

### Python API

* Added BFloat16 support for `torch.{ceil, floor, frac, round, trunc, lerp, roll, diag, logaddexp, logaddexp2, nan_to_num, exp2, expm1, rsqrt, erfc, atan2, hypot}` on CUDA ([#57910](https://github.com/pytorch/pytorch/pull/57910), [#57907](https://github.com/pytorch/pytorch/pull/57907), [#57916](https://github.com/pytorch/pytorch/pull/57916), [#57908](https://github.com/pytorch/pytorch/pull/57908), [#58063](https://github.com/pytorch/pytorch/pull/58063), [#57913](https://github.com/pytorch/pytorch/pull/57913), [#57905](https://github.com/pytorch/pytorch/pull/57905)).
* Added `torch.pow()` for `torch.{float16, BFloat16}` on CPU ([#55280](https://github.com/pytorch/pytorch/pull/55280)).
* Added `torch.{index_select, argmax, argmin, min, max, amin, amax}` for `torch.{float16, BFloat16}` ([#53898](https://github.com/pytorch/pytorch/pull/53898), [#52582](https://github.com/pytorch/pytorch/pull/52582), [#51244](https://github.com/pytorch/pytorch/pull/51244), [#52579](https://github.com/pytorch/pytorch/pull/52579)).
* Added `torch.dot` for `BFloat16` on CUDA ([#57903](https://github.com/pytorch/pytorch/pull/57903)).
* Added support for tensor inputs for `min` and `max` arguments in `torch.clamp` ([#52695](https://github.com/pytorch/pytorch/pull/52695), [#56367](https://github.com/pytorch/pytorch/pull/56367)).
* Added a new `torch.special` namespace similar to `scipy.special` ([#52296](https://github.com/pytorch/pytorch/pull/52296)).
    * Added special.{`entr` ([#53500](https://github.com/pytorch/pytorch/pull/53500)),  `xlog1py` ([#55138](https://github.com/pytorch/pytorch/pull/55138)), `i0e` ([#54409](https://github.com/pytorch/pytorch/pull/54409)), `erfc`, `erfinv` ([#53260](https://github.com/pytorch/pytorch/pull/53260))}.
    * Added aliases for `special.{expm1, exp2}` ([#54670](https://github.com/pytorch/pytorch/pull/54670)).
    * Added aliases for `special.{sigmoid, logit}` ([#54759](https://github.com/pytorch/pytorch/pull/54759)).
* Added the following new operators in PyTorch similar to those in NumPy:
    * `torch.gradient` ([#54617](https://github.com/pytorch/pytorch/pull/54617))
    * `torch.{hsplit, vsplit, dsplit}` ([#53536](https://github.com/pytorch/pytorch/pull/53536))
    * `torch.positive` ([#55891](https://github.com/pytorch/pytorch/pull/55891))
    * `torch.frexp` ([#51097](https://github.com/pytorch/pytorch/pull/51097))
    * `torch.take_along_dim` ([#52833](https://github.com/pytorch/pytorch/pull/52833))
* Added a new keyword argument `alpha` to `torch.index_add` ([#54176](https://github.com/pytorch/pytorch/pull/54176)).
* Added `torch.assert_async` ([#53086](https://github.com/pytorch/pytorch/pull/53086))
* Added a new keyword argument `interpolation` to `torch.quantile` ([#49267](https://github.com/pytorch/pytorch/pull/49267)).
* Add correction parameter to std/var ([#50903](https://github.com/pytorch/pytorch/pull/50903))
* Added overloads for `torch.{std, var, std_mean, var_mean}` with a correction argument specifying the difference between the sample size and number of degrees of freedom. 
* Add support for integer type for `torch.`{`logit, rad2deg, deg2rad, polygamma}` ([#52028](https://github.com/pytorch/pytorch/pull/52028), [#51853,](https://github.com/pytorch/pytorch/pull/51853)[#57462](https://github.com/pytorch/pytorch/pull/57462))
* Added support for stable sort algorithm on CPU by a new kwarg `stable` ([#51790](https://github.com/pytorch/pytorch/pull/51790)).
* The `torch.linalg` module, analogous to NumPy’s linalg module but with several additional functions, is stable! Added `torch.linalg.{multi_dot, lstsq, vector_norm, matrix_norm, matrix_power, det, eig, eigvals, svdvals, cholesky_ex, inv_ex}` ([#51807](https://github.com/pytorch/pytorch/pull/51807), [#49093](https://github.com/pytorch/pytorch/pull/49093), [#51099](https://github.com/pytorch/pytorch/pull/51099), [#57127](https://github.com/pytorch/pytorch/pull/57127), [#52608](https://github.com/pytorch/pytorch/pull/52608), [#53119](https://github.com/pytorch/pytorch/pull/53119), [#52491](https://github.com/pytorch/pytorch/pull/52491), [#56684](https://github.com/pytorch/pytorch/pull/56684), [#56724](https://github.com/pytorch/pytorch/pull/56724), [#58039](https://github.com/pytorch/pytorch/pull/58039)).
* Added a new `device=meta` API ([#53143](https://github.com/pytorch/pytorch/pull/53143))
    * “meta” is a new device, like CPU/CUDA, that doesn’t allocate any memory for data. Operators that are passed meta tensor inputs will perform shape inference, without running the actually kernel computation. For example, `torch.ones(2, device='meta') + torch.ones(1, 2, device='meta')` will return a new meta tensor of size `[1, 2]` (performing broadcasting), without allocating memory or running an actual kernel.
    * `device=meta` API is implemented for `upsample_linear1d`([#51917](https://github.com/pytorch/pytorch/pull/51917)), `upsample_bilinear2d` and `upsample_bicubic2d` ([#52012](https://github.com/pytorch/pytorch/pull/52012)), `upsample_nearest3d` ([#52065](https://github.com/pytorch/pytorch/pull/52065)), `sin`([#52277](https://github.com/pytorch/pytorch/pull/52277)), `mul`([#52692](https://github.com/pytorch/pytorch/pull/52692)), `pow`([#53669](https://github.com/pytorch/pytorch/pull/53669)), `sub`([#53679](https://github.com/pytorch/pytorch/pull/53679)), `div`([#53680](https://github.com/pytorch/pytorch/pull/53680)), `copysign`([#55040](https://github.com/pytorch/pytorch/pull/55040)), `atan2`([#55130](https://github.com/pytorch/pytorch/pull/55130)), `sinh`([#55538](https://github.com/pytorch/pytorch/pull/55538)), `acosh`([#55540](https://github.com/pytorch/pytorch/pull/55540)), `cosh`([#55563](https://github.com/pytorch/pytorch/pull/55563)), `cos` ([#55564](https://github.com/pytorch/pytorch/pull/55564)), `replication_padding1d` ([#55481](https://github.com/pytorch/pytorch/pull/55481)), `replication_padding3d` ([#55499](https://github.com/pytorch/pytorch/pull/55499)), `replication_pad1d_backward` ([#55537](https://github.com/pytorch/pytorch/pull/55537)), `fractional_max_pool2d` ([#55581](https://github.com/pytorch/pytorch/pull/55581)), `reflection_pad1d` ([#55531](https://github.com/pytorch/pytorch/pull/55531)), `replication_pad2d` ([#55511](https://github.com/pytorch/pytorch/pull/55511)), `addmv` ([#55746](https://github.com/pytorch/pytorch/pull/55746)), all unary float functions ([#56082](https://github.com/pytorch/pytorch/pull/56082)), `adaptive_max_pool2d`([#56317](https://github.com/pytorch/pytorch/pull/56317)), `adaptive_max_pool3d` ([#56320](https://github.com/pytorch/pytorch/pull/56320)), all non-float unary operators (and `rsqrt`) ([#56151](https://github.com/pytorch/pytorch/pull/56151)), `adaptive_max_pool2d_backward` ([#56799](https://github.com/pytorch/pytorch/pull/56799)), `adaptive_max_pool3d_backward` ([#56800](https://github.com/pytorch/pytorch/pull/56800)), `neg`([#57212](https://github.com/pytorch/pytorch/pull/57212)), `max_pool2d_with_indices`([#56459](https://github.com/pytorch/pytorch/pull/56459)), `trunc` ([#57350](https://github.com/pytorch/pytorch/pull/57350)), `floor` ([#57587](https://github.com/pytorch/pytorch/pull/57587)), `sign` ([#57588](https://github.com/pytorch/pytorch/pull/57588)), `ceil` ([#57589](https://github.com/pytorch/pytorch/pull/57589)), `gcd` ([#57624](https://github.com/pytorch/pytorch/pull/57624)), `nextafter` ([#57625](https://github.com/pytorch/pytorch/pull/57625)), `igamma` and `igammac`([#57626](https://github.com/pytorch/pytorch/pull/57626)), `hypot`([#57627](https://github.com/pytorch/pytorch/pull/57627)), `lcm` ([#57628](https://github.com/pytorch/pytorch/pull/57628)), `logaddexp` and `logaddexp2` ([#57629](https://github.com/pytorch/pytorch/pull/57629)), `maximum` and `minimum` ([#57630](https://github.com/pytorch/pytorch/pull/57630)), `topk` ([#57790](https://github.com/pytorch/pytorch/pull/57790)), `max_pool2d_with_indices_backward` ([#57797](https://github.com/pytorch/pytorch/pull/57797)), `threshold` ([#57810](https://github.com/pytorch/pytorch/pull/57810)), `addmm` ([#57417](https://github.com/pytorch/pytorch/pull/57417)), `heaviside` ([#57933](https://github.com/pytorch/pytorch/pull/57933)), `elu`([#57619](https://github.com/pytorch/pytorch/pull/57619)), `softplus` ([#57620](https://github.com/pytorch/pytorch/pull/57620)), `leaky_relu` ([#57621](https://github.com/pytorch/pytorch/pull/57621)), `hardsigmoid` ([#57622](https://github.com/pytorch/pytorch/pull/57622)), `softshrink` ([#57623](https://github.com/pytorch/pytorch/pull/57623)), `silu` ([#58050](https://github.com/pytorch/pytorch/pull/58050)), `empty_strided` ([#53397](https://github.com/pytorch/pytorch/pull/53397)), non-composite in-place operators ([#54901](https://github.com/pytorch/pytorch/pull/54901))

### Complex Numbers

* Added complex autograd support for `torch.{masked_fill, polar, cumsum, lerp, prod, rsub, unfold, symeig, index_copy}` ([#52483](https://github.com/pytorch/pytorch/pull/52483), [#52488](https://github.com/pytorch/pytorch/pull/52488), [#53240](https://github.com/pytorch/pytorch/pull/53240), [#53689](https://github.com/pytorch/pytorch/pull/53689), [#48125](https://github.com/pytorch/pytorch/pull/48125), [#53702](https://github.com/pytorch/pytorch/pull/53702), [#52999](https://github.com/pytorch/pytorch/pull/52999), [#55085](https://github.com/pytorch/pytorch/pull/55085), [#52203](https://github.com/pytorch/pytorch/pull/52203)).
* Added complex support for torch.lerp ([#54129](https://github.com/pytorch/pytorch/pull/54129)) and torch.sigmoid ([#55975](https://github.com/pytorch/pytorch/pull/55975)) on CUDA. 
* Added complex support for `torch.index_copy` and `torch.{take}` and `torch.Tensor.put_` on both CPU and CUDA ([#52203](https://github.com/pytorch/pytorch/pull/52203), [#53356](https://github.com/pytorch/pytorch/pull/53356)).
* Added complex support to TorchScript.
    * Added logic to teach TorchScript frontend to parse complex literals, and complex lists. ([#52881](https://github.com/pytorch/pytorch/pull/52881)).
    * Added TorchScript support for:
        *  complex constructor and `torch.{add, mul, sub, as_tensor}` ([#52881](https://github.com/pytorch/pytorch/pull/52881)).
        * `cmath` unary ops: `cmath.{phase, log, log10, sqrt, exp, sin, cos, tan, asin, acos, atan, sinh, cosh, tanh, asinh, acosh, atanh}` ([#54089](https://github.com/pytorch/pytorch/pull/54089)).
        * `cmath.`{`infj, nanj}` ([#54328](https://github.com/pytorch/pytorch/pull/54328)).
        *  `cmath.{isinf, isnan, isfinite, rect}` ([#54541](https://github.com/pytorch/pytorch/pull/54541)).
        * real and imag tensor attributes (`tensor.real/imag`) ([#54692](https://github.com/pytorch/pytorch/pull/54692)).
    * Fixed `test_variant_consistency_jit_addmm` for complex types ([#54917](https://github.com/pytorch/pytorch/pull/54917), [#57129](https://github.com/pytorch/pytorch/pull/57129)).
* Added initial operator support for sparse complex tensors ([#57125](https://github.com/pytorch/pytorch/pull/57125)).
    * Added complex support for `torch.{sparse_coo_tensor, coalesce, to_dense, to_sparse, sparse_add, sspaddmm, saddmm}`.
* Added `torch.Tensor.{cfloat, cdouble}` functions ([#58137](https://github.com/pytorch/pytorch/pull/58137)).
* Added complex support for all reductions for `torch.{std, var}` to return a real valued output tensor for complex inputs ([#58066](https://github.com/pytorch/pytorch/pull/58066)) .
* Updated autograd formulas for many linear algebra operations support complex tensors:
    * `eig`: faster and with complex support ([#52875](https://github.com/pytorch/pytorch/pull/52875))
    * `lu`: more numerically stable and with complex support. ([#53994](https://github.com/pytorch/pytorch/pull/53994))

### torch.nn

* New `torch.nn` modules: `nn.LazyBatchNorm*d` ([#51862](https://github.com/pytorch/pytorch/pull/51862)), `nn.HuberLoss` ([#50553](https://github.com/pytorch/pytorch/pull/50553)), `nn.Mish` ([#58375](https://github.com/pytorch/pytorch/issues/58375)).
* New parametrization functionality ([#33344](https://github.com/pytorch/pytorch/pull/33344), [#58142](https://github.com/pytorch/pytorch/pull/58142), [#55456](https://github.com/pytorch/pytorch/pull/55456), [#57784](https://github.com/pytorch/pytorch/pull/57784)).
* `nn.Conv*d`: Added `padding='same'` mode for non-strided convolutions ([#45667](https://github.com/pytorch/pytorch/pull/45667)).
* `nn.EmbeddingBag`: Added `padding_idx` support ([#49237](https://github.com/pytorch/pytorch/pull/49237), [#56065](https://github.com/pytorch/pytorch/pull/56065), [#56618](https://github.com/pytorch/pytorch/pull/56618)).
* Added mish activation function ([#58648](https://github.com/pytorch/pytorch/pull/58648)).
* [memory format] Added channels last support for `MaxPool2d` ([#56361](https://github.com/pytorch/pytorch/pull/56361)).
* Added the option to build PyTorch with DNNL + AMD BLIS path ([#54953](https://github.com/pytorch/pytorch/pull/54953)).

### Profiler

* Added `skip_first` parameter to the default schedule ([#58025](https://github.com/pytorch/pytorch/pull/58025)).
* Added support for trace metadata ([#56575](https://github.com/pytorch/pytorch/pull/56575)).
* Added `gzip` format support for chrome tracing ([#56554](https://github.com/pytorch/pytorch/pull/56554)).
* Added `sequenceNr` and `fwdThreadId` to the trace ([#57182](https://github.com/pytorch/pytorch/pull/57182)).
* Enabled Kineto in CPU builds ([#53174](https://github.com/pytorch/pytorch/pull/53174)).

### Autograd

* Added new inference mode both in C++ ([](https://github.com/pytorch/pytorch/pull/58045)[#54403](https://github.com/pytorch/pytorch/pull/54403), [#53343](https://github.com/pytorch/pytorch/pull/53343)) and python ([#58045](https://github.com/pytorch/pytorch/pull/58045), [#57480](https://github.com/pytorch/pytorch/pull/57480)).
* Added `fast_mode` argument to `autograd.gradcheck` ([#54480](https://github.com/pytorch/pytorch/pull/54480)).
* Added support for non-Tensor inputs and outputs to `torch.utils.checkpoint` functions ([#52422](https://github.com/pytorch/pytorch/pull/52422)).

### Dataloader

* Implemented `FilterIterDataPipe` ([#51783](https://github.com/pytorch/pytorch/pull/51783)).
* Added context manager for runtime type validation ([#55936](https://github.com/pytorch/pytorch/pull/55936)).
* Added typing enforcement for `DataPipe` at construct-time ([#54066](https://github.com/pytorch/pytorch/pull/54066)).
* Added typing Enforcement for `DataPipe` at runtime ([#54544](https://github.com/pytorch/pytorch/pull/54544)).
* Implemented `issubtype` for `DataLoader` type hints ([#54299](https://github.com/pytorch/pytorch/pull/54299)).
* Added type hint for SequentialSampler ([#56374](https://github.com/pytorch/pytorch/pull/56374)).
* Added `ConcatDataPipe` ([#53301](https://github.com/pytorch/pytorch/pull/53301)).
* Introduced deterministic context to `DataLoader` ([#53271](https://github.com/pytorch/pytorch/pull/53271)).
* Added `ZipIterDataPipe` ([#53554](https://github.com/pytorch/pytorch/pull/53554)).
* Added switch to guaranteed determinism & add option to non_deterministic ([#53532](https://github.com/pytorch/pytorch/pull/53532)).
* Added `TransformsIterDataPipe` ([#52604](https://github.com/pytorch/pytorch/pull/52604)).
* Renamed Callable to `MapIterDataPipe` ([#51879](https://github.com/pytorch/pytorch/pull/51879)).

### CUDA

* Added the following new features to CUDA Graphs:
    *  Private mempools ([#54038](https://github.com/pytorch/pytorch/pull/54038))
    * Support for RNN capture when cuDNN dropout is used ([#57373](https://github.com/pytorch/pytorch/pull/57373)).
* Added support for `'max'` reduction for `torch.segment_reduce` ([#56704](https://github.com/pytorch/pytorch/pull/56704)).
* Added support for CUDA allocator to handle multiple streams seamlessly ([#55860](https://github.com/pytorch/pytorch/pull/55860)).

### C++ API

* Added `torch::nn::functional::huber_loss` ([#50553](https://github.com/pytorch/pytorch/pull/50553)).
* Added learning rate schedulers to C++ API ([#52268](https://github.com/pytorch/pytorch/pull/52268)).
* Added `padding='same'` mode to `torch::conv{1,2,3}d` ([#45667](https://github.com/pytorch/pytorch/pull/45667)).
* Added `padding_idx` argument to `EmbeddingBag` ([#49237](https://github.com/pytorch/pytorch/pull/49237)).
* Added mish activation function ([#58648](https://github.com/pytorch/pytorch/pull/58648)) ([#58940](https://github.com/pytorch/pytorch/pull/58940)).

### TorchScript

* Added reductions to NNC python bindings ([#52492](https://github.com/pytorch/pytorch/pull/52492)).
* Added Python bindings for ExternalCalls. ([#52905](https://github.com/pytorch/pytorch/pull/52905)).
* Added an API to reorder multiple loops ([#55568](https://github.com/pytorch/pytorch/pull/55568)).
* Added NNC support for `pow` on CPU ([#56308](https://github.com/pytorch/pytorch/pull/56308)).
* Enabled horizontal fusion of all loops ([#56324](https://github.com/pytorch/pytorch/pull/56324)).
* Added an API for Buffer Compression ([#55853](https://github.com/pytorch/pytorch/pull/55853)).
* Added API to distribute loops ([#53865](https://github.com/pytorch/pytorch/pull/53865)).
* Added `matmul` for NNC lowering/unified dtypes ([#56456](https://github.com/pytorch/pytorch/pull/56456)).
* Added a method to compute `conv` without bias ([#57512](https://github.com/pytorch/pytorch/pull/57512)).
* Added support for computing `conv` with dynamic shapes ([#57514](https://github.com/pytorch/pytorch/pull/57514)).
* Added NNC lowerings for `t`/`transpose`/`permute`/`expand` ([#57426](https://github.com/pytorch/pytorch/pull/57426)).
* Updated external functions for mobile build ([#56850](https://github.com/pytorch/pytorch/pull/56850)).
* Added `GELU` To NNC ([#57753](https://github.com/pytorch/pytorch/pull/57753)).
* Implemented `GELU` Backward ([#58249](https://github.com/pytorch/pytorch/pull/58249)).
* Added a mobile NNC backend skeleton ([#56852](https://github.com/pytorch/pytorch/pull/56852)).
* Added support for `torch.type` ([#51904](https://github.com/pytorch/pytorch/pull/51904))
* Added `dict()` constructor ([#51934](https://github.com/pytorch/pytorch/pull/51934)).
* Added a new `torch::deploy` to manage multiple python interpreters in a single
    process to deploy PyTorch models packaged with torch.package ([#51754](https://github.com/pytorch/pytorch/pull/51754)).
* Reintroduced static dispatch ([#51957](https://github.com/pytorch/pytorch/pull/51957)).
* Added TS support for `torch.any` ([#52360](https://github.com/pytorch/pytorch/pull/52360)).
* Added a demo backend with compiler ([#52603](https://github.com/pytorch/pytorch/pull/52603)).
* Added MKLDNN fuser ([#51600](https://github.com/pytorch/pytorch/pull/51600)).
* Added a context manager for hiding source ranges ([#53188](https://github.com/pytorch/pytorch/pull/53188)).
* Implemented `embedding_bag` for SR ([#52429](https://github.com/pytorch/pytorch/pull/52429)).
* Allowed the use of `AliasDb` in Python ([#51336](https://github.com/pytorch/pytorch/pull/51336)).
* Added support for `DictConstruct` ([#54438](https://github.com/pytorch/pytorch/pull/54438))
* Added `sliceHead`/`sliceTail` APIs with short parameter list ([#55115](https://github.com/pytorch/pytorch/pull/55115)).
* Added logic to infer argument types in TorchScript ([#56832](https://github.com/pytorch/pytorch/pull/56832)).
* Added support for  custom Python classes in `CUDAFuture` ([#56516](https://github.com/pytorch/pytorch/pull/56516)).
* Added a concat optimization pass ([#55474](https://github.com/pytorch/pytorch/pull/55474)).
* Added initial support for PEP-585 types ([#57363](https://github.com/pytorch/pytorch/pull/57363)).
* Added logic to infer types for arguments of methods not invoked directly by `MonkeyType` ([#57202](https://github.com/pytorch/pytorch/pull/57202)).
* Added support for `torch.jit.ignore` as a context manager ([#55172](https://github.com/pytorch/pytorch/pull/55172)).
* Implemented `hardswish`/`hardsigmoid `on MKLDNN tensors ([#55218](https://github.com/pytorch/pytorch/pull/55218)).
* Added `model_dump` tool for model inspection ([#56868](https://github.com/pytorch/pytorch/pull/56868))
* Added static method support for TorchBind ([#51177](https://github.com/pytorch/pytorch/pull/51177))
* Added TS support for `pow` ([#52374](https://github.com/pytorch/pytorch/pull/52374))
* Added support for default argument values to `TorchBind` ([#51253](https://github.com/pytorch/pytorch/pull/51253)).
* Added support for AST rewriting for submodules ([#52297](https://github.com/pytorch/pytorch/pull/52297)).
* Added `optimize_for_inference` API ([#58193](https://github.com/pytorch/pytorch/pull/58193)).
* Registered `aten::index_out` ([#51742](https://github.com/pytorch/pytorch/pull/51742)).
* Added `PYTORCH_TENSOREXPR_DONT_FUSE` env variable to disable fusion on specified operators ([#55650](https://github.com/pytorch/pytorch/pull/55650)).

### torch.package

* Allow TorchScript models to be contained in the package format ([#54891,](https://github.com/pytorch/pytorch/pull/54891)[#56299,](https://github.com/pytorch/pytorch/pull/56299)[#54893](https://github.com/pytorch/pytorch/pull/54893), [#57573](https://github.com/pytorch/pytorch/pull/57573), [#54894](https://github.com/pytorch/pytorch/pull/54894), [#57678](https://github.com/pytorch/pytorch/pull/57678)).

### Mobile

* Added 8x1 block sparse kernels for ARM and AArch64 ([#51118](https://github.com/pytorch/pytorch/pull/51118), [#51119](https://github.com/pytorch/pytorch/pull/51119), [#51120](https://github.com/pytorch/pytorch/pull/51120)).
* Made NNAPI converter handle binary ops combining NHWC+NCHW in some cases ([#48812](https://github.com/pytorch/pytorch/pull/48812)).
* Improved support for multiple inputs and outputs in NNAPI ([#54697](https://github.com/pytorch/pytorch/pull/54697)).
* Added flexible size support for NNAPI ([#54701](https://github.com/pytorch/pytorch/pull/54701)).
* Added new ops for Metal (concat, mul/sub/div, transpose, view, reshape, mean, chunk, reflection_pad2d) ( [#53950](https://github.com/pytorch/pytorch/pull/53950), [#54107](https://github.com/pytorch/pytorch/pull/54107), [#54522](https://github.com/pytorch/pytorch/pull/54522), [#56073](https://github.com/pytorch/pytorch/pull/56073), [#56074](https://github.com/pytorch/pytorch/pull/56074), [#58263](https://github.com/pytorch/pytorch/pull/58263)).
* Added python binding to use mobile cpu allocator ([#52323](https://github.com/pytorch/pytorch/pull/52323)).
* Added lightweight RandomSampler for mobile ([#58201](https://github.com/pytorch/pytorch/pull/58201)).
* Added support for:
    * new ops to NNAPI converter (size, unsqueeze, cat, mean) ([#52026](https://github.com/pytorch/pytorch/pull/52026), [#48811](https://github.com/pytorch/pytorch/pull/48811)).
    * multi-dimension tensors in Metal via MPSImage ([#54106](https://github.com/pytorch/pytorch/pull/54106)).
    * multiple output tensors in Metal ([#56072](https://github.com/pytorch/pytorch/pull/56072)).
    * methods other than forward in optimize_for_mobile  ([#53314](https://github.com/pytorch/pytorch/pull/53314)).
    * ChannelsLast in TensorImageUtils on Android ([#48990](https://github.com/pytorch/pytorch/pull/48990)).
    * loading “extra files” in Java/Android ([#55644](https://github.com/pytorch/pytorch/pull/55644)).
    * loading “extra files” in Lite interpreter ([#52635](https://github.com/pytorch/pytorch/pull/52635)).
    * querying bytecode version in Lite interpreter and bytecode models ([#56948](https://github.com/pytorch/pytorch/pull/56948), [#56948](https://github.com/pytorch/pytorch/pull/56948)).
    * exporting some older bytecode versions for Lite interpreter ([#56802](https://github.com/pytorch/pytorch/pull/56802)).
    * querying available ops ([#57570](https://github.com/pytorch/pytorch/pull/57570)).
* Added SqueezeNet to PyTorch Playground (71d0b5632b).
* Added libtorch lite build ([#51419](https://github.com/pytorch/pytorch/pull/51419)).

### Distributed

* `torch.distributed.Store`
    * Added `compare_set` op ([#51815](https://github.com/pytorch/pytorch/pull/51815)).
    * Added new `watchKey` method to register callbacks on a key ([#56217](https://github.com/pytorch/pytorch/pull/56217)).
* `torch.distributed.rpc`
    * Allowed passing `cpu` to CUDA RPC device maps ([#57019](https://github.com/pytorch/pytorch/pull/57019)).
    *  Add a new `devices` argument to TensorPipe options to specify set of devices for TensorPipe ([#56405](https://github.com/pytorch/pytorch/pull/56405))
* `DistributedDataParallel`
    * Adds a flag to ddp `join` context manager that enables throwing an error across all ranks when this flag is specified ([#56755](https://github.com/pytorch/pytorch/pull/56755))
    * Enable static graph training in DDP ([#55248](https://github.com/pytorch/pytorch/pull/55248), [#54995](https://github.com/pytorch/pytorch/pull/54995))
    * Log unused parameter names in DDP when crashing due to unused parameters ([#55075](https://github.com/pytorch/pytorch/pull/55075))
    * Introduce `torch.distributed.algorithms.default_hooks.fp16_compress_wrapper` wrapper that can be combined with other communication hooks ([#53808](https://github.com/pytorch/pytorch/pull/53808))
    * Support loading a non-DP/DDP model from a DP/DDP state_dict ([#53224](https://github.com/pytorch/pytorch/pull/53224))
    * Enhanced logging in DDP for performance metrics ([#52957](https://github.com/pytorch/pytorch/pull/52957), [#53145](https://github.com/pytorch/pytorch/pull/53145), [#54647](https://github.com/pytorch/pytorch/pull/54647))
* `torch.distributed`
    * Support `work.result` API for MPI backend ([#57168](https://github.com/pytorch/pytorch/pull/57168))
    * Support `work.result` for `ProcessGroupGloo::AsyncWork` objects ([#57565](https://github.com/pytorch/pytorch/pull/57565))
    * Support `work.get_future()` API for ProcessGroupMPI and ProcessGroupGloo [(](https://github.com/pytorch/pytorch/pull/57818)[#57818,](https://github.com/pytorch/pytorch/pull/57818)[#57214](https://github.com/pytorch/pytorch/pull/57214))
    * New` torch.distributed.monitored_barrier` API (Gloo-only) ([#53773](https://github.com/pytorch/pytorch/pull/53773), [#53787](https://github.com/pytorch/pytorch/pull/53787), [#55009](https://github.com/pytorch/pytorch/pull/55009), [#55010](https://github.com/pytorch/pytorch/pull/55010), [#55197](https://github.com/pytorch/pytorch/pull/55197), [#55265](https://github.com/pytorch/pytorch/pull/55265), [#55989](https://github.com/pytorch/pytorch/pull/55989), [#55990](https://github.com/pytorch/pytorch/pull/55990))
    * Allow passing `options` field to process group initialization APIs ([#53662](https://github.com/pytorch/pytorch/pull/53662), [#54090](https://github.com/pytorch/pytorch/pull/54090), [#53663](https://github.com/pytorch/pytorch/pull/53663))
    * Enable profiling for distributed collectives ([#51822](https://github.com/pytorch/pytorch/pull/51822), , [#52004](https://github.com/pytorch/pytorch/pull/52004), [#52031](https://github.com/pytorch/pytorch/pull/52031), [#52949](https://github.com/pytorch/pytorch/pull/52949), [#55204](https://github.com/pytorch/pytorch/pull/55204), [#56412](https://github.com/pytorch/pytorch/pull/56412), [#56216](https://github.com/pytorch/pytorch/pull/56216), [#56427](https://github.com/pytorch/pytorch/pull/56427))
    * Allow user to specify `TORCH_DISTRIBUTED_DEBUG `environment variable ([#52481](https://github.com/pytorch/pytorch/pull/52481))
    * Added `compareSet` method for `torch.distributed.{HashStore, FileStore}` ([#53803](https://github.com/pytorch/pytorch/pull/53803)).
* Added new `torch.distributed.elastic `module that upstreams `pytorch/elastic`
    * Introduce RendezvousSettings ([#56537](https://github.com/pytorch/pytorch/pull/56537))
    * Introduce a new from_backend static constructor for DynamicRendezvousHandler ([#57150](https://github.com/pytorch/pytorch/pull/57150))
    * Introduce the implementation of DynamicRendezvousHandler ([#57151](https://github.com/pytorch/pytorch/pull/57151))
    * add support for the new error file format ([#57084](https://github.com/pytorch/pytorch/pull/57084))
    * Introduce the delay utility function ([#56533](https://github.com/pytorch/pytorch/pull/56533))
    *  Make torchelastic launcher compatible with the caffe2.distributed.launch ([#55687](https://github.com/pytorch/pytorch/pull/55687))
    * Introduce `PeriodicTimer` ([#55919](https://github.com/pytorch/pytorch/pull/55919))
    * Introduce `DynamicRendezvousHandler` and `RendezvousBackend`. ([#55635](https://github.com/pytorch/pytorch/pull/55635))
    * Introduce `C10dRendezvousBackend`. ([#55636](https://github.com/pytorch/pytorch/pull/55636))
    * Introduce `EtcdRendezvousBackend`. ([#55637](https://github.com/pytorch/pytorch/pull/55637))
    * Added `torch.distributed.elastic.launchers.api`, `torch.distributed.elastic.metrics`, `torch.distributed.events`, `torch.distributed.rendezvous`, `torch.distributed.elastic.agent` modules ([#55471](https://github.com/pytorch/pytorch/pull/55471), [#53870](https://github.com/pytorch/pytorch/pull/53870), [#53574](https://github.com/pytorch/pytorch/pull/53574), [#53760](https://github.com/pytorch/pytorch/pull/53760), [#53172](https://github.com/pytorch/pytorch/pull/53172), [#54343](https://github.com/pytorch/pytorch/pull/54343))
    * Upstreamed timer and multiprocessing classes to `torch.distribute.elastic.timer` and `torch.distributed.elastic.multiprocessing` ([#53574](https://github.com/pytorch/pytorch/pull/53574))
* `torch.distributed.nn.RemoteModule`: Enable RemoteModule to directly send GPU tensors over the wire on TensorPipe RPC backend if a device map is provided ([#57288](https://github.com/pytorch/pytorch/pull/57288))
* `torch.distributed.optim`: 
    * Allow `torch.optim.Adamax`  to be used as a TorchScript functional optimizer in RPC ([#55833](https://github.com/pytorch/pytorch/pull/55833))
    * Allow `torch.optim.Rprop` to be used as a TorchScript functional optimizer in RPC ([#55834](https://github.com/pytorch/pytorch/pull/55834))

### torch.fx

* Added `torch.fx.Node.format_node()` ([#51737](https://github.com/pytorch/pytorch/pull/51737)).
* Added a `Transformer` to normalize args/kwargs of `torch.nn.functional` calls into only kwargs ([#51816](https://github.com/pytorch/pytorch/pull/51816)).
* Added submodule manipulation APIs on `GraphModule` ([#52358](https://github.com/pytorch/pytorch/pull/52358)).
* Added `Graph.eliminate_dead_code` ([#52658](https://github.com/pytorch/pytorch/pull/52658)).
* Added a function to retrieve `inspect.Signature` instances for PyTorch operations ([#53830](https://github.com/pytorch/pytorch/pull/53830)).
* Experimental type annotation pass using Python signatures ([#53831](https://github.com/pytorch/pytorch/pull/53831)).
* Added a transformer to normalize `torch` namespace operations ([#53832](https://github.com/pytorch/pytorch/pull/53832)).
* Extended `NormalizeArgs` to work on `torch` namespace operations ([#54236](https://github.com/pytorch/pytorch/pull/54236)).
* Added FX `optimize_for_inference` for Intel CPUs ([#53805](https://github.com/pytorch/pytorch/pull/53805), [#58293](https://github.com/pytorch/pytorch/pull/58293)).
* Added a metadata dict to `Node` and switch shape-prop to use that ([#54926](https://github.com/pytorch/pytorch/pull/54926)).
* Added C-level monkey patching of `torch.randn` to capture it during tracing ([#54060](https://github.com/pytorch/pytorch/pull/54060)).
* Added a new API replace_input_with to `Node` ([#55887](https://github.com/pytorch/pytorch/pull/55887)).
* Added net splitter and net minimizer utilities ([#56201](https://github.com/pytorch/pytorch/pull/56201)).
* Added PyTree support to FX through `concrete_args` ([#55888](https://github.com/pytorch/pytorch/pull/55888)).
* Added support for proxy-able classes ([#56737](https://github.com/pytorch/pytorch/pull/56737)).

### ONNX

* Support onnxifi interface for set/get options ([#52388](https://github.com/pytorch/pytorch/pull/52388)).
* Support --onnxifi_min_ops in AOT flow ([#52380](https://github.com/pytorch/pytorch/pull/52380)).
* Redesign onnx pass to enable shape type dependent pattern conversion - cont ([#51795)](https://github.com/pytorch/pytorch/pull/51795) ([#53304)](https://github.com/pytorch/pytorch/pull/53304).
* Support inplace operations on inplace indexing ([#52063)](https://github.com/pytorch/pytorch/pull/52063) ([#53306](https://github.com/pytorch/pytorch/pull/53306)).
* Symbolic shape inference ([#51481](https://github.com/pytorch/pytorch/pull/51481)) ([#53307](https://github.com/pytorch/pytorch/pull/53307)).
* Support repeat_interleave symbolic ([#52855](https://github.com/pytorch/pytorch/pull/52855)) ([#53312](https://github.com/pytorch/pytorch/pull/53312)).
* Support primitive type input/outputs and attributes ([#53550](https://github.com/pytorch/pytorch/pull/53550)) ([#54864](https://github.com/pytorch/pytorch/pull/54864)).
* Support outer export to onnx ([#53603](https://github.com/pytorch/pytorch/pull/53603)) ([#54869](https://github.com/pytorch/pytorch/pull/54869)).
* Support hardsigmoid symbolic in opset 9 #49649 ([#54193](https://github.com/pytorch/pytorch/pull/54193)).
* Support support for hann_window operator ([#54587](https://github.com/pytorch/pytorch/pull/54587)) ([#56163](https://github.com/pytorch/pytorch/pull/56163)).
* Enable tensordot symbolic function ([#55654](https://github.com/pytorch/pytorch/pull/55654)) ([#56166](https://github.com/pytorch/pytorch/pull/56166)).
* Support for prim::min ([#55259](https://github.com/pytorch/pytorch/pull/55259)) ([#56168](https://github.com/pytorch/pytorch/pull/56168)).
* Support mv op ([#55470](https://github.com/pytorch/pytorch/pull/55470)) ([#56169](https://github.com/pytorch/pytorch/pull/56169)).
* Support .item() export & NumberType to tensor conversion ([#55697](https://github.com/pytorch/pytorch/pull/55697)) ([#57594](https://github.com/pytorch/pytorch/pull/57594)).
* Support a new operator for fill_() function ([#56859](https://github.com/pytorch/pytorch/pull/56859)) ([#57596](https://github.com/pytorch/pytorch/pull/57596)).
* Support index_add_ function ([#56867](https://github.com/pytorch/pytorch/pull/56867)) ([#57830](https://github.com/pytorch/pytorch/pull/57830)).
* Support tensor.to(device) ([#56857](https://github.com/pytorch/pytorch/pull/56857)) ([#57599](https://github.com/pytorch/pytorch/pull/57599)).
* Support registering custom export for prim::PythonOp from torch.autograd.Function ([#55630](https://github.com/pytorch/pytorch/pull/55630)) ([#57600](https://github.com/pytorch/pytorch/pull/57600)).

### Vulkan

* Added the `hardswish` and `hardsigmoid` activation functions ([#53362](https://github.com/pytorch/pytorch/pull/53362)).
* Added the `reflection_pad2d` op ([#53604](https://github.com/pytorch/pytorch/pull/53604)).
* Added an implementation of Winograd convolutions ([#54639](https://github.com/pytorch/pytorch/pull/54639)).
* Added the `sigmoid` activation function ([#57867](https://github.com/pytorch/pytorch/pull/57867)).

### Misc

* Android packages are now published to maven central ([#53568](https://github.com/pytorch/pytorch/pull/53568)).
* Kineto is now supported on Windows ([#56323](https://github.com/pytorch/pytorch/pull/56323)).
* Added a Gloo `TCP_TLS `transport ([#56442](https://github.com/pytorch/pytorch/pull/56442)).
* Add ability to collect minidumps after the crash ([#59236](https://github.com/pytorch/pytorch/pull/59236)).

# Improvements

### Python API

* Added nondeterministic alert for `index_put_` when `accumulate=False` ([#55827](https://github.com/pytorch/pytorch/pull/55827)).
* Added deterministic path for `torch.index_add` on CUDA ([#56521](https://github.com/pytorch/pytorch/pull/56521)).
* Added deterministic path for `torch.index_copy` on CPU ([#56900](https://github.com/pytorch/pytorch/pull/56900)).
* Removed beta warning for use_deterministic_algorithms ([#58074](https://github.com/pytorch/pytorch/pull/58074))
* Updated `torch.Tensor.unflatten` to be able to infer size value in `sizes` from -1 ([#51955](https://github.com/pytorch/pytorch/pull/51955)).
* Added a safe cast and copy for `out=` input tensor for `torch.tensordot` ([#56286](https://github.com/pytorch/pytorch/pull/56286)).
* Added cross-device check for `out` and `input` tensors for `torch.cat` ([#53004](https://github.com/pytorch/pytorch/pull/53004)).
* Modified the order of asserts to correct the error message when nan appears in `torch.multinomial` on CUDA ([#53288](https://github.com/pytorch/pytorch/pull/53288)).
* Converted a few more checks for unsupported device to raise `NotImplementedError` ([#53610](https://github.com/pytorch/pytorch/pull/53610)).
* Made shared cache thread-safe for `torch.multiprocessing` ([#53750](https://github.com/pytorch/pytorch/pull/53750)).
* Added support for `torch.int32` indices in `torch.repeat_interleave` ([#55102](https://github.com/pytorch/pytorch/pull/55102)).
* Added a check to give a clear error message when a binary function is called for  non-complex inputs with complex valued alpha ([#54964](https://github.com/pytorch/pytorch/pull/54964)).
* Propagate error message from `torch_shm_manager` when running `torch.multiprocessing` ([#57307](https://github.com/pytorch/pytorch/pull/57307), [#57310](https://github.com/pytorch/pytorch/pull/57310)).
* Enabled deterministic path for `index_copy_cud`a with index_put ([#58144](https://github.com/pytorch/pytorch/pull/58144)).
* Added support for uppercase letters in `torch.einsum` ([#56475](https://github.com/pytorch/pytorch/pull/56475)).
* Added CUDA support for `torch.orgqr` ([#51348](https://github.com/pytorch/pytorch/pull/51348)) and  `torch.ormqr` ([#57316](https://github.com/pytorch/pytorch/pull/57316)).
* Added support for batched as well as complex inputs for `torch.geqrf` on both CPU and CUDA ([#56249](https://github.com/pytorch/pytorch/pull/56249), [#56251](https://github.com/pytorch/pytorch/pull/56251)).

### Complex Numbers

* Fixed `torch.{linspace, logspace}` to correctly infer complex type and return a complex tensor when the `start` and (or) `end` values are complex numbers, and the `dtype` value is `None`  ([#38875](https://github.com/pytorch/pytorch/pull/38875)).

### Autograd

* Added support for single tensor in `inputs` argument for `.backward()` ([#53827](https://github.com/pytorch/pytorch/pull/53827)).
* Added support for C++ optional arguments in autograd custom functions ([#54270](https://github.com/pytorch/pytorch/pull/54270)).
* Added autograd support to `torch.orgqr` ([#52637](https://github.com/pytorch/pytorch/pull/52637)), `torch.segment_reduce` ([#56792](https://github.com/pytorch/pytorch/pull/56792)).
* Added deterministic backward for `torch.gather` for `dim=1` ([#55573](https://github.com/pytorch/pytorch/pull/55573)).
* Make detach return an alias even under inference mode ([#59633](https://github.com/pytorch/pytorch/pull/59633)).

### torch.nn

* Add 3D depthwise separable convolution ([#51027](https://github.com/pytorch/pytorch/pull/51027))
* Make bias in lazy modules lazy and avoid creating empty tensors ([#52212](https://github.com/pytorch/pytorch/pull/52212)).
* BFloat16: enable prepacked weights's inference ([#48922](https://github.com/pytorch/pytorch/pull/48922)).
* Enable mkldnn conv2d backward to support mkldnn tensor input ([#48994](https://github.com/pytorch/pytorch/pull/48994)).
* Add OneDNN pooling backward ([#49454](https://github.com/pytorch/pytorch/pull/49454)).
* Add 64bit indexing support for softmax ([#52713](https://github.com/pytorch/pytorch/pull/52713)).
* `nn.init._calculate_fan_in_and_fan_out`: Support usage with `__torch_function__` ([#53522](https://github.com/pytorch/pytorch/pull/53522)).
* `nn.Transformer` / `nn.MultiheadAttention`: Add `batch_first` argument ([#55285](https://github.com/pytorch/pytorch/pull/55285)).
* `nn.Transformer`: Add `layer_norm_eps` arg ([#54494](https://github.com/pytorch/pytorch/pull/54494)).
* `nn.AvgPool2d`: Add channels_last support on CPU ([#48918](https://github.com/pytorch/pytorch/pull/48918)).
* `clip_grad_norm_`: Add `error_if_nonfinite` flag ([#53843](https://github.com/pytorch/pytorch/pull/53843), [#55169](https://github.com/pytorch/pytorch/pull/55169)).
* `Module.train`: Raise nicer error when called with invalid modes ([#58247](https://github.com/pytorch/pytorch/pull/58247)).
* `nn.Linear`: Support 0 `in_features` ([#56505](https://github.com/pytorch/pytorch/pull/56505)).
* `nn.EmbeddingBag`: Support mix of int32 and int64 offsets/indices ([#55189](https://github.com/pytorch/pytorch/pull/55189)).
* `xnnpack::linear`: Handle 1D input ([#54986](https://github.com/pytorch/pytorch/pull/54986)).
* `nn.Module`: Add `allow_duplicate` flag to `named_modules()` ([#54812](https://github.com/pytorch/pytorch/pull/54812)).
* `nn.Module`: Add `to_empty()` function for moving to a device without copying storage ([#56610](https://github.com/pytorch/pytorch/pull/56610)).
* Make `pad_sequence` callable from C++ API ([#57868](https://github.com/pytorch/pytorch/pull/57868)).

### Dataloader

* Added `generate_state` for NumPy seeding ([#56797](https://github.com/pytorch/pytorch/pull/56797)).
* Modified construct_time_validation to argument_validation ([#55836](https://github.com/pytorch/pytorch/pull/55836)).
* Added mode to `LoadFilesFromDisk` ([#57056](https://github.com/pytorch/pytorch/pull/57056)).
* Added the ability to override *reduce_ex* function of `DataPipe` ([#52858](https://github.com/pytorch/pytorch/pull/52858)).
* Added lambda support to `MapIterDataPipe` ([#52856](https://github.com/pytorch/pytorch/pull/52856)).
* Added functional way of stacking DataPipes ([#52885](https://github.com/pytorch/pytorch/pull/52885)).

### C++ API

* Suppressed unsigned comparison warning ([#52653](https://github.com/pytorch/pytorch/pull/52653)).
* Fixed constexpr **host** warning ([#52702](https://github.com/pytorch/pytorch/pull/52702)).
* Introduced a fluent API to construct tensors from external data ([#54530](https://github.com/pytorch/pytorch/pull/54530)).

### AMD

* Allow PYTORCH_ROCM_ARCH in cpp_extension ([#54341](https://github.com/pytorch/pytorch/pull/54341)).
* Added support for `torch.half` dtype RNNs with MIOpen ([#52475](https://github.com/pytorch/pytorch/pull/52475)).
* Added support for the new `hiprtc` precompiler feature ([#54350](https://github.com/pytorch/pytorch/pull/54350)).
* Improved reliability of `hipfft` and `rocfft` detection for ROCm build ([#53408](https://github.com/pytorch/pytorch/pull/53408)).

### CUDA

* Improved warning message when old GPU is detected ([#56621](https://github.com/pytorch/pytorch/pull/56621))
* Made `torch.cuda.amp.GradScaler` scale updates in-place for better composability with graph capture ([#55562](https://github.com/pytorch/pytorch/pull/55562)).
* Add `USE_MAGMA` build flag ([#55994](https://github.com/pytorch/pytorch/pull/55994)).
* Change link order for BUILD_SPLIT_CUDA option ([#58437](https://github.com/pytorch/pytorch/pull/58437)).
* Improve CUDA-11.X binary builds ([#58459](https://l.workplace.com/l.php?u=https%3A%2F%2Fgithub.com%2Fpytorch%2Fpytorch%2Fpull%2F58459&h=AT1iY5lImKBK5tsZFPG9Ub57qOFMix4DslZFPlHNwT13OJnRq6Tvh_HehGQ-k4GF2bUDNhIQHBS578V8RQ-Sk2YYUv4Cys6KDIZPsunP7HzrwcYtfEZnPczt41cqT0KEIuvTSa1ZiOZ4SvvQV9F2xiYZ4cCJ7WIl2URyvg)).
* Move CUDA async warning to suffix ([#59467](https://github.com/pytorch/pytorch/pull/59467)).

### torch.fx

* Make `torch.fx.map_arg` require a callable ([#51907](https://github.com/pytorch/pytorch/pull/51907)).
* Generalize dict key check in `torch.fx.Tracer.create_arg` ([#51927](https://github.com/pytorch/pytorch/pull/51927)).
* Customize traceback for calls to symbolically-traced code ([#51648](https://github.com/pytorch/pytorch/pull/51648)).
* Allow `Transformer` to accept output result that is not Proxy ([#52473](https://github.com/pytorch/pytorch/pull/52473)).
* Make `TracerBase._find_user_frame` private ([#53654](https://github.com/pytorch/pytorch/pull/53654)).
* Improve buffer registration during `GraphModule` init ([#53444](https://github.com/pytorch/pytorch/pull/53444)).
* Garbage collect values in `Interpreter` ([#54726](https://github.com/pytorch/pytorch/pull/54726)).
* Improve placeholder matching in subgraph rewriter ([#54958](https://github.com/pytorch/pytorch/pull/54958)).
* Record `stride` on `Node` during `ShapeProp` pass ([#55108](https://github.com/pytorch/pytorch/pull/55108)).
* Record `memory_format` on `Node` during `ShapeProp` pass ([#55815](https://github.com/pytorch/pytorch/pull/55815)).
* Put tensor metadata into a `NamedTuple` in `ShapeProp` ([#55930](https://github.com/pytorch/pytorch/pull/55930)).
* Preserve node meta info in `split_module` ([#56212](https://github.com/pytorch/pytorch/pull/56212)).
* Make `shape_prop` handle targets with aggregate outputs ([#56221](https://github.com/pytorch/pytorch/pull/56221)).
* Make arg normalization a method on `Node` and not a pass (also augment tests to be exhaustive) ([#55992](https://github.com/pytorch/pytorch/pull/55992)).
* Allow for args to be left as args in NormalizeArgs ([#55995](https://github.com/pytorch/pytorch/pull/55995)).
* Maintain submodule references during subgraph rewriting ([#55463](https://github.com/pytorch/pytorch/pull/55463)).
* Changes in order to move `PythonKey` out of tree ([#57427](https://github.com/pytorch/pytorch/pull/57427)).
* Handle cases in `GraphDrawer` when shape, type or stride are not present ([#57845](https://github.com/pytorch/pytorch/pull/57845)).
* Handle the case when output consumes `get_attr` directly in `split_by_tags` ([#57844](https://github.com/pytorch/pytorch/pull/57844)).
* Let submodules be collected as `args/kwargs`  in symbolic tracing([#57840](https://github.com/pytorch/pytorch/pull/57840)).

### Profiler

* Expanded Kineto platform support ([#56323](https://github.com/pytorch/pytorch/pull/56323)).
* Added profiler fallback ([#57612](https://github.com/pytorch/pytorch/pull/57612)).
* Added CUDA event fallback ([#58133](https://github.com/pytorch/pytorch/pull/58133)).

### TorchScript

* Added a flag to enable CPU fusion in benchmarks ([#48612](https://github.com/pytorch/pytorch/pull/48612)).
* Updated fusion to handle loops that have the same bounds as expressions ([#55997](https://github.com/pytorch/pytorch/pull/55997)).
* Updated normalization transformation to be in-place ([#56158](https://github.com/pytorch/pytorch/pull/56158)).
* Added check to only lower float `conv2d`s ([#56289](https://github.com/pytorch/pytorch/pull/56289)).
* Added more python bindings for loopnest ([#56213](https://github.com/pytorch/pytorch/pull/56213)).
* Updated `fuseLoops` API to return bool flag and not throw any exceptions ([#56353](https://github.com/pytorch/pytorch/pull/56353)).
* Added `unroll` and `flatten` APIs which do not require return stmt pointer ([#56420](https://github.com/pytorch/pytorch/pull/56420)).
* Updated `Buf` on mutation instead of creating a new one ([#57513](https://github.com/pytorch/pytorch/pull/57513)).
* Updated `flatten` transformation to be in-place ([#56629](https://github.com/pytorch/pytorch/pull/56629)).
* Added missing python bindings for NNC Stmts ([#55570](https://github.com/pytorch/pytorch/pull/55570)).
* Allowed backend preprocessing to take place outside of the backend interface ([#51757](https://github.com/pytorch/pytorch/pull/51757))
* Added an error message for the case when `with` item is not an object ([#52335](https://github.com/pytorch/pytorch/pull/52335)).
* Enabled `ModuleList` non-literal indexing ([#53410](https://github.com/pytorch/pytorch/pull/53410)).
* Added recursive scripting for class type module attributes ([#55124](https://github.com/pytorch/pytorch/pull/55124)).
* Added support for `mypy` ignore annotation with particular rule specified ([#51675](https://github.com/pytorch/pytorch/pull/51675)).
* Added support for comparing two bool variables ([#51844](https://github.com/pytorch/pytorch/pull/51844)).
* Added MKLDNN GELU function ([#53615](https://github.com/pytorch/pytorch/pull/53615)).
* Added `hardtanh(0,6)` to the set of MKLDNN fusible ops for mobilenetv2 ([#56203](https://github.com/pytorch/pytorch/pull/56203)).
* Captured argument names for traced functions and modules ([#51775](https://github.com/pytorch/pytorch/pull/51775)).
* Improved `has_bf16_support` ([#57408](https://github.com/pytorch/pytorch/pull/57408)).
* Walk Python AST to check for unsupported attribute type annotations ([#51805](https://github.com/pytorch/pytorch/pull/51805)).
* Added `out` version for sum ([#52225](https://github.com/pytorch/pytorch/pull/52225))
* Added logic to trace `torch.nn.Linear` as `aten::linear` ([#51897](https://github.com/pytorch/pytorch/pull/51897)).
* Made `is_tracing` scriptable ([#49853](https://github.com/pytorch/pytorch/pull/49853)).
* Added support for builtin `sum` ([#52188](https://github.com/pytorch/pytorch/pull/52188)).
* Fused `clip_ranges` and `gather_ranges` ([#52461](https://github.com/pytorch/pytorch/pull/52461)).
* Added support for features from `to_backend` for the Lite Interpreter ([#52870](https://github.com/pytorch/pytorch/pull/52870)).
* Added a filter to remove mutation ([#51923](https://github.com/pytorch/pytorch/pull/51923)).
* Added logic functionalize ops which to be included in MKLDNN group ([#51924](https://github.com/pytorch/pytorch/pull/51924))
* Extended subgraph utils to cover merging a node following a subgraph ([#52513](https://github.com/pytorch/pytorch/pull/52513))
* Included max pool in fusion groups ([#52613](https://github.com/pytorch/pytorch/pull/52613)).
* Registered both TupleConstruct and ListConstruct as out variants ([#52684](https://github.com/pytorch/pytorch/pull/52684)).
* Added Alias analysis to Memory Management/Planning ([#50060](https://github.com/pytorch/pytorch/pull/50060)).
* Included max pool in fusion groups ([#52613](https://github.com/pytorch/pytorch/pull/52613)).
* Added property binding in TorchBind ([#50670](https://github.com/pytorch/pytorch/pull/50670)).
* Registered `pow` out variant ([#52454](https://github.com/pytorch/pytorch/pull/52454)).
* Made `torch.load()` aware of import path changes ([#53139](https://github.com/pytorch/pytorch/pull/53139)).
* Added `aten::to` copy out variant ([#52343](https://github.com/pytorch/pytorch/pull/52343)).
* Added more variants to `create_empty_from` ([#53333](https://github.com/pytorch/pytorch/pull/53333)).
* Added support for parsing Ellipsis in JIT frontend ([#53576](https://github.com/pytorch/pytorch/pull/53576)).
* Added a bool `is_available()` method to the backend contract ([#53068](https://github.com/pytorch/pytorch/pull/53068)).
* Added parallel support for the LLVM backend. ([#53243](https://github.com/pytorch/pytorch/pull/53243)) / Resubmit: Add parallel support for the LLVM backend. ([#54122](https://github.com/pytorch/pytorch/pull/54122)).
* Rewrote `functional.tensordot` to be TorchScript-able ([#53672](https://github.com/pytorch/pytorch/pull/53672)).
* Added python bindings for missing loop transformations in `LoopNest` ([#54355](https://github.com/pytorch/pytorch/pull/54355)).
* Added support for list insertion for mutation removal ([#54271](https://github.com/pytorch/pytorch/pull/54271)).
* Added support for  `torch.bfloat16` in the fuser ([#54571](https://github.com/pytorch/pytorch/pull/54571)).
* Added some functions for manipulating MKLDNN tensors to TORCH_API ([#56954](https://github.com/pytorch/pytorch/pull/56954)).
* Merged CUDA Streams and Events ([#53902](https://github.com/pytorch/pytorch/pull/53902)).
* Added python bindings for `TensorExprKernel` ([#54450](https://github.com/pytorch/pytorch/pull/54450)).
* Added support for dtype-specific tensor subclasses (e.g. LongTensor) ([#54817](https://github.com/pytorch/pytorch/pull/54817)).
* Added support for tuple `add` operator ([#52292](https://github.com/pytorch/pytorch/pull/52292)).
* Disambiguated error message for working with not fully refined tuple types ([#55745](https://github.com/pytorch/pytorch/pull/55745)).
* Allowed unpacking tuple and assigning unpacked values to SELECT-type expressions ([#55268](https://github.com/pytorch/pytorch/pull/55268)).
* Made NoneType `annotation_str` emit `NoneType` instead of `None` ([#54746](https://github.com/pytorch/pytorch/pull/54746)).
* Added CUDA device synchronization support in JIT ([#55469](https://github.com/pytorch/pytorch/pull/55469)).
* Added `optimize_graph_output_memory` flag ([#55811](https://github.com/pytorch/pytorch/pull/55811)).
* Added support for refinement for `torch.jit.Future` ([#56148](https://github.com/pytorch/pytorch/pull/56148)).
* Added implicit conversion from null tensor to `NoneType `([#55823](https://github.com/pytorch/pytorch/pull/55823)).
* Added `aten::matmul`s to TE fuser ([#54605](https://github.com/pytorch/pytorch/pull/54605)).
* Put explicit error message on class attribute accesses ([#55723](https://github.com/pytorch/pytorch/pull/55723)).
* Added support for constant tensors in tensorexpr kernel ([#56319](https://github.com/pytorch/pytorch/pull/56319)).
* Added native support for `aten::getitem` ([#55310](https://github.com/pytorch/pytorch/pull/55310)).
* Added stricter check for function schemas with varargs ([#56509](https://github.com/pytorch/pytorch/pull/56509)).
* Added graceful failure handling of DataPtr extraction in CUDAFuture ([#56511](https://github.com/pytorch/pytorch/pull/56511)).
* Enabled forward/backward compatibility in TS mobile ([#56079](https://github.com/pytorch/pytorch/pull/56079)).
* Added binding for `aten::clamp_min_out` ([#56635](https://github.com/pytorch/pytorch/pull/56635)), `aten::argmin_out` ([#56638](https://github.com/pytorch/pytorch/pull/56638)), and `aten::norm_out` ([#56636](https://github.com/pytorch/pytorch/pull/56636)).
* Enhanced error message for `Future.setErrorIfNeeded` ([#56631](https://github.com/pytorch/pytorch/pull/56631)).
* Added type inference support for `nn.Module `methods using PDT ([#57165](https://github.com/pytorch/pytorch/pull/57165)).
* Disabled conv-add-relu fusion for cuDNN7 when model uses `torch.float16` ([#56579](https://github.com/pytorch/pytorch/pull/56579)).
* Enabled conv-add-relu fusion as a part of frozen graph optimization ([#56580](https://github.com/pytorch/pytorch/pull/56580)).
* Reduced inline autodiff threshold to enable the capture of smaller fusions ([#57062](https://github.com/pytorch/pytorch/pull/57062)).
* Added static runtime support for `aten::matmul` ([#57291](https://github.com/pytorch/pytorch/pull/57291)).
* Added `device()` method to `c10::Event` ([#57293](https://github.com/pytorch/pytorch/pull/57293)).
* Added support for normalization of `is` op ([#57862](https://github.com/pytorch/pytorch/pull/57862)).
* Enabled `cat` without conditionals iff CPU ([#58026](https://github.com/pytorch/pytorch/pull/58026)).
* Added `LowerSimpleTuples` for freeze tuples ([#57915](https://github.com/pytorch/pytorch/pull/57915)).
* Added support for striding for list slicing ([#49352](https://github.com/pytorch/pytorch/pull/49352)).
* Wrapped `torch::deploy` API functions in safe rethrow macros ([#58192](https://github.com/pytorch/pytorch/pull/58192)).
* Added binding for `aten::div_out` ([#56653](https://github.com/pytorch/pytorch/pull/56653))
* Added binding for `aten::sub_out` ([#56656](https://github.com/pytorch/pytorch/pull/56656)).
* Supported `clamp.Tensor `([#58191](https://github.com/pytorch/pytorch/pull/58191)).
* Added an out version for `aten::repeat` ([#57683](https://github.com/pytorch/pytorch/pull/57683)).
* Added default arguments to CUDA stream and events ([#53025](https://github.com/pytorch/pytorch/pull/53025)).
* Added support for linear in MKLDNN fusion ([#51484](https://github.com/pytorch/pytorch/pull/51484)).
* Handled MKLDNN broadcasting in MKLDNN fuser ([#51736](https://github.com/pytorch/pytorch/pull/51736)).
* Added 0-dim support for binary MKLDNN ops ([#51921](https://github.com/pytorch/pytorch/pull/51921)).
* Added OneDNN relu backward and reshape backward ([#49455](https://github.com/pytorch/pytorch/pull/49455)).
* Added OneDNN batch_norm backward ([#50460](https://github.com/pytorch/pytorch/pull/50460)).
* Added support for `hardshrink` ([#57749](https://github.com/pytorch/pytorch/pull/57749)).
* Added non mutator bundled inputs method ([#58408](https://github.com/pytorch/pytorch/pull/58408)).
* Added support to compare devices ([#53045](https://github.com/pytorch/pytorch/pull/53045)).
* Added support for `memory_arg` in `aten::clone` ([#58100](https://github.com/pytorch/pytorch/pull/58100)).
* Implemented `aten::cat` without conditionals ([#53128](https://github.com/pytorch/pytorch/pull/53128)).
* Added external function bindings ([#53420](https://github.com/pytorch/pytorch/pull/53420)).
* Added out variant of `sigrid_transforms_torch_bind` and `ListUnpack` ([#54761](https://github.com/pytorch/pytorch/pull/54761)).

### torch.package

* Added a reliable method for determining if a file is part of Python’s standard library  ([#51694](https://github.com/pytorch/pytorch/pull/51694)).
* Made package code more composable with other parts of PyTorch (package GraphModule, load non-code files from package) ([#51674](https://github.com/pytorch/pytorch/pull/51674), [#51976](https://github.com/pytorch/pytorch/pull/51976)).
* Improved debugging facilities (allow_empty flag, zip file viewer, deny instruction, dependency tracing, query if object is from a package)  ([#53232,](https://github.com/pytorch/pytorch/pull/53232)[#53233](https://github.com/pytorch/pytorch/pull/53233), [#52176](https://github.com/pytorch/pytorch/pull/52176), [#55167](https://github.com/pytorch/pytorch/pull/55167), [#56190](https://github.com/pytorch/pytorch/pull/56190), [#56238](https://github.com/pytorch/pytorch/pull/56238), [#56729](https://github.com/pytorch/pytorch/pull/56729)).
* Allow save_module to accept module as arg ([#55996](https://github.com/pytorch/pytorch/pull/55996)).
* Follow dependencies created by `__import__` calls ([#55153](https://github.com/pytorch/pytorch/pull/55153)).
* Added hooks to exporters’ mock and extern calls to take action when a module is matched ([#58000](https://github.com/pytorch/pytorch/pull/58000))
* Turn the default behavior of packaging into an ‘intern’ action so that it can be ordered with repeat to mock, extern, and deny actions ([#57341](https://github.com/pytorch/pytorch/pull/57341)).

### Quantization

* Added support for keeping output quantized for list and dict ([#56391](https://github.com/pytorch/pytorch/pull/56391)).
* Added `torch.float16` and `torch.float64` support to `fake_quantize_per_channel` ([#56894](https://github.com/pytorch/pytorch/pull/56894)).
* Support preserving attributes in deepcopy of observed/quantized graphmodule ([#56550](https://github.com/pytorch/pytorch/pull/56550)).
* Added support for packed params in state_dict ([#51639](https://github.com/pytorch/pytorch/pull/51639)).
* Added support for fusing `Conv3d + BatchNorm3d + ReLU` operations ([#50003](https://github.com/pytorch/pytorch/pull/50003)).
* Change back to `multiple_outputs_gpu_kernel` for learnable fake per-channel quantization ([#52017](https://github.com/pytorch/pytorch/pull/52017)).
* Added `torch.float16` and `torch.float32` support to `fake_quantize_per_tensor` ([#52612](https://github.com/pytorch/pytorch/pull/52612)).
* Support batched embeddings for 8 Bit embedding bag quantization ([#55343](https://github.com/pytorch/pytorch/pull/55343)).
* Expose nbins and ratio for `quantized::embedding_bag_4bit_prepack` ([#50398](https://github.com/pytorch/pytorch/pull/50398)).

### Mobile

* Removed caching of inflated bundled inputs ([#55181](https://github.com/pytorch/pytorch/pull/55181)).
* Improved exception reporting for Lite interpreter ([#54284](https://github.com/pytorch/pytorch/pull/54284), [#55062](https://github.com/pytorch/pytorch/pull/55062), [#55252](https://github.com/pytorch/pytorch/pull/55252)).
* Improved forward/backward compatibility in Lite interpreter when adding new optional arguments to ops ([#56845](https://github.com/pytorch/pytorch/pull/56845)).
* Added model size to logged metadata when loading a Lite interpreter model ([#53578](https://github.com/pytorch/pytorch/pull/53578)).
* Benchmarking binary speed_benchmark_torch now supports Lite interpreter ([#55402](https://github.com/pytorch/pytorch/pull/55402)).

### Distributed

`torch.distributed.Store`

* Update `compare_set` for other Store implementations to be the same as `TCPStore`. ([#57175](https://github.com/pytorch/pytorch/pull/57175))
* `torch.distributed.Store`: Expose C++ `compare_set` API to python. ([#57191](https://github.com/pytorch/pytorch/pull/57191))
* `torch.distributed.Store`: Add `timeout`, `host`, `port` to TCPStore’s python API as accessors. ([#52784](https://github.com/pytorch/pytorch/pull/52784))
* Allow `world_size` and `is_master` to be optional when constructing TCPStore. ([#51809](https://github.com/pytorch/pytorch/pull/51809))
* Add `wait_for_worker` param to `TCPStore`’s Python API([#52888](https://github.com/pytorch/pytorch/pull/52888))

`torch.distributed.rpc`

* Allow `RRef` to be created with a specified set of CUDA devices ([#57085](https://github.com/pytorch/pytorch/pull/57085))
* Correctness fixes for CUDA support in RPC framework ([#54024](https://github.com/pytorch/pytorch/pull/54024), )
* Refactor RPC agent to use `Store` to collect and verify name ([#53209](https://github.com/pytorch/pytorch/pull/53209), [#53202](https://github.com/pytorch/pytorch/pull/53202))

`DistributedDataParallel`

* Make unused parameter search show up in profiler output ([#57376](https://github.com/pytorch/pytorch/pull/57376))
* Update DDP communication hooks to divide by world size before all_reduce to avoid overflow ([#57410](https://github.com/pytorch/pytorch/pull/57410))
* Stabilize `torch.distributed.GradBucket` interface for gradient compression ([#53010](https://github.com/pytorch/pytorch/pull/53010), [#53098](https://github.com/pytorch/pytorch/pull/53098), [#53102](https://github.com/pytorch/pytorch/pull/53102), [#53009](https://github.com/pytorch/pytorch/pull/53009), [#53099](https://github.com/pytorch/pytorch/pull/53099))
* Skip CPU to GPU input copy if input is already on the right device. ([#55624](https://github.com/pytorch/pytorch/pull/55624))
* Record forward pass of `DistributedDataParallel` and `DataParallel` in profiler.([#55578](https://github.com/pytorch/pytorch/pull/55578))
*  Make `orthogonalization_epsilon` flag configurable in `torch.distributed.algorithms.ddp_comm_hooks.powerSGD_hook.PowerSGDState` ([#55738](https://github.com/pytorch/pytorch/pull/55738))
* Set default value of `start_powerSGD_iter` to 1K iterations in 
    `torch.distributed.algorithms.ddp_comm_hooks.powerSGD_hook. `([#55272](https://github.com/pytorch/pytorch/pull/55272))
* Add a minimum compression rate threshold parameter for `torch.distributed.algorithms.ddp_comm_hooks.powerSGD_hook`   ([#52541](https://github.com/pytorch/pytorch/pull/52541))
* Report compression rate for batched PowerSGD hook ([#55103](https://github.com/pytorch/pytorch/pull/55103))
* Enable gradient compression hook testing on ROCm ([#52403](https://github.com/pytorch/pytorch/pull/52403))
* Enhance warning for unused parameters in `DistributedDataParallel`. ([#52385](https://github.com/pytorch/pytorch/pull/52385))
* Enhance error messages when crashing with unused parameters in `DistributedDataParallel`. ([#52391](https://github.com/pytorch/pytorch/pull/52391))

`torch.distributed`

* Add rank information on NCCL communicator abort ([#57974](https://github.com/pytorch/pytorch/pull/57974))
* Enhance exception logging in NCCL ([#54557](https://github.com/pytorch/pytorch/pull/54557), [#54558](https://github.com/pytorch/pytorch/pull/54558), [#54117](https://github.com/pytorch/pytorch/pull/54117))

`torch.distributed.nn.RemoteModule`

* Create a separate remote module template when moving CPU tensors to a cuda device is not enabled ([#57413](https://github.com/pytorch/pytorch/pull/57413))
* Allow passing `RemoteModule` as an argument over RPC ([#57695](https://github.com/pytorch/pytorch/pull/57695), [#58345](https://github.com/pytorch/pytorch/pull/58345))
* Support async instantiation of RemoteModule ([#58052](https://github.com/pytorch/pytorch/pull/58052))
* Place inputs on the appropriate devices in `RemoteModule` ([#56943](https://github.com/pytorch/pytorch/pull/56943))

`torch.futures.Future`

* Enable `torch.futures.Future` to be created with CUDA support ([#56517](https://github.com/pytorch/pytorch/pull/56517)) 
* `torch.futures`: Improve error propagation when using `then` API ([#54475](https://github.com/pytorch/pytorch/pull/54475))

`torch.nn.SyncBatchNorm`

* Migrate `apex.parallel.SyncBatchNorm` `channels_last` to PyTorch implementation ([#46906](https://github.com/pytorch/pytorch/pull/46906))
* Fix `SyncBatchNorm`’s forward pass to handle optional weight ([#54568](https://github.com/pytorch/pytorch/pull/54568))

`torch.distributed.pipeline`

* `torch.distributed.pipeline`: Merge pipeline partitions that are on the same device. ([#55973](https://github.com/pytorch/pytorch/pull/55973))

Added new `torch.distributed.elastic `module that upstreams `pytorch/elastic`

* Rename `torch.distributed.elastic_launch` to `torch.distributed.run` ([#56831](https://github.com/pytorch/pytorch/pull/56831))
* make process failure init error non-fatal ([#56739](https://github.com/pytorch/pytorch/pull/56739))
* Reorder type definitions in dynamic_rendezvous.py ([#56534](https://github.com/pytorch/pytorch/pull/56534))
* Revise the rendezvous handler registry logic. ([#55466](https://github.com/pytorch/pytorch/pull/55466))
* Set error code in reply file when child process is terminated by signals. (f665a7f8a1)
* Make sure torchelastic mp wait for queue to be drained before finishing the process ([#55412](https://github.com/pytorch/pytorch/pull/55412))
* Revise the rendezvous exception types. ([#54803](https://github.com/pytorch/pytorch/pull/54803))
* Expose a `stderr` parameter in `EtcdServer`. ([#54805](https://github.com/pytorch/pytorch/pull/54805))
* Improve the implementation of the utility functions and add their unit tests. ([#54804](https://github.com/pytorch/pytorch/pull/54804))
* Improve the implementation of `RendezvousParameters` and add its unit tests. ([#54807](https://github.com/pytorch/pytorch/pull/54807))

`torch.distributed.optim.ZeroRedundancyOptimizer`

* Add an option for buckets to be views of tensors and consolidate public interface ([#52987](https://github.com/pytorch/pytorch/pull/52987))
* Make state dict for ZeroRedundancyOptimizer world size independent ([#52960](https://github.com/pytorch/pytorch/pull/52960))

Combine backtrace print into one string to avoid interleaving ([#56961](https://github.com/pytorch/pytorch/pull/56961)).
Raise exception rather than crash if GLOO_DEVICE_TRANSPORT is set to unknown value ([#58518](https://github.com/pytorch/pytorch/issues/58518)).

### ONNX

* Updated fuseLogSoftmaxNllLoss function to handle autocasting ([#51729](https://github.com/pytorch/pytorch/pull/51729)) ([#52349](https://github.com/pytorch/pytorch/pull/52349)).
* Added support for sequence of tensor mutations in blocks ([#51577](https://github.com/pytorch/pytorch/pull/51577)) ([#52347](https://github.com/pytorch/pytorch/pull/52347)).
* Updated LayerNorm symbolic to handle autocasting ([#52199](https://github.com/pytorch/pytorch/pull/52199)) ([#52350](https://github.com/pytorch/pytorch/pull/52350)).
* Restored fast path in `OnnxifiOp::adjustOutputBatchSize` ([#52498](https://github.com/pytorch/pytorch/pull/52498)).
* Improved index_put symbolic to handle singular Bool updates ([#53690](https://github.com/pytorch/pytorch/pull/53690)) ([#54863](https://github.com/pytorch/pytorch/pull/54863)).
* Replaced decomposeLinear pre process pass with a symbolic ([#53077](https://github.com/pytorch/pytorch/pull/53077)) ([#54866](https://github.com/pytorch/pytorch/pull/54866)).
* Improved assign input shape for tuple inputs & primitive type inputs ([#54112](https://github.com/pytorch/pytorch/pull/54112)) ([#56164](https://github.com/pytorch/pytorch/pull/56164)).
* Updated repeat_interleave symbolic ([#54312](https://github.com/pytorch/pytorch/pull/54312)) ([#56165](https://github.com/pytorch/pytorch/pull/56165)).
* Enabled `word_language_model` GRU and LSTM scripting ([#54310](https://github.com/pytorch/pytorch/pull/54310)) ([#56170](https://github.com/pytorch/pytorch/pull/56170)).
* Added standardOps match more input type in ORT ([#53813](https://github.com/pytorch/pytorch/pull/53813)) ([#56172](https://github.com/pytorch/pytorch/pull/56172)).
* Redesigned in-place conversion ([#55033](https://github.com/pytorch/pytorch/pull/55033)) ([#56173](https://github.com/pytorch/pytorch/pull/56173)).
* Handled PackedParams inputs for _propagate_and_assign_input_shapes ([#56449](https://github.com/pytorch/pytorch/pull/56449)) ([#57079](https://github.com/pytorch/pytorch/pull/57079)).
* Added a warning for the case when *len* is used to calculate tensor shape ([#55151](https://github.com/pytorch/pytorch/pull/55151)) ([#57595](https://github.com/pytorch/pytorch/pull/57595)).
* Added special post processing for `onnx::Cast` and `onnx::ConstantOfShape` shape type inference ([#55962](https://github.com/pytorch/pytorch/pull/55962)) ([#57597](https://github.com/pytorch/pytorch/pull/57597)).
* Handled NoneType in Assign Output Shapes ([#54623](https://github.com/pytorch/pytorch/pull/54623)) ([#57602](https://github.com/pytorch/pytorch/pull/57602)).
* ListUnpack on dynamic tensor list ([#56592](https://github.com/pytorch/pytorch/pull/56592)) ([#57603](https://github.com/pytorch/pytorch/pull/57603)).
* Handled mixed mask, index input for index_put ([#57604](https://github.com/pytorch/pytorch/pull/57604)).
* Handled incorrect format for example_outputs ([#55802](https://github.com/pytorch/pytorch/pull/55802)) ([#57829](https://github.com/pytorch/pytorch/pull/57829)).
* Enabled several script unit tests using new jit passes ([#51722](https://github.com/pytorch/pytorch/pull/51722)) ([#53309](https://github.com/pytorch/pytorch/pull/53309)).

### Vulkan

* Enabled broadcasting for arithmetic ops (add, sub, mul, and div) ([#52842](https://github.com/pytorch/pytorch/pull/52842)).
* Reduced size of compiled shaders by using the `-Os` flag when calling `glslc` ([#57199](https://github.com/pytorch/pytorch/pull/57199)).
* The vulkan optimization JIT pass now adds an `optimized_for_vulkan` attribute to the model ([#56414](https://github.com/pytorch/pytorch/pull/56414)).

### Benchmark

* Quality of life improvements to Timer ([#53294](https://github.com/pytorch/pytorch/pull/53294))
* Add repeats to Timer.collect_callgrind(...) ([#54484](https://github.com/pytorch/pytorch/pull/54484))

### Misc

* Auto-detect ccache to speed up developer builds ([#49389](https://github.com/pytorch/pytorch/pull/49389)).
* Catch and ignore tracebacks for compilation errors ([#55986](https://github.com/pytorch/pytorch/pull/55986)).
* Register DefaultBackend implementations for functional/inplace structured operators ([#53037](https://github.com/pytorch/pytorch/pull/53037)).
* Improved support for oneDNN on AArch64 when building from src ([#55913](https://github.com/pytorch/pytorch/pull/55913)).


# Bug fixes

### Python API

* Updated `torch.lerp`  to make `weights` tensor broadcast-able ([#52319](https://github.com/pytorch/pytorch/pull/52319)).
* Fixed print for negative torch.int8 tensors on ARM64 ([#52616](https://github.com/pytorch/pytorch/pull/52616)).
* Fixed type annotation for `as_tuple` to clearly determine what `torch.nonzero` will resolve to ([#51635](https://github.com/pytorch/pytorch/pull/51635)).
* Fixed `torch.logcumsumexp` to correctly handle infs and nans ([#52947](https://github.com/pytorch/pytorch/pull/52947)).
* Fixed `torch.topk` for k=0 on CUDA by skipping the kernel launch in this case ([#58086](https://github.com/pytorch/pytorch/pull/58086)).
* Fixed a bug for optimizers to have the hyper parameters be still defined when all parameters have no grad ([#52944](https://github.com/pytorch/pytorch/pull/52944)).
* Fixed type promotion issue for `torch.pow` ([#54085](https://github.com/pytorch/pytorch/pull/54085)).
* Fixed `torch.min()` and `torch.max()` to work on a non-empty dimension for tensors with 0 elements ([#52565](https://github.com/pytorch/pytorch/pull/52565)).
* Fixed the upper bound computation for `torch.randperm` ([#56967](https://github.com/pytorch/pytorch/pull/56967)).
* Allowed `std=0` in `torch.normal`, and added checks to consistently error out if `std<0` ([#51317](https://github.com/pytorch/pytorch/pull/51317))
* Fixed  `torch.index_fill` to output 0-dim tensor for a 0-dim input tensor ([#52209](https://github.com/pytorch/pytorch/pull/52209)).
* Fixed mul_() to correctly work for Mkldnn tensors ([#51758](https://github.com/pytorch/pytorch/pull/51758)).
* Fixed temp file/bind race condition in torch_shm_manager for `torch.multiprocessing` ([#57309](https://github.com/pytorch/pytorch/pull/57309)).
* Fixed tempfile address binding in torch_shm_manager to be destructed correctly for `torch.multiprocessing` ([#57566](https://github.com/pytorch/pytorch/pull/57566)).
* Fixed `torch.multinomial` to never select an element with 0 weight for `torch.half` (already works correctly for other datatypes) ([#53480](https://github.com/pytorch/pytorch/pull/53480)).
* Fixed a bug in `assertRaises` `NotImplemented` handling when no exception is thrown ([#54126](https://github.com/pytorch/pytorch/pull/54126)).
* Fixed override for `__iter__` ([#54702](https://github.com/pytorch/pytorch/pull/54702)).
* Fixed segmentation fault for `torch.floor_divide` when compiling on ARM64 ([#55608](https://github.com/pytorch/pytorch/pull/55608)).
* Fixed `torch.digamma`’s inconsistency with SciPy’s digamma ([#56689](https://github.com/pytorch/pytorch/pull/56689)).
* Fixed `torch.cat` to return correct result for non-contiguous tensors ([#57177](https://github.com/pytorch/pytorch/pull/57177)).
* Fixed distributions for `torch.distributions.log_prob` which don't properly honor `validate_args=False` ([#53600](https://github.com/pytorch/pytorch/pull/53600)).
* De-prioritized `Dimname` and `DimnameList` in python overload resolution ([#51350](https://github.com/pytorch/pytorch/pull/51350)).
* Fixed the handling of scalar and zero dimensional inputs as well to `torch.take()` and `torch.Tensor.put_` on both CPU and CUDA ([#53356](https://github.com/pytorch/pytorch/pull/53356)).
* Fixed a bug to not rebuild extensions for every import ([#56015](https://github.com/pytorch/pytorch/pull/56015)).
* Fixed error message for `torch.as_strided` ([#53198](https://github.com/pytorch/pytorch/pull/53198)).
* Added correct handling for tensor allocation for large tensors when using `torch.resize` on CUDA ([#52672](https://github.com/pytorch/pytorch/pull/52672)).
* Fixed an illegal memory access that could happen when computing the inverse of a batch of matrices on CUDA ([#53064](https://github.com/pytorch/pytorch/pull/53064)).
* Fixed a bug where `torch.sparse.addmm` would compute the wrong results for CUDA inputs when beta was not zero or one ([#56160](https://github.com/pytorch/pytorch/pull/56160)).
* Fixed a bug where `torch.sparse.sparse_coo_tensor`’s gradient could be calculated incorrectly ([#50361](https://github.com/pytorch/pytorch/pull/50361)).
* `pow`: Fixed a bug caused for mixed cpu/cuda input tensors ([#53669](https://github.com/pytorch/pytorch/pull/53669)).
* `sub`: Fixed a `sub.Scalar` bug ([#53679](https://github.com/pytorch/pytorch/pull/53679)).
* Fixed `torch.unique` for discontiguous inputs ([#59003](https://github.com/pytorch/pytorch/pull/59003)).
* Fixed `torch.randperm` on CUDA ([#59352](https://github.com/pytorch/pytorch/pull/59352)).
* Fix `torch.reciprocal` for `torch.float32` on ARMv8 ([#59361](https://github.com/pytorch/pytorch/pull/59361)).
* Disable overloading of std::max & std::min for inputs of different types, which could cause accuracy loss ([#55638](https://github.com/pytorch/pytorch/pull/55638))

### Complex Numbers

* Added custom implementation for `sqrt` and `acos` to be used if `libc++` is used to reduce numerical error for edge cases. ([#52018](https://github.com/pytorch/pytorch/pull/52018), [#54820](https://github.com/pytorch/pytorch/pull/54820), [#52287](https://github.com/pytorch/pytorch/pull/52287)).

### Autograd

* Fixed
    * `torch.autograd.gradgradcheck` when outputs are independent of the inputs ([#58049](https://github.com/pytorch/pytorch/pull/58049)).
    * `torch.utils.checkpoint` to behave properly when an error happens during forward ([#51746](https://github.com/pytorch/pytorch/pull/51746)).
    * autograd’s graph discovery when output is a leaf that requires gradients ([#51940](https://github.com/pytorch/pytorch/pull/51940)).
    * some cases where `torch.autograd.gradcheck` did not return the correct value when `raise_exception=False`  ([#53916](https://github.com/pytorch/pytorch/pull/53916)) .
    * thread local state not being properly propagated for some operations during the backward pass ([#56174](https://github.com/pytorch/pytorch/pull/56174)).
    * `torch.index_fill_` formula to support duplicate indices ([#57101](https://github.com/pytorch/pytorch/pull/57101)).
    * derivative of `torch.sinc` around `x=0` ([#56763](https://github.com/pytorch/pytorch/pull/56763), [#56986](https://github.com/pytorch/pytorch/pull/56986)).
    * `torch.cdist` backward formula to correctly support broadcasting ([#56605](https://github.com/pytorch/pytorch/pull/56605)) and empty inputs ([#56606](https://github.com/pytorch/pytorch/pull/56606)).
    * view creation metadata for functions that return multiple views in `no_grad` or inference mode. ([#57842](https://github.com/pytorch/pytorch/pull/57842)).
    * `autograd.functional.*` functions to work in no_grad mode ([#47543](https://github.com/pytorch/pytorch/pull/47543)).
    * rare deadlocks on exit due to autograd worker threads ([#53170](https://github.com/pytorch/pytorch/pull/53170)).

### torch.nn

* `nn.AdaptiveAveragePooling`: Fix crash for integral inputs ([#51443](https://github.com/pytorch/pytorch/pull/51443)).
* `F.normalize`: Fix to make it properly scriptable ([#51909](https://github.com/pytorch/pytorch/pull/51909)).
* `nn.parallel.scatter_gather.gather`: Fix to handle `NamedTuple`s and moving output to CPU ([#51104](https://github.com/pytorch/pytorch/pull/51104)).
* `fractional_max_pool{2/3}d` : Fix segfaults for incorrect `kernel_size` and `output_size` ([#51626](https://github.com/pytorch/pytorch/pull/51626)).
* `nn.CosineEmbeddingLoss`: Validate target has correct shape ([#53110](https://github.com/pytorch/pytorch/pull/53110)).
* Fix multiprocessing serialization for integer parameters on CUDA ([#56529](https://github.com/pytorch/pytorch/pull/56529)).
* `nn.Softplus`: Fix backwards computation by comparing `input` against `beta * threshold` ([#56484](https://github.com/pytorch/pytorch/pull/56484)).
* `addmm_`: Add check to disallow resizing the input tensor for the in-place variation on CPU ([#56452](https://github.com/pytorch/pytorch/pull/56452)).
* `nn.InstanceNorm*d`: Fix to perform correct input size check ([#56659](https://github.com/pytorch/pytorch/pull/56659)).
* `nn.CTCLoss`: Fix backward pass regression on cuDNN ([#56639](https://github.com/pytorch/pytorch/pull/56639)).
* `nn.ConvTranspose*d`: Fix regression that broke padding with a list of values ([#54911](https://github.com/pytorch/pytorch/pull/54911)).
* `F.max_pool3d`: Fix illegal memory access for large inputs on CUDA by doing multiplication in `int64` ([#52828](https://github.com/pytorch/pytorch/pull/52828)).
* `F.embedding`: Support `__torch_function__` ([#54478](https://github.com/pytorch/pytorch/pull/54478)).
* `nn.ChannelShuffle`: Remove `NamedTensor` warnings ([#55911](https://github.com/pytorch/pytorch/pull/55911)).
* `mkldnn_linear`: Fix incorrect results for non-contiguous inputs ([#51713](https://github.com/pytorch/pytorch/pull/51713)).
* `nn.ModuleList` / `nn.ModuleDict`: Raise `NotImplementedError` for `forward()` ([#48785](https://github.com/pytorch/pytorch/pull/48785)).
* Change `maybe_resize_storage_cpu` `new_size` arg to unsigned ([#52671](https://github.com/pytorch/pytorch/pull/52671)).
* `nn.LSTM`: Fix regression that broke loading older serialized modules ([#57558](https://github.com/pytorch/pytorch/pull/57558)).
* `F.reflection_pad2d`: Fix CUDA launch error ([#56451](https://github.com/pytorch/pytorch/pull/56451)).
* Fix wrong detection of depthwise convolution on neon ([#55794](https://github.com/pytorch/pytorch/pull/55794)).
* Re-enable fast winograd convolution on IOS ([#56021](https://github.com/pytorch/pytorch/pull/56021)).
* `gaussian_nll_loss`: Fix incorrect `reduction=‘none’` behavior ([#56469](https://github.com/pytorch/pytorch/pull/56469)).
* Fix misaligned access #56325 ([#56403](https://github.com/pytorch/pytorch/pull/56403)).
* Use native CTC loss for target length 256 ([#53557](https://github.com/pytorch/pytorch/pull/53557)).
* `register_full_backward_hook`: Fix crash when first argument doesn't require a gradient ([#57945](https://github.com/pytorch/pytorch/pull/57945)).
* Remove asserts of Tensor type and ignore mypy checks to support `__torch_function__` usage ([#57458](https://github.com/pytorch/pytorch/pull/57458)).
* Handle stride > 1 with im2col in CUDA thnn conv2d ([#54080](https://github.com/pytorch/pytorch/pull/54080)).
* Add device id to ConvolutionParams ([#50892](https://github.com/pytorch/pytorch/pull/50892)).
* Enabling OneDNN for group convolution ([#54890](https://github.com/pytorch/pytorch/pull/54890)).
* `nn.AdaptiveAveragePooling3d`: Add `AccumulateType` for CUDA ([#53607](https://github.com/pytorch/pytorch/pull/53607)).
* Do not use depthwise3x3 conv in grad mode for ARM ([#56889](https://github.com/pytorch/pytorch/pull/56889)).
* Fix type annotations for `state_dict()` override ([#55704](https://github.com/pytorch/pytorch/pull/55704)).
* Pass contiguous weight to NNPACK convolution ([#56569](https://github.com/pytorch/pytorch/pull/56569)).
* `nn.EmbeddingBag`: Mark backward as non-deterministic for max mode rather than all reducing modes ([#55574](https://github.com/pytorch/pytorch/pull/55574)).
* `nn.EmbeddingBag`: Initialize `bag_size` output with zeros to make it deterministic ([#56661](https://github.com/pytorch/pytorch/pull/56661)).
* `nn.EmbeddingBag`: Support the empty bag case on CPU ([#57446](https://github.com/pytorch/pytorch/pull/57446)).
* Fix `nn.MHA` + `quantized` scriptability ([#58727](https://github.com/pytorch/pytorch/pull/58727)).
* Fixes cuDNN performance on A100 ([#58287](https://github.com/pytorch/pytorch/pull/58287), [#59721](https://github.com/pytorch/pytorch/pull/59721), [#59744](https://github.com/pytorch/pytorch/pull/59744), [#59802](https://github.com/pytorch/pytorch/pull/59802)).

### Dataloader

* Fixed type hints of the callable DataLoader arguments ([#52924](https://github.com/pytorch/pytorch/pull/52924)).
* Added a keyword arg to meta and support `abc` for typing ([#58450](https://github.com/pytorch/pytorch/pull/58450)).
* Fixed a bug to use `generator` instead of `self.generator` in the `RandomSampler` ([#52956](https://github.com/pytorch/pytorch/pull/52956)).

### C++ API

* Fixed the lifetime of `PyTensorType` ([#51649](https://github.com/pytorch/pytorch/pull/51649)).
* Fixed linker failure with ambiguous namespaces ([#45736](https://github.com/pytorch/pytorch/pull/45736)).
* Fix Scalar output formatting ([#53229](https://github.com/pytorch/pytorch/pull/53229))
* Fix printing of optional string arguments in schemas ([#55196](https://github.com/pytorch/pytorch/pull/55196))

### AMD

* Fixed `hipfft` transform type error ([#53411](https://github.com/pytorch/pytorch/pull/53411)).
* Load only hipfft for ROCm > 4.1 ([#54349](https://github.com/pytorch/pytorch/pull/54349)).

### CUDA

* Added `torch.scatter_add` to `torch.cuda.amp` promote list ([#52133](https://github.com/pytorch/pytorch/pull/52133)).
* Fixed segfault in distributed process group due to IPC ([#53080](https://github.com/pytorch/pytorch/pull/53080)).
* Fixed multinomial CUDA misalignment and non-deterministic behavior ([#55364](https://github.com/pytorch/pytorch/pull/55364)).
* Replaced raw cudaMalloc in `torch.sparse` code ([#57083](https://github.com/pytorch/pytorch/pull/57083)).
* [CUDA graphs] Added proper sync after replay ([#57556](https://github.com/pytorch/pytorch/pull/57556)).
* Fixed NVRTC versioning for CUDA 11.X (X>=3), CUDA 12 and later ([#57204](https://github.com/pytorch/pytorch/pull/57204)).
* Fixed a correctness issue of CUDA channels-last `nn.SyncBatchNorm` ([#57077](https://github.com/pytorch/pytorch/pull/57077)).
* Fixed CUDA caching allocator when trying to allocate ~2^64 memory ([#57571](https://github.com/pytorch/pytorch/pull/57571)).
* Fixed raw_deleter() bug with PYTORCH_NO_CUDA_MEMORY_CACHING=1 ([#54775](https://github.com/pytorch/pytorch/pull/54775)).
* Fixed undefined symbol for CUDA 11.1 Windows ([#52506](https://github.com/pytorch/pytorch/pull/52506)).
* Automatically set BUILD_SPLIT_CUDA for cpp extensions ([#52503](https://github.com/pytorch/pytorch/pull/52503)).
* Adds grid_sampler to the list of operations that can autocast `torch.float32` ([#58679](https://github.com/pytorch/pytorch/pull/58679)).

### Dispatcher

* Fix boxing/unboxing for `Scalar` bool values ([#53228](https://github.com/pytorch/pytorch/pull/53228))
* Fix inaccurate dispatch table for `fill_` ([#53611](https://github.com/pytorch/pytorch/pull/53611))
* Fix inaccurate dispatch tables ([#54127](https://github.com/pytorch/pytorch/pull/54127))
* Fix issue with dispatch key: `AutogradXPU` ([#56336](https://github.com/pytorch/pytorch/pull/56336))
* Modify `DispatchKeyExtractor` to also work for optional Tensors ([#58283](https://github.com/pytorch/pytorch/pull/58283))
* Extract dispatch keys from optional Tensors (unboxed) ([#58296](https://github.com/pytorch/pytorch/pull/58296))

### torch.fx

* Preserve leaf modules in `Transformer` ([#51998](https://github.com/pytorch/pytorch/pull/51998)).
* Fix tuple type annotations in FX codebase ([#52010](https://github.com/pytorch/pytorch/pull/52010)).
* Fix type correctness on `GraphModule.graph` ([#54305](https://github.com/pytorch/pytorch/pull/54305)).
* Remove `forward` from `forward.__globals__` to facilitate retracing ([#54011](https://github.com/pytorch/pytorch/pull/54011)).
* Fix `ScriptMethod` dispatch on `__torch_function__` ([#56103](https://github.com/pytorch/pytorch/pull/56103)).
* Fix `type_matches` for `Optional[List[int]]` arguments to make `NormalizeArgs` more permissive ([#56790](https://github.com/pytorch/pytorch/pull/56790)).
* Fix `NormalizeArgs` issues with lists of tensors ([#57004](https://github.com/pytorch/pytorch/pull/57004)).
* Changed parametric type error in `NormalizeArgs` to a warning ([#57183](https://github.com/pytorch/pytorch/pull/57183)).
* Make `NormalizeArgs` not save output node in the `node_map` ([#58058](https://github.com/pytorch/pytorch/pull/58058)).

### Profiler

* Fixed intermittent CUDA activity flush issue (https://github.com/pytorch/kineto/pull/95).
* Handled empty trace ([#58013](https://github.com/pytorch/pytorch/pull/58013)).
* Added cuda synchronization points ([#56651](https://github.com/pytorch/pytorch/pull/56651)).
* Removed usage of onEachDevice from legacy profiler ([#54125](https://github.com/pytorch/pytorch/pull/54125)).
* Fixed double printing of FLOPs ([#56974](https://github.com/pytorch/pytorch/pull/56974)).

### TorchScript

* Fixed `jit.trace` mishandling of InterfaceType ([#53052](https://github.com/pytorch/pytorch/pull/53052)).
* Made `reshape`/`flatten` deterministic ([#54353](https://github.com/pytorch/pytorch/pull/54353)).
* Added logic to use `is_buffer` in `BufferPolicy::valid` ([#49588](https://github.com/pytorch/pytorch/pull/49588)).
* Updated NNC to sanitize input names ([#52786](https://github.com/pytorch/pytorch/pull/52786)).
* Handled ExternalCalls in LoadStore analysis and Inliner ([#52628](https://github.com/pytorch/pytorch/pull/52628)).
* Fixed output restriding of size-1 dimensions ([#58256](https://github.com/pytorch/pytorch/pull/58256)).
* Handled non literal constant bounds in Unroll ([#53029](https://github.com/pytorch/pytorch/pull/53029)).
* Fixed a case where inlining wouldn't work because dim-size was 1 ([#53254](https://github.com/pytorch/pytorch/pull/53254)).
* Removed cached argv from LLVMCodeGen to fix race condition ([#54286](https://github.com/pytorch/pytorch/pull/54286)).
* Lowered scalar constants as doubles/longs ([#54824](https://github.com/pytorch/pytorch/pull/54824)).
* Added a check to not try to vectorize kernels that use float16 ([#55970](https://github.com/pytorch/pytorch/pull/55970)).
* Added a check to not fuse `torch.float16` on CPU ([#56119](https://github.com/pytorch/pytorch/pull/56119)).
* Fixed `float->bool` conversion on CPU ([#57798](https://github.com/pytorch/pytorch/pull/57798)).
* Fixed handling of the arguments of `aten::to` ([#58028](https://github.com/pytorch/pytorch/pull/58028)).
* Don’t error on 0-dim in convolution ([#51922](https://github.com/pytorch/pytorch/pull/51922)).
* Allow `__exit__` to have a return value ([#52336](https://github.com/pytorch/pytorch/pull/52336)).
* Added metacompile of ternary if ([#51789](https://github.com/pytorch/pytorch/pull/51789)).
* Keep alive graph when creating iterators from it ([#51951](https://github.com/pytorch/pytorch/pull/51951)).
* Fixed return value of `IValue::to` for Tensor/String ([#51463](https://github.com/pytorch/pytorch/pull/51463)).
* Added function to check for memory leak ([#52342](https://github.com/pytorch/pytorch/pull/52342)).
* Ignore user annotated ignored attributes ([#52367](https://github.com/pytorch/pytorch/pull/52367)).
* Fixed `jit.trace` mishandling of InterfaceType ([#53052](https://github.com/pytorch/pytorch/pull/53052)).
* Fixed tracing support for TorchBind ([#52884](https://github.com/pytorch/pytorch/pull/52884)).
* Use correct warning type for tracer warnings ([#53460](https://github.com/pytorch/pytorch/pull/53460)).
* Removed the assumption that `forward` exists in freeze_module ([#52918](https://github.com/pytorch/pytorch/pull/52918)).
* Removed notion of "level" from `Module::dump_to_str` ([#52539](https://github.com/pytorch/pytorch/pull/52539)).
* Made `IValue::toTensor()` inline-able ([#53213](https://github.com/pytorch/pytorch/pull/53213)).
* Consider `normal_` as a special operation in the remove mutation pass ([#52175](https://github.com/pytorch/pytorch/pull/52175)).
* Updated `set_stream` API to change the device ([#53741](https://github.com/pytorch/pytorch/pull/53741)).
* Only run `ReplaceWithCopy` pass when `enable_out_variant` is true ([#54111](https://github.com/pytorch/pytorch/pull/54111)).
* Disable dfusion group that is not supported by XPU device ([#54239](https://github.com/pytorch/pytorch/pull/54239)).
* Don’t require same-sized `src`/`dest` in `reshape_copy` ([#54467](https://github.com/pytorch/pytorch/pull/54467)).
* Fixed `TupleType.annotation_str` to conform to `typing` module syntax for empty tuple type ([#54641](https://github.com/pytorch/pytorch/pull/54641)).
* Made NoneType `annotation_str` emit `NoneType` instead of `None` ([#54642](https://github.com/pytorch/pytorch/pull/54642)).
* Made sure the copy version of the op exists in `ReplaceWithCopy` ([#55337](https://github.com/pytorch/pytorch/pull/55337)).
* Included `conv3d` in `conv-add-relu` fusion ([#54772](https://github.com/pytorch/pytorch/pull/54772)).
* Added `cond-add-relu` matching pattern to cover in-place ops ([#55458](https://github.com/pytorch/pytorch/pull/55458)).
* Fixed `TupleType.annotation_str` to conform to `typing` module syntax for empty tuple type ([#54745](https://github.com/pytorch/pytorch/pull/54745)).
* Fixed `Optional[Tensor]` type in autodiff ([#55565](https://github.com/pytorch/pytorch/pull/55565)).
* Raise TypeErrors when `IValue::getSubValues` fails ([#56510](https://github.com/pytorch/pytorch/pull/56510)).
* Fixed num args for `to_copy` ([#56441](https://github.com/pytorch/pytorch/pull/56441))
* Fixed error in JIT CUDA on ROCm ([#55243](https://github.com/pytorch/pytorch/pull/55243)).
* Fixed a bug in `emitUse` to drop all values that are marked as drop ([#56652](https://github.com/pytorch/pytorch/pull/56652)).
* Fixed default dtype for `randperm` and `triu`/`tril_indices` inside TorchScript ([#57105](https://github.com/pytorch/pytorch/pull/57105)).
* Don't allow create() on singleton types ([#56807](https://github.com/pytorch/pytorch/pull/56807)).
* Fix GIL mutithreading issue exposed by `torch::jit::toIValue()` ([#57688](https://github.com/pytorch/pytorch/pull/57688)).
* Fold `NaiveSyncBatchNorm` when folding batch norm ([#57823](https://github.com/pytorch/pytorch/pull/57823)).
* Fix UB in `LoopNest::distribute` ([#57883](https://github.com/pytorch/pytorch/pull/57883)).
* Fix a condition when we use a native depthwise `conv2d` lowering ([#57906](https://github.com/pytorch/pytorch/pull/57906)).
* Ensure `torch.save()` has deterministic output ([#57536](https://github.com/pytorch/pytorch/pull/57536))
* Fixed `hasattr` support type ([#57950](https://github.com/pytorch/pytorch/pull/57950))
* Return nullptr if the number of input args doesn't match ([#58018](https://github.com/pytorch/pytorch/pull/58018)).
* Added fix for missing ops `aten::sorted.str` ([#58339](https://github.com/pytorch/pytorch/pull/58339)).
* Fixed deadlock in `Future` due to lock inversion with GIL ([#58382](https://github.com/pytorch/pytorch/pull/58382)).
* Added logic to prevent lock inversions with GIL in `Future` ([#58391](https://github.com/pytorch/pytorch/pull/58391)).
* Fixed `MKLDNN_add` in-place behavior ([#51687](https://github.com/pytorch/pytorch/pull/51687)).
* Use MKLDNN copy for `copy_ when` self and src are MKLDNN layout ([#54248](https://github.com/pytorch/pytorch/pull/54248)) .
* Fixed default to align with documentation in `fuser.py` ([#53457](https://github.com/pytorch/pytorch/pull/53457)).
* Fixed upcoming changes that are part of ROCm 4.2 and affect PyTorch JIT ([#57400](https://github.com/pytorch/pytorch/pull/57400)).
* Fix for improper mobile and torch.package serialization ([#59642](https://github.com/pytorch/pytorch/pull/59642)).

### torch.package

* Add cpython as a dependency for torch_python_obj ([#56740](https://github.com/pytorch/pytorch/pull/56740)).
* Catch exceptions where dependency resolution gets invalid imports ([#58573](https://github.com/pytorch/pytorch/pull/58573)).
* Simplifications to broken dependency handling ([#58572](https://github.com/pytorch/pytorch/pull/58572)).

### Quantization

* Fixed conv packed param serialization in `state_dict` ([#52787](https://github.com/pytorch/pytorch/pull/52787)).
* Fixed `torch.float16` dynamic quant for functional linear ([#52369](https://github.com/pytorch/pytorch/pull/52369)).
* Fixed prepacking for `F.conv1d` ([#55311](https://github.com/pytorch/pytorch/pull/55311)).
* MHA tensor assignment fix ([#53031](https://github.com/pytorch/pytorch/pull/53031)).
* Fixed `conv` transpose with `qconfig == None` ([#52844](https://github.com/pytorch/pytorch/pull/52844)).
* Quant norm layers: move scale + zp to buffers ([#52861](https://github.com/pytorch/pytorch/pull/52861)).
* Handled the case when observed node has no users ([#53210](https://github.com/pytorch/pytorch/pull/53210)).
* Only insert observers for fixed qparam ops ([#53330](https://github.com/pytorch/pytorch/pull/53330)).
* Fixed a condition check for `CopyNode` ([#53585](https://github.com/pytorch/pytorch/pull/53585)).
* Fix for `x.ndim` followed by `sub` ([#53120](https://github.com/pytorch/pytorch/pull/53120)).
* Fixed using size of quant layer in `torch._assert` ([#53187](https://github.com/pytorch/pytorch/pull/53187)).
* Fixed fx quant for `quant_layer -> stack -> sum` ([#53196](https://github.com/pytorch/pytorch/pull/53196)).
* Fixed `deepcopy` on quantized `ConvNd` ([#56154](https://github.com/pytorch/pytorch/pull/56154))
* Fixed `getitem` for unmatched nodes ([#57173](https://github.com/pytorch/pytorch/pull/57173)).
* Made quantizeable MHA work with `torch.jit.script` ([#57774](https://github.com/pytorch/pytorch/pull/57774)).
* Fixed `quantize_per_tensor` on CUDA ([#57703](https://github.com/pytorch/pytorch/pull/57703)).
* Fixed a bug to handle bias in rowwise quantization of FC ([#58022](https://github.com/pytorch/pytorch/pull/58022)).
* Skipped inserting observer for boolean Tensors ([#57375](https://github.com/pytorch/pytorch/pull/57375)).
* Fixed `torch.float16` reference patterns for linear ([#55727](https://github.com/pytorch/pytorch/pull/55727)).
* FX Quant:
    * Fixed edge case with copynode after user function ([#55710](https://github.com/pytorch/pytorch/pull/55710)).
    * Fixed subtle bug in BinaryOpQuantizeHanlder logic in matching ([#56294](https://github.com/pytorch/pytorch/pull/56294)).
    * Fixed bug with fusion patterns and disabling quantization ([#54654](https://github.com/pytorch/pytorch/pull/54654)).
* Fixed overflow issue in quantized instance_norm/layer_norm/group_norm ([#54872](https://github.com/pytorch/pytorch/pull/54872)).
* Fixed zero_point rounding for _fake_quantize_learnable_per_channel_affine ([#52290](https://github.com/pytorch/pytorch/pull/52290)).
* Bug fix to update requantization and zp parameters of input ([#52797](https://github.com/pytorch/pytorch/pull/52797)).
* Fix embedding bag bug accessing unaligned memory ([#53300](https://github.com/pytorch/pytorch/pull/53300)).
* Fix out variant for 4bit embedding bag ([#55096](https://github.com/pytorch/pytorch/pull/55096)).
* Avoid tensor refcount bumps on embedding bag ([#55023](https://github.com/pytorch/pytorch/pull/55023)).

### Mobile

* Fixed some bugs in the implementation of various functions on iOS GPU:
    * `max_pool_2d` when padding is used ([#52431](https://github.com/pytorch/pytorch/pull/52431)).
    * `softmax` ([#54519](https://github.com/pytorch/pytorch/pull/54519)).
    * binary element-wise ops to handle inputs with different number of dimensions ([#58262](https://github.com/pytorch/pytorch/pull/58262)).
* Removed duplication of constant tensors in model when using Lite interpreter ([#58182](https://github.com/pytorch/pytorch/pull/58182), [#56002](https://github.com/pytorch/pytorch/pull/56002)).
* Banned mutating operators in mobile GPU models ([#56070](https://github.com/pytorch/pytorch/pull/56070)).
* Use lite interpreter as default and bump model version ([#58630](https://github.com/pytorch/pytorch/pull/58630))

### Distributed

`torch.distributed.Store`

* Fix flag specifying whether there is more data for `TCPStore` delete key ([#53886](https://github.com/pytorch/pytorch/pull/53886))
* Properly enforce timeout for `PrefixStore`. ([#53928](https://github.com/pytorch/pytorch/pull/53928))
* Fix `TCPStore` `wait` hang when key is previously set ([#53860](https://github.com/pytorch/pytorch/pull/53860))
* Properly order `TCPStore`’s `compare_set` parameters in Python API ([#52696](https://github.com/pytorch/pytorch/pull/52696))
* Fix resource leak bug in TCPStore constructor ([#52860](https://github.com/pytorch/pytorch/pull/52860))

`torch.distributed.rpc`

* Several fixes for CUDA support in the RPC framework ([#57926](https://github.com/pytorch/pytorch/pull/57926), [#57432](https://github.com/pytorch/pytorch/pull/57432), [#57394](https://github.com/pytorch/pytorch/pull/57394), [#57443](https://github.com/pytorch/pytorch/pull/57443), [#57487](https://github.com/pytorch/pytorch/pull/57487), [#58384](https://github.com/pytorch/pytorch/pull/58384), [#51820](https://github.com/pytorch/pytorch/pull/51820), [#57792](https://github.com/pytorch/pytorch/pull/57792), [#56895](https://github.com/pytorch/pytorch/pull/56895), [#54932](https://github.com/pytorch/pytorch/pull/54932))
* Fix possible reference cycle by passing reference to parent future in RPC callbacks ([#57635](https://github.com/pytorch/pytorch/pull/57635))
* Fix RPC `get_worker_info` for rank 0 ([#52804](https://github.com/pytorch/pytorch/pull/52804))
* Fix crash when TensorPipe agent tries to double-set errors. ([#52837](https://github.com/pytorch/pytorch/pull/52837))

`torch.distributed`

* Fix path handling on Windows during rendezvous process ([#57000](https://github.com/pytorch/pytorch/pull/57000))
* Fix and re-enable `ProcessGroupMPITest` ([#56709](https://github.com/pytorch/pytorch/pull/56709))

`DistributedDataParallel`

*  Correct the usage of min_compression_rate in gradient compression communication hooks ([#52979](https://github.com/pytorch/pytorch/pull/52979))
* Fix mapping of parameter to parameter names when certain parameters don’t require gradient ([#57771](https://github.com/pytorch/pytorch/pull/57771))
* Skip rebuild buckets in `DistributedDataParallel` when running under `no_grad` mode. ([#54159](https://github.com/pytorch/pytorch/pull/54159))
* Fix a race condition in `DistributedDataParallel` when all parameters are used but running with `find_unused_parameters=True`. ([#53160](https://github.com/pytorch/pytorch/pull/53160))
* In `DistributedDataParallel`, pass in `process_group` argument into `dist.get_rank` calls ([#53793](https://github.com/pytorch/pytorch/pull/53793))
* Fix `DistributedDataParallel`’s process for verifying model consistency during initialization. ([#52887](https://github.com/pytorch/pytorch/pull/52887))

`torch.distributed`

* Check vector boundaries in `torch::cuda::scatter` ([#53057](https://github.com/pytorch/pytorch/pull/53057))
* Release GIL before destructing ProcessGroup classes ([#56381](https://github.com/pytorch/pytorch/pull/56381))

`torch.distributed.pipeline`

* Fix hang in `pipeline` destructor by removing `join_workers` ([#53433](https://github.com/pytorch/pytorch/pull/53433))

`torch.distributed.elastic`

* Resolve bug around incorrect rendezvous handler resolution ([#56386](https://github.com/pytorch/pytorch/pull/56386))

`torch.nn.SyncBatchNorm`

* Ensure `SyncBatchNorm` behaves like a regular `BatchNorm` layer in eval mode. ([#56982](https://github.com/pytorch/pytorch/pull/56982))

`torch.distributed.optim.ZeroRedundancyOptimizer`

* Typing fixes([#53165](https://github.com/pytorch/pytorch/pull/53165))

Fix monitored_barrier with wait_all_ranks ([#58702](https://github.com/pytorch/pytorch/pull/58702)).

### ONNX

* Removed the last Cast in pow symbolic_opset9 ([#52646](https://github.com/pytorch/pytorch/pull/52646)) ([#53305](https://github.com/pytorch/pytorch/pull/53305)).
* Fixed export of `copy_` operator ([#53046](https://github.com/pytorch/pytorch/pull/53046)) ([#53310](https://github.com/pytorch/pytorch/pull/53310)) ([#51938](https://github.com/pytorch/pytorch/pull/51938)) ([#54870](https://github.com/pytorch/pytorch/pull/54870)).
* Fixed export of embedding with `padding_idx` ([#53053](https://github.com/pytorch/pytorch/pull/53053)) ([#53530](https://github.com/pytorch/pytorch/pull/53530)).
* Fixed onnx warning message ([#54371](https://github.com/pytorch/pytorch/pull/54371)).
* Improved error message during Glow ONNXIFI ([#58069](https://github.com/pytorch/pytorch/pull/58069)).
* Fixed if output shape mismatch error & graph input directly used as output ([#53219](https://github.com/pytorch/pytorch/pull/53219)) ([#54865](https://github.com/pytorch/pytorch/pull/54865)).
* Fixed ComputeShapeFromReshape when `input_shape_size < reshape_size` ([#56171](https://github.com/pytorch/pytorch/pull/56171)).
* Fixed -Wrange-loop-construct in onnx_exporter.cc ([#56759](https://github.com/pytorch/pytorch/pull/56759)).
* Print `onnxifi` failed status code in readable format ([#53648](https://github.com/pytorch/pytorch/pull/53648)).

### Vulkan

* Fixed kernel registration errors in Vulkan test and benchmark binaries by adding `nonVarTypeModeGuard` ([#52535](https://github.com/pytorch/pytorch/pull/52535)).
* Fixed the `glslc` path in CMake for desktop builds ([#56507](https://github.com/pytorch/pytorch/pull/56507)).
* Fixed build failures caused by `warnings-treated-as-error` for Linux builds. ([#52781](https://github.com/pytorch/pytorch/pull/52781)).
* Remove constant duplication for Vulkan optimize_for_mobile ([#59276](https://github.com/pytorch/pytorch/pull/59276)).

### Benchmark

* Fix timer overflow on small, fast snippets ([#55200](https://github.com/pytorch/pytorch/pull/55200))

### Misc

* [memory format] Fixed channels last bug in upsample kernels to now correctly pass `memory_format` information from the input to the output tensors ([#53535](https://github.com/pytorch/pytorch/pull/53535)).
* [memory format] Fixed silent correctness bug for CUDA upsample kernels to correctly handle `torch.channels_last` contiguous tensors ([#54744](https://github.com/pytorch/pytorch/pull/54744)).
* Workaround intermittent gcc-7.5 ICE in cpp tests ([#57016](https://github.com/pytorch/pytorch/pull/57016)).
* Improve build quality on Windows ([#52729](https://github.com/pytorch/pytorch/pull/52729), [#53562](https://github.com/pytorch/pytorch/pull/53562), [#54132](https://github.com/pytorch/pytorch/pull/54132), [#55275](https://github.com/pytorch/pytorch/pull/55275)).
* Search for static OpenBLAS compiled with OpenMP ([#59428](https://github.com/pytorch/pytorch/pull/59428)).


# Performance

### Python API

* Optimized memory usage for `out=` version of `torch`.`logsumexp` ([#51239](https://github.com/pytorch/pytorch/pull/51239)).
* Added vectorization for `torch.floor_divide` ([#55380](https://github.com/pytorch/pytorch/pull/55380)).
* Reimplemented `torch.flip()` using advanced indexing ([#56713](https://github.com/pytorch/pytorch/pull/56713)).
* Improved performance for `torch.take()` and `torch.Tensor.put_` on both CPU and CUDA ([#53356](https://github.com/pytorch/pytorch/pull/53356))
* Generic performance improvement for operations performed on non-contiguous 2-dimensional tensors ([#53613](https://github.com/pytorch/pytorch/pull/53613)).
* Added vectorization for `torch.copysign` on CPU ([#51792](https://github.com/pytorch/pytorch/pull/51792)).
* Improved performance for bilinear interpolation on CPU ([#51653](https://github.com/pytorch/pytorch/pull/51653)).
* Improved performance for backward computations on `torch.cumsum` and `torch.cumprod` on both CPU and CUDA ([#53711](https://github.com/pytorch/pytorch/pull/53711)).
* Improved performance for `torch.Tensor.copy_`  when performing copies between small tensors of `torch.float` and `torch.half` data types ([#53800](https://github.com/pytorch/pytorch/pull/53800)).
* Enabled vectorization for `torch.Tensor.copy_` and `torch.cat` for BFloat16 tensors ([#54671](https://github.com/pytorch/pytorch/pull/54671), [#54674](https://github.com/pytorch/pytorch/pull/54674)).
* Added a fast path for a common case for `torch.addmm` on CUDA ([#55026](https://github.com/pytorch/pytorch/pull/55026)).
* In collaboration with NVIDIA, the CUDA performance of many linear algebra operations has been improved by increasing use of the cuSOLVER and cuBLAS libraries
    * Added cuBLAS support for `torch.triangular_solve` ([#53147](https://github.com/pytorch/pytorch/pull/53147)) and batched `torch.geqrf` ([#56253](https://github.com/pytorch/pytorch/pull/56253)).
    * Added cuSOLVER support for `torch.linalg.eigh/eigvalsh` ([#53040](https://github.com/pytorch/pytorch/pull/53040)), `torch.cholesky_solve` ([#54315](https://github.com/pytorch/pytorch/pull/54315)), `torch.cholesky_inverse` ([#54676](https://github.com/pytorch/pytorch/pull/54676)), and `torch.linalg.q`r ([#56256](https://github.com/pytorch/pytorch/pull/56256)).
    * Added cuBLAS and cuSOLVER support for `torch.linalg.lstsq` ([#57317](https://github.com/pytorch/pytorch/pull/57317)).
* Improved performance for `torch.nonzero` ([#58468](https://github.com/pytorch/pytorch/pull/58468)).
* Removed device check from a few indexing methods ([#58800](https://github.com/pytorch/pytorch/pull/58800)).

### Complex Numbers

* Added a faster path for `torch.is_complex()` by skipping unnecessary  dispatch ([#50054](https://github.com/pytorch/pytorch/pull/50054)).

### Autograd

* Sped up autograd’s graph discovery algorithm by skipping some nodes using sequence number ([#52180](https://github.com/pytorch/pytorch/pull/52180), [#52057](https://github.com/pytorch/pytorch/pull/52057)).
* Added a new fast gradcheck ([#54480](https://github.com/pytorch/pytorch/pull/54480)).

### torch.nn

* `Module.forward`: Add fast path for the case of no hooks ([#52576](https://github.com/pytorch/pytorch/pull/52576)).
* Fix `mkldnn` heuristic for multithreaded convolution ([#52909](https://github.com/pytorch/pytorch/pull/52909)).
* `linear`: Remove one refcount bump ([#54936](https://github.com/pytorch/pytorch/pull/54936)).
* Improve `native_batch_norm_backward` performance on CUDA ([#58240](https://github.com/pytorch/pytorch/pull/58240)).
* `nll_loss`: Use cascade summation on CPU ([#55841](https://github.com/pytorch/pytorch/pull/55841)).
* `nn.BatchNorm1d`: Improve training performance on CPU ([#57033](https://github.com/pytorch/pytorch/pull/57033)).
* Simplify convolution double backward gradInput formulas ([#54840](https://github.com/pytorch/pytorch/pull/54840)).
* Move RNN cell size check to cpp ([#51964](https://github.com/pytorch/pytorch/pull/51964)).
* Remove syncs in `one_hot` ([#57902](https://github.com/pytorch/pytorch/pull/57902)).
* Enable and enhance bf16 threshold ([#54384](https://github.com/pytorch/pytorch/pull/54384)).
* `nn.Conv3d`: Enable `channels_last_3d` for cuDNN ([#48430](https://github.com/pytorch/pytorch/pull/48430)).
* Increase token count threshold for calling thrust sort in embedding backward ([#49913](https://github.com/pytorch/pytorch/pull/49913)).
* CPU convolution benchmark harness for some popular models ([#56455](https://github.com/pytorch/pytorch/pull/56455)).
* Improved performance for `torch.nn.BatchNorm1d` on both CPU and CUDA ([#57033](https://github.com/pytorch/pytorch/pull/57033), [#57786](https://github.com/pytorch/pytorch/pull/57786)).
* Added optimized generic interpolation for `torch.nn.functional.{upsample_nearest`, `upsample_bicubic}` and speed up for channels first and last cases ([#54500](https://github.com/pytorch/pytorch/pull/54500)).
* Added shape documentation for CosineEmbeddingLoss ([#58403](https://github.com/pytorch/pytorch/pull/58403)).

### C++ API

* Fixed nest openmp performance bug in `thnn_conv2d` ([#52577](https://github.com/pytorch/pytorch/pull/52577)).
* Added c10::MaybeOwned and Tensor::expect_contiguous ([#53317](https://github.com/pytorch/pytorch/pull/53317))
* Added DimVector variant of infer_size ([#54882](https://github.com/pytorch/pytorch/pull/54882))
* Added logic to use `DimVector` for inputs to `as_strided `that don't grow dim ([#55016](https://github.com/pytorch/pytorch/pull/55016)).
* Reduce ref-counting by borrowing in/out Tensors in TensorIterator ([#55690](https://github.com/pytorch/pytorch/pull/55690)).
* Reduce ref-counting by migrating add operators to borrow Tensors in TensorIteratorBase ([#55691](https://github.com/pytorch/pytorch/pull/55691)).
* Reduce ref-counting by migrating copy_ operators to borrow input/output Tensors ([#56031](https://github.com/pytorch/pytorch/pull/56031)).
* Added logic to use `expect_contiguous` in `layer_norm` ([#58067](https://github.com/pytorch/pytorch/pull/58067)).

### CUDA

* Construct only necessary elements in OffsetCalculator ([#55107](https://github.com/pytorch/pytorch/pull/55107)).
* Migrated `torch.index_put` to use cub instead of thrust ([#55693](https://github.com/pytorch/pytorch/pull/55693)).
* Added cuSOLVER `potrf` and `potrfBatched` to the backend of `torch.cholesky_decomposition` ([#53104](https://github.com/pytorch/pytorch/pull/53104)).
* Implemented `torch.sort` with cub::DeviceSegmentedRadixSort ([#56821](https://github.com/pytorch/pytorch/pull/56821)).
* Added cuSOLVER path for `torch.geqrf` ([#56252](https://github.com/pytorch/pytorch/pull/56252)).
* Enabled cuSOLVER `torch.potrf` batched for Cholesky decomposition when CUDA >= 11.3 ([#57788](https://github.com/pytorch/pytorch/pull/57788)).
* Fewer CUDA sync in unique by using cub instead of thrust ([#57323](https://github.com/pytorch/pytorch/pull/57323)).
* Removed sync for `randperm` on small tensors ([#54113](https://github.com/pytorch/pytorch/pull/54113)).
* Simplify convolution double backward gradInput formulas ([#54840](https://github.com/pytorch/pytorch/pull/54840)).

### Composability

* We’ve landed lots of performance optimizations for 1.9, both large and small. See individual PRs for details:
    * Inline `tensor.device()` ([#50848](https://github.com/pytorch/pytorch/pull/50848))
    * Skip a second call to `shouldUseRecordFunction` for BackendSelect ops ([#50891](https://github.com/pytorch/pytorch/pull/50891))
    * Re-order `TensorImpl` fields to save a word ([#50920](https://github.com/pytorch/pytorch/pull/50920))
    * Devirtualize `TensorImpl::storage()` ([#51050](https://github.com/pytorch/pytorch/pull/51050))
    * Reduce template expansion in `call_functor_with_args_from_stack` (build time) ([#51313](https://github.com/pytorch/pytorch/pull/51313))
    * Eliminate `WrapFunctionIntoRuntimeFunctor `use in CppFunction constructors ([#51315](https://github.com/pytorch/pytorch/pull/51315))
    * Remove `reference_cast` in `make_boxed_from_unboxed_functor` (build time) ([#51319](https://github.com/pytorch/pytorch/pull/51319))
    * Debug-gate `static_assert` in `KernelFunction::makeFromUnboxedFunctor` (build time) ([#51367](https://github.com/pytorch/pytorch/pull/51367))
    * Use real `if constexpr` behind macro in hot template (build time) ([#51368](https://github.com/pytorch/pytorch/pull/51368), [#52420](https://github.com/pytorch/pytorch/pull/52420))
    * Outline `DispatchStub::get_call_ptr()` ([#51908](https://github.com/pytorch/pytorch/pull/51908))
    * Use `torchCheckFail` in `TORCH_INTERNAL_ASSERT` ([#52086](https://github.com/pytorch/pytorch/pull/52086))
    * Add `Storage::set_data_ptr_noswap` and use where possible ([#52244](https://github.com/pytorch/pytorch/pull/52244))
    * Make shared empty string static instead of thread_local ([#52220](https://github.com/pytorch/pytorch/pull/52220))
    * Avoid `std::string` in `TORCH_CHECK` when possible ([#52221](https://github.com/pytorch/pytorch/pull/52221))
    * Make `c10::str(const char*)` return `const char*` ([#52222](https://github.com/pytorch/pytorch/pull/52222))
    * Sync `TORCH_INTERNAL_ASSERT` optimizations with `TORCH_CHECK` ([#52226](https://github.com/pytorch/pytorch/pull/52226))
    * Save a single add instruction in the dispatcher ([#52543](https://github.com/pytorch/pytorch/pull/52543))
    * Inline `TensorIteratorConfig` setters ([#52661](https://github.com/pytorch/pytorch/pull/52661))
    * Use `DimVector` for sizes and strides in `view` ([#53001](https://github.com/pytorch/pytorch/pull/53001))
    * Avoid TLS in `has_names` ([#53003](https://github.com/pytorch/pytorch/pull/53003))
    * Don't inline `Dispatcher::call` on mobile (binary size) ([#53197](https://github.com/pytorch/pytorch/pull/53197))
    * Skip dispatch for `is_floating_point` ([#53242](https://github.com/pytorch/pytorch/pull/53242))
    * Move non-template part of `TensorImpl::Resize` to cpp (binary size, build time) ([#53388](https://github.com/pytorch/pytorch/pull/53388))
    * Don't copy vector arguments to `Tensor::Resize` ([#53389](https://github.com/pytorch/pytorch/pull/53389))
    * Skip dispatch trip for CPU in `resize_` ([#53575](https://github.com/pytorch/pytorch/pull/53575))
    * Pass `Scalar` by reference ([#53583](https://github.com/pytorch/pytorch/pull/53583))
    * Don't use static for template declarations in headers (binary size) ([#53602](https://github.com/pytorch/pytorch/pull/53602))
    * Boxing logic forwards arguments to stack ([#53624](https://github.com/pytorch/pytorch/pull/53624))
    * `Speed up Tensor::data_ptr by using static item size (`[`#53723`](https://github.com/pytorch/pytorch/pull/53723)`)`
    * `Skip dispatch for is_signed (`[`#53847`](https://github.com/pytorch/pytorch/pull/53847)`)`
    * Allow inlining of more Tensor methods ([#53905](https://github.com/pytorch/pytorch/pull/53905))
    * `Tensor::register_hook`: Avoid wrapping hook in two levels of `std::function` ([#53917](https://github.com/pytorch/pytorch/pull/53917))
    * Take advantage of string literals in `TORCH_WARN` ([#54032](https://github.com/pytorch/pytorch/pull/54032))
    * Inline `Tensor` keyset-checking methods & similar getters ([#54806](https://github.com/pytorch/pytorch/pull/54806))
    * `TensorIterator::output` returns const reference ([#54811](https://github.com/pytorch/pytorch/pull/54811))
    * Avoid refcount bump in `TensorArg` ([#54934](https://github.com/pytorch/pytorch/pull/54934))
    * Move `Tensor::has_names` inline ([#54965](https://github.com/pytorch/pytorch/pull/54965))
    * `OperandInfo` ctor should take rvalue reference ([#54972](https://github.com/pytorch/pytorch/pull/54972))
    * Don't bother with `SmallVector` in `TensorMaker` ([#55125](https://github.com/pytorch/pytorch/pull/55125))
    * Eliminate device guard in generic dispatch key kernel wrappers ([#55131](https://github.com/pytorch/pytorch/pull/55131))
    * Move logic to skip a redispatch directly inside of `resize_output` ([#55162](https://github.com/pytorch/pytorch/pull/55162))
    * Use `infer_size_dimvector` in `ExpandUtils` ([#55180](https://github.com/pytorch/pytorch/pull/55180))
    * Don't create intermediate Tensor for `at::result_type` w/Scalar ([#55232](https://github.com/pytorch/pytorch/pull/55232))
    * Use `sizes()[x]` instead of `size(x)` in `addr` ([#55247](https://github.com/pytorch/pytorch/pull/55247))
    * Add & use `inferExpandGeometry_dimvector` ([#55316](https://github.com/pytorch/pytorch/pull/55316))
    * Mark borrowed case as `C10_LIKELY` in `MaybeOwned` ([#55553](https://github.com/pytorch/pytorch/pull/55553))
    * Avoid double indirection in `MaybeOwned`'s borrowed state ([#55685](https://github.com/pytorch/pytorch/pull/55685))
    * Make `VariableVersion::DISABLED` the default constructor for `VariableVersion`. ([#55572](https://github.com/pytorch/pytorch/pull/55572))
    * Don't set `version_counter` on inference tensor for `unsafe_` ops. ([#55819](https://github.com/pytorch/pytorch/pull/55819))
    * Add & document `borrow_from_optional_tensor` ([#56647](https://github.com/pytorch/pytorch/pull/56647))
    * Migrate hacky wrapper removal to `borrow_from_optional_tensor` ([#56648](https://github.com/pytorch/pytorch/pull/56648))
    * Optimize `at::repeat` ([#56994](https://github.com/pytorch/pytorch/pull/56994))
    * Optimize `intrusive_ptr(TTarget*) ` ctor (`pybind`) ([#57053](https://github.com/pytorch/pytorch/pull/57053))

### torch.fx

* Use precompiled regex in graph name processing ([#52853](https://github.com/pytorch/pytorch/pull/52853)).
* Optimize module path finding in `Tracer` ([#52990](https://github.com/pytorch/pytorch/pull/52990)).
* Speed up `_Namespace.create_name` ([#55580](https://github.com/pytorch/pytorch/pull/55580)).

### Profiler

* Sped up post processing ([#58021](https://github.com/pytorch/pytorch/pull/58021)).

### TorchScript

* Generate arithmetic vs logical right shift as appropriate ([#51749](https://github.com/pytorch/pytorch/pull/51749))
* Introduced likely/unlikely `CompareSelect` hint ([#51751](https://github.com/pytorch/pytorch/pull/51751)).
* Implemented log approximation using the VML approach ([#51752](https://github.com/pytorch/pytorch/pull/51752)).
* Updated `TensorExpr` to use `LLVM` as the default backend ([#52314](https://github.com/pytorch/pytorch/pull/52314)).
* Added support for `aten::hardtanh` (a hot operation in mobilenet v2/v3) ([#52394](https://github.com/pytorch/pytorch/pull/52394))
* Implemented `hardtanh` ([#57750](https://github.com/pytorch/pytorch/pull/57750)).
* Add `aten::batch_norm` into fuser when in inference mode ([#54204](https://github.com/pytorch/pytorch/pull/54204)).
* NNC
    * Added a new API to perform loop fusion ([#54461](https://github.com/pytorch/pytorch/pull/54461)).
    * Implemented depthwise `conv2d` ([#54920](https://github.com/pytorch/pytorch/pull/54920)).
    * Integrated NNC `conv2d` with fuser ([#55213](https://github.com/pytorch/pytorch/pull/55213)).
    * Added logic to use NNC to generate `logit`, `relu` and `tanh` ([#52322](https://github.com/pytorch/pytorch/pull/52322)).
    * Use VML-inspired logarithm with NNC, tweak scheduling ([#52423](https://github.com/pytorch/pytorch/pull/52423)).
    * Generate `sigmoid` with NNC ([#52424](https://github.com/pytorch/pytorch/pull/52424)).
    * Enabled CPU fusion only when `num_threads == 1` ([#56120](https://github.com/pytorch/pytorch/pull/56120)).
    * Use NNC's `call_raw` API to reduce call overheads. ([#57553](https://github.com/pytorch/pytorch/pull/57553)).
    * Started codegen’ing some external calls ([#58118](https://github.com/pytorch/pytorch/pull/58118)).
* Reduce memory use for inference path in `OneDNN MaxPooling` ([#52728](https://github.com/pytorch/pytorch/pull/52728)).
* Removed redundant `gather_ranges` when fusing ([#53323](https://github.com/pytorch/pytorch/pull/53323)).
* Optimized `sigrid_hash` ([#53065](https://github.com/pytorch/pytorch/pull/53065)).
* Updated `create_empty_from` to directly use the native version of `at::empty` ([#53216](https://github.com/pytorch/pytorch/pull/53216)).
* Added a minimum fusion group size ([#50217](https://github.com/pytorch/pytorch/pull/50217)).
* Added CUDNN `Conv-Add-Relu` fusion for Frozen Model Optimization ([#52102](https://github.com/pytorch/pytorch/pull/52102)).
* Avoid dispatch overhead in call to MKLDNN convolution ([#52614](https://github.com/pytorch/pytorch/pull/52614)).
* Added re-inplacing to MKLDNN subgraphs ([#53908](https://github.com/pytorch/pytorch/pull/53908)).
* Set `requires_gradient` to help autodiff prune unneeded gradients ([#54374](https://github.com/pytorch/pytorch/pull/54374)).
* Use type cache in erasing shape information ([#55828](https://github.com/pytorch/pytorch/pull/55828)).
* Added heuristic to avoid perf incompatible MKLDNN formats for binary ops ([#56089](https://github.com/pytorch/pytorch/pull/56089))
* Added `adaptive_avgpool2d` to the set of fusible ops ([#56180](https://github.com/pytorch/pytorch/pull/56180)).
* Lazily initialize `AliasDb` in `remove_mutation` opt ([#55949](https://github.com/pytorch/pytorch/pull/55949))
* Made DataPtr extraction in CUDAFuture faster for Python values ([#56918](https://github.com/pytorch/pytorch/pull/56918)).
* Lazily initialize `AliasDb` in DCE ([#56649](https://github.com/pytorch/pytorch/pull/56649)).
* Add explicit checks for in-place ops in `ReplaceWithCopy` ([#54657](https://github.com/pytorch/pytorch/pull/54657)).
    

### Quantization

* Optimized quantized `torch.cat` ([#54813](https://github.com/pytorch/pytorch/pull/54813)).

### Mobile

* Enabled `QNNPACK` for Apple Silicon builds ([#52308](https://github.com/pytorch/pytorch/pull/52308)).
* Sped up model loading for per-channel quantized models using `QNNPACK` ([#53726](https://github.com/pytorch/pytorch/pull/53726)).
* Added `XNNPACK` implementations for various operationss (`hardswish, global average pool`) ([#56714](https://github.com/pytorch/pytorch/pull/56714), [#56715](https://github.com/pytorch/pytorch/pull/56715), [#55791](https://github.com/pytorch/pytorch/pull/55791)).
* Made various performance improvements for iOS GPU (Metal) ([#57664](https://github.com/pytorch/pytorch/pull/57664), [#57665](https://github.com/pytorch/pytorch/pull/57665), [#57666](https://github.com/pytorch/pytorch/pull/57666), [#57667](https://github.com/pytorch/pytorch/pull/57667), [#57668](https://github.com/pytorch/pytorch/pull/57668)).

### Distributed

`torch.distributed`

* Avoid 2 extra copies when reducing sparse tensors ([#57822](https://github.com/pytorch/pytorch/pull/57822))

### Vulkan

* Switched to a more performant implementation of matrix multiplication ([#49609](https://github.com/pytorch/pytorch/pull/49609)).
* Updated the version of Vulkan Memory Allocator used ([#52938](https://github.com/pytorch/pytorch/pull/52938)).
* Increased the command buffer submission rate ([#57196](https://github.com/pytorch/pytorch/pull/57196)).
* Updated the Vulkan tensors to use 2D textures whenever possible, instead of always using 3D textures ([#57198](https://github.com/pytorch/pytorch/pull/57198)).
* Updated convolution shaders to receive the bias tensor as a texture as opposed to a buffer ([#57201](https://github.com/pytorch/pytorch/pull/57201)).

# Docs

### Python API

* Added `torch.testing` docs ([#57247](https://github.com/pytorch/pytorch/pull/57247)).
* Updated docs to mention CUDA support for Future ([#50048](https://github.com/pytorch/pytorch/pull/50048)).
* Included `memory_format` , an already accepted argument, in `torch.empty` doc ([#54664](https://github.com/pytorch/pytorch/pull/54664)).
* Improved the documentation for torch.matrix_exp() ([#55626](https://github.com/pytorch/pytorch/pull/55626)).
* Updated use_deterministic_algorithms docs ([#55413](https://github.com/pytorch/pytorch/pull/55413)).
* Added the `generator`  argument to `torch.rand` and `torch.randn` docs ([#56242](https://github.com/pytorch/pytorch/pull/56242)).
* Added an example to show how to use learning rate schedulers in Optimizers ([#56705](https://github.com/pytorch/pytorch/pull/56705)).
* Corrected the torch.ceil formula in docs ([#55039](https://github.com/pytorch/pytorch/pull/55039))
* Fixed docs to use autosummary on tensors.rst ([#55042](https://github.com/pytorch/pytorch/pull/55042))
* Improved testing documentation in `CONTRIBUTING.md` ([#54904](https://github.com/pytorch/pytorch/pull/54904))
* Updated `torch.fft` docs to include `out=` argument ([#56732](https://github.com/pytorch/pytorch/pull/56732)).
* Updated rounding_mode documentation to remove `"true"` ([#52202](https://github.com/pytorch/pytorch/pull/52202)).
* Added a note about error handling for non-chained futures ([#53212](https://github.com/pytorch/pytorch/pull/53212)).
* Updated `torch.stft` documentation to clarify output shape ([#54877](https://github.com/pytorch/pytorch/pull/54877)).
* Added an example for `torch.is_tensor` and `torch.is_storage` ([#55052](https://github.com/pytorch/pytorch/pull/55052)).

### Autograd

* Added a note describing gradcheck internals ([#55966](https://github.com/pytorch/pytorch/pull/55966)).
* Split up autograd documentation into separate pages ([#55672](https://github.com/pytorch/pytorch/pull/55672)).
* `torch.utils.checkpoint` : Updated docs to state that `input` flag in `.backward()` is disallowed when checkpointing ([#51746](https://github.com/pytorch/pytorch/pull/51746)).
* Added section in autograd mechanics note describing how to use inference/no_grad ([#58513](https://github.com/pytorch/pytorch/pull/58513)).
* Added doc string for `torch.is_inference_mode_enabled` and `torch.is_grad_enabled` ([#59047](https://github.com/pytorch/pytorch/pull/59047)).
* Added no-grad inference mode note ([#58513](https://github.com/pytorch/pytorch/pull/58513)).
* Add docstring for is_inference_mode_enabled ([#59047](https://github.com/pytorch/pytorch/pull/59047)).

### torch.nn

* `nn.TripletMarginLoss` / `torch.reciprocal`: Fix formatting in docs ([#51650](https://github.com/pytorch/pytorch/pull/51650))
* `nn.FractionalMaxPool3d`: Add to pooling layer docs ([#52556](https://github.com/pytorch/pytorch/pull/52556))
* `F.fractional_max_pool`: Add to `nn.functional` docs ([#52557](https://github.com/pytorch/pytorch/pull/52557))
* `Module.share_memory`: Add link to `Tensor.share_memory_` in docs ([#52561](https://github.com/pytorch/pytorch/pull/52561))
* `nn.SiLU`: Mention alternative name of Swish within docs ([#53239](https://github.com/pytorch/pytorch/pull/53239))
* Remove redundant hardsigmoid() in docstring to show up `inplace` parameter ([#52559](https://github.com/pytorch/pytorch/pull/52559))
* Clarify docs for lazy modules ([#53495](https://github.com/pytorch/pytorch/pull/53495))
* `torch.nn`: Grammatically update docs ([#54370](https://github.com/pytorch/pytorch/pull/54370))
* `nn.Sequential`: Expand docs, including comparison with `nn.ModuleList` ([#53380](https://github.com/pytorch/pytorch/pull/53380))
* `F.embedding_bag`: Fix formatting in docs ([#54666](https://github.com/pytorch/pytorch/pull/54666))
* `F.group_norm`: Add to docs ([#54673](https://github.com/pytorch/pytorch/pull/54673))
* Add separate autosummary for flatten layer docs ([#54663](https://github.com/pytorch/pytorch/pull/54663))
* `LazyModuleMixin`: Add missing attr in docs to improve formatting ([#53363](https://github.com/pytorch/pytorch/pull/53363))
* `conv1d`: Fix example error in docs ([#57356](https://github.com/pytorch/pytorch/pull/57356))
* `nn.functional`: Split docs into a table-of-contents page and a sub-page per function ([#55038](https://github.com/pytorch/pytorch/pull/55038))
* `nn.LSTM` / `nn.RNN` / `nn.GRU`: Clarify `batch_first` behavior ([#58809](https://github.com/pytorch/pytorch/pull/58809))
* `nn.CosineEmbeddingLoss`: Add shape info to docs ([#58403](https://github.com/pytorch/pytorch/pull/58403))
* Add doc warnings for default SELU gain ([#54057](https://github.com/pytorch/pytorch/pull/54057)).
* Clarify batch_first behavior for `nn.LSTM, nn.RNN, and nn.GRU` ([#58809](https://github.com/pytorch/pytorch/pull/58809)).
* Add UninitializedBuffer to nn docs ( [#59021](https://github.com/pytorch/pytorch/pull/59021)).
* Document factory_kwargs in nn.Quantize + remove Attributes section ([#59025](https://github.com/pytorch/pytorch/pull/59025)).

### Dataloader

* Added DataPipes Typing Doc ([#54773](https://github.com/pytorch/pytorch/pull/54773)).
* Added docs to document the default NumPy seed for DataLoader workers ([#56528](https://github.com/pytorch/pytorch/pull/56528)).

### AMD

* Added HIP semantics doc ([#57871](https://github.com/pytorch/pytorch/pull/57871)).

### CUDA

* Added `scatter_add` to amp docs ([#54908](https://github.com/pytorch/pytorch/pull/54908)) 
* Added `reset_peak_memory_stats` in cuda.rst ([#54668](https://github.com/pytorch/pytorch/pull/54668)).

### torch.fx

* Make some modifications to limitation section ([#51928](https://github.com/pytorch/pytorch/pull/51928))
* Added docstring for concrete_args on `Tracer.trace` ([#53151](https://github.com/pytorch/pytorch/pull/53151)).
* Change Dynamic Control Flow example to a *more* dynamic version ([#53250](https://github.com/pytorch/pytorch/pull/53250)).
* Render inherited methods in fx.Tracer API reference ([#53630](https://github.com/pytorch/pytorch/pull/53630)).
* Add docs for `ShapeProp` ([#54554](https://github.com/pytorch/pytorch/pull/54554)).
* Hide module paths leaking in the documentation. ([#54585](https://github.com/pytorch/pytorch/pull/54585)).

### Profiler

* Updated profiler recipe doc (https://github.com/pytorch/tutorials/pull/1528).

### TorchScript

* Added NNC IR specification ([#52912](https://github.com/pytorch/pytorch/pull/52912)).
* Added starter content for new TorchScript language reference ([#53837](https://github.com/pytorch/pytorch/pull/53837)).
* Added documentation for `torch.jit.Attribute` and `torch.jit.annotate` ([#54485](https://github.com/pytorch/pytorch/pull/54485)).
* Updated TorchScript language reference section for types ([#53673](https://github.com/pytorch/pytorch/pull/53673)).
* Documented the TorchScript type system ([#53244](https://github.com/pytorch/pytorch/pull/53244)).
* Added language reference for Python builtin functions, statements,  and values in TorchScript ([#52847](https://github.com/pytorch/pytorch/pull/52847), [#52830](https://github.com/pytorch/pytorch/pull/52830)).
* Added `torch.*` API section for TorchScript language reference ([#53236](https://github.com/pytorch/pytorch/pull/53236)).
* Added “Conditionals in TE” doc ([#56949](https://github.com/pytorch/pytorch/pull/56949)).
    

### torch.package

* Added API reference ([#55812](https://github.com/pytorch/pytorch/pull/55812), [#56547](https://github.com/pytorch/pytorch/pull/56547)).
* Add explanation, tutorial, and preamble sections for `torch.package` ([#59833](https://github.com/pytorch/pytorch/pull/59833), [#59503](https://github.com/pytorch/pytorch/pull/59503), [#59499](https://github.com/pytorch/pytorch/pull/59499), [#59491](https://github.com/pytorch/pytorch/pull/59491), [#59842](https://github.com/pytorch/pytorch/pull/59842), [#59843](https://github.com/pytorch/pytorch/pull/59843), [#59602](https://github.com/pytorch/pytorch/pull/59602)).
* Add pickle security warning to package docs ([#59959](https://github.com/pytorch/pytorch/pull/59959)).

### Quantization

* Added docs for storage and tensors for quantized Tensor ([#51817](https://github.com/pytorch/pytorch/pull/51817)).
* Fixed FX Graph Mode Quantization tutorial link ([#54715](https://github.com/pytorch/pytorch/pull/54715)).
* Added fx graph mode quant api doc ([#55306](https://github.com/pytorch/pytorch/pull/55306)).
* FX Graph Mode Quantization - fixed preamble ([#52192](https://github.com/pytorch/pytorch/pull/52192)).
* Fixed broken link to fx graph quant guide in quantization.rst ([#56776](https://github.com/pytorch/pytorch/pull/56776)).

### Mobile

* Added doc string for lite interpreter related API in Android ([#53136](https://github.com/pytorch/pytorch/pull/53136)).
* Improved `export_opnames` Documentation ([#52333](https://github.com/pytorch/pytorch/pull/52333)).

### Distributed

`torch.distributed.Store`

* Documentation for TCPStore’s `compare_set` API ([#57203](https://github.com/pytorch/pytorch/pull/57203))

`torch.distributed.optim`

* Update distributed optimizer documentation ([#58084](https://github.com/pytorch/pytorch/pull/58084))
* Update and expose ZeroRedundancyOptimizer docs ([#53112](https://github.com/pytorch/pytorch/pull/53112), [#53113](https://github.com/pytorch/pytorch/pull/53113))


`torch.distributed.elastic`

* Upstream `torchelastic` documentation to PyTorch. ([#56811](https://github.com/pytorch/pytorch/pull/56811))
* Revise the note section of RendezvousHandler doc ([#57723](https://github.com/pytorch/pytorch/pull/57723))
* Update the rendezvous documentation ([#57973](https://github.com/pytorch/pytorch/pull/57973))


`DistributedDataParallel`

* Add register_comm_hook API to DDP communication hooks documentation page ([#51846](https://github.com/pytorch/pytorch/pull/51846),[](https://github.com/pytorch/pytorch/pull/51986)[#51986](https://github.com/pytorch/pytorch/pull/51986))
* Enhance documentation around `DistributedDataParallel` uneven input support ([#57448](https://github.com/pytorch/pytorch/pull/57448))
* Enhance communication hook documentation ([#58170](https://github.com/pytorch/pytorch/pull/58170), [#58168](https://github.com/pytorch/pytorch/pull/58168), [#53253](https://github.com/pytorch/pytorch/pull/53253), [#53855](https://github.com/pytorch/pytorch/pull/53855), [#53596,](https://github.com/pytorch/pytorch/pull/53596)[#53955](https://github.com/pytorch/pytorch/pull/53955), [#54052](https://github.com/pytorch/pytorch/pull/54052). [#55031](https://github.com/pytorch/pytorch/pull/55031))


`torch.distributed.rpc`

* Add a disclaimer about limited CUDA support in RPC ([#58023](https://github.com/pytorch/pytorch/pull/58023)) 
* `torch.distributed.rpc`:  Add a link to the tutorial in RemoteModule docstring ([#57875](https://github.com/pytorch/pytorch/pull/57875))
* `torch.distributed.rpc`:  Mentioned `RemoteModule` in RPC documentation ([#57876](https://github.com/pytorch/pytorch/pull/57876))


`torch.distributed.nn.RemoteModule`

* Add RemoteModule to master RPC docs. ([#53084](https://github.com/pytorch/pytorch/pull/53084))
* Add `remote_parameters` and `get_module_rref` to RemoteModule docs. ([#54645](https://github.com/pytorch/pytorch/pull/54645))

`torch.distributed.pipeline`

* Enhance Pipe docs to explicitly mention RPC initialization. ([#55187](https://github.com/pytorch/pytorch/pull/55187))
* Add tutorials to pipeline docs. ([#55209](https://github.com/pytorch/pytorch/pull/55209))

`torch.distributed`

* Update documentation for `get_future` support ([#58107](https://github.com/pytorch/pytorch/pull/58107))
* Mention distributed profiling in documentation ([#58286](https://github.com/pytorch/pytorch/pull/58286))
* Update distributed doc table for `alltoall`  ([#54277](https://github.com/pytorch/pytorch/pull/54277))
*  fix docstring signature in `all_reduce_multigpu` ([#54665](https://github.com/pytorch/pytorch/pull/54665))
* `torch.distributed`: Improve dist.new_group doc ([#55660](https://github.com/pytorch/pytorch/pull/55660))

### ONNX

* Updated ONNX documentation ([#51362](https://github.com/pytorch/pytorch/pull/51362)) ([#53313](https://github.com/pytorch/pytorch/pull/53313)).
* Updated scripting docs ([#54634](https://github.com/pytorch/pytorch/pull/54634)) ([#54868](https://github.com/pytorch/pytorch/pull/54868)).
* Fixed docstring signature of torch.{onnx,utils} ([#54662](https://github.com/pytorch/pytorch/pull/54662)).
* onnx.symbolic_helper.parse_args: document and clean up ([#56956](https://github.com/pytorch/pytorch/pull/56956)) ([#57598](https://github.com/pytorch/pytorch/pull/57598)).

PyTorch 1.8 Release, including Compiler and Distributed Training updates, New Mobile Tutorials and more (2021-03-04)

# PyTorch 1.8.0 Release Notes

* Highlights
* Backwards Incompatible Change
* New Features
* Improvements
* Performance
* Documentation

# Highlights

We are excited to announce the availability of PyTorch 1.8. This release is composed of more than 3,000 commits since 1.7. It includes major updates and new features for compilation, code optimization, frontend APIs for scientific computing, and AMD ROCm support through binaries that are available via pytorch.org. It also provides improved features for large-scale training for pipeline and model parallelism, and gradient compression. A few of the highlights include:

1. Support for doing python to python functional transformations via `torch.fx`;
2. Added or stabilized APIs to support FFTs (`torch.fft`), Linear Algebra functions (`torch.linalg`), added support for autograd for complex tensors and updates to improve performance for calculating hessians and jacobians; and
3. Significant updates and improvements to distributed training including: Improved NCCL reliability; Pipeline parallelism support; RPC profiling; and support for communication hooks adding gradient compression. See the full release notes [here](https://github.com/pytorch/pytorch/releases).

Along with 1.8, we are also releasing major updates to PyTorch libraries including [TorchCSPRNG](https://github.com/pytorch/csprng), [TorchVision](https://github.com/pytorch/vision), [TorchText](https://github.com/pytorch/text) and [TorchAudio](https://github.com/pytorch/audio). For more on the library releases, see the post [here](http://pytorch.org/blog/pytorch-1.8-new-library-releases). As previously noted, features in PyTorch releases are classified as Stable, Beta and Prototype. You can learn more about the definitions in the post [here](https://pytorch.org/blog/pytorch-1.8-new-library-releases).

You can find more details on all the highlighted features in the [PyTorch 1.8 Release blogpost](https://pytorch.org/blog/pytorch-1.8-released/).

# Backwards Incompatible changes

### Fix Tensor inplace modulo in python ([#49390](https://github.com/pytorch/pytorch/pull/49390))

Inplace modulo in python `%=` was wrongfully done out of place for Tensors. This change fixes the behavior.
Previous code that was relying on this operation being done out of place should be updated to use the out of place version `t = t % other` instead of `t %= other`.


  

    1.7.1 1.8.0
    
      _{>>> a = torch.arange(0, 10)
>>> b = a
>>> b %= 3
>>> print(a)
tensor([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
>>> print(b)
tensor([0, 1, 2, 0, 1, 2, 0, 1, 2, 0])}
      _{>>> a = torch.arange(0, 10)
>>> b = a
>>> b %= 3
>>> print(a)
tensor([0, 1, 2, 0, 1, 2, 0, 1, 2, 0])
>>> print(b)
tensor([0, 1, 2, 0, 1, 2, 0, 1, 2, 0])}
    
  


### Standardize `torch.clamp` edge cases ([#43288](https://github.com/pytorch/pytorch/pull/43288))

For ease of exposition let `a_min` be the value of the "min" argument to clamp, and `a_max` be the value of the "max" argument to clamp.

This PR changes the behavior of torch.clamp to always compute `min(max(a, a_min), a_max)`. `torch.clamp` currently computes this in its vectorized CPU implementation but uses different approaches for other backends.
These implementations are the same when `a_min < a_max`, but divergent when `a_min > a_max`. This divergence is easily triggered:

```python
>>> t = torch.arange(200).to(torch.float)
>>> torch.clamp(t, 4, 2)[0]
tensor(2.)

>>> torch.clamp(t.cuda(), 4, 2)[0]
tensor(4., device='cuda:0')

>>> torch.clamp(torch.tensor(0), 4, 2)
tensor(4)
```

This PR makes the behavior consistent with NumPy's `clip`. C++'s `std::clamp`'s behavior is undefined when `a_min > a_max`. Python has no standard clamp implementation.

### Tensor deepcopy now properly copies the `.grad` field ([#50663](https://github.com/pytorch/pytorch/pull/50663))

The deepcopy protocol will now properly copy the `.grad` field of Tensors when it exists.
The old behavior can be recovered by setting the `.grad` field to `None` after doing the deepcopy.


  

    1.7.1 1.8.0
    
      _{>>> t.grad
tensor([0.8883, 0.5765])
>>> deepcopy(t).grad
None}
      _{>>> t.grad
tensor([0.8883, 0.5765])
>>> deepcopy(t).grad
tensor([0.8883, 0.5765])}
    
  



### Fix `torch.fmod` type promotion ([#47323](https://github.com/pytorch/pytorch/pull/47323), [#48278](https://github.com/pytorch/pytorch/pull/48278))

1.7.1
Raises RuntimeError for integral tensor and floating-point tensor.
The dtype of output is determined by the first input.

```python
>>> x = torch.arange(start=1, end=6, dtype=torch.int32) # tensor([1, 2, 3, 4, 5])
>>> y = torch.arange(start=1.1, end=2.1, step=0.2, dtype=torch.float32) # tensor([1.1, 1.3, 1.5, 1.7, 1.9])
>>> torch.fmod(x, y)
RuntimeError: result type Float can't be cast to the desired output type Int
>>> z = torch.arange(start=0.2, end=1.1, step=0.2, dtype=torch.float64) # tensor([0.2, 0.4, 0.6, 0.8, 1.], dtype=torch.float64)
>>> torch.fmod(y, z).dtype
torch.float32
>>> torch.fmod(z, y).dtype
torch.float64
>>> torch.fmod(x, 1.2)
tensor([0, 0, 0, 0, 0], dtype=torch.int32)
```


1.8.0:
Support integral tensor and floating-point tensor as inputs.
The dtype of output is determined by both inputs.

```python
>>> x = torch.arange(start=1, end=6, dtype=torch.int32) # tensor([1, 2, 3, 4, 5])
>>> y = torch.arange(start=1.1, end=2.1, step=0.2, dtype=torch.float32) # tensor([1.1, 1.3, 1.5, 1.7, 1.9])
>>> torch.fmod(x, y)
tensor([1.0000, 0.7000, 0.0000, 0.6000, 1.2000])
>>> z = torch.arange(start=0.2, end=1.1, step=0.2, dtype=torch.float64) # tensor([0.2, 0.4, 0.6, 0.8, 1.], dtype=torch.float64)
>>> torch.fmod(y, z).dtype
torch.float64
>>> torch.fmod(z, y).dtype
torch.float64
>>> torch.fmod(x, 1.2)
tensor([1.0000, 0.8000, 0.6000, 0.4000, 0.2000])
```

### Preserve non-dense or overlapping tensor's layout in *_like functions ([#46046](https://github.com/pytorch/pytorch/pull/46046))

All the `*_like` factory functions will now generate the same striding as out of place operations would.
This means in particular that non-contiguous tensors will produce non-contiguous outputs.
If you require a contiguous output, you can pass the `memory_format=torch.contiguous` keyword argument to the factory function. Such factory functions include `clone`, `to`, `float`, `cuda,` `*_like`, `zeros`, `rand{n}`, etc.

### Make output of `torch.norm` and `torch.linalg.norm` consistent for complex inputs ([#48284](https://github.com/pytorch/pytorch/pull/48284))

Previously, when given a complex input, `torch.linalg.norm` and `torch.norm` would return a complex output. `torch.linalg.cond` would sometimes return a complex output and sometimes return a real output when given a complex input, depending on its `p` argument. This PR changes this behavior to match `numpy.linalg.norm` and `numpy.linalg.cond`, so that a complex input will result in a real number type, consistent with NumPy.

### Make `torch.svd` return `V`, not `V.conj()` for complex inputs ([#51012](https://github.com/pytorch/pytorch/pull/51012))

`torch.svd` added support for complex inputs in PyTorch 1.7, but was not documented as doing so. The complex `V` tensor returned was actually the complex conjugate of what's expected. This PR fixes the discrepancy.
Users that were already using the previous version of `torch.svd` with complex inputs can recover the previous behavior by taking the complex conjugate of the returned `V`.

### `torch.angle`: properly handle pure real numbers ([#49163](https://github.com/pytorch/pytorch/pull/49163))

This PR updates PyTorch's `torch.angle` operator to be consistent with NumPy's. Previously `torch.angle` would return zero for all real inputs (including NaN). Now angle returns `pi` for negative real inputs, zero for non-negative real inputs, and propagates NaNs.

### Enable distribution validation by default for `torch.distributions` ([#48743](https://github.com/pytorch/pytorch/pull/48743))

This may slightly slow down some models. Concerned users may disable validation by using `torch.distributions.Distribution.set_default_validate_args(False)` or by disabling individual distribution validation via `MyDistribution(..., validate_args=False)`.

This may cause new `ValueErrors` in models that rely on unsupported behavior, e.g. `Categorical.log_prob()` applied to continuous-valued tensors (only {0,1}-valued tensors are supported).
Such models should be fixed but the previous behavior can be recovered by disabling argument validation using the methods mentioned above.

### Prohibit assignment to a sparse tensor ([#50040](https://github.com/pytorch/pytorch/pull/50040))

Assigning to a sparse Tensor did not work properly and resulted in a no-op. The following code now properly raises an error:
```python
>>> t = torch.rand(10).to_sparse()
>>> t[0] = 42
TypeError: Cannot assign to a sparse tensor
```

### C++ API: operators that take a list of optional `Tensor`s cannot be called with `ArrayRef` anymore ([#49138](https://github.com/pytorch/pytorch/pull/49138))

This PR changes the C++ API representation of lists of optional Tensors (e.g. in the `Tensor::``index` method) from `ArrayRef` to  `List>`. This change breaks backwards compatibility, since there is no implicit conversion from `ArrayRef` to `List>`. 

A common call pattern is `tensor.index({indices_tensor})`, where `indices_tensor` is a `Tensor`. This will continue to work because the `{}` initializer_list constructor for `List>` can take `Tensor` elements that are implicitly converted to `optional`. 

However, another common call pattern is `tensor.index(indices_tensor)`, where previously the `Tensor` got implicitly converted to an `ArrayRef`. To implicitly convert `Tensor` -> `optional` -> `List>` would chain two implicit conversions, which C++ doesn't allow. So those call sites should be rewritten to use the  `tensor.index({indices_tensor})` pattern.

### Autograd view creation informations are now properly propagated when views are chained

After this fix, an error will properly be thrown to avoid wrong gradients when an in-place operation is performed on a view of a view, when in-place operation were not allowed on the first view.
This means that code that used to return wrong gradients in 1.7.1 (such as `t.unbind()[0].select(0, 0).add_(1)`) will now properly raise an error.

### End of deprecation cycle for spectral ops in the torch. namespace ([#48594](https://github.com/pytorch/pytorch/pull/48594))

This PR removes the deprecated `torch.{fft,rfft,ifft,irfft}` and their corresponding methods on `torch.Tensor`. PyTorch programs using these functions must now update to use the `torch.fft` namespace.

### `torch.digamma` : properly handle all inputs ([#48302](https://github.com/pytorch/pytorch/pull/48302))

This PR updates PyTorch's `torch.digamma` function to be consistent with SciPy's `special.digamma` function. This changes the result of the `torch.digamma` function on the nonpositive integers, where the gamma function is not defined. Since the gamma function is undefined at these points, the (typical) derivative of the logarithm of the gamma function is also undefined at these points, and for negative integers this PR updates `torch.digamma` to return `NaN`. For zero, however, it returns `-inf` to be consistent with SciPy.

Interestingly, SciPy made a similar change, which was noticed by at least one user: [scipy/scipy#9663](https://github.com/scipy/scipy/issues/9663#issue-396587679)

SciPy's returning of negative infinity at zero is intentional:
https://github.com/scipy/scipy/blob/59347ae8b86bcc92c339efe213128f64ab6df98c/scipy/special/cephes/psi.c#L163

This change is consistent with the C++ standard for the gamma function:
https://en.cppreference.com/w/cpp/numeric/math/tgamma

### Fix `torch.remainder` type promotion ([#48668](https://github.com/pytorch/pytorch/pull/48668))

1.7.1:
In the case where the second argument is a python number, the result is casted to the dtype of the first argument.

```python
>>> torch.remainder(x, 1.2)
tensor([0, 0, 0, 0, 0], dtype=torch.int32)
```


1.8.0
In the case where the second argument is a python number, the dtype of result is determined by type promotion of both inputs.

```python
>>> torch.remainder(x, 1.2)
tensor([1.0000, 0.8000, 0.6000, 0.4000, 0.2000])
```

### Changes to onnx export API to better handle named arguments ([#47367](https://github.com/pytorch/pytorch/pull/47367))

The `args` input argument of the `torch.onnx.export` function is updated to better support optional arguments. An optional dictionary can be passed in addition as the last argument in the `args` tuple, specifying inputs with the corresponding named parameter. Note that this is backward breaking for cases where the last input is also of a dictionary type. In the new API, for such cases, it is mandatory to have an empty dictionary as the last argument in the `args` tuple.
More details can be found at: https://pytorch.org/docs/1.8.0/onnx.html?highlight=onnx#using-dictionaries-to-handle-named-arguments-as-model-inputs.

### Update signature of `torch.quantization.quantize` function [#48537](https://github.com/pytorch/pytorch/pull/48537)

The `run_args` argument must now contain a list or tuple containing the positional arguments, even if there is only a single argument.
In particular, code like: `qmodel = quantize(float_model, default_eval_fn, img_data)` that was working in 1.7.1 will now raise the error: `TypeError: default_eval_fn() takes 2 positional arguments but 3 were given`.
You should update this code to provide the image in a list for example: `qmodel = quantize(float_model, default_eval_fn, [img_data])`

### Change the way we quantize relu, leaky relu and sigmoid([#47415](https://github.com/pytorch/pytorch/pull/47415), [#48038](https://github.com/pytorch/pytorch/pull/48038), [#45702,](https://github.com/pytorch/pytorch/pull/45702)[#45711](https://github.com/pytorch/pytorch/pull/45711), [#45883](https://github.com/pytorch/pytorch/pull/45883) [#45883](https://github.com/pytorch/pytorch/pull/45883), [#45882](https://github.com/pytorch/pytorch/pull/45882), [#47660](https://github.com/pytorch/pytorch/pull/47660)**)**

Starting with version 1.8.0, in the eager mode quantization flow, relu is not observed anymore as it is not needed.
In previous versions, quantized `leaky_relu` and `sigmoid` did not require observation and just inherited the quantization parameters from their input, but that does not work very well in eager mode quantization. Starting with version 1.8.0, they are observed operator so that they work better in eager mode quantization.

### Update direction numbers to 21201 dims in the SobolEngine ([#49710](https://github.com/pytorch/pytorch/pull/49710))

This update is BC-breaking because the values drawn by the engine will be different from the ones drawn in 1.7.1 even with the same seed.



  

    1.7.1 1.8.0
    
      _{>>> from torch.quasirandom import SobolEngine
>>> eng = SobolEngine(1)
>>> eng.draw(3)
tensor([[0.5000],
            [0.7500],
            [0.2500]])}
      _{>>> from torch.quasirandom import SobolEngine
>>> eng = SobolEngine(1)
>>> eng.draw(3)
tensor([[0.0000],
            [0.5000],
            [0.7500]])}
    
  


# Deprecations

## Python API

### Deprecate old style `nn.Module` backward hooks ([#46163](https://github.com/pytorch/pytorch/pull/46163))

Old style `nn.Module` backward hooks have been broken for a long time (they do not behave as advertised in the documentation). We now have new `nn.Module.register_full_backward_hook` that provide a fully working implementation of these hooks.
The old function should not be used and migrated to the new full version.

An example of this discrepancy is shown in the example below where a Linear layer takes as input a single Tensor of size 5 and returns a single Tensor of size 5 but old style hook would return two gradients with respect to the input for only one input.

1.7.1:

```python
import torch
from torch import nn

mod = nn.Linear(5, 5)
def hook(mod, grad_inp, grad_out):
    print(f"grad input size: " + " ".join(str(g.size()) for g in grad_inp))
    print(f"grad output size: " + " ".join(str(g.size()) for g in grad_out))
mod.register_backward_hook(hook)

mod(torch.rand(5, requires_grad=True)).sum().backward()
>>> `grad input size: torch.Size([5]) torch.Size([5]) # One too many
>>> grad output size: torch.Size([5])`
```

1.8.0:
Old style hooks are deprecated and will warn when providing wrong result.

```python
import torch
from torch import nn

mod = nn.Linear(5, 5)
def hook(mod, grad_inp, grad_out):
    print(f"grad input size: " + " ".join(str(g.size()) for g in grad_inp))
    print(f"grad output size: " + " ".join(str(g.size()) for g in grad_out))
mod.register_backward_hook(hook)

mod(torch.rand(5, requires_grad=True)).sum().backward()
>>> grad input size: torch.Size([5]) torch.Size([5]) # One too many
>>> grad output size: torch.Size([5])
>>> `UserWarning: Using a non-full backward hook when the forward contains multiple
autograd Nodes is deprecated and will be removed in future versions. This hook
will be missing some grad_input.`
```

Full hooks should be used to get the proper result all the time and avoid warnings

```python
mod.register_full_backward_hook(hook)

mod(torch.rand(5, requires_grad=True)).sum().backward()
>>> grad input size: torch.Size([5])
>>> grad output size: torch.Size([5])
```

### `torch.stft`: Deprecate default value of the `require_complex` argument ([#49022](https://github.com/pytorch/pytorch/pull/49022), [#50102](https://github.com/pytorch/pytorch/pull/50102))

Previously `torch.stft` took an optional `return_complex` parameter that indicated whether the output would be a real tensor or a complex tensor. `return_complex` has the default value of `False`. This default value is deprecated (meaning that this optional argument is becoming mandatory) and will be removed in future versions. You can pass this argument explicitly to avoid this deprecation.

### Deprecate `torch.set_deterministic` in favor of `torch.use_deterministic_algorithms` ([#49904](https://github.com/pytorch/pytorch/pull/49904))

This beta feature is being renamed for improved clarity. Users should migrate to use the new name.

### Deprecate `torch.*` linear algebra functions in favor of the `torch.linalg.*` variant for `cholesky` ([#51460](https://github.com/pytorch/pytorch/pull/51460)), `slogdet` ([#51354](https://github.com/pytorch/pytorch/pull/51354)), `inverse` ([#51672](https://github.com/pytorch/pytorch/pull/51672)), `pinverse` ([#51671](https://github.com/pytorch/pytorch/pull/51671))

All the linear algebra functions are being moved to the `torch.linalg` submodule that provided a compatible API with NumPy. These new functions have the same set of features as the `torch.` ones and should be used instead.

# New features

### Python API

* New functions (most of them to improve numpy compatibility): `torch.nan_to_num` ([#44592](https://github.com/pytorch/pytorch/pull/44592)), `torch.tensor_split` ([#45168](https://github.com/pytorch/pytorch/pull/45168)), `torch.nanmedian` ([#45847](https://github.com/pytorch/pytorch/pull/45847)), `torch.ravel` ([#46098](https://github.com/pytorch/pytorch/pull/46098)), `torch.igamma` ([#46183](https://github.com/pytorch/pytorch/pull/46183)), `torch.igammac` ([#48171](https://github.com/pytorch/pytorch/pull/48171)), `torch.{column_stack,row_stack}` ([#46313](https://github.com/pytorch/pytorch/pull/46313)), `torch.kron` ([#45358](https://github.com/pytorch/pytorch/pull/45358)), `torch.copysign` ([#46396](https://github.com/pytorch/pytorch/pull/46396)), `Tensor.new_empty_strided` ([#47225](https://github.com/pytorch/pytorch/pull/47225)), `torch.{swapdims,swapaxes}` ([#46041](https://github.com/pytorch/pytorch/pull/46041)), `torch.tile` ([#47974](https://github.com/pytorch/pytorch/pull/47974)), `torch.float_power` ([#44937](https://github.com/pytorch/pytorch/pull/44937)), `torch.moveaxis` ([#48581](https://github.com/pytorch/pytorch/pull/48581)), `torch.inner` ([#46716](https://github.com/pytorch/pytorch/pull/46716)), `torch.msort` ([#48440](https://github.com/pytorch/pytorch/pull/48440)), `torch.sinc` ([#48740](https://github.com/pytorch/pytorch/pull/48740)), `torch.broadcast_to` ([#48997](https://github.com/pytorch/pytorch/pull/48997)), `torch.xlogy` ([#48777](https://github.com/pytorch/pytorch/pull/48777)), `torch.f{max,min}` ([#49312](https://github.com/pytorch/pytorch/pull/49312)), `torch.diff` ([#50569](https://github.com/pytorch/pytorch/pull/50569)), `torch.ldexp` ([#45370](https://github.com/pytorch/pytorch/pull/45370)), `torch.broadcast_shapes` ([#43935](https://github.com/pytorch/pytorch/pull/43935)), 
* `torch.fft` new features: 2D FFT functions ([#45164](https://github.com/pytorch/pytorch/pull/45164)), use new FFT operators in stft ([#47601](https://github.com/pytorch/pytorch/pull/47601)), helper functions ([#44877](https://github.com/pytorch/pytorch/pull/44877)), fuzzing benchmark ([#47872](https://github.com/pytorch/pytorch/pull/47872))
* `torch.linalg` new features: `linalg.tensorsolve` ([#46142](https://github.com/pytorch/pytorch/pull/46142)), `linalg.cholesky` ([#46083](https://github.com/pytorch/pytorch/pull/46083)), `linalg.tensorinv` ([#45969](https://github.com/pytorch/pytorch/pull/45969)), `linalg.{eigh,eigvalsh}` ([#45526](https://github.com/pytorch/pytorch/pull/45526)), `linalg.matrix_rank` ([#48206](https://github.com/pytorch/pytorch/pull/48206)), `linalg.solve` ([#48456](https://github.com/pytorch/pytorch/pull/48456)), `linalg.qr` ([#47764](https://github.com/pytorch/pytorch/pull/47764),  [#50046](https://github.com/pytorch/pytorch/pull/50046)), `linalg.svd` ([#45562](https://github.com/pytorch/pytorch/pull/45562)), `linalg.inv` ([#48261](https://github.com/pytorch/pytorch/pull/48261)), `linalg.pinv` ([#48399](https://github.com/pytorch/pytorch/pull/48399)), `linalg.slogdet` ([#49194](https://github.com/pytorch/pytorch/pull/49194)), `linalg.cond` ([#45832](https://github.com/pytorch/pytorch/pull/45832))
* New `torch.nn` Modules: `nn.PixelUnshuffle` ([#49334](https://github.com/pytorch/pytorch/pull/49334)), `nn.GaussianNLLLoss` ([#50886](https://github.com/pytorch/pytorch/pull/50886))
* Automatic shape inference in `torch.nn`: new `nn.LazyLinear` ([#44538](https://github.com/pytorch/pytorch/pull/44538)), `nn.LazyConv{1,2,3}d` and `nn.LazyConvTranspose{1,2,3}d` ([#47350](https://github.com/pytorch/pytorch/pull/47350))
* Add channels last support for `torch.nn.AdaptiveAvgPool2d` ([#48916](https://github.com/pytorch/pytorch/pull/48916))
* Add option to produce standalone executable with `cpp_extensions` ([#47862](https://github.com/pytorch/pytorch/pull/47862))
* Add sparse-sparse matrix multiplication support ([#39526](https://github.com/pytorch/pytorch/pull/39526))
* Add `torch.futures.Future.add_done_callback` ([#45675](https://github.com/pytorch/pytorch/pull/45675))
* Add `three_phase` optional argument to `torch.optim.lr_scheduler.OneCycleLR` ([#42715](https://github.com/pytorch/pytorch/pull/42715))
* Add `bicubic` option for the `mode` argument of `torch.nn.functional.grid_sampler` ([#44780](https://github.com/pytorch/pytorch/pull/44780))
* Add new distributions to `torch.distributions`:  `Kumaraswamy` ([#48285](https://github.com/pytorch/pytorch/pull/48285)), `LKJCholesky` ([#48798](https://github.com/pytorch/pytorch/pull/48798))
* Add reparameterization support to `torch.distributions.OneHotCategorical` ([#46610](https://github.com/pytorch/pytorch/pull/46610))
* Add new transforms to `torch.distributions`: `CorrCholeskyTransform` ([#48041](https://github.com/pytorch/pytorch/pull/48041))
* Add new constraint to `torch.distributions`: `independent` ([#50547](https://github.com/pytorch/pytorch/pull/50547), [#50302](https://github.com/pytorch/pytorch/pull/50302))
* Add zero annealing epochs to SWA optimizer ([#47579](https://github.com/pytorch/pytorch/pull/47579))
* Add `close` method to `torch.hub.tqdm` mock ([#46040](https://github.com/pytorch/pytorch/pull/46040))
* Add support for pruning based on custom importance scores via the `importance_scores` keyword argument ([#48378](https://github.com/pytorch/pytorch/pull/48378))
* Add torch vitals ([#51047](https://github.com/pytorch/pytorch/pull/51047))

### Complex Numbers

* Complex Number support on CPU and CUDA for `torch.symeig` ([#45121](https://github.com/pytorch/pytorch/pull/45121)), `torch.pinverse` ([#45819](https://github.com/pytorch/pytorch/pull/45819)), `torch.det` ([#45980](https://github.com/pytorch/pytorch/pull/45980)), `torch.diagflat`  ([#47564](https://github.com/pytorch/pytorch/pull/47564)), `torch.{addcmul, addcdiv} `([#46639](https://github.com/pytorch/pytorch/pull/46639)), `torch.lu_solve` ([#48028](https://github.com/pytorch/pytorch/pull/48028)), `torch.matrix_exp` ([#48363](https://github.com/pytorch/pytorch/pull/48363)), `torch.eig` ([#49168](https://github.com/pytorch/pytorch/pull/49168)), `torch.{acosh, asinh, atanh}`  ([#50387](https://github.com/pytorch/pytorch/pull/50387)), `torch.masked_scatter` ([#51281](https://github.com/pytorch/pytorch/pull/51281)),  `torch.bmm` and `torch.baddbmm` ([#42553](https://github.com/pytorch/pytorch/pull/42553)), `torch.orgqr` ([#50502](https://github.com/pytorch/pytorch/pull/50502)), `torch.index_fill_` ([#50578](https://github.com/pytorch/pytorch/pull/50578)), `torch.cholesky_inverse` ([#50269](https://github.com/pytorch/pytorch/pull/50269))
* Complex Number support on CUDA for  `torch.qr` ([#45032](https://github.com/pytorch/pytorch/pull/45032)), `torch.lu (`[`#45898`](https://github.com/pytorch/pytorch/pull/45898)`), torch.prod`([#45980](https://github.com/pytorch/pytorch/pull/45980)), `torch.triangular_solve `([#46916](https://github.com/pytorch/pytorch/pull/46916)), `torch.solve `([#47045](https://github.com/pytorch/pytorch/pull/47045)),  `torch.cholesky_solve` ([#47047](https://github.com/pytorch/pytorch/pull/47047)), `torch.mean` ([#47048](https://github.com/pytorch/pytorch/pull/47048)), `torch.svd` ([#45795](https://github.com/pytorch/pytorch/pull/45795)), `torch.inverse` ([#47595](https://github.com/pytorch/pytorch/pull/47595)),  `torch.Tensor.index_put_` ([#51148](https://github.com/pytorch/pytorch/pull/51148))
* Complex Number support on CPU for  `torch.trace` ([#50380](https://github.com/pytorch/pytorch/pull/50380))
* Complex Number support for `torch.nn.DataParallel` ([#48686](https://github.com/pytorch/pytorch/pull/48686)),  `torch.nn.L1Loss` ([#49912](https://github.com/pytorch/pytorch/pull/49912)), Padding functions ([#50594](https://github.com/pytorch/pytorch/pull/50594))
* Complex Number support  for `torch.distributed.{all_reduce, all_gather}`  ([#45879](https://github.com/pytorch/pytorch/pull/45879), [#46270](https://github.com/pytorch/pytorch/pull/46270))
* Complex Autograd support for `torch.{atan, log, log10, log1p, log2, reciprocal, tan, pow, rsqrt, tanh, asinh, acosh}` ([#46275](https://github.com/pytorch/pytorch/pull/46275)),  `torch.{cholesky, triangular_solve, mm, mv, ger} `([#45737](https://github.com/pytorch/pytorch/pull/45737)),  `torch.take(), torch.Tensor.fill_()` ([#46860](https://github.com/pytorch/pytorch/pull/46860)), `torch.matrix_exp` ([#48363](https://github.com/pytorch/pytorch/pull/48363)), `torch.{baddbmm, addbmm, addmm, addmv}` ([#50632](https://github.com/pytorch/pytorch/pull/50632)), `torch.qr` ([#48489](https://github.com/pytorch/pytorch/pull/48489)), `torch.svd` and `torch.pinverse` ([#47761](https://github.com/pytorch/pytorch/pull/47761)), `torch.sqrt` ([#49461](https://github.com/pytorch/pytorch/pull/49461)), `torch.diag` ([#51268](https://github.com/pytorch/pytorch/pull/51268)), `torch.trace` ([#51537](https://github.com/pytorch/pytorch/pull/51537)), `torch.exp` ([#47194](https://github.com/pytorch/pytorch/pull/47194)), `torch.mean` ([#47566](https://github.com/pytorch/pytorch/pull/47566)),  `torch.addr` ([#50667](https://github.com/pytorch/pytorch/pull/50667)), torch.{`stack, gather, index_select}, torch.Tensor.index_add_`([#49552](https://github.com/pytorch/pytorch/pull/49552)), `torch.{masked_scatter, masked_select}` ([#51281](https://github.com/pytorch/pytorch/pull/51281)),  `torch.{addcmul, addcdiv} `([#46639](https://github.com/pytorch/pytorch/pull/46639)),  `torch.{acosh, asinh, atanh}`  ([#50387](https://github.com/pytorch/pytorch/pull/50387)), `torch.solve `([#47045](https://github.com/pytorch/pytorch/pull/47045)), `torch.cholesky_solve` ([#47047](https://github.com/pytorch/pytorch/pull/47047)), `torch.inverse` ([#47595](https://github.com/pytorch/pytorch/pull/47595))
* Add complex autograd support for named tensors ([#47289](https://github.com/pytorch/pytorch/pull/47289))
* Allow converting parameters and buffers of `torch.nn.Module` to complex dtypes ([#44788](https://github.com/pytorch/pytorch/pull/44788))
* Add complex support to IValues ([#50883](https://github.com/pytorch/pytorch/pull/50883), [#51476](https://github.com/pytorch/pytorch/pull/51476))
* Add TorchScript type annotation logic for complex numbers ([#50884](https://github.com/pytorch/pytorch/pull/50884))
* Add serialization logic for complex numbers ([#51287](https://github.com/pytorch/pytorch/pull/51287))
* Add support for complex number lists in JIT ([#51145](https://github.com/pytorch/pytorch/pull/51145))
* Add support for complex valued keys for dict in TorchScript ([#51472](https://github.com/pytorch/pytorch/pull/51472))
* Add `scalar.conj()` ([#46596](https://github.com/pytorch/pytorch/pull/46596))
* Add `Tensor.copy_()` for `ComplexHalf` tensors ([#45339](https://github.com/pytorch/pytorch/pull/45339))

### Profiler

* New profiler API ([#48280](https://github.com/pytorch/pytorch/pull/48280))
* Use libkineto in profiler ([#46470](https://github.com/pytorch/pytorch/pull/46470))
* Add FLOPS computation support to the new profiler API ([#51734](https://github.com/pytorch/pytorch/pull/51734))
* Add high level profiling trace for dataloading and optimizer ([#47655](https://github.com/pytorch/pytorch/pull/47655))
* Add support for SVG visualization ([#48438](https://github.com/pytorch/pytorch/pull/48438))

### Autograd

* Add `inputs` argument to `autograd.backward()` both in python and c++ ([#46855](https://github.com/pytorch/pytorch/pull/46855), [#47214](https://github.com/pytorch/pytorch/pull/47214))
* Add support for Tensor-like objects in `torch.autograd.gradcheck` ([#45732](https://github.com/pytorch/pytorch/pull/45732))
* Add experimental `vectorize` flag to `torch.autograd.functional.{jacobian, hessian}` ([#50915](https://github.com/pytorch/pytorch/pull/50915), [#51638](https://github.com/pytorch/pytorch/pull/51638))
* Add anomaly mode in C++ API ([#46981](https://github.com/pytorch/pytorch/pull/46981), [#47164](https://github.com/pytorch/pytorch/pull/47164))
* Make `torch.lu` differentiable. ([#46284](https://github.com/pytorch/pytorch/pull/46284))
* Add support for generators in autograd decorators like `torch.no_grad` ([#49017](https://github.com/pytorch/pytorch/pull/49017))

### Dataloader

* Add `BufferedShuffleDataset` ([#45290](https://github.com/pytorch/pytorch/pull/45290))
* Add warning if DataLoader is going to create excessive number of thread ([#46867](https://github.com/pytorch/pytorch/pull/46867))
* Add prototype of `BatchIterDataPipe` ([#49186, #51880](https://github.com/pytorch/pytorch/pull/49186))
* Add prototype of `SamplerIterDataPipe` ([#49363, #52104](https://github.com/pytorch/pytorch/pull/49363))
* Implement `BucketBatchIterDataPipe` ([#51126, #51880](https://github.com/pytorch/pytorch/pull/51126))
* Add Tar DataPipe-s ([#51398](https://github.com/pytorch/pytorch/pull/51398))
* Add `MapIterDataPipe` ([#51488](https://github.com/pytorch/pytorch/pull/51488)[](https://github.com/pytorch/pytorch/commit/9eb70c3c78ca971bf0277b9991f0932b4896bfed)[#51879](https://github.com/pytorch/pytorch/pull/51879))

### CUDA

* Allow user to specify a fraction of the GPU memory with `set_per_process_memory_fraction`. ([#48172](https://github.com/pytorch/pytorch/pull/48172))
* CUDA BFloat16 TopK ([#44755](https://github.com/pytorch/pytorch/pull/44755))
* Add LazyNVRTC ([#45674](https://github.com/pytorch/pytorch/pull/45674))
* Enable CUDA Fuser for ROCm ([#45965](https://github.com/pytorch/pytorch/pull/45965))
* Define the record_stream method in native_functions.yaml ([#44301](https://github.com/pytorch/pytorch/pull/44301))
* Add CUDA 11.1 docker build ([#46283](https://github.com/pytorch/pytorch/pull/46283))
* Add nvtx.range() context manager ([#42925](https://github.com/pytorch/pytorch/pull/42925))
* CUDA BFloat16 gelu, hardswish, hardsigmoid ([#44997](https://github.com/pytorch/pytorch/pull/44997))
* [ROCm] enable stream priorities ([#47136](https://github.com/pytorch/pytorch/pull/47136))
* Add bfloat support for torch.randn and torch.norm ([#47143](https://github.com/pytorch/pytorch/pull/47143))
* CUDA BFloat16 Dropout ([#45005](https://github.com/pytorch/pytorch/pull/45005)), batchnorm (non-cuDNN) ([#44994](https://github.com/pytorch/pytorch/pull/44994)), backwards ([#48809](https://github.com/pytorch/pytorch/pull/48809)), sparse ([#48807](https://github.com/pytorch/pytorch/pull/48807)),  indexing ([#48801](https://github.com/pytorch/pytorch/pull/48801)), embedding ([#44848](https://github.com/pytorch/pytorch/pull/44848)), signal windows ([#45155](https://github.com/pytorch/pytorch/pull/45155)), norm ([#48806](https://github.com/pytorch/pytorch/pull/48806)), isinf and isfinite ([#49356](https://github.com/pytorch/pytorch/pull/49356)), gemms on arch other than ampere ([#50442](https://github.com/pytorch/pytorch/pull/50442)), clamp, remainder, lshift, rshift ([#45247](https://github.com/pytorch/pytorch/pull/45247))
* Make CUDAGeneratorImpl capturable ([#48694](https://github.com/pytorch/pytorch/pull/48694))
* Adding support for CuDNN-based LSTM with projections ([#47725](https://github.com/pytorch/pytorch/pull/47725))
* Add `torch.cuda.can_device_access_peer` ([#50446](https://github.com/pytorch/pytorch/pull/50446))
* Add torch::cuda::ncll::all2all ([#45900](https://github.com/pytorch/pytorch/pull/45900)) 

### C++ API

* Add distance-agnostic triplet margin loss ([#45377](https://github.com/pytorch/pytorch/pull/45377))
* Add `torch::nn::ModuleDict` ([#47707](https://github.com/pytorch/pytorch/pull/47707))
* Add `torch::cuda::synchronize` ([#50072](https://github.com/pytorch/pytorch/pull/50072))
* Add new XPU backend type for Intel heterogeneous computation platform. ([#49786](https://github.com/pytorch/pytorch/pull/49786))

### TorchScript

* `torch::jit::freeze` C++ api introduced ([#52337](https://github.com/pytorch/pytorch/pull/52337), [#52392](https://github.com/pytorch/pytorch/pull/52392))
* Add API for ignoring arbitrary module attributes during compilation ([#45262](https://github.com/pytorch/pytorch/pull/45262))
* Support tracing tensor `__setitem__` with dynamic shape ([#45828](https://github.com/pytorch/pytorch/pull/45828))
* Expose script_if_tracing as public API ([#46494](https://github.com/pytorch/pytorch/pull/46494))
* Support %-based string formatting ([#45976](https://github.com/pytorch/pytorch/pull/45976))
* Add `torch.jit.isinstance` support for typed containers ([#46062](https://github.com/pytorch/pytorch/pull/46062))
* Allow for source code comments at any level of indentation ([#46548](https://github.com/pytorch/pytorch/pull/46548))
* Support hashing of various data types by implementing generic hashing for IValues ([#46441](https://github.com/pytorch/pytorch/pull/46441))
* Support doc string for TorchBind custom classes ([#46576](https://github.com/pytorch/pytorch/pull/46576))
* Add API for selective lowering of modules to custom JIT backend ([#43613](https://github.com/pytorch/pytorch/pull/43613))
* add list() support ([#42382](https://github.com/pytorch/pytorch/pull/42382))
* Support using lambda function as TorchBind constructor ([#47819](https://github.com/pytorch/pytorch/pull/47819))
* Support user defined classes as constants ([#45556](https://github.com/pytorch/pytorch/pull/45556))
* Allow del statements with multiple targets ([#48876](https://github.com/pytorch/pytorch/pull/48876))
* Tuple Slice with both negative and positive stepped size ([#48660](https://github.com/pytorch/pytorch/pull/48660))
* Expose run_async function on torch::jit::Method ([#48607](https://github.com/pytorch/pytorch/pull/48607))
* Add flag torch_jit_disable_warning_prints to allow disabling all warnings.warn ([#49313](https://github.com/pytorch/pytorch/pull/49313))
* Add dict comprehension ([#47774](https://github.com/pytorch/pytorch/pull/47774))
* Adding support for bitwise augassignment operators (`+=` style statements) ([#44621](https://github.com/pytorch/pytorch/pull/44621))
* Support the `in` operator with str ([#47057](https://github.com/pytorch/pytorch/pull/47057))
* Adding JIT support for cuda streams and events ([#48020](https://github.com/pytorch/pytorch/pull/48020))
* Add `Type::{castRaw,expectRef}` ([#50061](https://github.com/pytorch/pytorch/pull/50061))
* Allow arbitrary docstrings to be inside torchscript interface methods ([#50271](https://github.com/pytorch/pytorch/pull/50271))
* Change list striding parameters to take optional integer ([#48719](https://github.com/pytorch/pytorch/pull/48719))
* Add support for scripting and running module level hooks in JIT ([#49544](https://github.com/pytorch/pytorch/pull/49544), [#49975](https://github.com/pytorch/pytorch/pull/49975), [#49545](https://github.com/pytorch/pytorch/pull/49545), [#49546](https://github.com/pytorch/pytorch/pull/49546), [#49547](https://github.com/pytorch/pytorch/pull/49547))
* Support default argument values of a method ([#48863](https://github.com/pytorch/pytorch/pull/48863))
* Graceful invalidation of Python Node/Value/Block when C++ object is deleted ([#50326](https://github.com/pytorch/pytorch/pull/50326))
* Support `Union[NoneType, T]` as input type ([#51605](https://github.com/pytorch/pytorch/pull/51605))
* Allow implicit boolean conversion of lists, strings, and dictionaries ([#51683](https://github.com/pytorch/pytorch/pull/51683))

### Mobile

* Add instance_key into mobile stats logging. ([#45517](https://github.com/pytorch/pytorch/pull/45517))
* Profiling allocator for mobile. ([#43951](https://github.com/pytorch/pytorch/pull/43951))
* [Metal] Add Metal/MPSCNN support on iOS ([#46112](https://github.com/pytorch/pytorch/pull/46112))
* [Metal] Introduce USE_PYTORCH_METAL ([#46383](https://github.com/pytorch/pytorch/pull/46383))
* [Metal] Support Resnet models (b63ddd6f57)
* PyTorch NNAPI integration prototype ([#46780](https://github.com/pytorch/pytorch/pull/46780))
* [Metal] Enable Metal on macosx ([#47635](https://github.com/pytorch/pytorch/pull/47635))
* [Metal] Enable optimize_for_mobile on Linux ([#46384](https://github.com/pytorch/pytorch/pull/46384))
* [Android] Fix YUV camera image to tensor ([#50871](https://github.com/pytorch/pytorch/pull/50871))
* [Android] turn on USE_VULKAN for android builds by default ([#51291](https://github.com/pytorch/pytorch/pull/51291))
* Add windows JNI support ([#44257](https://github.com/pytorch/pytorch/pull/44257))
* 
* Enable partial loading of GPU models on linux CPU machines ([#51236](https://github.com/pytorch/pytorch/pull/51236))

### Distributed

* Support `send` and `recv` in c10d NCCL backend ([#44921](https://github.com/pytorch/pytorch/pull/44921), [#44922](https://github.com/pytorch/pytorch/pull/44922))
* Add support for NCCL alltoall ([#44374](https://github.com/pytorch/pytorch/pull/44374))
* Upstream `fairscale.nn.Pipe` into PyTorch as `torch.distributed.pipeline` ([#44090](https://github.com/pytorch/pytorch/pull/44090))
* Add a `--logdir` option to log subprocess output to files in DDP launcher. ([#33193](https://github.com/pytorch/pytorch/pull/33193))
* Support `RRef.backward()` for local RRefs. ([#46568](https://github.com/pytorch/pytorch/pull/46568)) and Owner RRefs. ([#46641](https://github.com/pytorch/pytorch/pull/46641))
* Support C++ implementation for DDP communication hook. ([#46566](https://github.com/pytorch/pytorch/pull/46566))
* Provide 2 default C++ comm hooks for DDP ([#46701](https://github.com/pytorch/pytorch/pull/46701))
* Support remote device format `"worker_name/device"` ([#46773](https://github.com/pytorch/pytorch/pull/46773))
* Enable creation and transfer of `ScriptModule` over RPC ([#48293](https://github.com/pytorch/pytorch/pull/48293))
* Enable TCPStore on Windows ([#47749](https://github.com/pytorch/pytorch/pull/47749))
* Support `torch.distributed.irecv(src=None, ...)` as `recv_anysource` ([#49383](https://github.com/pytorch/pytorch/pull/49383))
* Implement layer-wise PowerSGD as a DDP comm hook ([#49639](https://github.com/pytorch/pytorch/pull/49639))
* Support `alltoall_single` in TorchScript ([#48345](https://github.com/pytorch/pytorch/pull/48345))
* Enable GPU-to-GPU comm in `TensorPipeAgent` ([#44418](https://github.com/pytorch/pytorch/pull/44418))
* Support timeout in `rref._get_type()` ([#50498](https://github.com/pytorch/pytorch/pull/50498))
* Support timeout for RRef proxy functions ([#50499](https://github.com/pytorch/pytorch/pull/50499))
* Add optimizer state sharding as `ZeroRedundancyOptimizer` ([#46750](https://github.com/pytorch/pytorch/pull/46750))
* Add distributed functional `Adam` optimizer ([#50624](https://github.com/pytorch/pytorch/pull/50624)),  `sgd` optimizer ([#50618](https://github.com/pytorch/pytorch/pull/50618)),  `Adadelta` optimizer ([#50623](https://github.com/pytorch/pytorch/pull/50623)),  `RMSprop` optimizer ([#50619](https://github.com/pytorch/pytorch/pull/50619)), l `AdamW` optimizer ([#50620](https://github.com/pytorch/pytorch/pull/50620))
* Create a DDPLoggingData struct and expose it to python interface ([#50622](https://github.com/pytorch/pytorch/pull/50622))
* Implement autograd functions for c10d communication operations ([#40762](https://github.com/pytorch/pytorch/pull/40762))
* Enable TensorPipe's SHM transport ([#50760](https://github.com/pytorch/pytorch/pull/50760))
* Support device map for distributed autograd while using TensorPipe. ([#44859](https://github.com/pytorch/pytorch/pull/44859))
* Create PyTorch DDP logging APIs for applications to use ([#50637](https://github.com/pytorch/pytorch/pull/50637))
* Add `set_exception` API in `torch.futures.Future` ([#50983](https://github.com/pytorch/pytorch/pull/50983))
* Add `scatter_object_list` API for c10d ([#43930](https://github.com/pytorch/pytorch/pull/43930))
* Provide parameter to pass GPU ID in barrier function ([#49069](https://github.com/pytorch/pytorch/pull/49069))
* Enable TensorPipe CUDA fallback channel ([#50675](https://github.com/pytorch/pytorch/pull/50675))
* Enable TensorPipe's InfiniBand transport ([#50761](https://github.com/pytorch/pytorch/pull/50761))

### torch.fx

* allow custom behavior for args, kwargs, and bool ([#45193](https://github.com/pytorch/pytorch/pull/45193))
* Mutable Graph APIs ([#45227](https://github.com/pytorch/pytorch/pull/45227))
* Make output a non-special Node ([#45599](https://github.com/pytorch/pytorch/pull/45599))
* Make `Tracer.trace()` just return a Graph ([#45704](https://github.com/pytorch/pytorch/pull/45704))
* Preserve type annotations on generated code in Graph ([#45880](https://github.com/pytorch/pytorch/pull/45880))
* Make `graph_copy` examine existing values in val_map ([#46104](https://github.com/pytorch/pytorch/pull/46104))
* Allow tracing free functions ([#46268](https://github.com/pytorch/pytorch/pull/46268))
* Make sure args/kwargs are immutable ([#46325](https://github.com/pytorch/pytorch/pull/46325))
* Make wrapped functions traceable ([#46692](https://github.com/pytorch/pytorch/pull/46692))
* Added `GraphModule.to_folder` ([#47544](https://github.com/pytorch/pytorch/pull/47544))
* Support default args in symbolic tracing ([#47615](https://github.com/pytorch/pytorch/pull/47615))
* Add `Node.all_input_nodes` ([#48270](https://github.com/pytorch/pytorch/pull/48270))
* Support torchbind as attribute in torch.fx symbolic tracing ([#48732](https://github.com/pytorch/pytorch/pull/48732))
* Create subgraph rewriter API ([#49540](https://github.com/pytorch/pytorch/pull/49540))
* Make len traceable and scriptable with wrap ([#50184](https://github.com/pytorch/pytorch/pull/50184))
* Add Interpreter and Transformer APIs ([#50420](https://github.com/pytorch/pytorch/pull/50420))
* Add alternative prettyprinting method to `Graph` ([#50878](https://github.com/pytorch/pytorch/pull/50878))
* Move some heavily used passes out of experimental ([#51392](https://github.com/pytorch/pytorch/pull/51392))
* Added partial concrete values for symbolic tracing ([#51609](https://github.com/pytorch/pytorch/pull/51609))

### Quantization

* Quantized Operators and Modules
    *  Embedding and EmbeddingBag operator support
        * creating quint4x2 dtype for quantized tensors ([#44678](https://github.com/pytorch/pytorch/pull/44678))
        * PerChannelFloatQParams support for quint4x2 dtype ([#45594](https://github.com/pytorch/pytorch/pull/45594))
        * Add 4-bit embedding_bag prepack/unpack support using quint4x2 ([#45751](https://github.com/pytorch/pytorch/pull/45751))
        * Support 4-bit embedding_bag operators using the dtype quint4x2 ([#45752](https://github.com/pytorch/pytorch/pull/45752))
        * Support for 4-bit quantized EmbeddingBag module ([#45865](https://github.com/pytorch/pytorch/pull/45865))
        * Refactor qembeddingbag to remove duplicate code ([#45881](https://github.com/pytorch/pytorch/pull/45881))
        * Rename the sparse argument for embedding_bag ops ([#46003](https://github.com/pytorch/pytorch/pull/46003))
        * Add support for pruned weights in embedding_bag_byte lookup ([#47329](https://github.com/pytorch/pytorch/pull/47329))
        * fp16 -> fp32 EmbeddingBag moved into CPU impl ([#47076](https://github.com/pytorch/pytorch/pull/47076))
        * Add non-fbgemm fallback implementation for embedding lookup ops ([#50706](https://github.com/pytorch/pytorch/pull/50706))
        * Out variant for embedding_bag_4bit_rowwise_offsets ([#51324](https://github.com/pytorch/pytorch/pull/51324))
        * Using int32 as indices for embedding_bag operators ([#45878](https://github.com/pytorch/pytorch/pull/45878))
    * Add transposed conv support for fbgemm backend for 1d, 2d, 3d ([#46607](https://github.com/pytorch/pytorch/pull/46607), [#46608](https://github.com/pytorch/pytorch/pull/46608))
    * Add quantized flip dispatch ([#46235](https://github.com/pytorch/pytorch/pull/46235))
    * Add support for ReflectionPad2d ([#48036](https://github.com/pytorch/pytorch/pull/48036))
    * Dynamic GRU quantization support ([#49448](https://github.com/pytorch/pytorch/pull/49448))
    * Quantizable LSTM ([#49671](https://github.com/pytorch/pytorch/pull/49671))
* Quantization Flow/API
    * quantization: Linear + BatchNorm1d fusion ([#50748](https://github.com/pytorch/pytorch/pull/50748))
    * compare_model_stub_fx API implementation ([#48951](https://github.com/pytorch/pytorch/pull/48951))
    * Add additional_fuser_method_mapping to config ([#46355](https://github.com/pytorch/pytorch/pull/46355))
    * Compare Weights FX Implementation ([#48056](https://github.com/pytorch/pytorch/pull/48056))
    * Numeric Suite: Swap with shadow modules only for quantized part of model ([#51052](https://github.com/pytorch/pytorch/pull/51052))
* FX Graph Mode Quantization
    * Add prepare_custom_config_dict and convert_custom_config_dict ([#46223](https://github.com/pytorch/pytorch/pull/46223), [#46364](https://github.com/pytorch/pytorch/pull/46364))
    * Add FixedQParamsFakeQuantize module ([#46657](https://github.com/pytorch/pytorch/pull/46657))
    * Add support for additional_fuse_method_mapping ([#46345](https://github.com/pytorch/pytorch/pull/46345)), additional_{fusion/quant}_pattern ([#46346](https://github.com/pytorch/pytorch/pull/46346))
    * Support in qat sigmoid/hardsigmoid/tanh ([#46871](https://github.com/pytorch/pytorch/pull/46871)), convbn{relu}1d ([#47248](https://github.com/pytorch/pytorch/pull/47248)),  FloatFunctional ([#46634](https://github.com/pytorch/pytorch/pull/46634))
    * custom_module support static/dynamic/weight_only quant ([#46786](https://github.com/pytorch/pytorch/pull/46786))
    * Support standalone_module_class ([#47705](https://github.com/pytorch/pytorch/pull/47705))
    * Embedding/EmbeddingBag works in static quant qconfig ([#48062](https://github.com/pytorch/pytorch/pull/48062))
    * Add MatchAllNode in pattern matching ([#48979](https://github.com/pytorch/pytorch/pull/48979))
    * Add support for dynamic quant for RNN and RNNCell ([#49126](https://github.com/pytorch/pytorch/pull/49126)), ConvTranspose{n}d ([#49717](https://github.com/pytorch/pytorch/pull/49717)), quantizing functional linear + {functional relu/module relu} ([#50975](https://github.com/pytorch/pytorch/pull/50975)), functional conv2d + relu ([#51079](https://github.com/pytorch/pytorch/pull/51079)), functional conv1d and conv3d (#51155) ([#51254](https://github.com/pytorch/pytorch/pull/51254)), Scalar as first input for add/mul ([#46751](https://github.com/pytorch/pytorch/pull/46751)), leaky relu ([#45712](https://github.com/pytorch/pytorch/pull/45712)), Embedding ([#46677](https://github.com/pytorch/pytorch/pull/46677)), EmbeddingBag ([#46678](https://github.com/pytorch/pytorch/pull/46678))
    * Remove inplace option for convert_fx ([#46955](https://github.com/pytorch/pytorch/pull/46955))
    * Support non_traceable_module/module_class ([#46298](https://github.com/pytorch/pytorch/pull/46298))
    * Add additional_object_mapping argument to convert ([#46338](https://github.com/pytorch/pytorch/pull/46338))
    * Keep linear op unchanged when qconfig is not supported ([#48067](https://github.com/pytorch/pytorch/pull/48067))
    * Move {input|output}_quantized_idxs cfg from convert to prepare ([#49238](https://github.com/pytorch/pytorch/pull/49238))
    * Allow user to specify qconfig for call_method ([#49621](https://github.com/pytorch/pytorch/pull/49621))
    * Do not observe bias on F.conv and F.linear ([#49623](https://github.com/pytorch/pytorch/pull/49623), [#49628](https://github.com/pytorch/pytorch/pull/49628))
    * Linear work with float_qparam_dynamic_qconfig ([#47068](https://github.com/pytorch/pytorch/pull/47068))
    * Fix error that DefaultQuantizer is not inserted after a module configured with None qconfig ([#47316](https://github.com/pytorch/pytorch/pull/47316))
    * Scope support for call_method in QuantizationTracer ([#50173](https://github.com/pytorch/pytorch/pull/50173))
    * Support preserved_attributes in prepare_fx ([#50306](https://github.com/pytorch/pytorch/pull/50306))
    * Add option to leave graph inputs and/or outputs quantized ([#48624](https://github.com/pytorch/pytorch/pull/48624))
    * Support quantization for custom module ([#44074](https://github.com/pytorch/pytorch/pull/44074))
    * Remove `inplace` option for fuse_fx ([#46953](https://github.com/pytorch/pytorch/pull/46953)) and prepare_fx ([#46954](https://github.com/pytorch/pytorch/pull/46954))
    * Scope support for call_function in QuantizationTracer ([#51086](https://github.com/pytorch/pytorch/pull/51086))

### ONNX

* Preprocess index_put with bool inputs to `torch.masked_{scatter,fill}` ([#45584](https://github.com/pytorch/pytorch/pull/45584))
* Export `torch.{var,var_mean,std_mean}` ops ([#45678](https://github.com/pytorch/pytorch/pull/45678))
* Enable NoneType inputs to export API ([#45792](https://github.com/pytorch/pytorch/pull/45792))
* Add export of prim::dtype, prim::tolist ([#46019](https://github.com/pytorch/pytorch/pull/46019))
* Enable onnx shape inference in export by default ([#46629](https://github.com/pytorch/pytorch/pull/46629))
* Add `torch.silu` operator support for onnx ([#51519](https://github.com/pytorch/pytorch/pull/51519))
* Support list remove for onnx export ([#51526](https://github.com/pytorch/pytorch/pull/51526))
* Added `torch.hardswish` symbolic in opset 9 ([#48423](https://github.com/pytorch/pytorch/pull/48423))
* Add export of `aten::is_floating` point ([#46442](https://github.com/pytorch/pytorch/pull/46442))
* Add `torch.logical_{and,or,xor}` torch op support in pytorch exporter ([#50909](https://github.com/pytorch/pytorch/pull/50909))
* Add `torch.binary_cross_entropy_with_logits` op to ONNX opset version 12 ([#50908](https://github.com/pytorch/pytorch/pull/50908))
* Support opset13 `nn.Squeeze` and `nn.Unsqueeze` ([#50906](https://github.com/pytorch/pytorch/pull/50906))
* Add export of `prim::data` ([#45747](https://github.com/pytorch/pytorch/pull/45747))
* Support `torch.nonzero(*, as_tuple=True)` export ([#47421](https://github.com/pytorch/pytorch/pull/47421))
* Update Reducesum operator for opset 13 ([#50907](https://github.com/pytorch/pytorch/pull/50907))

### Misc

* Enable python code coverage on windows ([#44548](https://github.com/pytorch/pytorch/pull/44548)) and onnx ([#47387](https://github.com/pytorch/pytorch/pull/47387))
* Fix PyTorch compilation on Apple M1 chips ([#48275](https://github.com/pytorch/pytorch/pull/48275), [#49701](https://github.com/pytorch/pytorch/pull/49701))

# Improvements

### Python API

* Add integer support (by promoting integer to float) to `torch.{cos,sin,tan}` ([#45733](https://github.com/pytorch/pytorch/pull/45733), [#46706](https://github.com/pytorch/pytorch/pull/46706)), `torch.log{2,10}` ([#46810](https://github.com/pytorch/pytorch/pull/46810)), `torch.{a}tanh` ([#47064](https://github.com/pytorch/pytorch/pull/47064)), `torch.a{cos, tan}` ([#47005](https://github.com/pytorch/pytorch/pull/47005)), `torch.a{cosh, sinh}` ([#47152](https://github.com/pytorch/pytorch/pull/47152)), `torch.sqrt` ([#47293](https://github.com/pytorch/pytorch/pull/47293)), `torch.log1p`  ([#48002](https://github.com/pytorch/pytorch/pull/48002)). `torch.erf{c}` ([#48472](https://github.com/pytorch/pytorch/pull/48472)), `torch.asin` ([#48461](https://github.com/pytorch/pytorch/pull/48461)), `torch.sigmoid` ([#47551](https://github.com/pytorch/pytorch/pull/47551)), `torch.sinh` ([#48644](https://github.com/pytorch/pytorch/pull/48644)), `torch.cosh` ([#48923](https://github.com/pytorch/pytorch/pull/48923)), `torch.exp{2, m1}`([#48926](https://github.com/pytorch/pytorch/pull/48926)), `torch.reciprocal` ([#49102](https://github.com/pytorch/pytorch/pull/49102)), `torch.erfinv` ([#49155](https://github.com/pytorch/pytorch/pull/49155)), `torch.rsqrt` ([#47909](https://github.com/pytorch/pytorch/pull/47909)), `torch.exp`  ([#50093](https://github.com/pytorch/pytorch/pull/50093)), `torch.lgamma` ([#50140](https://github.com/pytorch/pytorch/pull/50140))
* Add optional `dtype` argument to `Tensor.view` ([#47951](https://github.com/pytorch/pytorch/pull/47951))
* Add `out` optional arguments to `torch.{reshape,flatten}`  ([#51249](https://github.com/pytorch/pytorch/pull/51249)), `torch.tensordot` ([#47278](https://github.com/pytorch/pytorch/pull/47278)), `torch.fft.*` ([#49335](https://github.com/pytorch/pytorch/pull/49335)), `torch.narrow_copy` ([#49502](https://github.com/pytorch/pytorch/pull/49502))
* Add support for int32 indices and offset in `nn.Embedding` and `nn.EmbeddingBag` ([#46758](https://github.com/pytorch/pytorch/pull/46758))
* Add boolean type support to `torch.where` ([#47454](https://github.com/pytorch/pytorch/pull/47454)), `torch.mul` and `Tensor.__mul__` ([#48637](https://github.com/pytorch/pytorch/pull/48637)), `torch.diag` ([#47455](https://github.com/pytorch/pytorch/pull/47455)), `torch.{all,any}` ([#44790](https://github.com/pytorch/pytorch/pull/44790)), ` Tensor.to_dense` ([#50019](https://github.com/pytorch/pytorch/pull/50019))
* Add inplace version of `torch.cum{sum,prod}_` ([#47651](https://github.com/pytorch/pytorch/pull/47651))
* Add sparse support to `torch.sqrt` ([#50088](https://github.com/pytorch/pytorch/pull/50088))
* Add support for both `dtype` and `ord` arguments in `torch.linalg.norm` ([#46637](https://github.com/pytorch/pytorch/pull/46637))
* Make `torch.nn` Module accept batch size of 0: `nn.ReplicationPad` ([#39137](https://github.com/pytorch/pytorch/pull/39137)), `nn.Unfold` ([#40689](https://github.com/pytorch/pytorch/pull/40689)), `nn.PixelShuffle`  ([#49187](https://github.com/pytorch/pytorch/pull/49187)), `nn.AvgPool{1,2,3}d`  ([#50008](https://github.com/pytorch/pytorch/pull/50008)), `nn.MultiLabelMarginLoss` and `nn.MultiMarginLoss` ([#50007](https://github.com/pytorch/pytorch/pull/50007))
* `utils.cpp_extensions` Ensure default extra_compile_args are properly handled ([#45956](https://github.com/pytorch/pytorch/pull/45956))
* `torch.LongTensor` legacy construction improved error message ([#46147](https://github.com/pytorch/pytorch/pull/46147))
* `torch.utils.checkpoint` allow having Tensors that don’t require gradients ([#45934](https://github.com/pytorch/pytorch/pull/45934))
* `torch.nan_to_num`: fix deprecated warnings ([#46309](https://github.com/pytorch/pytorch/pull/46309))
* Remove more use of “blacklist”  ([#45512](https://github.com/pytorch/pytorch/pull/45512), [#45781](https://github.com/pytorch/pytorch/pull/45781))
* Add type annotation to submodules: `torch.nn.cpp` ([#46490](https://github.com/pytorch/pytorch/pull/46490)), `torch.nn.parallel.comm` ([#46736](https://github.com/pytorch/pytorch/pull/46736)), `torch.nn.modules.*` ([#46828](https://github.com/pytorch/pytorch/pull/46828), [#45772](https://github.com/pytorch/pytorch/pull/45772), [#46013](https://github.com/pytorch/pytorch/pull/46013), [#49957](https://github.com/pytorch/pytorch/pull/49957), [#49479](https://github.com/pytorch/pytorch/pull/49479), [#49045](https://github.com/pytorch/pytorch/pull/49045), [#49035](https://github.com/pytorch/pytorch/pull/49035), [#49494](https://github.com/pytorch/pytorch/pull/49494), [#48969](https://github.com/pytorch/pytorch/pull/48969)), autograd functions from c++ ([#46622](https://github.com/pytorch/pytorch/pull/46622)), `torch.distributed` functions from c++  ([#46623](https://github.com/pytorch/pytorch/pull/46623)), `torch.storage`  ([#46876](https://github.com/pytorch/pytorch/pull/46876)), `torch._tensor_str` ([#48463](https://github.com/pytorch/pytorch/pull/48463), [#48584](https://github.com/pytorch/pytorch/pull/48584)), `torch.nn.modules.pooling` ([#48412](https://github.com/pytorch/pytorch/pull/48412)), `common_nn` ([#48190](https://github.com/pytorch/pytorch/pull/48190)), `torch.lobpcg` ([#47680](https://github.com/pytorch/pytorch/pull/47680)), `torch.nn.functional`  ([#50106](https://github.com/pytorch/pytorch/pull/50106)), `torch.overrides` ([#50824](https://github.com/pytorch/pytorch/pull/50824)), `torch.generate_torch_version`  ([#51637](https://github.com/pytorch/pytorch/pull/51637)), `torch.distributions` ([#45689](https://github.com/pytorch/pytorch/pull/45689)), `torch.quantization.quantize_jit` ([#45548](https://github.com/pytorch/pytorch/pull/45548)), `torch.utils.tensorboard` ([#49834](https://github.com/pytorch/pytorch/pull/49834)), `torch.multiprocessing` ([#47756](https://github.com/pytorch/pytorch/pull/47756)), `torch.cuda` ([#47134](https://github.com/pytorch/pytorch/pull/47134)), `torch._C._distributed_rpc` ([#46624](https://github.com/pytorch/pytorch/pull/46624)), `torch.distributed.*` ([#47531](https://github.com/pytorch/pytorch/pull/47531), [#47532](https://github.com/pytorch/pytorch/pull/47532), [#47533](https://github.com/pytorch/pytorch/pull/47533), [#47534](https://github.com/pytorch/pytorch/pull/47534)), `torch.nn.parallel._functions` ([#49687](https://github.com/pytorch/pytorch/pull/49687))
* Make comparison fail when dtypes don’t match ([#47288](https://github.com/pytorch/pytorch/pull/47288))
* Allow large inputs for `torch.svd` ([#47440](https://github.com/pytorch/pytorch/pull/47440))
* Add nondeterministic alerts to `torch.index_copy`, `torch.median` on CUDA and `torch.kthvalue` on CUDA  ([#46942](https://github.com/pytorch/pytorch/pull/46942))
* Add float16 and  bfloat16 support to `torch.where` ([#49004](https://github.com/pytorch/pytorch/pull/49004)), `torch.matmul` ([#47873](https://github.com/pytorch/pytorch/pull/47873))
* Add float16 support for CPU and bfloat16 support for CPU & CUDA to `torch.flip` and `torch.flip{lr, ud}` ([#49895](https://github.com/pytorch/pytorch/pull/49895))
* Add support for providing `indices` as a Tensor for `torch.tensor_split` ([#49169](https://github.com/pytorch/pytorch/pull/49169))
* Add support for SELU activation in `torc.nn.init.calculate_gain` ([#50664](https://github.com/pytorch/pytorch/pull/50664))
* Add function version of `torch.optim` optimizers and refactor existing classes to use the functional version: SGD ([#45597](https://github.com/pytorch/pytorch/pull/45597)), Adadelta ([#50409](https://github.com/pytorch/pytorch/pull/50409)), RMSProp ([#50410](https://github.com/pytorch/pytorch/pull/50410)), AdamW ([#50411](https://github.com/pytorch/pytorch/pull/50411))
* Improve error message when window is on wrong device for `torch.fft.stft` ([#51128](https://github.com/pytorch/pytorch/pull/51128))
* Add rounding_mode selection to `torch.div` ([#51706](https://github.com/pytorch/pytorch/pull/51706), [#52242](https://github.com/pytorch/pytorch/pull/52242))
* Remove spurious numpy writable warning ([#47271](https://github.com/pytorch/pytorch/pull/47271))
* Enable deterministic mode for rocBLAS ([#48654](https://github.com/pytorch/pytorch/pull/48654))
* Hipify submodule revamp and improved integration with cpp_extensions ([#48715](https://github.com/pytorch/pytorch/pull/48715))
* Remove warning about saving state in `torch.optim.lr_scheduler.LambdaLR` ([#46813](https://github.com/pytorch/pytorch/pull/46813))
* Improve typing of `torch.nn.Unflatten` ([#49838](https://github.com/pytorch/pytorch/pull/49838))
* Add exception classification to `torch.multiprocessing.spawn`

### Autograd

* Add double backward checks for the `torch.fft` submodule ([#46004](https://github.com/pytorch/pytorch/pull/46004))
* Detect inplace modifications of views of leaf Tensors earlier to improve error ([#46204](https://github.com/pytorch/pytorch/pull/46204))

### torch.utils

* `data.TensorDataset`: Add more specific error message ([#46905](https://github.com/pytorch/pytorch/pull/46905))
* `data.DistributedSampler`: Additional validation ([#48865](https://github.com/pytorch/pytorch/pull/48865))

### Complex Numbers

* Improve error message thrown by `torch.sign` for complex tensors ([#43280](https://github.com/pytorch/pytorch/pull/43280))
* Remove unnecessary dtype checks for complex types and disable complex dispatch for CPU `torch.{min,max}` pointwise ops ([#50465](https://github.com/pytorch/pytorch/pull/50465))

### CUDA

* Allow consumer ops to sync on autograd engine base gradient ([#45787](https://github.com/pytorch/pytorch/pull/45787))
* Add `torch::cuda::nccl::{send,recv}` ([#45926](https://github.com/pytorch/pytorch/pull/45926))
* Cusolver inverse check info ([#46625](https://github.com/pytorch/pytorch/pull/46625))
* Make numpy optional dependency for `torch.cuda.amp` ([#48154](https://github.com/pytorch/pytorch/pull/48154))
* Support all visible cards when building a cuda extension ([#48891](https://github.com/pytorch/pytorch/pull/48891))
* Enable using `torch.utils.checkpoint.checkpoint` and `torch.cuda.amp` at the same time ([#49757](https://github.com/pytorch/pytorch/pull/49757))
* Make `DeviceCachingAllocator`'s error handling more defensive and a bit easier to read ([#51158](https://github.com/pytorch/pytorch/pull/51158))

### Distributed

* Create NCCL communicator for send/recv on demand ([#44922](https://github.com/pytorch/pytorch/pull/44922))
* Reduce the peak memory of fp16 compression DDP comm hook by avoiding converting to fp32 ([#46078](https://github.com/pytorch/pytorch/pull/46078))
* Allow RPC framework to use rank in addition to `WorkerInfo` and name. ([#46221](https://github.com/pytorch/pytorch/pull/46221))
* Add to the `HashStore` `getNumKeys()`  ([#46048](https://github.com/pytorch/pytorch/pull/46048)) and `deleteKey()` ([#46049](https://github.com/pytorch/pytorch/pull/46049))
* Print exception message on both RPC caller and callee ([#46372](https://github.com/pytorch/pytorch/pull/46372))
* Add RRef proxy support for `ScriptModule` methods ([#48339](https://github.com/pytorch/pytorch/pull/48339))
* Support retrieving the RRef to the remote module ([#48983](https://github.com/pytorch/pytorch/pull/48983))
* Add a c++ interface in processGroup to get its backend name ([#51066](https://github.com/pytorch/pytorch/pull/51066))
* Enable `NamedTuple` data type to work with DDP ([#44220](https://github.com/pytorch/pytorch/pull/44220))
* Support send/recv to/from self when communicator is created on demand ([#45873](https://github.com/pytorch/pytorch/pull/45873))
* Add Error log when ProcessGroupNCCL takes down a process ([#44988](https://github.com/pytorch/pytorch/pull/44988))
* Provide additional information about NCCL error codes. ([#45950](https://github.com/pytorch/pytorch/pull/45950))
* Avoid scatter for single-device case in DDP ([#46304](https://github.com/pytorch/pytorch/pull/46304))
* Use Blocking Wait if both Blocking Wait and Async Error Handling Are Set ([#47926](https://github.com/pytorch/pytorch/pull/47926))
* Providing more information while crashing a process in async error handling ([#47246](https://github.com/pytorch/pytorch/pull/47246))
* Add PowerSGD comm hook ([#48060](https://github.com/pytorch/pytorch/pull/48060))
* Define a customized state for PowerSGD comm hook ([#48348](https://github.com/pytorch/pytorch/pull/48348))
* Add a random generator to PowerSGD state for initializing low-rank matrix Q ([#48507](https://github.com/pytorch/pytorch/pull/48507))
* Replace the key of `error_dict` in PowerSGD state with bucket index ([#48867](https://github.com/pytorch/pytorch/pull/48867))
* Make `CUDAFuture` remember and restore current device in callback ([#48789](https://github.com/pytorch/pytorch/pull/48789))
* Update pipeline API to accept arbitrary sequence of Tensors and not just Tuple ([#48467](https://github.com/pytorch/pytorch/pull/48467))
* Use `group.WORLD` appropriately in process group initialization. ([#48767](https://github.com/pytorch/pytorch/pull/48767))
* Add error feedback to layerwise PowerSGD ([#49418](https://github.com/pytorch/pytorch/pull/49418))
* Warm-start of PowerSGD by reusing states from previous iteration is possible ([#49451](https://github.com/pytorch/pytorch/pull/49451))
* Change `wait()` to `value()` in some callbacks of PowerSGD communication hook ([#49709](https://github.com/pytorch/pytorch/pull/49709))
* Ensure DDP + Pipe works with `find_unused_parameters`. ([#49908](https://github.com/pytorch/pytorch/pull/49908))
* Enable TensorPipe CUDA sending to self ([#50674](https://github.com/pytorch/pytorch/pull/50674)) and  GDR channel ([#50763](https://github.com/pytorch/pytorch/pull/50763))
* Add warning to distributed optimizer ([#50630](https://github.com/pytorch/pytorch/pull/50630))
* Make python object collective API args consistent ([#50625](https://github.com/pytorch/pytorch/pull/50625))
* Add option to make `rref.get_type` non-blocking. ([#50977](https://github.com/pytorch/pytorch/pull/50977))
* Unescape string in RPC error message ([#49373](https://github.com/pytorch/pytorch/pull/49373))
* Event Logging for NCCL Async Error Handling Process Crash ([#47244](https://github.com/pytorch/pytorch/pull/47244))
* Remove `balance` and `devices` parameter from Pipe. ([#48432](https://github.com/pytorch/pytorch/pull/48432))
* Error feedback for PowerSGD DDP comm hook ([#48670](https://github.com/pytorch/pytorch/pull/48670))
* Add an index field to `GradBucket` for PowerSGD ([#48757](https://github.com/pytorch/pytorch/pull/48757))
* Have `FutureNCCL` record streams w/ allocator in addCallback ([#48496](https://github.com/pytorch/pytorch/pull/48496)) and events in current stream ([#48497](https://github.com/pytorch/pytorch/pull/48497))
* Use fresh stream from pool for each `FutureNCCL` callback ([#48498](https://github.com/pytorch/pytorch/pull/48498))
* Record CUDA events for "follow-up" `FutureNCCL` inside `markCompleted()` ([#48499](https://github.com/pytorch/pytorch/pull/48499))
* Fix `FutureNCCL`'s `completed()` disagreeing with `wait()` ([#48503](https://github.com/pytorch/pytorch/pull/48503))
* Fix `FutureNCCL` not recording `DataPtr`s with caching alloc in `wait()` ([#48563](https://github.com/pytorch/pytorch/pull/48563))
* Add multi-GPU support to `FutureNCCL` ([#48500](https://github.com/pytorch/pytorch/pull/48500))
* Don't store device indices separately on `FutureNCCL` ([#48501](https://github.com/pytorch/pytorch/pull/48501))
* Support wider range of types in `FutureNCCL` ([#48502](https://github.com/pytorch/pytorch/pull/48502))
* Split `FutureNCCL`'s CUDA-specific parts from generic future logic ([#48504](https://github.com/pytorch/pytorch/pull/48504))
* Merge common parts of FutureNCCL into `at::ivalue::Future` ([#48505](https://github.com/pytorch/pytorch/pull/48505))
* Split out reusable `CUDAFuture` from `FutureNCCL` ([#48506](https://github.com/pytorch/pytorch/pull/48506))
* Cache the `DataPtr`s in `CUDAFuture` ([#48788](https://github.com/pytorch/pytorch/pull/48788))
* Modify `Pipe` to return an RRef. ([#47829](https://github.com/pytorch/pytorch/pull/47829))
* Cleanup APIs for pipeline parallelism. ([#48630](https://github.com/pytorch/pytorch/pull/48630))
* Fix TCPStore type coercion ([#49685](https://github.com/pytorch/pytorch/pull/49685))
* Simplify the implementation of error feedback and warm-start ([#50981](https://github.com/pytorch/pytorch/pull/50981))
* Explicitly specify the `dtype` of the error tensor ([#50985](https://github.com/pytorch/pytorch/pull/50985))
* Check `start_PowerSGD_iter > 1` and add guidance on tuning PowerSGD configs. ([#51427](https://github.com/pytorch/pytorch/pull/51427))
* Check if the backend is NCCL when a DDP communication hook is registered ([#51759](https://github.com/pytorch/pytorch/pull/51759))

### TorchScript

* Add multiline string dedent support ([#45580](https://github.com/pytorch/pytorch/pull/45580))
* Add string versions of argument funcs in jit Node ([#45464](https://github.com/pytorch/pytorch/pull/45464))
* Make sure each `warnings.warn` only executes once inside TorchScript. ([#45382](https://github.com/pytorch/pytorch/pull/45382))
* Allow slicing multiple dimensions with indexes if not Tuple ([#45239](https://github.com/pytorch/pytorch/pull/45239))
* Change type inferred from empty annotation ([#45360](https://github.com/pytorch/pytorch/pull/45360))
* Fix stride printing/parsing formatting ([#45156](https://github.com/pytorch/pytorch/pull/45156))
* Make objects throw Python AttributeError on nonexistant attr access ([#45911](https://github.com/pytorch/pytorch/pull/45911))
* Make InsertInstruction overflow check a warning instead of fatal ([#46369](https://github.com/pytorch/pytorch/pull/46369))
* Add an option to getWriteableTensorData to avoid copy CUDA tensor to CPU ([#46524](https://github.com/pytorch/pytorch/pull/46524))
* Add error messages and workaround for RET failure of containers with a torch class type ([#46543](https://github.com/pytorch/pytorch/pull/46543))
* Correctly mark unannotated NamedTuple field to be inferred TensorType ([#46969](https://github.com/pytorch/pytorch/pull/46969))
* Enable ModuleDict non-literal indexing ([#45716](https://github.com/pytorch/pytorch/pull/45716))
* Add an attribute to the torchscript model exported by metal ([#47174](https://github.com/pytorch/pytorch/pull/47174))
* Print out interface mismatch for prim::ModuleDictIndex ([#47300](https://github.com/pytorch/pytorch/pull/47300))
* better message for bad type annotation ([#47464](https://github.com/pytorch/pytorch/pull/47464))
* Resolve string literal type annotations using `Resolver::resolveType` ([#47731](https://github.com/pytorch/pytorch/pull/47731))
* Resolve `torch.device` in recursive compilation of classes ([#47734](https://github.com/pytorch/pytorch/pull/47734))
* Metacompile boolean constants ([#46721](https://github.com/pytorch/pytorch/pull/46721))
* Allow JIT unpickler to accept CUDA DataPtr from read_record_ ([#46827](https://github.com/pytorch/pytorch/pull/46827))
* Skip None submodule during JIT-tracing ([#49765](https://github.com/pytorch/pytorch/pull/49765))
* Add `__prepare_scriptable__` duck typing to allow replacing `nn.Module`s with scriptable preparations (#45645) ([#49242](https://github.com/pytorch/pytorch/pull/49242))
* Fix deprecation warning in scalar_type_analysis ([#50218](https://github.com/pytorch/pytorch/pull/50218))
* Support scripting classmethod called with object instances ([#49967](https://github.com/pytorch/pytorch/pull/49967))
* Use FileStore in TorchScript for store registry ([#50248](https://github.com/pytorch/pytorch/pull/50248))
* Treat has_torch_function and object_has_torch_function as static False when scripting ([#48966](https://github.com/pytorch/pytorch/pull/48966))
* Print better error when class attribute IValue conversion fails ([#50255](https://github.com/pytorch/pytorch/pull/50255))
* Clean up some type annotations in test/jit/...../test_class_type.py ([#50156](https://github.com/pytorch/pytorch/pull/50156))
* Type annotations in test/jit ([#50293](https://github.com/pytorch/pytorch/pull/50293))
* Eliminate static default_extra_files_mobile from header import.h ([#50832](https://github.com/pytorch/pytorch/pull/50832))
* Dump torch::jit::AliasDb objects as Graphviz files ([#50452](https://github.com/pytorch/pytorch/pull/50452))
* Fix test_jit_cuda_archflags on machine with more than one arch ([#50405](https://github.com/pytorch/pytorch/pull/50405))
* Provide more info when attribute fails to convert ([#50870](https://github.com/pytorch/pytorch/pull/50870))
* Adding correct error message for for..else ([#51258](https://github.com/pytorch/pytorch/pull/51258))
* Handle error during dict expansion ([#51374](https://github.com/pytorch/pytorch/pull/51374))

### Mobile

* Update default output extension in optimize_for_mobile.cc ([#45598](https://github.com/pytorch/pytorch/pull/45598))
* Add named tuple's error message and workaround for RET failure ([#46347](https://github.com/pytorch/pytorch/pull/46347))
* [Metal] Add metal backend type ([#46455](https://github.com/pytorch/pytorch/pull/46455))
* [Metal] Add the Python binding for optimize_for_mobile ([#46456](https://github.com/pytorch/pytorch/pull/46456))
* [Metal] Add pin_memory check in empty_strided ([#47228](https://github.com/pytorch/pytorch/pull/47228))
* [Metal] Calculate strides for metal tensors ([#50309](https://github.com/pytorch/pytorch/pull/50309))
* [Metal] Clean up the operator tests ([#50311](https://github.com/pytorch/pytorch/pull/50311))
* Add an overload for deserialize() that doesn't accept the extra_files map. ([#50932](https://github.com/pytorch/pytorch/pull/50932))
* bundled_inputs: Preserve bundled input related methods when calling optimize_for_mobile ([#49170](https://github.com/pytorch/pytorch/pull/49170))
* bundled_inputs: Preserved all functions generated by bundled inputs ([#51496](https://github.com/pytorch/pytorch/pull/51496))
* bundled_inputs: Expanded Bundled Inputs To Any Public Function ([#51153](https://github.com/pytorch/pytorch/pull/51153))
* Expose _export_operator_list to python ([#51312](https://github.com/pytorch/pytorch/pull/51312))

### Quantization

* Quantized Operators and Modules
    * Add reflection padding to conv ([#49011](https://github.com/pytorch/pytorch/pull/49011))
    * Add support for 2D indices for quantized embedding operators ([#47766](https://github.com/pytorch/pytorch/pull/47766))
    * quantize_tensor_per_channel ARM implementation ([#46018](https://github.com/pytorch/pytorch/pull/46018))
    * Support either min or max in qclamp ([#45937](https://github.com/pytorch/pytorch/pull/45937))
    * Add preliminary support for advanced indexing ([#49346](https://github.com/pytorch/pytorch/pull/49346))
    * Add backend_independent option for quantized linear module ([#48192](https://github.com/pytorch/pytorch/pull/48192))
    * Add out-variant for the reflection pad ([#48037](https://github.com/pytorch/pytorch/pull/48037))
    * Support 2 dim input in quantized batchnorm 1d ([#51597](https://github.com/pytorch/pytorch/pull/51597))
* Typing, Formatting, Error Messages, Logging and Tests
    * numeric suite: add types to eager ([#51168](https://github.com/pytorch/pytorch/pull/51168))
    * Enable type check for torch.quantization.fake_quantize ([#45701](https://github.com/pytorch/pytorch/pull/45701))
    * Type check for `torch.quantization.observer` ([#45630](https://github.com/pytorch/pytorch/pull/45630)), `torch.quantization._numeric_suite` ([#46330](https://github.com/pytorch/pytorch/pull/46330)), `torch.quantization.stubs` ([#46475](https://github.com/pytorch/pytorch/pull/46475)), `quantization.fx.Quantizer` ([#48343](https://github.com/pytorch/pytorch/pull/48343)), `quantization.fx.Quantizer` ([#48350](https://github.com/pytorch/pytorch/pull/48350)), `quantization_mappings.py` ([#49179](https://github.com/pytorch/pytorch/pull/49179)), `fusion_patterns.py` ([#49606](https://github.com/pytorch/pytorch/pull/49606)), `torch/nn/quantized/modules` ([#49941](https://github.com/pytorch/pytorch/pull/49941)), quantization-related files in `torch/jit` ([#49939](https://github.com/pytorch/pytorch/pull/49939)), fuser ([#48844](https://github.com/pytorch/pytorch/pull/48844)), quantization_patterns ([#48851](https://github.com/pytorch/pytorch/pull/48851)), observed_module.py ([#49607](https://github.com/pytorch/pytorch/pull/49607)), quantization ([#49942](https://github.com/pytorch/pytorch/pull/49942))
    * Enable mypy on `torch/quantization/fx/*` ([#48331](https://github.com/pytorch/pytorch/pull/48331))
    * Make each line of fx/quantize.py <=80 chars ([#48357](https://github.com/pytorch/pytorch/pull/48357))
    * Add more typehints ([#48774](https://github.com/pytorch/pytorch/pull/48774), [#48794](https://github.com/pytorch/pytorch/pull/48794), [#48792](https://github.com/pytorch/pytorch/pull/48792))
    * Nice error message on convtranspose with per-channel weight ([#49899](https://github.com/pytorch/pytorch/pull/49899))
    * Throw a nice error message for allclose with quantized inputs ([#49802](https://github.com/pytorch/pytorch/pull/49802))
    * Add type annotations to torch.nn.quantized.modules.conv ([#49702](https://github.com/pytorch/pytorch/pull/49702))
    * Add type annotations to conv_fused/blas_compare/blas_compare_setup ([#51235](https://github.com/pytorch/pytorch/pull/51235))
    * Add API usage logging to numeric suite ([#46504](https://github.com/pytorch/pytorch/pull/46504)) and quantization ([#46095](https://github.com/pytorch/pytorch/pull/46095))
* Sparsity
    * Block Sparse kernel ([#50585](https://github.com/pytorch/pytorch/pull/50585))
    * Add A matrix pretransformed based sparse kernels for linear ([#50587](https://github.com/pytorch/pytorch/pull/50587))
    * Add dyanmic linear sparse kernel for arm64 ([#50591](https://github.com/pytorch/pytorch/pull/50591))
* Others
    * Use tensor's quantized properties directly in pickler ([#46267](https://github.com/pytorch/pytorch/pull/46267))
    * Remove register api and rename get_*mapping to get_default*_mapping ([#46337](https://github.com/pytorch/pytorch/pull/46337))
    * Update HistogramObserver to be scriptable ([#51081](https://github.com/pytorch/pytorch/pull/51081))
    * Support varying size input in numeric suite ([#47391](https://github.com/pytorch/pytorch/pull/47391))
    * Backend string for the quantized types ([#49965](https://github.com/pytorch/pytorch/pull/49965))
    * Disable pruning on embedding look up operators when compressed_indices_mapping = {0} ([#48672](https://github.com/pytorch/pytorch/pull/48672))
    * Support out variant of embedding_bag_byte_rowwise_offsets_out ([#49561](https://github.com/pytorch/pytorch/pull/49561))

### ONNX

* Update embedding_bag export ([#44693](https://github.com/pytorch/pytorch/pull/44693))
* Improve error handling for adaptive_pool ([#45874](https://github.com/pytorch/pytorch/pull/45874))
* Support nd mask index in opset >= 11 ([#45252](https://github.com/pytorch/pytorch/pull/45252))
* Update peephole pass for prim::ListUnpack ([#46264](https://github.com/pytorch/pytorch/pull/46264))
* Slightly improve indexing with ellipsis under scripting ([#46571](https://github.com/pytorch/pytorch/pull/46571))
* Update batch_norm symbolic to handle track_running_stats=False ([#47135](https://github.com/pytorch/pytorch/pull/47135))
* Cast Gather index to Long if needed ([#47653](https://github.com/pytorch/pytorch/pull/47653))
* Handle dynamic input axes for prim_ConstantChunk ([#48176](https://github.com/pytorch/pytorch/pull/48176))
* Remove usage of isCompleteTensor() in symbolic functions ([#48162](https://github.com/pytorch/pytorch/pull/48162))
* Changes to export API to better handle named arguments ([#47367](https://github.com/pytorch/pytorch/pull/47367))
* Modified var_mean symbolic to support more combinations of dims ([#48949](https://github.com/pytorch/pytorch/pull/48949))
* Support gelu for fp16 export ([#50911](https://github.com/pytorch/pytorch/pull/50911))
* Enable Constant Folding for ONNX Opset 13 ([#51523](https://github.com/pytorch/pytorch/pull/51523))
* Export and shape inference for prim uninitialized in If subblock ([#46094](https://github.com/pytorch/pytorch/pull/46094))
* Scripting support for inputs to index_put ([#46866](https://github.com/pytorch/pytorch/pull/46866))
* Track and list model params for scripting ([#47348](https://github.com/pytorch/pytorch/pull/47348))
* Modifications in remove inplace ops passes to better handle binary inplace ops ([#51572](https://github.com/pytorch/pytorch/pull/51572))
* Improve error message for parse_arg in symbolic functions ([#51516](https://github.com/pytorch/pytorch/pull/51516))
* Update error message that displays when encountering an op unsupported for ONNX export ([#51522](https://github.com/pytorch/pytorch/pull/51522))
* Preserve param names during in-place op removal ([#50955](https://github.com/pytorch/pytorch/pull/50955))
* Handle sequence output shape and type inference ([#50599](https://github.com/pytorch/pytorch/pull/50599))
* Update constant-folding of Gather op to include cases where rank of indices input is 0 ([#51514](https://github.com/pytorch/pytorch/pull/51514))
* Update unsafe_chunk() method to support new version 13 of Split operator ([#51524](https://github.com/pytorch/pytorch/pull/51524))
* Replace optional parameters of Resize with placeholder for ops13 ([#50954](https://github.com/pytorch/pytorch/pull/50954))

### Vulkan

This release brings about a complete rewrite of PyTorch’s Vulkan backend with primary focus on improved performance, robustness, and better code structure and organization.  These changes are transparent to the end user.  Considering that this is a rewrite, many of these changes also qualify as performance improvements.

* Add Vulkan Tensor factory. ([#44016](https://github.com/pytorch/pytorch/pull/44016))
* Redo Vulkan command and descriptor pools. ([#44496](https://github.com/pytorch/pytorch/pull/44496))
* Add low level utilities image sampler ([#45037](https://github.com/pytorch/pytorch/pull/45037)), fence ([#45148](https://github.com/pytorch/pytorch/pull/45148)), tensor copy ([#46481](https://github.com/pytorch/pytorch/pull/46481)), job dispatch and flush ([#46008](https://github.com/pytorch/pytorch/pull/46008)), 
* Add more ops Add ([#44017](https://github.com/pytorch/pytorch/pull/44017)), Mul ([#47021](https://github.com/pytorch/pytorch/pull/47021)), Mm, Pool, Upsample ([#47063](https://github.com/pytorch/pytorch/pull/47063)), Conv2D ([#46900](https://github.com/pytorch/pytorch/pull/46900), [#48266](https://github.com/pytorch/pytorch/pull/48266), [#48816](https://github.com/pytorch/pytorch/pull/48816)), clamp ([#47196](https://github.com/pytorch/pytorch/pull/47196)), reshape ([#47252](https://github.com/pytorch/pytorch/pull/47252)), mean ([#47312](https://github.com/pytorch/pytorch/pull/47312)), 
* Add CMake option to enable Vulkan [v2] API. ([#46503](https://github.com/pytorch/pytorch/pull/46503))
* Add `Tensor.is_vulkan` ([#46655](https://github.com/pytorch/pytorch/pull/46655))

### Misc

* Factory operators (at::empty, at::zeroes,...) now have a new overload in the C++ API that takes ScalarType, Layout, Device and pin_memory parameters separately, in addition to the previously existing overload that takes one TensorOptions argument. ([#44087](https://github.com/pytorch/pytorch/pull/44087))

# Bug fixes

### Python API

* Fix `torch.nn.BatchNorm{1,2,3}d` channels_last contiguity check ([#50659](https://github.com/pytorch/pytorch/pull/50659))
* Fix `torch.nn.ConstantPadNd` not preserving memory format ([#50898](https://github.com/pytorch/pytorch/pull/50898))
* Fix dtype of first sample in `torch.quasirandom.SobolEngine` ([#51578](https://github.com/pytorch/pytorch/pull/51578))
* Fixes bug in `torch.sspaddmm` ([#45963](https://github.com/pytorch/pytorch/pull/45963))
* Check `support_as_strided` before using `torch.empty_strided` ([#46746](https://github.com/pytorch/pytorch/pull/46746))
* Fix internal assert for `torch.heaviside` with cuda tensor and cpu scalar tensor ([#46831](https://github.com/pytorch/pytorch/pull/46831))
* Fix negative column numbers for `torch.eye` ([#46841](https://github.com/pytorch/pytorch/pull/46841))
* Fix segfault with `torch.orgqr` ([#46700](https://github.com/pytorch/pytorch/pull/46700))
* Fix `torch.nn.functional.embedding` padding_idx behavior ([#46714](https://github.com/pytorch/pytorch/pull/46714))
* Fix `torch.nn.Embedding.from_pretrained` to properly handle the `padding_idx` argument ([#47184](https://github.com/pytorch/pytorch/pull/47184))
* Fix functions not handling discontiguous Tensors properly: `torch.dropout` ([#47552](https://github.com/pytorch/pytorch/pull/47552)), `torch.median` ([#46917](https://github.com/pytorch/pytorch/pull/46917))
* Fix max_pool2d with ceil_mode ([#46558](https://github.com/pytorch/pytorch/pull/46558))
* Fix type promotion for `torch.trace` on CPU ([#47305](https://github.com/pytorch/pytorch/pull/47305))
* Fix `torch.kthvalue` error for scalar input ([#47600](https://github.com/pytorch/pytorch/pull/47600))
* Fix multinomial when input has 0 probability ([#47386](https://github.com/pytorch/pytorch/pull/47386))
* Fix incorrect warnings in `torch.nn.Parameter{List,Dict}` ([#48315](https://github.com/pytorch/pytorch/pull/48315))
* Fix printing of `torch.device` ([#48655](https://github.com/pytorch/pytorch/pull/48655))
* Fix parameter generator exhaustion in `torch.optim.SparseAdam` ([#47724](https://github.com/pytorch/pytorch/pull/47724))
* Fix `torch.pow` bug for complex exponents ([#49809](https://github.com/pytorch/pytorch/pull/49809))
* Fix gradient for `torch.norm` when `p=+inf` ([#48611](https://github.com/pytorch/pytorch/pull/48611))
* Fix `SyncBatchNorm` when stats tracking is disabled ([#50126](https://github.com/pytorch/pytorch/pull/50126))
* Fix `torch.elu` backward when alpha is negative ([#49272](https://github.com/pytorch/pytorch/pull/49272))
* Fix pickling for Tensor-like objects ([#47732](https://github.com/pytorch/pytorch/pull/47732))
* Fix `torch.distributions.Half{Cauchy,Normal}` support for `validate_args=True` ([#50403](https://github.com/pytorch/pytorch/pull/50403), [#50492](https://github.com/pytorch/pytorch/pull/50492))
* Fix `torch.distributions.CatTransform` for `event_dim` > 0 ([#49111](https://github.com/pytorch/pytorch/pull/49111))
* Fix `torch.distributions.Binomial` to retain lazy logit initialization ([#46055](https://github.com/pytorch/pytorch/pull/46055))
* Fix `torch.pow` when exponent is provided as a scalar Tensor and on different device ([#46185](https://github.com/pytorch/pytorch/pull/46185), [#46320](https://github.com/pytorch/pytorch/pull/46320))
* Fix classmethod override argument passing for Tensor-like objects ([#47114](https://github.com/pytorch/pytorch/pull/47114))
* Fix internal assert when inputs are on the wrong device for `torch.`{`maximum, minimum}` ([#48446](https://github.com/pytorch/pytorch/pull/48446))
* Fix `torch.distributions.utils.broadcast_all` crashing on Tensor-like objects ([#48169](https://github.com/pytorch/pytorch/pull/48169))
* Fix vectorized conversion of `-nan` from float16 to float32 ([#41280](https://github.com/pytorch/pytorch/pull/41280))
* Fix `torch.silu` backward for all backends other than CPU and CUDA ([#49439](https://github.com/pytorch/pytorch/pull/49439))
* Fix wrong output when `torch.kthvalue` `out=` argument overlaps with input ([#48254](https://github.com/pytorch/pytorch/pull/48254))
* Fix advanced indexing for Tensor-like objects ([#49324](https://github.com/pytorch/pytorch/pull/49324))
* Fix `torch.distributions.TransformedDistribution` shape logic([#50581](https://github.com/pytorch/pytorch/pull/50581))
* Fix `torch.nn.functional.interpolate` backward on GPU for nearest interpolation ([#51240](https://github.com/pytorch/pytorch/pull/51240))
* Fix `torch.svd` ignoring `some` keyword argument for empty inputs ([#51109](https://github.com/pytorch/pytorch/pull/51109))
* Fix `torch.distributions.Dirichlet` `arg_constraints` ([#51369](https://github.com/pytorch/pytorch/pull/51369))
* Use deterministic implementation of `torch.index_put` and `torch.index` backward CPU in deterministic mode ([#51388](https://github.com/pytorch/pytorch/pull/51388))
* Removes spurious warning in `torch.nonzero` ([#51618](https://github.com/pytorch/pytorch/pull/51618))
* Fix calculation of number of elements to not overflow in many c++ implementations ([#46997](https://github.com/pytorch/pytorch/pull/46997))
* Fix Parameter detection as Tensor in c++ backend ([#48963](https://github.com/pytorch/pytorch/pull/48963))
* Fix bug in miopen findAlgorithm ([#46852](https://github.com/pytorch/pytorch/pull/46852))

### Autograd

* Fix deadlock on Windows due to bad thread termination in autograd engine ([#43532](https://github.com/pytorch/pytorch/pull/43532))
* Fix deadlock in tsan builds due to bad locking in the engine ([#45867](https://github.com/pytorch/pytorch/pull/45867))
* Avoid NaN values in `torch.cdist` backward for p<1 ([#45720](https://github.com/pytorch/pytorch/pull/45720))
* Fix handling of `requires_grad` arg for `torch.new_`{`full,empty,zeros}` ([#46486](https://github.com/pytorch/pytorch/pull/46486))
* Fix inplace check logic to be triggered when written-to Tensor does not require gradients ([#46296](https://github.com/pytorch/pytorch/pull/46296))
* Set proper output differentiability for `torch.unique` ([#47930](https://github.com/pytorch/pytorch/pull/47930)), `torch.count_nonzero` ([#50866](https://github.com/pytorch/pytorch/pull/50866))
* Fix race in autograd engine that lead can lead to `std::out_of_range` error ([#50164](https://github.com/pytorch/pytorch/pull/50164), [#50372](https://github.com/pytorch/pytorch/pull/50372))
* Fix autograd thread crash on destruction with python-3.9 ([#50998](https://github.com/pytorch/pytorch/pull/50998))
* Fix autograd side effects when printing ([#51364](https://github.com/pytorch/pytorch/pull/51364))
* Fix memory leak in anomaly mode ([#51610](https://github.com/pytorch/pytorch/pull/51610))
* fix `torch.hardsigmoid` backward at boundary values ([#51454](https://github.com/pytorch/pytorch/pull/51454))

### CUDA

* Fix incorrect CUDA `torch.nn.Embedding` result when `max_norm` is not `None` and indices are not sorted ([#45248](https://github.com/pytorch/pytorch/pull/45248))
* Ensure kernel launches are checked ([#46474](https://github.com/pytorch/pytorch/pull/46474), [#46727](https://github.com/pytorch/pytorch/pull/46727))
* Fix bit math ([#46837](https://github.com/pytorch/pytorch/pull/46837))
* Fix test_inverse_singular for cublas path; fix cusolver inverse multi-stream issue ([#47026](https://github.com/pytorch/pytorch/pull/47026))
* Fix indices computation for trilinear interpolate backwards ([#50084](https://github.com/pytorch/pytorch/pull/50084))
* Fix for possible RNG offset calculation bug in cuda vectorized dropout with VEC=2 ([#50110](https://github.com/pytorch/pytorch/pull/50110))
* Disable cuDNN persistent RNN on `sm_86` devices ([#49534](https://github.com/pytorch/pytorch/pull/49534))
* Fix Error with `torch.flip` for cuda tensors when `dims=()` ([#50325](https://github.com/pytorch/pytorch/pull/50325))
* Fix replication_pad CUDA launch configuration ([#50565](https://github.com/pytorch/pytorch/pull/50565))
* Workaround for MAGMA accessing illegal memory in batched cholesky ([#50957](https://github.com/pytorch/pytorch/pull/50957))
* Fix `torch.cdist` backward CUDA error due to illegal gridDim setting ([#51569](https://github.com/pytorch/pytorch/pull/51569))
* Prevent CUDAFuture from using uninitialized device index ([#51505](https://github.com/pytorch/pytorch/pull/51505))
* Fix incorrect usage of CUDACachingAllocator ([#48817](https://github.com/pytorch/pytorch/pull/48817))
* Fix `torch.cuda.memory_allocated` to return `{}` if not initialized ([#51179](https://github.com/pytorch/pytorch/pull/51179))
* Fix crash when trying to reset memory stats when no cuda device is available ([#48406](https://github.com/pytorch/pytorch/pull/48406))

### torch.utils

* `data.DistributedSampler`: Fix possible padding length overflow ([#45329](https://github.com/pytorch/pytorch/pull/45329))
* `data.DataLoader`: Fix hang with large sampler ([#48669](https://github.com/pytorch/pytorch/pull/48669))
* `data.DataLoader`: Fix unintended error when worker force kill happens #43455 ([#43462](https://github.com/pytorch/pytorch/pull/43462))
* `data.DataLoader`: Fix persistent_workers + pin_memory ([#48543](https://github.com/pytorch/pytorch/pull/48543))

### Complex Number

* Make `torch.view_as_real` raise a proper error for backends where it is not supported ([#47018](https://github.com/pytorch/pytorch/pull/47018))
* Fix bug in `toComplexWithDefault` ([#43841](https://github.com/pytorch/pytorch/pull/43841))
* Fix `torch.cat` backward formula to return correct gradient values for R -> C case ([#51681](https://github.com/pytorch/pytorch/pull/51681))
* Update backward formulas for `torch.{add, sub}` to correctly handle R -> C case. ([#46596](https://github.com/pytorch/pytorch/pull/46596))
* Add custom implementation for `torch.csqrt` if libc++ is used ([#52018](https://github.com/pytorch/pytorch/pull/52018))

### C++ API

* Refine `ConvParams::use_nnpack()` to allow NNPACK convolution algorithm only be used for kernels up to 16x16.([#49464](https://github.com/pytorch/pytorch/pull/49464))

### Distributed

* Record FutureNCCL callback stream on CUDA caching allocator ([#45318](https://github.com/pytorch/pytorch/pull/45318))
* Fix object-based collectives API to use `torch.cuda.current_device` instead of rank ([#46897](https://github.com/pytorch/pytorch/pull/46897))
* Explicitly restrict the scope of `torch.cuda.synchronize` to the current device in PowerSGD ([#49711](https://github.com/pytorch/pytorch/pull/49711))
* Fix Hang in Async Error Handling due to Work logging ([#46265](https://github.com/pytorch/pytorch/pull/46265))
* Add missing `recordStream` in `ProcessGroupNCCL::alltoall_base` ([#46603](https://github.com/pytorch/pytorch/pull/46603))
* Allow DataParallel to run zero input Module ([#46565](https://github.com/pytorch/pytorch/pull/46565))
* Fix DDP issue where parameters share same `grad_accumulator` ([#46755](https://github.com/pytorch/pytorch/pull/46755))
* Fix ProcessGroupNCCL profiling when profiler is not run with `use_cuda` ([#48946](https://github.com/pytorch/pytorch/pull/48946))
* Refactor RPC `matchBuiltInOp` to get rid of exception swallowing ([#49009](https://github.com/pytorch/pytorch/pull/49009))
* Solve zombie process problem in DDP launcher ([#49305](https://github.com/pytorch/pytorch/pull/49305))
* Fix memory leak in TensorPipeAgent. ([#50564](https://github.com/pytorch/pytorch/pull/50564))
* Fix warm-start for PowerSGD layer-wise compression ([#50283](https://github.com/pytorch/pytorch/pull/50283))
* Fix CUDA RPC Stream Synchronization ([#50949](https://github.com/pytorch/pytorch/pull/50949))
* Fix `benchmarks/distributed/ddp/benchmark.py` ([#51095](https://github.com/pytorch/pytorch/pull/51095))
* Fix store based barrier to only use `add` ([#49930](https://github.com/pytorch/pytorch/pull/49930))

### Mobile

* Fix out-of-bounds access for caching allocator calls ([#46439](https://github.com/pytorch/pytorch/pull/46439))
* Fix CPUCaching allocator guard bug ([#46922](https://github.com/pytorch/pytorch/pull/46922))
* [Metal] Make the dst tensor contiguous when copying from metal (25833e5d1c)
* [Metal] Fix the broken strides value for 2d transpose ([#50310](https://github.com/pytorch/pytorch/pull/50310))
* [Android] Fix yuv conversion ([#50951](https://github.com/pytorch/pytorch/pull/50951))

### TorchScript

* Fix bugs in a number of ops in CUDA fuser ([#47795](https://github.com/pytorch/pytorch/pull/47795), [#49143,](https://github.com/pytorch/pytorch/pull/49143) [#49396](https://github.com/pytorch/pytorch/pull/49396) ,[#48329](https://github.com/pytorch/pytorch/pull/48329) and others)
* Fix dict update ([#45857](https://github.com/pytorch/pytorch/pull/45857))
* Fix Dict bug in constant hashing ([#45929](https://github.com/pytorch/pytorch/pull/45929))
* Fix TypeError when `torch.jit.load` is passed a pathlib.Path ([#45825](https://github.com/pytorch/pytorch/pull/45825))
* Fix missing call to `__setstate__` when cloning modules ([#45858](https://github.com/pytorch/pytorch/pull/45858))
* Prevent caching of `graph` attribute. ([#46960](https://github.com/pytorch/pytorch/pull/46960))
* Fix traced training attribute ([#47211](https://github.com/pytorch/pytorch/pull/47211))
* Correctly compare Stream IValues ([#47303](https://github.com/pytorch/pytorch/pull/47303))
* Correctly print out sign of near-zero double values ([#47081](https://github.com/pytorch/pytorch/pull/47081))
* Properly serialize types that only appear at function input ([#47775](https://github.com/pytorch/pytorch/pull/47775))
* Fix bug in get_annotation_str for ast.Subscript ([#48741](https://github.com/pytorch/pytorch/pull/48741))
* Fix include files for out-of-tree compilation ([#48827](https://github.com/pytorch/pytorch/pull/48827))
* Fix constant propagation schemas ([#49605](https://github.com/pytorch/pytorch/pull/49605))
* Fix return type Any for Ternary ops ([#49165](https://github.com/pytorch/pytorch/pull/49165))
* Fix for module_has_exports ([#50680](https://github.com/pytorch/pytorch/pull/50680))
* Properly convert Python strings implictly to device ([#51340](https://github.com/pytorch/pytorch/pull/51340))
* Add missing support for `torch.jit.Final` in python 3.6 ([#47393](https://github.com/pytorch/pytorch/pull/47393))

### torch.fx

* Fix recursion depth issue on Graph deepcopy ([#46669](https://github.com/pytorch/pytorch/pull/46669))
* Fix handling of `inf` and `nan` literals ([#46894](https://github.com/pytorch/pytorch/pull/46894))
* Fix corner case in name sanitization ([#46958](https://github.com/pytorch/pytorch/pull/46958))
* Fix submodule naming for subgraph split ([#47869](https://github.com/pytorch/pytorch/pull/47869))
* Fix create_arg for NamedTuple ([#48986](https://github.com/pytorch/pytorch/pull/48986))
* Fix python code having spurious newlines from placeholders ([#49720](https://github.com/pytorch/pytorch/pull/49720))
* Make `split_module` results deterministic ([#50470](https://github.com/pytorch/pytorch/pull/50470))
* Fix tracing a free function with embedded constant ([#50639](https://github.com/pytorch/pytorch/pull/50639))
* Fix using `fx.wrap` as a decorator ([#50677](https://github.com/pytorch/pytorch/pull/50677))
* Fix annotation in generated code ([#50777](https://github.com/pytorch/pytorch/pull/50777), [#52021](https://github.com/pytorch/pytorch/pull/52021))

### Quantization

* Remove fake_quant after add/mul nodes during eager mode QAT ([#49213](https://github.com/pytorch/pytorch/pull/49213))
* `torch.mean` add path for unsupported QNNPACK modes ([#45533](https://github.com/pytorch/pytorch/pull/45533))
* Set type for GetAttr nodes in remapTypes ([#46250](https://github.com/pytorch/pytorch/pull/46250))
* Avoid inserting fakequant for sigmoid/hardsigmoid/tanh in eval mode ([#47297](https://github.com/pytorch/pytorch/pull/47297))
* Ensure observer respects device affinity ([#47514](https://github.com/pytorch/pytorch/pull/47514))
* Fix quant type classification for float_qparam qconfig ([#48069](https://github.com/pytorch/pytorch/pull/48069))
* Fix quant_type classification for fp16, fp16 ([#48073](https://github.com/pytorch/pytorch/pull/48073))
* Fix a bug in leakyReLU ([#48265](https://github.com/pytorch/pytorch/pull/48265))
* Fix quantization for qat.ConvBnReLU1d ([#48059](https://github.com/pytorch/pytorch/pull/48059))
* Add bias once in conv_fused ([#48593](https://github.com/pytorch/pytorch/pull/48593))
* Do not return unitialized qschame from getQSchemeAndQParamVector ([#49391](https://github.com/pytorch/pytorch/pull/49391))
* Fix quantization for DeQuantStub ([#49428](https://github.com/pytorch/pytorch/pull/49428))
* Ensure observers do not crash for empty Tensors ([#49800](https://github.com/pytorch/pytorch/pull/49800))
* fake_quant: fix device affinity and buffer resizing for state_dict ([#50868](https://github.com/pytorch/pytorch/pull/50868))
* Fix memory leak in qnnpack ops ([#51612](https://github.com/pytorch/pytorch/pull/51612))
* Remove set_quantizer_ from native_functions.yaml ([#49463](https://github.com/pytorch/pytorch/pull/49463))
* Make choose_qparams_optimized return Tensors to preserve dtype ([#45530](https://github.com/pytorch/pytorch/pull/45530))
* Use PlaceholderObserver as default dynamic quant observer ([#45343](https://github.com/pytorch/pytorch/pull/45343))
* FixedQParamsFakeQuantize: adjust default quant_min and quant_max ([#47423](https://github.com/pytorch/pytorch/pull/47423))
* Add bias once in conv_fused (#48593) ([#48661](https://github.com/pytorch/pytorch/pull/48661))
* Fix unused var warning when building for different archs. ([#48730](https://github.com/pytorch/pytorch/pull/48730))
* Make the CUDA fake quantize logic consistent with CPU fake quantize logic ([#49808](https://github.com/pytorch/pytorch/pull/49808))
* eager quant: fix error with removing forward hooks ([#49813](https://github.com/pytorch/pytorch/pull/49813))

### ONNX

* Fix `torch.flatten` operator ([#45632](https://github.com/pytorch/pytorch/pull/45632))
* Reimplement _var_mean to ensure non-negative ([#47240](https://github.com/pytorch/pytorch/pull/47240))
* Fix scripting of `torch.{rand,randn,where}` ([#45793](https://github.com/pytorch/pytorch/pull/45793))
* Fix `torch.eye` export ([#47016](https://github.com/pytorch/pytorch/pull/47016))
* Fix dtype for log_softmax export ([#46627](https://github.com/pytorch/pytorch/pull/46627))
* Fix graph position to insert clone node for inplace op removal ([#51520](https://github.com/pytorch/pytorch/pull/51520))
* Fix graph sequence output from loop node ([#51521](https://github.com/pytorch/pytorch/pull/51521))
* Do not dereference nullptr in scalar type analysis ([#50237](https://github.com/pytorch/pytorch/pull/50237))
* Fix bug in `torch.unfold` symbolic ([#51515](https://github.com/pytorch/pytorch/pull/51515))
* Fix opset 11 ConstantChunk with negative dim ([#51525](https://github.com/pytorch/pytorch/pull/51525))
* Fix bug in scatter_add ([#51527](https://github.com/pytorch/pytorch/pull/51527))

### Vulkan

* Fix interval midpoint calculation ([#46839](https://github.com/pytorch/pytorch/pull/46839))
* Fix Vulkan `torch.empty` (and family) breakage as a result of API update. ([#47937](https://github.com/pytorch/pytorch/pull/47937))
* Fix Addmm prepacking to persist after GPU flush ([#48313](https://github.com/pytorch/pytorch/pull/48313))
* Properly forbid dilation > 1 for conv2d ([#48800](https://github.com/pytorch/pytorch/pull/48800))

### Misc

* Fix c++ extension ninja CUDA build ([#49344](https://github.com/pytorch/pytorch/pull/49344))
* Only include dataclasses for py < 3.8 to make `setup.py` compatible with older python versions ([#45611](https://github.com/pytorch/pytorch/pull/45611))

# Performance

### Python API

* Rewrite `torch.kron` to improve performance and support more dtypes ([#50927](https://github.com/pytorch/pytorch/pull/50927))
* Enable the faster combined weight branch in MHA when query/key/value is same object with NaN ([#48126](https://github.com/pytorch/pytorch/pull/48126))

### Autograd

* `autograd.gradcheck` update to reduce computations ([#45757](https://github.com/pytorch/pytorch/pull/45757))
* Reduce memory usage for `torch.mm` when only one input requires gradient ([#45777](https://github.com/pytorch/pytorch/pull/45777))
* Reduce autograd engine startup cost ([#47592](https://github.com/pytorch/pytorch/pull/47592))
* Make `torch.svd` backward formula more memory and computationally efficient. ([#50109](https://github.com/pytorch/pytorch/pull/50109))

### CUDA

* Fix perfornance issue of GroupNorm on CUDA when feature map is small. ([#46170](https://github.com/pytorch/pytorch/pull/46170))
* Concat fast path with empty tensor ([#46805](https://github.com/pytorch/pytorch/pull/46805))
* Support the strided tensor on input for `torch.cat` ([#46859](https://github.com/pytorch/pytorch/pull/46859))
* Pin destination memory for `cuda_tensor.to("cpu", non_blocking=True)` ([#46878](https://github.com/pytorch/pytorch/pull/46878))
* Add proper maximum number of threads per block for sm_86 as 1536 ([#45889](https://github.com/pytorch/pytorch/pull/45889)) 
* Use MTA for amp grad unscaling, enforce op math type in MTA functors, and allow op lambdas ([#44778](https://github.com/pytorch/pytorch/pull/44778))
* Improve performance of CUDA trilinear interpolate backward  ([#52649](https://github.com/pytorch/pytorch/pull/52649))

### C++ API

* Avoid computing AutogradKey if not needed to speed up low level C++ calls ([#46252](https://github.com/pytorch/pytorch/pull/46252))
* VariableKernel calls into scattered C++ api ([#44158](https://github.com/pytorch/pytorch/pull/44158))
* Make validate debug-only in Device constructor ([#49123](https://github.com/pytorch/pytorch/pull/49123))
* Add macro to optionally devirtualize `TensorImpl::numel()` ([#49766](https://github.com/pytorch/pytorch/pull/49766)) and `TensorImpl::sizes()` ([#50176](https://github.com/pytorch/pytorch/pull/50176))
* Inline access to low level Dispatcher ([#50644](https://github.com/pytorch/pytorch/pull/50644))

### Distributed

* Only track variables with grad accumulator for find_unused_parameters=True in DDP to save memory ([#45942](https://github.com/pytorch/pytorch/pull/45942))
* Benchmark combining Distributed Data Parallel and Distributed RPC ([#46993](https://github.com/pytorch/pytorch/pull/46993))
* Drop FutureNCCL in favor of vanilla CUDAFuture ([#49014](https://github.com/pytorch/pytorch/pull/49014))
* Pytorch Distributed RPC Reinforcement Learning Benchmark (Throughput and Latency) ([#46901](https://github.com/pytorch/pytorch/pull/46901))

### TorchScript

* Optimized hot path in JIT graph executor ([#47465](https://github.com/pytorch/pytorch/pull/47465), [#48061](https://github.com/pytorch/pytorch/pull/48061),[#48034](https://github.com/pytorch/pytorch/pull/48034))
* Added support for `is_nan`, `to`, and `lgamma` in CUDA fuser([#45791](https://github.com/pytorch/pytorch/pull/45791), [#48973](https://github.com/pytorch/pytorch/pull/48973), [#48976](https://github.com/pytorch/pytorch/pull/48976))
* Added additional optimizations as part of `torch.jit.freeze` (Conv-Batchnorm, Conv-Add, and Conv-Mul folding, Dropout Removal) ([#50222](https://github.com/pytorch/pytorch/pull/50222)).
* Fast TypeMeta/ScalarType conversion ([#45544](https://github.com/pytorch/pytorch/pull/45544))
* Fix getCustomClassType() perf ([#48981](https://github.com/pytorch/pytorch/pull/48981))
* Avoid move-constructing a List in listConstruct ([#49355](https://github.com/pytorch/pytorch/pull/49355))
* Specialize `list_element_from` for `IValue` to avoid extra move/copy ([#50124](https://github.com/pytorch/pytorch/pull/50124))

### Mobile

* Avoid inlining kernel lambdas on mobile ([#46249](https://github.com/pytorch/pytorch/pull/46249))
* Free original weight after prepacking in XNNPACK based op ([#46541](https://github.com/pytorch/pytorch/pull/46541))
* [Metal] Make permuteWeights inline ([#47634](https://github.com/pytorch/pytorch/pull/47634))
* [Metal] Use MPSCNN kernels for binary elementwise ops (c18403a693)

### Vulkan

* Enable prepacked addmm/mm for linear layers ([#47815](https://github.com/pytorch/pytorch/pull/47815))
* Tweak memory use. ([#47728](https://github.com/pytorch/pytorch/pull/47728))
* Add linear memory allocator. ([#48569](https://github.com/pytorch/pytorch/pull/48569))
* Optimize Vulkan command buffer submission rate. ([#49112](https://github.com/pytorch/pytorch/pull/49112))

### torch.fx

* Speed up non-parameter tensor lookup ([#47325](https://github.com/pytorch/pytorch/pull/47325))

### Quantization

* Parallelize the quantization conversion operators ([#45536](https://github.com/pytorch/pytorch/pull/45536))
* Add a more memory efficient version of fake quant ([#50561](https://github.com/pytorch/pytorch/pull/50561))
* mem-efficient learnable fake quantization ([#49315](https://github.com/pytorch/pytorch/pull/49315), [#51255](https://github.com/pytorch/pytorch/pull/51255), [#51159](https://github.com/pytorch/pytorch/pull/51159))
* Remove contiguous calls in qembeddingbag ([#48993](https://github.com/pytorch/pytorch/pull/48993))
* Update embedding module to not store qweight ([#50418](https://github.com/pytorch/pytorch/pull/50418))

### Misc

* Extra sampling of record function events for the profiler ([#49114](https://github.com/pytorch/pytorch/pull/49114))

# Documentation

### Python API

* Add information how to control randomness in `DataLoader` ([#45749](https://github.com/pytorch/pytorch/pull/45749))
* Revamp reproducibility notes ([#45748](https://github.com/pytorch/pytorch/pull/45748))
* Revamp `torch.optim` doc for better understanding ([#45944](https://github.com/pytorch/pytorch/pull/45944))
* Revamp `torch.sparse` tensor documentation. ([#45400](https://github.com/pytorch/pytorch/pull/45400))
* Add doc for `torch.overrides` submodule. ([#48170](https://github.com/pytorch/pytorch/pull/48170))
* Add note on `nn.Module` overview and design principles ([#51536](https://github.com/pytorch/pytorch/pull/51536))
* Add helper functions section to `torch.fft` doc ([#46032](https://github.com/pytorch/pytorch/pull/46032))
* Add object-based collective APIs to public docs ([#48909](https://github.com/pytorch/pytorch/pull/48909))
* Fix diverse typos and rendering issues in `torch.` doc ([#46328](https://github.com/pytorch/pytorch/pull/46328), [#46589](https://github.com/pytorch/pytorch/pull/46589), [#47545](https://github.com/pytorch/pytorch/pull/47545), [#48316](https://github.com/pytorch/pytorch/pull/48316), [#48328](https://github.com/pytorch/pytorch/pull/48328), [#48673](https://github.com/pytorch/pytorch/pull/48673), [#48787](https://github.com/pytorch/pytorch/pull/48787), [#47762](https://github.com/pytorch/pytorch/pull/47762), [#48970](https://github.com/pytorch/pytorch/pull/48970), [#49136](https://github.com/pytorch/pytorch/pull/49136), [#49388](https://github.com/pytorch/pytorch/pull/49388), [#49413](https://github.com/pytorch/pytorch/pull/49413), [#49584](https://github.com/pytorch/pytorch/pull/49584), [#49667](https://github.com/pytorch/pytorch/pull/49667), [#41887](https://github.com/pytorch/pytorch/pull/41887), [#50254](https://github.com/pytorch/pytorch/pull/50254), [#51053](https://github.com/pytorch/pytorch/pull/51053), [#51212](https://github.com/pytorch/pytorch/pull/51212), [#51439](https://github.com/pytorch/pytorch/pull/51439), [#51286](https://github.com/pytorch/pytorch/pull/51286), [#49648](https://github.com/pytorch/pytorch/pull/49648))
* Fix diverse typo and rendering issues in `torch.nn` doc ([#45662](https://github.com/pytorch/pytorch/pull/45662), [#45660](https://github.com/pytorch/pytorch/pull/45660), [#45587](https://github.com/pytorch/pytorch/pull/45587), [#45763](https://github.com/pytorch/pytorch/pull/45763), [#46853](https://github.com/pytorch/pytorch/pull/46853), [#48577](https://github.com/pytorch/pytorch/pull/48577), [#48775](https://github.com/pytorch/pytorch/pull/48775), [#49950](https://github.com/pytorch/pytorch/pull/49950), [#50430](https://github.com/pytorch/pytorch/pull/50430), [#48596](https://github.com/pytorch/pytorch/pull/48596))
* Fix diverse typo and rendering issues in `torch.linalg` doc ([#51459](https://github.com/pytorch/pytorch/pull/51459), [#51353](https://github.com/pytorch/pytorch/pull/51353), [#51620](https://github.com/pytorch/pytorch/pull/51620), [#51641](https://github.com/pytorch/pytorch/pull/51641), [#51651](https://github.com/pytorch/pytorch/pull/51651), [#51658](https://github.com/pytorch/pytorch/pull/51658), [#51659](https://github.com/pytorch/pytorch/pull/51659), [#51660](https://github.com/pytorch/pytorch/pull/51660))
* Update docs for `torch.nn`:  in-place modification of weight in `nn.Embedding` ([#45595](https://github.com/pytorch/pytorch/pull/45595))
* Update docs for `torch.distributions`: `NegativeBinomial` ([#45693](https://github.com/pytorch/pytorch/pull/45693)), `Categorical` ([#45804](https://github.com/pytorch/pytorch/pull/45804)), `LKJCholesky` ([#52904](https://github.com/pytorch/pytorch/pull/52904))
* Improve `torch.matmul` doc regarding broadcasting ([#45699](https://github.com/pytorch/pytorch/pull/45699))
* Add function signature for `torch.pixel_shuffle` ([#45661](https://github.com/pytorch/pytorch/pull/45661))
* Fix signature for `torch.poisson` ([#45656](https://github.com/pytorch/pytorch/pull/45656))
* Add 3D reduction example to `torch.tensordot` ([#45697](https://github.com/pytorch/pytorch/pull/45697))
* Fix `torch.matrix_exp` ([#45909](https://github.com/pytorch/pytorch/pull/45909))
* Fix typo in `torch.load` docstring for the `f` parameter ([#49350](https://github.com/pytorch/pytorch/pull/49350))
* Document fix for `torch.logspace` and `torch.linspace` ([#46056](https://github.com/pytorch/pytorch/pull/46056))
* Improve clarity of `torch.norm` ([#42696](https://github.com/pytorch/pytorch/pull/42696))
* Fix info on the shape of pivots in `torch.lu` ([#46844](https://github.com/pytorch/pytorch/pull/46844))
* Add `generator` param in `torch.randperm` doc ([#47231](https://github.com/pytorch/pytorch/pull/47231))
* Updated doc for `torch.{v}dot` ([#47242](https://github.com/pytorch/pytorch/pull/47242))
* Update doc of `torch.eig` about backward([#47598](https://github.com/pytorch/pytorch/pull/47598))
* Fix `torch.swap{dim/axes}` to properly appear in doc ([#48376](https://github.com/pytorch/pytorch/pull/48376))
* Add global `nn.Module` hooks to nn doc ([#48374](https://github.com/pytorch/pytorch/pull/48374))
* Added `torch.linalg.cond` to doc([#48941](https://github.com/pytorch/pytorch/pull/48941))
* Improve new_group example in the context of `torch.nn.SyncBatchNorm` ([#48897](https://github.com/pytorch/pytorch/pull/48897))
* Update `is_floating_point()` docs to mention bfloat16 ([#49611](https://github.com/pytorch/pytorch/pull/49611))
* Improve docs for `torch.{scatter,gather}` ([#49679](https://github.com/pytorch/pytorch/pull/49679))
* Rename "Arguments:" to "Args:" in all doc ([#49736](https://github.com/pytorch/pytorch/pull/49736))
* Fix a KaTeX crash and many docstring issues ([#49684](https://github.com/pytorch/pytorch/pull/49684))
* Improve `torch.flatten` doc ([#49501](https://github.com/pytorch/pytorch/pull/49501))
* Add note about `torch.flip` returning new tensor and not view. ([#50041](https://github.com/pytorch/pytorch/pull/50041))
* Add instructional error message for cudnn RNN double backward workaround ([#33884](https://github.com/pytorch/pytorch/pull/33884))
* Add centered FFT example to `torch.fft.fftshift` doc ([#51223](https://github.com/pytorch/pytorch/pull/51223))
* Add `torch.sgn` to doc ([#51479](https://github.com/pytorch/pytorch/pull/51479))

### Autograd

* Fix many typos and rendering issues in `torch.autograd` doc ([#48765](https://github.com/pytorch/pytorch/pull/48765), [#45849](https://github.com/pytorch/pytorch/pull/45849), [#50166](https://github.com/pytorch/pytorch/pull/50166), [#51035](https://github.com/pytorch/pytorch/pull/51035), [#51335](https://github.com/pytorch/pytorch/pull/51335))
* Update the error message explaining when to use the `retain_grad` flag ([#47084](https://github.com/pytorch/pytorch/pull/47084))

### Complex Number

* Fix typo in complex autograd docs ([#49755](https://github.com/pytorch/pytorch/pull/49755))
* Doc update for complex numbers ([#51129](https://github.com/pytorch/pytorch/pull/51129), [#51661](https://github.com/pytorch/pytorch/pull/51661))
* Document that `torch.remainder` does not support complex inputs ([#48024](https://github.com/pytorch/pytorch/pull/48024))

### CUDA

* Add a Note on CUDA Stream ([#45754](https://github.com/pytorch/pytorch/pull/45754%20(http:/#45754))), [#45754](https://github.com/pytorch/pytorch/pull/45754))
* Add docs on how to toggle TF32 flags on C++ ([#47331](https://github.com/pytorch/pytorch/pull/47331))
* Fix syntax issue in C++ cuda api note ([#48434](https://github.com/pytorch/pytorch/pull/48434))
* Change “truncating” to “rounding“ in TF32 docs ([#49625](https://github.com/pytorch/pytorch/pull/49625))
* Add docstring to `torch.cuda.get_device_properties` ([#49792](https://github.com/pytorch/pytorch/pull/49792))
* Add doc for `cuda.memory_fraction` and `cuda.gpu_process` ([#51372](https://github.com/pytorch/pytorch/pull/51372))

### C++ API

* Add guide for choosing dispatch keys in `native_functions.yaml` ([#46126](https://github.com/pytorch/pytorch/pull/46126))
* Add a few more comments on dispatch key computation methods ([#46128](https://github.com/pytorch/pytorch/pull/46128))
* Improve error messages for operator registration API ([#47636](https://github.com/pytorch/pytorch/pull/47636))
* Add Math/DefaultBackend to dispatch key guide, introduce `PythonDispatcher` ([#50854](https://github.com/pytorch/pytorch/pull/50854))

### Distributed

* Clarify callback behavior when future is completed ([#50978](https://github.com/pytorch/pytorch/pull/50978))
* Enhance `new_group` doc to mention using NCCL concurrently. ([#48872](https://github.com/pytorch/pytorch/pull/48872))
* Adding c10d Store API Docs ([#45543](https://github.com/pytorch/pytorch/pull/45543))
* Fix distributed documentation for asynchronous collective Work objects ([#45709](https://github.com/pytorch/pytorch/pull/45709))
* Fix DDP documentation ([#46861](https://github.com/pytorch/pytorch/pull/46861))
* Fix inaccurate note in `DistributedDataParallel` ([#47156](https://github.com/pytorch/pytorch/pull/47156))
* Minor doc fixes for `init_process_group` ([#47644](https://github.com/pytorch/pytorch/pull/47644))
* Docs fixes for `HashStore` API ([#47643](https://github.com/pytorch/pytorch/pull/47643))
* Update links in DDP note ([#47663](https://github.com/pytorch/pytorch/pull/47663))
* Small documentation changes for `RRef` and Dist Autograd ([#48123](https://github.com/pytorch/pytorch/pull/48123))
* Add examples for new object-based c10d APIs ([#43932](https://github.com/pytorch/pytorch/pull/43932))
* Minor update of the comments on PowerSGD. ([#49246](https://github.com/pytorch/pytorch/pull/49246))
* Updating `init_process_group` docs to indicate correct rank range ([#49131](https://github.com/pytorch/pytorch/pull/49131))
* Store Python API Docs Fixes ([#49130](https://github.com/pytorch/pytorch/pull/49130))
* Fix link in distributed contributing doc and add link ([#49141](https://github.com/pytorch/pytorch/pull/49141))
* Updating Docs to Reflect `FileStore` changes ([#49557](https://github.com/pytorch/pytorch/pull/49557))
* Improve documentation for pipeline parallelism. ([#48638](https://github.com/pytorch/pytorch/pull/48638))
* Reorder `torch.distributed.rpc.init_rpc` docstring arguments ([#50419](https://github.com/pytorch/pytorch/pull/50419))
* Add documentation page for pipeline parallelism. ([#50791](https://github.com/pytorch/pytorch/pull/50791))
* Update the doc of `DistributedOptimizer` ([#51314](https://github.com/pytorch/pytorch/pull/51314))
* Fix doc inconsistency about callback args in `torch.futures.Future` ([#50979](https://github.com/pytorch/pytorch/pull/50979))

### TorchScript

* Added a developer tutorial for tensor expressions - the core technology used in CUDA fuser ([#45527](https://github.com/pytorch/pytorch/pull/45527))
* Fix jit model loading example ([#48104](https://github.com/pytorch/pytorch/pull/48104))
* Fix archive file extension in examples and docs ([#50649](https://github.com/pytorch/pytorch/pull/50649))
* Fix `ScriptModule` docstring ([#48608](https://github.com/pytorch/pytorch/pull/48608))
* Clarify logic in `ir_emitter` ([#51299](https://github.com/pytorch/pytorch/pull/51299))

### torch.fx

* Add `torch.fx` section to doc ([#48814](https://github.com/pytorch/pytorch/pull/48814), [#50291](https://github.com/pytorch/pytorch/pull/50291), [#50562](https://github.com/pytorch/pytorch/pull/50562), [#50896](https://github.com/pytorch/pytorch/pull/50896), [#50966](https://github.com/pytorch/pytorch/pull/50966), [#51728](https://github.com/pytorch/pytorch/pull/51728))
* Add example on how to split up an FX graph into smaller subgraphs with own submodules ([#45404](https://github.com/pytorch/pytorch/pull/45404))
* Shape propagation example ([#45637](https://github.com/pytorch/pytorch/pull/45637))
* Add many docstrings and improve their rendering ([#47719](https://github.com/pytorch/pytorch/pull/47719), [#48100](https://github.com/pytorch/pytorch/pull/48100), [#48738](https://github.com/pytorch/pytorch/pull/48738), [#48871](https://github.com/pytorch/pytorch/pull/48871), [#50145](https://github.com/pytorch/pytorch/pull/50145), [#50396](https://github.com/pytorch/pytorch/pull/50396), [#50555](https://github.com/pytorch/pytorch/pull/50555))
* Document single op replacement ([#50116](https://github.com/pytorch/pytorch/pull/50116), [#50377](https://github.com/pytorch/pytorch/pull/50377))
* Document example of Proxy use ([#50583](https://github.com/pytorch/pytorch/pull/50583))
* Add limitations of symbolic tracing ([#50638](https://github.com/pytorch/pytorch/pull/50638))
* Added how to write transformations section ([#51278](https://github.com/pytorch/pytorch/pull/51278))
* Added invert example ([#51478](https://github.com/pytorch/pytorch/pull/51478))
* Document FX debugging ([#51530](https://github.com/pytorch/pytorch/pull/51530))
* Write FX Subgraph Rewriter tutorial ([#51531](https://github.com/pytorch/pytorch/pull/51531))
* Add note about more use cases of FX ([#51576](https://github.com/pytorch/pytorch/pull/51576))

### Quantization

* Add API summary section in quantization docs ([#45848](https://github.com/pytorch/pytorch/pull/45848), [#50681](https://github.com/pytorch/pytorch/pull/50681), [#50187](https://github.com/pytorch/pytorch/pull/50187))
* Fix misleading doc string in quint8.h ([#48418](https://github.com/pytorch/pytorch/pull/48418))
* Add fx graph mode quantization to quantization docs ([#49515](https://github.com/pytorch/pytorch/pull/49515))
* Add common errors section ([#49902](https://github.com/pytorch/pytorch/pull/49902))
* Adding a table comparing eager and fx graph mode ([#50413](https://github.com/pytorch/pytorch/pull/50413))
* Add docs for embedding/embedding_bag ([#51770](https://github.com/pytorch/pytorch/pull/51770))
* Add fake_quantize functions documentation ([#51748](https://github.com/pytorch/pytorch/pull/51748))

### ONNX

* Update ONNX doc for indexing export ([#46349](https://github.com/pytorch/pytorch/pull/46349))
* Update ONNX doc for writing pytorch model ([#46961](https://github.com/pytorch/pytorch/pull/46961))

### Misc

* Add `docs/README.md` to make existing doc build info more discoverable ([#49286](https://github.com/pytorch/pytorch/pull/49286))
* Update CONTRIBUTING for doc build ([#47539](https://github.com/pytorch/pytorch/pull/47539))

PyTorch 1.7 released w/ CUDA 11, New APIs for FFTs, Windows support for Distributed training and more (2020-10-27)

# PyTorch 1.7.0 Release Notes

* Highlights
* Backwards Incompatible Change
* New Features
* Improvements
* Performance
* Documentation

# Highlights

The PyTorch 1.7 release includes a number of new APIs including support for NumPy-Compatible FFT operations, profiling tools and major updates to both distributed data parallel (DDP) and remote procedure call (RPC) based distributed training. In addition, several features moved to [stable](https://pytorch.org/docs/stable/index.html#pytorch-documentation) including custom C++ Classes, the memory profiler, the creation of custom tensor-like objects, user async functions in RPC and a number of other features in torch.distributed such as Per-RPC timeout, DDP dynamic bucketing and RRef helper. 

A few of the highlights include: 

* CUDA 11 is now officially supported with binaries available at [PyTorch.org](http://pytorch.org/)
* Updates and additions to profiling and performance for RPC, TorchScript and Stack traces in the autograd profiler
* (Beta) Support for NumPy compatible Fast Fourier transforms (FFT) via torch.fft
* (Prototype) Support for Nvidia A100 generation GPUs and native TF32 format 
* (Prototype) Distributed training on Windows now supported

To reiterate, starting [PyTorch 1.6](https://pytorch.org/blog/pytorch-feature-classification-changes/), features are now classified as stable, beta and prototype. You can see the detailed announcement [here](https://pytorch.org/blog/pytorch-feature-classification-changes/). Note that the prototype features listed in this blog are available as part of this release. 

## Front End APIs

### [Beta] NumPy Compatible torch.fft module

FFT-related functionality is commonly used in a variety of scientific fields like signal processing. While PyTorch has historically supported a few FFT-related functions, the 1.7 release adds a new torch.fft module that implements FFT-related functions with the same API as NumPy.  

This new module must be imported to be used in the 1.7 release, since its name conflicts with the historic (and now deprecated) torch.fft function.

**Example usage:**

```python
>>> import torch.fft
>>> t = torch.arange(4)
>>> t
tensor([0, 1, 2, 3])

>>> torch.fft.fft(t)
tensor([ 6.+0.j, -2.+2.j, -2.+0.j, -2.-2.j])

>>> t = tensor([0.+1.j, 2.+3.j, 4.+5.j, 6.+7.j])
>>> torch.fft.fft(t)
tensor([12.+16.j, -8.+0.j, -4.-4.j,  0.-8.j])
```

* Documentation | [Link](https://pytorch.org/docs/stable/fft.html#torch-fft)

### [Beta] C++ Support for Transformer NN Modules

Since [PyTorch 1.5](https://pytorch.org/blog/pytorch-1-dot-5-released-with-new-and-updated-apis/), we’ve continued to maintain parity between the python and C++ frontend APIs. This update allows developers to use the nn.transformer module abstraction from the C++ Frontend. And moreover, developers no longer need to save a module from python/JIT and load into C++ as it can now be used it in C++ directly. 

* Documentation | [Link](https://pytorch.org/cppdocs/api/classtorch_1_1nn_1_1_transformer_impl.html#_CPPv4N5torch2nn15TransformerImplE)

### [Beta] torch.set_deterministic 

Reproducibility (bit-for-bit determinism) may help identify errors when debugging or testing a program. To facilitate reproducibility, PyTorch 1.7 adds the  `torch.set_deterministic(bool)` function that can direct PyTorch operators to select deterministic algorithms when available, and to throw a runtime error if an operation may result in nondeterministic behavior. By default, the flag this function controls is false and there is no change in behavior, meaning PyTorch may implement its operations nondeterministically by default. 

More precisely, when this flag is true:

* Operations known to not have a deterministic implementation throw a runtime error;
* Operations with deterministic variants use those variants (usually with a performance penalty versus the non-deterministic version); and
* `torch.backends.cudnn.deterministic = True` is set.

Note that this is necessary, **but not sufficient**, for determinism **within a single run of a PyTorch program**. Other sources of randomness like random number generators, unknown operations, or asynchronous or distributed computation may still cause nondeterministic behavior.

See the documentation for `torch.set_deterministic(bool)` for the list of affected operations.

* RFC | [Link](https://github.com/pytorch/pytorch/issues/15359)
* Documentation | [Link](https://pytorch.org/docs/stable/generated/torch.set_deterministic.html)

## Performance & Profiling

### [Beta] Stack traces added to profiler

Users can now see not only operator name/inputs in the profiler output table but also where the operator is in the code. The workflow requires very little change to take advantage of this capability. The user uses the [autograd profiler](https://pytorch.org/docs/stable/autograd.html#profiler) as before but with optional new parameters: `with_stack` and `group_by_stack_n`. Caution: regular profiling runs should not use this feature as it adds significant overhead. 

* Details | [Link](https://github.com/pytorch/pytorch/pull/43898/)
* Documentation | [Link](https://pytorch.org/docs/stable/autograd.html)

## Distributed Training & RPC 

### [Stable] TorchElastic now bundled into PyTorch docker image

Torchelastic offers a strict superset of the current `torch.distributed.launch` CLI with the added features for fault-tolerance and elasticity. If the user is not be interested in fault-tolerance, they can get the exact functionality/behavior parity by setting `max_restarts=0` with the added convenience of auto-assigned `RANK` and `MASTER_ADDR|PORT` (versus manually specified in `torch.distributed.launch)`.

By bundling `torchelastic` in the same docker image as PyTorch, users can start experimenting with torchelastic right-away without having to separately install `torchelastic`. In addition to convenience, this work is a nice-to-have when adding support for elastic parameters in the existing Kubeflow’s distributed PyTorch operators.

* Usage examples and how to get started | [Link](https://pytorch.org/elastic/0.2.0/examples.html)

### [Beta] Support for uneven dataset inputs in DDP

PyTorch 1.7 introduces a new context manager to be used in conjunction with models trained using `torch.nn.parallel.DistributedDataParallel` to enable training with uneven dataset size across different processes. This feature enables greater flexibility when using DDP and prevents the user from having to manually ensure dataset sizes are the same across different process. With this context manager, DDP will handle uneven dataset sizes automatically, which can prevent errors or hangs at the end of training.

* RFC | [Link](https://github.com/pytorch/pytorch/issues/38174)
* Documentation | [Link](https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html#torch.nn.parallel.DistributedDataParallel.join)

### [Beta] NCCL Reliability - Async Error/Timeout Handling

In the past, NCCL training runs would hang indefinitely due to stuck collectives, leading to a very unpleasant experience for users. This feature will abort stuck collectives and throw an exception/crash the process if a potential hang is detected. When used with something like torchelastic (which can recover the training process from the last checkpoint), users can have much greater reliability for distributed training. This feature is completely opt-in and sits behind an environment variable that needs to be explicitly set in order to enable this functionality (otherwise users will see the same behavior as before).

* Documentation | [Link](https://pytorch.org/docs/stable/distributed.html?highlight=init_process_group#torch.distributed.init_process_group)
* RFC | [Link](https://github.com/pytorch/pytorch/issues/46874)

### [Beta] TorchScript `remote` and `rpc_sync`

`torch.distributed.rpc.rpc_async` has been available in TorchScript in prior releases. For PyTorch 1.7, this functionality will be extended the remaining two core RPC APIs, `torch.distributed.rpc.rpc_sync` and `torch.distributed.rpc.remote`. This will complete the major RPC APIs targeted for support in TorchScript, it allows users to use the existing python RPC APIs within TorchScript (in a script function or script method, which releases the python Global Interpreter Lock) and could possibly improve application performance in multithreaded environment.

* Documentation | [Link](https://pytorch.org/docs/stable/rpc.html#rpc)
* Usage examples | [Link](https://github.com/pytorch/pytorch/blob/58ed60c259834e324e86f3e3118e4fcbbfea8dd1/torch/testing/_internal/distributed/rpc/jit/rpc_test.py#L505-L525)

### [Beta] Distributed optimizer with TorchScript support

PyTorch provides a broad set of optimizers for training algorithms, and these have been used repeatedly as part of the python API. However, users often want to use multithreaded training instead of multiprocess training as it provides better resource utilization and efficiency in the context of large scale distributed training (e.g. Distributed Model Parallel) or any RPC-based training application). Users couldn’t do this with with distributed optimizer before because we need to get rid of the python Global Interpreter Lock (GIL) limitation to achieve this.

In PyTorch 1.7, we are enabling the TorchScript support in distributed optimizer to remove the GIL, and make it possible to run optimizer in multithreaded applications. The new distributed optimizer has the exact same interface as before but it automatically converts optimizers within each worker into TorchScript to make each GIL free. This is done by leveraging a functional optimizer concept and allowing the distributed optimizer to convert the computational portion of the optimizer into TorchScript. This will help use cases like distributed model parallel training and improve performance using multithreading. 

Currently, the only optimizer that supports automatic conversion with TorchScript is `Adagrad` and all other optimizers will still work as before without TorchScript support. We are working on expanding the coverage to all PyTorch optimizers and expect more to come in future releases. The usage to enable TorchScript support is automatic and exactly the same with existing python APIs, here is an example of how to use this:

```python
import torch.distributed.autograd as dist_autograd
import torch.distributed.rpc as rpc
from torch import optim
from torch.distributed.optim import DistributedOptimizer

with dist_autograd.context() as context_id:
  # Forward pass.
  rref1 = rpc.remote("worker1", torch.add, args=(torch.ones(2), 3))
  rref2 = rpc.remote("worker1", torch.add, args=(torch.ones(2), 1))
  loss = rref1.to_here() + rref2.to_here()

  # Backward pass.
  dist_autograd.backward(context_id, [loss.sum()])

  # Optimizer, pass in optim.Adagrad, DistributedOptimizer will
  # automatically convert/compile it to TorchScript (GIL-free)
  dist_optim = DistributedOptimizer(
     optim.Adagrad,
     [rref1, rref2],
     lr=0.05,
  )
  dist_optim.step(context_id)
```

* Documentation | [Link](https://pytorch.org/docs/stable/rpc.html#module-torch.distributed.optim)
* RFC | [Link](https://github.com/pytorch/pytorch/issues/46883)

### [Beta] Enhancements to RPC-based Profiling

Support for using the PyTorch profiler in conjunction with the RPC framework was first introduced in PyTorch 1.6. In PyTorch 1.7, the following enhancements have been made:

* Implemented better support for profiling TorchScript functions over RPC
* Achieved parity in terms of profiler features that work with RPC
* Added support for asynchronous RPC functions on the server-side (functions decorated with `rpc.functions.async_execution)`.

User are now able to use familiar profiling tools such as `with torch.autograd.profiler.profile()` and `with torch.autograd.profiler.record_function,` and this works transparently with the RPC framework with full feature support, profiles asynchronous functions, and TorchScript functions.

* Design doc | [Link](https://github.com/pytorch/pytorch/issues/39675)
* Usage examples | [Link](https://pytorch.org/tutorials/recipes/distributed_rpc_profiling.html)

### [Prototype] Windows support for Distributed Training

PyTorch 1.7 brings prototype support for `DistributedDataParallel` and collective communications on the Windows platform. In this release, the support only covers Gloo-based `ProcessGroup` and `FileStore`.
To use this feature across multiple machines, please provide a file from a shared file system in `init_process_group`. 

```python
# initialize the process group
dist.init_process_group(
    "gloo",
    # multi-machine example:
    # Shared files need six "/"
    # init_method = `"file://////{machine}/{share_folder}/file"`
    # Local file need three "/"
    init_method="file:///{your local file path}",
    rank=rank,
    world_size=world_size
)

model = DistributedDataParallel(local_model, device_ids=[rank])
```

* Design doc | [Link](https://github.com/pytorch/pytorch/issues/42095)
* Documentation | [Link](https://pytorch.org/docs/master/distributed.html#backends-that-come-with-pytorch)
* Acknowledgement | [gunandrose4u](https://github.com/gunandrose4u)

## Mobile

PyTorch Mobile supports both [iOS](https://pytorch.org/mobile/ios) and [Android](https://pytorch.org/mobile/android/) with binary packages available in [Cocoapods](https://cocoapods.org/) and J[Center](https://mvnrepository.com/repos/jcenter) respectively. You can learn more about PyTorch-Mobile [here](https://pytorch.org/mobile/home/). 

### [Beta] PyTorch Mobile Caching allocator for performance improvements

On some mobile platforms, such as Pixel, we observed that memory is returned to the system more aggressively. This results in frequent page faults as PyTorch being a functional framework does not maintain state for the operators. Thus outputs are allocated dynamically on each execution of the op, for the most ops. To ameliorate performance penalties due to this, PyTorch 1.7 provides a simple caching allocator for CPU. The allocator caches allocations by tensor sizes and, is currently, available only via the PyTorch C++ API. The caching allocator itself is owned by client and thus the lifetime of the allocator is also maintained by client code. Such a client owned caching allocator can then be used with scoped guard, `c10::WithCPUCachingAllocatorGuard`, to enable the use of cached allocation within that scope.

**Example usage:**

```cpp
#include 
.....
c10::CPUCachingAllocator caching_allocator;
  // Owned by client code. Can be a member of some client class so as to tie the
  // the lifetime of caching allocator to that of the class.
.....
{
  c10::optional caching_allocator_guard;
  if (FLAGS_use_caching_allocator) {
    caching_allocator_guard.emplace(&caching_allocator);
  }
  ....
  model.forward(..);
}
.....
```

**NOTE**: Caching allocator is only available on mobile builds, thus the use of caching allocator outside of mobile builds won’t be effective.

* Documentation | [Link](https://github.com/pytorch/pytorch/blob/master/c10/mobile/CPUCachingAllocator.h#L13-L43)
* Usage examples | [Link](https://github.com/pytorch/pytorch/blob/master/binaries/speed_benchmark_torch.cc#L207)



# Backwards Incompatible changes

## Python API

### `torch.conj` now returns the input as-is for real Tensors ([#43270](https://github.com/pytorch/pytorch/pull/43270))

Previously, `torch.conj` and `Tensor.conj` were making a clone for Tensors of real dtype. It now returns the Tensor as-is to improve performance.
You can recover the original behavior by adding a `.clone()` for real Tensors.
Note that this behavior is different from `numpy` for which `np.conj` returns a new ndarray and `ndarray.conj` returns the ndarray as-is.


  

    1.6.0 1.7.0
    
      _{>>> t.is_complex()
False
>>> t.conj() is t
False}
      _{>>> t.is_complex()
False
>>> t.conj() is t
True
>>>t.conj().clone() is t
False}
    
  


### `torch.tensor`, `torch.as_tensor`, and `torch.sparse_coo_tensor` now use the input Tensor’s device when it is not specified ([#41984](https://github.com/pytorch/pytorch/pull/41984))

This will change the device on which the Tensor is created and so the user can start seeing device mismatch errors.
It also means for sparse Tensors that both of the provided Tensors must be on the same device if the device is not specified.
You can recover the original behavior by passing the `device` argument.


  

    1.6.0 1.7.0
    
      _{>>> t.device
device(type=‘cuda:0’)
>>> # tensor constructor
>>> torch.tensor(t, dtype=torch.float32).device
device(type=‘cpu’)
>>> # sparse constructor
>>> torch.sparse_coo_tensor(
            torch.tensor(([0], [2]), device="cpu"),
            torch.tensor(([1.],), device="cuda"),
            size=(3, 3, 1)).device
device(type='cuda', index=0)}
      _{>>> t.device
device(type=‘cuda:0’)
>>> # tensor constructor
>>> torch.tensor(t, dtype=torch.float32).device
device(type=‘cuda:0’)
>>> # Specify the device to get the same behavior as 1.6
>>> torch.tensor(t, dtype=torch.float32, device='cpu').device
device(type=‘cpu’)
>>> # sparse constructor
>>> torch.sparse_coo_tensor(
            torch.tensor(([0], [2]), device="cpu"),
            torch.tensor(([1.],), device="cuda"),
            size=(3, 3, 1)).device
RuntimeError: backend of indices (CPU) must match backend
of values (CUDA)
>>> # Specify the device to get the same behavior as 1.6
>>> torch.sparse_coo_tensor(
            torch.tensor(([0], [2]), device="cpu"),
            torch.tensor(([1.],), device="cuda"),
            size=(3, 3, 1),
            device="cuda:0").device
device(type='cuda', index=0)}
    
  


### `torch.nn.utils.pack_padded_sequence`: remove hidden cross-device copy for `lengths` ([#41984](https://github.com/pytorch/pytorch/pull/41984))

In previous versions, when the lengths argument was a CUDA tensor, it would incorrectly be moved to the CPU silently.
This can lead to surprising performances and CPU/GPU sync when using CUDA so this has been removed.
You need to make sure that the provided `lenghts` is a CPU Tensor when it is provided as a Tensor.


  

    1.6.0 1.7.0
    
      _{>>> inp = torch.rand(10, 2, 3, device="cuda")
>>> lengths = torch.tensor([10, 7], device="cuda")
>>> torch.nn.utils.rnn.pack_padded_sequence(inp, lengths)
>>> # Implicitly move lengths to the CPU and runs fine}
      _{>>> inp = torch.rand(10, 2, 3, device="cuda")
>>> lengths = torch.tensor([10, 7], device="cuda")
>>> torch.nn.utils.rnn.pack_padded_sequence(inp, lengths)
RuntimeError: 'lengths' argument should be a 1D CPU int64 tensor,
but got 1D cuda:0 Long tensor
>>> # Ensure the lenghts is already on the right device
>>> lengths = lengths.cpu()
>>> torch.nn.utils.rnn.pack_padded_sequence(inp, lengths)
>>> # Runs fine with no implicit move across device}
    
  


### Improve `torch.norm` handling of `keepdim=True` ([#41956](https://github.com/pytorch/pytorch/pull/41956))

Before this change, when calling `torch.norm` with `keepdim=True` and `p='fro'` or `p=number`, leaving all other optional arguments as their default values, the keepdim argument would be ignored. It is now properly respected.
Also, any time `torch.norm` was called with `p='nuc'` and `keepdim=True`, the result would have one fewer dimension than the input, and the dimensions could be out of order depending on which dimensions were being reduced. It is now properly keeping all the dimensions.
You can recover the original behavior by setting `keepdim=False`.
**NOTE: this function is now deprecated (see below) and we recommend you use `torch.linalg.norm`, which follows NumPy’s conventions.**


  

    1.6.0 1.7.0
    
      _{>>> t.size()
torch.Size([4, 4])
>>> t.norm(p=‘fro’, keepdim=True).size()
torch.size([])
>>> t.norm(p=3, keepdim=True).size()
torch.size([])
>>> t.norm(p=‘nuc’, keepdim=True).size()
torch.size([1])}
      _{>>> t.size()
torch.Size([4, 4])
>>> t.norm(p=‘fro’, keepdim=True).size()
torch.size([1, 1])
>>> t.norm(p=3, keepdim=True).size()
torch.size([1, 1])
>>> t.norm(p=‘nuc’, keepdim=True).size()
torch.size([1, 1])}
    
  


### `torch.split` and `torch.chunk`: Fix view tracking for the autograd ([#41567](https://github.com/pytorch/pytorch/pull/41567))

The autograd system is able to correctly handle modifications through views of Tensors by explicitly tracking known view operations. In prior releases, `torch.split` and `torch.chunk` were not marked as known view operations, which could lead to silently wrong gradients.

Note that since v1.5, inplace modification of views created by functions that return multiple views is deprecated. Such case is not properly handled by the autograd and can lead to internal errors or wrong gradients. So, as a side effect of this view fix, inplace modifications of the outputs of `torch.split` and `torch.chunk` will now raise a warning and can lead to internal errors or wrong gradients while they were previously silently computing wrong gradients.
If you see such a warning, you should replace the inplace operation with an out of place one.
You can recover the original behavior by using the new `torch.unsafe_split` and `torch.unsafe_chunk`. Note that these functions are only here to ease the transition and will also be removed in a future version.

### `torch.{argmin,argmax}` now always return the first min/max index ([#42004](https://github.com/pytorch/pytorch/pull/42004))

`torch.argmin` (`torch.argmax`) now always returns the index of the first minimum (maximum) element. This choice is consistent with NumPy. Previously if there were multiple minima (maxima) the index returned could be the index of any of them.
You cannot recover the original behavior as it was platform dependent and not guaranteed. If your code was relying on a specific index for your specific platform, you should update it to work with the first index and this new code will work on all platforms.

### `torch.{min,max,median}`: Update backward formula when doing full reduction (`dim` argument not provided) ([#43519](https://github.com/pytorch/pytorch/pull/43519))

When no dimension is specified, full reduction is performed and the gradient will now flow back evenly towards all the input that realized the output value. The old behavior was to propagate the gradient only for one of such input selected arbitrarily.
This should improve stability of training by gradient descent.
To recover the previous behavior, you can perform the reduction with the `dim=` argument. It will ensure that the gradient only flows back for the input whose index was returned.


  

    1.6.0 1.7.0
    
      _{>>> a
tensor([3, 2, 3])
>>> a.max().backward()
>>> a.grad
tensor([0, 0, 1])}
      _{>>> a
tensor([3, 2, 3])
>>> a.max().backward()
>>> a.grad
tensor([0.5, 0, 0.5])
>>> a.max(dim=0).max(dim=0).max(dim=0).backward()
>>> a.grad
tensor([0, 0, 1])}
    
  


### `nn.BCELoss` size mismatch warning is now an error ([#41426](https://github.com/pytorch/pytorch/pull/41426))

This is the end of the deprecation cycle for this op to make sure it does not have different broadcasting semantic compared to numpy’s broadcasting semantic used everywhere else in PyTorch’s codebase.
You need to make sure all inputs are the same size to avoid the error.


  

    1.6.0 1.7.0
    
      _{>>> bceloss = nn.BCELoss()
>>> a = torch.rand(25)
>>> b = torch.rand(25, 1)
>>> bceloss(a, b)
UserWarning: Using a target size (torch.Size([25, 1]))
that is different to the input size (torch.Size([25]))
is deprecated. Please ensure they have the same size.
tensor(1.0604)}
      _{>>> bceloss = nn.BCELoss()
>>> a = torch.rand(25)
>>> b = torch.rand(25, 1)
>>> bceloss(a, b)
ValueError: Using a target size (torch.Size([25, 1]))
that is different to the input size (torch.Size([25]))
is deprecated. Please ensure they have the same size.
>>> b = b.reshape(25)
>>> bceloss(a, b)
tensor(1.0604)}
    
  


### Custom `autograd.Function` stop materializing `None` output Tensors ([#41490](https://github.com/pytorch/pytorch/pull/41490))

To improve performance, the custom `autograd.Function` will not create a Tensor full of zeros when an input is differentiable but the user’s `backward` function returns `None` for it. This means that code for which the `.backward()` or `autograd.grad()` final result will now be `None` while it used to be a Tensor full of zeros.
You can recover the previous behavior by having your custom `autograd.Function` materialize the zero Tensor with `torch.zeros_like(input)` to replace the `None` output for the `backward` method.

```python
import torch

# Custom Function that returns None for the gradient
class GetTwos(torch.autograd.Function):
    @staticmethod
    def forward(ctx, inp):
        return inp.clone().fill_(2)

    @staticmethod
    def backward(ctx, grad_out):
        # To recover the 1.6 behavior, replace the line below with `return torch.zeros_like(grad_out)`
        return None

a = torch.rand(10, requires_grad=True)
b = GetTwos.apply(a)
b.sum().backward()

print(a.grad)
# In PyTorch 1.6 this will print
# tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])
# In PyTorch 1.7 this will print
# None
```

### Fix inplace detection for non-differentiable outputs ([#41269](https://github.com/pytorch/pytorch/pull/41269))

We fixed a bug in the inplace detection code that was preventing the detection of some inplace operations for output that are not differentiable (like integer type Tensors).
This can lead to code that used to run fine to throw the error “a Tensor that was needed for backward was modified in an inplace operation”.
Such failure is true and the user code must be fixed to compute proper gradients. In general, this involves cloning the Tensor before modifying it inplace to make sure the backward pass can happen safely.


```python
import torch

a = torch.rand(10, requires_grad=True)
with torch.no_grad():
    a[2] = 10

b, ind = a.max(dim=0)
# ind is 2 here

with torch.no_grad():
    t = torch.rand(10)
    t[4] = 10
    res = torch.max(t, dim=0, out=(torch.Tensor(), ind))
    # ind becomes 4 here

# This backward runs in 1.6 but will fail in 1.7
b.sum().backward()
print(a.grad)
# tensor([0., 0., 0., 0., 1., 0., 0., 0., 0., 0.])
# The value is wrong is at index 4 while it should be at index 2

# The issue is avoided by not modifying ind inplace by replacing the line
# above with:
# res = torch.max(t, dim=0, out=(torch.Tensor(), ind.clone()))
```

### Add `__torch_functions__` for methods ([#37091](https://github.com/pytorch/pytorch/pull/37091))

Functions, slicing and Tensor methods will now properly preserve the subclass type when possible.

```python
>>> class SubTensor(torch.Tensor):
...     pass
>>> type(torch.add(SubTensor([0]), SubTensor([1]))).__name__
'SubTensor'
>>> type(torch.add(SubTensor([0]), torch.Tensor([1]))).__name__
'SubTensor'
```

The old behavior of “any operations on your subclass produces a torch.Tensor instead of the subclass” can be recovered by doing:

```python
from torch._C import _disabled_torch_function_impl
    
class SubTensor(torch.Tensor):
    __torch_function__ = _disabled_torch_function_impl
```

For all details on how to use this feature, please refer to the [doc](https://pytorch.org/docs/stable/notes/extending.html#extending-torch) page for it.

### `tensor.__iter__`: Use `torch.unbind` instead of a for loop ([#40884](https://github.com/pytorch/pytorch/pull/40884))

This improves performances significantly but it changes the behavior of in-place operations on the value returned by the iterator. This happens only if either the input Tensor or any argument of the in-place operation is a Tensor that requires gradients. And it will fail with "Output X of UnbindBackward is a view and is being modified inplace".
You can recover the previous behavior by manually slicing the Tensor: `[t[i] for i in range(t.size(0))]` as shown in the example below.


  

    1.6.0 1.7.0
    
      _{>>> x = torch.randn(5, 10, requires_grad=True)
>>> for i, v in enumerate(x):
>>>     v.fill_(i)}
      _{>>> x = torch.randn(5, 10, requires_grad=True)
>>> for i, v in enumerate([x[j] for j in range(x.size(0))]):
>>>   v.fill_(i)}
    
  


### Updated most function that take zero, one or two Tensor arguments and indexing op to check for memory overlap in the Tensor being worked on ([#43418](https://github.com/pytorch/pytorch/pull/43418), [#43419](https://github.com/pytorch/pytorch/pull/43419), [#43](https://github.com/pytorch/pytorch/pull/43420)[420](https://github.com/pytorch/pytorch/pull/43420), [#43421](https://github.com/pytorch/pytorch/pull/43421), [#43423](https://github.com/pytorch/pytorch/pull/43423), [#43422](https://github.com/pytorch/pytorch/pull/43422))

It fixes silent correctness errors: something that used to be silently incorrect now errors out. Code that raises this error must be updated to avoid doing such op that was returning wrong results as shown in the example below:

```python
>>> x = torch.randn(1, 3)
>>> # Create a tensor that has internal memory overlap
>>> y = x.expand(2, 3)

# In 1.6, this would not error out, but in 1.7, this errors out
>>> torch.nn.functional.elu(y, inplace=True)
RuntimeError: unsupported operation: more than one element of the written-to tensor refers to a single m
emory location. Please clone() the tensor before performing the operation.

# Here is the fix in 1.7
>>> torch.nn.functional.elu(y, inplace=False)
```

c++ API: Any external users of `TensorIterator` now always get the memory overlap check. The previous behavior can be recovered by setting `set_check_mem_overlap(false)` when creating the iterator.

## TorchScript

### TorchScript now correctly supports various exception type and custom exception message ([#41907](https://github.com/pytorch/pytorch/pull/41907))

* Exceptions raised in TorchScript was traditionally replaced with a generic runtime error that doesn’t carry exception type or message, leading to crashes that are difficult to pin-point and debug. We improved TorchScript to correctly parse exception types and messages and surface them to users. 
* This change is backward incompatible because TorchScript now attempts to compile user code that creates custom exception messages instead of ignoring them. Any TorchScript-incompatible Python features used in those code snippets would lead to failures.
* There is no fixed formula to fix this backward incompatibility failure other than updating code that generates exceptions to be TorchScript-able.

### TorchScript now supports properties of TorchScript classes and ScriptModules ([#42389](https://github.com/pytorch/pytorch/pull/42389), [#42390](https://github.com/pytorch/pytorch/pull/42390))

* TorchScript added support for `@property` of TorchScript classes and ScriptModules. Custom setters and getters are also supported. Custom deleters are not supported.
* This improvement is backward incompatible because TorchScript now attempts to script properties of existing classes and `Modules`. If these properties use Python or Pytorch features that are not supported in Torchscript, scripting will fail.
* There are two ways of fixing backward incompatibility failures introduced by this change. One is using `@torch.jit.unused` to annotate problematic properties, the other is to update the implementation of the property so that the getter and setter are scriptable.

## Quantization

### The convolution parameters now support versioning.

* This change means that any quantized convolution module **saved** using PyTorch 1.7+ cannot be loaded in v1.6 and lower.
* But this change is backward compatible: if the model (with conv layers) is saved in version 1.6, it can be safely loaded in version 1.7.

## Some undocumented functions that were mistakenly made public have been removed

* `torch.absolute_` has been removed, the Tensor method (`Tensor.absolute_`) should be used instead just like all other inplace ops.
* `torch.ExtraFilesMap` is an internal jit construct and should not be used.

## TorchScript Compiler Update

In 1.7, we are enabling a Profiling Executor and a new Tensor-Expressions-based (TE) Fuser. All compilations will now go through one (an adjustable setting) profiling run and one optimization run. For the profiling run, complete tensor shapes are recorded and used by the new Fuser. For the optimization run, the focus is on finding (in `torch.jit.ScriptModule`s) and fusing element-wise operations over CUDA tensors into a single CUDA kernel.

The TE fuser is expected to deliver performance similar to the old fuser used in 1.6. It however unlocks more opportunities for performance improvements in future releases. In rare cases, performance of some models may degrade 5-10%. If you experience any regressions please report it on Github, so we can address them as soon as possible! For 1.7, we are providing an option for our users to revert back to the old fuser by calling `torch._C._jit_set_profiling_executor(False)` in Python and `torch::jit::getExecutorMode()`` = false;` in C++. For more information, please see [“Graph Executor” section](https://github.com/pytorch/pytorch/blob/master/torch/csrc/jit/OVERVIEW.md#graph-executor) in our documentation.

# Deprecations

## Python API

### `torch.norm` and `torch.functional.norm` are deprecated in favor of `torch.linalg.norm` ([#44321](https://github.com/pytorch/pytorch/pull/44321))

The new `torch.linalg.norm` has the same behavior as `numpy.linalg.norm`
Both deprecated functions had odd behaviors for matrix and vector norms. You should refer to the doc [here](https://pytorch.org/docs/stable/generated/torch.norm.html?highlight=norm#torch.norm) to find the exact behavior they had and how to replicate it with the new API.

### Deprecate fft functions in `torch.` namespace in favor of `torch.fft.` namespace ([#44876](https://github.com/pytorch/pytorch/pull/44876))

Please use `torch.fft.foo` as a drop-in replacement for `torch.foo` for the following functions: `fft`, `ifft`, `rfft` and `irfft`.

### Warns when some `out=` functions need to resize an output which is not 0-size ([#42079](https://github.com/pytorch/pytorch/pull/42079))

This behavior is dangerous and leads to an API that is hard to use. It is being deprecated to be able to fix that API in future versions.
You should resize the output before-hand to avoid any issue in the future:

```python
a = torch.rand(5)
b = torch.rand(25)

# This is deprecated
torch.add(a, a, out=b)

# This has the same behavior but will work in future versions
torch.add(a, a, out=b.resize_(0))
```

### `torch.optim`: Warn for duplicate params in param group ([#41597](https://github.com/pytorch/pytorch/pull/41597))

Providing multiple times the same Parameter in a single param group is most likely due to user error and is being deprecated.
Please open an issue if you have a valid use case that require this feature.

### `torch.linspace` and `torch.logspace`: Not giving the step argument is deprecated ([#43860](https://github.com/pytorch/pytorch/pull/43860))

The default `steps` argument that has been used historically in PyTorch is not consistent with other libraries and so is being removed to avoid confusion.
For both functions, passing `steps=100` keyword argument can be used to recover the original behavior.


  

    1.6.0 1.7.0
    
      _{>>> torch.linspace(0, 10).size()
torch.Size([100])}
      _{>>> torch.linspace(0, 10).size()
UserWarning: Not providing a value for linspace's
steps is deprecated and will throw a runtime error
in a future release.
torch.Size([100])
>>> torch.linspace(0, 10, steps=100).size()
torch.Size([100])}
    
  


## Distributed

* Make TensorPipe the default backend for RPC ([#43246](https://github.com/pytorch/pytorch/pull/43246))
* Infer RPC backend type to preserve backward compatibility as we make TensorPipe the default ([#45065](https://github.com/pytorch/pytorch/pull/45065))
* Add deprecation warning to ProcessGroup backend and make TensorPipe backend stable. ([#45356](https://github.com/pytorch/pytorch/pull/45356))
* Add warnings on `ProcessGroup` and `ProcessGroup::Work` APIs which will be retired soon. ([#46366](https://github.com/pytorch/pytorch/pull/46366))

# New features

### Python API

New namespaces:

* `torch.fft` added ([#41911](https://github.com/pytorch/pytorch/pull/41911))
* `torch.linalg` added ([#42664](https://github.com/pytorch/pytorch/pull/42664))
* `torch.optim.functional` added ([#44715](https://github.com/pytorch/pytorch/pull/44715))

New operators:

* `torch.count_nonzero` added ([#39992](https://github.com/pytorch/pytorch/pull/39992))
* `nn.SiLU` activation added ([#41034](https://github.com/pytorch/pytorch/pull/41034))
* `torch.logit` added ([#41062](https://github.com/pytorch/pytorch/pull/41062))
* `torch.gcd`, `torch.lcm` added ([#40651](https://github.com/pytorch/pytorch/pull/40651), [#41552](https://github.com/pytorch/pytorch/pull/41552), [#42254](https://github.com/pytorch/pytorch/pull/42254))
* `torch.functional.atleast_{1d/2d/3d}` added ([#41317](https://github.com/pytorch/pytorch/pull/41317))
* `torch.isreal` added ([#41298](https://github.com/pytorch/pytorch/pull/41298))
* `nn.Unflatten` added ([#41564](https://github.com/pytorch/pytorch/pull/41564))
* `torch.movedim` added ([#41480](https://github.com/pytorch/pytorch/pull/41480))
* `torch.isposinf`, `torch.isneginf` added ([#41588](https://github.com/pytorch/pytorch/pull/41588))
* `torch.signbit` added ([#41589](https://github.com/pytorch/pytorch/pull/41589))
* `torch.absolute` added ([#42586](https://github.com/pytorch/pytorch/pull/42586))
* `torch.clip` alias added ([#42770](https://github.com/pytorch/pytorch/pull/42770))
* `torch.quantile` added ([#42755](https://github.com/pytorch/pytorch/pull/42755))
* `torch.linalg.det` and `torch.outer` alias added ([#42802](https://github.com/pytorch/pytorch/pull/42802))
* `torch.nansum` added ([#38628](https://github.com/pytorch/pytorch/pull/38628))
* `torch.hypot` added ([#42291](https://github.com/pytorch/pytorch/pull/42291))
* `torch.nextafter` added ([#42580](https://github.com/pytorch/pytorch/pull/42580))
* `torch.hstack`, `torch.vstack`, `torch.dstack` added ([#42799](https://github.com/pytorch/pytorch/pull/42799))
* `torch.arccosh` alias added ([#43107](https://github.com/pytorch/pytorch/pull/43107))
* `Tensor.movedim` as a method added ([#43122](https://github.com/pytorch/pytorch/pull/43122))
* `torch.matrix_exp` added ([#40161](https://github.com/pytorch/pytorch/pull/40161))
* `torch.fix` alias added ([#43326](https://github.com/pytorch/pytorch/pull/43326))
* `torch.arccos`, `torch.arcsin`, `torch.arctan` aliases added ([#43319](https://github.com/pytorch/pytorch/pull/43319))
* `torch.negative` alias added ([#43400](https://github.com/pytorch/pytorch/pull/43400))
* `torch.maximum`, `torch.minimum` added ([#42579](https://github.com/pytorch/pytorch/pull/42579))
* `torch.arctanh`, `torch.arcsinh` aliases added ([#43762](https://github.com/pytorch/pytorch/pull/43762))
* `torch.linalg.norm` added ([#42749](https://github.com/pytorch/pytorch/pull/42749), [#43907](https://github.com/pytorch/pytorch/pull/43907))
* `torch.amax`, `torch.amin` added ([#43819](https://github.com/pytorch/pytorch/pull/43819))
* `torch.heaviside` added ([#42523](https://github.com/pytorch/pytorch/pull/42523))
* `torch.i0` added ([#43132](https://github.com/pytorch/pytorch/pull/43132))
* `torch.not_equal`, `torch.greater`, `torch.greater_equal`, `torch.less`, `torch.less_equal` aliases added ([#43870](https://github.com/pytorch/pytorch/pull/43870))
* `torch.exp2` added ([#44184](https://github.com/pytorch/pytorch/pull/44184))
* `torch.kaiser_window` added ([#44271](https://github.com/pytorch/pytorch/pull/44271))
* `torch.nanquantile` added ([#44393](https://github.com/pytorch/pytorch/pull/44393))
* `torch.multiply`, `torch.divide` aliases added ([#44463](https://github.com/pytorch/pytorch/pull/44463))
* `nn.TripletMarginWithDistanceLoss` added ([#43680](https://github.com/pytorch/pytorch/pull/43680))
* `torch.fft.fft`, `torch.fft.ifft`, `torch.fft.rfft`, `torch.fft.irfft`, `torch.fft.hfft`, `torch.fft.ihfft` added ([#43011](https://github.com/pytorch/pytorch/pull/43011))
* `torch.fft.fftn`, `torch.fft.ifftn`, `torch.fft.rfftn`, `torch.fft.irfftn` added ([#44550](https://github.com/pytorch/pytorch/pull/44550))
* `optim.functional.adagrad` added ([#44715](https://github.com/pytorch/pytorch/pull/44715))
* `optim.functional.adam` added ([#44791](https://github.com/pytorch/pytorch/pull/44791))
* `torch.complex`,  `torch.polar` added ([#39617](https://github.com/pytorch/pytorch/pull/39617))
* `Tensor.__complex__` added ([#43844](https://github.com/pytorch/pytorch/pull/43844))
* `torch.vdot` added ([#43004](https://github.com/pytorch/pytorch/pull/43004))

API extension:

* `torch.full` added support for bool and integer dtypes ([#41912](https://github.com/pytorch/pytorch/pull/41912))
* `torch.lt` and `torch.masked_select` added support for half dtype ([#43704](https://github.com/pytorch/pytorch/pull/43704))
* `torch.div`, `torch.true_divide`, `torch.atan2` added support for integer to float type promotion in ([#42359](https://github.com/pytorch/pytorch/pull/42359))
* `unflatten`  added support for non-named dimensions ([#42563](https://github.com/pytorch/pytorch/pull/42563))
* `torch.polygamma`  added support for n >= 2 ([#42499](https://github.com/pytorch/pytorch/pull/42499))
* `torch.qr` added backward support for wide input matrices ([#42216](https://github.com/pytorch/pytorch/pull/42216))
* `nn.Linear`  for MKLDNN added support for no-bias ([#43703](https://github.com/pytorch/pytorch/pull/43703))
* `torch.lerp` added support for half dtype ([#43541](https://github.com/pytorch/pytorch/pull/43541))
* Updates `torch.div` to perform true division (end of deprecation cycle) ([#42907](https://github.com/pytorch/pytorch/pull/42907))
* `torch.scatter` added support for reductions on CUDA ([#41977](https://github.com/pytorch/pytorch/pull/41977))
* BFloat16 support type promotion ([#41698](https://github.com/pytorch/pytorch/pull/41698), [#43324](https://github.com/pytorch/pytorch/pull/43324))
* BFloat16 support on CUDA for `torch.pow` ([#44760](https://github.com/pytorch/pytorch/pull/44760)), unary ops and activations ([#44813](https://github.com/pytorch/pytorch/pull/44813), [#44824](https://github.com/pytorch/pytorch/pull/44824), [#44834](https://github.com/pytorch/pytorch/pull/44834)), `torch.i0` ([#44750](https://github.com/pytorch/pytorch/pull/44750)), `softmax` ([#44837](https://github.com/pytorch/pytorch/pull/44837)), `div`, `addcdiv`, `addcmul`, `mean`, `var` ([#44758](https://github.com/pytorch/pytorch/pull/44758)), `layernorm` ([#45002](https://github.com/pytorch/pytorch/pull/45002)),all pooling layers ([#44836](https://github.com/pytorch/pytorch/pull/44836), [#45151](https://github.com/pytorch/pytorch/pull/45151))), `torch.logspace` (CPU and CUDA) ([#44675](https://github.com/pytorch/pytorch/pull/44675)), random kernels on Windows ([#44918](https://github.com/pytorch/pytorch/pull/44918)), `torch.addmm`, `torch.addmv` ([#44986](https://github.com/pytorch/pytorch/pull/44986)), loss functions ([#45011](https://github.com/pytorch/pytorch/pull/45011)), batched gemm ([#45167](https://github.com/pytorch/pytorch/pull/45167)), nccl path ([#38515](https://github.com/pytorch/pytorch/pull/38515)), binary logical operators ([#42485](https://github.com/pytorch/pytorch/pull/42485)), `torch.neg` ([#45240](https://github.com/pytorch/pytorch/pull/45240)), Conv (non-cuDNN) ([#45007](https://github.com/pytorch/pytorch/pull/45007)), `torch.abs` ([#44804](https://github.com/pytorch/pytorch/pull/44804)), `torch.erfinv` ([#43399](https://github.com/pytorch/pytorch/pull/43399)), comparison ops ([#44748](https://github.com/pytorch/pytorch/pull/44748))
* `torch.asin`, `torch.neg` added support for sparse Tensors ([#44028](https://github.com/pytorch/pytorch/pull/44028))
* `torch.softmax` added support for CUDA ([#42307](https://github.com/pytorch/pytorch/pull/42307))
* `Tensor.{real,imag}` added setter for these attributes ([#39860](https://github.com/pytorch/pytorch/pull/39860))
* `torch.{addmm,addmv}` added support for complex on CUDA ([#40431](https://github.com/pytorch/pytorch/pull/40431), [#43827](https://github.com/pytorch/pytorch/pull/43827))
* `torch.bmm` added support for complex on CPU [#42383](https://github.com/pytorch/pytorch/pull/42383),
* `torch.{dot, vdot}` added support for complex ([#42745](https://github.com/pytorch/pytorch/pull/42745))
* `torch.stft`, `torch.istft` added support for complex ([#43886](https://github.com/pytorch/pytorch/pull/43886))
* `torch.cholesky` added support for complex ([#44895](https://github.com/pytorch/pytorch/pull/44895), [#45267](https://github.com/pytorch/pytorch/pull/45267))
* `torch.sgn` added (to support complex) ([#39955](https://github.com/pytorch/pytorch/pull/39955))
* Binary ops added support for complex ([#43174](https://github.com/pytorch/pytorch/pull/43174))
* Add allowlist for complex backward ([#45461](https://github.com/pytorch/pytorch/pull/45461))

### Autograd

* Don't automatically materialize output grads with zeros for `autograd.Function` ([#41821](https://github.com/pytorch/pytorch/pull/41821))
* Benchmark tool for `autograd.functional` API ([#43428](https://github.com/pytorch/pytorch/pull/43428))
* Added `reset_grad` API to remove gradient instead of setting them to zero ([#44423](https://github.com/pytorch/pytorch/pull/44423))
* Allow Tensor-like objects in `torch.autograd.gradcheck` ([#43877](https://github.com/pytorch/pytorch/pull/43877))
* Added support for nested call for `@torch.no_grad()` decorator ([#44633](https://github.com/pytorch/pytorch/pull/44633))
* Added support for `torch.lobpcg` backward ([#43002](https://github.com/pytorch/pytorch/pull/43002))

### CUDA

* Added TF32 support ([#41498](https://github.com/pytorch/pytorch/pull/41498))
* CUDA RTX30 series support ([#45489](https://github.com/pytorch/pytorch/pull/45489), [#45130](https://github.com/pytorch/pytorch/pull/45130))
    * **Note: **At the time of the 1.7 release, the currently available and stable Nvidia CUDA libraries are not fully tuned for the RTX 3080 and 3090 so users might see performance regressions.
* `torch.cuda.amp.GradScaler` now supports sparse gradients ([#36786](https://github.com/pytorch/pytorch/pull/36786))
* Autocast support for cudnn RNNs ([#42385](https://github.com/pytorch/pytorch/pull/42385))
* Support AMP in nn.parallel ([#43102](https://github.com/pytorch/pytorch/pull/43102))
* Support for tf32 in cudnn and `backends.cudnn.allow_tf32` flag to control it ([#40737](https://github.com/pytorch/pytorch/pull/40737))
* Added `torch.cuda.memory.list_gpu_processes` to list running processes on a give GPU ([#44616](https://github.com/pytorch/pytorch/pull/44616))
* Add env variable to bypass CUDACachingAllocator for debugging ([#45294](https://github.com/pytorch/pytorch/pull/45294))
* Add non-deterministic alert to CUDA operations that use `atomicAdd()` ([#41538](https://github.com/pytorch/pytorch/pull/41538))

### C++ API

* `nn::TransformerEncoderLayer` added ([#42633](https://github.com/pytorch/pytorch/pull/42633))
* `nn::TransformerDecoderLayer` added ([#42717](https://github.com/pytorch/pytorch/pull/42717))
* `nn::TransformerEncoder` added ([#43187](https://github.com/pytorch/pytorch/pull/43187))
* `nn::TransformerDecoder` added ([#42886](https://github.com/pytorch/pytorch/pull/42886))
* `nn::Transformer` added ([#44333](https://github.com/pytorch/pytorch/pull/44333))
* `nn::Unflatten` added ([#42613](https://github.com/pytorch/pytorch/pull/42613))
* `nn.ParameterList` added ([#41259](https://github.com/pytorch/pytorch/pull/41259))
* `torch::cuda::manual_seed` and `torch::cuda::manual_seed_all` added ([#42638](https://github.com/pytorch/pytorch/pull/42638))

### Mobile

* Support Tensor MemoryFormat in java wrappers ([#40785](https://github.com/pytorch/pytorch/pull/40785))
* Add `mobile_optimized` boolean flag to optimized model. ([#45479](https://github.com/pytorch/pytorch/pull/45479))

### Vulkan

* Backend added ([#36491](https://github.com/pytorch/pytorch/pull/36491), [#43076](https://github.com/pytorch/pytorch/pull/43076))
* Add many operators `adaptive_avg_pool2d` ([#41220](https://github.com/pytorch/pytorch/pull/41220)), `mm` ([#41221](https://github.com/pytorch/pytorch/pull/41221)), `reshape` ([#41223](https://github.com/pytorch/pytorch/pull/41223)), `max_pool2d` ([#41379](https://github.com/pytorch/pytorch/pull/41379)), `add_` and `relu_` ([#41380](https://github.com/pytorch/pytorch/pull/41380)), `cat` ([#41434](https://github.com/pytorch/pytorch/pull/41434)), `add` and `mul` ([#42674](https://github.com/pytorch/pytorch/pull/42674)) and `avg_pool2d` ([#42675](https://github.com/pytorch/pytorch/pull/42675)).
* Model preparation via `torch.utils.optimize_for_vulkan` ([#44903](https://github.com/pytorch/pytorch/pull/44903))
* Add to Java API option to load on Vulkan and test app ([#44896](https://github.com/pytorch/pytorch/pull/44896), [#44897](https://github.com/pytorch/pytorch/pull/44897))

### Distributed

* Support alltoall collective in ProcessGroupGloo ([#41424](https://github.com/pytorch/pytorch/pull/41424), [#41690](https://github.com/pytorch/pytorch/pull/41690))
* Add a DDP Communication Hook providing the flexibility to completely override DDP gradient communication ([#40848](https://github.com/pytorch/pytorch/pull/40848))
* Examples on how to use the DDP communication hook ([#43310](https://github.com/pytorch/pytorch/pull/43310))
* Add NCCL Alltoall to NCCL process group ([#42514](https://github.com/pytorch/pytorch/pull/42514))
* Support allgather and gather APIs for Python Objects ([#42189](https://github.com/pytorch/pytorch/pull/42189))
* Join-based API to support uneven inputs in DDP ([#42577](https://github.com/pytorch/pytorch/pull/42577))
* broadcast_object API for c10d ([#43887](https://github.com/pytorch/pytorch/pull/43887))
* Async Error Handling support for ProcessGroupNCCL ([#41050](https://github.com/pytorch/pytorch/pull/41050), [#41051](https://github.com/pytorch/pytorch/pull/41051), [#41052](https://github.com/pytorch/pytorch/pull/41052), [#41053](https://github.com/pytorch/pytorch/pull/41053), [#41054](https://github.com/pytorch/pytorch/pull/41054), [#44163](https://github.com/pytorch/pytorch/pull/44163))
* Add a “gradient_as_bucket_view" parameter to DDP to reduce memory overhead ([#44344](https://github.com/pytorch/pytorch/pull/44344))
* Add getNumKeys API to c10d TCPStore ([#43962](https://github.com/pytorch/pytorch/pull/43962))
* Add DeleteKey API for c10d TCP Store ([#45401](https://github.com/pytorch/pytorch/pull/45401))

### Quantization

* New quantized ops
    * Adaptive average pooling ([#40271](https://github.com/pytorch/pytorch/pull/40271))
    * Max pooling ([#45152](https://github.com/pytorch/pytorch/pull/45152))
    * Embedding and EmbeddingBag quantization (8-bit + partial support for 4-bit): ([#40076](https://github.com/pytorch/pytorch/pull/40076), [#41293](https://github.com/pytorch/pytorch/pull/41293), [#41612](https://github.com/pytorch/pytorch/pull/41612), [#42924](https://github.com/pytorch/pytorch/pull/42924), [#42762](https://github.com/pytorch/pytorch/pull/42762), [#42881](https://github.com/pytorch/pytorch/pull/42881), [#43077](https://github.com/pytorch/pytorch/pull/43077), [#43088](https://github.com/pytorch/pytorch/pull/43088), [#43090](https://github.com/pytorch/pytorch/pull/43090), [#43176](https://github.com/pytorch/pytorch/pull/43176), [#43296](https://github.com/pytorch/pytorch/pull/43296), [#43433](https://github.com/pytorch/pytorch/pull/43433), [#43989](https://github.com/pytorch/pytorch/pull/43989), [#44008](https://github.com/pytorch/pytorch/pull/44008), [#44207](https://github.com/pytorch/pytorch/pull/44207), [#44208](https://github.com/pytorch/pytorch/pull/44208), [#44217](https://github.com/pytorch/pytorch/pull/44217), [#45149](https://github.com/pytorch/pytorch/pull/45149), [#44845](https://github.com/pytorch/pytorch/pull/44845), [#44048](https://github.com/pytorch/pytorch/pull/44048), [#42690](https://github.com/pytorch/pytorch/pull/42690), [#42612](https://github.com/pytorch/pytorch/pull/42612))
    * QNNPACK Transposed convolution2D and 3D ([#39714](https://github.com/pytorch/pytorch/pull/39714), [#40351](https://github.com/pytorch/pytorch/pull/40351), [#40360](https://github.com/pytorch/pytorch/pull/40360), [#40370](https://github.com/pytorch/pytorch/pull/40370), [#40371](https://github.com/pytorch/pytorch/pull/40371), [#44844](https://github.com/pytorch/pytorch/pull/44844), [#45078](https://github.com/pytorch/pytorch/pull/45078), [#45081](https://github.com/pytorch/pytorch/pull/45081))
    * Operations on quantized tensors
        * `aten::repeat` ([#40644](https://github.com/pytorch/pytorch/pull/40644))
        * `aten::apend` ([#40743](https://github.com/pytorch/pytorch/pull/40743))
        * `stack` ([#42187](https://github.com/pytorch/pytorch/pull/42187))
        * `fill_` ([#43303](https://github.com/pytorch/pytorch/pull/43303))
        * `clone` for per channel affine quantized tensor ([#44573](https://github.com/pytorch/pytorch/pull/44573))
        * `append` (graphmode) ([#44641](https://github.com/pytorch/pytorch/pull/44641))
    * 1D batch normalization support ([#42491](https://github.com/pytorch/pytorch/pull/42491))
    * N-Dimensional constant padding ([#43304](https://github.com/pytorch/pytorch/pull/43304))
    * CELU operator ([#39199](https://github.com/pytorch/pytorch/pull/39199))
* Support for FP16 quantization ([#40708](https://github.com/pytorch/pytorch/pull/40708), [#40709](https://github.com/pytorch/pytorch/pull/40709), [#40710](https://github.com/pytorch/pytorch/pull/40710), [#42147](https://github.com/pytorch/pytorch/pull/42147), [#42221](https://github.com/pytorch/pytorch/pull/42221), [#42222](https://github.com/pytorch/pytorch/pull/42222), [#42348](https://github.com/pytorch/pytorch/pull/42348), [#41049](https://github.com/pytorch/pytorch/pull/41049))
* Add Quantizer support to IValue ([#42438](https://github.com/pytorch/pytorch/pull/42438))
* Custom module support ([#44835](https://github.com/pytorch/pytorch/pull/44835))
* Preserving pre and post forward hooks ([#37233](https://github.com/pytorch/pytorch/pull/37233))

### Misc

* `torch.set_deterministic` and `torch.is_deterministic`: Raise error when the flag is set and a non-deterministic operation is used ([#15359](https://github.com/pytorch/pytorch/issues/15359), [#41377](https://github.com/pytorch/pytorch/issues/41377))
* Add CUDA 11 to nightly binaries ([#44086](https://github.com/pytorch/pytorch/pull/44086), [#43366](https://github.com/pytorch/pytorch/pull/43366))
* Dev Tool: Nightly checkout tool and doc in `CONTRIBUTING.md` ([#42635](https://github.com/pytorch/pytorch/pull/42635),  [#43294](https://github.com/pytorch/pytorch/pull/43294))
* Website: Add docs for tagged version (include rc) on the general website ([#45204](https://github.com/pytorch/pytorch/pull/45204))
* Build: Added BUILD_CAFFE2 flag to be able to disable caffe2 compilation ([#43673](https://github.com/pytorch/pytorch/pull/43673))
* Dataloader: Add `prefetch_factor` argument to control the number of batch loaded ahead of time([#41130](https://github.com/pytorch/pytorch/pull/41130))
* Dataloader: Allow handling of `np.memmap` objects ([#39847](https://github.com/pytorch/pytorch/pull/39847))
* ROCm: Add support torch `utils.cpp_extension` ([#41257](https://github.com/pytorch/pytorch/pull/41257), [#43528](https://github.com/pytorch/pytorch/pull/43528))
* ROCm: Enable complex BLAS ([#43744](https://github.com/pytorch/pytorch/pull/43744))
* docker: Add torchelastic to docker image ([#45438](https://github.com/pytorch/pytorch/pull/45438))
* docker: Add CUDA 11 support ([#45071](https://github.com/pytorch/pytorch/pull/45071))
* docker: Use python 3.8 in pytorch docker image ([#45466](https://github.com/pytorch/pytorch/pull/45466))

# Improvements

### Python API

* Use tree-based sum for floats to avoid numerical instability ([#39516](https://github.com/pytorch/pytorch/pull/39516))
* `nn.ReflectionPad`: Add support for 0-dim batch sizes. ([#39231](https://github.com/pytorch/pytorch/pull/39231))
* `torch.scatter`: Add reductions for CPU ([#36447](https://github.com/pytorch/pytorch/pull/36447))
* Allow any valid ASCII python identifiers as dimnames ([#40871](https://github.com/pytorch/pytorch/pull/40871))
* Improve Python warning prints when there is also an error ([#41116](https://github.com/pytorch/pytorch/pull/41116))
* `torch.iinfo`, `torch.finfo`: Improve printing ([#40488](https://github.com/pytorch/pytorch/pull/40488))
* `torch.where`: Add support for scalar input ([#40336](https://github.com/pytorch/pytorch/pull/40336))
* `torch.nonzero`: Remove deprecation warning for `as_tuple` argument ([#45413](https://github.com/pytorch/pytorch/pull/45413))
* `torch.distributions.Categorical`: Clamp logit to avoid `-inf` when calculating entropy ([#41002](https://github.com/pytorch/pytorch/pull/41002))
* `torch.futures.Future`: Add `done` function to query the status of the future ([#42013](https://github.com/pytorch/pytorch/pull/42013))

### torch.nn

* `nn.EmbeddingBag`: Add support for `incude_last_offset=True` when reduction is mean or max ([#42215](https://github.com/pytorch/pytorch/pull/42215))
* `nn.AvgPooling{1,2,3}d`: Ensure all cells are valid in ceil mode to avoid division by 0 ([#41368](https://github.com/pytorch/pytorch/pull/41368))
* `nn,[Adaptive]MaxPool{1,2,3}d`: Handle edge case when input is filled with -inf ([#40665](https://github.com/pytorch/pytorch/pull/40665))
* `nn.Hardsigmoid`, `nn.Hardswish`: Add inplace option ([#42346](https://github.com/pytorch/pytorch/pull/42346))
* `nn.MSELoss`, `nn.L1Loss`, `nn.SmoothL1Loss`: Add support for target that requires gradients. ([#44437](https://github.com/pytorch/pytorch/pull/44437), [#44471](https://github.com/pytorch/pytorch/pull/44471), [#44486](https://github.com/pytorch/pytorch/pull/44486))
* `nn.Parameter{List,Dict}`: Add warning when improperly used (with DataParallel or weight_norm) ([#44405](https://github.com/pytorch/pytorch/pull/44405))
* `nn.functional.smooth_l1`: Add beta parameter ([#44433](https://github.com/pytorch/pytorch/pull/44433))

### Build

* Report error when ATEN_THEADING is OMP and USE_OPENMP is turned off. ([#40146](https://github.com/pytorch/pytorch/pull/40146))
* Raise nice error when trying to build PyTorch on 32-bit Windows system ([#40321](https://github.com/pytorch/pytorch/pull/40321))
* Make setup.py Python-2 syntactically correct and work for version >= 3.9 ([#41960](https://github.com/pytorch/pytorch/pull/41960), [#46388](https://github.com/pytorch/pytorch/pull/46388))
* Don't proceed into setup.py too far if Python version is unsupported ([#42870](https://github.com/pytorch/pytorch/pull/42870))

### Distributed

* Support profiling rpc_async in TorchScript ([#40652](https://github.com/pytorch/pytorch/pull/40652))
* Allow RPC to be initialized again after shutdown. ([#42723](https://github.com/pytorch/pytorch/pull/42723))
* Support rpc_sync, rpc.remote in TorchScript ([#43043](https://github.com/pytorch/pytorch/pull/43043), [#43046](https://github.com/pytorch/pytorch/pull/43046))
* Make async_execution compatible with RRef helpers ([#44666](https://github.com/pytorch/pytorch/pull/44666))
* Extend RPC profiling to support async function execution over RPC. ([#44664](https://github.com/pytorch/pytorch/pull/44664))
* Support record_shapes in RPC profiling ([#44419](https://github.com/pytorch/pytorch/pull/44419))
* Add variants for cuda.comm.broadcast/gather/scatter which store the result in a provided “out” parameter ([#39681](https://github.com/pytorch/pytorch/pull/39681))
* Explicitly abort NCCL Communicators on ProcessGroupNCCL Destruction ([#40585](https://github.com/pytorch/pytorch/pull/40585))
* Helper function to print out all DDP-relevant env vars ([#41297](https://github.com/pytorch/pytorch/pull/41297))
* Add timeout to ProcessGroup Work Wait ([#40944](https://github.com/pytorch/pytorch/pull/40944))
* Support Wait Timeout in ProcessGroupNCCL ([#40946](https://github.com/pytorch/pytorch/pull/40946))
* Support work-level timeouts in ProcessGroupGloo ([#40948](https://github.com/pytorch/pytorch/pull/40948))
* Support for torch.bool in ProcessGroupNCCL ([#41959](https://github.com/pytorch/pytorch/pull/41959))
* DDP.train() returns self to stay consistent with nn.Module ([#42131](https://github.com/pytorch/pytorch/pull/42131))
* Add a drop_last option in DistributedSampler to drop tail of the data to ensure data is even across ranks ([#41171](https://github.com/pytorch/pytorch/pull/41171))
* Additional error checking for `torch.cuda.nccl` APIs. ([#43247](https://github.com/pytorch/pytorch/pull/43247))
* Support work.result() to get result tensors for allreduce for Gloo, NCCL backends ([#43970](https://github.com/pytorch/pytorch/pull/43970))
* Add a device parameter to RemoteModule ([#44254](https://github.com/pytorch/pytorch/pull/44254))
* Add remote_parameters() API for RemoteModule. ([#43906](https://github.com/pytorch/pytorch/pull/43906))
* Add a warning log when there is high skew of uneven inputs in DDP training ([#45238](https://github.com/pytorch/pytorch/pull/45238))

### TorchScript

* Support string concatenation (cc29c192a6) 
* Support using Python Enum in TorchScript ([#41390](https://github.com/pytorch/pytorch/pull/41390),[#41965,](https://github.com/pytorch/pytorch/pull/41965)[#42085,](https://github.com/pytorch/pytorch/pull/42085)[#42623,](https://github.com/pytorch/pytorch/pull/42623)[#42661,](https://github.com/pytorch/pytorch/pull/42661)[#42661,](https://github.com/pytorch/pytorch/pull/42661)[#42874,](https://github.com/pytorch/pytorch/pull/42874)[#43460,](https://github.com/pytorch/pytorch/pull/43460)[#43188,](https://github.com/pytorch/pytorch/pull/43188)[#44243,](https://github.com/pytorch/pytorch/pull/44243)[#44891](https://github.com/pytorch/pytorch/pull/44891))
* Support sorting list of strings ([#42398](https://github.com/pytorch/pytorch/pull/42398))
* Support boolean key in dictionary ([#42833](https://github.com/pytorch/pytorch/pull/42833))
* Support `@torch.no_grad` ([#41371](https://github.com/pytorch/pytorch/pull/41371))
* Support `del` to TorchScript classes ([#44352](https://github.com/pytorch/pytorch/pull/44352))
* Speed up saving modules in case of having many classes ([#44589](https://github.com/pytorch/pytorch/pull/44589))
* Support Python Slice class in TorchScript ([#44335](https://github.com/pytorch/pytorch/pull/44335))
* Support sorting a list of tuples ([#43448](https://github.com/pytorch/pytorch/pull/43448))
* Enable `@torch.jit.unused` syntax for ignoring properties ([#45261](https://github.com/pytorch/pytorch/pull/45261))
* Enable ProfilingExecutor + TensorExpression (#45546) ([#45546](https://github.com/pytorch/pytorch/pull/45546))
* Support `@torch.jit.unused` on a `@torch.no_grad` decorated function ([#41496](https://github.com/pytorch/pytorch/pull/41496))
* Improve ModuleList indexing error msg ([#43361](https://github.com/pytorch/pytorch/pull/43361))
* Better match behavior of loaded `ScriptModule``s vs. freshly created ones ([#43298](https://github.com/pytorch/pytorch/pull/43298))
* Support backend-lowered submodules ([#41146](https://github.com/pytorch/pytorch/pull/41146))
* Allow freezing of modules containing interface attribute ([#41860](https://github.com/pytorch/pytorch/pull/41860))
* `to_backend` API now accepts wrapped modules ([#43612](https://github.com/pytorch/pytorch/pull/43612))
* Allow submodule methods inference rules to be different ([#43872](https://github.com/pytorch/pytorch/pull/43872))
* Support default values for arguments of class type methods ([#45098](https://github.com/pytorch/pytorch/pull/45098))
* Improve sugared value's error message when closing over global variables ([#42889](https://github.com/pytorch/pytorch/pull/42889))
* Support backend-lowered submodules ([#40841](https://github.com/pytorch/pytorch/pull/40841))
* Turn on non-ASCII string literals serialization ([#40719](https://github.com/pytorch/pytorch/pull/40719))
* Better printing of Tensor stride information (#[45156](https://github.com/pytorch/pytorch/pull/45156))

### Mobile

* Allow specifying PYTHON executable to build_android ([#41927](https://github.com/pytorch/pytorch/pull/41927))
* Include all overloads for OSS custom build (a01e91e6b2)

### Quantization

* Change the `whitelist` to `allowlist` ([#41771](https://github.com/pytorch/pytorch/pull/41771), [#41802](https://github.com/pytorch/pytorch/pull/41802))
* `dequantize` now supports list and tuple of tensors ([#41079](https://github.com/pytorch/pytorch/pull/41079))
* User now has a way to add a activation post process hook using `register_activation_post_process_hook` function ([#42342](https://github.com/pytorch/pytorch/pull/42342))
* `add`/`mul` now support different variants ([#42769](https://github.com/pytorch/pytorch/pull/42769))
* Fake quantizer now has more info when printed ([#43031](https://github.com/pytorch/pytorch/pull/43031))
* `OP_LIST_TO_FUSER_METHOD` is exposed to the user ([#43286](https://github.com/pytorch/pytorch/pull/43286))
* `quantize_jit`  can handle new upsample overloads ([#43407](https://github.com/pytorch/pytorch/pull/43407))
* Setter/getter method for quantization and fusion mappings ([#43990](https://github.com/pytorch/pytorch/pull/43990))
* fake_quant and observer can be disabled in scriptmodule ([#44773](https://github.com/pytorch/pytorch/pull/44773))
* `convert_jit` can now take `preserved_attrs` argument ([#44490](https://github.com/pytorch/pytorch/pull/44490))
* `SyncBN`: preserve qconfig if it exists ([#45317](https://github.com/pytorch/pytorch/pull/45317))
* Add quant APIs to save/load observer `state_dict` ([#44846](https://github.com/pytorch/pytorch/pull/44846))
* Add version support for the `conv` parameters ([#43524](https://github.com/pytorch/pytorch/pull/43524), [#43086](https://github.com/pytorch/pytorch/pull/43086), [#43651](https://github.com/pytorch/pytorch/pull/43651), [#44671](https://github.com/pytorch/pytorch/pull/44671))

### ONNX

In PyTorch 1.7, we have continued to add and improve PyTorch operator export to ONNX. We have enabled export of 10 new operators, and further enhanced and optimized export of 10+ torch operators to ONNX. We have also focused on improving export of TorchScript modules, in particular laying some groundwork required for better support in near future. We have also created an API  (torch.onnx.utils._find_missing_ops_onnx_export) as a diagnostic tool (preview only) to get a list of operators in a model that are not supported or implemented by ONNX exporter. Support for export of torch.quantization.FakeQuantize has also been added to help enable some QAT workflows. 

* Add support to export more torch ops `torch.view_as` ([#40496](https://github.com/pytorch/pytorch/pull/40496)), fake quantize functions ([#39738](https://github.com/pytorch/pytorch/pull/39738)), embedding_bag ([#41234](https://github.com/pytorch/pytorch/pull/41234), [#44693](https://github.com/pytorch/pytorch/pull/44693)), `torch.eye` ([#41357](https://github.com/pytorch/pytorch/pull/41357)), `Tensor.as_strided` ([#41569](https://github.com/pytorch/pytorch/pull/41569)), `torch.tensor` ([#41872](https://github.com/pytorch/pytorch/pull/41872)), addition between list of tensors ([#41888](https://github.com/pytorch/pytorch/pull/41888)), `Tensor.__floordiv__` ([#43022](https://github.com/pytorch/pytorch/pull/43022)), `torch.nn.KLDivLoss` ([#41858](https://github.com/pytorch/pytorch/pull/41858)), `Tensor.new_empty` and `Tensor.new_zeros` ([#43506](https://github.com/pytorch/pytorch/pull/43506))
* Improves existing export logic and optimizing exported ONNX graph
    * Add warning in ONNX export when constant folding is on in training-amenable mode ([#40546](https://github.com/pytorch/pytorch/pull/40546))
    * Fix export of `torch.full_like` ([#40063](https://github.com/pytorch/pytorch/pull/40063))
    * Add pass that fuses Conv and BatchNormalization ([#40547](https://github.com/pytorch/pytorch/pull/40547))
    *  `torch.where` export, add support for ByteTensor ([#42264](https://github.com/pytorch/pytorch/pull/42264))
    * Fix scalar type cast for comparison ops ([#37787](https://github.com/pytorch/pytorch/pull/37787))
    * `torch.scatter` export, add support for src being scalar or different dtype ([#42765](https://github.com/pytorch/pytorch/pull/42765), [#43440](https://github.com/pytorch/pytorch/pull/43440))
    * Fix Squeeze operator when applied to a dimension with shape > 1 ([#38476](https://github.com/pytorch/pytorch/pull/38476))
    *  Extend support for `torch.where` ([#41544](https://github.com/pytorch/pytorch/pull/41544))
    * Update ops `torch.slice` ([#42935](https://github.com/pytorch/pytorch/pull/42935)), `torch.split` ([#43670](https://github.com/pytorch/pytorch/pull/43670)), `torch.repeat` ([#43430](https://github.com/pytorch/pytorch/pull/43430)), `torch.arange` ([#43777](https://github.com/pytorch/pytorch/pull/43777)), `len` ([#43824](https://github.com/pytorch/pytorch/pull/43824)), `torch.narrow` ([#44039](https://github.com/pytorch/pytorch/pull/44039)), flatten ([#40418](https://github.com/pytorch/pytorch/pull/40418)), adaptive_pool ([#46100](https://github.com/pytorch/pytorch/pull/46100))
* Update export to follow pytorch changes

    * Update div export to perform true divide ([#44831](https://github.com/pytorch/pytorch/pull/))
    * Enable true_divide scripting export with ONNX  shape inference ([#43991](https://github.com/pytorch/pytorch/pull/43911))

### Misc

* `torch.utils.collect_env`: Collect more informations (python 32/64bit, clang version, CPU architecture, ROCm version) ([#42887](https://github.com/pytorch/pytorch/pull/42887), [#42961](https://github.com/pytorch/pytorch/pull/42961), [#44106](https://github.com/pytorch/pytorch/pull/44106))
* `torch.hub.load_local`: Allow to load models from any local directory ([#44204](https://github.com/pytorch/pytorch/pull/44204))
* Add warning if `import torch` is called from the source root ([#39995](https://github.com/pytorch/pytorch/pull/39995))
* Improve Dynamic Library loading for Windows ([#40365](https://github.com/pytorch/pytorch/pull/40365))
* serialization: validate sparse tensors after loading ([#34059](https://github.com/pytorch/pytorch/pull/34059))
* Add `--continue-through-error` option to run_test.sh script ([#41136](https://github.com/pytorch/pytorch/pull/41136))
* Tensorboard: Support custom `run_name` and ``hparam_domain_discrete` in `add_hparams` ([#40660](https://github.com/pytorch/pytorch/pull/40660), [#40720](https://github.com/pytorch/pytorch/pull/40720))
* MKLDNN: Enable conv3d, batchnorm3d, max_pool3d and avg_pool3d ([#40691](https://github.com/pytorch/pytorch/pull/40691), [#40995](https://github.com/pytorch/pytorch/pull/40995), [#40996](https://github.com/pytorch/pytorch/pull/40996))
* Profiler: Do not record zero duration kernel events ([#41540](https://github.com/pytorch/pytorch/pull/41540))
* Profiler: Improve cuda time counting ([#45209](https://github.com/pytorch/pytorch/pull/45209))
* Profiler: Adding `with_source` parameter to enable tracking source code ([#43898](https://github.com/pytorch/pytorch/pull/43898))
* Optim: Add verbose param for all schedulers ([#41580](https://github.com/pytorch/pytorch/pull/41580))
* Pruning: check attributes before deleting ([#41913](https://github.com/pytorch/pytorch/pull/41913))
* Autograd: In `zero_grad`, avoid using inpalce `detach` when it is not required ([#41283](https://github.com/pytorch/pytorch/pull/41283))
* Autograd: Update the `torch.div` backward formula to improve numerical stability ([#43627](https://github.com/pytorch/pytorch/pull/43627))
* Autograd: Print all traceback for higher order backwards in detect_anomaly ([#43626](https://github.com/pytorch/pytorch/pull/43626))
* Autograd: Stop saving input of `torch.repeat` as only `input.dim()` is needed in backward ([#40766](https://github.com/pytorch/pytorch/pull/40766))
* CUDA: Improve cuDNN error messages to include call parameters ([#45023](https://github.com/pytorch/pytorch/pull/45023))
* CUDA: Improve `device_count` and cuda init error detection and messages ([#42249](https://github.com/pytorch/pytorch/pull/42249))
* Improve Tensor layout propagation for pointwise ops to follow input layout more closely ([#42922](https://github.com/pytorch/pytorch/pull/42922))
* Remove blacklist/whitelist references ([#41447](https://github.com/pytorch/pytorch/pull/41447), [#41644](https://github.com/pytorch/pytorch/pull/41644), [#41636](https://github.com/pytorch/pytorch/pull/41636), [#41777](https://github.com/pytorch/pytorch/pull/41777), [#41822](https://github.com/pytorch/pytorch/pull/41822), [#41691](https://github.com/pytorch/pytorch/pull/41691), [#41789](https://github.com/pytorch/pytorch/pull/41789), [#41979](https://github.com/pytorch/pytorch/pull/41979), [#41627](https://github.com/pytorch/pytorch/pull/41627), [#42011](https://github.com/pytorch/pytorch/pull/42011), [#41796](https://github.com/pytorch/pytorch/pull/41796), [#42067](https://github.com/pytorch/pytorch/pull/42067), [#42091](https://github.com/pytorch/pytorch/pull/42091), [#42097](https://github.com/pytorch/pytorch/pull/42097), [#42071](https://github.com/pytorch/pytorch/pull/42071), [#42089](https://github.com/pytorch/pytorch/pull/42089), [#42279](https://github.com/pytorch/pytorch/pull/42279), [#42047](https://github.com/pytorch/pytorch/pull/42047), [#42088](https://github.com/pytorch/pytorch/pull/42088), [#45260](https://github.com/pytorch/pytorch/pull/45260))

### Python Type Annotations

* Update some types in top level `torch/*.py` ([#40235](https://github.com/pytorch/pytorch/pull/40235), [#40873](https://github.com/pytorch/pytorch/pull/40873))
* Added typing for `Tensor` attributes and methods: `T` and `grad_fn` ([#40879](https://github.com/pytorch/pytorch/pull/40879)),  `Tensor._version` ([#41125](https://github.com/pytorch/pytorch/pull/41125)), `ndim` ([#42909](https://github.com/pytorch/pytorch/pull/42909)), `nonzero`  ([#43053](https://github.com/pytorch/pytorch/pull/43053)), [#40499](https://github.com/pytorch/pytorch/pull/40499))
* Added typing for `torch.serialization` ([#40862](https://github.com/pytorch/pytorch/pull/40862))
* Added typing for `torch.tensor` ([#45077](https://github.com/pytorch/pytorch/pull/45077))
* Added typing for  `torch.Size` ([#40879](https://github.com/pytorch/pytorch/pull/40879))
* Added typing for `torch.futures` ([#41675](https://github.com/pytorch/pytorch/pull/41675))
* Added typing for `torch.random` ([#42234](https://github.com/pytorch/pytorch/pull/42234))
* Added typing for `torch.hub` ([#42252](https://github.com/pytorch/pytorch/pull/42252))
* Added typing for `collect_env.py` ([#43062](https://github.com/pytorch/pytorch/pull/43062))
* Added typing for `torch.utils` ([#39392](https://github.com/pytorch/pytorch/pull/39392), [#42647](https://github.com/pytorch/pytorch/pull/42647), [#42711](https://github.com/pytorch/pytorch/pull/42711), [#42960](https://github.com/pytorch/pytorch/pull/42960), [#43806](https://github.com/pytorch/pytorch/pull/43806), [#44136](https://github.com/pytorch/pytorch/pull/44136), [#44216](https://github.com/pytorch/pytorch/pull/44216))
* Added typing for `torch.nn` ([#43044](https://github.com/pytorch/pytorch/pull/43044), [#44093](https://github.com/pytorch/pytorch/pull/44093), [#43080](https://github.com/pytorch/pytorch/pull/43080), [#42231](https://github.com/pytorch/pytorch/pull/42231), [#40669](https://github.com/pytorch/pytorch/pull/40669))
* Added typing for `torch.sparse` ([#43108](https://github.com/pytorch/pytorch/pull/43108))
* Added typing for `torch.cuda.nvtx` ([#43443](https://github.com/pytorch/pytorch/pull/43443))
* Added typing for `torch.cuda.memory` ([#43444](https://github.com/pytorch/pytorch/pull/43444))
* Added typing for `torch.functional` ([#43446](https://github.com/pytorch/pytorch/pull/43446))
* Added typing for `torch.autograd` ([#44451](https://github.com/pytorch/pytorch/pull/44451), [#46206](https://github.com/pytorch/pytorch/pull/46206))
* Added typing for `torch.quantization.fuse_modules` ([#43786](https://github.com/pytorch/pytorch/pull/43786))
* Added typing for `torch.nn.quantized` ([#43186](https://github.com/pytorch/pytorch/pull/43186), [#44154](https://github.com/pytorch/pytorch/pull/44154), [#43110](https://github.com/pytorch/pytorch/pull/43110))
* Added typing for `torch.testing._internal` submodules ([#44575](https://github.com/pytorch/pytorch/pull/44575), [#44805](https://github.com/pytorch/pytorch/pull/44805), [#44832](https://github.com/pytorch/pytorch/pull/44832), [#44911](https://github.com/pytorch/pytorch/pull/44911), [#44927](https://github.com/pytorch/pytorch/pull/44927), [#44985](https://github.com/pytorch/pytorch/pull/44985), [#44971](https://github.com/pytorch/pytorch/pull/44971), [#45107](https://github.com/pytorch/pytorch/pull/45107), [#45368](https://github.com/pytorch/pytorch/pull/45368), [#45375](https://github.com/pytorch/pytorch/pull/45375))
* Added typing for `torch.backends.quantized` ([#44794](https://github.com/pytorch/pytorch/pull/44794))
* Added typing for `torch.backends.cuda` ([#44916](https://github.com/pytorch/pytorch/pull/44916))
* Added typing for `torch.cuda.{comm,nccl,amp}` ([#45350](https://github.com/pytorch/pytorch/pull/45350), [#45344](https://github.com/pytorch/pytorch/pull/45344), [#45480](https://github.com/pytorch/pytorch/pull/45480))
* Added typing for `torch.quasirandom` ([#45434](https://github.com/pytorch/pytorch/pull/45434))
* Fix typing for `jit.trace` and `onnx.export` ([#41093](https://github.com/pytorch/pytorch/pull/41093))
* Fix typing for `torch/optim/lr_scheduler.pyi` ([#41775](https://github.com/pytorch/pytorch/pull/41775), [#41866](https://github.com/pytorch/pytorch/pull/41866))

# Bug fixes

### Python API

* `torch.linspace`: Fix step computation for large integral types ([#40132](https://github.com/pytorch/pytorch/pull/40132))
* `torch.pca_lowrank`: Fix un-expected memory consumption ([#40853](https://github.com/pytorch/pytorch/pull/40853))
* `torch.linspace`: Fix behavior for non-contiguous inputs on CPU ([#41286](https://github.com/pytorch/pytorch/pull/41286))
* `torch.div`: Fix division by low precision scalar ([#41446](https://github.com/pytorch/pytorch/pull/41446))
* `torch.expm1`: disable mkl as it produces wrong values in some cases ([#41654](https://github.com/pytorch/pytorch/pull/41654))
* `torch.utils.data.RandomSampler`: Stop generating samples one at a time when replacement=True ([#41682](https://github.com/pytorch/pytorch/pull/41682))
* `torch.nn.functional.grid_sample`: Fix 64-bit indexing ([#41923](https://github.com/pytorch/pytorch/pull/41923))
* `torch.nn.functional.grid_sample`: Fix crash when `grid` has NaNs ([#42703](https://github.com/pytorch/pytorch/pull/42703))
* `torch.det`: Fix on CPU ([#35136](https://github.com/pytorch/pytorch/pull/35136))
* `torch.interpolate`: Avoid zero division in cubic mode ([#42093](https://github.com/pytorch/pytorch/pull/42093))
* `torch.fmod`: Fix to work with zero divisors consistently ([#41948](https://github.com/pytorch/pytorch/pull/41948))
* `torch.masked_select`: Fix for discontiguous outputs ([#41841](https://github.com/pytorch/pytorch/pull/41841))
* `torch.cummin`, `torch.cummax`: Fix for discontiguous inputs/outputs ([#42507](https://github.com/pytorch/pytorch/pull/42507))
* `torch.einsum`: Fix for discontiguous inputs ([#42425](https://github.com/pytorch/pytorch/pull/42425))
* `torch.orgqr`: Fix input size conditions ([#42825](https://github.com/pytorch/pytorch/pull/42825))
* `torch.manual_seed`: Fix argument unpacking ([#42206](https://github.com/pytorch/pytorch/pull/42206))
* `torch.searchsorted`: Properly mark output as non differentiable ([#42933](https://github.com/pytorch/pytorch/pull/42933))
* `torch.bucketize`: Properly mark output as non differentiable ([#44102](https://github.com/pytorch/pytorch/pull/44102))
* `torch.addmm`: Properly raise error on device mismatch  ([#43505](https://github.com/pytorch/pytorch/pull/43505))
* `torch.chain_matmul`: Properly handle empty args ([#43553](https://github.com/pytorch/pytorch/pull/43553))
* `torch.multinomial`: Properly handle 0 size dim ([#43775](https://github.com/pytorch/pytorch/pull/43775))
* `torch.cholesky_solve`: Fix broadcast and error checking ([#43137](https://github.com/pytorch/pytorch/pull/43137))
* `torch.movedim`: Fix uniqueness check  ([#44307](https://github.com/pytorch/pytorch/pull/44307))
* `torch.min`, ` torch.max`, `torch.mean`: Properly throw error if dim is repeated ([#44281](https://github.com/pytorch/pytorch/pull/44281))
* `torch.lerp`: Fix for discontiguous outputs on CUDA ([#44559](https://github.com/pytorch/pytorch/pull/44559))
* `torch.addmv`, `torch.mv`: Fix beta=0 case in slow path ([#44681](https://github.com/pytorch/pytorch/pull/44681))
* `torch.triangular_solve`: Fix error check on CPU ([#44720](https://github.com/pytorch/pytorch/pull/44720))
* `torch.empty_like`, `torch.zeros_like`: Properly raise error if any memory format is provided with sparse input ([#44058](https://github.com/pytorch/pytorch/pull/44058))
* `torch.atan2`: Fix type promotion ([#43466](https://github.com/pytorch/pytorch/pull/43466))
* `torch.repeat`: Fix backward for 0 size repeats ([#45212](https://github.com/pytorch/pytorch/pull/45212))
* `torch.min`, ` torch.max`, `torch.median`: Fix handling of nan in backward ([#45280](https://github.com/pytorch/pytorch/pull/45280))
* `torch.rdiv`: Properly make it consistent with div ([#45407](https://github.com/pytorch/pytorch/pull/45407))
* `torch.std`: Fix hanling of nan in backward ([#45468](https://github.com/pytorch/pytorch/pull/45468))
* `torch.distributions.Binomial`: Fix CUDA sampling at extreme points ([#42702](https://github.com/pytorch/pytorch/pull/42702))
* `torch.dot`, `torch.vdot`: Add complex support ([#45074](https://github.com/pytorch/pytorch/pull/45074))
* `torch.pow`: Fix when scalar base is complex ([#45259](https://github.com/pytorch/pytorch/pull/45259))
* `torch.round`, `torch.abs_`: Disable complex inputs ([#45330](https://github.com/pytorch/pytorch/pull/45330))
* `torch.svd`: Fix memory corruption for complex inputs ([#45486](https://github.com/pytorch/pytorch/pull/45486))
* `torch.view_as_complex`: Fix zero dimensional input ([#44175](https://github.com/pytorch/pytorch/pull/44175))
* `torch.kthvalue`: Fix for non-contiguous input ([#46177](https://github.com/pytorch/pytorch/pull/46177))
* `torch.save`: Fix python binding that could lead to out of bound read ([#46207](https://github.com/pytorch/pytorch/pull/46207))

### Torch.nn

* `nn.ModuleDict`: Fix input dict key ordering ([#40905](https://github.com/pytorch/pytorch/pull/40905))
* `nn.LayerNorm`: Fix handling of `gamma` in the backward when `create_graph=True`  ([#41595](https://github.com/pytorch/pytorch/pull/41595))
* `nn.functional.{max,avg}_pool{1,2,3}d`: Raise RuntimeError for zero stride ([#41819](https://github.com/pytorch/pytorch/pull/41819))
* `nn.Module`: Fix missing attribute when loading model from older version ([#42290](https://github.com/pytorch/pytorch/pull/42290))
* `nn.Embedding`: Raise proper error for 0-D weight ([#42550](https://github.com/pytorch/pytorch/pull/42550))
* `nn.SyncBatchNorm`: Fix forward pass for non-default process group ([#43861](https://github.com/pytorch/pytorch/pull/43861))
* `nn.functional.embedding_bag`: Fix for non-contiguous weight ([#44032](https://github.com/pytorch/pytorch/pull/44032))
* `nn.functional.upsample`: Add nondeterministic checks (df6ea62526)
* `nn.GroupNorm`: Fix bug when input does not require_grad on CUDA ([#44863](https://github.com/pytorch/pytorch/pull/44863))
* `functional.{l1_loss,smoothl1_loss,mse_loss}`: Properly check that reduction strings are valid ([#43527](https://github.com/pytorch/pytorch/pull/43527))
* `functional.smoothl1_loss`: Properly raise error for negative `beta` values ([#45759](https://github.com/pytorch/pytorch/pull/45759))
* `functional.pad`: Fix extra memory allocation and invalid result for negative or zero pad when using circular padding ([#39273](https://github.com/pytorch/pytorch/pull/39273))

### C++ API

* `nn::MultiheadAttention`: Ensure all parameters are properly registered ([#42037](https://github.com/pytorch/pytorch/pull/42037))
* `Tensor::grad`: Fix the thread safety issues ([#40887](https://github.com/pytorch/pytorch/pull/40887))
* `Tensor::var`: Ensure that `var(0)` does not call the `var(bool keepdim)` overload but `var(int dim)` ([#40451](https://github.com/pytorch/pytorch/pull/40451))

### Distributed

* Fix RPC and ProcessGroup GIL deadlock ([#45088](https://github.com/pytorch/pytorch/pull/45088))
* Relax size check in flatten_for_scatter_gather ([#40573](https://github.com/pytorch/pytorch/pull/40573))
* BAND, BOR and BXOR for NCCL all_reduce should throw runtime errors ([#42669](https://github.com/pytorch/pytorch/pull/42669))
* Disallow creation of ProcessGroupNCCL without GPUs ([#45642](https://github.com/pytorch/pytorch/pull/45642))
* Fix read/write of bulk data ([#42504](https://github.com/pytorch/pytorch/pull/42504))
* Fix thread safety issue with distributed optimizers and TorchScript ([#46071](https://github.com/pytorch/pytorch/pull/46071))

### TorchScript

* Fix type annotations in select assignments ([#40528](https://github.com/pytorch/pytorch/pull/40528))
* Fix compilation issues with GCC-5.4 ([#41055](https://github.com/pytorch/pytorch/pull/41055), [#41063](https://github.com/pytorch/pytorch/pull/41063), [#43223](https://github.com/pytorch/pytorch/pull/43223))
* Fix JIT not round to even if constant is folded ([#40897](https://github.com/pytorch/pytorch/pull/40897))
* Fix `torch.jit.freeze` import ([#42319](https://github.com/pytorch/pytorch/pull/42319))
* Fix `List[str].index` ([#40348](https://github.com/pytorch/pytorch/pull/40348))
* Fix `torch.jit.is_tracing()` so that it is correctly called rather than returning the method itself ([#42486](https://github.com/pytorch/pytorch/pull/42486))
* Fix Str -> Device implicit conversions ([#43213](https://github.com/pytorch/pytorch/pull/43213))
* Fix `NaN` propagation in fuser's min/max implementation ([#43590](https://github.com/pytorch/pytorch/pull/43590))
*  Cast return values of functions returning Any ([#42259](https://github.com/pytorch/pytorch/pull/42259))
* Fix `NaN` propagation in TensorExpression fuser's min/max implementation ([#43609](https://github.com/pytorch/pytorch/pull/43609))
* Fix segfault in attribute lookup on loaded `ScriptModules` ([#43284](https://github.com/pytorch/pytorch/pull/43284))
* Fix casting of `unsigned char`, and `abs(int)` ([#44157](https://github.com/pytorch/pytorch/pull/44157))
* Fix frac in CUDA fuser ([#44152](https://github.com/pytorch/pytorch/pull/44152))
* Fix model_name not logged properly issue. ([#45488](https://github.com/pytorch/pytorch/pull/45488))
* Fix `len`, `contains`, `getitem` inherited from interface class derived from nn container ([#40789](https://github.com/pytorch/pytorch/pull/40789))
* Fix support for FP16 in CudaCodgen ([#44209](https://github.com/pytorch/pytorch/pull/44209))
* Fix `torch.tensor` for empty multidimensional-typed lists ([#44652](https://github.com/pytorch/pytorch/pull/44652))
* Fix freeze_module pass for sharedtype ([#42457](https://github.com/pytorch/pytorch/pull/42457))
* Correctly clone schema in `insert_observers` ([#40624](https://github.com/pytorch/pytorch/pull/40624))
*  Fix value association with dictionaries in the tracer ([#40885](https://github.com/pytorch/pytorch/pull/40885))
* Fix preserve submodule attribute in freezing ([#45143](https://github.com/pytorch/pytorch/pull/45143))
* Fix Half conversion of immediates in NNC Cuda backend ([#45213](https://github.com/pytorch/pytorch/pull/45213))
* Fix a bug in `SplitWithMask` when splitting multiple times ([#45141](https://github.com/pytorch/pytorch/pull/45141))
* Fix inlining interface call in fork subgraph ([#43790](https://github.com/pytorch/pytorch/pull/43790))
* Fix operator order in combineMultilane in TensorExpr fuser([#45157](https://github.com/pytorch/pytorch/pull/45157))
* Correctly mark Tensor types inferred from empty annotation as `inferred=True` (#[45360](https://github.com/pytorch/pytorch/pull/45360))
* Fix some bugs in Round+Mod simplification in NNC ([#42934](https://github.com/pytorch/pytorch/pull/42934))
* Fix `set_grad_enabled` scripted version ([#46060](https://github.com/pytorch/pytorch/pull/46060))
* Fix for `dict.update()` scripted version ([#46105](https://github.com/pytorch/pytorch/pull/46105))
* Fix segfault when scripting nested classes ([#46422](https://github.com/pytorch/pytorch/pull/46422))
* Fix memory leak in Profiling Mode ([#46621](https://github.com/pytorch/pytorch/pull/46621))

### Quantization

* Resolved namespace conflict in qnnpack for init_win symbol (a7e09b8727)
* Fix linking of qnnpack params on windows. ([#40920](https://github.com/pytorch/pytorch/pull/40920))
* Adding zero point type check for per channel quantization ([#40811](https://github.com/pytorch/pytorch/pull/40811))
* Remove activation_post_process in qat modules (#42343) ([#43015](https://github.com/pytorch/pytorch/pull/43015))
* `qlinear_dynamic`: Fix ASAN error in QNNPACK's integration. ([#41967](https://github.com/pytorch/pytorch/pull/41967))
* Change quantizer to account for input tensor's memory format. ([#42178](https://github.com/pytorch/pytorch/pull/42178))
* Fixing the output shape for the linear ([#44513](https://github.com/pytorch/pytorch/pull/44513))
* Ensure observers and fq modules are scriptable ([#44749](https://github.com/pytorch/pytorch/pull/44749))
* histogram observer: ensure buffer shape consistency ([#44956](https://github.com/pytorch/pytorch/pull/44956))
* Attach qconfig to all modules ([#42576](https://github.com/pytorch/pytorch/pull/42576))
* Fix qnnpack quantized activations for NHWC memory format ([#46217](https://github.com/pytorch/pytorch/pull/46217))

### ONNX

* Fix crash  when exporting a model with `nn.Sequential` ([#19227](https://github.com/pytorch/pytorch/pull/19227))
* Fix default `ignore_index` for nll loss ([#44816](https://github.com/pytorch/pytorch/pull/44816))
* Rename Black to Block  for various files ([#42913](https://github.com/pytorch/pytorch/pull/42913))
* Fix bug in `onnx::SsaRewrite` ([#42148](https://github.com/pytorch/pytorch/pull/42148))

### Misc

* Fix `torch.hub` for new zipfile format. ([#42333](https://github.com/pytorch/pytorch/pull/42333))
* Preserve python backtrace in autograd engine errors. ([#43684](https://github.com/pytorch/pytorch/pull/43684))
* `optim.SparseAdam`: Fix check that params are dense on init ([#43668](https://github.com/pytorch/pytorch/pull/43668))
* Fix clang build ([#44934](https://github.com/pytorch/pytorch/pull/44934))
* `nn::MultiheadAttention:` Fix parameter registration ([#42037](https://github.com/pytorch/pytorch/pull/42037))
* MaxPool2D: Fix memory leak for XNNPACK ([#41874](https://github.com/pytorch/pytorch/pull/41874))
* Numpy scalar detection for bool and complex types fixed ([#43644](https://github.com/pytorch/pytorch/pull/43644))
* Add missing file to `BUILD.bazel` ([#40536](https://github.com/pytorch/pytorch/pull/40536))
* `autograd.gradcheck`: Add support for complex ([#43208](https://github.com/pytorch/pytorch/pull/43208))
* Fix bug in mobile-specific CPU caching allocator ([#43719](https://github.com/pytorch/pytorch/pull/43719))

# Performance

### Python API

* `torch.{view_as_complex,view_as_real}`: Remove unnecessary temporary Tensor ([#44908](https://github.com/pytorch/pytorch/pull/44908))
* `tensorboard.SummaryWriter.add_audio`: Remove unnecessary for loops ([#44201](https://github.com/pytorch/pytorch/pull/44201))
* `Conv2d` and `Conv3d`:  bypass the im2col for 1x1 conv ([#40324](https://github.com/pytorch/pytorch/pull/40324))
* Fix `max_pool2d` perf regression ([#41174](https://github.com/pytorch/pytorch/pull/41174))
* Disable the mkldnn for `conv2d` in some special cases ([#40610](https://github.com/pytorch/pytorch/pull/40610))
* `addmm`: Reduce constant time overhead ([#41374](https://github.com/pytorch/pytorch/pull/41374))
* `cumsum, cumprod`: Enable non-synchronizing cub scan for cum* operations ([#42036](https://github.com/pytorch/pytorch/pull/42036))
* `max_pool2d`: CUDA NCHW performance improvement ([#42182](https://github.com/pytorch/pytorch/pull/42182))
* `arenge`: Vectorize CPU implementation ([#38697](https://github.com/pytorch/pytorch/pull/38697))
* `istft`: optimize by using col2im ([#42826](https://github.com/pytorch/pytorch/pull/42826))
* `LayerNorm`:  improved performance on CPU both forward and backward ([#35750](https://github.com/pytorch/pytorch/pull/35750))
* `silu`: improved performance ([#42976](https://github.com/pytorch/pytorch/pull/42976))
* `addmv`: improved performance for zero sized input cases ([#41824](https://github.com/pytorch/pytorch/pull/41824))
* Mobile: Simple caching allocator for CPU ([#42006](https://github.com/pytorch/pytorch/pull/42006))
* `MaxPool1d`: improved performance for cases without indices ([#43745](https://github.com/pytorch/pytorch/pull/43745))
* `adaptive_avg_pool2d:` optimized code path for cases when output size is (1, 1) ([#44211](https://github.com/pytorch/pytorch/pull/44211))
* Vectorized complex copy ([#44722](https://github.com/pytorch/pytorch/pull/44722))
* `cat`: optimized cuda kernel ([#44833](https://github.com/pytorch/pytorch/pull/44833))
* Vectorized int8_t on CPU ([#44759](https://github.com/pytorch/pytorch/pull/44759))
* Vectorized `bitwise_not` ([#45103](https://github.com/pytorch/pytorch/pull/45103))
* Added stateful XNNPack deconvolution2d operator to torch ([#43233](https://github.com/pytorch/pytorch/pull/43233))
* Enabled mkldnn dilation convolution ([#40483](https://github.com/pytorch/pytorch/pull/40483))

### Distributed

* Skip allreducing `local_used_maps_dev_` when `find_unused_param=False` in DDP to improve performance ([#40407](https://github.com/pytorch/pytorch/pull/40407))
* Remove unnecessary copies in ProcessGroupGloo for multiple inputs allreduce ([#43543](https://github.com/pytorch/pytorch/pull/43543))
* Add option to run NCCL operations on high priority cuda stream ([#43796](https://github.com/pytorch/pytorch/pull/43796))
* Enhance DistributedOptimizer to be functional and torchscriptable to avoid GIL and global lock ([#45221](https://github.com/pytorch/pytorch/pull/45221))

### TorchScript

* JIT pass for add relu fusion. ([#39343](https://github.com/pytorch/pytorch/pull/39343))
* Optimize autodiff subgraph slicing ([#41437](https://github.com/pytorch/pytorch/pull/41437))
* Don't re-run CSE on every block ([#41479](https://github.com/pytorch/pytorch/pull/41479))
* Add loop unroll optimization in NNC ([#42465](https://github.com/pytorch/pytorch/pull/42465))
* Speed up CUDA kernel launch when block/thread extents are statically known ([#42899](https://github.com/pytorch/pytorch/pull/42899))
* Support merging adjacent fusion groups in TensorExpression Fuser. ([#43671](https://github.com/pytorch/pytorch/pull/43671))
* Add passes to profiling executor pipeline ([#43636](https://github.com/pytorch/pytorch/pull/43636))
* Improve performance of `KernelSumMultipleAxes` ([#43905](https://github.com/pytorch/pytorch/pull/43905))
* Latency improvements for pointwise + reduction fusion ([#45218](https://github.com/pytorch/pytorch/pull/45218))
* Add simplification of Loop + Condition patterns in NNC ([#44764](https://github.com/pytorch/pytorch/pull/44764))
* Fix fallback graph in specialize autogradzero ([#44654](https://github.com/pytorch/pytorch/pull/44654))
* Fix masking for all block and thread dimensions in CudaCodeGen ([#44733](https://github.com/pytorch/pytorch/pull/44733))
* Improve performance of simple reduction and softmax in nvFuser ([#40864](https://github.com/pytorch/pytorch/pull/40864))
* Add a new optimization pass, the Registerizer, which looks for common Stores and Loads to a single item in a buffer and replaces them with a local temporary scalar which is cheaper to write. ([#42606](https://github.com/pytorch/pytorch/pull/42606))
* Fuse identical conditions in NNC simplifier ([#44886](https://github.com/pytorch/pytorch/pull/44886))
* Add _out variants and reuse memory in static runtime([#44128](https://github.com/pytorch/pytorch/pull/44128))

### Mobile

* Add add_relu fusion pass to optimize_for_mobile. ([#40252](https://github.com/pytorch/pytorch/pull/40252))
* optimize_for_mobile: bring packed params to root module ([#42740](https://github.com/pytorch/pytorch/pull/42740))
* Apply selective build on RNN operators ([#44132](https://github.com/pytorch/pytorch/pull/44132))
* Add neon backend for vectorization ([#39341](https://github.com/pytorch/pytorch/pull/39341))

### Quantization

* Use the _min_max function instead of two separate calls for min and max([#41570](https://github.com/pytorch/pytorch/pull/41570), [#42957](https://github.com/pytorch/pytorch/pull/42957), [#44537](https://github.com/pytorch/pytorch/pull/44537))
* Improve performance of the QNNPACK kernels ([#41342](https://github.com/pytorch/pytorch/pull/41342), [#42007](https://github.com/pytorch/pytorch/pull/42007), [#42008](https://github.com/pytorch/pytorch/pull/42008))
* Speed up HistogramObserver by vectorizing critical path ([#41041](https://github.com/pytorch/pytorch/pull/41041))
* Speed up AdaptivePool3d by checking if input is ChannelsLast or ChannelsLast3d ([#42780](https://github.com/pytorch/pytorch/pull/42780))
* observers: use clamp instead of min/max in calculate_qparams ([#43150](https://github.com/pytorch/pytorch/pull/43150))
* observers: use torch.all to check for valid min and max values ([#43151](https://github.com/pytorch/pytorch/pull/43151))
* Avoid resizing in MinMaxObserver ([#43789](https://github.com/pytorch/pytorch/pull/43789))
* observers: make eps a buffer ([#43149](https://github.com/pytorch/pytorch/pull/43149))

### Misc

* ROCm: Fix performance issues with `torch.cat` ([#46323](https://github.com/pytorch/pytorch/pull/46323))

# Documentation

### Python API

* Numerous typo and grammatical improvements ([#39854](https://github.com/pytorch/pytorch/pull/39854), [#40217](https://github.com/pytorch/pytorch/pull/40217), [#40285](https://github.com/pytorch/pytorch/pull/40285), [#40544](https://github.com/pytorch/pytorch/pull/40544), [#40692](https://github.com/pytorch/pytorch/pull/40692), [#40617](https://github.com/pytorch/pytorch/pull/40617), [#41025](https://github.com/pytorch/pytorch/pull/41025), [#41031](https://github.com/pytorch/pytorch/pull/41031), [#40984](https://github.com/pytorch/pytorch/pull/40984), [#41066](https://github.com/pytorch/pytorch/pull/41066), [#41203](https://github.com/pytorch/pytorch/pull/41203), [#41263](https://github.com/pytorch/pytorch/pull/41263), [#41384](https://github.com/pytorch/pytorch/pull/41384), [#41526](https://github.com/pytorch/pytorch/pull/41526), [#41563](https://github.com/pytorch/pytorch/pull/41563), [#41632](https://github.com/pytorch/pytorch/pull/41632), [#41643](https://github.com/pytorch/pytorch/pull/41643), [#41599](https://github.com/pytorch/pytorch/pull/41599), [#41799](https://github.com/pytorch/pytorch/pull/41799), [#41679](https://github.com/pytorch/pytorch/pull/41679), [#41835](https://github.com/pytorch/pytorch/pull/41835), [#41851](https://github.com/pytorch/pytorch/pull/41851), [#41963](https://github.com/pytorch/pytorch/pull/41963), [#42016](https://github.com/pytorch/pytorch/pull/42016), [#42076](https://github.com/pytorch/pytorch/pull/42076), [#41946](https://github.com/pytorch/pytorch/pull/41946), [#42046](https://github.com/pytorch/pytorch/pull/42046), [#42065](https://github.com/pytorch/pytorch/pull/42065), [#42236](https://github.com/pytorch/pytorch/pull/42236), [#42184](https://github.com/pytorch/pytorch/pull/42184), [#42734](https://github.com/pytorch/pytorch/pull/42734), [#42923](https://github.com/pytorch/pytorch/pull/42923), [#42891](https://github.com/pytorch/pytorch/pull/42891), [#43063](https://github.com/pytorch/pytorch/pull/43063), [#43131](https://github.com/pytorch/pytorch/pull/43131), [#43395](https://github.com/pytorch/pytorch/pull/43395), [#43588](https://github.com/pytorch/pytorch/pull/43588), [#43583](https://github.com/pytorch/pytorch/pull/43583), [#43697](https://github.com/pytorch/pytorch/pull/43697), [#43779](https://github.com/pytorch/pytorch/pull/43779), [#43569](https://github.com/pytorch/pytorch/pull/43569), [#43893](https://github.com/pytorch/pytorch/pull/43893), [#43695](https://github.com/pytorch/pytorch/pull/43695), [#43973](https://github.com/pytorch/pytorch/pull/43973), [#44667](https://github.com/pytorch/pytorch/pull/44667), [#44753](https://github.com/pytorch/pytorch/pull/44753), [#44740](https://github.com/pytorch/pytorch/pull/44740), [#45045](https://github.com/pytorch/pytorch/pull/45045), [#45192](https://github.com/pytorch/pytorch/pull/45192), [#43308](https://github.com/pytorch/pytorch/pull/43308), [#40334](https://github.com/pytorch/pytorch/pull/40334))
* Remove use of term “blacklist” ([#41450](https://github.com/pytorch/pytorch/pull/41450))
* Add overflow notice for cuFFT on half precision ([#40551](https://github.com/pytorch/pytorch/pull/40551))
* Add complex Note ([#41012](https://github.com/pytorch/pytorch/pull/41012), [#41252](https://github.com/pytorch/pytorch/pull/41252), [#40450](https://github.com/pytorch/pytorch/pull/40450))
* Add documentation about data sharing for Tensors during serialization ([#40412](https://github.com/pytorch/pytorch/pull/40412))
* Add `nn.Module.training` to docs ([#40923](https://github.com/pytorch/pytorch/pull/40923))
*  `nn.CrossEntropyLoss`: Clarify that the mean argument is weighted ([#40991](https://github.com/pytorch/pytorch/pull/40991))
* `torch.scatter_`: Update doc with support for reduction methods. ([#40962](https://github.com/pytorch/pytorch/pull/40962))
* Fix HTTP links in documentation to HTTPS ([#40878](https://github.com/pytorch/pytorch/pull/40878))
* Fix warnings when building docs ([#41068](https://github.com/pytorch/pytorch/pull/41068), [#41334](https://github.com/pytorch/pytorch/pull/41334), [#41335](https://github.com/pytorch/pytorch/pull/41335), [#44686](https://github.com/pytorch/pytorch/pull/44686))
* Add PyTorch Glossary ([#40639](https://github.com/pytorch/pytorch/pull/40639))
* Fix documentation references following page split ([#39086](https://github.com/pytorch/pytorch/pull/39086))
* Update serialization note to explain versioned symbols and dynamic versioning ([#41395](https://github.com/pytorch/pytorch/pull/41395))
* Make elementwise comparison docs more consistent ([#41626](https://github.com/pytorch/pytorch/pull/41626))
* Update CONTRIBUTING.md to explain how to use ccache ([#41619](https://github.com/pytorch/pytorch/pull/41619))
* Add doc warning for LSTM non-deterministic behavior ([#40893](https://github.com/pytorch/pytorch/pull/40893))
* Document default dim for cross being None ([#41850](https://github.com/pytorch/pytorch/pull/41850))
* Clarify Python 3.6 is the minimum supported version in the installation section. ([#41937](https://github.com/pytorch/pytorch/pull/41937))
* Split quantization subsection into smaller pages ([#41321](https://github.com/pytorch/pytorch/pull/41321))
* Documentation for `torch.optim.swa_utils` ([#41228](https://github.com/pytorch/pytorch/pull/41228))
* Improve the documentation of DistributedDataParallel ([#42471](https://github.com/pytorch/pytorch/pull/42471))
* Update docs about CUDA stream priority ([#41364](https://github.com/pytorch/pytorch/pull/41364))
* Update the documentation for `torch.scatter` to include streams parameter. ([#42814](https://github.com/pytorch/pytorch/pull/42814))
* Update `Tensor.clone` doc ([#42931](https://github.com/pytorch/pytorch/pull/42931), [#43098](https://github.com/pytorch/pytorch/pull/43098))
* Update external links in the README.md ([#43100](https://github.com/pytorch/pytorch/pull/43100))
* Update `torch.Tensor.is_set_to` documentation ([#43052](https://github.com/pytorch/pytorch/pull/43052))
* Polish the nightly pull docs in CONTRIBUTING ([#43494](https://github.com/pytorch/pytorch/pull/43494))
* Update the `torch.qr` documentation to include a warning about when the QR.backward is well-defined. ([#43547](https://github.com/pytorch/pytorch/pull/43547))
* Update the instructions to build from source on windows ([#43479](https://github.com/pytorch/pytorch/pull/43479), [#45553](https://github.com/pytorch/pytorch/pull/45553))
* Document the beta=0 behavior of BLAS functions ([#43823](https://github.com/pytorch/pytorch/pull/43823))
* Fix docs for kwargs-only functions ([#43586](https://github.com/pytorch/pytorch/pull/43586), [#43589](https://github.com/pytorch/pytorch/pull/43589))
* Documents `torch.sub` properly, adds `torch.subtract` alias ([#43850](https://github.com/pytorch/pytorch/pull/43850))
* Update determinism documentation ([#41692](https://github.com/pytorch/pytorch/pull/41692))
* Update instructions to build ([#42850](https://github.com/pytorch/pytorch/pull/42850))
* Clarify `nn.Batchnorm` `track_running_stats` docs ([#44445](https://github.com/pytorch/pytorch/pull/44445))
* Fix latex error in `torch.heaviside` docs ([#44481](https://github.com/pytorch/pytorch/pull/44481))
* Update `torch.median` doc to explain returned value for even-sized input ([#44562](https://github.com/pytorch/pytorch/pull/44562))
* Fix the `nn.ELU` formula in the docs ([#43764](https://github.com/pytorch/pytorch/pull/43764))
* `torch.min`, `torch.max`: remove incorrect warning from docs ([#44615](https://github.com/pytorch/pytorch/pull/44615))
* Reference `torch.cuda.amp` tutorial from core amp docs ([#44725](https://github.com/pytorch/pytorch/pull/44725))
* Mention TF32 on related docs ([#44690](https://github.com/pytorch/pytorch/pull/44690))
* Clarify that 5-D 'bilinear' grid_sample is actually trilinear ([#45090](https://github.com/pytorch/pytorch/pull/45090))
* Update linalg warning + docs ([#45415](https://github.com/pytorch/pytorch/pull/45415))
* Update `torch.floor_divide` documentation to clarify it's actually `torch.trunc_divide` ([#45411](https://github.com/pytorch/pytorch/pull/45411))
* Update `torch.fft` doc and make warning clearer ([#45409](https://github.com/pytorch/pytorch/pull/45409))
* Update for complex autograd ([#45270](https://github.com/pytorch/pytorch/pull/45270), [#46281](https://github.com/pytorch/pytorch/pull/46281))
* Update `nn.Flatten` docs ([#42084](https://github.com/pytorch/pytorch/pull/42084))

### Distributed

* Add a CONTRIBUTING.md for the distributed package. ([#44224](https://github.com/pytorch/pytorch/pull/44224))
* Added docs for Store API ([#45543](https://github.com/pytorch/pytorch/pull/45543))
* Add `all_gather_object` and `gather_object` documentation ([#43772](https://github.com/pytorch/pytorch/pull/43772))

### TorchScript

* Fix `torch.jit.trace_module` documentation ([#40248](https://github.com/pytorch/pytorch/pull/40248))
* Fix the docs for the inputs arg of `torch.jit.trace_module` ([#41586](https://github.com/pytorch/pytorch/pull/41586))
* Add documentation for `PYTORCH_JIT_TYPE_VERBOSITY` ([#42241](https://github.com/pytorch/pytorch/pull/42241))
* Grammatical corrections in JIT overview ([#43473](https://github.com/pytorch/pytorch/pull/43473))
* Update docs for recently added JIT features, including Enum Support, `torch.no_grad` etc. ([#45232](https://github.com/pytorch/pytorch/pull/45232))
* Add function signature for `pixel_shuffle` (#[45661](https://github.com/pytorch/pytorch/pull/45661))
* Fix signature for `torch.poisson` in documentation (#[45656](https://github.com/pytorch/pytorch/pull/45656))

### Mobile

* Aar native linking add fbjni ([#40578](https://github.com/pytorch/pytorch/pull/40578))
* fix scripts ([#44464](https://github.com/pytorch/pytorch/pull/44464))
* [PyTorch Mobile] Move some string ops to register_prim_ops.cpp and make them selective ([#44500](https://github.com/pytorch/pytorch/pull/44500))

### Quantization

* Fix several quantization documentation typos ([#40567](https://github.com/pytorch/pytorch/pull/40567), [#43693](https://github.com/pytorch/pytorch/pull/43693))
* API summary section ([#45848](https://github.com/pytorch/pytorch/pull/45848))
* Documentation for dynamically quantized RNN cells ([#40896](https://github.com/pytorch/pytorch/pull/40896))

### Misc

* Update  ONNX docs for release ([#45086](https://github.com/pytorch/pytorch/pull/45086))

🚀 pytorch/pytorch - Release Notes

PyTorch 2.6.0 Release (2025-01-29)

PyTorch 2.5.1: bug fix release (2024-10-29)

PyTorch 2.5.0 Release, SDPA CuDNN backend, Flex Attention (2024-10-17)

PyTorch 2.4.1 Release, bug fix release (2024-09-04)

PyTorch 2.4: Python 3.12, AOTInductor freezing, libuv backend for TCPStore (2024-07-24)

PyTorch 2.3.1 Release, bug fix release (2024-06-05)

PyTorch 2.3: User-Defined Triton Kernels in torch.compile, Tensor Parallelism in Distributed (2024-04-24)

PyTorch 2.2.2 Release, bug fix release (2024-03-27)

PyTorch 2.2.1 Release, bug fix release (2024-02-22)

PyTorch 2.2: FlashAttention-v2, AOTInductor (2024-01-30)

PyTorch 2.1.2 Release, bug fix release (2023-12-15)

PyTorch 2.1.1 Release, bug fix release (2023-11-15)

PyTorch 2.1: automatic dynamic shape compilation, distributed checkpointing (2023-10-04)

PyTorch 2.0.1 Release, bug fix release (2023-05-08)

PyTorch 2.0: Our next generation release that is faster, more Pythonic and Dynamic as ever (2023-03-15)

PyTorch 1.13.1 Release, small bug fix release (2022-12-16)

PyTorch 1.13: beta versions of functorch and improved support for Apple’s new M1 chips are now available (2022-10-28)

PyTorch 1.12.1 Release, small bug fix release (2022-08-05)

PyTorch 1.12: TorchArrow, Functional API for Modules and nvFuser, are now available (2022-06-28)

PyTorch 1.11, TorchData, and functorch are now available (2022-03-10)

PyTorch 1.10.2 Release, small bug fix release (2022-01-27)

PyTorch 1.10.1 Release, small bug fix release (2021-12-15)

PyTorch 1.10 Release, including CUDA Graphs APIs, Frontend and compiler improvements (2021-10-21)

Small bug fix release (2021-09-22)

LTS 1.8.2, Wrap cub in its own namespace (2021-08-17)

PyTorch 1.9 Release, including Torch.Linalg and Mobile Interpreter (2021-06-15)

Small bug fix release (2021-03-25)

PyTorch 1.8 Release, including Compiler and Distributed Training updates, New Mobile Tutorials and more (2021-03-04)

Bug fix release with updated binaries for Python 3.9 and cuDNN 8.0.5 (2020-12-10)

PyTorch 1.7 released w/ CUDA 11, New APIs for FFTs, Windows support for Distributed training and more (2020-10-27)

Beta	Prototype
torch.compiler.set_stance	Improved PyTorch user experience on Intel GPUs
torch.library.triton_op	FlexAttention support on X86 CPU for LLMs
torch.compile support for Python 3.13	Dim.AUTO
New packaging APIs for AOTInductor	CUTLASS and CK GEMM/CONV Backends for AOTInductor
AOTInductor: minifier
AOTInductor: ABI-compatible mode code generation
FP16 support for X86 CPUs

Beta	Prototype	Performance Improvements
Python 3.12 support for torch.compile	FSDP2: DTensor-based per-parameter-sharding FSDP	torch.compile optimizations for AWS Graviton (aarch64-linux) processors
AOTInductor Freezing for CPU	torch.distributed.pipelining, simplified pipeline parallelism	BF16 symbolic shape optimization in TorchInductor
New Higher-level Python Custom Operator API	Intel GPU is available through source build	Performance optimizations for GenAI projects utilizing CPU devices
Switching TCPStore’s default server backend to libuv

Stable	Beta	Prototype	Performance Improvements
	User-defined Triton kernels in torch.compile	torch.export adds new API to specify dynamic_shapes	Weight-Only-Quantization introduced into Inductor CPU backend
	Tensor parallelism within PyTorch Distributed	Asynchronous checkpoint generation
	Support for semi-structured sparsity

2.2	2.3
```python # Version 2.2.2 import torch.distributed.checkpoint as dcp dcp.save( state_dict={"model": model.state_dict()}, checkpoint_id="path_to_model_checkpoint" no_dist=True, coordinator_rank=0 ) # ... dcp.load( state_dict={"model": model.state_dict()}, checkpoint_id="path_to_model_checkpoint" no_dist=True, coordinator_rank=0 ) ```	```python # Version 2.2.3 # no dist is assumed from pg state, and rank 0 is always coordinator. import torch.distributed.checkpoint as dcp dcp.save( state_dict={"model": model.state_dict()}, checkpoint_id="path_to_model_checkpoint" ) # ... dcp.load( state_dict={"model": model.state_dict()}, checkpoint_id="path_to_model_checkpoint" ) ```

2.2	2.3
```python folded_model = convert_pt2e(model, fold_quantize=True) non_folded_model = convert_pt2e(model) ```	```python folded_model = convert_pt2e(model) non_folded_model = convert_pt2e(model, fold_quantize=False) ```

2.2	2.3
```python # torch.jit.quantized APIs torch.jit.quantized.quantize_rnn_cell_modules torch.jit.quantized.quantize_rnn_modules torch.jit.quantized.quantize_linear_modules torch.jit.quantized.QuantizedLinear torch.jit.QuantizedLinearFP16 torch.jit.quantized.QuantizedGRU torch.jit.quantized.QuantizedGRUCell torch.jit.quantized.QuantizedLSTM torch.jit.quantized.QuantizedLSTMCell ```	```python # Corresponding torch.ao.quantization APIs torch.ao.nn.quantized.dynamic.RNNCell torch.ao.quantization.quantize_dynamic APIs torch.ao.nn.quantized.dynamic.Linear torch.ao.nn.quantized.dynamic.GRU torch.ao.nn.quantized.dynamic.GRUCell torch.ao.nn.quantized.dynamic.LSTM ```

2.2	2.3
```python import torch from torch.backends.cuda import sdp_kernel with sdp_kernel(enable_math=False, enable_flash=False, enable_mem_efficient=True): torch.nn.functional.scaled_dot_product_attention(...) ```	```python import torch from torch.nn.attention import sdpa_kernel, SDPBackend with sdpa_kernel(backends=[SDPBackend.EFFICIENT_ATTENTION]): torch.nn.functional.scaled_dot_product_attention(...) ```