Post

PyTorch Cheatsheet - Practical Guide

Introduction

PyTorch is an open-source deep learning library developed by Facebook AI Research (FAIR). It offers maximum flexibility for building and training neural networks thanks to its automatic gradient computation system (autograd) and intuitive Pythonic syntax.

Why PyTorch?

  • Dynamic: Define-by-run computational graph
  • Pythonic: Integrates naturally with the Python ecosystem
  • Performant: Native GPU support and advanced optimizations
  • Flexible: Ideal for research and production
  • Rich Ecosystem: TorchVision, TorchText, TorchAudio, etc.

Installation

PyTorch can be installed with or without GPU support. CUDA support significantly accelerates model training by using NVIDIA GPUs.

1
2
3
4
5
# CPU only
pip install torch torchvision

# GPU (CUDA)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

Tensors

Tensors are the fundamental data structure in PyTorch, similar to NumPy arrays but with GPU support and automatic gradient computation. A tensor can be a scalar (0D), a vector (1D), a matrix (2D), or a multidimensional array (nD).

Creating Tensors

PyTorch offers several methods to create tensors according to your needs:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
import torch

# From a list
x = torch.tensor([[1, 2], [3, 4]])

# Empty tensor
x = torch.empty(3, 4)

# Filled with zeros
x = torch.zeros(2, 3)

# Filled with ones
x = torch.ones(2, 3)

# Random values (uniform distribution [0, 1))
x = torch.rand(2, 3)

# Random values (normal distribution)
x = torch.randn(2, 3)

# Sequence of values
x = torch.arange(0, 10, 2)  # [0, 2, 4, 6, 8]
x = torch.linspace(0, 10, 5)  # [0, 2.5, 5, 7.5, 10]

# Identity matrix
x = torch.eye(3)

# Tensor with a specific type
x = torch.tensor([1, 2, 3], dtype=torch.float32)

Tensor Properties

Each tensor has important attributes that define its structure and data type:

1
2
3
4
5
6
7
8
x = torch.randn(2, 3, 4)

print(x.shape)        # torch.Size([2, 3, 4])
print(x.size())       # torch.Size([2, 3, 4])
print(x.dtype)        # torch.float32
print(x.device)       # cpu or cuda
print(x.ndim)         # 3 (number of dimensions)
print(x.numel())      # 24 (total number of elements)

Tensor Operations

Arithmetic Operations

PyTorch supports all classic arithmetic operations. These operations can be performed element-wise:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
x = torch.tensor([1, 2, 3])
y = torch.tensor([4, 5, 6])

# Addition
z = x + y
z = torch.add(x, y)

# Subtraction
z = x - y
z = torch.sub(x, y)

# Multiplication (element-wise)
z = x * y
z = torch.mul(x, y)

# Division
z = x / y
z = torch.div(x, y)

# Power
z = x ** 2
z = torch.pow(x, 2)

Matrix Operations

Matrix operations are essential for neural networks. PyTorch offers several optimized methods for matrix multiplication:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
x = torch.randn(2, 3)
y = torch.randn(3, 4)

# Matrix multiplication
z = torch.mm(x, y)           # (2, 4)
z = x @ y                    # (2, 4)

# Batch multiplication
x = torch.randn(10, 3, 4)
y = torch.randn(10, 4, 5)
z = torch.bmm(x, y)          # (10, 3, 5)

# Dot product
x = torch.tensor([1, 2, 3])
y = torch.tensor([4, 5, 6])
z = torch.dot(x, y)          # 32

# Transpose
x = torch.randn(2, 3)
y = x.T                      # (3, 2)
y = x.transpose(0, 1)        # (3, 2)

Reshape and View

Reshaping tensors is crucial for adapting data to different network layers. view() requires a contiguous tensor in memory, while reshape() can copy data if necessary:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
x = torch.randn(2, 3, 4)

# Reshape
y = x.view(6, 4)           # (6, 4)
y = x.view(-1, 4)          # (-1 computed automatically)
y = x.reshape(2, 12)       # (2, 12)

# Flatten
y = x.flatten()            # (24,)
y = x.view(-1)             # (24,)

# Squeeze/Unsqueeze
x = torch.randn(1, 3, 1, 4)
y = x.squeeze()            # (3, 4) - removes dimensions of size 1
y = x.squeeze(0)           # (3, 1, 4) - removes dimension 0
y = x.unsqueeze(0)         # (1, 1, 3, 1, 4) - adds a dimension

Indexing and Slicing

Indexing works like NumPy, allowing access and modification of specific parts of a tensor:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
x = torch.randn(4, 5)

# Basic indexing
y = x[0]              # First row
y = x[:, 0]           # First column
y = x[1:3, 2:4]       # Submatrix

# Boolean indexing
mask = x > 0
y = x[mask]           # All positive elements

# Fancy indexing
indices = torch.tensor([0, 2])
y = x[indices]        # Rows 0 and 2

Concatenation and Split

These operations allow combining or separating tensors along a specific dimension:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
x = torch.randn(2, 3)
y = torch.randn(2, 3)

# Concatenation
z = torch.cat([x, y], dim=0)     # (4, 3)
z = torch.cat([x, y], dim=1)     # (2, 6)

# Stack
z = torch.stack([x, y], dim=0)   # (2, 2, 3)

# Split
z = torch.randn(6, 3)
chunks = torch.split(z, 2, dim=0)  # 3 tensors of size (2, 3)
chunks = torch.chunk(z, 3, dim=0)  # 3 tensors of size (2, 3)

Autograd

Autograd is PyTorch’s automatic differentiation engine. It records all operations performed on tensors with requires_grad=True and builds a dynamic computational graph to automatically compute gradients via backpropagation.

Automatic Gradient

Computing gradients is essential for neural network optimization:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# Enable gradient tracking
x = torch.tensor([2.0], requires_grad=True)
y = torch.tensor([3.0], requires_grad=True)

# Computation
z = x**2 + y**3
z.backward()  # Compute gradients

print(x.grad)  # dz/dx = 2x = 4
print(y.grad)  # dz/dy = 3y² = 27

# Reset gradients
x.grad.zero_()
y.grad.zero_()

no_grad Context

1
2
3
4
5
6
7
8
9
x = torch.randn(3, requires_grad=True)

# Temporarily disable gradient computation
with torch.no_grad():
    y = x * 2
    # y does not track gradients

# Alternative
y = (x * 2).detach()  # Detach y from the computational graph

Neural Networks

PyTorch uses nn.Module as the base class for all neural networks. Each model inherits from this class and must implement the forward() method which defines the forward pass.

Simple Model

Here’s an example of a fully-connected network (MLP) for classification:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
import torch.nn as nn
import torch.nn.functional as F

class SimpleNet(nn.Module):
    def __init__(self, input_size, hidden_size, num_classes):
        super(SimpleNet, self).__init__()
        self.fc1 = nn.Linear(input_size, hidden_size)
        self.fc2 = nn.Linear(hidden_size, num_classes)
    
    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = self.fc2(x)
        return x

# Usage
model = SimpleNet(784, 128, 10)
x = torch.randn(32, 784)  # Batch of 32 images
output = model(x)         # (32, 10)

Common Layers

PyTorch provides a wide range of pre-implemented layers for building different types of networks:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# Linear layer (fully connected)
fc = nn.Linear(in_features=100, out_features=50)

# 2D Convolution
conv = nn.Conv2d(in_channels=3, out_channels=64, kernel_size=3, stride=1, padding=1)

# Pooling
pool = nn.MaxPool2d(kernel_size=2, stride=2)
pool = nn.AvgPool2d(kernel_size=2, stride=2)

# Batch Normalization
bn = nn.BatchNorm2d(num_features=64)

# Dropout
dropout = nn.Dropout(p=0.5)

# Activation
relu = nn.ReLU()
sigmoid = nn.Sigmoid()
tanh = nn.Tanh()

CNN Example

Convolutional Neural Networks (CNNs) are particularly effective for processing images. They use convolutions to extract hierarchical spatial features:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
class CNN(nn.Module):
    def __init__(self, num_classes=10):
        super(CNN, self).__init__()
        self.conv1 = nn.Conv2d(3, 32, kernel_size=3, padding=1)
        self.conv2 = nn.Conv2d(32, 64, kernel_size=3, padding=1)
        self.pool = nn.MaxPool2d(2, 2)
        self.fc1 = nn.Linear(64 * 8 * 8, 128)
        self.fc2 = nn.Linear(128, num_classes)
        self.dropout = nn.Dropout(0.5)
    
    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))  # (32, 16, 16)
        x = self.pool(F.relu(self.conv2(x)))  # (64, 8, 8)
        x = x.view(-1, 64 * 8 * 8)            # Flatten
        x = F.relu(self.fc1(x))
        x = self.dropout(x)
        x = self.fc2(x)
        return x

model = CNN(num_classes=10)

ResNet Block

Residual connections (skip connections) enable training very deep networks by solving the vanishing gradient problem. The idea is to add the input to the output of a block of layers:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
class ResidualBlock(nn.Module):
    def __init__(self, in_channels, out_channels, stride=1):
        super().__init__()
        self.conv1 = nn.Conv2d(in_channels, out_channels, kernel_size=3, 
                               stride=stride, padding=1, bias=False)
        self.bn1 = nn.BatchNorm2d(out_channels)
        self.conv2 = nn.Conv2d(out_channels, out_channels, kernel_size=3,
                               stride=1, padding=1, bias=False)
        self.bn2 = nn.BatchNorm2d(out_channels)
        
        self.shortcut = nn.Sequential()
        if stride != 1 or in_channels != out_channels:
            self.shortcut = nn.Sequential(
                nn.Conv2d(in_channels, out_channels, kernel_size=1,
                         stride=stride, bias=False),
                nn.BatchNorm2d(out_channels)
            )
    
    def forward(self, x):
        out = F.relu(self.bn1(self.conv1(x)))
        out = self.bn2(self.conv2(out))
        out += self.shortcut(x)
        out = F.relu(out)
        return out

Sequential

nn.Sequential allows you to quickly create models by chaining layers sequentially, without having to define a custom class:

1
2
3
4
5
6
7
8
9
model = nn.Sequential(
    nn.Linear(784, 256),
    nn.ReLU(),
    nn.Dropout(0.2),
    nn.Linear(256, 128),
    nn.ReLU(),
    nn.Dropout(0.2),
    nn.Linear(128, 10)
)

ModuleList and ModuleDict

These containers allow you to dynamically manage lists or dictionaries of modules while correctly registering their parameters:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
# ModuleList - list of modules
class MyModule(nn.Module):
    def __init__(self):
        super().__init__()
        self.layers = nn.ModuleList([
            nn.Linear(10, 10) for _ in range(5)
        ])
    
    def forward(self, x):
        for layer in self.layers:
            x = F.relu(layer(x))
        return x

# ModuleDict - dictionary of modules
class MyModule(nn.Module):
    def __init__(self):
        super().__init__()
        self.layers = nn.ModuleDict({
            'conv': nn.Conv2d(1, 20, 5),
            'pool': nn.MaxPool2d(2),
            'fc': nn.Linear(320, 10)
        })
    
    def forward(self, x):
        x = self.layers['pool'](F.relu(self.layers['conv'](x)))
        x = x.view(x.size(0), -1)
        x = self.layers['fc'](x)
        return x

RNN and LSTM

Recurrent Neural Networks (RNNs) and their variants (LSTM, GRU) are designed to process sequential data such as text, audio, or time series. They maintain a hidden state that captures information from previous time steps.

Simple RNN

The basic RNN is the simplest form of recurrent network, but suffers from the vanishing gradient problem for long sequences:

1
2
3
4
5
6
7
8
9
10
# Basic RNN
rnn = nn.RNN(input_size=10, hidden_size=20, num_layers=2, batch_first=True)

# Input: (batch_size, seq_len, input_size)
x = torch.randn(32, 100, 10)
h0 = torch.zeros(2, 32, 20)  # (num_layers, batch_size, hidden_size)

output, hn = rnn(x, h0)
# output: (32, 100, 20) - output at each time step
# hn: (2, 32, 20) - final hidden state

LSTM

Long Short-Term Memory (LSTM) solves the gradient problem by using gates to control information flow. It maintains both a hidden state and a cell state:

1
2
3
4
5
6
7
8
9
10
11
12
# LSTM
lstm = nn.LSTM(input_size=10, hidden_size=20, num_layers=2, 
               batch_first=True, dropout=0.2)

x = torch.randn(32, 100, 10)
h0 = torch.zeros(2, 32, 20)
c0 = torch.zeros(2, 32, 20)

output, (hn, cn) = lstm(x, (h0, c0))
# output: (32, 100, 20)
# hn: (2, 32, 20) - final hidden state
# cn: (2, 32, 20) - final cell state

GRU

Gated Recurrent Unit (GRU) is a simplified variant of LSTM with fewer parameters. It is often faster to train while offering similar performance:

1
2
3
4
5
6
7
8
9
# GRU
gru = nn.GRU(input_size=10, hidden_size=20, num_layers=2, 
             batch_first=True, bidirectional=True)

x = torch.randn(32, 100, 10)
h0 = torch.zeros(2*2, 32, 20)  # *2 for bidirectional

output, hn = gru(x, h0)
# output: (32, 100, 40) - 40 because bidirectional (20*2)

Complete LSTM Model

Here’s a complete example of an LSTM classifier for sentiment analysis or text classification. It uses embeddings to represent words and a bidirectional LSTM to capture context in both directions:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
class LSTMClassifier(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim, n_layers, 
                 dropout=0.5, bidirectional=True):
        super().__init__()
        
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, n_layers,
                           batch_first=True, dropout=dropout,
                           bidirectional=bidirectional)
        
        # If bidirectional, hidden_dim * 2
        fc_input_dim = hidden_dim * 2 if bidirectional else hidden_dim
        self.fc = nn.Linear(fc_input_dim, output_dim)
        self.dropout = nn.Dropout(dropout)
    
    def forward(self, text):
        # text: (batch_size, seq_len)
        embedded = self.dropout(self.embedding(text))
        # embedded: (batch_size, seq_len, embedding_dim)
        
        output, (hidden, cell) = self.lstm(embedded)
        # output: (batch_size, seq_len, hidden_dim*2)
        
        # Concatenate forward and backward hidden states of the last layer
        if self.lstm.bidirectional:
            hidden = torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim=1)
        else:
            hidden = hidden[-1,:,:]
        
        # hidden: (batch_size, hidden_dim*2)
        hidden = self.dropout(hidden)
        output = self.fc(hidden)
        return output

# Usage
model = LSTMClassifier(vocab_size=10000, embedding_dim=100, 
                       hidden_dim=256, output_dim=2, n_layers=2)

Attention Mechanism

The attention mechanism allows the model to focus on the most relevant parts of the input sequence. It computes attention weights for each position and creates a weighted representation:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
class Attention(nn.Module):
    def __init__(self, hidden_dim):
        super().__init__()
        self.attention = nn.Linear(hidden_dim, 1)
    
    def forward(self, lstm_output):
        # lstm_output: (batch_size, seq_len, hidden_dim)
        
        # Compute attention scores
        attention_weights = torch.softmax(
            self.attention(lstm_output).squeeze(-1), dim=1
        )
        # attention_weights: (batch_size, seq_len)
        
        # Apply attention
        attention_weights = attention_weights.unsqueeze(1)
        # attention_weights: (batch_size, 1, seq_len)
        
        weighted = torch.bmm(attention_weights, lstm_output)
        # weighted: (batch_size, 1, hidden_dim)
        
        return weighted.squeeze(1), attention_weights.squeeze(1)

class LSTMWithAttention(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, batch_first=True)
        self.attention = Attention(hidden_dim)
        self.fc = nn.Linear(hidden_dim, output_dim)
    
    def forward(self, text):
        embedded = self.embedding(text)
        lstm_output, (hidden, cell) = self.lstm(embedded)
        
        # Apply attention
        attended, attention_weights = self.attention(lstm_output)
        
        output = self.fc(attended)
        return output, attention_weights

Transformers

Transformers have revolutionized natural language processing (NLP) and computer vision. Unlike RNNs, they process the entire sequence in parallel thanks to the self-attention mechanism, making them much faster and more efficient.

Transformer Encoder Layer

The Transformer encoder uses multi-head self-attention to capture relationships between all elements of a sequence:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
import math

# Transformer Encoder Layer
encoder_layer = nn.TransformerEncoderLayer(
    d_model=512,        # Model dimension
    nhead=8,           # Number of attention heads
    dim_feedforward=2048,
    dropout=0.1,
    batch_first=True
)

encoder = nn.TransformerEncoder(encoder_layer, num_layers=6)

# Input
src = torch.randn(32, 100, 512)  # (batch, seq_len, d_model)
output = encoder(src)
# output: (32, 100, 512)

Complete Transformer

Here’s a complete implementation of a Transformer with encoder and decoder. Positional Encoding adds information about the position of elements in the sequence:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
class TransformerModel(nn.Module):
    def __init__(self, vocab_size, d_model, nhead, num_encoder_layers,
                 num_decoder_layers, dim_feedforward, max_seq_length, num_classes):
        super().__init__()
        
        self.d_model = d_model
        self.embedding = nn.Embedding(vocab_size, d_model)
        self.pos_encoder = PositionalEncoding(d_model, max_seq_length)
        
        self.transformer = nn.Transformer(
            d_model=d_model,
            nhead=nhead,
            num_encoder_layers=num_encoder_layers,
            num_decoder_layers=num_decoder_layers,
            dim_feedforward=dim_feedforward,
            batch_first=True
        )
        
        self.fc = nn.Linear(d_model, num_classes)
    
    def forward(self, src, tgt):
        src = self.embedding(src) * math.sqrt(self.d_model)
        src = self.pos_encoder(src)
        
        tgt = self.embedding(tgt) * math.sqrt(self.d_model)
        tgt = self.pos_encoder(tgt)
        
        output = self.transformer(src, tgt)
        output = self.fc(output)
        return output

class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_len=5000):
        super().__init__()
        
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * 
                            (-math.log(10000.0) / d_model))
        
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0)
        
        self.register_buffer('pe', pe)
    
    def forward(self, x):
        return x + self.pe[:, :x.size(1), :]

Multi-Head Attention

Multi-head attention allows the model to learn different representations of information in parallel. Each “head” focuses on different aspects of the relationships between tokens:

1
2
3
4
5
6
7
8
9
10
# Direct use of Multi-Head Attention
multihead_attn = nn.MultiheadAttention(embed_dim=512, num_heads=8, batch_first=True)

query = torch.randn(32, 100, 512)
key = torch.randn(32, 100, 512)
value = torch.randn(32, 100, 512)

attn_output, attn_weights = multihead_attn(query, key, value)
# attn_output: (32, 100, 512)
# attn_weights: (32, 100, 100)

Training

Training a neural network follows an iterative process: forward pass (prediction), loss computation, backward pass (gradient computation), and weight update. PyTorch facilitates this process with its intuitive API.

Complete Training Loop

Here’s the standard pattern for training a model. This loop iterates over epochs and batches, performing backpropagation at each step:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
import torch.optim as optim

# Preparation
model = SimpleNet(784, 128, 10)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Training
num_epochs = 10
for epoch in range(num_epochs):
    for batch_idx, (data, targets) in enumerate(train_loader):
        # Forward pass
        outputs = model(data)
        loss = criterion(outputs, targets)
        
        # Backward pass
        optimizer.zero_grad()  # Reset gradients
        loss.backward()        # Compute gradients
        optimizer.step()       # Update weights
        
        if batch_idx % 100 == 0:
            print(f'Epoch [{epoch+1}/{num_epochs}], '
                  f'Step [{batch_idx}/{len(train_loader)}], '
                  f'Loss: {loss.item():.4f}')

# Evaluation
model.eval()
with torch.no_grad():
    correct = 0
    total = 0
    for data, targets in test_loader:
        outputs = model(data)
        _, predicted = torch.max(outputs.data, 1)
        total += targets.size(0)
        correct += (predicted == targets).sum().item()
    
    accuracy = 100 * correct / total
    print(f'Accuracy: {accuracy:.2f}%')

DataLoader

PyTorch’s DataLoader facilitates loading data in batches, shuffling, and parallel loading. To use it, you must first create a Dataset that defines how to access your data.

Custom Dataset

Create your own Dataset by inheriting from torch.utils.data.Dataset and implementing __len__() and __getitem__():

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
from torch.utils.data import Dataset, DataLoader

class CustomDataset(Dataset):
    def __init__(self, data, labels, transform=None):
        self.data = data
        self.labels = labels
        self.transform = transform
    
    def __len__(self):
        return len(self.data)
    
    def __getitem__(self, idx):
        sample = self.data[idx]
        label = self.labels[idx]
        
        if self.transform:
            sample = self.transform(sample)
        
        return sample, label

# Usage
dataset = CustomDataset(X_train, y_train)
train_loader = DataLoader(
    dataset,
    batch_size=32,
    shuffle=True,
    num_workers=4
)

Built-in Datasets

TorchVision provides popular pre-configured datasets (MNIST, CIFAR-10, ImageNet, etc.) that can be downloaded automatically:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
from torchvision import datasets, transforms

# Transformations
transform = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406],
                        std=[0.229, 0.224, 0.225])
])

# MNIST
train_dataset = datasets.MNIST(
    root='./data',
    train=True,
    transform=transforms.ToTensor(),
    download=True
)

# CIFAR-10
train_dataset = datasets.CIFAR10(
    root='./data',
    train=True,
    transform=transform,
    download=True
)

# ImageFolder
train_dataset = datasets.ImageFolder(
    root='./data/train',
    transform=transform
)

Data Augmentation

Data augmentation is an essential technique for improving model generalization by creating artificial variations of training data. This reduces overfitting and improves performance.

Image Transformations

TorchVision offers numerous transformations to augment images. Use them only on training data, not on test data:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
from torchvision import transforms

# Basic transformations
transform = transforms.Compose([
    transforms.Resize(256),
    transforms.RandomCrop(224),
    transforms.RandomHorizontalFlip(p=0.5),
    transforms.RandomRotation(degrees=15),
    transforms.ColorJitter(brightness=0.2, contrast=0.2, saturation=0.2, hue=0.1),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406],
                        std=[0.229, 0.224, 0.225])
])

# Advanced transformations
transform_advanced = transforms.Compose([
    transforms.RandomResizedCrop(224),
    transforms.RandomAffine(degrees=10, translate=(0.1, 0.1), scale=(0.9, 1.1)),
    transforms.RandomPerspective(distortion_scale=0.2, p=0.5),
    transforms.RandomGrayscale(p=0.1),
    transforms.GaussianBlur(kernel_size=3),
    transforms.ToTensor(),
    transforms.RandomErasing(p=0.5, scale=(0.02, 0.33)),
    transforms.Normalize(mean=[0.485, 0.456, 0.406],
                        std=[0.229, 0.224, 0.225])
])

# Test transformations (without augmentation)
transform_test = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406],
                        std=[0.229, 0.224, 0.225])
])

AutoAugment and RandAugment

These advanced augmentation techniques were developed through neural architecture search to automatically find the best augmentation strategies:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# AutoAugment
transform = transforms.Compose([
    transforms.AutoAugment(policy=transforms.AutoAugmentPolicy.IMAGENET),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406],
                        std=[0.229, 0.224, 0.225])
])

# RandAugment
transform = transforms.Compose([
    transforms.RandAugment(num_ops=2, magnitude=9),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406],
                        std=[0.229, 0.224, 0.225])
])

Mixup and CutMix

Mixup linearly combines two images and their labels, creating interpolated training examples. This encourages the model to have linear behavior between classes and improves generalization:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
import numpy as np

# Mixup
def mixup_data(x, y, alpha=1.0):
    lam = np.random.beta(alpha, alpha)
    batch_size = x.size(0)
    index = torch.randperm(batch_size)
    
    mixed_x = lam * x + (1 - lam) * x[index, :]
    y_a, y_b = y, y[index]
    return mixed_x, y_a, y_b, lam

def mixup_criterion(criterion, pred, y_a, y_b, lam):
    return lam * criterion(pred, y_a) + (1 - lam) * criterion(pred, y_b)

# Usage in training loop
for data, targets in train_loader:
    data, targets_a, targets_b, lam = mixup_data(data, targets, alpha=1.0)
    outputs = model(data)
    loss = mixup_criterion(criterion, outputs, targets_a, targets_b, lam)
    loss.backward()
    optimizer.step()

Loss Functions

The loss function measures the difference between the model’s predictions and the true values. Choosing the right loss function depends on your task (classification, regression, etc.).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# Binary classification
criterion = nn.BCELoss()                      # Binary Cross Entropy
criterion = nn.BCEWithLogitsLoss()            # BCE with integrated sigmoid

# Multi-class classification
criterion = nn.CrossEntropyLoss()             # Softmax + NLL Loss
criterion = nn.NLLLoss()                      # Negative Log Likelihood

# Regression
criterion = nn.MSELoss()                      # Mean Squared Error
criterion = nn.L1Loss()                       # Mean Absolute Error
criterion = nn.SmoothL1Loss()                 # Huber Loss

# Others
criterion = nn.KLDivLoss()                    # Kullback-Leibler Divergence
criterion = nn.CosineEmbeddingLoss()          # Cosine Similarity Loss

Optimizers

Optimizers update the model’s parameters using the computed gradients. Each optimizer uses a different strategy to adjust weights and accelerate convergence.

Quick comparison:

  • SGD: Simple but robust, often with momentum
  • Adam: Adaptive, very popular, good default choice
  • AdamW: Adam with improved weight decay
  • RMSprop: Good for RNNs
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
import torch.optim as optim

# SGD
optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)

# Adam
optimizer = optim.Adam(model.parameters(), lr=0.001, betas=(0.9, 0.999))

# AdamW (Adam with weight decay)
optimizer = optim.AdamW(model.parameters(), lr=0.001, weight_decay=0.01)

# RMSprop
optimizer = optim.RMSprop(model.parameters(), lr=0.01, alpha=0.99)

# Learning Rate Scheduler
scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=10, gamma=0.1)
scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='min', patience=5)
scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=100)

# Usage
for epoch in range(num_epochs):
    train(...)
    validate(...)
    scheduler.step()  # Update learning rate

GPU

Using a GPU can accelerate training by 10x to 100x depending on the model. PyTorch makes it easy to transfer data and models to the GPU with the .to(device) method.

GPU Usage

To use the GPU, you must move both the model and the data to the CUDA device:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# Check GPU availability
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f'Using device: {device}')

# Move model to GPU
model = model.to(device)

# Move data to GPU
x = x.to(device)
y = y.to(device)

# In training loop
for data, targets in train_loader:
    data = data.to(device)
    targets = targets.to(device)
    
    outputs = model(data)
    loss = criterion(outputs, targets)
    # ...

# Multi-GPU
if torch.cuda.device_count() > 1:
    model = nn.DataParallel(model)
model.to(device)

GPU Operations

1
2
3
4
5
6
7
8
9
10
11
12
# Create directly on GPU
x = torch.randn(3, 4, device='cuda')

# GPU information
print(torch.cuda.device_count())              # Number of GPUs
print(torch.cuda.current_device())            # Current GPU ID
print(torch.cuda.get_device_name(0))          # GPU name
print(torch.cuda.memory_allocated())          # Allocated memory
print(torch.cuda.memory_reserved())           # Reserved memory

# Clear GPU cache
torch.cuda.empty_cache()

Transfer Learning

Transfer learning involves reusing a pre-trained model on a large database (like ImageNet) and adapting it to your specific task. This is particularly useful when you have limited data.

Advantages:

  • Faster convergence
  • Better results with limited data
  • Reuse of pre-learned features

Load a Pre-trained Model

Two main strategies: freeze all layers except the last one (feature extraction) or fine-tune the entire network:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import torchvision.models as models

# Load pre-trained ResNet
model = models.resnet50(pretrained=True)

# Freeze all parameters
for param in model.parameters():
    param.requires_grad = False

# Replace the last layer
num_features = model.fc.in_features
model.fc = nn.Linear(num_features, 10)  # 10 custom classes

# Only the last layer will be trained
optimizer = optim.Adam(model.fc.parameters(), lr=0.001)

Selective Fine-tuning

1
2
3
4
5
6
7
8
9
10
11
12
# Unfreeze the last layers
for name, param in model.named_parameters():
    if 'layer4' in name or 'fc' in name:
        param.requires_grad = True
    else:
        param.requires_grad = False

# Different learning rates for different layers
optimizer = optim.Adam([
    {'params': model.layer4.parameters(), 'lr': 1e-4},
    {'params': model.fc.parameters(), 'lr': 1e-3}
])
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# Vision
resnet18 = models.resnet18(pretrained=True)
resnet50 = models.resnet50(pretrained=True)
vgg16 = models.vgg16(pretrained=True)
vgg19 = models.vgg19(pretrained=True)
densenet121 = models.densenet121(pretrained=True)
inception_v3 = models.inception_v3(pretrained=True)
mobilenet_v2 = models.mobilenet_v2(pretrained=True)
efficientnet_b0 = models.efficientnet_b0(pretrained=True)
vit_b_16 = models.vit_b_16(pretrained=True)  # Vision Transformer

# Segmentation
fcn_resnet50 = models.segmentation.fcn_resnet50(pretrained=True)
deeplabv3_resnet50 = models.segmentation.deeplabv3_resnet50(pretrained=True)

# Object detection
faster_rcnn = models.detection.fasterrcnn_resnet50_fpn(pretrained=True)
mask_rcnn = models.detection.maskrcnn_resnet50_fpn(pretrained=True)

TorchVision

TorchVision is PyTorch’s official library for computer vision. It provides datasets, pre-trained models, and utilities for manipulating images.

Image Operations

Utility functions for visualizing and manipulating images:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
from torchvision import transforms
from torchvision.utils import make_grid, save_image
import torchvision.transforms.functional as TF

# Functional operations
img = TF.resize(img, size=(224, 224))
img = TF.rotate(img, angle=30)
img = TF.hflip(img)
img = TF.adjust_brightness(img, brightness_factor=1.5)
img = TF.adjust_contrast(img, contrast_factor=1.5)

# Create an image grid
images = torch.randn(64, 3, 224, 224)
grid = make_grid(images, nrow=8, padding=2)
save_image(grid, 'grid.png')

# Save a batch of images
save_image(images, 'batch.png', nrow=8)

Detection and Segmentation

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# Object detection
model = models.detection.fasterrcnn_resnet50_fpn(pretrained=True)
model.eval()

image = torch.randn(1, 3, 800, 800)
with torch.no_grad():
    predictions = model(image)
    # predictions[0]['boxes'] - box coordinates
    # predictions[0]['labels'] - object labels
    # predictions[0]['scores'] - confidence scores

# Semantic segmentation
model = models.segmentation.deeplabv3_resnet50(pretrained=True)
model.eval()

image = torch.randn(1, 3, 520, 520)
with torch.no_grad():
    output = model(image)['out']
    # output: (1, 21, 520, 520) - 21 PASCAL VOC classes
    predictions = torch.argmax(output, dim=1)

TorchText

TorchText facilitates preprocessing and loading text data for NLP. It handles tokenization, vocabulary creation, and embeddings.

Vocabulary and Tokenization

Tokenization divides text into units (words, subwords). The vocabulary maps these tokens to numerical indices:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
from torchtext.vocab import build_vocab_from_iterator
from torchtext.data.utils import get_tokenizer

# Tokenizer
tokenizer = get_tokenizer('basic_english')

# Build vocabulary
def yield_tokens(data_iter):
    for text in data_iter:
        yield tokenizer(text)

vocab = build_vocab_from_iterator(
    yield_tokens(train_data),
    specials=['<unk>', '<pad>', '<bos>', '<eos>'],
    min_freq=2
)
vocab.set_default_index(vocab['<unk>'])

# Convert text -> indices
text = "Hello world"
tokens = tokenizer(text)
indices = [vocab[token] for token in tokens]

# Padding
from torch.nn.utils.rnn import pad_sequence

sequences = [torch.tensor(vocab(tokenizer(text))) for text in texts]
padded = pad_sequence(sequences, batch_first=True, padding_value=vocab['<pad>'])

Pre-trained Embeddings

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
from torchtext.vocab import GloVe, FastText

# GloVe embeddings
glove = GloVe(name='6B', dim=100)

# Get word embedding
word_embedding = glove['hello']

# Create embedding matrix for your vocabulary
embedding_matrix = torch.zeros(len(vocab), 100)
for i, word in enumerate(vocab.get_itos()):
    if word in glove.stoi:
        embedding_matrix[i] = glove[word]

# Use in a model
embedding_layer = nn.Embedding.from_pretrained(embedding_matrix, freeze=False)

TorchAudio

TorchAudio provides tools for loading, transforming, and augmenting audio data. It supports various audio formats and offers common transformations like spectrograms and MFCC.

Audio Loading and Processing

Audio transformations are essential for preparing data for deep learning models:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
import torchaudio
import torchaudio.transforms as T

# Load audio file
waveform, sample_rate = torchaudio.load('audio.wav')

# Resampling
resampler = T.Resample(orig_freq=sample_rate, new_freq=16000)
waveform_resampled = resampler(waveform)

# Spectrogram
spectrogram = T.Spectrogram(
    n_fft=1024,
    win_length=None,
    hop_length=512
)
spec = spectrogram(waveform)

# Mel Spectrogram
mel_spectrogram = T.MelSpectrogram(
    sample_rate=sample_rate,
    n_fft=1024,
    hop_length=512,
    n_mels=128
)
mel_spec = mel_spectrogram(waveform)

# MFCC
mfcc_transform = T.MFCC(
    sample_rate=sample_rate,
    n_mfcc=40,
    melkwargs={'n_fft': 400, 'hop_length': 160, 'n_mels': 23}
)
mfccs = mfcc_transform(waveform)

# Audio augmentation
time_stretch = T.TimeStretch()
waveform_stretched = time_stretch(spec, rate=1.2)

pitch_shift = T.PitchShift(sample_rate, n_steps=4)
waveform_shifted = pitch_shift(waveform)

Distributed Training

Distributed training allows using multiple GPUs or multiple machines to accelerate training. PyTorch offers two main approaches: DataParallel (simple but limited) and DistributedDataParallel (recommended for performance).

DataParallel (Simple)

The simplest method to use multiple GPUs, but with performance limitations:

1
2
3
4
5
6
# Multi-GPU on a single machine
if torch.cuda.device_count() > 1:
    model = nn.DataParallel(model)
model.to(device)

# The rest of the code remains identical

DDP is more efficient than DataParallel because it uses one process per GPU and synchronizes gradients in an optimized way:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.utils.data.distributed import DistributedSampler
import os

def setup(rank, world_size):
    os.environ['MASTER_ADDR'] = 'localhost'
    os.environ['MASTER_PORT'] = '12355'
    dist.init_process_group('nccl', rank=rank, world_size=world_size)

def cleanup():
    dist.destroy_process_group()

def train(rank, world_size):
    setup(rank, world_size)
    
    # Create model and move to GPU
    model = MyModel().to(rank)
    model = DDP(model, device_ids=[rank])
    
    # Distributed sampler
    train_sampler = DistributedSampler(
        train_dataset,
        num_replicas=world_size,
        rank=rank
    )
    
    train_loader = DataLoader(
        train_dataset,
        batch_size=32,
        sampler=train_sampler
    )
    
    optimizer = optim.Adam(model.parameters(), lr=0.001)
    
    for epoch in range(num_epochs):
        train_sampler.set_epoch(epoch)
        for data, targets in train_loader:
            data, targets = data.to(rank), targets.to(rank)
            optimizer.zero_grad()
            outputs = model(data)
            loss = criterion(outputs, targets)
            loss.backward()
            optimizer.step()
    
    cleanup()

# Launch training
import torch.multiprocessing as mp

if __name__ == '__main__':
    world_size = torch.cuda.device_count()
    mp.spawn(train, args=(world_size,), nprocs=world_size, join=True)

Quantization

Quantization reduces the precision of weights and activations (from float32 to int8) to decrease model size and accelerate inference, with minimal accuracy loss.

Advantages:

  • Model size reduced by ~75%
  • Inference 2-4x faster
  • Less memory used

Post-Training Quantization

The simplest method: quantize an already trained model without retraining:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
# Dynamic quantization (easy, fast)
model_quantized = torch.quantization.quantize_dynamic(
    model,
    {nn.Linear, nn.LSTM},  # Layers to quantize
    dtype=torch.qint8
)

# Static quantization (better accuracy)
model.eval()
model.qconfig = torch.quantization.get_default_qconfig('fbgemm')
model_prepared = torch.quantization.prepare(model)

# Calibration with representative data
with torch.no_grad():
    for data, _ in calibration_loader:
        model_prepared(data)

model_quantized = torch.quantization.convert(model_prepared)

# Save
torch.save(model_quantized.state_dict(), 'model_quantized.pth')

# Size comparison
print(f"Original size: {os.path.getsize('model.pth') / 1e6:.2f} MB")
print(f"Quantized size: {os.path.getsize('model_quantized.pth') / 1e6:.2f} MB")

Quantization-Aware Training

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
model.train()
model.qconfig = torch.quantization.get_default_qat_qconfig('fbgemm')
model_prepared = torch.quantization.prepare_qat(model)

# Train normally
for epoch in range(num_epochs):
    for data, targets in train_loader:
        optimizer.zero_grad()
        outputs = model_prepared(data)
        loss = criterion(outputs, targets)
        loss.backward()
        optimizer.step()

# Convert to quantized model
model_prepared.eval()
model_quantized = torch.quantization.convert(model_prepared)

ONNX Export

ONNX (Open Neural Network Exchange) is a standard format for representing deep learning models. Exporting to ONNX allows using your PyTorch model in other frameworks or optimizing inference.

Use cases:

  • Production deployment with ONNX Runtime
  • Interoperability between frameworks (PyTorch → TensorFlow)
  • Hardware-specific inference optimizations

Export to ONNX

ONNX export traces the model with an example input:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
import torch.onnx

# Prepare model
model.eval()

# Example input
dummy_input = torch.randn(1, 3, 224, 224)

# Export
torch.onnx.export(
    model,
    dummy_input,
    'model.onnx',
    export_params=True,
    opset_version=11,
    do_constant_folding=True,
    input_names=['input'],
    output_names=['output'],
    dynamic_axes={
        'input': {0: 'batch_size'},
        'output': {0: 'batch_size'}
    }
)

# Verify ONNX model
import onnx
onnx_model = onnx.load('model.onnx')
onnx.checker.check_model(onnx_model)
print(onnx.helper.printable_graph(onnx_model.graph))

Inference with ONNX Runtime

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import onnxruntime as ort
import numpy as np

# Create session
session = ort.InferenceSession('model.onnx')

# Prepare input
input_name = session.get_inputs()[0].name
output_name = session.get_outputs()[0].name

x = np.random.randn(1, 3, 224, 224).astype(np.float32)

# Inference
result = session.run([output_name], {input_name: x})
print(result[0].shape)

JIT and TorchScript

TorchScript allows creating serializable and optimizable models independent of Python. This is essential for production deployment, especially in non-Python environments (C++, mobile).

Advantages:

  • Independent of Python
  • JIT (Just-In-Time) optimizations
  • Mobile deployment (iOS, Android)
  • Improved inference performance

TorchScript by Tracing

Tracing records operations performed during a forward pass. Simple but doesn’t support dynamic control structures:

1
2
3
4
5
6
7
8
9
10
11
# Trace the model
model.eval()
example_input = torch.randn(1, 3, 224, 224)
traced_model = torch.jit.trace(model, example_input)

# Save
traced_model.save('model_traced.pt')

# Load
loaded_model = torch.jit.load('model_traced.pt')
output = loaded_model(example_input)

TorchScript by Scripting

1
2
3
4
5
6
7
8
# Script the model (supports control structures)
scripted_model = torch.jit.script(model)

# Save
scripted_model.save('model_scripted.pt')

# Load and use in C++
# torch::jit::script::Module module = torch::jit::load("model_scripted.pt");

JIT Optimization

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
import time

# Optimization for inference
with torch.jit.optimized_execution(True):
    traced_model = torch.jit.trace(model, example_input)
    traced_model = torch.jit.freeze(traced_model)

# Warm-up
for _ in range(10):
    _ = traced_model(example_input)

# Measure performance
start = time.time()
for _ in range(1000):
    with torch.no_grad():
        _ = traced_model(example_input)
print(f"Time: {(time.time() - start) * 1000:.2f} ms")

Save and Load

Save/Load Model

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
# Save only weights
torch.save(model.state_dict(), 'model_weights.pth')

# Load weights
model = SimpleNet(784, 128, 10)
model.load_state_dict(torch.load('model_weights.pth'))
model.eval()

# Save complete model
torch.save(model, 'model_complete.pth')

# Load complete model
model = torch.load('model_complete.pth')
model.eval()

# Save complete checkpoint
checkpoint = {
    'epoch': epoch,
    'model_state_dict': model.state_dict(),
    'optimizer_state_dict': optimizer.state_dict(),
    'loss': loss,
}
torch.save(checkpoint, 'checkpoint.pth')

# Load checkpoint
checkpoint = torch.load('checkpoint.pth')
model.load_state_dict(checkpoint['model_state_dict'])
optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
epoch = checkpoint['epoch']
loss = checkpoint['loss']

Tips and Best Practices

Performance

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# 1. Use DataLoader with num_workers
train_loader = DataLoader(dataset, batch_size=32, num_workers=4)

# 2. Use pin_memory for GPU
train_loader = DataLoader(dataset, batch_size=32, pin_memory=True)

# 3. Use mixed precision training
from torch.cuda.amp import autocast, GradScaler

scaler = GradScaler()
for data, targets in train_loader:
    optimizer.zero_grad()
    
    with autocast():
        outputs = model(data)
        loss = criterion(outputs, targets)
    
    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()

# 4. Gradient clipping
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

Debugging

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# Check for NaN/Inf
torch.isnan(x).any()
torch.isinf(x).any()

# Fix seed for reproducibility
torch.manual_seed(42)
torch.cuda.manual_seed(42)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

# Gradient checking
from torch.autograd import gradcheck

input = torch.randn(1, 3, requires_grad=True, dtype=torch.double)
test = gradcheck(model, input, eps=1e-6, atol=1e-4)
print(f'Gradient check: {test}')

# Display model architecture
print(model)
from torchsummary import summary
summary(model, input_size=(3, 224, 224))

Hooks for Debugging

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
# Hook on gradients
def gradient_hook(grad):
    print(f"Gradient shape: {grad.shape}")
    print(f"Gradient mean: {grad.mean()}, std: {grad.std()}")
    return grad

x = torch.randn(3, 3, requires_grad=True)
handle = x.register_hook(gradient_hook)

y = x.sum()
y.backward()

handle.remove()  # Remove hook

# Hook on layers
activations = {}
def get_activation(name):
    def hook(model, input, output):
        activations[name] = output.detach()
    return hook

# Register hooks
model.layer1.register_forward_hook(get_activation('layer1'))
model.layer2.register_forward_hook(get_activation('layer2'))

# After forward pass, activations contains outputs
output = model(x)
print(activations['layer1'].shape)

Profiling

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
from torch.profiler import profile, record_function, ProfilerActivity

# Simple profiler
with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA]) as prof:
    with record_function("model_inference"):
        output = model(input)

print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))

# Advanced profiler with TensorBoard
with profile(
    activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
    schedule=torch.profiler.schedule(wait=1, warmup=1, active=3, repeat=2),
    on_trace_ready=torch.profiler.tensorboard_trace_handler('./log/model'),
    record_shapes=True,
    with_stack=True
) as prof:
    for step, (data, target) in enumerate(train_loader):
        output = model(data)
        loss = criterion(output, target)
        loss.backward()
        optimizer.step()
        prof.step()
        if step >= 10:
            break

Memory Profiling

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# Monitor memory usage
print(f"Allocated: {torch.cuda.memory_allocated() / 1e9:.2f} GB")
print(f"Reserved: {torch.cuda.memory_reserved() / 1e9:.2f} GB")

# Memory profiler
with torch.profiler.profile(
    activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
    profile_memory=True,
    record_shapes=True
) as prof:
    output = model(input)

print(prof.key_averages().table(
    sort_by="self_cuda_memory_usage",
    row_limit=10
))

# Detect memory leaks
torch.cuda.reset_peak_memory_stats()
output = model(input)
print(f"Peak memory: {torch.cuda.max_memory_allocated() / 1e9:.2f} GB")

Advanced Techniques

These techniques help improve training stability, model generalization, and handle hardware constraints.

Gradient Accumulation

Useful when your GPU doesn’t have enough memory for a large batch. Accumulates gradients over multiple small batches before updating weights:

1
2
3
4
5
6
7
8
9
10
11
12
13
# To simulate larger batch sizes
accumulation_steps = 4

optimizer.zero_grad()
for i, (data, targets) in enumerate(train_loader):
    outputs = model(data)
    loss = criterion(outputs, targets)
    loss = loss / accumulation_steps  # Normalize loss
    loss.backward()
    
    if (i + 1) % accumulation_steps == 0:
        optimizer.step()
        optimizer.zero_grad()

Gradient Clipping

Prevents gradient explosion by limiting their magnitude. Essential for training RNNs and Transformers:

1
2
3
4
5
6
# Clipping by norm
max_grad_norm = 1.0
torch.nn.utils.clip_grad_norm_(model.parameters(), max_grad_norm)

# Clipping by value
torch.nn.utils.clip_grad_value_(model.parameters(), clip_value=0.5)

Early Stopping

Stops training when validation performance no longer improves, avoiding overfitting:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
class EarlyStopping:
    def __init__(self, patience=7, min_delta=0, verbose=True):
        self.patience = patience
        self.min_delta = min_delta
        self.verbose = verbose
        self.counter = 0
        self.best_loss = None
        self.early_stop = False
    
    def __call__(self, val_loss, model):
        if self.best_loss is None:
            self.best_loss = val_loss
            self.save_checkpoint(model)
        elif val_loss > self.best_loss - self.min_delta:
            self.counter += 1
            if self.verbose:
                print(f'EarlyStopping counter: {self.counter}/{self.patience}')
            if self.counter >= self.patience:
                self.early_stop = True
        else:
            self.best_loss = val_loss
            self.save_checkpoint(model)
            self.counter = 0
    
    def save_checkpoint(self, model):
        torch.save(model.state_dict(), 'best_model.pth')

# Usage
early_stopping = EarlyStopping(patience=10)
for epoch in range(num_epochs):
    train_loss = train_epoch(model, train_loader)
    val_loss = validate(model, val_loader)
    
    early_stopping(val_loss, model)
    if early_stopping.early_stop:
        print("Early stopping")
        break

Label Smoothing

Reduces model overconfidence by smoothing labels (0 and 1 become for example 0.1 and 0.9). Improves generalization:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
class LabelSmoothingCrossEntropy(nn.Module):
    def __init__(self, smoothing=0.1):
        super().__init__()
        self.smoothing = smoothing
    
    def forward(self, pred, target):
        n_classes = pred.size(-1)
        log_pred = F.log_softmax(pred, dim=-1)
        
        # Create smoothed distribution
        with torch.no_grad():
            true_dist = torch.zeros_like(log_pred)
            true_dist.fill_(self.smoothing / (n_classes - 1))
            true_dist.scatter_(1, target.unsqueeze(1), 1.0 - self.smoothing)
        
        return torch.mean(torch.sum(-true_dist * log_pred, dim=-1))

criterion = LabelSmoothingCrossEntropy(smoothing=0.1)

Focal Loss

Designed for class imbalance problems. It gives less weight to easy examples and focuses on hard examples:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
class FocalLoss(nn.Module):
    def __init__(self, alpha=1, gamma=2, reduction='mean'):
        super().__init__()
        self.alpha = alpha
        self.gamma = gamma
        self.reduction = reduction
    
    def forward(self, inputs, targets):
        ce_loss = F.cross_entropy(inputs, targets, reduction='none')
        pt = torch.exp(-ce_loss)
        focal_loss = self.alpha * (1 - pt) ** self.gamma * ce_loss
        
        if self.reduction == 'mean':
            return focal_loss.mean()
        elif self.reduction == 'sum':
            return focal_loss.sum()
        else:
            return focal_loss

Custom Learning Rate Warmup

Gradually increases learning rate at the beginning of training to stabilize early iterations, particularly useful for Transformers:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
class WarmupScheduler:
    def __init__(self, optimizer, warmup_steps, base_lr):
        self.optimizer = optimizer
        self.warmup_steps = warmup_steps
        self.base_lr = base_lr
        self.current_step = 0
    
    def step(self):
        self.current_step += 1
        if self.current_step <= self.warmup_steps:
            lr = self.base_lr * self.current_step / self.warmup_steps
            for param_group in self.optimizer.param_groups:
                param_group['lr'] = lr

# Usage
warmup = WarmupScheduler(optimizer, warmup_steps=1000, base_lr=0.001)

for epoch in range(num_epochs):
    for data, targets in train_loader:
        # ... training code ...
        warmup.step()

Stochastic Weight Averaging (SWA)

Averages model weights over multiple epochs at the end of training. Improves generalization with minimal cost:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
from torch.optim.swa_utils import AveragedModel, SWALR

# Create averaged model
swa_model = AveragedModel(model)

# Scheduler for SWA
swa_scheduler = SWALR(optimizer, swa_lr=0.05)

# Normal training then SWA
swa_start = 75  # Start SWA at epoch 75

for epoch in range(num_epochs):
    train_epoch(model, train_loader, optimizer)
    
    if epoch > swa_start:
        swa_model.update_parameters(model)
        swa_scheduler.step()
    else:
        scheduler.step()

# Update batch normalization
torch.optim.swa_utils.update_bn(train_loader, swa_model)

# Use swa_model for inference

Model Ensemble

Combines multiple models to improve performance. The final prediction is typically the average of individual predictions:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
class ModelEnsemble(nn.Module):
    def __init__(self, models):
        super().__init__()
        self.models = nn.ModuleList(models)
    
    def forward(self, x):
        predictions = [model(x) for model in self.models]
        # Average predictions
        return torch.mean(torch.stack(predictions), dim=0)

# Usage
model1 = ResNet50()
model2 = EfficientNet()
model3 = VGG16()

ensemble = ModelEnsemble([model1, model2, model3])
output = ensemble(x)

Test-Time Augmentation (TTA)

Applies multiple random transformations to the test image and averages predictions. Improves robustness and accuracy at the cost of longer inference time:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
def test_time_augmentation(model, image, transforms, num_augmentations=5):
    model.eval()
    predictions = []
    
    with torch.no_grad():
        for _ in range(num_augmentations):
            augmented = transforms(image)
            pred = model(augmented)
            predictions.append(pred)
        
        # Average predictions
        final_pred = torch.mean(torch.stack(predictions), dim=0)
    
    return final_pred

# Usage
tta_transforms = transforms.Compose([
    transforms.RandomHorizontalFlip(p=0.5),
    transforms.RandomRotation(degrees=10),
    transforms.ColorJitter(brightness=0.1, contrast=0.1)
])

prediction = test_time_augmentation(model, image, tta_transforms)
This post is licensed under CC BY 4.0 by the author.