Introduction

In the current Notebook we take a look at different Cost(Loss) functions that are used when dealing with Regression and Classification learning problems.

Regressions

We commonly encounter Regression based problems in Unsupervised Deep Learning when dealing with Auto-Encoder based representational problems. For Regression problem, we have L1 and L2 norm based loss for evaluating the encoding of a patch of an image. These are geometric in nature whereas KL Divergence is a probabilistic loss function. Similarly, we can have Information and Structure based loss functions.
Geometric losses looks at a predicted point and calculates how far are we from the desired point in the geometric space.

L1-Absolute Loss(Geometric)
$J(\cdot) = \frac{1}{K} \sum_{k=1}^{K} \left| o_k - t_k\right|$
$o_k : \text{Output of the Neural-Net}$
$t_k : \text{desired Target response}$
$K: \text{Total number of Output Neurons}$

The absolute condition mandates that Loss remains in positive space.

MSE Loss(Geometric) $J(\cdot) = \frac{1}{K} \sum_{k=1}^{K} \left| o_k - t_k\right|^2$
$o_k : \text{Output of the Neural-Net}$
$t_k : \text{desired Target response}$
$K: \text{Total number of Output Neurons}$
Kullback-Leibler Divergence Loss(Information-Based)
$J(\cdot) = -\frac{1}{K} \sum_{k=1}^{K} t_k log(\frac{t_k}{o_k})$
$o_k : \text{Output of the Neural-Net}$
$t_k : \text{desired Target response}$
$K: \text{Total number of Output Neurons}$

Classification

We deal with identifying classes in Multi-Class problem.
Common Classification Cost functions are:

Binary Cross-Entropy(Information-Based)
$J(\cdot) = -\frac{1}{K} \sum_{k=1}^{K} w_k\left( t_k log(o_k)+(1-t_k)log(1 - o_k)\right)$ $w_k: \text{Weight factor for class k}$
$t_k \in {0,1}: \text{Binary Label of the target class}$ $o_k \in [0,1]: \text{Classification probability score for the Neural Network}$
$K: \text{Total number of classes}$

We call this as 'binary' because the tensor of that particular class out of K classes remain 1, rest are 0.
So, Binary Cross Entropy necessitates that the output of the neuron lies between (0,1). Hence, we need a non-linear transfer function which assists in this case. Thus, in this case, we need to use Sigmoid and we cannot use other transfer functions such as Tanh and ReLU!

Negative Log-Likelihood(Probabilistic)
$J(\cdot) = -\frac{1}{K}\sum_{k=1}^{K}w_ky_k$
$w_k: \text{Weight factor for class k}$
$y_k \in (-\infty, 0]: \text{Log of the response of class K}$
$K: \text{Total number of classes}$

Here, Logarithmic transformation scales down the loss.

Margin Loss(Structural)
$J(\cdot) = \frac{1}{K}\sum_{k=1}^{K}max\left(0, M-o_kt_k\right)$
$o_k \in (-1,1): \text{Output of the Neural-Net}$
$t_k \in {-1,1}: \text{desired Target response}$
$K: \text{Total number of classes}$

Here the Output is in the range of -1 or +1. So, class of interest is associated with +1. Naturally, the output range suggests that we need to apply a Tanh non-linearity.
Next, we have a Margin criteria $M$
It is to be noted that Margin Loss is useful in situations where we expect classification output to be in the range of -1 and 1. For instance, Ultrasound images and MR signals make a good use case as they do correspond to a range in negative to positive range.

Soft Margin-based Loss(Structural)

While Margin-Loss is stepwise, Soft-Margin Loss is smoother and has a derivative.

Load Packages

%matplotlib inline
import torch
import numpy as np
import torch.nn as nn
import torch.optim as optim
import matplotlib.pyplot as plt
import torch.nn.functional as F
from torchvision import datasets, transforms
from skimage.measure import compare_ssim as ssim #Structural similarity index

Load Data

transform = transforms.Compose([transforms.ToTensor()])
BatchSize = 100

trainset = torchvision.datasets.MNIST(root='./MNIST', train=True,
                                        download=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=BatchSize,
                                          shuffle=True, num_workers=4) # Creating dataloader

testset = torchvision.datasets.MNIST(root='./MNIST', train=False,
                                       download=True, transform=transform)
testloader = torch.utils.data.DataLoader(testset, batch_size=BatchSize,
                                         shuffle=False, num_workers=4) # Creating dataloader

use_gpu = torch.cuda.is_available()
if use_gpu:
    print('GPU is available!')
    device = "cuda"
else:
    print('GPU is not available!')
    device = "cpu"

Regression Losses

Define the Autoencoder

We define a simple Auto-Encoder which connects our (28*28) neurons with a layer of 500 Neurons in the Encoder Layer. Similarly, on the decoder side, we map 500 Neurons to (28*28) Neurons.
On the Encoder side, we apply a Tanh non-linearity i.e., between (-1, 1).
On the Decoder side, we apply a Sigmoid non-linearity i.e., between (0, 1).

class autoencoder(nn.Module):
    def __init__(self):
        super(autoencoder, self).__init__()
        self.encoder = nn.Sequential(
            nn.Linear(28*28, 500),
            nn.Tanh())
        self.decoder = nn.Sequential(          
            nn.Linear(500, 28*28),
            nn.Sigmoid())

    def forward(self, x):
        out = self.encoder(x)
        out = self.decoder(out)
        return out

net1 = autoencoder().to(device) # Network to be trained using MSE loss
net2 = autoencoder().to(device) # Network to be trained using L1 loss

Train Autoencoder

def Train(model,optimizer,criterion,datainput,label):
  """Trainer function to observe Loss."""
  model.train()
  optimizer.zero_grad()
  output = model(datainput)
  loss = criterion(output, label)
  loss.backward()
  optimizer.step()
  return loss.item()

We define two criterions for observing the loss

criterion_1 corresponds to Mean Square Loss
criterion_2 corresponds to L1 Loss

For observing the performance of the model, we use 'Structural Similarity Index Matrix(SSIM)', a concept used in Image processing. So, SSIM is a metric that suggests on how similar two images are i.e., if SSIM is close to 1 that means the images are very similar to each other, if SSIM is close to -1 then images are inverse and vice versa.

iterations = 10
learning_rate = 1

optimizer1 = optim.Adam(net1.parameters(), lr=1e-4) # Network to be trained using MSE loss
optimizer2 = optim.Adam(net2.parameters(), lr=1e-4) # Network to be trained using L1 loss

criterion1 = nn.MSELoss()
criterion2 = nn.L1Loss()

Plotssim1 = []
Plotssim2 = []
plotLoss1 = []
plotLoss2 = []

testImage = testloader.dataset[0][0]
testinputs = testImage.view(-1, 28*28).to(device)

for epoch in range(iterations):  # loop over the dataset multiple times
  running_loss1 = 0.0
  running_loss2 = 0.0

  for i, data in enumerate(trainloader, 0):
    inputs, labels = data
    inputs = inputs.view(-1, 28*28).to(device)              
      
    trainLoss1  = Train(net1,optimizer1,criterion1,inputs,inputs) # MSE Loss
    trainLoss2 = Train(net2,optimizer2,criterion2,inputs,inputs)  # L1 Loss  
        
    running_loss1 += trainLoss1
    running_loss2 += trainLoss2
  plotLoss1.append(running_loss1/(i+1))
  plotLoss2.append(running_loss2/(i+1))       

  net1.eval()  
  net2.eval()
  with torch.no_grad():
    outputs = net1(testinputs.to(device))    
    if use_gpu:
      outputs = outputs.cpu()
      testinputs = testinputs.cpu()
    ssim1 = ssim(outputs.data.view(28,28).numpy(),testinputs.data.view(28,28).numpy())
    
    outputs = net2(testinputs.to(device))
    if use_gpu:
      outputs = outputs.cpu()
      testinputs = testinputs.cpu()
    ssim2 = ssim(outputs.data.view(28,28).numpy(),testinputs.data.view(28,28).numpy())

  Plotssim1.append(float(ssim1))
  Plotssim2.append(float(ssim2))
    
  print('At Epoch '+str(epoch+1))
  print('With MSELoss: Loss = {:.6f}, SSIM Index = {:.5f} '.format(running_loss1/(i+1),float(ssim1)))
  print('With L1Loss: Loss = {:.6f}, SSIM Index = {:.5f} '.format(running_loss2/(i+1),float(ssim2)))

fig = plt.figure()        
plt.plot(range(epoch+1),plotLoss1,'r-',label='Mean Square Error')
plt.plot(range(epoch+1),plotLoss2,'g-',label='L1 Loss')   
plt.legend(loc='best')
plt.xlabel('Epochs')
plt.ylabel('Training Loss')  

fig = plt.figure()         
plt.plot(range(epoch+1),Plotssim1,'r-',label='SSIM Index (MSE)')
plt.plot(range(epoch+1),Plotssim2,'g-',label='SSIM Index (L1)')      
plt.legend(loc='best')
plt.xlabel('Epochs')
plt.ylabel('Testing SSIM')

/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py:481: UserWarning: This DataLoader will create 4 worker processes in total. Our suggested max number of worker in current system is 2, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
  cpuset_checked))

At Epoch 1
With MSELoss: Loss = 0.066963, SSIM Index = 0.40598 
With L1Loss: Loss = 0.155505, SSIM Index = 0.41019

/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:41: UserWarning: DEPRECATED: skimage.measure.compare_ssim has been moved to skimage.metrics.structural_similarity. It will be removed from skimage.measure in version 0.18.
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:47: UserWarning: DEPRECATED: skimage.measure.compare_ssim has been moved to skimage.metrics.structural_similarity. It will be removed from skimage.measure in version 0.18.

At Epoch 2
With MSELoss: Loss = 0.032984, SSIM Index = 0.54397 
With L1Loss: Loss = 0.116320, SSIM Index = 0.53894 
At Epoch 3
With MSELoss: Loss = 0.021482, SSIM Index = 0.62897 
With L1Loss: Loss = 0.099107, SSIM Index = 0.60956 
At Epoch 4
With MSELoss: Loss = 0.015003, SSIM Index = 0.70868 
With L1Loss: Loss = 0.084211, SSIM Index = 0.65826 
At Epoch 5
With MSELoss: Loss = 0.011134, SSIM Index = 0.77156 
With L1Loss: Loss = 0.072214, SSIM Index = 0.71547 
At Epoch 6
With MSELoss: Loss = 0.008638, SSIM Index = 0.81490 
With L1Loss: Loss = 0.062632, SSIM Index = 0.75584 
At Epoch 7
With MSELoss: Loss = 0.006939, SSIM Index = 0.84709 
With L1Loss: Loss = 0.055689, SSIM Index = 0.78653 
At Epoch 8
With MSELoss: Loss = 0.005744, SSIM Index = 0.86917 
With L1Loss: Loss = 0.050113, SSIM Index = 0.80401 
At Epoch 9
With MSELoss: Loss = 0.004875, SSIM Index = 0.88897 
With L1Loss: Loss = 0.046361, SSIM Index = 0.81297 
At Epoch 10
With MSELoss: Loss = 0.004224, SSIM Index = 0.90216 
With L1Loss: Loss = 0.042943, SSIM Index = 0.81811

Text(0, 0.5, 'Testing SSIM')

Observations:

Both MSE and L1 loss are reduced over the epochs.
L1 loss is linearly dependent on the change of gray levels(Gray-Scale images) whereas MSE loss is squared proportional to the changes. It is because MSE has a higher propensity for convergence when we have an error. We can extend this idea to L3, L4 or any positive power to get much steeper losses. However, we need to ensure that the norm which we choose is differentiable! Also, when we raise the power of norm. We are adding some extra computational complexity(No free Lunch).

Classification Loss

Neural Network

class NeuralNet(nn.Module):
    def __init__(self):
        super(NeuralNet, self).__init__()
        self.Layer1 = nn.Sequential(
            nn.Linear(28*28, 400),
            nn.ReLU(),
            nn.Linear(400, 256),
            nn.ReLU())
        self.Layer2 = nn.Sequential(
            nn.Linear(256, 10))

    def forward(self, x):
        x = self.Layer1(x)
        x = self.Layer2(x)
        return x

net1 = NeuralNet().to(device) # Network to be trained using cross-entropy loss
net2 = NeuralNet().to(device) # Network to be trained using NLL loss
net3 = NeuralNet().to(device) # Network to be trained using multi-margin loss

Train Classifier

def Train(model,optimizer,criterion,datainput,label,lossType):
    model.train()
    optimizer.zero_grad()
    output = model(datainput)
    if lossType == 'NLL':
        loss = criterion(F.log_softmax(output,dim=1), label)
    else:
        loss = criterion(output, label)
    loss.backward()
    optimizer.step()
    return loss.item()

iterations = 10
learning_rate = 0.1
criterion1 = nn.CrossEntropyLoss()
criterion2 = nn.NLLLoss()
criterion3 = nn.MultiMarginLoss()

optimizer1 = optim.Adam(net1.parameters(), lr=1e-4) # Network to be trained using cross-entropy loss
optimizer2 = optim.Adam(net2.parameters(), lr=1e-4) # Network to be trained using NLL loss
optimizer3 = optim.Adam(net3.parameters(), lr=1e-4) # Network to be trained using multi-margin loss

Plotacc1 = []
Plotacc2 = []
Plotacc3 = []

plotLoss1 = []
plotLoss2 = []
plotLoss3 = []

for epoch in range(iterations):  # loop over the dataset multiple times

    correct1 = 0
    correct2 = 0
    correct3 = 0
    runningLoss1 = 0
    runningLoss2 = 0
    runningLoss3 = 0
    total = 0
    for i, data in enumerate(trainloader, 0): # i-> batch count
        # get the inputs
        inputs, labels = data
        inputs, labels = inputs.view(-1, 28*28).to(device), labels.to(device)    
        
        trainLoss1 = Train(net1,optimizer1,criterion1,inputs,labels,lossType='CE')
        trainLoss2 = Train(net2,optimizer2,criterion2,inputs,labels,lossType='NLL')   
        trainLoss3 = Train(net3,optimizer3,criterion3,inputs,labels,lossType='MM')    

        runningLoss1 += trainLoss1
        runningLoss2 += trainLoss2
        runningLoss3 += trainLoss3
   
    runningLoss1 = runningLoss1/(i+1)
    runningLoss2 = runningLoss2/(i+1)
    runningLoss3 = runningLoss3/(i+1)          
   
    plotLoss1.append(runningLoss1)
    plotLoss2.append(runningLoss2)
    plotLoss3.append(runningLoss3)
    
    net1.eval()
    net2.eval()
    net3.eval()
    with torch.no_grad():    
        for data in testloader:
            inputs, labels = data
            inputs, labels = inputs.view(-1, 28*28).to(device), labels.to(device)
            total += labels.size(0)

            outputs = net1(inputs)
            _, predicted = torch.max(outputs.data, 1)
            correct1 += (predicted == labels).sum()

            outputs = net2(inputs)
            _, predicted = torch.max(outputs.data, 1)
            correct2 += (predicted == labels).sum()

            outputs = net3(inputs)
            _, predicted = torch.max(outputs.data, 1)
            correct3 += (predicted == labels).sum()

    Plotacc1.append(100*float(correct1)/float(total))
    Plotacc2.append(100*float(correct2)/float(total))
    Plotacc3.append(100*float(correct3)/float(total))
    
    print('At Epoch '+str(epoch+1))
    print('With CrossEntropyLoss: Loss = {:.6f} , Acc = {:.4f}%'.format(runningLoss1,100*float(correct1)/float(total)))
    print('With NegativeLogLikelihoodLoss: Loss = {:.6f} , Acc = {:.4f}%'.format(runningLoss2,100*float(correct2)/float(total)))
    print('With MultiMarginLoss: Loss = {:.6f} , Acc = {:.4f}%\n'.format(runningLoss3,100*float(correct3)/float(total)))
    
fig = plt.figure()        
plt.plot(range(epoch+1),plotLoss1,'r-',label='Cross Entropy Loss')
plt.plot(range(epoch+1),plotLoss2,'g-',label='Negative Log Likelihood Loss')   
plt.plot(range(epoch+1),plotLoss3,'b-',label='Multi Margin Loss')  
plt.legend(loc='best')
plt.xlabel('Epochs')
plt.ylabel('Training Loss')  
    
fig = plt.figure()        
plt.plot(range(epoch+1),Plotacc1,'r-',label='Cross Entropy Loss')
plt.plot(range(epoch+1),Plotacc2,'g-',label='Negative Log Likelihood Loss')   
plt.plot(range(epoch+1),Plotacc3,'b-',label='Multi Margin Loss')  
plt.legend(loc='best')
plt.xlabel('Epochs')
plt.ylabel('Testing Accuracy')  
print('Finished Training')

/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py:481: UserWarning: This DataLoader will create 4 worker processes in total. Our suggested max number of worker in current system is 2, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
  cpuset_checked))

At Epoch 1
With CrossEntropyLoss: Loss = 0.687404 , Acc = 91.2400%
With NegativeLogLikelihoodLoss: Loss = 0.693948 , Acc = 91.6200%
With MultiMarginLoss: Loss = 0.144733 , Acc = 91.7700%

At Epoch 2
With CrossEntropyLoss: Loss = 0.280389 , Acc = 93.2300%
With NegativeLogLikelihoodLoss: Loss = 0.276354 , Acc = 93.3000%
With MultiMarginLoss: Loss = 0.036266 , Acc = 93.6100%

At Epoch 3
With CrossEntropyLoss: Loss = 0.225201 , Acc = 94.3100%
With NegativeLogLikelihoodLoss: Loss = 0.221525 , Acc = 94.3600%
With MultiMarginLoss: Loss = 0.026817 , Acc = 94.7200%

At Epoch 4
With CrossEntropyLoss: Loss = 0.186385 , Acc = 95.0400%
With NegativeLogLikelihoodLoss: Loss = 0.183985 , Acc = 95.1000%
With MultiMarginLoss: Loss = 0.021182 , Acc = 95.3800%

At Epoch 5
With CrossEntropyLoss: Loss = 0.158245 , Acc = 95.5700%
With NegativeLogLikelihoodLoss: Loss = 0.157139 , Acc = 95.5000%
With MultiMarginLoss: Loss = 0.017428 , Acc = 95.8900%

At Epoch 6
With CrossEntropyLoss: Loss = 0.136643 , Acc = 96.1100%
With NegativeLogLikelihoodLoss: Loss = 0.136461 , Acc = 96.0900%
With MultiMarginLoss: Loss = 0.014613 , Acc = 96.4300%

At Epoch 7
With CrossEntropyLoss: Loss = 0.119658 , Acc = 96.4700%
With NegativeLogLikelihoodLoss: Loss = 0.119741 , Acc = 96.3200%
With MultiMarginLoss: Loss = 0.012542 , Acc = 96.6100%

At Epoch 8
With CrossEntropyLoss: Loss = 0.106166 , Acc = 96.7100%
With NegativeLogLikelihoodLoss: Loss = 0.106018 , Acc = 96.6500%
With MultiMarginLoss: Loss = 0.010779 , Acc = 96.9600%

At Epoch 9
With CrossEntropyLoss: Loss = 0.094124 , Acc = 96.9000%
With NegativeLogLikelihoodLoss: Loss = 0.093815 , Acc = 96.8700%
With MultiMarginLoss: Loss = 0.009347 , Acc = 97.1100%

At Epoch 10
With CrossEntropyLoss: Loss = 0.083534 , Acc = 97.0600%
With NegativeLogLikelihoodLoss: Loss = 0.082979 , Acc = 97.0400%
With MultiMarginLoss: Loss = 0.008052 , Acc = 97.3300%

Finished Training