732A99 Machine Learning

Data-Manipulation
Supervised Learning
Unsupervised Learning
- Nearest Shrunken Centroid
- PCA
Other Topics

Data-Manipulation

1.1 Load Excel Data

library("readxl")
data = read_excel('spambase.xlsx')
data = read_excel('spambase.xls')

1.2 Denormalize normalized data

Collect $\textit{max}_x$, $\textit{min}_x$
Apply the formula on all $\textit{x}$ : $\textit{x}$ * ( $\textit{max}_x$ - $\textit{min}_x$ ) + $\textit{min}_x$

1.3 Train/Test Split

n = dim(data)[1]
set.seed(12345)
id = sample(1:n, floor(n*0.5))
train = data[id,]
test = data[-id,]

1.3 Train/Test/Validation Split

Also Available in Lecture: Block 1(e)]
Contains example on how to perform Cross-Validation given that there is a function loss to estimate the loss.

spec = c(train = .6, test = .2, validate = .2)
g = sample(cut(
  seq(nrow(dataframe)), 
  nrow(dataframe)*cumsum(c(0,spec)),
  labels = names(spec)
))
res = split(dataframe, g)
train = res$train
test = res$test
val = res$validate
# Cross Validation
n_fold_cv = function(X,Y, n_fold){
  #Randomly shuffle the data
  ran_ind = sample(nrow(X))
  X = X[ran_ind,]
  Y = Y[ran_ind]
  #Create n equally size folds
  folds <- cut(seq(1,nrow(X)),breaks=n_fold,labels=FALSE)
  #Perform 10 fold cross validation
  cv_loss = c()
  for(i in 1:n_fold){
    #Segement your data by fold using the which() function 
    testIndexes <- which(folds==i,arr.ind=TRUE)
    xTest <- X[testIndexes,]
    yTest <- Y[testIndexes]
    xTrain <- X[-testIndexes,]
    yTrain <- Y[-testIndexes]
    l = loss(x=xTrain, y = yTrain,xTest = xTest, yTest = yTest)
    cv_loss = c(l,cv_loss)
  }
  cv_loss = mean(cv_loss)
  return(cv_loss)
}

1.4 Dichotomization

spam$target = ifelse(spam$Spam==1, 'S','H') # Create a binary variable
spam$target = as.factor(spam$target) # Convert to Factor
contrasts(spam$target) # Check the factor

1.5 Writing Custom formula

covariates = names(spam)[0:48] # Collect Covariates 1st
cov=paste(covariates, collapse='+') # Collapse the covariates
form = as.formula(paste0('target~',cov)) # Create the formula with the target

1.6 Gaussian Kernel

h specifies Kernel Width

gaussian_k <- function(x, h) { return (exp(-(x**2)/(2*h*h))) }

1.7 Package Vignette

utils::browseVignettes("gbm") # Package Name

1.8 Metrics

Look at data imbalance when quoting misclassification.
Misclassification rate : $\textbf{1 - Accuracy OR }\frac{\text{FP+FN}}{total}$ | Accuracy : $\frac{TP+TN}{total}$

1 - mean(data$train == data$prediction)

Sensitivity/Recal/True Positive Rate: $\frac{TP}{\text{Actual Yes}}$

Specificity/False Positive Rate: $\frac{\text{FP}}{\textbf{Actual NO}}$

get.rate = function(predicted.prob, actual, true_label, false_label, threshold){
prediction = ifelse(predicted.prob>threshold, true_label,false_label)
TPR = sum(prediction==true_label & actual == true_label)/sum(actual == true_label)
FPR = sum(prediction==true_label & actual == false_label)/ sum(actual == false_label)
return(c(TPR, FPR))
}

AIC/BIC/$C_p$:
- Can be used for estimating only ‘In-Sample’ errors for ‘Linear’ Parameters.
  ‘In-Sample’ is not beneficial as it does not represent generalized error i.e., how the model will perform on ‘Unseen’ data.
- AIC Formula: $\frac{-2}{N}\cdot\text{log-likelihood}+ 2\cdot\frac{degree of freedom}{N}$ where $\text{N = No of datapoints}$
- BIC Formula: $\textit{Log-Likelihood} - \frac{M}{2} \cdot \ln(N)$ where $M$ is the no. of parameters. $N$ = # of datapoints.
  - BIC = function(LL,M,N){return(LL - (0.5*M/log(N)))}
Loss Matrix:
- Make sure the loss matrix and prediction order is same. Ex: if loss is c("cat","dog") then predicted probability columns should also be: pred_prob[,c("cat","dog")]
- Then use apply(pred_prob%*%loss_matrix,MARGIN=1,FUN=which.max). Use the index same as c("cat","dog") for classification.

1.9 Plotting Help

refer: https://bookdown.org/rdpeng/exdata/the-base-plotting-system-1.html#base-plotting-functions

Supervised Learning

Neural-Net Regression

STEP-1: Standardize the Data. If Test set is unseen, then use ‘mean’ and ‘variance’ of training data.
y~. does not work, formula needs to be specified specifically.
linear.output=TRUE should be chosen.
threshold specifies the change in Error(given by nn.fit$err.fct) over the epochs(also referred as ‘Steps’)

Custom activation function should go into: act.fct:

RelU = function(x){ ifelse(x>0,x,0)} # RELU Implementation
nn.fit = neuralnet(form, data = tr, hidden = c(2,2,2), threshold = 0.01,linear.output = TRUE, act.fct = RelU)

library(neuralnet)
Var <- runif(50, 0, 10)
trva <- data.frame(Var, Sin=sin(Var))
tr <- trva[1:25,] # Training
va <- trva[26:50,] # Validation
form = as.formula("Sin~Var")
winit = runif(19, -1,1)
nn.fit = neuralnet(form, data = tr, rep=10,hidden = c(2,2,2), threshold = 0.01,linear.output = TRUE,
               startweights = winit)
plot(nn.fit, show.weights = TRUE)
## Prediction
yTestPred = compute(nn.fit, va)
plot(prediction(x = nn.fit)$rep1)
points(trva, col=2)
## Test Results
yTestPred$net.result
plot(yTestPred$net.result)
points(va$Sin, col=2)

EM : Multivariate Bernoulli

In case of Binary Data, use ‘bernoulli’ priors
In case of Multiclass data, use ‘dirichlet’ priors!
In contrast to the mixture of Gaussians, there are no singularities in which the likelihood function goes to infinity.This can be seen by noting that the likelihood function is bounded because $\text{0 }\le p(\mathbf{x_n|\mu_k}) \le \text{ 1}$
$\text{Log-Likelihood :} \ln p(\textbf{X} | \mu, \pi) = \displaystyle \sum_{n=1}^{N} \ln {\sum_{k=1}^{K} \pi_{k} p(\mathbf{x_n} | \mathbf{\mu_k}) }$ where $p(\mathbf{x_n} | \mathbf{\mu_k}) = \displaystyle \prod_{i=1}^{D} \mu_i^{x_i} \left(1 - \mu_i \right)^{1 - x_i}$ i.e., $x_n$ datapoint with $D$ “binary” variables.
Algo Steps:
1. Use $\text{Eq. 9.54}$ to get the $\textbf{Complete Data Log Likelihood}$
2. Use $\text{Eq. 9.56}$ to get the responsibility($\gamma(z_{nk})$) of component $k$ given datapoint $\mathbf{x}_n$ for which we get a resp matrix of dim $[N,K]$
3. Now is M-Step:
  3.1. Calculate $N_k$ using $\text{Eq. 9.57}$. : N_k = colSums(resp)
  3.2. Use $\text{Eq. 9.58}$ to get $\bar{x}$.: x_k = t(resp)%*%x # same dimension as mu
  3.3. Update $\mathbf{\mu}_k = \bar{x}$: mu[k,] = x_k[k,]/N_k[k] # Using eq. 9.58 and 9.59
  3.3. Update $\pi_k$ using Eq 9.60 : pi = N_k/N # eq. 9.60

library(dplyr) ## For Piping  
set.seed(1234567890)
max_it = 100
min_change = 0.1
N = 1000
D= 10
x = matrix(nrow = N,ncol = D)
true_pi = vector(length = 3)
true_mu = matrix(nrow = 3, ncol=D)
true_pi = c(1/3,1/3,1/3)
true_mu[1,] = c(0.5,0.6,0.4,0.7,0.3,0.8,0.2,0.9,0.1,1)
true_mu[2,] = c(0.5,0.4,0.6,0.3,0.7,0.2,0.8,0.1,0.9,0)
true_mu[3,] = c(0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5)
plot(true_mu[1,], type="o", col="blue", ylim=c(0,1))
points(true_mu[2,], type="o", col="red")
points(true_mu[3,], type="o", col="green")
# Training Data
for(n in 1:N){
   k = sample(1:3, 1, prob = true_pi)
   for(d in 1:D){
     x[n,d] = rbinom(1,1,true_mu[k,d])
   }
}
K = 3 # Guessed components
pi = vector(length = K)
mu = matrix(nrow=K, ncol=D)
llik = vector(length = max_it)
# Random Initialization of the Parameters
pi = runif(K,0.49,0.51)
pi = pi/sum(pi)  # Sum = 1 for 'Mixing Coefficients'
for(k in 1:K){
    mu[k,] = runif(D, 0.49,0.51)
}
for(it in 1:max_it){
  plot(mu[1,], type="o", col="blue", ylim=c(0,1))
  points(mu[2,], type="o", col="red")
  points(mu[3,], type="o", col="green")
  # points(mu[3,], type="o", col="black")
  Sys.sleep(0.5)
  # E-Step
  # Measure Log-Likelihood
  lik1 = function(n,K,pi,mu){
    lik = NA
    # Log-likelihood for 1 datapoint
    for(k in 1:K){
      t1 = log(pi[k]) # Log(pi_k)
      t2 = x[n,]*log(mu[k,]) # x_{nd} * log(mu_{kd}) # for d columns
      t3 = (1-x[n,])*log(1 - mu[k,]) # (1 - x_{nd}) * log(1 - mu_{kd}) # for d columns
      lik[k] = t1 + sum(t2+t3)
    }
    return(sum(lik))
  }
  llik[it] = sapply(1:N, FUN = lik1, K=K,pi=pi,mu=mu) %>% sum ## Eq 9.54
  cat("iteration: ", it, " Lig-Likelihood: ", llik[it], "\n")
  # Calculating responsibility  resp = matrix(nrow = N, ncol = K) # holds eq. 9.56 variable i.e., responsibility
  for(n in 1:N){
    total_k = NA
    for(k in 1:K){
      t1 = log(pi[k])
      t2 = (x[n,]*log(mu[k,]))
      t3 = (1-x[n,])*log(1 - mu[k,])
      total_k[k] = t1 + sum(t2+t3)
    }
      total_k = exp(total_k)
      total_k = total_k/sum(total_k)
      resp[n,] = total_k
  }
  # M- Step
  N_k = colSums(resp) # 9.57
  cat("N_k: ", N_k, "\n")
  x_k = t(resp)%*%x # same dimension as mu
  # Update mu
  for(k in 1:K){
    mu[k,] = x_k[k,]/N_k[k] # Using eq. 9.58 and 9.59
  }
  # Update pi
  pi = N_k/N # eq. 9.60
  if(it>1){
  if(abs(llik[it-1] - llik[it])<min_change)break
  else{print(round(abs(llik[it-1] - llik[it]), digits = 3))}
    }
}

EM: Mixture of Gaussians

Initial $\mu$ and $\Sigma$ can be obtained using kmeans
We have $\text{Log-Likelihood :} \ln p(\textbf{X}|\mu, \pi,\Sigma) = \displaystyle \sum_{n=1}^{N} \sum_{k=1}^{K} { \ln \pi_{k} +\ln \mathcal{N}(\mathbf{x}_n | \mathbf{\mu}_k, \Sigma_k) }$
E-Step:
- Evaluate the responsibility using Eq. 9.23. Implemented by the function e_step
M-Step:
- First get $N_k$ for Eq. 9.27 using colSums(e_step(...)) saved in N_k variable
- Get Updates $\mathbf{\mu}_k^{new}$ using Eq. 9.24 and function update_mu
- Get Updated $\mathbf{\pi}_k^{new}$ using Eq. 9.26 and update_pi function.
- CAREFUL! For getting $\Sigma_K^{new}$, we have two steps:
  - Use function cov_k to get $\gamma(\textit{z}_{nk})(\mathbf{x}_n - \mathbf{\mu}_k^{new})\left(\mathbf{x}_n - \mathbf{\mu}_k^{new} \right)^{T}$ for only 1 datapoint.
  - Now apply cov_k function on all data-points and sum them up: sigmaK = sapply(1:N, FUN = cov_k, df = df, K=k,z = z) %>% rowSums
  - Re-arrange the co-variance matrix: sigmaK = matrix(sigmaK, ncol=D, byrow = T)
  - In case We need it to be only-diagonal matrix: sigmA[k,,] = diag(diag(sigmaK)/N_k[k]) ## Only Diagonals
  - If we dont want it to be non-diagonal: sigmA[k,,] = sigmaK/N_k[k]) ## We need to divide by N_k[k]

library(dplyr) ## For Piping
library(mvtnorm)
dataEM = read.table(paste0(file_path, "dataEM.txt"))
N = dim(dataEM)[1]
D = dim(dataEM)[2] # Can be columns
df  = as.data.frame(dataEM)



x_density = function(n,df,K,pi,mu,sigmA){
 
  vK = NA
  for(k in 1:K){
    
    vK[k]= log(pi[k]) + dmvnorm(df[n,], mean = mu[k,], sigma = sigmA[k,,], log = TRUE)
  }
  return(sum(vK))  # x_density and CDLL need to be outside EM function as they are helper
}
CDLL = function(df, K,pi,mu,sigmA){
  
  loglik = sapply(1:N, FUN = x_density, df,K,pi,mu,sigmA) %>% sum
  return(loglik) # Observe Complete Data Log Likelihood
}
# EM-Function
EM = function(df,N,D,K,pi,mu,sigmA){
  # E-Step
  e_step = function(df,N,K, pi, mu, sigmA){
    z = matrix(NA, nrow = N, ncol = K)
    for(n in 1:N){
      for(k in 1:K){
        # z[n,k] is the numerator in eq. 9.23
        z[n,k] =  pi[k]*dmvnorm(df[n,], mean = mu[k,],sigma = sigmA[k,,]) 
      }
      
      z[n,] = z[n,]/sum(z[n,]) # Eq 9.23
    }
    return(z)
  }
  z = e_step(df=df,N, K=K, pi=pi,mu = mu, sigmA = sigmA)
  N_k = colSums(z,na.rm = TRUE)
  ##M - step
  # pi
  update_pi = function(z, N){
    pi_m = colSums(z)/N
    pi_m = pi_m/sum(pi_m)
    return(pi_m)
  }
  pi = update_pi(z,N)
  # mu
  update_mu = function(df, z, Nk){
    mu_m = (t(df)%*%z)/Nk
    return(as.matrix(t(mu_m)))
  }
  mu = update_mu(df, z, Nk = N_k)
  # Update Sigma
  cov_k = function(n,df, K, z){
    df = as.matrix(df)
    val = (df[n,]%*%t(df[n,]))*z[n,k]
    return(as.matrix(val))
  }
  for(k in 1:K){
    sigmaK = sapply(1:N, FUN = cov_k, df = df, K=k,z = z) %>% rowSums
    sigmaK = matrix(sigmaK, ncol=D, byrow = T)
    sigmA[k,,] = diag(diag(sigmaK)/N_k[k]) ## Only Diagonals
  }
  llik = CDLL(df,K,pi,mu,sigmA) ## CDLL is designed as a global function
  return(list(pi = pi,mu = mu, sigma = sigmA, LLik = llik))
}
## Function to run EM Algorithm
RUN_EM = function(df,N,D,K,pi,mu,sigmA,max_it,delta){
  llik = vector(length = max_it)
  for(it in 1:max_it){
    if(it==1){
      llik[it] = CDLL(df,K,pi,mu,sigmA)
      result = EM(df,N,D,K,pi,mu,sigmA)
      pi = result$pi
      mu = result$mu
      sigmA = result$sigma
    }
    else{
      result = EM(df,N,D,K,pi,mu,sigmA)
      pi = result$pi
      mu = result$mu
      sigmA = result$sigma 
      llik[it] = result$LLik
      if( (abs(llik[it] - llik[it-1])<delta) )break
    }
  }
  return(list(pi = pi,mu = mu, sigma = sigmA, LLik = llik, iterations = it))
}
## RUN EXPERIMENT
K = 2
D = 2 # Columns 
mu = matrix(nrow = K, ncol = D)
sigmA = array(NA,dim=c(K,2,2)) # K X [D X D]
for(k in 1:K){
  mu[k,] = runif(D, 0,5)
  sigmA[k,,] = diag(rep(1,D))
}
pi = runif(K,0.4,0.6)
pi = pi/sum(pi)
max_it = 100
delta = 0.1
result_2 = RUN_EM(df,N,D,K,pi,mu,sigmA,max_it,delta)
# Plot
par(mfrow=c(1,3))
plot(result_2$LLik[2:result_2$iterations], col=2, type='l') # Plot only Non-Zero Log-Likelihoods
plot(result_3$LLik[1:result_3$iterations], col=3, type='l')
plot(result_4$LLik[2:result_4$iterations], col=4, type='l')
# Calculate Bayesian Info Criteria
BIC = function(LL,M,N){return(LL - (0.5*M/log(N)))}
BIC2 = BIC(LL = result_2$LLik[result_2$iterations], M=2, N=600)   
   

GAM(Generalized Additive Models)

Notes for Library mgcv
- Non-Linear Effects can be entered via s or te:
- Use method = "REML"
- always look at gam.check()
- Parameters:
  - s in gam(y~x0+s(x1)+s(x2,x3)) is the smoothing function:
    - k specifies dimensions to represent smoothing term: ALWAYS equal to the length: Mortality~Year+s(Week, k = length(unique(data1$Week)))
    - bs specifies the penalized smoothing basis: Mortality~s(Week, k = length(unique(data1$Week)),bs='cc')
      - bs="tp" : for Thin Plate Splines
      - bs="cr" : for Cubic Splines
      - bs="cc" : for cyclic cubuc splines. Useful ifa cov shows cyclical trend like ‘Year’,’Month’
      - Generalized Cross Validation Score is given within summary(model)’s GCV value. Use method="GCV.Cp" while fitting.

library(mgcv)
form.1 = as.formula("Mortality~Year+s(Week, k = length(unique(data1$Week)))")
fit1 = gam(form.1, family = gaussian(), data=data1, method = 'REML')
for.2 = as.formula("Mortality~Year+s(Week, k = length(unique(data1$Week)))+
                  s(Year, Influenza)")
fit2 = gam(for.2, family = gaussian(), data=data1, method = 'REML')
# Gives siginificance of smooth terms and variance of the model
summary(fit1) 
# Plotting the Residuals
par(mfrow = c(2,2))
gam.check(fit1)
# Plot the Spline Component
plot(fit1)  
plot(fit1, residuals=T) # With Residuals
# Compare Models
anova(fit1,fit2, fit3, test = 'Chisq')
AIC(fit1,fit2, fit3) # Choose with lowest AIC

Kernel Classification(using Kernel Density Estimation)

# January-2021 exam
# data generation
N_class1 <- 1500
N_class2 <- 1000
data_class1 <- NULL
for(i in 1:N_class1){
  a <- rbinom(n = 1, size = 1, prob = 0.3)
  b <- rnorm(n = 1, mean = 15, sd = 3) * a + (1-a)*rnorm(n = 1, mean = 4, sd = 2)
  data_class1 <- c(data_class1,b)
}
data_class2 <- NULL
for(i in 1:N_class2){
  a <- rbinom(n = 1, size = 1, prob = 0.4)
  b <- rnorm(n = 1, mean = 10, sd = 5) * a + (1-a) * rnorm(n = 1, mean = 15, sd = 2)
  data_class2 <- c(data_class2,b)
}
# y=1,0
df1 = data.frame(cbind(data_class1, 1))
df2 = data.frame(cbind(data_class2,0))
# Prior Probabilities
p_1 = 1500/2500
P_2 = 1 - p_1
# Define the Kernel
kernel =  function(x,h){return(exp(- (x/h)^2))}
xs = seq(-5,25,0.1) # Data to be tested on
kern.prob = matrix(NA, nrow = length(xs), ncol = 2) # For Storing Posterior Class probability
for(i in 1: length(xs)){
  x = xs[i]
  diff_1 = x - df1[,1] # For class-1 only use class-1 data
  diff_2 = x- df2[,1] # For class-2 only use class-2 data
  p1 = mean(sapply(diff_1, FUN = kernel, h = 5)) # P(X=x|Y = 'class 1')
  p2 = mean(sapply(diff_2, FUN = kernel, h = 5)) # P(X=x|Y = 'class 2')
  kern.prob[i,1] = (p1*p_1)/((p1*p_1) + (p2* P_2) ) # Applying Bayes Theorem
  kern.prob[i,2] = (p2*P_2)/((p1*p_1) + (p2* P_2) ) # this is also equal to 1 - kern.prob[i,1]
}
# Check the Plots
plot(x=xs, y = kern.prob[,1], type='l', ylim=c(0,1))
lines(x=xs, y=kern.prob[,2],type='l', col=3)

LDA(Manual implementation can be extended to QDA and Naive Bayes)

Inputs are assumed to be CONTINUOUS
All Classes share the same covariance matrix(In QDA, they don’t)
Because of ‘common covariance matrix’ assumption, the decision boundary corresponding to Posterior Probability $p(C_k|\textbf{x})$ are constant and can be respresented by a linear function of $\textbf{x}$ i.e., “decision boundaries are linear in input space”.
Since, we estimate the parameters using MLE, the approach is susceptible to ‘outliers’.
While plotting, make the xlim and ylim of generated data same as Old Data.
Use only diagonal elements of pooled-covariance matrix to perform Naive Bayes.
Use different covariance matrix for each class to perform QDA

library(MASS) # Step-1: Library
crabs = as.data.frame(read.csv(paste0(file_path,'australian-crabs.csv')))
crabs$sex = as.factor(crabs$sex) # Step-0:Convert the target to factor
# Plot for Classes
with(crabs, plot(CL, RW, main = "Main", type = "n"))
with(subset(crabs, sex='Male'), points(CL, RW, col='blue'))
with(subset(crabs, sex!='Male'), points(CL, RW, col='red'))
legend("topleft", pch = 1, col = c("blue", "red"), legend = c("Male", "Female"))
form = as.formula('sex~CL+RW') # Step-2 formulation
lda.model = lda(form, crabs, prior = c(0.1,0.9)) # Step-3, prior order as 'factor' order
crabs[,'predictions'] = predict(lda.model,crabs)$class # Step-4
# Plot- Multiple Predictions, Multiple CLasses
par(mfrow = c(1, 2), mar = c(4, 4, 2, 1), oma = c(0, 0, 2, 0))
with(crabs, {
  plot(CL, RW, main = "Prior", type = "n")
  with(subset(crabs, sex='Male'), points(CL, RW, col='blue'))
  with(subset(crabs, sex!='Male'), points(CL, RW, col='red'))
  legend("topleft", pch = 1, col = c("blue", "red"), legend = c("Male", "Female"))
  plot(CL, RW, main = "Posterior", type = "n")
  with(subset(crabs, predictions='Male'), points(CL, RW, col='blue'))
  with(subset(crabs, predictions!='Male'), points(CL, RW, col='red'))
  legend("topleft", pch = 1, col = c("blue", "red"), legend = c("Male", "Female"))
})
# Draw discriminant Line
np = lda.model$N
nd.x <- seq(from = min(crabs$RW), to = max(crabs$RW), length.out = np)
nd.y <- seq(from = min(crabs$CL), to = max(crabs$CL), length.out = np)
nd <- expand.grid(RW = nd.x, CL = nd.y)
prd <- as.numeric(predict(lda.model, newdata = nd)$class)
plot(crabs[, c('RW','CL')], col = crabs$sex) # Make Sure X and Y are ordered correctly
contour(x = nd.x, y = nd.y, z = matrix(prd, nrow = np, ncol = np), 
        levels = c(1,2), add = TRUE, drawlabels = FALSE)

# Manual LDA
# Compute Mean, covMat, priorProb
setosa = as.data.frame(iris2[iris2$Species == 'setosa',])
versicolor = as.data.frame(iris2[iris2$Species == 'versicolor',])
virginica = as.data.frame(iris2[iris2$Species == 'virginica',])
## Means, COvs and Prior Probs
cov_names = c('Sepal.Width', 'Sepal.Length')
mu_setosa = as.matrix(colMeans(setosa[,cov_names]))
mu_ver = as.matrix(colMeans(versicolor[,cov_names]))
mu_vir = as.matrix(colMeans(virginica[,cov_names]))
cov_setosa = cov(setosa[,cov_names])
cov_versi = cov(versicolor[,cov_names])
cov_virgi = cov(virginica[,cov_names])
prop_setosa = 50/150
prob_ver = 50/150
prop_virg = 1-(prop_setosa+prob_ver)
## Pooled Covariance: Bishop eq 4.78(In case of QDA dont use pooled)
S = as.matrix((prop_setosa*cov_setosa)+(prop_virg*cov_virgi)+(prob_ver*cov_versi))
s_inv = as.matrix(solve(S))
# Calculate Discriminant FUnction
pred = matrix(NA, nrow = 150, ncol = 3)
p_j = c(prop_setosa,prob_ver,prop_virg)
mus = cbind(mu_setosa,mu_ver, mu_vir)
for(i in 1:150){
  for(j in 1:3){
    pred[i,j] =as.matrix(iris2[i,cov_names])%*%s_inv%*%mus[,j] - 0.5*t(mus[,j])%*%s_inv%*%mus[,j] + log(p_j[j])
  }
}
classes = c('setosa','versicolor','virginica')
actual_predictions = matrix(NA,nrow=150,ncol=1)
for(i in 1:150){
  actual_predictions[i] =classes[which.max(pred[i,])]
}
# Sampling From New Sample
library(mvtnorm)
## Generate synthetic data for each class i.e., x ~ N(mu_k,Sigma)
# Ex for setosa: rmvnorm(50, mean = mu_setosa, sigma = S)
syn_data = rbind(cbind(rmvnorm(50, mean = mu_setosa, sigma = S), 'setosa'),
cbind(rmvnorm(50, mean = mu_ver, sigma = S), 'versicolor'),
cbind(rmvnorm(50, mean = mu_vir, sigma = S), 'virginica'))
colnames(syn_data) = c('Sepal.Width','Sepal.Length','Species')
syn_data = as.data.frame(syn_data)
## Plot and comparing Synthetic Data with actual data
par(mfrow = c(1, 2), mar = c(4, 4, 2, 1), oma = c(0, 0, 2, 0))
with(iris, plot(Sepal.Width, Sepal.Length, main = "Main", type = "n"))
with(subset(iris, Species='setosa'), points(Sepal.Width, Sepal.Length, col='blue'))
with(subset(iris, (Species!='setosa' & Species !='versicolor')), 
     points(Sepal.Width, Sepal.Length, col='red'))
with(subset(iris, (Species!='setosa' & Species !='virginica')), 
     points(Sepal.Width, Sepal.Length, col='green'))
legend("topleft", pch = 1, col = c("blue", "red", 'green'), legend = c("setosa", "virginica",'versicolor'))
with(syn_data, plot(Sepal.Width, Sepal.Length, main = "Synthetic Data", type = "n"))
with(subset(syn_data, Species='setosa'), points(Sepal.Width, Sepal.Length, col='blue'))
with(subset(syn_data, (Species!='setosa' & Species !='versicolor')), 
     points(Sepal.Width, Sepal.Length, col='red'))
with(subset(syn_data, (Species!='setosa' & Species !='virginica')), 
     points(Sepal.Width, Sepal.Length, col='green'))
legend("topleft", pch = 1, col = c("blue", "red", 'green'), legend = c("setosa", "virginica",'versicolor'))

QDA

Refer LDA codes for Templates. For Manual application use LDA template and modify individual sigma, refer ISLR eq. 4.19

### QDA
form = as.formula('sex~CL+RW') # Step-2 formulation
qda.model = qda(form, crabs, prior= c(0.5,0.5)) # Step-3
crabs[,'qdapredictions'] = predict(qda.model,crabs)$class
# Misclassification Error
1 - mean(crabs$sex == crabs$qdapredictions)
# Draw discriminant Line
np = qda.model$N
nd.x <- seq(from = min(crabs$RW), to = max(crabs$RW), length.out = np)
nd.y <- seq(from = min(crabs$CL), to = max(crabs$CL), length.out = np)
nd <- expand.grid(RW = nd.x, CL = nd.y)
prd <- as.numeric(predict(qda.model, newdata = nd)$class)
plot(crabs[, c('RW','CL')], col = crabs$qdapredictions) # Make Sure X and Y are ordered correctly
contour(x = nd.x, y = nd.y, z = matrix(prd, nrow = np, ncol = np), 
        levels = c(1,2), add = TRUE, drawlabels = FALSE)

Decision Trees: Classification/Regression

Lesser the residual Mean deviance.better the model. Residual Mean Deviance = $\frac{\text{deviance}}{n- |T_o|}$ where $n= \text{no. of data points}$ and $T_o=\text{No. of Terminal Nodes}$
Deviance = $-2 \sum_m \sum_k n_{mk} \cdot log \hat{p}{mk}$ where $n{mk}$ is the no. of obs in the mth terminal node belonging to the kth class
Optimal Tree can be selected using:
1. cv.tree()
2. Trai/Test Validation
In Regression Trees, the deviance is ‘Sum of Square Errors’ for the Tree!.

# Classification
credit = read_excel(paste0(file_path,'creditscoring.xls'))
credit$good_bad = as.factor(credit$good_bad) # Labels need to be factored!
library(tree) # Call the Library 
form = as.formula("good_bad~.") # Create the formula
# NOTE: split = 'gini' for gini-index
train.tree.dev = tree(form, data = train, split = 'deviance') # Fit the model on 'Train'
train.tree.gini = tree(form, data=train, split='gini')
summary(train.tree.dev);summary(train.tree.gini)
## Choosing Optimal Tree
# 1. By Penalizing
# ------------------
fit = tree(form, data = train)
cv.res = cv.tree(fit, FUN = prune.misclass) # By default FUN = 'deviance'
par(mfrow=c(1,2))
plot(cv.res$size, cv.res$dev, type="b",
     col="red", main = 'CV-Error: Misclassification', xlab='Terminal Nodes')
plot(log(cv.res$k), cv.res$dev,
     type="b", col="red", main = 'CV-Error: Misclassification', xlab='Cost-Complexity Parameter')
# 5 Terminal Node looks to provide the lowest Misclassification rate.
# 'cv.res$size' returns the no of terminal nodes.
# NOTE: 'cv.res$dev' will return the cross-validation error and not 'deviance' which in this case 
#is misclassifcation error
# 2. Train/Val Method
# ------------------
trainScore = testScore = rep(0,9)
for(i in 2:9){
  prunedTree = prune.tree(fit, best = i)
  # Put 'type' = 'class' in case of predictions
  pred = predict(prunedTree, newdata=val, type='tree')# Returns Tree
  trainScore[i] = deviance(prunedTree)
  testScore[i] = deviance(pred)
}
plot(2:9, trainScore[2:9], type="b", col="red", ylim = c(min(trainScore,testScore), 
                                                         max(trainScore,testScore)),main='Train Score', ylab='Deviance', xlab='No. of Nodes')
points(2:9, testScore[2:9], type="b", col="blue")
opTimalTree = prune.tree(fit, best=5)
summary(opTimalTree)
# State the following in the report:Variables selected, # Terminal Nodes, Res. mean deviance, Misclassification error
# Estimate Misclassification Rate for Test data. NOTE: type='class'
testPred = predict(opTimalTree, newdata = test, type='class')
1 - mean(test$good_bad == testPred)

##### Regression Trees
formreg = as.formula("amount~.")
fitReg = tree(formreg, data = train)
trY = predict(fitReg, newdata = val)
sqrt(mean((val$amount - trY)^2)) ## RMSE

Splines: Using base R Functions only

Example below uses only basic r functions with lm

Step 1: For order-M Spline We create upto M-1 degree of freedom variables

  M=5
  X = matrix(data$Day %>% poly(degree = M-1, raw = TRUE), ncol =M-1) ## KEEP `raw=TRUE` when using poly

Step-2: Create Truncated Power Basis functions $h(x,\xi)$ for each knot!

  # knot 𝜁=75
  h.75= ifelse(X[,1]>75, X[,4],0)
  # If 𝜁=80
  h.80= ifelse(X[,1]>80, X[,4],0)

Step-3: Combine Knots and Splines

  # knot 𝜁=75
  X = cbind(X,h.75)
  # If knots: 𝜁=75,𝜁=80
  X = cbind(X, h.75,h.80)

# Use basis function approach to fit an order-5 spline with a single knot 𝜁=75 such that Day is the feature and Rate is the target variable.
data = read.csv2(paste0(file_path,"mortality_rate.csv"))
M=5
X = matrix(data$Day %>% poly(degree = M-1, raw = TRUE), ncol =M-1) ## KEEP `raw=TRUE` when using poly
h.75= ifelse(X[,1]>75, X[,4],0)
X = cbind(X,h.75)
colnames(X) = as.character(seq(1:5))
df = data.frame(cbind(X,data$Rate))
colnames(df)
spline.fit = lm("V6~.", data = df)
yPred = predict(spline.fit, data = df)
plot(x = data$Day, y = data$Rate)
points(x = data$Day,y = yPred, col=2)

Splines: Using splines library

# Using Library Splines
library(splines)
data = read.csv2(paste0(file_path,"mortality_rate.csv"))
f = as.formula("Rate~bs(Day, df=4, knots = 75)") ## Because order is 5!
spline.fit = lm(f, data = data)
yPred = predict(spline.fit, newdata = data)
plot(x = data$Day, y = data$Rate)
points(x = data$Day,y = yPred, col=2)

Logistic Regression

for prediction use predict(glm.fits, type = 'response') script. This generates probability for $\textbf{p(Y=1|X)}$
$p(Y=1|\mathbf{w,X}) = \sigma{\left(\mathbf{w^T \cdot X} \right)}$ where $\sigma(a) = \frac{1}{1 + e^{-a}}$ with a linear decision boundary at $\mathbf{w^T\cdot X}= 0$
Also $Y \sim Bern(\sigma(\mathbf{w^T \cdot X}))$

spam$target = ifelse(spam$Spam==1, 'S','H') # Create a binary variable
spam$target = as.factor(spam$target) # Convert to Factor
# Check the factor
contrasts(spam$target)
cov=paste(covariates, collapse='+')
form = as.formula(paste0('Spam~',cov))
glm.fits = glm(form, data=train,family="binomial") # Family should be 'binomial'
test_pred = predict(glm.fits, type = 'response', newdata = test)
test_pred = ifelse(test_pred>0.5,'Spam','Ham') # Y=1 for 'Spam'
# Accuracy
table(train_pred, train$target)
# Misclassification
1 - mean(train_pred == train$target)
1 - mean(test_pred==test$target)
# TPR,FPR,ROC
pred_tree.prob = predict(opTimalTree, test, type='vector')[,'good'] # Get Class Probability
# Calculate TPR, FPR
get.rate = function(predicted.prob, actual, true_label, false_label, threshold){
  prediction = ifelse(predicted.prob>threshold, true_label,false_label)
  TPR = sum(prediction==true_label & actual == true_label)/sum(actual == true_label)
  FPR = sum(prediction==true_label & actual == false_label)/ sum(actual == false_label)
  return(c(TPR, FPR))
}
thresholds = seq(0.05,0.95,0.05)
cols = c('threshold','TPR', 'FPR')
tree.roc = matrix(data = NA, nrow = length(thresholds), ncol = 3)
colnames(tree.roc) =  cols
for(i in 1:length(thresholds)){
  threshold = thresholds[i]
  # Tree
  res.tree = get.rate(pred_tree.prob, test$good_bad, 'good', 'bad', threshold)
  tree.roc[i, cols] = c(threshold, res.tree)
}
# PLOT ROC
plot(x = tree.roc[,'FPR'], y = tree.roc[,'TPR'], type='l',
     xlab='FPR', ylab='TPR', col='red')

K Nearest Neighbors

- NOTE: How 'train' and 'validation' are evaluated. Param `k` decides on no. of neighbors. `d` decides Minkowski distance. `d=2` for `euclidean`
- theres no `predict`, use the data to be predicted in `test`   

library("kknn")
# k=30
# Target has value 0 or 1
k.30.train = kknn(form, train=train,test=train,k=30,distance=2, kernel='rectangular') # Creating Model based on Training data as Validation
k.30.test = kknn(form, train=train,test = test,k=30,distance=2, kernel='rectangular')
train.pred = ifelse(k.30.train$fit>threshold,'S','H' ) # Script to modify the prediction
print(table(Predicted=train.pred, Actual = train$target)) # Training Confusion
print(table(Predicted=test.pred, Actual = test$target)) # Test Confusion
1 - mean(train.pred==train$target) # Accuracy

Ridge Regression

In ‘glmnet’ ‘alpha=0’ is for ridge.
Make sure that input data is in matrix format i.e., glmnet(x = as.matrix(scaled_data[,coefs[2:64]]), y = as.matrix(scaled_data[,'Fat']), alpha=0, lambda = grid)
choosing good grid values: grid <- 10 ^ seq (10, -2, length = 100)
use type.measure="class" for getting misclassification error in case of classification task.
We use ‘MSE’ because :
1. It is the MLE of a normal distribution and we assume error to be normally distributed.
2. MSE is a quadratic function i.e., it is easily differentiable and will always have a minima.
Effective Degrees of freedom: $df(\lambda) = tr(X(X’X + \lambda I_p)^{-1} X’)$

# Ridge Regression
grid <- 10 ^ seq (10, -2, length = 100)
ridge.fit = glmnet(x = as.matrix(scaled_data[,coefs[2:64]]), y = scaled_data[,'Fat'],alpha=0, lambda = grid)
# 5 fold cv
cv.5 = cv.glmnet(x = as.matrix(train[,coefs]), y = as.matrix(train[,'Fat']),alpha=0,lambda = grid,nfolds=10,type.measure = "mse") 
# Plot CV Results
plot(log(cv.5$lambda), cv.5$cvm,pch=19,col='red',xlab = 'log(lambda)',ylab = cv.5$name)
# Optimal lambda
print(cv.5$lambda.min)

Poisson Regression

$p(y=Y|\lambda) = \frac{e^{-\lambda} \lambda^{Y}}{Y!}$

For Prediction of $\hat{Y}$, we use $\hat{Y} = e^{\mathbf{W^TX}}$ whete $\lambda = \mathbf{W^TX}$

yTestPred = predict(lasso.fit, newx = as.matrix(test[,c(1:8)]), type = "response")
yTestPred[1]
yHat = as.matrix(cbind(1,test[1,c(1:8)]))%*% matrix(coef(lasso.fit), ncol=1)
prediction = exp(yay)

Ridge Regression using Optim

Use Block-1, Lab-1 Q2 as premise. Make sure to SCALE the data!!!
Make sure that the X, Y and Weights are in as.matrix format.
For Log-Likelihood, used Bishop eq.(3.11) and for applying ridge penalty used equation from pg.17 of Lecture 1D,Block1

  # LogLikelihood = log P(Y|X,w,sigma)
  Loglikelihood = function(x,y,w,sigma){
  wTX = as.matrix(x)%*%w
  loss = sum((wTX - as.matrix(y))^2)
  N = dim(x)[1]
  t1 = (N*0.5*(log(sigma)))
  t2 = (N*0.5*log(2*pi))
  t3 = (sigma*0.5*loss)
  logLik = t1-t2 - t3
  return(logLik)
  }
  ## Ridge Penalty
  ridge = function(w,sigma,lambda, x,y){
    log_lik = Loglikelihood(x,y,w,sigma)
    penalty = lambda*(sum(w^2)) - log_lik
    return(penalty)
  }
  ## Ridge Degree of Freedom
  ridge.df = function(x,lambda){
    x = as.matrix(x)
    N = dim(x)[2] # Covariate count
    Ip = diag(N)
    inv_data = solve(t(x)%*%x + (lambda*Ip))
    df = x%*%inv_data%*%t(x)
    trace = sum(diag(df))
    return(trace)
  }
  ### AIC criteria
  calculate_AIC = function(x,y,w,sigma, lambda){
    log_lik = Loglikelihood(x,y,w,sigma)
    N = dim(x)[1] # No of data points
    df = ridge.df(x,lambda)
    aic = (-2*log_lik/N) + (2*df/N) #(-2*Log-likelihood/N) + 2*(df/N)
    return(aic)
  }
  # Calculate Train,Test Loss
  for(lambda in c(1,100,1000)){
    N = dim(train[,covs])[2]
    initW = as.matrix(rnorm(N))
    sigma=runif(1)
    lambda=lambda
    x = as.matrix(train[,covs])
    y = as.matrix(train[,'motor_UPDRS'])
    xTest = as.matrix(test[,covs])
    yTest = as.matrix(test[,'motor_UPDRS'])
    ridgeOpt = optim(initW, ridge,sigma=sigma,lambda=lambda,x=x,y=y,method=c("BFGS"))
    weights = as.matrix(ridgeOpt$par)
    train_mse = mean((x%*%weights - y)^2);test_mse = mean((xTest%*%weights - yTest)^2)
    print(paste0('Lambda: ', lambda));print(paste0('Train Error: ', train_mse))
    print(calculate_AIC(x,y,w=weights,sigma, lambda)) # For AIC Calculation
  }

LASSO

Standardize the data!
Almost same as Ridge. But, df(degree of freedom) is also covered.
Example of how to generate synthetic data is below.

  # Lasso Regression
  lasso.fit = glmnet(x = as.matrix(train[,coefs]), y = as.matrix(train[,'Fat']),alpha=1,family="gaussian")
  # Plotting Lambda V/S Coefficients
  plot(lasso.fit, xvar = "lambda", label = TRUE)
  # Plot LASSO V/S df
  plot(x = log(lasso.fit$lambda), y = lasso.fit$df)
  ## Cross Validation
  lasso.grid <- 10 ^ seq (10, -2, length = 100)
  # 5 fold cv
  cv.5 = cv.glmnet(x = as.matrix(train[,coefs]), y = as.matrix(train[,'Fat']),alpha=1,family="gaussian",
                    lambda = lasso.grid,nfolds=10,type.measure = "mse")
  # Plot CV Results: MSE V/S Log(Lambda)
  plot(log(cv.5$lambda), cv.5$cvm,pch=19,col='red',xlab = 'log(lambda)',ylab = cv.5$name)
  # Optimal lambda
  print(cv.5$lambda.min)
  ### Generating synthetic test data
  X = as.matrix(test[,coefs]);Y = as.matrix(test[,'Fat']) # Actual Test Data
  lasso.fit = glmnet(x = as.matrix(train[,coefs]), y = as.matrix(train[,'Fat']),alpha=1, lambda = cv.5$lambda.min) # LASSO with best fit
  betas = as.vector(coef(lasso.fit)[-1,]) # Remove intercept and fetch beta coefficients
  residual = train$Fat - as.matrix(train[,coefs])%*% betas;stdev = sd(residual) # Residual and st dev
  yTestHat = predict(lasso.fit, newx = X, type='response')
  n = dim(Y)[1];set.seed(12345)
  yPred = rnorm(n, yTestHat,stdev)
  # PLot Actual vs generated data
  plot(Y, lty=1, col=3)
  lines(yPred, type='b', col=2)

LASSO:(Best Lambda statistically)

Checking For Statistically Significantly better: print(lasso.fit)
If %Dev explained is similar, choose one with lower ‘df’

df1 = read.csv2(paste0(file_path, "Dailytemperature.csv"))
df2 = as.data.frame(df1)
temp = df2$Temperature
power_factor = seq(-50,50,1)
for(i in power_factor){
  df2[,paste0("sin_",as.character(i))] = sin(0.5^i * temp)
  df2[,paste0("cos_",as.character(i))] = cos( 0.5^i* temp)
}
## LASSO
library(glmnet)
cols = colnames(df2)[3:204]
x = as.matrix(df2[,cols])
y = as.matrix(temp)
## Dependence of DF on Penalty factor
plot(x = lasso.fit$lambda, y = lasso.fit$df, main="Degrees of Freedom V/S  Lambda",
    xlab = "Lambda", ylab = "DF")
## CV for dependence of MSE on Log-Penalty
lambdas = seq(-5,5,by = 0.01) 
lambdas = exp(lambdas)
## Ensure -4 is there in log(lambda)
## any(log(lambdas) == -4)
cv.fit = cv.glmnet(x=x,y=y, family="gaussian", type.measure = "mse", nfolds = 10,
                  lambda = lambdas)
plot(cv.fit) ## Plots MSE V/S log(lambda)
print(cv.fit$lambda.1se) ## Check the Standard Error
## Checking For Statistically Significantly better
## If %Dev explained is similar, choose one with lower 'df'
lasso3 = glmnet(x=x,y=y, family = "gaussian", alpha=1, lambda = exp(-4))
print(lasso3)
lasso.min = glmnet(x=x,y=y, family = "gaussian", alpha=1, lambda = cv.fit$lambda.min)
print(lasso.min)
## Extracting Coefficients
coef.exact <- coef(lasso.min,exact = TRUE, x=x, y=y)
## No. of Non Zero Features
length(coef.exact[which(coef.exact!=0)]) # Includes Intercept!
# Prediction
predTemp = predict(lasso.min, newx = x, type = "link")
predTemp = predict(lasso.min, newx = x, type = "response") ## OR

Elastic-Net

alpha is between 0 and 1.

  library(glmnet)
  Xtrain = train[,-4703]
  Ytrain = train[,4703]
  Xtest = test[, -4703]
  Ytest = test[,4703]
  grid <- 10 ^ seq (10, -2, length = 100)
  # Classification Cross-Validation
  en.fit = cv.glmnet(x = as.matrix(Xtrain),
                  y = Ytrain, alpha=0.5, lambda = grid, family = 'binomial',
                type.measure = 'class')
  best_lambda = en.fit$lambda.min
  model = glmnet(x = as.matrix(Xtrain),
            y = Ytrain, alpha=0.5, lambda = best_lambda, family = 'binomial')
  yTest.elastic = predict(model, as.matrix(Xtest), type = "class")
  1 - mean(Ytest==yTest.elastic)
  print(paste0("Regularized Features: ", model$df)) # No. of Features

SVM

Example also within Nested Cross-Validation section

  library(kernlab)
  form = as.formula("Conference~.")
  svm.fit = ksvm(form, data = train, kernel = "vanilladot")
  yTest.svm = predict(svm.fit, test)
  1 - mean(Ytest == yTest.svm)

Experimental Estimation of Bagging Error

D = 10
mu = as.matrix(runif(D, -10, 10))
cov_diag = runif(D, 1,2)
sd_matrix = as.matrix(diag(cov_diag))
error_data = rmvnorm(100, mean = mu, sigma = sd_matrix)
error_data = as.matrix(error_data)
colnames(error_data)=paste0("Model ", 1:10)
## Average Error
rmse_models = vector(length = 10)
for(i in 1:10){
  rmse_models[i] = mean(error_data[,i]^2) ## eq 14.9
}
e2 = error_data^2
boot_data = rowMeans(e2)/10 
colMeans(e2)%>%mean ## eq 14.10
mean(boot_data)  ## eq 14.14
hist(boot_data) # Bagging Error Histogram
## part-3
variances = matrix(NA, nrow = 100, ncol=11)
 ## Assuming com(M1,M2) = 0
for(i in 1:100){
  # Error Data Generation
  B = 10
  mu = as.matrix(runif(B, -10, 10))
  cov_diag = runif(B, 1,2)
  sd_matrix = as.matrix(diag(cov_diag))
  error_data = rmvnorm(100, mean = mu, sigma = sd_matrix)
  ## Bootstrap Variance
  error_data = as.matrix(error_data)
  ## Lecture: 2(a) pg. 10
  error_bag = rowSums(error_data) / B^2
  variances[i,1] = mean(error_bag - mean(error_bag)) ## Bagging Error Variance
  # Model Avg Error Variance
  variances[i,2:11] = apply(error_data, FUN=var, MARGIN = 2) ## Variance of Average Individual Error
  
}
colnames(variances) = c("Bootstrap", paste0("Model " ,1:10))
par(mfrow = c(3,4))
for(i in 1:11){
  boxplot(variances[,i], main = colnames(variances)[i])
}

NN from Scratch

NOTE: X and Y needs to be transposed and also Normalize X!.
Example is for Classification

In case of Regression:

1. Use A2 = Z2 # Linear Output within the forwardPropagation function.

Modify the Cost Computing function computeCost

computeCost = function(X,Y,cache){
m = dim(X)[2] # No.of Datapoints
A2 = cache$A2 # Output
cost <- mean((A2-Y)^2)
return (cost)}

   data = read.csv2(paste0(file_path,"spambase_exam.csv"))  
   X = as.matrix(data[,c(1:57)], ncol = 57)
   X = scale(X)
   Y = as.matrix(data$Spam)
   ## Now transpose X and Y: Important Step before Implementation
   X = t(X) 
   Y = t(Y)
   # 1. Define Architecture
   # 2. Initialize Model Params
   getLayerSize = function(X,Y,hidden_neurons, train=TRUE){
     n_x = dim(X)[1]
     n_h = hidden_neurons
     n_y = dim(Y)[1]

     size = list(
       "n_x" = n_x,
       "n_h" = n_h,
       "n_y" = n_y
     )
     return(size)
   }
   layers_size = getLayerSize(X,Y,10)
   initializeParameters = function(X, list_layer_size){
     m = dim(data.matrix(X))[2] ## No of Training Points
     n_x <- list_layer_size$n_x
     n_h <- list_layer_size$n_h
     n_y <- list_layer_size$n_y

     W1 = matrix(runif(n_h*n_x), nrow = n_h, ncol=n_x, byrow = TRUE)*0.01
     b1 = matrix(rep(0,n_h), nrow = n_h)

     W2 = matrix(runif(n_y*n_h), nrow = n_y, ncol = n_h, byrow = TRUE)*0.01
     b2 <- matrix(rep(0, n_y), nrow = n_y)

     params <- list("W1" = W1,
                    "b1" = b1, 
                    "W2" = W2,
                    "b2" = b2)

     return (params)
   }
   pars = initializeParameters(X, layers_size)
   ## Activations
   sigmoid <- function(x){
     return(1 / (1 + exp(-x)))
   }
   ## Using Tanh in the hidden layer
   ## Sigmoid in the Output Layer
   forwardPropagation = function(X, params, list_layer_size){
     m <- dim(X)[2]
     n_h <- list_layer_size$n_h
     n_y <- list_layer_size$n_y
     W1 <- params$W1
     b1 <- params$b1
     W2 <- params$W2
     b2 <- params$b2
     b1_new <- matrix(rep(b1, m), nrow = n_h)
     b2_new <- matrix(rep(b2, m), nrow = n_y)
     # Input to Hiddem
     Z1 <- W1 %*% X + b1_new
     A1 <- tanh(Z1)
     # A1 <- sigmoid(Z1) # For Sigmoid in input layer
     # Hidden to Output
     Z2 <- W2 %*% A1 + b2_new
     # A2 = Z2 # Linear Output
     A2 <- sigmoid(Z2) # For Sigmoid
     cache <- list("Z1" = Z1,
                   "A1" = A1, 
                   "Z2" = Z2,
                   "A2" = A2)
     return (cache)
   }
   fwd_prop <- forwardPropagation(X, pars, layers_size)
   lapply(fwd_prop, function(x) dim(x))
   ## Classification
   computeCost = function(X,Y,cache){
     m = dim(X)[2] # No.of Datapoints
     A2 = cache$A2 # Output
     # Binary Loss
     logprobs = (log(A2) * Y) + (log(1-A2) * (1-Y))
     cost <- -sum(logprobs/m)
     return (cost)
   }
   ## If Regression
   #computeCost = function(X,Y,cache){
   #  m = dim(X)[2] # No.of Datapoints
   #  A2 = cache$A2 # Output
   #  cost <- mean((A2-Y)^2)
   #  return (cost)
   #}
   cost <- computeCost(X, Y, fwd_prop)
   cost
   backwardPropagation <- function(X, Y, cache, params, list_layer_size){
     m <- dim(X)[2]
     n_x <- list_layer_size$n_x
     n_h <- list_layer_size$n_h
     n_y <- list_layer_size$n_y
     A2 <- cache$A2
     A1 <- cache$A1
     W2 <- params$W2
     # Output Layer
     dZ2 = A2 - Y
     dW2 = 1/m * (dZ2 %*% t(A1)) 
     db2 <- matrix(1/m * sum(dZ2), nrow = n_y)
     db2_new <- matrix(rep(db2, m), nrow = n_y)

     # Input Layer
     dZ1 <- (t(W2) %*% dZ2) * (1 - A1^2) # If tanh
     # dZ1 <- (t(W2) %*% dZ2) * (A1*(1*A1)) # If sigmoid
     dW1 <- 1/m * (dZ1 %*% t(X))
     db1 <- matrix(1/m * sum(dZ1), nrow = n_h)
     db1_new <- matrix(rep(db1, m), nrow = n_h)
     grads <- list("dW1" = dW1, 
                   "db1" = db1,
                   "dW2" = dW2,
                   "db2" = db2)

     return(grads)
   }

   back_prop <- backwardPropagation(X, Y, fwd_prop, pars, layers_size)
   lapply(back_prop, function(x) dim(x))

   updateParameters = function(grads,params, learning_rate){
     W1 <- params$W1
     b1 <- params$b1
     W2 <- params$W2
     b2 <- params$b2

     dW1 <- grads$dW1
     db1 <- grads$db1
     dW2 <- grads$dW2
     db2 <- grads$db2

     W1 <- W1 - learning_rate * dW1
     b1 <- b1 - learning_rate * db1
     W2 <- W2 - learning_rate * dW2
     b2 <- b2 - learning_rate * db2

     updated_params <- list("W1" = W1,
                            "b1" = b1,
                            "W2" = W2,
                            "b2" = b2)

     return (updated_params)
   }
   update_params <- updateParameters(back_prop, pars, learning_rate = 0.01)
   lapply(update_params, function(x) dim(x))

   ## Training
   trainModel <- function(X, y, num_iteration, hidden_neurons, lr){
    # X = t(X) # Transpose if not done already.
    # y = t(y)
     layer_size <- getLayerSize(X, y, hidden_neurons)
     init_params <- initializeParameters(X, layer_size)
     cost_history <- c()
     for (i in 1:num_iteration) {
       fwd_prop <- forwardPropagation(X, init_params, layer_size)
       cost <- computeCost(X, y, fwd_prop)
       back_prop <- backwardPropagation(X, y, fwd_prop, init_params, layer_size)
       update_params <- updateParameters(back_prop, init_params, learning_rate = lr)
       init_params <- update_params
       cost_history <- c(cost_history, cost)

       if (i %% 50 == 0) cat("Iteration", i, " | Cost: ", cost, "\n")
     }

     model_out <- list("updated_params" = update_params,
                       "cost_hist" = cost_history)
     return (model_out)
   }
   ## Learning
   EPOCHS = 500
   HIDDEN_NEURONS = 10
   LEARNING_RATE = 0.1
   train_model <- trainModel(X, Y, hidden_neurons = HIDDEN_NEURONS, num_iteration = EPOCHS, lr = LEARNING_RATE)
   # Making Prediction
   makePrediction <- function(X, y, hidden_neurons){
      # X = t(X) # Transpose if doing in 'trainModel'.
     # y = t(y)
     layer_size <- getLayerSize(X, y, hidden_neurons)
     params <- train_model$updated_params
     fwd_prop <- forwardPropagation(X, params, layer_size)
     pred <- fwd_prop$A2
     return (pred)
   }
   outputProb = makePrediction(X,Y,HIDDEN_NEURONS) 

Unsupervised Learning

Nearest Shrunken Centroid

Need to scale the x data

Class Probability Estimate $\hat{p_k} \left(x^{} \right) =$ $\frac{\mathrm{e}^{\delta_k (x^)}}{\sum_{l=1}^{K} \mathrm{e}^{\delta_l (x^*)}}$

  # To Obtain Class Probability estimate
  pamr.predict(model, newx, threshold,type =  "posterior",prior = model$prior, threshold.scale = model$threshold.scale)

  library(pamr)
  data = (read.csv2(paste0(file_path, "data.csv")) )
  data$Conference = as.factor(data$Conference) # Categorical as factor
  # Give rows name
  # Train/Test Split
  train = data[id,]
  test = data[-id,]
  # Transpose the X matrix!!!
  x = t(train[,c(1:4702)]) 
  y = train[['Conference']]
  # Modify Data
  mydata = list(x=x,y=as.factor(y),geneid= as.character(1:nrow(x)), genenames= rownames(x))
  model = pamr.train(mydata, threshold = seq(0,9,0.1))
  # Cross Validation
  #### SET SEEED!!!!!
  set.seed(12345)
  cv.results = pamr.cv(model, data = mydata, nfold = 5)
  # Plot Cross Validation Results
  # THreshold for which error is minimum and parameters are less
  pamr.plotcv(cv.results)
  # Compute Confusion matrix for a threshold
  pamr.confusion(model, threshold = 1.9)
  # Plot cross validated class probabilities by class
  pamr.plotcvprob(model,data=mydata, threshold = 1.9)
  # Estimate False discovery rates and plot them
  fdr.obj = pamr.fdr(model, mydata)
  plot(x= fdr.obj$results[,1], y = fdr.obj$results[,4], xlab="Threshold",
      ylab="False Discovery Rate")
  plot(x= fdr.obj$results[,2], y = fdr.obj$results[,4], xlab="Significant Genes",
      ylab="False Discovery Rate")
  # List Significant Genes: ALso find their names
  res = pamr.listgenes(model, mydata,threshold = 1.9)
  names = as.integer(res[1:10,1])
  colnames(train)[names] # Get Column Names
  # Plot Centroids
  pamr.plotcen(model,mydata, threshold = 1.9)
  # Prediction
  x = t(test[,c(1:4702)]) 
  y = test[['Conference']]
  mydata2 = list(x=x,y=as.factor(y),geneid= as.character(1:nrow(x)), genenames= rownames(x))
  yTestHat = pamr.predict(model, mydata2$x, threshold=1.9)
  # Class Posterior Probability
  yTest.Prob = pamr.predict(model, mydata2$x, threshold=1.9, type="posterior")

PCA

For PC1: $eigen_{1\lambda_1} \cdot$column-1 + $eigen_{2\lambda_1} \cdot$ column-2 +…
- r-equivalent: PC1 = x2%*%as.matrix(eig.x$vectors[,1]) where $x2$ is NXD matrix and eig.x$vector[,1] is a DX1 eigen-vectors for $\lambda_1$

  ### Manual PCA
  df = mtcars
  X = df[, c("mpg", "hp")]
  x2 = scale(X) # Standardizing
  x3 = x2 - colMeans(x2) ## Subrtracting Means
  cov.x = cov(x3)
  eig.x = eigen(cov.x)
  eig.x$vectors
  ## Ensure that x and y are standardized
  plot(x = x2[,1], y = x2[,2], cex=2, xlab="x-component",ylab="y component",
      main="Actual Data VS PCA Vectors")
  ## Draw PC1
  arrows(x0=0,y0=0,x1 = eig.x$vectors[1,1], y1=eig.x$vectors[2,1], col="red", lwd = 5)
  arrows(x0=0,y0=0,x1 = -eig.x$vectors[1,1], y1=-eig.x$vectors[2,1], col="red", lwd = 5)
  ## Draw PC2
  arrows(x0=0,y0=0,x1 = eig.x$vectors[1,2], y1=eig.x$vectors[2,2], col="blue", lwd = 5)
  arrows(x0=0,y0=0,x1 = -eig.x$vectors[1,2], y1=-eig.x$vectors[2,2], col="blue", lwd = 5)
  ## Reduced Data V/S PCA
  PC1 = x2%*%as.matrix(eig.x$vectors[,1])
  PC2 = x2%*%as.matrix(eig.x$vectors[,2])
  # Reduced Data
  ## Here PC's need to be rotated
  plot(x = PC1, y = PC2, cex=2, xlab="PC1",ylab="PC2",
      main="Data Along PC1 and PC2", xlim = c(-3,3), ylim = c(-3,3))
  arrows(x0=0,y0=0,x1 = -3, y1=0, col="red", lwd = 5)
  arrows(x0=0,y0=0,x1 = 3, y1=0, col="red", lwd = 5)
  arrows(x0=0,y0=0,x1 =0, y1=2, col="blue", lwd = 5)
  arrows(x0=0,y0=0,x1 = 0, y1=-2, col="blue", lwd = 5)

  ##### Communities data example
  communities = read.csv(paste0(file_path, 'communities.csv'))
  covs = colnames(communities)[1:100]
  # Manual PCA
  # Step-1: Scaling
  PCA  = scale(communities[,covs])
  PCA = cbind(PCA, communities$ViolentCrimesPerPop)
  cor.mat = cor(PCA) # Correlation Matrix
  cov.mat = cov(PCA) # Covariance Matrix
  # Step-2: Obtain  eigenvalues and eigenvectors of covariance matrix
  eig = eigen(cov.mat) # Eigen values and vectors
  # Step-3: Calculate Total and Proportional Variance explained
  lambdas = eig$values
  # First 34 will explain 95% Variance in the data
  cum_var = cumsum(lambdas/sum(lambdas))
  sum(cum_var<0.95)
  # Proportion of variance explained by 1st two Principal Components
  lambdas[1:2]/sum(lambdas)
  pc.result = princomp(PCA)
  # pc.result$centre: Specifies 'Mean' of the variables used for PCA
  # pc.result$sdev: Specifies 'Standard Deviation' of the variables used for PCA
  # Calculate Total and Proportional Variance explained
  pc.var = pc.result$sdev^2
  pve = pc.var/sum(pc.var)
  # Plot 1st 2 Principal Components
  library(ggplot2)
  PC1 = pc.result$scores[,1]
  PC2 = pc.result$scores[,2]
  PCA2 = data.frame(VioLentCrimesPerPop = communities$ViolentCrimesPerPop, PC1,PC2)
  ggplot(PCA2, aes(PC1,PC2, colour = VioLentCrimesPerPop))+
          geom_point() + theme_bw()
  # Second Order Polynomial on PC1
  X = poly(PC1,2, raw=TRUE)  # MAKE SURE 'raw=TRUE'      
  Y = communities$ViolentCrimesPerPop
  data = data.frame(cbind(X,Y))
  # Prediction
  form = as.formula("Y~.")
  fit = lm(form, data=data)
  Yhat = predict(fit, newdata = data)
  plot(x = data$X1, y = data$Y,
      xlab = "PC-1", ylab = 'Violent Crimes', cex=0.5)
  points(x = data$X1, y = Yhat,col='red', cex=0.5)
  legend("topleft", pch = 1, col = c("black", "red"), legend = c("Actual", "Predicted"),
        cex=0.5)
  # Run Parametric Bootstrap
  result = boot(data, statistic = f1, R=1000, mle = mle,ran.gen = rng,sim = 'parametric')
  result.PI = boot(data, statistic = f2, R=1000, mle = mle,ran.gen = rng,sim = 'parametric')
  # Plot Confidence Interval
  env = envelope(result, level=0.95)
  env.PI = envelope(result.PI,level=0.95)
  plot(x = data$X1, y = data$Y, ylim=c(-1,2),
      xlab = "PC-1", ylab = 'Violent Crimes', cex=0.5)
  points(x = data$X1, y = Yhat,col=2, cex=0.5,  lty=1)
  points(x = data$X1, y = env$point[2,], col = 3)
  points(x = data$X1, y = env$point[1,], col = 3)

Other Topics

Non-Parametric Bootstrap

Assumes no distribution for the generated data.
USE the WHOLE TRAINING DATA

library(boot)
state = read.csv(paste0(file_path,"State.csv"), header = TRUE,sep = ";")
to_numeric = colnames(state)[2:6]
for(col in to_numeric){
  state[,col] = as.numeric(gsub(",",".", state[,col])) # String modification
}
# 1. Re-order the Data based on co-variates
  state.order = state[order(state$MET),] # Ascending Order
# 2.1 Define function to 'fit' and 'predict' on the sampled data
  f = function(data,index){
    data1 = data[index,]
    fit = tree(form, data1,control = tree.control(nobs= dim(state)[1],minsize=8))
    prunedTree  = prune.tree(fit, best=3)
    Yhat = predict(prunedTree, newdata = state.order)
    return(Yhat)
  }
# 2.2 For Prediction Intervals
  f2 = function(data,index){
    data1 = data[index,]
    fit = tree(form, data1,control = tree.control(nobs= dim(state)[1],minsize=8))
    prunedTree  = prune.tree(fit, best=3)
    Yhat = predict(prunedTree, newdata = state.order)
    n = length(state$EX) # equal to #.Training Data points
    # Now generate Y from the distribution
    predictedEX = rnorm(n, Yhat, sd(residuals(prunedTree)))
    return(predictedEX) # Return this]
  }
# 3. Perform Non-Parametric Bootstrap
  result = boot(state.order, f, R=1000) # Confidence Interval
  result.PI = boot(state.order, f2, R=1000) # Prediction Interval
# 4. Plot Confidence Intervals
  env = envelope(result,level=0.95)# For Setting the CI
  env.PI = envelope(result.PI, level=0.95)
  # Prediction with CI
# Prediction with CI
par(mfrow=c(1,1))
plot(x = state.order$MET, y = state.order$EX, col = "red")
points(x = state.order$MET, y = trY, col = "blue",type='l')
points(x = state.order$MET, y = env$point[2,], col = "blue",type='l')
points(x = state.order$MET, y = env$point[1,], col = "blue",type='l')
points(x = state.order$MET, y = env.PI$point[2,], col = "green",type='l')
points(x = state.order$MET, y = env.PI$point[1,], col = "green",type='l')

Parametric Bootstrap(Regression)

Need to modify rng depending upon classification or regression task!
For Classification look into: Parametric Bootstrap(Classification)

library(boot)
state = read.csv(paste0(file_path,"State.csv"), header = TRUE,sep = ";")
to_numeric = colnames(state)[2:6]
for(col in to_numeric){
  state[,col] = as.numeric(gsub(",",".", state[,col])) # String modification
}
# 1. Compute 'mle' i.e., of model type
mle = tree(form, state,control = tree.control(nobs= dim(state)[1],minsize=8))
residuals(mle) # Best way to obtain model residuals
# 2. Define 'rng'  for data generation
rng = function(data, mle){
  data1 = data.frame(EX = data$EX,MET= data$MET)
  n = length(data$EX) # For Whole Data
  # Generate New Data: rnorm will be changed for classification task
  data1$EX = rnorm(n, predict(mle, newdata = data1), sd(residuals(mle)))
  return(data1)
}
#  Statistic function: should return the estimator(Yhat)
f1 = function(data1){
  res = tree(form, data1,control = tree.control(nobs= dim(state)[1],minsize=8)) # Fit the Model
  Yhat = predict(res, newdata = state) # newdata = input data
  return(Yhat)
}
# Run Parametric Bootstrap
result = boot(state, statistic = f1, R=1000, mle = mle,ran.gen = rng,sim = 'parametric')
## Prediction Interval
f2 = function(data1){
  # Fit the Model
  res = tree(form, data1,control = tree.control(nobs= dim(state)[1],minsize=8))
  Yhat = predict(res, newdata = state) # newdata = input data
  n = length(state$EX) # equal to no of Training Data points
  # Now generate Y from the distribution
  predictedEX = rnorm(n, Yhat, sd(residuals(mle)))
  return(predictedEX) # Return this
}
# Plot Confidence Interval
env = envelope(result, level=0.95)
result.PI = boot(state, statistic = f2, R=1000, mle = mle,
                 ran.gen = rng,sim = 'parametric')
env.PI = envelope(result.PI,level=0.95)
# Prediction with CI& PI
plot(x = state.order$MET, y = state.order$EX, col = "red",
     ylim = c(min(c(env.PI$point[2,],env.PI$point[1,])),max(c(env.PI$point[2,],env.PI$point[1,]))))    
points(x = state.order$MET, y = trY, col = "blue",type='l')
points(x = state.order$MET, y = env$point[2,], col = "red",type='l')
points(x = state.order$MET, y = env$point[1,], col = "red",type='l')
points(x = state.order$MET, y = env.PI$point[2,], col = "green",type='l') # For Prediction Interval
points(x = state.order$MET, y = env.PI$point[1,], col = "green",type='l') # For Prediction Interval

Parametric Bootstrap(Classification)

Pay attention to rng function

    # Perform Parametric Bootstrap
    # Since it is a classification task: 'rng' and f1 need to be modified
    data = data.frame(PC1 = PC_one,Y = digits$Y)
    data$Y = as.factor(data$Y)
    form = as.formula("Y~.")
    # 1. Define 'mle'
    mle = glm(form, data = data, family='binomial')
    # 2. Define 'rng'  for data generation
    rng = function(data, mle){
    data1 = data.frame(PC1 = data[,1], Y= data[,2])
    n = length(data[,2]) # For Whole Data
    # Generate New Data: rbinom for classification task
    data1[,2] = rbinom(n, size = 1, prob = predict(mle, data1, type = 'response'))
    return(data1)
    }
    #  3. Statistic function: should return the estimator(Yhat)
    f1 = function(data1){
    # Fit the Model
    fit = glm(form, data = data1, family='binomial')
    # newdata = input data
    Yhat = predict(fit, newdata = data) # Get the Link
    # Since Predicting Confidence Interval of "Probabilities", it cannot be negative(ISLR pg.291-292)
    Yhat = exp(Yhat)/(1 + exp(Yhat))
    return(Yhat)
    }
    # Run Parametric Bootstrap
    result = boot(data, statistic = f1, R=1000, mle = mle,ran.gen = rng,sim = 'parametric')
    env = envelope(result,level=0.95)# For Setting the CI
    # Plot results
    plot(x = data$PC1, y = yT, col = "red", ylim=c(0,1))
    points(x = data$PC1, y = env$point[2,], col = "blue")
    points(x = data$PC1, y = env$point[1,], col = "blue")

Bonferroni Method

When we perform several Tests simultaneously, we need to make an adjustment for risk.
$H_{0j}$ : Treatment has no effect on $j$(j will be column or row)
$H_{1j}$ : Treatment has an effect on $j$
We reject $H_{0j}$ if $p_j < \alpha/M$. Not useful for large M for which Bonferroni gives conservative results
Choose features for which we fail to reject the Null Hypothesis!

Benjamin-Hochberg

$H_{0j}$ : treatment has no effect on $j$(j will be column or row)
$H_{1j}$ : Treatment has an effect on $j$
ISLR(Version 2) has example on Regression type Data. Here we deal with Classification Data:
- In CLassification Data, divide the data for each class and conduct BH Test for each class individually.

  gene_data = read.csv(paste0(file_path,"geneexp.csv"))
  gene_data$CellType = as.factor(gene_data$CellType)
  which(sapply(gene_data, class) !='integer') # Check column-type
  expnames = colnames(gene_data)[2:2086] # X-column names
  gene_data[['isCD4']] = gene_data$CellType == "CD4" # Conditional Column for CD4 Cell type
  gene_data$isCD8 = gene_data$CellType == "CD8"
  gene_data$isCD19 = gene_data$CellType == "CD19"
  ## Benjamin-Hochberg
  # 1. Specify 1, the level to which control FDR
  q = 0.05
  # 2. Compute p values for m-null hypothesis
  get_p_value = function(name,gene_type, data){
    f = as.formula(paste0(name,"~",gene_type)) #
    return(t.test(f,data = data)$p.value)
  }
  ## CD4
  p.CD4 = sapply(expnames, FUN=get_p_value, gene_type = "isCD4", data = gene_data)
  ## CD8
  p.CD8 = sapply(expnames, FUN=get_p_value, gene_type = "isCD8", data = gene_data)
  ## CD19
  p.CD19 = sapply(expnames, FUN=get_p_value, gene_type = "isCD19", data = gene_data)
  # 3. Order m p-values!!!!!
  p.CD4 = sort(p.CD4) # IMPORTANT step
  p.CD8 = sort(p.CD8)
  p.CD19 = sort(p.CD19)
  # 4. Define L
  BH = function(pv, q){
    m = length(pv)
    for(i in 1:m){
      if(pv[i]>q*i/m)
        break
    }
    return(i-1)
  }
  # 5. Reject all null hypothesis for which p_j <= p_{(L)}
  BH(p.CD4, q=0.05) ## No of Rejected Genes for CD4 Cell type(Choose these as important ones)
  # Important Ones
  names(p.CD4)[1:BH(p.CD4, q=0.05)] # These are the important ones
  BH(p.CD8, q=0.05) ## No of Rejected Genes for CD8 Cell type
  BH(p.CD19, q=0.05) ## No of Rejected Genes for CD19 Cell type
  # Plotting
  BHLine = q*(1:length(p.CD4))/m # Line for BH-threshold
  plot(x=1:length(p.CD4), y=p.CD4, xlab='Index',
      ylab="P-Value", main = "CD4")
  lines(x=1:length(p.CD4),y = BHLine, col="blue", type='l')

N-Fold CV

Calculate Loss using loss
Calculate N-Fold Loss for the data using ‘n_fold_cv’
‘get.subset’ creates various column subsets. NOTE: Does not create for a single predictor.

loss = function(x,y,xTest,yTest){
  # Function to obtain regression coefficient using x,y
  # Predict y_hat for 'xTest' and then calculate loss
  x = as.matrix(cbind(intercept=1,x)) ### Add Intercept
  xTest = as.matrix(cbind(intercept=1,xTest))
  xtx = solve(t(x)%*%x)
  beta = xtx%*%(t(x)%*%y)
  beta = as.matrix(t(beta))
  colnames(beta) = colnames(x)
  # loss =  mean(abs(y - y_hat))  ## MAE
  yHat = xTest%*%t(beta)
  testLoss = sqrt(mean((yTest-yHat)**2)) 
  return(testLoss)
}
n_fold_cv = function(X,Y, n_fold){
  #Randomly shuffle the data
  ran_ind = sample(nrow(X))
  X = as.matrix(X[ran_ind,])
  Y = Y[ran_ind]
  #Create n equally size folds
  folds <- cut(seq(1,nrow(X)),breaks=n_fold,labels=FALSE)
  #Perform 10 fold cross validation
  cv_loss = c()
  for(i in 1:n_fold){
    #Segement your data by fold using the which() function 
    testIndexes <- which(folds==i,arr.ind=TRUE)
    xTest <- X[testIndexes,]
    yTest <- Y[testIndexes]
    xTrain <- X[-testIndexes,]
    yTrain <- Y[-testIndexes]
    l = loss(x=xTrain, y = yTrain,xTest = xTest, yTest = yTest)
    cv_loss = c(l,cv_loss)
  }
  cv_loss = mean(cv_loss)
  return(cv_loss)
}
get.subset = function(X,Y,n_folds){
  N = dim(X)[2] # No. of Co-variates
  
  col_comb = function(i,n){t(combn(1:n,i))} # Function to select 'i' out of 'n'
  best_cols = col_comb(N,N)[1,] # Start with all columns
  loss_best = Inf  # Setting a very high initial value
  
  # i accounts for no of columns
  for(i in 1:N){
    combos = col_comb(i=i,n=N)
    rows = dim(combos)[1]
    for(row in 1:rows){
      columns = combos[row,]
      X_new = as.matrix(X[, columns])
      # HERE n-fold validation
      curr_loss = n_fold_cv(X_new, Y, n_fold)
      if (curr_loss<loss_best){
        loss_best = curr_loss
        best_cols = columns
      }
    }
  }
  return(list(best = best_cols, loss =loss_best))
}

Variable Selection using StepAIC

Can be used for both aic and bic
What are the important variables, that can be asked
‘AIC’ is an information criterion and can use all the provided data. Thus, we can utilize all the provided data as information for estimating generalization capabilities of a model.

  ## Variable Selection using StepAIC
  library(MASS)
  d3 = data[,-1] # Sample Dataset
  form = "Fat~." # Set Formula
  fit = lm(form, data = as.data.frame(d3))
  step = stepAIC(fit, direction = 'both') # For AIC in both the directions.
  step.BIC = stepAIC(fit, k=log(nrow(d3))) # BIC Method
  # To get selected variables
  get.coef.names = function(model, step.func){
    STEP_COEF = vector("numeric",length(coefficients(model)))# create an empty vector of zeros
    names(STEP_COEF) = names(coefficients(model))#same names
    STEP_COEF[names(coefficients(step))] = as.numeric(coefficients(step))#fill in the ones found in step
    coef_names = names(STEP_COEF[round(STEP_COEF, 0) !=0])
    return(coef_names)
  }

Nested Cross-Validation

randomly split data in $K$ outerfolds
for each hyper-parameter combo, there will be an avegare combo. error config_error calculated using only Val data within inner-loop.
The inner-loop consists of inner-CV for innerfolds times cross-validation.
Look at the average evaluation for each config i.e., config_error and choose the hyper-parameter with lowest config_error
Now, with the chosen hyper-parameters, divide the data into Train and Test for each fold and save the test loss within outer_cv_loss for each outer fold.
The average of the outer_cv_loss is the models Generalization Error.

data(spam)
outerfolds = 5 # No. of Outer Folds
innerfolds = 3 # No. of Inner FOlds
form = as.formula('type~.')
ran_ind = sample(nrow(spam)) # Reshuffle the Data
data = spam[ran_ind,]
#Create n equally size folds depending on 'outerfolds' value
folds <- cut(seq(1,nrow(X)),breaks=outerfolds,labels=FALSE)
#Perform n fold cross validation
outer_cv_loss = NULL # Each Models Error
params = seq(0.1,2.5,0.5)
# Outer CV
for(outerfold in 1:outerfolds){
    testIndexes <- which(folds==outerfold,arr.ind=TRUE)
    Test <- data[testIndexes,]
    Train <- data[-testIndexes,]
    # Hyper-Parameter Search: Use only Validation Data Error
    config_error = NULL
     # Each 'i' should contain the required param(s) combo
      for(i in 1:length(params)){
        # Inner-CV
        if_error = NULL # Inner-Fold error for 'i' Param configuration
        for(fold in 1:innerfolds){
          #   # Use only xTrain
            ValIndexes <- which(folds==fold,arr.ind=TRUE)
            Val <- Train[ValIndexes,] # kth 'inner-fold' Validation Data
            TraIN <- Train[-ValIndexes,]
            fit = ksvm(form,data=TraIN,kernel="rbfdot",kpar=list(sigma=0.05),C=params[i])
            yValHat = predict(fit, Val)
            misc_error = 1-mean(Val$type == yValHat) # Misclassification Error
            if_error = c(if_error, misc_error)
        }
        config_error[i] = mean(if_error) # Look at avg eval for each config
      }
    best_i = params[which.min(config_error)] # Choose the config with the best config_error
    fit.outer = ksvm(form,data=Train,kernel="rbfdot",kpar=list(sigma=0.05),C=best_i)
    yTestHat = predict(fit.outer, Test) # Prediction on k-th outerfold Test Data
    of_error = 1-mean(Test$type == yTestHat) # Outer Fold Test Error
    outer_cv_loss = c(outer_cv_loss, of_error) # Its Mean is the 'Generalized Error'
}
generalization_error = mean(outer_cv_loss) # Generalization error of the model.

Laplace Prior

$Y_i = \mathbf{x}_i + \mathbf{\beta} + \epsilon_i, \text{i = 1,…,n}$ where $\epsilon_i \sim \textit{Laplace(0,b)}$ then, the complete data Log-Likelihood is given as,
- $ - \ln p\left( Y| \mathbf{X}, \mathbf{\beta}, b \right) = n \ln (\textit{2b}) + \frac{1}{b} \sum_{i=1}^n \left|\mathbf{x}_i \cdot \mathbf{\beta} - Y_i \right|$
Creating distribution for error term $\epsilon$ where $p(\epsilon) = \frac{e^{-|\epsilon|}}{2}$

e_s draws from the Laplacian Error distribution with $\mu = \mathbf{0}$ and $\Sigma = \mathbf{1}$

# Drawing Error from its Laplace distribution
error = seq(-5,5, length = 10000)
dlaplace = function(x){0.5*exp(-abs(x))}
derror = sapply(error, FUN = dlaplace)
e_s = sample(error, size=25000,prob = derror, replace=TRUE)
hist(e_s, breaks = 1000, freq = F)

Entropy

Entropy $x \cdot \ln x$ is strictly convex function.

  x = seq(0,1, length = 10000)
  x = x[x>0]
  plot(x=x, y = x*log(x), col=2)