732A99 Machine Learning
Helpful R-Code snippets to go along 732A99 course offered at Linköping University
- Data-Manipulation
- Supervised Learning
- Neural-Net Regression
- EM : Multivariate Bernoulli
- EM: Mixture of Gaussians
- GAM(Generalized Additive Models)
- Kernel Classification(using Kernel Density Estimation)
- LDA(Manual implementation can be extended to QDA and Naive Bayes)
- QDA
- Decision Trees: Classification/Regression
- Splines: Using base R Functions only
- Splines: Using splines library
- Logistic Regression
- K Nearest Neighbors
- Ridge Regression
- Poisson Regression
- Ridge Regression using Optim
- LASSO
- Elastic-Net
- SVM
- Experimental Estimation of Bagging Error
- NN from Scratch
- Unsupervised Learning
- Other Topics
Data-Manipulation
1.1 Load Excel Data
library("readxl")
data = read_excel('spambase.xlsx')
data = read_excel('spambase.xls')
1.2 Denormalize normalized data
- Collect $\textit{max}_x$, $\textit{min}_x$
- Apply the formula on all $\textit{x}$ : $\textit{x}$ * ( $\textit{max}_x$ - $\textit{min}_x$ ) + $\textit{min}_x$
1.3 Train/Test Split
n = dim(data)[1]
set.seed(12345)
id = sample(1:n, floor(n*0.5))
train = data[id,]
test = data[-id,]
1.3 Train/Test/Validation Split
- Also Available in Lecture: Block 1(e)]
- Contains example on how to perform Cross-Validation given that there is a function
loss
to estimate the loss.
spec = c(train = .6, test = .2, validate = .2)
g = sample(cut(
seq(nrow(dataframe)),
nrow(dataframe)*cumsum(c(0,spec)),
labels = names(spec)
))
res = split(dataframe, g)
train = res$train
test = res$test
val = res$validate
# Cross Validation
n_fold_cv = function(X,Y, n_fold){
#Randomly shuffle the data
ran_ind = sample(nrow(X))
X = X[ran_ind,]
Y = Y[ran_ind]
#Create n equally size folds
folds <- cut(seq(1,nrow(X)),breaks=n_fold,labels=FALSE)
#Perform 10 fold cross validation
cv_loss = c()
for(i in 1:n_fold){
#Segement your data by fold using the which() function
testIndexes <- which(folds==i,arr.ind=TRUE)
xTest <- X[testIndexes,]
yTest <- Y[testIndexes]
xTrain <- X[-testIndexes,]
yTrain <- Y[-testIndexes]
l = loss(x=xTrain, y = yTrain,xTest = xTest, yTest = yTest)
cv_loss = c(l,cv_loss)
}
cv_loss = mean(cv_loss)
return(cv_loss)
}
1.4 Dichotomization
spam$target = ifelse(spam$Spam==1, 'S','H') # Create a binary variable
spam$target = as.factor(spam$target) # Convert to Factor
contrasts(spam$target) # Check the factor
1.5 Writing Custom formula
covariates = names(spam)[0:48] # Collect Covariates 1st
cov=paste(covariates, collapse='+') # Collapse the covariates
form = as.formula(paste0('target~',cov)) # Create the formula with the target
1.6 Gaussian Kernel
-
h
specifies Kernel Width
gaussian_k <- function(x, h) { return (exp(-(x**2)/(2*h*h))) }
1.7 Package Vignette
utils::browseVignettes("gbm") # Package Name
1.8 Metrics
- Look at data imbalance when quoting misclassification.
- Misclassification rate : $\textbf{1 - Accuracy OR }\frac{\text{FP+FN}}{total}$ | Accuracy : $\frac{TP+TN}{total}$
1 - mean(data$train == data$prediction)
- Sensitivity/Recal/True Positive Rate: $\frac{TP}{\text{Actual Yes}}$
- Specificity/False Positive Rate: $\frac{\text{FP}}{\textbf{Actual NO}}$
get.rate = function(predicted.prob, actual, true_label, false_label, threshold){ prediction = ifelse(predicted.prob>threshold, true_label,false_label) TPR = sum(prediction==true_label & actual == true_label)/sum(actual == true_label) FPR = sum(prediction==true_label & actual == false_label)/ sum(actual == false_label) return(c(TPR, FPR)) }
- AIC/BIC/$C_p$:
- Can be used for estimating only ‘In-Sample’ errors for ‘Linear’ Parameters.
‘In-Sample’ is not beneficial as it does not represent generalized error i.e., how the model will perform on ‘Unseen’ data. - AIC Formula: $\frac{-2}{N}\cdot\text{log-likelihood}+ 2\cdot\frac{degree of freedom}{N}$ where $\text{N = No of datapoints}$
- BIC Formula: $\textit{Log-Likelihood} - \frac{M}{2} \cdot \ln(N)$ where $M$ is the no. of parameters. $N$ = # of datapoints.
BIC = function(LL,M,N){return(LL - (0.5*M/log(N)))}
- Can be used for estimating only ‘In-Sample’ errors for ‘Linear’ Parameters.
- Loss Matrix:
- Make sure the loss matrix and prediction order is same. Ex: if loss is
c("cat","dog")
then predicted probability columns should also be:pred_prob[,c("cat","dog")]
- Then use
apply(pred_prob%*%loss_matrix,MARGIN=1,FUN=which.max)
. Use the index same asc("cat","dog")
for classification.
- Make sure the loss matrix and prediction order is same. Ex: if loss is
1.9 Plotting Help
refer: https://bookdown.org/rdpeng/exdata/the-base-plotting-system-1.html#base-plotting-functions
Supervised Learning
Neural-Net Regression
- STEP-1: Standardize the Data. If Test set is unseen, then use ‘mean’ and ‘variance’ of training data.
-
y~.
does not work, formula needs to be specified specifically. -
linear.output=TRUE
should be chosen. -
threshold
specifies the change in Error(given bynn.fit$err.fct
) over the epochs(also referred as ‘Steps’) - Custom activation function should go into:
act.fct
:RelU = function(x){ ifelse(x>0,x,0)} # RELU Implementation nn.fit = neuralnet(form, data = tr, hidden = c(2,2,2), threshold = 0.01,linear.output = TRUE, act.fct = RelU)
library(neuralnet) Var <- runif(50, 0, 10) trva <- data.frame(Var, Sin=sin(Var)) tr <- trva[1:25,] # Training va <- trva[26:50,] # Validation form = as.formula("Sin~Var") winit = runif(19, -1,1) nn.fit = neuralnet(form, data = tr, rep=10,hidden = c(2,2,2), threshold = 0.01,linear.output = TRUE, startweights = winit) plot(nn.fit, show.weights = TRUE) ## Prediction yTestPred = compute(nn.fit, va) plot(prediction(x = nn.fit)$rep1) points(trva, col=2) ## Test Results yTestPred$net.result plot(yTestPred$net.result) points(va$Sin, col=2)
EM : Multivariate Bernoulli
- In case of Binary Data, use ‘bernoulli’ priors
- In case of Multiclass data, use ‘dirichlet’ priors!
- In contrast to the mixture of Gaussians, there are no singularities in which the likelihood function goes to infinity.This can be seen by noting that the likelihood function is bounded because $\text{0 }\le p(\mathbf{x_n|\mu_k}) \le \text{ 1}$
- $\text{Log-Likelihood :} \ln p(\textbf{X} | \mu, \pi) = \displaystyle \sum_{n=1}^{N} \ln {\sum_{k=1}^{K} \pi_{k} p(\mathbf{x_n} | \mathbf{\mu_k}) }$ where $p(\mathbf{x_n} | \mathbf{\mu_k}) = \displaystyle \prod_{i=1}^{D} \mu_i^{x_i} \left(1 - \mu_i \right)^{1 - x_i}$ i.e., $x_n$ datapoint with $D$ “binary” variables.
- Algo Steps:
- Use $\text{Eq. 9.54}$ to get the $\textbf{Complete Data Log Likelihood}$
- Use $\text{Eq. 9.56}$ to get the responsibility($\gamma(z_{nk})$) of component $k$ given datapoint $\mathbf{x}_n$ for which we get a
resp
matrix of dim $[N,K]$ - Now is M-Step:
3.1. Calculate $N_k$ using $\text{Eq. 9.57}$. :N_k = colSums(resp)
3.2. Use $\text{Eq. 9.58}$ to get $\bar{x}$.:x_k = t(resp)%*%x # same dimension as mu
3.3. Update $\mathbf{\mu}_k = \bar{x}$:mu[k,] = x_k[k,]/N_k[k] # Using eq. 9.58 and 9.59
3.3. Update $\pi_k$ using Eq 9.60 :pi = N_k/N # eq. 9.60
library(dplyr) ## For Piping
set.seed(1234567890)
max_it = 100
min_change = 0.1
N = 1000
D= 10
x = matrix(nrow = N,ncol = D)
true_pi = vector(length = 3)
true_mu = matrix(nrow = 3, ncol=D)
true_pi = c(1/3,1/3,1/3)
true_mu[1,] = c(0.5,0.6,0.4,0.7,0.3,0.8,0.2,0.9,0.1,1)
true_mu[2,] = c(0.5,0.4,0.6,0.3,0.7,0.2,0.8,0.1,0.9,0)
true_mu[3,] = c(0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5)
plot(true_mu[1,], type="o", col="blue", ylim=c(0,1))
points(true_mu[2,], type="o", col="red")
points(true_mu[3,], type="o", col="green")
# Training Data
for(n in 1:N){
k = sample(1:3, 1, prob = true_pi)
for(d in 1:D){
x[n,d] = rbinom(1,1,true_mu[k,d])
}
}
K = 3 # Guessed components
pi = vector(length = K)
mu = matrix(nrow=K, ncol=D)
llik = vector(length = max_it)
# Random Initialization of the Parameters
pi = runif(K,0.49,0.51)
pi = pi/sum(pi) # Sum = 1 for 'Mixing Coefficients'
for(k in 1:K){
mu[k,] = runif(D, 0.49,0.51)
}
for(it in 1:max_it){
plot(mu[1,], type="o", col="blue", ylim=c(0,1))
points(mu[2,], type="o", col="red")
points(mu[3,], type="o", col="green")
# points(mu[3,], type="o", col="black")
Sys.sleep(0.5)
# E-Step
# Measure Log-Likelihood
lik1 = function(n,K,pi,mu){
lik = NA
# Log-likelihood for 1 datapoint
for(k in 1:K){
t1 = log(pi[k]) # Log(pi_k)
t2 = x[n,]*log(mu[k,]) # x_{nd} * log(mu_{kd}) # for d columns
t3 = (1-x[n,])*log(1 - mu[k,]) # (1 - x_{nd}) * log(1 - mu_{kd}) # for d columns
lik[k] = t1 + sum(t2+t3)
}
return(sum(lik))
}
llik[it] = sapply(1:N, FUN = lik1, K=K,pi=pi,mu=mu) %>% sum ## Eq 9.54
cat("iteration: ", it, " Lig-Likelihood: ", llik[it], "\n")
# Calculating responsibility resp = matrix(nrow = N, ncol = K) # holds eq. 9.56 variable i.e., responsibility
for(n in 1:N){
total_k = NA
for(k in 1:K){
t1 = log(pi[k])
t2 = (x[n,]*log(mu[k,]))
t3 = (1-x[n,])*log(1 - mu[k,])
total_k[k] = t1 + sum(t2+t3)
}
total_k = exp(total_k)
total_k = total_k/sum(total_k)
resp[n,] = total_k
}
# M- Step
N_k = colSums(resp) # 9.57
cat("N_k: ", N_k, "\n")
x_k = t(resp)%*%x # same dimension as mu
# Update mu
for(k in 1:K){
mu[k,] = x_k[k,]/N_k[k] # Using eq. 9.58 and 9.59
}
# Update pi
pi = N_k/N # eq. 9.60
if(it>1){
if(abs(llik[it-1] - llik[it])<min_change)break
else{print(round(abs(llik[it-1] - llik[it]), digits = 3))}
}
}
EM: Mixture of Gaussians
- Initial $\mu$ and $\Sigma$ can be obtained using
kmeans
- We have $\text{Log-Likelihood :} \ln p(\textbf{X}|\mu, \pi,\Sigma) = \displaystyle \sum_{n=1}^{N} \sum_{k=1}^{K} { \ln \pi_{k} +\ln \mathcal{N}(\mathbf{x}_n | \mathbf{\mu}_k, \Sigma_k) }$
-
E-Step:
- Evaluate the responsibility using Eq. 9.23. Implemented by the function
e_step
- Evaluate the responsibility using Eq. 9.23. Implemented by the function
-
M-Step:
- First get $N_k$ for Eq. 9.27 using
colSums(e_step(...))
saved inN_k
variable - Get Updates $\mathbf{\mu}_k^{new}$ using Eq. 9.24 and function
update_mu
- Get Updated $\mathbf{\pi}_k^{new}$ using Eq. 9.26 and
update_pi
function. -
CAREFUL! For getting $\Sigma_K^{new}$, we have two steps:
- Use function
cov_k
to get $\gamma(\textit{z}_{nk})(\mathbf{x}_n - \mathbf{\mu}_k^{new})\left(\mathbf{x}_n - \mathbf{\mu}_k^{new} \right)^{T}$ for only 1 datapoint. - Now apply
cov_k
function on all data-points and sum them up:sigmaK = sapply(1:N, FUN = cov_k, df = df, K=k,z = z) %>% rowSums
- Re-arrange the co-variance matrix:
sigmaK = matrix(sigmaK, ncol=D, byrow = T)
- In case We need it to be only-diagonal matrix:
sigmA[k,,] = diag(diag(sigmaK)/N_k[k]) ## Only Diagonals
- If we dont want it to be non-diagonal:
sigmA[k,,] = sigmaK/N_k[k]) ## We need to divide by N_k[k]
- Use function
- First get $N_k$ for Eq. 9.27 using
library(dplyr) ## For Piping
library(mvtnorm)
dataEM = read.table(paste0(file_path, "dataEM.txt"))
N = dim(dataEM)[1]
D = dim(dataEM)[2] # Can be columns
df = as.data.frame(dataEM)
x_density = function(n,df,K,pi,mu,sigmA){
vK = NA
for(k in 1:K){
vK[k]= log(pi[k]) + dmvnorm(df[n,], mean = mu[k,], sigma = sigmA[k,,], log = TRUE)
}
return(sum(vK)) # x_density and CDLL need to be outside EM function as they are helper
}
CDLL = function(df, K,pi,mu,sigmA){
loglik = sapply(1:N, FUN = x_density, df,K,pi,mu,sigmA) %>% sum
return(loglik) # Observe Complete Data Log Likelihood
}
# EM-Function
EM = function(df,N,D,K,pi,mu,sigmA){
# E-Step
e_step = function(df,N,K, pi, mu, sigmA){
z = matrix(NA, nrow = N, ncol = K)
for(n in 1:N){
for(k in 1:K){
# z[n,k] is the numerator in eq. 9.23
z[n,k] = pi[k]*dmvnorm(df[n,], mean = mu[k,],sigma = sigmA[k,,])
}
z[n,] = z[n,]/sum(z[n,]) # Eq 9.23
}
return(z)
}
z = e_step(df=df,N, K=K, pi=pi,mu = mu, sigmA = sigmA)
N_k = colSums(z,na.rm = TRUE)
##M - step
# pi
update_pi = function(z, N){
pi_m = colSums(z)/N
pi_m = pi_m/sum(pi_m)
return(pi_m)
}
pi = update_pi(z,N)
# mu
update_mu = function(df, z, Nk){
mu_m = (t(df)%*%z)/Nk
return(as.matrix(t(mu_m)))
}
mu = update_mu(df, z, Nk = N_k)
# Update Sigma
cov_k = function(n,df, K, z){
df = as.matrix(df)
val = (df[n,]%*%t(df[n,]))*z[n,k]
return(as.matrix(val))
}
for(k in 1:K){
sigmaK = sapply(1:N, FUN = cov_k, df = df, K=k,z = z) %>% rowSums
sigmaK = matrix(sigmaK, ncol=D, byrow = T)
sigmA[k,,] = diag(diag(sigmaK)/N_k[k]) ## Only Diagonals
}
llik = CDLL(df,K,pi,mu,sigmA) ## CDLL is designed as a global function
return(list(pi = pi,mu = mu, sigma = sigmA, LLik = llik))
}
## Function to run EM Algorithm
RUN_EM = function(df,N,D,K,pi,mu,sigmA,max_it,delta){
llik = vector(length = max_it)
for(it in 1:max_it){
if(it==1){
llik[it] = CDLL(df,K,pi,mu,sigmA)
result = EM(df,N,D,K,pi,mu,sigmA)
pi = result$pi
mu = result$mu
sigmA = result$sigma
}
else{
result = EM(df,N,D,K,pi,mu,sigmA)
pi = result$pi
mu = result$mu
sigmA = result$sigma
llik[it] = result$LLik
if( (abs(llik[it] - llik[it-1])<delta) )break
}
}
return(list(pi = pi,mu = mu, sigma = sigmA, LLik = llik, iterations = it))
}
## RUN EXPERIMENT
K = 2
D = 2 # Columns
mu = matrix(nrow = K, ncol = D)
sigmA = array(NA,dim=c(K,2,2)) # K X [D X D]
for(k in 1:K){
mu[k,] = runif(D, 0,5)
sigmA[k,,] = diag(rep(1,D))
}
pi = runif(K,0.4,0.6)
pi = pi/sum(pi)
max_it = 100
delta = 0.1
result_2 = RUN_EM(df,N,D,K,pi,mu,sigmA,max_it,delta)
# Plot
par(mfrow=c(1,3))
plot(result_2$LLik[2:result_2$iterations], col=2, type='l') # Plot only Non-Zero Log-Likelihoods
plot(result_3$LLik[1:result_3$iterations], col=3, type='l')
plot(result_4$LLik[2:result_4$iterations], col=4, type='l')
# Calculate Bayesian Info Criteria
BIC = function(LL,M,N){return(LL - (0.5*M/log(N)))}
BIC2 = BIC(LL = result_2$LLik[result_2$iterations], M=2, N=600)
GAM(Generalized Additive Models)
- Notes for Library
mgcv
- Non-Linear Effects can be entered via
s
orte
: - Use
method = "REML"
- always look at
gam.check()
- Parameters:
-
s
ingam(y~x0+s(x1)+s(x2,x3))
is the smoothing function:-
k
specifies dimensions to represent smoothing term: ALWAYS equal to the length:Mortality~Year+s(Week, k = length(unique(data1$Week)))
-
bs
specifies the penalized smoothing basis:Mortality~s(Week, k = length(unique(data1$Week)),bs='cc')
-
bs="tp"
: for Thin Plate Splines -
bs="cr"
: for Cubic Splines -
bs="cc"
: for cyclic cubuc splines. Useful ifa cov shows cyclical trend like ‘Year’,’Month’ -
Generalized Cross Validation Score is given within
summary(model)
’s GCV value. Usemethod="GCV.Cp"
while fitting.
-
-
-
- Non-Linear Effects can be entered via
library(mgcv)
form.1 = as.formula("Mortality~Year+s(Week, k = length(unique(data1$Week)))")
fit1 = gam(form.1, family = gaussian(), data=data1, method = 'REML')
for.2 = as.formula("Mortality~Year+s(Week, k = length(unique(data1$Week)))+
s(Year, Influenza)")
fit2 = gam(for.2, family = gaussian(), data=data1, method = 'REML')
# Gives siginificance of smooth terms and variance of the model
summary(fit1)
# Plotting the Residuals
par(mfrow = c(2,2))
gam.check(fit1)
# Plot the Spline Component
plot(fit1)
plot(fit1, residuals=T) # With Residuals
# Compare Models
anova(fit1,fit2, fit3, test = 'Chisq')
AIC(fit1,fit2, fit3) # Choose with lowest AIC
Kernel Classification(using Kernel Density Estimation)
# January-2021 exam
# data generation
N_class1 <- 1500
N_class2 <- 1000
data_class1 <- NULL
for(i in 1:N_class1){
a <- rbinom(n = 1, size = 1, prob = 0.3)
b <- rnorm(n = 1, mean = 15, sd = 3) * a + (1-a)*rnorm(n = 1, mean = 4, sd = 2)
data_class1 <- c(data_class1,b)
}
data_class2 <- NULL
for(i in 1:N_class2){
a <- rbinom(n = 1, size = 1, prob = 0.4)
b <- rnorm(n = 1, mean = 10, sd = 5) * a + (1-a) * rnorm(n = 1, mean = 15, sd = 2)
data_class2 <- c(data_class2,b)
}
# y=1,0
df1 = data.frame(cbind(data_class1, 1))
df2 = data.frame(cbind(data_class2,0))
# Prior Probabilities
p_1 = 1500/2500
P_2 = 1 - p_1
# Define the Kernel
kernel = function(x,h){return(exp(- (x/h)^2))}
xs = seq(-5,25,0.1) # Data to be tested on
kern.prob = matrix(NA, nrow = length(xs), ncol = 2) # For Storing Posterior Class probability
for(i in 1: length(xs)){
x = xs[i]
diff_1 = x - df1[,1] # For class-1 only use class-1 data
diff_2 = x- df2[,1] # For class-2 only use class-2 data
p1 = mean(sapply(diff_1, FUN = kernel, h = 5)) # P(X=x|Y = 'class 1')
p2 = mean(sapply(diff_2, FUN = kernel, h = 5)) # P(X=x|Y = 'class 2')
kern.prob[i,1] = (p1*p_1)/((p1*p_1) + (p2* P_2) ) # Applying Bayes Theorem
kern.prob[i,2] = (p2*P_2)/((p1*p_1) + (p2* P_2) ) # this is also equal to 1 - kern.prob[i,1]
}
# Check the Plots
plot(x=xs, y = kern.prob[,1], type='l', ylim=c(0,1))
lines(x=xs, y=kern.prob[,2],type='l', col=3)
LDA(Manual implementation can be extended to QDA and Naive Bayes)
- Inputs are assumed to be CONTINUOUS
- All Classes share the same covariance matrix(In QDA, they don’t)
- Because of ‘common covariance matrix’ assumption, the decision boundary corresponding to Posterior Probability $p(C_k|\textbf{x})$ are constant and can be respresented by a linear function of $\textbf{x}$ i.e., “decision boundaries are linear in input space”.
- Since, we estimate the parameters using MLE, the approach is susceptible to ‘outliers’.
- While plotting, make the
xlim
andylim
ofgenerated data
same as Old Data. - Use only diagonal elements of pooled-covariance matrix to perform Naive Bayes.
- Use different covariance matrix for each class to perform QDA
library(MASS) # Step-1: Library
crabs = as.data.frame(read.csv(paste0(file_path,'australian-crabs.csv')))
crabs$sex = as.factor(crabs$sex) # Step-0:Convert the target to factor
# Plot for Classes
with(crabs, plot(CL, RW, main = "Main", type = "n"))
with(subset(crabs, sex='Male'), points(CL, RW, col='blue'))
with(subset(crabs, sex!='Male'), points(CL, RW, col='red'))
legend("topleft", pch = 1, col = c("blue", "red"), legend = c("Male", "Female"))
form = as.formula('sex~CL+RW') # Step-2 formulation
lda.model = lda(form, crabs, prior = c(0.1,0.9)) # Step-3, prior order as 'factor' order
crabs[,'predictions'] = predict(lda.model,crabs)$class # Step-4
# Plot- Multiple Predictions, Multiple CLasses
par(mfrow = c(1, 2), mar = c(4, 4, 2, 1), oma = c(0, 0, 2, 0))
with(crabs, {
plot(CL, RW, main = "Prior", type = "n")
with(subset(crabs, sex='Male'), points(CL, RW, col='blue'))
with(subset(crabs, sex!='Male'), points(CL, RW, col='red'))
legend("topleft", pch = 1, col = c("blue", "red"), legend = c("Male", "Female"))
plot(CL, RW, main = "Posterior", type = "n")
with(subset(crabs, predictions='Male'), points(CL, RW, col='blue'))
with(subset(crabs, predictions!='Male'), points(CL, RW, col='red'))
legend("topleft", pch = 1, col = c("blue", "red"), legend = c("Male", "Female"))
})
# Draw discriminant Line
np = lda.model$N
nd.x <- seq(from = min(crabs$RW), to = max(crabs$RW), length.out = np)
nd.y <- seq(from = min(crabs$CL), to = max(crabs$CL), length.out = np)
nd <- expand.grid(RW = nd.x, CL = nd.y)
prd <- as.numeric(predict(lda.model, newdata = nd)$class)
plot(crabs[, c('RW','CL')], col = crabs$sex) # Make Sure X and Y are ordered correctly
contour(x = nd.x, y = nd.y, z = matrix(prd, nrow = np, ncol = np),
levels = c(1,2), add = TRUE, drawlabels = FALSE)
# Manual LDA
# Compute Mean, covMat, priorProb
setosa = as.data.frame(iris2[iris2$Species == 'setosa',])
versicolor = as.data.frame(iris2[iris2$Species == 'versicolor',])
virginica = as.data.frame(iris2[iris2$Species == 'virginica',])
## Means, COvs and Prior Probs
cov_names = c('Sepal.Width', 'Sepal.Length')
mu_setosa = as.matrix(colMeans(setosa[,cov_names]))
mu_ver = as.matrix(colMeans(versicolor[,cov_names]))
mu_vir = as.matrix(colMeans(virginica[,cov_names]))
cov_setosa = cov(setosa[,cov_names])
cov_versi = cov(versicolor[,cov_names])
cov_virgi = cov(virginica[,cov_names])
prop_setosa = 50/150
prob_ver = 50/150
prop_virg = 1-(prop_setosa+prob_ver)
## Pooled Covariance: Bishop eq 4.78(In case of QDA dont use pooled)
S = as.matrix((prop_setosa*cov_setosa)+(prop_virg*cov_virgi)+(prob_ver*cov_versi))
s_inv = as.matrix(solve(S))
# Calculate Discriminant FUnction
pred = matrix(NA, nrow = 150, ncol = 3)
p_j = c(prop_setosa,prob_ver,prop_virg)
mus = cbind(mu_setosa,mu_ver, mu_vir)
for(i in 1:150){
for(j in 1:3){
pred[i,j] =as.matrix(iris2[i,cov_names])%*%s_inv%*%mus[,j] - 0.5*t(mus[,j])%*%s_inv%*%mus[,j] + log(p_j[j])
}
}
classes = c('setosa','versicolor','virginica')
actual_predictions = matrix(NA,nrow=150,ncol=1)
for(i in 1:150){
actual_predictions[i] =classes[which.max(pred[i,])]
}
# Sampling From New Sample
library(mvtnorm)
## Generate synthetic data for each class i.e., x ~ N(mu_k,Sigma)
# Ex for setosa: rmvnorm(50, mean = mu_setosa, sigma = S)
syn_data = rbind(cbind(rmvnorm(50, mean = mu_setosa, sigma = S), 'setosa'),
cbind(rmvnorm(50, mean = mu_ver, sigma = S), 'versicolor'),
cbind(rmvnorm(50, mean = mu_vir, sigma = S), 'virginica'))
colnames(syn_data) = c('Sepal.Width','Sepal.Length','Species')
syn_data = as.data.frame(syn_data)
## Plot and comparing Synthetic Data with actual data
par(mfrow = c(1, 2), mar = c(4, 4, 2, 1), oma = c(0, 0, 2, 0))
with(iris, plot(Sepal.Width, Sepal.Length, main = "Main", type = "n"))
with(subset(iris, Species='setosa'), points(Sepal.Width, Sepal.Length, col='blue'))
with(subset(iris, (Species!='setosa' & Species !='versicolor')),
points(Sepal.Width, Sepal.Length, col='red'))
with(subset(iris, (Species!='setosa' & Species !='virginica')),
points(Sepal.Width, Sepal.Length, col='green'))
legend("topleft", pch = 1, col = c("blue", "red", 'green'), legend = c("setosa", "virginica",'versicolor'))
with(syn_data, plot(Sepal.Width, Sepal.Length, main = "Synthetic Data", type = "n"))
with(subset(syn_data, Species='setosa'), points(Sepal.Width, Sepal.Length, col='blue'))
with(subset(syn_data, (Species!='setosa' & Species !='versicolor')),
points(Sepal.Width, Sepal.Length, col='red'))
with(subset(syn_data, (Species!='setosa' & Species !='virginica')),
points(Sepal.Width, Sepal.Length, col='green'))
legend("topleft", pch = 1, col = c("blue", "red", 'green'), legend = c("setosa", "virginica",'versicolor'))
QDA
- Refer LDA codes for Templates. For Manual application use LDA template and modify individual sigma, refer ISLR eq. 4.19
### QDA
form = as.formula('sex~CL+RW') # Step-2 formulation
qda.model = qda(form, crabs, prior= c(0.5,0.5)) # Step-3
crabs[,'qdapredictions'] = predict(qda.model,crabs)$class
# Misclassification Error
1 - mean(crabs$sex == crabs$qdapredictions)
# Draw discriminant Line
np = qda.model$N
nd.x <- seq(from = min(crabs$RW), to = max(crabs$RW), length.out = np)
nd.y <- seq(from = min(crabs$CL), to = max(crabs$CL), length.out = np)
nd <- expand.grid(RW = nd.x, CL = nd.y)
prd <- as.numeric(predict(qda.model, newdata = nd)$class)
plot(crabs[, c('RW','CL')], col = crabs$qdapredictions) # Make Sure X and Y are ordered correctly
contour(x = nd.x, y = nd.y, z = matrix(prd, nrow = np, ncol = np),
levels = c(1,2), add = TRUE, drawlabels = FALSE)
Decision Trees: Classification/Regression
- Lesser the residual Mean deviance.better the model. Residual Mean Deviance = $\frac{\text{deviance}}{n- |T_o|}$ where $n= \text{no. of data points}$ and $T_o=\text{No. of Terminal Nodes}$
- Deviance = $-2 \sum_m \sum_k n_{mk} \cdot log \hat{p}{mk}$ where $n{mk}$ is the no. of obs in the mth terminal node belonging to the kth class
- Optimal Tree can be selected using:
cv.tree()
- Trai/Test Validation
- In Regression Trees, the deviance is ‘Sum of Square Errors’ for the Tree!.
# Classification
credit = read_excel(paste0(file_path,'creditscoring.xls'))
credit$good_bad = as.factor(credit$good_bad) # Labels need to be factored!
library(tree) # Call the Library
form = as.formula("good_bad~.") # Create the formula
# NOTE: split = 'gini' for gini-index
train.tree.dev = tree(form, data = train, split = 'deviance') # Fit the model on 'Train'
train.tree.gini = tree(form, data=train, split='gini')
summary(train.tree.dev);summary(train.tree.gini)
## Choosing Optimal Tree
# 1. By Penalizing
# ------------------
fit = tree(form, data = train)
cv.res = cv.tree(fit, FUN = prune.misclass) # By default FUN = 'deviance'
par(mfrow=c(1,2))
plot(cv.res$size, cv.res$dev, type="b",
col="red", main = 'CV-Error: Misclassification', xlab='Terminal Nodes')
plot(log(cv.res$k), cv.res$dev,
type="b", col="red", main = 'CV-Error: Misclassification', xlab='Cost-Complexity Parameter')
# 5 Terminal Node looks to provide the lowest Misclassification rate.
# 'cv.res$size' returns the no of terminal nodes.
# NOTE: 'cv.res$dev' will return the cross-validation error and not 'deviance' which in this case
#is misclassifcation error
# 2. Train/Val Method
# ------------------
trainScore = testScore = rep(0,9)
for(i in 2:9){
prunedTree = prune.tree(fit, best = i)
# Put 'type' = 'class' in case of predictions
pred = predict(prunedTree, newdata=val, type='tree')# Returns Tree
trainScore[i] = deviance(prunedTree)
testScore[i] = deviance(pred)
}
plot(2:9, trainScore[2:9], type="b", col="red", ylim = c(min(trainScore,testScore),
max(trainScore,testScore)),main='Train Score', ylab='Deviance', xlab='No. of Nodes')
points(2:9, testScore[2:9], type="b", col="blue")
opTimalTree = prune.tree(fit, best=5)
summary(opTimalTree)
# State the following in the report:Variables selected, # Terminal Nodes, Res. mean deviance, Misclassification error
# Estimate Misclassification Rate for Test data. NOTE: type='class'
testPred = predict(opTimalTree, newdata = test, type='class')
1 - mean(test$good_bad == testPred)
##### Regression Trees
formreg = as.formula("amount~.")
fitReg = tree(formreg, data = train)
trY = predict(fitReg, newdata = val)
sqrt(mean((val$amount - trY)^2)) ## RMSE
Splines: Using base R Functions only
- Example below uses only basic r functions with
lm
- Step 1: For order-M Spline We create upto M-1
degree of freedom
variablesM=5 X = matrix(data$Day %>% poly(degree = M-1, raw = TRUE), ncol =M-1) ## KEEP `raw=TRUE` when using poly
- Step-2: Create Truncated Power Basis functions $h(x,\xi)$ for each knot!
# knot 𝜁=75 h.75= ifelse(X[,1]>75, X[,4],0) # If 𝜁=80 h.80= ifelse(X[,1]>80, X[,4],0)
- Step-3: Combine Knots and Splines
# knot 𝜁=75 X = cbind(X,h.75) # If knots: 𝜁=75,𝜁=80 X = cbind(X, h.75,h.80)
# Use basis function approach to fit an order-5 spline with a single knot 𝜁=75 such that Day is the feature and Rate is the target variable.
data = read.csv2(paste0(file_path,"mortality_rate.csv"))
M=5
X = matrix(data$Day %>% poly(degree = M-1, raw = TRUE), ncol =M-1) ## KEEP `raw=TRUE` when using poly
h.75= ifelse(X[,1]>75, X[,4],0)
X = cbind(X,h.75)
colnames(X) = as.character(seq(1:5))
df = data.frame(cbind(X,data$Rate))
colnames(df)
spline.fit = lm("V6~.", data = df)
yPred = predict(spline.fit, data = df)
plot(x = data$Day, y = data$Rate)
points(x = data$Day,y = yPred, col=2)
Splines: Using splines library
# Using Library Splines
library(splines)
data = read.csv2(paste0(file_path,"mortality_rate.csv"))
f = as.formula("Rate~bs(Day, df=4, knots = 75)") ## Because order is 5!
spline.fit = lm(f, data = data)
yPred = predict(spline.fit, newdata = data)
plot(x = data$Day, y = data$Rate)
points(x = data$Day,y = yPred, col=2)
Logistic Regression
- for prediction use
predict(glm.fits, type = 'response')
script. This generates probability for $\textbf{p(Y=1|X)}$ - $p(Y=1|\mathbf{w,X}) = \sigma{\left(\mathbf{w^T \cdot X} \right)}$ where $\sigma(a) = \frac{1}{1 + e^{-a}}$ with a linear decision boundary at $\mathbf{w^T\cdot X}= 0$
- Also $Y \sim Bern(\sigma(\mathbf{w^T \cdot X}))$
spam$target = ifelse(spam$Spam==1, 'S','H') # Create a binary variable
spam$target = as.factor(spam$target) # Convert to Factor
# Check the factor
contrasts(spam$target)
cov=paste(covariates, collapse='+')
form = as.formula(paste0('Spam~',cov))
glm.fits = glm(form, data=train,family="binomial") # Family should be 'binomial'
test_pred = predict(glm.fits, type = 'response', newdata = test)
test_pred = ifelse(test_pred>0.5,'Spam','Ham') # Y=1 for 'Spam'
# Accuracy
table(train_pred, train$target)
# Misclassification
1 - mean(train_pred == train$target)
1 - mean(test_pred==test$target)
# TPR,FPR,ROC
pred_tree.prob = predict(opTimalTree, test, type='vector')[,'good'] # Get Class Probability
# Calculate TPR, FPR
get.rate = function(predicted.prob, actual, true_label, false_label, threshold){
prediction = ifelse(predicted.prob>threshold, true_label,false_label)
TPR = sum(prediction==true_label & actual == true_label)/sum(actual == true_label)
FPR = sum(prediction==true_label & actual == false_label)/ sum(actual == false_label)
return(c(TPR, FPR))
}
thresholds = seq(0.05,0.95,0.05)
cols = c('threshold','TPR', 'FPR')
tree.roc = matrix(data = NA, nrow = length(thresholds), ncol = 3)
colnames(tree.roc) = cols
for(i in 1:length(thresholds)){
threshold = thresholds[i]
# Tree
res.tree = get.rate(pred_tree.prob, test$good_bad, 'good', 'bad', threshold)
tree.roc[i, cols] = c(threshold, res.tree)
}
# PLOT ROC
plot(x = tree.roc[,'FPR'], y = tree.roc[,'TPR'], type='l',
xlab='FPR', ylab='TPR', col='red')
K Nearest Neighbors
- NOTE: How 'train' and 'validation' are evaluated. Param `k` decides on no. of neighbors. `d` decides Minkowski distance. `d=2` for `euclidean`
- theres no `predict`, use the data to be predicted in `test`
library("kknn")
# k=30
# Target has value 0 or 1
k.30.train = kknn(form, train=train,test=train,k=30,distance=2, kernel='rectangular') # Creating Model based on Training data as Validation
k.30.test = kknn(form, train=train,test = test,k=30,distance=2, kernel='rectangular')
train.pred = ifelse(k.30.train$fit>threshold,'S','H' ) # Script to modify the prediction
print(table(Predicted=train.pred, Actual = train$target)) # Training Confusion
print(table(Predicted=test.pred, Actual = test$target)) # Test Confusion
1 - mean(train.pred==train$target) # Accuracy
Ridge Regression
- In ‘glmnet’ ‘alpha=0’ is for ridge.
- Make sure that input data is in
matrix
format i.e.,glmnet(x = as.matrix(scaled_data[,coefs[2:64]]), y = as.matrix(scaled_data[,'Fat']), alpha=0, lambda = grid)
- choosing good grid values:
grid <- 10 ^ seq (10, -2, length = 100)
- use
type.measure="class"
for getting misclassification error in case of classification task. - We use ‘MSE’ because :
- It is the MLE of a normal distribution and we assume error to be normally distributed.
- MSE is a quadratic function i.e., it is easily differentiable and will always have a minima.
- Effective Degrees of freedom: $df(\lambda) = tr(X(X’X + \lambda I_p)^{-1} X’)$
# Ridge Regression
grid <- 10 ^ seq (10, -2, length = 100)
ridge.fit = glmnet(x = as.matrix(scaled_data[,coefs[2:64]]), y = scaled_data[,'Fat'],alpha=0, lambda = grid)
# 5 fold cv
cv.5 = cv.glmnet(x = as.matrix(train[,coefs]), y = as.matrix(train[,'Fat']),alpha=0,lambda = grid,nfolds=10,type.measure = "mse")
# Plot CV Results
plot(log(cv.5$lambda), cv.5$cvm,pch=19,col='red',xlab = 'log(lambda)',ylab = cv.5$name)
# Optimal lambda
print(cv.5$lambda.min)
Poisson Regression
- $p(y=Y|\lambda) = \frac{e^{-\lambda} \lambda^{Y}}{Y!}$
- For Prediction of $\hat{Y}$, we use $\hat{Y} = e^{\mathbf{W^TX}}$ whete $\lambda = \mathbf{W^TX}$
yTestPred = predict(lasso.fit, newx = as.matrix(test[,c(1:8)]), type = "response") yTestPred[1] yHat = as.matrix(cbind(1,test[1,c(1:8)]))%*% matrix(coef(lasso.fit), ncol=1) prediction = exp(yay)
Ridge Regression using Optim
- Use Block-1, Lab-1 Q2 as premise. Make sure to SCALE the data!!!
- Make sure that the
X
,Y
andWeights
are inas.matrix
format. - For Log-Likelihood, used Bishop eq.(3.11) and for applying ridge penalty used equation from pg.17 of
Lecture 1D,Block1
# LogLikelihood = log P(Y|X,w,sigma)
Loglikelihood = function(x,y,w,sigma){
wTX = as.matrix(x)%*%w
loss = sum((wTX - as.matrix(y))^2)
N = dim(x)[1]
t1 = (N*0.5*(log(sigma)))
t2 = (N*0.5*log(2*pi))
t3 = (sigma*0.5*loss)
logLik = t1-t2 - t3
return(logLik)
}
## Ridge Penalty
ridge = function(w,sigma,lambda, x,y){
log_lik = Loglikelihood(x,y,w,sigma)
penalty = lambda*(sum(w^2)) - log_lik
return(penalty)
}
## Ridge Degree of Freedom
ridge.df = function(x,lambda){
x = as.matrix(x)
N = dim(x)[2] # Covariate count
Ip = diag(N)
inv_data = solve(t(x)%*%x + (lambda*Ip))
df = x%*%inv_data%*%t(x)
trace = sum(diag(df))
return(trace)
}
### AIC criteria
calculate_AIC = function(x,y,w,sigma, lambda){
log_lik = Loglikelihood(x,y,w,sigma)
N = dim(x)[1] # No of data points
df = ridge.df(x,lambda)
aic = (-2*log_lik/N) + (2*df/N) #(-2*Log-likelihood/N) + 2*(df/N)
return(aic)
}
# Calculate Train,Test Loss
for(lambda in c(1,100,1000)){
N = dim(train[,covs])[2]
initW = as.matrix(rnorm(N))
sigma=runif(1)
lambda=lambda
x = as.matrix(train[,covs])
y = as.matrix(train[,'motor_UPDRS'])
xTest = as.matrix(test[,covs])
yTest = as.matrix(test[,'motor_UPDRS'])
ridgeOpt = optim(initW, ridge,sigma=sigma,lambda=lambda,x=x,y=y,method=c("BFGS"))
weights = as.matrix(ridgeOpt$par)
train_mse = mean((x%*%weights - y)^2);test_mse = mean((xTest%*%weights - yTest)^2)
print(paste0('Lambda: ', lambda));print(paste0('Train Error: ', train_mse))
print(calculate_AIC(x,y,w=weights,sigma, lambda)) # For AIC Calculation
}
LASSO
- Standardize the data!
- Almost same as Ridge. But,
df
(degree of freedom) is also covered. - Example of how to generate synthetic data is below.
# Lasso Regression
lasso.fit = glmnet(x = as.matrix(train[,coefs]), y = as.matrix(train[,'Fat']),alpha=1,family="gaussian")
# Plotting Lambda V/S Coefficients
plot(lasso.fit, xvar = "lambda", label = TRUE)
# Plot LASSO V/S df
plot(x = log(lasso.fit$lambda), y = lasso.fit$df)
## Cross Validation
lasso.grid <- 10 ^ seq (10, -2, length = 100)
# 5 fold cv
cv.5 = cv.glmnet(x = as.matrix(train[,coefs]), y = as.matrix(train[,'Fat']),alpha=1,family="gaussian",
lambda = lasso.grid,nfolds=10,type.measure = "mse")
# Plot CV Results: MSE V/S Log(Lambda)
plot(log(cv.5$lambda), cv.5$cvm,pch=19,col='red',xlab = 'log(lambda)',ylab = cv.5$name)
# Optimal lambda
print(cv.5$lambda.min)
### Generating synthetic test data
X = as.matrix(test[,coefs]);Y = as.matrix(test[,'Fat']) # Actual Test Data
lasso.fit = glmnet(x = as.matrix(train[,coefs]), y = as.matrix(train[,'Fat']),alpha=1, lambda = cv.5$lambda.min) # LASSO with best fit
betas = as.vector(coef(lasso.fit)[-1,]) # Remove intercept and fetch beta coefficients
residual = train$Fat - as.matrix(train[,coefs])%*% betas;stdev = sd(residual) # Residual and st dev
yTestHat = predict(lasso.fit, newx = X, type='response')
n = dim(Y)[1];set.seed(12345)
yPred = rnorm(n, yTestHat,stdev)
# PLot Actual vs generated data
plot(Y, lty=1, col=3)
lines(yPred, type='b', col=2)
- LASSO:(Best Lambda statistically)
- Checking For Statistically Significantly better:
print(lasso.fit)
- If %Dev explained is similar, choose one with lower ‘df’
df1 = read.csv2(paste0(file_path, "Dailytemperature.csv")) df2 = as.data.frame(df1) temp = df2$Temperature power_factor = seq(-50,50,1) for(i in power_factor){ df2[,paste0("sin_",as.character(i))] = sin(0.5^i * temp) df2[,paste0("cos_",as.character(i))] = cos( 0.5^i* temp) } ## LASSO library(glmnet) cols = colnames(df2)[3:204] x = as.matrix(df2[,cols]) y = as.matrix(temp) ## Dependence of DF on Penalty factor plot(x = lasso.fit$lambda, y = lasso.fit$df, main="Degrees of Freedom V/S Lambda", xlab = "Lambda", ylab = "DF") ## CV for dependence of MSE on Log-Penalty lambdas = seq(-5,5,by = 0.01) lambdas = exp(lambdas) ## Ensure -4 is there in log(lambda) ## any(log(lambdas) == -4) cv.fit = cv.glmnet(x=x,y=y, family="gaussian", type.measure = "mse", nfolds = 10, lambda = lambdas) plot(cv.fit) ## Plots MSE V/S log(lambda) print(cv.fit$lambda.1se) ## Check the Standard Error ## Checking For Statistically Significantly better ## If %Dev explained is similar, choose one with lower 'df' lasso3 = glmnet(x=x,y=y, family = "gaussian", alpha=1, lambda = exp(-4)) print(lasso3) lasso.min = glmnet(x=x,y=y, family = "gaussian", alpha=1, lambda = cv.fit$lambda.min) print(lasso.min) ## Extracting Coefficients coef.exact <- coef(lasso.min,exact = TRUE, x=x, y=y) ## No. of Non Zero Features length(coef.exact[which(coef.exact!=0)]) # Includes Intercept! # Prediction predTemp = predict(lasso.min, newx = x, type = "link") predTemp = predict(lasso.min, newx = x, type = "response") ## OR
- Checking For Statistically Significantly better:
Elastic-Net
-
alpha
is between0
and1
.library(glmnet) Xtrain = train[,-4703] Ytrain = train[,4703] Xtest = test[, -4703] Ytest = test[,4703] grid <- 10 ^ seq (10, -2, length = 100) # Classification Cross-Validation en.fit = cv.glmnet(x = as.matrix(Xtrain), y = Ytrain, alpha=0.5, lambda = grid, family = 'binomial', type.measure = 'class') best_lambda = en.fit$lambda.min model = glmnet(x = as.matrix(Xtrain), y = Ytrain, alpha=0.5, lambda = best_lambda, family = 'binomial') yTest.elastic = predict(model, as.matrix(Xtest), type = "class") 1 - mean(Ytest==yTest.elastic) print(paste0("Regularized Features: ", model$df)) # No. of Features
SVM
- Example also within
Nested Cross-Validation
section
library(kernlab)
form = as.formula("Conference~.")
svm.fit = ksvm(form, data = train, kernel = "vanilladot")
yTest.svm = predict(svm.fit, test)
1 - mean(Ytest == yTest.svm)
Experimental Estimation of Bagging Error
D = 10
mu = as.matrix(runif(D, -10, 10))
cov_diag = runif(D, 1,2)
sd_matrix = as.matrix(diag(cov_diag))
error_data = rmvnorm(100, mean = mu, sigma = sd_matrix)
error_data = as.matrix(error_data)
colnames(error_data)=paste0("Model ", 1:10)
## Average Error
rmse_models = vector(length = 10)
for(i in 1:10){
rmse_models[i] = mean(error_data[,i]^2) ## eq 14.9
}
e2 = error_data^2
boot_data = rowMeans(e2)/10
colMeans(e2)%>%mean ## eq 14.10
mean(boot_data) ## eq 14.14
hist(boot_data) # Bagging Error Histogram
## part-3
variances = matrix(NA, nrow = 100, ncol=11)
## Assuming com(M1,M2) = 0
for(i in 1:100){
# Error Data Generation
B = 10
mu = as.matrix(runif(B, -10, 10))
cov_diag = runif(B, 1,2)
sd_matrix = as.matrix(diag(cov_diag))
error_data = rmvnorm(100, mean = mu, sigma = sd_matrix)
## Bootstrap Variance
error_data = as.matrix(error_data)
## Lecture: 2(a) pg. 10
error_bag = rowSums(error_data) / B^2
variances[i,1] = mean(error_bag - mean(error_bag)) ## Bagging Error Variance
# Model Avg Error Variance
variances[i,2:11] = apply(error_data, FUN=var, MARGIN = 2) ## Variance of Average Individual Error
}
colnames(variances) = c("Bootstrap", paste0("Model " ,1:10))
par(mfrow = c(3,4))
for(i in 1:11){
boxplot(variances[,i], main = colnames(variances)[i])
}
NN from Scratch
- NOTE: X and Y needs to be transposed and also Normalize X!.
- Example is for Classification
- In case of Regression:
-
- Use
A2 = Z2 # Linear Output
within theforwardPropagation
function.
- Use
-
- Modify the Cost Computing function
computeCost
computeCost = function(X,Y,cache){ m = dim(X)[2] # No.of Datapoints A2 = cache$A2 # Output cost <- mean((A2-Y)^2) return (cost)}
- Modify the Cost Computing function
-
data = read.csv2(paste0(file_path,"spambase_exam.csv"))
X = as.matrix(data[,c(1:57)], ncol = 57)
X = scale(X)
Y = as.matrix(data$Spam)
## Now transpose X and Y: Important Step before Implementation
X = t(X)
Y = t(Y)
# 1. Define Architecture
# 2. Initialize Model Params
getLayerSize = function(X,Y,hidden_neurons, train=TRUE){
n_x = dim(X)[1]
n_h = hidden_neurons
n_y = dim(Y)[1]
size = list(
"n_x" = n_x,
"n_h" = n_h,
"n_y" = n_y
)
return(size)
}
layers_size = getLayerSize(X,Y,10)
initializeParameters = function(X, list_layer_size){
m = dim(data.matrix(X))[2] ## No of Training Points
n_x <- list_layer_size$n_x
n_h <- list_layer_size$n_h
n_y <- list_layer_size$n_y
W1 = matrix(runif(n_h*n_x), nrow = n_h, ncol=n_x, byrow = TRUE)*0.01
b1 = matrix(rep(0,n_h), nrow = n_h)
W2 = matrix(runif(n_y*n_h), nrow = n_y, ncol = n_h, byrow = TRUE)*0.01
b2 <- matrix(rep(0, n_y), nrow = n_y)
params <- list("W1" = W1,
"b1" = b1,
"W2" = W2,
"b2" = b2)
return (params)
}
pars = initializeParameters(X, layers_size)
## Activations
sigmoid <- function(x){
return(1 / (1 + exp(-x)))
}
## Using Tanh in the hidden layer
## Sigmoid in the Output Layer
forwardPropagation = function(X, params, list_layer_size){
m <- dim(X)[2]
n_h <- list_layer_size$n_h
n_y <- list_layer_size$n_y
W1 <- params$W1
b1 <- params$b1
W2 <- params$W2
b2 <- params$b2
b1_new <- matrix(rep(b1, m), nrow = n_h)
b2_new <- matrix(rep(b2, m), nrow = n_y)
# Input to Hiddem
Z1 <- W1 %*% X + b1_new
A1 <- tanh(Z1)
# A1 <- sigmoid(Z1) # For Sigmoid in input layer
# Hidden to Output
Z2 <- W2 %*% A1 + b2_new
# A2 = Z2 # Linear Output
A2 <- sigmoid(Z2) # For Sigmoid
cache <- list("Z1" = Z1,
"A1" = A1,
"Z2" = Z2,
"A2" = A2)
return (cache)
}
fwd_prop <- forwardPropagation(X, pars, layers_size)
lapply(fwd_prop, function(x) dim(x))
## Classification
computeCost = function(X,Y,cache){
m = dim(X)[2] # No.of Datapoints
A2 = cache$A2 # Output
# Binary Loss
logprobs = (log(A2) * Y) + (log(1-A2) * (1-Y))
cost <- -sum(logprobs/m)
return (cost)
}
## If Regression
#computeCost = function(X,Y,cache){
# m = dim(X)[2] # No.of Datapoints
# A2 = cache$A2 # Output
# cost <- mean((A2-Y)^2)
# return (cost)
#}
cost <- computeCost(X, Y, fwd_prop)
cost
backwardPropagation <- function(X, Y, cache, params, list_layer_size){
m <- dim(X)[2]
n_x <- list_layer_size$n_x
n_h <- list_layer_size$n_h
n_y <- list_layer_size$n_y
A2 <- cache$A2
A1 <- cache$A1
W2 <- params$W2
# Output Layer
dZ2 = A2 - Y
dW2 = 1/m * (dZ2 %*% t(A1))
db2 <- matrix(1/m * sum(dZ2), nrow = n_y)
db2_new <- matrix(rep(db2, m), nrow = n_y)
# Input Layer
dZ1 <- (t(W2) %*% dZ2) * (1 - A1^2) # If tanh
# dZ1 <- (t(W2) %*% dZ2) * (A1*(1*A1)) # If sigmoid
dW1 <- 1/m * (dZ1 %*% t(X))
db1 <- matrix(1/m * sum(dZ1), nrow = n_h)
db1_new <- matrix(rep(db1, m), nrow = n_h)
grads <- list("dW1" = dW1,
"db1" = db1,
"dW2" = dW2,
"db2" = db2)
return(grads)
}
back_prop <- backwardPropagation(X, Y, fwd_prop, pars, layers_size)
lapply(back_prop, function(x) dim(x))
updateParameters = function(grads,params, learning_rate){
W1 <- params$W1
b1 <- params$b1
W2 <- params$W2
b2 <- params$b2
dW1 <- grads$dW1
db1 <- grads$db1
dW2 <- grads$dW2
db2 <- grads$db2
W1 <- W1 - learning_rate * dW1
b1 <- b1 - learning_rate * db1
W2 <- W2 - learning_rate * dW2
b2 <- b2 - learning_rate * db2
updated_params <- list("W1" = W1,
"b1" = b1,
"W2" = W2,
"b2" = b2)
return (updated_params)
}
update_params <- updateParameters(back_prop, pars, learning_rate = 0.01)
lapply(update_params, function(x) dim(x))
## Training
trainModel <- function(X, y, num_iteration, hidden_neurons, lr){
# X = t(X) # Transpose if not done already.
# y = t(y)
layer_size <- getLayerSize(X, y, hidden_neurons)
init_params <- initializeParameters(X, layer_size)
cost_history <- c()
for (i in 1:num_iteration) {
fwd_prop <- forwardPropagation(X, init_params, layer_size)
cost <- computeCost(X, y, fwd_prop)
back_prop <- backwardPropagation(X, y, fwd_prop, init_params, layer_size)
update_params <- updateParameters(back_prop, init_params, learning_rate = lr)
init_params <- update_params
cost_history <- c(cost_history, cost)
if (i %% 50 == 0) cat("Iteration", i, " | Cost: ", cost, "\n")
}
model_out <- list("updated_params" = update_params,
"cost_hist" = cost_history)
return (model_out)
}
## Learning
EPOCHS = 500
HIDDEN_NEURONS = 10
LEARNING_RATE = 0.1
train_model <- trainModel(X, Y, hidden_neurons = HIDDEN_NEURONS, num_iteration = EPOCHS, lr = LEARNING_RATE)
# Making Prediction
makePrediction <- function(X, y, hidden_neurons){
# X = t(X) # Transpose if doing in 'trainModel'.
# y = t(y)
layer_size <- getLayerSize(X, y, hidden_neurons)
params <- train_model$updated_params
fwd_prop <- forwardPropagation(X, params, layer_size)
pred <- fwd_prop$A2
return (pred)
}
outputProb = makePrediction(X,Y,HIDDEN_NEURONS)
Unsupervised Learning
Nearest Shrunken Centroid
- Need to scale the
x
data - Class Probability Estimate $\hat{p_k} \left(x^{} \right) =$ $\frac{\mathrm{e}^{\delta_k (x^)}}{\sum_{l=1}^{K} \mathrm{e}^{\delta_l (x^*)}}$
# To Obtain Class Probability estimate pamr.predict(model, newx, threshold,type = "posterior",prior = model$prior, threshold.scale = model$threshold.scale)
library(pamr)
data = (read.csv2(paste0(file_path, "data.csv")) )
data$Conference = as.factor(data$Conference) # Categorical as factor
# Give rows name
# Train/Test Split
train = data[id,]
test = data[-id,]
# Transpose the X matrix!!!
x = t(train[,c(1:4702)])
y = train[['Conference']]
# Modify Data
mydata = list(x=x,y=as.factor(y),geneid= as.character(1:nrow(x)), genenames= rownames(x))
model = pamr.train(mydata, threshold = seq(0,9,0.1))
# Cross Validation
#### SET SEEED!!!!!
set.seed(12345)
cv.results = pamr.cv(model, data = mydata, nfold = 5)
# Plot Cross Validation Results
# THreshold for which error is minimum and parameters are less
pamr.plotcv(cv.results)
# Compute Confusion matrix for a threshold
pamr.confusion(model, threshold = 1.9)
# Plot cross validated class probabilities by class
pamr.plotcvprob(model,data=mydata, threshold = 1.9)
# Estimate False discovery rates and plot them
fdr.obj = pamr.fdr(model, mydata)
plot(x= fdr.obj$results[,1], y = fdr.obj$results[,4], xlab="Threshold",
ylab="False Discovery Rate")
plot(x= fdr.obj$results[,2], y = fdr.obj$results[,4], xlab="Significant Genes",
ylab="False Discovery Rate")
# List Significant Genes: ALso find their names
res = pamr.listgenes(model, mydata,threshold = 1.9)
names = as.integer(res[1:10,1])
colnames(train)[names] # Get Column Names
# Plot Centroids
pamr.plotcen(model,mydata, threshold = 1.9)
# Prediction
x = t(test[,c(1:4702)])
y = test[['Conference']]
mydata2 = list(x=x,y=as.factor(y),geneid= as.character(1:nrow(x)), genenames= rownames(x))
yTestHat = pamr.predict(model, mydata2$x, threshold=1.9)
# Class Posterior Probability
yTest.Prob = pamr.predict(model, mydata2$x, threshold=1.9, type="posterior")
PCA
- For PC1: $eigen_{1\lambda_1} \cdot$column-1 + $eigen_{2\lambda_1} \cdot$ column-2 +…
- r-equivalent:
PC1 = x2%*%as.matrix(eig.x$vectors[,1])
where $x2$ is NXD matrix andeig.x$vector[,1]
is aDX1
eigen-vectors for $\lambda_1$
- r-equivalent:
### Manual PCA
df = mtcars
X = df[, c("mpg", "hp")]
x2 = scale(X) # Standardizing
x3 = x2 - colMeans(x2) ## Subrtracting Means
cov.x = cov(x3)
eig.x = eigen(cov.x)
eig.x$vectors
## Ensure that x and y are standardized
plot(x = x2[,1], y = x2[,2], cex=2, xlab="x-component",ylab="y component",
main="Actual Data VS PCA Vectors")
## Draw PC1
arrows(x0=0,y0=0,x1 = eig.x$vectors[1,1], y1=eig.x$vectors[2,1], col="red", lwd = 5)
arrows(x0=0,y0=0,x1 = -eig.x$vectors[1,1], y1=-eig.x$vectors[2,1], col="red", lwd = 5)
## Draw PC2
arrows(x0=0,y0=0,x1 = eig.x$vectors[1,2], y1=eig.x$vectors[2,2], col="blue", lwd = 5)
arrows(x0=0,y0=0,x1 = -eig.x$vectors[1,2], y1=-eig.x$vectors[2,2], col="blue", lwd = 5)
## Reduced Data V/S PCA
PC1 = x2%*%as.matrix(eig.x$vectors[,1])
PC2 = x2%*%as.matrix(eig.x$vectors[,2])
# Reduced Data
## Here PC's need to be rotated
plot(x = PC1, y = PC2, cex=2, xlab="PC1",ylab="PC2",
main="Data Along PC1 and PC2", xlim = c(-3,3), ylim = c(-3,3))
arrows(x0=0,y0=0,x1 = -3, y1=0, col="red", lwd = 5)
arrows(x0=0,y0=0,x1 = 3, y1=0, col="red", lwd = 5)
arrows(x0=0,y0=0,x1 =0, y1=2, col="blue", lwd = 5)
arrows(x0=0,y0=0,x1 = 0, y1=-2, col="blue", lwd = 5)
##### Communities data example
communities = read.csv(paste0(file_path, 'communities.csv'))
covs = colnames(communities)[1:100]
# Manual PCA
# Step-1: Scaling
PCA = scale(communities[,covs])
PCA = cbind(PCA, communities$ViolentCrimesPerPop)
cor.mat = cor(PCA) # Correlation Matrix
cov.mat = cov(PCA) # Covariance Matrix
# Step-2: Obtain eigenvalues and eigenvectors of covariance matrix
eig = eigen(cov.mat) # Eigen values and vectors
# Step-3: Calculate Total and Proportional Variance explained
lambdas = eig$values
# First 34 will explain 95% Variance in the data
cum_var = cumsum(lambdas/sum(lambdas))
sum(cum_var<0.95)
# Proportion of variance explained by 1st two Principal Components
lambdas[1:2]/sum(lambdas)
pc.result = princomp(PCA)
# pc.result$centre: Specifies 'Mean' of the variables used for PCA
# pc.result$sdev: Specifies 'Standard Deviation' of the variables used for PCA
# Calculate Total and Proportional Variance explained
pc.var = pc.result$sdev^2
pve = pc.var/sum(pc.var)
# Plot 1st 2 Principal Components
library(ggplot2)
PC1 = pc.result$scores[,1]
PC2 = pc.result$scores[,2]
PCA2 = data.frame(VioLentCrimesPerPop = communities$ViolentCrimesPerPop, PC1,PC2)
ggplot(PCA2, aes(PC1,PC2, colour = VioLentCrimesPerPop))+
geom_point() + theme_bw()
# Second Order Polynomial on PC1
X = poly(PC1,2, raw=TRUE) # MAKE SURE 'raw=TRUE'
Y = communities$ViolentCrimesPerPop
data = data.frame(cbind(X,Y))
# Prediction
form = as.formula("Y~.")
fit = lm(form, data=data)
Yhat = predict(fit, newdata = data)
plot(x = data$X1, y = data$Y,
xlab = "PC-1", ylab = 'Violent Crimes', cex=0.5)
points(x = data$X1, y = Yhat,col='red', cex=0.5)
legend("topleft", pch = 1, col = c("black", "red"), legend = c("Actual", "Predicted"),
cex=0.5)
# Run Parametric Bootstrap
result = boot(data, statistic = f1, R=1000, mle = mle,ran.gen = rng,sim = 'parametric')
result.PI = boot(data, statistic = f2, R=1000, mle = mle,ran.gen = rng,sim = 'parametric')
# Plot Confidence Interval
env = envelope(result, level=0.95)
env.PI = envelope(result.PI,level=0.95)
plot(x = data$X1, y = data$Y, ylim=c(-1,2),
xlab = "PC-1", ylab = 'Violent Crimes', cex=0.5)
points(x = data$X1, y = Yhat,col=2, cex=0.5, lty=1)
points(x = data$X1, y = env$point[2,], col = 3)
points(x = data$X1, y = env$point[1,], col = 3)
Other Topics
Non-Parametric Bootstrap
- Assumes no distribution for the generated data.
- USE the WHOLE TRAINING DATA
library(boot)
state = read.csv(paste0(file_path,"State.csv"), header = TRUE,sep = ";")
to_numeric = colnames(state)[2:6]
for(col in to_numeric){
state[,col] = as.numeric(gsub(",",".", state[,col])) # String modification
}
# 1. Re-order the Data based on co-variates
state.order = state[order(state$MET),] # Ascending Order
# 2.1 Define function to 'fit' and 'predict' on the sampled data
f = function(data,index){
data1 = data[index,]
fit = tree(form, data1,control = tree.control(nobs= dim(state)[1],minsize=8))
prunedTree = prune.tree(fit, best=3)
Yhat = predict(prunedTree, newdata = state.order)
return(Yhat)
}
# 2.2 For Prediction Intervals
f2 = function(data,index){
data1 = data[index,]
fit = tree(form, data1,control = tree.control(nobs= dim(state)[1],minsize=8))
prunedTree = prune.tree(fit, best=3)
Yhat = predict(prunedTree, newdata = state.order)
n = length(state$EX) # equal to #.Training Data points
# Now generate Y from the distribution
predictedEX = rnorm(n, Yhat, sd(residuals(prunedTree)))
return(predictedEX) # Return this]
}
# 3. Perform Non-Parametric Bootstrap
result = boot(state.order, f, R=1000) # Confidence Interval
result.PI = boot(state.order, f2, R=1000) # Prediction Interval
# 4. Plot Confidence Intervals
env = envelope(result,level=0.95)# For Setting the CI
env.PI = envelope(result.PI, level=0.95)
# Prediction with CI
# Prediction with CI
par(mfrow=c(1,1))
plot(x = state.order$MET, y = state.order$EX, col = "red")
points(x = state.order$MET, y = trY, col = "blue",type='l')
points(x = state.order$MET, y = env$point[2,], col = "blue",type='l')
points(x = state.order$MET, y = env$point[1,], col = "blue",type='l')
points(x = state.order$MET, y = env.PI$point[2,], col = "green",type='l')
points(x = state.order$MET, y = env.PI$point[1,], col = "green",type='l')
Parametric Bootstrap(Regression)
- Need to modify
rng
depending upon classification or regression task! - For Classification look into:
Parametric Bootstrap(Classification)
library(boot)
state = read.csv(paste0(file_path,"State.csv"), header = TRUE,sep = ";")
to_numeric = colnames(state)[2:6]
for(col in to_numeric){
state[,col] = as.numeric(gsub(",",".", state[,col])) # String modification
}
# 1. Compute 'mle' i.e., of model type
mle = tree(form, state,control = tree.control(nobs= dim(state)[1],minsize=8))
residuals(mle) # Best way to obtain model residuals
# 2. Define 'rng' for data generation
rng = function(data, mle){
data1 = data.frame(EX = data$EX,MET= data$MET)
n = length(data$EX) # For Whole Data
# Generate New Data: rnorm will be changed for classification task
data1$EX = rnorm(n, predict(mle, newdata = data1), sd(residuals(mle)))
return(data1)
}
# Statistic function: should return the estimator(Yhat)
f1 = function(data1){
res = tree(form, data1,control = tree.control(nobs= dim(state)[1],minsize=8)) # Fit the Model
Yhat = predict(res, newdata = state) # newdata = input data
return(Yhat)
}
# Run Parametric Bootstrap
result = boot(state, statistic = f1, R=1000, mle = mle,ran.gen = rng,sim = 'parametric')
## Prediction Interval
f2 = function(data1){
# Fit the Model
res = tree(form, data1,control = tree.control(nobs= dim(state)[1],minsize=8))
Yhat = predict(res, newdata = state) # newdata = input data
n = length(state$EX) # equal to no of Training Data points
# Now generate Y from the distribution
predictedEX = rnorm(n, Yhat, sd(residuals(mle)))
return(predictedEX) # Return this
}
# Plot Confidence Interval
env = envelope(result, level=0.95)
result.PI = boot(state, statistic = f2, R=1000, mle = mle,
ran.gen = rng,sim = 'parametric')
env.PI = envelope(result.PI,level=0.95)
# Prediction with CI& PI
plot(x = state.order$MET, y = state.order$EX, col = "red",
ylim = c(min(c(env.PI$point[2,],env.PI$point[1,])),max(c(env.PI$point[2,],env.PI$point[1,]))))
points(x = state.order$MET, y = trY, col = "blue",type='l')
points(x = state.order$MET, y = env$point[2,], col = "red",type='l')
points(x = state.order$MET, y = env$point[1,], col = "red",type='l')
points(x = state.order$MET, y = env.PI$point[2,], col = "green",type='l') # For Prediction Interval
points(x = state.order$MET, y = env.PI$point[1,], col = "green",type='l') # For Prediction Interval
Parametric Bootstrap(Classification)
- Pay attention to
rng
function
# Perform Parametric Bootstrap
# Since it is a classification task: 'rng' and f1 need to be modified
data = data.frame(PC1 = PC_one,Y = digits$Y)
data$Y = as.factor(data$Y)
form = as.formula("Y~.")
# 1. Define 'mle'
mle = glm(form, data = data, family='binomial')
# 2. Define 'rng' for data generation
rng = function(data, mle){
data1 = data.frame(PC1 = data[,1], Y= data[,2])
n = length(data[,2]) # For Whole Data
# Generate New Data: rbinom for classification task
data1[,2] = rbinom(n, size = 1, prob = predict(mle, data1, type = 'response'))
return(data1)
}
# 3. Statistic function: should return the estimator(Yhat)
f1 = function(data1){
# Fit the Model
fit = glm(form, data = data1, family='binomial')
# newdata = input data
Yhat = predict(fit, newdata = data) # Get the Link
# Since Predicting Confidence Interval of "Probabilities", it cannot be negative(ISLR pg.291-292)
Yhat = exp(Yhat)/(1 + exp(Yhat))
return(Yhat)
}
# Run Parametric Bootstrap
result = boot(data, statistic = f1, R=1000, mle = mle,ran.gen = rng,sim = 'parametric')
env = envelope(result,level=0.95)# For Setting the CI
# Plot results
plot(x = data$PC1, y = yT, col = "red", ylim=c(0,1))
points(x = data$PC1, y = env$point[2,], col = "blue")
points(x = data$PC1, y = env$point[1,], col = "blue")
Bonferroni Method
- When we perform several Tests simultaneously, we need to make an adjustment for risk.
- $H_{0j}$ : Treatment has no effect on $j$(j will be column or row)
- $H_{1j}$ : Treatment has an effect on $j$
- We reject $H_{0j}$ if $p_j < \alpha/M$. Not useful for large M for which Bonferroni gives conservative results
- Choose features for which we fail to reject the Null Hypothesis!
Benjamin-Hochberg
- $H_{0j}$ : treatment has no effect on $j$(j will be column or row)
- $H_{1j}$ : Treatment has an effect on $j$
- ISLR(Version 2) has example on Regression type Data. Here we deal with Classification Data:
- In CLassification Data, divide the data for each class and conduct BH Test for each class individually.
gene_data = read.csv(paste0(file_path,"geneexp.csv"))
gene_data$CellType = as.factor(gene_data$CellType)
which(sapply(gene_data, class) !='integer') # Check column-type
expnames = colnames(gene_data)[2:2086] # X-column names
gene_data[['isCD4']] = gene_data$CellType == "CD4" # Conditional Column for CD4 Cell type
gene_data$isCD8 = gene_data$CellType == "CD8"
gene_data$isCD19 = gene_data$CellType == "CD19"
## Benjamin-Hochberg
# 1. Specify 1, the level to which control FDR
q = 0.05
# 2. Compute p values for m-null hypothesis
get_p_value = function(name,gene_type, data){
f = as.formula(paste0(name,"~",gene_type)) #
return(t.test(f,data = data)$p.value)
}
## CD4
p.CD4 = sapply(expnames, FUN=get_p_value, gene_type = "isCD4", data = gene_data)
## CD8
p.CD8 = sapply(expnames, FUN=get_p_value, gene_type = "isCD8", data = gene_data)
## CD19
p.CD19 = sapply(expnames, FUN=get_p_value, gene_type = "isCD19", data = gene_data)
# 3. Order m p-values!!!!!
p.CD4 = sort(p.CD4) # IMPORTANT step
p.CD8 = sort(p.CD8)
p.CD19 = sort(p.CD19)
# 4. Define L
BH = function(pv, q){
m = length(pv)
for(i in 1:m){
if(pv[i]>q*i/m)
break
}
return(i-1)
}
# 5. Reject all null hypothesis for which p_j <= p_{(L)}
BH(p.CD4, q=0.05) ## No of Rejected Genes for CD4 Cell type(Choose these as important ones)
# Important Ones
names(p.CD4)[1:BH(p.CD4, q=0.05)] # These are the important ones
BH(p.CD8, q=0.05) ## No of Rejected Genes for CD8 Cell type
BH(p.CD19, q=0.05) ## No of Rejected Genes for CD19 Cell type
# Plotting
BHLine = q*(1:length(p.CD4))/m # Line for BH-threshold
plot(x=1:length(p.CD4), y=p.CD4, xlab='Index',
ylab="P-Value", main = "CD4")
lines(x=1:length(p.CD4),y = BHLine, col="blue", type='l')
N-Fold CV
- Calculate Loss using
loss
- Calculate N-Fold Loss for the data using ‘n_fold_cv’
- ‘get.subset’ creates various column subsets. NOTE: Does not create for a single predictor.
loss = function(x,y,xTest,yTest){
# Function to obtain regression coefficient using x,y
# Predict y_hat for 'xTest' and then calculate loss
x = as.matrix(cbind(intercept=1,x)) ### Add Intercept
xTest = as.matrix(cbind(intercept=1,xTest))
xtx = solve(t(x)%*%x)
beta = xtx%*%(t(x)%*%y)
beta = as.matrix(t(beta))
colnames(beta) = colnames(x)
# loss = mean(abs(y - y_hat)) ## MAE
yHat = xTest%*%t(beta)
testLoss = sqrt(mean((yTest-yHat)**2))
return(testLoss)
}
n_fold_cv = function(X,Y, n_fold){
#Randomly shuffle the data
ran_ind = sample(nrow(X))
X = as.matrix(X[ran_ind,])
Y = Y[ran_ind]
#Create n equally size folds
folds <- cut(seq(1,nrow(X)),breaks=n_fold,labels=FALSE)
#Perform 10 fold cross validation
cv_loss = c()
for(i in 1:n_fold){
#Segement your data by fold using the which() function
testIndexes <- which(folds==i,arr.ind=TRUE)
xTest <- X[testIndexes,]
yTest <- Y[testIndexes]
xTrain <- X[-testIndexes,]
yTrain <- Y[-testIndexes]
l = loss(x=xTrain, y = yTrain,xTest = xTest, yTest = yTest)
cv_loss = c(l,cv_loss)
}
cv_loss = mean(cv_loss)
return(cv_loss)
}
get.subset = function(X,Y,n_folds){
N = dim(X)[2] # No. of Co-variates
col_comb = function(i,n){t(combn(1:n,i))} # Function to select 'i' out of 'n'
best_cols = col_comb(N,N)[1,] # Start with all columns
loss_best = Inf # Setting a very high initial value
# i accounts for no of columns
for(i in 1:N){
combos = col_comb(i=i,n=N)
rows = dim(combos)[1]
for(row in 1:rows){
columns = combos[row,]
X_new = as.matrix(X[, columns])
# HERE n-fold validation
curr_loss = n_fold_cv(X_new, Y, n_fold)
if (curr_loss<loss_best){
loss_best = curr_loss
best_cols = columns
}
}
}
return(list(best = best_cols, loss =loss_best))
}
Variable Selection using StepAIC
- Can be used for both
aic
andbic
- What are the important variables, that can be asked
- ‘AIC’ is an information criterion and can use all the provided data. Thus, we can utilize all the provided data as information for estimating generalization capabilities of a model.
## Variable Selection using StepAIC
library(MASS)
d3 = data[,-1] # Sample Dataset
form = "Fat~." # Set Formula
fit = lm(form, data = as.data.frame(d3))
step = stepAIC(fit, direction = 'both') # For AIC in both the directions.
step.BIC = stepAIC(fit, k=log(nrow(d3))) # BIC Method
# To get selected variables
get.coef.names = function(model, step.func){
STEP_COEF = vector("numeric",length(coefficients(model)))# create an empty vector of zeros
names(STEP_COEF) = names(coefficients(model))#same names
STEP_COEF[names(coefficients(step))] = as.numeric(coefficients(step))#fill in the ones found in step
coef_names = names(STEP_COEF[round(STEP_COEF, 0) !=0])
return(coef_names)
}
Nested Cross-Validation
- randomly split data in $K$
outerfolds
- for each hyper-parameter combo, there will be an avegare combo. error
config_error
calculated using onlyVal
data within inner-loop. - The inner-loop consists of inner-CV for
innerfolds
times cross-validation. - Look at the average evaluation for each config i.e.,
config_error
and choose the hyper-parameter with lowestconfig_error
- Now, with the chosen hyper-parameters, divide the data into
Train
andTest
for each fold and save the test loss withinouter_cv_loss
for each outer fold. - The average of the
outer_cv_loss
is the modelsGeneralization
Error.
data(spam)
outerfolds = 5 # No. of Outer Folds
innerfolds = 3 # No. of Inner FOlds
form = as.formula('type~.')
ran_ind = sample(nrow(spam)) # Reshuffle the Data
data = spam[ran_ind,]
#Create n equally size folds depending on 'outerfolds' value
folds <- cut(seq(1,nrow(X)),breaks=outerfolds,labels=FALSE)
#Perform n fold cross validation
outer_cv_loss = NULL # Each Models Error
params = seq(0.1,2.5,0.5)
# Outer CV
for(outerfold in 1:outerfolds){
testIndexes <- which(folds==outerfold,arr.ind=TRUE)
Test <- data[testIndexes,]
Train <- data[-testIndexes,]
# Hyper-Parameter Search: Use only Validation Data Error
config_error = NULL
# Each 'i' should contain the required param(s) combo
for(i in 1:length(params)){
# Inner-CV
if_error = NULL # Inner-Fold error for 'i' Param configuration
for(fold in 1:innerfolds){
# # Use only xTrain
ValIndexes <- which(folds==fold,arr.ind=TRUE)
Val <- Train[ValIndexes,] # kth 'inner-fold' Validation Data
TraIN <- Train[-ValIndexes,]
fit = ksvm(form,data=TraIN,kernel="rbfdot",kpar=list(sigma=0.05),C=params[i])
yValHat = predict(fit, Val)
misc_error = 1-mean(Val$type == yValHat) # Misclassification Error
if_error = c(if_error, misc_error)
}
config_error[i] = mean(if_error) # Look at avg eval for each config
}
best_i = params[which.min(config_error)] # Choose the config with the best config_error
fit.outer = ksvm(form,data=Train,kernel="rbfdot",kpar=list(sigma=0.05),C=best_i)
yTestHat = predict(fit.outer, Test) # Prediction on k-th outerfold Test Data
of_error = 1-mean(Test$type == yTestHat) # Outer Fold Test Error
outer_cv_loss = c(outer_cv_loss, of_error) # Its Mean is the 'Generalized Error'
}
generalization_error = mean(outer_cv_loss) # Generalization error of the model.
Laplace Prior
- $Y_i = \mathbf{x}_i + \mathbf{\beta} + \epsilon_i, \text{i = 1,…,n}$ where $\epsilon_i \sim \textit{Laplace(0,b)}$ then, the complete data Log-Likelihood is given as,
- $ - \ln p\left( Y| \mathbf{X}, \mathbf{\beta}, b \right) = n \ln (\textit{2b}) + \frac{1}{b} \sum_{i=1}^n \left|\mathbf{x}_i \cdot \mathbf{\beta} - Y_i \right|$
- Creating distribution for error term $\epsilon$ where $p(\epsilon) = \frac{e^{-|\epsilon|}}{2}$
-
e_s
draws from the Laplacian Error distribution with $\mu = \mathbf{0}$ and $\Sigma = \mathbf{1}$# Drawing Error from its Laplace distribution error = seq(-5,5, length = 10000) dlaplace = function(x){0.5*exp(-abs(x))} derror = sapply(error, FUN = dlaplace) e_s = sample(error, size=25000,prob = derror, replace=TRUE) hist(e_s, breaks = 1000, freq = F)
Entropy
- Entropy $x \cdot \ln x$ is strictly convex function.
x = seq(0,1, length = 10000) x = x[x>0] plot(x=x, y = x*log(x), col=2)