##### 🍐 我们总结了PSU代写中——CS代写的经典案例，如果你有任何机器学习代写的需要，可以随时联络我们。CoursePear™ From @2009。

Instruction

CMPSC 448: Machine Learning and AI

Homework 4 (Sunday April 3, 11:59 PM)

This HW includes both theory and implementation problems. Please note,

• Your code must work with Python 3.5+ (you may install the Anaconda distribution of Python)
• You need to submit a report in PDF including all written deliverable and plots via Gradescope, and all implementation codes via Canvas so one could regenerate your results.
• For non-linear classifier problem, you need to submit a Jupyter notebook for each algorithm and all complementary python codes so one could reproduce your results. Submit your code in Problem4*.py and replace * witn the name of classifier.
• For clustering problem, you are NOT allowed to use the implementation from scikit-learn or import it and should submit your own implementation. Submit your code in Problem5.py. The submission for this homework should be a single PDF file with solutions of theory problems, figures, and any text explaining your results of programming problems via Gradescope and a single ‘zip‘ file containing all of the relevant code via Canvas. SVM & Kernel Methods Problem 1 (10 points) In this problem we would like to compare the solutions of hard and soft SVMs on a linearly seperable dataset. Let n > 1 be a fixed number. Is it true that there exists C > 0 such that for every sample S of n training examples with binary label, which are linearly separable, the hard-SVM and the soft-SVM (with parameter C) solutions will return exactly the same weight vector. Justify your answer. (Hint: consider n = 2, d = 1 and S = {(x1, y1) , (x2, y2)}. Let a > 0 and consider x1 = a, y1 = 1, x2 = −a, y2 = −1. Derive the optimal solution for hard and soft SVM and compare the results.) Problem 2 (10 points) In this problem you are asked to find the explicit mapping function Φ(·) for the Gaussian kernel k (x, z) = exp 􏰀− ∥x − z∥2 /2γ2􏰁 by showing that it can be expressed as the inner product of an infinite-dimensional feature space by a mapping Φ. a) Assume x, z ∈ R1 (only one feature per training example). Use the Taylor expansion of the kernel function κ (to be precise, the exp(·) function) and derive the explicit mapping Φ the kernel correspond to. b) Answer the above question in general case where x, z ∈ Rd. Assume ∥x∥2 = ∥z∥2 = 1.

1

Problem 3 (20 points) In this problem we aim at utilizing the kerenl trick in Ridge regression and propose its kernalized version. Recall the Ridge regression training objective function:

f(w) = ∥Xw − y∥2 + λ∥w∥2

for λ > 0.
a) Show that for w to be a minimizer of f(w), we must have X⊤Xw + λIw = X⊤y, where X ∈ Rn×d is the data matrix with n samples each with d features, and I is iden- tity matrix (please check lectures for more details). Show that the minimizer of f(w) is

w = 􏰀X⊤X + λI􏰁−1 X⊤y. Justify that the matrix X⊤X + λI is invertible, for λ > 0. (Hint: use SVD decomposition of data matrix X = UΣV ⊤ and show all the eigenvalues of X⊤X + λI are larger than zero).

b) Rewrite X⊤Xw + λIw = X⊤y as w = λ1 􏰀X⊤y − X⊤Xw􏰁. Based on this, show that we can write w = X⊤α for some α ∈ Rn, and give an expression for α.

c) Based on the fact that w = X⊤α, explain why we say w is ”in the span of the data.”

d) Show that α = 􏰀λI + XX⊤􏰁−1 y. Note that XX⊤ is the n × n Gram (kernel) matrix for the standard vector dot product. (Hint: Replace w by X⊤α in the expression for α, and then solve for α.)

e) Give a kernelized expression for the Xw, the predicted values on the training points. (Hint: Replace w by X⊤α and α by its expression in terms of the kernel matrix XX⊤.)

f) Give an expression for the prediction w⊤∗ x for a test sample x, not in the training set, where w∗ is the optimal solution. The expression should only involve x via inner products training data samples xi,i = 1,…,n.

g) Based on (f), propose a kernalized version of the Ridge regression.

Boosting

Problem 4 (15 points) Consider the AdaBoost algorithm we discussed in the class 1. Ad- aBoost is an example of ensemble classifiers where the weights in next round are decided based on the training error of the weak classifier learned on the current weighted training set. We wish to run the AdaBoost on the dataset provided in Table 1.

a) Assume we choose the following decision stump f1 (a shallow tree with a single decision node), as the first predictor (i.e., when training instances are weighted uniformly):

1Please read reading assignement from textbook and (optional) the Introduction chapter from Boosting book for detailed explanation https://bit.ly/2HvKObl

2

Instance Color Size Shape Edible? D1 Yellow Small Round Yes

Table 1: Mushroom data with 16 instances, three categorical features, and binary labels.

```  if(Color is Yellow):
predict  Eadible = Yes
```
```  else:
predict Eadible = No
```

What would be the weight of f1 in final ensemble classifier (i.e., α1 in f(x) = 􏰂Ki=1 αifi(x))?

b) After computing f1, we proceed to next round of AdaBoost. We begin by recomputing data weights depending on the error of f1 and whether a point was (mis)classified by f1. What is the weight of each instance in second boosting iteration, i.e., after the points have been re-weighted? Please note that the weights across the training set are to be uniformly initialized.

c) In AdaBoost, would you stop the iteration if the error rate of the current weak classifier on the weighted training data is 0?

Experiment with non-linear classifiers
Problem 5 (40 points) For this problem, you will need to learn to use software libraries for

at least two of the following non-linear classifier types:

• Boosted Decision Trees (i.e., boosting with decision trees as weak learner)

• Random Forests

• Support Vector Machines with Gaussian Kernel

All of these are available in scikit-learn, although you may also use other external libraries (e.g., XGBoost 2 for boosted decision trees and LibSVM for SVMs). You are

2A simple blog post on how to use XGBoost please check this. 3

welcome to implement learning algorithms for these classifiers yourself, but this is neither required nor recommended.

Pick two different types of non-linear classifiers from above for classification of Adult dataset. You can the download the data from a9a in libSVM data repository. The a9a data set comes with two files: the training data file a9a with 32,561 samples each with 123 features, and a9a.t with 16,281 test samples. Note that a9a data is in LibSVM for- mat. In this format, each line takes the form <label> <feature-id>:<feature-value> <feature-id>:<feature-value> ….. This format is especially suitable for sparse datasets. Note that scikit-learn includes utility functions (e.g., load svmlight file in example code below) for loading datasets in the LibSVM format.

For each of learning algorithms, you will need to set various hyperparameters (e.g., the type of kernel and regularization parameter for SVM, tree method, max depth, number of weak classifiers, etc for XGBoost, number of estimators and min impurity decrease for Random Forests). Often there are defaults that make a good starting point, but you may need to adjust at least some of them to get good performance. Use hold-out validation or K-fold cross-validation to do this (scikit-learn has nice features to accomplish this, e.g., you may use train test split to split data into train and test data and sklearn.model selection for K-fold cross validation). Do not make any hyperparameter choices (or any other similar choices) based on the test set! You should only compute the test error rates after you have settled on hyperparameter settings and trained your two final classifiers.

What to submit (in PDF file and Jupyter/python codes):

1. Names of the two classifiers you opt to learn and a brief description of each algorithm and how it works.
2. Description of your training methodology, with enough details so that another ma- chine learning enthusiast can reproduce the your results. You need to submit all the codes (python and Jupyter notebooks) to reproduce your code. Please use pre- fix Problem4*.py where you need to replace ‘*‘ with the name of non-linear classifier for your coding files.
3. The list of hyperparameters and breif description of each hyperparameter you tuned in training, their default values, and the final hyperparameter settings you use to get the best result.
4. Training error rates, hold-out or cross-validation error rates, and test error rates for your two final classifiers. You are also encouraged to report other settings you tried with the accuracy it achieved (please make a table with a column with each hyperparamter and accuracy of configuration of parameters.
5. Please do your best to obtain the best achievable accuracy for each classifier on given dataset. Note: The amount of effort you put on tuning the parameters will be deter- mined based on the discrepancy between the accuracy you get and the best achievable accuracy on a9a data for each algorithm.

Parameters to be tuned for XGBoost: 1. n estimators

4

2. max depth
3. lambda
4. learning rate 5. missing
6. objective

Parameters to be tuned for SVM: 1. kernel type
2. gamma
3. C

Parameters to be tuned for Random Forests: 1. n estimators
2. bootstrap
3. max depth

4. min impurity decrease

5. min samples leaf Example code to use XGBoost

```from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_svmlight_file
```
```from xgboost import XGBClassifier
```
```# load data in LibSVM sparse data format
```
```X, y = load_svmlight_file("a9a")
```
```# split data into train and test sets
```
```seed = 6
test_size = 0.4
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=seed)
```
```# fit model on training data
# for simplicity we fit based on default values of hyperparameters
```
```model = XGBClassifier()
model.fit(X_train, y_train)
```
```# make predictions for test data
```

y_pred = model.predict(X_test)
predictions = [round(value) for value in y_pred] # evaluate predictions
accuracy = accuracy_score(y_test, predictions) print(“Accuracy: %.2f%%” % (accuracy * 100.0))

5

Clustering

Problem 6 (45 points) For this problem, you will implement the k-means++ algorithm in Python. You will then use it to cluster Iris dataset from the UCI Machine Learning Repository. The data is contained the iris.data file under ”Data Folder”, while the file iris.names contains a description of the data. The features x are given as the first four comma-separated values in each row in the data file. The labels y are the last entry in each row, but you do NOT need the class label as it is unknown for clustering.

1. sepal length (cm)
2. sepal width (cm)
3. petal length (cm)
4. petal width (cm)
5. class: {Iris Setosa, Iris Versicolour, Iris Virginica}

You need to,

• Create a new data set with two features by computing the ratio of raw features (x =
(x1, x2) where x1 = (sepal length/sepal width) and x2 = (petal length/petal width)) and plot the data to observe the clusters in data by yourself (use class label to color
the data points for better illustration of clusters).
• Implement the k-means++ algorithm. You are provided with the skeleton of the code with main functions to be implemented (Problem6.py file in assignment directory). Submit the source code (documented!) of your implementation.
• Cluster the modified Iris dataset with with two features explained above. Run your algorithm 50 times over the data with different values of clusters k = 1,2,…,5 and plot the accuracies (x and y axes should be the number of clusters and the clustering objective, respectively).
• Based on the above plot, decide the number of final clusters and justify your answer. For the chosen number of clusters, 1. Create a plot showing how objective changes with number of iterations.
2. Create a plot with the data colored by assignment, and the cluster centers.