AimThe purpose of assignment 3 is to conduct a cluster analysis in which we aim to create a collection of data objects that are similar to one another within the same group and dissimilar to the objects in other groups. The popular k-means clustering algorithm cannot be applied to this assignment because the data we will use in this homework includes numerical and categorical variables. We cannot define the mean for categorical variables. k-medoids is a related algorithm that partitions data into k distinct clusters by finding medoids that minimize the sum of dissimilarities between points in the data and their nearest medoid. The medoid is the most centrally located object in a cluster. The dissimilarity can be calculated by metrics such as the Jaccard coefficient and Gower’s distance. Please watch videos in the links below for more details.
k-medoids clustering method:
DatasetThe data (a2_data.csv) for this assignment was prepared for you. The data preprocessing steps explained in the linear algorithms notebook (week 4), including combining CHBMIX42 and BMINDX53, removing outliers, and creating dummy variables, were applied. Actually, this is the same dataset you used for Assignment 1. You should use this file as a data source. As I selected data instances for patients with asthma, Recently, the focus is on finding different groups of patients with asthma that share certain characteristics. We may be able to perform specific actions on them depending on their characteristics.
What to Do for the Assignment 3The primary goal of this assignment is to determine the best number of clusters and to interpret the result of cluster analysis.
Based on the k-medoids implementation provided with “a3.ipynb” notebook, expand the implementation to conduct a parameter tuning in which you can test different numbers of clusters. For the parameter tuning implementation, please refer to the “K-Means Clustering.ipynb” notebook for week 10. You need to use the silhouette coefficient, which is an intrinsic quality metrics, for quality evaluation. Finding the best number of clusters correctly is 70 % of the grade.
What patterns did you identify? Compare the attribute values with cluster labels. Did you find any attribute that is particularly associated with the cluster label? Please specify the attribute that makes the most contribution to the cluster labels and explain why this trend happens. This part is 30 % of the grade.
How to writeYou need to report the best number of clusters. Also, answer the second question in “what to do for the assignment 3.” For these parts, you can write the report in a MS word file. You also need to submit your Jupyter notebook you used for identifying the best number of clusters.
Grading CriteriaThe 30% of the grade for the assignment will be your answer for the second question in “what to do for the assignment 3.” I will grade the correctness as well as comprehensiveness. Another 70% will be whether you properly conducted the parameter tuning to identify the best number of clusters and the correctness of the reported best number of clusters.