CSCI 4144 - Assignment 3 - Clustering and classification


  • 作业标题:CSCI 4144 - Data Mining and Data Warehousing Assignment 3 - Clustering and classification
  • 课程名称:Dalhouse University CSCI 4144 Data Mining and Data Warehousing
  • 完成周期:4天

Section 1 - Clustering

In this section, you will compare your own implementation of k-means against scikit-learn’s implementation on two small datasets.

Please see scikit-learn’s documentation on K-means here (https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html). The answer to your question on how to use it is almost certainly there.

Dataset

The first dataset, GiveMeSomeCredit (https://www.kaggle.com/c/GiveMeSomeCredit), is a small dataset that could allow a bank to make decisions with regards to providing credit to customers. Every row in this dataset represents a hypothetical person. The column meanings should be self-explanatory from
their names, but are not important in this assignment. The labels are in column SeriousDlqin2yrs ; all other columns are observations.

The second dataset is the famous Iris flower dataset (https://en.wikipedia.org/wiki/Iris_flower_data_set), which is commonly used to introduce concepts of clustering and classification. We already saw this dataset in Lec03.pca.ipynb , which you have already downloaded and played with extensively, so you are already familiar with it. The labels are in column species ; all other columns are observations.
Here is one of the types of flower represented in this dataset:

。。。

Section 2 - Classification

Here, we will actually make use of the labels in our two datasets to do some very simple classification. We will use three builtin classifiers from sklearn, use
fit() to learn the models on training sets and use predict() to make predictions on test sets.

。。。

Bonus [5 Marks]

  • We will give up to 5 bonus marks for innovative work going substantially beyond the minimal requirements.
  • These marks can make up for marks lost in other sections of the assignment, but your overall mark for this assignment cannot exceed 100%.
  • You may decide to pursue any number of tasks of your own design related to this assignment, although you should consult with the instructor or the lead
  • TA before embarking on such exploration, and the value of bonus work is left to the discretion of the markers.
  • Be sure to document your work sufficiently for the markers to understand what you’re doing. You can add additional Code or MarkDown cells below, as necessary.
  • Certainly, the rest of the assignment takes higher priority

文章作者: 量子数字
版权声明: 本博客所有文章除特別声明外,均采用 CC BY-NC-ND 4.0 许可协议。转载请注明来源 量子数字 !
  目录