nus/cs2109s/labs/final 2/main.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "7d017333",
   "metadata": {},
   "source": [
    "# Final Assessment: Making Prediction on a Dataset without Domain Knowledge"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "09648cba",
   "metadata": {},
   "source": [
    "**Release Date:** Saturday, 27 April 2024, 20:00\n",
    "\n",
    "**Due Date:** Sunday, 28 April 2024, 23:59"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "022cb4cd",
   "metadata": {},
   "source": [
    "## Overview\n",
    "\n",
    "In this assessment, your goal is to put into practice what you have learnt over the semester by building a model that can accurately predict the outcomes for new data samples from the same distribution from a given data set. Because the dataset's origin and domain are unknown, you cannot rely on any prior domain knowledge. Instead, you will need to analyze and comprehend the data based solely on its inherent characteristics and the given training samples. You may employ any techniques to uncover the data's nature, such as visualization, analysis, or experimentation.\n",
    "\n",
    "In general, the techniques learnt in class should be sufficient, but you may choose to explore new ones. For fairness and for practical reasons, your solution will be implemented with the libraries that we have used in CS2109S. You may find \"[ML for People in Hurry](https://colab.research.google.com/drive/1yHecT3sXevjdko9KmRSIOOJxXdf43xd6?usp=sharing)\" shared in Lecture 12 useful.\n",
    "\n",
    "Once you have some understanding of the data, you can create a model (or a set of models) that takes into account the data's characteristics. Your model can perform a series of preprocessing steps before training or making predictions. You will need to contemplate how you will train your model, what objective you will use, and how to evaluate its performance properly. Additionally, you may want to conduct a hyperparameter search to enhance your model's performance.\n",
    "\n",
    "There is no model answer to the problem. There are many possible approaches and many of them might perform equally well. \n",
    "\n",
    "### Constraints:\n",
    "\n",
    "* You are only allowed to use <span style=\"color:red\">**neural networks**</span> as the hypothesis class of your supervised learning models.\n",
    "* Your model is allowed to take <span style=\"color:red\">**no more than 3 minutes**</span> on the Coursemology server and use <span style=\"color:red\">**at most 1 GB of memory**</span>.\n",
    "* <span style=\"color:red\">**WARNING:</span> The use of any other hypothesis class is  <span style=\"color:red\">strictly prohibited</span> and the failure to comply with this requirement  <span style=\"color:red\">will result in zero marks for this assessment**</span>.\n",
    "* You are allowed and you might find it useful to use unsupervised learning methods such as PCA and K-means clustering to preprocess your data, but preprocessing is optional.\n",
    "\n",
    "### Required Files:\n",
    "\n",
    "* `main.ipynb` (this Jupyter notebook)\n",
    "* `scratchpad.ipynb` (Jupyter notebook for your \"scratch paper\") \n",
    "* `environment.yml` (environment file)\n",
    "* `data.npy` (dataset)\n",
    "\n",
    "\n",
    "\n",
    "### Policy on Citing AI:\n",
    "\n",
    "As explained in lecture, we have little choice but to allow the use of AI. Here are some guidelines for citing AI-generated content:\n",
    "\n",
    "**ChatGPT**: When referencing content generated by ChatGPT, Bing, or something similar, please include the link(s) to the conversation(s) in `scratchpad.ipynb` as part of the supporting evidence for your work.\n",
    "\n",
    "**GitHub Copilot**: For GitHub Copilot or something similar, please include one of the following in `scratchpad.ipynb`: (i) If your code was generated based on a prompt, include the prompt used along with a link to the generated code (if possible). (ii) If the code was produced through autocomplete without a specific prompt, kindly provide a screenshot of the autocomplete suggestion.\n",
    "\n",
    "We have caught students in the past attempting to explain away instances of plagiarism by falsely claiming that they received input from ChatGPT. It is important to note that these attempts were identified and flagged by our plagiarism checker. **Disciplinary action was taken against these students** for their actions.\n",
    "\n",
    "We would like to remind all students that academic integrity is of utmost importance, especially during final assessment. Any attempt to plagiarize will result in **severe disciplinary action**. We urge you to complete this final assessment with honesty and integrity, and refrain from resorting to any unethical practices.\n",
    "\n",
    "Please be aware that **our plagiarism detection tools are highly effective** and any attempt to cheat or plagiarize will likely be detected. It is in your best interest to demonstrate your own knowledge and skills in completing the final assessment.\n",
    "\n",
    "### Honour Code: \n",
    "\n",
    "* Note that plagiarism will not be condoned! \n",
    "* Also, because this is an individual assessment, you **MUST NOT** discuss your approach or solution with your classmates. If you are caught doing so, you will be subject to disciplinary action because it will considered an act of academic dishonestly.  \n",
    "* You may check the internet for references, but if you submit code that is copied directly and subsequently modified from an online source, you need to provide us with the reference and URL! For good measure, please cite all references in your final solution write-up and include all your working in `scratchpad.ipynb`.\n",
    "\n",
    "### Latest Updates, Announcements, and Clarifications:\n",
    "\n",
    "We recommend regularly checking the [official forum thread for the final exam](https://coursemology.org/courses/2714/forums/homework/topics/official-final-assessment) to stay up-to-date with the latest updates and announcements. Additionally, you can use this thread to request clarifications regarding any aspect of the exam."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ef4d73fd",
   "metadata": {},
   "source": [
    "## Jupyter Notebooks\n",
    "\n",
    "In this final assessment, you will be working with two Jupyter notebooks: `main.ipynb` and `scratchpad.ipynb`.\n",
    "\n",
    "### `main.ipynb`\n",
    "\n",
    "This notebook serves as the primary guide to completing the final exam. It includes instructions, guidelines, and a final model code template that you must complete and copy-paste into Coursemology for submission. You can think of this as the question paper. \n",
    "\n",
    "It is important to note that `main.ipynb` **should not be uploaded**.\n",
    "\n",
    "### `scratchpad.ipynb`\n",
    "\n",
    "`scratchpad.ipynb` is your \"scratch paper,\" where you will perform various analyses and transformations on the data we provided you with and experiment with different techniques. You can think of this as your answer booklet. \n",
    "\n",
    "It is important to document your thoughts in this notebook so that we can understand how you arrived at your final model. Note that your working will not be graded directly. Instead, you will have to summarize your work in a report of not more than 1,000 words. You should follow the format/sections given to help make the grading easier for the profs. Some sections are optional. If they are not applicable, just indicate \"Nil.\"\n",
    "\n",
    "Mostly, the working submitted will be used as a sanity check that you actually did the work you claim to have done, instead of copying it from some online source, or worse, copied from another student. If we find 2 submissions are suspiciously similar, we will check the working for the 2 students. It is highly implausible that 2 students will come up with the same working if they did not discuss their approach/answers.  \n",
    "\n",
    "Once you have found your best model, you should copy and paste the model and its necessary components from `scratchpad.ipynb` to Coursemology for submission. It is important to **ensure that your model adheres to the model code template provided in `main.ipynb`**. You need to make sure that the code you submit in Coursemology is self-contained. We strongly encourage you to test your model locally using `main.ipynb` by copying the necessary parts from your scratchpad to `main.ipynb` and check that your model runs correctly, to avoid wasting your attempt on Coursemology. You will be giving only a limited number of attempts on Coursemology because it is extreme memory-intensive (aka expensive) to run these models and we don't want students to be spamming our servers. \n",
    "\n",
    "**Remember to upload `scratchpad.ipynb` when submitting your final exam.** "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c51cc898",
   "metadata": {},
   "source": [
    "## Compute Resources\n",
    "\n",
    "### IMPORTANT: Limitations to Compute Resources\n",
    "\n",
    "At some level, machine learning can generally do better if we throw more resources into the problem. However for reasons of fairness and practicality, we need to limit the compute resources that you can use for this problem. The way that we decide to quantify this is to limit your model to take <span style=\"color:red\">**no more than 3 minutes**</span> on the Coursemology server and uses <span style=\"color:red\">**at most 1 GB of memory**</span>. The Coursemology server also **does not have a GPU,** so you should not use GPU in your model. You should also assume that your model will have access to only 1 CPU and not include multi-processor code in your submission.  This means that you should start off with a simple model and then gradually increase the complexity of your model to make sure that you stay within the allocated compute resources.\n",
    "\n",
    "These limitations exist for 2 reasons: (i) first, we need to contain the costs of running this Final Assessment; and (ii) we need to ensure a fair playing field for all students. We need to make sure no student performs better than other students simply because of access to faster machines or more resources. \n",
    "\n",
    "### Available Compute Resources\n",
    "\n",
    "For this assessment, you will need access to some computing resources to run your code and train your machine learning models. Two popular options are Google Colaboratory and the School of Computing Compute Cluster.\n",
    "\n",
    "* **[Google Collaboratory](https://colab.research.google.com/)**, or \"Colab\" for short, is a free cloud-based platform provided by Google that allows you to write and run Python code using a Jupyter notebook interface. Colab provides access to a virtual machine with a GPU and sometimes even a TPU, which can speed up computation for tasks like training machine learning models. You can use Colab on your own computer without installing any software, and it provides access to a number of libraries and datasets. However, there may be limits on how much time, memory, and storage space you can use, and you may need to reauthorize your session frequently.\n",
    "\n",
    "* **[The School of Computing Compute Cluster](https://dochub.comp.nus.edu.sg/cf/guides/compute-cluster/start)** is a set of high-performance computing resources that are available to students, faculty, and researchers affiliated with the National University of Singapore's School of Computing. The cluster consists of multiple nodes, each with its own set of CPUs, memory, and storage. You can submit jobs to the cluster using the [Slurm workload manager](https://slurm.schedmd.com/documentation.html), which allocates resources to jobs based on availability and user-specified requirements. The Compute Cluster provides significantly more computing power than Colab, with the ability to scale up to hundreds or even thousands of cores. However, you need to apply for access to the cluster, and there may be limits on the amount of resources that can be used at any given time. Additionally, using the cluster requires some technical expertise and familiarity with the Linux command line interface.\n",
    "\n",
    "If you prefer not to use Google Colaboratory or the School of Computing Compute Cluster, you can also run your code on your own computer. However, keep in mind that your computer may not have as much processing power or memory as the other options, so your code may run more slowly and you will take more time to complete certain tasks."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7e323c3c",
   "metadata": {},
   "source": [
    "## Scratch Pad"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bd2d0eaf",
   "metadata": {},
   "source": [
    "The following are brief descriptions of each step in `scratchpad.ipynb`. Please note that the scratch pad only contains generic steps, and depending on the specific data you are working with, not all steps may be necessary. As a result, some sections of the notebook may potentially be left empty/blank (you can indicate \"Nil\").\n",
    "\n",
    "You should probably also limit the amount of time that you invest into each of the steps to avoid rushing at the end. We expect students to take abour 3-4 hours to complete this assessment and the suggested amount of times for each of the steps is given as a guide. You do not need to adhere strictly to our suggestions. \n",
    "\n",
    "You should include all your \"workings\" in scratchpad. Although you will only be graded on your 1,000-word executive summary, we might refer to your workings if there are concerns about plagiarism.\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cbe832b6",
   "metadata": {},
   "source": [
    "### Data Exploration & Preparation (60 mins)\n",
    "\n",
    "Before starting to create a model, it is important to explore and analyze the characteristics of the data. This can help us make informed decisions regarding the choice of approach, techniques, and models that we will use.\n",
    "\n",
    "When dealing with data, it is essential to understand the format in which it is presented. In machine learning, tabular data is typically provided in the form of a Pandas DataFrame, while tensor data, such as images, is given in the form of a Numpy ndarray.\n",
    "\n",
    "To help you get started with these data formats, the following guides can be useful:\n",
    "\n",
    "* [10 Minutes to pandas](https://pandas.pydata.org/docs/user_guide/10min.html): This guide provides a quick introduction to the Pandas library, including its key features and how to work with DataFrames.\n",
    "* [Numpy Quickstart](https://numpy.org/doc/stable/user/quickstart.html): This guide covers the basics of the Numpy library and how to work with ndarrays, including creating arrays, indexing and slicing, and mathematical operations.\n",
    "\n",
    "**It is important to note that you have the option to skip most of these data exploration and preparation steps and mostly use the dataset as is (with a rudimentary preprocessing)**. Given that machine learning is machine learning, such a naive approach work (i.e. give you some answer). However, such an approach is unlikely to yield the best outcomes.\n",
    "\n",
    "\n",
    "#### 1. Descriptive Analysis  (5 mins)\n",
    "\n",
    "Descriptive analysis is used to understand the basic characteristics of the data. This includes analyzing the distribution of the data, measuring its central tendency (i.e., mean, median, mode), and checking the variability of the data (i.e., range, standard deviation, variance, interquartile range). This analysis can give us an overview of the data and help us to identify any potential issues or challenges that may need to be addressed.\n",
    "\n",
    "You may find the following resources helpful:\n",
    "* [Pandas: how to calculate statistics?](https://pandas.pydata.org/docs/getting_started/intro_tutorials/06_calculate_statistics.html)\n",
    "* [Numpy statistics](https://numpy.org/doc/stable/reference/routines.statistics.html)\n",
    "\n",
    "#### 2. Detection and Handling of Missing Values   (10 mins)\n",
    "\n",
    "Missing values in the data can cause problems with our machine learning algorithms and may need to be handled. Detecting and handling missing values involves checking if there are any missing values in the data and figuring out the best way to handle them if necessary. This may involve imputing missing values with a certain value or method, or removing the rows or columns that contain missing values. You can follow the \"[10 Minutes to pandas](https://pandas.pydata.org/docs/user_guide/10min.html)\" guide on how to do this.\n",
    "\n",
    "#### 3. Detection and Handling of Outliers  (10 mins)\n",
    "\n",
    "Outliers are data points that are significantly different from the majority of the data. They can have a significant impact on the performance of your model, and it is important to detect and handle them appropriately. For example, you may use statistical methods such as the interquartile range (IQR) and z-score to detect outliers. Once you have detected outliers, you need to decide how to handle them. After you found them, you can choose to remove them from the dataset, replace them with a more appropriate value (e.g., the mean or median), or leave them in the dataset and use a model that is robust to outliers. You may find [this guide](https://www.geeksforgeeks.org/detect-and-remove-the-outliers-using-python/) useful.\n",
    "\n",
    "#### 4. Detection and Handling of Class Imbalance  (5 mins)\n",
    "\n",
    "Class imbalance is a problem that occurs when one class has significantly more instances than another. This can make it difficult to build a model that accurately predicts the minority class. Investigating the possibility of class imbalance and figuring out the best way to handle it if necessary is important. This may involve techniques such as oversampling or undersampling the minority class (as seen in PS5), using cost-sensitive learning (as seen in Tutorial 8), or other methods which you can explore yourself.\n",
    "\n",
    "#### 5. Understanding Relationship Between Variables  (20 mins)\n",
    "\n",
    "Analyzing the relationship between variables in a dataset can reveal potential dependencies and offer insights for building accurate models. There are different ways to explore these dependencies:\n",
    "\n",
    "* **Linear dependencies:** It's possible to identify linear dependencies by verifying if certain attributes are multiples of other attributes by a constant factor.\n",
    "\n",
    "* **Correlations:** Another approach is to measure the correlations between variables, which indicates whether certain variables influence others. For instance, high correlation between the target variable and attributes A, B, and C suggests that A, B, and C are important factors for the target variable. Conversely, low correlation implies that they are not critical. [You can measure correlation using pandas](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.corr.html).\n",
    "\n",
    "By performing this analysis, we can determine which variables are most relevant to the problem and should be considered in our model. This can help us build a more effective and accurate machine learning model.\n",
    "\n",
    "#### 6. Data Visualization  (10 mins)\n",
    "\n",
    "Visualizing the data can help us to see patterns that are visible to human eyes. Data visualization techniques can include scatter plots, histograms, heat maps, and other graphical methods. Visualization can be particularly useful when trying to identify relationships between variables or patterns in the data.\n",
    "\n",
    "You may use libraries such as [Matplotlib](https://matplotlib.org/stable/tutorials/introductory/pyplot.html) or [Seaborn](https://seaborn.pydata.org/tutorial/introduction.html) to do this.\n",
    "\n",
    "### Data Preprocessing  (70 mins)\n",
    "\n",
    "Data preprocessing involves cleaning and transforming the data to maximize learning and prediction performance. This can include removing irrelevant variables, normalizing the data, scaling the data, or transforming the data using mathematical techniques.\n",
    "\n",
    "Scikit-learn website provides a [short guide](https://scikit-learn.org/stable/modules/preprocessing.html) on how to do data preprocessing using their library.\n",
    "\n",
    "**It is important to note that you can choose to skip more of the steps in feature selection and feature engineering and use all the features in the data set as is**. However, doing so may not yield the best outcomes. \n",
    "\n",
    "#### 7. General Preprocessing  (10 mins)\n",
    "\n",
    "General preprocessing involves any other preprocessing that is necessary for the data, such as converting the data type of certain attributes or removing duplicates. think of this as implementing what needs to be done based on what you learnt in Step 2 above. \n",
    "\n",
    "#### 8. Feature Selection  (30 mins)\n",
    "\n",
    "Feature selection is an important step in machine learning that involves identifying a subset of features that are most relevant to the problem. This helps to reduce the dimensionality of the data and improve the accuracy of the models. There are different techniques for feature selection, including:\n",
    "\n",
    "* **Removing uninformative features:** This involves removing features that are not useful for the task at hand. Two common methods to identify uninformative features are:\n",
    "  * Linearly dependent: features that are linear combinations of other features can be removed since they add no new information. A linear dependence test can be applied to identify linearly dependent features.\n",
    "  * Low or no correlation: features that have low or no correlation with the target variable can also be removed as they do not provide valuable information. A correlation analysis can be performed to identify such features \n",
    "  * See Understanding Relationship Between Variables on the explanation regarding the analysis.\n",
    "\n",
    "* **Sequentially removing features:** This involves iteratively removing the least significant feature until a desired number of features is reached. The idea is to remove features that have the least impact on the performance of the model. [Learn how to do this using Scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SequentialFeatureSelector.html).\n",
    "\n",
    "* **Principal component analysis:** This is a dimensionality reduction technique that involves transforming the data into a new set of orthogonal variables, called principal components, that capture the most important information in the original data. By selecting a subset of these components, we can reduce the dimensionality of the data while retaining most of the information. [Learn how to do this using Scikit-learn](https://scikit-learn.org/stable/auto_examples/decomposition/plot_pca_iris.html).\n",
    "\n",
    "By applying feature selection techniques, we can identify the most relevant features for the problem and improve the accuracy and efficiency of our machine learning models.\n",
    "\n",
    "\n",
    "#### 9. Feature Engineering  (30 mins)\n",
    "\n",
    "Feature engineering is the process of creating new features from existing ones to improve the performance of machine learning models. It involves identifying and extracting useful information from the data by applying various techniques such as:\n",
    "\n",
    "* **Combining features:** This involves combining two or more existing features to create a new feature that captures more information than either of the original features alone. For example, if we have two features representing length and width, we can create a new feature that represents area by multiplying the two. \n",
    "\n",
    "* **Creating new features:** This involves creating new features from the existing ones using domain knowledge or other insights gained from the data analysis. For example, if we have a dataset of customer transactions, we can create new features such as total spending per customer or the number of items purchased in a single transaction. \n",
    "\n",
    "* **Mapping functions to features:** This involves applying mathematical functions to the features to create new features with higher-order terms or interactions between features. For example, we can create polynomial features by mapping a feature x to x^2 or x^3.\n",
    "\n",
    "By applying feature engineering techniques, we can create more informative features that capture the underlying patterns and relationships in the data, leading to better performance in the machine learning models, since compute resources are limited. \n",
    "\n",
    "\n",
    "### Modeling & Evaluation (110 mins)\n",
    "\n",
    "After completing the above steps, it is time to build and evaluate models. This can involve creating a set of models that best fit the nature of the data, performing model evaluation, and doing hyperparameters search.\n",
    "\n",
    "The models should be evaluated thoroughly, and the best one should be chosen for submission.\n",
    "\n",
    "#### 10. Creating Models  (30 mins)\n",
    "\n",
    "In this stage, we create models that are appropriate for the data. Since you are only allowed to use neural networks as the hypothesis class, then the choice of models boils down to the choice of the architectures of the neural networks. For example, you can use feedforward neural networks, convolutional neural networks, recurrent neural networks, and so on. You can even combine multiple architecture together to form a new architecture. Depending on the nature of the data, we can choose one or more of these models to build. We should be careful in selecting the models to ensure that they are suitable for the task we want to accomplish.\n",
    "\n",
    "Utilizing pre-built models from [PyTorch](https://pytorch.org/docs/stable/nn.html) can be beneficial. This library offers an extensive range of models that can be easily implemented and integrated. \n",
    "\n",
    "However, if needed, you can also create your own models and algorithms from scratch.\n",
    "\n",
    "#### 11. Model Evaluation  (30 mins)\n",
    "\n",
    "Once we have created our models, we need to evaluate them to determine their performance. We should use a variety of metrics to evaluate the performance of each model, such as accuracy, precision, recall, F1 score, ROC curve, AUC, and so on. We should also use appropriate techniques to validate the models, such as cross-validation, train-test split, or hold-out validation. By doing this, we can determine which model is the best fit for our data. \n",
    "\n",
    "It's important to consider multiple models in the evaluation process to ensure that we are choosing the best one for our data. We may create and evaluate several models before selecting the best one. We should also consider the trade-offs between model complexity and accuracy to make sure that it can run in Coursemology.\n",
    "\n",
    "#### 12. Hyperparameters Search  (50 mins)\n",
    "\n",
    "After choosing a model, we should optimize its hyperparameters to achieve the best performance. Hyperparameters are parameters that are not learned during training, such as the learning rate, the number of hidden layers (if applicable), or the regularization coefficient (if applicable). We can use various methods to search for the optimal hyperparameters, such as grid search, random search, or Bayesian optimization. The choice of the method depends on the complexity of the model and the size of the dataset. By tuning the hyperparameters, we can improve the performance of the model and make it more robust.\n",
    "\n",
    "There are many libraries to do hyperparameter search. You browse for them on [GitHub](https://github.com/topics/hyperparameter-optimization).\n",
    "\n",
    "In addition to using optimization libraries and functions, you can also manually perform simple hyperparameter tuning. This involves adjusting the hyperparameters of your model and evaluating its performance repeatedly until the best combination is achieved. However, keep in mind that manual tuning can be time-consuming and may not be as effective as more advanced techniques."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9045d90d",
   "metadata": {},
   "source": [
    "## Tasks & Submission"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c950686e",
   "metadata": {},
   "source": [
    "### Task 1: Model Implementation (80% Marks)\n",
    "\n",
    "Implement your model that you want to submit by completing the following functions:\n",
    "* `__init__`: The constructor for Model class.\n",
    "* `fit`: Fit/train the model using the input data. You may perform data handling and preprocessing here before training your model.\n",
    "* `predict`: Predict using the model. If you perform data handling and preprocessing in the `fit` function, then you may want to do the same here.\n",
    "\n",
    "#### Dependencies\n",
    "\n",
    "It is crucial to note that your model may rely on specific versions of Python packages, including:\n",
    "\n",
    "* Python 3.10\n",
    "* Numpy version 1.23\n",
    "* Pandas version 1.4\n",
    "* Scikit-Learn version 1.1\n",
    "* PyTorch version 1.12\n",
    "* Torchvision version 0.13\n",
    "\n",
    "To prevent any compatibility issues or unexpected errors during the execution of your code, ensure that you are using the correct versions of these packages. You can refer to `environment.yml` for a comprehensive list of packages that are pre-installed in Coursemology and can be used by your model. Note that if you do end up using libraries that are not installed on Coursemology, you might see an error like:\n",
    "\n",
    "\"Your code failed to evaluate correctly. There might be a syntax error, or perhaps execution failed to complete within the allocated time and memory limits.\"\n",
    "\n",
    "#### Model Template\n",
    "\n",
    "Note that you should copy and paste the code below *directly* into Coursemology for submission. You should probably test the code in this notebook on your local machine before uploading to Coursemology and using up an attempt.\n",
    " "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "a44b7aa4",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-04-28T04:16:20.742674Z",
     "start_time": "2024-04-28T04:16:20.719852Z"
    }
   },
   "outputs": [],
   "source": [
    "import torch\n",
    "from torch import nn\n",
    "class CNN(nn.Module):\n",
    "    def __init__(self, num_classes):\n",
    "        super(CNN, self).__init__()\n",
    "\n",
    "        self.conv1 = nn.Conv2d(1,32,3,stride=1,padding=0)\n",
    "        self.conv2 = nn.Conv2d(32,64,3,stride=1,padding=0)\n",
    "        self.relu = nn.ReLU()\n",
    "        self.maxpool = nn.MaxPool2d(2)\n",
    "        self.fc1 = nn.Linear(256, 128)  # Calculate input size based on output from conv2 and pooling\n",
    "        self.fc2 = nn.Linear(128, num_classes)\n",
    "        self.flatten = nn.Flatten()\n",
    "\n",
    "    def forward(self, x):\n",
    "        x = self.conv1(x)\n",
    "        x = self.relu(x)\n",
    "        x = self.maxpool(x)\n",
    "        x = self.conv2(x)\n",
    "        x = self.relu(x)\n",
    "        x = self.maxpool(x)\n",
    "        x = self.flatten(x)\n",
    "        x = self.fc1(x)\n",
    "        x = self.relu(x)\n",
    "        x = self.fc2(x)\n",
    "        return x\n",
    "\n",
    "# video is a numpy array of shape (L, H, W)\n",
    "def clean_batch(batch):\n",
    "    batch = np.array(batch)\n",
    "    print(batch.shape)\n",
    "    temp_x = batch.reshape(-1, 256)\n",
    "    np.nan_to_num(temp_x, copy=False)\n",
    "    col_mean = np.nanmean(temp_x, axis=0)\n",
    "    inds = np.where(np.isnan(temp_x))\n",
    "    temp_x[inds] = np.take(col_mean, inds[1])\n",
    "    temp_x = np.clip(temp_x, 1, 255)\n",
    "    batch = temp_x.reshape(-1, 1, 16,16)\n",
    "    return torch.tensor(batch, dtype=torch.float32)\n",
    "def flatten_data(X, y):\n",
    "    not_nan_indices = np.argwhere(~np.isnan(np.array(y))).squeeze()\n",
    "    y = [y[i] for i in not_nan_indices]\n",
    "    X = [X[i] for i in not_nan_indices]\n",
    "    flattened_x = []\n",
    "    flattened_y = []\n",
    "    for idx, video in enumerate(X):\n",
    "        for frame in video:\n",
    "            flattened_x.append(frame)\n",
    "            flattened_y.append(y[idx])\n",
    "    flattened_x = clean_batch(flattened_x)\n",
    "    return flattened_x, torch.Tensor(np.array(flattened_y, dtype=np.int64)).long()\n",
    "\n",
    "class Model():\n",
    "    def __init__(self):\n",
    "        self.cnn = CNN(6)\n",
    "    def fit(self, X, y):\n",
    "        self.cnn.train()\n",
    "        X, y = flatten_data(X, y)\n",
    "        train_dataset = torch.utils.data.TensorDataset(X, y)\n",
    "        train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=320, shuffle=True)\n",
    "        criterion = nn.CrossEntropyLoss()\n",
    "        optimizer = torch.optim.Adam(self.cnn.parameters(), lr=0.001)\n",
    "        for epoch in range(70):\n",
    "            for idx, (inputs, labels) in enumerate(train_loader):\n",
    "                optimizer.zero_grad()\n",
    "                outputs = self.cnn(inputs)\n",
    "                loss = criterion(outputs, labels)\n",
    "                loss.backward()\n",
    "                optimizer.step()\n",
    "            print(f'Epoch {epoch}, Loss: {loss.item()}')\n",
    "        return self\n",
    "    def predict(self, X):\n",
    "        self.cnn.eval()\n",
    "        results = []\n",
    "        for idx, batch in enumerate(X):\n",
    "            batch = clean_batch(batch)\n",
    "            pred = self.cnn(batch)\n",
    "            result = torch.argmax(pred, axis=1)\n",
    "            results.append(torch.max(result))\n",
    "        return results\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e02178d7",
   "metadata": {},
   "source": [
    "#### Local Evaluation\n",
    "\n",
    "You may test your solution locally by running the following code. Do note that the results may not reflect your performance in Coursemology. You should not be submitting the code below in Coursemology. The code here is meant only for you to do local testing."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "4f4dd489",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-04-28T04:09:59.741093Z",
     "start_time": "2024-04-28T04:09:59.732247Z"
    }
   },
   "outputs": [],
   "source": [
    "# Import packages\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "import os\n",
    "from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, mean_squared_error, mean_absolute_error, r2_score\n",
    "from sklearn.model_selection import train_test_split"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "3064e0ff",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-04-28T04:10:00.718747Z",
     "start_time": "2024-04-28T04:10:00.689200Z"
    }
   },
   "outputs": [],
   "source": [
    "# Load data\n",
    "with open('data.npy', 'rb') as f:\n",
    "    data = np.load(f, allow_pickle=True).item()\n",
    "    X = data['data']\n",
    "    y = data['label']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "id": "27c9fd10",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-04-28T04:18:50.184449Z",
     "start_time": "2024-04-28T04:18:43.527661Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(16189, 16, 16)\n",
      "Epoch 0, Loss: 1.0775409936904907\n",
      "Epoch 1, Loss: 1.279036283493042\n",
      "Epoch 2, Loss: 1.0776251554489136\n"
     ]
    },
    {
     "ename": "KeyboardInterrupt",
     "evalue": "",
     "output_type": "error",
     "traceback": [
      "\u001B[0;31m---------------------------------------------------------------------------\u001B[0m",
      "\u001B[0;31mKeyboardInterrupt\u001B[0m                         Traceback (most recent call last)",
      "Cell \u001B[0;32mIn[21], line 15\u001B[0m\n\u001B[1;32m     13\u001B[0m \u001B[38;5;66;03m# Train and predict\u001B[39;00m\n\u001B[1;32m     14\u001B[0m model \u001B[38;5;241m=\u001B[39m Model()\n\u001B[0;32m---> 15\u001B[0m \u001B[43mmodel\u001B[49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43mfit\u001B[49m\u001B[43m(\u001B[49m\u001B[43mX_train\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[43my_train\u001B[49m\u001B[43m)\u001B[49m\n\u001B[1;32m     16\u001B[0m y_pred \u001B[38;5;241m=\u001B[39m model\u001B[38;5;241m.\u001B[39mpredict(X_test)\n\u001B[1;32m     18\u001B[0m \u001B[38;5;66;03m# Evaluate model predition\u001B[39;00m\n\u001B[1;32m     19\u001B[0m \u001B[38;5;66;03m# Learn more: https://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics\u001B[39;00m\n",
      "Cell \u001B[0;32mIn[18], line 68\u001B[0m, in \u001B[0;36mModel.fit\u001B[0;34m(self, X, y)\u001B[0m\n\u001B[1;32m     66\u001B[0m     outputs \u001B[38;5;241m=\u001B[39m \u001B[38;5;28mself\u001B[39m\u001B[38;5;241m.\u001B[39mcnn(inputs)\n\u001B[1;32m     67\u001B[0m     loss \u001B[38;5;241m=\u001B[39m criterion(outputs, labels)\n\u001B[0;32m---> 68\u001B[0m     \u001B[43mloss\u001B[49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43mbackward\u001B[49m\u001B[43m(\u001B[49m\u001B[43m)\u001B[49m\n\u001B[1;32m     69\u001B[0m     optimizer\u001B[38;5;241m.\u001B[39mstep()\n\u001B[1;32m     70\u001B[0m \u001B[38;5;28mprint\u001B[39m(\u001B[38;5;124mf\u001B[39m\u001B[38;5;124m'\u001B[39m\u001B[38;5;124mEpoch \u001B[39m\u001B[38;5;132;01m{\u001B[39;00mepoch\u001B[38;5;132;01m}\u001B[39;00m\u001B[38;5;124m, Loss: \u001B[39m\u001B[38;5;132;01m{\u001B[39;00mloss\u001B[38;5;241m.\u001B[39mitem()\u001B[38;5;132;01m}\u001B[39;00m\u001B[38;5;124m'\u001B[39m)\n",
      "File \u001B[0;32m/nix/store/4mv9lb8b1vjx88y2i7px1r2s8p3xlr7d-python3-3.11.9-env/lib/python3.11/site-packages/torch/_tensor.py:522\u001B[0m, in \u001B[0;36mTensor.backward\u001B[0;34m(self, gradient, retain_graph, create_graph, inputs)\u001B[0m\n\u001B[1;32m    512\u001B[0m \u001B[38;5;28;01mif\u001B[39;00m has_torch_function_unary(\u001B[38;5;28mself\u001B[39m):\n\u001B[1;32m    513\u001B[0m     \u001B[38;5;28;01mreturn\u001B[39;00m handle_torch_function(\n\u001B[1;32m    514\u001B[0m         Tensor\u001B[38;5;241m.\u001B[39mbackward,\n\u001B[1;32m    515\u001B[0m         (\u001B[38;5;28mself\u001B[39m,),\n\u001B[0;32m   (...)\u001B[0m\n\u001B[1;32m    520\u001B[0m         inputs\u001B[38;5;241m=\u001B[39minputs,\n\u001B[1;32m    521\u001B[0m     )\n\u001B[0;32m--> 522\u001B[0m \u001B[43mtorch\u001B[49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43mautograd\u001B[49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43mbackward\u001B[49m\u001B[43m(\u001B[49m\n\u001B[1;32m    523\u001B[0m \u001B[43m    \u001B[49m\u001B[38;5;28;43mself\u001B[39;49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[43mgradient\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[43mretain_graph\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[43mcreate_graph\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[43minputs\u001B[49m\u001B[38;5;241;43m=\u001B[39;49m\u001B[43minputs\u001B[49m\n\u001B[1;32m    524\u001B[0m \u001B[43m\u001B[49m\u001B[43m)\u001B[49m\n",
      "File \u001B[0;32m/nix/store/4mv9lb8b1vjx88y2i7px1r2s8p3xlr7d-python3-3.11.9-env/lib/python3.11/site-packages/torch/autograd/__init__.py:266\u001B[0m, in \u001B[0;36mbackward\u001B[0;34m(tensors, grad_tensors, retain_graph, create_graph, grad_variables, inputs)\u001B[0m\n\u001B[1;32m    261\u001B[0m     retain_graph \u001B[38;5;241m=\u001B[39m create_graph\n\u001B[1;32m    263\u001B[0m \u001B[38;5;66;03m# The reason we repeat the same comment below is that\u001B[39;00m\n\u001B[1;32m    264\u001B[0m \u001B[38;5;66;03m# some Python versions print out the first line of a multi-line function\u001B[39;00m\n\u001B[1;32m    265\u001B[0m \u001B[38;5;66;03m# calls in the traceback and some print out the last line\u001B[39;00m\n\u001B[0;32m--> 266\u001B[0m \u001B[43mVariable\u001B[49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43m_execution_engine\u001B[49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43mrun_backward\u001B[49m\u001B[43m(\u001B[49m\u001B[43m  \u001B[49m\u001B[38;5;66;43;03m# Calls into the C++ engine to run the backward pass\u001B[39;49;00m\n\u001B[1;32m    267\u001B[0m \u001B[43m    \u001B[49m\u001B[43mtensors\u001B[49m\u001B[43m,\u001B[49m\n\u001B[1;32m    268\u001B[0m \u001B[43m    \u001B[49m\u001B[43mgrad_tensors_\u001B[49m\u001B[43m,\u001B[49m\n\u001B[1;32m    269\u001B[0m \u001B[43m    \u001B[49m\u001B[43mretain_graph\u001B[49m\u001B[43m,\u001B[49m\n\u001B[1;32m    270\u001B[0m \u001B[43m    \u001B[49m\u001B[43mcreate_graph\u001B[49m\u001B[43m,\u001B[49m\n\u001B[1;32m    271\u001B[0m \u001B[43m    \u001B[49m\u001B[43minputs\u001B[49m\u001B[43m,\u001B[49m\n\u001B[1;32m    272\u001B[0m \u001B[43m    \u001B[49m\u001B[43mallow_unreachable\u001B[49m\u001B[38;5;241;43m=\u001B[39;49m\u001B[38;5;28;43;01mTrue\u001B[39;49;00m\u001B[43m,\u001B[49m\n\u001B[1;32m    273\u001B[0m \u001B[43m    \u001B[49m\u001B[43maccumulate_grad\u001B[49m\u001B[38;5;241;43m=\u001B[39;49m\u001B[38;5;28;43;01mTrue\u001B[39;49;00m\u001B[43m,\u001B[49m\n\u001B[1;32m    274\u001B[0m \u001B[43m\u001B[49m\u001B[43m)\u001B[49m\n",
      "\u001B[0;31mKeyboardInterrupt\u001B[0m: "
     ]
    }
   ],
   "source": [
    "# Split train and test\n",
    "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1)\n",
    "\n",
    "# Filter test data that contains no labels\n",
    "# In Coursemology, the test data is guaranteed to have labels\n",
    "not_nan_indices = np.argwhere(~np.isnan(np.array(y_test))).squeeze()\n",
    "y_test = [y_test[i] for i in not_nan_indices]\n",
    "X_test = [X_test[i] for i in not_nan_indices]\n",
    "not_nan_indices = np.argwhere(~np.isnan(np.array(y_train))).squeeze()\n",
    "y_train = [y_train[i] for i in not_nan_indices]\n",
    "X_train = [X_train[i] for i in not_nan_indices]\n",
    "\n",
    "# Train and predict\n",
    "model = Model()\n",
    "model.fit(X_train, y_train)\n",
    "y_pred = model.predict(X_test)\n",
    "\n",
    "# Evaluate model predition\n",
    "# Learn more: https://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics\n",
    "print(\"F1 Score (macro): {0:.2f}\".format(f1_score(y_test, y_pred, average='macro'))) # You may encounter errors, you are expected to figure out what's the issue."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "16861aef",
   "metadata": {},
   "source": [
    "#### Grading Scheme\n",
    "\n",
    "Your code implementation will be graded based on its performance ([**Macro F1 Score**](http://iamirmasoud.com/2022/06/19/understanding-micro-macro-and-weighted-averages-for-scikit-learn-metrics-in-multi-class-classification-with-example/)*) in the contest. Your model will be trained with the data that we provided you with this assesment. We will use score cutoffs that we will decide after the contest to determine your marks.\n",
    "\n",
    "The performance of your model will be determined by a separate test data set, drawn from the same population as the training set, but not provided to you earlier. The marks you will receive will depend on the **Macro F1 Score** of the predictions:\n",
    "\n",
    "* If your score is above the mean or median, you can expect to receive decent marks. \n",
    "* If your score is higher than the 75th percentile, you are likely to receive good marks. \n",
    "* If you achieve a score above the 90th percentile (top 10%), you will likely receive full marks.\n",
    "\n",
    "Throughout the contest, we will provide periodic updates on the distribution of the score of student submissions in the official forum thread (see Overview) based on the **public test case**, which test the performance of the model on **a small subset of data from the hidden test data**. You can use these updates to estimate your relative standing, compared to your peers. \n",
    "\n",
    "<b>*) Macro F1 Score:</b> F1 score for multi-class classification computed by taking the average of all the per-class F1 score"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "44c79c17",
   "metadata": {},
   "source": [
    "### Task 2: Scratch Pad (20% Marks)\n",
    "\n",
    "Fill up the `scratchpad.ipynb` with your working. \n",
    "\n",
    "In the **\"Report\" section**, write a report that explain the thought process behind your solution, and convince us that you have understood the concepts taught in class and can apply them. The report should cover data exploration and preparation, data preprocessing, modeling, and evaluation. The final solution and any alternative approaches that were tried but did not work may also be documented. The length of the report should be approximately equivalent to **1-2 pages of A4 paper (up to 1,000 words)**.\n",
    "\n",
    "#### Grading Scheme\n",
    "\n",
    "The report will be graded based on the reasonability and soundness of the approach you take, your understanding of the data, and your final solution. If you do not make any errors in your approach, reasoning/understanding, and conclusion, you can expect to receive full marks. This part is meant to be \"standard\", and is only for us to do a quick sanity check that you actually did the work required to come up with the model you submitted."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "28b658b4",
   "metadata": {},
   "source": [
    "### Submission\n",
    "\n",
    "Once you are done, please submit your work to Coursemology, by copying the right snippets of code into the corresponding box that says 'Model Implementation', and click 'Save Draft'. You can still make changes after you save your submission.\n",
    "\n",
    "When submitting your model, the `fit` function will be called to train your model with the **data that we have provided you**. Due to the inherent stochasticity of the training process, **your model's performance may vary across different runs**. To ensure deterministic results, you can set a fixed random seed in your code. After the training is completed, the `predict` function will be used to evaluate your model. The evaluation of your model will be based on two test cases: \n",
    "1. **Public test cases, containing a small portion of the test data**, that allows you to **estimate** your score. \n",
    "2. **Evaluation test cases containing the remaining test data** (which you will not be able to see) by which we will evaluate your model. \n",
    "\n",
    "Your score in the public test case may not reflect your actual score. **Note that running all test cases can take up to 3 minutes to complete, and you have a maximum of 20 attempts.** We only provide you with a limited number of tries because we do not want you to spam our autograder. \n",
    "\n",
    "Finally, when you are satisfied with your submission, you can finalize it by clicking \"Finalize submission.\". <span style=\"color:red\">**Note that once you have finalized your submission, it is considered to be submitted for grading, and no further changes can be made**.</span>\n",
    "\n",
    "When nearing the submission deadline for our final assessment, the Coursemology server may experience overloads due to high request volumes. Here are some guidelines for you if you encounter issues close to the deadline:\n",
    "\n",
    "1. If you notice that no result is returned after running your code, please refrain from rerunning it. Instead, simply refresh the page and await the result.\n",
    "\n",
    "2. If your code continues running and you cannot finalize your submission, \n",
    "* You should be able to finalize it by refreshing the submission page. Please be aware that any input made after running your code will be lost upon refreshing the page. \n",
    "* If you didn't make any changes after running your code, your running code will be considered the final submission.\n",
    "* If you make changes to your code after refreshing and then press finalize, your final code will be the one you typed just before finalizing.\n",
    "\n",
    "We highly recommend <span style=\"color:red\">**not waiting until the last moment to submit your final assessment**.</span>"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.18"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}