nus/cs2109s/labs/final-mock/main.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "7d017333",
   "metadata": {},
   "source": [
    "# Final Assessment: Making Prediction on a Dataset without Domain Knowledge"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "09648cba",
   "metadata": {},
   "source": [
    "**Release Date:** -\n",
    "\n",
    "**Due Date:** -"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "022cb4cd",
   "metadata": {},
   "source": [
    "## Overview\n",
    "\n",
    "In this assessment, your goal is to put into practice what you have learnt over the semester by building a model that can accurately predict the outcomes for new data samples from the same distribution from a given data set. Because the dataset's origin and domain are unknown, you cannot rely on any prior domain knowledge. Instead, you will need to analyze and comprehend the data based solely on its inherent characteristics and the given training samples. You may employ any techniques to uncover the data's nature, such as visualization, analysis, or experimentation. \n",
    "\n",
    "In general, the techniques learnt in class should be sufficient, but you may choose to explore new ones. For fairness and for practical reasons, your solution will be implemented with the libraries that we have used in CS2109S. \n",
    "\n",
    "Once you have some understanding of the data, you can create a model (or a set of models) that takes into account the data's characteristics. Your model can perform a series of preprocessing steps before training or making predictions. You will need to contemplate how you will train your model, what objective you will use, and how to evaluate its performance properly. Additionally, you may want to conduct a hyperparameter search to enhance your model's performance.\n",
    "\n",
    "There is no model answer to the problem. There are many possible approaches and many of them might perform equally well. \n",
    "\n",
    "### Required Files:\n",
    "\n",
    "* `main.ipynb` (this Jupyter notebook)\n",
    "* `scratchpad.ipynb` (Jupyter notebook for your \"scratch paper\") \n",
    "* `environment.yml` (environment file)\n",
    "* `util.py` (utilities)\n",
    "* `data/` (dataset folder)\n",
    "  * `tabular.csv`\n",
    "  * `images.npy`\n",
    "\n",
    "### Honour Code: \n",
    "\n",
    "* Note that plagiarism will not be condoned! \n",
    "* Also, because this is an individual assessment, you **MUST NOT** discuss your approach or solution with your classmates. If you are caught doing so, you will be subject to disciplinary actions because it will be an act of academic dishonestly.  \n",
    "* You may check the internet for references, but you **MUST NOT** submit code/report that is copied directly from online sources! For good measure, please cite all references in your final solution write-up. \n",
    "\n",
    "### Latest Updates, Announcements, and Clarifications:\n",
    "\n",
    "We recommend regularly checking the [official forum thread for the mock final exam](https://coursemology.org/courses/2714/forums/homework/topics/official-mock-final-assessment) to stay up-to-date with the latest updates and announcements. Additionally, you can use this thread to request clarifications regarding any aspect of the exam."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ef4d73fd",
   "metadata": {},
   "source": [
    "## Jupyter Notebooks\n",
    "\n",
    "In this final assessment, you will be working with two Jupyter notebooks: `main.ipynb` and `scratchpad.ipynb`.\n",
    "\n",
    "### `main.ipynb`\n",
    "\n",
    "This notebook serves as the primary guide to completing the final exam. It includes instructions, guidelines, and a final model code template that you must complete and copy-paste into Coursemology for submission. You can think of this as the question paper. \n",
    "\n",
    "It is important to note that `main.ipynb` **should not be uploaded**.\n",
    "\n",
    "### `scratchpad.ipynb`\n",
    "\n",
    "`scratchpad.ipynb` is your \"scratch paper,\" where you will perform various analyses and transformations on the data we provided you with and experiment with different techniques. You can think of this as your answer booklet. \n",
    "\n",
    "It is important to document your thoughts in this notebook so that we can understand how you arrived at your final model. Note that your working will not be graded directly. Instead, you will have to summarize your work in a report of not more than 1,000 words. You should follow the format/sections given to help make the grading easier for the profs. Some sections are optional. If they are not applicable, just indicate \"Nil.\"\n",
    "\n",
    "Mostly, the working submitted will be used as a sanity check that you actually did the work you claim to have done, instead of copying it from some online source, or worse, copied from another student. If we find 2 submissions are suspiciously similar, we will check the working for the 2 students. It is highly implausible that 2 students will come up with the same working if they did not discuss their approach/answers.  \n",
    "\n",
    "Once you have found your best model, you should copy and paste the model and its necessary components from `scratchpad.ipynb` to Coursemology for submission. It is important to **ensure that your model adheres to the model code template provided in `main.ipynb`**. You need to make sure that the code you submit in Coursemology is self contained. We strongly encourage you to test your model locally using `main.ipynb` by copying the necessary parts from your scratchpad to `main.ipynb` and check that your model runs correctly, to avoid wasting your attempt on Coursemology. You will be giving only a limited number of attempts on Coursemology because it is extreme memory-intensive (aka expensive) to run these models and we don't want students to be spamming our servers. \n",
    "\n",
    "**Remember to upload `scratchpad.ipynb` when submitting your final exam.** "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c51cc898",
   "metadata": {},
   "source": [
    "## Compute Resources\n",
    "\n",
    "### IMPORTANT: Limitations to Compute Resources\n",
    "\n",
    "At some level, machine learning can generally do better if we throw more resources into the problem. However for reasons of fairness and practicality, we need to limit the compute resources that you can use for this problem. The way that we can decided to quantify this to limit your model to take **no more than 5 minutes** on the Coursemology server and uses **at most 1 GB of memory**. The Coursemology server also **does not have a GPU,** so you should not use GPU in your model. You should also assume that your model will have access to only 1 CPU and not include multi-processor code in your submission.  This means that you should start off with a simple model and then gradually increase the complexity of your model to make sure that you stay within the allocated compute resources.\n",
    "\n",
    "These limitations exist for 2 reasons: (i) first, we need to contain the costs of running this Final Assessment; and (ii) we need to ensure a fair playing field for all students. We need to make sure no student performs better than other students simply because of access to faster machines or more resources. \n",
    "\n",
    "### Available Compute Resources\n",
    "\n",
    "For this assessment, you will need access to some computing resources to run your code and train your machine learning models. Two popular options are Google Colaboratory and the School of Computing Compute Cluster.\n",
    "\n",
    "* **[Google Collaboratoty](https://colab.research.google.com/)**, or \"Colab\" for short, is a free cloud-based platform provided by Google that allows you to write and run Python code using a Jupyter notebook interface. Colab provides access to a virtual machine with a GPU and sometimes even a TPU, which can speed up computation for tasks like training machine learning models. You can use Colab on your own computer without installing any software, and it provides access to a number of libraries and datasets. However, there may be limits on how much time, memory, and storage space you can use, and you may need to reauthorize your session frequently.\n",
    "\n",
    "* **[The School of Computing Compute Cluster](https://dochub.comp.nus.edu.sg/cf/guides/compute-cluster/start)** is a set of high-performance computing resources that are available to students, faculty, and researchers affiliated with the National University of Singapore's School of Computing. The cluster consists of multiple nodes, each with its own set of CPUs, memory, and storage. You can submit jobs to the cluster using the [Slurm workload manager](https://slurm.schedmd.com/documentation.html), which allocates resources to jobs based on availability and user-specified requirements. The Compute Cluster provides significantly more computing power than Colab, with the ability to scale up to hundreds or even thousands of cores. However, you need to apply for access to the cluster, and there may be limits on the amount of resources that can be used at any given time. Additionally, using the cluster requires some technical expertise and familiarity with the Linux command line interface.\n",
    "\n",
    "If you prefer not to use Google Colaboratory or the School of Computing Compute Cluster, you can also run your code on your own computer. However, keep in mind that your computer may not have as much processing power or memory as the other options, so your code may run more slowly and you will take more time to complete certain tasks."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7e323c3c",
   "metadata": {},
   "source": [
    "## Scratch Pad"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bd2d0eaf",
   "metadata": {},
   "source": [
    "The following are brief descriptions of each step in `scratchpad.ipynb`. Please note that the scratch pad only contains generic steps, and depending on the specific data you are working with, not all steps may be necessary. As a result, some sections of the notebook may potentially be left empty/blank (you can indicate \"Nil\").\n",
    "\n",
    "You should probably also limit the amount of time that you invest into each of the steps to avoid rushing at the end. We expect students to take abour 3-4 hours to complete this assessment and the suggested amount of times for each of the steps is given as a guide. You do not need to adhere strictly to our suggestions. \n",
    "\n",
    "You should include all your \"workings\" in scratchpad. Although you will only be graded on your 1,000-word executive summary, we might refer to your workings if there are concerns about plagiarism.\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cbe832b6",
   "metadata": {},
   "source": [
    "### Data Exploration & Preparation (60 mins)\n",
    "\n",
    "Before starting to create a model, it is important to explore and analyze the characteristics of the data. This can help us make informed decisions regarding the choice of approach, techniques, and models that we will use.\n",
    "\n",
    "When dealing with data, it is essential to understand the format in which it is presented. In machine learning, tabular data is typically provided in the form of a Pandas DataFrame, while tensor data, such as images, is given in the form of a Numpy ndarray.\n",
    "\n",
    "To help you get started with these data formats, the following guides can be useful:\n",
    "\n",
    "* [10 Minutes to pandas](https://pandas.pydata.org/docs/user_guide/10min.html): This guide provides a quick introduction to the Pandas library, including its key features and how to work with DataFrames.\n",
    "* [Numpy Quickstart](https://numpy.org/doc/stable/user/quickstart.html): This guide covers the basics of the Numpy library and how to work with ndarrays, including creating arrays, indexing and slicing, and mathematical operations.\n",
    "\n",
    "**It is important to note that you have the option to skip most of these data exploration and preparation steps and mostly use the dataset as is (with a rudimentary preprocessing)**. Given that machine learning is machine learning, such a naive approach work (i.e. give you some answer). However, such an approach is unlikely to yield the best outcomes.\n",
    "\n",
    "\n",
    "#### 1. Descriptive Analysis  (5 mins)\n",
    "\n",
    "Descriptive analysis is used to understand the basic characteristics of the data. This includes analyzing the distribution of the data, measuring its central tendency (i.e., mean, median, mode), and checking the variability of the data (i.e., range, standard deviation, variance, interquartile range). This analysis can give us an overview of the data and help us to identify any potential issues or challenges that may need to be addressed.\n",
    "\n",
    "You may find the following resources helpful:\n",
    "* [Pandas: how to calculate statistics?](https://pandas.pydata.org/docs/getting_started/intro_tutorials/06_calculate_statistics.html)\n",
    "* [Numpy statistics](https://numpy.org/doc/stable/reference/routines.statistics.html)\n",
    "\n",
    "#### 2. Detection and Handling of Missing Values   (10 mins)\n",
    "\n",
    "Missing values in the data can cause problems with our machine learning algorithms and may need to be handled. Detecting and handling missing values involves checking if there are any missing values in the data and figuring out the best way to handle them if necessary. This may involve imputing missing values with a certain value or method, or removing the rows or columns that contain missing values. You can follow the \"[10 Minutes to pandas](https://pandas.pydata.org/docs/user_guide/10min.html)\" guide on how to do this.\n",
    "\n",
    "#### 3. Detection and Handling of Outliers  (10 mins)\n",
    "\n",
    "Outliers are data points that are significantly different from the majority of the data. They can have a significant impact on the performance of your model, and it is important to detect and handle them appropriately. For example, you may use statistical methods such as the interquartile range (IQR) and z-score to detect outliers. Once you have detected outliers, you need to decide how to handle them. After you found them, you can choose to remove them from the dataset, replace them with a more appropriate value (e.g., the mean or median), or leave them in the dataset and use a model that is robust to outliers. You may find [this guide](https://www.geeksforgeeks.org/detect-and-remove-the-outliers-using-python/) useful.\n",
    "\n",
    "#### 4. Detection and Handling of Class Imbalance  (5 mins)\n",
    "\n",
    "Class imbalance is a problem that occurs when one class has significantly more instances than another. This can make it difficult to build a model that accurately predicts the minority class. Investigating the possibility of class imbalance and figuring out the best way to handle it if necessary is important. This may involve techniques such as oversampling or undersampling the minority class (as seen in PS4), using cost-sensitive learning (as seen in Tutorial 8), or other methods which you can explore yourself.\n",
    "\n",
    "Note: Check out SMOTE for this\n",
    "\n",
    "#### 5. Understanding Relationship Between Variables  (20 mins)\n",
    "\n",
    "Analyzing the relationship between variables in a dataset can reveal potential dependencies and offer insights for building accurate models. There are different ways to explore these dependencies:\n",
    "\n",
    "* **Linear dependencies:** It's possible to identify linear dependencies by verifying if certain attributes are multiples of other attributes by a constant factor.\n",
    "\n",
    "* **Correlations:** Another approach is to measure the correlations between variables, which indicates whether certain variables influence others. For instance, high correlation between the target variable and attributes A, B, and C suggests that A, B, and C are important factors for the target variable. Conversely, low correlation implies that they are not critical. [You can measure correlation using pandas](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.corr.html).\n",
    "\n",
    "By performing this analysis, we can determine which variables are most relevant to the problem and should be considered in our model. This can help us build a more effective and accurate machine learning model.\n",
    "\n",
    "#### 6. Data Visualization  (10 mins)\n",
    "\n",
    "Visualizing the data can help us to see patterns that are visible to human eyes. Data visualization techniques can include scatter plots, histograms, heat maps, and other graphical methods. Visualization can be particularly useful when trying to identify relationships between variables or patterns in the data.\n",
    "\n",
    "You may use libraries such as [Matplotlib](https://matplotlib.org/stable/tutorials/introductory/pyplot.html) or [Seaborn](https://seaborn.pydata.org/tutorial/introduction.html) to do this.\n",
    "\n",
    "### Data Preprocessing  (70 mins)\n",
    "\n",
    "Data preprocessing involves cleaning and transforming the data to maximize learning and prediction performance. This can include removing irrelevant variables, normalizing the data, scaling the data, or transforming the data using mathematical techniques.\n",
    "\n",
    "Scikit-learn website provides a [short guide](https://scikit-learn.org/stable/modules/preprocessing.html) on how to do data preprocessing using their library.\n",
    "\n",
    "**It is important to note that you can choose to skip more of the steps in feature selection and feature engineering and use all the features in the data set as is**. However, doing so will likely not yield the best outcomes. \n",
    "\n",
    "Note: Do SVD / PCA for this\n",
    "\n",
    "#### 7. General Preprocessing  (10 mins)\n",
    "\n",
    "General preprocessing involves any other preprocessing that is necessary for the data, such as converting the data type of certain attributes or removing duplicates. think of this as implementing what needs to be done based on what you learnt in Step 2 above. \n",
    "\n",
    "#### 8. Feature Selection  (30 mins)\n",
    "\n",
    "Feature selection is an important step in machine learning that involves identifying a subset of features that are most relevant to the problem. This helps to reduce the dimensionality of the data and improve the accuracy of the models. There are different techniques for feature selection, including:\n",
    "\n",
    "* **Removing uninformative features:** This involves removing features that are not useful for the task at hand. Two common methods to identify uninformative features are:\n",
    "  * Linearly dependent: features that are linear combinations of other features can be removed since they add no new information. A linear dependence test can be applied to identify linearly dependent features.\n",
    "  * Low or no correlation: features that have low or no correlation with the target variable can also be removed as they do not provide valuable information. A correlation analysis can be performed to identify such features \n",
    "  * See Understanding Relationship Between Variables on the explanation regarding the analysis.\n",
    "\n",
    "* **Sequentially removing features:** This involves iteratively removing the least significant feature until a desired number of features is reached. The idea is to remove features that have the least impact on the performance of the model. [Learn how to do this using Scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SequentialFeatureSelector.html).\n",
    "\n",
    "* **Principal component analysis:** This is a dimensionality reduction technique that involves transforming the data into a new set of orthogonal variables, called principal components, that capture the most important information in the original data. By selecting a subset of these components, we can reduce the dimensionality of the data while retaining most of the information. [Learn how to do this using Scikit-learn](https://scikit-learn.org/stable/auto_examples/decomposition/plot_pca_iris.html).\n",
    "\n",
    "By applying feature selection techniques, we can identify the most relevant features for the problem and improve the accuracy and efficiency of our machine learning models.\n",
    "\n",
    "\n",
    "#### 9. Feature Engineering  (30 mins)\n",
    "\n",
    "Feature engineering is the process of creating new features from existing ones to improve the performance of machine learning models. It involves identifying and extracting useful information from the data by applying various techniques such as:\n",
    "\n",
    "* **Combining features:** This involves combining two or more existing features to create a new feature that captures more information than either of the original features alone. For example, if we have two features representing length and width, we can create a new feature that represents area by multiplying the two. \n",
    "\n",
    "* **Creating new features:** This involves creating new features from the existing ones using domain knowledge or other insights gained from the data analysis. For example, if we have a dataset of customer transactions, we can create new features such as total spending per customer or the number of items purchased in a single transaction. \n",
    "\n",
    "* **Mapping functions to features:** This involves applying mathematical functions to the features to create new features with higher-order terms or interactions between features. For example, we can create polynomial features by mapping a feature x to x^2 or x^3.\n",
    "\n",
    "By applying feature engineering techniques, we can create more informative features that capture the underlying patterns and relationships in the data, leading to better performance in the machine learning models, since compute resources are limited. \n",
    "\n",
    "Note: Look at PS7 for this\n",
    "\n",
    "### Modeling & Evaluation (110 mins)\n",
    "\n",
    "After completing the above steps, it is time to build and evaluate models. This can involve creating a set of models that best fit the nature of the data, performing model evaluation, and doing hyperparameters search.\n",
    "\n",
    "The models should be evaluated thoroughly, and the best one should be chosen for submission.\n",
    "\n",
    "#### 10. Creating Models  (30 mins)\n",
    "\n",
    "In this stage, we create models that are appropriate for the data. There are different types of models, such as linear regression, logistic regression, decision trees, support vector machines, neural networks, and so on. Depending on the nature of the data, we can choose one or more of these models to build. We should be careful in selecting the models to ensure that they are suitable for the task we want to accomplish.\n",
    "\n",
    "Utilizing pre-built models from libraries such as [Scikit-Learn](https://scikit-learn.org/stable/supervised_learning.html) and [PyTorch](https://pytorch.org/docs/stable/nn.html) can be beneficial. These libraries offer an extensive range of models that can be easily implemented and integrated.\n",
    "\n",
    "However, if needed, you can also create your own models and algorithms from scratch.\n",
    "\n",
    "#### 11. Model Evaluation  (30 mins)\n",
    "\n",
    "Once we have created our models, we need to evaluate them to determine their performance. We should use a variety of metrics to evaluate the performance of each model, such as accuracy, precision, recall, F1 score, ROC curve, AUC, and so on. We should also use appropriate techniques to validate the models, such as cross-validation, train-test split, or hold-out validation. By doing this, we can determine which model is the best fit for our data. [Learn how to do this using Scikit-learn](https://scikit-learn.org/stable/modules/model_evaluation.html).\n",
    "\n",
    "It's important to consider multiple models in the evaluation process to ensure that we are choosing the best one for our data. We may create and evaluate several models before selecting the best one. We should also consider the trade-offs between model complexity and accuracy to make sure that it can run in Coursemology.\n",
    "\n",
    "#### 12. Hyperparameters Search  (50 mins)\n",
    "\n",
    "After choosing a model, we should optimize its hyperparameters to achieve the best performance. Hyperparameters are parameters that are not learned during training, such as the learning rate, the number of hidden layers (if applicable), or the regularization coefficient (if applicable). We can use various methods to search for the optimal hyperparameters, such as grid search, random search, or Bayesian optimization. The choice of the method depends on the complexity of the model and the size of the dataset. By tuning the hyperparameters, we can improve the performance of the model and make it more robust.\n",
    "\n",
    "Scikit-learn offers a number of [hyperparameter search functions](https://scikit-learn.org/stable/modules/grid_search.html) to help you optimize your model. Aside from Scikit-learn, there are many other options -- you can search for these options on [GitHub](https://github.com/topics/hyperparameter-optimization).\n",
    "\n",
    "In addition to using optimization libraries and functions, you can also manually perform simple hyperparameter tuning. This involves adjusting the hyperparameters of your model and evaluating its performance repeatedly until the best combination is achieved. However, keep in mind that manual tuning can be time-consuming and may not be as effective as more advanced techniques."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9045d90d",
   "metadata": {},
   "source": [
    "## Tasks & Submission"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c950686e",
   "metadata": {},
   "source": [
    "### Task 1: Model Implementation (80% Marks)\n",
    "\n",
    "Implement your model that you want to submit by completing the following functions:\n",
    "* `__init__`: The constructor for Model class.\n",
    "* `fit`: Fit/train the model using the input data. You may perform data handling and preprocessing here before training your model.\n",
    "* `predict`: Predict using the model. If you perform data handling and preprocessing in the `fit` function, then you may want to do the same here.\n",
    "\n",
    "#### Dependencies\n",
    "\n",
    "It is crucial to note that your model may rely on specific versions of Python packages, including:\n",
    "\n",
    "* Python 3.10\n",
    "* Numpy version 1.23\n",
    "* Pandas version 1.4\n",
    "* Scikit-Learn version 1.1\n",
    "* PyTorch version 1.12\n",
    "* Torchvision version 0.13\n",
    "\n",
    "To prevent any compatibility issues or unexpected errors during the execution of your code, ensure that you are using the correct versions of these packages. You can refer to `environment.yml` for a comprehensive list of packages that are pre-installed in Coursemology and can be used by your model. Note that if you do end up using libraries that are not installed on Coursemology, you might see an error like:\n",
    "\n",
    "\"Your code failed to evaluate correctly. There might be a syntax error, or perhaps execution failed to complete within the allocated time and memory limits.\"\n",
    "\n",
    "#### Model Template\n",
    "\n",
    "Note that you should copy and paste the code below *directly* into Coursemology for submission. You should probably test the code in this notebook on your local machine before uploading to Coursemology and using up an attempt. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 70,
   "id": "a44b7aa4",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-04-27T07:45:57.664982Z",
     "start_time": "2024-04-27T07:45:57.652624Z"
    }
   },
   "outputs": [],
   "source": [
    "from sklearn.preprocessing import OrdinalEncoder\n",
    "from sklearn.linear_model import LinearRegression\n",
    "import sklearn.ensemble\n",
    "\n",
    "\n",
    "class Model:  \n",
    "    \"\"\"\n",
    "    This class represents an AI model.\n",
    "    \"\"\"\n",
    "    def __init__(self):\n",
    "        \"\"\"\n",
    "        Constructor for Model class.\n",
    "  \n",
    "        Parameters\n",
    "        ----------\n",
    "        self : object\n",
    "            The instance of the object passed by Python.\n",
    "        \"\"\"\n",
    "        self.model = LinearRegression()\n",
    "\n",
    "    def process_input(self, X):\n",
    "        images = X['images'].reshape(X['images'].shape[0], -1)\n",
    "        X = X['tabular']\n",
    "        X = \n",
    "        def object_columns(X):\n",
    "            return X.dtypes[X.dtypes == 'object'].index\n",
    "\n",
    "        def convert_to_ordinal(X, columns):\n",
    "            encoder = OrdinalEncoder()\n",
    "            return encoder.fit_transform(X[columns])\n",
    "\n",
    "        obj_cols = object_columns(X)\n",
    "        ordinal_columns = convert_to_ordinal(X, obj_cols)\n",
    "        X[obj_cols] = ordinal_columns\n",
    "        columns_to_drop = ['V40', 'V20', 'V39', 'V15', 'V10', 'V35', 'V2', 'V52', 'V45', 'V7', 'V48', 'V49', 'V43', 'V44', 'V26', 'V41', 'V11', 'V53', 'V42', 'V38']\n",
    "        X = X.drop(columns_to_drop, axis=1)\n",
    "        X = X.fillna(X.mean())\n",
    "        return X\n",
    "    def fit(self, X_dict, y):\n",
    "        \"\"\"\n",
    "        Train the model using the input data.\n",
    "        \n",
    "        Parameters\n",
    "        ----------\n",
    "        X_dict : dictionary with the following entries:\n",
    "            - tabular: pandas Dataframe of shape (n_samples, n_features)\n",
    "            - images: ndarray of shape (n_samples, height, width)\n",
    "            Training data.\n",
    "        y : pandas Dataframe of shape (n_samples,)\n",
    "            Target values.\n",
    "            \n",
    "        Returns\n",
    "        -------\n",
    "        self : object\n",
    "            Returns an instance of the trained model.\n",
    "        \"\"\"\n",
    "        X = X_dict['tabular']\n",
    "        X = self.process_input(X)\n",
    "        self.model.fit(X, y)\n",
    "        return self\n",
    "       \n",
    "    def predict(self, X_dict):\n",
    "        \"\"\"\n",
    "        Use the trained model to make predictions.\n",
    "        \n",
    "        Parameters\n",
    "        ----------\n",
    "        X_dict : dictionary with the following entries:\n",
    "            - tabular: pandas Dataframe of shape (n_samples, n_features)\n",
    "            - images: ndarray of shape (n_samples, height, width)\n",
    "            Input data.\n",
    "            \n",
    "        Returns\n",
    "        -------\n",
    "        pandas Dataframe of shape (n_samples,)\n",
    "           Predicted target values per element in X_dict.\n",
    "           \n",
    "        \"\"\"\n",
    "        X = self.process_input(X_dict['tabular'])\n",
    "        return self.model.predict(X)\n",
    "        # return [0 for _ in range(len(X_dict['tabular']))]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e02178d7",
   "metadata": {},
   "source": [
    "#### Local Evaluation\n",
    "\n",
    "You may test your solution locally by running the following code. Do note that the results may not reflect your performance in Coursemology. You should not be submitting the code below in Coursemology. The code here is meant only for you to do local testing. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "id": "4f4dd489",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-04-27T07:23:39.732051Z",
     "start_time": "2024-04-27T07:23:39.725818Z"
    }
   },
   "outputs": [],
   "source": [
    "# Import packages\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "import os\n",
    "from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, mean_squared_error, mean_absolute_error, r2_score\n",
    "from util import dict_train_test_split"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "id": "3064e0ff",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-04-27T07:23:42.216498Z",
     "start_time": "2024-04-27T07:23:40.676178Z"
    }
   },
   "outputs": [],
   "source": [
    "# Load data\n",
    "df = pd.read_csv(os.path.join('data', 'tabular.csv'))\n",
    "with open(os.path.join('data', 'images.npy'), 'rb') as f:\n",
    "    images = np.load(f)\n",
    "    \n",
    "# Exclude target column\n",
    "X_columns = [col for col in df.columns if col != 'target']\n",
    "\n",
    "# Create X_dict and y\n",
    "X_dict = {\n",
    "    'tabular': df[X_columns],\n",
    "    'images': images\n",
    "}\n",
    "y = df['target']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 71,
   "id": "27c9fd10",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-04-27T07:46:01.374238Z",
     "start_time": "2024-04-27T07:45:59.640013Z"
    }
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/var/folders/zd/9vyg32393qncxwt_3r_873mh0000gn/T/ipykernel_29080/3308836053.py:43: SettingWithCopyWarning: \n",
      "A value is trying to be set on a copy of a slice from a DataFrame.\n",
      "Try using .loc[row_indexer,col_indexer] = value instead\n",
      "\n",
      "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n",
      "  X[obj_cols] = ordinal_columns\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "MSE: 5352.19\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/var/folders/zd/9vyg32393qncxwt_3r_873mh0000gn/T/ipykernel_29080/3308836053.py:43: SettingWithCopyWarning: \n",
      "A value is trying to be set on a copy of a slice from a DataFrame.\n",
      "Try using .loc[row_indexer,col_indexer] = value instead\n",
      "\n",
      "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n",
      "  X[obj_cols] = ordinal_columns\n"
     ]
    }
   ],
   "source": [
    "# Split train and test\n",
    "X_dict_train, y_train, X_dict_test, y_test = dict_train_test_split(X_dict, y, ratio=0.9)\n",
    "\n",
    "# Train and predict\n",
    "model = Model()\n",
    "model.fit(X_dict_train, y_train)\n",
    "y_pred = model.predict(X_dict_test)\n",
    "\n",
    "# Evaluate model predition\n",
    "# Learn more: https://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics\n",
    "print(\"MSE: {0:.2f}\".format(mean_squared_error(y_test, y_pred)))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "16861aef",
   "metadata": {},
   "source": [
    "#### Grading Scheme\n",
    "\n",
    "Your code implementation will be graded based on its performance (**MSE**) in the contest. Your model will be trained with the data that we provided you with this assesment. We will use score cutoffs that we will decide after the contest to determine your marks.\n",
    "\n",
    "The performance of your model will be determined by a separate test data set, drawn from the same population as the training set, but not provided to you earlier. The marks you will receive will depend on the **MSE** of the predictions:\n",
    "\n",
    "* If your score is above the mean or median, you can expect to receive decent marks. \n",
    "* If your score is higher than the 75th percentile, you are likely to receive good marks. \n",
    "* If you achieve a score above the 90th percentile (top 10%), you will likely receive full marks.\n",
    "\n",
    "Throughout the contest, we will provide periodic updates on the distribution of the score of student submissions in the official forum thread (see Overview) based on the **public test case**, which test the performance of the model on **a small subset of data from the hidden test data**. You can use these updates to estimate your relative standing, compared to your peers. "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "44c79c17",
   "metadata": {},
   "source": [
    "### Task 2: Scratch Pad (20% Marks)\n",
    "\n",
    "Fill up the `scratchpad.ipynb` with your working. \n",
    "\n",
    "In the **\"Report\" section**, write a report that explain the thought process behind your solution, and convince us that you have understood the concepts taught in class and can apply them. The report should cover data exploration and preparation, data preprocessing, modeling, and evaluation. The final solution and any alternative approaches that were tried but did not work may also be documented. The length of the report should be approximately equivalent to **1-2 pages of A4 paper (up to 1,000 words)**.\n",
    "\n",
    "#### Grading Scheme\n",
    "\n",
    "The report will be graded based on the reasonability and soundness of the approach you take, your understanding of the data, and your final solution. If you do not make any errors in your approach, reasoning/understanding, and conclusion, you can expect to receive full marks. This part is meant to be \"standard\", and is only for us to do a quick sanity check that you actually did the work required to come up with the model you submitted."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "28b658b4",
   "metadata": {},
   "source": [
    "### Submission\n",
    "\n",
    "Once you are done, please submit your work to Coursemology, by copying the right snippets of code into the corresponding box that says 'Your answer', and click 'Save'. You can still make changes after you save your submission.\n",
    "\n",
    "When submitting your model, the `fit` function will be called to train your model with the **data that we have provided you**. Due to the inherent stochasticity of the training process, **your model's performance may vary across different runs**. To ensure deterministic results, you can set a fixed random seed in your code. After the training is completed, the `predict` function will be used to evaluate your model. The evaluation of your model will be based on two test cases: \n",
    "1. **Public test cases, containing a small portion of the test data**, that allows you to **estimate** your score. \n",
    "2. **Evaluation test cases containing the remaining test data** (which you will not be able to see) by which we will evaluate your model. \n",
    "\n",
    "Your score in the public test case may not reflect your actual score. **Note that running all test cases can take up to 5 minutes to complete, and you have a maximum of 20 attempts.** We only provide you with a limited number of tries because we do not want you to spam our autograder. \n",
    "\n",
    "Finally, when you are satisfied with your submission, you can finalize it by clicking \"Finalize submission.\". <span style=\"color:red\">**Note that once you have finalized your submission, it is considered to be submitted for grading, and no further changes can be made**.</span>"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}