{ "cells": [ { "cell_type": "markdown", "id": "7d017333", "metadata": {}, "source": [ "# Final Assessment Scratch Pad" ] }, { "cell_type": "markdown", "id": "d3d00386", "metadata": {}, "source": [ "## Instructions" ] }, { "cell_type": "markdown", "id": "ea516aa7", "metadata": {}, "source": [ "1. Please use only this Jupyter notebook to work on your model, and **do not use any extra files**. If you need to define helper classes or functions, feel free to do so in this notebook.\n", "2. This template is intended to be general, but it may not cover every use case. The sections are given so that it will be easier for us to grade your submission. If your specific use case isn't addressed, **you may add new Markdown or code blocks to this notebook**. However, please **don't delete any existing blocks**.\n", "3. If you don't think a particular section of this template is necessary for your work, **you may skip it**. Be sure to explain clearly why you decided to do so." ] }, { "cell_type": "markdown", "id": "022cb4cd", "metadata": {}, "source": [ "## Report" ] }, { "cell_type": "markdown", "id": "9c14a2d8", "metadata": {}, "source": [ "**[TODO]**\n", "\n", "Please provide a summary of the ideas and steps that led you to your final model. Someone reading this summary should understand why you chose to approach the problem in a particular way and able to replicate your final model at a high level. Please ensure that your summary is detailed enough to provide an overview of your thought process and approach but also concise enough to be easily understandable. Also, please follow the guidelines given in the `main.ipynb`.\n", "\n", "This report should not be longer than **1-2 pages of A4 paper (up to around 1,000 words)**. Marks will be deducted if you do not follow instructions and you include too many words here. \n", "\n", "**[DELETE EVERYTHING FROM THE PREVIOUS TODO TO HERE BEFORE SUBMISSION]**\n", "\n", "##### Overview\n", "**[TODO]**\n", "\n", "##### 1. Descriptive Analysis\n", "**[TODO]**\n", "\n", "##### 2. Detection and Handling of Missing Values\n", "**[TODO]**\n", "\n", "##### 3. Detection and Handling of Outliers\n", "**[TODO]**\n", "\n", "##### 4. Detection and Handling of Class Imbalance \n", "**[TODO]**\n", "\n", "##### 5. Understanding Relationship Between Variables\n", "**[TODO]**\n", "\n", "##### 6. Data Visualization\n", "**[TODO]** \n", "##### 7. General Preprocessing\n", "**[TODO]**\n", " \n", "##### 8. Feature Selection \n", "**[TODO]**\n", "\n", "##### 9. Feature Engineering\n", "**[TODO]**\n", "\n", "##### 10. Creating Models\n", "**[TODO]**\n", "\n", "##### 11. Model Evaluation\n", "**[TODO]**\n", "\n", "##### 12. Hyperparameters Search\n", "**[TODO]**\n", "\n", "##### Conclusion\n", "**[TODO]**" ] }, { "cell_type": "markdown", "id": "49dcaf29", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "id": "27103374", "metadata": {}, "source": [ "# Workings (Not Graded)\n", "\n", "You will do your working below. Note that anything below this section will not be graded, but we might counter-check what you wrote in the report above with your workings to make sure that you actually did what you claimed to have done. " ] }, { "cell_type": "markdown", "id": "0f4c6cd4", "metadata": {}, "source": [ "## Import Packages\n", "\n", "Here, we import some packages necessary to run this notebook. In addition, you may import other packages as well. Do note that when submitting your model, you may only use packages that are available in Coursemology (see `main.ipynb`)." ] }, { "cell_type": "code", "execution_count": 2, "id": "cded1ed6", "metadata": { "ExecuteTime": { "end_time": "2024-04-20T14:33:43.165330Z", "start_time": "2024-04-20T14:33:41.764757Z" } }, "outputs": [], "source": [ "import pandas as pd\n", "import os\n", "import numpy as np\n", "from util import show_images, dict_train_test_split" ] }, { "cell_type": "markdown", "id": "748c35d7", "metadata": {}, "source": [ "## Load Dataset\n", "\n", "The dataset provided is multimodal and contains two components, images and tabular data. The tabular dataset `tabular.csv` contains $N$ entries and $F$ columns, including the target feature. On the other hand, the image dataset `images.npy` is of size $(N, H, W)$, where $N$, $H$, and $W$ correspond to the number of data, image width, and image height, respectively. Each image corresponds to the data in the same index of the tabular dataset. These datasets can be found in the `data/` folder in the given file structure.\n", "\n", "A code snippet that loads and displays some of the data is provided below.\n", "\n", "### Load Tabular Data" ] }, { "cell_type": "code", "execution_count": 3, "id": "a88be725", "metadata": { "ExecuteTime": { "end_time": "2024-04-20T14:33:46.117171Z", "start_time": "2024-04-20T14:33:44.688985Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(357699, 61)\n" ] }, { "data": { "text/plain": " V0 V1 V2 V3 V4 V5 V6 V7 \\\n0 8315.0 1784.0 21994.0 37115.0 317.0 105.016815 296559.0 321602.0 \n1 8315.0 1272.0 11114.0 18683.0 230.0 NaN 340059.0 368602.0 \n2 8315.0 3832.0 65514.0 147707.0 607.0 105.018240 279159.0 302802.0 \n3 8315.0 2296.0 32874.0 55547.0 404.0 NaN 313959.0 340402.0 \n4 11021.0 1784.0 21994.0 37115.0 375.0 105.024985 232701.0 252606.0 \n\n V8 V9 ... V51 V52 V53 V54 V55 V56 V57 V58 \\\n0 2470.0 C1 ... C4 C4 834148.0 C2 C6 1089 293 C2 \n1 2820.0 C0 ... C7 C7 401668.0 C5 C6 9801 1085 C7 \n2 2330.0 C1 ... C7 C7 820948.0 C5 C4 1485 304 C6 \n3 2610.0 C1 ... C7 C7 1664548.0 C5 C5 -495 711 C4 \n4 1490.0 C0 ... C7 C7 735748.0 C2 C9 1683 117 C0 \n\n V59 target \n0 7428.249334 300.0 \n1 9693.829502 200.0 \n2 7609.258214 50.0 \n3 4258.532609 140.0 \n4 9492.484802 20.0 \n\n[5 rows x 61 columns]", "text/html": "
\n | V0 | \nV1 | \nV2 | \nV3 | \nV4 | \nV5 | \nV6 | \nV7 | \nV8 | \nV9 | \n... | \nV51 | \nV52 | \nV53 | \nV54 | \nV55 | \nV56 | \nV57 | \nV58 | \nV59 | \ntarget | \n
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n8315.0 | \n1784.0 | \n21994.0 | \n37115.0 | \n317.0 | \n105.016815 | \n296559.0 | \n321602.0 | \n2470.0 | \nC1 | \n... | \nC4 | \nC4 | \n834148.0 | \nC2 | \nC6 | \n1089 | \n293 | \nC2 | \n7428.249334 | \n300.0 | \n
1 | \n8315.0 | \n1272.0 | \n11114.0 | \n18683.0 | \n230.0 | \nNaN | \n340059.0 | \n368602.0 | \n2820.0 | \nC0 | \n... | \nC7 | \nC7 | \n401668.0 | \nC5 | \nC6 | \n9801 | \n1085 | \nC7 | \n9693.829502 | \n200.0 | \n
2 | \n8315.0 | \n3832.0 | \n65514.0 | \n147707.0 | \n607.0 | \n105.018240 | \n279159.0 | \n302802.0 | \n2330.0 | \nC1 | \n... | \nC7 | \nC7 | \n820948.0 | \nC5 | \nC4 | \n1485 | \n304 | \nC6 | \n7609.258214 | \n50.0 | \n
3 | \n8315.0 | \n2296.0 | \n32874.0 | \n55547.0 | \n404.0 | \nNaN | \n313959.0 | \n340402.0 | \n2610.0 | \nC1 | \n... | \nC7 | \nC7 | \n1664548.0 | \nC5 | \nC5 | \n-495 | \n711 | \nC4 | \n4258.532609 | \n140.0 | \n
4 | \n11021.0 | \n1784.0 | \n21994.0 | \n37115.0 | \n375.0 | \n105.024985 | \n232701.0 | \n252606.0 | \n1490.0 | \nC0 | \n... | \nC7 | \nC7 | \n735748.0 | \nC2 | \nC9 | \n1683 | \n117 | \nC0 | \n9492.484802 | \n20.0 | \n
5 rows × 61 columns
\n