{ "cells": [ { "cell_type": "markdown", "id": "7d017333", "metadata": {}, "source": [ "# Final Assessment Scratch Pad" ] }, { "cell_type": "markdown", "id": "d3d00386", "metadata": {}, "source": [ "## Instructions" ] }, { "cell_type": "markdown", "id": "ea516aa7", "metadata": {}, "source": [ "1. Please use only this Jupyter notebook to work on your model, and **do not use any extra files**. If you need to define helper classes or functions, feel free to do so in this notebook.\n", "2. This template is intended to be general, but it may not cover every use case. The sections are given so that it will be easier for us to grade your submission. If your specific use case isn't addressed, **you may add new Markdown or code blocks to this notebook**. However, please **don't delete any existing blocks**.\n", "3. If you don't think a particular section of this template is necessary for your work, **you may skip it**. Be sure to explain clearly why you decided to do so." ] }, { "cell_type": "markdown", "id": "022cb4cd", "metadata": {}, "source": [ "## Report" ] }, { "cell_type": "markdown", "id": "9c14a2d8", "metadata": {}, "source": [ "**[TODO]**\n", "\n", "Please provide a summary of the ideas and steps that led you to your final model. Someone reading this summary should understand why you chose to approach the problem in a particular way and able to replicate your final model at a high level. Please ensure that your summary is detailed enough to provide an overview of your thought process and approach but also concise enough to be easily understandable. Also, please follow the guidelines given in the `main.ipynb`.\n", "\n", "This report should not be longer than **1-2 pages of A4 paper (up to around 1,000 words)**. Marks will be deducted if you do not follow instructions and you include too many words here. \n", "\n", "**[DELETE EVERYTHING FROM THE PREVIOUS TODO TO HERE BEFORE SUBMISSION]**\n", "\n", "##### Overview\n", "**[TODO]**\n", "\n", "##### 1. Descriptive Analysis\n", "**[TODO]**\n", "\n", "##### 2. Detection and Handling of Missing Values\n", "**[TODO]**\n", "\n", "##### 3. Detection and Handling of Outliers\n", "**[TODO]**\n", "\n", "##### 4. Detection and Handling of Class Imbalance \n", "**[TODO]**\n", "\n", "##### 5. Understanding Relationship Between Variables\n", "**[TODO]**\n", "\n", "##### 6. Data Visualization\n", "**[TODO]** \n", "##### 7. General Preprocessing\n", "**[TODO]**\n", " \n", "##### 8. Feature Selection \n", "**[TODO]**\n", "\n", "##### 9. Feature Engineering\n", "**[TODO]**\n", "\n", "##### 10. Creating Models\n", "**[TODO]**\n", "\n", "##### 11. Model Evaluation\n", "**[TODO]**\n", "\n", "##### 12. Hyperparameters Search\n", "**[TODO]**\n", "\n", "##### Conclusion\n", "**[TODO]**" ] }, { "cell_type": "markdown", "id": "49dcaf29", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "id": "27103374", "metadata": {}, "source": [ "# Workings (Not Graded)\n", "\n", "You will do your working below. Note that anything below this section will not be graded, but we might counter-check what you wrote in the report above with your workings to make sure that you actually did what you claimed to have done. " ] }, { "cell_type": "markdown", "id": "0f4c6cd4", "metadata": {}, "source": [ "## Import Packages\n", "\n", "Here, we import some packages necessary to run this notebook. In addition, you may import other packages as well. Do note that when submitting your model, you may only use packages that are available in Coursemology (see `main.ipynb`)." ] }, { "cell_type": "code", "execution_count": 95, "id": "cded1ed6", "metadata": { "ExecuteTime": { "end_time": "2024-04-27T06:36:12.309324Z", "start_time": "2024-04-27T06:36:12.305262Z" } }, "outputs": [], "source": [ "import pandas as pd\n", "import os\n", "import numpy as np\n", "from util import show_images, dict_train_test_split\n", "from sklearn.preprocessing import OrdinalEncoder\n", "import matplotlib.pyplot as plt\n" ] }, { "cell_type": "markdown", "id": "748c35d7", "metadata": {}, "source": [ "## Load Dataset\n", "\n", "The dataset provided is multimodal and contains two components, images and tabular data. The tabular dataset `tabular.csv` contains $N$ entries and $F$ columns, including the target feature. On the other hand, the image dataset `images.npy` is of size $(N, H, W)$, where $N$, $H$, and $W$ correspond to the number of data, image width, and image height, respectively. Each image corresponds to the data in the same index of the tabular dataset. These datasets can be found in the `data/` folder in the given file structure.\n", "\n", "A code snippet that loads and displays some of the data is provided below.\n", "\n", "### Load Tabular Data" ] }, { "cell_type": "code", "execution_count": 96, "id": "a88be725", "metadata": { "ExecuteTime": { "end_time": "2024-04-27T06:36:13.752562Z", "start_time": "2024-04-27T06:36:12.324501Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(357699, 61)\n", "Object columns 18\n" ] } ], "source": [ "df = pd.read_csv(os.path.join('data', 'tabular.csv'))\n", "print(df.shape)\n", "df.head()\n", "\n", "import math\n", "\n", "# Object columns\n", "object_columns = df.dtypes[df.dtypes == 'object']\n", "print('Object columns', object_columns.shape[0])" ] }, { "cell_type": "markdown", "id": "c09da291", "metadata": {}, "source": [ "### Load Image Data" ] }, { "cell_type": "code", "execution_count": 114, "id": "6297e25a", "metadata": { "ExecuteTime": { "end_time": "2024-04-27T07:48:25.899880Z", "start_time": "2024-04-27T07:48:25.398594Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Shape: (357699, 8, 8)\n" ] }, { "data": { "text/plain": "
", "image/png": "" }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": "(357699, 64)" }, "execution_count": 114, "metadata": {}, "output_type": "execute_result" } ], "source": [ "with open(os.path.join('data', 'images.npy'), 'rb') as f:\n", " images = np.load(f)\n", " \n", "print('Shape:', images.shape)\n", "show_images(images[:18], n_row=3, n_col=5, figsize=[12,5])\n", "images.reshape(images.shape[0], -1).shape" ] }, { "cell_type": "markdown", "id": "cbe832b6", "metadata": {}, "source": [ "## Data Exploration & Preparation" ] }, { "cell_type": "markdown", "id": "2f6a464c", "metadata": {}, "source": [ "### 1. Descriptive Analysis" ] }, { "cell_type": "code", "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Columns with high NaNs: Index(['V15', 'V38', 'V39', '0', '19', '24'], dtype='object')\n", "Columns with high zeros Index(['55', '61', '62', '63'], dtype='object')\n", "['42', '60', '35', 'V20', 'V35', '5', '0', '57', '63', '2', '9', '12', '40', 'V41', '53', '33', '20', '29', '50', '58', 'V40', '28', 'V39', '52', '37', '54', 'V52', '61', '56', '39', '21', 'V45', 'V7', '47', '51', '55', 'V26', '16', '41', 'V15', '19', '62', '14', '25', '49', '22', '7', '4', '18', '36', '13', 'V48', 'V49', '23', '1', '48', 'V44', '6', 'V11', 'V53', '32', 'V38', '43', '11', '44', 'V10', '3', 'V2', '15', '38', '30', '45', '26', '24', 'V43', '34', '46', 'V42']\n" ] } ], "source": [ "Y = df['target']\n", "X = df.drop('target', axis=1)\n", "\n", "def nan_columns(X, threshold=0.5):\n", " count = X.shape[0] * threshold\n", " nan_columns = X.isna().sum()\n", " return nan_columns[nan_columns >= count].index\n", "def zero_columns(X, threshold=0.5):\n", " count = X.shape[0] * threshold\n", " zero_cols = (X == 0).sum()\n", " return zero_cols[zero_cols >= count].index\n", "\n", "def object_columns(X):\n", " return X.dtypes[X.dtypes == 'object'].index\n", "\n", "def convert_to_ordinal(X, columns):\n", " encoder = OrdinalEncoder()\n", " return encoder.fit_transform(X[columns])\n", "\n", "def correlated_columns(X, threshold=0.99):\n", " corr = X.corr()\n", " upper = corr.where(np.triu(np.ones(corr.shape), k=1).astype(bool))\n", " return [column for column in upper.columns if any(upper[column] > threshold)]\n", "\n", "# Identify columns with High Nans\n", "nan_columns = nan_columns(X, 0.5)\n", "print('Columns with high NaNs:', nan_columns)\n", "zero_cols = zero_columns(X, 0.9)\n", "print('Columns with high zeros', zero_cols)\n", "object_columns = object_columns(X)\n", "ordinal_columns = convert_to_ordinal(X, object_columns)\n", "X[object_columns] = ordinal_columns\n", "\n", "correlated_cols = correlated_columns(X, 0.95)\n", "\n", "columns_to_drop = list(set(nan_columns) | set(zero_cols) | set(correlated_cols))\n", "print(columns_to_drop)\n" ], "metadata": { "ExecuteTime": { "end_time": "2024-04-27T07:50:40.607409Z", "start_time": "2024-04-27T07:50:30.338617Z" } }, "id": "3b1f62dd", "execution_count": 121 }, { "cell_type": "code", "outputs": [ { "data": { "text/plain": " V0 V1 V3 V4 V5 V6 V8 V9 \\\n0 8315.0 1784.0 37115.0 317.0 105.016815 296559.0 2470.0 1.0 \n1 8315.0 1272.0 18683.0 230.0 NaN 340059.0 2820.0 0.0 \n2 8315.0 3832.0 147707.0 607.0 105.018240 279159.0 2330.0 1.0 \n3 8315.0 2296.0 55547.0 404.0 NaN 313959.0 2610.0 1.0 \n4 11021.0 1784.0 37115.0 375.0 105.024985 232701.0 1490.0 0.0 \n... ... ... ... ... ... ... ... ... \n357694 8315.0 1272.0 18683.0 230.0 105.012445 270459.0 2260.0 0.0 \n357695 8315.0 2296.0 55547.0 404.0 NaN 244359.0 2050.0 0.0 \n357696 8315.0 1784.0 37115.0 375.0 NaN 348759.0 2890.0 0.0 \n357697 8315.0 1784.0 37115.0 375.0 105.016815 348759.0 2890.0 0.0 \n357698 8315.0 1784.0 37115.0 317.0 NaN 244359.0 2050.0 0.0 \n\n V12 V13 ... V56 V57 V58 V59 8 10 \\\n0 85.0 737.0 ... 1089 293 2.0 7428.249334 0.249110 0.283362 \n1 42.0 585.0 ... 9801 1085 7.0 9693.829502 -1.144696 -1.343454 \n2 335.0 1041.0 ... 1485 304 6.0 7609.258214 0.129641 -0.258910 \n3 113.0 889.0 ... -495 711 4.0 4258.532609 0.726987 0.283362 \n4 186.0 737.0 ... 1683 117 0.0 9492.484802 0.249110 0.283362 \n... ... ... ... ... ... ... ... ... ... \n357694 4.0 585.0 ... 6336 1855 2.0 4634.276235 -4.290717 -2.427998 \n357695 110.0 889.0 ... 2970 854 8.0 8379.073980 0.129641 -0.258910 \n357696 163.0 737.0 ... -4257 942 8.0 5359.986193 0.408403 0.283362 \n357697 147.0 737.0 ... 2376 1195 7.0 9095.239127 0.726987 0.283362 \n357698 46.0 737.0 ... 9108 502 3.0 9379.720939 0.129641 -0.258910 \n\n 17 27 31 59 \n0 -1.523953 -0.689523 -0.637881 1.465378 \n1 -0.425715 -1.246596 -1.090949 -0.852887 \n2 0.306444 1.538767 1.627457 -0.080132 \n3 0.672524 -0.132450 -0.184813 -0.080132 \n4 -0.425715 -0.410987 -0.637881 1.465378 \n... ... ... ... ... \n357694 -4.086510 -1.246596 -1.090949 2.238133 \n357695 0.306444 -0.132450 -0.184813 -0.080132 \n357696 0.672524 -0.410987 -0.637881 -0.080132 \n357697 0.672524 -0.410987 -0.637881 -0.080132 \n357698 0.306444 -0.689523 -0.637881 -0.080132 \n\n[357699 rows x 46 columns]", "text/html": "
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
V0V1V3V4V5V6V8V9V12V13...V56V57V58V5981017273159
08315.01784.037115.0317.0105.016815296559.02470.01.085.0737.0...10892932.07428.2493340.2491100.283362-1.523953-0.689523-0.6378811.465378
18315.01272.018683.0230.0NaN340059.02820.00.042.0585.0...980110857.09693.829502-1.144696-1.343454-0.425715-1.246596-1.090949-0.852887
28315.03832.0147707.0607.0105.018240279159.02330.01.0335.01041.0...14853046.07609.2582140.129641-0.2589100.3064441.5387671.627457-0.080132
38315.02296.055547.0404.0NaN313959.02610.01.0113.0889.0...-4957114.04258.5326090.7269870.2833620.672524-0.132450-0.184813-0.080132
411021.01784.037115.0375.0105.024985232701.01490.00.0186.0737.0...16831170.09492.4848020.2491100.283362-0.425715-0.410987-0.6378811.465378
..................................................................
3576948315.01272.018683.0230.0105.012445270459.02260.00.04.0585.0...633618552.04634.276235-4.290717-2.427998-4.086510-1.246596-1.0909492.238133
3576958315.02296.055547.0404.0NaN244359.02050.00.0110.0889.0...29708548.08379.0739800.129641-0.2589100.306444-0.132450-0.184813-0.080132
3576968315.01784.037115.0375.0NaN348759.02890.00.0163.0737.0...-42579428.05359.9861930.4084030.2833620.672524-0.410987-0.637881-0.080132
3576978315.01784.037115.0375.0105.016815348759.02890.00.0147.0737.0...237611957.09095.2391270.7269870.2833620.672524-0.410987-0.637881-0.080132
3576988315.01784.037115.0317.0NaN244359.02050.00.046.0737.0...91085023.09379.7209390.129641-0.2589100.306444-0.689523-0.637881-0.080132
\n

357699 rows × 46 columns

\n
" }, "execution_count": 122, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X_dropped = X.drop(columns_to_drop, axis=1)\n", "X_dropped" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-04-27T07:50:42.584344Z", "start_time": "2024-04-27T07:50:42.498150Z" } }, "id": "b8383cb1d724181c", "execution_count": 122 }, { "cell_type": "code", "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(357699, 46)\n" ] } ], "source": [ "print(X_dropped.shape)" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-04-27T07:54:05.509713Z", "start_time": "2024-04-27T07:54:05.505067Z" } }, "id": "c64798f73ec3412f", "execution_count": 134 }, { "cell_type": "markdown", "source": [ "### 2. Detection and Handling of Missing Values" ], "metadata": {}, "id": "adb61967" }, { "cell_type": "code", "execution_count": 135, "id": "4bb9cdfb", "metadata": { "ExecuteTime": { "end_time": "2024-04-27T07:54:06.587195Z", "start_time": "2024-04-27T07:54:06.478662Z" } }, "outputs": [], "source": [ "# For the columns with nan's that are not the object columns, fill them with mean\n", "# For the object columns, fill them with the mode\n", "X_missing = X_dropped.fillna(X_dropped.mean())\n", "# TODO: Replace with mode for object columns" ] }, { "cell_type": "markdown", "id": "8adcb9cd", "metadata": {}, "source": [ "### 3. Detection and Handling of Outliers" ] }, { "cell_type": "code", "outputs": [], "source": [ "# Time to do PCA\n", "from sklearn.decomposition import PCA\n", "pca = PCA(n_components=30)\n", "X_pca = pca.fit_transform(X_missing)\n", "# plt.scatter(X_pca[:, 0], X_pca[:, 1], c=Y)\n", "# plt.colorbar()\n", "# plt.show()" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-04-27T07:59:38.839920Z", "start_time": "2024-04-27T07:59:34.454737Z" } }, "id": "878c95195942e270", "execution_count": 151 }, { "cell_type": "code", "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.9999008890228839\n", "2\n" ] } ], "source": [ "res = 0\n", "variance = pca.explained_variance_ratio_\n", "for i in range(len(variance)):\n", " if np.sum(variance[0:i]) >= 0.99:\n", " res = i\n", " break\n", "print(np.sum(variance[:res]))\n", "print(res)\n" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-04-27T07:59:42.918071Z", "start_time": "2024-04-27T07:59:42.915297Z" } }, "id": "724586267e51a3c5", "execution_count": 155 }, { "cell_type": "markdown", "id": "d4916043", "metadata": {}, "source": [ "### 4. Detection and Handling of Class Imbalance" ] }, { "cell_type": "code", "execution_count": null, "id": "ad3ab20e", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "2552a795", "metadata": {}, "source": [ "### 5. Understanding Relationship Between Variables" ] }, { "cell_type": "code", "execution_count": null, "id": "29ddbbcf", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "757fb315", "metadata": {}, "source": [ "### 6. Data Visualization" ] }, { "cell_type": "code", "execution_count": null, "id": "93f82e42", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "2a7eebcf", "metadata": {}, "source": [ "## Data Preprocessing" ] }, { "cell_type": "markdown", "id": "ae3e3383", "metadata": {}, "source": [ "### 7. General Preprocessing" ] }, { "cell_type": "code", "execution_count": null, "id": "19174365", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "fb3aa527", "metadata": {}, "source": [ "### 8. Feature Selection" ] }, { "cell_type": "code", "execution_count": null, "id": "a85808bf", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "4921e8ca", "metadata": {}, "source": [ "### 9. Feature Engineering" ] }, { "cell_type": "code", "execution_count": null, "id": "dbcde626", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "fa676c3f", "metadata": {}, "source": [ "## Modeling & Evaluation" ] }, { "cell_type": "markdown", "id": "589b37e4", "metadata": {}, "source": [ "### 10. Creating models" ] }, { "cell_type": "code", "execution_count": 158, "id": "d8dffd7d", "metadata": { "ExecuteTime": { "end_time": "2024-04-27T08:00:10.769193Z", "start_time": "2024-04-27T08:00:10.657136Z" } }, "outputs": [], "source": [ "# Split the data into train and test\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.linear_model import LinearRegression\n", "from sklearn.metrics import mean_squared_error\n", "X_train, X_test, y_train, y_test = train_test_split(X_missing, Y, test_size=0.2, random_state=42)" ] }, { "cell_type": "code", "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "MSE: 5311.417393315556\n" ] } ], "source": [ "# Linear Regression\n", "# # Train the model\n", "model = LinearRegression()\n", "model.fit(X_train, y_train)\n", "# # Predict\n", "y_pred = model.predict(X_test)\n", "# # Evaluate\n", "mse = mean_squared_error(y_test, y_pred)\n", "print('MSE:', mse)\n" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-04-27T08:00:12.375826Z", "start_time": "2024-04-27T08:00:11.942486Z" } }, "id": "9864de4426d22d9b", "execution_count": 159 }, { "cell_type": "code", "outputs": [], "source": [], "metadata": { "collapsed": false }, "id": "5381f534af74f626" }, { "cell_type": "markdown", "id": "495bf3c0", "metadata": {}, "source": [ "### 11. Model Evaluation" ] }, { "cell_type": "code", "execution_count": null, "id": "9245ab47", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "8aa31404", "metadata": {}, "source": [ "### 12. Hyperparameters Search" ] }, { "cell_type": "code", "execution_count": null, "id": "81addd51", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.7" } }, "nbformat": 4, "nbformat_minor": 5 }