566 lines
49 KiB
Plaintext
566 lines
49 KiB
Plaintext
{
|
||
"cells": [
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "7d017333",
|
||
"metadata": {},
|
||
"source": [
|
||
"# Final Assessment Scratch Pad"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "d3d00386",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Instructions"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "ea516aa7",
|
||
"metadata": {},
|
||
"source": [
|
||
"1. Please use only this Jupyter notebook to work on your model, and **do not use any extra files**. If you need to define helper classes or functions, feel free to do so in this notebook.\n",
|
||
"2. This template is intended to be general, but it may not cover every use case. The sections are given so that it will be easier for us to grade your submission. If your specific use case isn't addressed, **you may add new Markdown or code blocks to this notebook**. However, please **don't delete any existing blocks**.\n",
|
||
"3. If you don't think a particular section of this template is necessary for your work, **you may skip it**. Be sure to explain clearly why you decided to do so."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "022cb4cd",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Report"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "9c14a2d8",
|
||
"metadata": {},
|
||
"source": [
|
||
"**[TODO]**\n",
|
||
"\n",
|
||
"Please provide a summary of the ideas and steps that led you to your final model. Someone reading this summary should understand why you chose to approach the problem in a particular way and able to replicate your final model at a high level. Please ensure that your summary is detailed enough to provide an overview of your thought process and approach but also concise enough to be easily understandable. Also, please follow the guidelines given in the `main.ipynb`.\n",
|
||
"\n",
|
||
"This report should not be longer than **1-2 pages of A4 paper (up to around 1,000 words)**. Marks will be deducted if you do not follow instructions and you include too many words here. \n",
|
||
"\n",
|
||
"**[DELETE EVERYTHING FROM THE PREVIOUS TODO TO HERE BEFORE SUBMISSION]**\n",
|
||
"\n",
|
||
"##### Overview\n",
|
||
"**[TODO]**\n",
|
||
"\n",
|
||
"##### 1. Descriptive Analysis\n",
|
||
"**[TODO]**\n",
|
||
"\n",
|
||
"##### 2. Detection and Handling of Missing Values\n",
|
||
"**[TODO]**\n",
|
||
"\n",
|
||
"##### 3. Detection and Handling of Outliers\n",
|
||
"**[TODO]**\n",
|
||
"\n",
|
||
"##### 4. Detection and Handling of Class Imbalance \n",
|
||
"**[TODO]**\n",
|
||
"\n",
|
||
"##### 5. Understanding Relationship Between Variables\n",
|
||
"**[TODO]**\n",
|
||
"\n",
|
||
"##### 6. Data Visualization\n",
|
||
"**[TODO]** \n",
|
||
"##### 7. General Preprocessing\n",
|
||
"**[TODO]**\n",
|
||
" \n",
|
||
"##### 8. Feature Selection \n",
|
||
"**[TODO]**\n",
|
||
"\n",
|
||
"##### 9. Feature Engineering\n",
|
||
"**[TODO]**\n",
|
||
"\n",
|
||
"##### 10. Creating Models\n",
|
||
"**[TODO]**\n",
|
||
"\n",
|
||
"##### 11. Model Evaluation\n",
|
||
"**[TODO]**\n",
|
||
"\n",
|
||
"##### 12. Hyperparameters Search\n",
|
||
"**[TODO]**\n",
|
||
"\n",
|
||
"##### Conclusion\n",
|
||
"**[TODO]**"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "49dcaf29",
|
||
"metadata": {},
|
||
"source": [
|
||
"---"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "27103374",
|
||
"metadata": {},
|
||
"source": [
|
||
"# Workings (Not Graded)\n",
|
||
"\n",
|
||
"You will do your working below. Note that anything below this section will not be graded, but we might counter-check what you wrote in the report above with your workings to make sure that you actually did what you claimed to have done. "
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "0f4c6cd4",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Import Packages\n",
|
||
"\n",
|
||
"Here, we import some packages necessary to run this notebook. In addition, you may import other packages as well. Do note that when submitting your model, you may only use packages that are available in Coursemology (see `main.ipynb`)."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 2,
|
||
"id": "cded1ed6",
|
||
"metadata": {
|
||
"ExecuteTime": {
|
||
"end_time": "2024-04-20T14:33:43.165330Z",
|
||
"start_time": "2024-04-20T14:33:41.764757Z"
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"import pandas as pd\n",
|
||
"import os\n",
|
||
"import numpy as np\n",
|
||
"from util import show_images, dict_train_test_split"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "748c35d7",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Load Dataset\n",
|
||
"\n",
|
||
"The dataset provided is multimodal and contains two components, images and tabular data. The tabular dataset `tabular.csv` contains $N$ entries and $F$ columns, including the target feature. On the other hand, the image dataset `images.npy` is of size $(N, H, W)$, where $N$, $H$, and $W$ correspond to the number of data, image width, and image height, respectively. Each image corresponds to the data in the same index of the tabular dataset. These datasets can be found in the `data/` folder in the given file structure.\n",
|
||
"\n",
|
||
"A code snippet that loads and displays some of the data is provided below.\n",
|
||
"\n",
|
||
"### Load Tabular Data"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 3,
|
||
"id": "a88be725",
|
||
"metadata": {
|
||
"ExecuteTime": {
|
||
"end_time": "2024-04-20T14:33:46.117171Z",
|
||
"start_time": "2024-04-20T14:33:44.688985Z"
|
||
}
|
||
},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"(357699, 61)\n"
|
||
]
|
||
},
|
||
{
|
||
"data": {
|
||
"text/plain": " V0 V1 V2 V3 V4 V5 V6 V7 \\\n0 8315.0 1784.0 21994.0 37115.0 317.0 105.016815 296559.0 321602.0 \n1 8315.0 1272.0 11114.0 18683.0 230.0 NaN 340059.0 368602.0 \n2 8315.0 3832.0 65514.0 147707.0 607.0 105.018240 279159.0 302802.0 \n3 8315.0 2296.0 32874.0 55547.0 404.0 NaN 313959.0 340402.0 \n4 11021.0 1784.0 21994.0 37115.0 375.0 105.024985 232701.0 252606.0 \n\n V8 V9 ... V51 V52 V53 V54 V55 V56 V57 V58 \\\n0 2470.0 C1 ... C4 C4 834148.0 C2 C6 1089 293 C2 \n1 2820.0 C0 ... C7 C7 401668.0 C5 C6 9801 1085 C7 \n2 2330.0 C1 ... C7 C7 820948.0 C5 C4 1485 304 C6 \n3 2610.0 C1 ... C7 C7 1664548.0 C5 C5 -495 711 C4 \n4 1490.0 C0 ... C7 C7 735748.0 C2 C9 1683 117 C0 \n\n V59 target \n0 7428.249334 300.0 \n1 9693.829502 200.0 \n2 7609.258214 50.0 \n3 4258.532609 140.0 \n4 9492.484802 20.0 \n\n[5 rows x 61 columns]",
|
||
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>V0</th>\n <th>V1</th>\n <th>V2</th>\n <th>V3</th>\n <th>V4</th>\n <th>V5</th>\n <th>V6</th>\n <th>V7</th>\n <th>V8</th>\n <th>V9</th>\n <th>...</th>\n <th>V51</th>\n <th>V52</th>\n <th>V53</th>\n <th>V54</th>\n <th>V55</th>\n <th>V56</th>\n <th>V57</th>\n <th>V58</th>\n <th>V59</th>\n <th>target</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>8315.0</td>\n <td>1784.0</td>\n <td>21994.0</td>\n <td>37115.0</td>\n <td>317.0</td>\n <td>105.016815</td>\n <td>296559.0</td>\n <td>321602.0</td>\n <td>2470.0</td>\n <td>C1</td>\n <td>...</td>\n <td>C4</td>\n <td>C4</td>\n <td>834148.0</td>\n <td>C2</td>\n <td>C6</td>\n <td>1089</td>\n <td>293</td>\n <td>C2</td>\n <td>7428.249334</td>\n <td>300.0</td>\n </tr>\n <tr>\n <th>1</th>\n <td>8315.0</td>\n <td>1272.0</td>\n <td>11114.0</td>\n <td>18683.0</td>\n <td>230.0</td>\n <td>NaN</td>\n <td>340059.0</td>\n <td>368602.0</td>\n <td>2820.0</td>\n <td>C0</td>\n <td>...</td>\n <td>C7</td>\n <td>C7</td>\n <td>401668.0</td>\n <td>C5</td>\n <td>C6</td>\n <td>9801</td>\n <td>1085</td>\n <td>C7</td>\n <td>9693.829502</td>\n <td>200.0</td>\n </tr>\n <tr>\n <th>2</th>\n <td>8315.0</td>\n <td>3832.0</td>\n <td>65514.0</td>\n <td>147707.0</td>\n <td>607.0</td>\n <td>105.018240</td>\n <td>279159.0</td>\n <td>302802.0</td>\n <td>2330.0</td>\n <td>C1</td>\n <td>...</td>\n <td>C7</td>\n <td>C7</td>\n <td>820948.0</td>\n <td>C5</td>\n <td>C4</td>\n <td>1485</td>\n <td>304</td>\n <td>C6</td>\n <td>7609.258214</td>\n <td>50.0</td>\n </tr>\n <tr>\n <th>3</th>\n <td>8315.0</td>\n <td>2296.0</td>\n <td>32874.0</td>\n <td>55547.0</td>\n <td>404.0</td>\n <td>NaN</td>\n <td>313959.0</td>\n <td>340402.0</td>\n <td>2610.0</td>\n <td>C1</td>\n <td>...</td>\n <td>C7</td>\n <td>C7</td>\n <td>1664548.0</td>\n <td>C5</td>\n <td>C5</td>\n <td>-495</td>\n <td>711</td>\n <td>C4</td>\n <td>4258.532609</td>\n <td>140.0</td>\n </tr>\n <tr>\n <th>4</th>\n <td>11021.0</td>\n <td>1784.0</td>\n <td>21994.0</td>\n <td>37115.0</td>\n <td>375.0</td>\n <td>105.024985</td>\n <td>232701.0</td>\n <td>252606.0</td>\n <td>1490.0</td>\n <td>C0</td>\n <td>...</td>\n <td>C7</td>\n <td>C7</td>\n <td>735748.0</td>\n <td>C2</td>\n <td>C9</td>\n <td>1683</td>\n <td>117</td>\n <td>C0</td>\n <td>9492.484802</td>\n <td>20.0</td>\n </tr>\n </tbody>\n</table>\n<p>5 rows × 61 columns</p>\n</div>"
|
||
},
|
||
"execution_count": 3,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"df = pd.read_csv(os.path.join('data', 'tabular.csv'))\n",
|
||
"print(df.shape)\n",
|
||
"df.head()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "c09da291",
|
||
"metadata": {},
|
||
"source": [
|
||
"### Load Image Data"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 4,
|
||
"id": "6297e25a",
|
||
"metadata": {
|
||
"ExecuteTime": {
|
||
"end_time": "2024-04-20T14:34:00.060850Z",
|
||
"start_time": "2024-04-20T14:33:59.642045Z"
|
||
}
|
||
},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"Shape: (357699, 8, 8)\n"
|
||
]
|
||
},
|
||
{
|
||
"data": {
|
||
"text/plain": "<Figure size 1200x500 with 15 Axes>",
|
||
"image/png": ""
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
}
|
||
],
|
||
"source": [
|
||
"with open(os.path.join('data', 'images.npy'), 'rb') as f:\n",
|
||
" images = np.load(f)\n",
|
||
" \n",
|
||
"print('Shape:', images.shape)\n",
|
||
"show_images(images[:18], n_row=3, n_col=5, figsize=[12,5])"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "cbe832b6",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Data Exploration & Preparation"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "2f6a464c",
|
||
"metadata": {},
|
||
"source": [
|
||
"### 1. Descriptive Analysis"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 5,
|
||
"id": "3b1f62dd",
|
||
"metadata": {
|
||
"ExecuteTime": {
|
||
"end_time": "2024-04-20T14:34:03.954499Z",
|
||
"start_time": "2024-04-20T14:34:03.584502Z"
|
||
}
|
||
},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"(357699, 61)\n",
|
||
"V5 172305\n",
|
||
"V15 191109\n",
|
||
"V38 302903\n",
|
||
"V39 347413\n",
|
||
"shape (357699, 64)\n",
|
||
"nan 0 191109\n",
|
||
"nan 12 172305\n",
|
||
"nan 19 347413\n",
|
||
"nan 24 302903\n",
|
||
"0 55 357699\n",
|
||
"0 61 357699\n",
|
||
"0 62 357699\n",
|
||
"0 63 357699\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"targets = df['target']\n",
|
||
"print(df.shape)\n",
|
||
"for column in df:\n",
|
||
" if df[column].isna().sum() > 100000:\n",
|
||
" print(column, df[column].isna().sum())\n",
|
||
"# Remove V38, V39, and we can interpolate / remove V5 and V15\n",
|
||
"# Flatten the images\n",
|
||
"flattened_images = images.reshape(images.shape[0], -1)\n",
|
||
"print('shape', flattened_images.shape)\n",
|
||
"# ID useless columns\n",
|
||
"for i, col in enumerate(flattened_images.T):\n",
|
||
" if (np.isnan(col).sum() > 100000):\n",
|
||
" print('nan', i, np.isnan(col).sum())\n",
|
||
" if (col == 0).sum() > 100000:\n",
|
||
" print('0', i, (col == 0).sum())\n",
|
||
"# Col 19, 24, 55, 61, 62, 63 are useless\n",
|
||
"# Interpolate 0 and 12\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "adb61967",
|
||
"metadata": {},
|
||
"source": [
|
||
"### 2. Detection and Handling of Missing Values"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 85,
|
||
"id": "4bb9cdfb",
|
||
"metadata": {
|
||
"ExecuteTime": {
|
||
"end_time": "2024-04-16T09:06:12.435405Z",
|
||
"start_time": "2024-04-16T09:06:12.381979Z"
|
||
}
|
||
},
|
||
"outputs": [
|
||
{
|
||
"ename": "KeyError",
|
||
"evalue": "\"['V38', 'V39', 'target'] not found in axis\"",
|
||
"output_type": "error",
|
||
"traceback": [
|
||
"\u001B[0;31m---------------------------------------------------------------------------\u001B[0m",
|
||
"\u001B[0;31mKeyError\u001B[0m Traceback (most recent call last)",
|
||
"Cell \u001B[0;32mIn[85], line 2\u001B[0m\n\u001B[1;32m 1\u001B[0m dropped_columns \u001B[38;5;241m=\u001B[39m [\u001B[38;5;124m'\u001B[39m\u001B[38;5;124mV38\u001B[39m\u001B[38;5;124m'\u001B[39m, \u001B[38;5;124m'\u001B[39m\u001B[38;5;124mV39\u001B[39m\u001B[38;5;124m'\u001B[39m, \u001B[38;5;124m'\u001B[39m\u001B[38;5;124mtarget\u001B[39m\u001B[38;5;124m'\u001B[39m]\n\u001B[0;32m----> 2\u001B[0m df \u001B[38;5;241m=\u001B[39m \u001B[43mdf\u001B[49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43mdrop\u001B[49m\u001B[43m(\u001B[49m\u001B[43mdropped_columns\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[43maxis\u001B[49m\u001B[38;5;241;43m=\u001B[39;49m\u001B[38;5;241;43m1\u001B[39;49m\u001B[43m)\u001B[49m\n\u001B[1;32m 3\u001B[0m flattened_images \u001B[38;5;241m=\u001B[39m np\u001B[38;5;241m.\u001B[39mdelete(flattened_images, [\u001B[38;5;241m19\u001B[39m, \u001B[38;5;241m24\u001B[39m,\u001B[38;5;241m55\u001B[39m,\u001B[38;5;241m61\u001B[39m,\u001B[38;5;241m62\u001B[39m,\u001B[38;5;241m63\u001B[39m], axis\u001B[38;5;241m=\u001B[39m\u001B[38;5;241m1\u001B[39m)\n\u001B[1;32m 4\u001B[0m flattened_images\u001B[38;5;241m.\u001B[39mshape\n",
|
||
"File \u001B[0;32m/nix/store/nip0khhq6vhx1cimwz0ap9bzdvqawyg5-python3-3.11.8-env/lib/python3.11/site-packages/pandas/core/frame.py:5347\u001B[0m, in \u001B[0;36mDataFrame.drop\u001B[0;34m(self, labels, axis, index, columns, level, inplace, errors)\u001B[0m\n\u001B[1;32m 5199\u001B[0m \u001B[38;5;28;01mdef\u001B[39;00m \u001B[38;5;21mdrop\u001B[39m(\n\u001B[1;32m 5200\u001B[0m \u001B[38;5;28mself\u001B[39m,\n\u001B[1;32m 5201\u001B[0m labels: IndexLabel \u001B[38;5;241m|\u001B[39m \u001B[38;5;28;01mNone\u001B[39;00m \u001B[38;5;241m=\u001B[39m \u001B[38;5;28;01mNone\u001B[39;00m,\n\u001B[0;32m (...)\u001B[0m\n\u001B[1;32m 5208\u001B[0m errors: IgnoreRaise \u001B[38;5;241m=\u001B[39m \u001B[38;5;124m\"\u001B[39m\u001B[38;5;124mraise\u001B[39m\u001B[38;5;124m\"\u001B[39m,\n\u001B[1;32m 5209\u001B[0m ) \u001B[38;5;241m-\u001B[39m\u001B[38;5;241m>\u001B[39m DataFrame \u001B[38;5;241m|\u001B[39m \u001B[38;5;28;01mNone\u001B[39;00m:\n\u001B[1;32m 5210\u001B[0m \u001B[38;5;250m \u001B[39m\u001B[38;5;124;03m\"\"\"\u001B[39;00m\n\u001B[1;32m 5211\u001B[0m \u001B[38;5;124;03m Drop specified labels from rows or columns.\u001B[39;00m\n\u001B[1;32m 5212\u001B[0m \n\u001B[0;32m (...)\u001B[0m\n\u001B[1;32m 5345\u001B[0m \u001B[38;5;124;03m weight 1.0 0.8\u001B[39;00m\n\u001B[1;32m 5346\u001B[0m \u001B[38;5;124;03m \"\"\"\u001B[39;00m\n\u001B[0;32m-> 5347\u001B[0m \u001B[38;5;28;01mreturn\u001B[39;00m \u001B[38;5;28;43msuper\u001B[39;49m\u001B[43m(\u001B[49m\u001B[43m)\u001B[49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43mdrop\u001B[49m\u001B[43m(\u001B[49m\n\u001B[1;32m 5348\u001B[0m \u001B[43m \u001B[49m\u001B[43mlabels\u001B[49m\u001B[38;5;241;43m=\u001B[39;49m\u001B[43mlabels\u001B[49m\u001B[43m,\u001B[49m\n\u001B[1;32m 5349\u001B[0m \u001B[43m \u001B[49m\u001B[43maxis\u001B[49m\u001B[38;5;241;43m=\u001B[39;49m\u001B[43maxis\u001B[49m\u001B[43m,\u001B[49m\n\u001B[1;32m 5350\u001B[0m \u001B[43m \u001B[49m\u001B[43mindex\u001B[49m\u001B[38;5;241;43m=\u001B[39;49m\u001B[43mindex\u001B[49m\u001B[43m,\u001B[49m\n\u001B[1;32m 5351\u001B[0m \u001B[43m \u001B[49m\u001B[43mcolumns\u001B[49m\u001B[38;5;241;43m=\u001B[39;49m\u001B[43mcolumns\u001B[49m\u001B[43m,\u001B[49m\n\u001B[1;32m 5352\u001B[0m \u001B[43m \u001B[49m\u001B[43mlevel\u001B[49m\u001B[38;5;241;43m=\u001B[39;49m\u001B[43mlevel\u001B[49m\u001B[43m,\u001B[49m\n\u001B[1;32m 5353\u001B[0m \u001B[43m \u001B[49m\u001B[43minplace\u001B[49m\u001B[38;5;241;43m=\u001B[39;49m\u001B[43minplace\u001B[49m\u001B[43m,\u001B[49m\n\u001B[1;32m 5354\u001B[0m \u001B[43m \u001B[49m\u001B[43merrors\u001B[49m\u001B[38;5;241;43m=\u001B[39;49m\u001B[43merrors\u001B[49m\u001B[43m,\u001B[49m\n\u001B[1;32m 5355\u001B[0m \u001B[43m \u001B[49m\u001B[43m)\u001B[49m\n",
|
||
"File \u001B[0;32m/nix/store/nip0khhq6vhx1cimwz0ap9bzdvqawyg5-python3-3.11.8-env/lib/python3.11/site-packages/pandas/core/generic.py:4711\u001B[0m, in \u001B[0;36mNDFrame.drop\u001B[0;34m(self, labels, axis, index, columns, level, inplace, errors)\u001B[0m\n\u001B[1;32m 4709\u001B[0m \u001B[38;5;28;01mfor\u001B[39;00m axis, labels \u001B[38;5;129;01min\u001B[39;00m axes\u001B[38;5;241m.\u001B[39mitems():\n\u001B[1;32m 4710\u001B[0m \u001B[38;5;28;01mif\u001B[39;00m labels \u001B[38;5;129;01mis\u001B[39;00m \u001B[38;5;129;01mnot\u001B[39;00m \u001B[38;5;28;01mNone\u001B[39;00m:\n\u001B[0;32m-> 4711\u001B[0m obj \u001B[38;5;241m=\u001B[39m \u001B[43mobj\u001B[49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43m_drop_axis\u001B[49m\u001B[43m(\u001B[49m\u001B[43mlabels\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[43maxis\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[43mlevel\u001B[49m\u001B[38;5;241;43m=\u001B[39;49m\u001B[43mlevel\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[43merrors\u001B[49m\u001B[38;5;241;43m=\u001B[39;49m\u001B[43merrors\u001B[49m\u001B[43m)\u001B[49m\n\u001B[1;32m 4713\u001B[0m \u001B[38;5;28;01mif\u001B[39;00m inplace:\n\u001B[1;32m 4714\u001B[0m \u001B[38;5;28mself\u001B[39m\u001B[38;5;241m.\u001B[39m_update_inplace(obj)\n",
|
||
"File \u001B[0;32m/nix/store/nip0khhq6vhx1cimwz0ap9bzdvqawyg5-python3-3.11.8-env/lib/python3.11/site-packages/pandas/core/generic.py:4753\u001B[0m, in \u001B[0;36mNDFrame._drop_axis\u001B[0;34m(self, labels, axis, level, errors, only_slice)\u001B[0m\n\u001B[1;32m 4751\u001B[0m new_axis \u001B[38;5;241m=\u001B[39m axis\u001B[38;5;241m.\u001B[39mdrop(labels, level\u001B[38;5;241m=\u001B[39mlevel, errors\u001B[38;5;241m=\u001B[39merrors)\n\u001B[1;32m 4752\u001B[0m \u001B[38;5;28;01melse\u001B[39;00m:\n\u001B[0;32m-> 4753\u001B[0m new_axis \u001B[38;5;241m=\u001B[39m \u001B[43maxis\u001B[49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43mdrop\u001B[49m\u001B[43m(\u001B[49m\u001B[43mlabels\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[43merrors\u001B[49m\u001B[38;5;241;43m=\u001B[39;49m\u001B[43merrors\u001B[49m\u001B[43m)\u001B[49m\n\u001B[1;32m 4754\u001B[0m indexer \u001B[38;5;241m=\u001B[39m axis\u001B[38;5;241m.\u001B[39mget_indexer(new_axis)\n\u001B[1;32m 4756\u001B[0m \u001B[38;5;66;03m# Case for non-unique axis\u001B[39;00m\n\u001B[1;32m 4757\u001B[0m \u001B[38;5;28;01melse\u001B[39;00m:\n",
|
||
"File \u001B[0;32m/nix/store/nip0khhq6vhx1cimwz0ap9bzdvqawyg5-python3-3.11.8-env/lib/python3.11/site-packages/pandas/core/indexes/base.py:6992\u001B[0m, in \u001B[0;36mIndex.drop\u001B[0;34m(self, labels, errors)\u001B[0m\n\u001B[1;32m 6990\u001B[0m \u001B[38;5;28;01mif\u001B[39;00m mask\u001B[38;5;241m.\u001B[39many():\n\u001B[1;32m 6991\u001B[0m \u001B[38;5;28;01mif\u001B[39;00m errors \u001B[38;5;241m!=\u001B[39m \u001B[38;5;124m\"\u001B[39m\u001B[38;5;124mignore\u001B[39m\u001B[38;5;124m\"\u001B[39m:\n\u001B[0;32m-> 6992\u001B[0m \u001B[38;5;28;01mraise\u001B[39;00m \u001B[38;5;167;01mKeyError\u001B[39;00m(\u001B[38;5;124mf\u001B[39m\u001B[38;5;124m\"\u001B[39m\u001B[38;5;132;01m{\u001B[39;00mlabels[mask]\u001B[38;5;241m.\u001B[39mtolist()\u001B[38;5;132;01m}\u001B[39;00m\u001B[38;5;124m not found in axis\u001B[39m\u001B[38;5;124m\"\u001B[39m)\n\u001B[1;32m 6993\u001B[0m indexer \u001B[38;5;241m=\u001B[39m indexer[\u001B[38;5;241m~\u001B[39mmask]\n\u001B[1;32m 6994\u001B[0m \u001B[38;5;28;01mreturn\u001B[39;00m \u001B[38;5;28mself\u001B[39m\u001B[38;5;241m.\u001B[39mdelete(indexer)\n",
|
||
"\u001B[0;31mKeyError\u001B[0m: \"['V38', 'V39', 'target'] not found in axis\""
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"dropped_columns = ['V38', 'V39', 'target']\n",
|
||
"df = df.drop(dropped_columns, axis=1)\n",
|
||
"flattened_images = np.delete(flattened_images, [19, 24,55,61,62,63], axis=1)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"(357699, 58)\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"print(flattened_images.shape)"
|
||
],
|
||
"metadata": {
|
||
"collapsed": false,
|
||
"ExecuteTime": {
|
||
"end_time": "2024-04-16T09:10:48.089387Z",
|
||
"start_time": "2024-04-16T09:10:48.083321Z"
|
||
}
|
||
},
|
||
"id": "d996a04b28b2d1be",
|
||
"execution_count": 101
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "8adcb9cd",
|
||
"metadata": {},
|
||
"source": [
|
||
"### 3. Detection and Handling of Outliers"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 100,
|
||
"id": "ed1c17a1",
|
||
"metadata": {
|
||
"ExecuteTime": {
|
||
"end_time": "2024-04-16T09:10:46.281705Z",
|
||
"start_time": "2024-04-16T09:10:46.278864Z"
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": []
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "d4916043",
|
||
"metadata": {},
|
||
"source": [
|
||
"### 4. Detection and Handling of Class Imbalance"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "ad3ab20e",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": []
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "2552a795",
|
||
"metadata": {},
|
||
"source": [
|
||
"### 5. Understanding Relationship Between Variables"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "29ddbbcf",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": []
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "757fb315",
|
||
"metadata": {},
|
||
"source": [
|
||
"### 6. Data Visualization"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "93f82e42",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": []
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "2a7eebcf",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Data Preprocessing"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "ae3e3383",
|
||
"metadata": {},
|
||
"source": [
|
||
"### 7. General Preprocessing"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "19174365",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": []
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "fb3aa527",
|
||
"metadata": {},
|
||
"source": [
|
||
"### 8. Feature Selection"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "a85808bf",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": []
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "4921e8ca",
|
||
"metadata": {},
|
||
"source": [
|
||
"### 9. Feature Engineering"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "dbcde626",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": []
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "fa676c3f",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Modeling & Evaluation"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "589b37e4",
|
||
"metadata": {},
|
||
"source": [
|
||
"### 10. Creating models"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "d8dffd7d",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": []
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "495bf3c0",
|
||
"metadata": {},
|
||
"source": [
|
||
"### 11. Model Evaluation"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "9245ab47",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": []
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "8aa31404",
|
||
"metadata": {},
|
||
"source": [
|
||
"### 12. Hyperparameters Search"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "81addd51",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": []
|
||
}
|
||
],
|
||
"metadata": {
|
||
"kernelspec": {
|
||
"display_name": "Python 3 (ipykernel)",
|
||
"language": "python",
|
||
"name": "python3"
|
||
},
|
||
"language_info": {
|
||
"codemirror_mode": {
|
||
"name": "ipython",
|
||
"version": 3
|
||
},
|
||
"file_extension": ".py",
|
||
"mimetype": "text/x-python",
|
||
"name": "python",
|
||
"nbconvert_exporter": "python",
|
||
"pygments_lexer": "ipython3",
|
||
"version": "3.10.7"
|
||
}
|
||
},
|
||
"nbformat": 4,
|
||
"nbformat_minor": 5
|
||
}
|