{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Problem Set 5: Credit Card Fraud Detection\n", "\n", "**Release Date:** 12 March 2024\n", "\n", "**Due Date:** 23:59, 30 March 2024" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Overview\n", "\n", "In class, we discussed about logistic regression, and how it can be useful as a classification algorithm. In this problem set, we get some hands-on practice by implementing logistic regression on a Credit Card Fraud Detection dataset. Note that for this problem set, you should only be using the scikit-learn (sklearn) library for the last part (Tasks 5.x) on SVM.\n", "\n", "**Required Files**:\n", "\n", "* ps5.ipynb\n", "* credit_card.csv\n", "* restaurant_data.csv\n", "\n", "**Honour Code**: Note that plagiarism will not be condoned! You may discuss with your classmates and check the internet for references, but you MUST NOT submit code/report that is copied directly from other sources!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Important\n", "\n", "Similar to PS0, your implementation in the following tasks **should NOT involve any iteration, including `map` and `filter`, or recursion**. Instead, you should work with the operations available in NumPy. Solutions that violate this will be penalised.\n", "\n", "- You are allowed to use any mathematical functions, but this does not mean that you are allowed to use any NumPy function (there are NumPy functions that aren’t mathematical functions). For example, `np.vectorize` is not allowed since it is iterative. If you are in doubt about which functions are allowed, you should ask in the forum.\n", "\n", "There is, however, an exception for **Tasks 3.4 and 3.5**. In the pseudo-code for the algorithm required, there is an explicit for loop. Hence, only for these task you may use **a single for/while loop** to iterate for the number of epochs required." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Introduction\n", "\n", "Fraudulent credit card transaction is a common phenomenon in many parts of the world and can lead to potentially large amounts of losses for both companies and customers. Therefore, we hope to help credit card companies recognize those fraudulent transactions so that customers are not charged for items that they did not purchase.\n", "\n", "We are given a dataset that contains transactions made by credit cards holders in `credit_card.csv`. If we think about what type of data might be included in the input variables under the given context, we might realize that those input variables are likely to include word descriptions, such as shop name or locations. In this problem set, we don't need to worry about language processing as the data are pre-processed to contain only numeric values." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Getting Started\n", "\n", "Take a look at the columns in the dataset `credit_card.csv`. We have V1-V20, 'Amount', and 'Time' as input features, and 'Class' as output which takes the value 1 if it's fraud and 0 otherwise. This dataset presents 492 frauds out of 284,807 transactions. That means, there are 284,808 rows (including the header) in the csv file.\n", "\n", "We will use this dataset to implement logistic regression using batch and stochastic gradient descent for binary classification." ] }, { "cell_type": "code", "execution_count": 118, "metadata": {}, "outputs": [], "source": [ "# Initial imports and setup\n", "\n", "import numpy as np\n", "import os\n", "import pandas as pd\n", "\n", "from sklearn import svm\n", "from sklearn import model_selection\n", "\n", "# Read credit card data into a Pandas dataframe for large tests\n", "\n", "dirname = os.getcwd()\n", "credit_card_data_filepath = os.path.join(dirname, 'credit_card.csv')\n", "restaurant_data_filepath = os.path.join(dirname, 'restaurant_data.csv')\n", "\n", "credit_df = pd.read_csv(credit_card_data_filepath)\n", "X = credit_df.values[:, :-1]\n", "y = credit_df.values[:, -1:]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Basics of Pandas\n", "\n", "[Pandas](https://pandas.pydata.org/) is an open source data analysis and manipulation tool in Python. In this problem set, we read the CSV into a [Pandas dataframe](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) to provide some nifty methods that makes it easier for us to handle large amounts of data. You can think of a dataframe as a large table that stores our dataset in a neat and optimized manner, making it fast for retrieval and manipulation and data. Using Pandas, we can quickly gain an overview of the type and values of the data stored, distributions of values within the dataset, and even ways to perform sampling.\n", "\n", "In the new few sections, we will explore some basic functions of Pandas to help us get started. You do not need to submit any codes for this section, and can simply run the cells to follow along." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exploring the dataset" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using the [`head`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html) method on the dataframe, we can get an overview of the data. By default, the method returns the first 5 entries in the dataframe." ] }, { "cell_type": "code", "execution_count": 119, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | Time | \n", "V1 | \n", "V2 | \n", "V3 | \n", "V4 | \n", "V5 | \n", "V6 | \n", "V7 | \n", "V8 | \n", "V9 | \n", "... | \n", "V13 | \n", "V14 | \n", "V15 | \n", "V16 | \n", "V17 | \n", "V18 | \n", "V19 | \n", "V20 | \n", "Amount | \n", "Class | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "0.0 | \n", "-1.359807 | \n", "-0.072781 | \n", "2.536347 | \n", "1.378155 | \n", "-0.338321 | \n", "0.462388 | \n", "0.239599 | \n", "0.098698 | \n", "0.363787 | \n", "... | \n", "-0.991390 | \n", "-0.311169 | \n", "1.468177 | \n", "-0.470401 | \n", "0.207971 | \n", "0.025791 | \n", "0.403993 | \n", "0.251412 | \n", "149.62 | \n", "0 | \n", "
1 | \n", "0.0 | \n", "1.191857 | \n", "0.266151 | \n", "0.166480 | \n", "0.448154 | \n", "0.060018 | \n", "-0.082361 | \n", "-0.078803 | \n", "0.085102 | \n", "-0.255425 | \n", "... | \n", "0.489095 | \n", "-0.143772 | \n", "0.635558 | \n", "0.463917 | \n", "-0.114805 | \n", "-0.183361 | \n", "-0.145783 | \n", "-0.069083 | \n", "2.69 | \n", "0 | \n", "
2 | \n", "1.0 | \n", "-1.358354 | \n", "-1.340163 | \n", "1.773209 | \n", "0.379780 | \n", "-0.503198 | \n", "1.800499 | \n", "0.791461 | \n", "0.247676 | \n", "-1.514654 | \n", "... | \n", "0.717293 | \n", "-0.165946 | \n", "2.345865 | \n", "-2.890083 | \n", "1.109969 | \n", "-0.121359 | \n", "-2.261857 | \n", "0.524980 | \n", "378.66 | \n", "0 | \n", "
3 | \n", "1.0 | \n", "-0.966272 | \n", "-0.185226 | \n", "1.792993 | \n", "-0.863291 | \n", "-0.010309 | \n", "1.247203 | \n", "0.237609 | \n", "0.377436 | \n", "-1.387024 | \n", "... | \n", "0.507757 | \n", "-0.287924 | \n", "-0.631418 | \n", "-1.059647 | \n", "-0.684093 | \n", "1.965775 | \n", "-1.232622 | \n", "-0.208038 | \n", "123.50 | \n", "0 | \n", "
4 | \n", "2.0 | \n", "-1.158233 | \n", "0.877737 | \n", "1.548718 | \n", "0.403034 | \n", "-0.407193 | \n", "0.095921 | \n", "0.592941 | \n", "-0.270533 | \n", "0.817739 | \n", "... | \n", "1.345852 | \n", "-1.119670 | \n", "0.175121 | \n", "-0.451449 | \n", "-0.237033 | \n", "-0.038195 | \n", "0.803487 | \n", "0.408542 | \n", "69.99 | \n", "0 | \n", "
5 rows × 23 columns
\n", "\n", " | Time | \n", "V1 | \n", "V2 | \n", "V3 | \n", "V4 | \n", "V5 | \n", "V6 | \n", "V7 | \n", "V8 | \n", "V9 | \n", "... | \n", "V13 | \n", "V14 | \n", "V15 | \n", "V16 | \n", "V17 | \n", "V18 | \n", "V19 | \n", "V20 | \n", "Amount | \n", "Class | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "0.0 | \n", "-1.359807 | \n", "-0.072781 | \n", "2.536347 | \n", "1.378155 | \n", "-0.338321 | \n", "0.462388 | \n", "0.239599 | \n", "0.098698 | \n", "0.363787 | \n", "... | \n", "-0.991390 | \n", "-0.311169 | \n", "1.468177 | \n", "-0.470401 | \n", "0.207971 | \n", "0.025791 | \n", "0.403993 | \n", "0.251412 | \n", "149.62 | \n", "0 | \n", "
1 | \n", "0.0 | \n", "1.191857 | \n", "0.266151 | \n", "0.166480 | \n", "0.448154 | \n", "0.060018 | \n", "-0.082361 | \n", "-0.078803 | \n", "0.085102 | \n", "-0.255425 | \n", "... | \n", "0.489095 | \n", "-0.143772 | \n", "0.635558 | \n", "0.463917 | \n", "-0.114805 | \n", "-0.183361 | \n", "-0.145783 | \n", "-0.069083 | \n", "2.69 | \n", "0 | \n", "
2 rows × 23 columns
\n", "\n", " | Time | \n", "V1 | \n", "V2 | \n", "V3 | \n", "V4 | \n", "V5 | \n", "V6 | \n", "V7 | \n", "V8 | \n", "V9 | \n", "... | \n", "V13 | \n", "V14 | \n", "V15 | \n", "V16 | \n", "V17 | \n", "V18 | \n", "V19 | \n", "V20 | \n", "Amount | \n", "Class | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "0.0 | \n", "-1.359807 | \n", "-0.072781 | \n", "2.536347 | \n", "1.378155 | \n", "-0.338321 | \n", "0.462388 | \n", "0.239599 | \n", "0.098698 | \n", "0.363787 | \n", "... | \n", "-0.991390 | \n", "-0.311169 | \n", "1.468177 | \n", "-0.470401 | \n", "0.207971 | \n", "0.025791 | \n", "0.403993 | \n", "0.251412 | \n", "149.62 | \n", "0 | \n", "
1 | \n", "0.0 | \n", "1.191857 | \n", "0.266151 | \n", "0.166480 | \n", "0.448154 | \n", "0.060018 | \n", "-0.082361 | \n", "-0.078803 | \n", "0.085102 | \n", "-0.255425 | \n", "... | \n", "0.489095 | \n", "-0.143772 | \n", "0.635558 | \n", "0.463917 | \n", "-0.114805 | \n", "-0.183361 | \n", "-0.145783 | \n", "-0.069083 | \n", "2.69 | \n", "0 | \n", "
2 | \n", "1.0 | \n", "-1.358354 | \n", "-1.340163 | \n", "1.773209 | \n", "0.379780 | \n", "-0.503198 | \n", "1.800499 | \n", "0.791461 | \n", "0.247676 | \n", "-1.514654 | \n", "... | \n", "0.717293 | \n", "-0.165946 | \n", "2.345865 | \n", "-2.890083 | \n", "1.109969 | \n", "-0.121359 | \n", "-2.261857 | \n", "0.524980 | \n", "378.66 | \n", "0 | \n", "
3 | \n", "1.0 | \n", "-0.966272 | \n", "-0.185226 | \n", "1.792993 | \n", "-0.863291 | \n", "-0.010309 | \n", "1.247203 | \n", "0.237609 | \n", "0.377436 | \n", "-1.387024 | \n", "... | \n", "0.507757 | \n", "-0.287924 | \n", "-0.631418 | \n", "-1.059647 | \n", "-0.684093 | \n", "1.965775 | \n", "-1.232622 | \n", "-0.208038 | \n", "123.50 | \n", "0 | \n", "
4 | \n", "2.0 | \n", "-1.158233 | \n", "0.877737 | \n", "1.548718 | \n", "0.403034 | \n", "-0.407193 | \n", "0.095921 | \n", "0.592941 | \n", "-0.270533 | \n", "0.817739 | \n", "... | \n", "1.345852 | \n", "-1.119670 | \n", "0.175121 | \n", "-0.451449 | \n", "-0.237033 | \n", "-0.038195 | \n", "0.803487 | \n", "0.408542 | \n", "69.99 | \n", "0 | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
284802 | \n", "172786.0 | \n", "-11.881118 | \n", "10.071785 | \n", "-9.834783 | \n", "-2.066656 | \n", "-5.364473 | \n", "-2.606837 | \n", "-4.918215 | \n", "7.305334 | \n", "1.914428 | \n", "... | \n", "-0.689256 | \n", "4.626942 | \n", "-0.924459 | \n", "1.107641 | \n", "1.991691 | \n", "0.510632 | \n", "-0.682920 | \n", "1.475829 | \n", "0.77 | \n", "0 | \n", "
284803 | \n", "172787.0 | \n", "-0.732789 | \n", "-0.055080 | \n", "2.035030 | \n", "-0.738589 | \n", "0.868229 | \n", "1.058415 | \n", "0.024330 | \n", "0.294869 | \n", "0.584800 | \n", "... | \n", "1.214756 | \n", "-0.675143 | \n", "1.164931 | \n", "-0.711757 | \n", "-0.025693 | \n", "-1.221179 | \n", "-1.545556 | \n", "0.059616 | \n", "24.79 | \n", "0 | \n", "
284804 | \n", "172788.0 | \n", "1.919565 | \n", "-0.301254 | \n", "-3.249640 | \n", "-0.557828 | \n", "2.630515 | \n", "3.031260 | \n", "-0.296827 | \n", "0.708417 | \n", "0.432454 | \n", "... | \n", "-0.183699 | \n", "-0.510602 | \n", "1.329284 | \n", "0.140716 | \n", "0.313502 | \n", "0.395652 | \n", "-0.577252 | \n", "0.001396 | \n", "67.88 | \n", "0 | \n", "
284805 | \n", "172788.0 | \n", "-0.240440 | \n", "0.530483 | \n", "0.702510 | \n", "0.689799 | \n", "-0.377961 | \n", "0.623708 | \n", "-0.686180 | \n", "0.679145 | \n", "0.392087 | \n", "... | \n", "-1.042082 | \n", "0.449624 | \n", "1.962563 | \n", "-0.608577 | \n", "0.509928 | \n", "1.113981 | \n", "2.897849 | \n", "0.127434 | \n", "10.00 | \n", "0 | \n", "
284806 | \n", "172792.0 | \n", "-0.533413 | \n", "-0.189733 | \n", "0.703337 | \n", "-0.506271 | \n", "-0.012546 | \n", "-0.649617 | \n", "1.577006 | \n", "-0.414650 | \n", "0.486180 | \n", "... | \n", "-0.188093 | \n", "-0.084316 | \n", "0.041333 | \n", "-0.302620 | \n", "-0.660377 | \n", "0.167430 | \n", "-0.256117 | \n", "0.382948 | \n", "217.00 | \n", "0 | \n", "
284315 rows × 23 columns
\n", "\n", " | Time | \n", "V1 | \n", "V2 | \n", "V3 | \n", "V4 | \n", "V5 | \n", "V6 | \n", "V7 | \n", "V8 | \n", "V9 | \n", "... | \n", "V13 | \n", "V14 | \n", "V15 | \n", "V16 | \n", "V17 | \n", "V18 | \n", "V19 | \n", "V20 | \n", "Amount | \n", "Class | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "0.0 | \n", "-1.359807 | \n", "-0.072781 | \n", "2.536347 | \n", "1.378155 | \n", "-0.338321 | \n", "0.462388 | \n", "0.239599 | \n", "0.098698 | \n", "0.363787 | \n", "... | \n", "-0.991390 | \n", "-0.311169 | \n", "1.468177 | \n", "-0.470401 | \n", "0.207971 | \n", "0.025791 | \n", "0.403993 | \n", "0.251412 | \n", "149.62 | \n", "0 | \n", "
1 | \n", "0.0 | \n", "1.191857 | \n", "0.266151 | \n", "0.166480 | \n", "0.448154 | \n", "0.060018 | \n", "-0.082361 | \n", "-0.078803 | \n", "0.085102 | \n", "-0.255425 | \n", "... | \n", "0.489095 | \n", "-0.143772 | \n", "0.635558 | \n", "0.463917 | \n", "-0.114805 | \n", "-0.183361 | \n", "-0.145783 | \n", "-0.069083 | \n", "2.69 | \n", "0 | \n", "
284805 | \n", "172788.0 | \n", "-0.240440 | \n", "0.530483 | \n", "0.702510 | \n", "0.689799 | \n", "-0.377961 | \n", "0.623708 | \n", "-0.686180 | \n", "0.679145 | \n", "0.392087 | \n", "... | \n", "-1.042082 | \n", "0.449624 | \n", "1.962563 | \n", "-0.608577 | \n", "0.509928 | \n", "1.113981 | \n", "2.897849 | \n", "0.127434 | \n", "10.00 | \n", "0 | \n", "
284806 | \n", "172792.0 | \n", "-0.533413 | \n", "-0.189733 | \n", "0.703337 | \n", "-0.506271 | \n", "-0.012546 | \n", "-0.649617 | \n", "1.577006 | \n", "-0.414650 | \n", "0.486180 | \n", "... | \n", "-0.188093 | \n", "-0.084316 | \n", "0.041333 | \n", "-0.302620 | \n", "-0.660377 | \n", "0.167430 | \n", "-0.256117 | \n", "0.382948 | \n", "217.00 | \n", "0 | \n", "
4 rows × 23 columns
\n", "