# Final Assessment Scratch Pad

## Instructions

1. Please use only this Jupyter notebook to work on your model, and **do not use any extra files**. If you need to define helper classes or functions, feel free to do so in this notebook.
2. This template is intended to be general, but it may not cover every use case. The sections are given so that it will be easier for us to grade your submission. If your specific use case isn't addressed, **you may add new Markdown or code blocks to this notebook**. However, please **don't delete any existing blocks**.
3. If you don't think a particular section of this template is necessary for your work, **you may skip it**. Be sure to explain clearly why you decided to do so.

## Report

**[TODO]**

Please provide a summary of the ideas and steps that led you to your final model. Someone reading this summary should understand why you chose to approach the problem in a particular way and able to replicate your final model at a high level. Please ensure that your summary is detailed enough to provide an overview of your thought process and approach but also concise enough to be easily understandable. Also, please follow the guidelines given in the `main.ipynb`.

This report should not be longer than **1-2 pages of A4 paper (up to around 1,000 words)**. Marks will be deducted if you do not follow instructions and you include too many words here. 

**[DELETE EVERYTHING FROM THE PREVIOUS TODO TO HERE BEFORE SUBMISSION]**

##### Overview
**[TODO]**

##### 1. Descriptive Analysis
**[TODO]**

##### 2. Detection and Handling of Missing Values
**[TODO]**

##### 3. Detection and Handling of Outliers
**[TODO]**

##### 4. Detection and Handling of Class Imbalance 
**[TODO]**

##### 5. Understanding Relationship Between Variables
**[TODO]**

##### 6. Data Visualization
**[TODO]** 
##### 7. General Preprocessing
**[TODO]**
 
##### 8. Feature Selection 
**[TODO]**

##### 9. Feature Engineering
**[TODO]**

##### 10. Creating Models
**[TODO]**

##### 11. Model Evaluation
**[TODO]**

##### 12. Hyperparameters Search
**[TODO]**

##### Conclusion
**[TODO]**

---

# Workings (Not Graded)

You will do your working below. Note that anything below this section will not be graded, but we might counter-check what you wrote in the report above with your workings to make sure that you actually did what you claimed to have done. 

## Import Packages

Here, we import some packages necessary to run this notebook. In addition, you may import other packages as well. Do note that when submitting your model, you may only use packages that are available in Coursemology (see `main.ipynb`).

In [1]:
import pandas as pd
import os
import numpy as np

## Load Dataset

The dataset `data.npy` consists of $N$ grayscale videos and their corresponding labels. Each video has a shape of (L, H, W). L represents the length of the video, which may vary between videos. H and W represent the height and width, which are consistent across all videos. 

A code snippet that loads the data is provided below.

### Load Data

In [2]:
with open('data.npy', 'rb') as f:
    data = np.load(f, allow_pickle=True).item()
    X = data['data']
    y = data['label']
    
print('Number of data sample:', len(X))
print('Shape of the first data sample:', X[0].shape)
print('Shape of the third data sample:', X[2].shape)

Number of data sample: 2500
Shape of the first data sample: (10, 16, 16)
Shape of the third data sample: (8, 16, 16)


## Data Exploration & Preparation

### 1. Descriptive Analysis

### 2. Detection and Handling of Missing Values

### 3. Detection and Handling of Outliers

### 4. Detection and Handling of Class Imbalance

### 5. Understanding Relationship Between Variables

### 6. Data Visualization

## Data Preprocessing

### 7. General Preprocessing

### 8. Feature Selection

### 9. Feature Engineering

## Modeling & Evaluation

### 10. Creating models

In [190]:
import torch
from torch import nn
class CNN(nn.Module):
    def __init__(self, num_classes):
        super(CNN, self).__init__()

        self.conv1 = nn.Conv2d(1,32,3,stride=1,padding=0)
        self.conv2 = nn.Conv2d(32,64,3,stride=1,padding=0)
        self.relu = nn.ReLU()
        self.maxpool = nn.MaxPool2d(2)
        self.fc1 = nn.Linear(256, 128)  # Calculate input size based on output from conv2 and pooling
        self.fc2 = nn.Linear(128, num_classes)
        self.flatten = nn.Flatten()

    def forward(self, x):
        x = self.conv1(x)
        x = self.relu(x)
        x = self.maxpool(x)
        x = self.conv2(x)
        x = self.relu(x)
        x = self.maxpool(x)
        x = self.flatten(x)
        x = self.fc1(x)
        x = self.relu(x)
        x = self.fc2(x)
        return x

# video is a numpy array of shape (L, H, W)
def clean_batch(batch):
    batch = np.array(batch)
    print(batch.shape)
    temp_x = batch.reshape(-1, 256)
    np.nan_to_num(temp_x, copy=False)
    col_mean = np.nanmean(temp_x, axis=0)
    inds = np.where(np.isnan(temp_x))
    temp_x[inds] = np.take(col_mean, inds[1])
    temp_x = np.clip(temp_x, 1, 255)
    batch = temp_x.reshape(-1, 1, 16,16)
    return torch.tensor(batch, dtype=torch.float32)
def flatten_data(X, y):
    not_nan_indices = np.argwhere(~np.isnan(np.array(y))).squeeze()
    y = [y[i] for i in not_nan_indices]
    X = [X[i] for i in not_nan_indices]
    flattened_x = []
    flattened_y = []
    for idx, video in enumerate(X):
        for frame in video:
            flattened_x.append(frame)
            flattened_y.append(y[idx])
    flattened_x = clean_batch(flattened_x)
    return flattened_x, torch.Tensor(np.array(flattened_y, dtype=np.int64)).long()

class Model():
    def __init__(self):
        self.cnn = CNN(6)
    def fit(self, X, y):
        self.cnn.train()
        X, y = flatten_data(X, y)
        train_dataset = torch.utils.data.TensorDataset(X, y)
        train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=320, shuffle=True)
        criterion = nn.CrossEntropyLoss()
        optimizer = torch.optim.Adam(self.cnn.parameters(), lr=0.001)
        for epoch in range(70):
            for idx, (inputs, labels) in enumerate(train_loader):
                optimizer.zero_grad()
                outputs = self.cnn(inputs)
                loss = criterion(outputs, labels)
                loss.backward()
                optimizer.step()
            print(f'Epoch {epoch}, Loss: {loss.item()}')
        return self
    def predict(self, X):
        self.cnn.eval()
        results = []
        for idx, batch in enumerate(X):
            batch = clean_batch(batch)
            pred = self.cnn(batch)
            result = torch.argmax(pred, axis=1)
            results.append(torch.max(result))
        return results

### 11. Model Evaluation

In [191]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1)

not_nan_indices = np.argwhere(~np.isnan(np.array(y_test))).squeeze()
y_test = [y_test[i] for i in not_nan_indices]
X_test = [X_test[i] for i in not_nan_indices]

model = Model()
model.fit(X_train, y_train)
# predictions = model.predict(X_train)
# print(predictions[0])
# print(y_train[0])

(16186, 16, 16)


  return ufunc.reduce(obj, axis, dtype, out, **passkwargs)


Epoch 0, Loss: 1.2299271821975708
Epoch 1, Loss: 1.1530401706695557
Epoch 2, Loss: 1.0396554470062256


KeyboardInterrupt: 

In [189]:
from sklearn.metrics import f1_score

y_pred = model.predict(X_test)
result = model.cnn(clean_batch(X_train[0]))
print(result)
print(torch.argmax(result, axis=1))
# print(y_train[0])
print("F1 Score (macro): {0:.2f}".format(f1_score(y_test, y_pred, average='macro'))) # You may encounter errors, you are expected to figure out what's the issue.


(9, 16, 16)
(9, 16, 16)
(8, 16, 16)
(10, 16, 16)
(10, 16, 16)
(7, 16, 16)
(8, 16, 16)
(10, 16, 16)
(8, 16, 16)
(8, 16, 16)
(9, 16, 16)
(6, 16, 16)
(8, 16, 16)
(10, 16, 16)
(6, 16, 16)
(8, 16, 16)
(8, 16, 16)
(10, 16, 16)
(10, 16, 16)
(8, 16, 16)
(6, 16, 16)
(10, 16, 16)
(8, 16, 16)
(6, 16, 16)
(10, 16, 16)
(6, 16, 16)
(9, 16, 16)
(9, 16, 16)
(9, 16, 16)
(7, 16, 16)
(10, 16, 16)
(7, 16, 16)
(10, 16, 16)
(7, 16, 16)
(7, 16, 16)
(8, 16, 16)
(7, 16, 16)
(8, 16, 16)
(8, 16, 16)
(10, 16, 16)
(8, 16, 16)
(7, 16, 16)
(10, 16, 16)
(9, 16, 16)
(7, 16, 16)
(7, 16, 16)
(9, 16, 16)
(7, 16, 16)
(8, 16, 16)
(10, 16, 16)
(10, 16, 16)
(9, 16, 16)
(6, 16, 16)
(7, 16, 16)
(9, 16, 16)
(6, 16, 16)
(7, 16, 16)
(6, 16, 16)
(9, 16, 16)
(6, 16, 16)
(9, 16, 16)
(8, 16, 16)
(7, 16, 16)
(10, 16, 16)
(8, 16, 16)
(7, 16, 16)
(6, 16, 16)
(9, 16, 16)
(6, 16, 16)
(6, 16, 16)
(8, 16, 16)
(10, 16, 16)
(10, 16, 16)
(10, 16, 16)
(6, 16, 16)
(9, 16, 16)
(7, 16, 16)
(8, 16, 16)
(7, 16, 16)
(9, 16, 16)
(7, 16, 16)
(7, 16, 16

### 12. Hyperparameters Search