# Final Assessment Scratch Pad

## Instructions

1. Please use only this Jupyter notebook to work on your model, and **do not use any extra files**. If you need to define helper classes or functions, feel free to do so in this notebook.
2. This template is intended to be general, but it may not cover every use case. The sections are given so that it will be easier for us to grade your submission. If your specific use case isn't addressed, **you may add new Markdown or code blocks to this notebook**. However, please **don't delete any existing blocks**.
3. If you don't think a particular section of this template is necessary for your work, **you may skip it**. Be sure to explain clearly why you decided to do so.

## Report

**[TODO]**

Please provide a summary of the ideas and steps that led you to your final model. Someone reading this summary should understand why you chose to approach the problem in a particular way and able to replicate your final model at a high level. Please ensure that your summary is detailed enough to provide an overview of your thought process and approach but also concise enough to be easily understandable. Also, please follow the guidelines given in the `main.ipynb`.

This report should not be longer than **1-2 pages of A4 paper (up to around 1,000 words)**. Marks will be deducted if you do not follow instructions and you include too many words here. 

**[DELETE EVERYTHING FROM THE PREVIOUS TODO TO HERE BEFORE SUBMISSION]**

##### Overview
**[TODO]**

##### 1. Descriptive Analysis
**[TODO]**

##### 2. Detection and Handling of Missing Values
**[TODO]**

##### 3. Detection and Handling of Outliers
**[TODO]**

##### 4. Detection and Handling of Class Imbalance 
**[TODO]**

##### 5. Understanding Relationship Between Variables
**[TODO]**

##### 6. Data Visualization
**[TODO]** 
##### 7. General Preprocessing
**[TODO]**
 
##### 8. Feature Selection 
**[TODO]**

##### 9. Feature Engineering
**[TODO]**

##### 10. Creating Models
**[TODO]**

##### 11. Model Evaluation
**[TODO]**

##### 12. Hyperparameters Search
**[TODO]**

##### Conclusion
**[TODO]**

---

# Workings (Not Graded)

You will do your working below. Note that anything below this section will not be graded, but we might counter-check what you wrote in the report above with your workings to make sure that you actually did what you claimed to have done. 

## Import Packages

Here, we import some packages necessary to run this notebook. In addition, you may import other packages as well. Do note that when submitting your model, you may only use packages that are available in Coursemology (see `main.ipynb`).

In [42]:
import pandas as pd
import os
import numpy as np

## Load Dataset

The dataset `data.npy` consists of $N$ grayscale videos and their corresponding labels. Each video has a shape of (L, H, W). L represents the length of the video, which may vary between videos. H and W represent the height and width, which are consistent across all videos. 

A code snippet that loads the data is provided below.

### Load Data

In [43]:
with open('data.npy', 'rb') as f:
    data = np.load(f, allow_pickle=True).item()
    X = data['data']
    y = data['label']


print('Number of data sample:', len(X))
print('Shape of the first data sample:', X[0].shape)
print('Shape of the third data sample:', X[2].shape)

Number of data sample: 2500
Shape of the first data sample: (10, 16, 16)
Shape of the third data sample: (8, 16, 16)


## Data Exploration & Preparation

### 1. Descriptive Analysis

In [44]:
# Remove nans from the input. This needs to be done in the model for training data as well
not_nan_indices = np.argwhere(~np.isnan(np.array(y))).squeeze()
y = [y[i] for i in not_nan_indices]
X = [X[i] for i in not_nan_indices]
y = np.array(y).astype(int)

# Since each video varies in length, we will take the min length, 6, for each video
X6 = np.array([video[:6] for video in X])
# Now that they are consistent, we can convert them to a numpy array
X6 = np.array(X)


ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (2250,) + inhomogeneous part.

In [None]:
pd.DataFrame(y).value_counts()
# From this, we know that we need to undersample or upsample the data. We will pick understampling as the data is quite large, and understampling will reduce the training time.

### 2. Detection and Handling of Missing Values

In [None]:
np.isnan(X6).sum() # We know that there is quite a few NaNs in the data. However, I will not be figuring out which column / nan has this value. Instead we can just take the average of each image, adn use that as the input to the nan

### 3. Detection and Handling of Outliers

In [None]:
# Check if there are outliers
# We can check if there are outliers by checking the max and min values of each video
np.max(X6, axis=3)
# From this we can see that there are values whic exceed 255, and thus, we can clip that.

### 4. Detection and Handling of Class Imbalance

In [83]:
# Handling Undersampling
pd.DataFrame(y).value_counts()
# There is a class imbalance, and we will need to undersample the data

0
0    300
1    300
2    300
3    300
4    300
5    300
Name: count, dtype: int64

### 5. Understanding Relationship Between Variables

### 6. Data Visualization

## Data Preprocessing

### 7. General Preprocessing

In [82]:
import torch

# Reduce the data to 6 frames
X = np.array([video[:6] for video in X])
tensor_videos = torch.tensor(X, dtype=torch.float32)
# Clip values to 0 and 255
tensor_videos = np.clip(tensor_videos, 0, 255)
# Replace NaNs in each frame, with the average of the frame. This was generated with GPT
for i in range(tensor_videos.shape[0]):
    for j in range(tensor_videos.shape[1]):
        tensor_videos[i][j][torch.isnan(tensor_videos[i][j])] = torch.mean(tensor_videos[i][j][~torch.isnan(tensor_videos[i][j])])
        
# Undersample the data for each of the 6 classes. Select max of 300 samples for each class
# Very much generated with the assitance of chatGPT with some modifications
# Get the indices of each class
indices = [np.argwhere(y == i).squeeze(1) for i in range(6)]
# Get the number of samples to take for each class
num_samples_to_take = 300
# Get the indices of the samples to take
indices_to_take = [np.random.choice(indices[i], num_samples_to_take, replace=True) for i in range(6)]
# Concatenate the indices
indices_to_take = np.concatenate(indices_to_take)
# Select the samples
tensor_videos = tensor_videos[indices_to_take]
y = y[indices_to_take]


torch.Size([1800, 6, 16, 16])
(1800,)
<class 'numpy.ndarray'>


In [85]:
# This is the extra channel dimention to work with the conv3d
tensor_videos = tensor_videos.unsqueeze(1)
tensor_videos.shape

torch.Size([1800, 1, 6, 16, 16])

### 8. Feature Selection

### 9. Feature Engineering

## Modeling & Evaluation

### 10. Creating models

In [230]:
from torch import nn
class CNN3D(nn.Module):
    def __init__(self):
        super(CNN3D, self).__init__()
        self.conv1 = nn.Conv3d(1, 12, 2, 1,2)
        self.mp = nn.AvgPool3d(2)
        self.relu = nn.LeakyReLU()
        self.fc1 = nn.Linear(3888, 6)
        self.fc2 = nn.Linear(128, 6)
        self.flatten = nn.Flatten()
    def forward(self, x):
        x = self.conv1(x)
        x = self.mp(x)
        x = self.relu(x)
        
        # print(x.shape)
        
        x = x.view(-1, 3888)
        x = self.fc1(x)
        # x = self.fc2(x)
        return x
    
def train(model, criterion, optimizer, loader, epochs = 10):
    for epoch in range(epochs):
        for idx, (inputs, labels) in enumerate(loader):
            optimizer.zero_grad()
            outputs = model(inputs)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()
        print(f'Epoch {epoch}, Loss: {loss.item()}')
    return model
def process_data(X, y):
    y = np.array(y)
    X = np.array([video[:6] for video in X])
    tensor_videos = torch.tensor(X, dtype=torch.float32)
    # Clip values to 0 and 255
    tensor_videos = np.clip(tensor_videos, 0, 255)
    # Replace NaNs in each frame, with the average of the frame. This was generated with GPT
    for i in range(tensor_videos.shape[0]):
        for j in range(tensor_videos.shape[1]):
            tensor_videos[i][j][torch.isnan(tensor_videos[i][j])] = torch.mean(tensor_videos[i][j][~torch.isnan(tensor_videos[i][j])])
    # Undersample the data for each of the 6 classes. Select max of 300 samples for each class
    # Very much generated with the assitance of chatGPT with some modifications
    # Get the indices of each class
    indices = [np.argwhere(y == i).squeeze(1) for i in range(6)]
    # Get the number of samples to take for each class
    num_samples_to_take = 300
    # Get the indices of the samples to take
    indices_to_take = [np.random.choice(indices[i], num_samples_to_take, replace=True) for i in range(6)]
    # Concatenate the indices
    indices_to_take = np.concatenate(indices_to_take)
    # Select the samples
    tensor_videos = tensor_videos[indices_to_take].unsqueeze(1)
    y = y[indices_to_take]
    return torch.Tensor(tensor_videos), torch.Tensor(y).long()
class Model():
    def __init__(self):
        self.model = CNN3D()
        self.criterion = nn.CrossEntropyLoss()
        self.optimizer = torch.optim.Adam(self.model.parameters(), lr=0.001)
    def fit(self, X, y):
        X, y = process_data(X, y)
        train_dataset = torch.utils.data.TensorDataset(X, y)
        train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=32, shuffle=True)
        train(self.model, self.criterion, self.optimizer, train_loader)
    def predict(self, X):
        self.model.eval()

        X = np.array([video[:6] for video in X])
        tensor_videos = torch.tensor(X, dtype=torch.float32)
        # Clip values to 0 and 255
        tensor_videos = np.clip(tensor_videos, 0, 255)
        # Replace NaNs in each frame, with the average of the frame. This was generated with GPT
        for i in range(tensor_videos.shape[0]):
            for j in range(tensor_videos.shape[1]):
                tensor_videos[i][j][torch.isnan(tensor_videos[i][j])] = torch.mean(tensor_videos[i][j][~torch.isnan(tensor_videos[i][j])])
        X = torch.Tensor(tensor_videos.unsqueeze(1))
        return np.argmax(self.model(X).detach().numpy(), axis=1)


### 11. Model Evaluation

In [217]:
from sklearn.model_selection import train_test_split

with open('data.npy', 'rb') as f:
    data = np.load(f, allow_pickle=True).item()
    X = data['data']
    y = data['label']


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1)

not_nan_indices = np.argwhere(~np.isnan(np.array(y_test))).squeeze()
y_test = [y_test[i] for i in not_nan_indices]
X_test = [X_test[i] for i in not_nan_indices]



In [235]:
model = Model()
model.fit(X_train, y_train)

from sklearn.metrics import f1_score

y_pred = model.predict(X_test)
print("F1 Score (macro): {0:.2f}".format(f1_score(y_test, y_pred, average='macro'))) # You may encounter errors, you are expected to figure out what's the issue.


Epoch 0, Loss: 4.225716590881348
Epoch 1, Loss: 0.9198675155639648
Epoch 2, Loss: 1.7365752458572388
Epoch 3, Loss: 0.4570190906524658
Epoch 4, Loss: 0.11014104634523392
Epoch 5, Loss: 0.24420055747032166
Epoch 6, Loss: 0.03079795092344284
Epoch 7, Loss: 0.07790327817201614
Epoch 8, Loss: 0.07603466510772705
Epoch 9, Loss: 0.04154537618160248
F1 Score (macro): 0.51


F1 Score (macro): 0.60


### 12. Hyperparameters Search