# Final Assessment Scratch Pad

## Instructions

1. Please use only this Jupyter notebook to work on your model, and **do not use any extra files**. If you need to define helper classes or functions, feel free to do so in this notebook.
2. This template is intended to be general, but it may not cover every use case. The sections are given so that it will be easier for us to grade your submission. If your specific use case isn't addressed, **you may add new Markdown or code blocks to this notebook**. However, please **don't delete any existing blocks**.
3. If you don't think a particular section of this template is necessary for your work, **you may skip it**. Be sure to explain clearly why you decided to do so.

## Report

**[TODO]**

Please provide a summary of the ideas and steps that led you to your final model. Someone reading this summary should understand why you chose to approach the problem in a particular way and able to replicate your final model at a high level. Please ensure that your summary is detailed enough to provide an overview of your thought process and approach but also concise enough to be easily understandable. Also, please follow the guidelines given in the `main.ipynb`.

This report should not be longer than **1-2 pages of A4 paper (up to around 1,000 words)**. Marks will be deducted if you do not follow instructions and you include too many words here. 

**[DELETE EVERYTHING FROM THE PREVIOUS TODO TO HERE BEFORE SUBMISSION]**

##### Overview
**[TODO]**

##### 1. Descriptive Analysis
First step: Look at the target values. The target values are floats and NAs, which is interesting. NAs in the target data is a bit suspicious. However, despite being floats, the target values are actually ordinal. I'll convert them to ordinal values by just `Y.fillna(-1).astype(int).`. Now, I can do value counts and see that there are only 7 distinct values, including NaN. I will regard this as a classification problem with 7 classes.

Looking at the `X`, I realise each entry in the list is an `n` by 16 by 16 matrix. 16 by 16 matrix, my first idea is to look at them like images. Plotting the images showed no relevant info. `6 <= n <= 10`.

I just realised this is a video dataset. I'll pad all the frames to be of size 10. so that i'll have a 2500 x 10 x 16 x 16 video datset. 

##### 2. Detection and Handling of Missing Values
**[TODO]**

##### 3. Detection and Handling of Outliers
**[TODO]**

##### 4. Detection and Handling of Class Imbalance 
**[TODO]**

##### 5. Understanding Relationship Between Variables
**[TODO]**

##### 6. Data Visualization
**[TODO]** 
##### 7. General Preprocessing
**[TODO]**
 
##### 8. Feature Selection 
**[TODO]**

##### 9. Feature Engineering
**[TODO]**

##### 10. Creating Models
**[TODO]**

##### 11. Model Evaluation
**[TODO]**

##### 12. Hyperparameters Search
**[TODO]**

##### Conclusion
**[TODO]**

---

# Workings (Not Graded)

You will do your working below. Note that anything below this section will not be graded, but we might counter-check what you wrote in the report above with your workings to make sure that you actually did what you claimed to have done. 

## Import Packages

Here, we import some packages necessary to run this notebook. In addition, you may import other packages as well. Do note that when submitting your model, you may only use packages that are available in Coursemology (see `main.ipynb`).

In [1]:
import pandas
import pandas as pd
import os
import numpy as np
import matplotlib.pyplot as plt

## Load Dataset

The dataset `data.npy` consists of $N$ grayscale videos and their corresponding labels. Each video has a shape of (L, H, W). L represents the length of the video, which may vary between videos. H and W represent the height and width, which are consistent across all videos. 

A code snippet that loads the data is provided below.

### Load Data

In [2]:
with open('data.npy', 'rb') as f:
    data = np.load(f, allow_pickle=True).item()
    X = data['data']
    y = data['label']
    
print('Number of data sample:', len(X))
print('Shape of the first data sample:', X[0].shape)
print('Shape of the third data sample:', X[2].shape)

Number of data sample: 2500
Shape of the first data sample: (10, 16, 16)
Shape of the third data sample: (8, 16, 16)


## Data Exploration & Preparation

In [3]:
from sklearn.preprocessing import OrdinalEncoder
# Some Helper Functions
def show_images(images, n_row=5, n_col=5, figsize=[12,12]):
    _, axs = plt.subplots(n_row, n_col, figsize=figsize)
    axs = axs.flatten()
    for img, ax in zip(images, axs):
        ax.imshow(img, cmap='gray')
    plt.show()
def nan_columns(X, threshold=0.5):
    count = X.shape[0] * threshold
    nan_columns = X.isna().sum()
    return nan_columns[nan_columns >= count].index
def zero_columns(X, threshold=0.5):
    count = X.shape[0] * threshold
    zero_cols = (X == 0).sum()
    return zero_cols[zero_cols >= count].index

def object_columns(X):
    return X.dtypes[X.dtypes == 'object'].index

def convert_to_ordinal(X, columns):
    encoder = OrdinalEncoder()
    return encoder.fit_transform(X[columns])

def correlated_columns(X, threshold=0.99):
    corr = X.corr()
    upper = corr.where(np.triu(np.ones(corr.shape), k=1).astype(bool))
    return [column for column in upper.columns if any(upper[column] > threshold)]

### 1. Descriptive Analysis

In [4]:
not_nan_indices = np.argwhere(~np.isnan(np.array(y))).squeeze()
print(len(not_nan_indices))
y_filtered = [y[i] for i in not_nan_indices]
x_filtered = [X[i] for i in not_nan_indices]
X = x_filtered
y = y_filtered
# show_images(X[0], 2, 5, [16, 16])
Y = pd.DataFrame(y)
# show_images(X[0], 1, 10, [10, 1])
# show_images(X[1], 1, 10, [10, 1])
# show_images(X[2], 1, 10, [10, 1])
# show_images(X[3], 1, 10, [10, 1])
# Y[:10].T
# print(type(X[0]))

2250


In [5]:
# We can now try to pad the videos to be of size 10

def process_video(video):
    L = video.shape[0]
    if L < 10:
        return np.concatenate([video, np.zeros((10 - L, 16, 16))]).reshape(10, -1)
    return video.reshape(10, -1).astype(np.float32)

L_max = 10
X_array = np.zeros((len(X), 10, 256))
for i, video in enumerate(X):
    X_array[i] = process_video(video)
np.expand_dims(X_array, axis=2).shape
print(X_array.shape)
X_array = np.reshape(X_array, (X_array.shape[0], X_array.shape[1], 256)).shape
# flattened_data = print(flattened_data)

(2250, 10, 256)


### 2. Detection and Handling of Missing Values

### 3. Detection and Handling of Outliers

### 4. Detection and Handling of Class Imbalance

### 5. Understanding Relationship Between Variables

### 6. Data Visualization

## Data Preprocessing

### 7. General Preprocessing

### 8. Feature Selection

### 9. Feature Engineering

## Modeling & Evaluation

### 10. Creating models

In [6]:
import torch
from torch import nn

### 11. Model Evaluation

In [7]:
from sklearn.model_selection import train_test_split
# Split train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1)
X_train = [process_video(video) for video in X_train]
X_test = [process_video(video) for video in X_test]

X_tensor = torch.tensor(X_train, dtype=torch.float32)
y_tensor = torch.tensor(y_train, dtype=torch.float32)

train_dataset = torch.utils.data.TensorDataset(X_tensor, y_tensor)
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=32, shuffle=True)

  X_tensor = torch.tensor(X_train, dtype=torch.float32)


In [8]:
class VideoLSTM(nn.Module):
    def __init__(self):
        super(VideoLSTM, self).__init__()
        self.input_size = 256
        self.hidden_layers = 64
        self.num_layers = 1
        self.num_classes = 6
        
        self.lstm = nn.LSTM(self.input_size, self.hidden_layers, self.num_layers, batch_first=True)
        self.fc = nn.Linear(self.hidden_layers, self.num_classes)
    def forward(self, x):
        h0 = torch.zeros(self.num_layers, x.size(0), self.hidden_layers).to(x.device)
        c0 = torch.zeros(self.num_layers, x.size(0), self.hidden_layers).to(x.device)

        # Forward propagate LSTM
        print('prelstm')
        out, _ = self.lstm(x, (h0, c0))
        print('postlstm')
        
        out = self.fc(out[:, -1, :])
        print('postout')
        return out 

In [9]:
def train_model(model, loss_fn, optimizer, train_loader, num_epochs=10):
    model.train()
    for epoch in range(num_epochs):
        running_loss = 0.0
        for inputs, labels in train_loader:
            optimizer.zero_grad()
            outputs = model(inputs)
            loss = loss_fn(outputs, labels)
            loss.backward()
            optimizer.step()
            running_loss += loss.item()
        print(f"Epoch {epoch + 1}, Loss: {running_loss / len(train_loader)}")


In [None]:
model = VideoLSTM()
lossFn = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
train_model(model, lossFn, optimizer, train_loader, num_epochs=1)

### 12. Hyperparameters Search