Adrien Ruggiero

Logo

Artificial intelligence and data science enthusiasts, my ambition is to become a data scientist. I love to get involved in data science projects on projects that interest me.

View my LinkedIn profile

View My GitHub Profile

Brain Tumors Classification

1. About this project

Everything is explained in the pdf file named “ProjectReport”. However, as it is written in French, this README file will explain the main points in English. The code is available here. Let’s start with a short presentation of the different files:

  1. Folders
  1. Files

This project was done in a school context with two other classmates. The scope and the subject were free. We decided to specialize more specifically in Deep Learning to have this first experience instead of doing only a Machine Learning project with regression or classification as the rest of the class was doing. In addition to that, we chose a topic related to the medical field because we are interested in these applications and are clearly interested in pursuing a career in data science applied to medicine.

2. Dataset

The data comes from a training set made available on Kaggle, more precisely at the following address: https://www.kaggle.com/datasets/navoneel/brain-mri-images-for-brain-tumor-detection There are 253 images available, mainly in .jpg format, although .jpeg and .png are also available. In these images, there are two categories, namely images showing a brain tumour (Yes: 155) and images with no tumour (No: 98). They are all in grayscale. Some additional criteria are noted:

3. The process

Here are the different stages of our reflection:

Then, we build our CNN with the following 5 layers

In order to avoid overfitting, the principle of early stopping is also used. Here are the compilation parameters of our model for its evaluation :

modelCNN_examp.compile(loss='binary_crossentropy', optimizer=optimizers.SGD(lr=1e-4), metrics=['AUC'])

4. Different methods to further our thinking or improve our model

  1. Data Augmentation : DA

Our dataset contains little data, so we sought to increase it. A simple procedure is to use Data Augmentation (DA) which artificially modifies some of the images in our training dataset to increase the amount of data to be trained. In addition to augmentation, this allows us to diversify the data without actually collecting any. Finally, the objective is to limit overfitting since our CNN will consider the generated images as distinct.

Here are all the transformations applied to our images:

CAUTION: it is necessary to limit the effects of these transformations so as not to modify the image too much.

Indeed, a combination of zooming, cropping and rotation can cause us to lose our point/area of interest within the image, namely the tumour, in particular by cutting out the part of the brain where it is located. This can result in an image being considered tumour-free when in fact it is labelled as having a tumour, thus adding false negatives.

  1. Repetition and comparison of models

We carried out a manual cross-validation in order to compare several models. We studied 4 cases, namely a model with the ‘Adam’ optimizer and a model with the ‘SGD’ optimizer, with one case without DA and one case with DA. In order to make the results more relevant, we decided to repeat these experiments about ten times. This will give us an idea of which configuration represents the best model.

  1. Confusion matrix

We use the confusion matrix to retrieve some additional metrics including recall. In our model, we consider the images bearing a tumour as our relevant element. Recall will therefore tell us whether the set of people who need to be treated are identified, whereas precision will tell us whether we are treating people who do not need to be treated. We want to maximise our chances of treating all the people who have the condition, hence the interest in recall.

Also, we were interested in the ‘AUC’ metric rather than the ‘acc’ metric because it is more relevant to our study, particularly through the study of false negatives. Indeed, it takes into account the prediction probabilities for each class, and is more representative and robust for small datasets.

5. Results

In our case, the data augmentation was inconclusive. Therefore, for our final model we use the ‘SGD’ optimizer (generally better than Adam for small datasets) and without DA.