Cassava Leaf Disease Classification

Connor Uzzo
3 min readFeb 23, 2021

By Akshay Shah, Harry Chalfin, and Connor Uzzo

Cassava is an important crop in many African households due to its resilience against harsh climates. It is susceptible to several viral diseases however, and diagnosing these diseases requires assistance from government trained agriculture experts. If a deep learning model could be deployed capable of recognizing these diseases based on digital images, farmers could decide on treatment for their cassava plants without waiting and paying for one of these experts. Using the Kaggle Cassava Leaf Disease Classification dataset and Python’s Tensorflow package, we are building such a pipeline. So far the highest reported accuracy score for this dataset on the Kaggle website is 0.9132, achieved by the “golddiggaz” team. The top 50 scores are all above 0.90, with many teams tied exactly up to 4 decimal places.

Exploratory Data Analysis:

Pie chart for the distribution of classes in the cassava leaf disease dataset. The cassava mosaic disease class is by far the largest, encompassing 61.5% of the data points overall. Following this class are healthy, cassava green mottle, cassava brown streak disease, and cassava bacteria blight.
3 Instances of each class. After looking over each class it is easy to see that some diseases can look very similar (for example, class 2 CGM and class 3 CMD), some healthy cassava planta may look sick (such as the first class 4 image), and some sick plants may look healthy (such as the first class 1 image with CBSD).

Our EDA confirmed to us that purely by visual inspection, it is difficult to confidently classify a cassava leaf image into any class. Even healthy plants may look sickly at times, and sickly plants may look perfectly healthy. Cassava mosaic disease (CMD) was by far the largest class, taking up 61.5% of the total dataset, so we are expecting our model to predict a given image is in class 3 most of the time.

Baseline Model:

A common practice to define some baseline accuracy for a classifier is to simply predict that each image belongs to the largest class. In this case, our baseline model would predict class 3 for each image, and since 61.5% of the images belong to class 3, our model would have an accuracy of 61.5%. Of course, this means we would never be able to detect any sickness other than cassava mosaic disease, and we would never classify a plant as healthy. Assuming the test set is similarly distributed to the training set, this baseline model would have approximately this confusion matrix:

A kaggle submission of this baseline model along with our EDA can be found at https://www.kaggle.com/connortuzzo/notebookdb57b7a224/edit.

Citations:

Stanfield, Devon. (February, 2021). “Cassava-Infer”, Version 6. Retrieved February 22, 2021 from https://www.kaggle.com/devonstanfield/cassava-infer

“Cassava Leaf Disease Classification”, Version 1. Retrieved February 18, 2021 from https://www.kaggle.com/c/cassava-leaf-disease-classification/data

--

--