Multiclass Classification of Medical Symptoms

This project, which I worked on in my undergraduate junior year in collaboration with some other classmates in the Statistics major, was done for the class “Big Data & High Performance Statistical Computing”. The focus of this project was to predict which disease a patient had given what their symptoms were.

The methodology that we used included hierarchical clustering for some EDA, as well as a Random Forest model for our classification problem. A lot of hand-holding was done with this project - we had used a Kaggle dataset, which meant that the data was designed to work very much in our favor. Our Random Forest model achieved extremely high accuracies which would be incredibly unrealistic for any real-life situation, and over-fitting was certainly not the main issue as we had tried to correct for it intensively. Nevertheless, this was my first introduction into modern machine learning, so I see this project as a sort of kick-starter into my current interests.

Slides can be accessed here.