DataCamp-Ishango.ai Check-in Event
Each month we bring our Datacamp-Ishango.ai scholarship community together to exchange ideas, build connections, and discover the latest developments in the data science field. At our most recent community event, Francis Jeremiah Majawa and Chantelle Amoako-Atta, two of our scholarship recipients, presented two data science topics. Read on for more about their presentations and what we learned.
Francis presented on the topic of Continuous Integration and Continuous Deployment (CICD) for Machine Learning. He started off by providing an overview of the topic and discussing its importance for data scientists. He also touched on the history of software development before DevOps emerged as a solution to address previous challenges in the process.
He explained the goal of DevOps, which is to bring software development and operations teams together to build applications quickly and reliably. He then introduced the concept of ML Ops, which incorporates DevOps principles into machine learning workflows. He went on to discuss CICD, explaining how it automates testing and validation processes for faster delivery of high-quality models. Francis highlights the benefits of CICD, such as fast value delivery, high model quality, better collaboration, and easier maintenance. He also mentioned tools like Git and the Azure DevOps pipeline that can be used in machine learning workflows. Additionally, he suggested exploring resources to gain a deeper understanding of CICD in practice.
Chantelle presented on Using Machine Learning Algorithms to Classify Cardiac Arrhythmia.She introduced cardiac arrhythmia and the dataset that she used in her application. She went over the processes she applied to the data, including handling missing values, feature scaling, and feature selection. She then presented the three machine learning models that she used for classification: k-nearest neighbors, logistic regression, and extreme gradient boosting. She evaluated their performance using precision and the F1 score due to imbalanced data. The imbalance refers to an unequal distribution of observations across different classes of cardiac arrhythmia in the dataset. This imbalance affects model predictions and makes accuracy a less reliable measure of performance in this context.
Chantelle also explained the different measures of performance, including accuracy, F1 score, and conversion metrics. She mentioned that sensitivity (also known as true negative rate) is a measure of performance for classification problems like cardio. Chantelle clarified that her model classifies patients into normal or specific classes of Arrhythmia but does not detect whether a person is normal or sick. After their presentations, Chris Toumping Fotso discussed handling class imbalance in machine learning models and suggested using ensemble techniques or generating synthetic data to address this issue.