Software Engineer, Data Scientist
Proficient software engineer with seven years of experience primarily on test development. Currently completing a MS in Data Science at the University of Virginia
We present here two broad deep learning approaches applied to face mask data: Classification (mask,no-mask) and Object Detection (where is a potential face mask). First, we obtained data from Kaggle.com to implement two classification models: ResNet50 and VGG16. Both models perform exceedingly well. Second, we scrubbed images from google images and added a few images from the original data set in order to have images with greater diversity. After annotating the images, we implemented a YOLOV3 (You Only Look Once) model. Ultimately, the YOLO model also performed exceedingly well.
Topics: Python, NumPy, Keras, TensorFlow, Deep Neural Networks, Convolutional Networks
Given the risk of heart disease in modern society, detection of cardiovascular disease and identifying its risk level for adults is a critical task. Therefore, we implemented three Bayesian classification a models to classify whether a patient’s heart is normal or if there is the presence of heart disease. Given that all three of our models obtain consistent results, we have confidence that specific features from this data have value in predicting the presence of heart disease. Results from these models suggest that old peak (ST depression induced by exercise relative to rest), thalach (maximum heart rate achieved), and ca (number of major coronary artery calcification vessels (0-3) colored by fluoroscopy) play a large role in determining the likelihood that a patient has heart disease. We can also found some differences in likelihood of having heart disease between countries, which is an interesting observation as well as justifies the use of Hierarchical Bayesian Analysis for problems of this nature.
Topics: Python, pymc, Bayesian Inference, Probability, Statistics
Over a year into the COVID-19 pandemic, it is still unclear if and how weather affects the spread of the SARS-COV-2 virus. Many believed the pandemic would not thrive in areas with warmer weather, while others thought there would be some seasonality to the virus like the existing respiratory viruses that spread across the country every winter. In reality, COVID-19 has spread to every corner of the earth and has waxed and waned under various weather conditions in a way that leads to no obvious conclusions about the ways the weather interacts with this virus. This study uses a data-driven machine learning approach to model reported COVID-19 infections in United States counties using weather and mobility data. Our project explored four models before conducting a grid-based hyperparameter tuning process on our best performing model, Gradient Boosted Trees. Our resulting model’s R2 value is approximately 0.35, and although this suggests limited predictive ability, this was expected because we only considered weather and mobility data when there are undoubtedly many other important variables that affect COVID-19’s spread. While our models are not particularly useful for prediction, we found a strong association between temperature and COVID-19 spread, even when controlling for population mobility. This study should serve to inform future research into the ways that the weather affects respiratory viruses and presents several concrete recommendations for focus areas based on the results of this work.
Topics: Python, Spark, pySpark, MapReduce
Utilize data mining classification methods to solve a real historical data mining problem: locating displaced persons living in makeshift shelters following the destruction of the earthquake in Haiti in 2010. Using imagery data collected during the relief efforts; determine which data mining method will accurately as possible, and in as timely a manner as possible, locate as many of the displaced persons identified in the imagery data so that they can be provided food and water before their situations become unsurvivable
Topics: R, Linear Discriminant Analysis (LDA), Quadratic Discriminant Analysis (QDA), Linear Regression, KNN, Support Vector Machines, Tree-Based Models
When listening to music, is it possible to assess the sentiment of various genres across time and across genres through their lyrics? This is the question that is posed to be answered through an evaluation of the lyrics of specific genres using topic models and sentiment analysis of the Billboard Top 100 from 2009 to 2020. Through this pipeline, determination will be made if Rap, Country, Pop and R&B/Hip-Hop displays strong direction to the emotions of anger, anticipation, disgust, fear, joy, sadness, surprise and trust while also measuring the polarity in those songs across years.
Topics: Python, scikit-learn, Latent Dirichlet Allocation (LDA), Topic Modeling, Sentiment Analysis, Principal Components Analysis (PCA)