F21-STAT 441/841 CM 763-Proposal: Difference between revisions

From statwiki
Jump to navigation Jump to search
No edit summary
No edit summary
 
(79 intermediate revisions by 39 users not shown)
Line 16: Line 16:


--------------------------------------------------------------------
--------------------------------------------------------------------
Project # 1 Group members:


'''Project # 1 Group members:'''
Feng, Jared


Song, Quinn
Huang, Xipeng


Loh, William
Xu, Mingwei


Bai, Junyue
Yu, Tingzhou


Choi, Phoebe
Title: Patch-based classification of lung cancers pathological images using convolutional neural networks


'''Title:''' APTOS 2019 Blindness Detection
In this project, we explore the classification problem of lung cancer pathological images of some patients. The input images are from three categories of tumor types (LUAD, LUSD, and MESO), and the images have been split into patches in order to reduce the computational difficulty. The classification task is decomposed into patch-wise and whole image-wise. We experiment with three neural networks for patch-wise classification, and two classical machine learning models for patient classification. Techniques of feature extraction and sampling methods for training neural networks are also implemented and studied. Our results show that support vector machine (SVM) on extracted feature vectors outperforms all other methods and achieves an accuracy of 67.86\% based on DenseNet-121 model for patch-wise classification.


'''Description:'''
Our poster is [https://www.dropbox.com/s/fu6vr2cxcbt4458/Stat_841_poster.pdf?dl=0 here].
--------------------------------------------------------------------
Project # 2 Group members:


Our team chose the APTOS 2019 Blindness Detection Challenge from Kaggle. The goal of this challenge is to build a machine learning model that detects diabetic retinopathy by screening retina images.
Anderson, Eric


Millions of people suffer from diabetic retinopathy, the leading cause of blindness among working-aged adults. It is caused by damage to the blood vessels of the light-sensitive tissue at the back of the eye (retina). In rural areas where medical screening is difficult to conduct, it is challenging to detect the disease efficiently. Aravind Eye Hospital hopes to utilize machine learning techniques to gain the ability to automatically screen images for disease and provide information on how severe the condition may be.
Wang, Chengzhi


Our team plans to solve this problem by applying our knowledge in image processing and classification.
Zhong, Kai


Zhou, Yi Jing


----
Title: Clean-Label Targeted Poisons for an End-to-End Trained CNN on the MNIST Dataset


'''Project # 2 Group members:'''
Description: Applying data poisoning techniques to the MNIST Dataset


Li, Dylan
--------------------------------------------------------------------
Project # 3 Group members:


Li, Mingdao
Chopra, Kanika


Lu, Leonie
Rajcoomar, Yush


Sharman,Bharat
Bhattacharya, Vaibhav


'''Title:''' Risk prediction in life insurance industry using supervised learning algorithms
Title: Cancer Classification


'''Description:'''
Description: We will be classifying three tumour types based on pathological data.


In this project, we aim to replicate and possibly improve upon the work of Jayabalan et al. in their paper “Risk prediction in life insurance industry using supervised learning algorithms”. We will be using the Prudential Life Insurance Data Set that the authors have used and have shared with us. We will be pre-processing the data to replace missing values, using feature selection using CFS and feature reduction using PCA use this processed data to perform Classification via four algorithms – Neural Networks, Random Tree, REPTree and Multiple Linear Regression. We will compare the performance of these Algorithms using MAE and RMSE metrics and come up with visualizations that can explain the results easily even to a non-quantitative audience.
--------------------------------------------------------------------
Project # 4 Group members:


Our goal behind this project is to learn applying the algorithms that we learned in our class to an industry dataset and come up with results that we can aid better, data-driven decision making.
Li, Shao Zhong


----
Kerr, Hannah


'''Project # 3 Group members:'''
Wong, Ann Gie


Parco, Russel
Title: Predicting "Pawpularity" of Pets with Image Regression


Sun, Scholar
Description: Analyze raw images and metadata to predict the “Pawpularity” of pet photos to help guide shelters and rescuers around the world improve the appeal of their pet profiles, so that more animals can get adopted and animals can find their "furever" home faster.


Yao, Jacky
--------------------------------------------------------------------
Project # 5 Group members:


Zhang, Daniel
Chin, Jessie Man Wai


'''Title:''' Lyft Motion Prediction for Autonomous Vehicles
Ooi, Yi Lin


'''Description:'''
Shi, Yaqi


Our team has decided to participate in the Lyft Motion Prediction for Autonomous Vehicles Kaggle competition. The aim of this competition is to build a model which given a set of objects on the road (pedestrians, other cars, etc), predict the future movement of these objects.
Ngew, Shwen Lyng


Autonomous vehicles (AVs) are expected to dramatically redefine the future of transportation. However, there are still significant engineering challenges to be solved before one can fully realize the benefits of self-driving cars. One such challenge is building models that reliably predict the movement of traffic agents around the AV, such as cars, cyclists, and pedestrians.
Title: The Application of Classification in Accelerated Underwriting (Insurance)


Our aim is to apply classification techniques learned in class to optimally predict how these objects move.
Description: Accelerated Underwriting (AUW), also called “express underwriting,” is a faster and easier process for people with good health condition to obtain life insurance. The traditional underwriting process is often painful for both customers and insurers. From the customer's perspective, they have to complete different types of questionnaires and provide different medical tests involving blood, urine, saliva and other medical results. Underwriters on the other hand have to manually go through every single policy to access the risk of each applicant. AUW allows people, who are deemed “healthy” to forgo medical exams. Since COVID-19, it has become a more concerning topic as traditional underwriting cannot be performed due to the stay-at-home order. However, this imposes a burden on the insurance company to better estimate the risk associated with less testing results.  


----
This is where data science kicks in. With different classification methods, we can address the underwriting process’ five pain points: labor, speed, efficiency, pricing and mortality.  This allows us to better estimate the risk and classify the clients for whether they are eligible for accelerated underwriting. For the final project, we use the data from one of the leading US insurers to analyze how we can classify our clients for AUW using the method of classification. We will be using factors such as health data, medical history, family history as well as insurance history to determine the eligibility.


'''Project # 4 Group members:'''
--------------------------------------------------------------------
Project # 6 Group members:


Chow, Jonathan
Wang, Carolyn


Dharani, Nyle
Cyrenne, Ethan


Nasirov, Ildar
Nguyen, Dieu Hoa


'''Title:''' Classification with Abstinence
Sin, Mary Jane


'''Description:'''
Title: Pawpularity (PetFinder Kaggle Competition)


We seek to implement the algorithm described in [https://papers.nips.cc/paper/9247-deep-gamblers-learning-to-abstain-with-portfolio-theory.pdf Deep Gamblers: Learning to Abstain with Portfolio Theory]. The paper describes augmenting classification problems to include the option of abstaining from making a prediction when confidence is low.
Description: Using images and metadata on the images to predict the popularity of pet photos, which is calculated based on page view statistics and other metrics from the PetFinder website.


Medical imaging diagnostics is a field in which classification could assist professionals and improve life expectancy for patients through increased accuracy. However, there are also severe consequences to incorrect predictions. As such, we also hope to apply the algorithm implemented to the classification of medical images, specifically instances of normal and pneumonia [https://www.kaggle.com/paultimothymooney/chest-xray-pneumonia? chest x-rays].
--------------------------------------------------------------------
Project # 7 Group members:


----
Bhattacharya, Vaibhav


'''Project # 5 Group members:'''
Chatoor, Amanda


Jones, Hayden
Prathap Das, Sutej


Leung, Michael
Title: PetFinder.my - Pawpularity Contest [https://www.kaggle.com/c/petfinder-pawpularity-score/overview]


Haque, Bushra
Description: In this competition, we will analyze raw images and metadata to predict the “Pawpularity” of pet photos. We'll train and test our model on PetFinder.my's thousands of pet profiles.


Mustatea, Cristian
--------------------------------------------------------------------
Project # 8 Group members:


'''Title:''' Combine Convolution with Recurrent Networks for Text Classification
Yan, Xin


'''Description:'''
Duan, Yishu


Our team chose to reproduce the paper [https://arxiv.org/pdf/2006.15795.pdf Combine Convolution with Recurrent Networks for Text Classification] on Arxiv. The goal of this paper is to combine CNN and RNN architectures in a way that more flexibly combines the output of both architectures other than simple concatenation through the use of a “neural tensor layer” for the purpose of improving at the task of text classification. In particular, the paper claims that their novel architecture excels at the following types of text classification: sentiment analysis, news categorization, and topical classification. Our team plans to recreate this paper by working in pairs of 2, one pair to implement the CNN pipeline and the other pair to implement the RNN pipeline. We will be working with Tensorflow 2, Google Collab, and reproducing the paper’s experimental results with training on the same 6 publicly available datasets found in the paper.
Di, Xibei


----
Title: The application of classification on company bankruptcy prediction


'''Project # 6 Group members:'''
Description: If a company goes bankrupt, all its employees will lose their jobs, and it is hard for them to find another suitable job in a short period. For the individual, the employee who loses the job due to bankruptcy will have no income for a period of time. This may lead to several negative consequences: increased homelessness as people do not have enough money to cover living expenses and increased crime rates as poverty increases. For the economy, if many companies go bankrupt at the same time, a huge number of employees will lose jobs, leading to a higher unemployment rate. This may cause a series of negative impact on the economy: loss of government tax revenue since the unemployed has no income and they do not need to pay the income taxes and increased inequality in the income distribution.


Chin, Ruixian
Therefore, it can be seen that company bankruptcy negatively influences the individual, government, society, and the economy, this makes the prediction on company bankruptcy extremely essential. The purpose of the project is to predict whether a company will go bankrupt.
--------------------------------------------------------------------
Project # 9 Group members:


Ong, Jason
Loke, Chun Waan


Chiew, Wen Cheen
Chong, Peter


Tan, Yan Kai
Osmond, Clarice


'''Title:''' Mechanisms of Action (MoA) Prediction
Li, Zhilong


'''Description:'''
Title: Popularity of Shelter Pet Photo Prediction using Varied ML Techniques


Our team chose to participate in a Kaggle research challenge "Mechanisms of Action (MoA) Prediction". This competition is a project within the Broad Institute of MIT and Harvard, the Laboratory for Innovation Science at Harvard (LISH), and the NIH Common Funds Library of Integrated Network-Based Cellular Signatures (LINCS), present this challenge with the goal of advancing drug development through improvements to MoA prediction algorithms.
Description: In this Kaggle competition, we will analyze raw images and metadata to predict the “Pawpularity” of pet photos.
----
--------------------------------------------------------------------


'''Project # 7 Group members:'''
Project # 10 Group members:


Ren, Haotian
O'Farrell, Ethan


Cheung, Ian Long Yat
D'Astous, Justin


Hussain, Swaleh
Hamed, Waqas


Zahid, Bin, Haris
Vladusic, Stefan


'''Title:''' Transaction Fraud Detection
Title: Pawpularity (Kaggle)


'''Description:'''
Description: Predicting the popularity of animal photos based on photo metadata
--------------------------------------------------------------------
Project # 11 Group members:


Protecting people from fraudulent transactions is an important topic for all banks and internet security companies. This Kaggle project is based on the dataset from IEEE Computational Intelligence Society (IEEE-CIS). Our objective is to build a more efficient model in order to recognize each fraud transaction with a higher accuracy and higher speed.
JunBin, Pan
----


'''Project # 8 Group members:'''
Title: Learning from Normality: Two-Stage Method with Autoencoder and Boosting Trees for Unsupervised Anomaly Detection


ZiJie, Jiang
Description: New algorithm for unsupervised anomaly detection
--------------------------------------------------------------------
Project # 12 Group members:


Yawen, Wang
Kar Lok, Ng


DanMeng, Cui
Muhan (Iris), Li


MingKang, Jiang
Title: NFL Health & Safety - Helmet Assignment


'''Title:''' Lyft Motion Prediction for Autonomous Vehicles
Description: Assigning players to the helmet in a given footage of head collision in football play.
--------------------------------------------------------------------
Project # 13 Group members:


'''Description:'''
Livochka, Anastasiia


Our team chose to participate in the Kaggle Challenge "Lyft Motion Prediction for Autonomous Vehicles". We will apply our science skills to build motion prediction models for self-driving vehicles. The model will be able to predict the movement of traffic agents around the AV, such as cars, cyclists, and pedestrians. The goal of this competition is to predict the trajectories of other traffic participants.
Wong, Cassandra


----------------------------------------------------------------------
Evans, David


Yalsavar, Maryam


'''Project # 9 Group members:'''
Title: TBD


Banno, Dion
Description: TBD
--------------------------------------------------------------------
Project # 14 Group Members:


Battista, Joseph
Zeng, Mingde


Kahn, Solomon
Lin, Xiaoyu


'''Title:''' Increasing Spotify user engagement through predictive personalization
Fan, Joshua


'''Description:'''
Rao, Chen Min


Our project is an application of classification to the domain of predictive personalization. The goal of the project is to increase Spotify user engagement through data-driven methods. Given a set of users’ demographic data, listening preferences and behaviour, our goal is to build a recommendation system that suggests new songs to users. From a potential pool of songs to suggest, the final song recommendations will be driven by a classification algorithm that measures a given user’s propensity to like a song. We plan on leveraging the Spotify Web API to gather data about songs and collecting user data from consenting peers.
Title: Toxic Comment Classification, Kaggle


Description: Using Wikipedia comments labeled for toxicity to train a model that detects toxicity in comments.
--------------------------------------------------------------------
Project # 15 Group Members:


-----------------------------------------------------------------------
Huang, Yuying


'''Project # 10 Group members:'''
Anugu, Ankitha


Qing, Guo
Chen, Yushan


Wang, Yuanxin
Title: Implementation of the classification task between crop and weeds


James, Ni
Description: Our work will be based on the paper ''Crop and Weeds Classification for Precision Agriculture using Context-Independent Pixel-Wise Segmentation''.
--------------------------------------------------------------------
Project # 16 Group Members:


Xueguang, Ma
Wang, Lingshan


'''Title:''' Mechanisms of Action (MoA) Prediction
Li, Yifan


'''Description:'''
Liu, Ziyi


Our team has decided to participate in the Mechanisms of Action (MoA) Prediction Kaggle competition. This is a challenge with the goal of advancing drug development through improvements to MoA prediction algorithms.
Title: Implement and Improve CNN in Multi-Class Text Classification
Our team plan to develop an algorithm to predict a compound’s MoA given its cellular signature and our goal is to learn various algorithms taught in this course.


Description: We are going to apply Bidirectional Encoder Representations from Transformers (BERT) to classify real-world data (application to build an efficient case study interview materials classifier) and improve it algorithm-wise in the context of text classification, being supported with real-world data set. With the implementation of BERT, it allows us to further analyze the efficiency and practicality of the algorithm when dealing with imbalanced dataset in the data input level and modelling level.
The dataset is composed of case study HTML files containing case information that can be classified into multiple industry categories. We will implement a multi-class classification to break down the information contained in each case material into some pre-determined subcategories (eg, behavior questions, consulting questions, questions for new business/market entry, etc.). We will attempt to process the complicated data into several data types(e.g. HTML, JSON, pandas data frames, etc.) and choose the most efficient raw data processing logic based on runtime and algorithm optimization.
--------------------------------------------------------------------
Project # 17 Group members:


-----------------------------------------------------------------------
Malhi, Dilmeet


'''Project # 11 Group members:'''
Joshi, Vansh


Yang, Jiwon
Syamala, Aavinash


Mahdi, Anas
Islam, Sohan


Thibault, Will
Title: Kaggle project: PetFinder.my - Pawpularity Contest


Lau, Jan
Description: In this competition, we will analyze raw images provided by PetFinder.my to predict the “Pawpularity” of pet photos.
--------------------------------------------------------------------


'''Title:''' Application of classification in human fatigue analysis
Project # 18 Group members:


'''Description:'''
Yuwei, Liu


The goal of this project is to classify different levels of fatigue based on motion capture (Vicon) and force plates data. First, we plan to obtain data from 4 to 6 participants performing squats or squats with weights and rate them on a fatigue scale, with each participant doing at least 50 to 100 reps. We will collect data with EMG, IMU, force plates, and Vicon. When the participants are squatting, we will ask them about their fatigue level, and compare their feedback against the fatigue level recorded on EMG. The fatigue level will be on a scale of 1 to 10 (1 being not fatigued at all and 10 being cannot continue anymore). Once data is collected, we will classify the motion capture and force plates data into the different levels of fatigue.
Daniel, Mao


Title: Sartorius - Cell Instance Segmentation (Kaggle) [https://www.kaggle.com/c/sartorius-cell-instance-segmentation]


------------------------------------------------------------------------
Description: Detect single neuronal cells in microscopy images


'''Project # 12 Group members:'''
--------------------------------------------------------------------


Xiaolan Xu,
Project #19 Group members:


Robin Wen,  
Samuel, Senko


Yue Weng,  
Tyler, Verhaar


Beizhen Chang
Zhang, Bowen


'''Title:''' Identification (Classification) of Submillimetre Galaxies Based on Multiwavelength Data in Astronomy
Title: NBA Game Prediction


'''Description:'''
Description: We will build a win/loss classifier for NBA games using player and game data and also incorporating alternative data (ex. sports betting data).


Identifying the counterparts of submillimetre galaxies (SMGs) in multiwavelength images is important to the study of galaxy evolution in astronomy. However, obtaining a statistically significant sample of robust associations is very challenging because of the poor angular resolution of single-dish submm facilities, that is we can not tell which galalxy is actually responsible for the submillimeter emission from a group of possible candidates due to the poor resolution. Recently, a set of labelled dataset is obtained from ALMA, a milliemetre/submilliemetre telescope array with the sufficient resolution to pin down the exact source of submillimeter emssion. However, applying such array to large fraction of skies are not feasible, so it is of practical interest to develop algorithm to identify submillimetre galaxies (SMGs) based on the other available data. With this newly labelled dataset from ALMA, it is possible to test and develop different new alrgorithms and apply them on unlabelled data to detect submillimetre galaxies.
-------------------------------------------------------------------


In our work, we primarily build on the works of Liu et al.(https://arxiv.org/abs/1901.09594), which tested a set of standard classification algorithms to the dataset. We aim to first reproduce their work and test other classification algorithms with a more stastics centered perspective. Next, we hope to possibly extend their works from one or some of the following directions: (1)Incorporating some other relevant features to augment the dimensions of the available dataset for better classification rate. (2)Taking the measurement error into the classifcation algorithms, possibly from a Bayesian approach. (All features in astronomy datasets come from actual physical measurements, which come with an error bar. However, it is not clear how to incoporate this error into the classification task.) (3)The possibility of combining some tradtional astronomy approaches with algorithms from ML.
Project #20 Group members:


------------------------------------------------------------------------
Mitrache, Christian


'''Project # 13 Group members:'''
Renggli, Aaron


Saini, Jessica


Zihui (Betty) Qin,
Mossman, Alexandra


Wenqi (Maggie) Zhao,
Title: Classification and Deep Learning for Healthcare Provider Fraud Detection Analysis


Muyuan Yang,
Description: TBD


Amartya (Marty) Mukherjee,
--------------------------------------------------------------------


'''Title:''' Insider Trading Roles Classification Prediction on United States conventional stock or non-derivative transaction
Project # 21 Group members:


'''Description:'''
Wang, Kun


Background (why we were interested in classifying based on insiders):  
Title: TBD
The United States is one of the most frequently traded financial markets in the world. The dataset captures all insider activities as reported on SEC (U.S. Securities and Exchange Commission) forms 3, 4, 5, and 144. We believe that using variables (such as transaction date, security type, and transaction amount), we could predict the roles code for a new transaction. The reason for the chosen prediction is that the role of the insider gives investors signals of potential internal activities and private information. This is crucial for investors to detect important market signals from those insider trading activities, such that they could benefit from the market.


Goal: To classify the role of an insider in a company based on the data of their trades.
Description : TBD


--------------------------------------------------------------------


------------------------------------------------------------------------
Project # 22 Group members:


'''Project # 14 Group members:'''
Guray, Egemen


Jung, Kyle
Title: Traffic Sign Recognition System (TSRS): SVM and Convolutional Neural Network


Kim, Dae Hyun
Description : I will build a prediction system to predict road signs in the German Traffic Sign Dataset using CNN.
--------------------------------------------------------------------


Lee, Stan
Project # 23 Group members:


Lim, Seokho
Bsodjahi


'''Title:''' Mechanisms of Action (MoA) Prediction Competition
Title: Modeling Pseudomonas aeruginosa bacteria state through its genes expression activity


'''Description:''' The main objective of this Kaggle competition is to help to develop an algorithm to predict a compound's MoA given its cellular signature, helping scientists advance the drug discovery process. Our execution plan is to apply concepts and algorithms learned in STAT441 and apply multi-label classification. Through the process, our team will learn biological knowledge necessary to complete and enhance our classification thought-process. https://www.kaggle.com/c/lish-moa
Description : Label Pseudomonas aeruginosa gene expression data through unsupervised learning (eg., EM algorithm) and then model the bacterial state as function of its genes expression
 
------------------------------------------------------------------------
 
'''Project # 15 Group Members:'''
 
Li, Evan
 
Abuaisha, Karam
 
Vadivelu, Nicholas
 
Pu, Jason
 
'''Title:''' Predict Students Answering Ability Kaggle Competition
 
'''Description:'''
 
https://www.kaggle.com/c/riiid-test-answer-prediction
We plan on tackling this Kaggle competition that revolves around classifying whether students are able to answer their next questions correctly. The data provided consists of the student’s historic performance, the performance of other students on the same question, metadata about the question itself, and more. The theme of the competition is to tailor education to a student’s ability as an AI tutor.
 
------------------------------------------------------------------------
 
'''Project # 16 Group members:'''
 
Hall, Matthew
 
Chalaturnyk, Johnathan
 
'''Title:'''  Predicting CO and NOx emissions from gas turbines: novel data and a benchmark PEMS
 
'''Description:'''
 
Predictive emission monitoring systems (PEMS) are used in conjunction with measurement instruments to predict the amount of emissions exuded from Gas turbine engines. The implementation of this system is reliant on the availability of proper measurements and ecological data points. We will attempt to adjust the novel PEMS implementation from this paper in the hopes of improving the prediction of CO and NOx emission levels from the turbines. Using data points collected over the previous five years, we'll use a number of machine learning algorithms to discuss possible future research areas. Finally, we will compare our methods against the benchmark presented in this paper in order to measure the effectiveness of our problem solutions.
 
------------------------------------------------------------------------
 
'''Project # 17 Group members:'''
 
Yang, Junyi
 
Wang, Jill Yu Chieh
 
Wu, Yu Min
 
Li, Calvin
 
'''Title:'''  Humpback Whale Identification
 
'''Description:'''
 
Our team will participate in the Kaggle challenge, Humpback Whale Identification. The main objective is to build a multi-class classification model to identify whales' class base on their tail. There are a total of over 3000 classes and 25361 training images. The challenge is that for each class, there are only on average 8 training data. 
 
------------------------------------------------------------------------
'''Project # 18 Group members:''' 
 
Lian, Jinjiang 
 
Zhu, Yisheng 
Huang, Mingzhe   
 
Hou, Jiawen
 
'''Title:'''  Mechanisms of Action (MoA) Prediction 
 
'''Description:''' 
 
The final project of our team is the Kaggle ongoing competition -- Mechanism of Action(MoA) Prediction. The goal is to improve the MoA prediction algorithm to assist and advance drug development. MoA algorithm helps scientists approach more targeted medicine molecules based on the biological mechanism of disease. This would strongly shorten the medicine development cycle. Here, MoA here is to apply different drugs to human cells to analyze the corresponding reaction and the dataset provides simultaneous measurement of 100 types of human cells and 5000 drugs.   
 
To tackle this competition, after data cleaning and feature engineering, we are going to try a selection of ML algorithms such as logistic regression, tree-based method, SVM, etc and find the optimized one that can best complete the tasks. Depending on how we perform, we might utilize other technics such as model ensembling or stacking.
 
------------------------------------------------------------------------
'''Project # 19 Group members:''' 
 
Fagan, Daniel 
 
Brooke, Cooper 
Perelman, Maya   
 
'''Title:'''  Mechanisms of Action (MoA) Prediction (https://www.kaggle.com/c/lish-moa/overview/description)
 
'''Description:''' 
 
For our final project, we will be competing in the Mechanisms of Action (MoA) Prediction Research Challenge on Kaggle. MoA refers to the description of the biological activity of a given molecule and scientists have specific interest in the MoA of molecules as it pertains to the advancement of drugs. This is because under new frameworks, scientists are looking to develop molecules that can modulate protein targets associated with given diseases. Our task will be to analyze a dataset containing human cellular responses to more than 5, 000 drugs and to classify these responses with one or more MoA.
 
For this competition, we plan to use various classification algorithms taught in STAT 441 followed by model validation techniques to ultimately select the most accurate model based on the logarithmic loss function which was specified by Kaggle.
 
------------------------------------------------------------------------
'''Project # 20 Group members:''' 
Cheng, Leyan
 
Dai, Mingyan
 
Jiang, Daniel 
   
Huang, Jerry
 
'''Title:'''  Riiid! Answer Correctness Prediction
 
'''Description:'''
 
We will be competing in the Riiid! Kaggle Challenge. The goal of this challenge is to create algorithms for "Knowledge Tracing," the modeling of student knowledge over time. The goal is to accurately predict how students will perform on future interactions.
 
We plan on using the classification techniques and model validation techniques learned in the course in order to design an algorithm that can accurately predict the actions of students.
 
------------------------------------------------------------------------
'''Project # 21 Group members:''' 
 
Carson, Emilee
 
Ellmen, Isaac
 
Mohammadrezaei, Dorsa
 
Budaraju, Sai Arvind
   
 
'''Title:'''  Classifying SARS-CoV-2 region of origin based on DNA/RNA sequence
 
'''Description:'''
 
Determining the location of origin for a viral sequence is an important tool for epidemiological tracking. Knowing where a virus comes from allows epidemiologists to track how a virus is spreading. There are significant efforts to track the spread of SARS-CoV-2. As an RNA virus, SARS-CoV-2 mutates frequently. Most of these mutations carry negligible changes to the function of the virus but act as “barcodes” for specific strains. As the virus spreads in a region, it picks up mutations which allow researchers to identify which sequences correspond to which regions.
 
The standard method for classifying viruses based on location is to:
 
- Perform a multiple sequence alignment (MSA)
 
- Build a phylogenetic tree of the MSA
 
- Empirically determine which regions have which sections of the tree
 
Phylogenetic trees are an excellent tool for tracking evolutionary changes over time but we wonder if there are better methods for classifying the region of origin for a virus using machine learning techniques.
 
Our plan is to perform PCA on the MSA which is available through GISAID. We will determine an appropriate encoding for sequence alignments to vectors and map the aligned sequences onto a much lower dimensional space. We will then use LDA or QDA to classify points based on region (continent). We will also examine if the same technique works well for classifying sequences based on state of origin for samples from the United States. We may try other classification techniques such as logistic regression or neural nets. Finally, we know that projecting data to a small number of principal components and then projecting back to the original space can reduce noise in certain datasets. In the case of mutations, this might correspond to removing insignificant mutations. It is possible that there are certain mutations which induce functional changes in the virus which would be of greater medical interest. Our hope is that we could detect these using PCA.
 
------------------------------------------------------------------------
'''Project # 22 Group members:''' 
 
Chang, Luwen
 
Yu, Qingyang
 
Kong, Tao 
   
Sun, Tianrong
 
'''Title:'''  Riiid! Answer Correctness Prediction
 
'''Description:'''
 
For the final project, we chose the featured Kaggle Competition named Riiid! Answer Correctness Prediction. The purpose of this challenge is to build a machine learning model to predict the students' interaction performance. (https://www.kaggle.com/c/riiid-test-answer-prediction)
 
We plan to use classification and regression techniques learned in this course to build the model and use area under ROC curve to evaluate our model.
 
------------------------------------------------------------------------
'''Project # 23 Group members:''' 
 
Han, Jihoon
 
Vera De Casey
 
Jawad Solaiman
 
'''Title:'''  Lyft Motion Prediction for Autonomous Vehicles
 
'''Description:'''
 
We are planning to compete in the Lyft Motion Prediction for Autonomous Vehicles Challenge on Kaggle. Our goal is to build a motion prediction model for the self-driving car by using our machine learning knowledge as well as utilizing the training and testing data sets. The motion prediction model will predict the motion of traffic agents around the car, such as cars, cyclists, and pedestrians. We are not sure if we have to classify the agents into three categories (cars, cyclists, pedestrians) ourselves. If so, we will initially start by using the single-shot detector algorithm and improve through it.
 
------------------------------------------------------------------------
'''Project # 24 Group members:''' 
 
Guanting Pan
Haocheng Chang
 
Zaiwei Zhang
 
'''Title:'''  Reproducing result in Accelerated Stochastic Power Iteration
 
'''Description:'''
 
As our final project, we will reproduce the stochastic PCA algorithm designed by De Sa, He, Mitliagkas, Ré, and Xu to accelerate the iteration complexity for power iteration. By doing so, we are aiming to achieve a final rate of 𝒪(1/sqrt(Δ)) for our reproduction result. We are also hoping to explore and discuss the potentiality for applying such an acceleration method to other non-convex optimization problems, as mentioned in the original paper if there is additional time to do so. Link to the paper: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6557638/pdf/nihms-993807.pdf
 
------------------------------------------------------------------------
'''Project # 25 Group members:''' 
 
Haoran Dong
 
Mushi Wang
 
Siyuan Qiu
 
Yan Yu
 
'''Title:''' Lyft Motion Prediction for Autonomous Vehicles
 
'''Description:'''
 
We want to be involved in the Kaggle Challenge "Lyft Motion Prediction for Autonomous Vehicles". The goal is to build a motion prediction model for the self-driving car by machine learning with the datasets they provided.
 
------------------------------------------------------------------------
'''Project # 26 Group members:''' 
 
Sangeeth Kalaichanthiran
 
Evan Peters
 
Cynthia Mou
 
Yuxin Wang
 
'''Title:''' Mechanisms of Action (MoA) Prediction
 
'''Description:'''
 
Our team chose the "Mechanisms of Action (MoA) Prediction" challenge on Kaggle. Mechanisms of Action, MOA for short, describes the biological response of human cells to a particular molecule (the drug). The goal is to develop an algorithm that can predict the biological response of a drug based on its similarities to other known drugs. 
 
Our team hopes to develop a superior algorithm by using our knowledge of supervised learning methods.
 
------------------------------------------------------------------------
'''Project # 27 Group members:''' 
 
Delaney Smith
 
Mohammad Assem Mahmoud
 
'''Title:''' Replicating "Electrocardiogram heartbeat classification based on a deep convolutional
neural network and focal loss"
 
'''Description:'''
 
For our project, we intend to replicate and hopefully, extend the work of Romdhane et al.’s 2020 paper “Electrocardiogram heartbeat classification based on a deep convolutional neural network and focal loss”. In this paper, the authors develop a deep convoluted neural network that exploits a novel loss function, focal loss, to classify heartbeats into five arrhythmia categories (N, S, V, Q and F) based on the AAMI standard. The network was trained and tested against two ECG datasets, MIT-BIH and INCART, and returned a 98.41% overall accuracy, a 98.38% overall F1-score, a 98.37% overall prevision and a 98.41% overall recall, which we intend to replicate.
Interestingly, focal loss was implemented to prevent bias towards larger classes (normal heart beats) without needing to augment the smaller class data (diseased heart beats), however the authors did not outline which method actually performs better. Therefore, we hope to extend their work by answering this question in this project.
------------------------------------------------------------------------
'''Project # 28 Group members:''' 
 
Fang Yuqin
 
Fu Rao
 
Li Siqi
 
Zhou Zeping
 
'''Title:''' The Spectrum of the Fisher Information Matrix of a Single-Hidden-Layer Neural Network
 
'''Description:'''
Our group aims to dig more on single hidden layer neural network based on what we have learned from class. We'll focus on data that follows the Gaussian distribution and weights such that we can provide some expression in terms of the spectrum in the limit of infinite width. We believe that we can improve the efficiency of first-order optimization problems by applying spectrun.
------------------------------------------------------------------------
'''Project # 29 Group members:''' 
 
Rui Gong
 
Xuetong Wang
 
Xinqi Ling
 
Di Ma
 
'''Title:''' Riiid! Answer Correctness Prediction
 
'''Description:'''
 
We will take the "Riiid! Answer Correctness Prediction" Kaggle competition. We will predict students' performances on a particular question based on their historic performance. The performance of other students on this question and the information about the question itself (like its difficulty, length, etc). https://www.kaggle.com/c/riiid-test-answer-prediction/overview
------------------------------------------------------------------------
'''Project # 30 Group members:''' 
 
Jiabao Dong
 
Jiaxiang Liu
 
Siyuan Xia
 
Yipeng Du
 
'''Title:''' Privacy-Preserving Classification of Personal Text Messages with Secure Multi-Party Computation
 
'''Description:'''
We aim to replicate the work demonstrated in [https://papers.nips.cc/paper/8632-privacy-preserving-classification-of-personal-text-messages-with-secure-multi-party-computation.pdf Privacy-Preserving Classification of Personal Text Messages with Secure Multi-Party Computation].
 
Personal text classification has many useful applications such as mental health care and security surveillance, but also raises concerns about personal privacy. The method proposed in this paper is based on Secure Multiparty Computation (SMC) and avoids (un)intentional privacy violations. The method then extracts features from texts and classifies with logistic regression and tree ensembles. This paper claims to have proposed the first privacy-preserving (PP) solution for text classification that is provably secure.
 
------------------------------------------------------------------------
 
'''Project # 31 Group members:''' 
 
Tompkins, Grace
 
Krikella, Tatiana
 
'''Title:''' A comparison of machine learning algorithms and covariate balance measures for propensity score matching and weighting (2018)
'''Description:'''
We will be reproducing the results of "A comparison of machine learning algorithms and covariate balance measures for propensity score matching and weighting" by Cannas and Arpino (2018) and applying the results to a new dataset, Right Heart Catheterization (RHC) which includes data from the Study to Understand Prognoses and Preferences for Outcomes and Risks of Treatments (SUPPORT), for comparison. This paper uses simulated data and several machine learning algorithms to estimate causal effects in observational studies. The machine learning methods used include CART, Bagging, Boosting, Random Forest, Neural Networks, and Naive Bayes. There are also several variations of measures of covariate balancing used in the study. The importance of tuning the machine learning algorithms' hyperparameters is also investigated with respect to propensity score estimation.
 
 
We will use R for analysis.
 
Link to paper: [http://papers.nips.cc/paper/8520-adapting-neural-networks-for-the-estimation-of-treatment-effects]
 
------------------------------------------------------------------------
 
'''Project # 32 Group members:'''
 
Taohao Wang
Zeren Shen
Zihao Guo
Rui Chen
 
'''Title:''' Google Landmark Recognition 2020
 
'''Description:'''
Our team decided to give a try for "Google Landmark Recognition 2020" (kaggle) competition,
in which the competitors are asked to build a model to detect any existing landmarks within provided test images.
This competition is challenging in its own way: it has more than 81K classes within its data, where traditional CNN would very
likely to fail(too many parameters to train, especially when taking convolutional layers into account). We will like to implement several
algorithms/frameworks which can utilize a large amount of data with noisy labels, apply them to the provided dataset, and compare their performance(training time,
number of parameters trained, multiple metrics for accuracy/loss evaluation... etc) for our report.
 
------------------------------------------------------------------------
 
'''Project # 33 Group members:''' 
 
Hansa Halim
 
Sanjana Rajendra Naik
 
Samka Marfua
 
Shawrupa Proshasty
 
'''Title:''' Superhuman AI for multiplayer poker (Brown and Sandholm 2019)
 
'''Description:'''
Our team aims to recreate the paper “Superhuman AI for multiplayer poker” by Noam Brown and Tuomas Sandholm. The paper talks about algorithm used by the authors to train the AI for playing poker. They primary do so using the Monte Carlo CFR. Poker is a great example for training AI with incomplete data. Furthermore, since it is a multiplayer game, this presents more complications while training the AI. The authors use abstraction to reduce the number of different actions to be considered by the AI, information abstraction and action abstraction both.
We aim to replicate this algorithm for at least 2 players to begin with.
 
Link to paper: [https://www.cs.cmu.edu/~noamb/papers/19-Science-Superhuman.pdf Paper]

Latest revision as of 08:48, 22 December 2021

Use this format (Don’t remove Project 0)

Project # 0 Group members:

Last name, First name

Last name, First name

Last name, First name

Last name, First name

Title: Making a String Telephone

Description: We use paper cups to make a string phone and talk with friends while learning about sound waves with this science project. (Explain your project in one or two paragraphs).


Project # 1 Group members:

Feng, Jared

Huang, Xipeng

Xu, Mingwei

Yu, Tingzhou

Title: Patch-based classification of lung cancers pathological images using convolutional neural networks

In this project, we explore the classification problem of lung cancer pathological images of some patients. The input images are from three categories of tumor types (LUAD, LUSD, and MESO), and the images have been split into patches in order to reduce the computational difficulty. The classification task is decomposed into patch-wise and whole image-wise. We experiment with three neural networks for patch-wise classification, and two classical machine learning models for patient classification. Techniques of feature extraction and sampling methods for training neural networks are also implemented and studied. Our results show that support vector machine (SVM) on extracted feature vectors outperforms all other methods and achieves an accuracy of 67.86\% based on DenseNet-121 model for patch-wise classification.

Our poster is here.


Project # 2 Group members:

Anderson, Eric

Wang, Chengzhi

Zhong, Kai

Zhou, Yi Jing

Title: Clean-Label Targeted Poisons for an End-to-End Trained CNN on the MNIST Dataset

Description: Applying data poisoning techniques to the MNIST Dataset


Project # 3 Group members:

Chopra, Kanika

Rajcoomar, Yush

Bhattacharya, Vaibhav

Title: Cancer Classification

Description: We will be classifying three tumour types based on pathological data.


Project # 4 Group members:

Li, Shao Zhong

Kerr, Hannah

Wong, Ann Gie

Title: Predicting "Pawpularity" of Pets with Image Regression

Description: Analyze raw images and metadata to predict the “Pawpularity” of pet photos to help guide shelters and rescuers around the world improve the appeal of their pet profiles, so that more animals can get adopted and animals can find their "furever" home faster.


Project # 5 Group members:

Chin, Jessie Man Wai

Ooi, Yi Lin

Shi, Yaqi

Ngew, Shwen Lyng

Title: The Application of Classification in Accelerated Underwriting (Insurance)

Description: Accelerated Underwriting (AUW), also called “express underwriting,” is a faster and easier process for people with good health condition to obtain life insurance. The traditional underwriting process is often painful for both customers and insurers. From the customer's perspective, they have to complete different types of questionnaires and provide different medical tests involving blood, urine, saliva and other medical results. Underwriters on the other hand have to manually go through every single policy to access the risk of each applicant. AUW allows people, who are deemed “healthy” to forgo medical exams. Since COVID-19, it has become a more concerning topic as traditional underwriting cannot be performed due to the stay-at-home order. However, this imposes a burden on the insurance company to better estimate the risk associated with less testing results.

This is where data science kicks in. With different classification methods, we can address the underwriting process’ five pain points: labor, speed, efficiency, pricing and mortality. This allows us to better estimate the risk and classify the clients for whether they are eligible for accelerated underwriting. For the final project, we use the data from one of the leading US insurers to analyze how we can classify our clients for AUW using the method of classification. We will be using factors such as health data, medical history, family history as well as insurance history to determine the eligibility.


Project # 6 Group members:

Wang, Carolyn

Cyrenne, Ethan

Nguyen, Dieu Hoa

Sin, Mary Jane

Title: Pawpularity (PetFinder Kaggle Competition)

Description: Using images and metadata on the images to predict the popularity of pet photos, which is calculated based on page view statistics and other metrics from the PetFinder website.


Project # 7 Group members:

Bhattacharya, Vaibhav

Chatoor, Amanda

Prathap Das, Sutej

Title: PetFinder.my - Pawpularity Contest [1]

Description: In this competition, we will analyze raw images and metadata to predict the “Pawpularity” of pet photos. We'll train and test our model on PetFinder.my's thousands of pet profiles.


Project # 8 Group members:

Yan, Xin

Duan, Yishu

Di, Xibei

Title: The application of classification on company bankruptcy prediction

Description: If a company goes bankrupt, all its employees will lose their jobs, and it is hard for them to find another suitable job in a short period. For the individual, the employee who loses the job due to bankruptcy will have no income for a period of time. This may lead to several negative consequences: increased homelessness as people do not have enough money to cover living expenses and increased crime rates as poverty increases. For the economy, if many companies go bankrupt at the same time, a huge number of employees will lose jobs, leading to a higher unemployment rate. This may cause a series of negative impact on the economy: loss of government tax revenue since the unemployed has no income and they do not need to pay the income taxes and increased inequality in the income distribution.

Therefore, it can be seen that company bankruptcy negatively influences the individual, government, society, and the economy, this makes the prediction on company bankruptcy extremely essential. The purpose of the project is to predict whether a company will go bankrupt.


Project # 9 Group members:

Loke, Chun Waan

Chong, Peter

Osmond, Clarice

Li, Zhilong

Title: Popularity of Shelter Pet Photo Prediction using Varied ML Techniques

Description: In this Kaggle competition, we will analyze raw images and metadata to predict the “Pawpularity” of pet photos.


Project # 10 Group members:

O'Farrell, Ethan

D'Astous, Justin

Hamed, Waqas

Vladusic, Stefan

Title: Pawpularity (Kaggle)

Description: Predicting the popularity of animal photos based on photo metadata


Project # 11 Group members:

JunBin, Pan

Title: Learning from Normality: Two-Stage Method with Autoencoder and Boosting Trees for Unsupervised Anomaly Detection

Description: New algorithm for unsupervised anomaly detection


Project # 12 Group members:

Kar Lok, Ng

Muhan (Iris), Li

Title: NFL Health & Safety - Helmet Assignment

Description: Assigning players to the helmet in a given footage of head collision in football play.


Project # 13 Group members:

Livochka, Anastasiia

Wong, Cassandra

Evans, David

Yalsavar, Maryam

Title: TBD

Description: TBD


Project # 14 Group Members:

Zeng, Mingde

Lin, Xiaoyu

Fan, Joshua

Rao, Chen Min

Title: Toxic Comment Classification, Kaggle

Description: Using Wikipedia comments labeled for toxicity to train a model that detects toxicity in comments.


Project # 15 Group Members:

Huang, Yuying

Anugu, Ankitha

Chen, Yushan

Title: Implementation of the classification task between crop and weeds

Description: Our work will be based on the paper Crop and Weeds Classification for Precision Agriculture using Context-Independent Pixel-Wise Segmentation.


Project # 16 Group Members:

Wang, Lingshan

Li, Yifan

Liu, Ziyi

Title: Implement and Improve CNN in Multi-Class Text Classification

Description: We are going to apply Bidirectional Encoder Representations from Transformers (BERT) to classify real-world data (application to build an efficient case study interview materials classifier) and improve it algorithm-wise in the context of text classification, being supported with real-world data set. With the implementation of BERT, it allows us to further analyze the efficiency and practicality of the algorithm when dealing with imbalanced dataset in the data input level and modelling level. The dataset is composed of case study HTML files containing case information that can be classified into multiple industry categories. We will implement a multi-class classification to break down the information contained in each case material into some pre-determined subcategories (eg, behavior questions, consulting questions, questions for new business/market entry, etc.). We will attempt to process the complicated data into several data types(e.g. HTML, JSON, pandas data frames, etc.) and choose the most efficient raw data processing logic based on runtime and algorithm optimization.


Project # 17 Group members:

Malhi, Dilmeet

Joshi, Vansh

Syamala, Aavinash

Islam, Sohan

Title: Kaggle project: PetFinder.my - Pawpularity Contest

Description: In this competition, we will analyze raw images provided by PetFinder.my to predict the “Pawpularity” of pet photos.


Project # 18 Group members:

Yuwei, Liu

Daniel, Mao

Title: Sartorius - Cell Instance Segmentation (Kaggle) [2]

Description: Detect single neuronal cells in microscopy images


Project #19 Group members:

Samuel, Senko

Tyler, Verhaar

Zhang, Bowen

Title: NBA Game Prediction

Description: We will build a win/loss classifier for NBA games using player and game data and also incorporating alternative data (ex. sports betting data).


Project #20 Group members:

Mitrache, Christian

Renggli, Aaron

Saini, Jessica

Mossman, Alexandra

Title: Classification and Deep Learning for Healthcare Provider Fraud Detection Analysis

Description: TBD


Project # 21 Group members:

Wang, Kun

Title: TBD

Description : TBD


Project # 22 Group members:

Guray, Egemen

Title: Traffic Sign Recognition System (TSRS): SVM and Convolutional Neural Network

Description : I will build a prediction system to predict road signs in the German Traffic Sign Dataset using CNN.


Project # 23 Group members:

Bsodjahi

Title: Modeling Pseudomonas aeruginosa bacteria state through its genes expression activity

Description : Label Pseudomonas aeruginosa gene expression data through unsupervised learning (eg., EM algorithm) and then model the bacterial state as function of its genes expression