F21-STAT 441/841 CM 763-Proposal

From statwiki
Revision as of 17:29, 9 October 2020 by Elcarson (talk | contribs)
Jump to navigation Jump to search

Use this format (Don’t remove Project 0)

Project # 0 Group members:

Last name, First name

Last name, First name

Last name, First name

Last name, First name

Title: Making a String Telephone

Description: We use paper cups to make a string phone and talk with friends while learning about sound waves with this science project. (Explain your project in one or two paragraphs).


Project # 1 Group members:

Song, Quinn

Loh, William

Bai, Junyue

Choi, Phoebe

Title: APTOS 2019 Blindness Detection

Description:

Our team chose the APTOS 2019 Blindness Detection Challenge from Kaggle. The goal of this challenge is to build a machine learning model that detects diabetic retinopathy by screening retina images.

Millions of people suffer from diabetic retinopathy, the leading cause of blindness among working-aged adults. It is caused by damage to the blood vessels of the light-sensitive tissue at the back of the eye (retina). In rural areas where medical screening is difficult to conduct, it is challenging to detect the disease efficiently. Aravind Eye Hospital hopes to utilize machine learning techniques to gain the ability to automatically screen images for disease and provide information on how severe the condition may be.

Our team plans to solve this problem by applying our knowledge in image processing and classification.



Project # 2 Group members:

Li, Dylan

Li, Mingdao

Lu, Leonie

Sharman,Bharat

Title: Risk prediction in life insurance industry using supervised learning algorithms

Description:

In this project, we aim to replicate and possibly improve upon the work of Jayabalan et al. in their paper “Risk prediction in life insurance industry using supervised learning algorithms”. We will be using the Prudential Life Insurance Data Set that the authors have used and have shared with us. We will be pre-processing the data to replace missing values, using feature selection using CFS and feature reduction using PCA use this processed data to perform Classification via four algorithms – Neural Networks, Random Tree, REPTree and Multiple Linear Regression. We will compare the performance of these Algorithms using MAE and RMSE metrics and come up with visualizations that can explain the results easily even to a non-quantitative audience.

Our goal behind this project is to learn applying the algorithms that we learned in our class to an industry dataset and come up with results that we can aid better, data-driven decision making.


Project # 3 Group members:

Parco, Russel

Sun, Scholar

Yao, Jacky

Zhang, Daniel

Title: Lyft Motion Prediction for Autonomous Vehicles

Description:

Our team has decided to participate in the Lyft Motion Prediction for Autonomous Vehicles Kaggle competition. The aim of this competition is to build a model which given a set of objects on the road (pedestrians, other cars, etc), predict the future movement of these objects.

Autonomous vehicles (AVs) are expected to dramatically redefine the future of transportation. However, there are still significant engineering challenges to be solved before one can fully realize the benefits of self-driving cars. One such challenge is building models that reliably predict the movement of traffic agents around the AV, such as cars, cyclists, and pedestrians.

Our aim is to apply classification techniques learned in class to optimally predict how these objects move.


Project # 4 Group members:

Chow, Jonathan

Dharani, Nyle

Nasirov, Ildar

Title: Classification with Abstinence

Description:

We seek to implement the algorithm described in Deep Gamblers: Learning to Abstain with Portfolio Theory. The paper describes augmenting classification problems to include the option of abstaining from making a prediction when confidence is low.

Medical imaging diagnostics is a field in which classification could assist professionals and improve life expectancy for patients through increased accuracy. However, there are also severe consequences to incorrect predictions. As such, we also hope to apply the algorithm implemented to the classification of medical images, specifically instances of normal and pneumonia chest x-rays.


Project # 5 Group members:

Jones, Hayden

Leung, Michael

Haque, Bushra

Mustatea, Cristian

Title: Combine Convolution with Recurrent Networks for Text Classification

Description:

Our team chose to reproduce the paper Combine Convolution with Recurrent Networks for Text Classification on Arxiv. The goal of this paper is to combine CNN and RNN architectures in a way that more flexibly combines the output of both architectures other than simple concatenation through the use of a “neural tensor layer” for the purpose of improving at the task of text classification. In particular, the paper claims that their novel architecture excels at the following types of text classification: sentiment analysis, news categorization, and topical classification. Our team plans to recreate this paper by working in pairs of 2, one pair to implement the CNN pipeline and the other pair to implement the RNN pipeline. We will be working with Tensorflow 2, Google Collab, and reproducing the paper’s experimental results with training on the same 6 publicly available datasets found in the paper.


Project # 6 Group members:

Chin, Ruixian

Ong, Jason

Chiew, Wen Cheen

Tan, Yan Kai

Title: Mechanisms of Action (MoA) Prediction

Description:

Our team chose to participate in a Kaggle research challenge "Mechanisms of Action (MoA) Prediction". This competition is a project within the Broad Institute of MIT and Harvard, the Laboratory for Innovation Science at Harvard (LISH), and the NIH Common Funds Library of Integrated Network-Based Cellular Signatures (LINCS), present this challenge with the goal of advancing drug development through improvements to MoA prediction algorithms.


Project # 7 Group members:

Ren, Haotian

Cheung, Ian Long Yat

Hussain, Swaleh

Zahid, Bin, Haris

Title: Transaction Fraud Detection

Description:

Protecting people from fraudulent transactions is an important topic for all banks and internet security companies. This Kaggle project is based on the dataset from IEEE Computational Intelligence Society (IEEE-CIS). Our objective is to build a more efficient model in order to recognize each fraud transaction with a higher accuracy and higher speed.


Project # 8 Group members:

ZiJie, Jiang

Yawen, Wang

DanMeng, Cui

MingKang, Jiang

Title: Lyft Motion Prediction for Autonomous Vehicles

Description:

Our team chose to participate in the Kaggle Challenge "Lyft Motion Prediction for Autonomous Vehicles". We will apply our science skills to build motion prediction models for self-driving vehicles. The model will be able to predict the movement of traffic agents around the AV, such as cars, cyclists, and pedestrians. The goal of this competition is to predict the trajectories of other traffic participants.



Project # 9 Group members:

Banno, Dion

Battista, Joseph

Kahn, Solomon

Title: Increasing Spotify user engagement through predictive personalization

Description:

Our project is an application of classification to the domain of predictive personalization. The goal of the project is to increase Spotify user engagement through data-driven methods. Given a set of users’ demographic data, listening preferences and behaviour, our goal is to build a recommendation system that suggests new songs to users. From a potential pool of songs to suggest, the final song recommendations will be driven by a classification algorithm that measures a given user’s propensity to like a song. We plan on leveraging the Spotify Web API to gather data about songs and collecting user data from consenting peers.



Project # 10 Group members:

Qing, Guo

Wang, Yuanxin

James, Ni

Xueguang, Ma

Title: Mechanisms of Action (MoA) Prediction

Description:

Our team has decided to participate in the Mechanisms of Action (MoA) Prediction Kaggle competition. This is a challenge with the goal of advancing drug development through improvements to MoA prediction algorithms. Our team plan to develop an algorithm to predict a compound’s MoA given its cellular signature and our goal is to learn various algorithms taught in this course.



Project # 11 Group members:

Yang, Jiwon

Mahdi, Anas

Thibault, Will

Lau, Jan

Title: Application of classification in human fatigue analysis

Description:

The goal of this project is to classify different levels of fatigue based on motion capture (Vicon) and force plates data. First, we plan to obtain data from 4 to 6 participants performing squats or squats with weights and rate them on a fatigue scale, with each participant doing at least 50 to 100 reps. We will collect data with EMG, IMU, force plates, and Vicon. When the participants are squatting, we will ask them about their fatigue level, and compare their feedback against the fatigue level recorded on EMG. The fatigue level will be on a scale of 1 to 10 (1 being not fatigued at all and 10 being cannot continue anymore). Once data is collected, we will classify the motion capture and force plates data into the different levels of fatigue.



Project # 12 Group members:

Xiaolan Xu,

Robin Wen,

Yue Weng,

Beizhen Chang

Title: Identification (Classification) of Submillimetre Galaxies Based on Multiwavelength Data in Astronomy

Description:

Identifying the counterparts of submillimetre galaxies (SMGs) in multiwavelength images is important to the study of galaxy evolution in astronomy. However, obtaining a statistically significant sample of robust associations is very challenging because of the poor angular resolution of single-dish submm facilities, that is we can not tell which galalxy is actually responsible for the submillimeter emission from a group of possible candidates due to the poor resolution. Recently, a set of labelled dataset is obtained from ALMA, a milliemetre/submilliemetre telescope array with the sufficient resolution to pin down the exact source of submillimeter emssion. However, applying such array to large fraction of skies are not feasible, so it is of practical interest to develop algorithm to identify submillimetre galaxies (SMGs) based on the other available data. With this newly labelled dataset from ALMA, it is possible to test and develop different new alrgorithms and apply them on unlabelled data to detect submillimetre galaxies.

In our work, we primarily build on the works of Liu et al.(https://arxiv.org/abs/1901.09594), which tested a set of standard classification algorithms to the dataset. We aim to first reproduce their work and test other classification algorithms with a more stastics centered perspective. Next, we hope to possibly extend their works from one or some of the following directions: (1)Incorporating some other relevant features to augment the dimensions of the available dataset for better classification rate. (2)Taking the measurement error into the classifcation algorithms, possibly from a Bayesian approach. (All features in astronomy datasets come from actual physical measurements, which come with an error bar. However, it is not clear how to incoporate this error into the classification task.) (3)The possibility of combining some tradtional astronomy approaches with algorithms from ML.


Project # 13 Group members:


Zihui (Betty) Qin,

Wenqi (Maggie) Zhao,

Muyuan Yang,

Amartya (Marty) Mukherjee,

Title: Insider Trading Roles Classification Prediction on United States conventional stock or non-derivative transaction

Description:

Background (why we were interested in classifying based on insiders): The United States is one of the most frequently traded financial markets in the world. The dataset captures all insider activities as reported on SEC (U.S. Securities and Exchange Commission) forms 3, 4, 5, and 144. We believe that using variables (such as transaction date, security type, and transaction amount), we could predict the roles code for a new transaction. The reason for the chosen prediction is that the role of the insider gives investors signals of potential internal activities and private information. This is crucial for investors to detect important market signals from those insider trading activities, such that they could benefit from the market.

Goal: To classify the role of an insider in a company based on the data of their trades.



Project # 14 Group members:

Jung, Kyle

Kim, Dae Hyun

Lee, Stan

Lim, Seokho

Title: Mechanisms of Action (MoA) Prediction Competition

Description: The main objective of this Kaggle competition is to help to develop an algorithm to predict a compound's MoA given its cellular signature, helping scientists advance the drug discovery process. Our execution plan is to apply concepts and algorithms learned in STAT441 and apply multi-label classification. Through the process, our team will learn biological knowledge necessary to complete and enhance our classification thought-process. https://www.kaggle.com/c/lish-moa


Project # 15 Group Members:

Li, Evan

Abuaisha, Karam

Vadivelu, Nicholas

Pu, Jason

Title: Predict Students Answering Ability Kaggle Competition

Description:

https://www.kaggle.com/c/riiid-test-answer-prediction We plan on tackling this Kaggle competition that revolves around classifying whether students are able to answer their next questions correctly. The data provided consists of the student’s historic performance, the performance of other students on the same question, metadata about the question itself, and more. The theme of the competition is to tailor education to a student’s ability as an AI tutor.


Project # 16 Group members:

Hall, Matthew

Chalaturnyk, Johnathan

Title: Predicting CO and NOx emissions from gas turbines: novel data and a benchmark PEMS

Description:

i plorum ipsus decorum i plorum ipsus decorum i plorum ipsus decorum i plorum ipsus decorum i plorum ipsus decorum i plorum ipsus decorum i plorum ipsus decorum i plorum ipsus decorum i plorum ipsus decorum i plorum ipsus decorum i plorum ipsus decorum i plorum ipsus decorum i plorum ipsus decorum i plorum ipsus decorum i plorum ipsus decorum i plorum ipsus decorum i plorum ipsus decorum i plorum ipsus decorum i plorum ipsus decorum i plorum ipsus decorum i plorum ipsus decorum i plorum ipsus decorum i plorum ipsus decorum i plorum ipsus decorum i plorum ipsus decorum i plorum ipsus decorum i plorum ipsus decorum i plorum ipsus decorum i plorum ipsus decorum i plorum ipsus decorum i plorum ipsus decorum i plorum ipsus decorum i plorum ipsus decorum i plorum ipsus decorum i plorum ipsus decorum


Project # 17 Group members:

Yang, Junyi

Wang, Jill Yu Chieh

Wu, Yu Min

Li, Calvin

Title: Humpback Whale Identification

Description:

Our team will participate in the Kaggle challenge, Humpback Whale Identification. The main objective is to build a multi-class classification model to identify whales' class base on their tail. There are a total of over 3000 classes and 25361 training images. The challenge is that for each class, there are only on average 8 training data.


Project # 18 Group members:

Lian, Jinjiang

Zhu, Yisheng

Huang, Mingzhe

Hou, Jiawen

Title: Mechanisms of Action (MoA) Prediction

Description:

The final project of our team is the Kaggle ongoing competition -- Mechanism of Action(MoA) Prediction. The goal is to improve the MoA prediction algorithm to assist and advance drug development. MoA algorithm helps scientists approach more targeted medicine molecules based on the biological mechanism of disease. This would strongly shorten the medicine development cycle. Here, MoA here is to apply different drugs to human cells to analyze the corresponding reaction and the dataset provides simultaneous measurement of 100 types of human cells and 5000 drugs.

To tackle this competition, after data cleaning and feature engineering, we are going to try a selection of ML algorithms such as logistic regression, tree-based method, SVM, etc and find the optimized one that can best complete the tasks. Depending on how we perform, we might utilize other technics such as model ensembling or stacking.


Project # 19 Group members:

Fagan, Daniel

Brooke, Cooper

Perelman, Maya

Title: Mechanisms of Action (MoA) Prediction (https://www.kaggle.com/c/lish-moa/overview/description)

Description:

For our final project, we will be competing in the Mechanisms of Action (MoA) Prediction Research Challenge on Kaggle. MoA refers to the description of the biological activity of a given molecule and scientists have specific interest in the MoA of molecules as it pertains to the advancement of drugs. This is because under new frameworks, scientists are looking to develop molecules that can modulate protein targets associated with given diseases. Our task will be to analyze a dataset containing human cellular responses to more than 5, 000 drugs and to classify these responses with one or more MoA.

For this competition, we plan to use various classification algorithms taught in STAT 441 followed by model validation techniques to ultimately select the most accurate model based on the logarithmic loss function which was specified by Kaggle.


Project # 20 Group members: Cheng, Leyan

Dai, Mingyan

Jiang, Daniel

Huang, Jerry

Title: Riiid! Answer Correctness Prediction

Description:

We will be competing in the Riiid! Kaggle Challenge. The goal of this challenge is to create algorithms for "Knowledge Tracing," the modeling of student knowledge over time. The goal is to accurately predict how students will perform on future interactions.

We plan on using the classification techniques and model validation techniques learned in the course in order to design an algorithm that can accurately predict the actions of students.


Project # 21 Group members:

Carson, Emilee

Ellmen, Isaac

Mohammadrezaei, Dorsa


Title: Classifying SARS-CoV-2 region of origin based on DNA/RNA sequence

Description:

Determining the location of origin for a viral sequence is an important tool for epidemiological tracking. Knowing where a virus comes from allows epidemiologists to track how a virus is spreading. There are significant efforts to track the spread of SARS-CoV-2. As an RNA virus, SARS-CoV-2 mutates frequently. Most of these mutations carry negligible changes to the function of the virus but act as “barcodes” for specific strains. As the virus spreads in a region, it picks up mutations which allow researchers to identify which sequences correspond to which regions.

The standard method for classifying viruses based on location is to:

- Perform a multiple sequence alignment (MSA)

- Build a phylogenetic tree of the MSA

- Empirically determine which regions have which sections of the tree

Phylogenetic trees are an excellent tool for tracking evolutionary changes over time but we wonder if there are better methods for classifying the region of origin for a virus using machine learning techniques.

Our plan is to perform PCA on the MSA which is available through GISAID. We will determine an appropriate encoding for sequence alignments to vectors and map the aligned sequences onto a much lower dimensional space. We will then use LDA or QDA to classify points based on region (continent). We will also examine if the same technique works well for classifying sequences based on state of origin for samples from the United States. We may try other classification techniques such as logistic regression or neural nets. Finally, we know that projecting data to a small number of principal components and then projecting back to the original space can reduce noise in certain datasets. In the case of mutations, this might correspond to removing insignificant mutations. It is possible that there are certain mutations which induce functional changes in the virus which would be of greater medical interest. Our hope is that we could detect these using PCA.