F18-STAT841-Proposal: Difference between revisions

From statwiki
Jump to navigation Jump to search
(Created page with " Use this format (Don’t remove Project 0) '''Project # 0''' Group members: Name of Group Member Name of Group Member Name of Group Member Name of Group Member Title: Making...")
 
No edit summary
 
(90 intermediate revisions by 33 users not shown)
Line 1: Line 1:


Use this format (Don’t remove Project 0)
'''Use this format (Don’t remove Project 0)'''
 
'''Project # 0'''
'''Project # 0'''
Group members:
Group members:
Name of Group Member
 
Name of Group Member
Last name, First name
Name of Group Member
 
Name of Group Member
Last name, First name
Title:  Making a String Telephone
 
Description:  We use paper cups to make a string phone and talk with friends while learning about sound waves with this science project. (Explain your project in one or two paragraphs).
Last name, First name
 
Last name, First name
 
'''Title:''' Making a String Telephone
 
'''Description:''' We use paper cups to make a string phone and talk with friends while learning about sound waves with this science project. (Explain your project in one or two paragraphs).
 
--------------------------------------------------------------------
 
'''Project # 1'''
Group members:
 
Weng, Jiacheng
 
Li, Keqi
 
Qian, Yi
 
Liu, Bomeng
 
'''Title:'''  RSNA Pneumonia Detection Challenge
 
'''Description:''' 
 
Our team’s project is the RSNA Pneumonia Detection Challenge from Kaggle competition. The primary goal of this project is to develop a machine learning tool to detect patients with pneumonia based on their chest radiographs (CXR).
 
Pneumonia is an infection that inflames the air sacs in human lungs which has symptoms such as chest pain, cough, and fever [1]. Pneumonia can be very dangerous especially to infants and elders. In 2015, 920,000 children under the age of 5 died from this disease [2]. Due to its fatality to children, diagnosing pneumonia has a high order. A common method of diagnosing pneumonia is to obtain patients’ chest radiograph (CXR) which is a gray-scale scan image of patients’ chests using x-ray. The infected region due to pneumonia usually shows as an area or areas of increased opacity [3] on CXR. However, many other factors can also contribute to increase in opacity on CXR which makes the diagnose very challenging. The diagnose also requires highly-skilled clinicians and a lot of time of CXR screening. The Radiological Society of North America (RSNA®) sees the opportunity of using machine learning to potentially accelerate the initial CXR screening process.
 
For the scope of this project, our team plans to contribute to solving this problem by applying our machine learning knowledge in image processing and classification. Team members are going to apply techniques that include, but are not limited to: logistic regression, random forest, SVM, kNN, CNN, etc., in order to successfully detect CXRs with pneumonia.
 
 
[1] (Accessed 2018, Oct. 4). Pneumonia [Online]. MAYO CLINIC. Available from: https://www.mayoclinic.org/diseases-conditions/pneumonia/symptoms-causes/syc-20354204
[2] (Accessed 2018, Oct. 4). RSNA Pneumonia Detection Challenge [Online]. Kaggle. Available from: https://www.kaggle.com/c/rsna-pneumonia-detection-challenge
[3] Franquet T. Imaging of community-acquired pneumonia. J Thorac Imaging 2018 (epub ahead of print). PMID 30036297
 
 
--------------------------------------------------------------------
 
'''Project # 3'''
Group members:
 
Hanzhen Yang
 
Jing Pu Sun
 
Ganyuan Xuan
 
Yu Su
 
'''Title:''' Kaggle Challenge: Quick, Draw! Doodle Recognition Challenge
 
'''Description:'''
 
Our team chose the [https://www.kaggle.com/c/quickdraw-doodle-recognition Quick, Draw! Doodle Recognition Challenge] from the Kaggle Competition. The goal of the competition is to build an image recognition tool that can classify hand-drawn doodles into one of the 340 categories.
 
The main challenge of the project remains in the training set being very noisy. Hand-drawn artwork may deviate substantially from the actual object, and is almost definitively different from person to person. Mislabeled images also present a problem since they will create outlier points when we train our models.
 
We plan on learning more about some of the currently mature image recognition algorithms to inspire and develop our own model.
 
--------------------------------------------------------------------
 
'''Project # 4'''
Group members:
 
Snaith, Mitchell
 
'''Title:'''  Exploring Kuzushiji-MNIST, a new classification benchmark
 
'''Description:''' 
 
The paper *Deep Learning for Classical Japanese Literature* presents a new classification dataset intended to act as a drop-in replacement for MNIST. The paper authors believe that this dataset is significantly more difficult that MNIST for typical classification methods, while not "capping" performance due to indiscernible objects like Fashion-MNIST might.
Goals are to:
 
- perform survey of typical machine-learning algorithms on Kuzushiji-MNIST compared to both MNIST and Fashion-MNIST
 
- investigate relevant differences in the structures of the datasets
 
- assess whether Fashion-MNIST does indeed seem to have a performance cap that can be overcome with Kuzushiji-MNIST
--------------------------------------------------------------------
 
'''Project # 5'''
Group members:
 
Pei Wei, Wang
 
Daoyi Chen
 
Yiming Li
 
Ying  Chi
 
'''Title:'''  Kaggle Challenge: Airbus Ship Detection Challenge
 
'''Description:''' 
 
Image segmentation is now widely used in all kinds of field like medical diagnosis, autonomous driving and satellite image location. Our project is chosen from Kaggle competition - Airbus Ship Detection, which aims to detect, locate ships in satellite images and put an aligned bounding box segment around the ships we locate.What’s more, Airbus is also interested in improving the detection speed via a speed evaluation based upon the inference time on over 40,000 images chips.
 
The goal of our project is to construct a model(s) that can accurately find the ship's segmentation in new pictures. We also need to balance the accuracy and the speed since the time limitation.
--------------------------------------------------------------------
 
'''Project # 6'''
Group members:
 
Ngo, Jameson
 
Xu, Amy
 
'''Title:''' Kaggle Challenge: [https://www.kaggle.com/c/PLAsTiCC-2018  PLAsTiCC Astronomical Classification ]
 
'''Description:'''
 
We will participate in the PLAsTiCC Astronomical Classification competition featured on Kaggle. We will explore how possible it is classify astronomical bodies based on various factors such as brightness.
 
These bodies will vary in time and size. Some are unknown! There are over 100 classes that these bodies may be and it will be our job to find the predicted probability for an image to be each class.
 
--------------------------------------------------------------------
'''Project # 7'''
Group members:
 
Qianying Zhao
 
Hui Huang
 
Meiyu Zhou
 
Gezhou Zhang
 
'''Title:''' Quora Insincere Questions Classification
 
'''Description:''' 
Our group will participate in the featured Kaggle competition of Quora Insincere Questions Classification. For this competition, we should predict wether a question asked on Quora is sincere or not. If the question is insincere, it intends to be a statement rather than look for useful answers, and identified as (target = 1).
We will analyze the Quora question text to predict the characteristics of questions and define they are sincere or insincere using Rstudio. Our presentation report will include not only how we've concluded by classifying and analyzing provided data with appropriate models, but also how we performed in the contest.
 
--------------------------------------------------------------------
'''Project # 8'''
Group members:
 
Jiayue Zhang
 
Lingyun Yi
 
Rongrong Su
 
Siao Chen
 
 
'''Title:''' Telecom Customer Churn Prediction
 
 
'''Description:''' 
Traditional telecommunication industry is made up of telecommunication companies and internet service providers, which play important role in daily life. It is crucial for the telecommunication companies to analyze and maintain their relationship with existing customers, as well as winning new customers with marketing strategies. However, it costs 5 times as much to attract a new customer than to keep an existing one. Therefore, retaining existing customers and building a loyal relationship are the key concerns for traditional telecommunication companies to stay strong in the competition. This project aims to provide insights for the telecom companies in predicting the chance of a customer leaving the company. We will be applying different classification models such as Random Forest, Gradient boosting, Logistic Regression and XGBoost, and then compare each model's performance.
 
 
--------------------------------------------------------------------
'''Project # 9'''
Group members:
 
Brewster, Kristi
 
McLellan, Isaac
 
Hassan, Ahmad Nayar
 
Melek, Marina Medhat Rassmi
 
 
'''Title:''' Quora Insincere Questions Classification: Detect toxic content to improve online conversations
 
'''Description:'''
 
This is a Kaggle Competition.
 
Quora is an online question and answer platform with content created by its community of users. Quora prides itself as being a place where users can gain and share knowledge and feel safe doing it. In order to have a safe community, they need to eliminate what they term as "insincere" questions.
This competitioon asks Kagglers to develop models that will flag these types of questions given a list of both insincere and sincere questions.
 
We intend to use Python and its wide variety of packages as we aim to classify these questions.
 
'''Reference:'''
[1] Kaggle. (2018, Nov 18). Quora Insincere Questions Classification. [https://www.kaggle.com/c/quora-insincere-questions-classification]
 
--------------------------------------------------------------------
'''Project # 10'''
Group members:
 
Lam, Amanda
Huang, Xiaoran
Chu, Qi
Sang, Di
 
'''Title:'''  Kaggle Competition: Human Protein Atlas Image Classification
 
'''Description:'''
 
--------------------------------------------------------------------
 
'''Project # 11'''
Group members:
 
Bobichon, Philomene
 
Maheshwari, Aditya
 
An, Zepeng
 
Stranc, Colin
 
'''Title:'''  Kaggle Challenge: Quick, Draw! Doodle Recognition Challenge
 
'''Description:''' 
 
--------------------------------------------------------------------
 
'''Project # 12'''
Group members:
 
Huo, Qingxi
 
Yang, Yanmin
 
Cai, Yuanjing
 
Wang, Jiaqi
 
'''Title:'''  Kaggle Challenge: Quick, Draw! Doodle Recognition Challenge
 
'''Description:''' 
 
Our task is to build a better classifier for the existing Quick, Draw! dataset. By advancing models on this dataset, Kagglers can improve pattern recognition solutions more broadly. This will have an immediate impact on handwriting recognition and its robust applications in areas including OCR (Optical Character Recognition), ASR (Automatic Speech Recognition) & NLP (Natural Language Processing).
 
--------------------------------------------------------------------
 
'''Project # 13'''
Group members:
 
Ross, Brendan
 
Barenboim, Jon
 
Lin, Junqiao
 
Bootsma, James
 
'''Title:'''  Paper reconstruction (Adaptive Blending Units: Trainable Activation Functions For Deep Neural Networks)
 
'''Description:'''  Adaptive Blending Units: Trainable Activation Functions For Deep Neural Networks is a paper introducing activation functions that are weighted sums of commonly used activation functions. In which the of the activation function's weights are updated with each training step. First, we reconstructed the models the paper ran and compared the results. A reconstruction of the model verified that trainable activation functions produce more accurate results. Further analysis of the trained activation function leads to comparisons between common activation functions and a general shape that the activation function converges to. However, we then discuss a weakness in the model such that the training time for the activation weights is very large. We break down the math of the back propagation step outlining the computational complexity of the activation weight iteration step.
 
--------------------------------------------------------------------
 
'''Project # 14'''
Group members:
 
Schneider, Jason
 
Walton, Jordyn
 
Abbas, Zahraa
 
Na, Andrew
 
'''Title:'''  Application of ML Classification to Cancer Identification
 
'''Description:'''  The application of machine learning to cancer classification based on gene expression is a topic of great interest to physicians and biostatisticians alike. We would like to work on this for our final project to encourage the application of proven ML techniques to improve accuracy of cancer classification and diagnosis. In this project, we will use the dataset from Golub et al. [1] which contains data on gene expression on tumour biopsies to train a model and classify healthy individuals and individuals who have cancer.
 
One challenge we may face pertains to the way that the data was collected. Some parts of the dataset have thousands of features (which each represent a quantitative measure of the expression of a certain gene) but as few as twenty samples. We propose some ways to mitigate the impact of this; including the use of PCA, leave-one-out cross validation, or regularization.
 
----------------------------------------------------------------------
 
'''Project # 15'''
Group members:
 
Praneeth, Sai
 
Peng, Xudong
 
Li, Alice
 
Vajargah, Shahrzad
 
'''Title:'''  Google Analytics Customer Revenue Prediction [1] - A Kaggle Competition
 
'''Description:'''  Guess which cabin class in airlines is the most profitable? One might guess economy - but in reality, it's the premium classes that show higher returns. According to research conducted by Wendover productions [2], despite having less than 50 seats and taking up more space than the economy class, premium classes end up driving more revenue than other classes.
 
In fact, just like airlines, many companies adopt the business model where the vast majority of revenue is derived from a minority group of customers. As a result, data-intensive promotional strategies are getting more and more attention nowadays from marketing teams to further improve company returns.
 
In this Kaggle competition, we are challenged to analyze a Google Merchanidize Store's customer dataset to predict revenue per customer. We will implement a series of data analytics methods including pre-processing, data augmentation, and parameter tuning. Different classification algorithms will be compared and optimized in order to achieve the best results.
 
'''Reference:'''
 
[1] Kaggle. (2018, Sep 18). Google Analytics Customer Revenue Prediction. Retrieved from https://www.kaggle.com/c/ga-customer-revenue-prediction
 
[2] Kottke, J (2017, Mar 17). The economics of airline classes. Retrieved from https://kottke.org/17/03/the-economics-of-airline-classes
----------------------------------------------------------------------
 
'''Project # 16'''
Group members:
 
Wang, Yu Hao
 
Grant, Aden
 
McMurray, Andrew
 
Song, Baizhi
 
'''Title:'''  Google Analytics Customer Revenue Prediction - A Kaggle Competition
 
The 80/20 rule has proven true for many businesses–only a small percentage of customers produce most of the revenue. As such, marketing teams are challenged to make appropriate investments in promotional strategies.
 
GStore
 
RStudio, the developer of free and open tools for R and enterprise-ready products for teams to scale and share work, has partnered with Google Cloud and Kaggle to demonstrate the business impact that thorough data analysis can have.
 
In this competition, you’re challenged to analyze a Google Merchandise Store (also known as GStore, where Google swag is sold) customer dataset to predict revenue per customer. Hopefully, the outcome will be more actionable operational changes and a better use of marketing budgets for those companies who choose to use data analysis on top of GA data.
 
we will test a variety of classification algorithms to determine an appropriate model.
 
----------------------------------------------------------------------
 
'''Project # 17'''
Group Members:
 
Jiang, Ya Fan
 
Zhang, Yuan
 
Hu, Jerry Jie
 
'''Title:'''  Humpback Whale Identification
 
'''Description:''' We analyze Happywhale’s database of over 25,000 images, gathered from research institutions and public contributors to classify each whale to its identification based on its tail image.
 
----------------------------------------------------------------------
 
'''Project # 18'''
Group Members:
 
Zhang, Ben
 
Mall, Sunil
 
Rees Simmons
 
'''Title:''' Formal Adversary, Towards an Epsilon Free Optimization
 
'''Description:''' Use news analytics to predict stock price performance. This is subject to change.
 
----------------------------------------------------------------------
'''Project # 19'''
Group Members:
 
Yan Yu Chen
 
Qisi Deng
 
Hengxin Li
 
Bochao Zhang
 
'''Description:''' Our team presents the Unsupervised Lexicon-Based Sentiment Topic Model (ULSTM) as a sentiment analysis model for reviews on the popular crowd-sourced review forum Yelp. The model applies an unsupervised learning since the supervised method has many constraints. Furthermore, instead of employing an existing sentiment lexicon, we developed a sentiment dictionary using the linguistic corpus WordNet; the self-defined lexicon allows more targeted scoring towards the evaluated dataset. Finally, the ULSTM adopts the Latent Dirichlet Allocation model to find the most mentioned topics in reviews for individual businesses.
 
'''Dataset''': Yelp Review Dataset from Kaggle
----------------------------------------------------------------------
'''Project # 20'''
Group Members:
 
Dong, Yongqi (Michael)
 
Kingston, Stephen
 
Hou, Zhaoran
 
Zhang, Chi
 
'''Title:''' Kaggle--Two Sigma: Using News to Predict Stock Movements 
 
'''Description:''' The movement in price of a trade-able security, or stock, on any given day is an aggregation of each individual market participant’s appraisal of the intrinsic value of the underlying company or assets. These values are primarily driven by investors’ expectations of the company’s ability to generate future free cash flow. A steady stream of information on the state of macro and micro-economic variables which affect a company’s operations inform these market actors, primarily through news articles and alerts. We would like to take a universe of news headlines and parse the information into features, which allow us to classify the direction and ‘intensity’ of a stock’s price move, in any given day. Strategies may include various classification methods to determine the most effective solution.
 
----------------------------------------------------------------------
 
'''Project # 21'''
Group members:
 
Xiao, Alexandre
 
Zhang, Richard
 
Ash, Hudson
 
Zhu, Ziqiu
 
'''Title:'''  Image Segmentation with Capsule Networks using CRF loss
 
'''Description:'''  Investigate the impact in changing loss function/regularizers on image segmentation tasks with capsule networks.
 
----------------------------------------------------------------------
 
'''Project # 22'''
Group Members:
 
Lee, Yu Xuan
 
Heng, Tsen Yee
 
'''Title:''' Wine Rating Prediction
 
'''Description:''' Predict the rating of the bottles of wine with the help of machine learning. With the variables from the datasets of the wine review which we found in kaggle, we are able to show that different points, price and the year of the production of the wine are very crucial in determining the value of the bottle of wine. The formula of finding the price increased per point for the wine is found from www.vivino.com. From the information we have, we are able to determine which wine is worth to buy!
 
 
-------------------------------------------------------------------------
 
'''Project # 23'''
Group Members:
 
Bayati, Mahdiyeh
 
Malek Mohammadi, Saber
 
Luong, Vincent
 
 
'''Title:''' Human Protein Atlas Image Classification
 
 
'''Description:''' The Human Protein Atlas is a Sweden-based initiative aimed at mapping all human proteins in cells, tissues and organs.
 
-------------------------------------------------------------------------
 
'''Project # 24'''
Group Members:
 
Wu Yutong,
 
Wang Shuyue,
 
Jiao Yan
 
'''Title:''' Kaggle Competition: Quora Insincere Questions Classification
 
'''Description:'''  Quora is a question-and-answer website where users can ask questions and share opinions. For the company, one key challenge is to identify those insincere questions, which are defined as those founded upon false premises, or that intend to make a statement rather than look for helpful answers. This report is about classifying Quora questions into "Sincere" and "Insincere".  The data used in this project was prepared by Quora and can be found on kaggle website. We tried Bi-GRU and Capsule Network model, along with blend of LSTMs and CNN model. Experiments have demonstrated that they have the similar performance.

Latest revision as of 14:41, 13 December 2018

Use this format (Don’t remove Project 0)

Project # 0 Group members:

Last name, First name

Last name, First name

Last name, First name

Last name, First name

Title: Making a String Telephone

Description: We use paper cups to make a string phone and talk with friends while learning about sound waves with this science project. (Explain your project in one or two paragraphs).


Project # 1 Group members:

Weng, Jiacheng

Li, Keqi

Qian, Yi

Liu, Bomeng

Title: RSNA Pneumonia Detection Challenge

Description:

Our team’s project is the RSNA Pneumonia Detection Challenge from Kaggle competition. The primary goal of this project is to develop a machine learning tool to detect patients with pneumonia based on their chest radiographs (CXR).

Pneumonia is an infection that inflames the air sacs in human lungs which has symptoms such as chest pain, cough, and fever [1]. Pneumonia can be very dangerous especially to infants and elders. In 2015, 920,000 children under the age of 5 died from this disease [2]. Due to its fatality to children, diagnosing pneumonia has a high order. A common method of diagnosing pneumonia is to obtain patients’ chest radiograph (CXR) which is a gray-scale scan image of patients’ chests using x-ray. The infected region due to pneumonia usually shows as an area or areas of increased opacity [3] on CXR. However, many other factors can also contribute to increase in opacity on CXR which makes the diagnose very challenging. The diagnose also requires highly-skilled clinicians and a lot of time of CXR screening. The Radiological Society of North America (RSNA®) sees the opportunity of using machine learning to potentially accelerate the initial CXR screening process.

For the scope of this project, our team plans to contribute to solving this problem by applying our machine learning knowledge in image processing and classification. Team members are going to apply techniques that include, but are not limited to: logistic regression, random forest, SVM, kNN, CNN, etc., in order to successfully detect CXRs with pneumonia.


[1] (Accessed 2018, Oct. 4). Pneumonia [Online]. MAYO CLINIC. Available from: https://www.mayoclinic.org/diseases-conditions/pneumonia/symptoms-causes/syc-20354204 [2] (Accessed 2018, Oct. 4). RSNA Pneumonia Detection Challenge [Online]. Kaggle. Available from: https://www.kaggle.com/c/rsna-pneumonia-detection-challenge [3] Franquet T. Imaging of community-acquired pneumonia. J Thorac Imaging 2018 (epub ahead of print). PMID 30036297



Project # 3 Group members:

Hanzhen Yang

Jing Pu Sun

Ganyuan Xuan

Yu Su

Title: Kaggle Challenge: Quick, Draw! Doodle Recognition Challenge

Description:

Our team chose the Quick, Draw! Doodle Recognition Challenge from the Kaggle Competition. The goal of the competition is to build an image recognition tool that can classify hand-drawn doodles into one of the 340 categories.

The main challenge of the project remains in the training set being very noisy. Hand-drawn artwork may deviate substantially from the actual object, and is almost definitively different from person to person. Mislabeled images also present a problem since they will create outlier points when we train our models.

We plan on learning more about some of the currently mature image recognition algorithms to inspire and develop our own model.


Project # 4 Group members:

Snaith, Mitchell

Title: Exploring Kuzushiji-MNIST, a new classification benchmark

Description:

The paper *Deep Learning for Classical Japanese Literature* presents a new classification dataset intended to act as a drop-in replacement for MNIST. The paper authors believe that this dataset is significantly more difficult that MNIST for typical classification methods, while not "capping" performance due to indiscernible objects like Fashion-MNIST might. Goals are to:

- perform survey of typical machine-learning algorithms on Kuzushiji-MNIST compared to both MNIST and Fashion-MNIST

- investigate relevant differences in the structures of the datasets

- assess whether Fashion-MNIST does indeed seem to have a performance cap that can be overcome with Kuzushiji-MNIST


Project # 5 Group members:

Pei Wei, Wang

Daoyi Chen

Yiming Li

Ying Chi

Title: Kaggle Challenge: Airbus Ship Detection Challenge

Description:

Image segmentation is now widely used in all kinds of field like medical diagnosis, autonomous driving and satellite image location. Our project is chosen from Kaggle competition - Airbus Ship Detection, which aims to detect, locate ships in satellite images and put an aligned bounding box segment around the ships we locate.What’s more, Airbus is also interested in improving the detection speed via a speed evaluation based upon the inference time on over 40,000 images chips.

The goal of our project is to construct a model(s) that can accurately find the ship's segmentation in new pictures. We also need to balance the accuracy and the speed since the time limitation.


Project # 6 Group members:

Ngo, Jameson

Xu, Amy

Title: Kaggle Challenge: PLAsTiCC Astronomical Classification

Description:

We will participate in the PLAsTiCC Astronomical Classification competition featured on Kaggle. We will explore how possible it is classify astronomical bodies based on various factors such as brightness.

These bodies will vary in time and size. Some are unknown! There are over 100 classes that these bodies may be and it will be our job to find the predicted probability for an image to be each class.


Project # 7 Group members:

Qianying Zhao

Hui Huang

Meiyu Zhou

Gezhou Zhang

Title: Quora Insincere Questions Classification

Description: Our group will participate in the featured Kaggle competition of Quora Insincere Questions Classification. For this competition, we should predict wether a question asked on Quora is sincere or not. If the question is insincere, it intends to be a statement rather than look for useful answers, and identified as (target = 1). We will analyze the Quora question text to predict the characteristics of questions and define they are sincere or insincere using Rstudio. Our presentation report will include not only how we've concluded by classifying and analyzing provided data with appropriate models, but also how we performed in the contest.


Project # 8 Group members:

Jiayue Zhang

Lingyun Yi

Rongrong Su

Siao Chen


Title: Telecom Customer Churn Prediction


Description: Traditional telecommunication industry is made up of telecommunication companies and internet service providers, which play important role in daily life. It is crucial for the telecommunication companies to analyze and maintain their relationship with existing customers, as well as winning new customers with marketing strategies. However, it costs 5 times as much to attract a new customer than to keep an existing one. Therefore, retaining existing customers and building a loyal relationship are the key concerns for traditional telecommunication companies to stay strong in the competition. This project aims to provide insights for the telecom companies in predicting the chance of a customer leaving the company. We will be applying different classification models such as Random Forest, Gradient boosting, Logistic Regression and XGBoost, and then compare each model's performance.



Project # 9 Group members:

Brewster, Kristi

McLellan, Isaac

Hassan, Ahmad Nayar

Melek, Marina Medhat Rassmi


Title: Quora Insincere Questions Classification: Detect toxic content to improve online conversations

Description:

This is a Kaggle Competition.

Quora is an online question and answer platform with content created by its community of users. Quora prides itself as being a place where users can gain and share knowledge and feel safe doing it. In order to have a safe community, they need to eliminate what they term as "insincere" questions. This competitioon asks Kagglers to develop models that will flag these types of questions given a list of both insincere and sincere questions.

We intend to use Python and its wide variety of packages as we aim to classify these questions.

Reference: [1] Kaggle. (2018, Nov 18). Quora Insincere Questions Classification. [1]


Project # 10 Group members:

Lam, Amanda

Huang, Xiaoran

Chu, Qi

Sang, Di

Title: Kaggle Competition: Human Protein Atlas Image Classification

Description:


Project # 11 Group members:

Bobichon, Philomene

Maheshwari, Aditya

An, Zepeng

Stranc, Colin

Title: Kaggle Challenge: Quick, Draw! Doodle Recognition Challenge

Description:


Project # 12 Group members:

Huo, Qingxi

Yang, Yanmin

Cai, Yuanjing

Wang, Jiaqi

Title: Kaggle Challenge: Quick, Draw! Doodle Recognition Challenge

Description:

Our task is to build a better classifier for the existing Quick, Draw! dataset. By advancing models on this dataset, Kagglers can improve pattern recognition solutions more broadly. This will have an immediate impact on handwriting recognition and its robust applications in areas including OCR (Optical Character Recognition), ASR (Automatic Speech Recognition) & NLP (Natural Language Processing).


Project # 13 Group members:

Ross, Brendan

Barenboim, Jon

Lin, Junqiao

Bootsma, James

Title: Paper reconstruction (Adaptive Blending Units: Trainable Activation Functions For Deep Neural Networks)

Description: Adaptive Blending Units: Trainable Activation Functions For Deep Neural Networks is a paper introducing activation functions that are weighted sums of commonly used activation functions. In which the of the activation function's weights are updated with each training step. First, we reconstructed the models the paper ran and compared the results. A reconstruction of the model verified that trainable activation functions produce more accurate results. Further analysis of the trained activation function leads to comparisons between common activation functions and a general shape that the activation function converges to. However, we then discuss a weakness in the model such that the training time for the activation weights is very large. We break down the math of the back propagation step outlining the computational complexity of the activation weight iteration step.


Project # 14 Group members:

Schneider, Jason

Walton, Jordyn

Abbas, Zahraa

Na, Andrew

Title: Application of ML Classification to Cancer Identification

Description: The application of machine learning to cancer classification based on gene expression is a topic of great interest to physicians and biostatisticians alike. We would like to work on this for our final project to encourage the application of proven ML techniques to improve accuracy of cancer classification and diagnosis. In this project, we will use the dataset from Golub et al. [1] which contains data on gene expression on tumour biopsies to train a model and classify healthy individuals and individuals who have cancer.

One challenge we may face pertains to the way that the data was collected. Some parts of the dataset have thousands of features (which each represent a quantitative measure of the expression of a certain gene) but as few as twenty samples. We propose some ways to mitigate the impact of this; including the use of PCA, leave-one-out cross validation, or regularization.


Project # 15 Group members:

Praneeth, Sai

Peng, Xudong

Li, Alice

Vajargah, Shahrzad

Title: Google Analytics Customer Revenue Prediction [1] - A Kaggle Competition

Description: Guess which cabin class in airlines is the most profitable? One might guess economy - but in reality, it's the premium classes that show higher returns. According to research conducted by Wendover productions [2], despite having less than 50 seats and taking up more space than the economy class, premium classes end up driving more revenue than other classes.

In fact, just like airlines, many companies adopt the business model where the vast majority of revenue is derived from a minority group of customers. As a result, data-intensive promotional strategies are getting more and more attention nowadays from marketing teams to further improve company returns.

In this Kaggle competition, we are challenged to analyze a Google Merchanidize Store's customer dataset to predict revenue per customer. We will implement a series of data analytics methods including pre-processing, data augmentation, and parameter tuning. Different classification algorithms will be compared and optimized in order to achieve the best results.

Reference:

[1] Kaggle. (2018, Sep 18). Google Analytics Customer Revenue Prediction. Retrieved from https://www.kaggle.com/c/ga-customer-revenue-prediction

[2] Kottke, J (2017, Mar 17). The economics of airline classes. Retrieved from https://kottke.org/17/03/the-economics-of-airline-classes


Project # 16 Group members:

Wang, Yu Hao

Grant, Aden

McMurray, Andrew

Song, Baizhi

Title: Google Analytics Customer Revenue Prediction - A Kaggle Competition

The 80/20 rule has proven true for many businesses–only a small percentage of customers produce most of the revenue. As such, marketing teams are challenged to make appropriate investments in promotional strategies.

GStore

RStudio, the developer of free and open tools for R and enterprise-ready products for teams to scale and share work, has partnered with Google Cloud and Kaggle to demonstrate the business impact that thorough data analysis can have.

In this competition, you’re challenged to analyze a Google Merchandise Store (also known as GStore, where Google swag is sold) customer dataset to predict revenue per customer. Hopefully, the outcome will be more actionable operational changes and a better use of marketing budgets for those companies who choose to use data analysis on top of GA data.

we will test a variety of classification algorithms to determine an appropriate model.


Project # 17 Group Members:

Jiang, Ya Fan

Zhang, Yuan

Hu, Jerry Jie

Title: Humpback Whale Identification

Description: We analyze Happywhale’s database of over 25,000 images, gathered from research institutions and public contributors to classify each whale to its identification based on its tail image.


Project # 18 Group Members:

Zhang, Ben

Mall, Sunil

Rees Simmons

Title: Formal Adversary, Towards an Epsilon Free Optimization

Description: Use news analytics to predict stock price performance. This is subject to change.


Project # 19 Group Members:

Yan Yu Chen

Qisi Deng

Hengxin Li

Bochao Zhang

Description: Our team presents the Unsupervised Lexicon-Based Sentiment Topic Model (ULSTM) as a sentiment analysis model for reviews on the popular crowd-sourced review forum Yelp. The model applies an unsupervised learning since the supervised method has many constraints. Furthermore, instead of employing an existing sentiment lexicon, we developed a sentiment dictionary using the linguistic corpus WordNet; the self-defined lexicon allows more targeted scoring towards the evaluated dataset. Finally, the ULSTM adopts the Latent Dirichlet Allocation model to find the most mentioned topics in reviews for individual businesses.

Dataset: Yelp Review Dataset from Kaggle


Project # 20 Group Members:

Dong, Yongqi (Michael)

Kingston, Stephen

Hou, Zhaoran

Zhang, Chi

Title: Kaggle--Two Sigma: Using News to Predict Stock Movements

Description: The movement in price of a trade-able security, or stock, on any given day is an aggregation of each individual market participant’s appraisal of the intrinsic value of the underlying company or assets. These values are primarily driven by investors’ expectations of the company’s ability to generate future free cash flow. A steady stream of information on the state of macro and micro-economic variables which affect a company’s operations inform these market actors, primarily through news articles and alerts. We would like to take a universe of news headlines and parse the information into features, which allow us to classify the direction and ‘intensity’ of a stock’s price move, in any given day. Strategies may include various classification methods to determine the most effective solution.


Project # 21 Group members:

Xiao, Alexandre

Zhang, Richard

Ash, Hudson

Zhu, Ziqiu

Title: Image Segmentation with Capsule Networks using CRF loss

Description: Investigate the impact in changing loss function/regularizers on image segmentation tasks with capsule networks.


Project # 22 Group Members:

Lee, Yu Xuan

Heng, Tsen Yee

Title: Wine Rating Prediction

Description: Predict the rating of the bottles of wine with the help of machine learning. With the variables from the datasets of the wine review which we found in kaggle, we are able to show that different points, price and the year of the production of the wine are very crucial in determining the value of the bottle of wine. The formula of finding the price increased per point for the wine is found from www.vivino.com. From the information we have, we are able to determine which wine is worth to buy!



Project # 23 Group Members:

Bayati, Mahdiyeh

Malek Mohammadi, Saber

Luong, Vincent


Title: Human Protein Atlas Image Classification


Description: The Human Protein Atlas is a Sweden-based initiative aimed at mapping all human proteins in cells, tissues and organs.


Project # 24 Group Members:

Wu Yutong,

Wang Shuyue,

Jiao Yan

Title: Kaggle Competition: Quora Insincere Questions Classification

Description: Quora is a question-and-answer website where users can ask questions and share opinions. For the company, one key challenge is to identify those insincere questions, which are defined as those founded upon false premises, or that intend to make a statement rather than look for helpful answers. This report is about classifying Quora questions into "Sincere" and "Insincere". The data used in this project was prepared by Quora and can be found on kaggle website. We tried Bi-GRU and Capsule Network model, along with blend of LSTMs and CNN model. Experiments have demonstrated that they have the similar performance.