F18-STAT841-Proposal: Difference between revisions
No edit summary |
No edit summary |
||
Line 339: | Line 339: | ||
[2] Kottke, J (2017, Mar 17). The economics of airline classes. Retrieved from https://kottke.org/17/03/the-economics-of-airline-classes | [2] Kottke, J (2017, Mar 17). The economics of airline classes. Retrieved from https://kottke.org/17/03/the-economics-of-airline-classes | ||
---------------------------------------------------------------------- | ---------------------------------------------------------------------- | ||
'''Project # 16''' | |||
Group members: | |||
Wang, Yu Hao | |||
Grant, Aden | |||
McMurray, Andrew | |||
Song, Baizhi | |||
'''Title:''' Two Sigma: Using News to Predict Stock Movements - A Kaggle Competition | |||
By analyzing news data to predict stock prices, Kagglers have a unique opportunity to advance the state of research in understanding the predictive power of the news. This power, if harnessed, could help predict financial outcomes and generate significant economic impact all over the world. | |||
Data for this competition comes from the following sources: | |||
Market data provided by Intrinio. | |||
News data provided by Thomson Reuters. Copyright ©, Thomson Reuters, 2017. All Rights Reserved. Use, duplication, or sale of this service, or data contained herein, except as described in the Competition Rules, is strictly prohibited. |
Revision as of 20:25, 7 October 2018
Use this format (Don’t remove Project 0)
Project # 0 Group members:
Last name, First name
Last name, First name
Last name, First name
Last name, First name
Title: Making a String Telephone
Description: We use paper cups to make a string phone and talk with friends while learning about sound waves with this science project. (Explain your project in one or two paragraphs).
Project # 1 Group members:
Weng, Jiacheng
Li, Keqi
Qian, Yi
Liu, Bomeng
Title: RSNA Pneumonia Detection Challenge
Description:
Our team’s project is the RSNA Pneumonia Detection Challenge from Kaggle competition. The primary goal of this project is to develop a machine learning tool to detect patients with pneumonia based on their chest radiographs (CXR).
Pneumonia is an infection that inflames the air sacs in human lungs which has symptoms such as chest pain, cough, and fever [1]. Pneumonia can be very dangerous especially to infants and elders. In 2015, 920,000 children under the age of 5 died from this disease [2]. Due to its fatality to children, diagnosing pneumonia has a high order. A common method of diagnosing pneumonia is to obtain patients’ chest radiograph (CXR) which is a gray-scale scan image of patients’ chests using x-ray. The infected region due to pneumonia usually shows as an area or areas of increased opacity [3] on CXR. However, many other factors can also contribute to increase in opacity on CXR which makes the diagnose very challenging. The diagnose also requires highly-skilled clinicians and a lot of time of CXR screening. The Radiological Society of North America (RSNA®) sees the opportunity of using machine learning to potentially accelerate the initial CXR screening process.
For the scope of this project, our team plans to contribute to solving this problem by applying our machine learning knowledge in image processing and classification. Team members are going to apply techniques that include, but are not limited to: logistic regression, random forest, SVM, kNN, CNN, etc., in order to successfully detect CXRs with pneumonia.
[1] (Accessed 2018, Oct. 4). Pneumonia [Online]. MAYO CLINIC. Available from: https://www.mayoclinic.org/diseases-conditions/pneumonia/symptoms-causes/syc-20354204
[2] (Accessed 2018, Oct. 4). RSNA Pneumonia Detection Challenge [Online]. Kaggle. Available from: https://www.kaggle.com/c/rsna-pneumonia-detection-challenge
[3] Franquet T. Imaging of community-acquired pneumonia. J Thorac Imaging 2018 (epub ahead of print). PMID 30036297
Project # 2 Group members:
Hou, Zhaoran
Zhang, Chi
Title:
Description:
Project # 3 Group members:
Hanzhen Yang
Jing Pu Sun
Ganyuan Xuan
Yu Su
Title: Kaggle Challenge: Quick, Draw! Doodle Recognition Challenge
Description:
Our team chose the Quick, Draw! Doodle Recognition Challenge from the Kaggle Competition. The goal of the competition is to build an image recognition tool that can classify hand-drawn doodles into one of the 340 categories.
The main challenge of the project remains in the training set being very noisy. Hand-drawn artwork may deviate substantially from the actual object, and is almost definitively different from person to person. Mislabeled images also present a problem since they will create outlier points when we train our models.
We plan on learning more about some of the currently mature image recognition algorithms to inspire and develop our own model.
Project # 4 Group members:
Snaith, Mitchell
Title: Reproducibility report: *Fixing Variational Bayes: Deterministic Variational Inference for Bayesian Neural Networks*
Description:
The paper *Fixing Variational Bayes: Deterministic Variational Inference for Bayesian Neural Networks* [1] has been submitted to ICLR 2019. It aims to "fix" variational Bayes and turn it into a robust inference tool through two innovations.
Goals are to:
- reproduce the deterministic variational inference scheme as described in the paper without referencing the original author's code, providing a 3rd party implementation
- reproduce experiment results with own implementation, using the same NN framework for reference implementations of compared methods described in the paper
- reproduce experiment results with the author's own implementation
- explore other possible applications of variational Bayes besides heteroscedastic regression
[1] OpenReview location: https://openreview.net/forum?id=B1l08oAct7
Project # 5 Group members:
Rebecca, Chen
Susan,
Mike, Li
Ted, Wang
Title: Kaggle Challenge: Quick, Draw! Doodle Recognition Challenge
Description:
Classification has become a more and more eye-catching, especially with the rise of machine learning in these years. Our team is particularly interested in machine learning algorithms that optimize some specific type image classification.
In this project, we will dig into base classifiers we learnt from the class and try to cook them together to find an optimal solution for a certain type images dataset. Currently, we are looking into a dataset from Kaggle: Quick, Draw! Doodle Recognition Challenge. The dataset in this competition contains 50M drawings among 340 categories and is the subset of the world’s largest doodling dataset and the doodling dataset is updating by real drawing game players. Anyone can contribution by joining it! (quickdraw.withgoogle.com).
For us, as machine learning students, we are more eager to help getting a better classification method. By “better”, we mean find a balance between simplify and accuracy. We will start with neural network via different activation functions in each layer and we will also combine base classifiers with bagging, random forest, boosting for ensemble learning. Also, we will try to regulate our parameters to avoid overfitting in training dataset. Last, we will summary features of this type image dataset, formulate our solutions and standardize our steps to solve this kind problems
Hopefully, we can not only finish our project successfully, but also make a little contribution to machine learning research field.
Project # 6 Group members:
Ngo, Jameson
Xu, Amy
Title: Kaggle Challenge: Human Protein Atlas Image Classification
Description:
We will participate in the Human Protein Atlas Image Classification competition featured on Kaggle. We will classify proteins based on patterns seen in microscopic images of human cells.
Historically, the work done to classify proteins had only developed methods to classify proteins using single patterns of very few cell types at a time. The goal of this challenge is to develop methods to classify proteins based on multiple/mixed patterns and with a larger range of cell types.
Project # 7 Group members:
Qianying Zhao
Hui Huang
Meiyu Zhou
Gezhou Zhang
Title: Google Analytics Customer Revenue Prediction
Description: Our group will participate in the featured Kaggle competition of Google Analytics Customer Revenue Prediction. In this competition, we will analyze customer dataset from a Google Merchandise Store selling swags to predict revenue per customer using Rstudio. Our presentation report will include not only how we've concluded by classifying and analyzing provided data with appropriate models, but also how we performed in the contest.
Project # 8 Group members:
Jiayue Zhang
Lingyun Yi
Rongrong Su
Siao Chen
Title: Kaggle--Two Sigma: Using News to Predict Stock Movements
Description:
Stock price is affected by the news to some extent. What is the news influence on stock price and what is the predicted power of the news?
What we are going to do is to use the content of news to predict the tendency of stock price. We will mine the data, finding the useful information behind the big data. As the result we will predict the stock price performance when market faces news.
Project # 9 Group members:
Hassan, Ahmad Nayar
McLellan, Isaac
Brewster, Kristi
Melek, Marina Medhat Rassmi
Title: Quick, Draw! Doodle Recognition
Description:
Background
Google’s Quick, Draw! is an online game where a user is prompted to draw an image depicting a certain category in under 20 seconds. As the drawing is being completed, the game uses a model which attempts to correctly identify the image being drawn. With the aim to improve the underlying pattern recognition model this game uses, Google is hosting a Kaggle competition asking the public to build a model to correctly identify a given drawing. The model should classify the drawing into one of the 340 label categories within the Quick, Draw! Game in 3 guesses or less.
Proposed Approach
Each image/doodle (input) is considered as a matrix of pixel values. In order to classify images, we need to essentially reshape an images’ respective matrix of pixel values - convolution. This would reduce the dimensionality of the input significantly which in turn reduces the number of parameters of any proposed recognition model. Using filters, pooling layers and further convolution, a final layer called the fully connected layer is used to correlate images with categories, assigning probabilities (weights) and hence classifying images.
This approach to image classification is called convolution neural network (CNN) and we propose using this to classify the doodles within the Quick, Draw! Dataset.
To control overfitting and underfitting of our proposed model and minimizing the error, we will use different architectures consisting of different types and dimensions of pooling layers and input filters.
Challenges
This project presents a number of interesting challenges:
- The data given for training is noisy in that it contains drawings that are incomplete or simply poorly drawn. Dealing with this noise will be a significant part of our work.
- There are 340 label categories within the Quick, Draw! dataset, this means that the model created must be able to classify drawings based on a large pool of information while making effective use of powerful computational resources.
Tools & Resources
- We will use Python & MATLAB.
- We will use the Quick, Draw! Dataset available on the Kaggle competition website. <https://www.kaggle.com/c/quickdraw-doodle-recognition/data>
Project # 10 Group members:
Lam, Amanda
Huang, Xiaoran
Chu, Qi
Sang, Di
Title: Kaggle Competition: Quick, Draw! Doodle Recognition Challenge
Description:
Project # 11 Group members:
Bobichon, Philomene
Maheshwari, Aditya
An, Zepeng
Stranc, Colin
Title: Kaggle Challenge: Quick, Draw! Doodle Recognition Challenge
Description:
Project # 12 Group members:
Huo, Qingxi
Yang, Yanmin
Cai, Yuanjing
Wang, Jiaqi
Title:
Description:
Project # 13 Group members:
Ross, Brendan
Barenboim, Jon
Lin, Junqiao
Bootsma, James
Title: Expanding Neural Netwrok
Description: The goal of our project is to create an expanding neural network algorithm which starts off by training a small neural network then expands it to a larger one. We hypothesize that with the proper expansion method we could decrease training time and prevent overfitting. The method we wish to explore is to link together input dimensions based on covariance. Then when the neural network reaches convergence we create a larger neural network without the links between dimensions and using starting values from the smaller neural network.
Project # 14 Group members:
Schneider, Jason
Walton, Jordyn
Abbas, Zahraa
Na, Andrew
Title: Application of ML Classification to Cancer Identification
Description: The application of machine learning to cancer classification based on gene expression is a topic of great interest to physicians and biostatisticians alike. We would like to work on this for our final project to encourage the application of proven ML techniques to improve accuracy of cancer classification and diagnosis. In this project, we will use the dataset from Golub et al. [1] which contains data on gene expression on tumour biopsies to train a model and classify healthy individuals and individuals who have cancer.
One challenge we may face pertains to the way that the data was collected. Some parts of the dataset have thousands of features (which each represent a quantitative measure of the expression of a certain gene) but as few as twenty samples. We propose some ways to mitigate the impact of this; including the use of PCA, leave-one-out cross validation, or regularization.
Project # 15 Group members:
Praneeth, Sai
Peng, Xudong
Li, Alice
Vajargah, Shahrzad
Title: Google Analytics Customer Revenue Prediction [1] - A Kaggle Competition
Description: Guess which cabin class in airlines is the most profitable? One might guess economy - but in reality, it's the premium classes that show higher returns. According to research conducted by Wendover productions [2], despite having less than 50 seats and taking up more space than the economy class, premium classes end up driving more revenue than other classes.
In fact, just like airlines, many companies adopt the business model where the vast majority of revenue is derived from a minority group of customers. As a result, data-intensive promotional strategies are getting more and more attention nowadays from marketing teams to further improve company returns.
In this Kaggle competition, we are challenged to analyze a Google Merchanidize Store's customer dataset to predict revenue per customer. We will implement a series of data analytics methods including pre-processing, data augmentation, and parameter tuning. Different classification algorithms will be compared and optimized in order to achieve the best results.
Reference:
[1] Kaggle. (2018, Sep 18). Google Analytics Customer Revenue Prediction. Retrieved from https://www.kaggle.com/c/ga-customer-revenue-prediction
[2] Kottke, J (2017, Mar 17). The economics of airline classes. Retrieved from https://kottke.org/17/03/the-economics-of-airline-classes
Project # 16 Group members:
Wang, Yu Hao
Grant, Aden
McMurray, Andrew
Song, Baizhi
Title: Two Sigma: Using News to Predict Stock Movements - A Kaggle Competition By analyzing news data to predict stock prices, Kagglers have a unique opportunity to advance the state of research in understanding the predictive power of the news. This power, if harnessed, could help predict financial outcomes and generate significant economic impact all over the world.
Data for this competition comes from the following sources:
Market data provided by Intrinio. News data provided by Thomson Reuters. Copyright ©, Thomson Reuters, 2017. All Rights Reserved. Use, duplication, or sale of this service, or data contained herein, except as described in the Competition Rules, is strictly prohibited.