http://wiki.math.uwaterloo.ca/statwiki/api.php?action=feedcontributions&user=Y7chi&feedformat=atomstatwiki - User contributions [US]2024-03-28T18:35:29ZUser contributionsMediaWiki 1.41.0http://wiki.math.uwaterloo.ca/statwiki/index.php?title=F18-STAT841-Proposal&diff=42418F18-STAT841-Proposal2018-12-13T01:56:42Z<p>Y7chi: </p>
<hr />
<div><br />
'''Use this format (Don’t remove Project 0)'''<br />
<br />
'''Project # 0'''<br />
Group members:<br />
<br />
Last name, First name<br />
<br />
Last name, First name<br />
<br />
Last name, First name<br />
<br />
Last name, First name<br />
<br />
'''Title:''' Making a String Telephone<br />
<br />
'''Description:''' We use paper cups to make a string phone and talk with friends while learning about sound waves with this science project. (Explain your project in one or two paragraphs).<br />
<br />
--------------------------------------------------------------------<br />
<br />
'''Project # 1'''<br />
Group members:<br />
<br />
Weng, Jiacheng<br />
<br />
Li, Keqi<br />
<br />
Qian, Yi<br />
<br />
Liu, Bomeng<br />
<br />
'''Title:''' RSNA Pneumonia Detection Challenge<br />
<br />
'''Description:''' <br />
<br />
Our team’s project is the RSNA Pneumonia Detection Challenge from Kaggle competition. The primary goal of this project is to develop a machine learning tool to detect patients with pneumonia based on their chest radiographs (CXR). <br />
<br />
Pneumonia is an infection that inflames the air sacs in human lungs which has symptoms such as chest pain, cough, and fever [1]. Pneumonia can be very dangerous especially to infants and elders. In 2015, 920,000 children under the age of 5 died from this disease [2]. Due to its fatality to children, diagnosing pneumonia has a high order. A common method of diagnosing pneumonia is to obtain patients’ chest radiograph (CXR) which is a gray-scale scan image of patients’ chests using x-ray. The infected region due to pneumonia usually shows as an area or areas of increased opacity [3] on CXR. However, many other factors can also contribute to increase in opacity on CXR which makes the diagnose very challenging. The diagnose also requires highly-skilled clinicians and a lot of time of CXR screening. The Radiological Society of North America (RSNA®) sees the opportunity of using machine learning to potentially accelerate the initial CXR screening process. <br />
<br />
For the scope of this project, our team plans to contribute to solving this problem by applying our machine learning knowledge in image processing and classification. Team members are going to apply techniques that include, but are not limited to: logistic regression, random forest, SVM, kNN, CNN, etc., in order to successfully detect CXRs with pneumonia.<br />
<br />
<br />
[1] (Accessed 2018, Oct. 4). Pneumonia [Online]. MAYO CLINIC. Available from: https://www.mayoclinic.org/diseases-conditions/pneumonia/symptoms-causes/syc-20354204<br />
[2] (Accessed 2018, Oct. 4). RSNA Pneumonia Detection Challenge [Online]. Kaggle. Available from: https://www.kaggle.com/c/rsna-pneumonia-detection-challenge<br />
[3] Franquet T. Imaging of community-acquired pneumonia. J Thorac Imaging 2018 (epub ahead of print). PMID 30036297<br />
<br />
<br />
--------------------------------------------------------------------<br />
<br />
'''Project # 3'''<br />
Group members:<br />
<br />
Hanzhen Yang<br />
<br />
Jing Pu Sun<br />
<br />
Ganyuan Xuan<br />
<br />
Yu Su<br />
<br />
'''Title:''' Kaggle Challenge: Quick, Draw! Doodle Recognition Challenge<br />
<br />
'''Description:'''<br />
<br />
Our team chose the [https://www.kaggle.com/c/quickdraw-doodle-recognition Quick, Draw! Doodle Recognition Challenge] from the Kaggle Competition. The goal of the competition is to build an image recognition tool that can classify hand-drawn doodles into one of the 340 categories.<br />
<br />
The main challenge of the project remains in the training set being very noisy. Hand-drawn artwork may deviate substantially from the actual object, and is almost definitively different from person to person. Mislabeled images also present a problem since they will create outlier points when we train our models. <br />
<br />
We plan on learning more about some of the currently mature image recognition algorithms to inspire and develop our own model.<br />
<br />
--------------------------------------------------------------------<br />
<br />
'''Project # 4'''<br />
Group members:<br />
<br />
Snaith, Mitchell<br />
<br />
'''Title:''' Exploring Kuzushiji-MNIST, a new classification benchmark<br />
<br />
'''Description:''' <br />
<br />
The paper *Deep Learning for Classical Japanese Literature* presents a new classification dataset intended to act as a drop-in replacement for MNIST. The paper authors believe that this dataset is significantly more difficult that MNIST for typical classification methods, while not "capping" performance due to indiscernible objects like Fashion-MNIST might. <br />
Goals are to: <br />
<br />
- perform survey of typical machine-learning algorithms on Kuzushiji-MNIST compared to both MNIST and Fashion-MNIST<br />
<br />
- investigate relevant differences in the structures of the datasets<br />
<br />
- assess whether Fashion-MNIST does indeed seem to have a performance cap that can be overcome with Kuzushiji-MNIST<br />
--------------------------------------------------------------------<br />
<br />
'''Project # 5'''<br />
Group members:<br />
<br />
Pei Wei, Wang<br />
<br />
Daoyi Chen<br />
<br />
Yiming Li<br />
<br />
Ying Chi<br />
<br />
'''Title:''' Kaggle Challenge: Airbus Ship Detection Challenge<br />
<br />
'''Description:''' <br />
<br />
Image segmentation is now widely used in all kinds of field like medical diagnosis, autonomous driving and satellite image location. Our project is chosen from Kaggle competition - Airbus Ship Detection, which aims to detect, locate ships in satellite images and put an aligned bounding box segment around the ships we locate.What’s more, Airbus is also interested in improving the detection speed via a speed evaluation based upon the inference time on over 40,000 images chips.<br />
<br />
The goal of our project is to construct a model(s) that can accurately find the ship's segmentation in new pictures. We also need to balance the accuracy and the speed since the time limitation.<br />
--------------------------------------------------------------------<br />
<br />
'''Project # 6'''<br />
Group members:<br />
<br />
Ngo, Jameson<br />
<br />
Xu, Amy<br />
<br />
'''Title:''' Kaggle Challenge: [https://www.kaggle.com/c/PLAsTiCC-2018 PLAsTiCC Astronomical Classification ]<br />
<br />
'''Description:''' <br />
<br />
We will participate in the PLAsTiCC Astronomical Classification competition featured on Kaggle. We will explore how possible it is classify astronomical bodies based on various factors such as brightness.<br />
<br />
These bodies will vary in time and size. Some are unknown! There are over 100 classes that these bodies may be and it will be our job to find the predicted probability for an image to be each class.<br />
<br />
--------------------------------------------------------------------<br />
'''Project # 7'''<br />
Group members:<br />
<br />
Qianying Zhao<br />
<br />
Hui Huang<br />
<br />
Meiyu Zhou<br />
<br />
Gezhou Zhang<br />
<br />
'''Title:''' Quora Insincere Questions Classification<br />
<br />
'''Description:''' <br />
Our group will participate in the featured Kaggle competition of Quora Insincere Questions Classification. For this competition, we should predict wether a question asked on Quora is sincere or not. If the question is insincere, it intends to be a statement rather than look for useful answers, and identified as (target = 1). <br />
We will analyze the Quora question text to predict the characteristics of questions and define they are sincere or insincere using Rstudio. Our presentation report will include not only how we've concluded by classifying and analyzing provided data with appropriate models, but also how we performed in the contest.<br />
<br />
--------------------------------------------------------------------<br />
'''Project # 8'''<br />
Group members:<br />
<br />
Jiayue Zhang<br />
<br />
Lingyun Yi<br />
<br />
Rongrong Su<br />
<br />
Siao Chen<br />
<br />
<br />
'''Title:''' Telecom Customer Churn Prediction<br />
<br />
<br />
'''Description:''' <br />
Traditional telecommunication industry is made up of telecommunication companies and internet service providers, which play important role in daily life. It is crucial for the telecommunication companies to analyze and maintain their relationship with existing customers, as well as winning new customers with marketing strategies. However, it costs 5 times as much to attract a new customer than to keep an existing one. Therefore, retaining existing customers and building a loyal relationship are the key concerns for traditional telecommunication companies to stay strong in the competition. This project aims to provide insights for the telecom companies in predicting the chance of a customer leaving the company. We will be applying different classification models such as Random Forest, Gradient boosting, Logistic Regression and XGBoost, and then compare each model's performance. <br />
<br />
<br />
--------------------------------------------------------------------<br />
'''Project # 9'''<br />
Group members:<br />
<br />
Brewster, Kristi<br />
<br />
McLellan, Isaac<br />
<br />
Hassan, Ahmad Nayar<br />
<br />
Melek, Marina Medhat Rassmi <br />
<br />
<br />
'''Title:''' Quora Insincere Questions Classification: Detect toxic content to improve online conversations<br />
<br />
'''Description:'''<br />
<br />
This is a Kaggle Competition.<br />
<br />
Quora is an online question and answer platform with content created by its community of users. Quora prides itself as being a place where users can gain and share knowledge and feel safe doing it. In order to have a safe community, they need to eliminate what they term as "insincere" questions.<br />
This competitioon asks Kagglers to develop models that will flag these types of questions given a list of both insincere and sincere questions.<br />
<br />
We intend to use Python and its wide variety of packages as we aim to classify these questions.<br />
<br />
'''Reference:'''<br />
[1] Kaggle. (2018, Nov 18). Quora Insincere Questions Classification. [https://www.kaggle.com/c/quora-insincere-questions-classification]<br />
<br />
--------------------------------------------------------------------<br />
'''Project # 10'''<br />
Group members:<br />
<br />
Lam, Amanda<br />
<br />
Huang, Xiaoran<br />
<br />
Chu, Qi<br />
<br />
Sang, Di<br />
<br />
'''Title:''' Kaggle Competition: Human Protein Atlas Image Classification<br />
<br />
'''Description:'''<br />
<br />
--------------------------------------------------------------------<br />
<br />
'''Project # 11'''<br />
Group members:<br />
<br />
Bobichon, Philomene<br />
<br />
Maheshwari, Aditya<br />
<br />
An, Zepeng<br />
<br />
Stranc, Colin<br />
<br />
'''Title:''' Kaggle Challenge: Quick, Draw! Doodle Recognition Challenge<br />
<br />
'''Description:''' <br />
<br />
--------------------------------------------------------------------<br />
<br />
'''Project # 12'''<br />
Group members:<br />
<br />
Huo, Qingxi<br />
<br />
Yang, Yanmin<br />
<br />
Cai, Yuanjing<br />
<br />
Wang, Jiaqi<br />
<br />
'''Title:''' <br />
<br />
'''Description:''' <br />
<br />
--------------------------------------------------------------------<br />
<br />
'''Project # 13'''<br />
Group members:<br />
<br />
Ross, Brendan<br />
<br />
Barenboim, Jon<br />
<br />
Lin, Junqiao<br />
<br />
Bootsma, James<br />
<br />
'''Title:''' Expanding Neural Netwrok<br />
<br />
'''Description:''' The goal of our project is to create an expanding neural network algorithm which starts off by training a small neural network then expands it to a larger one. We hypothesize that with the proper expansion method we could decrease training time and prevent overfitting. The method we wish to explore is to link together input dimensions based on covariance. Then when the neural network reaches convergence we create a larger neural network without the links between dimensions and using starting values from the smaller neural network. <br />
<br />
--------------------------------------------------------------------<br />
<br />
'''Project # 14'''<br />
Group members:<br />
<br />
Schneider, Jason <br />
<br />
Walton, Jordyn <br />
<br />
Abbas, Zahraa<br />
<br />
Na, Andrew<br />
<br />
'''Title:''' Application of ML Classification to Cancer Identification<br />
<br />
'''Description:''' The application of machine learning to cancer classification based on gene expression is a topic of great interest to physicians and biostatisticians alike. We would like to work on this for our final project to encourage the application of proven ML techniques to improve accuracy of cancer classification and diagnosis. In this project, we will use the dataset from Golub et al. [1] which contains data on gene expression on tumour biopsies to train a model and classify healthy individuals and individuals who have cancer.<br />
<br />
One challenge we may face pertains to the way that the data was collected. Some parts of the dataset have thousands of features (which each represent a quantitative measure of the expression of a certain gene) but as few as twenty samples. We propose some ways to mitigate the impact of this; including the use of PCA, leave-one-out cross validation, or regularization. <br />
<br />
----------------------------------------------------------------------<br />
<br />
'''Project # 15'''<br />
Group members:<br />
<br />
Praneeth, Sai<br />
<br />
Peng, Xudong <br />
<br />
Li, Alice<br />
<br />
Vajargah, Shahrzad<br />
<br />
'''Title:''' Google Analytics Customer Revenue Prediction [1] - A Kaggle Competition<br />
<br />
'''Description:''' Guess which cabin class in airlines is the most profitable? One might guess economy - but in reality, it's the premium classes that show higher returns. According to research conducted by Wendover productions [2], despite having less than 50 seats and taking up more space than the economy class, premium classes end up driving more revenue than other classes.<br />
<br />
In fact, just like airlines, many companies adopt the business model where the vast majority of revenue is derived from a minority group of customers. As a result, data-intensive promotional strategies are getting more and more attention nowadays from marketing teams to further improve company returns.<br />
<br />
In this Kaggle competition, we are challenged to analyze a Google Merchanidize Store's customer dataset to predict revenue per customer. We will implement a series of data analytics methods including pre-processing, data augmentation, and parameter tuning. Different classification algorithms will be compared and optimized in order to achieve the best results.<br />
<br />
'''Reference:'''<br />
<br />
[1] Kaggle. (2018, Sep 18). Google Analytics Customer Revenue Prediction. Retrieved from https://www.kaggle.com/c/ga-customer-revenue-prediction<br />
<br />
[2] Kottke, J (2017, Mar 17). The economics of airline classes. Retrieved from https://kottke.org/17/03/the-economics-of-airline-classes<br />
----------------------------------------------------------------------<br />
<br />
'''Project # 16'''<br />
Group members:<br />
<br />
Wang, Yu Hao<br />
<br />
Grant, Aden <br />
<br />
McMurray, Andrew<br />
<br />
Song, Baizhi<br />
<br />
'''Title:''' Google Analytics Customer Revenue Prediction - A Kaggle Competition<br />
<br />
The 80/20 rule has proven true for many businesses–only a small percentage of customers produce most of the revenue. As such, marketing teams are challenged to make appropriate investments in promotional strategies.<br />
<br />
GStore<br />
<br />
RStudio, the developer of free and open tools for R and enterprise-ready products for teams to scale and share work, has partnered with Google Cloud and Kaggle to demonstrate the business impact that thorough data analysis can have.<br />
<br />
In this competition, you’re challenged to analyze a Google Merchandise Store (also known as GStore, where Google swag is sold) customer dataset to predict revenue per customer. Hopefully, the outcome will be more actionable operational changes and a better use of marketing budgets for those companies who choose to use data analysis on top of GA data.<br />
<br />
we will test a variety of classification algorithms to determine an appropriate model.<br />
<br />
----------------------------------------------------------------------<br />
<br />
'''Project # 17'''<br />
Group Members:<br />
<br />
Jiang, Ya Fan<br />
<br />
Zhang, Yuan<br />
<br />
Hu, Jerry Jie<br />
<br />
'''Title:''' Humpback Whale Identification<br />
<br />
'''Description:''' We analyze Happywhale’s database of over 25,000 images, gathered from research institutions and public contributors to classify each whale to its identification based on its tail image.<br />
<br />
----------------------------------------------------------------------<br />
<br />
'''Project # 18'''<br />
Group Members:<br />
<br />
Zhang, Ben<br />
<br />
Mall, Sunil<br />
<br />
'''Title:''' Two Sigma: Using News to Predict Stock Movements<br />
<br />
'''Description:''' Use news analytics to predict stock price performance. This is subject to change.<br />
<br />
----------------------------------------------------------------------<br />
'''Project # 19'''<br />
Group Members:<br />
<br />
Yan Yu Chen<br />
<br />
Qisi Deng<br />
<br />
Hengxin Li<br />
<br />
Bochao Zhang<br />
<br />
'''Description:''' Our team presents the Unsupervised Lexicon-Based Sentiment Topic Model (ULSTM) as a sentiment analysis model for reviews on the popular crowd-sourced review forum Yelp. The model applies an unsupervised learning since the supervised method has many constraints. Furthermore, instead of employing an existing sentiment lexicon, we developed a sentiment dictionary using the linguistic corpus WordNet; the self-defined lexicon allows more targeted scoring towards the evaluated dataset. Finally, the ULSTM adopts the Latent Dirichlet Allocation model to find the most mentioned topics in reviews for individual businesses.<br />
<br />
'''Dataset''': Yelp Review Dataset from Kaggle<br />
----------------------------------------------------------------------<br />
'''Project # 20'''<br />
Group Members:<br />
<br />
Dong, Yongqi (Michael)<br />
<br />
Kingston, Stephen<br />
<br />
Hou, Zhaoran<br />
<br />
Zhang, Chi<br />
<br />
'''Title:''' Kaggle--Two Sigma: Using News to Predict Stock Movements <br />
<br />
'''Description:''' The movement in price of a trade-able security, or stock, on any given day is an aggregation of each individual market participant’s appraisal of the intrinsic value of the underlying company or assets. These values are primarily driven by investors’ expectations of the company’s ability to generate future free cash flow. A steady stream of information on the state of macro and micro-economic variables which affect a company’s operations inform these market actors, primarily through news articles and alerts. We would like to take a universe of news headlines and parse the information into features, which allow us to classify the direction and ‘intensity’ of a stock’s price move, in any given day. Strategies may include various classification methods to determine the most effective solution.<br />
<br />
----------------------------------------------------------------------<br />
<br />
'''Project # 21'''<br />
Group members:<br />
<br />
Xiao, Alexandre<br />
<br />
Zhang, Richard<br />
<br />
Ash, Hudson<br />
<br />
Zhu, Ziqiu<br />
<br />
'''Title:''' Image Segmentation with Capsule Networks using CRF loss<br />
<br />
'''Description:''' Investigate the impact in changing loss function/regularizers on image segmentation tasks with capsule networks.<br />
<br />
----------------------------------------------------------------------<br />
<br />
'''Project # 22'''<br />
Group Members:<br />
<br />
Lee, Yu Xuan<br />
<br />
Heng, Tsen Yee<br />
<br />
'''Title:''' Wine Rating Prediction<br />
<br />
'''Description:''' Predict the rating of the bottles of wine with the help of machine learning. With the variables from the datasets of the wine review which we found in kaggle, we are able to show that different points, price and the year of the production of the wine are very crucial in determining the value of the bottle of wine. The formula of finding the price increased per point for the wine is found from www.vivino.com. From the information we have, we are able to determine which wine is worth to buy!<br />
<br />
<br />
-------------------------------------------------------------------------<br />
<br />
'''Project # 23'''<br />
Group Members:<br />
<br />
Bayati, Mahdiyeh<br />
<br />
Malek Mohammadi, Saber<br />
<br />
Luong, Vincent<br />
<br />
<br />
'''Title:''' Human Protein Atlas Image Classification<br />
<br />
<br />
'''Description:''' The Human Protein Atlas is a Sweden-based initiative aimed at mapping all human proteins in cells, tissues and organs.<br />
<br />
-------------------------------------------------------------------------<br />
<br />
'''Project # 24'''<br />
Group Members:<br />
<br />
Wu Yutong, <br />
<br />
Wang Shuyue,<br />
<br />
Jiao Yan<br />
<br />
'''Title:''' Kaggle Competition: Quora Insincere Questions Classification<br />
<br />
'''Description:''' Quora is a question-and-answer website where users can ask questions and share opinions. For the company, one key challenge is to identify those insincere questions, which are defined as those founded upon false premises, or that intend to make a statement rather than look for helpful answers. This report is about classifying Quora questions into "Sincere" and "Insincere". The data used in this project was prepared by Quora and can be found on kaggle website. We tried Bi-GRU and Capsule Network model, along with blend of LSTMs and CNN model. Experiments have demonstrated that they have the similar performance.</div>Y7chihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=F18-STAT841-Proposal&diff=42417F18-STAT841-Proposal2018-12-13T01:20:10Z<p>Y7chi: </p>
<hr />
<div><br />
'''Use this format (Don’t remove Project 0)'''<br />
<br />
'''Project # 0'''<br />
Group members:<br />
<br />
Last name, First name<br />
<br />
Last name, First name<br />
<br />
Last name, First name<br />
<br />
Last name, First name<br />
<br />
'''Title:''' Making a String Telephone<br />
<br />
'''Description:''' We use paper cups to make a string phone and talk with friends while learning about sound waves with this science project. (Explain your project in one or two paragraphs).<br />
<br />
--------------------------------------------------------------------<br />
<br />
'''Project # 1'''<br />
Group members:<br />
<br />
Weng, Jiacheng<br />
<br />
Li, Keqi<br />
<br />
Qian, Yi<br />
<br />
Liu, Bomeng<br />
<br />
'''Title:''' RSNA Pneumonia Detection Challenge<br />
<br />
'''Description:''' <br />
<br />
Our team’s project is the RSNA Pneumonia Detection Challenge from Kaggle competition. The primary goal of this project is to develop a machine learning tool to detect patients with pneumonia based on their chest radiographs (CXR). <br />
<br />
Pneumonia is an infection that inflames the air sacs in human lungs which has symptoms such as chest pain, cough, and fever [1]. Pneumonia can be very dangerous especially to infants and elders. In 2015, 920,000 children under the age of 5 died from this disease [2]. Due to its fatality to children, diagnosing pneumonia has a high order. A common method of diagnosing pneumonia is to obtain patients’ chest radiograph (CXR) which is a gray-scale scan image of patients’ chests using x-ray. The infected region due to pneumonia usually shows as an area or areas of increased opacity [3] on CXR. However, many other factors can also contribute to increase in opacity on CXR which makes the diagnose very challenging. The diagnose also requires highly-skilled clinicians and a lot of time of CXR screening. The Radiological Society of North America (RSNA®) sees the opportunity of using machine learning to potentially accelerate the initial CXR screening process. <br />
<br />
For the scope of this project, our team plans to contribute to solving this problem by applying our machine learning knowledge in image processing and classification. Team members are going to apply techniques that include, but are not limited to: logistic regression, random forest, SVM, kNN, CNN, etc., in order to successfully detect CXRs with pneumonia.<br />
<br />
<br />
[1] (Accessed 2018, Oct. 4). Pneumonia [Online]. MAYO CLINIC. Available from: https://www.mayoclinic.org/diseases-conditions/pneumonia/symptoms-causes/syc-20354204<br />
[2] (Accessed 2018, Oct. 4). RSNA Pneumonia Detection Challenge [Online]. Kaggle. Available from: https://www.kaggle.com/c/rsna-pneumonia-detection-challenge<br />
[3] Franquet T. Imaging of community-acquired pneumonia. J Thorac Imaging 2018 (epub ahead of print). PMID 30036297<br />
<br />
<br />
--------------------------------------------------------------------<br />
<br />
'''Project # 3'''<br />
Group members:<br />
<br />
Hanzhen Yang<br />
<br />
Jing Pu Sun<br />
<br />
Ganyuan Xuan<br />
<br />
Yu Su<br />
<br />
'''Title:''' Kaggle Challenge: Quick, Draw! Doodle Recognition Challenge<br />
<br />
'''Description:'''<br />
<br />
Our team chose the [https://www.kaggle.com/c/quickdraw-doodle-recognition Quick, Draw! Doodle Recognition Challenge] from the Kaggle Competition. The goal of the competition is to build an image recognition tool that can classify hand-drawn doodles into one of the 340 categories.<br />
<br />
The main challenge of the project remains in the training set being very noisy. Hand-drawn artwork may deviate substantially from the actual object, and is almost definitively different from person to person. Mislabeled images also present a problem since they will create outlier points when we train our models. <br />
<br />
We plan on learning more about some of the currently mature image recognition algorithms to inspire and develop our own model.<br />
<br />
--------------------------------------------------------------------<br />
<br />
'''Project # 4'''<br />
Group members:<br />
<br />
Snaith, Mitchell<br />
<br />
'''Title:''' Exploring Kuzushiji-MNIST, a new classification benchmark<br />
<br />
'''Description:''' <br />
<br />
The paper *Deep Learning for Classical Japanese Literature* presents a new classification dataset intended to act as a drop-in replacement for MNIST. The paper authors believe that this dataset is significantly more difficult that MNIST for typical classification methods, while not "capping" performance due to indiscernible objects like Fashion-MNIST might. <br />
Goals are to: <br />
<br />
- perform survey of typical machine-learning algorithms on Kuzushiji-MNIST compared to both MNIST and Fashion-MNIST<br />
<br />
- investigate relevant differences in the structures of the datasets<br />
<br />
- assess whether Fashion-MNIST does indeed seem to have a performance cap that can be overcome with Kuzushiji-MNIST<br />
--------------------------------------------------------------------<br />
<br />
'''Project # 5'''<br />
Group members:<br />
<br />
Pei Wei, Wang<br />
<br />
Daoyi Chen<br />
<br />
Yiming Li<br />
<br />
Ying Chi<br />
<br />
'''Title:''' Kaggle Challenge: Airbus Ship Detection Challenge<br />
<br />
'''Description:''' <br />
<br />
Classification has become a more and more eye-catching, especially with the rise of machine learning in these years. Our team is particularly interested in machine learning algorithms that optimize some specific type image classification. <br />
<br />
In this project, we will dig into base classifiers we learnt from the class and try to cook them together to find an optimal solution for a certain type images dataset. Currently, we are looking into a dataset from Kaggle: Airbus Ship Detection Challenge.<br />
<br />
For us, as machine learning students, we are more eager to help getting a better classification method. By “better”, we mean find a balance between simplify and accuracy. We will start with neural network via different activation functions in each layer and we will also combine base classifiers with bagging, random forest, boosting for ensemble learning. Also, we will try to regulate our parameters to avoid overfitting in training dataset. Last, we will summary features of this type image dataset, formulate our solutions and standardize our steps to solve this kind problems <br />
<br />
Hopefully, we can not only finish our project successfully, but also make a little contribution to machine learning research field.<br />
<br />
--------------------------------------------------------------------<br />
<br />
'''Project # 6'''<br />
Group members:<br />
<br />
Ngo, Jameson<br />
<br />
Xu, Amy<br />
<br />
'''Title:''' Kaggle Challenge: [https://www.kaggle.com/c/PLAsTiCC-2018 PLAsTiCC Astronomical Classification ]<br />
<br />
'''Description:''' <br />
<br />
We will participate in the PLAsTiCC Astronomical Classification competition featured on Kaggle. We will explore how possible it is classify astronomical bodies based on various factors such as brightness.<br />
<br />
These bodies will vary in time and size. Some are unknown! There are over 100 classes that these bodies may be and it will be our job to find the predicted probability for an image to be each class.<br />
<br />
--------------------------------------------------------------------<br />
'''Project # 7'''<br />
Group members:<br />
<br />
Qianying Zhao<br />
<br />
Hui Huang<br />
<br />
Meiyu Zhou<br />
<br />
Gezhou Zhang<br />
<br />
'''Title:''' Quora Insincere Questions Classification<br />
<br />
'''Description:''' <br />
Our group will participate in the featured Kaggle competition of Quora Insincere Questions Classification. For this competition, we should predict wether a question asked on Quora is sincere or not. If the question is insincere, it intends to be a statement rather than look for useful answers, and identified as (target = 1). <br />
We will analyze the Quora question text to predict the characteristics of questions and define they are sincere or insincere using Rstudio. Our presentation report will include not only how we've concluded by classifying and analyzing provided data with appropriate models, but also how we performed in the contest.<br />
<br />
--------------------------------------------------------------------<br />
'''Project # 8'''<br />
Group members:<br />
<br />
Jiayue Zhang<br />
<br />
Lingyun Yi<br />
<br />
Rongrong Su<br />
<br />
Siao Chen<br />
<br />
<br />
'''Title:''' Telecom Customer Churn Prediction<br />
<br />
<br />
'''Description:''' <br />
Traditional telecommunication industry is made up of telecommunication companies and internet service providers, which play important role in daily life. It is crucial for the telecommunication companies to analyze and maintain their relationship with existing customers, as well as winning new customers with marketing strategies. However, it costs 5 times as much to attract a new customer than to keep an existing one. Therefore, retaining existing customers and building a loyal relationship are the key concerns for traditional telecommunication companies to stay strong in the competition. This project aims to provide insights for the telecom companies in predicting the chance of a customer leaving the company. We will be applying different classification models such as Random Forest, Gradient boosting, Logistic Regression and XGBoost, and then compare each model's performance. <br />
<br />
<br />
--------------------------------------------------------------------<br />
'''Project # 9'''<br />
Group members:<br />
<br />
Brewster, Kristi<br />
<br />
McLellan, Isaac<br />
<br />
Hassan, Ahmad Nayar<br />
<br />
Melek, Marina Medhat Rassmi <br />
<br />
<br />
'''Title:''' Quora Insincere Questions Classification: Detect toxic content to improve online conversations<br />
<br />
'''Description:'''<br />
<br />
This is a Kaggle Competition.<br />
<br />
Quora is an online question and answer platform with content created by its community of users. Quora prides itself as being a place where users can gain and share knowledge and feel safe doing it. In order to have a safe community, they need to eliminate what they term as "insincere" questions.<br />
This competitioon asks Kagglers to develop models that will flag these types of questions given a list of both insincere and sincere questions.<br />
<br />
We intend to use Python and its wide variety of packages as we aim to classify these questions.<br />
<br />
'''Reference:'''<br />
[1] Kaggle. (2018, Nov 18). Quora Insincere Questions Classification. [https://www.kaggle.com/c/quora-insincere-questions-classification]<br />
<br />
--------------------------------------------------------------------<br />
'''Project # 10'''<br />
Group members:<br />
<br />
Lam, Amanda<br />
<br />
Huang, Xiaoran<br />
<br />
Chu, Qi<br />
<br />
Sang, Di<br />
<br />
'''Title:''' Kaggle Competition: Human Protein Atlas Image Classification<br />
<br />
'''Description:'''<br />
<br />
--------------------------------------------------------------------<br />
<br />
'''Project # 11'''<br />
Group members:<br />
<br />
Bobichon, Philomene<br />
<br />
Maheshwari, Aditya<br />
<br />
An, Zepeng<br />
<br />
Stranc, Colin<br />
<br />
'''Title:''' Kaggle Challenge: Quick, Draw! Doodle Recognition Challenge<br />
<br />
'''Description:''' <br />
<br />
--------------------------------------------------------------------<br />
<br />
'''Project # 12'''<br />
Group members:<br />
<br />
Huo, Qingxi<br />
<br />
Yang, Yanmin<br />
<br />
Cai, Yuanjing<br />
<br />
Wang, Jiaqi<br />
<br />
'''Title:''' <br />
<br />
'''Description:''' <br />
<br />
--------------------------------------------------------------------<br />
<br />
'''Project # 13'''<br />
Group members:<br />
<br />
Ross, Brendan<br />
<br />
Barenboim, Jon<br />
<br />
Lin, Junqiao<br />
<br />
Bootsma, James<br />
<br />
'''Title:''' Expanding Neural Netwrok<br />
<br />
'''Description:''' The goal of our project is to create an expanding neural network algorithm which starts off by training a small neural network then expands it to a larger one. We hypothesize that with the proper expansion method we could decrease training time and prevent overfitting. The method we wish to explore is to link together input dimensions based on covariance. Then when the neural network reaches convergence we create a larger neural network without the links between dimensions and using starting values from the smaller neural network. <br />
<br />
--------------------------------------------------------------------<br />
<br />
'''Project # 14'''<br />
Group members:<br />
<br />
Schneider, Jason <br />
<br />
Walton, Jordyn <br />
<br />
Abbas, Zahraa<br />
<br />
Na, Andrew<br />
<br />
'''Title:''' Application of ML Classification to Cancer Identification<br />
<br />
'''Description:''' The application of machine learning to cancer classification based on gene expression is a topic of great interest to physicians and biostatisticians alike. We would like to work on this for our final project to encourage the application of proven ML techniques to improve accuracy of cancer classification and diagnosis. In this project, we will use the dataset from Golub et al. [1] which contains data on gene expression on tumour biopsies to train a model and classify healthy individuals and individuals who have cancer.<br />
<br />
One challenge we may face pertains to the way that the data was collected. Some parts of the dataset have thousands of features (which each represent a quantitative measure of the expression of a certain gene) but as few as twenty samples. We propose some ways to mitigate the impact of this; including the use of PCA, leave-one-out cross validation, or regularization. <br />
<br />
----------------------------------------------------------------------<br />
<br />
'''Project # 15'''<br />
Group members:<br />
<br />
Praneeth, Sai<br />
<br />
Peng, Xudong <br />
<br />
Li, Alice<br />
<br />
Vajargah, Shahrzad<br />
<br />
'''Title:''' Google Analytics Customer Revenue Prediction [1] - A Kaggle Competition<br />
<br />
'''Description:''' Guess which cabin class in airlines is the most profitable? One might guess economy - but in reality, it's the premium classes that show higher returns. According to research conducted by Wendover productions [2], despite having less than 50 seats and taking up more space than the economy class, premium classes end up driving more revenue than other classes.<br />
<br />
In fact, just like airlines, many companies adopt the business model where the vast majority of revenue is derived from a minority group of customers. As a result, data-intensive promotional strategies are getting more and more attention nowadays from marketing teams to further improve company returns.<br />
<br />
In this Kaggle competition, we are challenged to analyze a Google Merchanidize Store's customer dataset to predict revenue per customer. We will implement a series of data analytics methods including pre-processing, data augmentation, and parameter tuning. Different classification algorithms will be compared and optimized in order to achieve the best results.<br />
<br />
'''Reference:'''<br />
<br />
[1] Kaggle. (2018, Sep 18). Google Analytics Customer Revenue Prediction. Retrieved from https://www.kaggle.com/c/ga-customer-revenue-prediction<br />
<br />
[2] Kottke, J (2017, Mar 17). The economics of airline classes. Retrieved from https://kottke.org/17/03/the-economics-of-airline-classes<br />
----------------------------------------------------------------------<br />
<br />
'''Project # 16'''<br />
Group members:<br />
<br />
Wang, Yu Hao<br />
<br />
Grant, Aden <br />
<br />
McMurray, Andrew<br />
<br />
Song, Baizhi<br />
<br />
'''Title:''' Google Analytics Customer Revenue Prediction - A Kaggle Competition<br />
<br />
The 80/20 rule has proven true for many businesses–only a small percentage of customers produce most of the revenue. As such, marketing teams are challenged to make appropriate investments in promotional strategies.<br />
<br />
GStore<br />
<br />
RStudio, the developer of free and open tools for R and enterprise-ready products for teams to scale and share work, has partnered with Google Cloud and Kaggle to demonstrate the business impact that thorough data analysis can have.<br />
<br />
In this competition, you’re challenged to analyze a Google Merchandise Store (also known as GStore, where Google swag is sold) customer dataset to predict revenue per customer. Hopefully, the outcome will be more actionable operational changes and a better use of marketing budgets for those companies who choose to use data analysis on top of GA data.<br />
<br />
we will test a variety of classification algorithms to determine an appropriate model.<br />
<br />
----------------------------------------------------------------------<br />
<br />
'''Project # 17'''<br />
Group Members:<br />
<br />
Jiang, Ya Fan<br />
<br />
Zhang, Yuan<br />
<br />
Hu, Jerry Jie<br />
<br />
'''Title:''' Humpback Whale Identification<br />
<br />
'''Description:''' We analyze Happywhale’s database of over 25,000 images, gathered from research institutions and public contributors to classify each whale to its identification based on its tail image.<br />
<br />
----------------------------------------------------------------------<br />
<br />
'''Project # 18'''<br />
Group Members:<br />
<br />
Zhang, Ben<br />
<br />
Mall, Sunil<br />
<br />
'''Title:''' Two Sigma: Using News to Predict Stock Movements<br />
<br />
'''Description:''' Use news analytics to predict stock price performance. This is subject to change.<br />
<br />
----------------------------------------------------------------------<br />
'''Project # 19'''<br />
Group Members:<br />
<br />
Yan Yu Chen<br />
<br />
Qisi Deng<br />
<br />
Hengxin Li<br />
<br />
Bochao Zhang<br />
<br />
'''Description:''' Our team presents the Unsupervised Lexicon-Based Sentiment Topic Model (ULSTM) as a sentiment analysis model for reviews on the popular crowd-sourced review forum Yelp. The model applies an unsupervised learning since the supervised method has many constraints. Furthermore, instead of employing an existing sentiment lexicon, we developed a sentiment dictionary using the linguistic corpus WordNet; the self-defined lexicon allows more targeted scoring towards the evaluated dataset. Finally, the ULSTM adopts the Latent Dirichlet Allocation model to find the most mentioned topics in reviews for individual businesses.<br />
<br />
'''Dataset''': Yelp Review Dataset from Kaggle<br />
----------------------------------------------------------------------<br />
'''Project # 20'''<br />
Group Members:<br />
<br />
Dong, Yongqi (Michael)<br />
<br />
Kingston, Stephen<br />
<br />
Hou, Zhaoran<br />
<br />
Zhang, Chi<br />
<br />
'''Title:''' Kaggle--Two Sigma: Using News to Predict Stock Movements <br />
<br />
'''Description:''' The movement in price of a trade-able security, or stock, on any given day is an aggregation of each individual market participant’s appraisal of the intrinsic value of the underlying company or assets. These values are primarily driven by investors’ expectations of the company’s ability to generate future free cash flow. A steady stream of information on the state of macro and micro-economic variables which affect a company’s operations inform these market actors, primarily through news articles and alerts. We would like to take a universe of news headlines and parse the information into features, which allow us to classify the direction and ‘intensity’ of a stock’s price move, in any given day. Strategies may include various classification methods to determine the most effective solution.<br />
<br />
----------------------------------------------------------------------<br />
<br />
'''Project # 21'''<br />
Group members:<br />
<br />
Xiao, Alexandre<br />
<br />
Zhang, Richard<br />
<br />
Ash, Hudson<br />
<br />
Zhu, Ziqiu<br />
<br />
'''Title:''' Image Segmentation with Capsule Networks using CRF loss<br />
<br />
'''Description:''' Investigate the impact in changing loss function/regularizers on image segmentation tasks with capsule networks.<br />
<br />
----------------------------------------------------------------------<br />
<br />
'''Project # 22'''<br />
Group Members:<br />
<br />
Lee, Yu Xuan<br />
<br />
Heng, Tsen Yee<br />
<br />
'''Title:''' Wine Rating Prediction<br />
<br />
'''Description:''' Predict the rating of the bottles of wine with the help of machine learning. With the variables from the datasets of the wine review which we found in kaggle, we are able to show that different points, price and the year of the production of the wine are very crucial in determining the value of the bottle of wine. The formula of finding the price increased per point for the wine is found from www.vivino.com. From the information we have, we are able to determine which wine is worth to buy!<br />
<br />
<br />
-------------------------------------------------------------------------<br />
<br />
'''Project # 23'''<br />
Group Members:<br />
<br />
Bayati, Mahdiyeh<br />
<br />
Malek Mohammadi, Saber<br />
<br />
Luong, Vincent<br />
<br />
<br />
'''Title:''' Human Protein Atlas Image Classification<br />
<br />
<br />
'''Description:''' The Human Protein Atlas is a Sweden-based initiative aimed at mapping all human proteins in cells, tissues and organs.<br />
<br />
-------------------------------------------------------------------------<br />
<br />
'''Project # 24'''<br />
Group Members:<br />
<br />
Wu Yutong, <br />
<br />
Wang Shuyue,<br />
<br />
Jiao Yan<br />
<br />
'''Title:''' Kaggle Competition: Quora Insincere Questions Classification<br />
<br />
'''Description:''' Quora is a question-and-answer website where users can ask questions and share opinions. For the company, one key challenge is to identify those insincere questions, which are defined as those founded upon false premises, or that intend to make a statement rather than look for helpful answers. This report is about classifying Quora questions into "Sincere" and "Insincere". The data used in this project was prepared by Quora and can be found on kaggle website. We tried Bi-GRU and Capsule Network model, along with blend of LSTMs and CNN model. Experiments have demonstrated that they have the similar performance.</div>Y7chihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=F18-STAT841-Proposal&diff=42363F18-STAT841-Proposal2018-12-09T22:26:54Z<p>Y7chi: </p>
<hr />
<div><br />
'''Use this format (Don’t remove Project 0)'''<br />
<br />
'''Project # 0'''<br />
Group members:<br />
<br />
Last name, First name<br />
<br />
Last name, First name<br />
<br />
Last name, First name<br />
<br />
Last name, First name<br />
<br />
'''Title:''' Making a String Telephone<br />
<br />
'''Description:''' We use paper cups to make a string phone and talk with friends while learning about sound waves with this science project. (Explain your project in one or two paragraphs).<br />
<br />
--------------------------------------------------------------------<br />
<br />
'''Project # 1'''<br />
Group members:<br />
<br />
Weng, Jiacheng<br />
<br />
Li, Keqi<br />
<br />
Qian, Yi<br />
<br />
Liu, Bomeng<br />
<br />
'''Title:''' RSNA Pneumonia Detection Challenge<br />
<br />
'''Description:''' <br />
<br />
Our team’s project is the RSNA Pneumonia Detection Challenge from Kaggle competition. The primary goal of this project is to develop a machine learning tool to detect patients with pneumonia based on their chest radiographs (CXR). <br />
<br />
Pneumonia is an infection that inflames the air sacs in human lungs which has symptoms such as chest pain, cough, and fever [1]. Pneumonia can be very dangerous especially to infants and elders. In 2015, 920,000 children under the age of 5 died from this disease [2]. Due to its fatality to children, diagnosing pneumonia has a high order. A common method of diagnosing pneumonia is to obtain patients’ chest radiograph (CXR) which is a gray-scale scan image of patients’ chests using x-ray. The infected region due to pneumonia usually shows as an area or areas of increased opacity [3] on CXR. However, many other factors can also contribute to increase in opacity on CXR which makes the diagnose very challenging. The diagnose also requires highly-skilled clinicians and a lot of time of CXR screening. The Radiological Society of North America (RSNA®) sees the opportunity of using machine learning to potentially accelerate the initial CXR screening process. <br />
<br />
For the scope of this project, our team plans to contribute to solving this problem by applying our machine learning knowledge in image processing and classification. Team members are going to apply techniques that include, but are not limited to: logistic regression, random forest, SVM, kNN, CNN, etc., in order to successfully detect CXRs with pneumonia.<br />
<br />
<br />
[1] (Accessed 2018, Oct. 4). Pneumonia [Online]. MAYO CLINIC. Available from: https://www.mayoclinic.org/diseases-conditions/pneumonia/symptoms-causes/syc-20354204<br />
[2] (Accessed 2018, Oct. 4). RSNA Pneumonia Detection Challenge [Online]. Kaggle. Available from: https://www.kaggle.com/c/rsna-pneumonia-detection-challenge<br />
[3] Franquet T. Imaging of community-acquired pneumonia. J Thorac Imaging 2018 (epub ahead of print). PMID 30036297<br />
<br />
<br />
--------------------------------------------------------------------<br />
<br />
'''Project # 3'''<br />
Group members:<br />
<br />
Hanzhen Yang<br />
<br />
Jing Pu Sun<br />
<br />
Ganyuan Xuan<br />
<br />
Yu Su<br />
<br />
'''Title:''' Kaggle Challenge: Quick, Draw! Doodle Recognition Challenge<br />
<br />
'''Description:'''<br />
<br />
Our team chose the [https://www.kaggle.com/c/quickdraw-doodle-recognition Quick, Draw! Doodle Recognition Challenge] from the Kaggle Competition. The goal of the competition is to build an image recognition tool that can classify hand-drawn doodles into one of the 340 categories.<br />
<br />
The main challenge of the project remains in the training set being very noisy. Hand-drawn artwork may deviate substantially from the actual object, and is almost definitively different from person to person. Mislabeled images also present a problem since they will create outlier points when we train our models. <br />
<br />
We plan on learning more about some of the currently mature image recognition algorithms to inspire and develop our own model.<br />
<br />
--------------------------------------------------------------------<br />
<br />
'''Project # 4'''<br />
Group members:<br />
<br />
Snaith, Mitchell<br />
<br />
'''Title:''' Exploring Kuzushiji-MNIST, a new classification benchmark<br />
<br />
'''Description:''' <br />
<br />
The paper *Deep Learning for Classical Japanese Literature* presents a new classification dataset intended to act as a drop-in replacement for MNIST. The paper authors believe that this dataset is significantly more difficult that MNIST for typical classification methods, while not "capping" performance due to indiscernible objects like Fashion-MNIST might. <br />
Goals are to: <br />
<br />
- perform survey of typical machine-learning algorithms on Kuzushiji-MNIST compared to both MNIST and Fashion-MNIST<br />
<br />
- investigate relevant differences in the structures of the datasets<br />
<br />
- assess whether Fashion-MNIST does indeed seem to have a performance cap that can be overcome with Kuzushiji-MNIST<br />
--------------------------------------------------------------------<br />
<br />
'''Project # 5'''<br />
Group members:<br />
<br />
Pei Wei, Wang<br />
<br />
Daoyi, Chen<br />
<br />
Yiming, Li<br />
<br />
Ying, Chi<br />
<br />
'''Title:''' Kaggle Challenge: Airbus Ship Detection Challenge<br />
<br />
'''Description:''' <br />
<br />
Classification has become a more and more eye-catching, especially with the rise of machine learning in these years. Our team is particularly interested in machine learning algorithms that optimize some specific type image classification. <br />
<br />
In this project, we will dig into base classifiers we learnt from the class and try to cook them together to find an optimal solution for a certain type images dataset. Currently, we are looking into a dataset from Kaggle: Airbus Ship Detection Challenge.<br />
<br />
For us, as machine learning students, we are more eager to help getting a better classification method. By “better”, we mean find a balance between simplify and accuracy. We will start with neural network via different activation functions in each layer and we will also combine base classifiers with bagging, random forest, boosting for ensemble learning. Also, we will try to regulate our parameters to avoid overfitting in training dataset. Last, we will summary features of this type image dataset, formulate our solutions and standardize our steps to solve this kind problems <br />
<br />
Hopefully, we can not only finish our project successfully, but also make a little contribution to machine learning research field.<br />
<br />
--------------------------------------------------------------------<br />
<br />
'''Project # 6'''<br />
Group members:<br />
<br />
Ngo, Jameson<br />
<br />
Xu, Amy<br />
<br />
'''Title:''' Kaggle Challenge: [https://www.kaggle.com/c/PLAsTiCC-2018 PLAsTiCC Astronomical Classification ]<br />
<br />
'''Description:''' <br />
<br />
We will participate in the PLAsTiCC Astronomical Classification competition featured on Kaggle. We will explore how possible it is classify astronomical bodies based on various factors such as brightness.<br />
<br />
These bodies will vary in time and size. Some are unknown! There are over 100 classes that these bodies may be and it will be our job to find the predicted probability for an image to be each class.<br />
<br />
--------------------------------------------------------------------<br />
'''Project # 7'''<br />
Group members:<br />
<br />
Qianying Zhao<br />
<br />
Hui Huang<br />
<br />
Meiyu Zhou<br />
<br />
Gezhou Zhang<br />
<br />
'''Title:''' Quora Insincere Questions Classification<br />
<br />
'''Description:''' <br />
Our group will participate in the featured Kaggle competition of Quora Insincere Questions Classification. For this competition, we should predict wether a question asked on Quora is sincere or not. If the question is insincere, it intends to be a statement rather than look for useful answers, and identified as (target = 1). <br />
We will analyze the Quora question text to predict the characteristics of questions and define they are sincere or insincere using Rstudio. Our presentation report will include not only how we've concluded by classifying and analyzing provided data with appropriate models, but also how we performed in the contest.<br />
<br />
--------------------------------------------------------------------<br />
'''Project # 8'''<br />
Group members:<br />
<br />
Jiayue Zhang<br />
<br />
Lingyun Yi<br />
<br />
Rongrong Su<br />
<br />
Siao Chen<br />
<br />
<br />
'''Title:''' Kaggle--Two Sigma: Using News to Predict Stock Movements<br />
<br />
<br />
'''Description:''' <br />
Stock price is affected by the news to some extent. What is the news influence on stock price and what is the predicted power of the news? <br />
What we are going to do is to use the content of news to predict the tendency of stock price. We will mine the data, finding the useful information behind the big data. As the result we will predict the stock price performance when market faces news.<br />
<br />
<br />
--------------------------------------------------------------------<br />
'''Project # 9'''<br />
Group members:<br />
<br />
Hassan, Ahmad Nayar<br />
<br />
McLellan, Isaac<br />
<br />
Brewster, Kristi<br />
<br />
Melek, Marina Medhat Rassmi <br />
<br />
<br />
'''Title:''' Kaggle Compeition: Quora Insincere Questions Classification<br />
<br />
'''Description:''' <br />
<br />
<br />
--------------------------------------------------------------------<br />
'''Project # 10'''<br />
Group members:<br />
<br />
Lam, Amanda<br />
<br />
Huang, Xiaoran<br />
<br />
Chu, Qi<br />
<br />
Sang, Di<br />
<br />
'''Title:''' Kaggle Competition: Human Protein Atlas Image Classification<br />
<br />
'''Description:'''<br />
<br />
--------------------------------------------------------------------<br />
<br />
'''Project # 11'''<br />
Group members:<br />
<br />
Bobichon, Philomene<br />
<br />
Maheshwari, Aditya<br />
<br />
An, Zepeng<br />
<br />
Stranc, Colin<br />
<br />
'''Title:''' Kaggle Challenge: Quick, Draw! Doodle Recognition Challenge<br />
<br />
'''Description:''' <br />
<br />
--------------------------------------------------------------------<br />
<br />
'''Project # 12'''<br />
Group members:<br />
<br />
Huo, Qingxi<br />
<br />
Yang, Yanmin<br />
<br />
Cai, Yuanjing<br />
<br />
Wang, Jiaqi<br />
<br />
'''Title:''' <br />
<br />
'''Description:''' <br />
<br />
--------------------------------------------------------------------<br />
<br />
'''Project # 13'''<br />
Group members:<br />
<br />
Ross, Brendan<br />
<br />
Barenboim, Jon<br />
<br />
Lin, Junqiao<br />
<br />
Bootsma, James<br />
<br />
'''Title:''' Expanding Neural Netwrok<br />
<br />
'''Description:''' The goal of our project is to create an expanding neural network algorithm which starts off by training a small neural network then expands it to a larger one. We hypothesize that with the proper expansion method we could decrease training time and prevent overfitting. The method we wish to explore is to link together input dimensions based on covariance. Then when the neural network reaches convergence we create a larger neural network without the links between dimensions and using starting values from the smaller neural network. <br />
<br />
--------------------------------------------------------------------<br />
<br />
'''Project # 14'''<br />
Group members:<br />
<br />
Schneider, Jason <br />
<br />
Walton, Jordyn <br />
<br />
Abbas, Zahraa<br />
<br />
Na, Andrew<br />
<br />
'''Title:''' Application of ML Classification to Cancer Identification<br />
<br />
'''Description:''' The application of machine learning to cancer classification based on gene expression is a topic of great interest to physicians and biostatisticians alike. We would like to work on this for our final project to encourage the application of proven ML techniques to improve accuracy of cancer classification and diagnosis. In this project, we will use the dataset from Golub et al. [1] which contains data on gene expression on tumour biopsies to train a model and classify healthy individuals and individuals who have cancer.<br />
<br />
One challenge we may face pertains to the way that the data was collected. Some parts of the dataset have thousands of features (which each represent a quantitative measure of the expression of a certain gene) but as few as twenty samples. We propose some ways to mitigate the impact of this; including the use of PCA, leave-one-out cross validation, or regularization. <br />
<br />
----------------------------------------------------------------------<br />
<br />
'''Project # 15'''<br />
Group members:<br />
<br />
Praneeth, Sai<br />
<br />
Peng, Xudong <br />
<br />
Li, Alice<br />
<br />
Vajargah, Shahrzad<br />
<br />
'''Title:''' Google Analytics Customer Revenue Prediction [1] - A Kaggle Competition<br />
<br />
'''Description:''' Guess which cabin class in airlines is the most profitable? One might guess economy - but in reality, it's the premium classes that show higher returns. According to research conducted by Wendover productions [2], despite having less than 50 seats and taking up more space than the economy class, premium classes end up driving more revenue than other classes.<br />
<br />
In fact, just like airlines, many companies adopt the business model where the vast majority of revenue is derived from a minority group of customers. As a result, data-intensive promotional strategies are getting more and more attention nowadays from marketing teams to further improve company returns.<br />
<br />
In this Kaggle competition, we are challenged to analyze a Google Merchanidize Store's customer dataset to predict revenue per customer. We will implement a series of data analytics methods including pre-processing, data augmentation, and parameter tuning. Different classification algorithms will be compared and optimized in order to achieve the best results.<br />
<br />
'''Reference:'''<br />
<br />
[1] Kaggle. (2018, Sep 18). Google Analytics Customer Revenue Prediction. Retrieved from https://www.kaggle.com/c/ga-customer-revenue-prediction<br />
<br />
[2] Kottke, J (2017, Mar 17). The economics of airline classes. Retrieved from https://kottke.org/17/03/the-economics-of-airline-classes<br />
----------------------------------------------------------------------<br />
<br />
'''Project # 16'''<br />
Group members:<br />
<br />
Wang, Yu Hao<br />
<br />
Grant, Aden <br />
<br />
McMurray, Andrew<br />
<br />
Song, Baizhi<br />
<br />
'''Title:''' Google Analytics Customer Revenue Prediction - A Kaggle Competition<br />
<br />
The 80/20 rule has proven true for many businesses–only a small percentage of customers produce most of the revenue. As such, marketing teams are challenged to make appropriate investments in promotional strategies.<br />
<br />
GStore<br />
<br />
RStudio, the developer of free and open tools for R and enterprise-ready products for teams to scale and share work, has partnered with Google Cloud and Kaggle to demonstrate the business impact that thorough data analysis can have.<br />
<br />
In this competition, you’re challenged to analyze a Google Merchandise Store (also known as GStore, where Google swag is sold) customer dataset to predict revenue per customer. Hopefully, the outcome will be more actionable operational changes and a better use of marketing budgets for those companies who choose to use data analysis on top of GA data.<br />
<br />
we will test a variety of classification algorithms to determine an appropriate model.<br />
<br />
----------------------------------------------------------------------<br />
<br />
'''Project # 17'''<br />
Group Members:<br />
<br />
Jiang, Ya Fan<br />
<br />
Zhang, Yuan<br />
<br />
Hu, Jerry Jie<br />
<br />
'''Title:''' Kaggle Competition: Quick, Draw! Doodle Recognition Challenge<br />
<br />
'''Description:''' Construction of a classifier that can learn from noisy training data and generalize to a clean test set . Training data coming from the Google game "Quick, Draw"<br />
<br />
----------------------------------------------------------------------<br />
<br />
'''Project # 18'''<br />
Group Members:<br />
<br />
Zhang, Ben<br />
<br />
Mall, Sunil<br />
<br />
'''Title:''' Two Sigma: Using News to Predict Stock Movements<br />
<br />
'''Description:''' Use news analytics to predict stock price performance. This is subject to change.<br />
<br />
----------------------------------------------------------------------<br />
'''Project # 19'''<br />
Group Members:<br />
<br />
Yan Yu Chen<br />
<br />
Qisi Deng<br />
<br />
Hengxin Li<br />
<br />
Bochao Zhang<br />
<br />
Our team currently has two interested topics at hand, and we have summarized the objective of each topic below. Please note that we will narrow down our choices after further discussions with the instructor.<br />
<br />
'''Description 1:''' With 14 percent of American claiming that social media is their most dominant news source, fake news shared on Facebook and Twitter are invading people’s information learning experience. Concomitantly, the quality and nature of online news have been gradually diluted by fake news that are sometimes imperceptible. With an aim of creating an unalloyed Internet surfing experience, we sought to develop a tool that performs fake news detection and classification. <br />
<br />
'''Description 2:''' Statistics Canada has recently reported an increasing trend of Toronto’s violent crime score. Though the Royal Canadian Mounted Police has put in the effort and endeavor to track crimes, the ambiguous snapshots captured by outdated cameras often hamper the investigation. Motivated by the aforementioned circumstance, our second interest focuses on the accurate numeral and letter identification within variable-resolution images.<br />
<br />
----------------------------------------------------------------------<br />
'''Project # 20'''<br />
Group Members:<br />
<br />
Dong, Yongqi (Michael)<br />
<br />
Kingston, Stephen<br />
<br />
Hou, Zhaoran<br />
<br />
Zhang, Chi<br />
<br />
'''Title:''' Kaggle--Two Sigma: Using News to Predict Stock Movements <br />
<br />
'''Description:''' The movement in price of a trade-able security, or stock, on any given day is an aggregation of each individual market participant’s appraisal of the intrinsic value of the underlying company or assets. These values are primarily driven by investors’ expectations of the company’s ability to generate future free cash flow. A steady stream of information on the state of macro and micro-economic variables which affect a company’s operations inform these market actors, primarily through news articles and alerts. We would like to take a universe of news headlines and parse the information into features, which allow us to classify the direction and ‘intensity’ of a stock’s price move, in any given day. Strategies may include various classification methods to determine the most effective solution.<br />
<br />
----------------------------------------------------------------------<br />
<br />
'''Project # 21'''<br />
Group members:<br />
<br />
Xiao, Alexandre<br />
<br />
Zhang, Richard<br />
<br />
Ash, Hudson<br />
<br />
Zhu, Ziqiu<br />
<br />
'''Title:''' Image Segmentation with Capsule Networks using CRF loss<br />
<br />
'''Description:''' Investigate the impact in changing loss function/regularizers on image segmentation tasks with capsule networks.<br />
<br />
----------------------------------------------------------------------<br />
<br />
'''Project # 22'''<br />
Group Members:<br />
<br />
Lee, Yu Xuan<br />
<br />
Heng, Tsen Yee<br />
<br />
'''Title:''' Two Sigma: Using News to Predict Stock Movements<br />
<br />
'''Description:''' Use news analytics to predict stock price performance. This is subject to change.<br />
<br />
<br />
-------------------------------------------------------------------------<br />
<br />
'''Project # 23'''<br />
Group Members:<br />
<br />
Bayati, Mahdiyeh<br />
<br />
Malek Mohammadi, Saber<br />
<br />
Luong, Vincent<br />
<br />
<br />
'''Title:''' Human Protein Atlas Image Classification<br />
<br />
<br />
'''Description:''' The Human Protein Atlas is a Sweden-based initiative aimed at mapping all human proteins in cells, tissues and organs.<br />
<br />
-------------------------------------------------------------------------<br />
<br />
'''Project # 24'''<br />
Group Members:<br />
<br />
Wu Yutong, <br />
<br />
Wang Shuyue,<br />
<br />
Jiao Yan<br />
<br />
'''Title:''' Kaggle Competition: Quora Insincere Questions Classification<br />
<br />
'''Description:'''</div>Y7chihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat841F18/&diff=38548stat841F18/2018-11-09T21:50:45Z<p>Y7chi: /* Presented by */</p>
<hr />
<div>== Presented by == <br />
Zhaoran Hou, Pei wei Wang, Chi Zhang, Daoyi Chen, Yiming Li,Ying Chi<br />
<br />
== Introduction ==<br />
In the past two decades, due to their surprising classi- fication capability, support vector machine (SVM) [1] and its variants [2]–[4] have been extensively used in classification applications.<br />
Least square support vector machine (LS-SVM) and proximal sup- port vector machine (PSVM) have been widely used in binary classification applications. The conventional LS-SVM and PSVM cannot be used in regression and multiclass classification appli- cations directly, although variants of LS-SVM and PSVM have been proposed to handle such cases.<br />
<br />
== Motivation ==<br />
<br />
There are several issues on BP learning algorithms:<br />
<br />
(1) When the learning rate Z is too small, the learning algorithm converges very slowly. However, when Z is too large, the algorithm becomes unstable and diverges.<br />
<br />
(2) Another peculiarity of the error surface that impacts the performance of the BP learning algorithm is the presence of local minima [6]. It is undesirable that the learning algorithm stops at a local minima if it is located far above a global minima.<br />
<br />
(3) Neural network may be over-trained by using BP algorithms and obtain worse generalization performance. Thus, validation and suitable stopping methods are required in the cost function minimization procedure.<br />
<br />
(4) Gradient-based learning is very time-consuming in most applications.<br />
<br />
Due to the simplicity of their implementations, least square support vector machine (LS-SVM) and proximal support vector machine (PSVM) have been widely used in binary classification applications. The conventional LS-SVM and PSVM cannot be used in regression and multiclass classification applications directly, although variants of LS-SVM and PSVM have been proposed to handle such cases. This paper shows that both LS-SVM and PSVM can be simplified further and a unified learning framework of LS-SVM, PSVM, and other regularization algorithms referred to extreme learning machine (ELM) can be built.<br />
<br />
== Previous Work ==<br />
<br />
As the training of SVMs involves a quadratic programming problem, the computational complexity of SVM training al- gorithms is usually intensive, which is at least quadratic with respect to the number of training examples<br />
<br />
Least square SVM (LS-SVM) [2] and proximal SVM (PSVM) [3] provide fast implementations of the traditional SVM. Both LS-SVM and PSVM use equality optimization constraints instead of inequalities from the traditional SVM, which results in a direct least square solution by avoiding quadratic programming.<br />
<br />
SVM, LS-SVM, and PSVM are originally proposed for bi- nary classification. Different methods have been proposed in or- der for them to be applied in multiclass classification problems. One-against-all (OAA) and one-against-one (OAO) methods are mainly used in the implementation of SVM in multiclass classification applications [8]. <br />
<br />
extreme learning machine (ELM) for single hidden layer feedforward neural networks (SLFNs) which randomly chooses the input weights and analytically determines the output weights of SLFNs. In theory, this algorithm tends to provide the best generalization performance at extremely fast learning speed. The experimental results based on real world benchmarking function approximation and classification problems including large complex applications show that the new algorithm can produce best generalization performance in some cases and can learn much faster than traditional popular learning algorithms for feedforward neural networks.<br />
<br />
== Model Architecture ==<br />
<br />
The extreme learning machine (ELM) is a particular kind of machine learning setup in which a single layer or multiple layers apply. The ELM includes numbers of hidden neurons where the input weights are assigned randomly. Extreme learning machines use the concept of random projection and early perceptron models to do specific kinds of problem-solving.<br />
<br />
Given a single hidden layer of ELM, suppose that the output function of the <math>i</math>-th hidden node is <math>h_i(\mathbf{x})=G(\mathbf{a}_i,b_i,\mathbf{x})</math>, where <math>\mathbf{a}_i</math> and <math>b_i</math> are the parameters of the <math>i</math>-th hidden node. The output function of the ELM for SLFNs with <math>L</math> hidden nodes is:<br />
<br />
<math>f_L({\bf x})=\sum_{i=1}^L{\boldsymbol \beta}_ih_i({\bf x})</math>, where is the output weight of the <math>i</math>-th hidden node.<br />
<br />
<math>\mathbf{h}(\mathbf{x})=[G(h_i(\mathbf{x}),...,h_L(\mathbf{x}))]</math> is the hidden layer output mapping of ELM. Given <math>N</math> training samples, the hidden layer output matrix <math>\mathbf{H}</math> of ELM is given as: <math>{\bf H}=\left[\begin{matrix}<br />
{\bf h}({\bf x}_1)\\<br />
\vdots\\<br />
{\bf h}({\bf x}_N)<br />
\end{matrix}\right]=\left[\begin{matrix}<br />
G({\bf a}_1, b_1, {\bf x}_1) &\cdots & G({\bf a}_L, b_L, {\bf x}_1)\\<br />
\vdots &\vdots&\vdots\\<br />
G({\bf a}_1, b_1, {\bf x}_N) &\cdots & G({\bf a}_L, b_L, {\bf x}_N)<br />
\end{matrix}\right]<br />
</math><br />
<br />
and <math>\mathbf{T}</math> is the training data target matrix: <math>{\bf T}=\left[\begin{matrix}<br />
{\bf t}_1\\<br />
\vdots\\<br />
{\bf t}_N<br />
\end{matrix}\right]<br />
</math><br />
<br />
General speaking, ELM is a kind of regularization neural networks but with non-tuned hidden layer mappings (formed by either random hidden nodes, kernels or other implementations), its objective function is:<br />
<br />
<math><br />
\text{Minimize: } \|{\boldsymbol \beta}\|_p^{\sigma_1}+C\|{\bf H}{\boldsymbol \beta}-{\bf T}\|_q^{\sigma_2}<br />
</math><br />
<br />
where <math>\sigma_1>0, \sigma_2>0, p,q=0, \frac{1}{2}, 1, 2, \cdots, +\infty</math>. <br />
<br />
Different combinations of <math>\sigma_1</math>, <math>\sigma_2</math>, <math>p</math> and <math>q</math> can be used and result in different learning algorithms for regression, classification, sparse coding, compression, feature learning and clustering.<br />
<br />
As a special case, a simplest ELM training algorithm learns a model of the form (for single hidden layer sigmoid neural networks):<br />
<br />
:<math>\mathbf{\hat{Y}} = \mathbf{W}_2 \sigma(\mathbf{W}_1 x)</math><br />
<br />
where is the matrix of input-to-hidden-layer weights, <math>\sigma</math> is an activation function, and is the matrix of hidden-to-output-layer weights. The algorithm proceeds as follows:<br />
<br />
# Fill with random values (e.g, Gaussian noise|Gaussian random noise);<br />
# estimate by least-squares fit to a matrix of response variables, computed using the Moore–Penrose pseudoinverse|pseudoinverse, given a design matrix]] :<br />
#:<math>\mathbf{W}_2 = \sigma(\mathbf{W}_1 \mathbf{X})^+ \mathbf{Y}</math><br />
<br />
<br />
<center><br />
[[File:aa.png|800px]]<br />
</center><br />
<br />
== Performance Verification ==<br />
<br />
<center><br />
[[File:bb.png|800px]]<br />
</center><br />
<br />
<center><br />
[[File:cc.png|400px]]<br />
<br />
Fig. 1.<br />
</center><br />
Fig. 1 shows the scalability of different classifiers: An example on letter data set. training time spent by LS-SVM and ELM (Gaussian kernel) increases sharply when the number of training data increases. However, the training time spent by ELM with Sigmoid additive node and multiquadric function node increases very slowly when the number of training data increases.<br />
<br />
== Conclusion ==<br />
<br />
<center><br />
[[File:dd.png|800px]]<br />
</center><br />
<br />
ELM is a learning mechanism for the generalized SLFNs, where learning is made without iterative tuning. The essence of ELM is that the hidden layer of the generalized SLFNs should not be tuned. This paper has shown that both LS-SVM and PSVM can be simplified by removing the term bias b and the resultant learning algorithms are unified with ELM. Instead of different variants requested for different types of applications, ELM can be applied in regression and multiclass classification appli- cations directly. <br />
<br />
ELM requires less human intervention than SVM and LS- SVM/PSVM. If the feature mappings h(x) are known to users, in ELM, only one parameter C needs to be specified by users. The generalization performance of ELM is not sensitive to the dimensionality L of the feature space (the number of hidden nodes) as long as L is set large enough (e.g., L ≥ 1000 for all the real-world cases tested in our simulations). Different from SVM, LS-SVM, and PSVM which usually request two parameters (C,γ) to be specified by users, single-parameter setting makes ELM be used easily and efficiently. If feature mappings are unknown to users, similar to SVM, LS-SVM, and PSVM, kernels can be applied in ELM as well. Different from LS-SVM and PSVM, ELM does not have con- straints on the Lagrange multipliers αi’s. Since LS-SVM and ELM have the same optimization objective functions and LS- SVM has some optimization constraints on Lagrange multipli- ers αi’s, in this sense, LS-SVM tends to obtain a solution which is suboptimal to ELM.<br />
<br />
As verified by the simulation results, compared to SVM and LS-SVM ELM achieves similar or better generalization performance for regression and binary class classification cases, and much better generalization performance for multiclass clas- sification cases. ELM has better scalability and runs at much faster learning speed (up to thousands of times) than traditional SVM and LS-SVM.<br />
<br />
== Critiques ==<br />
<br />
An ELM is basically a 2-layer neural net in which the first layer is fixed and random, and the second layer is trained. There is a number of issues with this idea.<br />
<br />
Firstly, Algrithms such as SVM and Deep Learning are focusing on fitting a complex function with less parameters while ELM uses more parameters to fit a relatively simple function<br />
<br />
Secondly, the name: an ELM is *exactly* what Minsky & Papert call a Gamba Perceptron (a Perceptron whose first layer is a bunch of linear threshold units). The original 1958 Rosenblatt perceptron was an ELM in that the first layer was randomly connected.<br />
<br />
Thirdly, the method: connecting the first layer randomly is just about the stupidest thing you could do. People have spent the almost 60 years since the Perceptron to come up with better schemes to non-linearly expand the dimension of an input vector so as to make the data more separable (many of which are documented in the 1974 edition of Duda & Hart).<br />
<br />
== References ==<br />
<br />
* <sup>[https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=1380068 [1]]</sup>G.-B. Huang, Q.-Y. Zhu, and C.-K. Siew, “Extreme learning machine: A new learning scheme of feedforward neural networks,” in Proc. IJCNN,Budapest, Hungary, Jul. 25–29, 2004, vol. 2, pp. 985–990.<br />
<br />
* <sup>[https://www.sciencedirect.com/science/article/pii/S0925231210002225 [2]]</sup>G.-B. Huang, X.Ding, and H.Zhou, ''Optimization method based extreme learning machine for classification," Neurocomputing, vol. 74, no. 1-3, pp. 155-163, Dec. 2010.</div>Y7chi