Difference between revisions of "stat946w18/Predicting Floor-Level for 911 Calls with Neural Networks and Smartphone Sensor Data"

From statwiki
Jump to: navigation, search
(Created page with "= Presented by = 1. Family name, First name 2. Family name, First name 3. = Introduction = The emergence of the Internet and other modern technology has greatly increased p...")
 
(Floor Estimation)
 
(91 intermediate revisions by 19 users not shown)
Line 1: Line 1:
= Presented by =
 
 
1. Family name, First name
 
 
2. Family name, First name
 
 
3.
 
 
= Introduction =
 
= Introduction =
The emergence of the Internet and other modern technology has greatly increased people's ability to communicate across vast distances and barriers. However, there still remains the fundamental barrier of languages and as anyone who has attempted to learn a new language can attest, it takes tremendous amount of work to learn more than one language past childhood. The ability to efficiently and quickly translate between languages would then be of great importance. This is an extremely difficult problem however as languages can have varying grammar and context always plays an important role. For example, the word "back" means entirely different things in the following two sentences,
+
During emergency 911 calls, knowing the exact position of the victims is crucial to a fast response and a successful rescue. Knowing the victim's floor level in an emergency can speed up the search by a factor proportional to the number of floors in the building. Problems arise when the caller is unable to give their physical position accurately. This can happen for instance when the caller is disoriented, held hostage, or a child is calling on behalf of the victim. GPS sensors on smartphones can provide the rescuers with the geographic location. However GPS fails to give an accurate floor level inside a tall building. Previous work have explored using Wi-Fi signals or beacons placed inside the buildings, but these methods are not self-contained and require prior infrastructure knowledge.
 
 
<blockquote>
 
I am in the back of the car.
 
</blockquote>
 
 
 
<blockquote>
 
My back hurts.
 
</blockquote>
 
 
 
Applying Deep Neural Networks (DNNs) to this problem is difficult given that DNNs can only be applied to problems where the inputs and output vectors are of fixed dimensions. This is suitable for applications such as image processing where the dimensions is a known ''a priori'', however in applications such as speech recognition, the dimension is not known. Thus, the goal of this paper is to introduce a domain independent method that learns to map sequences of input vectors to output vectors. Sutskever et al have approached this problem by applying Multi-Layer Long Short-Term Memory (LSTM) architecture<ref name=lstm></ref>, and used this architecture to estimate a conditional probability between input and output sequences. Specifically, they used one LSTM to obtain a large fixed-dimensional representation and another to extract the output sequence from that vector. Given that translations tend to be paraphrases of the source sentences, the translation objective encourages the LSTM to find sentence representations that capture their meaning, as sentences with similar meanings are close to each other while sentences with different meanings will be far.
 
 
 
The main result of this work is that on the WMT' 14 English to French translation task, their model obtained a BLEU (Bilingual Evaluation Understudy) score of 34.81 by extracting translations from an ensemble of 5 LSTMs. This is by far the best result achieved by direct translation from an artificial neural network. Also, the LSTM model did not suffer from long sentences, contrary to the recent experiences from researchers using similar architectures. Their model performed well on long sentences because they reversed the source sentences in the training and testing set. Reversing the sentences is a simple trick but it is one of the key contributions of their work.
 
 
 
= Model =
 
 
 
 
 
=== Theory of Recurrent Neural Networks ===
 
 
 
Reall that an Artifical Neural Network (ANN) is a nonlinear function <math>\mathcal{N}:\mathbb{R}^{p_0}
 
\rightarrow \mathbb{R}</math> that computes an output through iterative nonlinear function applications to an input vector <math>\mathbf{z}_0\in
 
\mathbb{R}^{p_0}</math>. The updates are said to occur in ''layers'', producing the sequence of vectors associated with each layer as <math>\left\{\mathbf{z}_0,\mathbf{z}_1,\dots,\mathbf{z}_N\right\}</math>. Any <math>\mathbf{z}_n</math> for <math>n \in
 
\left\{1,\ldots,N\right\}</math> is computed recursively from <math>\mathbf{z}_0</math> with the ''weight matrices'' and ''bias vectors'' as per the rule
 
  
<math>
+
Fortunately, today’s smartphones are equipped with many more sensors including barometers and magnetometers. Deep learning can be applied to predict floor level based on these sensor readings.
    \mathbf{z}_n =
+
Firstly, an LSTM is trained to classify whether the caller is indoors or outdoors using GPS, RSSI (Received Signal Strength Indication), and magnetometer sensor readings. Next, an unsupervised clustering algorithm is used to predict the floor level depending on the barometric pressure difference. With these two parts working together, a self-contained floor level prediction system can achieve 100% accuracy, without any external prior knowledge.
                \sigma_n\left(\mathbf{W}_{n}\mathbf{z}_{n-1} + \mathbf{b}_{n}\right), \quad 1 \leq n \leq N</math>
 
  
where <math>\mathbf{W}_{n} \in \mathbb{R}^{p_n \times p_{n-1}}</math> is the weight matrix and <math>\mathbf{b}_{n} \in \mathbb{R}^{p_n}</math> is the bias vector associated with the <math>n</math>th layer in the network.
+
This paper is published in ICLR 2018. The code, data, and app are open-source on [https://github.com/williamFalcon/Predicting-floor-level-for-911-Calls-with-Neural-Networks-and-Smartphone-Sensor-Data (GitHub)]
  
The element-wise vector function <math>\sigma_n\left(\cdot\right)</math> is sigmoid-like for each component in its domain, outputing a value that ranges in <math>[0,1]</math>. Typically, the functions <math>\left\{\sigma_n\left(\cdot\right)\right\}</math> are the same for <math>n < N</math>, but the final output <math>\sigma_N(\cdot)</math> depends on the network architecture—for instance, it may be a softmax function for multi-label classification. Thus the network is completed characterized by its weights and biases as the tuple <math>(\left\{\mathbf{W}_{n}\right\},\left\{\mathbf{b}_{n}\right\})</math>.
+
= Data Description =
 +
The authors developed an iOS app called Sensory and used it to collect data on an iPhone 6. The following sensor readings were recorded: indoors, created at, session id, floor, RSSI strength, GPS latitude, GPS longitude, GPS vertical accuracy, GPS horizontal accuracy, GPS course, GPS speed, barometric relative altitude, barometric pressure, environment context, environment mean building floors, environment activity, city name, country name, magnet x, magnet y, magnet z, magnet total.
  
A sample network for <math>N = 2</math> is depicted below: a graph of an ANN with <math>N = 2</math>, where the vertices represent the vectors <math>\left\{\mathbf{z}_n\right\}_{n=0}^2</math> in their respective layers. Edges denote computation, where the vector transformations have been overlaid to show sample dimensions of <math>\mathbf{W}_1</math> and <math>\mathbf{W}_2</math>, such that they match the vectors <math>\mathbf{z}_0</math> and <math>\mathbf{z}_1</math>.
+
The indoor-outdoor data has to be manually entered as soon as the user enters or exits a building. To gather the data for floor level prediction, the authors conducted 63 trials among five different buildings throughout New York City. The actual floor level was recorded manually for validation purposes only, since unsupervised learning is being used.
<center>
 
[[File:ann.png | frame | center |Fig 1. Graph of an ANN with <math>N = 2</math>, where the vertices represent the vectors <math>\left\{\mathbf{z}_n\right\}_{n=0}^2</math> in their respective layers. Edges denote computation, where the vector transformations have been overlaid to show sample dimensions of <math>\mathbf{W}_1</math> and <math>\mathbf{W}_2</math>, such that they match the vectors <math>\mathbf{z}_0</math> and <math>\mathbf{z}_1</math>.
 
]]
 
</center>
 
  
A Recurrent Neural Network is a generalization of an ANN for a ''sequence'' of inputs <math>\left\{\mathbf{z}_0^{[t]}\right\}</math> where <math>t \in
+
=== Note: Barometric formula ===
\left\{1,\ldots,T\right\}</math> such that there are ''recurrent'' connections between the intermediary vectors <math>\left\{\mathbf{z}_n\right\}</math> for different so-called ''time steps''. These connections are made to represent conditioning on the previous vectors in the sequence: supposing the sequence were a vectorized representation of the words, an input to the network could be: <math>\left\{\mathbf{z}_0^{[1]},\mathbf{z}_0^{[2]},\mathbf{z}_0^{[3]}\right\} =
 
\left\{\text{pass}, \text{the}, \text{sauce}\right\}</math>. In a language modelling problem for predictive text, the probability of obtaining <math>\mathbf{z}_0^{[3]}</math> is strongly conditioned on the previous words in the sequence. As such, additional recurrence weight matrices are added to the update rule for <math>1 \leq n \leq N</math> and <math>t > 1</math> to produce the recurrent update rule
 
  
<math>
+
The barometric measures, sometimes called the exponential atomsphere or isothermal atmosphere, is the measure used to model how the pressure (or density) of the air changes with altitude. The pressure drops approximately by 11.3 Pa per meter in first 1000 meters above sea level.
    \mathbf{z}_n^{[t]} =
 
        \sigma_n\left(
 
            \mathbf{b}_{n}
 
        +  \mathbf{W}_{n}\mathbf{z}_{n-1}^{[t]}
 
        +  \mathbf{R}_n\mathbf{z}_{n}^{[t-1]}
 
        \right)</math>
 
  
where <math>\mathbf{R}_n \in \mathbb{R}^{p_n \times p_n}</math> is the recurrence matrix that relates the <math>n</math>th layer’s output for item <math>t</math> to its previous output for item <math>t-1</math>. The network architecture for a single layer <math>n</math> at step <math>t</math> is pictured below. This is a schematic of an RNN layer <math>n</math> at step <math>t</math> with recurrence on the output of <math>\mathbf{z}_n^{[t-1]}</math>, with the dimensions of the matrices <math>\mathbf{R}_{n}</math> and <math>\mathbf{W}_{n}</math> pictured.
+
= Methods =
<center>
+
The proposed method first determines if the user is indoor or outdoor and detects the instances of transition between them. When an outdoor to indoor transition event occurs, the elevation of the user is saved using an estimation from the cellphone barometer. Finally, the exact floor level is predicted through clustering techniques. Indoor/outdoor classification is critical to the working of this method. Once the user is detected to be outdoors, he is assumed to be at the ground level. The vertical height and floor estimation is applied only when the user is indoors. The indoor/outdoor transitions are used to save the barometer readings at the ground level for use as reference pressure.
[[File:rnn.png | frame | center |Fig 2. Schematic of an RNN layer <math>n</math> at step <math>t</math> with recurrence on the output of <math>\mathbf{z}_n^{[t-1]}</math>, with the dimensions of the matrices <math>\mathbf{R}_{n}</math> and <math>\mathbf{W}_{n}</math> pictured. ]]
 
</center>
 
  
=== RNN Architecture by Graves, 2013 ===
+
=== Indoor/Outdoor Classification ===  
  
The RNN update rule used by Sutskever et al. comes from a paper by Graves (2013). The connections between layers are denser in this case. The final layer is fully connected to every preceding layer execept for the input <math>\mathbf{z}_0^{[t]}</math>, and follows the update rule
+
An LSTM network is used to solve the indoor-outdoor classification problem. Here is a diagram of the network architecture.
  
<math>
+
[[File:lstm.jpg | 500px]]
        \mathbf{z}_{N}^{[t]} = \sigma_n\left(
 
            \mathbf{b}_N
 
        + \displaystyle\sum_{n' = 1}^{N-1} \mathbf{W}_{N,n'}\mathbf{z}_{n'}^{[t]}
 
    \right)</math>
 
  
where <math>\mathbf{W}_{N,n'} \in \mathbb{R}^{p_N\times p_{n'}}</math> denotes the weight matrix between layer <math>n'</math> and <math>N</math>.
+
Figure 1: LSTM network architecture. A 3-layer LSTM. Inputs are sensor readings for d consecutive time-steps. Target is y = 1 if indoors and y = 0 if outdoors.
  
The layers 2 through <math>N-1</math> have additional connections to <math>\mathbf{z}_0^{[t]}</math> as
+
<math> X_i</math> contains a set of <math>d</math> consecutive sensor readings, i.e. <math> X_i = [x_1, x_2,...,x_d] </math>. <math>Y</math> is labelled as 0 for outdoors and 1 for indoors. <math>d</math> is chosen to be 3 by random-search so that <math>X</math> has 3 points <math>X_i = [x_{j-1}, x_j, x_{j+1}]</math> and the middle <math>x_j</math> is used for the <math>y</math> label.
 +
The LSTM contains three layers. Layers one and two have 50 neurons followed by a dropout layer set to 0.2. Layer 3 has two neurons fed directly into a one-neuron feedforward layer with a sigmoid activation function. The input is the sensor readings, and the output is the indoor-outdoor label. The objective function is the cross-entropy between the true labels and the predictions.
  
<math>
+
\begin{equation}
        \mathbf{z}_n^{[t]} = \sigma_n\left(
+
C(y_i, \hat{y}_i) = \frac{1}{n} \sum_{i=1}^{n} -(y_i log(\hat{y_i}) + (1 - y_i) log(1 - \hat{y_i}))
            \mathbf{b}_{n}
+
\label{equation:binCE}
        +  \mathbf{W}_{n}\mathbf{z}_{n-1}^{[t]}  
+
\end{equation}
        +  \mathbf{W}_{n,0}\mathbf{z}_0^{[t]}  
 
        +  \mathbf{R}_n\mathbf{z}_{n}^{[t-1]}
 
    \right),</math>
 
  
where, again, <math>\mathbf{W}_{n,n'}</math> must be of size <math>\mathbb{R}^{p_n\times
+
The main reason why the neural network is able to predict whether the user is indoors or outdoors is that it learns a pattern of how the walls of buildings interfere with the GPS signals. The LSTM is able to find the pattern in the GPS signal strength in combination with other sensor readings to give an accurate prediction. However, the change in GPS signal does not happen instantaneously as the user walks indoor. Thus, a window of 20 seconds is allowed, and the minimum barometric pressure reading within that window is recorded as the ground floor.
    p_{n'}}</math>. The first layer has the typical RNN input rule as before,
 
  
<math>
+
=== Indoor/Outdoor Transition ===
        \mathbf{z}_{1}^{[t]} = \sigma_1\left(
+
To determine the exact time the user makes an indoor/outdoor transition, two vector masks are convolved across the LSTM predictions.
            \mathbf{b}_{1}
 
        +  \mathbf{W}_{1}\mathbf{z}_{0}^{[t]}
 
        +  \mathbf{R}_{1}\mathbf{z}_{1}^{[t-1]}
 
    \right).
 
</math>
 
  
=== Long Short-Term Memory Recurrent Neural Network (LSTM) ===
+
\begin{equation}
[http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/ Recurrent neural networks] are a variation of deep neural networks that are capable of storing information about previous hidden states in special memory layers.<ref name=lstm>
+
V_1 = [1, 1, 1, 1, 1, 0, 0, 0, 0, 0]
Hochreiter, Sepp, and Jürgen Schmidhuber. [http://deeplearning.cs.cmu.edu/pdfs/Hochreiter97_lstm.pdf "Long short-term memory."] Neural computation 9.8 (1997): 1735-1780.
+
\end{equation}
</ref> Unlike feed forward neural networks that take in a single fixed length vector input and output a fixed length vector output, recurrent neural networks can take in a sequence of fixed length vectors as input, because of their ability to store information and maintain a connection between inputs through this memory layer. By comparison, previous inputs would have no impact on current output for feed forward neural networks, whereas they can impact current input in a recurrent neural network. (This paper used the LSTM formulation from Graves<ref name=grave>
 
Graves, Alex. [http://arxiv.org/pdf/1308.0850.pdf "Generating sequences with recurrent neural networks."] arXiv preprint arXiv:1308.0850 (2013).
 
</ref>)
 
  
 +
\begin{equation}
 +
V_2 = [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]
 +
\end{equation}
  
This form of input fits naturally with language translation, as sentences are sequences of words, and many problems regarding representing variable length sentences as fixed length vectors can be avoided. However, training recurrent neural networks to learn long time lag dependencies--where inputs many time steps back can heavily influence current output--is difficult, and generally results in exploding or vanishing gradients. A variation of recurrent neural networks, the [http://colah.github.io/posts/2015-08-Understanding-LSTMs/ long short-term memory neural network], was used instead for this paper, as they do not suffer as much from the vanishing gradient problem.
+
The Jaccard distances measures the similarity of two sets and is calculated with the following equation:
  
 +
\begin{equation}
 +
J_j = J(s_i, V_j) = \frac{|s_i \cap V_j|}{|s_i| + |V_j| - |s_i \cap V_j|}
 +
\label{equation:Jaccard}
 +
\end{equation}
  
The purpose of LSTM in this case is to estimate the conditional probability of the output sequence, <math>\,(y_1,\cdots,y_{T'})</math>, based on the input sequence, <math>\,(x_1,\cdots,x_{T})</math>, where <math>\,T</math> does not have to equal <math>\,T'</math>
+
If the Jaccard distance between <math>V_{1}</math> and sub-sequence <math> s_i </math> is greater or equal to the threshold 0.4, it means there was a transition from indoors to outdoors in the vicinity of the 20 second range of the vector mask. Similarly, a distance of to 0.4 or greater to <math>V_{2}</math> indicates a transition from outdoors to indoors. Sets of transition windows are merged together if they occur close in time to each other, with the average transition time of both windows being used as the new transition time.
  
 +
[[File:FindIOIndexes.png | 700px]]
  
Let <math>\,v</math> represent the state of hidden layers after <math>\,(x_1,\cdots,x_{T})</math> have been inputted into the LSTM, i.e. what has been stored in the neural network's memory. Then
+
=== Vertical Height Estimation ===
 +
Once the barometric pressure of the ground floor is known, the user’s current relative altitude can be calculated by the international pressure equation, where <math>m_\Delta</math> is the estimated height, <math> p_1 </math> is the pressure reading of the device, and <math> p_0 </math> is the reference pressure at ground level while transitioning from outdoor to indoor.
  
:<math>\,p(y_1,\cdots,y_{T'}|x_1,\cdots,x_{T})=\prod_{t=1}^{T'} p(y_t|v,y_1,\cdots,y_{t-1})</math>
+
\begin{equation}
 +
m_\Delta = f_{floor}(p_0, p_1) = 44330 (1 - (\frac{p_1}{p_0})^{\frac{1}{5.255}})
 +
\label{equation:baroHeight}
 +
\end{equation}
  
For each <math>\,p(y_t|v,y_1,\cdots,y_{t-1})</math>, The LSTM neural network at time step <math>\,t</math> after <math>\,(x_1,\cdots,x_T,y_1,\cdots,y_{t-1})</math> have been inputted would output the relative probability of each word in the vocabulary and softmax function, <math>\,\frac{e^{x_b}}{\sum_{t=1}^N e^{x_t}}\,</math> can be applied to this output vector to generate the corresponding probability. From this, we can calculate any <math>\,p(y_1,\cdots,y_{T'}|x_1,\cdots,x_{T})</math> by repeatedly adding <math>\,y_t</math> as input into the LSTM neural network to calculate the new set of probabilities.
+
In appendix B.1, the authors acknowledge that for this system to work, pressures variations due to weather or temperature must be accounted for as those variations are on the same order of magnitude or larger than the pressure variations caused by changing altitude. They suggest using a nearby reference station with known altitude to continuously measure and correct for this effect.
  
The objective function used during the training process was:
+
=== Floor Estimation ===
 +
Given the user’s relative altitude, the floor level can be determined. However, this is not a straightforward task because different buildings have different floor heights, different floor labeling (E.g. not including the 13th floor), and floor heights within the same building can vary from floor to floor. To solve these problems, altitude data collected are clustered into groups by grouping sorted altitude data points that are within 1.5 meters of each other. Each cluster represents the approximate altitude of a floor.
  
:<math>\,\frac{1}{|T_r|}\sum_{(S,T)\in T_r} \log(p(T|S))\,</math>
+
Here is an example of altitude data collected across 41 trials in the Uris Hall building in New York City. Each dashed line represent the center of a cluster.
  
Where <math>\,S</math> is the base/source sentence, <math>\,T</math> is the paired translated sentence and <math>\,T_r</math> is the total training set. This objective function is to maximize the log probability of a correct translation <math>\,T</math> given the base/source sentence <math>\,S</math> over the entire training set. Once the training is complete, translations are produced by finding the most likely translation according to LSTM:
+
[[File:clusters.png | 500px]]
  
:<math>\hat{T} = \underset{T}{\operatorname{arg\ max}}\ p(T|S)</math>
+
Figure 2: Distribution of measurements across 41 trials in the Uris Hall building in New York City. A clear size difference is specially noticeable at the lobby. Each dotted line corresponds to an actual floor in the building learned from clustered data-points.
  
<br />It has been showed that Long Short-Term Memory recurrent neural networks have the ability to generate both discrete and real-valued sequences with complex, long-range structure using next-step prediction <ref name=grave>
+
Here is the algorithm for the floor level prediction.
Reference
 
</ref>.
 
  
=== Input and Output Data Transformation ===
+
[[File:PredictFloor.png | 700px]]
About 12 million English-French sentence pairs were used during the training with a vocabulary of 160,000 for English and 80,000 for French. Any unknown words were replaced with a special token. All sentences were attached with an <EOS> token to indicate end of sentence.
 
  
Additionally, input sentences were entered backwards as the researchers found this significantly increased accuracy. For example, using the sentence "Today I went to lectures.", the input order would be "lectures,to,went,I,Today". They suspect this is due to reduction of time lag between the beginning of each sentence.
+
= Experiments and Results =
 +
The authors performed evaluation on two different tasks: The indoor-outdoor classification task and the floor level prediction task. In the indoor-outdoor detection task, they compared six different models, LSTM, feedforward neural networks, logistic regression, SVM, HMM and Random Forests. In the floor level prediction task, they evaluated the full system.
  
To decode a translation after training, a simple left to right beam search algorithm is used. This process goes as follows, a small number of initial translations with highest probabilities are picked at the start. Each translation is then re-entered into the LSTM independently and a new small set of words with highest probabilities are appended to the end of each translation. This repeats until <EOS> token is chosen and the completely translated sentence is added to the final translation set which is then ranked and highest ranking translation chosen.
+
== Indoor-Outdoor Classification Results ==
 +
Here are the results for the indoor-outdoor classification problem using different machine learning techniques. LSTM has the best performance on the test set.
 +
The LSTM is trained for 24 epochs with a batch size of 128. All the hyper-parameters such as learning rate(0.006), number of layers, d size, number of hidden units and dropout rate were searched through random search algorithm.
  
= Training and Results =
+
[[File:IOResults.png]]
=== Training Method ===
 
Two LSTM neural networks were used overall; one to generate a fixed vector representation from the input sequence and another to generate the output sequence from the fixed vector representation. Each neural network had 4 layers and 1000 cells per layer and <math>\,v</math> can be represented by the 8000 real numbers in each cell's memory after the input sequence has been entered. Stochastic gradient descent with a batch size of 128 and learning rate of 0.7 was used. Initial parameters were set using a uniform distribution between -0.08 and 0.08. LSTM does not suffer from the vanishing gradient problem, but it can be affected by exploding gradients which is taken into account by enforcing a hard constraint on the norm of the gradient.
 
  
=== Scoring Method ===
+
== Floor Level Prediction Results ==
Scoring was done using the BLEU (Bilingual Evaluation Understudy) metric. This is an algorithm created for evaluating the quality of machine translated text. This is done by using a modified form of precision to compare a produced translation against a set of reference translations. This metric tends to correlate well with human judgement across a corpus, but performs badly if used to evaluate individual sentences. More information can be found in the [http://www.aclweb.org/anthology/P02-1040.pdf BLEU paper] and the [https://en.wikipedia.org/wiki/BLEU wikipedia article]. These resources both state that the BLEU score is a number between 0 and 1, with closer to 1 corresponding to a better translation. The LSTM paper reports BLEU scores in percentage values.
+
The following are the results for the floor level prediction from the 63 collected samples. Results are given as the percent which matched the floor exactly, off by one, or off by more than one. In each column, the left number is the accuracy using a fixed floor height, and the number on the right is the accuracy when clustering was used to calculate a variable floor height. It was found that using the clustering technique produced 100% accuracy on floor predictions. The conclusion from these results is that using building-specific floor heights produces significantly better results.
  
=== Results ===
+
[[File:FloorLevelResults.png]]
The resulting LSTM neural networks outperformed standard Statistical Machine Translation (SMT) with a BLEU score of 34.8 against 33.3 and with certain heuristics or modification, was very close to matching the best performing system. Additionally, it could recognize sentences in both active and passive voice as being similar.
 
<blockquote>
 
Active Voice: I ate an apple.
 
</blockquote>
 
<blockquote>
 
Passive Voice: The apple was eaten by me.
 
</blockquote>
 
  
An interesting result is the fact that reversing the source sentences (not test sentences) improved the long sentence decoding, which in turn increased the BLEU score from 25.9 to 30.6. While the authors do not have a complete explanation, they theorize the improvement in performance is due to the introduction of many short term dependencies to the data-set, by reversing the source sentences they minimize the time lag between the end of the source and the start of the target sentence. This reduction in the time lag is what the authors believe help the LSTM architecture establish a link between source and target and utilize the memory feature of the network better. Note, that the mean time lag does not change. Given the input sequence <math>(x_1, \dots, x_T)</math> and the target sequence <math>(y_1, \dots, y_T)</math>, the sequence of time lags is <math>\Delta t = (T, \dots, T)</math> and <math>\frac{1}{T} \sum_{\Delta t_i} T = T</math>. If, however, the input is reversed, the sequence of time lags of corresponding inputs is <math>\Delta t = (1, 3, \dots, 2T - 1)</math> which still has a mean time lag of <math>\frac{1}{T} \sum_{\Delta t_i} (2i - 1) = \frac{1}{T} \sum_{i = 1}^{T/2} (2i + 2(T-i)) = T</math> (assuming even T, but odd T can be shown similarly). Thus, half of the time lags are shorter with the reversed input sequence.
+
== Floor Level Clustering Results ==
 +
Here is the comparison between the estimated floor height and the ground truth in the Uris Hall building.
  
For example, let "I saw the man" be the source sentence, "with the binoculars" be the target sentence, if we concatenate both source and target sentences we have "I saw the man with the binoculars". By reversing the source sentence ("man the saw I") the subject "man" is now closer to the context target "binoculars", compared to if the source sentence is not reversed. Reversing the input sentences leads to more confident predictions in the early parts of the target sentence and to less confident predictions in the later parts.  Also, it results in LSTMs with better memory utilization.
+
[[File:FloorComparison.png]]
  
In summary the LSTM method has proven to be quite capable of translating long sentences despite potentially long delay between input time steps. However, it still falls short of [Edinburgh's specialised statistical model http://www.statmt.org/OSMOSES/sysdesc.pdf].
+
= Criticism =
 +
This paper is an interesting application of deep learning and achieves an outstanding result of 100% accuracy. However, it offers no new theoretical discoveries. The machine learning techniques used are fairly standard. The neural networks used in this paper only contains 3 layers, and the clustering is applied on one-dimensional data. This leads to the question whether deep learning is necessary and suitable for this task.
  
=== Some developments of LSTM ===
+
It was explained in the paper that there are many cases where the system does not work. Some cases that were mentioned include: buildings with glass walls, delayed GPS signals,
Dealing with the very rare words are a challenge for the LSTM. A weakness of LSTM is its inability to (correctly) translate the very rare words that comes up with out-of-vocabulary words, i.e. no translation. The long short-term dependencies that are induced in this method can lead to less likelihood for a long sequence words come after the unknown words. In the other hand, these untranslated words impose a longer temporal memory that decrease the overall efficiency of network. Sutskever I ([http://www.aclweb.org/anthology/P15-1002.pdf]) suggested a method to address the rare word problem. They assumed that if the origin of any undefined word is known then the word can be look up by introducing a post-processing step. This step would replace each unknown word in the system’s output with a translation of its source word. They proposed three strategies to track the source and translate it using either a dictionary or the identity translation.
+
and pressure changes caused by air conditioning. Other examples I can think of are: uneven floors with some area higher than others, floors rarely visited, and tunnels from one building to another. These special cases are not specifically mentioned in the paper, but they do note that differences between outdoors and pressure-sealed buildings is a problem
  
=== Open questions ===
+
Another weakness of the method comes from the clustering technique. It requires a fair bit of training data. The author suggested two approaches. First, the data can be stored in the individual smartphone. This is not realistic as most people do not visit every single floor of every building, even if it is their own apartment buildings. The second approach is to let a central system (emergency department) collect data from multiple users (which is what the paper’s results are based on). However, such data collection would need to be done in accordance with local laws. Perhaps a better solution would be to use elevation reading to estimate a floor based on typical floor height. Even having a small range of floors of interest could help first responders significantly narrow down their response time.
  
The results of the paper pose some interesting questions which are not discussed in the paper itself:
+
Aside from all the technical issues, if knowing the exact floor is required, would it maybe be easier to let the rescuers carry a barometer with them and search for the floor with the transmitted pressure reading?
  
# Instead of reversing the input sequence the target sequence could be reversed. This would change the time lags between corresponding words in a similar way, but instead of reducing the time lag between the first half of corresponding words, it is reduced between the last half of the words. This might allow conclusions about whether the improved performance is purely due to the reduced minimal time lag or whether structure in natural language is also important (e.g. when a short time lag between the first few words is better than a short time lag between the last few words of sentence).
+
== Real-world Considerations ==
# For half of the words the time lag increases to more than the average. Thus, they might have only a minor contribution to the model performance. It could be interesting to see how much the performance is affected by leaving those words out of the input sequence. Or more generally, one could ask, how does the performance related to the number of used input words?
 
  
= More Formulations of Recurrent Neural Networks =
+
In the appendices the real-world issues discovered are discussed and possible solutions are proposed.
The standard RNN is formalized as follows
 
  
:<math>\,h_t=\tanh(W_{hx}x_t+W_{hh}h_{t-1}+b_h)</math>
+
'''Pressure Variance'''
:<math>\,o_t=W_{oh}h_t+b_o</math>
 
  
Given sequence of input vectors <math>\,(x_1,\cdots,x_{T})</math>, the RNN computes a sequence of hidden states <math>\,(h_1,\cdots,h_{T})</math> and a sequence of output <math>\,(o_1,\cdots,o_{T})</math> by iterating the above equations. <math>\,W_{hx}</math> is the input to hidden weight matrix, <math>\,W_{hh}</math> is the hidden to hidden weight matrix, <math>\,W_{oh}</math> is the hidden to output weight matrix. Vector <math>\,b_{h}</math> and <math>\,b_{o}</math> are the biases. When t=1, the undefined <math>\,W_{hh}h_{t-1}</math> is replace with a special initial bias vector, <math>\,h_{init}</math>.  
+
Changing weather conditions and geographical locations can greatly affect the barometric pressure from different cases. As a possible solution, gathering the current pressure conditions from a nearby landmark such as an airport can be used to normalize the local pressure. Alternatively, the knowledge of local wi-fi access points can establish if a user is changing locations or if the pressure is naturally changing.
  
It may seem to train RNNs with gradient descent, but in reality, gradient decays exponentially as it is backpropagated through time. The relation between parameter and dynamics of the RNN is highly unstable, which makes gradient descent ineffective. Thus, it argues that RNN can not learn long-range temporal dependencies when gradient descent is used for training. A good way to deal with inability of gradient descent to learn long-range temporal structure in RNN is known as "Long-Short Term memory". (http://www.cs.utoronto.ca/~ilya/pubs/2011/LANG-RNN.pdf)
+
Another potential issue when using pressure readings is that different phones were found to read the local pressure at varying offsets from each-other. This shows that some form of calibration of the phone would have to be provided prior to the use of the app.
  
There are different variants of LSTM<ref name=grave>
+
'''Battery Impact'''
</ref><ref>
 
Gers, Felix, and Jürgen Schmidhuber. [http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=861302&tag=1 "Recurrent nets that time and count."] Neural Networks, 2000. IJCNN 2000, Proceedings of the IEEE-INNS-ENNS International Joint Conference on. Vol. 3. IEEE, 2000.
 
</ref><ref>
 
Cho, Kyunghyun, et al. [http://arxiv.org/pdf/1406.1078v3.pdf "Learning phrase representations using rnn encoder-decoder for statistical machine translation."] arXiv preprint arXiv:1406.1078 (2014).
 
</ref> other than the original one proposed by Hochreiter et al.<ref name=lstm>
 
</ref> Greff et al. compare the performance of some different popular variants in their work<ref>
 
Greff, Klaus, et al. [http://arxiv.org/pdf/1503.04069.pdf "LSTM: A Search Space Odyssey."] arXiv preprint arXiv:1503.04069 (2015).
 
</ref> and draw the conclusion that they are about the same. While Jozefowicz, et al. suggest that some architecture can perform better than LSTM on certain tasks<ref>
 
Jozefowicz, Rafal, Wojciech Zaremba, and Ilya Sutskever. [http://jmlr.org/proceedings/papers/v37/jozefowicz15.pdf "An Empirical Exploration of Recurrent Network Architectures."] Proceedings of the 32nd International Conference on Machine Learning (ICML-15). 2015.
 
</ref>.
 
  
= Criticisms =
+
In having an app regularly collecting data from the GPS and motion sensors, the battery life of the device will be severely impacted. While the motion sensing has already been addressed in iOS systems by running on a dedicated chip, the GPS would need to be sampled far less frequently.
There is some concern regarding whether this model will be able to provide a truly scalable solution to MT. In particular, it is not obvious that this model will be able to sufficiently scale to long sentences as is evident in the reported results. The model is severely limited, in general, by working only in the absence of infrequent words. These theoretical limitations alongside sparse experimental results give rise to skepticism about the overarching validity of the model.  
 
  
= Source =
+
= Conclusion =
Sutskever, I. Vinyals, O. & Le. Q. V. Sequence to sequence learning
+
This paper presented a novel deep learning application in predicting the floor level given sensory data from mobile phones. While there are no new theoretical discoveries, the application is novel and important for 911-responders; indeed, previous studies have shown that survival rates for urgent medical events drop exponentially for each floor increase. Although much of this is attributed to the actual floor height, this situation makes it all the more important to reduce ground-to-floor travel time.
with neural networks. In Proc. Advances in Neural Information
 
Processing Systems 27 3104–3112 (2014).
 
<references />
 

Latest revision as of 15:01, 18 April 2018

Introduction

During emergency 911 calls, knowing the exact position of the victims is crucial to a fast response and a successful rescue. Knowing the victim's floor level in an emergency can speed up the search by a factor proportional to the number of floors in the building. Problems arise when the caller is unable to give their physical position accurately. This can happen for instance when the caller is disoriented, held hostage, or a child is calling on behalf of the victim. GPS sensors on smartphones can provide the rescuers with the geographic location. However GPS fails to give an accurate floor level inside a tall building. Previous work have explored using Wi-Fi signals or beacons placed inside the buildings, but these methods are not self-contained and require prior infrastructure knowledge.

Fortunately, today’s smartphones are equipped with many more sensors including barometers and magnetometers. Deep learning can be applied to predict floor level based on these sensor readings. Firstly, an LSTM is trained to classify whether the caller is indoors or outdoors using GPS, RSSI (Received Signal Strength Indication), and magnetometer sensor readings. Next, an unsupervised clustering algorithm is used to predict the floor level depending on the barometric pressure difference. With these two parts working together, a self-contained floor level prediction system can achieve 100% accuracy, without any external prior knowledge.

This paper is published in ICLR 2018. The code, data, and app are open-source on (GitHub)

Data Description

The authors developed an iOS app called Sensory and used it to collect data on an iPhone 6. The following sensor readings were recorded: indoors, created at, session id, floor, RSSI strength, GPS latitude, GPS longitude, GPS vertical accuracy, GPS horizontal accuracy, GPS course, GPS speed, barometric relative altitude, barometric pressure, environment context, environment mean building floors, environment activity, city name, country name, magnet x, magnet y, magnet z, magnet total.

The indoor-outdoor data has to be manually entered as soon as the user enters or exits a building. To gather the data for floor level prediction, the authors conducted 63 trials among five different buildings throughout New York City. The actual floor level was recorded manually for validation purposes only, since unsupervised learning is being used.

Note: Barometric formula

The barometric measures, sometimes called the exponential atomsphere or isothermal atmosphere, is the measure used to model how the pressure (or density) of the air changes with altitude. The pressure drops approximately by 11.3 Pa per meter in first 1000 meters above sea level.

Methods

The proposed method first determines if the user is indoor or outdoor and detects the instances of transition between them. When an outdoor to indoor transition event occurs, the elevation of the user is saved using an estimation from the cellphone barometer. Finally, the exact floor level is predicted through clustering techniques. Indoor/outdoor classification is critical to the working of this method. Once the user is detected to be outdoors, he is assumed to be at the ground level. The vertical height and floor estimation is applied only when the user is indoors. The indoor/outdoor transitions are used to save the barometer readings at the ground level for use as reference pressure.

Indoor/Outdoor Classification

An LSTM network is used to solve the indoor-outdoor classification problem. Here is a diagram of the network architecture.

lstm.jpg

Figure 1: LSTM network architecture. A 3-layer LSTM. Inputs are sensor readings for d consecutive time-steps. Target is y = 1 if indoors and y = 0 if outdoors.

[math] X_i[/math] contains a set of [math]d[/math] consecutive sensor readings, i.e. [math] X_i = [x_1, x_2,...,x_d] [/math]. [math]Y[/math] is labelled as 0 for outdoors and 1 for indoors. [math]d[/math] is chosen to be 3 by random-search so that [math]X[/math] has 3 points [math]X_i = [x_{j-1}, x_j, x_{j+1}][/math] and the middle [math]x_j[/math] is used for the [math]y[/math] label. The LSTM contains three layers. Layers one and two have 50 neurons followed by a dropout layer set to 0.2. Layer 3 has two neurons fed directly into a one-neuron feedforward layer with a sigmoid activation function. The input is the sensor readings, and the output is the indoor-outdoor label. The objective function is the cross-entropy between the true labels and the predictions.

\begin{equation} C(y_i, \hat{y}_i) = \frac{1}{n} \sum_{i=1}^{n} -(y_i log(\hat{y_i}) + (1 - y_i) log(1 - \hat{y_i})) \label{equation:binCE} \end{equation}

The main reason why the neural network is able to predict whether the user is indoors or outdoors is that it learns a pattern of how the walls of buildings interfere with the GPS signals. The LSTM is able to find the pattern in the GPS signal strength in combination with other sensor readings to give an accurate prediction. However, the change in GPS signal does not happen instantaneously as the user walks indoor. Thus, a window of 20 seconds is allowed, and the minimum barometric pressure reading within that window is recorded as the ground floor.

Indoor/Outdoor Transition

To determine the exact time the user makes an indoor/outdoor transition, two vector masks are convolved across the LSTM predictions.

\begin{equation} V_1 = [1, 1, 1, 1, 1, 0, 0, 0, 0, 0] \end{equation}

\begin{equation} V_2 = [0, 0, 0, 0, 0, 1, 1, 1, 1, 1] \end{equation}

The Jaccard distances measures the similarity of two sets and is calculated with the following equation:

\begin{equation} J_j = J(s_i, V_j) = \frac{|s_i \cap V_j|}{|s_i| + |V_j| - |s_i \cap V_j|} \label{equation:Jaccard} \end{equation}

If the Jaccard distance between [math]V_{1}[/math] and sub-sequence [math] s_i [/math] is greater or equal to the threshold 0.4, it means there was a transition from indoors to outdoors in the vicinity of the 20 second range of the vector mask. Similarly, a distance of to 0.4 or greater to [math]V_{2}[/math] indicates a transition from outdoors to indoors. Sets of transition windows are merged together if they occur close in time to each other, with the average transition time of both windows being used as the new transition time.

FindIOIndexes.png

Vertical Height Estimation

Once the barometric pressure of the ground floor is known, the user’s current relative altitude can be calculated by the international pressure equation, where [math]m_\Delta[/math] is the estimated height, [math] p_1 [/math] is the pressure reading of the device, and [math] p_0 [/math] is the reference pressure at ground level while transitioning from outdoor to indoor.

\begin{equation} m_\Delta = f_{floor}(p_0, p_1) = 44330 (1 - (\frac{p_1}{p_0})^{\frac{1}{5.255}}) \label{equation:baroHeight} \end{equation}

In appendix B.1, the authors acknowledge that for this system to work, pressures variations due to weather or temperature must be accounted for as those variations are on the same order of magnitude or larger than the pressure variations caused by changing altitude. They suggest using a nearby reference station with known altitude to continuously measure and correct for this effect.

Floor Estimation

Given the user’s relative altitude, the floor level can be determined. However, this is not a straightforward task because different buildings have different floor heights, different floor labeling (E.g. not including the 13th floor), and floor heights within the same building can vary from floor to floor. To solve these problems, altitude data collected are clustered into groups by grouping sorted altitude data points that are within 1.5 meters of each other. Each cluster represents the approximate altitude of a floor.

Here is an example of altitude data collected across 41 trials in the Uris Hall building in New York City. Each dashed line represent the center of a cluster.

clusters.png

Figure 2: Distribution of measurements across 41 trials in the Uris Hall building in New York City. A clear size difference is specially noticeable at the lobby. Each dotted line corresponds to an actual floor in the building learned from clustered data-points.

Here is the algorithm for the floor level prediction.

PredictFloor.png

Experiments and Results

The authors performed evaluation on two different tasks: The indoor-outdoor classification task and the floor level prediction task. In the indoor-outdoor detection task, they compared six different models, LSTM, feedforward neural networks, logistic regression, SVM, HMM and Random Forests. In the floor level prediction task, they evaluated the full system.

Indoor-Outdoor Classification Results

Here are the results for the indoor-outdoor classification problem using different machine learning techniques. LSTM has the best performance on the test set. The LSTM is trained for 24 epochs with a batch size of 128. All the hyper-parameters such as learning rate(0.006), number of layers, d size, number of hidden units and dropout rate were searched through random search algorithm.

IOResults.png

Floor Level Prediction Results

The following are the results for the floor level prediction from the 63 collected samples. Results are given as the percent which matched the floor exactly, off by one, or off by more than one. In each column, the left number is the accuracy using a fixed floor height, and the number on the right is the accuracy when clustering was used to calculate a variable floor height. It was found that using the clustering technique produced 100% accuracy on floor predictions. The conclusion from these results is that using building-specific floor heights produces significantly better results.

FloorLevelResults.png

Floor Level Clustering Results

Here is the comparison between the estimated floor height and the ground truth in the Uris Hall building.

FloorComparison.png

Criticism

This paper is an interesting application of deep learning and achieves an outstanding result of 100% accuracy. However, it offers no new theoretical discoveries. The machine learning techniques used are fairly standard. The neural networks used in this paper only contains 3 layers, and the clustering is applied on one-dimensional data. This leads to the question whether deep learning is necessary and suitable for this task.

It was explained in the paper that there are many cases where the system does not work. Some cases that were mentioned include: buildings with glass walls, delayed GPS signals, and pressure changes caused by air conditioning. Other examples I can think of are: uneven floors with some area higher than others, floors rarely visited, and tunnels from one building to another. These special cases are not specifically mentioned in the paper, but they do note that differences between outdoors and pressure-sealed buildings is a problem

Another weakness of the method comes from the clustering technique. It requires a fair bit of training data. The author suggested two approaches. First, the data can be stored in the individual smartphone. This is not realistic as most people do not visit every single floor of every building, even if it is their own apartment buildings. The second approach is to let a central system (emergency department) collect data from multiple users (which is what the paper’s results are based on). However, such data collection would need to be done in accordance with local laws. Perhaps a better solution would be to use elevation reading to estimate a floor based on typical floor height. Even having a small range of floors of interest could help first responders significantly narrow down their response time.

Aside from all the technical issues, if knowing the exact floor is required, would it maybe be easier to let the rescuers carry a barometer with them and search for the floor with the transmitted pressure reading?

Real-world Considerations

In the appendices the real-world issues discovered are discussed and possible solutions are proposed.

Pressure Variance

Changing weather conditions and geographical locations can greatly affect the barometric pressure from different cases. As a possible solution, gathering the current pressure conditions from a nearby landmark such as an airport can be used to normalize the local pressure. Alternatively, the knowledge of local wi-fi access points can establish if a user is changing locations or if the pressure is naturally changing.

Another potential issue when using pressure readings is that different phones were found to read the local pressure at varying offsets from each-other. This shows that some form of calibration of the phone would have to be provided prior to the use of the app.

Battery Impact

In having an app regularly collecting data from the GPS and motion sensors, the battery life of the device will be severely impacted. While the motion sensing has already been addressed in iOS systems by running on a dedicated chip, the GPS would need to be sampled far less frequently.

Conclusion

This paper presented a novel deep learning application in predicting the floor level given sensory data from mobile phones. While there are no new theoretical discoveries, the application is novel and important for 911-responders; indeed, previous studies have shown that survival rates for urgent medical events drop exponentially for each floor increase. Although much of this is attributed to the actual floor height, this situation makes it all the more important to reduce ground-to-floor travel time.