Human Activity Recognition using Smartwatch and Smartphone: A Review on Methods, Applications, and Challenges

Recently, Human Activity Recognition (HAR) has been a popular research field due to wide spread of sensor devices. Embedded sensors in smartwatch and smartphone enabled applications to use sensors in activity recognition with challenges for example, support of elderly’s daily life . In the aim of recognizing and analyzing human activity many approaches have been implemented in researches. Most articles published on human activity recognition used a multi -sensors based methods where a number of sensors were tied on different positions on a human body which are not suitable for many users. Currently, a smartphone and smart watch device combine different types of sensors which present a new area for analysis of human attitude. This paper presents a review on methodologies applied to solve problems related to human activity recognition that use the equipped sensors in smartphone and smartwatch with the employ of Machine Learning and the advance of deep learning approaches. The literature is summarized from four aspects: sensors types, applications, Machine Learning (ML) and Deep Learning (DL) models, results and challenges.


Sensor Modality for HAR
Sensors of smart devices are embedded to improve the ability of devices to be controlled and managed [16]. Activity recognition performance rely on the sensor modality that have been used. Sensor modalities can be categorized into four types: [17] 1. Ambient Sensor, which is embedded in the environment for catching the humanenvironment interactions, furthermore it can be used for indoor localizing. 2. Object Sensor, which is tied to the target objects such as books, goods, etc. Radio-Frequency Identification (RFID) sensors are widely used for determining object usage where discriminating composite activities such as drinking/cooking depend on incorporating the information on used object. 3. Wearable sensor, works efficiently on capturing the body motion, thereby it is commonly used with HAR. These sensors are integrated into smartphone, smart watch, or bands. The three essential sensors that are embedded in smart device and used for detect motion are accelerometer, gyroscope, and magnetometer which is called motion sensors .Accelerometer measures the rate of change of the object velocity which is called acceleration which is measured in meter per second (m/s). Gyroscope, on the other hand measures the orientation and angular velocity which is measured in degrees per second (°/s). A magnetometer is assembled with gyroscope and accelerometer within an inertial unit. It measures the alter magnetic field in Tesla unit (T) at a specific position. 4. Particular applications sensor, this modality is used for specific applications. Audio sensor is an example of this type. It depends on having a speaker and microphone that built in mobile device to transmit and receive ultrasound signal and are modified with respect to human motion information. Audio sensor is convenient for fine-grained movement recognition. Lee et al. [18] worked on using the ultrasound signals for chewing activities recognition. Another example is a Pressure sensor which rely on mechanical mechanisms that need direct physical contact and it can be distributed at different places in a smart environment. Due to its physical contact attribute, it can be used for detecting small movement. Hence it may be proper to monitor exercise and correct write posture [19]. Figure 1 summarizes the sensor modality set in a diagram that designed according to our vision to illustrate the categorization of sensors used with HAR.

Dataset for HAR
Different modalities have been proposed for data gathering. The dataset that implements HAR with smartphone and/or smartwatch devices were gathered from different sensors equipped in these devices. Kwapisz et al. [20] created dataset that is collected from accelerometer data that supervised by one of the WISDOM team. The standard dataset for HAR that is using a smartphones is made available in 2012 and are modeled with machine learning algorithms [21] and full described in 2013, The data was gathered from 30 person of ages between 19 to 48 years old with a smartphone mounted on their waist, and recorded the x, y, and z accelerometer and gyroscope data [22]. Later, different standard datasets were modeled and implemented [23] [24]. Table 1 represents the publicly datasets available for HAR that is collected from accelerometer or gyroscope sensors for smartphone or smartwatch.

HAR Features
Features can be defined as a statistical function that efficiently obtain a meaningful information of data. From the HAR point of view, a particular physical movement of a subject can generate a particular pattern. For instance "Running" activity has a pattern different from "walking" activity pattern. Different pattern distribution can be produce from measuring the intensity of a particular physical effort by an accelerometer and gyroscope sensors. In literature, the features domain classified into groups: The time domain, which uses a mathematical function to extract statistical information from the signals, and the frequency domain that uses a mathematical functions to get a repetitive patterns of signals and are often linked to the natural periodicity of the activities. If unsuitable features are used to train ML model then it will affect its performance and decrease model accuracy. Thus, applying a feature selection technique before modeling the data will eliminate over fitting, improve accuracy, and reduce the time of training [25]. Table 2 presents the domain features that are often used in the literature.

HAR Problems
This section covers the HAR problems and their applications that implement machine learning (ML) and deep learning (DL) algorithms.

Machine Learning
Machine learning include building mathematical models to assist understanding data. Learning can join the contention when these models given a tunable parameters that can be acclimatized to observed data and therefore the program can learn from that data. Fitting models to a previously seen data can be used to understand and predict the sides of a new observed data. Machine learning approach has a benefit where it can generalize to much larger dataset in many more dimensions [26]. HAR is considered as a supervised learning problem in which modeling depends on the relation between the labels associated with data, measure features of data, and then this model can be used to apply label to unknown, new data. In general, most common classical machine learning classification algorithms are: Support Vector Machine (SVM), Naïve Bayes (NB), Logistic Regression (LR), Random Forests (RF), Decision Trees (DT) and k-Nearest Neighbors (k-NN) which are categorized as a supervised learning algorithm. These algorithms are implemented as a verification process in order to obtain best classification accuracy with data collected from smart device [27].
Several published articles tried to identify and solve different problems related to HAR which are: a. Health-care One of the problems that considered recently is utilization of smartphone connectivity to deliver a health care service remotely. Anjum et al. [28] built a smartphone applications to track physical activities of users during their daily routine and to report a feedback of estimates of calories burned without user intervention. The ordinary physical activity can affect the one's life [29] hence, to encourage people to leverage their physical activity by quantifying it. In this regard, Cvetković et al. [30] built an android application to recognize and monitor human activity and use it as an input to estimate the Energy Expenditure (EE) by using a smartphone worn freely on body and optional heart rate monitor to increase accuracy. The approach is implemented in real time and automatically adapts to presence, orientation, and location of the phone. First, the method detects the presence of the devices using simple heuristic , then the orientation of the phone were is normalized and used for detecting walking , location , and Activity Recognition(AR) which is implemented using machine learning. Mekruksavanich et al.
[31] proposed a framework that uses a smart watch to detect period of sitting activity to identify the problem of Office Worker Syndrome(OWS). Ensemble which includes stacking and base classifier are employed .The results stated that using stacking in combination with ensemble learning methods achieved good accuracy. In addition, an optimal performance is carried out when the gyroscope and accelerometer are integrated rather than isolated from each other. Balli et al. [32] worked on recognition of movement for elderly and young children to prevent them from falling down. The data is obtained from accelerometer, gyroscope, step counter, and heart rate sensor of a smart watch. The main novelty of this study lies in a hybrid method entitled Principal Component Analysis (PCA) plus Random Forest (RF) which combines an efficient clustering feature extraction method with an efficient classification algorithm to produce the most successful result. b. Sequentially Aspect of Data Time series sensory data need an approach that takes into consideration the sequential aspect of data, Lee et al. [33] utilized the hierarchal nature of activities since it can be broken down into simpler ones to design Hierarchical Hidden Markov Models (HHMMs) which are simply Markov chains, with hidden states to classify low and high activities collected from 3D accelerometer sensor of smartphone. However, their approach showed a difficulty in differentiating between upstairs and downstairs activities. Later, Ronao et al. [34] was overcome this issue and suggest to utilize a two-stage Continues Hidden Markov Model (CHMM) classifier. In first stage, features are selected using Random Forest(RF) variable importance measures and then used for classification which divide stationary from moving activities, and in the second stage, CHMMs are used for fine classification. Bulbul et al. [35] used an accelerometer and gyroscope sensors of smartphone with different supervised machine learning approaches like ensemble methods to recognize human activity. Efficiency and precision are used as a metrics for comparison between those approaches.

c. Driving Behavior
Recent years witnessed a growing interest in controlling the data related to driving behavior in order to recognize risky driving. Several studies analyzed the behavior of driver using mobile sensors. The classification and the evaluation depends on combination of inertial embedded sensors with axillary devices [36] [37]. Liu et al. [38] proposed a system that exploits the benefit of using smartphone and smart watch with inertial sensors for track a steering wheel turning angle to detect unsafe driving activities. Smart watch provides the ability to discriminate the steering movement from other movements such as eating. They illustrated that combining machine learning models of motion sensing with a classifier provides an accurate driving activities classification. Ferreira et al. [39] worked on various android smartphone sensors to evaluate the performance of four ML algorithms on data collected from four android sensors in detecting seven driving events. Sun et al. [40] employed a smartphone acceleration and orientation data to detect a driving event by proposing a new bagging tree with Dynamic Time Warping (DTW) algorithm.

d. Feature Selection Problem
One of the basic concepts in machine learning is feature selection which highly affects the performance of the model by selecting manually the most contributing features to a variable being predict. Therefore it has been confirmed to be an active way for data preparing , since irrelevant features are ignored [41]. Some authors worked on improving the accuracy of recognition by using or suggesting a feature selection techniques. For instance, Capela et al. [42] implement a machine learning algorithms to evaluate on subset features that were selected from calculated features of a smart phone sensor data collected from three populations, able-bodied, elderly, and stroke patients. Feature selection methods used are Relief-F, Correlation-based Feature Selection (CFS), and Fast Correlation Based Filter (FCBF). FCBF algorithm achieved highest accuracies. Ahmed et al. [43] proposed a hybridization between filter and wrapper feature selection approaches. First, a base time and frequency features are applied to extract heterogonous features, then a hybrid approach is applied in which a sequential forward search (SFFS) is used. The new hybrid feature selection approach leverage the average classification performance when compared with other feature selection algorithms like MC-SVM, and CAT.

Smart Device Position
The position of the smartphone or smart watch and the location of a subject play a vital role on the result of recognition of activity Havinga [44] studied the effect of placing the smartphone on the two body positions, in the pocket and in wrist on the performance of activity recognition. Furthermore, Havinga analyzed the effect of increasing window size and sensor combination on various simple and complex activities in different ways. The results show that increasing the window size improved the performance for simple activity while sensor combinations improved the recognition of complex activities. Whereas, Kwon et al. [45] proposed a system that uses a smartwatch sensors to collect data which took into consideration location information from three places office, kitchen, and outdoor to enhance the HAR system. Two model were evaluated, one with information location and the second without. The results demonstrated that different activities can be classified with 95% of accuracy. Ramos et al. [46] evaluated the effect caused by the combination of sensor data from smart watches and smartphones in term of the accuracy of activity recognition approaches. This is done by collected simultaneously an accelerometer data from smartphones along with smart watches as input source. They concluded that accuracy of recognition is leveraged when the data of smartphone and smartwatch are merged. To eliminate the effect of orientation variations Chen et al. [47] suggested a HAR system that uses a smartphone sensors by taking into consideration placement, orientation and subject variations depending on combination of Coordinate Transformation and Principle Component Analysis (CT-PCA).
To defeat the limitation of fixing the smartphone on a specific position in human body in order to facilitate a classifying process of a human activity, Muslim et al. [48] proposed an approach to use a smart watch fixed on ankle in addition to smartphone. This incorporation results in accurate activity recognition. The features are extracted from smart watch for each window size and send to the smartphone via a Bluetooth.

Authentication of Person
Weiss et al. [49] compared between smartphone and smart watch based activity recognition that utilized a hand-oriented and not-hand oriented activates on a personal and impersonal model. They investigated how smartphone and smartwatch can be used to identify a person from his/her eating habits. The results showed that the capability of smart watch for identifying a hand-based activity (e.g. drinking) is more accurate than using a smartphone, also watch accelerometer supplies much better results than phone accelerometer and watch gyroscope carry out much more poorly than watch accelerometer.
Typing, swiping, moving the arm while walking, and other related activities represent a behavioral footprint and play a good role in user authentication as the work carried out by Zheng et al. [50]. Despite that, there is still challenges that haven't been investigated. For instance, the availability of these footprint through interaction with smartphone and group of labeled data for verification process under different context of phone usage. Several footprints behavior have been investigated for continuously authenticating persons [51]. Recently, the movement pattern of phone was worked on by Balagani et al. [52], which concentrated basically on phone movement patterns during walking and sitting for continuous authentication of smartphone's user by employed a Hand Movement, Orientation, and Grasp (HMOG) as a set of behavioral features , whereas, Kumar et al. [53] proposed an authentication system that depends on phone movement patterns during typing or swiping collected from a diverse population in an unrestricted environments. The results stated that may not be adequate for a certain types user and would presented high error rate, and the movement pattern of the phone based authentication systems may not be suitable for every smartphone user. In another work, Tang et al. [54] investigated the pattern of phone movement with three conditions, static (e.g. stand, sit) ,dynamic (e.g. upstairs, jogging), and total of activities that merge static, dynamic and postural transitions (e.g. stand-to-sit or sit-tostand). The results demonstrate that static and total activities can be used for person identification by extracting a suitable features and apply a suitable classifiers. Murmuria et al. [55] investigated continuous authentication of power consumption, touch gestures, and physical movement. Singha et al. [56] suggested an authentication system to identify a person using data from accelerometer sensor in phone. The system performs in the background without requiring any additional action from the user. The results achieved high accuracy which shows the possibility of combining accelerometer-based person recognition with biometric authentication. Bayat et al. [57] proposed a recognition system to design a new digital pass filter to separate the gravity acceleration from body acceleration in raw data. A proposed digital low pass filter is used to separate AC from DC and calculate the magnitude Am. AC and Am are used for feature construction. Weiss et al. [58] investigated the convention of using both accelerometer and gyroscope sensor in a combination on a smartphone and a smart watch to evaluate activities on a biometric identification as well as biometric authentication. Classification algorithms are used to generate the authentication and identification models. The results state that the best biometric performance happened when using smart watch and smartphone together.
Supervised machine learning is a commonly used methods, however recently Active learning proved its effectiveness in generating models for activity recognition with smaller training datasets. Shahmohammadi et al. [59] stated the way how to use a smart watch based on active learning method to recognize a daily human activities. The results showed that this system achieved high accuracy, and thus active learning with smart watch has higher performance than with smartphone. Table 3 presents a list of machine learning classification approaches used in literatures whereas, Table 4 summaries the works on HAR with machine learning classification algorithms.

Deep Learning Model
Traditional approaches used for HAR problem consist of two parts: feature extraction and classification. Machine learning techniques use hand-craft to extract features which depends heavily on human domain expertise, hence it is time consuming and only shallow features can be learned by those approaches [60]. However recently, deep learning approaches cope with these flaws and have shown extreme success in improving recognition accuracy. Deep Learning merge the two steps within a Neural Network to learn features automatically [61]. Convolutional Neural Network (CNN) is one of the common deep learning methods that is extremely to analyze the data sequentially, and its success depend on using a convolutional filter in hierarchies to extract complex features representations [62]. In the presence of deep learning, many ideas have been published to address the problem of HAR. Several works focused on using deep learning approach where machine learning methods were employed either for comparison or evaluation. This section summarizes the existing work on employed deep model on HAR.
Recently, the location and navigation services have been considered as one of the standard attributes of smartphones as a consequences of the growth of smartphone capabilities. Recognition of Pedestrian activity is important in the procedure of pedestrian navigation [63]. Ye et al. [64] produced a strategy for real time human activity recognition with deep learning algorithms and smartphone MEMS (Micro-Electro-Mechanical System) sensors measurements to perform four main experiments for recognition of pedestrian motion mode, smartphone posture, real-time comprehensive pedestrian, and pedestrian navigation. Long Short Term Memory (LSTM) and CNN networks were trained and used in android smartphone for recognition of pedestrian activity in real time. Works exploited deep model to solve other HAR problem. Chen et al. [65] suggested a framework for fusion between engineered features and automatically learned features with deep algorithm, also a maximum full a posterior (MFAP) algorithm was evolved to improve HAR performance. The proposed method produced a good performance with self and publicly collected data. Radu et al. [66] proposed a deep learning architecture to show the effects of integrating the data comes from multiple sensors by using a Multi-Modal Restricted Boltzmann Machines (MM-RBMs) which is used prior to fusion a pair of text to verify if it is suitable sensing task. MM-RBM is constructed by two hidden layers for each of acceleration and angular velocity sensor data. Three shallow classifier are employed for comparison with proposed architecture. The results show that the performance is achieved without any hand selection features. Liu et al. [67] suggested a method for recognizing human activity using smartphone with high accuracy. Two models were implemented, First, is a Machine learning model that used SVM with linear and fisher linear discriminate for classification, then the prediction is based on the feature extracted. Second, is a deep learning model where the time sequence raw data is passed through CNN model after being normalized it. The data entered to the model consist of three components; no. of sample, window size, and no. of channels which is then reconfigured to be fed into model. Yu et al. [68] suggested a bidirectional structures of Long Short Term Memory(bi-LSTM) on a time series data collected from smartphone accelerometer and gyroscope inertial sensor. The proposed approach overcome the problem with the baseline LSTM cells where the prediction of the current state depends on previous information only, while bi-LSTM can get the past and future information from horizontal direction also, information can be reached from the vertical direction or lower layer. The results show that the new approach performed better when it is compared with other classification approaches. Another research on bidirectional LSTM worked by Zhao et al. [69] that suggested a deep network which is used Residual Bidirectional Long Short-Term Memory (Res-Bidir-LSTM) by combine the Res-LSTM and Bidir-LSTM. Shakya et al. [70] compared the HAR performance with two commonly datasets; WISDM, which is collected from smartphone accelerometer and Shoaib, collected form 5 accelerometers fixed on 5 positions on body. Different machine learning classifiers (e.g. KNN) and deep learning models (e.g. CNN, RNN) were applied on the two datasets. The results conducted that the DL models outperformed the performance of traditional ML classifiers performed with no hand crafted features. Benavidez et al. [71] performed a CNN and LSTM approach to classify activities collected from phone/watch sensors. A comparison between them shows that the performance of LSTM is better than CNN, also found that the two models cannot discriminate between similar hand movement activities on a phone dataset like eating different things, also cannot distinguish between kicking and catching on the watch dataset. Liu et al. [72] suggested an approach to predict human activities on a dataset collected from smartphone sensor. Two kinds of features were extracted from raw data, and then activities were analyzed with machine learning classifier. While CNN model was applied on the only raw data to analyze the performance of recognition, the results stated that, using additional features and CNN model enhance the perdition. Almaslukh et al. [73] suggested an architecture that uses a CNN with statistical features to find a position-aware and position detection HAR . Position-aware is performed using 3 classifiers levels. The first discriminates between static and dynamic activities, the second, detects the sensor position, and the third, had a group of a classifiers and each represent a specific location. The result showed that the proposed model produced a good performance compared to other machine learning methods. Ronao et al. [74] suggested a convolutional neural networks (Convnet) for HAR with smartphone and its effects on extracting robust features. The experimental result showed that altering the structure of Convnet affects the performance of recognition, where adding additional layers can derive more complex features, also increasing filter size and adopting low pooling size will enhance the accuracy. Table 5 Summarizes of the works on HAR with deep models. Table 5-summarization of works with DL models.

Conclusions and Challenges
This paper presents a comprehensive overview of the current works on HAR using smartphone and/or smartwatch with inertial sensors. First, the concept of human activity was explained, followed by description of the public dataset, and sensor modality equipped with smartwatch and smartphone. Furthermore, this paper described comprehensively the tradition methodology that depends on using shallow machine learning algorithms and methodology that depends on using deep learning algorithms commonly used to recognize human activities. The paper contributes to provide a good representation of the HAR area in the context of smartwatch and smartphone with its inertial sensors. From this review some gaps and further challenges need to be taken into consideration: For activity recognition, there is no single best approach, therefore various factors need to be determined when select a particular application. Some of classification methods such as decision tree may cause over fitting, whereas SVM may cause under fitting. Thus, the method must be implemented in accordance to the data. For any classification model, when the time complexity is degreased the accuracy is bad. However, good accuracy may come from unacceptable time complexity. Deep learning methods were adopted to decrease time consumed and calculation complexity of engineered features in machine learning. Position of smartphone and/or smartphone play a role on the recognition of the activities. Combination of sensor data of smartphone and smartwatch produces a good results. Smartphone is convenient for un-handed based activities such as "walking", whereas smartwatch is convenient for hand-based activities such as "eating". The researches that implement HAR with smartphone is more than with smartwatch. There is limitation on using smartwatch and smartphone to track behavior activities in safety applications such as in detecting unsafe driving, since the accuracy decreases if the driver steer with watchless hand and a misclassification rate of 1 in 10000 is not acceptable if the consequence is loss of life. The results produced by HAR are employed using a standard dataset which is different when real time dataset is used. Despite improvement in the field of HAR, there is still some challenges that need to be explored which are: With sensor based activity recognition, extracting feature may be difficult since there is similarity characteristics between different activities such as running and walking. Recent researches are focus on recognition of simple activity which represented by repeated action such as walking. Therefore, the challenges is to recognize more complex or composite activities. Activities are assumed to be performed in sequential manner. However, a human can perform more than one activity at the same time, therefore, exploring a concurrent activity is also challenge. Employing HAR as an authentication system is still a challenge that is not exploited yet to cope the resultant high rate of error.