INTRODUCTION
Human activity recognition (HAR) aims to model user behavior and automatically identify the tasks they perform by observing and analyzing human behavior (Saini et al., 2018; Uzunovic et al., 2018; Brophy et al., 2018), which results in the recognition of people’s activities, identities, personalities, and psychological state (Vrigkas et al., 2015).
In recent years, HAR has become an area of constant exploration in different fields; its applications are a current research subject, as it helps automate processes and activities that may go unnoticed by the human eye or may constitute tedious tasks. For Lohit et al. (2018), a human pose transmits the configuration of the body parts and implicit predictive information on people’s subsequent movement, dynamic information that may be utilized in various applications. By reviewing the literature, it can be found that HAR exhibits a growing demand in the fields of entertainment (Lawrence et al., 2010; Han et al., 2013; Akhavian & Behzadan, 2016); video surveillance systems (Ryoo, 2011; Preis et al., 2012; Y. Liu et al., 2014; Ben Mabrouk & Zagrouba, 2018; Cosar et al., 2017; J.-W. Hsieh et al., 2014); emergency rescue and emergency robotics (Durrant-Whyte et al., 2012); smart cities, sports performance, military applications, medical monitoring for caring of the elderly, and diverse health care (Banos et al., 2012; Chen et al., 2012; Avci et al., 2010; Kim et al., 2010; Sazonov et al., 2011; Ismail et al., 2015; Rafferty et al., 2017); among others (Elbasiony & Gomaa, 2020). The common factor of research on HAR is the set of problems under study, which involve recognizing a specific activity such as the weather, object protection, and lighting conditions, among others. Moreover, an activity may vary from one person to another (Y. Yang et al., 2019; Kahani et al., 2019). Thus, it is essential to find different ways to optimize the recognition of human activities.
Modern and efficient methods for healthcare are now being proposed, such as the use of blockchain and the Internet of Things (Pava et al., 2021). However, this review prioritizes the works and advances on HAR, specifically elderly fall detection since, according to the World Health Organization (WHO), the proportion of the planet's population over 60 years will double from 11 to 22% between 2000 and 2050 (2015). In huge numbers, this age group will grow from 605 million to 2 billion in the course of half a century. As a consequence, caring for and monitoring the health of the elderly will become an essential and daunting task. Li et al. (2018) state that approximately 58% of elderly over 80 years old have passed away after a severe fall due to physical trauma, mild traumatic brain injury, hip fracture, among others, which may discourage this population from working out. A sedentary lifestyle in the elderly is another problem that entails other health consequences, such as obesity and cardiovascular diseases (Suto & Oniga, 2019).
As workout and movement are vital for the elderly, monitoring and recognizing daily activities is essential to provide them with proper healthcare. According to Li et al., (2018), the automatic detection of falls or movements that may affect health can significantly reduce the consequences of the incident. It may also allow tracking and reporting anomalies from patterns of normal daily behaviors by adults with high risk of falling.
This document presents a review of the state of the art regarding 1) the classification of human activities, 2) the methods for HAR information acquisition and 3) the methods that have been used for feature extraction from videos and images in order to recognize the activities of the elderly.
This research was conducted following the criteria for review methodology and document analysis (Barbosa-Chacón et al., 2013), dividing the research process into the heuristics and hermeneutics of the different information sources.
CLASSIFICATION OF HUMAN ACTIVITIES
Human activity recognition is a current research topic due to its various applications in the entertainment industry, video surveillance, healthcare, robotics, smart cities, sports performance, and military applications. Therefore, the main objective of this work is to review the field of HAR with a focus on elderly care.
Human activities are classified depending on their complexity and duration. For Hassan et al. (2018), activities are divided into short-term activities and simple and complex tasks: firstly, short-term activities such as the transition between sitting and standing up; secondly, basic activities such as walking and reading; and complex activities, which involve scenarios where there is an interaction with objects or people. On the other hand, Vrigkas et al. (2015) propose another mechanism to classify activities which also considers their complexity. A summary of this classification is shown in Table 1.
Classification | Description |
---|---|
Gestures | Primitive movements of a person’s body parts, which correspond to a particular action. |
Atomic actions | A person’s movements that are part of more complex activities. |
Human-object interaction or human-human interaction | Human activities that involve two or more people or objects. |
Activities in group | Activities carried out by groups of people. |
Behaviors | Physical activities associated with feelings, personality, and the psychological state of an individual. |
Events | High-level activities that describe social actions between individuals and indicate a person’s social roles. |
Source: Vrigkas et al. (2015)
Research related to recognizing activities carried out by the elderly is focused on identifying short-term and basic-specific activities. An example of this is the work carried out by Khan and Sohn (2011), which aimed to detect six specific activities (forward falls, backward falls, chest pain, faints, vomiting, and headaches). On the other hand, Ma et al. (2014) attempted to recognize other six activities (people falling, flexing, sitting, squatting, or lying down). In turn, Amiri et al. (2014) increased the number of activities to be recognized (a person cleaning a table, drinking a drink, taking or dropping an object, reading, sitting, standing up, writing, using a phone, and falling), which also expanded the difficulty and vagueness of the system due to occlusion issues and the similarity between actions (Yu et al., 2013).
INFORMATION ACQUISITION METHODS
The first step to recognize a determined human activity is obtaining information for subsequent processing. This process may be carried out in different ways. The first method is based on environmental sensors, such as pressure, acoustic, electromyography, and different sensors that may be integrated and distributed around the environment where the identification of different activities is required (L. Yang et al., 2016). The use of different sensors may entail high costs and could be an intrusive method. Aspects such as the arrangement and generation of different types of sensors should also be taken into account, especially in underdeveloped territories, as discussed by Nivia-Vargas and Jaramillo-Jaramillo (2018).
The second method also receives information through portable sensors such as contact sensors, gyroscopes, and accelerometers (Rosati et al., 2018). Using methods based on sensors has a specific set of difficulties when recognizing elderly activities. According to Khan and Sohn (2013), the elderly often forget to wear portable sensors, and their use in different parts of the body causes frustration since it limits their movement.
On the other hand, Kwolek and Kepski (2014) argue that the vast majority of elderly people do not enjoy using sensors, as they generate excessive false alarms. Some daily activities are wrongly detected as falls, which may also frustrate users.
This article delves into the third information acquisition method: incorporating computer vision using cameras, depth sensors, and image processing techniques. According to Yu et al., (2013) and Amiri et al. (2014), this is a non-intrusive method that can extract a large amount of information in comparison with portable sensor methods. Furthermore, this method is not easily affected by noise in the environment. On that premise, Panahi and Ghods (2018) highlight the technological progress of extracting images from video using RGB (red, green, and blue) cameras or using depth map images to determine the different distances of objects or people. L. Yang et al. (2016) divide the vision-based method into three categories: methods using standard RGB cameras, 3D-based methods using multiple cameras, and 3D-based methods using depth cameras.
The vision-based method also has its limitations, which include a lack of privacy, as it implies having a camera in the environment at all times. Moreover, Concone et al. (2019) criticize its computational cost, since this method may rarely run in real-time, and they highlight the fact that the performance of the method strongly depends on the position of the cameras.
FEATURE EXTRACTION
Although HAR has been a continuous topic of research in the last decade, there are still different aspects that hinder the accurate recognition of elderly activities. With the computer vision method, this includes features such as the weather, object protection or occlusion, lighting conditions, the similarity between some activities, clutter in the background of the image, privacy problems, and other specific difficulties that may cause false detections. For this reason, it is vital to study the different methods in order to optimize the recognition of these activities, especially regarding fall detection.
Some studies (Yu et al., 2013; Goudelis et al., 2015) argue that the most essential step for successful activity recognition is to select a method for feature extraction from an image or a video. Different methods have been proposed whose purpose is to effectively distinguish non-intentional actions such as falls from other daily activities. This review of the state of the art focuses on the classification of feature extraction methods presented by S. Zhang et al. (2017), which classifies them in terms of their approach: local characteristics, global characteristics, and depth-based representation. Also, the current method based on convolutional neural networks is attached.
For Das Dawn and Shaikh, (2016), the shape or edge of an object are relevant data that can be used to determine local characteristics. However, global information involves flow description or movement in a video.
Global feature extraction
This method allows extracting global descriptions from videos and images, which, according to Zhang et al. (2017), allows localizing the human subject and isolating them from the background, using subtraction methods to acquire their silhouette and shape. Other global representation methods are 3D space-time volumes, which monitor a human being's silhouette for a determined period of time. There is also the Fourier Transform method, which is based on monitoring the frequency of a silhouette for activity recognition.
Various other studies use global feature extraction to recognize human actions, especially for elderly fall detection. In general, research in this field takes advantage of the silhouette of the human body to reach its objective.
Elderly fall detection is the main objective of several works (Khan & Sohn, 2011, 2013; Yu et al., 2012, 2013; Foroughi et al., 2008), which focus on extracting the human silhouette for subsequent processing. These studies have several differences. Khan and Sohn (2011) use the human silhouette to extract information from the elderly using R-transform, invariant scale, rotation characteristics, and Kernel discriminant analysis (KDA) as they attempt to detect human falls while considering the different distances of people in front of the camera. On the other hand, the works by Yu et al. (2012, 2013) have several common factors: both techniques detect falls in the elderly, extract the adult's silhouette, and calculate the human figure's center of mass. Nonetheless, Yu et al. (2013) employ the method presented by Rougier et al. (2007) to extract and delimit people's silhouettes via ellipse features and look for the structural characteristics and shape of human actions, locating its centroid as a fall detection method. Meanwhile, Yu et al. (2012) calculate the centroid of the human silhouette and identify the person's orientation. To this effect, at least two cameras are needed, both of them synchronized in order to minimize occlusion. Finally, Foroughi et al. (2008) use the human silhouette as captured from videos or images to identify histograms of its segmented projection, analyzing temporary changes in an elderly person's head in order to recognize a possible fall.
As these systems were implemented in different contexts, they reported different performances. However, the study areas of these works were controlled environments such as small apartments with multiple cameras, few lighting changes, and high computational costs. For instance, Auvinet et al. (2011) used various cameras to extract 3D images, aiming to detect and analyze the volume in the elderly silhouette's vertical space, activating a falling alarm when the volume distribution was abnormally close to the floor. For an extended period, this method reached a recognition effectiveness of 99,7%, albeit using eight simultaneous cameras, having a high computational cost with regard to synchronization and performance, and making the system challenging to implement on a daily basis.
On the other hand, V. A. Nguyen et al. (2016) aimed to recognize indoor human actions tested in different environments with natural lighting and different shadows, as well as involving diverse daily activities, which caused several failures. Nonetheless, as their method is based on the use of a single RGB camera, it is easy to implement and entails a low computational cost. Falls in the elderly are detected by analyzing movement orientation and magnitude, changes in the human shape, and movement in the image's histogram. The authors suggest using additional techniques in future research, which includes detecting the head and the inactivity zone.
Optical flow is a global extraction technique used to extract and describe silhouettes on moving or dynamic backgrounds. Efros et al. (2003) used this method to recognize actions performed by soccer and tennis players and ballet dancers in TV broadcasts. The authors suggest applying this technique to extract the dynamic background, focusing only on the sportsmen's silhouette.
Despite the fact that systems using global feature extraction have performed well in controlled environments, Zhang et al. (2017) have exposed the difficulties of these systems given their noise sensitivity and viewpoint changes. Furthermore, authors such as Goudelis et al. (2015) have indicated that methods based on silhouettes and figures lack solidity and generalization, as they depend on an accurate extraction of the human silhouette and the different geometric transformations, which may be distorted by the distance and position of the subject.
Local feature extraction
Zhang et al. (2017) explain that this method focuses on specific local patches determined by interest point detectors or dense sampling, which densely cover the content of a video or an image. The first interest point detector was proposed by Harris and Stephens (1988) and is known for being an excellent corner detector, giving rise to further research works such as the one by Laptev and Lindeberg (2003), who proposed 3D space-time interest points (STIP). The latter would become the main interest point detectors and inspire even further research (Chakraborty et al., 2012; Laptev, 2005; T. V. Nguyen et al., 2015), which aimed to optimize these techniques.
According to Das Dawn and Shaikh (2016), STIP is an essential technique for robust interest points extraction from a video or image in the space-time domain, such as a corner point or an isolated point where the intensity is maximum or minimum -even endpoints of lines and curves.
Amiri et al. (2014) focused on simulating a smart home environment using two cameras and a Kinect sensor placed between them. Local feature extraction with space-time techniques was implemented using the Hariss3D algorithm as a feature detector and STIP as a feature descriptor. The system's main difficulties are occlusion problems and clutter in the background, since tracking the human body is a challenging and an error-prone task. The capacity of the Kinect sensor to recognize skeletal information only for objects in the range of 1,2 to 3,5 m may also have caused recognition problems. On the other hand, Berlin and John, (2016) used Harris's corner point detectors differently, including the histogram form of the diverse images in order to recognize different activities performed in two sets with controlled environments. The results showed 95 and 88% recognition rates for Set1 and Set2, respectively.
Conversely, Venkatesha and Turk (2010) attempted online human activity recognition, that is to say, without storing any video. The systems immediately learned the actions in the scene and classified them, considering the shape of human actions. They also used interest point extraction techniques while analyzing the histogram of the image in order to identify the action performed. This method showed a recognition effectiveness of 87% in non-complex actions. Meanwhile, Peng et al. (2016) obtained a similar recognition rate, albeit combining local space-time characteristics and the construction of a visual dictionary, proposing a hybrid super vector.
Zhu et al., (2011) presented another technique based on recognizing an action through feature coding of local 3D space-time gradients within the framework of scattered code. By doing so, each space-time characteristic is transformed into a linear combination of some ‘atoms’ in a dictionary trained to detect local movement and appearance features. This method provides an increase in scale invariance, achieving the recognition of some basic activities.
Considering the above studies and that proposed by H.-B. Zhang et al. (2019), local feature extraction does not require pre-processing activities such as background segmentation or human detection. It also offers scale invariance and rotation and is stable under lighting changes and more resistant to occlusion than global feature extraction.
S. Zhang et al. (2017) highlighted the fact that, although these detectors achieve satisfactory results in HAR, they have a significant deficiency: the calculation of stable interest points is often inadequate, as "discriminative" and "correct" interest points are difficult to identify.
Similarly, H.-B. Zhang et al. (2019) faced some difficulties with the current local feature extraction method, as it is easily affected by changes in camera view, background movement, and camera movement.
Depth-based feature extraction
The development of depth sensors such as the Microsoft Kinect (Shotton et al., 2011) has allowed higher access to depth maps and the real-time position of skeletal joints, thus contributing to HAR via computer vision.
Various studies (X. Ma et al., 2014; Planinc & Kampel, 2013; Bogdan Kwolek & Kepski, 2014; Nizam et al., 2017; Mastorakis & Makris, 2014; Yao et al., 2017; Jalal et al., 2012) have used the Kinect sensor as an information acquisition instrument and employed its depth images for HAR. The difference lies in the characteristics that each researcher wanted to extract. For example, Ma et al. (2014) conducted a complex study aiming to recognize six human actions (people falling, bending, sitting down, squatting, walking, and lying down) while combining global extraction techniques from depth images and analyzing changes in the human shape in short periods of time. On the other hand, Nizam et al. (2017) focused on extracting the elderly center of mass and added the angle between the human body and the floor plan. If this data is below specific thresholds, then a fall is detected.
On the other hand, Kwolek and Kepski (2014) complemented the use of depth images and calculated the distance from the human center of mass to the ground using an accelerometer for elderly fall detection. In this approach, if the acceleration exceeds a threshold value, it means that the person is in motion. At that moment, the depth sensor begins to extract information in order to detect a possible fall. However, the process requires calibrating the cameras and accelerometers, which increases its computational cost. Nizam et al. (2017) also used a Kinect sensor to study the speed and position of a person. Thus, if a high speed in a short time is detected, it is assumed that a fall has occurred. The fall is confirmed or discarded by analyzing the position of the body. This system has an average precision of 93,94%.
Mastorakis and Makris (2014) attempted elderly fall detection by using a Kinect sensor to extract the environment's 3D image, aiming to obtain a 3D bounding box surrounding the older person. Here, when the bounding box changes its width, height, and depth, the speed is analyzed. When the speed is higher than a certain threshold, it is considered that a fall has occurred. In turn, Yao et al. (2017) used depth images to extract information such as the movement of the human torso, the 3D positions of the central hip joint, the central shoulder joint, and the height of a person's centroid. With this method, a fall can be identified when the rates of the aforementioned characteristics reach threshold values. Despite this robust method, using only a Kinect makes the system dependent on the distance at which the sensor is working.
Unlike the aforementioned studies, Jalal et al. (2012) did not calculate the distance to the ground of any part of the human body. Instead, they combined the extraction of some global features such as depth data in order to recognize the elderly's daily activities. To this effect, R-transform was used to extract depth silhouettes of elderly body parts, and a hidden Markov model was subsequently used to train and recognize daily household activities. The results showed an average recognition rate of 96,55%.
According to Ma et al. (2014), light is not a problem when extracting silhouettes, since the Kinect sensor uses infrared light. This is very advantageous, as the sensor can also recognize human silhouettes in the dark and extract information from the human skeleton for HAR (Yong Du et al., 2015). This technique has been widely applied in different studies (Keceli & Burak Can, 2013; Pazhoumand-Dar et al., 2015; X. Yang & Tian, 2014; Hbali et al., 2018). However, occlusion represents a problem with this approach, as recognition can be affected if the human body is occluded by any object. Therefore, several studies (Ni et al., 2013; Jalal et al., 2017; Liu & Shao, 2013) have merged spatial-temporal features with RGB cameras and depth data to reduce the occlusion problem. Data merging makes the processing volume larger, which increases feature dimensions. These factors increase the computational complexity of the algorithm for activity recognition.
Convolutional neural networks
Finally, the current state of the art highlights the growing importance and impact of using convolutional neural networks (CNN) for HAR, as well as their classification and optimization in recent years. Different authors have adopted the use of CNN as a recognition method. For instance, Y.-Z. Hsieh and Jeng (2018) applied a feedback CNN of optical flow to video transmission incorporating point estimation histograms, the limit of the object in motion, and limits of the subject in order to detect falls. Moreover, Yan et al. (2018) proposed a novel model of dynamic skeletons called Spatial Temporal Graph Convolutional Networks. This model automatically learns the spatial and temporal patterns of data, which allows for a higher generalization capacity. Xu et al. (2020) also based their research on mapping the human skeleton to predict falls using OPENPOSE, thus obtaining a skeletal map and transforming it into a dataset to then feed the CNN. Other studies involving CNNs are based on the movement of a person. Wang et al. (2015) extracted the trajectory in a determined scenario while attempting to recognize and classify different activities. Núñez-Marcos et al. (2017) used optical flow images as the neural network's input, followed by a training phase to detect falls. Similarly, Espinosa et al. (2019) incorporates optical flow to a CNN that not only learns static information. CNNs have also been used in studies that incorporate depth maps from a Kinect sensor for fall detection (Rahnemoonfar & Alkittawi, 2018; Adhikari et al., 2017). Adhikari et al. (2017) concludes that combining RGB image background subtraction and depth images with CNNs provides a possible solution to monitor falls based on indoor videos.
Lu et al. (2017) used a three-dimensional convolutional neural network (3D-CNN) to extract the spatial characteristics of 2D images. It also incorporated video motion information to detect falls, thus reducing the failures caused by image noise, lighting variations, and occlusion. Similarly, C. Ma et al., (2019) incorporated a 3D-CNN, albeit hiding the facial regions optically perceivable in the video capture phase, thus helping to protect privacy while using surveillance cameras. Khraief et al. (2020) used a CNN with its own characteristics. In this study, the authors created a multi-stream CNN -a CNN with multiple flows. That is to say, four CNNs fed by the same images but extracting different features from them (color, texture, depth, shape, movement). Finally, they concatenated the four CNNs in order to obtain a unique classification of activities for fall detection.
A different CNN-based method to detect falls was proposed by Sreenidhi (2020), in which feature extraction from images of people falling was carried out. The CNN employed facial recognition because the author manifests that human expression when falling is highly distinguishable.
When working with CNNs, a large amount of data is required for training, which may be disadvantageous. For that reason, some authors (Cai et al., 2019; Khraief et al., 2019; X. Li et al., 2017) have used networks based on pre-trained architectures such as AlexNet, VGG16 (Krizhevsky et al., 2012), and ResNet (He et al., 2016).
According to El Kaid et al. (2019), although the application of CNNs in activity recognition is successful, it has been done in very restricted environments. None of these networks are flexible enough to work well outside their domain. In this vein, the studies by Debard et al. (2016) and Fan et al. (2017) are concerned with the functioning of algorithms for detecting human actions, considering real falls, global and local feature extraction, and feature extraction through CNN.
Accordingly, the vast majority of studies focus on HAR using short video data segments captured in artificial environments, optimal conditions, and simulated falls by actors. Thereupon, Debard et al. (2016) selected algorithms with a good percentage of activity recognition when used with databases created in controlled or acted scenarios in order to implement them in real environments and falls of real adults. The authors concluded that said algorithms did not have the same efficiency, since they do not consider image quality, overexposure problems, occlusion, and changes in lighting conditions, thus demonstrating that not all the specifications for a robust system in real-world situations are met. Modern clustering techniques, such as the one proposed by Contreras-Contreras et al. (2022) could be applied to such databases.
Table 2 shows popular databases used among the research community that studies human falls. They were created in real or poorly controlled environments.
Data base | Videos | Data provided | Environment | Population | Types of falls |
---|---|---|---|---|---|
URFD (Kwolek & Kepski, 2014) | 70 videos, 30 falls, and 40 daily activities | RGB depth images, images, and accelerometer signals | Indoors | Adult people | People falling while standing and sitting on a chair |
LE2I (Charfi et al., 2013) | 191 videos: falls and daily activities | RGB images | Realistic indoors, home environments, and offices with variable lighting, occlusion, and cluttered and textured background | Adult people | Falls when walking, stumbling, and falling from chairs |
CMDFALL (COMVIS-PTIT, n.d.) | 600 videos with 20 human actions including falls | RGB images, depth images, and accelerometer signals | Home simulation indoors | 30 men and 20 women between 21 and 40 years old | Falling backwards and forwards, to the left, and to the right |
FALL-UP (Martínez-Villaseñor et al., 2019) | 361 videos including falls and daily activities | RGB images, accelerometer signals and different indoor images | Sensors | 17 adults between 18 and 24 years old | Different falls |
Multiple Cameras Fall Dataset (Auvinet et al., n.d.) | 192 videos including falls and daily activities | RGB images | Realistic and indoor home environments with occlusion, disorder, texture, variable lighting, and movement in the background | Adults | Backward falls, forward falls |
UCF101 (Soomro et al., 2012) | 13.000 clips, 27 hours of video data with 101 human actions | RGB images | Controlled and realistic environments, moving cameras, and cluttered background | Adult people | Different falls |
Source: Authors
CONCLUSIONS
This work reviewed the progress made on human activity recognition with an emphasis on elderly falls, showing different devices and techniques to acquire data for subsequent processing and recognition. Furthermore, the main feature extraction methods used in the literature to detect human falls were presented.
Despite the fact that the vast majority of the proposed techniques have a high fall detection percentage and perform well in controlled environments, methods such as global feature extraction are highly sensitive to noise, occlusion, and viewpoint changes.
Similarly, local feature extraction shows a high deficiency when calculating correct interest points. Moreover, interest points are affected by changes in camera view, movement of the background, and camera movement. In addition, the main issue with extracting depth-based features is occlusion, as it may affect human skeleton extraction. Current applications of convolutional neural networks have been successful. However, they are placed in controlled and restricted environments, and their performance is not very good outside their domain.
This work demonstrates the importance of an efficient fall detection method, as well as the great potential of this research area going forward. Although good results are obtained by using the different techniques proposed by the authors mentioned in this paper, the environments where these techniques have been used are controlled, unrealistic, or use simulated falls, which does not contribute to the real advancement of this field. Therefore, presenting the results of these studies in environments similar to reality would have a positive impact, which is why studies that focus on elaborating databases with real falls of adults in non-controlled environments become essential.