¹¹1This study was approved by the Institutional Review Board at our institution (IRB No. Pro2018001757)

VCHAR:Variance-Driven Complex Human Activity Recognition framework with Generative Representation

Yuan Sun WINLAB, Rutgers University Piscataway, NJ, USA [email protected] , Navid Salami Pargoo WINLAB, Rutgers University Piscataway, NJ, USA [email protected] , Taqiya Ehsan WINLAB, Rutgers University Piscataway, NJ, USA [email protected] , Zhao Zhang WINLAB, Rutgers University Piscataway, NJ, USA [email protected] and Jorge Ortiz WINLAB, Rutgers University Piscataway, NJ, USA [email protected]

Abstract.

Complex human activity recognition (CHAR) remains a pivotal challenge within ubiquitous computing, especially in the context of smart environments. Existing studies typically require meticulous labeling of both atomic and complex activities, a task that is labor-intensive and prone to errors due to the scarcity and inaccuracies of available datasets. Most prior research has focused on datasets that either precisely label atomic activities or, at minimum, their sequence—approaches that are often impractical in real-world settings. In response, we introduce VCHAR (Variance-Driven Complex Human Activity Recognition), a novel framework that treats the outputs of atomic activities as a distribution over specified intervals. Leveraging generative methodologies, VCHAR elucidates the reasoning behind complex activity classifications through video-based explanations, accessible to users without prior machine learning expertise. Our evaluation across three publicly available datasets demonstrates that VCHAR enhances the accuracy of complex activity recognition without necessitating precise temporal or sequential labeling of atomic activities. Furthermore, user studies confirm that VCHAR’s explanations are more intelligible compared to existing methods, facilitating a broader understanding of complex activity recognition among non-experts.

Explainable IoT, Deep generative model, Hardware deep learning, Deep learning on small dataset

^†^†journal: IMWUT^†^†journalvolume: 0^†^†journalnumber: 0^†^†article: 0 ^†^†publicationmonth: 0^†^†ccs: Computing methodologies Machine learning^†^†ccs: Human-centered computing Visualization

1. Introduction

In recent years, the proliferation of sensors across diverse settings has become ubiquitous, seamlessly integrating into the very fabric of daily life. These sensors are embedded in a wide array of devices such as smartphones, cameras, clothing, buildings, and vehicles, enabling continuous and pervasive data collection. This expansion has significantly propelled the field of activity recognition, placing it at the forefront of ubiquitous computing research. The potential applications of activity recognition are vast and impactful, encompassing areas such as healthcare (Do et al., 2013), elderly care (Rashidi and Mihailidis, 2012), surveillance, and emergency response (Zhang et al., 2008). The ability to monitor and understand human activities on such a scale holds immense utility, offering advancements in personal health monitoring, enhanced security systems, and more efficient emergency services.

However, alongside its rapid growth and integration into various domains, the field of activity recognition faces several substantial challenges. Chief among these is the issue of sensor data labeling, which is crucial for the development of accurate and reliable models. This process often encounters significant hurdles such as the absence of labels, incorrect labeling, or the extensive manual effort required for annotation (Zhao et al., 2011; Pan and Yang, 2009; Gomes et al., 2012). Furthermore, the models developed for activity recognition frequently remain complex and opaque, functioning as ”black-boxes” that are difficult for both experts and laypeople to interpret (Atzmueller and Roth-Berghofer, 2010; Ribeiro et al., 2016; Atzmueller, 2017). This complexity hinders transparency and assessment, posing a barrier to trust and understanding among users who are not well-versed in technical details.

To address these issues, there is a growing emphasis on the development of Explainable AI (XAI) methods. Such methods are crucial for demystifying AI decision-making processes, increasing transparency, and building trust among users (Zylowski, 2022). By making AI decisions more accessible and understandable, XAI not only enhances user confidence but also facilitates wider adoption and integration of these technologies across various sectors. This approach aims to ensure that as AI systems become more integrated into our lives, they do so in a manner that is both comprehensible and trustworthy, ultimately leading to more informed and accepting user interactions.

To unlock activity recognition’s full potential in ubiquitous computing, addressing labeling challenges and creating techniques that enable layperson understanding of model insights is paramount. This bridges the gap between raw sensor data and actionable information, facilitating the development of reliable, trustworthy, and deployable real-world activity recognition systems.

1.1. Challenges in Complex Human Activity Recognition

Traditional methods for Complex Human Activity Recognition (CHAR) typically necessitate precise labeling of each atomic activity within specific time slots to effectively train models. While some research attempts to incorporate conceptual frameworks, these approaches often require segmenting the data to enhance accuracy. Such segmentation demands detailed labeling of each atomic activity, including the elimination of transient states, which can be labor-intensive and prone to inaccuracies regarding the exact start and end points of activities.

In practical scenarios, real-life datasets typically categorize types of atomic or complex activities within specific collection intervals (see Fig. 1) (Lago et al., 2020; Shoaib et al., 2016b, a; Chen et al., 2021). While some datasets provide detailed atomic activity labels, these can often be erroneous or unreliable (Inoue et al., 2019; Kwon et al., 2019). Furthermore, some datasets only indicate the type of activities (Fig. 1), encompassing $n$ atomic activities where $m$ activities may occur concurrently, leading to $C\mathbin{(}n,m\mathbin{)}$ possible combinations. This underscores the combinatorial complexity faced when segments cannot be distinctly separated. Our framework is specifically designed to address such challenges by managing inseparable dataset segments and extending beyond merely detecting $n$ isolated activities. It is important to note that a prevalent assumption in the field suggests that the performance of machine learning models degrades as they become more explainable, especially when the model structures become intricate (Gunning, 2016). Additionally, these datasets often assign uncharacterized activities to generic categories like ”others” or exclude them altogether, which presents significant labeling challenges and complicates the development of generalized solutions adaptable to real-world applications.

Moreover, significant challenges persist in representing the outputs of sensor-based models within the visual domain. Despite increasing interest in transforming sensor data into image representations to enhance layperson understanding of machine learning results, the development of visual domain representations has not kept pace(Arrotta et al., 2022; Hur et al., 2018; Aquino et al., 2023). This discrepancy highlights the urgent need for innovative approaches that effectively bridge the gap between sensor data and visual representation, thereby improving the interpretability and practical utility of Complex Human Activity Recognition (CHAR) systems for everyday users. This growing interest emphasizes the demand for visual representations that make machine learning models more accessible to laypersons.

These challenges highlight the complexity of CHAR in real-life settings and underscore the necessity for advanced methodologies that can handle imprecise data and develop more intuitive output representations.

Refer to caption — Figure 1. Standard complex activity datasets (left) typically provide detailed labels for each time interval to facilitate atomic activity training. In contrast, in-the-wild datasets(middle), constrained by labor capacity and other practical limitations, only specify the types of complex and atomic activities per segment without specific time interval or detailed atomic activity labels for time series segmentation. It often feature a greater variety of label combinations(right), reflecting the complexity and unpredictability of real-world scenarios.

1.2. Contributions

We developed the Variance-Driven Complex Human Activity Recognition (VCHAR) framework with Generative Representation to tackle prevalent issues in the recognition of complex human activities. VCHAR overcomes the limitations of traditional CHAR methods that require detailed and labor-intensive segmentation of activities. Instead, it utilizes a variance-driven approach that leverages the Kullback-Leibler divergence to approximate the distribution of atomic activity outputs. This method allows for the recognition of decisive atomic activities within specific time intervals without necessitating the removal of transient states or other irrelevant data, thereby enhancing the detection rates of complex activities even in the absence of detailed labeling of atomic activities.Our experiments demonstrate that even without precise labeling of atomic activities or without sequentially corrected labeling, our model effectively utilizes key concepts to enhance the detection rate of complex activities. Additionally, it provides a promising rate of atomic activity detection, which is crucial for accurately representing the data when transmitting outputs to the decoder.

Moreover, VCHAR introduces a novel generative decoder framework that transforms sensor-based model outputs into integrated visual domain representations. This includes detailed visualizations of both complex and atomic activities, alongside desired sensor related information from the model. Utilizing a Language Model (LM) agent, the framework organizes diverse data sources and employs a Vision-Language Model (VLM) to generate comprehensive visual outputs. To facilitate rapid adaptation to specific smart space scenarios, we propose a pretrained ”sensor-based foundation model” and implement a ”one-shot tuning strategy” with masked guidance. Our experiments on three publicly available datasets demonstrate that VCHAR not only performs competitively with traditional methods but also significantly enhances the interpretability and usability of CHAR systems, as confirmed through human evaluation studies.Our contributions are multi-faceted and can be summarized as follows:

•

We introduce VCHAR, a variational-driven framework designed to generate visual domain representations of complex activity recognition. This system aims to make complex activity insights accessible to laypersons by visually representing the data in an intuitive manner.
•

We utilize KL divergence as a loss function to model the dynamic relationships among varying combinations of atomic activities across different time intervals for the same type of complex activity.
•

Our method proves effective in real-life scenarios with inaccurate or absent specific time labeling of activities. Results demonstrate that multitasking modeling enhances complex activity detection rates. Additionally, this multitasking modeling equips the decoder with features necessary to offer visual domain explanations accessible to laypersons.
•

We propose a ”sensor-based foundation model” framework with our masked one-shot tuning strategy that quickly adapts to specific smart space scenarios. An LLM agent guides this model to generate accessible visual domain representations, particularly benefiting users without technical expertise.
•

We conducted experiments on 3 publicly available datasets, some with labeling issues. Our method demonstrated competitive results and user preference through a user study, showcasing its effectiveness.

2. Related work

2.0.1. Smart Space Complex Activity Recognition

Significant advancements have been made in recognizing both atomic and complex human activities using various sensor technologies and machine learning models. Bao et al. utilized multiple biaxial accelerometers along with decision tree classifiers, demonstrating that an increase in the number of accelerometers and subject-specific training enhances performance (Bao and Intille, 2004). Dernbach et al. found that while atomic activities could be accurately classified using a smartphone’s triaxial accelerometer and gyroscope with a Multi-layer Perceptron, complex activities posed greater challenges (Dernbach et al., 2012). Mekruksavanich et al. introduced a CNN-BiGRU model that effectively recognizes complex activities from wrist-worn sensors (Mekruksavanich and Jitpattanakul, 2021), while Tahvilian et al. compared the efficacy of CNN-LSTM and CNN-BiLSTM models for varying complexities of activities (Tahvilian et al., 2022). Additionally, Peng et al. proposed a multi-task learning approach using CNNs and LSTMs that improved performance on both atomic and complex activities by sharing a CNN feature extractor (Peng et al., 2018b). They also noted that complex activities, due to their intricate nature, are harder to recognize than atomic ones, resulting in lower performance (Peng et al., 2018a). This body of work collectively emphasizes the potential of integrating advanced computational models and diverse sensor data to enhance activity recognition in ubiquitous computing environments.

2.0.2. Visual Representation of Sensor Data

The transformation of sensor data into visual formats has gained significant attention, enabling the application of image classification techniques to sensor-based human activity recognition (Ha and Choi, 2016; Trabelsi et al., 2022). Researchers have explored various methods to facilitate this transformation. For instance, transforming sensor data into spectrogram images has allowed the use of deep learning models like CNNs for recognizing human activities (Ito et al., 2018; Jiang and Yin, 2015; Chen et al., 2021).

Further advancements in this field have led to the development of methods that enhance interpretability. Arrotta et al. (Arrotta et al., 2022) utilize CNNs and Explainable AI to translate sensor data into semantic maps for transparent activity recognition in smart homes, aiming to boost trust, particularly in healthcare monitoring scenarios. These semantic maps are further processed into natural language explanations, providing clear and understandable insights into the AI’s decision-making processes. Another approach by Jeya et al. (Jeyakumar et al., 2023) applies CTC loss to align detected activities with raw sensor signals, which are visualized on time-acceleration graphs and marked with dashed rectangles to improve the interpretability of complex activities.

Additionally, explainability in AI can be approached through model-agnostic methods such as LIME (Chattopadhay et al., 2018) and SHAP (Lundberg and Lee, 2017), which approximate the relationship between input and output predictions without accessing the model’s internal workings. Conversely, model-transparent methods like Grad-CAM++ (Chattopadhay et al., 2018) and saliency maps (Simonyan et al., 2013) provide insights into the internal processes of neural networks by visualizing the importance of input features and the activation value of hidden layers.

In response to the growing interest in making sensor data comprehensible through visual representations(Hur et al., 2018), our work develops vision domain representations directly from sensor data. Specifically, we aim to make the outcomes of activity recognition models accessible to laypersons by applying a video-based representation of sensor activation value. This approach bridges the gap in understanding complex models for users without technical expertise, enhancing user trust and engagement with the technology.

2.0.3. Foundation and Multimodal Models

Foundation models, the latest evolution in AI technologies, are trained on extensive, diverse datasets and are capable of being applied across a wide range of domains (Bommasani et al., 2021). These models, which are adaptable to numerous applications, highlight the forefront of AI research, benefiting from their training on vast and heterogeneous datasets.

Recent advancements have seen various multimodal models that leverage this foundation. Wang et al. (Wang et al., 2022b) developed a sequence-to-sequence framework that unifies diverse tasks across modalities using an instruction-based task representation, pretrained on image-text data for both crossmodal and unimodal tasks. Similarly, Lu et al. (Lu et al., 2022) introduced a transformer sequence-to-sequence model that performs a variety of vision and language tasks without requiring task-specific branches, trained on over 90 datasets related to vision and language.

Furthering the multimodal approach, Singh et al. (Singh et al., 2022) created a model with separate encoders for images, text, and multimodal integration, which was pretrained on unimodal and multimodal losses for 35 tasks across vision, language, and vision-language areas. Another contribution by Singh et al. (Wang et al., 2022a) includes a multimodal vision-language model with a shared multiway transformer backbone, trained on masked data modeling across modalities, achieving state-of-the-art performance on various vision and vision-language benchmarks. Li et al. (Li et al., 2020) explored cross-modal contrastive learning to unify representations across modalities with a unified transformer on image and text data, enhancing the synergy between these modalities.

In this research, we harness the domain adaptation capabilities of generative models to customize the sensor decoder for specific scenarios of sensor data representation. This approach is designed to significantly enhance the visualization quality of the sensor model’s outputs.

3. Research Methods

This section systematically details the VCHAR framework, commencing with an architectural overview and operational functionality. We begin by outlining the fundamental architecture and principal features of VCHAR, providing a foundation for detailed exploration. This is followed by an in-depth analysis of the conceptual framework that forms the basis of VCHAR, discussing both the formulation of the problem it addresses and the integration of relevant theoretical concepts. Through this structured exposition, we aim to furnish a comprehensive understanding of VCHAR’s underlying principles and its functionality within practical applications.

3.1. Outline of the VCHAR Framework

The Variance-Driven Complex Human Activity Recognition (VCHAR) framework, as depicted in Fig. 3, is an end-to-end model designed to enhance both the prediction and explanation of complex human activities. This model uniquely employs the Kullback-Leibler (KL) divergence to approximate the distribution of atomic activities across various sliding window lengths. By leveraging a variance-driven approach, VCHAR circumvents the need for specific time-unit labels for atomic activities, focusing instead on the types of activities occurring within given time slots. Comparative experiments demonstrate that this approach significantly enhances the accuracy of complex activity detection relative to other methods.

The multitask design of VCHAR not only improves the recognition of complex activities but also establishes an ”interface” facilitating detailed visualizations by a generative decoder. For instance, when VCHAR detects complex activities such as making a sandwich, it concurrently identifies related atomic activities like opening a door or turning a switch. It also provides a list of desired sensor realted information from the model, highlighting the significance of each sensor in the smart space environment.

To further enrich the model’s output, a Language Model (LM) agent is integrated to reorganize and elucidate detailed information about these activities. In practical applications, we propose a ”sensor-based smart space foundation model” framework, capable of being tailored to specific scenarios through advanced techniques. Techniques such as Denoising Diffusion Implicit Models (DDIM) and Latent Diffusion Models (LDMs), coupled with a masked training strategy, are employed to refine the quality and relevance of the generated outputs.

This structured process ensures that VCHAR is not only effective in recognizing complex human activities but also adept at providing actionable insights into the dynamics of smart spaces, thereby enhancing user interaction and understanding.

3.2. Multi-Task Learning for Complex Activity Recognition

In the proposed model, multi-task learning is employed to facilitate the simultaneous recognition of atomic and complex activities. Atomic activities are defined as discrete actions that occur within a brief time window and are indivisible in the context of our dataset. Each complex activity, on the other hand, is composed of multiple atomic activities. Specifically, a complex activity in our model is defined as having more than 2 atomic activities, providing a more nuanced description of the grouped atomic activities.

For instance, within the Opportunity dataset 1, complex activities such as ”making coffee” or ”cleaning up” may include the atomic activity ”open the door”. This activity is further detailed as ”open the door 1” and ”open the door 2”, enriching the model’s understanding of the scenario by supplying detailed contextual information. This approach allows the model to capture the intricate relationships and recurring patterns among activities, thereby enhancing its ability to accurately classify complex activity scenarios.

Our model analyzes raw sensor data, denoted by $\mathbf{x}$ , to predict the probabilities of each atomic activity occurring within a specific sliding window, alongside the classification of a complex activity. The goal is to output a probability vector $\mathbf{p}$ for atomic activities, where each element $p_{i}$ represents the likelihood of the $i^{th}$ atomic activity occurring. Additionally, the model outputs a categorical label $C$ that classifies the type of complex activity observed, which integrates the information from the atomic activities.

Formally, $\mathbf{x}$ represents the input vector of sensor readings. The output $\mathbf{p}$ is a vector of probabilities, with length $n$ , where $n$ is the total number of atomic activities the model is trained to recognize. Each element $p_{i}$ in $\mathbf{p}$ is a real number between 0 and 1, inclusive, indicating the probability that the $i^{th}$ atomic activity is present in the sliding window. The complex activity label $C$ , determined by these probabilities, provides a high-level classification based on the pattern of atomic activities.

For instance, if the model considers four atomic activities, and a particular observation through the sliding window suggests varying probabilities of these activities, the output vector $\mathbf{p}$ might look like $[0.95,0.80,0.10,0.05]$ . This vector indicates high probabilities for the first two activities and low probabilities for the others. The label $C$ then contextualizes these activities into a complex activity classification, providing a comprehensive understanding of the scenario captured by $\mathbf{x}$ .

The primary training objective of our model is to minimize a loss function, $L$ , which effectively combines the losses from predicting the probabilities of atomic activities, $L_{atomic}$ , and from classifying complex activities, $L_{complex}$ . This combined loss function is defined as:

L=\alpha L_{atomic}(p_{atomic},p_{true})+\beta L_{complex}(y_{complex},C_{true})

In this formula, $\alpha$ and $\beta$ are weighting coefficients that balance the importance of each component during the training process. Here, $p_{true}$ represents the true distribution of the atomic activities occurring within a given context, and $C_{true}$ is the actual label of the complex activity. The dual-focus of this loss function encapsulates the essence of our multi-task learning approach, promoting an efficient and robust learning process that is well-suited for analyzing the nuanced dynamics of smart space sensor data. This methodology not only improves the predictive accuracy of both atomic and complex activity classifications but also ensures that the model can effectively discern the intricate relationships between these activity layers.

This modeling approach, with its probabilistic output for atomic activities, primarily facilitates the detection and understanding of complex activities, the main objective of our model. By analyzing the likelihoods of various atomic activities within a given context, our model uses these insights as supportive data to enhance the accuracy and reliability of complex activity classification. This method allows for a more nuanced interpretation of sensor data, ensuring that atomic activities serve to inform and refine our understanding of the broader, more intricate behavioral patterns represented by complex activities.

3.3. Loss of Atomic Activity Recognition

Our model’s primary objective is to predict the probability of each atomic activity within a sliding window of sensor data, aligning these predictions as closely as possible with the actual probabilities using the mean Kullback-Leibler (KL) divergence as the loss function. The KL divergence provides a robust metric for the average difference between the predicted probability distribution $p_{\text{predict}}$ and the true probability distribution $p_{\text{true}}$ , making it particularly suitable for datasets with varying class distributions.

Formally, the loss function $L$ for atomic activities is defined as the KL divergence between the true distribution $p_{\text{true}}$ and the predicted distribution $p_{\text{predict}}$ , which can be expressed mathematically as:

L_{atomic}=L_{KL}(p_{\text{predict}},p_{\text{true}})=\frac{1}{N}\sum_{i=1}^{N% }p_{\text{true},i}\log\frac{p_{\text{true},i}}{p_{\text{predict},i}}

where $N$ is the number of classes or atomic activities, $p_{\text{true},i}$ and $p_{\text{predict},i}$ represent the true and predicted probabilities of the $i^{th}$ atomic activity occurring within the sliding window, respectively. This formulation ensures that the model’s performance is evaluated based on the average divergence across all classes, promoting a balanced sensitivity to the accuracy of each class prediction.

3.4. Loss of Complex Activity Recognition

Our model is specifically designed to classify complex activities by minimizing the cross-entropy loss, which measures the discrepancy between the predicted probabilities and the actual class labels for complex activities. This loss function is crucial for optimizing the model’s ability to accurately categorize complex activities based on sensor data inputs.

The cross-entropy loss for complex activity classification is formally defined as:

L_{\text{complex}}(y_{\text{predict}},C_{\text{true}})=-\sum_{j=1}^{M}C_{\text% {true},j}\log y_{\text{predict},j}

where, $y_{\text{predict}}$ represents the predicted probability distribution across the complex activity classes. $C_{\text{true}}$ is the one-hot encoded vector of the true class labels for the complex activities. $M$ is the number of possible complex activity classes. Minimizing $L_{\text{complex}}$ during training ensures that the predicted probabilities align closely with the true class labels, effectively enhancing the model’s accuracy in complex activity recognition.

3.5. Sensor Encoder Architecture

To benchmark our approach against established baseline methodologies(Jeyakumar et al., 2023; Chen et al., 2021; Peng et al., 2018b; Singh et al., 2020), we employ the widely recognized ConvLSTM architecture, which is particularly adept at handling sensor time series data. This architecture synergistically combines Convolutional Neural Networks (CNNs) and Long Short-Term Memory (LSTM) networks to effectively extract and temporally analyze features from sensor data.

The ConvLSTM architecture operates in two primary phases:

(1)

Feature Extraction: The CNN component is responsible for spatial feature extraction from each time slice of the sensor data. This step is crucial for identifying intricate patterns within the data that are spatially localized but temporally variant.
(2)

Temporal Dependency Modeling: The LSTM layer processes the sequence of extracted features to capture temporal dependencies and dynamics, essential for understanding the progression and context of sensor readings over time.

In our enhanced model, we first conduct a channel-wise analysis to separately study the features from different sensor channels. These features are then integrated using a sensor fusion module, which synthesizes information across channels to provide a comprehensive feature set.

Following the fusion step, a bidirectional LSTM (biLSTM) module is employed to further refine the temporal analysis, enhancing the model’s ability to capture both forward and backward dependencies in the time series data. Additionally, a Multi-Layer Perceptron (MLP) module is incorporated to generate the distribution of atomic activities based on the extracted features. Finally, another LSTM layer is tailored to model and predict the complex activity outputs, synthesizing all prior analyses into a coherent activity prediction.

Example Application: Consider a scenario involving the monitoring of elderly activities in a smart home environment. Our model processes data from various sensors (e.g., motion, door, and appliance usage sensors) through the described architecture. Initially, individual sensor channels are analyzed to detect basic movements and interactions. These are then fused and temporally analyzed to predict more complex activities, such as cooking or cleaning, demonstrating the model’s capability to discern nuanced human behaviors effectively.

This comprehensive approach allows us to not only match but also surpass the performance of existing methods in complex activity recognition, as evidenced by our comparative evaluations. The results confirm the superiority of our model in accurately detecting and predicting both atomic and complex activities, highlighting its potential for real-world application in ubiquitous computing environments.

3.6. Generative Modeling for Enhanced Complex Activity Representation

For users lacking technical expertise, grasping the intricacies of sensor encoder outputs can be challenging. To bridge this gap, we implement a generative modeling approach that transforms the identified atomic and complex activities into visual narratives. This transformation is facilitated by a Language Model (LM) agent, which interprets the sensor data, encompassing the distribution of atomic activities, the classification of complex activities, and the sensor activation patterns within the model.

To enhance the adaptability of our system across various datasets and smart spaces, we initially pre-train our model on a diverse set of scenarios. This pretraining encapsulates a universal concept of relationships and fundamental elements essential for activity recognition. When adapting to a specific dataset or smart environment, we employ the ”one-shot tuning strategy”. This approach fine-tunes the pre-trained model, enabling it to generate tailored explanations that align with the unique context and requirements of the given application.

3.6.1. Pretraining a General-Purpose Sensor-Based Foundation Model Framework for CHAR

In line with the development of robust foundation models, our approach involves pretraining a sensor-based foundation model encapsulating a comprehensive suite of elements common to a wide array of scenarios. This model serves as a versatile starting point, designed to be fine-tuned subsequently to accommodate the specificities of distinct scenarios encountered by users.

The CHAR foundation model is imbued with a rich lexicon that describes a multitude of complex scenarios and human activities. It not only captures the essence of activity patterns but also identifies sensor-specific signatures that are pivotal for accurate activity detection. This capability ensures that, upon detection, the most relevant sensors are highlighted, providing intuitive visual cues within the generated explanations.

For instance, the foundation model is pretrained with elaborate activity narratives such as ”morning routine”, ”making a sandwich”, or ”preparing cereal”. These narratives embody complex sequences situated within a broader context of atomic activity representations. Moreover, the model is attuned to discern and emphasize sensors that yield significant information during the activity recognition process. This level of detail furnishes users lacking domain expertise with rich, contextual insights into the sensor data and the detected activities.

The foundation model thus constitutes a preparatory step in building an adaptable, knowledgeable system for CHAR. Through fine-tuning, it can swiftly assimilate into any designated smart space environment, delivering bespoke explanations tailored to the unique configuration and operation of that space.

During the pretraining phase, the foundation model’s objective function aims to minimize the expected discrepancy between the model’s predictions and the ground truth. The foundation model processes three distinct inputs: the visual content $V$ , represented by images or videos; the query $Q$ , which contextualizes the visual content; and the sensor encoder output $E$ , which encapsulates the sensor-derived information. The objective function is thus characterized by the following minimization:

\min\mathbb{E}_{V,Q,E\sim\mathcal{D}}\left[\mathcal{L}\left(\text{GT},\Psi_{% \theta}\left(V,Q,E\right)\right)\right]

Here, $\Psi_{\theta}$ denotes the predictive function of the foundation model parameterized by $\theta$ , GT represents the ground truth corresponding to the inputs, and $\mathcal{D}$ signifies the joint distribution of the visual, query, and sensor data. This optimization ensures that the foundation model is capable of integrating and interpreting multimodal inputs to generate a comprehensive and accurate depiction of the observed activities in preparation for fine-tuning to specific smart environment applications.

In our approach, $E$ symbolizes a set of features including the type of activities identified, the values of sensor activation, among others. To enhance interpretability, we convert these features into a structured prompt using a predefined template. For instance, given an input vector $[0.1,0.2,0.8,0.5]$ where the third value indicates the highest sensor activation value from the model, and knowing that the corresponding sensor is located on the arm, our template highlights this sensor in yellow within the generated video. Concurrently, the model identifies the complex activity, such as ’making coffee’, and a atomic activity, such as ’opening a door’. These elements are woven into a cohesive prompt: ”Someone is opening the door ,complex activity ”making coffee” ”. This narrative integration allows for a user-friendly representation of activities, facilitating a clear and intuitive understanding of the ongoing events as detected by the sensor system.

3.6.2. One-Shot Fine-Tuning for Complex Activity Description Using DDIM

When transitioning our model to specific scenarios, it is common to encounter variations in the manifestation of activities and their corresponding complex contexts. To accommodate these scenario-specific nuances, we adopt a ”one-shot tuning” strategy(Wu et al., 2023). This strategy rapidly recalibrates our sensor-based foundation model to align with the new scenario characteristics. This is particularly pertinent for incorporating new activity videos where the contextual dynamics may significantly differ.

To further enhance the model’s adaptability and descriptive capabilities, we introduce a masked training strategy. This approach facilitates the model’s ability to generalize across diverse descriptive modalities. During fine-tuning, carefully designed prompts address the functionalities of specifically masked regions within the video. These prompts serve a dual purpose: they guide the video generation process in the latent space and ensure that the model’s output is both representative and specific to the newly adapted scenario.

Employing this method enables the foundation model to not only recognize and adapt to new scenarios but also to generate activity representations that are rich in detail and contextually relevant.

Latent diffusion models (LDMs)(Rombach et al., 2022a) are pivotal in our approach, functioning directly within the latent space of new video embeddings. These models utilize an encoder to project videos into a latent space, facilitating manipulations within this reduced dimensionality before reconstruction. Specifically, an encoder $\mathcal{E}$ maps a video $V$ onto a latent representation $\mathbf{z}=\mathcal{E}(V)$ , and a decoder $\mathcal{D}$ subsequently restores this latent representation back to the video domain, approximating the target video, i.e., $\mathcal{D}(\mathcal{E}(V))\approx V$ .

In our case, when new activity types are introduced to the model, the LDMs adeptly handle the designated masked regions within the latent embeddings. The decoder $\mathcal{D}$ is then employed to regenerate the video, ensuring that it accurately reflects the complex activity type and sensor activation value in side of the model.

\min_{\mathcal{E},\mathcal{D}}\left\|V-\mathcal{D}(\mathcal{E}(V))\right\|^{2}

DDIM, the Denoising Diffusion Implicit Model, is integral to enhancing the efficiency of the video generation process, particularly in maintaining the structural integrity of the activity movements(Wu et al., 2023; Song et al., 2020).In our implementation, the DDIM operates within the latent space of a new video’s embedding. The process utilizes prompts that encapsulate masked patterns, denoted by $T$ , alongside a guiding prompt $T^{*}$ for directed generation. The final video $\mathbf{V}$ is produced by first inverting the latent embedding via $\text{DDIM}_{\text{inv}}$ , using the encoder’s output informed by the mask pattern and guiding prompt, followed by sampling with $\text{DDIM}_{\text{samp}}$ . This sequence can be represented mathematically as:

\mathbf{V}=\mathcal{D}\left(\text{DDIM}_{\text{samp}}\left(\text{DDIM}_{\text{% inv}}\left(\mathcal{E}(\mathbf{V_{0}}),T,T^{*}\right)\right)\right),

Here, $\mathcal{E}$ signifies the encoding function, $\mathcal{D}$ the decoding function, and $\mathbf{V_{0}}$ the initial video input. The inclusion of both $T$ and $T^{*}$ ensures that the DDIM not only captures the masked patterns but also adheres to the specified activity prompts. This dual guidance mechanism facilitates precise control over the generative process within the latent space, yielding an output video that accurately reflects the desired activities.By operating in this fashion, our model not only retains the fidelity of the original video but also enriches it with detailed descriptions of activities, providing a comprehensive depiction that encompasses both the observed and inferred aspects of the scene.

Masked training can markedly expedite the training of substantial diffusion models such as Stable Diffusion(Zheng et al., 2023), without detriment to their generative capabilities. To hasten the learning process, we employ a masked strategy tailored to specific patterns desired for our sensor descriptions. For instance, as depicted in Figure 6, the output frame rate is set at 24 fps, enabling us to direct the model’s output toward varying scenarios. Additionally, a human study is conducted to further evaluate the effectiveness of these strategies.

3.6.3. Implementation Details

Our decoder leverages the stable diffusion model (Rombach et al., 2022b), utilizing online pretrained weights. Despite this, specific patterns crucial for representing complex scenarios in smart spaces are absent, necessitating further pretraining of these missing concepts post-weight loading. Our pretraining involves adjusting all model weights, whereas finetuning focuses only on several attention layers to employ a one-shot tuning strategy(Wu et al., 2023), significantly reducing GPU memory usage. This efficiency allows the use of standard gaming GPUs, like the RTX series.

Specifically, the complete model requires 10,000 steps for pretraining to incorporate new features, and only 500 steps to fine-tune for video applications, with a batch size of one. For inference, the DDIM sampler requires 100 steps. We utilize an NVIDIA A100 GPU for pretraining and an RTX 6000 for fine-tuning. The entire pretraining process takes approximately 48 hours to integrate a new feature, while finetuning is completed in just 30 minutes. This rapid adaptability is particularly beneficial for individualized smart space applications, where conventional GPUs can effectively handle the adaptation of new features to the model. Such capabilities render our framework practical for real-life implementations.

4. EXPERIMENTS AND RESULTS

In this section, we detail the public datasets utilized in our experiments, providing a foundation for a comprehensive evaluation of our methodology. We will systematically discuss the outcomes of each experiment, highlighting the efficacy of our model across different scenarios. Additionally, results from human studies will be presented to demonstrate the practical effectiveness and user perception of our model. This approach allows us to present a well-rounded assessment of our framework’s performance and its applicability in real-world settings.

4.1. Dataset

For our experiments, we utilized three publicly available datasets: Opportunity(Chavarriaga et al., 2013), FallAllD(Saleh et al., 2020), and Cooking(Lago et al., 2020). Each dataset offers unique characteristics suited to testing the versatility of our model under different conditions. The Opportunity dataset provide labels for atomic activities corresponding to specific time intervals, facilitating precise activity recognition tests. In contrast, the Cooking and the FallAid datasets, which reflect real-life scenarios, labels only the types of atomic and complex activities within each time interval without specifying the exact timing of each atomic activity. This dataset merely indicates that a group of atomic activities occurs during the specified intervals, presenting a challenge in distinguishing individual actions within these periods.

4.1.1. Opportunity

Complex Activities	Atomic Activities
Making Coffee	Open Door 1,Close Drawer 1,Open Door 2,Open Drawer 2
Morning Routine	Close Door 1,Close Drawer 2,Close Door 2,Open Drawer 3
Cleaning Up	Drink,Open Fridge,Close Fridge,Clean Table,Open Dishwasher
Making Sandwich	Close Drawer 3, ,Close Dishwasher,Toggle Switch,Open Drawer 1

Table 1. Complex activities and their atomic activities in the Opportunity dataset. The dataset presents challenges in identifying activities due to the presence of similar atomic activities with different types and the possibility of the same atomic activity belonging to multiple complex activities. This hierarchical and overlapping nature of activities highlights the complexity of activity recognition in real-world scenarios.

The Opportunity dataset is a publicly accessible benchmark for human activity recognition algorithms, featuring data from 4 subjects out of an original 12 (Ciliberto et al., 2021) . It includes 15 networked sensor systems with 72 sensors across ten modalities embedded in the environment, objects, and worn on the body. This study particularly focuses on inertial sensors placed on the left lower arm, left upper arm, right lower arm, right upper arm, and back of the torso, which record data via accelerometers, gyroscopes, and magnetometers.

We analyze 17 micro-activities and 4 complex activities that offer a comprehensive view of human motion (table 1). The dataset’s design ensures a realistic portrayal of activities, with annotations at multiple abstraction levels suitable for testing advancements in sensor selection, feature extraction, classifier training, multimodal data fusion, segmentation, and hierarchical activity recognition. This rich framework makes the Opportunity dataset an excellent resource for evaluating the effectiveness of different activity recognition strategies under naturalistic conditions.We use 3 subject as training set, 1 subject as testing set.In this study, we utilize a 20-second sliding window approach for data processing, which frames our training set with dimensions of (3036, 72, 600) and our testing set as (1299, 72, 600). This strategy is instrumental in capturing the temporal dynamics essential for our activity recognition tasks.

4.1.2. FallAllD

Complex Activities	Atomic Activities
Falling prediction	stand, walk, sit, trip
Normal ADLs, no Falling detected	recovery, jog, slip, rotate

Table 2. Complex activities and their atomic activities in the Fallaid dataset. The dataset provides information about the types of atomic activities and their high-level activities. However, it does not include specific labeling for the atomic activities at each time unit. Some atomic activities, such as ”walk to the chair and sit” may belong to both falling prediction and ADL categories, highlighting the hierarchical and overlapping nature of the activities. This poses challenges in accurately labeling and recognizing activities at a granular level.

The FallAllD dataset is a specialized resource tailored for research in fall detection, fall prevention, and human activity recognition, suitable for both classical and deep learning methodologies. Data collection involved three identical data-loggers worn by participants on the neck, wrist, and waist, each outfitted with an inertial module (comprising an accelerometer, gyroscope, and magnetometer) and a barometer.

Data were collected from 15 participants aged between 21 to 53 years old, resulting in 26,420 files, each 20 seconds in duration. Within this dataset, complex scenarios such as ”fall” and ”no fall” detection are highlighted, where atomic activities are indicative of both scenarios. For example, activities might include normal actions such as walking and sitting, as well as the same actions performed with an accidental fall. For our studies, we selected 8 (table 2) activities that span both complex categories, allocating around 20% of the data for testing and 80% for training purposes.In our methodology, we adopted a 10-second sliding window for data segmentation, resulting in a training set dimensionality of (4964, 12, 4760) and a testing set dimensionality of (1375, 12, 4760). This windowing technique is critical for capturing temporal patterns in the sensor data conducive to recognizing complex activities.This dataset does not include specific labels for atomic activities. Each file only provides information on the type of activities, device type, recording date, and subject number.

4.1.3. Cooking Activity

Complex Activities	Atomic Activities
Making Sandwich	add, cut ,mix
Making a Cereal	open, peel ,pour
Making Fruit salad	put ,take, wash

Table 3. Complex activities and their associated atomic activities in the Cooking dataset. The dataset represents real-life scenarios and provides information about the types of atomic activities and their corresponding complex activities. However, it does not include specific time labels for each activity. The same atomic activity may belong to different complex activities, highlighting the versatility and reusability of actions across various cooking tasks. This lack of temporal information and the presence of overlapping activities pose challenges in precisely identifying and segmenting individual activities within the dataset.

The Cooking Activity Recognition Challenge dataset(Lago et al., 2020) is a multifaceted collection of sensory data, recorded via smartphones, wristwatches, and motion capture systems, designed specifically for the complex task of activity recognition. This dataset, procured from 4 subjects, comprises accelerometer data from the right arm, left hip, and both wrists, in addition to motion capture information from 29 distinct markers. It categorizes activities into three macro activities—making a sandwich, preparing a fruit salad, and cereal preparation—alongside 10 micro activities such as adding, cutting, mixing and ”other”. In our recognition task, we focused on nine specific activities that were relevant to our study. We excluded the ”other” category from our analysis, as it did not provide meaningful information for our research objectives.The dataset’s primary complexities arise from inconsistencies in sampling rates across different sensors and significant instances of missing data, notably within the left wristwatch accelerometer data, which shows several intervals devoid of readings. Furthermore, the dataset’s recordings, spanning across the accelerometers and motion capture system, demonstrate varying sampling rates within the 30-second segments. Specifically, accelerometer data sampling rates fluctuate between 50 and 100 Hz, while the motion capture data is consistently sampled at around 100 Hz. These challenges underscore the need for robust processing techniques capable of handling asynchronous data and compensating for informational gaps within the dataset(Alia et al., 2021).

In our study, we utilize sensor data recorded from the arm, hip, and wrist, which serve as the primary data inputs for our model. To ensure uniformity across the dataset, we have interpolated all sensor readings to a consistent sampling rate of 100 Hz. Notably, we do not incorporate motion capture data into our analyses.Given the dataset’s challenges, including missing readings and inconsistent recording sessions, we have chosen to discretize activities within the same subject. This approach yields a variety of activity sequences, such as ”add, cut, open, making cereal” or ”cut, take, take, making cereal,” each representing different combinations of micro activities that lead up to a macro activity. The dataset presents variability not only in activity sequences but also in sensor availability per session, with some subjects lacking consistent sensor data across recordings. To address these discrepancies and the issue of missing data, we structured the dataset to ensure diversity in the training and testing sets, with distinct activity combinations and previously unseen activity types.It is important to note that the dataset provides only a general indication of atomic activities. Hence, most CHAR models that rely on time-specific atomic activity labels or sequential labeling are not suitable for this dataset. Our approach, tailored for this type of loosely labeled data, allows for robust activity recognition despite the absence of precise temporal annotations.In light of the inseparable nature of the segments within the dataset, we have preserved the original 30-second recording window size to maintain the integrity of the data. Consequently, the shape of our training set is (2326, 12, 3000), indicating that it comprises 2326 segments, each with 12 features, across 3000 time steps. Similarly, the testing set is structured as (711, 12, 3000), consisting of 711 such segments. This configuration ensures that the dataset’s original structure is retained, allowing for an authentic evaluation of our model’s performance.

4.2. Evaluation

4.2.1. Atomic Accuracy Score

To assess the precision of our model in detecting atomic activities, we utilize the Atomic Accuracy Score. This metric measures the proportion of atomic activities correctly identified above a specified confidence threshold $\alpha$ relative to the total number of activities detected. The score is mathematically defined as:

\text{Atomic Accuracy}=\frac{\sum_{i=1}^{n}\chi(p_{i}>\alpha)}{n}

where $p_{i}$ denotes the confidence score of the $i$ -th detected activity, $n$ is the total number of detected activities, and $\chi$ is the indicator function that returns 1 if $p_{i}>\alpha$ , and 0 otherwise. In practical terms, we set $\alpha=0.4$ , ensuring that only activities detected with a confidence level above 40% are considered in the accuracy measurement. This strategy focuses on the most reliably detected activities, thus providing a more meaningful assessment of the model’s performance in smart space environments.

4.2.2. Complex Activity F1 score

In the evaluation of our model’s performance on complex activity classification, the F1 score is employed as a critical metric. The F1 score is a harmonic mean of precision and recall, providing a balanced measure that considers both the false positives and false negatives. This is especially important in our context where some complex activities may be underrepresented in the dataset.

The F1 score is calculated as follows:

\text{F1 Score}=2\times\frac{\text{Precision}\times\text{Recall}}{\text{% Precision}+\text{Recall}}

4.2.3. Model Comparison

In this study, we evaluate the performance of our proposed model against several established baselines in the field of CHAR. We provide a detailed description of our model in Subsection 3.5. For a fair comparison, we design the baselines with similar structures to ensure that each comparison model has approximately the same number of parameters. We employ the AdamW optimizer for training, with a maximum of 300 epochs. The following subsections detail the configurations of each comparison group:

•

ConvLSTM: Often considered a foundational architecture for addressing CHAR problems(Peng et al., 2018b; Varshney et al., 2022; Singh et al., 2023; Haresamudram et al., 2019), the ConvLSTM serves as a baseline to evaluate the enhancements our method brings to activity recognition. This model integrates a sensor fusion module utilizing CNN to extract primary sensor features, combined with LSTM networks to capture temporal dependencies among these features. The architecture culminates in a temporal convolution layer followed by a dense output layer, which collectively aim to optimize the recognition of complex human activities.
•

Concept Bottleneck(Koh et al., 2020): Utilizes a CNN-based structure to detect atomic concepts before advancing to higher-level concepts, applying MSE loss for concept identification. This hierarchical approach is possible to applied to complex activity recognition.
•

PEMM: The Pointwise Error Minimization Method (PEMM) is utilized as a comparative baseline in our study. It addresses potential concerns that Concept Bottleneck outcomes are limited by reliance on CNN architectures. To evaluate our model’s performance enhancements, we incorporate MSE in an ablation study, contrasting it with PEMM to highlight the effectiveness and advancements of our approach.
•

Debornair(Chen et al., 2021): This model parallels our basic ConvLSTM structure with a distinct preprocessing module that processes sensor data for atomic and complex activities separately before merging them. Debornair exclusively design for complex output only, focusing solely on the integration of processed data without predicting atomic activities.
•

XCHAR(Jeyakumar et al., 2023): Based on a vanilla ConvLSTM architecture, XCHAR differentiates itself by employing CTC(Connectionist Temporal Classification) loss to emphasize the importance of sequence in atomic activities. However, real-life datasets often suffer from errors or omissions in the sequencing of atomic activities, limiting the applicability of this model to certain datasets where precise sequence labeling is feasible.

Model	Opportunity		Cooking		FallAllD
Model	CHAR F1 Score	Atomic Acc. Score	CHAR F1 Score	Atomic Acc. Score	CHAR F1 Score	Atomic Acc. Score
PEMM	0.7321	0.5332	0.7838	0.4456	0.8217	0.7768
ConvLSTM	0.7922	–	0.8135	–	0.8393	–
Concept Bottleneck	0.6479	0.4031	0.4355	0.3728	0.5966	0.4725
XCHAR	0.8357	0.6736	–	–	–	–
DEBONAIR	0.8015	–	0.8128	–	0.8283	–
VCHAR	0.8463	0.6052	0.8198	0.5209	0.8657	0.8153

Table 4. Comparative Analysis of CHAR F1 Scores and Atomic Accuracy Across Models and Datasets. Not all results are available due to certain models’ incompatibility with datasets that lack specific time or sequence labeling, which is essential for their application in real-world settings.

In our study, we utilized a variance-based baseline to assess our model’s performance in recognizing complex activities. As shown in Table 4, our method surpasses other models with CHAR F1 Scores of 0.8463 on the Opportunity dataset, 0.8198 on the Cooking Challenge dataset, and 0.8657 on the FallAllD dataset, demonstrating its effectiveness in complex activity recognition. Nevertheless, the model encounters challenges in precisely detecting atomic activities when specific labels are available, recording a slightly lower Atomic Accuracy score of 0.6052, in comparison to XCHAR’s 0.6736 on the Opportunity dataset. While our approach mainly focuses in complex activity detection, further investigation into its capabilities for atomic activity recognition remains a point of interest.

Remarkably, our model demonstrates a significant advantage in environments where exact time labeling is absent, such as the Cooking dataset, where our method achieves the highest CHAR F1 Score of 0.8198 and Atomic Accuracy of 0.5209. This suggests that our variance approach adapatally handles datasets lacking detailed temporal annotations better than methods relying on point-wise error minimization like MSE loss, which performs well under conditions of precise labeling but struggles otherwise due to its inherent need for accuracy in individual assessments.The performance on the FallAIID dataset mirrors the trend observed in the Cooking dataset.

Moreover, the comparative performance in the table reveals that some models, including PEMM and Concept Bottleneck, compromise complex activity detection in favor of atomic activity recognition. For instance, while PEMM and Concept Bottleneck are designed to improve atomic detail, they fall short in overall CHAR F1 Scores compared to vanilla ConvLSTM, underscoring a trade-off (Gunning, 2016) that our method avoids. Our model’s multi-task recognition capability does not sacrifice the quality of complex activity detection, affirming its balanced approach in simultaneous task management.

The normalized confusion matrices illustrate distinct performance metrics across datasets. The FallAllD (fig 9) dataset demonstrates robust detection across all complex activities. For the Opportunity (fig 9) dataset, prediction scores for each complex activity exceed 0.8, with the ”early morning” activity achieving the highest accuracy at 0.97, whereas the ”clean up” activity records the lowest at 0.8. In contrast, the Cooking Challenge (fig 9) dataset reveals a suboptimal performance with the ”making cereal” activity scoring only 0.66, though other activities maintain scores above 0.8.

4.3. Empirical Studies of Explanation Understandability

In our effort to develop a user-friendly framework suitable for everyday use by laypersons, we conducted human evaluations to compare our method against existing approaches. These evaluations included all methods tested across the three datasets used in our quantitative experiments. Our evaluation comprised two distinct groups. The first group assessed the clarity and user preference for the model’s output representation, specifically focusing on how effectively the model’s predictions are described and understood by users. The second group of evaluations aimed to demonstrate to users how the model processes different types of data to make decisions. These studies were designed to ascertain which types of explanations regarding the model’s decision-making processes are most accessible and favored by users without technical expertise.

In our human study, we utilized three distinct datasets, each evaluated by 100 participants who responded to 6 questions, totaling 1,800 effective responses across all datasets. The participants were recruited predominantly from online platforms, with no specific background prerequisites. Some chose to remain anonymous. Educational backgrounds varied among participants: about one-third had a high school diploma, others held undergraduate or graduate degrees. Ages of participants ranged from twenty to fifty years old.The questions were randomly selected from two categories related to the model’s outputs: three questions focused on user preferences for output representation, and three on explanations of model decisions. Participants were asked to rate their satisfaction on a modified Decimal Likert Scale (Wuensch, 2005) ranging from 1 to 5, with options to select intermediate values such as 1.25, 1.5, 1.75, etc., assessing the extent to which the methods enhanced their understanding of the model’s outputs, especially for those without expertise. This enhanced scale provides finer granularity in capturing nuances in participant responses. This comprehensive evaluation aims to assess various scenarios and input types to better understand user interaction with the model.

4.3.1. Activity Recognition Description

We assess the VCHAR, DeXAR, and Concept Bottleneck methods for complex activity recognition, focusing on their ability to represent results effectively within datasets characterized by sparse labeling.

•

DeXAR: (Arrotta et al., 2022) Initially designed for atomic activity recognition, we have modified the method to simultaneously estimate both atomic and complex activities. This adaptation incorporates an NLP-based visual representation to depict the model’s recognition results.In our example, as shown in Figure 11, the semantic visualization employs the Dexar encoding method to represent levels of confidence through color variations. Darker colors indicate higher confidence. This visualization illustrates the potential time intervals estimated by the model for each atomic activity, accompanied by textual explanations. However, it does not provide descriptions for complex activities.
•

Concept Bottleneck: (Koh et al., 2020) It merely indicates whether atomic activities are detected or not, without providing detailed information about the recognition results (fig 10). The output consists only of text descriptions, making it straightforward and concise. Nonetheless, this approach necessitates that users possess basic reading skills.In Figure 10, we observe that the complex activity is described solely through textual means. While some users may find this approach straightforward and intuitive, it essentially involves a sequence of actions—such as opening and closing a door, toggling a switch, and drinking—that collectively signify a complex activity, in this instance, ’making coffee’.
•

VCHAR: In Figure 13, the VCHAR system is illustrated, showcasing its capability to represent each atomic activity with a corresponding video and a descriptive label—in this example, ’making cereal’. The width of the video segment indicates the estimated time interval during which the activity occurs, as detected by VCHAR. Both the atomic and complex activities are labeled at the top of the graph. VCHAR is specifically designed to address issues of label sparsity. The time interval estimation is based on the weights connecting a specific last-layer neuron to all neurons in the time series layer, allowing for a comprehensive representation that integrates visual and textual descriptions for each activity.

As illustrated in Figure 13, VCHAR’s representation receives the highest median evaluation scores across the datasets—3.88 for Opportunity, 4.37 for Cooking Challenge, and 4.5 for Fallaid. These scores demonstrate a clear preference among users for more detailed and descriptive outputs to better understand complex scenarios, especially in everyday contexts. The distinct advantage of VCHAR is its ability to convey intricate sensor data interactions in a manner that is intuitive for laypersons. This preference underscores the importance of designing AI systems that not only perform well but also communicate their processes and results in ways that enhance user comprehension and trust in technology applications.

4.3.2. Model Explaination

Another aspect of our evaluation focuses on illustrating to users how the model processes various data types to arrive at decisions. We assessed three distinct approaches: our proprietary method, a model-agnostic method, and a model-transparent method. This comparative study seeks to explore how different modeling approaches influence user preferences, particularly in terms of model interpretability and its impact on user satisfaction. Additionally, we analyze various explanation methods to identify which sensors are critical for recognizing atomic activities, further enhancing our understanding of each method’s effectiveness in practical scenarios.

•

Grad-CAM: (Lundberg and Lee, 2017) Grad-CAM is a model-transparent method that calculates the gradient of a target concept (output) relative to the feature maps of a designated layer. It produces a heatmap that identifies the critical sensors in higher layers that are pivotal for class prediction. As illustrated in Figure 14, it shows how various sensors contribute at different intensities to a particular activity, with brighter areas indicating greater importance.
•

SHAP (Chattopadhay et al., 2018) In contrast, SHAP is a model-agnostic method that approximates the relationship between inputs and outputs without probing the model’s internal mechanisms. It focuses on representing the sensor signals in the input time series data. SHAP calculates scores by estimating the impact of each feature on the prediction, using Shapley values from cooperative game theory to quantify each feature’s contribution.Shapley values in SHAP can be positive or negative, depending on whether a feature contributes positively or negatively to the model’s prediction for a particular instance, relative to the average prediction. This method highlights critical regions relative to the predicted class, as illustrated in Figure 15. Given the inherent complexity of sensor signals, simple descriptions of input may not fully capture their significance.
•

VCHAR VCHAR delivers a detailed depiction of sensor contributions by integrating both textual and visual representations. As shown in Figure 16, it visualizes a scenario where a person is attempting to open a fridge, with the most significant sensor activation value on the left foot, which VCHAR highlights in the visualization. Additionally, the sensor values are derived from gradient values tied to the selected activity type, akin to Grad-CAM’s methodology but extended to include both atomic and complex activities. VCHAR not only identifies these activities but also labels them in the visual representation, enhancing its analytical capabilities for detailed activity analysis.

From the results depicted in Figure 17, VCHAR demonstrates superior performance with median scores of 4.2, 3.8, and 4.5, along with a lower variance. Interestingly, our results are comparable to those of Grad-CAM. Grad-CAM scored lower than SHAP, which primarily stems from SHAP’s approach of explaining outcomes based on time-series data, thus providing a more detailed informational context. Despite this, our method, which also employs a model-transparent approach akin to Grad-CAM, received higher scores. This suggests that laypersons may benefit from more detailed explanations provided by our model, as compared to those used by expert users.

5. Conclusions

In this paper, we introduce VCHAR, a variance-based method specifically designed to address label sparsity issues in in-the-wild datasets. VCHAR is capable of simultaneously detecting both complex and atomic activities, without compromising the recognition rate of complex activities. Our results demonstrate a performance improvement over other baseline methods. Additionally, we present a novel decoder that translates the model’s outputs into a visual representation. This enhancement significantly aids laypersons in understanding the mechanisms of the model, compared to methods traditionally tailored for experts. A human study confirms that our method is preferred over others, offering more accessible and detailed insights to layperson users.

5.0.1. Limitations

Although we have made thorough attempts, our study presents several constraints as outlined below:

•

Atomic Activity: This paper introduces a method designed to address the challenges of accurately labeling the timing and sequence of activities in wild datasets, where precise annotation is often lacking. While this approach enhances labeling accuracy, it does not yet achieve optimal accuracy rates compared to precise labeling method.
•

Real-Time Rendering: Our approach features a generative decoder that visually represents activities detected within smart spaces. However, despite its innovative design, the rendering time is extended relative to traditional end-to-end decoding methods. This delay is largely due to the stable diffusion process employed by the decoder, which requires 100 steps to complete the inference. While this method provides detailed visual outputs, it still needs improvement for real-time applications.
•

Cross Domain Encoder: This paper introduces an approach using a universally pretrained decoder to interpret different scenarios within a single model, enhancing the ability to translate various types of scenarios seamlessly. However, while the decoder supports multi-scenario translation, the encoder processes different types of data separately and struggles to recognize cross-domain sensor data as a unified model, limiting its effectiveness in integrated scenario analysis.

5.0.2. Future Work

In future studies, while our principal focus remains on detecting complex activities, improving the detection rates of atomic activities will also be a key area of research. Our primary objective will be to refine the methods used for identifying atomic activities, aiming for substantial enhancements in accuracy. Additionally, addressing the challenges in real-time rendering is crucial; reducing rendering times is imperative for the practical deployment of our models in real-world settings. To achieve this, we plan to develop more efficient algorithms capable of managing the computational demands of stable diffusion processes. These improvements will aim to optimize the generation of smart space sensor representations, ensuring high-quality outputs without sacrificing speed or efficiency.

Additionally, to address the complexities of applying these techniques in diverse real-life environments, we aim to design and implement a unified encoder. This advanced encoder will be capable of processing and translating various types of sensor data across multiple domains into a coherent visual output. The development of such a encoder will facilitate a more seamless integration of our methods into everyday technology, making smart space technologies more adaptable and user-friendly across different settings and applications.

References

(1)
Alia et al. (2021) Sayeda Shamma Alia, Paula Lago, Shingo Takeda, Kohei Adachi, Brahim Benaissa, Md Atiqur Rahman Ahad, and Sozo Inoue. 2021. Summary of the cooking activity recognition challenge. Human Activity Recognition Challenge (2021), 1–13.
Aquino et al. (2023) Gustavo Aquino, Marly Guimarães Fernandes Costa, and Cícero Ferreira Fernandes Costa Filho. 2023. Explaining and Visualizing Embeddings of One-Dimensional Convolutional Models in Human Activity Recognition Tasks. Sensors 23, 9 (2023), 4409.
Arrotta et al. (2022) Luca Arrotta, Gabriele Civitarese, and Claudio Bettini. 2022. Dexar: Deep explainable sensor-based activity recognition in smart-home environments. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 6, 1 (2022), 1–30.
Atzmueller (2017) Martin Atzmueller. 2017. Onto explicative data mining: exploratory, interpretable and explainable analysis. Proceedings of Dutch-Belgian Database Day. TU Eindhoven (2017).
Atzmueller and Roth-Berghofer (2010) Martin Atzmueller and Thomas Roth-Berghofer. 2010. The mining and analysis continuum of explaining uncovered. In International Conference on Innovative Techniques and Applications of Artificial Intelligence. Springer, 273–278.
Bao and Intille (2004) Ling Bao and Stephen S Intille. 2004. Activity recognition from user-annotated acceleration data. In International conference on pervasive computing. Springer, 1–17.
Bommasani et al. (2021) Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. 2021. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021).
Chattopadhay et al. (2018) Aditya Chattopadhay, Anirban Sarkar, Prantik Howlader, and Vineeth N Balasubramanian. 2018. Grad-cam++: Generalized gradient-based visual explanations for deep convolutional networks. In 2018 IEEE winter conference on applications of computer vision (WACV). IEEE, 839–847.
Chavarriaga et al. (2013) Ricardo Chavarriaga, Hesam Sagha, Alberto Calatroni, Sundara Tejaswi Digumarti, Gerhard Tröster, José del R Millán, and Daniel Roggen. 2013. The Opportunity challenge: A benchmark database for on-body sensor-based activity recognition. Pattern Recognition Letters 34, 15 (2013), 2033–2042.
Chen et al. (2021) Ling Chen, Xiaoze Liu, Liangying Peng, and Menghan Wu. 2021. Deep learning based multimodal complex human activity recognition using wearable devices. Applied Intelligence 51, 6 (2021), 4029–4042.
Ciliberto et al. (2021) Mathias Ciliberto, Vitor Fortes Rey, Alberto Calatroni, Paul Lukowicz, and Daniel Roggen. 2021. Opportunity ++: A Multimodal Dataset for Video- and Wearable, Object and Ambient Sensors-based Human Activity Recognition. https://doi.org/10.21227/vd6r-db31
Dernbach et al. (2012) Stefan Dernbach, Barnan Das, Narayanan C Krishnan, Brian L Thomas, and Diane J Cook. 2012. Simple and complex activity recognition through smart phones. In 2012 eighth international conference on intelligent environments. IEEE, 214–221.
Do et al. (2013) Thang M Do, Seng W Loke, and Fei Liu. 2013. Healthylife: An activity recognition system with smartphone using logic-based stream reasoning. In Mobile and Ubiquitous Systems: Computing, Networking, and Services: 9th International Conference, MobiQuitous 2012, Beijing, China, December 12-14, 2012. Revised Selected Papers 9. Springer, 188–199.
Gomes et al. (2012) Joao Bártolo Gomes, Shonali Krishnaswamy, Mohamed M Gaber, Pedro AC Sousa, and Ernestina Menasalvas. 2012. Mobile activity recognition using ubiquitous data stream mining. In Data Warehousing and Knowledge Discovery: 14th International Conference, DaWaK 2012, Vienna, Austria, September 3-6, 2012. Proceedings 14. Springer, 130–141.
Gunning (2016) David Gunning. 2016. Broad agency announcement explainable artificial intelligence (xai). Defense Advanced Research Projects Agency (DARPA), Tech. Rep. (2016).
Ha and Choi (2016) Sojeong Ha and Seungjin Choi. 2016. Convolutional neural networks for human activity recognition using multiple accelerometer and gyroscope sensors. In 2016 international joint conference on neural networks (IJCNN). IEEE, 381–388.
Haresamudram et al. (2019) Harish Haresamudram, David V Anderson, and Thomas Plötz. 2019. On the role of features in human activity recognition. In Proceedings of the 2019 ACM International Symposium on Wearable Computers. 78–88.
Hur et al. (2018) Taeho Hur, Jaehun Bang, Thien Huynh-The, Jongwon Lee, Jee-In Kim, and Sungyoung Lee. 2018. Iss2Image: A novel signal-encoding technique for CNN-based human activity recognition. Sensors 18, 11 (2018), 3910.
Inoue et al. (2019) Sozo Inoue, Paula Lago, Tahera Hossain, Tittaya Mairittha, and Nattaya Mairittha. 2019. Integrating activity recognition and nursing care records: The system, deployment, and a verification study. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 3, 3 (2019), 1–24.
Ito et al. (2018) Chihiro Ito, Xin Cao, Masaki Shuzo, and Eisaku Maeda. 2018. Application of CNN for human activity recognition with FFT spectrogram of acceleration and gyro sensors. In Proceedings of the 2018 ACM international joint conference and 2018 international symposium on pervasive and ubiquitous computing and wearable computers. 1503–1510.
Jeyakumar et al. (2023) Jeya Vikranth Jeyakumar, Ankur Sarker, Luis Antonio Garcia, and Mani Srivastava. 2023. X-char: A concept-based explainable complex human activity recognition model. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 7, 1 (2023), 1–28.
Jiang and Yin (2015) Wenchao Jiang and Zhaozheng Yin. 2015. Human activity recognition using wearable sensors by deep convolutional neural networks. In Proceedings of the 23rd ACM international conference on Multimedia. 1307–1310.
Koh et al. (2020) Pang Wei Koh, Thao Nguyen, Yew Siang Tang, Stephen Mussmann, Emma Pierson, Been Kim, and Percy Liang. 2020. Concept bottleneck models. In International conference on machine learning. PMLR, 5338–5348.
Kwon et al. (2019) Hyeokhyen Kwon, Gregory D Abowd, and Thomas Plötz. 2019. Handling annotation uncertainty in human activity recognition. In Proceedings of the 2019 ACM International Symposium on Wearable Computers. 109–117.
Lago et al. (2020) Paula Lago, Shingo Takeda, Kohei Adachi, Sayeda Shamma Alia, Moe Matsuki, Brahim Benai, Sozo Inoue, and Francois Charpillet. 2020. Cooking activity dataset with macro and micro activities. https://doi.org/10.21227/hyzg-9m49
Li et al. (2020) Wei Li, Can Gao, Guocheng Niu, Xinyan Xiao, Hao Liu, Jiachen Liu, Hua Wu, and Haifeng Wang. 2020. Unimo: Towards unified-modal understanding and generation via cross-modal contrastive learning. arXiv preprint arXiv:2012.15409 (2020).
Lu et al. (2022) Jiasen Lu, Christopher Clark, Rowan Zellers, Roozbeh Mottaghi, and Aniruddha Kembhavi. 2022. Unified-io: A unified model for vision, language, and multi-modal tasks. In The Eleventh International Conference on Learning Representations.
Lundberg and Lee (2017) Scott M Lundberg and Su-In Lee. 2017. A unified approach to interpreting model predictions. Advances in neural information processing systems 30 (2017).
Mekruksavanich and Jitpattanakul (2021) Sakorn Mekruksavanich and Anuchit Jitpattanakul. 2021. Deep convolutional neural network with rnns for complex activity recognition using wrist-worn wearable sensor data. Electronics 10, 14 (2021), 1685.
Pan and Yang (2009) Sinno Jialin Pan and Qiang Yang. 2009. A survey on transfer learning. IEEE Transactions on knowledge and data engineering 22, 10 (2009), 1345–1359.
Peng et al. (2018a) Liangying Peng, Ling Chen, Menghan Wu, and Gencai Chen. 2018a. Complex activity recognition using acceleration, vital sign, and location data. IEEE Transactions on Mobile Computing 18, 7 (2018), 1488–1498.
Peng et al. (2018b) Liangying Peng, Ling Chen, Zhenan Ye, and Yi Zhang. 2018b. Aroma: A deep multi-task learning based simple and complex human activity recognition method using wearable sensors. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 2, 2 (2018), 1–16.
Rashidi and Mihailidis (2012) Parisa Rashidi and Alex Mihailidis. 2012. A survey on ambient-assisted living tools for older adults. IEEE journal of biomedical and health informatics 17, 3 (2012), 579–590.
Ribeiro et al. (2016) Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. ” Why should i trust you?” Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. 1135–1144.
Rombach et al. (2022a) Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2022a. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10684–10695.
Rombach et al. (2022b) Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2022b. High-Resolution Image Synthesis With Latent Diffusion Models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 10684–10695.
Saleh et al. (2020) Majd Saleh, Manuel Abbas, and Regine Bouquin Le Jeannes. 2020. FallAllD: An open dataset of human falls and activities of daily living for classical and deep learning applications. IEEE Sensors Journal 21, 2 (2020), 1849–1858.
Shoaib et al. (2016a) Muhammad Shoaib, Stephan Bosch, Ozlem Durmaz Incel, Hans Scholten, and Paul JM Havinga. 2016a. Complex human activity recognition using smartphone and wrist-worn motion sensors. Sensors 16, 4 (2016), 426.
Shoaib et al. (2016b) Muhammad Shoaib, Hans Scholten, Paul JM Havinga, and Ozlem Durmaz Incel. 2016b. A hierarchical lazy smoking detection algorithm using smartwatch sensors. In 2016 IEEE 18th International Conference on e-Health Networking, Applications and Services (Healthcom). IEEE, 1–6.
Simonyan et al. (2013) Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. 2013. Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034 (2013).
Singh et al. (2022) Amanpreet Singh, Ronghang Hu, Vedanuj Goswami, Guillaume Couairon, Wojciech Galuba, Marcus Rohrbach, and Douwe Kiela. 2022. Flava: A foundational language and vision alignment model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 15638–15650.
Singh et al. (2020) Satya P Singh, Madan Kumar Sharma, Aimé Lay-Ekuakille, Deepak Gangwar, and Sukrit Gupta. 2020. Deep ConvLSTM with self-attention for human activity decoding using wearable sensors. IEEE Sensors Journal 21, 6 (2020), 8575–8582.
Singh et al. (2023) Upendra Singh, Puja Gupta, Mukul Shukla, Varsha Sharma, Sunita Varma, and Sumit Kumar Sharma. 2023. Acknowledgment of patient in sense behaviors using bidirectional ConvLSTM. Concurrency and Computation: Practice and Experience 35, 28 (2023), e7819.
Song et al. (2020) Jiaming Song, Chenlin Meng, and Stefano Ermon. 2020. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502 (2020).
Tahvilian et al. (2022) Ehsan Tahvilian, Ehsan Partovi, Mehdi Ejtehadi, Parsa Riazi Bakhshayesh, and Saeed Behzadipour. 2022. Accuracy improvement in simple and complex Human Activity Recognition using a CNN-BiLSTM multi-task deep neural network. In 2022 8th Iranian Conference on Signal Processing and Intelligent Systems (ICSPIS). IEEE, 1–5.
Trabelsi et al. (2022) Imen Trabelsi, Jules Françoise, and Yacine Bellik. 2022. Sensor-based activity recognition using deep learning: A comparative study. In Proceedings of the 8th International Conference on Movement and Computing. 1–8.
Varshney et al. (2022) Neeraj Varshney, Brijesh Bakariya, Alok Kumar Singh Kushwaha, and Manish Khare. 2022. Human activity recognition by combining external features with accelerometer sensor data using deep learning network model. Multimedia Tools and Applications 81, 24 (2022), 34633–34652.
Wang et al. (2022b) Peng Wang, An Yang, Rui Men, Junyang Lin, Shuai Bai, Zhikang Li, Jianxin Ma, Chang Zhou, Jingren Zhou, and Hongxia Yang. 2022b. Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In International Conference on Machine Learning. PMLR, 23318–23340.
Wang et al. (2022a) Wenhui Wang, Hangbo Bao, Li Dong, Johan Bjorck, Zhiliang Peng, Qiang Liu, Kriti Aggarwal, Owais Khan Mohammed, Saksham Singhal, Subhojit Som, et al. 2022a. Image as a foreign language: Beit pretraining for all vision and vision-language tasks. arXiv preprint arXiv:2208.10442 (2022).
Wu et al. (2023) Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Stan Weixian Lei, Yuchao Gu, Yufei Shi, Wynne Hsu, Ying Shan, Xiaohu Qie, and Mike Zheng Shou. 2023. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 7623–7633.
Wuensch (2005) Karl L Wuensch. 2005. What is a likert scale. And how do you pronounce’Likert (2005).
Zhang et al. (2008) Daqing Zhang, Mossaab Hariz, and Mounir Mokhtari. 2008. Assisting elders with mild dementia staying at home. In 2008 Sixth Annual IEEE International Conference on Pervasive Computing and Communications (PerCom). IEEE, 692–697.
Zhao et al. (2011) Zhongtang Zhao, Yiqiang Chen, Junfa Liu, Zhiqi Shen, and Mingjie Liu. 2011. Cross-people mobile-phone based activity recognition. In Twenty-second international joint conference on artificial intelligence. Citeseer.
Zheng et al. (2023) Hongkai Zheng, Weili Nie, Arash Vahdat, and Anima Anandkumar. 2023. Fast training of diffusion models with masked transformers. arXiv preprint arXiv:2306.09305 (2023).
Zylowski (2022) Thorsten Zylowski. 2022. Study on criteria for explainable AI for laypeople. In Proceedings of the Second International Workshop on Explainable and Interpretable Machine Learning (XI-ML 2022) co-located with the 45rd German Conference on Artificial Intelligence (KI 2022), Trier (Virtual), Germany. CEUR Workshop Proceedings.