IoT-LM: Large Multisensory Language Models
for the Internet of Things

Shentong Mo, Ruslan Salakhutdinov, Louis-Philippe Morency, Paul Pu Liang
Carnegie Mellon University
[email protected]
Abstract

The Internet of Things (IoT) network integrating billions of smart physical devices embedded with sensors, software, and communication technologies is a critical and rapidly expanding component of our modern world. The IoT ecosystem provides a rich source of real-world modalities such as motion, thermal, geolocation, imaging, depth, sensors, and audio to recognize the states of humans and physical objects. Data-driven tools present a rich opportunity to automatically process IoT data at scale, enabling efficient inference for understanding human wellbeing, controlling physical devices, and interconnecting smart cities. To realize this potential, we introduce IoT-LM, an open-source large multisensory language model tailored for the IoT ecosystem. IoT-LM is enabled by two technical contributions: the first is MultiIoT, the most expansive unified IoT dataset to date, encompassing over 1.15 million samples from 12 modalities and 8 tasks prepared for multisensory pre-training and instruction-tuning. The second is a new multisensory multitask adapter layer to condition pre-trained large language models on multiple multisensory IoT tasks simultaneously, enabling the sharing of information across modalities and tasks for better generalization. Not only does IoT-LM yield substantial improvements on 8 supervised IoT classification tasks, but it also demonstrates new interactive question-answering, reasoning, and dialog capabilities conditioned on IoT sensors. We release IoT-LM’s data sources and new multisensory language modeling framework at the repository 111https://github.com/Multi-IoT/MultiIoT.

1 Introduction

The digital world is witnessing an unprecedented surge in the realm of the Internet of Things (IoT): the ever-growing system of interlinked devices from individual households to vast industrial complexes [8]. These devices are often embedded with sensors, software, and communication technologies that can safely and privately analyze the human and physical world [36, 50]. For example, high-fidelity sensors are able to recognize physical activities to inform us of our daily physical wellness [47, 55]; vision, depth, and lidar sensors are able to navigate self-driving cars and connect them with traffic lights for efficient transport management [27, 31]; and wifi, depth, camera sensors can detect if the elderly require assistance in hospitals [2, 34]. As a result, there has been substantial interest in building machine learning systems that can efficiently process these IoT sensors and make predictions, which has great potential for understanding human wellbeing, controlling physical devices, and interconnecting smart cities [1, 20, 52, 56].

However, most existing machine learning approaches in the IoT domain have largely focused on supervised models, trained only on a single input sensory modality and for a single output prediction task [3, 6, 23, 28, 33]. We extend the existing ML for IoT paradigms in two directions. Firstly, in the input space, we investigate how to learn from multiple heterogeneous and interacting IoT sensory modalities simultaneously. This is important since practical IoT scenarios often see multiple sensors used for different use cases, each with its own unique information, structures, and noise topologies. Secondly, in the output space, we study how to ground large language models (LLMs) on IoT sensors to enable simultaneous prediction over many IoT-related real-world tasks. This can enable us to combine the real-world sensing of IoT with the dialog, reasoning, and generalization abilities of large language models. Together, these result in the development of IoT-LM, a large multisensory language model capable of processing many IoT sensors for a range of predictive, question-answering, reasoning, and interactive dialog tasks. Our primary contributions in IoT-LM can be summarized as:

  • IoT-Language resource: To train IoT-LM, we collect and publicly release a dataset with 1.15 million IoT sensor - natural language paired samples covering 12 real-world sensory modalities and 8888 IoT tasks. These sensors and tasks are rooted in practical scenarios such as human health and wellness, physical commonsense, and smart cities.

  • IoT-LM architecture: The key innovation in IoT-LM’s architecture is a new multisensory multitask adapter layer to condition pretrained LLMs on multiple multisensory IoT tasks simultaneously, enabling the sharing of information across modalities and tasks for better generalization.

  • IoT-LM: With this new IoT training data and multisensory multitask adapter, we train IoT-LM, the first large multisensory language model that can perceive physical IoT sensors. Through sensor-language pretraining and instruction tuning, IoT-LM displays interactive question-answering, reasoning, and dialog capabilities conditioned on IoT sensor data. We publicly release a set of IoT-LM models spanning 7B to 70B parameters, along with all the curated IoT-language resources, and training code.

2 Related Work

We cover related work in the design and applications of IoT sensors, how machine learning can be used to accelerate IoT perception, and related background work in multisensory machine learning and foundation models.

Internet of Things (IoT): The pursuit of extracting meaningful insights from IoT data [7, 30, 35, 13] has led to various innovative approaches that focus on individual modalities and specific tasks [17, 10, 4] and resource-constrained devices [15, 32, 24]. For instance, DIP-IMU [23] effectively fuses depth sensing with IMU data for enhanced pose estimation, demonstrating the potential of multimodal integration. Similarly, EyeMU [33] utilizes time-series neural networks to process IMU sensor data for accurate gaze tracking on mobile devices. TouchPose [3] explores tactile data for pose recognition, while LLVIP [28] focuses on leveraging visual data from IoT environments for dynamic real-world applications. Furthermore, RGBDGaze [6] integrates RGBD data for gaze estimation, highlighting the diversity of sensory data applications. However, these approaches generally remain confined to single-task models and lack the capability to generalize across multiple IoT modalities and tasks, an area where our work with IoT-LM introduces a significant advancement by enabling statistical sharing and generalization across a broad spectrum of IoT data.

Multisensory machine learning: The field of multimodal machine learning has seen substantial growth, with models designed to integrate inputs from various sensory channels to perform more complex interpretation and interaction with the environment [9, 42]. Notable works include multimodal transformers [54, 44] that fuse visual, auditory, and textual data to improve learning efficacy. These multimodal transformers have also been scaled for self-supervised pre-training across an increasing range of modalities including time-series and sensors [25, 41, 49]. While these models offer a foundation for integrating diverse data types, they often do not address the unique challenges posed by IoT environments, such as the integration of non-traditional sensor data (e.g., IMU, thermal sensors) and the need for models to interact dynamically with physical environments.

Foundation models: Recently, the concept of foundation models [11, 12, 5], pre-trained on large-scale datasets and adaptable to a wide range of tasks [57, 29, 21], has gained traction. These models, exemplified by works such as GPT-4 [46] and LLaMA [53], offer a robust starting point for further fine-tuning on individual natural-language tasks. There has also been a recent drive towards large multimodal models, using either LLMs as a starting point and training adapters from other modalities to LLM input space [16, 57], or training multimodal transformers from scratch (sometimes called ‘natively’) with interleaved language tokens, image frames, audio frames, and other modalities [21, 46]. Our approach extends these paradigms into the IoT domain, where IoT-LM adapts a large language model framework to not only understand textual information but also effectively process and reason about multisensory data from diverse IoT sensors. This adaptation enables unprecedented capabilities in performing complex reasoning, dialogue, and interactive question-answering tasks directly related to physical IoT contexts, setting a new benchmark for intelligent IoT systems.

Refer to caption
Figure 1: Illustration of IoT-LM architecture, highlights the integration of multisensory data through modality-specific encoders and the novel multisensory multitask adapter layer. We illustrate how different sensory inputs are processed, combined, and utilized to adapt a pre-trained language model for IoT applications to handle and interpret complex, real-world sensor data efficiently.

3 IoT-LM: A New Multisensory IoT Foundation Model

In this section, we describe the architecture, data curation, and training innovations in IoT-LM.

3.1 IoT-LM Architecture: Multisensory Multitask Adapter

IoT-LM’s backbone consists of a variety of IoT sensory inputs, a general-purpose multisensory multitask encoder that fuses the information from multiple sensors and across multiple tasks, a multisensory multitask adapter that transforms encoded representations into pretrained LLM input space, and the pretrained LLM itself. We show an overview of the IoT-LM architecture in Figure 1.

Multisensory multitask encoder for heterogeneous IoT signals

The encoder is trained using a novel approach that effectively manages the heterogeneity of IoT data by combining supervised learning with unsupervised feature extraction techniques. This training involves task-specific adaptations, where the encoder learns to map raw sensor data to an intermediate feature space conducive for language model processing. The encoder handles the multisensory data through a combination of multimodal fusion methods, including early, late, and model-based fusion, which are chosen based on the data characteristics and the specific requirements of each task.

Specifically, the adapters transform the sensor data into a representation that is more conducive for the language model to process. For each sensor modality xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, we employ a dedicated encoder Eisubscript𝐸𝑖E_{i}italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT that maps raw sensor data to an intermediate feature space. The encoded features from different modalities are then combined using a fusion mechanism within the adapter, allowing the model to harness information from all available sensors. We employ late fusion methods that are extracted from each sensor modality and are independently processed and only combined at decision-making layers, facilitating specialization in feature extraction while leveraging multimodal data for final predictions. These fusion techniques are selected and fine-tuned based on the specific characteristics of the data and the tasks at hand, ensuring optimal performance. Our fusion methods innovations involve developing new model-based fusion techniques that adaptively adjust the weighting of different sensor inputs based on their predictive value for specific tasks, a novel approach to handling IoT data heterogeneity. The training of these components is conducted through a multitask supervised learning framework that emphasizes individual task accuracy and enhances generalizability across tasks by sharing representations and leveraging common patterns found in the diverse IoT data landscape. This method strengthens the model’s ability to perform well across various conditions and tasks, making it highly effective for real-world IoT applications.

Multisensory multitask adapter layer

We extend the typical use of adapter modules, which are compact, trainable layers inserted between the existing layers of a pre-trained language model [53]. These adapters are designed to fine-tune the pre-existing model to new tasks and modalities without significant modifications to the original model’s weights, thus preserving its general linguistic capabilities while extending its functionality to new IoT-specific domains.

The key difference in IoT-LM is that the multisensory multitask adapter layer conditions pretrained LLMs on multiple multisensory IoT tasks simultaneously, enabling the sharing of information across modalities and tasks for better generalization and holistic understanding of IoT environments. We show an overview of the multisensory multitask adapter layer in Figure 2. The architecture is formalized as follows:

y=MW+A(E1(x1)E2(x2)Em(xm))=MW(AWA(E1(x1)E2(x2)Em(xm)))𝑦subscript𝑀𝑊𝐴direct-sumsubscript𝐸1subscript𝑥1subscript𝐸2subscript𝑥2subscript𝐸𝑚subscript𝑥𝑚subscript𝑀𝑊subscript𝐴subscript𝑊𝐴direct-sumsubscript𝐸1subscript𝑥1subscript𝐸2subscript𝑥2subscript𝐸𝑚subscript𝑥𝑚y=M_{W+A}(E_{1}(x_{1})\oplus E_{2}(x_{2})\oplus...\oplus E_{m}(x_{m}))=M_{W}(A% _{W_{A}}(E_{1}(x_{1})\oplus E_{2}(x_{2})\oplus...\oplus E_{m}(x_{m})))italic_y = italic_M start_POSTSUBSCRIPT italic_W + italic_A end_POSTSUBSCRIPT ( italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ⊕ italic_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ⊕ … ⊕ italic_E start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) ) = italic_M start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ⊕ italic_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ⊕ … ⊕ italic_E start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) ) ) (1)

where MW+A()subscript𝑀𝑊𝐴M_{W+A}(\cdot)italic_M start_POSTSUBSCRIPT italic_W + italic_A end_POSTSUBSCRIPT ( ⋅ ) and MW()subscript𝑀𝑊M_{W}(\cdot)italic_M start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT ( ⋅ ) denote the models with combined and original weights, respectively, and E1,E2,,Emsubscript𝐸1subscript𝐸2subscript𝐸𝑚E_{1},E_{2},...,E_{m}italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_E start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT represent the encoders for each sensor modality. This structure allows IoT-LM to maintain the foundational knowledge of the pre-trained model while adapting to the specific nuances of IoT data.

Refer to caption
Figure 2: Illustration of the multisensory multitask adapter layer we designed for IoT-LM. The adapter takes multiple sensor features extracted from dedicated input encoders, performs multimodal fusion into higher-order representations, and simultaneously transforms all fused multimodal features into the same representation space for an LLM to process.

3.2 IoT-Language Data Collection

To train IoT-LM, we aggregate and release the largest IoT dataset for machine learning research, comprising 1.15 million samples from 12 different sensory modalities for 8 different real-world IoT prediction tasks related to personal wellness, healthcare, smart cities, and more. These modalities include Inertial Measurement Units (IMU), thermal sensors, GPS, LiDAR, gaze, pose, capacitance sensors, and traditional modalities such as images, audio, and video. Tasks supported include:

  1. 1.

    Gaze estimation: For human-computer interaction, driver monitoring, and virtual reality applications, our dataset includes RGB images of faces, depth data, and IMU outputs. The model is tasked with predicting X/Y coordinates for gaze tracking, necessitating a deep understanding of the interactions between these modalities.

  2. 2.

    Depth estimation: This task involves estimating the distance from cameras to objects in images, vital for AR/VR, robotics, and object detection. Our dataset includes RGB images combined with camera parameters, GPS coordinates, and IMU data to create depth maps for various scenarios, including street scenes and robotic hand interactions.

  3. 3.

    Gesture classification: Key to enhancing human-machine interfaces, this task uses data from gaze tracking and IMU sensors (accelerometer, gyroscope, and orientation) to classify human gestures. The challenge here is to accurately interpret the nuanced cross-modal interactions.

  4. 4.

    Pose estimation: This task determines the arrangement of human joints, using RGB images and IMU data to predict poses of the human body, including 24 joints with three angles each (yaw, pitch, roll). This requires the model to fuse data from IMUs and visual inputs.

  5. 5.

    Touch contact: To improve touch-based device interactions, this task classifies the type of touch on capacitive surfaces using RGB and capacitive images, depth maps, and hand poses.

  6. 6.

    Event detection: In applications ranging from healthcare to smart homes, this task identifies specific occurrences or anomalies in data streams, using audio spectrograms and IMU data to categorize events over time.

  7. 7.

    Activity recognition: Central to applications in fitness and healthcare, this task uses RGB images, pose data, and IMU outputs to recognize human activities such as walking, running, or jumping.

  8. 8.

    3D reconstruction: Significant in gaming, film, and AR/VR, this task involves creating three-dimensional models from RGB images, capacitance images, and depth maps, aimed at reconstructing 3D poses of objects and environments.

These diverse tasks not only prepare IoT-LM to handle complex sensor data but also ensure it can perform a broad range of functions, from classification to complex reasoning and dialogue involving multiple IoT devices.

3.3 Multisensory Multitask Encoder Pretraining

Building on the IoT-LM architecture and data resources, we now describe the pre-training and instruction tuning stages in IoT-LM. The pre-training stage aims to learn a general-purpose multisensory multitask encoder that fuses the information from multiple sensors and across multiple tasks. During pretraining, the multisensory multitask encoder is trained on a combination of supervised learning tasks to extract and fuse features from diverse sensor data effectively. This stage is pivotal in preparing the model to process and understand complex sensory inputs, which can be mathematically described as follows:

Θpre=argminθk=1K(xk,yk)𝒟kk(Mθ(Ek(xk)),yk),subscriptΘpresubscript𝜃superscriptsubscript𝑘1𝐾subscriptsubscript𝑥𝑘subscript𝑦𝑘subscript𝒟𝑘subscript𝑘subscript𝑀𝜃subscript𝐸𝑘subscript𝑥𝑘subscript𝑦𝑘\Theta_{\text{pre}}=\arg\min_{\theta}\sum_{k=1}^{K}\sum_{(x_{k},y_{k})\in% \mathcal{D}_{k}}\mathcal{L}_{k}(M_{\theta}(E_{k}(x_{k})),y_{k}),roman_Θ start_POSTSUBSCRIPT pre end_POSTSUBSCRIPT = roman_arg roman_min start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∈ caligraphic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_M start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) , italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) , (2)

where θ𝜃\thetaitalic_θ represents the parameters of the entire network including the adapters, 𝒟ksubscript𝒟𝑘\mathcal{D}_{k}caligraphic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is the dataset consisting of input-output pairs (xk,yk)subscript𝑥𝑘subscript𝑦𝑘(x_{k},y_{k})( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ), Ek(xk)subscript𝐸𝑘subscript𝑥𝑘E_{k}(x_{k})italic_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) represents the encoded inputs from various sensors for task k𝑘kitalic_k, and ksubscript𝑘\mathcal{L}_{k}caligraphic_L start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT denotes the loss function used to measure the discrepancy between the model’s predictions and the true outputs.

Refer to caption
Figure 3: Illustration of IoT-LM instruction tuning paradigm that learns to perform specific tasks based on directive inputs. By training on a diverse range of input modalities and output tasks, this enables IoT-LM to process multiple IoT inputs and execute complex tasks.

3.4 IoT-LM Instruction Tuning

During the instruction-tuning phase, IoT-LM is trained to understand and interpret multisensory IoT data contexts and perform specific tasks based on directed language inputs. These tasks can include making a prediction, answering a question, holding a dialog, reasoning about physical properties, and so on. This stage is crucial for refining the model’s ability to follow complex and varied human instructions. The instruction tuning phase is guided by the following optimization:

Θtune=argminθk=1K(xk,ck,yk)𝒯kk(Mθ(Ek(x),ck),yk),subscriptΘtunesubscript𝜃superscriptsubscript𝑘1𝐾subscriptsubscript𝑥𝑘subscript𝑐𝑘subscript𝑦𝑘subscript𝒯𝑘subscript𝑘subscript𝑀𝜃subscript𝐸𝑘𝑥subscript𝑐𝑘subscript𝑦𝑘\Theta_{\text{tune}}=\arg\min_{\theta}\sum_{k=1}^{K}\sum_{(x_{k},c_{k},y_{k})% \in\mathcal{T}_{k}}\mathcal{L}_{k}(M_{\theta}(E_{k}(x),c_{k}),y_{k}),roman_Θ start_POSTSUBSCRIPT tune end_POSTSUBSCRIPT = roman_arg roman_min start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∈ caligraphic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_M start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x ) , italic_c start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) , italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) , (3)

where 𝒯ksubscript𝒯𝑘\mathcal{T}_{k}caligraphic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT represents the task-specific dataset with inputs xksubscript𝑥𝑘x_{k}italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, context or commands cksubscript𝑐𝑘c_{k}italic_c start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, and outputs yksubscript𝑦𝑘y_{k}italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT for task k𝑘kitalic_k. Here, Ek(xk)subscript𝐸𝑘subscript𝑥𝑘E_{k}(x_{k})italic_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) denotes the encoded sensor data for task k𝑘kitalic_k, cksubscript𝑐𝑘c_{k}italic_c start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT provides specific instructions or task directives, and Mθsubscript𝑀𝜃M_{\theta}italic_M start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT is the model parameterized by θ𝜃\thetaitalic_θ including the tuned adapters. This phases ensure that IoT-LM not only learns to process a wide array of sensor data but also understands and executes complex commands pertinent to IoT applications, thus achieving high performance across varied IoT tasks.

The combination of pretraining with IoT-specific sensory data and subsequent instruction tuning prepares IoT-LM to handle real-world tasks effectively. This training approach not only enhances the model’s capability to understand complex sensor data but also ensures it can perform a wide array of tasks, from simple classification to complex reasoning and dialogues involving multiple IoT devices.

4 Experiments

Our experiments aim to benchmark the performance of IoT-LM on supervised IoT classification tasks, as well as their reasoning, dialog, interaction, and zero-shot and few-shot transfer abilities across different modalities and tasks.

4.1 Experimental Setup

Experiments were conducted using NVIDIA A100 GPUs, ensuring high-performance computation for our deep learning models. The models were trained for 30 epochs using the Adam optimizer with a learning rate of 1e-4 and a batch size of 16. We compare to the following baselines:

  • Unimodal models [33, 3, 45] processed data from single sensor types using tailored neural architectures for each IoT domain data,

  • Unimodal Adapter models [16] utilized deep architectures such as LLaMA-adpater [16] with adapter layers, fine-tuning only the adapter modules with a learning rate of 0.0005,

  • Unimodal multi-task models [39] employed shared encoder layers and task-specific decoders, ensuring balanced gradients among tasks,

  • Multisensory models [26], where data fusion can occur at varying levels, from input to decision, and models ensured balanced data representation from each modality,

  • Multisensory multitask models [41] utilized modality-specific encoders followed by task-specific decoders, balancing both modalities and tasks during training. Each method’s efficacy was validated on respective datasets.

To evaluate performance, we employ task-specific metrics. For gaze and pose estimation, we measure the mean euclidean error in centimeters between predictions and ground truth. Depth estimation utilizes mean absolute error in millimeters, while gesture classification, touch contact classification, and activity recognition rely on accuracy metrics. Event detection employs the F1 score for confident threshold predictions, and 3D pose reconstruction is assessed using the End-point-error in millimeters for joint discrepancies.

Table 1: Multisensory multi-task learning is a particularly effective approach on IoT-LM, enabling information sharing to learn general representations for IoT data. Our IoT-LM achieves the best results across all diverse tasks, benefiting from multisensory multitask adapter and instruction tuning.
Method Gaze est. Depth est. Gesture cls. Pose est. Touch cls. Event det. Activity recog. 3D recons.
(cm, \downarrow) (mm, \downarrow) (%,\uparrow) (cm, \downarrow) (%, \uparrow) (%, \uparrow) (%, \uparrow) (mm, \downarrow)
Domain-specific 3.76 40.9 68.2 10.86 52.6 59.3 48.5 35.6
Unimodal 2.26 20.7 97.3 6.49 88.0 86.9 79.2 22.2
Unimodal Adapter 2.05 18.6 97.6 5.75 88.7 87.5 82.3 21.3
Unimodal Multi-task 1.95 18.2 98.2 5.36 89.3 88.1 82.5 20.5
Multisensory 1.79 17.3 98.7 4.62 91.2 89.1 83.5 19.6
Multisensory Multi-task 1.08 13.6 99.3 3.85 93.8 92.7 87.5 17.5
Multisensory multitask IoT-LM 0.95 11.5 99.6 3.24 94.6 93.8 89.2 16.3

4.2 Main quantitative results

Overall performance: Table 1 reports the quantitative results of IoT-LM compared to state-of-the-art domain-specific, single modality, single task, multimodal multitask, and adapter and alignment models. As seen in Table 1, the IoT-LM consistently outperforms the single modality and single task models across all tasks. This can be attributed to their ability to integrate textual information across modalities and tasks, which is especially crucial when one modality might have noisy or incomplete data. While the adapter and multimodal multitask models show commendable performance due to their ability to adapt to new tasks, they often fall short in scenarios where multiple modalities have to be processed simultaneously. IoT-LM improves upon models that only perform multimodal multitask supervised learning, since it inherits the prediction and reasoning capabilities from the pretrained large language model component.

Table 2: Multimodal learning enables complementary learning of information for IoT-LM and achieves strong performance. For all tasks, the incorporation of more modalities resulted in more robust and accurate models across diverse IoT applications based on multiple sensors.
Modality Ratio Gaze est. Depth est. Gesture cls. Pose est. Touch cls. Event det. Activity recog. 3D recons.
(cm, \downarrow) (mm, \downarrow) (%,\uparrow) (cm, \downarrow) (%, \uparrow) (%, \uparrow) (%, \uparrow) (mm, \downarrow)
single-modality 1.38 15.1 98.1 5.07 91.9 91.8 86.3 18.7
25% 1.25 14.3 98.4 4.63 92.5 92.2 86.7 18.3
50% 1.12 13.2 98.9 4.15 93.2 92.5 87.2 17.9
all 1.08 12.9 99.2 3.76 93.5 92.9 88.1 17.5
Table 3: Multi-task learning is another effective strategy on IoT-LM, enabling information sharing across tasks. For all tasks, the incorporation of more tasks resulted in more accurate models.
Task Ratio Gaze est. Depth est. Gesture cls. Pose est. Touch cls. Event det. Activity recog. 3D recons.
(cm, \downarrow) (mm, \downarrow) (%,\uparrow) (cm, \downarrow) (%, \uparrow) (%, \uparrow) (%, \uparrow) (mm, \downarrow)
single-task 1.38 15.1 98.1 5.07 91.9 91.8 86.3 18.7
25% 1.29 14.5 98.3 4.86 92.2 92.1 86.7 18.4
50% 1.22 13.8 98.6 4.52 92.6 92.5 87.2 18.0
all 1.13 13.1 99.1 4.23 93.1 92.8 87.8 17.5

Performance across increasing modalities: In this section, we study how adding additional modalities to IoT-LM impacts performance. From Table 2, we find significant performance improvements observed when adding multimodal datapoints of increasing ratios (25%, 50%, all) as compared to unimodal models. This can be attributed to the IoT-LM’s ability to tap into complementary information present in different modalities, especially in scenarios where one modality might be ambiguous or noisy.

Performance across increasing tasks: We also analyzed model performance when trained on increasing numbers of tasks, while keeping the same modality inputs constant. From Table 3, we see that IoT-LM’s performance steadily increased as we increased the number of tasks during training. This suggests that the multitask instruction tuning in IoT-LM was beneficial since the model learns more general features while also improving computational efficiency.

Table 4: IoT-LM shows the best zero-shot and few-shot generalization capabilities as compared to other supervised unimodal, multimodal, single-task, and multitask variants. As a result, IoT-LM can be a promising approach to deal with limited labeled data often seen in real-world IoT systems.
Method Gaze estimation (cm, \downarrow) Touch contact classification (%, \uparrow)
multisensory IoT-LM 1.08 93.5
multisensory multitask IoT-LM 1.03 94.1
multisensory multitask IoT-LM (zero-shot) 1.25 92.3
multisensory multitask IoT-LM (5-shot) 1.21 92.5
multisensory multitask IoT-LM (10-shot) 1.13 92.9
multisensory multitask IoT-LM (20-shot) 1.06 93.6

Zero-shot and few-shot transfer: Furthermore, we study whether IoT-LM trained on certain modalities or tasks can transfer to a new set of target modalities or tasks they have never seen during training (zero-shot) or have seen with only very few examples (few-shot). We chose the fix-8 dataset as the target, primarily because of its diverse representation of modalities (IMU, capacitance, depth, image) and its challenging task (gaze estimation and touch contact classification). From Table 4, we find that across the board, even a few examples significantly boosted performance compared to the zero-shot setting, which highlights the model’s ability to quickly adapt to new information. Using IoT-LM as a base model for zero-shot and few-shot experiments consistently outperformed other types of models, such as supervised unimodal, multimodal, single-task, and multitask variants. These gains were most pronounced in the 20-shot setting but were noticeably beneficial even in the 5-shot scenario. Our results suggest that IoT-LM excels at zero-shot and few-shot learning not seen in single-modality, single-task, and supervised models, and is a promising approach to deal with limited labeled data often seen in real-world IoT systems.

Table 5: Scaling law of multisensory multi-task adapter is observed on MultiIoT, enabling models with more parameters to learn general representations for IoT data.
Params Gaze est. Depth est. Gesture cls. Pose est. Touch cls. Event det. Activity recog. 3D recons.
(cm, \downarrow) (mm, \downarrow) (%,\uparrow) (cm, \downarrow) (%, \uparrow) (%, \uparrow) (%, \uparrow) (mm, \downarrow)
7B 1.03 12.7 99.4 3.56 94.1 93.1 88.3 17.1
13B 0.98 11.3 99.5 3.42 94.3 93.3 88.5 16.6
70B 0.95 11.5 99.6 3.24 94.6 93.8 89.2 16.3

Scaling law of IoT-LM: Finally, we systematically increased the model size of IoT-LM to observe the impact of size on performance and representation learning. We evaluated three configurations of the multisensory multi-task adapter: small (7 billion parameters), medium (13 billion parameters), and large (70 billion parameters), and the results are reported in Table 5. Each configuration was trained using a consistent training regime on a curated dataset comprising various IoT sensory inputs, including visual, auditory, and tactile data. We focused on a range of tasks, such as anomaly detection, predictive maintenance, and activity recognition, to test the adaptability and efficiency of the models at different scales. Our experiments demonstrate a clear scaling law: as the number of parameters increases, the models exhibit improved performance across all tasks. This is quantified not only in terms of accuracy, but also in how effectively the models generalize to unseen data, indicating better learning of underlying representations. The observed scaling law suggests that larger models are more adept at integrating and processing multisensory data, leading to more robust and general representations. This supports the hypothesis that model capacity plays a crucial role in multisensory learning environments typical of IoT applications.

Refer to caption
Figure 4: Dialog for audio example. Our IoT-LM accurately predicts the activity corresponding to the input audio spectrogram, and gives a reasonable explanation for its prediction.
Refer to caption
Figure 5: Dialog for IMU example. IoT-LM accurately predicts the activity corresponding to the input IMU data, and also gives a reasonable explanation for its prediction.

4.3 Qualitative analysis

In this section, we qualitatively demonstrate the dialog and interactive capabilities of IoT-LM on processing audio and Inertial Measurement Unit (IMU) data.

Audio: For audio data, in Figure 4, IoT-LM was presented with an audio waveform and instructed to identify the corresponding activity from a list of categories. The model was able to discern that the audio signal most closely resembled the category "Coughing." The choice was justified by the sudden and sporadic nature of the spikes in the waveform, which align with the acoustic signature of a cough. This example highlights the model’s capability to process temporal acoustic features and correctly categorize and reason about it.

IMU: For IMU data, in Figure 5, IoT-LM analyzed a graph with multiple lines, each representing different sensor readings from the IMU. The task was to select an activity from a list that best matched the IMU data pattern. The model identified "Knocking" as the most likely activity, reasoning that the regular intervals of peaks were indicative of a repetitive action with varying force, which is consistent with the pattern of knocking.

These examples demonstrate the multisensory multi-task adapter’s analytical proficiency in interpreting and classifying data from different IoT modalities based on their temporal characteristics. Such capability is instrumental in realizing the full potential of IoT systems, where understanding and acting upon such heterogeneous data in real time is essential for various applications, from smart homes to industrial automation. The examples provided also show the model’s potential in bridging the gap between raw sensor data and meaningful insights, enabling non-expert users to interact with and benefit from complex IoT systems.

5 Conclusion and Broader Impacts

This paper presents IoT-LM, a new large multisensory language model with multisensory perception and natural language interaction capabilities over a spectrum of IoT modalities and applications. Key innovations in IoT-LM include a new multisensory multitask adapter to simultaneously condition pretrained LLMs on multiple multisensory IoT tasks for better generalization, as well as a new resource of 1.15 million IoT sensor - natural language paired samples covering 12 modalities and 8 real-world tasks. Overall, IoT-LM not only advances the state-of-the-art in IoT predictive learning but also enables new question-answering, reasoning, and interactive dialog capabilities on physical sensor data. Future work should focus on expanding the model’s capabilities to more IoT modalities and tasks, improving its reasoning and robustness, and exploring the implications of its deployment in real-world IoT systems. We hope that IoT-LM will inspire further innovations at the intersection of machine learning and IoT, contributing to smarter and more responsive technologies that can understand and interact with their environments.

We are aware of some potential limitations and broader impacts of our work. Firstly, there may be privacy risks associated with making predictions from multimodal data of recorded human behaviors, such as video, audio, activities, poses, and wearable sensors. As such, we made sure the datasets we use are those that collect data from consenting participants. We only use these datasets for research purposes. All data was anonymized and stripped of all personal (e.g., personally identifiable information) and protected attributes (e.g., race, gender). Furthermore, it is also important to keep data and features private on each device without sending it to other locations. There are potential avenues to combine the multisensory and multitask models in our paper with techniques such as federated learning [37, 38], differential privacy [19], or encryption [14] to preserve the privacy of sensor data. In addition to privacy concerns, modern large-scale machine learning models can cause environmental impacts resulting from high carbon footprints [51]. An important direction is to build tiny and efficient models for IoT. The models and resources we have released can help future work explore these efficient methods and quickly benchmark their performance. Finally, we also acknowledge that there is a risk of social biases when human-centric data and possibly sensitive labels are involved, such as models that perform poorly on certain demographic groups. When language model outputs are involved, they can also amplify the underlying social biases [43] and generate harmful content [40]. Future work must study and mitigate social biases present in multisensory models and large language models.

References

  • Adi et al. [2020] Erwin Adi, Adnan Anwar, Zubair Baig, and Sherali Zeadally. Machine learning and data analytics for the iot. Neural computing and applications, 32:16205–16233, 2020.
  • Ahamed and Farid [2018] Farhad Ahamed and Farnaz Farid. Applying internet of things and machine-learning for personalized healthcare: Issues and challenges. In 2018 International Conference on Machine Learning and Data Engineering (iCMLDE), pages 19–21. IEEE, 2018.
  • Ahuja et al. [2021] Karan Ahuja, Paul Streli, and Christian Holz. Touchpose: Hand pose prediction, depth estimation, and touch classification from capacitive images. In Proceedings of the 34th Annual ACM Symposium on User Interface Software and Technology, New York, NY, USA, 2021. Association for Computing Machinery. ISBN 9781450386357. URL https://doi.org/10.1145/3472749.3474801.
  • Alysha M. De Livera and Snyder [2011] Rob J. Hyndman Alysha M. De Livera and Ralph D. Snyder. Forecasting time series with complex seasonal patterns using exponential smoothing. Journal of the American Statistical Association, 106(496):1513–1527, 2011. doi: 10.1198/jasa.2011.tm09771. URL https://doi.org/10.1198/jasa.2011.tm09771.
  • Anil et al. [2023] Rohan Anil, Andrew M. Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, Eric Chu, Jonathan H. Clark, Laurent El Shafey, Yanping Huang, Kathy Meier-Hellstern, Gaurav Mishra, Erica Moreira, Mark Omernick, Kevin Robinson, Sebastian Ruder, Yi Tay, Kefan Xiao, Yuanzhong Xu, Yujing Zhang, Gustavo Hernandez Abrego, Junwhan Ahn, Jacob Austin, Paul Barham, Jan Botha, James Bradbury, Siddhartha Brahma, Kevin Brooks, Michele Catasta, Yong Cheng, Colin Cherry, Christopher A. Choquette-Choo, Aakanksha Chowdhery, Clément Crepy, Shachi Dave, Mostafa Dehghani, Sunipa Dev, Jacob Devlin, Mark Díaz, Nan Du, Ethan Dyer, Vlad Feinberg, Fangxiaoyu Feng, Vlad Fienber, Markus Freitag, Xavier Garcia, Sebastian Gehrmann, Lucas Gonzalez, Guy Gur-Ari, Steven Hand, Hadi Hashemi, Le Hou, Joshua Howland, Andrea Hu, Jeffrey Hui, Jeremy Hurwitz, Michael Isard, Abe Ittycheriah, Matthew Jagielski, Wenhao Jia, Kathleen Kenealy, Maxim Krikun, Sneha Kudugunta, Chang Lan, Katherine Lee, Benjamin Lee, Eric Li, Music Li, Wei Li, YaGuang Li, Jian Li, Hyeontaek Lim, Hanzhao Lin, Zhongtao Liu, Frederick Liu, Marcello Maggioni, Aroma Mahendru, Joshua Maynez, Vedant Misra, Maysam Moussalem, Zachary Nado, John Nham, Eric Ni, Andrew Nystrom, Alicia Parrish, Marie Pellat, Martin Polacek, Alex Polozov, Reiner Pope, Siyuan Qiao, Emily Reif, Bryan Richter, Parker Riley, Alex Castro Ros, Aurko Roy, Brennan Saeta, Rajkumar Samuel, Renee Shelby, Ambrose Slone, Daniel Smilkov, David R. So, Daniel Sohn, Simon Tokumine, Dasha Valter, Vijay Vasudevan, Kiran Vodrahalli, Xuezhi Wang, Pidong Wang, Zirui Wang, Tao Wang, John Wieting, Yuhuai Wu, Kelvin Xu, Yunhan Xu, Linting Xue, Pengcheng Yin, Jiahui Yu, Qiao Zhang, Steven Zheng, Ce Zheng, Weikang Zhou, Denny Zhou, Slav Petrov, and Yonghui Wu. Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
  • Arakawa et al. [2022] Riku Arakawa, Mayank Goel, Chris Harrison, and Karan Ahuja. Rgbdgaze: Gaze tracking on smartphones with RGB and depth data. In International Conference on Multimodal Interaction, ICMI 2022, Bengaluru, India, November 7-11, 2022, pages 329–336, New York, 2022. ACM. doi: 10.1145/3536221.3556568.
  • Atmoko et al. [2017] R A Atmoko, R Riantini, and M K Hasin. Iot real time data acquisition using mqtt protocol. Journal of Physics: Conference Series, 853(1):012003, may 2017. doi: 10.1088/1742-6596/853/1/012003. URL https://dx.doi.org/10.1088/1742-6596/853/1/012003.
  • Atzori et al. [2010] Luigi Atzori, Antonio Iera, and Giacomo Morabito. The internet of things: A survey. Computer networks, 54(15):2787–2805, 2010.
  • Baltrušaitis et al. [2019] Tadas Baltrušaitis, Chaitanya Ahuja, and Louis-Philippe Morency. Multimodal machine learning: A survey and taxonomy. In IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019.
  • Billah et al. [2006] Baki Billah, Maxwell L. King, Ralph D. Snyder, and Anne B. Koehler. Exponential smoothing model selection for forecasting. International Journal of Forecasting, 22(2):239–247, 2006. ISSN 0169-2070. doi: https://doi.org/10.1016/j.ijforecast.2005.08.002. URL https://www.sciencedirect.com/science/article/pii/S016920700500107X.
  • Bommasani et al. [2022] Rishi Bommasani, Drew A. Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S. Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, Erik Brynjolfsson, Shyamal Buch, Dallas Card, Rodrigo Castellon, Niladri Chatterji, Annie Chen, Kathleen Creel, Jared Quincy Davis, Dora Demszky, Chris Donahue, Moussa Doumbouya, Esin Durmus, Stefano Ermon, John Etchemendy, Kawin Ethayarajh, Li Fei-Fei, Chelsea Finn, Trevor Gale, Lauren Gillespie, Karan Goel, Noah Goodman, Shelby Grossman, Neel Guha, Tatsunori Hashimoto, Peter Henderson, John Hewitt, Daniel E. Ho, Jenny Hong, Kyle Hsu, Jing Huang, Thomas Icard, Saahil Jain, Dan Jurafsky, Pratyusha Kalluri, Siddharth Karamcheti, Geoff Keeling, Fereshte Khani, Omar Khattab, Pang Wei Koh, Mark Krass, Ranjay Krishna, Rohith Kuditipudi, Ananya Kumar, Faisal Ladhak, Mina Lee, Tony Lee, Jure Leskovec, Isabelle Levent, Xiang Lisa Li, Xuechen Li, Tengyu Ma, Ali Malik, Christopher D. Manning, Suvir Mirchandani, Eric Mitchell, Zanele Munyikwa, Suraj Nair, Avanika Narayan, Deepak Narayanan, Ben Newman, Allen Nie, Juan Carlos Niebles, Hamed Nilforoshan, Julian Nyarko, Giray Ogut, Laurel Orr, Isabel Papadimitriou, Joon Sung Park, Chris Piech, Eva Portelance, Christopher Potts, Aditi Raghunathan, Rob Reich, Hongyu Ren, Frieda Rong, Yusuf Roohani, Camilo Ruiz, Jack Ryan, Christopher Ré, Dorsa Sadigh, Shiori Sagawa, Keshav Santhanam, Andy Shih, Krishnan Srinivasan, Alex Tamkin, Rohan Taori, Armin W. Thomas, Florian Tramèr, Rose E. Wang, William Wang, Bohan Wu, Jiajun Wu, Yuhuai Wu, Sang Michael Xie, Michihiro Yasunaga, Jiaxuan You, Matei Zaharia, Michael Zhang, Tianyi Zhang, Xikun Zhang, Yuhui Zhang, Lucia Zheng, Kaitlyn Zhou, and Percy Liang. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2022.
  • Chowdhery et al. [2022] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bradbury, Jacob Austin, Michael Isard, Guy Gur-Ari, Pengcheng Yin, Toju Duke, Anselm Levskaya, Sanjay Ghemawat, Sunipa Dev, Henryk Michalewski, Xavier Garcia, Vedant Misra, Kevin Robinson, Liam Fedus, Denny Zhou, Daphne Ippolito, David Luan, Hyeontaek Lim, Barret Zoph, Alexander Spiridonov, Ryan Sepassi, David Dohan, Shivani Agrawal, Mark Omernick, Andrew M. Dai, Thanumalayan Sankaranarayana Pillai, Marie Pellat, Aitor Lewkowycz, Erica Moreira, Rewon Child, Oleksandr Polozov, Katherine Lee, Zongwei Zhou, Xuezhi Wang, Brennan Saeta, Mark Diaz, Orhan Firat, Michele Catasta, Jason Wei, Kathy Meier-Hellstern, Douglas Eck, Jeff Dean, Slav Petrov, and Noah Fiedel. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  • Cook et al. [2020] Andrew A. Cook, Göksel Mısırlı, and Zhong Fan. Anomaly detection for iot time-series data: A survey. IEEE Internet of Things Journal, 7(7):6481–6494, 2020. doi: 10.1109/JIOT.2019.2958185.
  • Dankar and El Emam [2013] Fida Kamal Dankar and Khaled El Emam. Practicing differential privacy in health care: A review. Trans. Data Priv., 6(1):35–67, 2013.
  • Ebrahimi et al. [2019] Shahriar Ebrahimi, Siavash Bayat-Sarmadi, and Hatameh Mosanaei-Boorani. Post-quantum cryptoprocessors optimized for edge and resource-constrained devices in iot. IEEE Internet of Things Journal, 6(3):5500–5507, 2019. doi: 10.1109/JIOT.2019.2903082.
  • Gao et al. [2023] Peng Gao, Jiaming Han, Renrui Zhang, Ziyi Lin, Shijie Geng, Aojun Zhou, Wei Zhang, Pan Lu, Conghui He, Xiangyu Yue, Hongsheng Li, and Yu Qiao. Llama-adapter v2: Parameter-efficient visual instruction model. arXiv preprint arXiv:2304.15010, 2023.
  • Gardner Jr. [1985] Everette S. Gardner Jr. Exponential smoothing: The state of the art. Journal of Forecasting, 4(1):1–28, 1985. doi: https://doi.org/10.1002/for.3980040103. URL https://onlinelibrary.wiley.com/doi/abs/10.1002/for.3980040103.
  • Geiger et al. [2013] A Geiger, P Lenz, C Stiller, and R Urtasun. Vision meets robotics: The kitti dataset. Int. J. Rob. Res., 32(11):1231–1237, sep 2013. ISSN 0278-3649. doi: 10.1177/0278364913491297. URL https://doi.org/10.1177/0278364913491297.
  • Geyer et al. [2017] Robin C Geyer, Tassilo Klein, and Moin Nabi. Differentially private federated learning: A client level perspective. arXiv preprint arXiv:1712.07557, 2017.
  • Ghazal et al. [2021] Taher M Ghazal, Mohammad Kamrul Hasan, Muhammad Turki Alshurideh, Haitham M Alzoubi, Munir Ahmad, Syed Shehryar Akbar, Barween Al Kurdi, and Iman A Akour. Iot for smart cities: Machine learning approaches in smart healthcare—a review. Future Internet, 13(8):218, 2021.
  • Google [2024] Google. Gemini: A family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2024.
  • Grauman et al. [2022] Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, Miguel Martin, Tushar Nagarajan, Ilija Radosavovic, Santhosh Kumar Ramakrishnan, Fiona Ryan, Jayant Sharma, Michael Wray, Mengmeng Xu, Eric Zhongcong Xu, Chen Zhao, Siddhant Bansal, Dhruv Batra, Vincent Cartillier, Sean Crane, Tien Do, Morrie Doulaty, Akshay Erapalli, Christoph Feichtenhofer, Adriano Fragomeni, Qichen Fu, Abrham Gebreselasie, Cristina González, James Hillis, Xuhua Huang, Yifei Huang, Wenqi Jia, Weslie Khoo, Jáchym Kolář, Satwik Kottur, Anurag Kumar, Federico Landini, Chao Li, Yanghao Li, Zhenqiang Li, Karttikeya Mangalam, Raghava Modhugu, Jonathan Munro, Tullie Murrell, Takumi Nishiyasu, Will Price, Paola Ruiz, Merey Ramazanova, Leda Sari, Kiran Somasundaram, Audrey Southerland, Yusuke Sugano, Ruijie Tao, Minh Vo, Yuchen Wang, Xindi Wu, Takuma Yagi, Ziwei Zhao, Yunyi Zhu, Pablo Arbeláez, David Crandall, Dima Damen, Giovanni Maria Farinella, Christian Fuegen, Bernard Ghanem, Vamsi Krishna Ithapu, C. V. Jawahar, Hanbyul Joo, Kris Kitani, Haizhou Li, Richard Newcombe, Aude Oliva, Hyun Soo Park, James M. Rehg, Yoichi Sato, Jianbo Shi, Mike Zheng Shou, Antonio Torralba, Lorenzo Torresani, Mingfei Yan, and Jitendra Malik. Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18995–19012, June 2022.
  • Huang et al. [2018] Yinghao Huang, Manuel Kaufmann, Emre Aksan, Michael J. Black, Otmar Hilliges, and Gerard Pons-Moll. Deep inertial poser: Learning to reconstruct human pose from sparse inertial measurements in real time. ACM Transactions on Graphics, (Proc. SIGGRAPH Asia), 37:185:1–185:15, November 2018. First two authors contributed equally.
  • Imteaj et al. [2022] Ahmed Imteaj, Urmish Thakker, Shiqiang Wang, Jian Li, and M. Hadi Amini. A survey on federated learning for resource-constrained iot devices. IEEE Internet of Things Journal, 9(1):1–24, 2022. doi: 10.1109/JIOT.2021.3095077.
  • Jaegle et al. [2021a] Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch, Catalin Ionescu, David Ding, Skanda Koppula, Daniel Zoran, Andrew Brock, Evan Shelhamer, et al. Perceiver io: A general architecture for structured inputs & outputs. arXiv preprint arXiv:2107.14795, 2021a.
  • Jaegle et al. [2021b] Andrew Jaegle, Felix Gimeno, Andrew Brock, Andrew Zisserman, Oriol Vinyals, and Joao Carreira. Perceiver: General perception with iterative attention. arXiv preprint arXiv:2103.03206, 2021b.
  • Javaid et al. [2018] Sabeen Javaid, Ali Sufian, Saima Pervaiz, and Mehak Tanveer. Smart traffic management system using internet of things. In 2018 20th international conference on advanced communication technology (ICACT), pages 393–398. IEEE, 2018.
  • Jia et al. [2021] Xinyu Jia, Chuang Zhu, Minzhen Li, Wenqi Tang, and Wenli Zhou. Llvip: A visible-infrared paired dataset for low-light vision. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3496–3504, 2021.
  • Jiang et al. [2023] Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
  • Khan et al. [2018] Nida Saddaf Khan, Sayeed Ghani, and Sajjad Haider. Real-time analysis of a sensor’s data for automated decision making in an iot-based smart home. Sensors, 18(6), 2018. ISSN 1424-8220. doi: 10.3390/s18061711. URL https://www.mdpi.com/1424-8220/18/6/1711.
  • Khayyam et al. [2020] Hamid Khayyam, Bahman Javadi, Mahdi Jalili, and Reza N Jazar. Artificial intelligence and internet of things for autonomous vehicles. Nonlinear Approaches in Engineering Applications: Automotive Applications of Engineering Problems, pages 39–68, 2020.
  • Khor et al. [2021] Jing Huey Khor, Michail Sidorov, and Peh Yee Woon. Public blockchains for resource-constrained iot devices—a state-of-the-art survey. IEEE Internet of Things Journal, 8(15):11960–11982, 2021. doi: 10.1109/JIOT.2021.3069120.
  • Kong et al. [2021] Andy Kong, Karan Ahuja, Mayank Goel, and Chris Harrison. Eyemu interactions: Gaze + imu gestures on mobile devices. In Proceedings of the 2021 International Conference on Multimodal Interaction, ICMI ’21, page 577–585, New York, NY, USA, 2021. Association for Computing Machinery. ISBN 9781450384810. doi: 10.1145/3462244.3479938. URL https://doi.org/10.1145/3462244.3479938.
  • Kulkarni et al. [2014] Alok Kulkarni, Sampada Sathe, et al. Healthcare applications of the internet of things: A review. International Journal of Computer Science and Information Technologies, 5(5):6229–6232, 2014.
  • Kumar et al. [2020] Raghavendra Kumar, Pardeep Kumar, and Yugal Kumar. Time series data prediction using iot and machine learning technique. Procedia Computer Science, 167:373–381, 2020. ISSN 1877-0509. doi: https://doi.org/10.1016/j.procs.2020.03.240. URL https://www.sciencedirect.com/science/article/pii/S1877050920307067. International Conference on Computational Intelligence and Data Science.
  • Li et al. [2015] Shancang Li, Li Da Xu, and Shanshan Zhao. The internet of things: a survey. Information systems frontiers, 17:243–259, 2015.
  • Li et al. [2018] Tian Li, Anit Kumar Sahu, Manzil Zaheer, Maziar Sanjabi, Ameet Talwalkar, and Virginia Smith. Federated optimization in heterogeneous networks. CoRR, abs/1812.06127, 2018. URL http://arxiv.org/abs/1812.06127.
  • Liang et al. [2020] Paul Pu Liang, Terrance Liu, Liu Ziyin, Nicholas B. Allen, Randy P. Auerbach, David Brent, Ruslan Salakhutdinov, and Louis-Philippe Morency. Think locally, act globally: Federated learning with local and global representations. 2020.
  • Liang et al. [2021a] Paul Pu Liang, Yiwei Lyu, Xiang Fan, Zetian Wu, Yun Cheng, Jason Wu, Leslie Yufan Chen, Peter Wu, Michelle A Lee, Yuke Zhu, et al. Multibench: Multiscale benchmarks for multimodal representation learning. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1), 2021a.
  • Liang et al. [2021b] Paul Pu Liang, Chiyu Wu, Louis-Philippe Morency, and Ruslan Salakhutdinov. Towards understanding and mitigating social biases in language models. In International Conference on Machine Learning, pages 6565–6576. PMLR, 2021b.
  • Liang et al. [2022] Paul Pu Liang, Yiwei Lyu, Xiang Fan, Jeffrey Tsaw, Yudong Liu, Shentong Mo, Dani Yogatama, Louis-Philippe Morency, and Russ Salakhutdinov. High-modality multimodal transformer: Quantifying modality & interaction heterogeneity for high-modality representation learning. Transactions on Machine Learning Research, 2022.
  • Liang et al. [2023] Paul Pu Liang, Amir Zadeh, and Louis-Philippe Morency. Foundations & trends in multimodal machine learning: Principles, challenges, and open questions. ACM Computing Surveys, 2023.
  • Lloyd [2018] Kirsten Lloyd. Bias amplification in artificial intelligence systems. CoRR, abs/1809.07842, 2018. URL http://arxiv.org/abs/1809.07842.
  • Lu et al. [2019] Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. Vilbert: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, pages 13–23, 2019.
  • Mollyn et al. [2022] Vimal Mollyn, Karan Ahuja, Dhruv Verma, Chris Harrison, and Mayank Goel. Samosa: Sensing activities with motion and subsampled audio. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol., 6(3), sep 2022. doi: 10.1145/3550284. URL https://doi.org/10.1145/3550284.
  • OpenAI [2023] OpenAI. Gpt-4 technical report, 2023.
  • Qi et al. [2015] Jun Qi, Po Yang, Dina Fan, and Zhikun Deng. A survey of physical activity monitoring and assessment using internet of things technology. In 2015 IEEE International Conference on Computer and Information Technology; Ubiquitous Computing and Communications; Dependable, Autonomic and Secure Computing; Pervasive Intelligence and Computing, pages 2353–2358. IEEE, 2015.
  • Ramezani et al. [2020] Milad Ramezani, Yiduo Wang, Marco Camurri, David Wisth, Matias Mattamala, and Maurice Fallon. The newer college dataset: Handheld LiDAR, inertial and vision with ground truth. In 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, oct 2020.
  • Reed et al. [2022] Scott Reed, Konrad Zolna, Emilio Parisotto, Sergio Gómez Colmenarejo, Alexander Novikov, Gabriel Barth-maron, Mai Giménez, Yury Sulsky, Jackie Kay, Jost Tobias Springenberg, et al. A generalist agent. Transactions on Machine Learning Research, 2022.
  • Rose et al. [2015] Karen Rose, Scott Eldridge, and Lyman Chapin. The internet of things: An overview. The internet society (ISOC), 80:1–50, 2015.
  • Strubell et al. [2019] Emma Strubell, Ananya Ganesh, and Andrew McCallum. Energy and policy considerations for deep learning in nlp. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3645–3650, 2019.
  • Sworna et al. [2021] Nabila Sabrin Sworna, AKM Muzahidul Islam, Swakkhar Shatabda, and Salekul Islam. Towards development of iot-ml driven healthcare systems: A survey. Journal of Network and Computer Applications, 196:103244, 2021.
  • Touvron et al. [2023] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  • Tsai et al. [2019] Yao-Hung Hubert Tsai, Shaojie Bai, Paul Pu Liang, J Zico Kolter, Louis-Philippe Morency, and Ruslan Salakhutdinov. Multimodal transformer for unaligned multimodal language sequences. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 6558–6569, 2019.
  • Yuehong et al. [2016] YIN Yuehong, Yan Zeng, Xing Chen, and Yuanjie Fan. The internet of things in healthcare: An overview. Journal of Industrial Information Integration, 1:3–13, 2016.
  • Zantalis et al. [2019] Fotios Zantalis, Grigorios Koulouras, Sotiris Karabetsos, and Dionisis Kandris. A review of machine learning and iot in smart transportation. Future Internet, 11(4):94, 2019.
  • Zhu et al. [2023] Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.

Appendix

Appendix A Details Regarding IoT Datasets

A.1 Twelve Rich Modalities

We collected diverse data from IoT devices, such as Inertial Measurement Units (IMU), Thermal sensors, and Global Positioning Systems (GPS). Furthermore, we include challenging modalities, such as capacitance, depth, gaze, and pose. Finally, we collect common and widely used image, audio, and video modalities. These modalities bring unique challenges since they typically involve noisy real-world sensor measurements, that lack explicit tokenization and alignment with other modalities that we typically expect from conventional multimodal image-text research.

IMU: Inertial Measurement Units capture 3D motion and orientation. This data is fundamental for various applications, including motion tracking and navigation. We collected 2,940 IMU samples from EyeMU [33] for gaze estimation and motion gesture classification, where they used the accelerometer and gyroscope raw values sampled at 60 Hz as the IMU values per-axis. 28,400 IMU instances are included from SAMoSA [45] to save synchronized streams of the 9-axis IMU data (accelerometer, gyroscope and orientation) at 50 Hz by using a Fossil Gen 5 smartwatch running Google Android wearOS 2.23. Further, we sampled 160,120 IMU samples (9-axis) recorded by the device motion sensor using an iOS application [6] on Apple iPhone X for gaze tracking. For human bodies, 330,178 IMU orientation recordings [23] from 17 sensors on different body parts are saved for pose estimation and activity recognition. For first-person videos in Ego4D [22], we used 510,142 timestamps-based IMU samples with the normalized accelerometer and gyroscope values in each video for activity recognition. For IMU data on self-driving cars, we collected 41,000 samples from KITTI [18] for depth estimation.

Thermal: Thermal modality data provide temperature radiance insights, crucial in surveillance. For collection, we used 12,025 Thermal samples from LLVIP [28] containing many pedestrians and cyclists from different locations on the street between 6 and 10 o’clock in the evening. They used HIKVISION DS-2TD8166BJZFY-75H2F/V2 as the camera equipment, a binocular camera platform consisting of an infrared camera with a wavelength of 814similar-to8148\sim 148 ∼ 14um.

GPS: Global Positioning Systems offer location data with high precision. This data is invaluable for tasks like location-based services, asset tracking, and navigation. For GPS data on self-driving cars, we collected 41,000 samples from KITTI [18] using OXTS RT3003 inertial and GPS navigation system for depth estimation. The geographic coordinates include global orientation, altitude, velocities, accelerations, angular rates, and satellite information. Following the original dataset, we applied two specified 3-axis coordinates as accelerations for the vehicle and angular rates to describe the tangent plane of the earth’s surface corresponding to the geographic location.

Camera: Cameras provide visual data, capturing the environment in rich detail. They serve as the backbone for countless computer vision tasks. For Camera data, we collected 41,000 instances from KITTI [18] using a Velodyne laser scanner installed on a vehicle car for depth estimation. We stored its 3-axis coordinate and an additional reflectance value for each point. The timestamp-based points can be considered according to the scanner’s continuous rotation on its vertical axis, which provides a complementary context to GPS/IMU systems for auto-driving.

Capacitance: Capacitive sensors measure changes in capacitance to detect nearby objects or changes. This is foundational for touchscreen technologies and proximity sensing. For capacitance data, we used 65,374 samples from TouchPose [3] using a 39.6 cm capacitive Crystal Touch panel (Ocular Touch, Dallas, TX), 16-bit touch digitizer, and cameras to record ground-truth data. When fingers approach the lines on the mutual-capacitance touch sensor, it causes a capacitance drop between lines, resulting in the mutual-capacitance image.

Depth: Depth sensors measure distances between the sensor and objects, providing a 3D view of the environment. They play a significant role in tasks like object detection and scene reconstruction. For depth data, we collected 160,120 samples from RGBGaze [6] using Apple iPhone X with a TrueDepth camera (640 × 480 depth map interpolated from a 170 × 170 IR dot pattern). The participants were asked to look at a target (red dot) that was moving on the screen. While the user gazed at the target, the depth imagery was logged at approximately 8 Hz, along with the ARKit gaze prediction. For touch depth data, we used 65,374 samples from TouchPose [3] recorded by a ToF depth camera (Azure Kinect) below the surface of the transparent touch panel. The depth modality is essential for touch contact and 3D pose joint reconstruction.

Gaze: Gaze sensors track eye movement and direction, offering insights into user attention and intention. Regarding gaze modality, we collected 2,940 samples from EyeMU [33] by running an iOS application on an Apple iPhone 12 Pro (screen size is 12.8 × 6.4 cm). The participants were asked to gaze at a single red dot, and the screen advanced to capture a motion gesture and a 2-axis gaze location after 1.2 seconds. Furthermore, we used 160,120 gaze samples from RGBGaze [6] by running an iOS application on Apple iPhone X to record gaze tracking data with pre-determined 35 fixed locations on the screen.

Pose: Pose sensors capture the orientation and position of objects or individuals, which is critical for motion analysis and interactive applications. For body pose data, we collected 330,178 samples from DIP-IMU [23] using Xsens IMU sensors containing 3-axis accelerometers, gyroscopes, and magnetometers. They placed the head sensor onto the head of each participant such that the sensor axes aligned with the SMPL body frame to do a calibration. The SMPL pose parameters are stored in the angle-axis format with three joint angles (yaw, pitch, roll) per 24 joints. For touch pose data, we used 65,374 samples in TouchPose [3] from a Leap Motion stereo IR camera, running Orion 4.1 for 3D hand pose tracking. We included 14 different finger and whole-hand touch poses and gestures, representing each pose with 3030 to 5875 samples.

LiDAR: LiDAR sensors emit light to measure distances, generating high-resolution 3D maps of environments. They are central to applications like autonomous driving and topographical mapping. For LiDAR data, we collected 51,000 samples from the Newer College dataset [48] using the Ouster LiDAR with 64 beams, 64 Channels, 120 m range, 45 vertical Field-of-View (FoV), and 1024 horizontal resolution. During the collection, the Ouster LiDAR synchronized with the recording computer using the Precision Time Protocol (PTP) to achieve sub-microsecond accuracy. The accurate prior map of 290 million points was down-sampled to 1 cm resolution and reduced to about 17 million points, allowing for its use without an observable drop in registration accuracy. Further, cropping the reduced point cloud around the sensor’s pose dynamically created the final reference cloud in 100 m by 100 m.

Video: Video captures sequences of visual frames, providing a dynamic view of the environment. This modality supports tasks ranging from action recognition to anomaly detection. For video modality, we used 510,142 egocentric videos with 30FPS in Ego4D [22], which includes a wide range of everyday activities, such as cooking, cleaning, and fishing. These videos also cover diverse geographic locations across the world, and are paired with timestamps-based IMU values of the normalized accelerometer and gyroscopes.

Audio: Audio sensors capture sound waves, enabling voice recognition, sound classification, and environmental sound analysis. For audio data, we collected 28,400 samples paired with IMU modality from SAMoSA [45], where participants wore the smartwatch on their dominant arm, and were asked to perform 26 activities across 4 contexts with each activity repeated 3 times within each context. As the audio was sampled at 1 kHz, the resolution of the information went down, and more activity classes, such as Hand Washing and Toothbrushing, got similar and confused. In such cases, IMU data can provide valuable information to remove ambiguity.

Image: Image sensors offer static visual captures of the environment, serving as a basis for a myriad of vision tasks. For RGB image data, we collected 160,120 samples from RGBDGaze [6] paired with gaze, depth, and IMU for gaze tracking. To align GPS and Camera modalities with images, we collected 41,000 samples from KITTI [18] for depth estimation and activity recognition. Furthermore, we used 12,025 high-quality images paired with infrared thermal samples in LLVIP [28] from 26 locations. For alignment with body pose, we used 330,178 samples from DIP-IMU [23] for pose estimation and activity recognition. Regarding hand pose images, we collected 65,374 instances from TouchPose [3] for touch contact classification and 3D hand pose joint reconstruction.

A.2 Eight Well-defined and Challenging Tasks

Our benchmark includes tasks that reflect real-world IoT challenges and that will drive the community towards solutions with tangible societal impacts.

Gaze estimation: This task is pivotal for human-computer interaction, driver monitoring, and virtual reality. Given RGB images of faces, depth and IMUs, our goal is to predict the location (X/Y) for tracking gazes of the person. This regression task requires multisensory understanding on long-range interactions between RGB images and depth and heterogeneity in IMUs.

Depth estimation: A cornerstone for AR/VR applications, robotics, and object detection, depth estimation involves predicting the distance between the camera and each pixel in the image. Given RGB images, camera parameters, GPS coordinates, and IMU, we are expected to predict the depth maps of objects, such as cars and pedestrian on the streets. In the touch robots case, given RGB images, capacitive image, and hand poses, our target is to estimate the depth maps of hands. This regression problem requires multisensory understanding on long-range interactions between RGB images and capacitance and heterogeneity in poses.

Gesture classification: Crucial for intuitive human-machine interfaces, gesture classification aims to recognize specific hand or body movements. Given gaze locations and IMU data on accelerometer, gyroscope and orientation, the task is defined to classify the gesture of human heads. This classification problem requires the cross-model perception on heterogeneity in gaze and IMUs.

Pose estimation: With applications in AR/VR, gaming, and health, pose estimation focuses on determining the spatial arrangement of human joints. Given RGB images and measured IMU data, our goal is to predict the poses of human body including 24 joints with three joint angles (yaw, pitch, roll). This regression problem requires a deeper cross-modal understanding on the heterogeneity in IMUs and RGB pixels.

Touch contact classification: Vital for enhancing user experiences on touch-based devices, this task involves determining the type or nature of touch on capacitive surfaces. Given RGB images, capacitive images, depth maps, and hand poses, we are expected to classify touch contact using diverse modalities. This classification task requires a multimodal understanding on the long-range interactions between RGB images and capacitance and heterogeneity in depth maps and poses.

Event detection: A broad area with applications in surveillance, smart homes, and industrial setups, event detection involves identifying specific occurrences or anomalies in the data stream. Given audio spectrograms and IMU data on accelerometer, gyroscope and orientation, our goal is to predict the categories of events across different timestamps. This classification problem requires a cross-modal understanding on the long-range interactions between audio and IMU. If a predicted activity is above a confidence threshold, we consider it an event. Othwise, if it’s below a confidence threshold, or belongs to the Other class, we do not consider it an event.

Activity recognition: Central to fitness, health, and elder care applications, activity recognition aims to discern human activities like walking, running, or jumping. Given RGB images, poses with three joint angles (yaw, pitch, roll), and IMU data, we are expected to classify the class of actions for the human body. For ego-centric cases, we are given video frames and IMU orientation recordings on from different sensors to predict the category of activity in the videos. This classification task requires a cross-modal understanding on the heterogeneity in poses, videos and IMU.

3D reconstruction: With significance in gaming, film, and AR/VR, 3D reconstruction involves creating a three-dimensional model of an environment or object from 2D data. Given RGB images, capacitance image, and depth maps, our target is to reconstruct the 3D poses. This regression problem requires a multimodal understanding of both capacitance images and depth maps.

Appendix B Experimental Setup

B.1 Setup for Unimodal Models

  • Data Preparation: Each modality, e.g., RGB images, capacitive images, or hand pose, is pre-processed independently. The data undergo normalization and any specific transformations tailored to that modality.

  • Network Architecture: Distinct neural architectures optimized for each modality type, such as CNNs for images and RNNs for sequential data.

  • Training Details: Models are trained using a batch size of 128, employing the Adam optimizer with a learning rate of 0.001. Early stopping with a patience of 10 epochs ensures prevention from overfitting.

  • Evaluation: Each unimodal model is evaluated on its respective validation dataset to gauge performance.

B.2 Setup for Adapter Models

  • Data Preparation: Data is fed through a pre-trained network, where only the adapter modules are trainable.

  • Network Architecture: Utilizing deep architectures like LLaMA [16], but with adapter layers inserted in-between the pre-defined layers.

  • Training Details: Since only the adapter layers are trainable, fewer parameters are updated, allowing for a larger batch size of 256. The training uses the Adam optimizer with a learning rate of 0.0005.

  • Evaluation: Model performance is assessed by evaluating the fine-tuned model on the targeted task’s validation set.

B.3 Setup for Unimodal Multi-task Models

  • Data Preparation: Data from different tasks, but the same modality, are concatenated or paired.

  • Network Architecture: Shared encoder layers process the input data, followed by task-specific decoders.

  • Training Details: Gradient balancing techniques are employed to prevent one task from dominating the training process. Training leverages a batch size of 128 and the Adam optimizer with a learning rate of 0.001.

  • Evaluation: Performance is evaluated separately for each task on their respective validation sets.

B.4 Setup for Multisensory Models

  • Data Preparation: Data from different modalities are fused either at the input, feature, or decision level.

  • Network Architecture: Modality-specific encoders process each input type. Fusion layers then combine features from all encoders.

  • Training Details: Models are trained with a batch size of 128 using the Adam optimizer and a learning rate of 0.001. Data balancing techniques ensure equal representation from each modality.

  • Evaluation: The combined model’s efficacy is evaluated using a validation dataset that includes all modalities.

B.5 Setup for Multisensory Multitask Models

  • Data Preparation: Data from different modalities and tasks are paired or concatenated as required.

  • Network Architecture: Shared modality-specific encoders are followed by task-specific decoders.

  • Training Details: Gradient balancing techniques are applied, along with modality balancing, to ensure fairness in learning. The model trains using a batch size of 128 and the Adam optimizer at a learning rate of 0.001.

  • Evaluation: Each task’s performance is assessed on their respective validation datasets.

For all the methods, the experimental environment remains consistent. All models are trained and evaluated on NVIDIA V100 GPUs, ensuring uniformity in computational power and performance.

Appendix C Evaluation Metrics

To measure performance, we utilize a combination of metrics following prior work on each specific task. For gaze estimation, we use mean euclidean error in centimeters to measure the positional distance between the predicted gaze and the ground-truth gaze. For depth estimation, we apply mean absolute error in millimeter to calculate the gap between the prediction and the ground-truth depth. For gesture classification, we compute the ratio of correct classified samples as the accuracy. For pose estimation, we use mean euclidean error in centimeters to measure the positional distance between the predicted pose joints and the ground-truth pose. For touch contact classification, we calculate the accuracy of classifying the category of fingers interacting with the touchscreen. For event detection, we apply F1 score to decide if the predicted activity above a confident threshold belongs to a event. For activity recognition, we compute the balanced accuracy for measuring instance-level performance. For 3D pose reconstruction, we use End-point-error in millimeter, the mean Euclidean error between all the joints of the annotated and predicted hand pose.

Appendix D More analysis

Testing long-range interactions: Long-range interactions are critical to many problems in machine learning, particularly in fields like time series forecasting, natural language processing, and signal analysis. Recognizing patterns and relationships over vast sequences or across multiple modalities often requires models to understand and leverage these long-range dependencies. However, capturing these interactions remains a challenge for many conventional models.

In a controlled experiment, we truncated sequences to various lengths and observed how conventional models performed. As the sequence lengths increased, representing longer durations of time or more extensive contexts, there was a marked decline in performance. This showcased the models’ inability to effectively encapsulate and understand interactions beyond a certain range. Multimodal setups further complicate this. The long-range dependencies aren’t just within a modality but can also be across modalities. This inter-modality long-range interaction is a largely uncharted territory, and our experiments showed that it’s an area where even advanced models can falter.

Exploring architectures that inherently focus on long-range interactions, potentially leveraging self-attention mechanisms but with modifications to handle extremely long sequences. Employing models that operate at different temporal scales, allowing them to summarize information at various levels and potentially capture longer-range interactions more effectively. Techniques that allow models to allocate more computational resources when faced with potential long-range dependencies, thus emphasizing critical parts of a sequence or modality. For multimodal problems, mechanisms that facilitate better cross-modal attention can be crucial. This will enable models to recognize and act upon dependencies that span across different modalities, even if they are separated by considerable temporal or sequential gaps.

Testing heterogeneity in structure and noise: Heterogeneity in data, both in terms of structure and noise, is a pervasive challenge in machine learning. As datasets grow more complex, encompassing a wider variety of sources, the inherent differences in data structure and the presence of various types of noise can significantly hamper the performance of models. Understanding how models grapple with such heterogeneity is vital for real-world applications.

We exposed models to datasets that combined structured data (such as GPS, IMU) with unstructured data (such as images or raw audio). Unimodal baselines often struggled to reconcile these different data forms, leading to a significant drop in accuracy compared to when dealing with homogenous data types. We also introduced varying degrees of noise into datasets, Gaussian noise in numerical data. Currrent methods saw a rapid decline in performance as the noise levels increased, unable to filter out irrelevant information effectively. Heterogeneity challenges underline the importance of robustness in model design. Our experiments highlighted that many models, even those considered state-of-the-art, have vulnerabilities when exposed to unexpected data structures or noise patterns.

Exploring architectures and training techniques that are inherently more robust to noise and heterogeneity. This might include noise injection during training or techniques like dropout that encourage model generalization. Leveraging advanced data augmentation techniques, both for structured and unstructured data, to simulate and thus prepare the model for varied data structures and noise patterns. Using meta-learning approaches where models are trained to quickly adapt to new data structures or noise patterns with minimal fine-tuning. Research into advanced denoising mechanisms, especially ones that can handle structured noise, can be invaluable. This includes both pre-processing methods and in-model techniques.

D.1 Analysis of information sharing

Finally, we show examples of how information is shared across modalities and tasks, based on two potential sources of sharing: low-level modality features and high-level semantic concepts.

Low-level modality features: Different sensory modalities often contain unique low-level perceptual features that complement those in other modalities. We illustrate this information sharing across 3 modalities: IMU, video, and pose data for predicting 2 common activities: walking and dancing.

Walking is a common activity with distinctive rhythmic characteristics. Using IMU features, the model learns that rhythmic patterns, particularly in acceleration and deceleration, correspond to each walking step. The cadence, stability, and any irregularities in the walking pattern can also be inferred. Video features capture the holistic visual representation of walking, presenting details such as gait, arm swing, speed, stride length, and frequency. Finally, pose features highlight the specific posture changes during walking, emphasizing leg movement, foot placement, and body alignment.

Dancing requires complex and expressive motions with varying styles and dynamics. IMU provides dynamic, often non-linear patterns in IMU data, reflecting the dance’s tempo, vigor, and style variations; video captures the dance form, style, synchronization, and expressiveness; and pose data captures the alignment and configuration of body parts, offering insights into dance postures, transitions, and intricate footwork or hand movements.

High-level semantic concepts encapsulate a more general conceptual understanding and reasoning about the environment. We show two examples showing how the audio and IMU modalities share information about two high-level semantic concepts, focusing on ’body pose’ and ’hand pose’.

Body pose represents the spatial arrangement and posture of the entire human body. This can involve stances like standing, sitting, lying down, or even dynamic movements like jumping or running. For Audio, indirect cues such as the sound of footsteps, a person sitting down on a chair, or even the echo in a room (indicating a certain body pose affecting sound propagation) can provide hints about the body’s posture. For IMU, accelerometers capture the directional movement while gyroscopes provide rotational dynamics to distinguish if a person is upright, moving rapidly, or stationary.

Hand pose looks at the orientation, gesture, and spatial arrangement of just the hands, ranging from gestures like waving, gripping, to more intricate signs in sign language. In audio, sounds like clapping, snapping, or even the subtle rustling of hands moving through the air can be detected. The distinct sounds made by hang interactions with objects can also hint at specific hand poses. When IMU sensors are placed on the wrist or back of the hand, they can capture detailed dynamics of hand movements, tilting, rotation, or swift movements that indicate hand poses.

Appendix E More examples

In this section, we present more examples of diverse IoT modalities, emphasizing their heterogeneity and the implications of their temporal interactions. Each modality contributes uniquely to the understanding and processing of environmental data, pivotal for applications in IoT networks. These examples underscore the heterogeneity in sensor types and data characteristics in IoT systems. Moreover, they highlight the importance of temporal interaction in data processing and application responsiveness.

Refer to caption
Figure 6: IMU Visualizations

E.1 IMU

The Inertial Measurement Unit (IMU) is critical for capturing dynamic motion and orientation. An IMU typically combines accelerometers, gyroscopes, and sometimes magnetometers to provide comprehensive motion tracking. For example, in a smartwatch, the IMU captures temporal data on user movement patterns, crucial for activity recognition and health monitoring applications. This modality’s high sampling rate allows for detailed temporal analysis, capturing minute fluctuations in motion, as shown in Figure 6.

Refer to caption
Figure 7: Audio Visualizations

E.2 Audio

Audio sensors capture sound waves, converting them into digital signals that represent the acoustic environment. In smart homes, audio sensors can detect various sounds, from spoken commands to the activity noise of household appliances. The temporal granularity of audio data reported in Figure 7 is vital for applications like speech recognition, environmental sound classification, and emergency detection (e.g., breaking glass or alarms).

Refer to caption
Figure 8: Capacitance Visualizations

E.3 Capacitance

Capacitive sensing involves the measurement of changes in capacitance in an environment, often used to detect touch or proximity. In an IoT context, capacitive sensors can be embedded in surfaces to create interactive touch interfaces or to monitor object presence and human interaction without direct contact. As shown in Figure 8, the temporal resolution of capacitance can vary, but its real-time response is crucial for interactive applications.

Refer to caption
Figure 9: Depth Visualizations

E.4 Depth

Depth sensors measure the distance between the sensor and objects in its environment, typically using technologies such as LIDAR, structured light, or time-of-flight cameras. This modality is essential in scenarios where spatial relationships and object recognition are required, such as in autonomous vehicle navigation or interactive gaming, as illustrated in Figure 9. Temporal interactions in depth sensing are crucial for understanding scene changes and movement dynamics over time.