MultiIoT: Benchmarking Machine Learning
for the Internet of Things

Shentong Mo, Louis-Philippe Morency, Ruslan Salakhutdinov, Paul Pu Liang
Carnegie Mellon University
[email protected]

Abstract

The next generation of machine learning systems must be adept at perceiving and interacting with the physical world through a diverse array of sensory channels. Commonly referred to as the ‘Internet of Things (IoT)’ ecosystem, sensory data from motion, thermal, geolocation, depth, wireless signals, video, and audio are increasingly used to model the states of physical environments and the humans inside them. Despite the potential for understanding human wellbeing, controlling physical devices, and interconnecting smart cities, the community has seen limited benchmarks for building machine learning systems for IoT. Existing efforts are often specialized to a single sensory modality or prediction task, which makes it difficult to study and train large-scale models across many IoT sensors and tasks. To accelerate the development of new machine learning technologies for IoT, this paper proposes MultiIoT, the most expansive and unified IoT benchmark to date, encompassing over 1.15 million samples from 12 modalities and 8 real-world tasks. MultiIoT introduces unique challenges involving (1) generalizable learning from many sensory modalities, (2) multimodal interactions across long temporal ranges, (3) extreme heterogeneity due to unique structure and noise topologies in real-world sensors, and (4) complexity during training and inference. We evaluate a comprehensive set of models on MultiIoT, including modality and task-specific methods, multisensory and multitask supervised models, and large multisensory foundation models. Our results highlight opportunities for ML to make a significant impact in IoT, but many challenges in scalable learning from heterogeneous, long-range, and imperfect sensory modalities still persist. We release all code and data at the repository ¹¹1https://github.com/Multi-IoT/MultiIoT to accelerate future research in machine learning for IoT.

Refer to caption — Figure 1: MultiIoT is the largest benchmark for machine learning on the Internet of Things (IoT), consisting of 1.15M samples, 12 rich modalities, and 8 challenging tasks such as perceiving the pose, gaze, activities, and gestures of humans as well as the touch, contact, pose, and 3D structure of physical objects. MultiIoT presents new challenges of (1) generalizable learning from many sensory modalities, (2) fine-grained interactions across long temporal ranges, (3) extreme heterogeneity and noise topologies in real-world sensors, and (4) complexity during training and inference.

1 Introduction

The next generation of machine learning systems will need to understand and interact with the physical world through physical sensors. This interconnection of sensors is typically called the Internet of Things (IoT) ecosystem, which includes motion, thermal, geolocation, depth, wireless signals, pose, video, and audio to model the states of physical environments and the humans inside them [7, 35, 52]. These sensing technologies have had great impact in recognizing human physical activities to inform us of our daily physical wellness [38, 49, 57]; navigating self-driving cars and efficiently connecting them with transportation grids [25, 28]; and recognizing if humans require assistance in schools, hospitals, or the workplace [1, 3, 31].

While the field of machine learning for IoT has great potential, existing efforts are often specialized to a single sensory modality or prediction task [6, 27, 32, 11], resulting in limited resources to systematically study large-scale learning across many IoT sensors and tasks. To standardize the benchmarking and development of new machine learning technologies for IoT, this paper proposes MultiIoT, the most expansive and unified IoT benchmark to date, encompassing over $1.15$ million samples covering $12$ real-world sensory modalities and $8$ IoT tasks firmly rooted in practical scenarios such as personal wellness, healthcare, and smart cities. These tasks include perceiving the pose, gaze, activities, and gestures of humans as well as the touch, contact, pose, and 3D structure of physical objects. MultiIoT introduces the following unique challenges to the machine learning community:

1.

High-modality multimodal learning: While multimodal representation learning has historically been limited to image, text, video, and audio [42], real-world sensory modalities like IMU, thermal dynamics, GPS, depth, camera captures, audio, and more paint a more realistic picture of our multisensory physical world. These diverse modalities introduce new challenges in generalization across modalities and multitask and transfer learning across different physical sensors.
2.

Long-range temporal interactions: The second challenge lies in learning fine-grained multimodal interactions across long temporal ranges. Real-world sensory data is naturally sequential, possibly over extremely long time ranges, and multisensory sequential data often shows interactions between time steps that are not aligned. For example, typical image-text datasets have a sequence length of 77 words or lower [43], video datasets are roughly 10-60 seconds long [58], while MultiIoT displays sequence lengths of up to 100-300 steps.
3.

Heterogeneity and robustness: The third challenge lies in handling the extreme heterogeneity in real-world sensors with unique structures and noise topologies [39, 42]. These sensory modalities may be naturally noisy or corrupted, not easily semantically segmented, and may not have natural language correspondences like image and video often do.
4.

Real-time complexity: Finally, many IoT devices need to run in real-time for applications in smart cities, security, healthcare, and automation. We therefore need to benchmark the efficiency of multisensory data collection, processing, and prediction as a critical quality alongside performance.

In addition to its diverse data modalities and prediction tasks, MultiIoT also contains a new set of evaluation metrics to study these challenges. Through this holistic benchmark, we evaluate a family of machine learning models spanning unimodal sensor-specific [2, 5] and multisensor fusion approaches [34, 42, 56], multimodal and multitask pre-training [24, 40, 51], and multimodal extensions of large language models [14, 61]. Together, these cover all state-of-the-art frontiers of machine learning, deep learning, and foundation models for IoT. Our results highlight opportunities for ML to make a significant impact in IoT, but many challenges in scalable learning from heterogeneous, long-range, and imperfect sensory modalities are critical directions for future work.

Overall, MultiIoT presents a milestone in unifying disjoint efforts in machine learning and IoT research and paves the way towards a better understanding of the capabilities and limitations of current models, all the while ensuring ease of use, accessibility, and reproducibility. MultiIoT, evaluation metrics, standardized implementations of various models, and leaderboards are publicly available, will be regularly updated, and welcomes inputs from the community.

2 MultiIoT benchmark, modalities, and tasks

The rapid expansion of the Internet-of-Things (IoT) landscape necessitates a comprehensive benchmark that captures the richness and variety of IoT sensory modalities and tasks. MultiIoT is the largest and most diverse of its kind, comprising 1.15M samples spanning twelve distinct modalities and geared towards eight challenging tasks, as summarized in Figure 1.

2.1 Twelve diverse modalities

We collected diverse data from IoT devices, such as Inertial Measurement Units (IMU), Thermal sensors, Global Positioning Systems (GPS), capacitance, depth, gaze, and pose. We also collect commonly used image, audio, and video modalities in the physical world to bridge conventional multimodal research with the new challenges introduced by MultiIoT.

1.

Inertial measurement units capture 3D motion and orientation. This data is fundamental for various applications, including motion tracking and navigation. We include 2,940 IMU gaze samples [30], 28,400 IMU motion instances [46], 160,120 IMU samples [5], 330,178 IMU orientation recordings [22], and 510,142 timestamps-based IMU samples [20].
2.

Thermal provides temperature radiance insights, crucial in surveillance. We used 12,025 samples from LLVIP [26] containing pedestrians and cyclists from different locations on the street.
3.

Global positioning systems offer location data with high precision. This data is invaluable for tasks like location-based services, asset tracking, and navigation. We include GPS data from self-driving cars using 41,000 samples from KITTI [17] using OXTS RT3003 inertial and GPS navigation system for depth estimation. The geographic coordinates include global orientation, altitude, velocities, accelerations, angular rates, and satellite information.
4.

Cameras capture the visual world in rich detail. We include 41,000 instances from KITTI self-driving car dataset [17] using a Velodyne laser scanner installed on a vehicle car for depth estimation. The timestamp-based points can be considered according to the scanner’s continuous rotation on its vertical axis, which provide context to GPS/IMU systems for auto-driving.
5.

Capacitance sensors measure changes in capacitance to detect nearby objects or changes and are critical components of touchscreen technologies and proximity sensing. We used 65,374 samples from TouchPose [2] using a 39.6 cm capacitive Crystal Touch panel, 16-bit touch digitizer, and cameras. When fingers approach the lines on the mutual-capacitance touch sensor, it causes a capacitance drop between lines, resulting in the mutual-capacitance image.
6.

Depth sensors measure distances between the sensor and objects, providing a 3D view of the environment. They play a significant role in tasks like object detection and scene reconstruction. We used 160,120 samples from RGBGaze [5] using Apple iPhone X with a TrueDepth camera.
7.

Gaze sensors track eye movement and direction, offering insights into user attention and intention. We used 2,940 samples from EyeMU [30] running an iOS application on an Apple iPhone 12 Pro. The participants were asked to gaze at a single red dot, and the screen advanced to capture a motion gesture and a 2-axis gaze location after 1.2 seconds.
8.

Pose sensors capture the orientation and position of objects or individuals critical for motion analysis and interactive applications. We include 330,178 samples from DIP-IMU [22] using Xsens IMU sensors, and 65,374 samples in TouchPose [2] from a Leap Motion stereo IR camera, running Orion 4.1 for 3D hand pose tracking.
9.

LiDAR sensors emit light to measure distances, generating high-resolution 3D maps of environments. They are central to autonomous driving and topographical mapping. We include 51,000 samples from the Newer College dataset [50] using the Ouster LiDAR with 64 beams, 64 Channels, 120 m range, 45^∘ vertical Field-of-View (FoV), and 1024 horizontal resolution.
10.

Video captures sequences of visual frames, providing a dynamic view of the environment. We used 510,142 egocentric videos in Ego4D [20], which include many everyday activities, such as cooking, cleaning, and fishing from diverse geographic locations across the world, and are paired with timestamps-based IMU values of the normalized accelerometer and gyroscopes.
11.

Audio sensors capture sound waves, enabling voice recognition, sound classification, and environmental sound analysis. We include 28,400 samples from SAMoSA [46], where participants wore the smartwatch on their dominant arm, and were asked to perform 26 activities across 4 contexts with each activity repeated 3 times within each context.
12.

Image sensors offer static visual captures of the environment, serving as a basis for a myriad of vision tasks. We collected 160,120 samples from RGBDGaze [5] paired with gaze, depth, and IMU for gaze tracking, 41,000 samples from KITTI [17], 12,025 high-quality images paired with infrared thermal samples in LLVIP [26], and 65,374 instances from TouchPose [2].

2.2 Eight challenging tasks

Upon these 12 modalities, our benchmark includes tasks that reflect real-world IoT challenges.

1.

Gaze estimation: This task is pivotal for human-computer interaction, driver monitoring, and virtual reality. Given RGB images of faces, depth, and IMUs, our goal is to predict the location (X/Y) for tracking the gazes of the person. This regression task requires multisensory understanding of long-range interactions between RGB images and depth and heterogeneity in IMUs.
2.

Depth estimation involves predicting the distance between the camera and each pixel in the image and is a cornerstone for AR/VR applications, robotics, and object detection. Given RGB images, camera parameters, GPS coordinates, and IMU, we predict the depth maps of objects, such as cars and pedestrians on the streets. For robots, given RGB images, capacitive images, and hand poses, our target is to estimate the depth maps of left and right hands.
3.

Gesture classification: Crucial for human-machine interfaces, gesture classification aims to recognize specific human hand or body movements. Given gaze locations and IMU data on accelerometer, gyroscope, and orientation, the goal is to classify human gestures. This classification problem requires the cross-modal perception of both gaze and IMUs.
4.

Pose estimation focuses on determining the spatial arrangement of human joints and has applications in AR/VR, gaming, and health. Given RGB images and measured IMU data, our goal is to predict the poses of the human body including 24 joints with three joint angles (yaw, pitch, roll). This regression problem requires fusing IMUs and RGB pixels.
5.

Touch contact classification involves determining the type or nature of touch on capacitive surfaces, a vital component for enhancing user experiences on touch-based devices. Given RGB images, capacitive images, depth maps, and hand poses, the goal is to classify touch contact.
6.

Event detection: A broad area with applications in health, wellness, smart homes, and the workplace, event detection involves identifying specific occurrences or anomalies in the data stream. Given audio spectrograms and IMU data of accelerometer, gyroscope, and orientation, our goal is to predict the categories of events across different timestamps. This classification problem requires modeling interactions between audio and IMU.
7.

Activity recognition: Central to fitness, health, and elderly care, activity recognition aims to discern human activities like walking, running, or jumping given RGB images, poses with three joint angles (yaw, pitch, roll), IMU data, or egocentric videos.
8.

3D reconstruction involves creating a three-dimensional model of an environment or object from 2D data, an application of huge significance in gaming, film, and AR/VR. Given RGB images, capacitance images, and depth maps, we aim to reconstruct 3D poses.

3 Modeling Paradigms in MultiIoT

The models we include in MultiIoT span conventional IoT processing methods, which we briefly review below, as well as new methods we designed based on large multisensory foundation models.

Domain-specific unimodal models. Over the years, each sensor modality has evolved its own set of algorithms. For instance, IMU data has been traditionally processed using Kalman filters [33] to predict movement, while thermal modality often relies on image processing techniques for hotspot detection [18, 60, 10]. A majority of IoT sensor data is inherently time-series [6, 27, 32, 11]. Classical statistical methods like AutoRegressive Integrated Moving Average (ARIMA) [45, 55] or Exponential Smoothing have been employed to forecast, denoise, or detect anomalies in sensor readings [15, 8, 4]. Signal processing methods, such as Fourier [59, 47] and Wavelet Transforms [48], to data compression [21] and feature extraction strategies specific to resource-constrained devices [13, 29, 23] have also been proposed. Many of these methods were designed to function efficiently in real-time scenarios with limited computational resources.

Multitask unimodal models extend unimodal models by having a common backbone for the sensory modality and separate decoder heads, each suitable for predicting a single task [9]. The common backbone can learn general-purpose information about the sensory modality while each decoder is task-specific. Given a dataset $D=\{(x_{i},y_{i1},y_{i2},...y_{in})\}$ where each $x_{i}$ has multiple corresponding labels for different tasks, the model minimizes a combined loss $L$ :

L(D,M)=\sum_{i}\sum_{j}\mathcal{L}_{j}(M_{j}(E(x_{i})),y_{ij}).\vspace{-0.5em}

(1)

where $M_{j}(\cdot)$ denotes the $j$ th task model, and $E$ denotes the encoders.

Multisensory fusion models combine different modalities at some stage in the model – be it early fusion, middle fusion, or late fusion [42]. A common approach is to use separate encoders for each modality and a shared decoder that fuses the representations to produce an output. Given multi-modal data $x=(x_{1},x_{2},...x_{m})$ , the model combines representations:

y=T(E_{1}(x_{1})\oplus E_{2}(x_{2})\oplus...\oplus E_{m}(x_{m}))

(2)

where $T(\cdot)$ denotes the task head, and $E_{1},E_{2},...,E_{m}$ denote the encoders.

Multisensory multitask models leverage data from different modalities to solve more than one task simultaneously [24, 40, 51]. It often benefits from interconnections between tasks. For example, in an IoT setting, a model could use vision and sound vision to simultaneously predict both the type of event occurring and its intensity, and further use vision and depth to reason about moving objects. For multi-modal data $x=(x_{1},x_{2},...x_{m})$ and multiple tasks, the combined representations are processed as:

y_{j}=T_{j}(E_{1}(x_{1})\oplus E_{2}(x_{2})\oplus...\oplus E_{m}(x_{m}))

(3)

where $T_{j}(\cdot)$ denotes the $j$ th task head, and $E_{1},E_{2},...,E_{m}$ denote the encoders.

Multisensory language models: While the above approaches are primarily based on supervised learning across one or more modalities and tasks, there has been recent interest in grounding large language models on external modalities to take advantage of the general prediction, reasoning, and interaction capabilities of large language model decoders [54]. These methods operate via adapter layers that transform a modality’s features into the original layers of a pre-trained model [14]. Given a pre-trained model with a set of weights $W$ , and an adapter module $A$ with its own set of weights $W_{A}$ , the output $y$ for an input $x$ is:

y=M_{W+A}(x)=M_{W}(A_{W_{A}}(x)).

(4)

where $M_{W+A}(\cdot),M_{W}(\cdot)$ denotes the model with both weights $W,A$ and weights $W$ .

Multisensory multitask language models are multitask extensions of single-task adapters, where general representations for many tasks are transformed into the layers of a pre-trained model [14]. For example, in an IoT setting with multi-modal data $x=(x_{1},x_{2},...x_{m})$ , we are given a pre-trained model with a set of weights $W$ , and an adapter module $A$ with its own set of weights $W_{A}$ , the output $y$ for an input $x$ to formulate the Multisensory Multitask Adapter as:

y=M_{W+A}(E_{1}(x_{1})\oplus E_{2}(x_{2})\oplus...\oplus E_{m}(x_{m}))=M_{W}(A% _{W_{A}}(E_{1}(x_{1})\oplus E_{2}(x_{2})\oplus...\oplus E_{m}(x_{m}))).

(5)

where $M_{W+A}(\cdot),M_{W}(\cdot)$ denotes the model with both weights $W,A$ and weights $W$ , and $E_{1},E_{2},...,E_{m}$ denote the encoders for multiple tasks with multisensory data.

4 Experiments

Our experiments aim to benchmark existing machine learning paradigms on MultiIoT, including the best task-specific models as well as those designed for multimodal, multitask, long-range, and noisy data settings. We elaborate on the experimental setup and report our findings.

4.1 Experimental Setup

All experiments were conducted on NVIDIA V100 GPUs. For unimodal models, data from each modality was processed independently using optimized neural architectures like CNNs for images and time-series models for sensor data. Models were trained with a batch size of 128, using the Adam optimizer at a learning rate of 0.001. Unimodal multitask models use shared encoder layers and task-specific decoders, and we ensured balanced gradients among tasks for equal training [40]. For multisensory models, we experimented with specialized unimodal models with data fusion occurring at varying levels, from input to decision levels [42]. Multisensory multitask models utilized modality-specific encoders followed by task-specific decoders, enabling sharing across modalities and tasks during training. Finally, multisensory language models utilized adapter-based methods such as LLaMA-adpater [14], which enables us to keep the LLM frozen and only fine-tune only the small adapter modules, and multisensory multitask language models extend adapter-based fine-tuning to multiple tasks at the same time.

To evaluate performance, we employ task-specific metrics following prior practice. For gaze and pose estimation, we measure the mean euclidean error in centimeters between predictions and ground truth. Depth estimation utilizes mean absolute error in millimeters, while gesture classification, touch contact classification, and activity recognition rely on accuracy metrics. Event detection employs the F1 score for confident threshold predictions, and 3D pose reconstruction is assessed using the End-point-error in millimeters for joint discrepancies.

Table 1: Multisensory multitask learning and multisensory multitask large language models are particularly effective approaches on MultiIoT, enabling information sharing to learn general representations for IoT data.

Method	Gaze est.	Depth est.	Gesture cls.	Pose est.	Touch cls.	Event det.	Activity recog.	3D recons.
Method	(cm, $\downarrow$ )	(mm, $\downarrow$ )	(%, $\uparrow$ )	(cm, $\downarrow$ )	(%, $\uparrow$ )	(%, $\uparrow$ )	(%, $\uparrow$ )	(mm, $\downarrow$ )
Unimodal model	2.26	20.7	97.3	6.49	88.0	86.9	79.2	22.2
Unimodal multitask model	1.95	18.2	98.2	5.36	89.3	88.1	82.5	20.5
Multisensory model	1.79	17.3	98.7	4.62	91.2	89.1	83.5	19.6
Multisensory multitask model	1.08	13.6	99.3	3.85	93.8	92.7	87.5	17.5
Multisensory LM	2.05	18.6	97.6	5.75	88.7	87.5	82.3	21.3
Multisensory multi-task LM	0.95	11.5	99.6	3.24	94.6	93.8	89.2	16.3

4.2 Main quantitative results

Overall performance: Table 1 reports the quantitative results on MultiIoT using single modality, single task, multimodal multitask, and extensions of language models. As seen in Table 1, the multimodal multitask method consistently outperforms the single modality and single task models across all tasks. This can be attributed to their ability to integrate information across modalities and tasks, which is especially crucial when one modality might have noisy or incomplete data. While the multisensory language model often falls short in many scenarios, the multisensory multitask language model achieves the strongest results by leveraging the power of both multimodal inputs and multitask training, with the existing reasoning ability present in pretrained large language models.

Table 2: Adding more modalities enables complementary learning of information and yields improving performances on the MultiIoT benchmark.

Modality Ratio	Gaze est.	Depth est.	Gesture cls.	Pose est.	Touch cls.	Event det.	Activity recog.	3D recons.
Modality Ratio	(cm, $\downarrow$ )	(mm, $\downarrow$ )	(%, $\uparrow$ )	(cm, $\downarrow$ )	(%, $\uparrow$ )	(%, $\uparrow$ )	(%, $\uparrow$ )	(mm, $\downarrow$ )
single-modality	2.26	20.7	97.3	6.49	88.0	86.9	79.2	22.2
25%	2.13	19.6	97.5	5.97	88.9	87.3	80.2	21.5
50%	1.95	18.7	98.1	5.38	90.1	88.2	81.3	20.9
all	1.79	17.3	98.7	4.62	91.2	89.1	83.5	19.6

Performance across different modalities: In this section, we study the impact of adding more modalities on task performance. Table 2 shows significant performance improvements when adopting a multimodal approach as opposed to unimodal setups and various ratios (25%, 50%, all) of total modalities. The incorporation of multiple modalities results in more robust and accurate models. This can be attributed to the model’s ability to tap into complementary information present in different modalities, especially in scenarios where one modality might be ambiguous or noisy.

Table 3: Multi-task learning is another effective strategy on the MultiIoT benchmark, enabling information sharing across tasks. Performance consistently improves as more datapoints from related tasks are added during training.

Task Ratio	Gaze est.	Depth est.	Gesture cls.	Pose est.	Touch cls.	Event det.	Activity recog.	3D recons.
Task Ratio	(cm, $\downarrow$ )	(mm, $\downarrow$ )	(%, $\uparrow$ )	(cm, $\downarrow$ )	(%, $\uparrow$ )	(%, $\uparrow$ )	(%, $\uparrow$ )	(mm, $\downarrow$ )
single-task	2.26	20.7	97.3	6.49	88.0	86.9	79.2	22.2
25%	2.17	19.9	97.5	6.23	88.3	87.1	80.1	21.8
50%	2.09	19.0	97.8	5.86	88.9	87.5	81.2	21.2
all	1.95	18.2	98.2	5.36	89.3	88.1	82.5	20.5

Table 4: Multimodal and multitask training enables zero-shot and few-shot capabilities, which can help when dealing with limited labeled data often seen in real-world IoT systems.

Method	Gaze estimation (cm, $\downarrow$ )	Touch contact classification (%, $\uparrow$ )
IMU	2.65	–
capacitance	–	83.5
depth	2.45	86.2
image	2.26	88.0
multimodal	1.79	91.2
multimodal multitask	1.08	93.8
multimodal multitask (zero-shot)	2.18	88.6
multimodal multitask (5-shot)	1.96	89.5
multimodal multitask (10-shot)	1.89	90.2
multimodal multitask (20-shot)	1.81	91.1

Performance across different tasks: We separately analyze model performance when trained on multiple tasks simultaneously, while keeping the same modality inputs constant. Table 3 reveals that for most tasks, our multitask model’s performance was on par with or exceeded that of models trained solely on individual tasks. This suggests that the shared representations learned during multitask learning were largely beneficial, since the model learns more generalized and robust features, while also improving computational efficiency.

Zero-shot and few-shot transfer: Furthermore, we study whether models trained on certain modalities or tasks can transfer to a new set of target modalities or tasks they have never seen during training (zero-shot) or have seen with only very few examples (few-shot 5, 10, 20). We chose the fix-8 dataset as the target, primarily because of its diverse representation of modalities (IMU, capacitance, depth, image) and its challenging task (gaze estimation and touch contact classification). We examined various configurations ranging from transferring unimodal and multimodal multitask models. From the results in Table 4, we find that even zero-shot performance from a transferred multimodal multitask model can be comparable to supervised training using only IMU, depth, and image modalities. Furthermore, adding just a few examples (5-20) significantly boosted performance compared to the zero-shot setting, which highlights the model’s ability to quickly learn new information. Our results suggest that multimodal and multitask training enables few-shot capabilities that can be helpful for limited-data real-world IoT scenarios.

4.3 Understanding challenges in MultiIoT

Long-range multimodal interactions are critical to many problems in IoT, such as in time series forecasting and signal analysis. In a controlled experiment, we truncated sequences to various lengths and observed how conventional models performed. From Figure 3 (left), as the sequence lengths increased, representing longer durations of time or more extensive contexts, there was a marked decline in performance. This showcased the models’ inability to effectively encapsulate and understand interactions beyond a certain range. Multimodal setups further complicate this when the long-range dependencies aren’t just within a modality but can also be across modalities. Therefore, architectures that can handle both long-range and multisensory data will be critical for progress.

Heterogeneity in structure: Differences in data distributions, due to their natural structure, are a challenge for machine learning models. We evaluated the same models on datasets that combined structured data (such as GPS, IMU) with unstructured data (such as images or raw audio) and found that unimodal baselines often struggled to reconcile these different data forms, leading to a significant drop in accuracy. The main finding from our evaluation is that when models are not tailored to the specific characteristics of each data type, their ability to effectively integrate and interpret data diminishes. The results in both gaze estimation and touch contact classification drop, underscoring the inadequacy of generic models in handling complex, mixed-data scenarios. Therefore, the use of modality-specific encoders is critical in addressing the challenges posed by heterogeneous data. How to best handle these high degrees of heterogeneity, while maintaining efficiency beyond independent models for each sensor, is a critical direction for future work.

Robustness to noisy and missing sensors: In real-world applications, machine learning models often encounter data that is incomplete or corrupted by noise. To assess the robustness of our models against such imperfections, we introduced varying degrees of Gaussian noise into the datasets and systematically dropped sensor data at regular intervals. Both of these types of noise can naturally happen in real-world settings due to white noise and sensor failures respectively. From Figure 3 (right), we can observe the model’s performance as we incrementally increase the noise ratio from 0% to 50%. At 0% noise, the models operate in optimal conditions, showing peak accuracy. As the noise level increases to 10% and 20%, there is a noticeable degradation in performance, illustrating the initial sensitivity to noise. Beyond 20%, the decline becomes more pronounced, with model accuracy dropping below 80% at a 50% noise ratio. In addition to noise, missing sensor data is another common issue, and we find similar patterns when randomly omitting readings from various sensors. These findings indicate that building robust models for IoT is still a challenge.

Table 5: Tradeoff between various models in terms of performance and training cost on MultiIoT. Multisensory multi-task models yield stronger performance but come at the expense of increased training costs.

Method	Touch cls.	Event det.	Activity recog.	Average training cost
Method	(%, $\uparrow$ )	(%, $\uparrow$ )	(%, $\uparrow$ )	(hours, $\downarrow$ )
Unimodal model	88.0	86.9	79.2	25
Unimodal multi-task model	89.3	88.1	82.5	30
Multisensory model	91.2	89.1	83.5	32
Multisensory multi-task model	93.8	92.7	87.5	38
Multisensory language model	88.7	87.5	82.3	27
Multisensory multi-task language model	94.6	93.8	89.2	39

Complexity during training and inference: One final critical considerations in the development of IoT models is the balance between the model’s performance and its computational cost. Table 5 reports a comparative analysis of the performance across different methods against their respective training costs. There is a clear incremental increase in performance from unimodal to multisensory multitask approaches. The unimodal method, while the least costly in terms of training time (25 hours), offers the lowest performance across all three tasks including touch classification, event detection, and activity recognition. The shift towards multisensory multitask learning slightly increases the training costs but also yields notable enhancements in performance. Overall, the multisensory multitask model yields the best tradeoffs between performance and complexity.

4.4 Analysis of information sharing

Finally, we show visualization examples of how information is shared across modalities and tasks in Figure 4, based on low-level modality features and high-level semantic concepts.

Low-level modality features: Different sensory modalities often contain unique low-level perceptual features that complement those in other modalities. We illustrate this information sharing across 3 modalities: IMU, video, and pose data for predicting 2 common activities: walking and dancing.

Walking is a common activity with distinctive rhythmic characteristics. Using IMU features, the model learns that rhythmic patterns, particularly in acceleration and deceleration, correspond to each walking step. The cadence, stability, and any irregularities in the walking pattern can also be inferred. Video features capture the holistic visual representation of walking, presenting details such as gait, arm swing, speed, stride length, and frequency. Finally, pose features highlight the specific posture changes during walking, emphasizing leg movement, foot placement, and body alignment.

Dancing requires complex and expressive motions with varying styles and dynamics. IMU provides dynamic, often non-linear patterns in IMU data, reflecting the dance’s tempo, vigor, and style variations; video captures the dance form, style, synchronization, and expressiveness; and pose data captures the alignment and configuration of body parts, offering insights into dance postures, transitions, and intricate footwork or hand movements.

High-level semantic concepts encapsulate a more general conceptual understanding and reasoning about the environment. We show two examples showing how the audio and IMU modalities share information about two high-level semantic concepts, focusing on body pose and hand pose.

Body pose represents the spatial arrangement and posture of the entire human body. This can involve stances like standing, sitting, or lying down, or even dynamic movements like jumping or running. For Audio, indirect cues such as the sound of footsteps, a person sitting down on a chair, or even the echo in a room (indicating a certain body pose affecting sound propagation) can provide hints about the body’s posture. For IMU, accelerometers capture the directional movement while gyroscopes provide rotational dynamics to distinguish if a person is upright, moving rapidly, or stationary.

Hand pose looks at the orientation, gesture, and spatial arrangement of just the hands, ranging from gestures like waving and gripping, to more intricate signs in sign language. In audio, sounds like clapping, snapping, or even the subtle rustling of hands moving through the air can be detected. The distinct sounds made by hands interacting with other objects can also hint at specific hand poses. When IMU sensors are placed on the wrist or back of the hand, they can capture detailed dynamics of hand movements, tilting, rotation, or swift movements.

5 Conclusion and Broader Impacts

This paper proposes MultiIoT, the most expansive IoT benchmark to date, encompassing over 1.15 million samples from 12 modalities and 8 tasks. MultiIoT introduces unique challenges involving (1) learning from many sensory modalities, (2) fine-grained multisensory interactions across long temporal ranges, and (3) extreme heterogeneity due to ambiguous semantic abstraction and unique noise topologies in real-world sensors, which inspire several directions for future not encountered in conventional representation learning research. MultiIoT, our standardized code, and leaderboards are publicly available, will be regularly updated, and welcome inputs from the community.

We are also aware of some potential limitations and broader societal impacts:

1.

Data privacy: There may be privacy risks associated with making predictions from multimodal data of recorded human behaviors, such as video, audio, activities, poses, and wearable sensors. Datasets are collected from participants who have consented to data release. We only use these datasets for research purposes. All data was anonymized and stripped of all personal (e.g., personally identifiable information) and protected attributes (e.g., race, gender).
2.

Real-world privacy: To deploy these algorithms at scale in the real world, it is also important to keep data and features private on each device without sending it to other locations using techniques such as federated learning [36, 37], differential privacy [19], or encryption [12]. MultiIoT also enables large-scale studies of privacy-preserving machine learning in the IoT domain, which will be a critical direction for future work.
3.

Efficiency: Modern ML models can cause environmental impacts resulting from the carbon footprint required to run large-scale models. ML for IoT can inspire the design of lightweight models that can run efficiently on edge devices and low-cost sensors [53].
4.

Biases: We also acknowledge risks of exposure bias due to imbalanced datasets, especially when human-centric data and possibly sensitive labels are involved. Models trained on biased data have been shown to amplify the underlying social biases [44]. Future work should quantify the internal learning process of multimodal models [41] to better understand and mitigate social biases across sensory modalities. MultiIoT can be a useful resource to accelerate the study of fairer representation learning methods on real-world sensors.

References

Ahamed and Farid [2018] Farhad Ahamed and Farnaz Farid. Applying internet of things and machine-learning for personalized healthcare: Issues and challenges. In 2018 International Conference on Machine Learning and Data Engineering (iCMLDE), pages 19–21. IEEE, 2018.
Ahuja et al. [2021] Karan Ahuja, Paul Streli, and Christian Holz. Touchpose: Hand pose prediction, depth estimation, and touch classification from capacitive images. In Proceedings of the 34th Annual ACM Symposium on User Interface Software and Technology, New York, NY, USA, 2021. Association for Computing Machinery. ISBN 9781450386357. URL https://doi.org/10.1145/3472749.3474801.
Al-Emran et al. [2020] Mostafa Al-Emran, Sohail Iqbal Malik, and Mohammed N Al-Kabi. A survey of internet of things (iot) in education: Opportunities and challenges. Toward social internet of things (SIoT): Enabling technologies, architectures and applications: Emerging technologies for connected and smart social objects, pages 197–209, 2020.
Alysha M. De Livera and Snyder [2011] Rob J. Hyndman Alysha M. De Livera and Ralph D. Snyder. Forecasting time series with complex seasonal patterns using exponential smoothing. Journal of the American Statistical Association, 106(496):1513–1527, 2011. doi: 10.1198/jasa.2011.tm09771. URL https://doi.org/10.1198/jasa.2011.tm09771.
Arakawa et al. [2022] Riku Arakawa, Mayank Goel, Chris Harrison, and Karan Ahuja. Rgbdgaze: Gaze tracking on smartphones with RGB and depth data. In International Conference on Multimodal Interaction, ICMI 2022, Bengaluru, India, November 7-11, 2022, pages 329–336, New York, 2022. ACM. doi: 10.1145/3536221.3556568.
Atmoko et al. [2017] R A Atmoko, R Riantini, and M K Hasin. Iot real time data acquisition using mqtt protocol. Journal of Physics: Conference Series, 853(1):012003, may 2017. doi: 10.1088/1742-6596/853/1/012003. URL https://dx.doi.org/10.1088/1742-6596/853/1/012003.
Atzori et al. [2010] Luigi Atzori, Antonio Iera, and Giacomo Morabito. The internet of things: A survey. Computer networks, 54(15):2787–2805, 2010.
Billah et al. [2006] Baki Billah, Maxwell L. King, Ralph D. Snyder, and Anne B. Koehler. Exponential smoothing model selection for forecasting. International Journal of Forecasting, 22(2):239–247, 2006. ISSN 0169-2070. doi: https://doi.org/10.1016/j.ijforecast.2005.08.002. URL https://www.sciencedirect.com/science/article/pii/S016920700500107X.
Caruana [1997] Rich Caruana. Multitask learning. Machine learning, 28(1):41–75, 1997.
Chockalingam et al. [2023] Al. Chockalingam, S. Naveen, S. Sanjay, J. Nanthakumar, and V. Praveenkumar. Sensor based hotspot detection and isolation in solar array system using iot. In 2023 9th International Conference on Electrical Energy Systems (ICEES), pages 371–376, 2023. doi: 10.1109/ICEES57979.2023.10110240.
Cook et al. [2020] Andrew A. Cook, Göksel Mısırlı, and Zhong Fan. Anomaly detection for iot time-series data: A survey. IEEE Internet of Things Journal, 7(7):6481–6494, 2020. doi: 10.1109/JIOT.2019.2958185.
Dankar and El Emam [2013] Fida Kamal Dankar and Khaled El Emam. Practicing differential privacy in health care: A review. Trans. Data Priv., 6(1):35–67, 2013.
Ebrahimi et al. [2019] Shahriar Ebrahimi, Siavash Bayat-Sarmadi, and Hatameh Mosanaei-Boorani. Post-quantum cryptoprocessors optimized for edge and resource-constrained devices in iot. IEEE Internet of Things Journal, 6(3):5500–5507, 2019. doi: 10.1109/JIOT.2019.2903082.
Gao et al. [2023] Peng Gao, Jiaming Han, Renrui Zhang, Ziyi Lin, Shijie Geng, Aojun Zhou, Wei Zhang, Pan Lu, Conghui He, Xiangyu Yue, Hongsheng Li, and Yu Qiao. Llama-adapter v2: Parameter-efficient visual instruction model. arXiv preprint arXiv:2304.15010, 2023.
Gardner Jr. [1985] Everette S. Gardner Jr. Exponential smoothing: The state of the art. Journal of Forecasting, 4(1):1–28, 1985. doi: https://doi.org/10.1002/for.3980040103. URL https://onlinelibrary.wiley.com/doi/abs/10.1002/for.3980040103.
Gebru et al. [2018] Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé III, and Kate Crawford. Datasheets for datasets. arXiv preprint arXiv:1803.09010, 2018.
Geiger et al. [2013] A Geiger, P Lenz, C Stiller, and R Urtasun. Vision meets robotics: The kitti dataset. Int. J. Rob. Res., 32(11):1231–1237, sep 2013. ISSN 0278-3649. doi: 10.1177/0278364913491297. URL https://doi.org/10.1177/0278364913491297.
George and Thampi [2018] Gemini George and Sabu M. Thampi. A graph-based security framework for securing industrial iot networks from vulnerability exploitations. IEEE Access, 6:43586–43601, 2018. doi: 10.1109/ACCESS.2018.2863244.
Geyer et al. [2017] Robin C Geyer, Tassilo Klein, and Moin Nabi. Differentially private federated learning: A client level perspective. arXiv preprint arXiv:1712.07557, 2017.
Grauman et al. [2022] Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, Miguel Martin, Tushar Nagarajan, Ilija Radosavovic, Santhosh Kumar Ramakrishnan, Fiona Ryan, Jayant Sharma, Michael Wray, Mengmeng Xu, Eric Zhongcong Xu, Chen Zhao, Siddhant Bansal, Dhruv Batra, Vincent Cartillier, Sean Crane, Tien Do, Morrie Doulaty, Akshay Erapalli, Christoph Feichtenhofer, Adriano Fragomeni, Qichen Fu, Abrham Gebreselasie, Cristina González, James Hillis, Xuhua Huang, Yifei Huang, Wenqi Jia, Weslie Khoo, Jáchym Kolář, Satwik Kottur, Anurag Kumar, Federico Landini, Chao Li, Yanghao Li, Zhenqiang Li, Karttikeya Mangalam, Raghava Modhugu, Jonathan Munro, Tullie Murrell, Takumi Nishiyasu, Will Price, Paola Ruiz, Merey Ramazanova, Leda Sari, Kiran Somasundaram, Audrey Southerland, Yusuke Sugano, Ruijie Tao, Minh Vo, Yuchen Wang, Xindi Wu, Takuma Yagi, Ziwei Zhao, Yunyi Zhu, Pablo Arbeláez, David Crandall, Dima Damen, Giovanni Maria Farinella, Christian Fuegen, Bernard Ghanem, Vamsi Krishna Ithapu, C. V. Jawahar, Hanbyul Joo, Kris Kitani, Haizhou Li, Richard Newcombe, Aude Oliva, Hyun Soo Park, James M. Rehg, Yoichi Sato, Jianbo Shi, Mike Zheng Shou, Antonio Torralba, Lorenzo Torresani, Mingfei Yan, and Jitendra Malik. Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18995–19012, June 2022.
Hossain et al. [2019] Kaium Hossain, Mizanur Rahman, and Shanto Roy. Iot data compression and optimization techniques in cloud storage: Current prospects and future directions. Int. J. Cloud Appl. Comput., 9(2):43–59, apr 2019. ISSN 2156-1834. doi: 10.4018/IJCAC.2019040103. URL https://doi.org/10.4018/IJCAC.2019040103.
Huang et al. [2018] Yinghao Huang, Manuel Kaufmann, Emre Aksan, Michael J. Black, Otmar Hilliges, and Gerard Pons-Moll. Deep inertial poser: Learning to reconstruct human pose from sparse inertial measurements in real time. ACM Transactions on Graphics, (Proc. SIGGRAPH Asia), 37:185:1–185:15, November 2018. First two authors contributed equally.
Imteaj et al. [2022] Ahmed Imteaj, Urmish Thakker, Shiqiang Wang, Jian Li, and M. Hadi Amini. A survey on federated learning for resource-constrained iot devices. IEEE Internet of Things Journal, 9(1):1–24, 2022. doi: 10.1109/JIOT.2021.3095077.
Jaegle et al. [2021] Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch, Catalin Ionescu, David Ding, Skanda Koppula, Daniel Zoran, Andrew Brock, Evan Shelhamer, et al. Perceiver io: A general architecture for structured inputs & outputs. arXiv preprint arXiv:2107.14795, 2021.
Javaid et al. [2018] Sabeen Javaid, Ali Sufian, Saima Pervaiz, and Mehak Tanveer. Smart traffic management system using internet of things. In 2018 20th international conference on advanced communication technology (ICACT), pages 393–398. IEEE, 2018.
Jia et al. [2021] Xinyu Jia, Chuang Zhu, Minzhen Li, Wenqi Tang, and Wenli Zhou. Llvip: A visible-infrared paired dataset for low-light vision. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3496–3504, 2021.
Khan et al. [2018] Nida Saddaf Khan, Sayeed Ghani, and Sajjad Haider. Real-time analysis of a sensor’s data for automated decision making in an iot-based smart home. Sensors, 18(6), 2018. ISSN 1424-8220. doi: 10.3390/s18061711. URL https://www.mdpi.com/1424-8220/18/6/1711.
Khayyam et al. [2020] Hamid Khayyam, Bahman Javadi, Mahdi Jalili, and Reza N Jazar. Artificial intelligence and internet of things for autonomous vehicles. Nonlinear Approaches in Engineering Applications: Automotive Applications of Engineering Problems, pages 39–68, 2020.
Khor et al. [2021] Jing Huey Khor, Michail Sidorov, and Peh Yee Woon. Public blockchains for resource-constrained iot devices—a state-of-the-art survey. IEEE Internet of Things Journal, 8(15):11960–11982, 2021. doi: 10.1109/JIOT.2021.3069120.
Kong et al. [2021] Andy Kong, Karan Ahuja, Mayank Goel, and Chris Harrison. Eyemu interactions: Gaze + imu gestures on mobile devices. In Proceedings of the 2021 International Conference on Multimodal Interaction, ICMI ’21, page 577–585, New York, NY, USA, 2021. Association for Computing Machinery. ISBN 9781450384810. doi: 10.1145/3462244.3479938. URL https://doi.org/10.1145/3462244.3479938.
Kulkarni et al. [2014] Alok Kulkarni, Sampada Sathe, et al. Healthcare applications of the internet of things: A review. International Journal of Computer Science and Information Technologies, 5(5):6229–6232, 2014.
Kumar et al. [2020] Raghavendra Kumar, Pardeep Kumar, and Yugal Kumar. Time series data prediction using iot and machine learning technique. Procedia Computer Science, 167:373–381, 2020. ISSN 1877-0509. doi: https://doi.org/10.1016/j.procs.2020.03.240. URL https://www.sciencedirect.com/science/article/pii/S1877050920307067. International Conference on Computational Intelligence and Data Science.
Lai et al. [2019] Xiaozheng Lai, Ting Yang, Zetao Wang, and Peng Chen. Iot implementation of kalman filter to improve accuracy of air quality monitoring and prediction. Applied Sciences, 9(9), 2019. ISSN 2076-3417. doi: 10.3390/app9091831. URL https://www.mdpi.com/2076-3417/9/9/1831.
Lee et al. [2020] Michelle A Lee, Brent Yi, Roberto Martín-Martín, Silvio Savarese, and Jeannette Bohg. Multimodal sensor fusion with differentiable filters. IROS, 2020.
Li et al. [2015] Shancang Li, Li Da Xu, and Shanshan Zhao. The internet of things: a survey. Information systems frontiers, 17:243–259, 2015.
Li et al. [2018] Tian Li, Anit Kumar Sahu, Manzil Zaheer, Maziar Sanjabi, Ameet Talwalkar, and Virginia Smith. Federated optimization in heterogeneous networks. CoRR, abs/1812.06127, 2018. URL http://arxiv.org/abs/1812.06127.
Liang et al. [2020] Paul Pu Liang, Terrance Liu, Liu Ziyin, Nicholas B. Allen, Randy P. Auerbach, David Brent, Ruslan Salakhutdinov, and Louis-Philippe Morency. Think locally, act globally: Federated learning with local and global representations. 2020.
Liang et al. [2021a] Paul Pu Liang, Terrance Liu, Anna Cai, Michal Muszynski, Ryo Ishii, Nick Allen, Randy Auerbach, David Brent, Ruslan Salakhutdinov, and Louis-Philippe Morency. Learning language and multimodal privacy-preserving markers of mood from mobile data. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4170–4187, 2021a.
Liang et al. [2021b] Paul Pu Liang, Yiwei Lyu, Xiang Fan, Zetian Wu, Yun Cheng, Jason Wu, Leslie Yufan Chen, Peter Wu, Michelle A Lee, Yuke Zhu, et al. Multibench: Multiscale benchmarks for multimodal representation learning. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1), 2021b.
Liang et al. [2022] Paul Pu Liang, Yiwei Lyu, Xiang Fan, Jeffrey Tsaw, Yudong Liu, Shentong Mo, Dani Yogatama, Louis-Philippe Morency, and Russ Salakhutdinov. High-modality multimodal transformer: Quantifying modality & interaction heterogeneity for high-modality representation learning. Transactions on Machine Learning Research, 2022.
Liang et al. [2023a] Paul Pu Liang, Yiwei Lyu, Gunjan Chhablani, Nihal Jain, Zihao Deng, Xingbo Wang, Louis-Philippe Morency, and Ruslan Salakhutdinov. Multiviz: Towards visualizing and understanding multimodal models. International Conference on Learning Representations, 2023a.
Liang et al. [2023b] Paul Pu Liang, Amir Zadeh, and Louis-Philippe Morency. Foundations & trends in multimodal machine learning: Principles, challenges, and open questions. ACM Computing Surveys, 2023b.
Lin et al. [2014] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, and C. Lawrence Zitnick. Microsoft coco: Common objects in context. In Proceedings of the European Conference on Computer Vision (ECCV), pages 740–755, 2014.
Lloyd [2018] Kirsten Lloyd. Bias amplification in artificial intelligence systems. CoRR, abs/1809.07842, 2018. URL http://arxiv.org/abs/1809.07842.
Lopez-Martin et al. [2020] Manuel Lopez-Martin, Belen Carro, and Antonio Sanchez-Esguevillas. Iot type-of-traffic forecasting method based on gradient boosting neural networks. Future Generation Computer Systems, 105:331–345, 2020. ISSN 0167-739X. doi: https://doi.org/10.1016/j.future.2019.12.013. URL https://www.sciencedirect.com/science/article/pii/S0167739X19322319.
Mollyn et al. [2022] Vimal Mollyn, Karan Ahuja, Dhruv Verma, Chris Harrison, and Mayank Goel. Samosa: Sensing activities with motion and subsampled audio. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol., 6(3), sep 2022. doi: 10.1145/3550284. URL https://doi.org/10.1145/3550284.
Murugan et al. [2017] Senthil Kumar Murugan, Sishaj P Simon, Kinattingal Sundareswaran, P. Srinivasa Rao Nayak, and Narayana Prasad Padhy. An empirical fourier transform-based power transformer differential protection. IEEE Transactions on Power Delivery, 32(1):209–218, 2017. doi: 10.1109/TPWRD.2016.2575981.
Muthukrishnan et al. [2019] A. Muthukrishnan, J. Charles Rajesh kumar, D. Vinod Kumar, and M. Kanagaraj. Internet of image things-discrete wavelet transform and gabor wavelet transform based image enhancement resolution technique for iot satellite applications. Cognitive Systems Research, 57:46–53, 2019. ISSN 1389-0417. doi: https://doi.org/10.1016/j.cogsys.2018.10.010. URL https://www.sciencedirect.com/science/article/pii/S1389041718305278.
Qi et al. [2015] Jun Qi, Po Yang, Dina Fan, and Zhikun Deng. A survey of physical activity monitoring and assessment using internet of things technology. In 2015 IEEE International Conference on Computer and Information Technology; Ubiquitous Computing and Communications; Dependable, Autonomic and Secure Computing; Pervasive Intelligence and Computing, pages 2353–2358. IEEE, 2015.
Ramezani et al. [2020] Milad Ramezani, Yiduo Wang, Marco Camurri, David Wisth, Matias Mattamala, and Maurice Fallon. The newer college dataset: Handheld LiDAR, inertial and vision with ground truth. In 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, oct 2020.
Reed et al. [2022] Scott Reed, Konrad Zolna, Emilio Parisotto, Sergio Gómez Colmenarejo, Alexander Novikov, Gabriel Barth-maron, Mai Giménez, Yury Sulsky, Jackie Kay, Jost Tobias Springenberg, et al. A generalist agent. Transactions on Machine Learning Research, 2022.
Rose et al. [2015] Karen Rose, Scott Eldridge, and Lyman Chapin. The internet of things: An overview. The internet society (ISOC), 80:1–50, 2015.
Strubell et al. [2019] Emma Strubell, Ananya Ganesh, and Andrew McCallum. Energy and policy considerations for deep learning in nlp. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3645–3650, 2019.
Touvron et al. [2023] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
Yasnita et al. [2020] Yasnita, Edi Sutoyo, and Ahmad Musnansyah. A hybrid of seasonal autoregressive integrated moving average (sarima) and decision tree for drought forecasting. ICONETSI ’20, New York, NY, USA, 2020. Association for Computing Machinery. ISBN 9781450387712. doi: 10.1145/3429789.3429870. URL https://doi.org/10.1145/3429789.3429870.
Yeong et al. [2021] De Jong Yeong, Gustavo Velasco-Hernandez, John Barry, Joseph Walsh, et al. Sensor and sensor fusion technology in autonomous vehicles: A review. Sensors, 21(6):2140, 2021.
Yuehong et al. [2016] YIN Yuehong, Yan Zeng, Xing Chen, and Yuanjie Fan. The internet of things in healthcare: An overview. Journal of Industrial Information Integration, 1:3–13, 2016.
Zadeh et al. [2018] AmirAli Bagher Zadeh, Paul Pu Liang, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. Multimodal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2236–2246, 2018.
Zhang et al. [2017] Lei Zhang, Ayesha Ijaz, Pei Xiao, and Rahim Tafazolli. Channel equalization and interference analysis for uplink narrowband internet of things (nb-iot). IEEE Communications Letters, 21(10):2206–2209, 2017. doi: 10.1109/LCOMM.2017.2705710.
Zhu et al. [2021] Binwu Zhu, Ran Chen, Xinyun Zhang, Fan Yang, Xuan Zeng, Bei Yu, and Martin D.F. Wong. Hotspot detection via multi-task learning and transformer encoder. In 2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD), page 1–8. IEEE Press, 2021. doi: 10.1109/ICCAD51958.2021.9643590. URL https://doi.org/10.1109/ICCAD51958.2021.9643590.
Zhu et al. [2023] Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.

Appendix

In this supplementary material, we provide the following material:

•

addition implementation and datasets details in Section A,
•

detailed experimental setup in Section B,
•

details about evaluation metrics in Section C,
•

more experimental analyses in Section D,
•

more qualitative visualization results in Section E,
•

more dialog examples for language models in Section F,
•

dataset documentation and intended uses in Section G.

Appendix A Detailed Benchmark

We introduce the MultiIoT benchmark, the largest and most diverse IoT dataset consisting of 1.15 million samples across twelve distinct modalities, tailored towards eight challenging tasks.

A.1 Technical challenges and selection criterion

In this section, we outline unique challenges and potential real-world applications of representation learning for IoT, highlighting how these differentiate from traditional approaches. Our selection of modalities and tasks is driven by these challenges, detailed subsequently with an in-depth description of the benchmark composition.

1.

High Modality Complexity: The primary challenge in IoT representation learning involves handling high-modality data from diverse sensors and environments. Unlike conventional multimodal research limited to visuals and audio, MultiIoT incorporates advanced modalities such as IMU sensors [22], thermal dynamics [26], GPS signals, and camera feeds. This variety ensures comprehensive simulation of the real-world, enhancing the robustness and applicability of the learned representations. The integration of these varied modalities necessitates innovative approaches in data fusion, representation learning, and generalization across heterogeneous sensor data.
2.

Temporal Dynamics: IoT devices often capture data that embodies complex temporal dynamics over extended periods. Unlike typical multimodal datasets with short sequence lengths (e.g., image-text datasets averaging 77 words [43], or video datasets spanning 10-60 seconds), MultiIoT includes data sequences up to 100-300 steps, representing a significant leap in capturing long-range temporal interactions. This aspect introduces challenges in modeling sequential interactions that are not temporally aligned, thereby pushing the boundaries of current sequence learning methods.
3.

Real-world Variability: The heterogeneity and inherent noise in IoT sensor data pose substantial challenges. MultiIoT encompasses sensors with diverse noise signatures and lacks straightforward natural language equivalents, complicating the direct application of conventional conditioning techniques used in language models. This aspect of the benchmark tests models’ abilities to handle real-world data variability and encourages the development of techniques that enhance noise robustness and semantic interpretation.
4.

Real-time Processing: The real-time nature of IoT applications demands models that can process and react to multimodal inputs promptly. This requirement is crucial in fields like healthcare monitoring, home automation, and security systems. The benchmark, therefore, not only measures model accuracy and robustness but also emphasizes efficiency and speed, ensuring that the models are practical for real-time applications.

Reflecting these four core challenges, MultiIoT aggregates data from a wide array of environments and IoT devices, offering an unparalleled resource for advancing IoT research. This benchmark sets new standards in the field by providing a robust platform for developing next-generation IoT technologies that can efficiently handle complex, real-time, and multi-modal data streams.

A.2 Twelve Rich Modalities

The MultiIoT Benchmark integrates a diverse array of modalities, spanning structured and unstructured data, each bringing unique technical challenges and offering distinct perspectives on sensor-based machine learning.

IMU (Inertial Measurement Units): IMUs provide 3D motion and orientation data crucial for applications such as motion tracking and navigation. The challenge lies in accurately interpreting the noisy signals from accelerometers and gyroscopes, which are often affected by drift and require sophisticated filtering techniques to derive precise readings. We incorporated a rich dataset of IMU samples from various sources including EyeMU [30] for gaze estimation and SAMoSA [46] for synchronized 9-axis data, enhancing the benchmark’s depth in capturing real-world motion.

Thermal: Thermal sensors capture temperature variations, essential for applications like surveillance. The primary challenge is processing the subtle thermal changes in diverse environmental conditions without being overwhelmed by background noise. Our collection includes 12,025 thermal images from LLVIP [26], providing a basis for advanced thermal pattern recognition tasks.

GPS: This modality is critical for location-based services and navigation, where the challenge is to deal with signal occlusion and multipath propagation in urban settings. The benchmark includes 41,000 GPS samples from KITTI [17], offering a framework to develop and test algorithms that can robustly estimate location even in less-than-ideal conditions.

Camera: As a cornerstone of computer vision, cameras provide rich visual data but must contend with challenges such as varying lighting conditions, occlusions, and dynamic environments. Our dataset incorporates comprehensive camera data from KITTI, which is instrumental in tasks like depth estimation and object recognition.

Capacitance: These sensors detect touch and proximity by measuring changes in capacitance. One challenge is distinguishing between intentional touch and incidental contact, critical for touch-sensitive applications. We used data from TouchPose [2], which includes detailed interactions captured via a capacitive touch panel.

Depth: Depth sensors, which provide spatial data about the surroundings, are crucial for 3D modeling and interaction systems. The challenge is to derive accurate depth information in cluttered scenes or where depth cues are minimal. Our dataset includes depth data from RGBGaze [5] and TouchPose, enhancing tasks related to 3D reconstruction and interaction.

Gaze: Tracking where a user is looking offers insights into user intent and focus. The variability in individual gaze patterns and external lighting conditions makes this data challenging to interpret. The benchmark features detailed gaze data collected using iOS devices, facilitating the development of personalized gaze-tracking technologies.

Pose: Pose data is essential for understanding body movements and interactions. Capturing accurate pose information involves challenges related to body occlusions and the 3D nature of human movements. We include extensive pose data from DIP-IMU [22] and TouchPose, providing a foundation for advanced pose estimation algorithms.

LiDAR: Used for generating precise 3D maps, LiDAR data is pivotal for autonomous vehicles and geographic mapping. The challenge is processing the massive point clouds efficiently, especially in dynamic environments. Our inclusion of LiDAR data from the Newer College dataset [50] enriches the benchmark’s utility for high-resolution spatial analysis.

Video: Video data captures dynamic scenes and is crucial for understanding temporal variations. The challenge lies in processing high-volume data streams in real-time, crucial for applications like surveillance and live activity recognition. The Ego4D [20] dataset contributes extensive egocentric video data, pushing forward research in first-person visual understanding.

Audio: This modality is essential for speech and environmental sound analysis, with challenges in noise filtering and sound source separation. We integrate audio data from SAMoSA [46], which, paired with IMU data, enhances the capability to analyze sounds in context.

Image: Static images are fundamental for numerous vision tasks, and challenges include dealing with diverse image qualities and contexts. Our dataset includes high-quality images from various sources, supporting a wide range of image processing and analysis tasks.

These modalities, each with its inherent challenges, make the MultiIoT Benchmark a comprehensive toolkit for advancing IoT device capabilities and multimodal learning research.

A.3 Eight Well-defined and Challenging Tasks

Our benchmark outlines tasks that encapsulate real-world IoT challenges, designed to propel advancements in practical applications with significant societal impacts.

Gaze Estimation: Integral to enhancing human-computer interaction, driver monitoring, and immersive experiences in virtual reality, gaze estimation involves predicting the (X, Y) coordinates of a person’s gaze based on multimodal inputs. Given RGB images of faces alongside depth information and IMU data, this regression task tests the model’s ability to integrate and interpret visual cues with spatial dynamics, addressing the challenge of sensor heterogeneity.

Depth Estimation: Critical for augmented reality, robotics, and autonomous driving, depth estimation requires predicting the distance from the camera to each image pixel. Utilizing inputs such as RGB images combined with camera parameters, GPS, and IMU data, models must generate precise depth maps. This task emphasizes understanding complex spatial relationships and integrating diverse sensory data, crucial for navigating and interacting with real-world environments.

Gesture Classification: Essential for developing intuitive human-machine interfaces, this task involves recognizing specific human gestures using gaze data and IMU outputs. Models must classify movements effectively by synthesizing data from accelerometers, gyroscopes, and orientation sensors, showcasing the need for robust cross-modal integration.

Pose Estimation: This task aims at determining the spatial configuration of human joints, crucial in gaming, AR/VR, and health monitoring. Given RGB images and IMU data, the challenge is to predict human poses, including detailed joint angles. The task demands deep cross-modal insights, especially in blending visual information with physical sensor data.

Touch Contact Classification: In touch-based user interfaces, accurately identifying the nature of touch interactions on capacitive surfaces is vital. This classification task leverages RGB and capacitive images, depth maps, and hand poses to discern touch types, highlighting the importance of multimodal interactions and the complexity of synchronizing disparate data types.

Event Detection: Widely applicable in surveillance and smart environments, this task requires the detection of specific events or anomalies from audio-visual streams. Using audio spectrograms and IMU data, models must discern and categorize events, a process that hinges on the model’s ability to correlate audio signals with physical sensor outputs across varied temporal spans.

Activity Recognition: Critical for applications in fitness and healthcare, this task involves recognizing human activities from multimodal data. Models are challenged to integrate RGB images, video frames, poses, and IMU data to classify actions accurately, demanding a nuanced understanding of both motion and visual cues in dynamic, real-time scenarios.

3D Reconstruction: Important in entertainment and spatial computing, 3D reconstruction involves creating detailed 3D models from 2D data. Given RGB images, capacitance data, and depth maps, the task tests the model’s ability to construct accurate 3D representations, requiring a sophisticated blend of image processing and depth perception skills.

Each task is formulated to push the boundaries of what’s possible with current technology, encouraging innovation in the handling of complex, multimodal datasets. These tasks not only reflect pressing real-world challenges but also serve as a robust platform for developing next-generation machine learning models tailored for the IoT ecosystem.

Appendix B Experimental Setup

In this section, we provide details about the experimental configurations used to evaluate our models across various tasks and modalities, ensuring reproducibility and evaluation.

B.1 Setup for Domain-specific Unimodal Models

•

Data Preparation: Each modality (e.g., RGB images, capacitive images, hand pose) is independently processed. Normalization and modality-specific transformations are applied to standardize the input data for optimal model performance.
•

Network Architecture: Tailored neural architectures are employed for each modality type. For instance, convolutional neural networks (CNNs) are used for image data and recurrent neural networks (RNNs) for sequential data, optimizing each model’s capacity to extract relevant features effectively.
•

Training Details: Models are trained with a batch size of 128 using the Adam optimizer at a learning rate of 0.001. Early stopping is implemented with a patience of 10 epochs to prevent overfitting.
•

Evaluation: Performance is methodically evaluated on validation datasets specific to each modality, enabling direct assessment of model efficacy and generalization.

B.2 Setup for Multi-task Unimodal Models

•

Data Preparation: Data pertinent to different tasks, but from the same modality, are concatenated or paired, enriching the training set.
•

Network Architecture: A shared encoder processes the unified input data, followed by multiple task-specific decoders tailored to address the requirements of each task independently.
•

Training Details: Gradient balancing techniques are utilized to ensure no single task dominates the learning process. This balanced approach is critical for maintaining uniform model performance across tasks.
•

Evaluation: Each task is separately evaluated on tailored validation sets, highlighting the model’s task-specific competencies and areas for improvement.

B.3 Setup for Multisensory Fusion Models

•

Data Preparation: Different modalities are fused at various levels-input, feature, or decision—based on the nature of the task and the characteristics of the data.
•

Network Architecture: Encoders specific to each modality process inputs independently before fusion layers integrate the features, aiming to capture and utilize the comprehensive information available across the modalities.
•

Training Details: A uniform training approach using a batch size of 128 and the Adam optimizer is maintained, with particular attention to data balancing to ensure equitable representation from all modalities.
•

Evaluation: The efficacy of the combined model is validated on a mixed-modality dataset, testing the model’s ability to synthesize and leverage multimodal information.

B.4 Setup for Multisensory Multitask Models

•

Data Preparation: A combination of data from various modalities and tasks is either paired or concatenated, depending on the specific requirements of each task.
•

Network Architecture: Shared modality-specific encoders are followed by task-specific decoders, allowing fine-tuned processing paths for each task while leveraging shared learning across modalities.
•

Training Details: The training regimen employs gradient and modality balancing techniques to foster a fair learning environment, promoting equal learning opportunities for all tasks and modalities.
•

Evaluation: Task and modality-specific performance metrics are used to assess each aspect of the model’s capabilities comprehensively.

B.5 Setup for Multisensory Language Models

•

Data Preparation: All data is pre-processed to fit the input requirements of a pre-trained network, ensuring only the adapter modules are adaptable.
•

Network Architecture: State-of-the-art architectures such as LLaMA are enhanced with adapter layers, strategically inserted to refine the model’s ability to integrate and process multisensory data.
•

Training Details: Adapter layers are fine-tuned with a larger batch size of 256, using the Adam optimizer at a reduced learning rate of 0.0005, optimizing for efficient learning dynamics.
•

Evaluation: The performance of the adapted model is critically assessed against a validation set designed to challenge its enhanced capabilities, ensuring rigorous testing of its applied enhancements.

B.6 Setup for Multisensory Multitask Language Models

•

Data Preparation: Combines multisensory data streams with language data to prepare for multitask processing.
•

Network Architecture: Integrates language processing units with sensory data processors within a unified architectural framework, facilitating complex multitask learning.
•

Training Details: Employs sophisticated multitask learning algorithms to optimize performance across varied sensory and language tasks.
•

Evaluation: Each task is individually assessed to determine the model’s effectiveness across the spectrum of included tasks and modalities.

Throughout all experimental setups, the environment remains consistent. All models are trained and evaluated on NVIDIA V100 & A100 GPUs, ensuring uniformity in computational power and performance, crucial for fair and replicable validation.

Appendix C Evaluation Metrics

To ensure a comprehensive assessment of model performance across diverse tasks, we employ a variety of metrics that are well-suited to the specific challenges posed by each task in our benchmark. These metrics not only align with industry standards but also facilitate a nuanced analysis of model effectiveness in real-world scenarios.

Gaze Estimation: The precision of gaze tracking is quantified using the Mean Euclidean Error in centimeters. This metric measures the average distance between the coordinates of the predicted gaze and the actual gaze points, providing a direct assessment of accuracy in spatial terms.

Depth Estimation: For evaluating the accuracy of depth predictions, we use the Mean Absolute Error (MAE) in millimeters. MAE helps quantify the average magnitude of errors in the predictions without considering their direction, making it particularly useful for depth where exact value prediction is crucial.

Gesture Classification: The effectiveness of gesture recognition is measured by Accuracy, defined as the ratio of correctly classified samples to the total samples. This metric is straightforward and reflects the model’s ability to correctly identify and categorize each gesture, crucial for interactive applications.

Pose Estimation: Similar to gaze estimation, pose accuracy is evaluated using Mean Euclidean Error in centimeters. This metric calculates the average Euclidean distance between predicted and true joint positions, offering a clear measure of spatial accuracy in pose estimation.

Touch Contact Classification: For assessing how accurately the model classifies types of touch interactions, Accuracy is again utilized. This metric is particularly important for applications where precise touch recognition can enhance user interface responsiveness and interactivity.

Event Detection: The F1 Score is used to evaluate event detection, combining the measures of precision and recall. F1 is especially suitable for scenarios where a balance between false positives and false negatives is crucial, such as in surveillance or safety monitoring systems.

Activity Recognition: We compute Balanced Accuracy to evaluate activity recognition models. This metric is important in scenarios with imbalanced datasets, as it considers the accuracy of each class, thereby ensuring fairness across less frequent activities.

3D Pose Reconstruction: The End-point Error in millimeters assesses the precision of 3D pose reconstruction by measuring the mean Euclidean distance between all corresponding joints in the predicted and actual models. This metric is critical for applications in AR/VR and animation, where spatial accuracy in three dimensions is paramount.

Each of these metrics has been chosen to reflect both the efficacy and the practical utility of the models in handling real-world tasks, ensuring that our evaluations are both rigorous and relevant to practical applications.

Appendix D More Analysis

In this section, we delve deeper into two critical aspects of machine learning challenges that our experiments focused on: handling long-range interactions and managing heterogeneity in structure and noise.

D.1 Testing Long-range Interactions

Long-range interactions are essential for many machine learning applications, including time-series forecasting, natural language processing, and signal analysis. Recognizing patterns and relationships over extended sequences or across multiple modalities requires models that effectively leverage these long-range dependencies. We conducted experiments in which we systematically truncated sequences to various lengths and analyzed the performance of conventional models. As sequence lengths increased, indicating longer durations or more extensive contexts, model performance declined noticeably, highlighting a deficiency in capturing and understanding distant interactions.

Complexity is amplified in multimodal contexts, where long-range dependencies exist not only within individual modalities but also across different modalities. Our findings indicate this area of intermodality long-range interaction is particularly challenging, with even advanced models showing limitations. Developing architectures that inherently focus on long-range interactions, such as leveraging modified self-attention mechanisms capable of handling extremely long sequences. Implementing models that operate at different temporal scales to summarize information effectively across various levels could enhance the capture of longer-range interactions. Employing dynamic computational resource allocation techniques to emphasize critical parts of a sequence or modality when potential long-range dependencies are detected. For multimodal problems, enhancing cross-modal attention mechanisms to enable models to recognize and utilize dependencies spanning across different modalities and temporal gaps.

D.2 Testing Heterogeneity in Structure and Noise

Heterogeneity in data, in terms of both structure and noise, poses significant challenges, especially as datasets become more complex and diverse. Models were exposed to datasets combining structured data (such as GPS and IMU) with unstructured data (like images and raw audio). Unimodal baselines struggled significantly with these mixed data types, leading to notable accuracy reductions. Additionally, introducing varying degrees of Gaussian noise into numerical data demonstrated a rapid performance decline as noise levels increased.

These challenges underscore the need for enhanced robustness in model design. Our experiments reveal that even state-of-the-art models are vulnerable when confronted with unexpected data structures or noise patterns. Exploring architectures and training strategies that inherently boost robustness to noise and heterogeneity, such as noise injection during training or using dropout techniques to foster generalization. Applying advanced data augmentation techniques for both structured and unstructured data to better prepare models for diverse data structures and noise scenarios. Utilizing meta-learning approaches to train models that can quickly adapt to new data structures or noise patterns with minimal additional training. Developing sophisticated denoising mechanisms, including pre-processing methods and in-model techniques, particularly those capable of handling structured noise. The detailed analysis and proposed solutions aim to direct future research toward developing machine learning models that are not only effective but also robust and adaptable to real-world data complexities.

Appendix E More examples

In this section, we provide more examples for an in-depth look at the diverse IoT modalities presented in our benchmark, emphasizing the heterogeneity in sensor types, data characteristics, and their implications for temporal interactions. These examples are instrumental in demonstrating the complex data processing challenges and the necessity for advanced interpretation techniques within IoT systems.

E.1 IMU

The Inertial Measurement Unit (IMU) is crucial for capturing dynamic motion and orientation, combining accelerometers, gyroscopes, and magnetometers. This modality is pivotal in devices like smartwatches, capturing high-resolution temporal data on user movement which is essential for applications in activity recognition and health monitoring. As shown in Figure 5, the IMU’s high sampling rate allows for capturing minute fluctuations in motion, providing a detailed temporal analysis that is vital for accurate modeling of dynamic behaviors.

E.2 Audio

Audio sensors transform sound waves into digital signals, reflecting the acoustic environment. These sensors are extensively used in smart homes to detect various sounds—from spoken commands to household activity noises, as illustrated in Figure 6. The detailed temporal granularity of audio data, essential for precise speech recognition, environmental sound classification, and emergency alert systems, showcases the critical role of temporal resolution in understanding and responding to audio cues effectively.

E.3 Capacitance

Capacitance sensing involves detecting changes in capacitance caused by touch or proximity. This technology enables the development of interactive touch interfaces and non-contact object and human interaction monitoring in IoT applications. The visualizations of capacitive data in Figure 7 illustrate its responsiveness and the importance of real-time processing, crucial for applications requiring immediate interaction feedback.

E.4 Depth

Depth sensors measure distances using technologies like LIDAR, structured light, or time-of-flight cameras, critical for understanding spatial relationships in environments necessary for autonomous navigation or interactive applications. The visualizations in Figure 8 highlight how depth data captures scene dynamics and object relations over time, underpinning the significance of temporal interactions in depth sensing for robust environmental understanding and interaction.

Appendix F More Dialog Examples

In this section, we illustrate the analytical capabilities of our multisensory multi-task language model when applied to two distinct IoT modalities: audio and Inertial Measurement Unit (IMU) data. Through detailed dialog examples, we demonstrate the model’s adeptness at interpreting heterogeneous sensory data and making informed inferences about activities based on these modalities.

F.1 Audio Analysis

Context: In the audio analysis scenario, our model was presented with a visual representation of an audio waveform. The task was to identify the corresponding activity from a predefined list of categories based on the visualized audio pattern.

Input: An audio waveform is visualized, showing varying amplitudes and frequencies over time.

Task: The model is instructed to determine the activity that best matches the waveform from options such as "Talking," "Clapping," "Coughing," and "Music."

Model Output: The multisensory multi-task language model identifies "Coughing" as the most likely activity.

Rationale: The model justifies its choice by highlighting the sudden, sporadic spikes in the waveform, characteristic of the sharp, burst-like sounds typical of coughing. This decision demonstrates the model’s ability to parse and interpret complex temporal acoustic features, translating visual representations of audio data into accurate categorical insights.

F.2 IMU Data Analysis

Context: For the IMU data analysis, the model analyzed a plotted graph displaying time-series data from various IMU sensors, including accelerometers and gyroscopes.

Input: A multi-line graph is shown, where each line represents different aspects of IMU data such as acceleration and rotational motion over time.

Task: The model is to match the pattern displayed in the IMU data to a possible physical activity from choices like "Running," "Jumping," "Knocking," and "Dancing."

Model Output: "Knocking" is selected by the multisensory multi-task language model as the activity that best corresponds to the IMU data presented.

Rationale: The model interprets the regular pattern of sharp peaks at consistent intervals as indicative of knocking. It notes the repetitive nature of the motion and the varying force, typical for an action like knocking on a surface, which would generate a rhythmic and forceful pattern in IMU sensors.

These examples underscore the model’s proficiency in interpreting and classifying data from different IoT modalities based on their temporal and sensory characteristics. The ability to analyze such heterogeneous data effectively is crucial for a variety of applications across smart home systems, healthcare monitoring, and industrial automation. Additionally, these dialog examples highlight the model’s potential to bridge the gap between raw sensor outputs and actionable insights, empowering users to engage with complex IoT systems more intuitively and effectively. Each scenario described showcases not only the model’s analytical strength but also its potential to transform raw, often opaque sensor data into comprehensible and actionable information. This capability enhances the usability of IoT systems, making them more accessible and beneficial for a wider range of users and applications.

Appendix G Dataset Documentation & Intended Uses

In this section, we outline comprehensive documentation and intended uses for our dataset to ensure transparency, accountability, and ease of access for researchers interested in utilizing this benchmark.

G.1 Dataset Documentation

To foster clarity and responsible usage, we adhere to several established documentation frameworks:

•

Datasheets for Datasets: We provide a detailed datasheet that includes the dataset’s motivation, composition, collection process, recommended uses, and limitations. This aims to ensure that all potential users have a clear understanding of how the dataset was created and its intended scope of application.
•

Dataset Nutrition Labels: Similar to nutrition labels on food products, our dataset nutrition label offers a concise summary of the data contents, including data types, instances count, and a profile of typical data points.
•

Data Statements for NLP: For components of our dataset applicable to NLP, we include data statements detailing linguistic demographics and speaker information, ensuring transparency in the representation within the data.
•

Data Cards: Each modality within the dataset is accompanied by a data card that outlines specific characteristics, use cases, and processing procedures, which helps in understanding each part of the dataset holistically.
•

Accountability Frameworks: Our dataset complies with existing accountability frameworks to ensure ethical use and application, including guidelines for addressing potential biases and misuse.

This document is based on Datasheets for Datasets by Gebru et al. [16].

{mdframed}

[linecolor=violet]

MOTIVATION

For what purpose was the dataset created? Was there a specific task in mind? Was there a specific gap that needed to be filled? Please provide a description.
The dataset was created to address the lack of large-scale, diverse, multimodal datasets that can be used to improve and evaluate IoT and machine learning models’ ability to interpret and process multimodal data streams in real-world scenarios.

Who created this dataset (e.g., which team, research group) and on behalf of which entity (e.g., company, institution, organization)?
This dataset was created by the authors.

What support was needed to make this dataset? (e.g.who funded the creation of the dataset? If there is an associated grant, provide the name of the grantor and the grant name and number, or if it was supported by a company or government agency, give those details.)
No. This dataset was not supported by any grants from several research funding agencies.

Any other comments?
No.

{mdframed}

[linecolor=violet]

COMPOSITION

What do the instances that comprise the dataset represent (e.g., documents, photos, people, countries)? Are there multiple types of instances (e.g., movies, users, and ratings; people and interactions between them; nodes and edges)? Please provide a description.
The instances represent a combination of sensor data including audio, visual (image and video), and temporal sensor data (IMU, GPS).

How many instances are there in total (of each type, if appropriate)?
The dataset contains over 1.15 million instances.

Does the dataset contain all possible instances or is it a sample (not necessarily random) of instances from a larger set? If the dataset is a sample, then what is the larger set? Is the sample representative of the larger set (e.g., geographic coverage)? If so, please describe how this representativeness was validated/verified. If it is not representative of the larger set, please describe why not (e.g., to cover a more diverse range of instances, because instances were withheld or unavailable).
Each instance consists of raw sensor data along with processed features, including extracted metadata and precomputed sensory features.

What data does each instance consist of? “Raw” data (e.g., unprocessed text or images) or features? In either case, please provide a description.
Yes, each instance is labeled with activity tags, environmental context, and temporal markers where applicable.

Is there a label or target associated with each instance? If so, please provide a description.
Yes, relationships such as sequential and contextual linkages are explicitly defined, enabling the study of interactions across time and modality.

Is any information missing from individual instances? If so, please provide a description, explaining why this information is missing (e.g., because it was unavailable). This does not include intentionally removed information, but might include, e.g., redacted text.
No.

Are relationships between individual instances made explicit (e.g., users’ movie ratings, social network links)? If so, please describe how these relationships are made explicit.
No.

Are there recommended data splits (e.g., training, development/validation, testing)? If so, please provide a description of these splits, explaining the rationale behind them.
Yes. The dataset is split into training (70%), validation (15%), and testing (15%) sets, designed to ensure comprehensive coverage of various scenarios and conditions in each split.

Are there any errors, sources of noise, or redundancies in the dataset? If so, please provide a description.
No.

Is the dataset self-contained, or does it link to or otherwise rely on external resources (e.g., websites, tweets, other datasets)? If it links to or relies on external resources, a) are there guarantees that they will exist, and remain constant, over time; b) are there official archival versions of the complete dataset (i.e., including the external resources as they existed at the time the dataset was created); c) are there any restrictions (e.g., licenses, fees) associated with any of the external resources that might apply to a future user? Please provide descriptions of all external resources and any restrictions associated with them, as well as links or other access points, as appropriate.
No.

Does the dataset contain data that might be considered confidential (e.g., data that is protected by legal privilege or by doctor-patient confidentiality, data that includes the content of individuals’ non-public communications)? If so, please provide a description.
No.

Does the dataset contain data that, if viewed directly, might be offensive, insulting, threatening, or might otherwise cause anxiety? If so, please describe why.
No.

Does the dataset relate to people? If not, you may skip the remaining questions in this section.
No.

Does the dataset identify any subpopulations (e.g., by age, gender)? If so, please describe how these subpopulations are identified and provide a description of their respective distributions within the dataset.
No.

Is it possible to identify individuals (i.e., one or more natural persons), either directly or indirectly (i.e., in combination with other data) from the dataset? If so, please describe how.
No.

Does the dataset contain data that might be considered sensitive in any way (e.g., data that reveals racial or ethnic origins, sexual orientations, religious beliefs, political opinions or union memberships, or locations; financial or health data; biometric or genetic data; forms of government identification, such as social security numbers; criminal history)? If so, please provide a description.
No.

Any other comments?
No.

{mdframed}

[linecolor=violet]

COLLECTION

How was the data associated with each instance acquired? Was the data directly observable (e.g., raw text, movie ratings), reported by subjects (e.g., survey responses), or indirectly inferred/derived from other data (e.g., part-of-speech tags, model-based guesses for age or language)? If data was reported by subjects or indirectly inferred/derived from other data, was the data validated/verified? If so, please describe how.
No.

Over what timeframe was the data collected? Does this timeframe match the creation timeframe of the data associated with the instances (e.g., recent crawl of old news articles)? If not, please describe the timeframe in which the data associated with the instances was created. Finally, list when the dataset was first published.
Data collection spanned over half one year.

What mechanisms or procedures were used to collect the data (e.g., hardware apparatus or sensor, manual human curation, software program, software API)? How were these mechanisms or procedures validated?
Different sensors were used to collect the sensory data.

What was the resource cost of collecting the data? (e.g. what were the required computational resources, and the associated financial costs, and energy consumption - estimate the carbon footprint. See Strubell et al.[53] for approaches in this area.)
We use V100 & A100 GPUs to curate data and train our models.

If the dataset is a sample from a larger set, what was the sampling strategy (e.g., deterministic, probabilistic with specific sampling probabilities)?
No. The dataset is not a subset of a larger set.

Who was involved in the data collection process (e.g., students, crowdworkers, contractors) and how were they compensated (e.g., how much were crowdworkers paid)?
Authors are involved in the data curation process.

Were any ethical review processes conducted (e.g., by an institutional review board)? If so, please provide a description of these review processes, including the outcomes, as well as a link or other access point to any supporting documentation.
No.

Does the dataset relate to people? If not, you may skip the remainder of the questions in this section.
No.

Did you collect the data from the individuals in question directly, or obtain it via third parties or other sources (e.g., websites)?
No.

Were the individuals in question notified about the data collection? If so, please describe (or show with screenshots or other information) how notice was provided, and provide a link or other access point to, or otherwise reproduce, the exact language of the notification itself.
No.

Did the individuals in question consent to the collection and use of their data? If so, please describe (or show with screenshots or other information) how consent was requested and provided, and provide a link or other access point to, or otherwise reproduce, the exact language to which the individuals consented.
No.

If consent was obtained, were the consenting individuals provided with a mechanism to revoke their consent in the future or for certain uses? If so, please provide a description, as well as a link or other access point to the mechanism (if appropriate)
No.

Has an analysis of the potential impact of the dataset and its use on data subjects (e.g., a data protection impact analysis)been conducted? If so, please provide a description of this analysis, including the outcomes, as well as a link or other access point to any supporting documentation.
No.

Any other comments?
No.

{mdframed}

[linecolor=violet]

PREPROCESSING / CLEANING / LABELING

Was any preprocessing/cleaning/labeling of the data done(e.g.,discretization or bucketing, tokenization, part-of-speech tagging, SIFT feature extraction, removal of instances, processing of missing values)? If so, please provide a description. If not, you may skip the remainder of the questions in this section.
No.

Was the “raw” data saved in addition to the preprocessed/cleaned/labeled data (e.g., to support unanticipated future uses)? If so, please provide a link or other access point to the “raw” data.
No.

Is the software used to preprocess/clean/label the instances available? If so, please provide a link or other access point.
No.

Any other comments?
No.

{mdframed}

[linecolor=violet]

USES

Has the dataset been used for any tasks already? If so, please provide a description.
No.

Is there a repository that links to any or all papers or systems that use the dataset? If so, please provide a link or other access point.
No.

What (other) tasks could the dataset be used for?
Beyond the current uses, the dataset holds potential for tasks in real-world IoT applications.

Is there anything about the composition of the dataset or the way it was collected and preprocessed/cleaned/labeled that might impact future uses? For example, is there anything that a future user might need to know to avoid uses that could result in unfair treatment of individuals or groups (e.g., stereotyping, quality of service issues) or other undesirable harms (e.g., financial harms, legal risks) If so, please provide a description. Is there anything a future user could do to mitigate these undesirable harms?
No.

Are there tasks for which the dataset should not be used? If so, please provide a description.
No.

Any other comments?
No.

{mdframed}

[linecolor=violet]

DISTRIBUTION

Will the dataset be distributed to third parties outside of the entity (e.g., company, institution, organization) on behalf of which the dataset was created? If so, please provide a description.
No.

How will the dataset will be distributed (e.g., tarball on website, API, GitHub)? Does the dataset have a digital object identifier (DOI)?
The dataset is available for download via a website page.

When will the dataset be distributed?
The dataset will be available upon publication.

Will the dataset be distributed under a copyright or other intellectual property (IP) license, and/or under applicable terms of use (ToU)? If so, please describe this license and/or ToU, and provide a link or other access point to, or otherwise reproduce, any relevant licensing terms or ToU, as well as any fees associated with these restrictions.
No.

Have any third parties imposed IP-based or other restrictions on the data associated with the instances? If so, please describe these restrictions, and provide a link or other access point to, or otherwise reproduce, any relevant licensing terms, as well as any fees associated with these restrictions.
No.

Do any export controls or other regulatory restrictions apply to the dataset or to individual instances? If so, please describe these restrictions, and provide a link or other access point to, or otherwise reproduce, any supporting documentation.
No.

Any other comments?
No.

{mdframed}

[linecolor=violet]

MAINTENANCE

Who is supporting/hosting/maintaining the dataset?
The dataset is maintained by the authors.

How can the owner/curator/manager of the dataset be contacted (e.g., email address)?
The owner of the dataset can contacted by email.

Is there an erratum? If so, please provide a link or other access point.
No.

Will the dataset be updated (e.g., to correct labeling errors, add new instances, delete instances)? If so, please describe how often, by whom, and how updates will be communicated to users (e.g., mailing list, GitHub)?
No.

If the dataset relates to people, are there applicable limits on the retention of the data associated with the instances (e.g., were individuals in question told that their data would be retained for a fixed period of time and then deleted)? If so, please describe these limits and explain how they will be enforced.
No.

Will older versions of the dataset continue to be supported/hosted/maintained? If so, please describe how. If not, please describe how its obsolescence will be communicated to users.
Yes. Older versions will be archived and accessible for historical comparison and research consistency.

If others want to extend/augment/build on/contribute to the dataset, is there a mechanism for them to do so? If so, please provide a description. Will these contributions be validated/verified? If so, please describe how. If not, why not? Is there a process for communicating/distributing these contributions to other users? If so, please provide a description.
Yes. Feedback and contributions from the community are highly encouraged and can be facilitated through our repository.

Any other comments?
No.

G.2 Access and Download Links

Dataset Access: The dataset can be viewed and downloaded via our dedicated website, accessible through the following URL: https://github.com/Multi-IoT/MultiIoT. This website provides structured access to the data, along with visualization tools and download options.

Metadata Record: A comprehensive metadata record is available through our Croissant metadata entry, which can be viewed and downloaded using the following link: https://github.com/Multi-IoT/MultiIoT. This metadata follows the structure suggested by the MLCommons Croissant project, ensuring standardized documentation.

G.3 Legal and Ethical Assurance

Author Responsibility: The authors bear full responsibility for the dataset and confirm that all data was collected and distributed in compliance with all applicable laws and regulations. The dataset does not violate any rights or privacy of individuals or entities.

Data Licensing: The dataset is released under the Creative Commons Attribution 4.0 International License.

G.4 Hosting, Licensing, and Maintenance

Hosting Platform: The dataset is hosted on our website, which ensures reliable and scalable access to the data.

Maintenance Plan: We commit to maintaining the dataset with regular updates and corrections as necessary. The maintenance log will be publicly available to ensure transparency in how the dataset evolves over time.

Community Engagement: We encourage the community to contribute to the dataset’s improvement by submitting issues or suggestions through our repository’s issue tracker on GitHub.

MultiIoT: Benchmarking Machine Learning for the Internet of Things

Abstract

1 Introduction

2 MultiIoT benchmark, modalities, and tasks

2.1 Twelve diverse modalities

2.2 Eight challenging tasks

3 Modeling Paradigms in MultiIoT

4 Experiments

4.1 Experimental Setup

4.2 Main quantitative results

4.3 Understanding challenges in MultiIoT

4.4 Analysis of information sharing

5 Conclusion and Broader Impacts

References

Appendix

Appendix A Detailed Benchmark

A.1 Technical challenges and selection criterion

A.2 Twelve Rich Modalities

A.3 Eight Well-defined and Challenging Tasks

Appendix B Experimental Setup

B.1 Setup for Domain-specific Unimodal Models

B.2 Setup for Multi-task Unimodal Models

B.3 Setup for Multisensory Fusion Models

B.4 Setup for Multisensory Multitask Models

B.5 Setup for Multisensory Language Models

B.6 Setup for Multisensory Multitask Language Models

Appendix C Evaluation Metrics

Appendix D More Analysis

D.1 Testing Long-range Interactions

D.2 Testing Heterogeneity in Structure and Noise

Appendix E More examples

E.1 IMU

E.2 Audio

E.3 Capacitance

E.4 Depth

Appendix F More Dialog Examples

F.1 Audio Analysis

F.2 IMU Data Analysis

Appendix G Dataset Documentation & Intended Uses

G.1 Dataset Documentation

MOTIVATION

COMPOSITION

COLLECTION

PREPROCESSING / CLEANING / LABELING

USES

DISTRIBUTION

MAINTENANCE

G.2 Access and Download Links

G.3 Legal and Ethical Assurance

G.4 Hosting, Licensing, and Maintenance

MultiIoT: Benchmarking Machine Learning
for the Internet of Things