SignSpeak: Open-Source Time Series Classification for ASL Translation

Aditya Makkar^*
Cheriton School of Computer Science
University of Waterloo
Waterloo, ON
[email protected]
&Divya Makkar^*
Cheriton School of Computer Science
University of Waterloo
Waterloo, ON
[email protected]
&Aarav Patel
Faculty of Engineering
University of Waterloo
Waterloo, ON
[email protected]
&Liam Hebert
Cheriton School of Computer Science
University of Waterloo
Waterloo, ON
[email protected]
All authors contributed equally

Abstract

The lack of fluency in sign language remains a barrier to seamless communication for hearing and speech-impaired communities. In this work, we propose a low-cost, real-time ASL-to-speech translation glove and an exhaustive training dataset of sign language patterns. We then benchmarked this dataset with supervised learning models, such as LSTMs, GRUs and Transformers, where our best model achieved 92% accuracy. The SignSpeak dataset has 7200 samples encompassing 36 classes (A-Z, 1-10) and aims to capture realistic signing patterns by using five low-cost flex sensors to measure finger positions at each time step at 36 Hz. Our open-source dataset, models and glove designs, provide an accurate and efficient ASL translator while maintaining cost-effectiveness, establishing a framework for future work to build on.

1 Introduction

American Sign Language (ASL) is the most prominent sign language in North America [1], yet as of 2021, only 0.15% of Americans are fluent in it [2]. This low figure causes significant challenges for hearing and speech-impaired individuals, including limited access to education, opportunities and essential services, leading to isolation and depression[3].

To address these barriers, prior work using optical-based methods has shown strong results in translating images of ASL gestures to speech; however, they are limited in real-world applicability[4, 5]. CNN and vision-based transformer models necessitate using a camera pointed at a user’s hands while signing, which is impractical in many contexts. Additionally, the use of cameras also presents a privacy risk by capturing the user and surrounding individuals while requiring considerable computing resources as frames must be sent to a server. This is infeasible and limits the scope of optical-based ASL translation within a real-world context.

Sensor-based models using embedded devices have been introduced to treat ASL as a time-series multi-label classification problem to address the limitations of optical systems. However, many of these datasets are private [6, 7] and have not been trained on a well-practiced sign-based language such as ASL [8], limiting their applicability. To address this, we introduce SignSpeak, an open-source ASL dataset comprising of 7200 recordings of 36 classes. Our dataset was recorded using five low-cost flex sensors, one for each finger, for all letters and numbers in ASL. The scale of our dataset enables researchers to test novel models on a dataset collected in an environment aimed at real-world feasibility to progress ASL-to-speech efforts. We extensively benchmarked various methods on this dataset, with our best result achieving 92% categorical accuracy, which matches or exceeds previous ASL time-series classification work [9].¹¹1 The GitHub codebase and dataset are available at https://github.com/adityamakkar000/ASL-Sign-Research and https://doi.org/10.7910/DVN/ODY7GH respectively.

2 Related work

Previous work using a glove-based apparatus involves sensory devices such as flex sensors and inertial measurement units (IMUs). Amin et al. [6] utilize flex sensor gloves to capture 37 hand gestures (numbers 0-10 and letters A-Z) using MLPs achieving 97.6% accuracy. However, a fundamental flaw limits real-world applicability as the measurements are static and recorded at only one point during the gesture. This fails to account for ASL’s dynamic nature since each sign is a sequence of motions that must be continuously measured. Furthermore, the dataset is closed-source, prohibiting others from building on it.

Lee et al. [7] developed a glove taking continuous measurements of 6 inertial measurement units, including an accelerometer, gyroscope, and magnetometer. They report a 99.87% accuracy; however, this study presents a drawback: each input is 10-15 seconds long and impractical for real-world signing, which is performed at a significantly faster rate of 4 syllables per second [10]. Similar to the previous work, the dataset was not released publicly.

Tan et al. [5] developed a 28-sensor glove which recorded 63 data channels to train an LSTM model.¹¹1Certain sensors measured multiple data channels. See the referenced paper for more details. Multiple data channels and sensory equipment significantly increase the glove’s cost, decreasing its affordability. In addition, less sensory equipment can produce similar results in commercial use. Králik and Šuppa [8] utilized a transformer architecture to achieve over 99% accuracy on a synthetic glove-collected gesture dataset, preventing its applicability to the ASL community.

We differ from previous work by introducing an open-source ASL dataset measuring 5 flex sensor channels. It includes 200 samples for all alphanumeric classes, allowing for a cost-effective and resource-efficient glove with broad applicability for the ASL community.

3 Methodology

Refer to caption — Figure 1: Circuit of data collection glove.

3.1 Data collection

For this study, a glove was constructed with five parallel flex sensors on each finger in series with a 10,000 $\Omega$ resistor. 5 $V$ were applied and measured across each sensor with an Arduino MEGA 2560. We recorded each feature within the standard Arduino 10-bit range of [0, 1023]. Each gesture was recorded at 36 Hz while ensuring that the sum of all flex sensor measurements was below 5000 or 24.4 $V$ . This value was experimentally determined, and indicates that the fingers were flexed (the sign being performed), allowing for intentional data collection. We retain all gesture recordings between 1.38 and 2.22 seconds (50-time to 80-time) steps to ensure that accidental gestures were not added and that the gestures reflect realistic signing patterns.

3.2 Model architecture

Each gesture recording contains $C=5$ channels and has a maximum time dimension of $T=79$ with all input features 0-padded to ensure a consistent batch size. We benchmarked RNN and Transformer-based time series models on the SignSpeak dataset. In particular, we evaluated a 2-layer LSTM[11] and a 2-layer GRU[12] model, where the output from the last cell was fed into a 2-layer MLP Softmax classification layer. We then applied a dropout layer to reduce overfitting[13].

\mathbf{y_{0}}=\text{LSTM}(\mathbf{x})

(1)

\mathbf{y_{\textbf{output}}}=\text{SOFTMAX}(\text{MLP}(\text{MLP}(\text{LSTM}(% \mathbf{y_{0}})^{(T)})))

(2)

For the Stacked LSTM model, refer to eq. 1 and 2 where $\mathbf{x}\in\mathbb{R}^{T\times C}\text{and }\mathbf{y_{output}^{\top}\in% \mathbb{R}^{\text{classes}}}$ . Toro-Ossaba et al. [9] presented a dense-LSTM network for EMG-ASL classification, showing that a MLP projection before the RNN could achieve state-of-the-art results. This work used a 2-layer MLP Softmax classifier following the dense-RNN unit.

\mathbf{y_{0}}=[MLP(\mathbf{x}^{(0)}),MLP(\mathbf{x}^{(1)}),\ldots,MLP(\mathbf% {x}^{(T)})]

(3)

\mathbf{y_{\textbf{output}}}=\text{SOFTMAX}(\text{MLP}(\text{MLP}(\text{LSTM}(% \mathbf{y_{0}})^{(T)})))

(4)

Eq. 3 and 4 describe the dense-LSTM. For a dense-stacked RNN, eq. 4 is modified by composing the RNN function with itself for the input $\mathbf{y_{0}}$ . Each RNN gate used a Sigmoid activation, while MLPs used a Tanh activation. The hidden size of the RNN cells was $h_{\text{RNN}}=64$ , and the dense and/or output MLP was $h_{\text{MLP}}=128$ .

In recent literature, transformers have matched or exceeded SOTA benchmarks in time-series classification. Králik and Šuppa [8] WaveGlove Encoder, based on Transformers [14], have surpassed previous SOTA architectures on this task [8]. Inspired by this architecture, we benchmark a slightly modified version of WaveGlove on SignSpeak, adding a classification token ([CLS]) to the start of the input as done with BERT[15]. The input is passed through a learnable embedding and positional embedding table with the projected input being fed into an Encoder [14] with layer normalization before the self-attention and MLPs, as described by Dosovitskiy et al. [16]. The input format $\mathbf{x}\in\mathbb{R}^{T\times C}$ represents 5-flex sensor channels across time $T$ , before being projected into a dimension $D=32$ with the sequential nature encoded by the positional embedding. We utilized the GELU activation function[17] and the number of layers was $L=5$ . All Encoder and RNN parameters were found through a Cartesian product hyperparameter sweep.

\mathbf{y_{0}}=[\mathbf{x}^{\text{class}},\mathbf{x}\mathbf{E}_{\text{emb}}]+% \mathbf{E}_{\text{pos\_emb}}

(5)

\mathbf{y_{l}}=\text{Encoder}(\mathbf{y_{l}}),\quad\text{where }l=1,2,\ldots,L

(6)

\mathbf{y_{\text{output}}}=\text{SOFTMAX}(\text{MLP}(\text{LN}(\mathbf{y_{L}}^% {(0)}))

(7)

The encoder is described by eq.(5) - (7), where $\mathbf{E}_{\text{emb}}\in\mathbb{R}^{C\times D},\ \mathbf{E}_{\text{pos\_emb}% }\in\mathbb{R}^{(T+1)\times D}$ .

4 Results

Table 1: Model description and results on accuracy, F1-score,

\sigma_{accuracy}

and

\sigma_{F1Score}

Model	Parameters	Categorical Accuracy	F1Score	$\sigma_{accuracy}$	$\sigma_{F1Score}$
Dense LSTM	63K	0.8348	0.8301	0.0110	0.0100
Dense GRU	51K	0.6692	0.6574	0.0565	0.0631
Stacked LSTM	64K	0.9167	0.9164	0.0067	0.0068
Stacked GRU	51K	0.9221	0.9218	0.0109	0.0110
Dense Stacked LSTM	96K	0.8876	0.8873	0.0194	0.0196
Dense Stacked GRU	76K	0.9192	0.9188	0.0079	0.0079
Encoder	67K	0.9136	0.8873	0.0078	0.0195

All models were trained with AdamW [18], with $B_{1}=0.9,B_{2}=0.999$ , a weight decay of 0.01, and a plateau learning rate decay on validation loss with a patience of 20 epochs of 0.5 starting from 0.001 until a minimum learning rate of 0.0001. RNNs were trained with a batch size of 64 for 15 minutes on an M2, and the Encoder was trained with a batch size of 256 for 15 minutes on a T4 GPU. All models used a 0.2 dropout probability. The metrics used to evaluate all models were categorical accuracy and the F1-Score. We utilized a stratified 5-fold validation and reported the standard deviation and average result of the held-out folds.

The results in Table 1 indicate methods on private datasets do not generalize to the SignSpeak dataset. This may be due to a reduction in the number of data channels. This can be seen with a model such as the Transformer where a lack of data channels reduces its performance. In particular, we found that simple models such as a stacked GRU perform the best, whereas models such as a dense LSTM proposed by Toro-Ossaba et al. [9] do not achieve near state-of-the-art results. We believe that the potential for Transformer-based architectures can be unlocked with more training data, which can then be further fine-tuned on the SignSpeak dataset. Our leading RNN and Encoder models still maintain above 99.5% traditional accuracy, demonstrating performance on par with previous studies.

Additionally, the Transformer architecture has the largest difference between F1-score and accuracy, indicating a bias towards certain classes. Figure 2 displays the confusion matrix and it is evident the low accuracy is due to specific classes such as ’E’ and ’L’. Specifically, the Encoder incorrectly predicts ’L’ 36% of the time when the actual label is ’E’. Additionally in ASL, these letters do not share the same features; ’E’ is very similar to a letter such as ’A’. This indicates the model’s over-predicts between certain classes and could be outlier patterns in the dataset. Analyzing this class with stacked GRU and LSTM models, they predict ’L’ instead of ’E’ 16% and 13% of the time, respectively. This indicates it is a learned bias of an Encoder model but over-fitting is still present in all models.

5 Future work

The models presented in this study only required a moderate amount of computing power to achieve 92% accuracy. In the future, leveraging more powerful computing resources can enable the implementation of larger-scale architectures to further enhance performance. Additionally, the gestures chosen in this dataset (alphanumeric classes) reflect an extremely limited subsection of real-world ASL; thus, future work is aimed at expanding the dataset by collecting data and creating classes for phrases and actions that resemble daily communication making the product viable for commercial use. Lastly, while our measurements were recorded at 36 Hz, which is slower than average ASL signing rates, we anticipate that using an improved MCU will allow us to increase this frequency to 200 Hz, aligning with more realistic signing speeds [10]. These advancements will expand on our existing research and contribute to a more refined product that can facilitate the integration of hearing and speech-impaired individuals into society.

6 Conclusion

In this study, we presented SignSpeak, an open-source dataset collected using a custom low-cost glove architecture benchmarked on time-series classification models to mimic real-time ASL translation. We found that a stacked GRU achieves the strongest results on categorical accuracy. SignSpeak benefits speech and hearing-impaired communities by providing a way to benchmark models on a universal dataset. Our work on Signspeak can provide a foundation for researchers to build upon our open-source dataset, leveraging supervised learning techniques to deliver assistive and accessible technology to communities in need.

References

National Institute on Deafness and Other Communication Disorders [2021] National Institute on Deafness and Other Communication Disorders. American sign language, 2021. https://www.nidcd.nih.gov/health/american-sign-language#:~:text=ASL%20is%20expressed%20by%20movements,some%20hearing%20people%20as%20well.
Kumar et al. [2021] Rakesh Kumar, Vishal Goyal, and Lalit Goyal. Comparative analysis of automatic sign language generation systems. Journal of Scientific Research, 65(5):226–235, 2021.
Verge [2023] Janine Verge. Reducing Barriers for People Living with Hearing Loss During In-Person Meetings. Canadian Audiology, 2023. URL https://canadianaudiology.ca/wp-content/uploads/2023/04/CAA_Stay-Connected_In-Person_Booklet_01.pdf.
Abdullah et al. [2023] Atesam Abdullah, Nisar Ali, Raja Hashim Ali, Zain Ul Abideen, Ali Zeeshan Ijaz, and Abdul Bais. American sign language character recognition using convolutional neural networks. In 2023 IEEE Canadian Conference on Electrical and Computer Engineering (CCECE), pages 165–169, 2023. doi: 10.1109/CCECE58730.2023.10288799.
Tan et al. [2023] Chun Keat Tan, Kian Ming Lim, Roy Kwang Yang Chang, Chin Poo Lee, and Ali Alqahtani. Hgr-vit: Hand gesture recognition with vision transformer. Sensors, 23(12), 2023. ISSN 1424-8220. doi: 10.3390/s23125555. URL https://www.mdpi.com/1424-8220/23/12/5555.
Amin et al. [2023] Muhammad Saad Amin, Syed Tahir Hussain Rizvi, Alessandro Mazzei, and Luca Anselma. Assistive data glove for isolated static postures recognition in american sign language using neural network. Electronics, 12(8), 2023. ISSN 2079-9292. doi: 10.3390/electronics12081904. URL https://www.mdpi.com/2079-9292/12/8/1904.
Lee et al. [2020] Boon Giin Lee, Teak-Wei Chong, and Wan-Young Chung. Sensor fusion of motion-based sign language interpretation with deep learning. Sensors, 20(21), 2020. ISSN 1424-8220. doi: 10.3390/s20216256. URL https://www.mdpi.com/1424-8220/20/21/6256.
Králik and Šuppa [2021] Matej Králik and Marek Šuppa. Waveglove: Transformer-based hand gesture recognition using multiple inertial sensors. In 2021 29th European Signal Processing Conference (EUSIPCO), pages 1576–1580, 2021. doi: 10.23919/EUSIPCO54536.2021.9616000.
Toro-Ossaba et al. [2022] Alejandro Toro-Ossaba, Juan Jaramillo-Tigreros, Juan Tejada, Alejandro Peña, Alexandro López-González, and Rui Alexandre Castanho. Lstm recurrent neural network for hand gesture recognition using emg signals. Applied Sciences, 12, 09 2022. doi: 10.3390/app12199700.
Wilbur [2009] Ronnie B. Wilbur. Effects of varying rate of signing on asl manual signs and nonmanual markers. Language and Speech, 52(2–3):245–285, Jun 2009. doi: 10.1177/0023830909103174.
Hochreiter and Schmidhuber [1997] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural Comput., 9(8):1735–1780, nov 1997. ISSN 0899-7667. doi: 10.1162/neco.1997.9.8.1735. URL https://doi.org/10.1162/neco.1997.9.8.1735.
Cho et al. [2014] Kyunghyun Cho, Bart van Merrienboer, Çaglar Gülçehre, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using RNN encoder-decoder for statistical machine translation. CoRR, abs/1406.1078, 2014. URL http://arxiv.org/abs/1406.1078.
Hinton et al. [2012] Geoffrey E. Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Improving neural networks by preventing co-adaptation of feature detectors. CoRR, abs/1207.0580, 2012. URL http://arxiv.org/abs/1207.0580.
Vaswani et al. [2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. CoRR, abs/1706.03762, 2017. URL http://arxiv.org/abs/1706.03762.
Devlin et al. [2018] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding. CoRR, abs/1810.04805, 2018. URL http://arxiv.org/abs/1810.04805.
Dosovitskiy et al. [2020] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. CoRR, abs/2010.11929, 2020. URL https://arxiv.org/abs/2010.11929.
Hendrycks and Gimpel [2016] Dan Hendrycks and Kevin Gimpel. Bridging nonlinearities and stochastic regularizers with gaussian error linear units. CoRR, abs/1606.08415, 2016. URL http://arxiv.org/abs/1606.08415.
Loshchilov and Hutter [2017] Ilya Loshchilov and Frank Hutter. Fixing weight decay regularization in adam. CoRR, abs/1711.05101, 2017. URL http://arxiv.org/abs/1711.05101.