\pkgpsifx - Psychological and Social Interactions Feature Extraction Package

Guillaume Rochette 1,2 \orcidlinkhttps://orcid.org/0000-0001-7842-766X   Matthew J. Vowels11footnotemark: 1 1,2,3 \orcidlinkhttps://orcid.org/0000-0002-8811-1156
1 Institute of Psychology
These authors contributed equally to this work. Project Page: https://psifx.github.io/psifx
   UNIL    Switzerland
2 CVSSP
   University of Surrey    United Kingdom
3 The Sense
   CHUV    Switzerland [email protected]
\Plainauthor

Guillaume Rochette, Matthew J. Vowels \Plaintitlepsifx - Psychological and Social Interactions Feature Extraction Package \Shorttitle\pkgpsifx - Psychological and Social Interactions Feature Extraction Package \Abstract \pkgpsifx is a plug-and-play multi-modal feature extraction toolkit, aiming to facilitate and democratize the use of state-of-the-art machine learning techniques for human sciences research. It is motivated by a need (a) to automate and standardize data annotation processes, otherwise involving expensive, lengthy, and inconsistent human labor, such as the transcription or coding of behavior changes from audio and video sources; (b) to develop and distribute open-source community-driven psychology research software; and (c) to enable large-scale access and ease of use to non-expert users. The framework contains an array of tools for tasks, such as speaker diarization, closed-caption transcription and translation from audio, as well as body, hand, and facial pose estimation and gaze tracking from video. The package has been designed with a modular and task-oriented approach, enabling the community to add or update new tools easily. We strongly hope that this package will provide psychologists and social scientists a simple and practical solution for efficiently a range of audio, linguistic, and visual features from audio and video, thereby creating new opportunities for in-depth study of real-time behavioral phenomena. \Keywordsmulti-modal, video, audio, linguistic, feature extraction, \proglangPython, package \Plainkeywordsmulti-modal, video, audio, linguistic, feature extraction, python, package \Address Matthew J. Vowels
The Sense, CHUV, Switzerland
and
Inst. Psychology, University of Lausanne, Switzerland
and
CVSSP, University of Surrey, Guildford, United Kingdom
E-mail: https://psifx.github.io/psifx/

1 Introduction

The study of human and social interactions is of principal interest to psychologists and social scientists. In order to study such interactions, psychologists need access to meaningful and interpretable features which represent these interactions and which can be used to facilitate reproducible research. Indeed, there exist many questions about human interactions which are simply unanswerable without access to the complex and rich information about behavior as it unfolds over time (Bulling et al., 2023; Gottman and Notarius, 2000). In the context of psychotherapy outcomes and emotion research, for example, around half of the literature integrates a form of behavioral feature extraction to evaluate, understand, and improve mental health treatments, psychotherapeutic processes, and therapist training (Peluso and Freund, 2018).

To arrive at a set of variables or features representing otherwise complex video and audio data, one can deconstruct behavioral interactions into a subset of explainable features. Such features include non-verbal behaviors such as body and head pose, motion, gaze, facial expression, as well as para-verbal and verbal features including pitch and automatic transcription of spoken language, respectively. These features can be considered to be relatively ‘objective’ to the extent that they concern a representation of directly observable events, in contrast with more elusive psychological constructs such as stress, affect, or emotion, which are only indirectly measurable. Access to such low-dimensional, human interpretable representations of video and audio files facilitates a range of downstream research-related tasks (Friedlander et al., 2019; Esplin et al., 2024), including the prediction of emotion from para-verbal features (Biggiogera et al., 2021), modeling empathetic actions/intent from text (Welivita and Pu, 2020), or other statistical approaches intended to test and develop psychological theories.

Currently, the standard for extracting such complex information in the domain of psychology is the human observational coding process (Peluso and Freund, 2018; Hilpert et al., 2019; Gottman and Notarius, 2000; Bulling et al., 2023), in which a team of humans rate various verbal, para-verbal and nonverbal behaviors as they occur second by second, or minute by minute, over the course of a recorded interaction. Unfortunately, there are two main disadvantages associated with human observational coding approaches. Firstly, and perhaps unsurprisingly, the utilization of human observational coding can be prohibitively expensive in terms of its cost in time and training (Bulling et al., 2023). Manual transcription, for example, takes between five and ten times longer than the duration of the audio itself (Bazillon et al., 2008). This expense has a concomitant negative impact on sample size and statistical power, requiring, as it does, an assignment of funds which could otherwise be used to recruit additional participants. Secondly, it has been known for some time that the use of human behavioral coding is not well standardized, such that individual coders may provide substantively different codes from those of each other, depending on their particular training, background, personal experiences etc. Harris and Lahey (1982) This variability makes it difficult to meaningfully compare, reproduce or aggregate results accross datasets derived using independent coding teams. Thirdly, and finally, human behavioral coding is often narrow in scope, limiting the behaviors being annotated according to the specifics of the coding system. Altogether, human observational coding does not scale well to large datasets, particularly in an era where large scale datasets feature increasingly often in modern research and for which human observational coding becomes prohibitively impractical.

These problems seriously hinder progress in research involving behaviors associated with the visual, physical, and audible modalities, such as mental health research, thereby also affecting the design of effective interventions, training, and the dissemination and sharing of data. In order to reduce the cost of such observational coding, as well as to increase the consistency and standardization, researchers have just begun to explore the possibility of automating observational coding using various signal processing and machine learning techniques (Imel et al., 2015, 2019; Creed et al., 2022; Atkins et al., 2014; Bulling et al., 2023). Many non-domain specific, but highly relevant, state-of-the-art machine learning techniques are openly available. This includes skeletal pose estimation (Cao et al., 2018), facial keypoints and expression representation (Baltrusaitis et al., 2018), speech features (Eyben et al., 2010), and automatic transcription (Radford et al., 2023)

Sadly, and despite the efficacy of these works, there remains a number a key barriers to uptake in psychology and social science. In terms of the automated coding of audio and video, effective approaches have only just emerged in response to the dramatic progress following the success of deep learning approaches (Alom et al., 2018). As such, convincing and powerful methods for effectively processing audio and video have only become available very recently, and it is perhaps no surprise that these techniques are underappreciated and disparately applied across the applied mental and behavioral health domains. There are technical hurdles to setup such bleeding-edge methods, such as containerizing the operating system and the stack of core libraries, but also the need to be familiar with technical concepts, such as low-level programming languages, source code compilation, dependency management, etc. Additional barriers appears for the usage and integration of said methods into a domain-specific workflow, such as high-level programming languages, data serialization, heterogeneous command-line interfaces, etc. These sets of skills lie beyond the typical education of empirical/applied researchers in the behavioral, psychological, or health domains. Finally, assuming that already knows about a particular method for which an easy-to-use implementation does exist, they may nonetheless find that the method itself lies behind a paywall, thereby again presenting a barrier to use both in terms of the cost itself, but also, in some cases, in terms of the ethical concerns relating to the sharing sensitive data with third-parties. Thus, there exist a number of challenges facing psychologists if they wish to explore complex human interactions which unfold dynamically over time.

The open-source project \pkgpsifx, Psychological and Social Interactions Feature eXtraction, provides:

  • An integrative approach to automate the extraction of ‘objective’ non-verbal (body, hand, and head pose, facial keypoints, eye-gaze), para-verbal (speech features such as pitch, intonation, and speech rate), and verbal (automatic transcription and translation) features. Such features are crucial to moving towards a standardized data processing framework for downstream tasks relevant to the study of complex, dynamic interactions.

  • Efficient parallelism and use of accelerated hardware architectures, enabling psychologists to efficiently process large-scale datasets and dramatically reduce the cost in comparison with the current standard of human behavioral coding.

  • A simple setup and usage, as it takes a handful of instructions to setup, and provides a homogeneous command-line interface, foregoing programming language knowledge.

  • A public, community-oriented, repository, as well as free-to-use \proglangPython package and containerized \proglangDocker image. This, in addition to the package’s usability, helps with reproducibility, as well as fostering a transparent approach to package development and inclusivity. Indeed, it is important that psychologists have the opportunity to participate and influence the development and maintenance of the project.

  • A standardization of each task output, favoring open-source and human-readable data formats, as well as easy-to-run visualization tools to verify and explain the outputs.

  • A means to process data locally, minimizing concerns regarding the sharing of sensitive data with third party data processing services.

\pkg

psifx can be used to automatically diarize audio, transcribe language (or translate and transcribe to English), extract audio features such as speech-rate and fundamental frequency, and extract visual features such as hand, body, and head pose, facial expressions and eye gaze. Such features represent rich projections of audio and video data, and can then be used for downstream predictive or exploratory machine learning tasks relevant to psychological research, such as predicting psychotherapy session efficacy. \pkgpsifx currently offers support for two input modalities, video and audio data, as they are minimally invasive means to capture essential behavioral information conveyed during human interactions.

Refer to caption
Figure 1: Overview of the video processing functionalities. For each synchronized video stream, we independently extract the entire pose of the body, face and hands, as well as facial action coding system units (FACS), eye-gaze and point distribution model (PDM)
Refer to caption
Figure 2: Overview of the audio processing functionalities. The synchronized audio tracks are blended together, the resulting mixed audio track is transcribed and diarized, thus enabling the re-identification of the speakers, yielding a complete speaker transcription, as well as para-verbal speech features extraction.

Finally, we state the principal goal of \pkgpsifx’s creators: To provide empirical researchers with usable, modern, open, and community driven feature extraction tools. Indeed, our goal is not to compete against the authors of the underlying feature extraction technologies, but rather to provide a key interface between these technologies and the researchers who wish to use them. In the remainder of the paper, we provide an overview of each of the included tools and examples of command line usage. We abstain from including reports on computational efficiency, as there is little-to-no overhead associated with the psifx pipeline, above that which is associated with each of the integrated methods. If readers are interested in the processing times associated with the underlying methods, they are encouraged to consult the original contributions.

2 Functionality

In this section, we describe the functionalities, and their respective tasks, contained in the package. An overview of the processing functionalities is given in  Figs. 1 and 2, and the overall structure of the package described on Fig. 3. These figures illustrate the integrative nature of \pkgpsifx; we define tasks providing a higher level of functionality than any of its individual components. The result is an easy-to-use, integrative audio-linguistic-visual processing pipeline and interface for psychologists, which facilitates a wide range of down-stream machine learning or statistical analysis tasks.

psifx/
 audio/
   diarization/
     pyannote
   identification/
     pyannote
   manipulation/
     extraction
     conversion
     mixdown
     normalization
   speech/
     opensmile
   transcription/
     whisper
 io/
   json
   rttm
   tar
   video
   vtt
   wav
 video/
   face/
     openface
   manipulation/
     trim
     crop
     resize
   pose/
     mediapipe
    
Figure 3: Structure of the package \pkgpsifx.

2.1 Installation

One of the most important advantages that \pkgpsifx has is in regard to the simplified installation process. Despite the impressive performance of many open-source projects, the time and expertise required to install even just one such library can be significant. The associated difficulties increase when one wishes to install multiple libraries simultaneously, creating issues with incompatibility between package dependencies. In contrast, \pkgpsifx provides a \proglangPython package installable with \pkgpip, as well as a ready-to-use containerized images, in which the external libraries are guaranteed to be compatible with one another. This provides a form of out-of-the-box practicality that most open-source projects simply do not have.

Assuming that a \proglangPython environment is set up, the package is downloaded and installed from PyPI with:

    pip install psifx

To avoid potential conflicts between \pkgpsifx dependencies and already installed software, we recommend to use isolated containers, through \proglangDocker for example. Assuming \proglangDocker is set up, the package, and its external dependencies, is downloaded and ran with:

docker run \
   --user $(id -u):$(id -g) \
   --mount type=bind,source=/path/to/data,target=/path/to/data \
   --interactive \
   --tty \
   psifx/psifx:latest

In both cases, the user can directly use the command-line interface and/or import and build on top of the \proglangPython library.

In contrast, the installation procedure for a (single) well-known computer vision package may require upwards of 20 commands involving the installation of a series of low-level dependencies.

2.2 Video Processing

The parts of \pkgpsifx which are concerned with the video data modality provide researchers with body, face and hand pose estimation, as well as facial keypoints, facial action units, and a point distribution model. The modules for the extraction of these features are described below. Currently, the two video processing modules are able to process data in real-time, but do not benefit yet from accelerated hardware architectures.

2.2.1 Human Pose Estimation

Human pose estimation aims at inferring the skeletal configuration of humans from images and videos and is relevant to any domain concerned with the tracking of human movement, ‘body language’, or pose, including psychotherapy and physiotherapy. These per-frame pose features have been shown to be useful across a diverse range of research involving motion energy analysis (Ramseyer, 2020), including a study concerning students with Attention-deficit/hyperactivity disorder Sempere-Tortosa et al. (2020), action recognition (Zhou et al., 2023), and the possibility to identify complex behaviors such as sustained attention in classroom settings (Zaletelj and Kosir, 2017).

The task of pose estimation consists in locating semantically meaningful keypoints either in 2D or 3D. In a monocular setting, 2D estimation is usually considered to be easier, as it is essentially a pattern recognition problem, where the machine is tasked to pinpoints the various landmarks on the image plane, whereas its 3D counterpart is an ill-posed problem, because of perspective ambiguities, making the problem hard even for humans, and benefiting from stereo vision. Given the additional experimental constraints associated with the estimation of 3D keypoints from a calibrated stereo-camera setup, we opt to include monocular 2D skeleton pose estimation as an essential tool for analysing human motion from video with a view to inclusion of 3D pose estimation in the future.

Specifically, we integrate \pkgMediaPipe (MediaPipe, 2024), an open-source pose estimation framework, by Google, and which estimates the body, face, and hand configurations, as well as the fitting of a detailed mesh for the face.

An example output can be seen on Fig. 4, and can be produced by,

psifx video pose mediapipe inference \
    --video /path/to/input/video.mp4 \
    --poses /path/to/output/poses.tar.gz \
    --masks /path/to/output/mask.mp4

where the input is an \pkgffmpeg-compatible video, and the output is a TAR archive containing the poses in the JSON format for every frames, as well as a video of the segmented silhouette.

Refer to caption
Figure 4: On the right, visualization of the skeletal landmarks of both \pkgMediaPipe (excluding the face mesh) and \pkgOpenFace 2.0. On the left, binary segmentation mask of the silhouette from \pkgMediaPipe.

2.2.2 Facial Expression Analysis

Facial analysis has been shown to be useful and of interest across a broad range of tasks, including the inference of internal states and emotions, attention analysis, and person recognition (Escalera et al., 2018; Mollahosseini et al., 2017; Lucey et al., 2010).

We integrate \pkgOpenFace2.0 (Baltrusaitis et al., 2018), an open-source framework providing gaze estimation, facial keypoints, facial action coding system units (FACS), a point distribution model, and head pose estimation.

These features can be extracted with,

psifx video face openface inference \
    --video /path/to/input/video.mp4 \
    --features /path/to/output/features.tar.gz

where the input is an \pkgffmpeg-compatible video, and the output is a TAR archive containing the facial features in the JSON format for every frames.

2.3 Audio Processing

The parts of \pkgpsifx which are concerned with the audio data modality provide researchers with verbal features including speaker diarization and re-identification, automatic transcription and translation into English, as well as para-verbal speech features. The modules for the extraction of these features are described below.

2.3.1 Speaker Diarization

Speaker diarization tries to find the answer to the question ‘who spoke when?’. To this end, we integrate \pkgpyannote (Bredin, 2023; Plaquet and Bredin, 2023), an open-source toolkit providing pre-trained machine learning pipeline for speaker diarization and audio embeddings, such that,

psifx audio diarization pyannote inference \
    --audio /path/to/input/mixed/mono/audio.wav \
    --diarization /path/to/output/diarization.rttm \
    --num_speakers $NUM_SPEAKERS

where the input is the mixed mono normalized WAV track, and the output is a diarization RTTM file, i.e. Rich Transcription Time Marked format, which contains the start and end timings as well as a unique identifier for the active speaker.

This can be run either on CPU or accelerated with CUDA-enabled hardware. If the number of speaker was not explicited, \pkgpyannote will attempt to guess it.

2.3.2 Speaker Re-Identification

We integrate a unique functionality into \pkgpsifx which enables users to infer the identity of the diarized speakers in the case where separate microphone channels for each speaker are available.

Since \pkgpyannote uses the mixed audio track, one can only infer the identity of the speaker for that mixed track, however it is not possible to directly trace back to which of the separate track it came from. To overcome this, we propose a speaker re-identification module, which uses the mixed audio track as well as all the original single audio tracks and the diarization results.

Using an ensemble of pre-trained speaker embedding models from \pkgpyannote, for each of the diarized entries, we temporally crop the audio tracks, both the single tracks and the mixed track. For each embedding model and each cropped tracks, we extract a feature embedding vector. Then, we measure the Euclidean distance between the single embedding vectors and the mixed embedding vectors to determine which single cropped track is most similar to the mixed cropped track. For each embedding model, we establish mappings between the diarized speakers and the single audio cropped tracks. Finally, we select the optimal mapping between speakers and audio source, as the one which receives the highest agreement score amongst the embedding model.

This functionality enables us to determine the identity of the active speaker with respect to its source audio track, such that,

psifx audio identification pyannote inference \
    --mixed_audio /path/to/input/mixed/mono/audio.wav \
    --mono_audios \
        /path/to/input/single/mono/audio_1.wav \
        ...
        /path/to/input/single/mono/audio_N.wav \
    --diarization /path/to/input/diarization.rttm \
    --identification /path/to/output/identification.json

where the inputs are the mono normalized WAV, mixed and single, tracks and the diarization RTTM file, and the output is a simple JSON file, containing the mapping between the speaker identifiers and the single audio sources, as well the agreement score.

This can be run either on CPU or accelerated with CUDA-enabled hardware.

2.3.3 Audio Transcription

Audio transcription aims at answering the question ‘what is spoken?’. Manual transcription is time consuming, it takes for humans between five and ten times longer than the duration of the audio recording itself (Bazillon et al., 2008). It is an important task, opening a wide array of downstream natural language processing and text analytic opportunities to researchers, including analyses of communication (Biggiogera et al., 2021), automatic content coding in psychotherapy (Gaut et al., 2017).

We integrate OpenAI’s \pkgWhisper large language model (Radford et al., 2023) which provides automatic, high-accuracy transcription across a wide range of languages including Dutch, Spanish, English, Italian, French, German, Polish, Russian, Japanese, German etc., such that,

psifx audio transcription whisper inference \
    --audio /path/to/input/mono/audio.wav \
    --transcription /path/to/output/transcription.vtt \
    --model_name large

where the input is the mixed mono normalized WAV track, and the output is a VTT file, i.e. Web Video Text Track format, which can be can be opened in any text editor or "dragged-and-dropped" on its original video in a media player, such as \pkgVLC.

This can be run either on CPU or accelerated with CUDA-enabled hardware. Users can also choose to translate the audio directly to English. Moreover, users can choose the size/performance of the model between tiny (<1absent1<1< 1GB), base (1absent1\approx 1≈ 1GB), small (2absent2\approx 2≈ 2GB), medium (5absent5\approx 5≈ 5GB) and large (10absent10\approx 10≈ 10GB). Finally, if the language is not declared, \pkgWhisper will attempt to infer the the language from the audio.

We integrate another unique functionality into \pkgpsifx which allows to enhance the transcription and translation outputs. Indeed, \pkgWhisper only infers the spoken sentences and their respective start and end timings. However, using both \pkgpyannote diarization and re-identification results, for each sentence, we can re-trace the identity of the speaker, by computing the intersection-over-unions between the sentence and diarization segments, and assigning the active speaker identity of the best match to that sentence, such that,

psifx audio transcription enhance \
    --transcription /path/to/input/transcription.vtt \
    --diarization /path/to/input/diarization.rttm \
    --identification /path/to/input/identification.json \
    --enhanced_transcription /path/to/output/enhanced/transcription.vtt

2.3.4 Para-Linguistic Speech Analysis

Para-verbal features have been used across a wide range of research including the study of interactions between partners (Hilpert et al., 2019), sentiment analysis (Majumder et al., 2019), and prediction of depressive symptoms (Cohn et al., 2018) and mental health disorders (Naderi et al., 2019).

We integrate \pkgOpenSmile (Eyben et al., 2010), which provides up to 6373 para-verbal speech features, such as fundamental pitch, speech rate, loudness etc. Such features have been highlight to be important for extending analyses about ‘how is it spoken?’ rather than ‘what is spoken?’.

These features can be extracted with,

psifx audio speech opensmile inference
    --audio /path/to/input/mixed/mono/audio.wav  \
    --diarization /path/to/input/diarization.rttm \
    --features /path/to/output/speech/features.tar.gz

where the inputs are the mixed mono normalized WAV track and the diarization RTTM file, the output is a TAR archive containing the features in the CSV format for diarized audio segments.

2.4 Utilities and Visualizations

For video processing, we offer manipulation utilities, such as temporal trimming, spatial cropping and re-sizing. As well as out-of-the-box visualization capabilities, for the body, face and hand poses, as well as face landmarks and gaze tracking. One can superimpose multiple features sets on a single video by simply feeding a processed visualization back into a subsequent visualization tool.

For audio processing, we offer manipulation functionalities, such as extracting audio from video, down-mixing multiple tracks into one, normalizing, e.g. setting the peak level to 0 dBFS, etc. And also the ability to plot the active speaker segments using the diarization results. Additionally, all produced transcriptions can directly be visualized in media players, such as \pkgVLC.

2.5 Maintenance and Automation

To prevent software decay and automate the deployment of new releases, we have defined a Continuous Integration and Continuous Deployment (CI/CD) pipelines, using \pkgGitHub Actions. These workflows take care of automatically building and publishing the \proglangPython package to the PyPI repositories, as well as building and pushing the \proglangDocker container to DockerHub. Furthermore, the documentation and the reference pages are automatically generated by parsing the source code and published to the \pkgGitHub Pages associated with the project.

This kind of maintenance functionality and automation is crucial for the lifespan and long-term usability of the package, and fits with the motivational goal of its creators: To provide empirical researchers with usable, useful, open, and community driven feature extraction tools.

3 Data Quality and Package Requirements

Readers are encouraged to consult the original articles relating to the methods implemented within \pkgpsifx for their evaluation and performance metrics. Note that the success of these methods is highly correlated to the quality of the data and recording conditions.

3.1 Recording Environment

The following observations can be made in terms of data quality:

  • In multi-camera, multi-microphone setups, the video and audio should be synchronized, e.g. be gen-locked.

  • Each video should contain no more than a single skeleton, as required by \pkgMediaPipe.

  • A multi-microphone setup is best, with clear amplitude separation between the microphones. Ideally these microphones would be body-worn microphones (i.e. Lavallier, Lapel, etc.), placed on the collar of the participants for best diarization results.

  • The camera resolution and framerate should be adequate to capture the facial and hand keypoints and movements.

  • The lighting in the recording environment should ideally be diffuse and bright.

  • The recording environment should be quiet and free from background noise.

Of course, in practice, one or more of these requirements may not be met, and it is up to the researchers to perform pilot tests to ensure their setup is adequate for the quality of the features being extracted.

3.2 Hardware and Software Requirements

Detailed installation and dependency instructions are continuously updated and published as part of the package documentation. It is important to note that certain modules benefit significantly from CUDA-enabled hardware, such as a Graphics Processing Unit (GPU). While it is strongly recommended to use such hardware, especially for large-scale data processing, it is not essential. All features can still be run on a CPU.

4 Future Work

At the time of writing, a number of current limitations of \pkgpsifx are worth discussing, and these inform ongoing development work. The skeleton pose estimation module only detects one individual per video, so the video cameras should be positioned to capture one individual only, or at least cropped to show a single individual. This limitation informs ongoing work to explore and integrate alternative tools, in addition to MediaPipe, which can handle multiple individuals in a video frame.

Secondly, there are currently no language analysis tools in \pkgpsifx, prompting efforts to integrate ’text’ as another modality, including a range of Natural Language Processing tools for text embedding and language modeling, such as BERT (Devlin et al., 2019). Additionally, physiology is argued to be an important modality for (e.g.) indications of stress or anxiety (Taelman et al., 2009), and may therefore also be useful for research into psychological and behavioral phenomena. This informs ongoing work to integrate Infrared / Thermal remote-sensing methods for photoplethysmography (heart rate) and respiration rate estimation.

Furthermore, the current diarization method uses multi-channel audio and is therefore uni-modal, potentially limiting its efficacy. This shortcoming informs ongoing work to develop a multi-modal approach to diarization that leverages both audio embeddings and movements in the keypoints around the mouth. Finally, while the GitHub repository incorporates automation for package and image builds, it does not yet include any unit tests. This omission slows down community-driven development, and integrating unit tests is a priority for the near future.

5 Conclusion

We have presented the open-source project \pkgpsifx, an integrative package for multi-modal estimation of features relevant to psychological and social sciences. The aim of \pkgpsifx is to standardize and streamline the annotation processes of human interactions in order to increase reliability and reproducibility in this field of research. But also to simplify and spread the uptake of state-of-the-art machine learning techniques in the community, by removing a number of technical barriers with respect to setup but also ease-of-use whilst retaining efficiency. Moreover, the open-source and community-driven aspects will help to shaped and nurture organic growth and increase the longevity to the project. We hope that \pkgpsifx provides empirical researchers with usable, modern, open, and community driven extraction tools for non-verbal, para-verbal and verbal features.

Acknowledgements

The funding for this project was awarded to Prof. Dr. Peter Hilpert by the University of Lausanne.

References

  • Alom et al. (2018) Alom M, Taha T, Yakopcic C, Westberg S, Sidike P, et al (2018). “The history began from AlexNet: A comprehensive survey on deep learning approaches.” arXiv preprint, arXiv:1803.01164.
  • Atkins et al. (2014) Atkins D, Steyvers M, Imel Z, Smyth P (2014). “Scaling up the evaluation of psychotherapy: evaluating motivational interviewing fidelity via statistical text classification.” Implement Sci., 9(49). https://doi.org/10.1109/FG.2018.00019.
  • Baltrusaitis et al. (2018) Baltrusaitis T, Zadeh A, Lim YC, Morency LP (2018). “OpenFace 2.0: Facial Behavior Analysis Toolkit.” 13th IEEE International Conference on Automatic Face and Gesture Recognition.
  • Bazillon et al. (2008) Bazillon T, Estève Y, Luzzati D (2008). “Manual vs Assisted Transcription of Prepared and Spontaneous Speech.” Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC’08). http://www.lrec-conf.org/proceedings/lrec2008/pdf/277_paper.pdf.
  • Biggiogera et al. (2021) Biggiogera J, Boateng G, Hilpert P, Vowels M, Bodenmann G, Neysari M, Nussbeck F, Kowatsch T (2021). “BERT meets LIWC: Exploring state-of-the-art language models for predicting communication behavior in couples’ conflict interactions.” Companion Publication of the 2021 International Conference on Multimodal Interaction, (385-389). 10.1145/3461615.3485423.
  • Bredin (2023) Bredin H (2023). “pyannote.audio 2.1 speaker diarization pipeline: principle benchmark, and recipe.” Proc. INTERSPEECH 2023.
  • Bulling et al. (2023) Bulling L, Heyman R, Bodenmann G (2023). “Bringing behavioral observation of couples into the 21st century.” Journal of Family Psychology, 37(1), 1–9. 10.1037/fam0001036.
  • Cao et al. (2018) Cao Z, Hidalgo G, T S, Wei SE, Sheikh Y (2018). “OpenPose: Realtime multi-person 2D pose estimation using part affinity fields.” arXiv:1812.08008v1.
  • Cohn et al. (2018) Cohn JF, Cummins N, Epps J, Goecke R, Joshi J, Scherer S (2018). “Multimodal assessment of depression from behavioral signals.” Handbook of Multimodal-Multisensor Interfaces, 2, 375–417.
  • Creed et al. (2022) Creed T, Salama L, Slevin R, Tanana M, Imel Z, Narayanan S, Atkins D (2022). “Enhancing the quality of cognitive behavioral therapy in community mental health through artificial intelligence generated fidelity feedback (Project AFFECT): a study protocol.” BMC Health Serv. Res., 22(1). https://doi.org/10.1186/s12913-022-08519-9.
  • Devlin et al. (2019) Devlin J, Chang M, Lee K, Toutanova K (2019). “BERT: Pre-training of deep bidirectional transformers for language understanding.” arXiv preprint, arXiv:1810.04805v2.
  • Escalera et al. (2018) Escalera S, Baro X, Guyon I, Escalante H, Tzimiropouls G, Valstar M, Pantic M, Cohn J, Kanade T (2018). “Guest editorial: The computational face.” IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(11), 2541–2545. https://doi.org/10.1109/TPAMI.2018.2869610.
  • Esplin et al. (2024) Esplin S, Anderson S, Bean R, Whiting J (2024). “Behavioral indicators of the therapeutic alliance in relation to discontinuation in couple therapy.” Contemporary Family Therapy, 46, 139–150. https://doi.org/10.1007/s10591-023-09685-6.
  • Eyben et al. (2010) Eyben F, Wollmer M, Schuller B (2010). “Opensmile: the munich versatile and fast open-source audio feature extractor.” Proceedings 18th ACM international conference on multimedia, (1459-1642). 10.1145/1873951.1874246.
  • Friedlander et al. (2019) Friedlander M, Lee M, Escudero V (2019). “What we do and do not know about the nature and analysis of couple interaction.” Couple and Family Psychology, 8(1). 10.1037/cfp0000114.
  • Gaut et al. (2017) Gaut G, Steyvers M, Imel Z, Atkins D, Smyth P (2017). “Content coding of psychotherapy transcripts using labeled topic models.” IEEE J. Biomed. Health Inform., 21(2), 4746–487. https://doi.org/10.1109/JBHI.2015.2503985.
  • Gottman and Notarius (2000) Gottman J, Notarius C (2000). “Decade review: Observing marital interaction.” Journal of Marriage and the Family, 62(4), 927–947. https://doi.org/10.1111/j.1741-3737.2000.00927.x.
  • Harris and Lahey (1982) Harris F, Lahey B (1982). “Recording system bias in direct observational methodology: A review and critical analysis of factors causing inaccurate coding behavior.” Clinical Psychology Review, 2(4), 539–556. https://doi.org/10.1016/0272-7358(82)90029-0.
  • Hilpert et al. (2019) Hilpert P, Brick TR, Flueckiger C, Vowels MJ, Ceuleman E, Kuppens P, Sels L (2019). “What Can Be Learned From Couple Research: Examining Emotional Co-Regulation Processes in Face-to-Face Interactions.” Journal of Counseling Psychology.
  • Imel et al. (2019) Imel Z, Pace B, Soma C, Tanana M, Hirsch T, Gibson J, Georgiou P, Narayanan S, Atkins D (2019). “Design feasibility of an automated, machine-learning based feedback system for motivational interviewing.” Psychotherapy, 56(2), 318–328. https://doi.org/10.1037/pst0000221.
  • Imel et al. (2015) Imel Z, Steyvers M, Atkins D (2015). “Computational psychotherapy research: Scaling up the evaluation of patient–provider interactions.” Psychotherapy, 52(1), 19–30. https://doi.org/10.1037/a0036841.
  • Lucey et al. (2010) Lucey P, Cohn JF, Kanade T, Saragih J, Ambadar Z, Matthews I (2010). “The extended Cohn-Kanade dataset: a complete dataset for action unit and emotion-specified expression.” IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 94–101.
  • Majumder et al. (2019) Majumder N, Poria S, Gangeshwar K, Niyati C, Rada M, Gelbukj A (2019). “Variational fusion for multimodal sentiment analysis.” arXiv preprint, arxiv:1908.06008.
  • MediaPipe (2024) MediaPipe (2024). URL https://developers.google.com/mediapipe/solutions/vision/pose_landmarker/.
  • Mollahosseini et al. (2017) Mollahosseini A, Hasani B, Mahoor MH (2017). “AffectNet: a database for facial expression, valence, and arousal computing in the wild.” IEEE Transactions on Affective Computing.
  • Naderi et al. (2019) Naderi H, Soleimani B, Matwin S (2019). “Multimodal deep learning for mental health disorders prediction from audio speech samples.” 33rd Conference on Neural Information Processing Systems.
  • Peluso and Freund (2018) Peluso P, Freund R (2018). “Therapist and client emotional expression and psychotherapy outcomes: A meta-analysis.” Psychotherapy, 55(4), 461–472. https://doi.org/10.1037/pst0000165.
  • Plaquet and Bredin (2023) Plaquet A, Bredin H (2023). “Powerset multi-class cross entropy loss for neural speaker diarization.” Proc. INTERSPEECH 2023.
  • Radford et al. (2023) Radford A, Kim J, Xu T, Brockman G, Mcleavey C, Sutskever I (2023). “Robust Speech Recognition via Large-Scale Weak Supervision.” Proceedings of the 40th International Conference on Machine Learning, PMLR, 202, 28492–28518.
  • Ramseyer (2020) Ramseyer F (2020). “Motion energy analysis MEA. A primer on the assessment of motion from video.” Journal of Counseling Psychology, 67(4), 536–549. https://doi.org/10.1037.cou0000407.
  • Sempere-Tortosa et al. (2020) Sempere-Tortosa M, Fernandez-Carrasco F, Mora-Lizan F, Rizo-Maestre C (2020). “Objective analysis of movement in subjects with ADHD. Multidisciplinary control tool for students in the classroom.” International Journal of Environmental Research and Public Health, 17(15). https://doi.org/10.3390/ijerph17155620.
  • Taelman et al. (2009) Taelman J, Vandeput S, Spaepen A, Van Huffel S (2009). “Influence of mental stress on heart rate and heart rate variability.” IFMBE Proceedings. https://doi.org/10.1007/978-3-540-89208-3_324.
  • Welivita and Pu (2020) Welivita A, Pu P (2020). “A taxonomy of empathetic response intents in human social conversations.” Porceedings of the 28th International Conference on Computational Linguistics, pp. 4886–4899. https://doi.org/10.18653/v1/2020.coling-main.429.
  • Zaletelj and Kosir (2017) Zaletelj J, Kosir A (2017). “Predicting students’ attention in the classroom from Kinect facial and body featrures.” Journal of Image and Video Processing, 80. https://doi.org/10.1186/s13640-017-0228-8.
  • Zhou et al. (2023) Zhou L, Meng X, Liu Z, Wu M, Gao Z, Wang P (2023). “Human pose-based estimation, tracking and action recognition with deep learning: A survey.” arXiv preprint, arxiv:2310.13039.