SLAM-based Joint Calibration of Multiple Asynchronous Microphone Arrays and Sound Source Localization

Jiang Wang, Yuanzheng He, Daobilige Su, Katsutoshi Itoyama, Kazuhiro Nakadai, Junfeng Wu, Shoudong Huang,
Youfu Li, and He Kong
This paper was accepted to and going to appear in the IEEE Transactions on Robotics.Jiang Wang, Yuanzheng He, and He Kong (corresponding author) are with the Shenzhen Key Laboratory of Control Theory and Intelligent Systems, Southern University of Science and Technology, No. 1088 Xueyuan Avenue, Shenzhen, China; Email: [email protected]; [email protected]; [email protected]. Daobilige Su is with the College of Engineering, China Agricultural University, Beijing, China; Email: [email protected]. Katsutoshi Itoyama and Kazuhiro Nakadai are with the Department of Systems and Control Engineering, Tokyo Institute of Technology, Tokyo, Japan; Email: itoyama;[email protected]. Junfeng Wu is with the School of Data Science, The Chinese University of Hong Kong, Shenzhen, Shenzhen, China; Email: [email protected]. Shoudong Huang is with the Robotics Institute, University of Technology Sydney, Sydney, Australia; Email: [email protected]. Youfu Li is with the Department of Mechanical Engineering, City University of Hong Kong, Hong Kong SAR, China; Email: [email protected].
Abstract

Robot audition systems with multiple microphone arrays have many applications in practice. However, accurate calibration of multiple microphone arrays remains challenging because there are many unknown parameters to be identified, including the relative transforms (i.e., orientation, translation) and asynchronous factors (i.e., initial time offset and sampling clock difference) between microphone arrays. To tackle these challenges, in this paper, we adopt batch simultaneous localization and mapping (SLAM) for joint calibration of multiple asynchronous microphone arrays and sound source localization. Using the Fisher information matrix (FIM) approach, we first conduct the observability analysis (i.e., parameter identifiability) of the above-mentioned calibration problem and establish necessary/sufficient conditions under which the FIM and the Jacobian matrix have full column rank, which implies the identifiability of the unknown parameters. We also discover several scenarios where the unknown parameters are not uniquely identifiable. Subsequently, we propose an effective framework to initialize the unknown parameters, which is used as the initial guess in batch SLAM for multiple microphone arrays calibration, aiming to further enhance optimization accuracy and convergence. Extensive numerical simulations and real experiments have been conducted to verify the performance of the proposed method. The experiment results show that the proposed pipeline achieves higher accuracy with fast convergence in comparison to methods that use the noise-corrupted ground truth of the unknown parameters as the initial guess in the optimization and other existing frameworks.

Index Terms:
Robot audition; Simultaneous localization and mapping; Multiple microphone arrays calibration; Sound source localization.

I INTRODUCTION

Microphone array-based robotic auditory systems have many applications such as sound source localization and human-robot interaction [1]-[5]. As with other sensing modalities [6]-[10], precise calibration of robotic auditory system parameters is crucial for achieving satisfactory sound source localization and tracking performance [11]. Hence, the calibration of robotic auditory systems made of single or multiple microphone arrays has received significant attention recently.

Of particular interest in this paper is the parameter calibration of robotic auditory systems that are made of multiple microphone arrays. Compared to single microphone array-based audition systems, there are more parameters to be calibrated for systems with multiple microphone arrays, including the relative transforms (i.e., orientation, translation) and the asynchronous offsets among the arrays. In the following, we first give a brief overview of the relevant literature on calibration of single microphone array-based systems, and then discuss the existing calibration methods for systems with multiple arrays.

I-A Related Work

In [12], based on the time difference of arrival (TDOA) between each pair of microphones, a calibration algorithm was developed to estimate the positions of microphones within a single microphone array. In [13], a bilinear calibration method based on time of flight (TOF) between each sensor source pair was proposed to estimate the microphone and source positions in 3D under the condition that the transmitting time is known. In [14], based on time of arrival (TOA) measurements and assuming knowledge of the distances between the sources and the microphones, a method for joint calibration of the positions of multiple microphones and sound source localization was proposed. In [15], a calibration method using TOA measurements was proposed for the scenario with a planar microphone array and a sound source moving in 3D.

Note that the applicability of the above-mentioned methods is limited in that they all rely on hardware synchronization between microphone channels, which is challenging to implement for robotic platforms in practice due to spatial and cost constraints [11]. Recently, in [16]-[18], a general framework using batch simultaneous localization and mapping has been developed for joint sound source localization and calibration of a single microphone array with asynchronous effects (i.e., clock difference and initial time offset).

Compared to single microphone array-based systems, the calibration of systems with multiple arrays has gained more recent attention. For example, the proposed approach in [19] utilizes direction of arrival (DOA) measurements to determine the sound source location and inter-array TDOA measurements to obtain the microphone array location through exhaustive grid search. The work [20] employs evolutionary algorithms to improve the accuracy and real-time performance of the approach in [19]. Based on DOA and inter-array TDOA measurements, another calibration framework for multiple microphone arrays is proposed in [21] using distributed damped Newton optimization. Note that the above-mentioned methods focus on the 2D case.

For the more general 3D case, there are only a few existing works. In [22], an artificial bee colony algorithm was employed to calibrate the positions and orientation of microphone arrays in 3D. Nevertheless, this method assumes that the sound source position at different moments is partially known and the clocks of the arrays are synchronized using hardware. Simultaneous calibration of positions, orientations, and time offsets of multiple microphone arrays and sound source positions in 3D was explored in [23] and [24].

I-B Motivation

For spatially distributed microphone arrays, it is necessary to consider both the initial time offsets and the sampling clock differences between the arrays [25], especially in the case of asynchronized scenarios based on the USB protocol and wireless acoustic sensor networks. In the above situations, each microphone array captures acoustic signals through its own microprocessor-controlled analog-to-digital converter and has a unique sampling clock source. Therefore, when launching multiple microphone arrays, differences in initialization result in varying initial time offsets between arrays. Moreover, the microprocessors in these microphone arrays often have limited performance, and the oscillators/crystals used to generate clock signals typically drift around their nominal frequencies. As a result, differences in sampling clocks accumulate over time. Not properly handling the above issue will significantly degrade the performance of sound source localization/tracking algorithms embedded in the arrays [17].

To the best of our knowledge, there is no work that has addressed the simultaneous calibration of positions, orientations, time offsets and sampling clock differences of multiple microphone arrays and sound source positions in 3D. In fact, as for single microphone array, calibration of multiple microphone arrays can be considered as a SLAM problem [26]-[29], where microphone arrays and the moving sound source serve as landmarks in the environment and the robot, respectively. As illustrated in Fig. 1, the acoustic measurements from the microphone arrays and the motion measurements from the robot are utilized in the optimization process, with landmark-robot constraints and robot relative pose constraints enforced, similar to the approach used in full information estimation and batch SLAM [26]-[31]. Then, two important questions arise.

Firstly, it is critical to assess whether the information contained in the measurements is sufficient to estimate the unknown parameters of microphone arrays and sound source locations. This is the so-called observability problem in the SLAM literature [32]-[33]. Although there exist works on observability analysis of SLAM-based calibration of single microphone arrays, in-depth analysis for the case with multiple microphone arrays is lacking.

Secondly, the selection of initial values is crucial because the considered calibration is a nonlinear least squares (NLS) problem, similar to batch SLAM [26], [34]. Many existing algorithms for solving such NLS problems employ the Gauss-Newton method or its variants. These methods typically require reasonable initial guesses; otherwise, the algorithms may converge toward local minima, or in extreme cases, diverge. For some specific problems, novel algorithms with certifiable convergence properties have been proposed in [35]-[37].

I-C Contributions

Motivated by the above observations, in this paper, we adopt batch SLAM as a general framework for the simultaneous calibration of translations, orientations, time offsets and sampling clock differences of multiple microphone arrays, and sound source positions in 3D. Our contributions are two-fold.

Firstly, we concentrate on the parameter identifiability of the corresponding SLAM problem. As discussed in existing works [32]-[33], SLAM is not observable from a control theoretical perspective. Hence, in the SLAM literature, the observability problem of SLAM has been tackled from an information-theoretic perspective, where all the parameters to be identified are taken to be constant but unknown. From the information-theoretic perspective, Fisher information quantifies the amount of information contained in a set of observations about a set of unknown parameters [33]. Following the above line of argument, when the multiple microphone array calibration problem is formulated as an NLS parameter estimation problem, the full rankness of the associated FIM determines the parameter identifiability or observability of the calibration problem.

Hence, in this paper, by leveraging the FIM approach, we thoroughly investigate the identifiability of the unknown parameters, including translations, orientations, and asynchronous factors between the microphone arrays and the sound source positions. We establish necessary/sufficient conditions under which the FIM and the Jacobian matrix have full column rank, which implies the identifiability of the unknown parameters. Furthermore, we identify several scenarios where the unknown parameters are not uniquely identifiable.

Secondly, we propose an effective framework to initialize the unknown parameters from the measurements, which is used as the initial guess in batch SLAM. Specifically, the initialization procedure is composed of the following major steps: (i) estimation of the sound source position by triangulation; (ii) estimation of distance between the sound source and microphone arrays using 3D geometry; (iii) estimation of microphone array poses using the iterative closest point (ICP) method; (iv) estimation of the asynchronous factors using linear least squares (LLS). As to be explained later in the paper (see Section IV. A), the microphone array pose estimation problem addressed in step (iii) mentioned above is conceptually a point-to-point registration problem, and hence can be tackled effectively using ICP [38]. To validate the effectiveness and robustness of the proposed initialization framework, we have conducted extensive numerical simulations and real experiments. Overall, the proposed pipeline achieves higher accuracy with fast convergence, in comparison to methods that use the noise-corrupted ground truth of the unknown parameters as the initial guess in the optimization, and other state-of-the-art methods in the literature [20], [23].

Compared to existing frameworks, the proposed calibration method requires less prior information. More specifically, the knowledge of the source’s position required in [22]-[23], or the distance between the signal source and the microphones needed in [14] is not required in the proposed framework in this paper. It should also be noted that our previous works documented in [16]-[18] primarily focused on calibrating individual microphones within a single array while in this paper we address the more challenging problem of calibrating multiple microphone arrays.

Finally, we remark that the observability analysis reported in Section III has been previously reported in our conference paper [39]. However, the results of [39] are only applicable for the case where the time interval between consecutive sound source events is fixed. In the current paper, we generalize the results in [39] from the scenario of fixed-interval sound source emissions to arbitrary time intervals (i.e., the interval between every two consecutive sound events can be asynchronous and time-varying). More importantly, we have proposed an effective framework for estimating the initial values of the parameters and conducted extensive simulation studies and real experiments to validate the entire calibration pipeline. All the codes and multimodal dataset used in this paper are publicly available at https://github.com/AISLAB-sustech/Calibration_of_Multi_Mic_Arrays.

Notation: Denote x𝑥xitalic_x, 𝐱𝐱\mathbf{x}bold_x, and 𝐗𝐗\mathbf{X}bold_X as scalars, vectors, and matrices, respectively. 𝐗Tsuperscript𝐗T\mathbf{X}^{\mathrm{T}}bold_X start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT represents the transpose of matrix 𝐗𝐗\mathbf{X}bold_X. 𝐈nsubscript𝐈𝑛\mathbf{I}_{n}bold_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT stands for the identity matrix of n𝑛nitalic_n dimensions. nsuperscript𝑛\mathbb{R}^{n}blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT denotes the n𝑛nitalic_n-dimensional Euclidean space. [a1;;an]subscript𝑎1subscript𝑎𝑛[a_{1};\cdots;a_{n}][ italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ; ⋯ ; italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ] denotes [a1T,,anT]Tsuperscriptsuperscriptsubscript𝑎1Tsuperscriptsubscript𝑎𝑛TT[a_{1}^{\mathrm{T}},\cdots,a_{n}^{\mathrm{T}}]^{\mathrm{T}}[ italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT , ⋯ , italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT, where a1,,ansubscript𝑎1subscript𝑎𝑛a_{1},\cdots,a_{n}italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT are scalars/vectors/matrices with proper dimensions. diagn(𝐀)𝑑𝑖𝑎subscript𝑔𝑛𝐀diag_{n}(\mathbf{A})italic_d italic_i italic_a italic_g start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_A ) denotes a block diagonal matrix with 𝐀𝐀\mathbf{A}bold_A as block diagonal entries for n𝑛nitalic_n times; diag(𝐀,𝐁)𝑑𝑖𝑎𝑔𝐀𝐁diag(\mathbf{A},\mathbf{B})italic_d italic_i italic_a italic_g ( bold_A , bold_B ) denotes a block diagonal matrix with 𝐀𝐀\mathbf{A}bold_A and 𝐁𝐁\mathbf{B}bold_B as its block diagonal entries; and 𝟎0\mathbf{0}bold_0 as a matrix of appropriate dimensions with its all entries as 0. 𝐗>0𝐗0\mathbf{X}>0bold_X > 0 means that 𝐗𝐗\mathbf{X}bold_X is a positive definite matrix. We denote 𝐱𝐏2=𝐱T𝐏𝐱superscriptsubscriptnorm𝐱𝐏2superscript𝐱T𝐏𝐱\left\|\mathbf{x}\right\|_{\mathbf{P}}^{2}=\mathbf{x}^{\mathrm{T}}\mathbf{Px}∥ bold_x ∥ start_POSTSUBSCRIPT bold_P end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = bold_x start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT bold_Px. Vectors/matrices, with dimensions not explicitly stated, are assumed to be algebraically compatible.

Refer to caption
Figure 1: Geometry of the problem setup and batch SLAM-based framework for multiple microphone arrays calibration and sound source localization.

II PROBLEM FORMULATION

In a calibration scene containing N𝑁Nitalic_N microphone arrays, as shown in Fig. 1 (with N=3𝑁3N=3italic_N = 3 as an example), the arrays capture K𝐾Kitalic_K consecutive acoustic signals emitted by a single sound source at several spatial positions. 𝐱arr_ipsuperscriptsubscript𝐱𝑎𝑟𝑟_𝑖𝑝\mathbf{x}_{arr\_i}^{p}bold_x start_POSTSUBSCRIPT italic_a italic_r italic_r _ italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT represents the position of the i-th𝑖-𝑡i\raisebox{0.0pt}{-}thitalic_i - italic_t italic_h microphone array in the global reference frame and any two arrays are in different positions. We assume that there is a local reference frame {𝐱𝑎𝑟𝑟_i}subscript𝐱𝑎𝑟𝑟_𝑖\left\{\mathrm{\mathbf{x}_{\mathit{arr\_i}}}\right\}{ bold_x start_POSTSUBSCRIPT italic_arr _ italic_i end_POSTSUBSCRIPT } attached to every microphone array; we choose {𝐱𝑎𝑟𝑟_1}subscript𝐱𝑎𝑟𝑟_1\left\{\mathrm{\mathbf{x}_{\mathit{arr\_\mathrm{1}}}}\right\}{ bold_x start_POSTSUBSCRIPT italic_arr _ 1 end_POSTSUBSCRIPT } as the global reference frame; 𝐑isubscript𝐑𝑖\mathbf{R}_{i}bold_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the rotation matrix of reference frame {𝐱arr_1}subscript𝐱𝑎𝑟𝑟_1\left\{\mathrm{\mathbf{x}}_{arr\_1}\right\}{ bold_x start_POSTSUBSCRIPT italic_a italic_r italic_r _ 1 end_POSTSUBSCRIPT } to the frame {𝐱arr_i}subscript𝐱𝑎𝑟𝑟_𝑖\left\{\mathrm{\mathbf{x}}_{arr\_i}\right\}{ bold_x start_POSTSUBSCRIPT italic_a italic_r italic_r _ italic_i end_POSTSUBSCRIPT } with the ZYX Euler angles vector 𝐱arr_iθsuperscriptsubscript𝐱𝑎𝑟𝑟_𝑖𝜃\mathbf{x}_{arr\_i}^{\theta}bold_x start_POSTSUBSCRIPT italic_a italic_r italic_r _ italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT; 𝐬ksuperscript𝐬𝑘\mathbf{s}^{k}bold_s start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT is the sound source position at time instance tk,superscript𝑡𝑘t^{k},italic_t start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , k=1,,K𝑘1𝐾k=1,\ldots,Kitalic_k = 1 , … , italic_K, with respect to (w.r.t.) {𝐱𝑎𝑟𝑟_1}subscript𝐱𝑎𝑟𝑟_1\left\{\mathrm{\mathbf{x}_{\mathit{arr\_\mathrm{1}}}}\right\}{ bold_x start_POSTSUBSCRIPT italic_arr _ 1 end_POSTSUBSCRIPT }, where K𝐾Kitalic_K is the total number of time steps. In the calibration process, the arrays remain static while the sound source moves around.

Here we consider the most general scenario with initial time offset and sampling clock difference among microphone arrays (we assume that the configuration of each microphone array itself, including its geometry, is known). When the sound source sends the k-th𝑘-𝑡k\raisebox{0.0pt}{-}thitalic_k - italic_t italic_h acoustic signal, the DOA information, i.e., the direction vector of the sound source in the i-th𝑖-𝑡i\raisebox{0.0pt}{-}thitalic_i - italic_t italic_h microphone array frame {𝐱arr_i}subscript𝐱𝑎𝑟𝑟_𝑖\left\{\mathrm{\mathbf{x}}_{arr\_i}\right\}{ bold_x start_POSTSUBSCRIPT italic_a italic_r italic_r _ italic_i end_POSTSUBSCRIPT } is obtained as:

𝐝ik=𝐑iT𝐬k𝐱𝑎𝑟𝑟_ipdik.superscriptsubscript𝐝𝑖𝑘superscriptsubscript𝐑𝑖Tsuperscript𝐬𝑘superscriptsubscript𝐱𝑎𝑟𝑟_𝑖𝑝superscriptsubscript𝑑𝑖𝑘\mathbf{d}_{i}^{k}=\mathbf{R_{\mathit{\mathrm{\mathit{i}}}}^{\mathrm{\mathit{% \mathrm{T}}}}}\frac{\mathbf{s}^{k}-\mathbf{x_{\mathit{arr\_i}}^{\mathit{p}}}}{% d_{i}^{k}}.bold_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT = bold_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT divide start_ARG bold_s start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT - bold_x start_POSTSUBSCRIPT italic_arr _ italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT end_ARG start_ARG italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_ARG . (1)

Note that the Euclidean norm of 𝐝iksuperscriptsubscript𝐝𝑖𝑘\mathbf{d}_{i}^{k}bold_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT is 1, i.e., 𝐝iksuperscriptsubscript𝐝𝑖𝑘\mathbf{d}_{i}^{k}bold_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT is a unit vector. Denote diksuperscriptsubscript𝑑𝑖𝑘d_{i}^{k}italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT, for i=1,2,,N𝑖12𝑁i=1,2,\ldots{,}Nitalic_i = 1 , 2 , … , italic_N, as the distance between the i-th𝑖-𝑡i\raisebox{0.0pt}{-}thitalic_i - italic_t italic_h microphone array and the sound source at the k-th𝑘-𝑡k\raisebox{0.0pt}{-}thitalic_k - italic_t italic_h time instant. The inter-array TDOA information between the i-th𝑖-𝑡i\raisebox{0.0pt}{-}thitalic_i - italic_t italic_h and the first microphone arrays can be expressed as follows:

Tik=dikcd1kc+xarr_iτ+Δkxarr_iδsuperscriptsubscript𝑇𝑖𝑘superscriptsubscript𝑑𝑖𝑘𝑐superscriptsubscript𝑑1𝑘𝑐superscriptsubscript𝑥𝑎𝑟𝑟_𝑖𝜏subscriptΔ𝑘superscriptsubscript𝑥𝑎𝑟𝑟_𝑖𝛿T_{i}^{k}=\frac{d_{i}^{k}}{c}-\frac{d_{1}^{k}}{c}+x_{arr\_i}^{\tau}+{\Delta_{k% }}x_{arr\_i}^{\delta}italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT = divide start_ARG italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_ARG start_ARG italic_c end_ARG - divide start_ARG italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_ARG start_ARG italic_c end_ARG + italic_x start_POSTSUBSCRIPT italic_a italic_r italic_r _ italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT + roman_Δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_a italic_r italic_r _ italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_δ end_POSTSUPERSCRIPT (2)

for i=1,2,,N𝑖12𝑁i=1,2,\ldots{,}Nitalic_i = 1 , 2 , … , italic_N, where c𝑐citalic_c represents the sound speed in the air; the scalar (unknown) constant variables xarr_iτsuperscriptsubscript𝑥𝑎𝑟𝑟_𝑖𝜏x_{arr\_i}^{\tau}italic_x start_POSTSUBSCRIPT italic_a italic_r italic_r _ italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT and xarr_iδsuperscriptsubscript𝑥𝑎𝑟𝑟_𝑖𝛿x_{arr\_i}^{\delta}italic_x start_POSTSUBSCRIPT italic_a italic_r italic_r _ italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_δ end_POSTSUPERSCRIPT represent the initial time offset and the sampling clock difference per second of each microphone array, respectively; ΔksubscriptΔ𝑘\Delta_{k}roman_Δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is the time interval from the beginning to the k-th𝑘-𝑡k\raisebox{0.0pt}{-}thitalic_k - italic_t italic_h sound signal. Since the first microphone array is used as the reference, then

𝐱arr_1p=𝟎, 𝐱arr_1θ=𝟎, xarr_1τ=0, xarr_1δ=0.formulae-sequencesuperscriptsubscript𝐱𝑎𝑟𝑟_1𝑝0formulae-sequence superscriptsubscript𝐱𝑎𝑟𝑟_1𝜃0formulae-sequence superscriptsubscript𝑥𝑎𝑟𝑟_1𝜏0 superscriptsubscript𝑥𝑎𝑟𝑟_1𝛿0\mathbf{x}_{arr\_1}^{p}=\mathbf{0},\text{ }\mathbf{x}_{arr\_1}^{\theta}=% \mathbf{0},\text{ }x_{arr\_1}^{\tau}=0,\text{ }x_{arr\_1}^{\delta}=0.bold_x start_POSTSUBSCRIPT italic_a italic_r italic_r _ 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT = bold_0 , bold_x start_POSTSUBSCRIPT italic_a italic_r italic_r _ 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT = bold_0 , italic_x start_POSTSUBSCRIPT italic_a italic_r italic_r _ 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT = 0 , italic_x start_POSTSUBSCRIPT italic_a italic_r italic_r _ 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_δ end_POSTSUPERSCRIPT = 0 . (3)

The positions and orientation of the i-th𝑖-𝑡i\raisebox{0.0pt}{-}thitalic_i - italic_t italic_h microphone array (where i=2,,N𝑖2𝑁i=2,\ldots{,}Nitalic_i = 2 , … , italic_N), i.e., 𝐱arr_ipsuperscriptsubscript𝐱𝑎𝑟𝑟_𝑖𝑝\mathbf{x}_{arr\_i}^{p}bold_x start_POSTSUBSCRIPT italic_a italic_r italic_r _ italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT and 𝐱arr_iθsuperscriptsubscript𝐱𝑎𝑟𝑟_𝑖𝜃\mathbf{x}_{arr\_i}^{\theta}bold_x start_POSTSUBSCRIPT italic_a italic_r italic_r _ italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT, are:

𝐱arr_ip=[xarr_ix;xarr_iy;xarr_iz], 𝐱arr_iθ=[θarr_ix;θarr_iy;θarr_iz],formulae-sequencesuperscriptsubscript𝐱𝑎𝑟𝑟_𝑖𝑝superscriptsubscript𝑥𝑎𝑟𝑟_𝑖𝑥superscriptsubscript𝑥𝑎𝑟𝑟_𝑖𝑦superscriptsubscript𝑥𝑎𝑟𝑟_𝑖𝑧 superscriptsubscript𝐱𝑎𝑟𝑟_𝑖𝜃superscriptsubscript𝜃𝑎𝑟𝑟_𝑖𝑥superscriptsubscript𝜃𝑎𝑟𝑟_𝑖𝑦superscriptsubscript𝜃𝑎𝑟𝑟_𝑖𝑧\begin{array}[]{c}\mathbf{x}_{arr\_i}^{p}=\left[x_{arr\_i}^{x};x_{arr\_i}^{y};% x_{arr\_i}^{z}\right],\text{ }\mathbf{x}_{arr\_i}^{\theta}=\left[\theta_{arr\_% i}^{x};\theta_{arr\_i}^{y};\theta_{arr\_i}^{z}\right],\end{array}start_ARRAY start_ROW start_CELL bold_x start_POSTSUBSCRIPT italic_a italic_r italic_r _ italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT = [ italic_x start_POSTSUBSCRIPT italic_a italic_r italic_r _ italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT ; italic_x start_POSTSUBSCRIPT italic_a italic_r italic_r _ italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT ; italic_x start_POSTSUBSCRIPT italic_a italic_r italic_r _ italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT ] , bold_x start_POSTSUBSCRIPT italic_a italic_r italic_r _ italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT = [ italic_θ start_POSTSUBSCRIPT italic_a italic_r italic_r _ italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT ; italic_θ start_POSTSUBSCRIPT italic_a italic_r italic_r _ italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT ; italic_θ start_POSTSUBSCRIPT italic_a italic_r italic_r _ italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT ] , end_CELL end_ROW end_ARRAY (4)

respectively, where θarr_ix,θarr_iysuperscriptsubscript𝜃𝑎𝑟𝑟_𝑖𝑥superscriptsubscript𝜃𝑎𝑟𝑟_𝑖𝑦\theta_{arr\_i}^{x},\theta_{arr\_i}^{y}italic_θ start_POSTSUBSCRIPT italic_a italic_r italic_r _ italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT , italic_θ start_POSTSUBSCRIPT italic_a italic_r italic_r _ italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT, and θarr_izsuperscriptsubscript𝜃𝑎𝑟𝑟_𝑖𝑧\theta_{arr\_i}^{z}italic_θ start_POSTSUBSCRIPT italic_a italic_r italic_r _ italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT take values in the range of [π,π],[π2,π2]𝜋𝜋𝜋2𝜋2[-\pi,\pi],[-\frac{\pi}{2},\frac{\pi}{2}][ - italic_π , italic_π ] , [ - divide start_ARG italic_π end_ARG start_ARG 2 end_ARG , divide start_ARG italic_π end_ARG start_ARG 2 end_ARG ], and [π,π],𝜋𝜋[-\pi,\pi],[ - italic_π , italic_π ] , respectively. Denote the unknown parameters w.r.t. the i-th𝑖-𝑡i\raisebox{0.0pt}{-}thitalic_i - italic_t italic_h microphone array as:

𝐱arr_i=[𝐱arr_ip;𝐱arr_iθ;xarr_iτ;xarr_iδ].subscript𝐱𝑎𝑟𝑟_𝑖superscriptsubscript𝐱𝑎𝑟𝑟_𝑖𝑝superscriptsubscript𝐱𝑎𝑟𝑟_𝑖𝜃superscriptsubscript𝑥𝑎𝑟𝑟_𝑖𝜏superscriptsubscript𝑥𝑎𝑟𝑟_𝑖𝛿\mathbf{x}_{arr\_i}=\left[\mathbf{x}_{arr\_i}^{p};\mathbf{x}_{arr\_i}^{\theta}% ;x_{arr\_i}^{\tau};x_{arr\_i}^{\delta}\right].bold_x start_POSTSUBSCRIPT italic_a italic_r italic_r _ italic_i end_POSTSUBSCRIPT = [ bold_x start_POSTSUBSCRIPT italic_a italic_r italic_r _ italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ; bold_x start_POSTSUBSCRIPT italic_a italic_r italic_r _ italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT ; italic_x start_POSTSUBSCRIPT italic_a italic_r italic_r _ italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT ; italic_x start_POSTSUBSCRIPT italic_a italic_r italic_r _ italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_δ end_POSTSUPERSCRIPT ] . (5)

All the unknown parameters w.r.t. microphone arrays are:

𝐱arr=[𝐱arr_2;;𝐱arr_N].subscript𝐱𝑎𝑟𝑟subscript𝐱𝑎𝑟𝑟_2subscript𝐱𝑎𝑟𝑟_𝑁\mathbf{x}_{arr}=\left[\mathbf{x}_{arr\_2};\ldots;\mathbf{x}_{arr\_N}\right].bold_x start_POSTSUBSCRIPT italic_a italic_r italic_r end_POSTSUBSCRIPT = [ bold_x start_POSTSUBSCRIPT italic_a italic_r italic_r _ 2 end_POSTSUBSCRIPT ; … ; bold_x start_POSTSUBSCRIPT italic_a italic_r italic_r _ italic_N end_POSTSUBSCRIPT ] . (6)

Denote the sound source position at time tk,superscript𝑡𝑘t^{k},italic_t start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , k=1,,K𝑘1𝐾k=1,\ldots,Kitalic_k = 1 , … , italic_K as:

𝐬k=[sxk;syk;szk].superscript𝐬𝑘superscriptsubscript𝑠𝑥𝑘superscriptsubscript𝑠𝑦𝑘superscriptsubscript𝑠𝑧𝑘\mathbf{s}^{k}=\left[s_{x}^{k};s_{y}^{k};s_{z}^{k}\right].bold_s start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT = [ italic_s start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ; italic_s start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ; italic_s start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ] . (7)

Thus, all unknown parameters to be identified are:

𝐱=[𝐱arr;𝐬1;;𝐬K].𝐱subscript𝐱𝑎𝑟𝑟superscript𝐬1superscript𝐬𝐾\mathbf{x}=\left[\mathbf{x}_{arr};\mathbf{s}^{1};\ldots;\mathbf{s}^{K}\right].bold_x = [ bold_x start_POSTSUBSCRIPT italic_a italic_r italic_r end_POSTSUBSCRIPT ; bold_s start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ; … ; bold_s start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ] . (8)

We denote the ideal inter-array TDOA and DOA measurements at the k-th𝑘-𝑡k\raisebox{0.0pt}{-}thitalic_k - italic_t italic_h time instance as:

𝐦k=[𝐝1k;T2k;𝐝2k;T3k;𝐝3k;;TNk;𝐝Nk]4N1.superscript𝐦𝑘superscriptsubscript𝐝1𝑘superscriptsubscript𝑇2𝑘superscriptsubscript𝐝2𝑘superscriptsubscript𝑇3𝑘superscriptsubscript𝐝3𝑘superscriptsubscript𝑇𝑁𝑘superscriptsubscript𝐝𝑁𝑘superscript4𝑁1\mathbf{m}^{k}=\left[\mathbf{d}_{1}^{k};T_{2}^{k};\mathbf{d}_{2}^{k};T_{3}^{k}% ;\mathbf{d}_{3}^{k};\ldots;T_{N}^{k};\mathbf{d}_{N}^{k}\right]\in\mathbf{% \mathbb{R}}^{4N-1}.bold_m start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT = [ bold_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ; italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ; bold_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ; italic_T start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ; bold_d start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ; … ; italic_T start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ; bold_d start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ] ∈ blackboard_R start_POSTSUPERSCRIPT 4 italic_N - 1 end_POSTSUPERSCRIPT . (9)

The measurements of DOA and inter-array TDOA at time k𝑘kitalic_k are subject to Gaussian noises as follows:

𝐲k=𝐦k+𝐯ksuperscript𝐲𝑘superscript𝐦𝑘superscript𝐯𝑘\mathbf{y}^{k}=\mathbf{m}^{k}+\mathbf{v}^{k}bold_y start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT = bold_m start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT + bold_v start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT (10)

where 𝐦ksuperscript𝐦𝑘\mathbf{m}^{k}bold_m start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT is defined in (9), 𝐯k𝒩(𝟎,𝐏)similar-tosuperscript𝐯𝑘𝒩0𝐏\mathbf{v}^{k}\sim\mathcal{N}(\mathbf{0},\mathbf{P})bold_v start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∼ caligraphic_N ( bold_0 , bold_P ), with 𝐏=diag(Λ,diagN1(λ,Λ))𝐏𝑑𝑖𝑎𝑔Λ𝑑𝑖𝑎subscript𝑔𝑁1𝜆Λ\mathbf{P}=diag(\Lambda,diag_{N-1}(\lambda,\Lambda))bold_P = italic_d italic_i italic_a italic_g ( roman_Λ , italic_d italic_i italic_a italic_g start_POSTSUBSCRIPT italic_N - 1 end_POSTSUBSCRIPT ( italic_λ , roman_Λ ) ), where λ>0𝜆0\lambda>0italic_λ > 0 is a positive scalar, Λ>𝟎,Λ0\Lambda>\mathbf{0},roman_Λ > bold_0 , and Λ3×3Λsuperscript33\Lambda\in\mathbf{\mathbb{R}}^{3\times 3}roman_Λ ∈ blackboard_R start_POSTSUPERSCRIPT 3 × 3 end_POSTSUPERSCRIPT. Assume that the sound source relative position between two consecutive time steps can be measured with Gaussian noise, i.e.,

𝐬Δk=𝐬k+1𝐬k+𝐰ksuperscriptsubscript𝐬Δ𝑘superscript𝐬𝑘1superscript𝐬𝑘superscript𝐰𝑘\mathbf{s}_{\Delta}^{k}=\mathbf{s}^{k+1}-\mathbf{s}^{k}+\mathbf{w}^{k}bold_s start_POSTSUBSCRIPT roman_Δ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT = bold_s start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT - bold_s start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT + bold_w start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT (11)

where k=1,,K1𝑘1𝐾1k=1,...,K-1italic_k = 1 , … , italic_K - 1, 𝐰k𝒩(𝟎,𝐐)similar-tosuperscript𝐰𝑘𝒩0𝐐\mathbf{w}^{k}\sim\mathcal{N}(\mathbf{0},\mathbf{Q})bold_w start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∼ caligraphic_N ( bold_0 , bold_Q ), with 𝐐>𝟎3×3𝐐0superscript33\mathbf{Q}>\mathbf{0}\in\mathbf{\mathbb{R}}^{3\times 3}bold_Q > bold_0 ∈ blackboard_R start_POSTSUPERSCRIPT 3 × 3 end_POSTSUPERSCRIPT. We combine the relative position measurements, the TDOA, and DOA measurements as:

𝐳=[𝐲1;𝐬Δ1;𝐲2;𝐬Δ2;;𝐲K1;𝐬ΔK1;𝐲K].𝐳superscript𝐲1superscriptsubscript𝐬Δ1superscript𝐲2superscriptsubscript𝐬Δ2superscript𝐲𝐾1superscriptsubscript𝐬Δ𝐾1superscript𝐲𝐾\mathbf{z}=\left[\mathbf{y}^{1};\mathbf{s}_{\Delta}^{1};\mathbf{y}^{2};\mathbf% {s}_{\Delta}^{2};\ldots;\mathbf{y}^{K-1};\mathbf{s}_{\Delta}^{K-1};\mathbf{y}^% {K}\right].bold_z = [ bold_y start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ; bold_s start_POSTSUBSCRIPT roman_Δ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ; bold_y start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ; bold_s start_POSTSUBSCRIPT roman_Δ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ; … ; bold_y start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT ; bold_s start_POSTSUBSCRIPT roman_Δ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT ; bold_y start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ] . (12)

The models in (10)-(11) can be rewritten compactly as:

𝐳=𝐠(𝐱)+γ𝐳𝐠𝐱𝛾\mathbf{z}=\mathbf{g}(\mathbf{x})+{\gamma}bold_z = bold_g ( bold_x ) + italic_γ (13)

where 𝐠(𝐱)𝐠𝐱\mathbf{g}(\mathbf{x})bold_g ( bold_x ) is the combined observation model, and γ𝒩(𝟎,𝐖)similar-to𝛾𝒩0𝐖{\gamma}\sim\mathcal{N}(\mathbf{0},\mathbf{W})italic_γ ∼ caligraphic_N ( bold_0 , bold_W ) is the noise of combined observations with

𝐖=diag(diagK1(𝐏,𝐐),𝐏).𝐖𝑑𝑖𝑎𝑔𝑑𝑖𝑎subscript𝑔𝐾1𝐏𝐐𝐏\mathbf{W}=diag(diag_{K-1}(\mathbf{P,Q),P}).bold_W = italic_d italic_i italic_a italic_g ( italic_d italic_i italic_a italic_g start_POSTSUBSCRIPT italic_K - 1 end_POSTSUBSCRIPT ( bold_P , bold_Q ) , bold_P ) . (14)

As shown in Fig. 1, the batch SLAM framework is a feasible solution to the above problem by treating the moving sound source as a robot and the multiple microphone arrays as landmarks [26]. As in [16]-[17], the problem of joint calibration of multiple asynchronous microphone arrays and sound source localization can be treated as the following NLS using batch SLAM:

min𝐱𝐠(𝐱)𝐳𝐖12subscript𝐱superscriptsubscriptnorm𝐠𝐱𝐳superscript𝐖12\noindent\min\limits_{{\mathbf{x}}}\left\|\mathbf{g}({\mathbf{x}})\mathbf{-z}% \right\|_{\mathbf{W}^{-1}}^{2}roman_min start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT ∥ bold_g ( bold_x ) - bold_z ∥ start_POSTSUBSCRIPT bold_W start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (15)

The measurements obtained by microphone arrays and robots constitute the spatial constraints and can be included in (15) to improve estimation accuracy.

Given the problem formulation described above, our main objective is (1) to determine the identifiability of the unknown parameters (microphone arrays positions, orientations, time offsets, sampling clock differences, and sound source positions) based on the available measurements (DOAs, inter-array TDOAs, and relative position measurements), and (2) to develop an efficient algorithm pipeline for solving the corresponding NLS in (15).

III OBSERVABILITY ANALYSIS

In this section, by utilizing the FIM method, the observability analysis of the batch SLAM framework for the above calibration problem is performed. More specifically, we have established necessary/sufficient conditions under which the FIM and Jacobian matrix have full column rank (which implies the identifiability of the unknown parameters, including the microphone array positions, orientations, time offsets, sampling clock differences, and sound source positions). In addition, we also discover some scenarios where the FIM and Jacobian matrix cannot have full column rank (in this case, the unknown parameters could not be uniquely identified).

III-A The Fisher Information Matrix and the Jacobian

The covariance matrix 𝐂x^subscript𝐂^𝑥\mathbf{C}_{\hat{x}}bold_C start_POSTSUBSCRIPT over^ start_ARG italic_x end_ARG end_POSTSUBSCRIPT of the estimation error corresponding to the estimated values 𝐱^^𝐱\hat{\mathbf{x}}over^ start_ARG bold_x end_ARG and the true values 𝐱ˇˇ𝐱\check{\mathbf{x}}overroman_ˇ start_ARG bold_x end_ARG of unknown parameters in the observation model in (13) can be calculated by

𝐂x^=E[(𝐱^𝐱ˇ)(𝐱^𝐱ˇ)T].subscript𝐂^𝑥𝐸delimited-[]^𝐱ˇ𝐱superscript^𝐱ˇ𝐱T\mathbf{C}_{\hat{x}}=E\left[(\hat{\mathbf{x}}-\check{\mathbf{x}})(\hat{\mathbf% {x}}-\check{\mathbf{x}})^{\mathit{\mathrm{T}}}\right].bold_C start_POSTSUBSCRIPT over^ start_ARG italic_x end_ARG end_POSTSUBSCRIPT = italic_E [ ( over^ start_ARG bold_x end_ARG - overroman_ˇ start_ARG bold_x end_ARG ) ( over^ start_ARG bold_x end_ARG - overroman_ˇ start_ARG bold_x end_ARG ) start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT ] . (16)

For nonrandom vector parameter estimation, the FIM of an unbiased estimator is defined as:

𝐈FIM=E[(xlnp(𝐳|𝐱))(xlnp(𝐳|𝐱))T],subscript𝐈𝐹𝐼𝑀𝐸delimited-[]subscript𝑥𝑝conditional𝐳𝐱superscriptsubscript𝑥𝑝conditional𝐳𝐱T\mathbf{I}_{FIM}=E\left[(\nabla_{x}\ln p(\mathbf{z}|\mathbf{x}))(\nabla_{x}\ln p% (\mathbf{z}|\mathbf{x}))^{\mathrm{T}}\right],bold_I start_POSTSUBSCRIPT italic_F italic_I italic_M end_POSTSUBSCRIPT = italic_E [ ( ∇ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT roman_ln italic_p ( bold_z | bold_x ) ) ( ∇ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT roman_ln italic_p ( bold_z | bold_x ) ) start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT ] , (17)

where xsubscript𝑥\nabla_{x}∇ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT is the gradient operator w.r.t. the vector 𝐱𝐱\mathbf{x}bold_x, p(𝐳|𝐱)𝑝conditional𝐳𝐱p(\mathbf{z}|\mathbf{x})italic_p ( bold_z | bold_x ) is the probability distribution function, and the derivatives are calculated at the true value 𝐱ˇˇ𝐱\check{\mathbf{x}}overroman_ˇ start_ARG bold_x end_ARG [40, chap. 2]. It can be shown that the covariance matrix of any unbiased estimator 𝐱^^𝐱\hat{\mathbf{x}}over^ start_ARG bold_x end_ARG satisfies

𝐂x^𝐈FIM1𝟎,subscript𝐂^𝑥superscriptsubscript𝐈𝐹𝐼𝑀10\mathbf{C}_{\hat{x}}-\mathbf{I}_{FIM}^{-1}\geq\mathbf{0},bold_C start_POSTSUBSCRIPT over^ start_ARG italic_x end_ARG end_POSTSUBSCRIPT - bold_I start_POSTSUBSCRIPT italic_F italic_I italic_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ≥ bold_0 , (18)

i.e., when the 𝐈FIMsubscript𝐈𝐹𝐼𝑀\mathbf{I}_{FIM}bold_I start_POSTSUBSCRIPT italic_F italic_I italic_M end_POSTSUBSCRIPT is singular, the Crame´´e\acute{\mathrm{e}}over´ start_ARG roman_e end_ARGr-Rao lower bound will not exist [40, pp. 165], one or more parameters will be unobservable. As in [32], the Fisher information matrix for the models described in (17) can be formulated as:

𝐈FIM=𝐉T𝐖1𝐉,subscript𝐈𝐹𝐼𝑀superscript𝐉Tsuperscript𝐖1𝐉\mathbf{I}_{FIM}=\mathbf{J^{\mathrm{T}}W^{\mathrm{-1}}J},bold_I start_POSTSUBSCRIPT italic_F italic_I italic_M end_POSTSUBSCRIPT = bold_J start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT bold_W start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_J , (19)

where 𝐉𝐉\mathbf{J}bold_J is the Jacobian of the observation model 𝐠()𝐠\mathbf{g}(\cdot)bold_g ( ⋅ ) in (13), and its explicit expressions will be given in (22). When 𝐖>𝟎,𝐖0\mathbf{W}>\mathbf{0},bold_W > bold_0 , one has that

rank(𝐈FIM)=rank(𝐉).𝑟𝑎𝑛𝑘subscript𝐈𝐹𝐼𝑀𝑟𝑎𝑛𝑘𝐉rank(\mathbf{I}_{FIM})=rank(\mathbf{J}).italic_r italic_a italic_n italic_k ( bold_I start_POSTSUBSCRIPT italic_F italic_I italic_M end_POSTSUBSCRIPT ) = italic_r italic_a italic_n italic_k ( bold_J ) . (20)

Since the first microphone array is viewed as the reference array, its corresponding parameters are all set to zero. The remaining state vectors contain only (N1)𝑁1\left(N-1\right)( italic_N - 1 ) microphone arrays parameters 𝐱arrsubscript𝐱𝑎𝑟𝑟\mathbf{x}_{arr}bold_x start_POSTSUBSCRIPT italic_a italic_r italic_r end_POSTSUBSCRIPT and the sound source position 𝐬ksuperscript𝐬𝑘\mathbf{s}^{k}bold_s start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT at all K𝐾Kitalic_K time steps. From the definition of the Jacobian matrix [41, pp. 569], we know that 𝐉g1×g2𝐉superscriptsubscript𝑔1subscript𝑔2\mathbf{J}\in\mathbb{R}^{g_{1}\times g_{2}}bold_J ∈ blackboard_R start_POSTSUPERSCRIPT italic_g start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT × italic_g start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, where

g1=4(N1)K+3(K1), g2=8(N1)+3K.formulae-sequencesubscript𝑔14𝑁1𝐾3𝐾1 subscript𝑔28𝑁13𝐾g_{1}=4(N-1)K+3(K-1),\text{ }g_{2}=8(N-1)+3K.italic_g start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 4 ( italic_N - 1 ) italic_K + 3 ( italic_K - 1 ) , italic_g start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 8 ( italic_N - 1 ) + 3 italic_K .

From (17)-(20), a necessary and sufficient condition for 𝐈FIMsubscript𝐈𝐹𝐼𝑀\mathbf{I}_{FIM}bold_I start_POSTSUBSCRIPT italic_F italic_I italic_M end_POSTSUBSCRIPT to be nonsingular is that 𝐉𝐉\mathbf{J}bold_J has full column rank. For 𝐉𝐉\mathbf{J}bold_J to be of full column rank, it is necessary that

4(N1)K+3(K1)8(N1)+3KK2+34(N1),4𝑁1𝐾3𝐾18𝑁13𝐾absent𝐾234𝑁1\begin{array}[]{l}4(N-1)K+3(K-1)\geq 8(N-1)+3K\\ \implies K\geqslant\left\lceil 2+\dfrac{3}{4(N-1)}\right\rceil,\end{array}start_ARRAY start_ROW start_CELL 4 ( italic_N - 1 ) italic_K + 3 ( italic_K - 1 ) ≥ 8 ( italic_N - 1 ) + 3 italic_K end_CELL end_ROW start_ROW start_CELL ⟹ italic_K ⩾ ⌈ 2 + divide start_ARG 3 end_ARG start_ARG 4 ( italic_N - 1 ) end_ARG ⌉ , end_CELL end_ROW end_ARRAY (21)

where \left\lceil\cdot\right\rceil⌈ ⋅ ⌉ stands for the ceiling operation generating the least integer not less than the number within the operator. We then have the following results.

Proposition 1: The Jacobian 𝐉𝐉\mathbf{J}bold_J can be written as

𝐉=[𝐋1𝟎𝐋2𝟎𝐋K1𝟎𝐋K𝐓1𝟎𝟎𝟎𝐈3𝐈3𝟎𝟎𝟎𝐓2𝟎𝟎𝟎𝐈3𝟎𝟎𝟎𝟎𝐓K1𝟎𝟎𝟎𝐈3𝐈3𝟎𝟎𝟎𝐓K]𝐉delimited-[]superscript𝐋10superscript𝐋20superscript𝐋𝐾10superscript𝐋𝐾superscript𝐓1000subscript𝐈3subscript𝐈3000superscript𝐓2000subscript𝐈30000superscript𝐓𝐾1000subscript𝐈3subscript𝐈3000superscript𝐓𝐾\mathbf{J}={\left[\begin{array}[]{c}\mathbf{L}^{1}\\ \mathbf{0}\\ \mathbf{L}^{2}\\ \mathbf{0}\\ \vdots\\ \mathbf{L}^{K-1}\\ \mathbf{0}\\ \mathbf{L}^{K}\end{array}\right.}{\left.\begin{array}[]{ccccc}\mathbf{T}^{1}&% \mathbf{0}&\cdots&\mathbf{0}&\mathbf{0}\\ -\mathbf{I}_{3}&\mathbf{I}_{3}&\cdots&\mathbf{0}&\mathbf{0}\\ \mathbf{0}&\mathbf{T}^{2}&\cdots&\mathbf{0}&\mathbf{0}\\ \mathbf{0}&-\mathbf{I}_{3}&\cdots&\mathbf{0}&\mathbf{0}\\ \vdots&\vdots&\ddots&\vdots&\vdots\\ \mathbf{0}&\mathbf{0}&\cdots&\mathbf{T}^{K-1}&\mathbf{0}\\ \mathbf{0}&\mathbf{0}&\cdots&-\mathbf{I}_{3}&\mathbf{I}_{3}\\ \mathbf{0}&\mathbf{0}&\cdots&\mathbf{0}&\mathbf{T}^{K}\end{array}\right]}bold_J = [ start_ARRAY start_ROW start_CELL bold_L start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL bold_0 end_CELL end_ROW start_ROW start_CELL bold_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL bold_0 end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL bold_L start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL bold_0 end_CELL end_ROW start_ROW start_CELL bold_L start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT end_CELL end_ROW end_ARRAY start_ARRAY start_ROW start_CELL bold_T start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT end_CELL start_CELL bold_0 end_CELL start_CELL ⋯ end_CELL start_CELL bold_0 end_CELL start_CELL bold_0 end_CELL end_ROW start_ROW start_CELL - bold_I start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_CELL start_CELL bold_I start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_CELL start_CELL ⋯ end_CELL start_CELL bold_0 end_CELL start_CELL bold_0 end_CELL end_ROW start_ROW start_CELL bold_0 end_CELL start_CELL bold_T start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_CELL start_CELL ⋯ end_CELL start_CELL bold_0 end_CELL start_CELL bold_0 end_CELL end_ROW start_ROW start_CELL bold_0 end_CELL start_CELL - bold_I start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_CELL start_CELL ⋯ end_CELL start_CELL bold_0 end_CELL start_CELL bold_0 end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL start_CELL ⋮ end_CELL start_CELL ⋱ end_CELL start_CELL ⋮ end_CELL start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL bold_0 end_CELL start_CELL bold_0 end_CELL start_CELL ⋯ end_CELL start_CELL bold_T start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT end_CELL start_CELL bold_0 end_CELL end_ROW start_ROW start_CELL bold_0 end_CELL start_CELL bold_0 end_CELL start_CELL ⋯ end_CELL start_CELL - bold_I start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_CELL start_CELL bold_I start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL bold_0 end_CELL start_CELL bold_0 end_CELL start_CELL ⋯ end_CELL start_CELL bold_0 end_CELL start_CELL bold_T start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT end_CELL end_ROW end_ARRAY ] (22)

where 𝐋k=𝐲k(𝐱arr,𝐬k)𝐱arrsuperscript𝐋𝑘superscript𝐲𝑘subscript𝐱𝑎𝑟𝑟superscript𝐬𝑘subscript𝐱𝑎𝑟𝑟\mathbf{L}^{k}=\frac{\partial\mathbf{y}^{k}(\mathbf{x}_{arr},\mathbf{s}^{k})}{% {\partial}\mathbf{x}_{arr}}bold_L start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT = divide start_ARG ∂ bold_y start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( bold_x start_POSTSUBSCRIPT italic_a italic_r italic_r end_POSTSUBSCRIPT , bold_s start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) end_ARG start_ARG ∂ bold_x start_POSTSUBSCRIPT italic_a italic_r italic_r end_POSTSUBSCRIPT end_ARG, 𝐓k=𝐲k(𝐱arr,𝐬k)𝐬ksuperscript𝐓𝑘superscript𝐲𝑘subscript𝐱𝑎𝑟𝑟superscript𝐬𝑘superscript𝐬𝑘\mathbf{T}^{k}=\frac{\partial\mathbf{y}^{k}(\mathbf{x}_{arr},\mathbf{s}^{k})}{% \partial\mathbf{s}^{k}}bold_T start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT = divide start_ARG ∂ bold_y start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( bold_x start_POSTSUBSCRIPT italic_a italic_r italic_r end_POSTSUBSCRIPT , bold_s start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) end_ARG start_ARG ∂ bold_s start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_ARG with 𝐲k(𝐱arr,𝐬k)superscript𝐲𝑘subscript𝐱𝑎𝑟𝑟superscript𝐬𝑘\mathbf{y}^{k}(\mathbf{x}_{arr},\mathbf{s}^{k})bold_y start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( bold_x start_POSTSUBSCRIPT italic_a italic_r italic_r end_POSTSUBSCRIPT , bold_s start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) being the inter-array TDOA and DOA observation model at the k-th𝑘-𝑡k\raisebox{0.0pt}{-}thitalic_k - italic_t italic_h time instant, k=1,,K𝑘1𝐾k=1,...,Kitalic_k = 1 , … , italic_K (expression of 𝐲k(𝐱arr,𝐬k)superscript𝐲𝑘subscript𝐱𝑎𝑟𝑟superscript𝐬𝑘\mathbf{y}^{k}(\mathbf{x}_{arr},\mathbf{s}^{k})bold_y start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( bold_x start_POSTSUBSCRIPT italic_a italic_r italic_r end_POSTSUBSCRIPT , bold_s start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) can be found in (10); the detailed expressions of 𝐋ksuperscript𝐋𝑘\mathbf{L}^{k}bold_L start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT and 𝐓ksuperscript𝐓𝑘\mathbf{T}^{k}bold_T start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT can be found in (47) and (51) in Appendix A, respectively).

Proof. See Appendix A.   

Given the equivalence of full rankness between the FIM and the Jacobian, in the following, we will focus on investigating conditions under which the Jacobian derived in (22) can or can not be of full column rank.

III-B Main Results of Observability

We firstly have the following results regarding the equivalence of full column rank between the Jacobian (22) and matrix 𝐅𝐅\mathbf{F}bold_F in (23) which has a much simpler structure.

Theorem 1

The Jacobian matrix 𝐉𝐉\mathbf{J}bold_J is of full column rank if and only if the following matrix

𝐅=[𝐋1𝐋2𝐋K𝐋𝐓1𝐓2𝐓K]𝐓\mathbf{F}=\underset{\mathbf{L}}{\underbrace{\left[\begin{array}[]{c}\mathbf{L% }^{1}\\ \mathbf{L}^{2}\\ \vdots\\ \mathbf{L}^{K}\end{array}\right.}}\underset{\mathbf{T}}{\underbrace{\left.% \begin{array}[]{c}\mathbf{T}^{1}\\ \mathbf{T}^{2}\\ \vdots\\ \mathbf{T}^{K}\end{array}\right]}}bold_F = underbold_L start_ARG under⏟ start_ARG [ start_ARRAY start_ROW start_CELL bold_L start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL bold_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL bold_L start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT end_CELL end_ROW end_ARRAY end_ARG end_ARG underbold_T start_ARG under⏟ start_ARG start_ARRAY start_ROW start_CELL bold_T start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL bold_T start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL bold_T start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT end_CELL end_ROW end_ARRAY ] end_ARG end_ARG (23)

is of full column rank.

Proof. The proof is similar to that of [18, Theorem 1] and is skipped here.   

We next present a necessary condition (Theorem 2) and a sufficient condition (Theorem 3) under which matrix 𝐅𝐅\mathbf{F}bold_F in (23) is of full column rank.

Theorem 2

The Jacobian matrix 𝐉𝐉\mathbf{J}bold_J is of full column rank only if matrices 𝐓¯¯𝐓\mathbf{\bar{T}}over¯ start_ARG bold_T end_ARG and 𝐋¯isubscript¯𝐋𝑖\mathbf{\bar{L}}_{i}over¯ start_ARG bold_L end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, for i=2,,N,𝑖2𝑁i=2,\ldots,N,italic_i = 2 , … , italic_N , are of full column rank, respectively111As shown in the full proof in Appendix A, submatrices 𝐓¯¯𝐓\mathbf{\bar{T}}over¯ start_ARG bold_T end_ARG and 𝐋¯isubscript¯𝐋𝑖\mathbf{\bar{L}}_{i}over¯ start_ARG bold_L end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are obtained from the matrices after applying elementary transformations to 𝐓𝐓\mathbf{T}bold_T and 𝐋𝐋\mathbf{L}bold_L (both defined in (23)), respectively., where

𝐓¯=[𝟎;Ψ;𝟎], 𝐋¯i=[𝐈2𝟎𝟎Φi],formulae-sequence¯𝐓0Ψ0 subscript¯𝐋𝑖delimited-[]subscript𝐈200subscriptΦ𝑖\mathbf{\bar{T}}=\left[\mathbf{0};\Psi;\mathbf{0}\right],\text{ }\mathbf{\bar{% L}}_{i}=\left[\begin{array}[]{cc}\mathbf{I}_{2}&\mathbf{0}\\ \mathbf{0}&\Phi_{i}\end{array}\right],over¯ start_ARG bold_T end_ARG = [ bold_0 ; roman_Ψ ; bold_0 ] , over¯ start_ARG bold_L end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = [ start_ARRAY start_ROW start_CELL bold_I start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_CELL start_CELL bold_0 end_CELL end_ROW start_ROW start_CELL bold_0 end_CELL start_CELL roman_Φ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_CELL end_ROW end_ARRAY ] , (24)

with ΨΨ\Psiroman_Ψ and ΦisubscriptΦ𝑖\Phi_{i}roman_Φ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT being defined in (55) and (56), respectively.

Proof. See Appendix A.   

Theorem 3

The Jacobian matrix 𝐉𝐉\mathbf{J}bold_J is of full column rank if the following statements hold concurrently:

(i) Any matrix resulting from the horizontal concatenation of 𝐋¯jsubscript¯𝐋𝑗\mathbf{\bar{L}}_{j}over¯ start_ARG bold_L end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT and 𝐓¯¯𝐓\mathbf{\bar{T}}over¯ start_ARG bold_T end_ARG is of full column rank, 2jN2𝑗𝑁2\leq j\leq N2 ≤ italic_j ≤ italic_N.

(ii) All matrices 𝐋¯isubscript¯𝐋𝑖\mathbf{\bar{L}}_{i}over¯ start_ARG bold_L end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, i=2,,N𝑖2𝑁i=2,\ldots,Nitalic_i = 2 , … , italic_N and ij𝑖𝑗i\neq jitalic_i ≠ italic_j are of full column rank.

Proof. See Appendix A.   

III-C Special Cases When Observability is Impossible

It can be seen from Proposition 1 and Theorems 1-3 that observability of the considered identification question is determined both by the configuration of microphone arrays (i.e., the relative transforms, namely, orientation and translation) and the sound source positions. This raises the question of under what conditions on the microphone array configuration and the sound source trajectory, the necessary conditions in Theorem 2 cannot hold. In this section, we will focus on this question and discover some special cases where observability is impossible. Our major result is stated in Theorems 4-5.

Theorem 4

The matrix 𝐓¯¯𝐓\mathbf{\bar{T}}over¯ start_ARG bold_T end_ARG is not of full column rank if one or more of the following conditions hold.

(i) For all microphone arrays, there exists fewer than five time steps information (i.e., the value of K𝐾Kitalic_K in (23) is less than 5).

(ii) The sound source positions at all moments are collinear with the origin of the global frame {𝐱arr_1}subscript𝐱𝑎𝑟𝑟_1\left\{\mathrm{\mathbf{x}}_{arr\_1}\right\}{ bold_x start_POSTSUBSCRIPT italic_a italic_r italic_r _ 1 end_POSTSUBSCRIPT }, i.e., 𝐬k=λk1𝐬k1superscript𝐬𝑘subscript𝜆𝑘1superscript𝐬𝑘1\mathbf{\mathbf{s}}^{k}={\lambda}_{k-1}\mathbf{s}^{k-1}bold_s start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT = italic_λ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT bold_s start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT always holds, where k=2,,K𝑘2𝐾k=2,\ldots,Kitalic_k = 2 , … , italic_K, and λk1subscript𝜆𝑘1{\lambda}_{k-1}italic_λ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT is an arbitrary non-zero scalar (λk1subscript𝜆𝑘1{\lambda}_{k-1}italic_λ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT might take different values at different time steps).

(iii) The sound source lies on any Euclidean plane of x+αy=0𝑥𝛼𝑦0x+\alpha y=0italic_x + italic_α italic_y = 0, x+βz=0𝑥𝛽𝑧0x+\beta z=0italic_x + italic_β italic_z = 0, and y+γz=0𝑦𝛾𝑧0y+\gamma z=0italic_y + italic_γ italic_z = 0 within the three-dimensional x𝑥xitalic_x-y𝑦yitalic_y-z𝑧zitalic_z Cartesian coordinate frame {𝐱arr_1}subscript𝐱𝑎𝑟𝑟_1\left\{\mathrm{\mathbf{x}}_{arr\_1}\right\}{ bold_x start_POSTSUBSCRIPT italic_a italic_r italic_r _ 1 end_POSTSUBSCRIPT }, at all moments, where α,β,γ𝛼𝛽𝛾\alpha,\beta,\gammaitalic_α , italic_β , italic_γ are arbitrary scalars.

Proof. See Appendix A.   

Theorem 5

The matrices 𝐋¯isubscript¯𝐋𝑖\mathbf{\bar{L}}_{i}over¯ start_ARG bold_L end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, i=2,3,,N𝑖23𝑁i=2,3,\cdots,Nitalic_i = 2 , 3 , ⋯ , italic_N, are not of full column rank if one or more of the following conditions hold:

(i) The sound source positions at all moments are collinear with the origin of the frame {𝐱arr_i}subscript𝐱𝑎𝑟𝑟_𝑖\left\{\mathrm{\mathbf{x}}_{arr\_i}\right\}{ bold_x start_POSTSUBSCRIPT italic_a italic_r italic_r _ italic_i end_POSTSUBSCRIPT }, i.e., (𝐬k𝐱arr_ip)=ϵk1(𝐬k1𝐱arr_ip)superscript𝐬𝑘superscriptsubscript𝐱𝑎𝑟𝑟_𝑖𝑝subscriptitalic-ϵ𝑘1superscript𝐬𝑘1superscriptsubscript𝐱𝑎𝑟𝑟_𝑖𝑝(\mathbf{\mathbf{s}}^{k}-\mathbf{x}_{arr\_i}^{p})={\epsilon}_{k-1}(\mathbf{% \mathbf{s}}^{k-1}-\mathbf{x}_{arr\_i}^{p})( bold_s start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT - bold_x start_POSTSUBSCRIPT italic_a italic_r italic_r _ italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ) = italic_ϵ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( bold_s start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT - bold_x start_POSTSUBSCRIPT italic_a italic_r italic_r _ italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ) always holds, where k=2,,K𝑘2𝐾k=2,\ldots,Kitalic_k = 2 , … , italic_K and ϵk1subscriptitalic-ϵ𝑘1{\epsilon}_{k-1}italic_ϵ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT is an arbitrary non-zero scalar (ϵk1subscriptitalic-ϵ𝑘1{\epsilon}_{k-1}italic_ϵ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT might take different values at different time steps).

(ii) For the i-th𝑖-𝑡i\raisebox{0.0pt}{-}thitalic_i - italic_t italic_h microphone array, one of the Euler angles satisfies θarr_iy=±π2superscriptsubscript𝜃𝑎𝑟𝑟_𝑖𝑦plus-or-minus𝜋2\theta_{arr\_i}^{y}=\pm\frac{\pi}{2}italic_θ start_POSTSUBSCRIPT italic_a italic_r italic_r _ italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT = ± divide start_ARG italic_π end_ARG start_ARG 2 end_ARG.

Proof. See Appendix A.   

III-D Discussions

The observability analysis presented in the above subsections refers to conditions concerning the ground truth value of the sound source trajectories or the configurations of microphone arrays. Hence, the observability analysis is of theoretical interest as it can serve as guidelines when designing microphone array configurations or the sound source trajectories during the calibration process. One can also rely on the results of Section III.C to avoid the unobservable scenarios from a theoretical point of view.

However, during real calibration processes, the measurements contain noises (i.e., the ground truth is not known a prior). Hence, the observability analysis results obtained above are not directly applicable. It is crucial to develop a reliable algorithmic pipeline that can achieve satisfactory convergence and accuracy. This will be discussed in the next section. One should note that the algorithmic pipeline presented in the sequel can also be applied to the nonobservable cases (but the calibration results will be unreliable). This is because, for these scenarios, the noisy measurements do not contain enough information to estimate the unknown parameters. This is also why the analysis in Sections III.A to III.C is valuable, as it suggests avoiding such unobservable situations when designing the microphone array configurations or the sound source trajectories.

Based on the above arguments, to validate the theoretical analysis, we will discuss both observable and unobservable situations in the numerical simulations in Section V. In the experimental results of Section VI, we will only design experiments that correspond to observable cases.

IV BATCH SLAM BASED CALIBRATION

In this section, we present our proposed pipeline for batch SLAM based joint calibration of multiple microphone arrays and sound source localization. As illustrated in Fig. 1, we treat the microphone arrays as landmarks and the sound source as a mobile robot in the corresponding batch SLAM problem and utilize Gauss–Newton iterations to solve the corresponding NLS problem. More specifically, we propose an effective framework to initialize the unknown parameters which are used as the initial guess in the Gauss–Newton iterative algorithm.

IV-A The Proposed Initialization Procedure

For notational simplicity, in the sequel, we use 𝐝iksuperscriptsubscript𝐝𝑖𝑘\mathbf{d}_{i}^{k}bold_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT and Tiksuperscriptsubscript𝑇𝑖𝑘T_{i}^{k}italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT to denote the Gaussian noise corrupted DOA and inter-array TDOA measurements, respectively. We use ^^\hat{\cdot}over^ start_ARG ⋅ end_ARG to represent the estimates of the unknown scalar/vector/matrix parameters. Our proposed initialization procedure is composed of the following main steps: (i) estimation of the sound source position by triangulation; (ii) estimation of the distance between the sound source and microphone arrays using 3D geometry; (iii) estimation of microphone array poses using ICP; (iv) estimation of the asynchronous factors using LLS.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 2: Initialization process of unknown parameters for the microphone arrays and sound source. (a) Estimation of the initial position of the sound source by triangulation. (b) Estimation of the distances between the sound source and microphone arrays using 3D geometry. (c) Estimation of microphone arrays initial positions and orientations using ICP. (d) Estimation of inter-array initial time offset and sampling clock difference using LLS.

(i) Estimation of the sound source position by triangulation: Without loss of generality, the initial trajectory of the moving sound source in the global frame {𝐱arr_1}subscript𝐱𝑎𝑟𝑟_1\left\{\mathrm{\mathbf{x}}_{arr\_1}\right\}{ bold_x start_POSTSUBSCRIPT italic_a italic_r italic_r _ 1 end_POSTSUBSCRIPT } is illustrated in Fig. 2(a). Then, from geometry, the initial position of the sound source can be obtained by triangulation and using the first two consecutive DOA measurements as follows:

d^11=Lsin(𝐝12,𝐬Δ1)sin(𝐝11,𝐝12), 𝐬^1=𝐝11d^11formulae-sequencesuperscriptsubscript^𝑑11𝐿superscriptsubscript𝐝12superscriptsubscript𝐬Δ1superscriptsubscript𝐝11superscriptsubscript𝐝12 superscript^𝐬1superscriptsubscript𝐝11superscriptsubscript^𝑑11\hat{d}_{1}^{1}=\frac{L\sin(\left\langle\mathbf{d}_{1}^{2},\mathbf{s}_{\Delta}% ^{1}\right\rangle)}{\sin(\left\langle\mathbf{d}_{1}^{1},\mathbf{d}_{1}^{2}% \right\rangle)},\text{ }\hat{\mathbf{s}}^{1}=\mathbf{d}_{1}^{1}\cdot\hat{d}_{1% }^{1}over^ start_ARG italic_d end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT = divide start_ARG italic_L roman_sin ( ⟨ bold_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , bold_s start_POSTSUBSCRIPT roman_Δ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ⟩ ) end_ARG start_ARG roman_sin ( ⟨ bold_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , bold_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⟩ ) end_ARG , over^ start_ARG bold_s end_ARG start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT = bold_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ⋅ over^ start_ARG italic_d end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT (25)

where L=𝐬Δ12𝐿subscriptnormsuperscriptsubscript𝐬Δ12L=\left\|\mathbf{s}_{\Delta}^{1}\right\|_{2}italic_L = ∥ bold_s start_POSTSUBSCRIPT roman_Δ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, i.e., L𝐿Litalic_L is the measured distance that the source moves between the first two consecutive moments, ,\left\langle\cdot,\cdot\right\rangle⟨ ⋅ , ⋅ ⟩ is the angle of two vectors, and d^11superscriptsubscript^𝑑11\hat{d}_{1}^{1}over^ start_ARG italic_d end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT is the distance between the first sound source position and the origin. Note that 𝐬Δ1,𝐝11,𝐝12superscriptsubscript𝐬Δ1superscriptsubscript𝐝11superscriptsubscript𝐝12\mathbf{s}_{\Delta}^{1},\mathbf{d}_{1}^{1},\mathbf{d}_{1}^{2}bold_s start_POSTSUBSCRIPT roman_Δ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , bold_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , bold_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT can be obtained from the relative position and DOA measurements, respectively. Once the initial position of the sound source is obtained as above, the sound source positions at different time steps can be estimated:

𝐬^k+1=𝐬^k+𝐬Δk.superscript^𝐬𝑘1superscript^𝐬𝑘superscriptsubscript𝐬Δ𝑘\hat{\mathbf{s}}^{k+1}=\mathbf{\hat{s}}^{k}+\mathbf{s}_{\Delta}^{k}.over^ start_ARG bold_s end_ARG start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT = over^ start_ARG bold_s end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT + bold_s start_POSTSUBSCRIPT roman_Δ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT . (26)

(ii) Estimation of the distance between the sound source and microphone arrays using 3D geometry: We calculate the distance between each source node and microphone arrays to provide constraints for estimating microphone array poses. One can construct an over-constrained NLS for estimating the distance d^iksuperscriptsubscript^𝑑𝑖𝑘\hat{d}_{i}^{k}over^ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT between each source node and microphone arrays by using the law of cosines constraints. To illustrate, as shown in Fig. 2(b), each microphone array and any four source positions A𝐴Aitalic_A, B𝐵Bitalic_B, C𝐶Citalic_C, and D𝐷Ditalic_D at the corresponding time instances form a polyhedron (when the four nodes are coplanar, it is tetrahedral, and when the four nodes are on different planes, it forms a five-vertex hexahedral structure). We construct an NLS problem by enforcing the law of cosines for each face of the polyhedron (including the two inner faces). For the scenario shown in Fig. 2(b), denote the estimated squared distance between any two sound source nodes a𝑎aitalic_a, b𝑏bitalic_b among the four source positions A𝐴Aitalic_A, B𝐵Bitalic_B, C𝐶Citalic_C, and D𝐷Ditalic_D as:

L^ab2=(d^ia)2+(d^ib)22d^iad^ibcos𝐝ia,𝐝ib,superscriptsubscript^𝐿𝑎𝑏2superscriptsuperscriptsubscript^𝑑𝑖𝑎2superscriptsuperscriptsubscript^𝑑𝑖𝑏22superscriptsubscript^𝑑𝑖𝑎superscriptsubscript^𝑑𝑖𝑏superscriptsubscript𝐝𝑖𝑎superscriptsubscript𝐝𝑖𝑏\hat{L}_{ab}^{2}=(\hat{d}_{i}^{a})^{2}+(\hat{d}_{i}^{b})^{2}-2\hat{d}_{i}^{a}% \hat{d}_{i}^{b}\cos\left\langle\mathbf{d}_{i}^{a},\mathbf{d}_{i}^{b}\right\rangle,over^ start_ARG italic_L end_ARG start_POSTSUBSCRIPT italic_a italic_b end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = ( over^ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( over^ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - 2 over^ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT over^ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT roman_cos ⟨ bold_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT , bold_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT ⟩ ,

where 𝐝iasuperscriptsubscript𝐝𝑖𝑎\mathbf{d}_{i}^{a}bold_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT and 𝐝ibsuperscriptsubscript𝐝𝑖𝑏\mathbf{d}_{i}^{b}bold_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT are the unit direction vectors of the corresponding sides with length d^iasuperscriptsubscript^𝑑𝑖𝑎\hat{d}_{i}^{a}over^ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT and d^ibsuperscriptsubscript^𝑑𝑖𝑏\hat{d}_{i}^{b}over^ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT, respectively. Denote the difference between L^ab2superscriptsubscript^𝐿𝑎𝑏2\hat{L}_{ab}^{2}over^ start_ARG italic_L end_ARG start_POSTSUBSCRIPT italic_a italic_b end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and Lab2superscriptsubscript𝐿𝑎𝑏2L_{ab}^{2}italic_L start_POSTSUBSCRIPT italic_a italic_b end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT as:

Fm(a,b)=L^ab2Lab2subscript𝐹𝑚𝑎𝑏superscriptsubscript^𝐿𝑎𝑏2superscriptsubscript𝐿𝑎𝑏2\begin{array}[]{c}F_{m}(a,b)=\hat{L}_{ab}^{2}-L_{ab}^{2}\end{array}start_ARRAY start_ROW start_CELL italic_F start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_a , italic_b ) = over^ start_ARG italic_L end_ARG start_POSTSUBSCRIPT italic_a italic_b end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_L start_POSTSUBSCRIPT italic_a italic_b end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_CELL end_ROW end_ARRAY

where m=1,2,,6𝑚126m=1,2,\cdots,6italic_m = 1 , 2 , ⋯ , 6 and Lab=𝐬^a𝐬^b2subscript𝐿𝑎𝑏subscriptnormsuperscript^𝐬𝑎superscript^𝐬𝑏2L_{ab}=\left\|\hat{\mathbf{s}}^{a}-\hat{\mathbf{s}}^{b}\right\|_{2}italic_L start_POSTSUBSCRIPT italic_a italic_b end_POSTSUBSCRIPT = ∥ over^ start_ARG bold_s end_ARG start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT - over^ start_ARG bold_s end_ARG start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. Consider a system of six nonlinear equations, given by F(diA,B,C,D)=[F1;F2;;F6]𝐹superscriptsubscript𝑑𝑖𝐴𝐵𝐶𝐷subscript𝐹1subscript𝐹2subscript𝐹6F(d_{i}^{A,B,C,D})=\left[F_{1};F_{2};\cdots;F_{6}\right]italic_F ( italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_A , italic_B , italic_C , italic_D end_POSTSUPERSCRIPT ) = [ italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ; italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ; ⋯ ; italic_F start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT ]. We use diA,B,C,Dsuperscriptsubscript𝑑𝑖𝐴𝐵𝐶𝐷{d}_{i}^{A,B,C,D}italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_A , italic_B , italic_C , italic_D end_POSTSUPERSCRIPT to collectively denote the distances between the four sound source positions and the i-th𝑖-𝑡i\raisebox{0.0pt}{-}thitalic_i - italic_t italic_h microphone array, which can be estimated by solving

mindiA,B,C,DF(diA,B,C,D)22subjectto:diA,B,C,D>0superscriptsubscript𝑑𝑖𝐴𝐵𝐶𝐷superscriptsubscriptnorm𝐹superscriptsubscript𝑑𝑖𝐴𝐵𝐶𝐷22:subjecttosuperscriptsubscript𝑑𝑖𝐴𝐵𝐶𝐷0\begin{array}[]{c}\underset{d_{i}^{A,B,C,D}}{\min}\left\|F(d_{i}^{A,B,C,D})% \right\|_{2}^{2}\\ \mathrm{subject\ to}:\ d_{i}^{A,B,C,D}>0\end{array}start_ARRAY start_ROW start_CELL start_UNDERACCENT italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_A , italic_B , italic_C , italic_D end_POSTSUPERSCRIPT end_UNDERACCENT start_ARG roman_min end_ARG ∥ italic_F ( italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_A , italic_B , italic_C , italic_D end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL roman_subject roman_to : italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_A , italic_B , italic_C , italic_D end_POSTSUPERSCRIPT > 0 end_CELL end_ROW end_ARRAY (27)

Note that the nonlinear optimization problem in (27) features a polynomial cost function with a fixed number of unknown parameter dimensions, namely, four edge lengths. However, the batch optimization problem in (15) has a more intricate objective function, incorporating polynomials, exponentials, and trigonometric functions, with 8(N1)+3K8𝑁13𝐾8(N-1)+3K8 ( italic_N - 1 ) + 3 italic_K optimization variables, where N𝑁Nitalic_N and K𝐾Kitalic_K represents the numbers of microphone arrays and time steps, respectively. Hence, in general, the optimization problem in (27) will be much easier to solve (it can be conveniently solved, for instance, using the trust region reflective method [42]) than the entire batch optimization problem in (15). To improve the estimation accuracy of d^iksuperscriptsubscript^𝑑𝑖𝑘\hat{d}_{i}^{k}over^ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT at all time instances K𝐾Kitalic_K, we form combinations by selecting any four sound source positions from all time instances, where the i-th𝑖-𝑡i\raisebox{0.0pt}{-}thitalic_i - italic_t italic_h microphone array-to-k-th𝑘-𝑡k\raisebox{0.0pt}{-}thitalic_k - italic_t italic_h source line exists within multiple polyhedra. This implies that we can leverage multiple estimation results to achieve greater accuracy. By solving for the edge lengths of each polyhedron and employing the well-known interquartile range (IQR) method [43, pp. 236], we calculate the average value of these same edges in different polyhedra. This average serves as the estimated distance d^iksuperscriptsubscript^𝑑𝑖𝑘\hat{d}_{i}^{k}over^ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT between the i-th𝑖-𝑡i\raisebox{0.0pt}{-}thitalic_i - italic_t italic_h microphone array and the sound source position at the k-th𝑘-𝑡k\raisebox{0.0pt}{-}thitalic_k - italic_t italic_h time step.

(iii) Estimation of microphone arrays positions and orientations using ICP: Note that the positions of the sound source in the frame {𝐱arr_i}subscript𝐱𝑎𝑟𝑟_𝑖\left\{\mathrm{\mathbf{x}}_{arr\_i}\right\}{ bold_x start_POSTSUBSCRIPT italic_a italic_r italic_r _ italic_i end_POSTSUBSCRIPT } can be estimated as:

𝐬^ik=𝐝ikd^ik.superscriptsubscript^𝐬𝑖𝑘superscriptsubscript𝐝𝑖𝑘superscriptsubscript^𝑑𝑖𝑘\mathbf{\hat{s}}_{i}^{k}=\mathbf{d}_{i}^{k}\cdot\hat{d}_{i}^{k}.over^ start_ARG bold_s end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT = bold_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ⋅ over^ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT . (28)

We treat the sound source positions as features in each coordinate frame. To find the transformation that optimally aligns the sound source positions with the reference frame is akin to representing the same features in the reference frame. To tackle this challenge, we formulate an NLS problem to minimize the mapping error of sound source positions between {𝐱arr_i}subscript𝐱𝑎𝑟𝑟_𝑖\left\{\mathrm{\mathbf{x}}_{arr\_i}\right\}{ bold_x start_POSTSUBSCRIPT italic_a italic_r italic_r _ italic_i end_POSTSUBSCRIPT } and {𝐱arr_1}subscript𝐱𝑎𝑟𝑟_1\left\{\mathrm{\mathbf{x}}_{arr\_1}\right\}{ bold_x start_POSTSUBSCRIPT italic_a italic_r italic_r _ 1 end_POSTSUBSCRIPT }:

min𝐑i,𝐱arr_ipk=1K𝐬^k(𝐑i𝐬^ik+𝐱arr_ip)22,subscript𝐑isuperscriptsubscript𝐱𝑎𝑟𝑟_𝑖𝑝superscriptsubscript𝑘1𝐾superscriptsubscriptnormsuperscript^𝐬𝑘subscript𝐑𝑖superscriptsubscript^𝐬𝑖𝑘superscriptsubscript𝐱𝑎𝑟𝑟_𝑖𝑝22\underset{\mathbf{{R}_{\mathrm{i}}},\mathbf{{x}}_{arr\_i}^{p}}{\min}\sum_{k=1}% ^{K}\left\|\mathbf{\hat{s}}^{k}-(\mathbf{{R}}_{i}\mathbf{\hat{s}}_{i}^{k}+% \mathbf{{x}}_{arr\_i}^{p})\right\|_{2}^{2},start_UNDERACCENT bold_R start_POSTSUBSCRIPT roman_i end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT italic_a italic_r italic_r _ italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT end_UNDERACCENT start_ARG roman_min end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ∥ over^ start_ARG bold_s end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT - ( bold_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT over^ start_ARG bold_s end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT + bold_x start_POSTSUBSCRIPT italic_a italic_r italic_r _ italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (29)

which is conceptually a point-to-point registration problem that can be tackled effectively using ICP [38]. Hence, as in [38], let 𝐩𝐩\mathbf{p}bold_p and 𝐩𝐢subscriptsuperscript𝐩𝐢\mathbf{p^{\prime}_{i}}bold_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT be the geometric mean of the source position in {𝐱arr_1}subscript𝐱𝑎𝑟𝑟_1\left\{\mathrm{\mathbf{x}}_{arr\_1}\right\}{ bold_x start_POSTSUBSCRIPT italic_a italic_r italic_r _ 1 end_POSTSUBSCRIPT } and {𝐱arr_i}subscript𝐱𝑎𝑟𝑟_𝑖\left\{\mathrm{\mathbf{x}}_{arr\_i}\right\}{ bold_x start_POSTSUBSCRIPT italic_a italic_r italic_r _ italic_i end_POSTSUBSCRIPT }, and they can be computed based on the estimated sound source positions 𝐬^1ksuperscriptsubscript^𝐬1𝑘\mathbf{\hat{s}}_{1}^{k}over^ start_ARG bold_s end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT and 𝐬^iksuperscriptsubscript^𝐬𝑖𝑘\mathbf{\hat{s}}_{i}^{k}over^ start_ARG bold_s end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT, respectively. The covariance of the sound source trajectory expressed in the two different frames becomes:

Ω=k=1K(𝐬^k𝐩)(𝐬^ik𝐩𝐢)T.Ωsuperscriptsubscript𝑘1𝐾superscript^𝐬𝑘𝐩superscriptsuperscriptsubscript^𝐬𝑖𝑘subscriptsuperscript𝐩𝐢T\Omega=\sum_{k=1}^{K}\left(\mathbf{\hat{s}}^{k}-\mathbf{p}\right)\left(\hat{% \mathbf{s}}_{i}^{k}-\mathbf{p^{\prime}_{i}}\right)^{\mathrm{T}}.roman_Ω = ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ( over^ start_ARG bold_s end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT - bold_p ) ( over^ start_ARG bold_s end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT - bold_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT . (30)

We perform singular value decomposition on this covariance matrix:

Ω=𝐔Σ𝐕T.Ω𝐔Σsuperscript𝐕T\Omega=\mathbf{U}\Sigma\mathbf{V^{\mathrm{T}}}.roman_Ω = bold_U roman_Σ bold_V start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT . (31)

The optimal rotation matrix [38, 44] can be obtained as:

𝐑^i=𝐔𝐕T.subscript^𝐑𝑖superscript𝐔𝐕T\hat{\mathbf{R}}_{i}=\mathbf{UV^{\mathrm{T}}}.over^ start_ARG bold_R end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = bold_UV start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT . (32)

Then, we can transform the rotation matrix 𝐑^isubscript^𝐑𝑖\hat{\mathbf{R}}_{i}over^ start_ARG bold_R end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT into the corresponding ZYX Euler angles [45]. Thus, the initial guess of microphone array positions can be expressed as:

𝐱^arr_ip=𝐩𝐑^i𝐩𝐢.superscriptsubscript^𝐱𝑎𝑟𝑟_𝑖𝑝𝐩subscript^𝐑𝑖subscriptsuperscript𝐩𝐢\mathbf{\hat{x}}_{arr\_i}^{p}=\mathbf{p}-\mathbf{\hat{R}}_{i}\mathbf{p^{\prime% }_{i}}.over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_a italic_r italic_r _ italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT = bold_p - over^ start_ARG bold_R end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT . (33)

(iv) Estimation of microphone arrays asynchronous parameters using LLS: In part (ii), the distances between the sound source and microphone arrays at different time steps have been estimated. By using the inter-array TDOA measurements, the initial guess of the microphone array asynchronous factors can be obtained by solving the following LLS problem:

minxarr_iτ,xarr_iσk=1KTik(d^ikcd^1kc)xarr_iτΔkxarr_iσ22.superscriptsubscript𝑥𝑎𝑟𝑟_𝑖𝜏superscriptsubscript𝑥𝑎𝑟𝑟_𝑖𝜎superscriptsubscript𝑘1𝐾superscriptsubscriptnormsuperscriptsubscript𝑇𝑖𝑘superscriptsubscript^𝑑𝑖𝑘𝑐superscriptsubscript^𝑑1𝑘𝑐superscriptsubscript𝑥𝑎𝑟𝑟_𝑖𝜏subscriptΔ𝑘superscriptsubscript𝑥𝑎𝑟𝑟_𝑖𝜎22\underset{{x}_{arr\_i}^{\tau},{x}_{arr\_i}^{\sigma}}{\min}\sum_{k=1}^{K}\left% \|T_{i}^{k}-\left(\frac{\hat{d}_{i}^{k}}{c}-\frac{\hat{d}_{1}^{k}}{c}\right)-{% x}_{arr\_i}^{\tau}-{\Delta_{k}}{x}_{arr\_i}^{\sigma}\right\|_{2}^{2}.start_UNDERACCENT italic_x start_POSTSUBSCRIPT italic_a italic_r italic_r _ italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT italic_a italic_r italic_r _ italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_σ end_POSTSUPERSCRIPT end_UNDERACCENT start_ARG roman_min end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ∥ italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT - ( divide start_ARG over^ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_ARG start_ARG italic_c end_ARG - divide start_ARG over^ start_ARG italic_d end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_ARG start_ARG italic_c end_ARG ) - italic_x start_POSTSUBSCRIPT italic_a italic_r italic_r _ italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT - roman_Δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_a italic_r italic_r _ italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_σ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (34)

To identify outliers and improve the estimation accuracy of the inter-array asynchronous factors, we first solve the optimization problem (34). Then, we calculate the residuals by determining the differences between the value Tik(d^ikd^1k)/csuperscriptsubscript𝑇𝑖𝑘superscriptsubscript^𝑑𝑖𝑘superscriptsubscript^𝑑1𝑘𝑐T_{i}^{k}-\left(\hat{d}_{i}^{k}-\hat{d}_{1}^{k}\right)/{c}italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT - ( over^ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT - over^ start_ARG italic_d end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) / italic_c and the corresponding fitted value at each time step, and their average and standard deviation. Subsequently, we perform normalization to the residuals, i.e., dividing each residual by the standard deviation to identify and exclude the outliers. Using the data with the outliers removed as described above, we solve the optimization problem (34) again, and the final estimates of the asynchronous factors are obtained.

Algorithm 1 Joint Calibration of Multi-asynchronous Microphone Arrays and Sound Source Localization
0:  Sensors measurements 𝐳𝐳\mathbf{z}bold_z
0:  Estimation of all unknown parameters 𝐱^^𝐱\hat{\mathbf{x}}over^ start_ARG bold_x end_ARG
  // Initialize 𝐱^^𝐱\hat{\mathbf{x}}over^ start_ARG bold_x end_ARG
  Compute the sound source positions 𝐬^ksuperscript^𝐬𝑘\mathbf{\hat{s}}^{k}over^ start_ARG bold_s end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT with Eq. (25)-(26);
  for i[2,N]𝑖2𝑁i\in[2,N]italic_i ∈ [ 2 , italic_N ] do
     for k[1,K]𝑘1𝐾k\in[1,K]italic_k ∈ [ 1 , italic_K ] do
        Solve for the distance d^iksuperscriptsubscript^𝑑𝑖𝑘\hat{d}_{i}^{k}over^ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT and the sound source position 𝐬^iksuperscriptsubscript^𝐬𝑖𝑘\mathbf{\hat{s}}_{i}^{k}over^ start_ARG bold_s end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT in frame {𝐱arr_i}subscript𝐱𝑎𝑟𝑟_𝑖\left\{\mathrm{\mathbf{x}}_{arr\_i}\right\}{ bold_x start_POSTSUBSCRIPT italic_a italic_r italic_r _ italic_i end_POSTSUBSCRIPT } via optimization problems (27)-(28), respectively;
     end for
     𝐑^i,𝐱^arr_ipargmink=1K𝐬^k(𝐑i𝐬^ik+𝐱arr_ip)22subscript^𝐑isuperscriptsubscript^𝐱𝑎𝑟𝑟_𝑖𝑝superscriptsubscript𝑘1𝐾superscriptsubscriptnormsuperscript^𝐬𝑘subscript𝐑𝑖superscriptsubscript^𝐬𝑖𝑘superscriptsubscript𝐱𝑎𝑟𝑟_𝑖𝑝22\mathbf{\hat{R}_{\mathrm{i}}},\mathbf{\hat{x}}_{arr\_i}^{p}\leftarrow\arg\min% \sum_{k=1}^{K}\left\|\mathbf{\hat{s}}^{k}-(\mathbf{{R}}_{i}\mathbf{\hat{s}}_{i% }^{k}+\mathbf{{x}}_{arr\_i}^{p})\right\|_{2}^{2}over^ start_ARG bold_R end_ARG start_POSTSUBSCRIPT roman_i end_POSTSUBSCRIPT , over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_a italic_r italic_r _ italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ← roman_arg roman_min ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ∥ over^ start_ARG bold_s end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT - ( bold_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT over^ start_ARG bold_s end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT + bold_x start_POSTSUBSCRIPT italic_a italic_r italic_r _ italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT;
     Transform 𝐑^isubscript^𝐑i\mathbf{\hat{R}_{\mathrm{i}}}over^ start_ARG bold_R end_ARG start_POSTSUBSCRIPT roman_i end_POSTSUBSCRIPT into ZYX Euler angles;
     Linear fitting x^arr_iτ,x^arr_iσsuperscriptsubscript^𝑥𝑎𝑟𝑟_𝑖𝜏superscriptsubscript^𝑥𝑎𝑟𝑟_𝑖𝜎\hat{x}_{arr\_i}^{\tau},\hat{x}_{arr\_i}^{\sigma}over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_a italic_r italic_r _ italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT , over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_a italic_r italic_r _ italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_σ end_POSTSUPERSCRIPT with (34);
  end for
  // Error Minimization
  for iter𝑖𝑡𝑒𝑟iteritalic_i italic_t italic_e italic_r do
     𝐇𝟎;𝐛𝟎;formulae-sequence𝐇0𝐛0\mathbf{H}\leftarrow\mathbf{0};\mathbf{b}\leftarrow\mathbf{0};bold_H ← bold_0 ; bold_b ← bold_0 ;
     for all 𝐳ijsubscript𝐳𝑖𝑗absent\mathbf{z}_{ij}\inbold_z start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ∈𝐳𝐳\mathbf{z}bold_z do
        Compute 𝐇ij,𝐛ijsubscript𝐇𝑖𝑗subscript𝐛𝑖𝑗\mathbf{H}_{ij},\mathbf{b}_{ij}bold_H start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT , bold_b start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT with Eq. (35)-(42);
        𝐇𝐇+𝐇ij;𝐛𝐛+𝐛ijformulae-sequence𝐇𝐇subscript𝐇𝑖𝑗𝐛𝐛subscript𝐛𝑖𝑗\mathbf{H}\leftarrow\mathbf{H}+\mathbf{H}_{ij};\mathbf{b}\leftarrow\mathbf{b}+% \mathbf{b}_{ij}bold_H ← bold_H + bold_H start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ; bold_b ← bold_b + bold_b start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT;
     end for
     𝐇[1:8,1:8]=𝐈8\mathbf{H}[1:8,1:8]=\mathbf{I}_{8}bold_H [ 1 : 8 , 1 : 8 ] = bold_I start_POSTSUBSCRIPT 8 end_POSTSUBSCRIPT; //Fixed the global frame {𝐱arr_1}subscript𝐱𝑎𝑟𝑟_1\left\{\mathrm{\mathbf{x}}_{arr\_1}\right\}{ bold_x start_POSTSUBSCRIPT italic_a italic_r italic_r _ 1 end_POSTSUBSCRIPT }
     𝐱=𝐇1(𝐛)𝐱superscript𝐇1𝐛\triangle\mathbf{x}=\mathbf{H}^{-1}\cdot(-\mathbf{b})△ bold_x = bold_H start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ⋅ ( - bold_b );
     if 𝐱2<ξsubscriptnorm𝐱2𝜉\left\|\triangle\mathbf{x}\right\|_{2}<\xi∥ △ bold_x ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT < italic_ξ then
        break;
     else
        𝐱^𝐱^+𝐱^𝐱^𝐱𝐱\hat{\mathbf{x}}\leftarrow\hat{\mathbf{x}}+\triangle\mathbf{x}over^ start_ARG bold_x end_ARG ← over^ start_ARG bold_x end_ARG + △ bold_x;
     end if
  end for

IV-B The Batch Optimization Procedure

As described in (15), we construct a standard NLS problem by considering the microphone arrays as landmarks and the sound source locations as robot positions. For the Gauss-Newton iterations, the increment of each iteration can be obtained by solving:

𝐇𝐱=𝐛,𝐇𝐱𝐛\mathbf{H}\mathbf{\triangle x}=\mathbf{-b},bold_H △ bold_x = - bold_b ,

where 𝐇𝐇\mathbf{H}bold_H is the approximation matrix of the Hessian matrix and 𝐛𝐛\mathbf{b}bold_b is the coefficient vector [27]:

𝐇=i,j𝒞𝐇ij=i,j𝒞𝐉ijT𝐖1𝐉ij𝐛=i,j𝒞𝐛ij=i,j𝒞𝐉ijT𝐖1𝐞ij𝐇subscript𝑖𝑗𝒞subscript𝐇𝑖𝑗subscript𝑖𝑗𝒞superscriptsubscript𝐉𝑖𝑗Tsuperscript𝐖1subscript𝐉𝑖𝑗𝐛subscript𝑖𝑗𝒞subscript𝐛𝑖𝑗subscript𝑖𝑗𝒞superscriptsubscript𝐉𝑖𝑗Tsuperscript𝐖1subscript𝐞𝑖𝑗\begin{array}[]{c}\mathbf{H}=\sum_{i,j\in\mathcal{C}}\mathbf{H}_{ij}=\sum_{i,j% \in\mathcal{C}}\mathbf{J}_{ij}^{\mathrm{T}}\mathbf{W}^{-1}\mathbf{J}_{ij}\\ \mathbf{b}=\sum_{i,j\in\mathcal{C}}\mathbf{b}_{ij}=\sum_{i,j\in\mathcal{C}}% \mathbf{J}_{ij}^{\mathrm{T}}\mathbf{W}^{-1}\mathbf{e}_{ij}\end{array}start_ARRAY start_ROW start_CELL bold_H = ∑ start_POSTSUBSCRIPT italic_i , italic_j ∈ caligraphic_C end_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i , italic_j ∈ caligraphic_C end_POSTSUBSCRIPT bold_J start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT bold_W start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_J start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL bold_b = ∑ start_POSTSUBSCRIPT italic_i , italic_j ∈ caligraphic_C end_POSTSUBSCRIPT bold_b start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i , italic_j ∈ caligraphic_C end_POSTSUBSCRIPT bold_J start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT bold_W start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_e start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT end_CELL end_ROW end_ARRAY (35)

where i𝑖iitalic_i and j𝑗jitalic_j are the two nodes in the graph (formed by the sound source at different positions and microphone arrays), 𝒞𝒞\mathcal{C}caligraphic_C is the full set of measurements, and 𝐉ijsubscript𝐉𝑖𝑗\mathbf{J}_{ij}bold_J start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT is the Jacobian matrix of the error function of the corresponding nodes. For the position-position constraint, denote the error between the expected measurement and real measurement 𝐳p,pk,k+1superscriptsubscript𝐳𝑝𝑝𝑘𝑘1\mathbf{z}_{p,p}^{k,k+1}bold_z start_POSTSUBSCRIPT italic_p , italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k , italic_k + 1 end_POSTSUPERSCRIPT collected by the robot as:

𝐞p,pk,k+1=𝐬k+1𝐬k𝐳p,pk,k+1.superscriptsubscript𝐞𝑝𝑝𝑘𝑘1superscript𝐬𝑘1superscript𝐬𝑘superscriptsubscript𝐳𝑝𝑝𝑘𝑘1\mathbf{e}_{p,p}^{k,k+1}=\mathbf{{s}}^{k+1}-\mathbf{{s}}^{k}-\mathbf{z}_{p,p}^% {k,k+1}.bold_e start_POSTSUBSCRIPT italic_p , italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k , italic_k + 1 end_POSTSUPERSCRIPT = bold_s start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT - bold_s start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT - bold_z start_POSTSUBSCRIPT italic_p , italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k , italic_k + 1 end_POSTSUPERSCRIPT . (36)

The Jacobian matrix w.r.t. position 𝐬ksuperscript𝐬𝑘\mathbf{{s}}^{k}bold_s start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT and position 𝐬k+1superscript𝐬𝑘1\mathbf{{s}}^{k+1}bold_s start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT are:

𝐀p,pk,k+1=𝐞p,pk,k+1𝐬k=𝐈3,𝐁p,pk,k+1=𝐞p,pk,k+1𝐬k+1=𝐈3.superscriptsubscript𝐀𝑝𝑝𝑘𝑘1superscriptsubscript𝐞𝑝𝑝𝑘𝑘1superscript𝐬𝑘subscript𝐈3superscriptsubscript𝐁𝑝𝑝𝑘𝑘1superscriptsubscript𝐞𝑝𝑝𝑘𝑘1superscript𝐬𝑘1subscript𝐈3\begin{array}[]{cc}\mathbf{A}_{p,p}^{k,k+1}=\dfrac{\partial\mathbf{e}_{p,p}^{k% ,k+1}}{\partial\mathbf{\mathbf{{s}}}^{k}}=-\mathbf{I}_{3},&\mathbf{B}_{p,p}^{k% ,k+1}=\dfrac{\partial\mathbf{e}_{p,p}^{k,k+1}}{\partial\mathbf{\mathbf{{s}}}^{% k+1}}=\mathbf{I}_{3}.\end{array}start_ARRAY start_ROW start_CELL bold_A start_POSTSUBSCRIPT italic_p , italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k , italic_k + 1 end_POSTSUPERSCRIPT = divide start_ARG ∂ bold_e start_POSTSUBSCRIPT italic_p , italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k , italic_k + 1 end_POSTSUPERSCRIPT end_ARG start_ARG ∂ bold_s start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_ARG = - bold_I start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , end_CELL start_CELL bold_B start_POSTSUBSCRIPT italic_p , italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k , italic_k + 1 end_POSTSUPERSCRIPT = divide start_ARG ∂ bold_e start_POSTSUBSCRIPT italic_p , italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k , italic_k + 1 end_POSTSUPERSCRIPT end_ARG start_ARG ∂ bold_s start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT end_ARG = bold_I start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT . end_CELL end_ROW end_ARRAY (37)

For the position-landmark constraint, denote the error between the expected measurement and the real measurement 𝐳p,lksuperscriptsubscript𝐳𝑝𝑙𝑘\mathbf{z}_{p,l}^{k}bold_z start_POSTSUBSCRIPT italic_p , italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT collected by microphone arrays as:

𝐞p,lk=[Tik;𝐝ik]𝐳p,lk.superscriptsubscript𝐞𝑝𝑙𝑘delimited-[]superscriptsubscript𝑇𝑖𝑘superscriptsubscript𝐝𝑖𝑘superscriptsubscript𝐳𝑝𝑙𝑘\mathbf{e}_{p,l}^{k}=\left[\begin{array}[]{cc}{T}_{i}^{k};&{\mathbf{d}}_{i}^{k% }\end{array}\right]-\mathbf{z}_{p,l}^{k}.bold_e start_POSTSUBSCRIPT italic_p , italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT = [ start_ARRAY start_ROW start_CELL italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ; end_CELL start_CELL bold_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_CELL end_ROW end_ARRAY ] - bold_z start_POSTSUBSCRIPT italic_p , italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT . (38)

The Jacobian matrices corresponding to landmark l𝑙litalic_l and position p𝑝pitalic_p are:

𝐀p,lk=𝐞p,lk𝐱arr,𝐁p,lk=𝐞p,lk𝐬k.superscriptsubscript𝐀𝑝𝑙𝑘superscriptsubscript𝐞𝑝𝑙𝑘subscript𝐱𝑎𝑟𝑟superscriptsubscript𝐁𝑝𝑙𝑘superscriptsubscript𝐞𝑝𝑙𝑘superscript𝐬𝑘\begin{array}[]{cc}\mathbf{A}_{p,l}^{k}=\dfrac{\partial\mathbf{e}_{p,l}^{k}}{% \partial\mathbf{{x}}_{arr}},&\mathbf{B}_{p,l}^{k}=\dfrac{\partial\mathbf{e}_{p% ,l}^{k}}{\partial\mathbf{{s}}^{k}}.\end{array}start_ARRAY start_ROW start_CELL bold_A start_POSTSUBSCRIPT italic_p , italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT = divide start_ARG ∂ bold_e start_POSTSUBSCRIPT italic_p , italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_ARG start_ARG ∂ bold_x start_POSTSUBSCRIPT italic_a italic_r italic_r end_POSTSUBSCRIPT end_ARG , end_CELL start_CELL bold_B start_POSTSUBSCRIPT italic_p , italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT = divide start_ARG ∂ bold_e start_POSTSUBSCRIPT italic_p , italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_ARG start_ARG ∂ bold_s start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_ARG . end_CELL end_ROW end_ARRAY (39)

The structure of the Jacobian matrix is elaborated in Eq. (47)-Eq. (51). For corresponding nodes i𝑖iitalic_i and j𝑗jitalic_j, the Jacobian matrix 𝐉i,jsubscript𝐉𝑖𝑗\mathbf{J}_{i,j}bold_J start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT can be succinctly represented as:

𝐉i,j=[𝟎;𝟎,𝐀i,jnodei,𝟎,𝐁i,jnodej,𝟎;𝟎].subscript𝐉𝑖𝑗00𝑛𝑜𝑑𝑒𝑖subscript𝐀𝑖𝑗0𝑛𝑜𝑑𝑒𝑗subscript𝐁𝑖𝑗00\mathbf{J}_{i,j}=\left[\mathbf{0};\mathbf{0},\underset{node\ i}{\underbrace{% \mathbf{A}_{i,j}}},\mathbf{0},\underset{node\ j}{\underbrace{\mathbf{B}_{i,j}}% },\mathbf{0};\mathbf{0}\right].bold_J start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = [ bold_0 ; bold_0 , start_UNDERACCENT italic_n italic_o italic_d italic_e italic_i end_UNDERACCENT start_ARG under⏟ start_ARG bold_A start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT end_ARG end_ARG , bold_0 , start_UNDERACCENT italic_n italic_o italic_d italic_e italic_j end_UNDERACCENT start_ARG under⏟ start_ARG bold_B start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT end_ARG end_ARG , bold_0 ; bold_0 ] . (40)

By omitting the zero blocks, the corresponding sparse block matrix 𝐇ijsubscript𝐇𝑖𝑗\mathbf{H}_{ij}bold_H start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT and the vector 𝐛ijsubscript𝐛𝑖𝑗\mathbf{b}_{ij}bold_b start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT (see Eq. (35)) can be expressed as:

𝐇ij=[𝐀i,jT𝐖ij1𝐀i,j𝐀i,jT𝐖ij1𝐁i,j𝐁i,jT𝐖ij1𝐀i,j𝐁i,jT𝐖ij1𝐁i,j],subscript𝐇𝑖𝑗delimited-[]missing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionsuperscriptsubscript𝐀𝑖𝑗Tsuperscriptsubscript𝐖𝑖𝑗1subscript𝐀𝑖𝑗superscriptsubscript𝐀𝑖𝑗Tsuperscriptsubscript𝐖𝑖𝑗1subscript𝐁𝑖𝑗missing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionsuperscriptsubscript𝐁𝑖𝑗Tsuperscriptsubscript𝐖𝑖𝑗1subscript𝐀𝑖𝑗superscriptsubscript𝐁𝑖𝑗Tsuperscriptsubscript𝐖𝑖𝑗1subscript𝐁𝑖𝑗missing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpression\mathbf{H}_{ij}=\left[\begin{array}[]{ccccc}\ddots\\ &\mathbf{A}_{i,j}^{\mathrm{T}}\mathbf{W}_{ij}^{-1}\mathbf{A}_{i,j}&\cdots&% \mathbf{A}_{i,j}^{\mathrm{T}}\mathbf{W}_{ij}^{-1}\mathbf{B}_{i,j}\\ &\vdots&\ddots&\vdots\\ &\mathbf{B}_{i,j}^{\mathrm{T}}\mathbf{W}_{ij}^{-1}\mathbf{A}_{i,j}&\cdots&% \mathbf{B}_{i,j}^{\mathrm{T}}\mathbf{W}_{ij}^{-1}\mathbf{B}_{i,j}\\ &&&&\ddots\end{array}\right],bold_H start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = [ start_ARRAY start_ROW start_CELL ⋱ end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL bold_A start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT bold_W start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_A start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT end_CELL start_CELL ⋯ end_CELL start_CELL bold_A start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT bold_W start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_B start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ⋮ end_CELL start_CELL ⋱ end_CELL start_CELL ⋮ end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL bold_B start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT bold_W start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_A start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT end_CELL start_CELL ⋯ end_CELL start_CELL bold_B start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT bold_W start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_B start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL ⋱ end_CELL end_ROW end_ARRAY ] , (41)
𝐛ij=[𝐀i,jT𝐖ij1𝐞i,j𝐁i,jT𝐖ij1𝐞i,j].subscript𝐛𝑖𝑗delimited-[]superscriptsubscript𝐀𝑖𝑗Tsuperscriptsubscript𝐖𝑖𝑗1subscript𝐞𝑖𝑗superscriptsubscript𝐁𝑖𝑗Tsuperscriptsubscript𝐖𝑖𝑗1subscript𝐞𝑖𝑗\mathbf{b}_{ij}=\left[\begin{array}[]{c}\vdots\\ \mathbf{A}_{i,j}^{\mathrm{T}}\mathbf{W}_{ij}^{-1}\mathbf{e}_{i,j}\\ \vdots\\ \mathbf{B}_{i,j}^{\mathrm{T}}\mathbf{W}_{ij}^{-1}\mathbf{e}_{i,j}\\ \vdots\end{array}\right].bold_b start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = [ start_ARRAY start_ROW start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL bold_A start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT bold_W start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_e start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL bold_B start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT bold_W start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_e start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL end_ROW end_ARRAY ] . (42)

respectively. Combining the initial guess selection pipeline and the Gauss-Newton iteration procedure, we then have the entire calibration algorithm as shown in Algorithm 1.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 3: The scenarios for microphone array calibration and the corresponding variations in the rank of the 𝐅𝐅\mathbf{F}bold_F matrices. (a) The geometric relationships between the moving sound source and multiple microphone arrays in two observable cases. (b) Variation of the 𝐅𝐅\mathbf{F}bold_F matrix rank with the movement of the source in two observable cases. (c) The geometric relationships when the moving sound source remains co-linear or co-planar with {𝐱arr_1}subscript𝐱𝑎𝑟𝑟_1\left\{\mathrm{\mathbf{x}}_{arr\_1}\right\}{ bold_x start_POSTSUBSCRIPT italic_a italic_r italic_r _ 1 end_POSTSUBSCRIPT }. (d) Variation of the 𝐅𝐅\mathbf{F}bold_F matrix rank in the corresponding unobservable scenarios (e) The geometric relationships when the moving sound source remains co-linear with {𝐱arr_2}subscript𝐱𝑎𝑟𝑟_2\left\{\mathrm{\mathbf{x}}_{arr\_2}\right\}{ bold_x start_POSTSUBSCRIPT italic_a italic_r italic_r _ 2 end_POSTSUBSCRIPT } oder θarr_4,7y=π/2superscriptsubscript𝜃𝑎𝑟𝑟_47𝑦𝜋2\theta_{arr\_4,7}^{y}=\pi/2italic_θ start_POSTSUBSCRIPT italic_a italic_r italic_r _ 4 , 7 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT = italic_π / 2. (f) Variation of the 𝐅𝐅\mathbf{F}bold_F matrix rank in the corresponding unobservable scenarios.

V NUMERICAL SIMULATIONS AND RESULTS

We next present extensive numerical simulations to validate the results in Sections III and IV. Firstly, we verify the observability analysis results, along with intuitive physical interpretations of unobservable scenarios. Secondly, we compare our proposed initialization method (which does not require ground truth) with initialization schemes of adding different levels of noise to the ground truth (GT) and random initialization. Thirdly, we verify the robustness of the calibration algorithm by varying the sound source trajectories.

V-A Observable Cases

We firstly present two observable scenarios as shown in Fig. 3(a). Each scenario comprises eight stationary microphone arrays and a moving sound source. In case 1, the source follows a randomly generated 3D trajectory, while in case 2, it moves along a path on a plane that does not coincide with the global reference frame. In both scenarios, the moving sound source emits signals at ten consecutive locations, which are recorded by the microphone arrays.

The rank of the 𝐅𝐅\mathbf{F}bold_F matrix in (23) changes over time steps, as illustrated in Fig. 3(b). Based on Theorem 3, since rank(𝐌2_T)=11𝑟𝑎𝑛𝑘subscript𝐌2_𝑇11rank(\mathbf{M}_{2\_T})=11italic_r italic_a italic_n italic_k ( bold_M start_POSTSUBSCRIPT 2 _ italic_T end_POSTSUBSCRIPT ) = 11 (note that 𝐌2_Tsubscript𝐌2_𝑇\mathbf{M}_{2\_T}bold_M start_POSTSUBSCRIPT 2 _ italic_T end_POSTSUBSCRIPT is defined in (59)) and rank(diag(𝐋¯i))=48𝑟𝑎𝑛𝑘𝑑𝑖𝑎𝑔subscript¯𝐋𝑖48rank(diag(\mathbf{\bar{L}}_{i}))=48italic_r italic_a italic_n italic_k ( italic_d italic_i italic_a italic_g ( over¯ start_ARG bold_L end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) = 48, i=3,4,,8𝑖348i=3,4,\cdots,8italic_i = 3 , 4 , ⋯ , 8, it is evident that with an increasing time step and the source’s movement along these two trajectories, the 𝐅𝐅\mathbf{F}bold_F matrix (with dimensions 336×5933659336\times 59336 × 59) gradually become full column rank, i.e. its Jacobian matrix 𝐉𝐉\mathbf{J}bold_J (with dimensions 497×8649786497\times 86497 × 86) in (22) is full column rank. This implies that the calibration scenarios are observable. At the time step when the Jacobian matrix becomes full column rank, it also can be verified that rank(diag(𝐋¯i))=56𝑟𝑎𝑛𝑘𝑑𝑖𝑎𝑔subscript¯𝐋𝑖56rank(diag(\mathbf{\bar{L}}_{i}))=56italic_r italic_a italic_n italic_k ( italic_d italic_i italic_a italic_g ( over¯ start_ARG bold_L end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) = 56, i=2,3,,8𝑖238i=2,3,\cdots,8italic_i = 2 , 3 , ⋯ , 8, and rank(𝐓¯)=3𝑟𝑎𝑛𝑘¯𝐓3rank(\mathbf{\bar{T}})=3italic_r italic_a italic_n italic_k ( over¯ start_ARG bold_T end_ARG ) = 3 so that Theorem 2 holds. Hence, the simulations presented so far based on the theoretical analysis worked as expected. It is worth noting that the sound source positions are not always in the same line with any array frame or on the same plane with the reference array frame. Hence, a sound source trajectory with more motion varieties often can help to ensure that the necessary conditions stated in Theorem 2 are met, thereby potentially avoiding the unobservable scenarios.

V-B Unobservable Cases

Several unobservable scenarios are presented in the following to verify the conclusions in Theorems 4-5.

(i) For the Jacobian matrix to have full column rank, it is necessary that the time steps are greater than or equal to 3 so that the number of rows of the Jacobian matrix is greater than the number of columns, according to (21). In addition, as can be seen from Fig. 3(b), when the number of time steps is greater than or equal to 3 but less than 5, the Jacobian matrix is not of full column rank. This reflects that the system is unobservable when the number of time steps is less than 5.

(ii) For the sound source trajectories shown in Fig. 3(c), the first case is that the sound source stays co-linear with the origin of the global frame {𝐱arr_1}subscript𝐱𝑎𝑟𝑟_1\left\{\mathrm{\mathbf{x}}_{arr\_1}\right\}{ bold_x start_POSTSUBSCRIPT italic_a italic_r italic_r _ 1 end_POSTSUBSCRIPT } during the entire process, and λ1:9subscript𝜆:19{\lambda}_{1:9}italic_λ start_POSTSUBSCRIPT 1 : 9 end_POSTSUBSCRIPT in Theorem 4 (ii) take on the values of 2,32,43,232432,\dfrac{3}{2},\dfrac{4}{3},\ldots2 , divide start_ARG 3 end_ARG start_ARG 2 end_ARG , divide start_ARG 4 end_ARG start_ARG 3 end_ARG , …, and 109109\dfrac{10}{9}divide start_ARG 10 end_ARG start_ARG 9 end_ARG respectively. The second case is that the sound source remains co-planar with global frame {𝐱arr_1}subscript𝐱𝑎𝑟𝑟_1\left\{\mathrm{\mathbf{x}}_{arr\_1}\right\}{ bold_x start_POSTSUBSCRIPT italic_a italic_r italic_r _ 1 end_POSTSUBSCRIPT }. For this scenario, the sound source positions all lie on the Euclidean plane defined by the equation xy=0𝑥𝑦0x-y=0italic_x - italic_y = 0 within the three-dimensional xyz𝑥𝑦𝑧x-y-zitalic_x - italic_y - italic_z Cartesian coordinate frame {𝐱arr_1}subscript𝐱𝑎𝑟𝑟_1\left\{\mathrm{\mathbf{x}}_{arr\_1}\right\}{ bold_x start_POSTSUBSCRIPT italic_a italic_r italic_r _ 1 end_POSTSUBSCRIPT }. From Fig. 3(d), we can see that both cases are unobservable due to the rank deficiency of matrix 𝐅𝐅\mathbf{F}bold_F.

(iii) For the sound source trajectories shown in Fig. 3(e), the first case is that the sound source keeps co-linear with the origin of {𝐱arr_2}subscript𝐱𝑎𝑟𝑟_2\left\{\mathrm{\mathbf{x}}_{arr\_2}\right\}{ bold_x start_POSTSUBSCRIPT italic_a italic_r italic_r _ 2 end_POSTSUBSCRIPT } during the movement, and ϵ1:9subscriptitalic-ϵ:19{\epsilon}_{1:9}italic_ϵ start_POSTSUBSCRIPT 1 : 9 end_POSTSUBSCRIPT in Theorem 4 (iii) take on the values of 2,32,43,232432,\dfrac{3}{2},\dfrac{4}{3},\ldots2 , divide start_ARG 3 end_ARG start_ARG 2 end_ARG , divide start_ARG 4 end_ARG start_ARG 3 end_ARG , …, and 109109\dfrac{10}{9}divide start_ARG 10 end_ARG start_ARG 9 end_ARG respectively. In the second case, the Euler angles θarr_4ysuperscriptsubscript𝜃𝑎𝑟𝑟_4𝑦\theta_{arr\_4}^{y}italic_θ start_POSTSUBSCRIPT italic_a italic_r italic_r _ 4 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT and θarr_7ysuperscriptsubscript𝜃𝑎𝑟𝑟_7𝑦\theta_{arr\_7}^{y}italic_θ start_POSTSUBSCRIPT italic_a italic_r italic_r _ 7 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT of {𝐱arr_4}subscript𝐱𝑎𝑟𝑟_4\left\{\mathrm{\mathbf{x}}_{arr\_4}\right\}{ bold_x start_POSTSUBSCRIPT italic_a italic_r italic_r _ 4 end_POSTSUBSCRIPT } and {𝐱arr_7}subscript𝐱𝑎𝑟𝑟_7\left\{\mathrm{\mathbf{x}}_{arr\_7}\right\}{ bold_x start_POSTSUBSCRIPT italic_a italic_r italic_r _ 7 end_POSTSUBSCRIPT } are π2𝜋2\frac{\pi}{2}divide start_ARG italic_π end_ARG start_ARG 2 end_ARG, and the sound source travels along the route of the observable scenario mentioned in case 1 of Fig. 3(f). The rotation angle is at the singular point of observation, rendering the system unobservable. Hence, the simulations presented above validate the conclusions in Theorems 4-5.

TABLE I: NUMERICAL SIMULATIONS EXPERIMENT PARAMETERS
Parameters Values
Inter-array TDOA noise STD 0.067ms
Elevation angle (DOA) noise STD 5 degrees
Azimuth angle (DOA) noise STD 5 degrees
Relative position noise STD diag3(0.03m)𝑑𝑖𝑎subscript𝑔30.03𝑚diag_{3}(0.03m)italic_d italic_i italic_a italic_g start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ( 0.03 italic_m )
Max. time offset 0.1s
Max. clock difference 0.1ms
Sound speed in air 346m/s
Max. iterations 50
Threshold ξ𝜉\xiitalic_ξ 1e-5

V-C Calibration under Different Initialization Schemes

To validate the initialization pipeline, we employed a predefined trajectory for the sound source, as illustrated in Fig. 4. We utilized our proposed pipeline to initialize the unknown parameters. For comparison, we added varying levels of noises to the true values of the unknown parameters for the same trajectory. These noisy values were then used as initial guesses for the Gauss-Newton iterations.

In detail, we set the base noise standard deviation for the microphone array positions, orientations, asynchronous parameters, and source positions to be diag3(0.2m)𝑑𝑖𝑎subscript𝑔30.2𝑚diag_{3}(0.2m)italic_d italic_i italic_a italic_g start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ( 0.2 italic_m ), diag3𝑑𝑖𝑎subscript𝑔3diag_{3}italic_d italic_i italic_a italic_g start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT(10 degrees), 102ssuperscript102𝑠10^{-2}s10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT italic_s, 105ssuperscript105𝑠10^{-5}s10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT italic_s, and diag3(0.2m)𝑑𝑖𝑎subscript𝑔30.2𝑚diag_{3}(0.2m)italic_d italic_i italic_a italic_g start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ( 0.2 italic_m ), respectively. We selected six sets of initial values, i.e., the ground truth (GT), Random, and Lv1, Lv2, Lv3, Lv4 where Gaussian noises with a standard deviation of 1, 3, 6, 9 times of the base noise are added to the GT. For these different initialization schemes, we conducted 200 Monte Carlo simulations with randomly selected initial values and used the root mean square error (RMSE) to measure the accuracy of the estimated values (the specific formulas are provided in Appendix B).

Furthermore, we also investigated the impact of different initialization schemes on the convergence ratio of the Gauss-Newton algorithm. For each initialization scheme, we define the convergence ratio as the proportion of successful convergence instances to the total number of experiments. During the optimization process, we assessed the convergence of the Gauss-Newton algorithm based on the square norm of the optimization step size, i.e., Δ𝐱2subscriptnormΔ𝐱2\left\|\Delta\mathbf{x}\right\|_{2}∥ roman_Δ bold_x ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. We classify any of the following three scenarios as divergent: (1) Δ𝐱2subscriptnormΔ𝐱2\left\|\Delta\mathbf{x}\right\|_{2}∥ roman_Δ bold_x ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT exceeds 1e8 in any iteration; (2) Δ𝐱2subscriptnormΔ𝐱2\left\|\Delta\mathbf{x}\right\|_{2}∥ roman_Δ bold_x ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT exhibits oscillations above 1e3 and does not come down below 1e3; (3) Δ𝐱2subscriptnormΔ𝐱2\left\|\Delta\mathbf{x}\right\|_{2}∥ roman_Δ bold_x ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT keeps growing as the iteration step increases. Otherwise, the Gauss-Newton algorithm is deemed convergent.

Note that, during the numerical experiments, we keep the multiple microphone arrays stationary while the sound source is in motion. The parameters used in the numerical experiments are summarized in Table I (note that in practice, the DOA information can be conveniently indicated by elevation and azimuth angles in 3D. Hence, we will use the latter two angles to represent DOA in the remainder of the paper). Specifically, in our Monte Carlo simulations, the true values of all unknown parameters remain fixed, and the initial values for the Gauss-Newton iterations of each simulation are obtained as described above. Additionally, each simulation utilizes measurements with the same noise level. In other words, noises are added to the theoretical measurement values with standard deviations (STD), as shown in Table I, resulting in the final measurement values used in simulations for inter-array TDOA, DOA, and sound source relative positions.

TABLE II: THE RMSE OF CALIBRATION RESULTS UNDER VARYING INITIALIZATION NOISE LEVELS: ANALYSIS OF 200 MONTE CARLO SIMULATIONS (BOLD MEANS BETTER)
Noise Levels Microphone Array SRC Convg. Ratio
Pos. (m) Orie. (deg.) Offset (ms) Clock (us) Pos. (m)
GT 2.796e-02 1.173 1.078e-01 7.579 4.228e-02 100%
Ours 2.797e-02 2.348 1.078e-01 7.584 4.229e-02 100%
Lv1 2.973e-02 6.299 0.992e-01 7.865 4.475e-02 100%
Lv2 3.143e-02 19.790 1.348e-01 8.730 4.611e-02 99.0%
Lv3 6.026e-02 42.860 3.010e-01 33.196 1.011e-01 78.5%
Lv4 7.861e-01 64.250 3.239 68.573 2.754e-01 44.0%
Random 6.928e-01 67.636 2.416 82.810 2.635e-01 43.0%

The results are presented in Table II. It is evident that with an increase in the noise level of initialized values for the unknown parameters, the final estimation errors gradually increase (except for the time offset, where Lv1 has a negligible advantage over GT), and the convergence ratio decreases. Furthermore, it can be observed that without relying on the GT for initial guess selection, the performance of our calibration algorithm is comparable to the case using the GT as the initial value. In terms of estimating the microphone array orientation, our method is slightly less accurate compared to using the GT as the initial guess. This demonstrates the effectiveness of our proposed framework. In contrast, the random initialization method, frequently used in many optimization problems, exhibits inferior performance. Although it outperforms Lv4 in terms of the accuracy of some parameters, it has the lowest convergence ratio, indicating the unreliability of a random strategy. The above comparisons highlight the necessity of an appropriate initialization algorithm in the calibration process and the effectiveness of our proposed pipeline.

V-D Calibration Using Random Trajectories

To verify the robustness of the proposed calibration framework, we generate ten random trajectories, each involving five microphone arrays and 80 sound-emitting events. Take a trajectory shown in Fig. 5 as an example (only the first 40 sound-emitting events for illustration purposes). Even with measurement noise interference, the parameter initialization procedure can obtain initial values that are close to the ground truth. The initialized values are used in Gauss-Newton iterations to improve calibration accuracy.

Fig. 6 shows the error distribution between the initialized values obtained by our proposed initialization method and the ground truth for ten different trajectories. In the box plot, the blue circle represents the outliers obtained from the interquartile range, while the upper and lower black horizontal lines represent the maximum and minimum values of the non-outlier errors. The upper and lower edges of each box represent the upper and lower quartiles, respectively, and the middle blue line corresponds to the median of the errors. The orange triangle represents the mean of the errors. The errors between our initial values and the ground truth are small, which promotes the convergence of the calibration algorithm.

Fig. 7 shows the error distribution between the final estimated values and the ground truth for ten different trajectories. Similar to Table II, the results indicate that while the accuracy of the microphone array orientation estimation is slightly lower than that of other parameters due to the larger DOA measurement noise, the calibration of all parameters is accurate.

Finally, we remark that the relatively poorer accuracy for microphone array orientation shown in Table II and Fig. 7 is mainly attributed to the large magnitude of DOA measurement noise used in the simulation. As indicated in Table I, for our simulations, the elevation and azimuth angle noise STD are both 5°. If we reduce the elevation and azimuth angle noise STD, the accuracy of microphone array orientation will be improved. However, due to limited space, we skip these comparisons and results here.

Refer to caption
Refer to caption
Refer to caption
Figure 4: Estimation results of the preset trajectory with 5 microphone arrays and 24 sound signals. (a) The initial and the true values of microphone array positions, orientations, and sound source positions. (b) The fine-tuned and true values of microphone array positions, orientations, and sound source positions. (c) The initial, fine-tuned, and true values of microphone array time offsets and sampling clock differences between microphone arrays.
Refer to caption
Refer to caption
Refer to caption
Figure 5: Estimation results of the random trajectory with 5 microphone arrays and 80 sound signals. (a) The initial and the true values of microphone array positions, orientations, and sound source positions. (b) The fine-tuned and true values of microphone array positions, orientations, and sound source positions. (c) The initial, fine-tuned, and true values of microphone array time offset and sampling clock differences between microphone arrays.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 6: Error distributions between the initial values and true values for 10 different trajectories. (a) Microphone array positions. (b) Microphone array orientations. (c) Time offsets. (d) Sampling clock differences. (e) Sound source positions.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 7: Error distributions between the estimated values and true values for 10 different trajectories. (a) Microphone array positions. (b) Microphone array orientations. (c) Time offsets. (d) Sampling clock differences. (e) Sound source positions.
Refer to caption
Refer to caption
Refer to caption
Figure 8: Real-world 3D asynchronous microphone arrays calibration environment setup. (a) Microphone array with pan-tilt head. (b) Turtlebot3 robot with Multi-sensors. (c) Typical physical scenario.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 9: Real environment microphone arrays calibration results. (a) The initial and the true values of microphone arrays positions, orientations, and sound source positions. (b) The fine-tuned and true values of microphone arrays positions, orientations, and sound source positions. (c) The calibration result of microphone arrays time offsets (e) The calibration result of microphone arrays sampling clock differences.

VI REAL-WORLD EXPERIMENTS

In this section, we validate our calibration method using real data. In our experiment, a Turtlebot3 mobile robot moves in an indoor environment, and multiple microphone arrays capture the sound signal emitted by the robot. More specifically, we use the iFLYTEK M160C microphone array consisting of six independent microphones arranged in a circular and evenly distributed configuration, with a diameter of 70.85mm, a sampling depth of 32 bits, a sampling rate of 16 KHz, and an effective pickup range of 3.5 meters, as shown in Fig. 8(a). The mobile robot is equipped with an Intel D435i camera with an integrated inertial measurement unit (IMU), as shown in Fig. 8(b). It also includes a four-channel trajectory detector for tracking predefined paths and a 3W 8ΩΩ\Omegaroman_Ω speaker for sound emission.

In the experimental setup, four microphone arrays are placed in an open area of an academic building. The experimental area is 15.5 meters long, 10 meters wide, and 3.3 meters high, as shown in Fig. 8(c). The microphone arrays remain stationary and receive audio signals while the mobile robot travels along the black trajectory on the ground. When the robot detects the cross-shaped sound markers on the ground, it immediately emits a chirp signal with a frequency of 1000 Hz to 2000 Hz through a speaker driven by a Class-D amplifier, lasting for 300 ms, and then moves on the trajectory. We carry out the following activities to validate the effectiveness and performance of the proposed calibration pipeline across diverse scenarios:

1) Firstly, we compare the calibration results achieved through various initialization strategies (see Section VI.B).

2) Next, we explore the influence of the absence of sound source relative position measurements on the estimation accuracy in the optimization process (see Section VI.C).

3) Moreover, we vary the spacing between the microphone arrays to cover a range of scenarios and scene scales to assess the calibration performance of our method (see Section VI.D).

4) Last but not least, in Section VI.E, we compare the performance of the proposed initialization method (IM) and its fine-tuning (FT) version (i.e., the results are obtained by feeding the initialized values to batch optimization with Gauss-Newton iterations) with those of other existing methods, including the open-source passive geometry calibration method for microphone arrays based on the differential evolution algorithm (PGM) [20] and the two-step calibration method (TSM) based on the L-BFGS algorithm [23].

VI-A Data Collection and Ground Truth

The trajectory of the robot is pre-defined to obtain the ground truth of sound source positions in the global frame. The position of the speaker during audio playback, corresponding to the sound marker’s coordinates and the mobile robot’s height, is regarded as the ground truth for the sound source positions. The microphone arrays were placed w.r.t. each other according to known preset values (i.e., these are taken to be the true values of microphone array positions) before the experiment started. The frame {𝐱𝑎𝑟𝑟_1}subscript𝐱𝑎𝑟𝑟_1\left\{\mathrm{\mathbf{x}_{\mathit{arr\_1}}}\right\}{ bold_x start_POSTSUBSCRIPT italic_arr _ italic_1 end_POSTSUBSCRIPT } attached to the first microphone array is taken to be the global coordinate system. As shown in Fig. 8(a), we affix each microphone array to a pan-tilt head. Subsequently, the pan-tilt head attached to each microphone array (except the first one) is rotated by certain known pre-set angles which are used to calculate the ground truth value of the Euler angles of the microphone arrays.

We determine the GT values of time offset and sampling clock difference as follows. To compute the ground truth for time offset, we subtract the theoretical time difference (excluding time offset and sampling clock difference) between the first sound marker received by each microphone array and the reference microphone array from the actual time difference. Note that due to the robot’s quick arrival at the first sound marker, the clock difference is so small at this moment that it can be considered negligible. For clock difference, we have recorded 8 hours of audio using multiple microphone arrays placed at the same distance relative to the sound source. This recording includes start and end signals. We calculate the ground truth for sampling clock difference by comparing the number of samples recorded by each microphone array with that of the reference microphone array during this period.

The following three kinds of measurements are obtained during the experiment:

1) For inter-array TDOA measurements between any microphone array and the reference microphone array at the k-th sound marker, we employ a sliding window technique to break down the sound signal into short frames. Subsequently, we compute the power spectrum of each frame to determine the valid sound region. Each frame has a duration of 25 ms, and a Hamming window is applied to prevent spectral leakage. For the valid sound region, we apply the GCC-PHAT algorithm [46], widely used in robotic sound localization, to compute the inter-channel time differences for all combinations of 6 channels ×\times× 6 channels. The average time difference is calculated as the inter-array TDOA.

2) For DOA measurements of the microphone array, we employ the Steered Response Power-Phase Transform (SRP-PHAT) algorithm [47] on the obtained signal region, with a discrete search angle resolution of 3 degrees. This technique leverages the spatial filtering capability of the microphone array to estimate the received power from a set of candidate directions. The source is then identified by selecting the location associated with the highest energy. The estimated azimuth and elevation angles are subsequently transformed into three-dimensional unit direction vectors.

3) For the sound source relative position measurements, we utilize a visual-inertial odometry (VIO) method [48] that integrates camera and IMU data. This approach fuses visual information and inertial data, providing more accurate and robust displacement measurements. This allows us to integrate more measurements related to robot motions, thereby enhancing the accuracy and reliability of the sound source relative position measurements.

TABLE III: THE RMSE OF CALIBRATION RESULTS UNDER VARYING INITIALIZATION NOISE LEVELS USING REAL DATA: ANALYSIS OF 200 MONTE CARLO EXPERIMENTS (BOLD MEANS BETTER)
Noise Levels Microphone Array SRC Convg. Ratio
Pos. (m) Orie. (deg.) Offset (ms) Clock (us) Pos. (m)
GT 0.233 7.936 1.514 12.712 0.156 100.0%
Ours 0.233 9.650 1.515 12.749 0.156 100.0%
Lv1 0.233 8.291 1.521 12.713 0.156 99.87%
Lv2 0.561 10.511 2.915 12.898 0.179 39.30%
Lv3 1.068 34.799 4.886 13.819 0.419 3.53%
Random 0.839 78.303 20.730 56.709 0.775 0.10%

VI-B Comparisons between Different Initialization Methods

For the case when four microphone arrays are placed on the corners of a square (2m ×\times× 2m), we collect data for five different trajectories, each repeated three times, resulting in a total of 15 datasets. These collected datasets have been used to explore the impact of different initial values in real-world experiments.

Based on the GT and measurement models in Section II (see (10)-(11)), we then calculate the following measurement errors: the inter-array TDOA measurement error has mean value 3.15e-4 seconds with STD of 1.25e-3; the azimuth angle error has mean value of 6.02 degrees with STD of 4.69 degrees; the elevation angle error has mean value of 5.45 degrees with STD of 5.97 degrees, and the VIO measurement error has mean value [2.06e-2, 2.49e-2, 6.13e-3] meters with STD of [9.64e-3, 3.65e-2, 8.44e-3] meters. These errors were obtained by comparing the measured values from the sound signal with the theoretical values.

We set the base noise standard deviation for the microphone array positions, orientations, asynchronous parameters, and source positions to be diag3(0.2m)𝑑𝑖𝑎subscript𝑔30.2𝑚diag_{3}(0.2m)italic_d italic_i italic_a italic_g start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ( 0.2 italic_m ), diag3𝑑𝑖𝑎subscript𝑔3diag_{3}italic_d italic_i italic_a italic_g start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT(10 degrees), 102ssuperscript102𝑠10^{-2}s10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT italic_s, 105ssuperscript105𝑠10^{-5}s10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT italic_s, and diag3(0.2m)𝑑𝑖𝑎subscript𝑔30.2𝑚diag_{3}(0.2m)italic_d italic_i italic_a italic_g start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ( 0.2 italic_m ), respectively. Subsequently, for comparison, we obtain initial values with different levels of errors (GT, Lv1, Lv2, Lv3, and Random), similar to those described in Section V.C. Using each of the 15 datasets, for these different initialization schemes, we conducted 200 Monte Carlo experiments with randomly selected initial values (note that for our proposed initialization method and the initialization using GT, there is only one experiment) and the corresponding real measurements. The results using the above different initialized values are shown in Table III and the calibration results of our method for one of the datasets are depicted in Fig. 9.

From Table III, it can be observed that for real data, the overall calibration accuracy is slightly lower compared to that of simulation studies, primarily due to noise sources such as motion noise from the mobile robot, sensor measurement noise, and manual interference. However, the effectiveness of our proposed method is evident. In real-world settings, our initialization method produces calibration results almost identical to those obtained using ground truth values directly or Lv1 as initial values, with only slightly reduced orientation accuracy. This is because, unlike simulations, the measurement noises in real-world settings are, in general, not Gaussian and the accuracy of DOA and TDOA measurements is lower. Notably, the noises in DOA measurements (w.r.t. the ground truth values) almost overshadow the performance differences between the initialization methods GT, Lv1, and our initial values. Despite that, the results indicate the effectiveness of our method in scenarios with non-Gaussian and large measurement noises.

Regarding convergence ratio, both our initialization method and direct use of ground truth values achieve 100% convergence. As the initial noise level increases, the convergence ratio of initialization using noise-corrupted GT values significantly decreases, especially in cases with higher levels of noise, such as the Lv3 and random initialization methods, where all Monte Carlo experiments across all 15 datasets almost always diverge. This underscores the effectiveness and robustness of our proposed method in real-world scenarios.

TABLE IV: THE RMSE OF CALIBRATION RESULTS UNDER VARYING INITIALIZATION NOISE LEVELS WITH ONLY ACOUSTIC MEASUREMENTS: ANALYSIS OF 200 MONTE CARLO EXPERIMENTS (BOLD MEANS BETTER)
Noise Levels Microphone Array SRC Convg. Ratio
Pos. (m) Orie. (deg.) Offset (ms) Clock (us) Pos. (m)
GT 0.425 11.580 2.015 12.064 0.226 100.0%
Lv1 0.426 12.819 2.012 12.062 0.226 99.90%
Lv2 0.658 14.270 3.100 12.185 0.231 32.07%
Lv3 - - - - - 0.0%
Random - - - - - 0.0%

VI-C Calibration with Only Acoustic Measurements

To validate the influence of the sound source relative position measurements obtained from the VIO method on the calibration results, this section focuses on conducting calibration experiments using only acoustic measurements obtained from the microphone arrays (inter-array TDOA and DOA measurements). Given that our initialization method relies on relative position measurements of the sound source, we cannot use it for comparison purposes. Hence, we perform Gauss-Newton optimizations initialized by ground truth values corrupted by Gaussian noise across varying levels. Following Section VI.B, Monte Carlo experiments are carried out under varying initialization noise levels using the real measurements from the 15 datasets (excluding the sound source relative position measurements from VIO).

Comparing the results in Table III and Table IV (including and excluding the sound source relative position measurements, respectively), it is evident that without the sound source relative position measurements, the overall parameter estimation results are poorer. In particular, the estimation accuracy of the relative transforms (i.e., orientation, translation) between microphone arrays and sound source positions, and the convergence ratio are lower than the case with sound source relative position measurements. However, as it can also be seen from Table III and Table IV, the absence of sound source relative position measurements from VIO has less impact on the estimation accuracy of asynchronous parameters between the microphone arrays.

Refer to caption
Refer to caption
Figure 10: Distance’s impact on sound perception. (a) The relationship between the distance of the sound source relative to the microphone array and maximum SNR; the color map illustrates the ratio of having a certain distance (the x-axis) between the sound source and the microphone arrays during the whole experiment process, across the 12 real datasets. (b) The relationship between sound source distance relative to the microphone array and DOA estimation errors, as well as inter-array TDOA estimation errors.
TABLE V: THE RMSE OF CALIBRATION RESULTS UNDER VARYING SCENE SCALES IN REAL-WORLD EXPERIMENTS (BOLD MEANS BETTER)
Distance Microphone Array SRC
Positions (m) Orientations (deg.) Offset (ms) Clock diff. (us) Position (m)
1 m 0.197 7.381 0.847 2.956 0.071
2 m 0.173 5.825 1.272 4.327 0.116
3 m 0.618 55.691 1.251 13.817 0.240
5 m 2.557 81.966 3.811 20.712 0.253

VI-D Calibration Across Varied Scene Scales

In this section, we conducted experiments to investigate the influence of the distances between the microphone arrays and the sound source on the calibration results. This factor plays a pivotal role in the calibration process of microphone arrays, as it impacts the propagation of sound signals and the level of measurement noises. For instance, in scenarios involving long-distance sound propagation, sound signals undergo propagation loss and are subject to noise interference, resulting in signal attenuation and a decrease in the signal-to-noise ratio (SNR) [3]. We consider four scenarios with different microphone array spacings: 1 meter, 2 meters, 3 meters, and 5 meters. Under these varying distances, the Turtlebot3 robot moves in proximity to the microphone arrays, emitting chirp signals with consistent sound intensity. We record data for each setup, with each experiment repeated three times, resulting in a total of 12 datasets.

Using the datasets collected with different microphone array spacings (1 meter, 2 meters, 3 meters, and 5 meters, respectively), we calculate the SNR that the microphone arrays can capture, and the ground truth values of the microphone arrays positions and the sound markers positions in the global frame are directly measured using a rangefinder. Subsequently, the distances from the microphone arrays to the sound source at different sound marker positions in the 12 collected datasets could be easily calculated. In Fig. 10(a), it can be observed that there is a significant decrease in the maximum SNR that the microphone array can capture as the distance between the mobile robot and the microphone arrays gradually increases. The color map illustrates the ratio of having a certain distance (the x-axis of Fig. 10(a)) between the sound source and the microphone arrays, during the whole experiment process, across the 12 real datasets222For example, in our experiments, there are 4 microphone arrays; for every dataset, there were 13 sound events; so in total, there are 41213=624412136244*12*13=6244 ∗ 12 ∗ 13 = 624 scenarios; if there are 138 scenarios where the sound source is 1m–1.5m away from any microphone array, then its corresponding ratio is 138/6240.22similar-to1386240.22138/624\sim 0.22138 / 624 ∼ 0.22.. Meanwhile, Fig. 10(b) clearly shows a significant decrease in the accuracy of both DOA and inter-array TDOA estimations with the increasing distance between the sensor array and the signal source. Table V summarizes the calibration results for different spacing cases. It can be seen from Table V that, compared to the greater spacings of 3 meters and 5 meters, our proposed calibration pipeline achieves better performance for the spacing cases of 1 meter and 2 meters. Moreover, one can also notice from Table V that the calibration results for microphone array positions and orientations at 2m were better than those at 1m, because the SNR increases in the 1-2m range (see Fig. 10), while the elevation angle measurement error (w.r.t. the ground truth values) gradually decreases and the azimuth angle measurement error (w.r.t. the ground truth values) almost stays the same. The above results further illustrate the impact of distance on calibration performance.

TABLE VI: THE RMSE OF CALIBRATION RESULTS FROM DIFFERENT METHODS IN REAL-WORLD EXPERIMENTS (BOLD MEANS BETTER)
Method Microphone Array SRC Average Time (s/dataset)
Pos. (m) Orie. (deg.) Offset (ms) Clock (us) Pos. (m)
PGM [20] 1.589 45.083 - - - 3661.152
TSM [23] 1.227 47.461 1.671 - 1.027 48.064
IM (Our) 0.378 11.730 1.896 18.334 0.219 6.770
FT (Our) 0.233 9.650 1.515 12.749 0.156 2.892

VI-E Comparisons with Existing Methods

We next compare our proposed calibration pipeline with the existing algorithms using the datasets collected in Section VI.B. These algorithms include the passive geometry calibration method333The original algorithms in [20] is for the 2D case. For comparison purposes, we have revised it accordingly for the 3D case. for microphone arrays based on the differential evolution algorithm (PGM) [20] and the two-step calibration method based on the L-BFGS algorithm (TSM) [23]. It is worth noting that these two calibration methods do not incorporate relative position measurements, i.e., they overlook the constraints among the positions of the sound source. Additionally, PGM does not include the calibration of time offsets and sampling clock differences among microphone arrays, while TSM disregards sampling clock differences, and the above methods lack an effective initialization process.

To showcase the efficiency of each calibration algorithm, we measure the average time required for each of the 15 calibration datasets for different methods on a PC with 32 GB RAM and an Intel Core 3.1 GHz i5-10505 processor. Table VI provides a summary of quantitative comparisons, where the RMSE is calculated based on the metrics listed in Appendix B. The experimental results indicate that our proposed methods (both IM and FT) outperform both PGM and TSM. Besides, the proposed method takes approximately 9 seconds (the total time that both initialization and Gauss-Newton iteration take) to automatically generate a highly accurate calibration of the multiple microphone arrays in 3D, which is faster than TSM and PGM, demonstrating its desirable efficiency.

VI-F Discussions

It is evident from the previous simulation and experimental results that the proposed method demonstrates strong robustness, outperforming existing calibration methods in terms of both accuracy and speed. Moreover, one should note that calibration accuracy is influenced by the measurement noises of the sensors, which is a critical factor. It is also worth noting that the SNR decreases as distances increase between the sound source and the microphone arrays, as pointed out in the existing works [5], [49], and [50]. Consequently, calibration accuracy gradually diminishes with increasing distance. This phenomenon is also observed for our proposed calibration framework, as shown in Section VI.C.

Finally, we remark that while the proposed method can tolerate certain noises such as the robot motion noise and air conditioner noise, it might face challenges in more complex scenarios with diffraction, reflection, and multiple sound sources. In these scenarios, to achieve satisfactory calibration accuracy, one has to incorporate other advanced techniques reported in the literature [5], [50], [51, pp. 217-241].

VII CONCLUSION

This paper is concerned with the joint calibration of multiple asynchronous microphone arrays and sound source localization via batch SLAM. First of all, using the FIM approach, we have conducted a systematic observability analysis of the batch SLAM framework for the above-mentioned calibration problem. More specifically, we have established necessary/sufficient conditions guaranteeing that the FIM and the Jacobian matrix have full column rank, which further implies the identifiability of the unknown parameters. Several scenarios where the unknown parameters are not uniquely identifiable have also been discovered and discussed. Subsequently, for solving the corresponding NLS problem, an effective framework has been proposed to obtain initialized values for the unknown parameters, which are used as the initial guesses in Gauss–Newton types of iterations in batch SLAM and further improve optimization accuracy and convergence. Extensive Monte Carlo simulations and real experiments confirm that the proposed method exhibits high efficiency, accuracy, and robustness in parameter calibration in 3D cases, outperforming the state-of-the-art frameworks for multiple microphone arrays calibration.

The main focus of our current and future work is to consider the active calibration problem of single or multiple microphone arrays where the sound source can optimize its trajectory in real-time to actively collect measurements that contain richer information for improved accuracy and performance, in contrast to the scenarios where the sound source is operated by a human. The calibration problem of moving microphone arrays is also of interest in our future work.

VIII ACKNOWLEDGMENT

The authors would like to thank the reviewers and Editors for their constructive suggestions which have helped to improve the quality and presentation of this paper significantly. This work was supported by the Science, Technology, and Innovation Commission of Shenzhen Municipality, China, under Grant No. ZDSYS20220330161800001, the Shenzhen Science and Technology Program under Grant No. KQTD20221101093557010, the National Natural Science Foundation of China (NSFC) under Grant No. 62350055.

References

  • [1] P. Gerstoft, Y. Hu, M. J. Bianco, C. Patil, A. Alegre, Y. Freund, and F. Grondin, Audio scene monitoring using redundant ad-hoc microphone array networks, IEEE Internet of Things Journal, Vol. 9, No. 6, pp. 4259–4268, 2022.
  • [2] H. G. Okuno and K. Nakadai, Robot audition: Its rise and perspectives, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5610–5614, 2015.
  • [3] C. Rascon and I. Meza, Localization of sound sources in robotics: A review, Robotics and Autonomous Systems, Vol. 96, pp. 184–210, 2017.
  • [4] K. Nakadai, M. Kumon, H. Okuno, K. Hoshiba, M. Wakabayashi, K. Washizaki, T. Ishiki, Y. Bando, T. Morito, R. Kojima, and O. Sugiyama, Development of microphone-array-embedded UAV for search and rescue task, IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 5985–5990, 2017.
  • [5] I. An, Y. Kwon, and S. Yoon, Diffraction-and reflection-aware multiple sound source localization, IEEE Transactions on Robotics, Vol. 38, No. 3, pp. 1925–1944, 2021.
  • [6] J. Zhang, Q. Lyu, G. Peng, Z. Wu, Q. Yan, and D. Wang, LB-L2L-Calib: Accurate and robust extrinsic calibration for multiple 3D LiDARs with long baseline and large viewpoint difference, Proc. of the 2022 International Conference on Robotics and Automation (ICRA), Vol. 22, No. 11, pp. 926–932, 2022.
  • [7] J. Lv, X. Zuo, K. Hu, J. Xu. G. Huang, and Y. Liu, Observability-aware intrinsic and extrinsic calibration of LiDAR-IMU systems, IEEE Transactions on Robotics, Vol. 38, No. 6, pp. 3734–3753, 2022.
  • [8] J. Huai, Y. Lin, Y. Zhuang, C. K. Toth, and D. Chen, Observability analysis and keyframe-based filtering for visual inertial odometry with full self-calibration, IEEE Transactions on Robotics, Vol. 38, No. 5, pp. 3219–3237, 2022.
  • [9] J. Wu, M. Wang, Y. Jiang, B. Yi, R. Fan, and M. Liu, Simultaneous hand–eye/robot–world/camera–IMU calibration, IEEE/ASME Transactions on Mechatronics, Vol. 27, No. 4, pp. 2278–2289, 2022.
  • [10] J. Jiao, Y. Yu, Q. Liao, H. Ye, R. Fan, and M. Liu, Automatic calibration of multiple 3D lidars in urban environments, Proc. of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 15–20, 2019.
  • [11] A. Plinge, F. Jacob, R. Haeb-Umbach, and G. A. Fink, Acoustic microphone geometry calibration: An overview and experimental evaluation of state-of-the-art algorithms, IEEE Signal Processing Magazine, Vol. 33, No. 4, pp. 14–29, 2016.
  • [12] F. Perrodin, J. Nikolic, J. Busset and R. Siegwart, Design and calibration of large microphone arrays for robotic applications, Proc. of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 4596–4601, 2012.
  • [13] M. Crocco, A. Del Bue, and V. Murino, A bilinear approach to the position self-calibration of multiple sensors, IEEE Transactions on Signal Processing, Vol. 60, No. 2, pp. 660–673, 2011.
  • [14] Y. Kuang, S. Burgess, A. Torstensson, and K. Åström, A complete characterization and solution to the microphone position self-calibration problem, IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 3875–3879, 2013.
  • [15] S. Burgess, Y. Kuang, and K. Åström, TOA sensor network self-calibration for receiver and transmitter spaces with difference in dimension, Signal Processing, Vol. 107, pp. 32–42, 2015.
  • [16] D. Su, T. Vidal-Calleja, and J. V. Miro, Simultaneous asynchronous microphone array calibration and sound source localisation, Proc. of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 5561–5567, 2015.
  • [17] D. Su, T. Vidal-Calleja, and J. V. Miro, Asynchronous microphone arrays calibration and sound source tracking, Autonomous Robots, Vol. 44, No. 2, pp. 183–204, 2020.
  • [18] D. Su, H. Kong, S. Sukkarieh, and S. Huang, Necessary and sufficient conditions for observability of SLAM-based TDOA sensor array calibration and source localization, IEEE Transactions on Robotics, Vol. 37, No. 5, pp. 1451–1468, 2021.
  • [19] A. Plinge and G. A. Fink, Geometry calibration of multiple microphone arrays in highly reverberant environments, Proc. of the International Workshop on Acoustic Signal Enhancement, pp. 243–247, 2014.
  • [20] A. Plinge, G. A. Fink, and S. Gannot, Passive online geometry calibration of acoustic sensor networks, IEEE Signal Processing Letters, Vol. 24, No. 3, pp. 324–328, 2017.
  • [21] D. Hu, Z. Chen, and F. Yin, Geometry calibration for acoustic transceiver networks based on network Newton distributed optimization, IEEE/ACM Transactions on Audio, Speech, and Language Processing, Vol. 29, pp. 1023–1032, 2021.
  • [22] R. Wang, Z. Chen, and F. Yin, DOA-Based three-dimensional node geometry calibration in acoustic sensor networks and its Cramér–Rao Bound and sensitivity analysis, IEEE/ACM Transactions on Audio, Speech, and Language Processing, Vol. 27, No. 9, pp. 1455–1468, 2019.
  • [23] S. Woźniak and K. Kowalczyk, Passive joint localization and synchronization of distributed microphone arrays, IEEE Signal Processing Letters, Vol. 26, No. 2, pp. 292–296, 2019.
  • [24] C. Sugiyama, K. Itoyama, K. Nishida, and K. Nakadai, Assessment of simultaneous calibration for positions, orientations, and time offsets in multiple microphone arrays systems, IEEE/SICE International Symposium on System Integration (SII), pp. 1–6, 2023.
  • [25] L. Wang and S. Doclo, Correlation maximization-based sampling rate offset estimation for distributed microphone arrays, IEEE/ACM Transactions on Audio, Speech, and Language Processing, Vol. 24, No. 3, pp. 571–582, 2016.
  • [26] S. Thrun and M. Montemerlo, The graph SLAM algorithm with applications to large-scale mapping of urban structures, The International Journal of Robotics Research, Vol. 25, No. 5–6, pp. 403–429, 2006.
  • [27] G. Grisetti, R. Kümmerle, C. Stachniss, and W. Burgard, A tutorial on graph-based SLAM, IEEE Intelligent Transportation Systems Magazine, Vol. 2, No. 4, pp. 31–43, 2010.
  • [28] C. Cadena, L. Carlone, H. Carrillo, Y. Latif, D. Scaramuzza, J. Neira, I. Reid, and J. J. Leonard, Past, Present, and Future of Simultaneous Localization and Mapping: Toward the Robust-Perception Age, IEEE Transactions on Robotics, Vol. 32, No. 6, pp. 1309–1332, 2016.
  • [29] S. M. Nasiri, R. Hosseini, and H. Moradi, Novel parameterization for Gauss–Newton methods in 3D pose graph optimization, IEEE Transactions on Robotics, Vol. 37, No. 3, pp. 780–797, 2021.
  • [30] H. Kong and S. Sukkarieh, Suboptimal receding horizon estimation via noise blocking, Automatica, Vol. 98, pp. 66–75, 2018.
  • [31] H. Kong and S. Sukkarieh, Metamorphic moving horizon estimation, Automatica, Vol. 97, pp. 167–171, 2018.
  • [32] Z. Wang and G. Dissanayake, Observability analysis of SLAM using Fisher information matrix, Proc. of the International Conference on Control, Automation, Robotics, and Vision, pp. 1242–1247, 2008.
  • [33] S. Huang and G. Dissanayake, A critique of current developments in simultaneous localization and mapping, International Journal of Advanced Robotic Systems, Vol. 13, No. 5, pp. 1–13, 2016.
  • [34] S. M. Nasiri, H. Moradi and R. Hosseini, A linear least square initialization method for 3D pose graph optimization problem, IEEE International Conference on Robotics and Automation (ICRA), pp. 2474-2479, 2018.
  • [35] D. M. Rosen, L. Carlone, A. S. Bandeira, and J. J. Leonard, SE-Sync: A certifiably correct algorithm for synchronization over the special Euclidean group, The International Journal of Robotics Research, Vol. 38, No. 2-3, pp. 95–125, 2019.
  • [36] F. Dümbgen, C. Holmes, and T. D. Barfoot, Safe and smooth: Certified continuous-time range-only localization, IEEE Robotics and Automation Letters, Vol. 8, No. 2, pp. 1117–1124, 2023.
  • [37] H. Yang and L. Carlone, Certifiably optimal outlier-robust geometric perception: Semidefinite relaxations and scalable global optimization, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 45, No. 3, pp. 2816-2834, 2023.
  • [38] F. Pomerleau, F. Colas, and R. Siegwart. A review of point cloud registration algorithms for mobile robotics, Foundations and Trends® in Robotics, Vol. 4, No. 1, pp. 1–104, 2015.
  • [39] Y. He, J. Wang, D. Su, K. Nakadai, J. Wu, S. Huang, Y. Li, and H. Kong, Observability analysis of graph SLAM-based joint calibration of multiple microphone arrays and sound source localization, IEEE/SICE International Symposium on System Integration, pp. 1–8, 2023.
  • [40] Y. Bar-Shalom, X. R. Li, and T. Kirubarajan, Estimation with applications to tracking and navigation: Theory algorithms and software. New York: Wiley, 2004.
  • [41] B. Siciliano, L. Sciavicco, L. Villani, and G. Oriolo, Robotics: Modeling, planning, and control, Berlin, Germany: Springer, 2009.
  • [42] M. A. Branch, T. F. Coleman, and Y. Li, A subspace, interior, and conjugate gradient method for large-scale bound-constrained minimization problems, SIAM Journal on Scientific Computing, Vol. 21, No. 1, pp. 1–23, 1999.
  • [43] F. M. Dekking, C. Kraaikamp, H. P. Lopuhaä, and L. E. Meester. A modern introduction to probability and statistics: Understanding why and how, London: springer, 2005.
  • [44] K. S. Arun, T. S. Huang, and S. D. Blostein. Least-Squares fitting of two 3-D Point sets, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. PAMI-9, No. 5, pp. 698–700, 1987.
  • [45] T. Blesgen. On rotation deformation zones for finite-strain Cosserat plasticity, Acta Mechanica, Vol. 226, No. 7, pp. 2421–2434, 2015.
  • [46] C. Knapp and G. Carter, The generalized correlation method for estimation of time delay, IEEE Transactions on Acoustics, Speech, and Signal Processing, Vol. 24, No. 4, pp. 320–327, 1976.
  • [47] A. Badali, J. M. Valin, F. Michaud, and P. Aarabi, Evaluating real-time audio localization algorithms for artificial audition in robotics, IEEE/RSJ International Conference on Intelligent Robots and Systems(IROS), pp. 2033–2038, 2009.
  • [48] T. Qin, P. Li and S. Shen, VINS-Mono: A Robust and Versatile monocular visual-inertial state estimator, IEEE Transactions on Robotics, Vol. 34, No. 4, pp. 1004–1020, 2018.
  • [49] C. Evers and P. A. Naylor, Acoustic SLAM, IEEE/ACM Transactions on Audio, Speech, and Language Processing, Vol. 26, No. 9, pp. 1484–1498, 2018.
  • [50] J. M. Valin, F. Michaud, and J. Rouat, Robust localization and tracking of simultaneous moving sound sources using beamforming and particle filtering, Robotics and Autonomous Systems, Vol. 55, No. 3, pp. 216–228, 2007.
  • [51] S. Rickard, The DUET blind source separation algorithm. Dordrecht: Springer Netherlands, 2007.

Appendix A

Proof of Proposition 1. Firstly, we note that the relative position of the sound source satisfies 𝐬Δk1=𝐬k𝐬k1+𝐰k1superscriptsubscript𝐬Δ𝑘1superscript𝐬𝑘superscript𝐬𝑘1superscript𝐰𝑘1\mathbf{s}_{\Delta}^{k-1}=\mathbf{s}^{k}-\mathbf{s}^{k-1}+\mathbf{w}^{k-1}bold_s start_POSTSUBSCRIPT roman_Δ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT = bold_s start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT - bold_s start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT + bold_w start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT whose corresponding Jacobian matrices are

𝐬Δk1𝐬k1=𝐈3, 𝐬Δk1𝐬k=𝐈3.formulae-sequencesuperscriptsubscript𝐬Δ𝑘1superscript𝐬𝑘1subscript𝐈3 superscriptsubscript𝐬Δ𝑘1superscript𝐬𝑘subscript𝐈3\dfrac{\partial\mathbf{s}_{\Delta}^{k-1}}{{\partial}\mathbf{s}^{k-1}}=-\mathbf% {I}_{3},\text{ }\dfrac{\partial\mathbf{s}_{\Delta}^{k-1}}{{\partial}\mathbf{s}% ^{k}}=\mathbf{I}_{3}.divide start_ARG ∂ bold_s start_POSTSUBSCRIPT roman_Δ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT end_ARG start_ARG ∂ bold_s start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT end_ARG = - bold_I start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , divide start_ARG ∂ bold_s start_POSTSUBSCRIPT roman_Δ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT end_ARG start_ARG ∂ bold_s start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_ARG = bold_I start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT .

Secondly, for i=2,,N𝑖2𝑁i=2,...,Nitalic_i = 2 , … , italic_N, the distance between the i-th𝑖-𝑡i\raisebox{0.0pt}{-}thitalic_i - italic_t italic_h microphone array and the sound source at time instance tksuperscript𝑡𝑘t^{k}italic_t start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT can be computed as

dik=(Δxik)2+(Δyik)2+(Δzik)2superscriptsubscript𝑑𝑖𝑘superscriptΔsuperscriptsubscript𝑥𝑖𝑘2superscriptΔsuperscriptsubscript𝑦𝑖𝑘2superscriptΔsuperscriptsubscript𝑧𝑖𝑘2d_{i}^{k}=\sqrt{{({\Delta x}_{i}^{k})}^{2}+{({\Delta y}_{i}^{k})}^{2}+{({% \Delta z}_{i}^{k})}^{2}}italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT = square-root start_ARG ( roman_Δ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( roman_Δ italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( roman_Δ italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG (43)

where

Δxik=sxkxarr_ixΔyik=sykxarr_iyΔzik=szkxarr_iz.Δsuperscriptsubscript𝑥𝑖𝑘superscriptsubscript𝑠𝑥𝑘superscriptsubscript𝑥𝑎𝑟𝑟_𝑖𝑥Δsuperscriptsubscript𝑦𝑖𝑘superscriptsubscript𝑠𝑦𝑘superscriptsubscript𝑥𝑎𝑟𝑟_𝑖𝑦Δsuperscriptsubscript𝑧𝑖𝑘superscriptsubscript𝑠𝑧𝑘superscriptsubscript𝑥𝑎𝑟𝑟_𝑖𝑧\begin{array}[]{c}{\Delta x}_{i}^{k}=s_{x}^{k}-x_{arr\_i}^{x}\text{, }{\Delta}% y_{i}^{k}=s_{y}^{k}-x_{arr\_i}^{y}\text{, }{\Delta}z_{i}^{k}=s_{z}^{k}-x_{arr% \_i}^{z}.\end{array}start_ARRAY start_ROW start_CELL roman_Δ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT = italic_s start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT - italic_x start_POSTSUBSCRIPT italic_a italic_r italic_r _ italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT , roman_Δ italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT = italic_s start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT - italic_x start_POSTSUBSCRIPT italic_a italic_r italic_r _ italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT , roman_Δ italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT = italic_s start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT - italic_x start_POSTSUBSCRIPT italic_a italic_r italic_r _ italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT . end_CELL end_ROW end_ARRAY (44)

When i=1,𝑖1i=1,italic_i = 1 , i.e., for the first microphone array, we have

d1k=(sxk)2+(syk)2+(szk)2.superscriptsubscript𝑑1𝑘superscriptsuperscriptsubscript𝑠𝑥𝑘2superscriptsuperscriptsubscript𝑠𝑦𝑘2superscriptsuperscriptsubscript𝑠𝑧𝑘2d_{1}^{k}=\sqrt{{(s_{x}^{k})}^{2}+{(s_{y}^{k})}^{2}+{(s_{z}^{k})}^{2}}.italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT = square-root start_ARG ( italic_s start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( italic_s start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( italic_s start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG . (45)

Based on the DOA and TDOA models in (1) and (2), then

𝐋k=𝐲k(𝐱arr,𝐬k)𝐱arr=[𝐉arr_2k,,𝐉arr_Nk]4(N1)×8(N1)superscript𝐋𝑘superscript𝐲𝑘subscript𝐱𝑎𝑟𝑟superscript𝐬𝑘subscript𝐱𝑎𝑟𝑟delimited-[]superscriptsubscript𝐉𝑎𝑟𝑟_2𝑘superscriptsubscript𝐉𝑎𝑟𝑟_𝑁𝑘missing-subexpressionmissing-subexpressionsuperscript4𝑁18𝑁1\mathbf{L}^{k}=\dfrac{\partial\mathbf{y}^{k}(\mathbf{x}_{arr},\mathbf{s}^{k})}% {{\partial}\mathbf{x}_{arr}}=\left[\begin{array}[]{ccc}\mathbf{J}_{arr\_2}^{k}% ,\cdots,\mathbf{J}_{arr\_N}^{k}\end{array}\right]\in\mathbb{R}^{4(N-1)\times 8% (N-1)}bold_L start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT = divide start_ARG ∂ bold_y start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( bold_x start_POSTSUBSCRIPT italic_a italic_r italic_r end_POSTSUBSCRIPT , bold_s start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) end_ARG start_ARG ∂ bold_x start_POSTSUBSCRIPT italic_a italic_r italic_r end_POSTSUBSCRIPT end_ARG = [ start_ARRAY start_ROW start_CELL bold_J start_POSTSUBSCRIPT italic_a italic_r italic_r _ 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , ⋯ , bold_J start_POSTSUBSCRIPT italic_a italic_r italic_r _ italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW end_ARRAY ] ∈ blackboard_R start_POSTSUPERSCRIPT 4 ( italic_N - 1 ) × 8 ( italic_N - 1 ) end_POSTSUPERSCRIPT (46)

where for i=2,,N,𝑖2𝑁i=2,...,N,italic_i = 2 , … , italic_N , and k=1,,K𝑘1𝐾k=1,\ldots,Kitalic_k = 1 , … , italic_K, and only entries of 𝐉arr_iksuperscriptsubscript𝐉𝑎𝑟𝑟_𝑖𝑘\mathbf{J}_{arr\_i}^{k}bold_J start_POSTSUBSCRIPT italic_a italic_r italic_r _ italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT on its (4i7:4i4):4𝑖74𝑖4(4i-7:4i-4)( 4 italic_i - 7 : 4 italic_i - 4 ) rows are nonzero. Then, 𝐋ksuperscript𝐋𝑘\mathbf{L}^{k}bold_L start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT can be re-expressed as:

𝐋k=diag(𝐇arr_2k,𝐇arr_3k,,𝐇arr_Nk).superscript𝐋𝑘𝑑𝑖𝑎𝑔superscriptsubscript𝐇𝑎𝑟𝑟_2𝑘superscriptsubscript𝐇𝑎𝑟𝑟_3𝑘superscriptsubscript𝐇𝑎𝑟𝑟_𝑁𝑘\mathbf{L}^{k}=diag(\mathbf{H}_{arr\_2}^{k},\mathbf{H}_{arr\_3}^{k},\cdots,% \mathbf{H}_{arr\_N}^{k}).bold_L start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT = italic_d italic_i italic_a italic_g ( bold_H start_POSTSUBSCRIPT italic_a italic_r italic_r _ 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , bold_H start_POSTSUBSCRIPT italic_a italic_r italic_r _ 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , ⋯ , bold_H start_POSTSUBSCRIPT italic_a italic_r italic_r _ italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) . (47)

Denote 𝐡ik,𝐔iksuperscriptsubscript𝐡𝑖𝑘superscriptsubscript𝐔𝑖𝑘\mathbf{h}_{i}^{k},\mathbf{U}_{i}^{k}bold_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , bold_U start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT as the partial derivative of TDOA and DOA w.r.t. microphone array positions, respectively; denote 𝐕iksuperscriptsubscript𝐕𝑖𝑘\mathbf{V}_{i}^{k}bold_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT as the partial derivative of DOA w.r.t. XYZ Euler angles. We then have:

𝐇arr_ik𝐉arr_ik(4i7:4i4,:)=[𝐡ik𝟎1Δk𝐔ik𝐕ik𝟎𝟎]4×8\begin{array}[]{c}\mathbf{H}_{arr\_i}^{k}\triangleq\mathbf{J}_{arr\_i}^{k}(4i-% 7:4i-4,:)\\ =\left[\begin{array}[]{cccc}\mathbf{h}_{i}^{k}&\mathbf{0}&1&{{\Delta}_{k}}\\ \mathbf{U}_{i}^{k}&\mathbf{V}_{i}^{k}&\mathbf{0}&\mathbf{0}\end{array}\right]% \in\mathbf{\mathbb{R}}^{4\times 8}\end{array}start_ARRAY start_ROW start_CELL bold_H start_POSTSUBSCRIPT italic_a italic_r italic_r _ italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ≜ bold_J start_POSTSUBSCRIPT italic_a italic_r italic_r _ italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( 4 italic_i - 7 : 4 italic_i - 4 , : ) end_CELL end_ROW start_ROW start_CELL = [ start_ARRAY start_ROW start_CELL bold_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_CELL start_CELL bold_0 end_CELL start_CELL 1 end_CELL start_CELL roman_Δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL bold_U start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_CELL start_CELL bold_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_CELL start_CELL bold_0 end_CELL start_CELL bold_0 end_CELL end_ROW end_ARRAY ] ∈ blackboard_R start_POSTSUPERSCRIPT 4 × 8 end_POSTSUPERSCRIPT end_CELL end_ROW end_ARRAY (48)

where

𝐡ik=[Δxikcdik,Δyikcdik,Δzikcdik],superscriptsubscript𝐡𝑖𝑘Δsuperscriptsubscript𝑥𝑖𝑘𝑐superscriptsubscript𝑑𝑖𝑘Δsuperscriptsubscript𝑦𝑖𝑘𝑐superscriptsubscript𝑑𝑖𝑘Δsuperscriptsubscript𝑧𝑖𝑘𝑐superscriptsubscript𝑑𝑖𝑘\mathbf{h}_{i}^{k}=\text{$\left[\dfrac{{\scriptstyle{\displaystyle-{\Delta x}_% {i}^{k}}}}{cd_{i}^{k}},\dfrac{{\scriptstyle{\displaystyle-{\Delta y}_{i}^{k}}}% }{cd_{i}^{k}},\dfrac{{\scriptstyle{\displaystyle-{\Delta z}_{i}^{k}}}}{cd_{i}^% {k}}\right]$},bold_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT = [ divide start_ARG - roman_Δ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_ARG start_ARG italic_c italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_ARG , divide start_ARG - roman_Δ italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_ARG start_ARG italic_c italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_ARG , divide start_ARG - roman_Δ italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_ARG start_ARG italic_c italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_ARG ] ,
𝐔ik=𝐑iT𝐀=𝐑iT[(Δyik)2+(Δzik)2(dik)3ΔxikΔyik(dik)3ΔxikΔzik(dik)3ΔxikΔyik(dik)3(Δxik)2+(Δzik)2(dik)3ΔyikΔzik(dik)3ΔxikΔzik(dik)3ΔyikΔzik(dik)3(Δxik)2+(Δyik)2(dik)3],superscriptsubscript𝐔𝑖𝑘superscriptsubscript𝐑𝑖T𝐀absentsuperscriptsubscript𝐑𝑖Tdelimited-[]superscriptΔsuperscriptsubscript𝑦𝑖𝑘2superscriptΔsuperscriptsubscript𝑧𝑖𝑘2superscriptsuperscriptsubscript𝑑𝑖𝑘3Δsuperscriptsubscript𝑥𝑖𝑘Δsuperscriptsubscript𝑦𝑖𝑘superscriptsuperscriptsubscript𝑑𝑖𝑘3Δsuperscriptsubscript𝑥𝑖𝑘Δsuperscriptsubscript𝑧𝑖𝑘superscriptsuperscriptsubscript𝑑𝑖𝑘3Δsuperscriptsubscript𝑥𝑖𝑘Δsuperscriptsubscript𝑦𝑖𝑘superscriptsuperscriptsubscript𝑑𝑖𝑘3superscriptΔsuperscriptsubscript𝑥𝑖𝑘2superscriptΔsuperscriptsubscript𝑧𝑖𝑘2superscriptsuperscriptsubscript𝑑𝑖𝑘3Δsuperscriptsubscript𝑦𝑖𝑘Δsuperscriptsubscript𝑧𝑖𝑘superscriptsuperscriptsubscript𝑑𝑖𝑘3Δsuperscriptsubscript𝑥𝑖𝑘Δsuperscriptsubscript𝑧𝑖𝑘superscriptsuperscriptsubscript𝑑𝑖𝑘3Δsuperscriptsubscript𝑦𝑖𝑘Δsuperscriptsubscript𝑧𝑖𝑘superscriptsuperscriptsubscript𝑑𝑖𝑘3superscriptΔsuperscriptsubscript𝑥𝑖𝑘2superscriptΔsuperscriptsubscript𝑦𝑖𝑘2superscriptsuperscriptsubscript𝑑𝑖𝑘3\begin{array}[]{c}\mathbf{U}_{i}^{k}=-\mathbf{R}_{i}^{\mathrm{T}}\mathbf{A}\\ =-\mathbf{R}_{i}^{\mathrm{T}}\left[\begin{array}[]{ccc}\dfrac{{\scriptstyle(% \Delta y_{i}^{k})^{2}+(\Delta z_{i}^{k})^{2}}}{{\scriptstyle(d_{i}^{k})^{3}}}&% \dfrac{{\scriptstyle-\Delta x_{i}^{k}\Delta y_{i}^{k}}}{{\scriptstyle(d_{i}^{k% })^{3}}}&\dfrac{{\scriptstyle-\Delta x_{i}^{k}\Delta z_{i}^{k}}}{{\scriptstyle% (d_{i}^{k})^{3}}}\\ \dfrac{{\scriptstyle-\Delta x_{i}^{k}\Delta y_{i}^{k}}}{{\scriptstyle(d_{i}^{k% })^{3}}}&\dfrac{{\scriptstyle(\Delta x_{i}^{k})^{2}+(\Delta z_{i}^{k})^{2}}}{{% \scriptstyle(d_{i}^{k})^{3}}}&\dfrac{{\scriptstyle-\Delta y_{i}^{k}\Delta z_{i% }^{k}}}{{\scriptstyle(d_{i}^{k})^{3}}}\\ \dfrac{{\scriptstyle-\Delta x_{i}^{k}\Delta z_{i}^{k}}}{{\scriptstyle(d_{i}^{k% })^{3}}}&\dfrac{{\scriptstyle-\Delta y_{i}^{k}\Delta z_{i}^{k}}}{{\scriptstyle% (d_{i}^{k})^{3}}}&\dfrac{{\scriptstyle(\Delta x_{i}^{k})^{2}+(\Delta y_{i}^{k}% )^{2}}}{{\scriptstyle(d_{i}^{k})^{3}}}\end{array}\right],\end{array}start_ARRAY start_ROW start_CELL bold_U start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT = - bold_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT bold_A end_CELL end_ROW start_ROW start_CELL = - bold_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT [ start_ARRAY start_ROW start_CELL divide start_ARG ( roman_Δ italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( roman_Δ italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ( italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_ARG end_CELL start_CELL divide start_ARG - roman_Δ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT roman_Δ italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_ARG start_ARG ( italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_ARG end_CELL start_CELL divide start_ARG - roman_Δ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT roman_Δ italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_ARG start_ARG ( italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_ARG end_CELL end_ROW start_ROW start_CELL divide start_ARG - roman_Δ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT roman_Δ italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_ARG start_ARG ( italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_ARG end_CELL start_CELL divide start_ARG ( roman_Δ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( roman_Δ italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ( italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_ARG end_CELL start_CELL divide start_ARG - roman_Δ italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT roman_Δ italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_ARG start_ARG ( italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_ARG end_CELL end_ROW start_ROW start_CELL divide start_ARG - roman_Δ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT roman_Δ italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_ARG start_ARG ( italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_ARG end_CELL start_CELL divide start_ARG - roman_Δ italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT roman_Δ italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_ARG start_ARG ( italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_ARG end_CELL start_CELL divide start_ARG ( roman_Δ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( roman_Δ italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ( italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_ARG end_CELL end_ROW end_ARRAY ] , end_CELL end_ROW end_ARRAY (49)

and

𝐕ik=1dik[[(𝐑i_xTθx)𝐑i_yT𝐑i_zT(ΔxikΔyikΔzik)]T[𝐑i_xT(𝐑i_yTθy)𝐑i_zT(ΔxikΔyikΔzik)]T[𝐑i_xT𝐑i_yT(𝐑i_zTθz)(ΔxikΔyikΔzik)]T]Tsuperscriptsubscript𝐕𝑖𝑘1superscriptsubscript𝑑𝑖𝑘superscriptdelimited-[]superscriptdelimited-[]superscriptsubscript𝐑𝑖_𝑥Tsubscript𝜃𝑥superscriptsubscript𝐑𝑖_𝑦Tsuperscriptsubscript𝐑𝑖_𝑧TΔsuperscriptsubscript𝑥𝑖𝑘Δsuperscriptsubscript𝑦𝑖𝑘Δsuperscriptsubscript𝑧𝑖𝑘Tsuperscriptdelimited-[]superscriptsubscript𝐑𝑖_𝑥Tsuperscriptsubscript𝐑𝑖_𝑦Tsubscript𝜃𝑦superscriptsubscript𝐑𝑖_𝑧TΔsuperscriptsubscript𝑥𝑖𝑘Δsuperscriptsubscript𝑦𝑖𝑘Δsuperscriptsubscript𝑧𝑖𝑘Tsuperscriptdelimited-[]superscriptsubscript𝐑𝑖_𝑥Tsuperscriptsubscript𝐑𝑖_𝑦Tsuperscriptsubscript𝐑𝑖_𝑧Tsubscript𝜃𝑧Δsuperscriptsubscript𝑥𝑖𝑘Δsuperscriptsubscript𝑦𝑖𝑘Δsuperscriptsubscript𝑧𝑖𝑘TT\mathbf{V}_{i}^{k}={\scriptstyle\dfrac{1}{{\scriptstyle{\displaystyle d_{i}^{k% }}}}}\left[\begin{array}[]{c}\left[{\scriptstyle\left(\dfrac{{\scriptstyle% \partial\mathbf{R}_{i\_x}^{\mathrm{T}}}}{{\scriptstyle\partial\theta_{x}}}% \right)\mathbf{R}_{i\_y}^{\mathrm{T}}\mathbf{R}_{i\_z}^{\mathrm{T}}\left(% \begin{array}[]{c}{\Delta x}_{i}^{k}\\ {\Delta y}_{i}^{k}\\ {\Delta z}_{i}^{k}\end{array}\right)}\right]^{\mathrm{T}}\\ \left[{\scriptstyle{\scriptstyle\mathbf{R}_{i\_x}^{\mathrm{T}}}\left(\dfrac{{% \scriptstyle\partial\mathbf{R}_{i\_y}^{\mathrm{T}}}}{{\scriptstyle{% \scriptstyle\partial\theta_{y}}}}\right){\scriptstyle\mathbf{R}_{i\_z}^{% \mathrm{T}}}{\scriptstyle\left(\begin{array}[]{c}{\Delta x}_{i}^{k}\\ {\Delta y}_{i}^{k}\\ {\Delta z}_{i}^{k}\end{array}\right)}}\right]^{\mathrm{T}}\\ \left[{\scriptstyle\mathbf{R}_{i\_x}^{\mathrm{T}}{\scriptstyle\mathbf{R}_{i\_y% }^{\mathrm{T}}\left(\dfrac{{\scriptstyle\partial\mathbf{R}_{i\_z}^{\mathrm{T}}% }}{{\scriptstyle\partial\theta_{z}}}\right)}}{\scriptstyle\left(\begin{array}[% ]{c}{\Delta x}_{i}^{k}\\ {\Delta y}_{i}^{k}\\ {\Delta z}_{i}^{k}\end{array}\right)}\right]^{\mathrm{T}}\end{array}\right]^{% \mathrm{T}}bold_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_ARG [ start_ARRAY start_ROW start_CELL [ ( divide start_ARG ∂ bold_R start_POSTSUBSCRIPT italic_i _ italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT end_ARG start_ARG ∂ italic_θ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_ARG ) bold_R start_POSTSUBSCRIPT italic_i _ italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT bold_R start_POSTSUBSCRIPT italic_i _ italic_z end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT ( start_ARRAY start_ROW start_CELL roman_Δ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL roman_Δ italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL roman_Δ italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_CELL end_ROW end_ARRAY ) ] start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL [ bold_R start_POSTSUBSCRIPT italic_i _ italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT ( divide start_ARG ∂ bold_R start_POSTSUBSCRIPT italic_i _ italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT end_ARG start_ARG ∂ italic_θ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT end_ARG ) bold_R start_POSTSUBSCRIPT italic_i _ italic_z end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT ( start_ARRAY start_ROW start_CELL roman_Δ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL roman_Δ italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL roman_Δ italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_CELL end_ROW end_ARRAY ) ] start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL [ bold_R start_POSTSUBSCRIPT italic_i _ italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT bold_R start_POSTSUBSCRIPT italic_i _ italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT ( divide start_ARG ∂ bold_R start_POSTSUBSCRIPT italic_i _ italic_z end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT end_ARG start_ARG ∂ italic_θ start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT end_ARG ) ( start_ARRAY start_ROW start_CELL roman_Δ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL roman_Δ italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL roman_Δ italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_CELL end_ROW end_ARRAY ) ] start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT end_CELL end_ROW end_ARRAY ] start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT (50)

where 𝐑i_x,𝐑i_ysubscript𝐑𝑖_𝑥subscript𝐑𝑖_𝑦\mathbf{R}_{i\_x},\mathbf{R}_{i\_y}bold_R start_POSTSUBSCRIPT italic_i _ italic_x end_POSTSUBSCRIPT , bold_R start_POSTSUBSCRIPT italic_i _ italic_y end_POSTSUBSCRIPT and 𝐑i_zsubscript𝐑𝑖_𝑧\mathbf{R}_{i\_z}bold_R start_POSTSUBSCRIPT italic_i _ italic_z end_POSTSUBSCRIPT are the rotation matrices about coordinate frame axes x, y𝑥 𝑦x,\text{ }yitalic_x , italic_y, and z𝑧zitalic_z, respectively. 𝐑iTsuperscriptsubscript𝐑𝑖T\mathbf{R}_{i}^{\mathrm{T}}bold_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT can be expressed as 𝐑iT=𝐑i_xT𝐑i_yT𝐑i_zT,superscriptsubscript𝐑𝑖Tsuperscriptsubscript𝐑𝑖_𝑥Tsuperscriptsubscript𝐑𝑖_𝑦Tsuperscriptsubscript𝐑𝑖_𝑧T\mathbf{R}_{i}^{\mathrm{T}}=\mathbf{R}_{i\_x}^{\mathrm{T}}\mathbf{R}_{i\_y}^{% \mathrm{T}}\mathbf{R}_{i\_z}^{\mathrm{T}},bold_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT = bold_R start_POSTSUBSCRIPT italic_i _ italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT bold_R start_POSTSUBSCRIPT italic_i _ italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT bold_R start_POSTSUBSCRIPT italic_i _ italic_z end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT , with

𝐑i_x=[1000cosθxsinθx0sinθxcosθx]𝐑i_y=[cosθy0sinθy010sinθy0cosθy]𝐑i_z=[cosθzsinθz0sinθzcosθz0001].subscript𝐑𝑖_𝑥delimited-[]1000subscript𝜃𝑥subscript𝜃𝑥0subscript𝜃𝑥subscript𝜃𝑥subscript𝐑𝑖_𝑦delimited-[]subscript𝜃𝑦0subscript𝜃𝑦010subscript𝜃𝑦0subscript𝜃𝑦subscript𝐑𝑖_𝑧delimited-[]subscript𝜃𝑧subscript𝜃𝑧0subscript𝜃𝑧subscript𝜃𝑧0001\begin{array}[]{c}\mathbf{R}_{i\_x}=\left[\begin{array}[]{ccc}1&0&0\\ 0&\cos\theta_{x}&-\sin\theta_{x}\\ 0&\sin\theta_{x}&\cos\theta_{x}\end{array}\right]\\ \mathbf{R}_{i\_y}=\left[\begin{array}[]{ccc}\cos\theta_{y}&0&\sin\theta_{y}\\ 0&1&0\\ -\sin\theta_{y}&0&\cos\theta_{y}\end{array}\right]\\ \mathbf{R}_{i\_z}=\left[\begin{array}[]{ccc}\cos\theta_{z}&-\sin\theta_{z}&0\\ \sin\theta_{z}&\cos\theta_{z}&0\\ 0&0&1\end{array}\right]\end{array}.start_ARRAY start_ROW start_CELL bold_R start_POSTSUBSCRIPT italic_i _ italic_x end_POSTSUBSCRIPT = [ start_ARRAY start_ROW start_CELL 1 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL roman_cos italic_θ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_CELL start_CELL - roman_sin italic_θ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL roman_sin italic_θ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_CELL start_CELL roman_cos italic_θ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_CELL end_ROW end_ARRAY ] end_CELL end_ROW start_ROW start_CELL bold_R start_POSTSUBSCRIPT italic_i _ italic_y end_POSTSUBSCRIPT = [ start_ARRAY start_ROW start_CELL roman_cos italic_θ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT end_CELL start_CELL 0 end_CELL start_CELL roman_sin italic_θ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 1 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL - roman_sin italic_θ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT end_CELL start_CELL 0 end_CELL start_CELL roman_cos italic_θ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT end_CELL end_ROW end_ARRAY ] end_CELL end_ROW start_ROW start_CELL bold_R start_POSTSUBSCRIPT italic_i _ italic_z end_POSTSUBSCRIPT = [ start_ARRAY start_ROW start_CELL roman_cos italic_θ start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT end_CELL start_CELL - roman_sin italic_θ start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL roman_sin italic_θ start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT end_CELL start_CELL roman_cos italic_θ start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 1 end_CELL end_ROW end_ARRAY ] end_CELL end_ROW end_ARRAY .

Denote 𝐓k=𝐲k(𝐱arr,𝐬k)𝐬k4(N1)×3superscript𝐓𝑘superscript𝐲𝑘subscript𝐱𝑎𝑟𝑟superscript𝐬𝑘superscript𝐬𝑘superscript4𝑁13\mathbf{T}^{k}=\dfrac{\partial\mathbf{y}^{k}(\mathbf{x}_{arr},\mathbf{s}^{k})}% {\partial\mathbf{s}^{k}}\in\mathbf{\mathbb{R}}^{4(N-1)\times 3}bold_T start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT = divide start_ARG ∂ bold_y start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( bold_x start_POSTSUBSCRIPT italic_a italic_r italic_r end_POSTSUBSCRIPT , bold_s start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) end_ARG start_ARG ∂ bold_s start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_ARG ∈ blackboard_R start_POSTSUPERSCRIPT 4 ( italic_N - 1 ) × 3 end_POSTSUPERSCRIPT as the partial derivative of TDOA and DOA observations w.r.t. sound source position at time instance tksuperscript𝑡𝑘t^{k}italic_t start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT, for k=1,,K𝑘1𝐾k=1,\ldots,Kitalic_k = 1 , … , italic_K. We then have the expression of 𝐓ksuperscript𝐓𝑘\mathbf{T}^{k}bold_T start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT as follows:

𝐓k=𝐲k(𝐱arr,𝐬k)𝐬k=[𝐉xk𝐉yk𝐉zk]=[𝐡2k𝐔2k𝐡Nk𝐔Nk][(𝐬kcd1k)T𝟎3×3(𝐬kcd1k)T𝟎3×3].superscript𝐓𝑘superscript𝐲𝑘subscript𝐱𝑎𝑟𝑟superscript𝐬𝑘superscript𝐬𝑘delimited-[]superscriptsubscript𝐉𝑥𝑘superscriptsubscript𝐉𝑦𝑘superscriptsubscript𝐉𝑧𝑘absentdelimited-[]superscriptsubscript𝐡2𝑘superscriptsubscript𝐔2𝑘superscriptsubscript𝐡𝑁𝑘superscriptsubscript𝐔𝑁𝑘delimited-[]superscriptsuperscript𝐬𝑘𝑐superscriptsubscript𝑑1𝑘Tsubscript033superscriptsuperscript𝐬𝑘𝑐superscriptsubscript𝑑1𝑘Tsubscript033\begin{array}[]{c}\mathbf{T}^{k}=\dfrac{\partial\mathbf{y}^{k}(\mathbf{x}_{arr% },\mathbf{s}^{k})}{{\partial}\mathbf{s}^{k}}=\left[\begin{array}[]{ccc}\mathbf% {J}_{x}^{k}&\mathbf{J}_{y}^{k}&\mathbf{J}_{z}^{k}\end{array}\right]\\ =\left[\begin{array}[]{c}-\mathbf{h}_{2}^{k}\\ \mathbf{-U}_{2}^{k}\\ \vdots\\ \mathbf{-h}_{N}^{k}\\ \mathbf{-U}_{N}^{k}\end{array}\right]-\left[\begin{array}[]{c}\left(\dfrac{% \mathbf{s}^{k}}{cd_{1}^{k}}\right)^{\mathrm{T}}\\ \mathbf{0}_{3\times 3}\\ \vdots\\ \left(\dfrac{\mathbf{s}^{k}}{cd_{1}^{k}}\right)^{\mathrm{T}}\\ \mathbf{0}_{3\times 3}\end{array}\right]\end{array}.start_ARRAY start_ROW start_CELL bold_T start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT = divide start_ARG ∂ bold_y start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( bold_x start_POSTSUBSCRIPT italic_a italic_r italic_r end_POSTSUBSCRIPT , bold_s start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) end_ARG start_ARG ∂ bold_s start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_ARG = [ start_ARRAY start_ROW start_CELL bold_J start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_CELL start_CELL bold_J start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_CELL start_CELL bold_J start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_CELL end_ROW end_ARRAY ] end_CELL end_ROW start_ROW start_CELL = [ start_ARRAY start_ROW start_CELL - bold_h start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL - bold_U start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL - bold_h start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL - bold_U start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_CELL end_ROW end_ARRAY ] - [ start_ARRAY start_ROW start_CELL ( divide start_ARG bold_s start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_ARG start_ARG italic_c italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_ARG ) start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL bold_0 start_POSTSUBSCRIPT 3 × 3 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL ( divide start_ARG bold_s start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_ARG start_ARG italic_c italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_ARG ) start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL bold_0 start_POSTSUBSCRIPT 3 × 3 end_POSTSUBSCRIPT end_CELL end_ROW end_ARRAY ] end_CELL end_ROW end_ARRAY . (51)

The results then follow the definition of the Jacobian matrix [41, pp. 569]. This completes the proof.   

Proof of Theorem 2. By performing elementary row transformation of 𝐅𝐅\mathbf{F}bold_F, we can obtain:

𝐅¯=[𝐇arr_21𝐓arr_21𝐇arr_2K𝐓arr_2K𝐇arr_31𝐓arr_31𝐇arr_3K𝐓arr_3K𝐇arr_N1𝐓arr_N1𝐇arr_NK𝐓arr_NK]=[𝐇arr_2𝐇arr_3𝐇arr_N𝐋¯𝐓arr_2𝐓arr_3𝐓arr_N]𝐓¯\begin{array}[]{c}\overline{\mathbf{F}}=\left[\begin{array}[]{ccccc}\mathbf{H}% _{arr\_2}^{1}&&&&\mathbf{T}_{arr\_2}^{1}\\ \vdots&&&&\vdots\\ \mathbf{H}_{arr\_2}^{K}&&&&\mathbf{T}_{arr\_2}^{K}\\ &\mathbf{H}_{arr\_3}^{1}&&&\mathbf{T}_{arr\_3}^{1}\\ &\vdots&&&\vdots\\ &\mathbf{H}_{arr\_3}^{K}&&&\mathbf{T}_{arr\_3}^{K}\\ &&\ddots&&\vdots\\ &&&\mathbf{H}_{arr\_N}^{1}&\mathbf{T}_{arr\_N}^{1}\\ &&&\vdots&\vdots\\ &&&\mathbf{H}_{arr\_N}^{K}&\mathbf{T}_{arr\_N}^{K}\end{array}\right]\\ =\underset{\overline{\mathbf{L}}}{\underbrace{\left[\begin{array}[]{cccc}% \mathbf{H}_{arr\_2}\\ &\mathbf{H}_{arr\_3}\\ &&\ddots\\ &&&\mathbf{H}_{arr\_N}\end{array}\right.}}\underset{\overline{\mathbf{T}}}{% \underbrace{\left.\begin{array}[]{c}\mathbf{T}_{arr\_2}\\ \mathbf{T}_{arr\_3}\\ \vdots\\ \mathbf{T}_{arr\_N}\end{array}\right]}}\end{array}start_ARRAY start_ROW start_CELL over¯ start_ARG bold_F end_ARG = [ start_ARRAY start_ROW start_CELL bold_H start_POSTSUBSCRIPT italic_a italic_r italic_r _ 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL bold_T start_POSTSUBSCRIPT italic_a italic_r italic_r _ 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL bold_H start_POSTSUBSCRIPT italic_a italic_r italic_r _ 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL bold_T start_POSTSUBSCRIPT italic_a italic_r italic_r _ 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL bold_H start_POSTSUBSCRIPT italic_a italic_r italic_r _ 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL bold_T start_POSTSUBSCRIPT italic_a italic_r italic_r _ 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ⋮ end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL bold_H start_POSTSUBSCRIPT italic_a italic_r italic_r _ 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL bold_T start_POSTSUBSCRIPT italic_a italic_r italic_r _ 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL end_CELL start_CELL ⋱ end_CELL start_CELL end_CELL start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL bold_H start_POSTSUBSCRIPT italic_a italic_r italic_r _ italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT end_CELL start_CELL bold_T start_POSTSUBSCRIPT italic_a italic_r italic_r _ italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL ⋮ end_CELL start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL bold_H start_POSTSUBSCRIPT italic_a italic_r italic_r _ italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT end_CELL start_CELL bold_T start_POSTSUBSCRIPT italic_a italic_r italic_r _ italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT end_CELL end_ROW end_ARRAY ] end_CELL end_ROW start_ROW start_CELL = start_UNDERACCENT over¯ start_ARG bold_L end_ARG end_UNDERACCENT start_ARG under⏟ start_ARG [ start_ARRAY start_ROW start_CELL bold_H start_POSTSUBSCRIPT italic_a italic_r italic_r _ 2 end_POSTSUBSCRIPT end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL bold_H start_POSTSUBSCRIPT italic_a italic_r italic_r _ 3 end_POSTSUBSCRIPT end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL end_CELL start_CELL ⋱ end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL bold_H start_POSTSUBSCRIPT italic_a italic_r italic_r _ italic_N end_POSTSUBSCRIPT end_CELL end_ROW end_ARRAY end_ARG end_ARG start_UNDERACCENT over¯ start_ARG bold_T end_ARG end_UNDERACCENT start_ARG under⏟ start_ARG start_ARRAY start_ROW start_CELL bold_T start_POSTSUBSCRIPT italic_a italic_r italic_r _ 2 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL bold_T start_POSTSUBSCRIPT italic_a italic_r italic_r _ 3 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL bold_T start_POSTSUBSCRIPT italic_a italic_r italic_r _ italic_N end_POSTSUBSCRIPT end_CELL end_ROW end_ARRAY ] end_ARG end_ARG end_CELL end_ROW end_ARRAY (52)

where

𝐇arr_i=[𝐇arr_i1;;𝐇arr_iK]4K×8𝐓arr_i=[𝐓arr_i1;;𝐓arr_iK]4K×3subscript𝐇𝑎𝑟𝑟_𝑖delimited-[]superscriptsubscript𝐇𝑎𝑟𝑟_𝑖1superscriptsubscript𝐇𝑎𝑟𝑟_𝑖𝐾missing-subexpressionmissing-subexpressionsuperscript4𝐾8subscript𝐓𝑎𝑟𝑟_𝑖delimited-[]superscriptsubscript𝐓𝑎𝑟𝑟_𝑖1superscriptsubscript𝐓𝑎𝑟𝑟_𝑖𝐾missing-subexpressionmissing-subexpressionsuperscript4𝐾3\begin{array}[]{c}\mathbf{H}_{arr\_i}=\left[\begin{array}[]{ccc}\mathbf{H}_{% arr\_i}^{1};\cdots;\mathbf{H}_{arr\_i}^{K}\end{array}\right]\in\mathbf{\mathbb% {R}}^{4K\times 8}\\ \mathbf{T}_{arr\_i}=\left[\begin{array}[]{ccc}\mathbf{T}_{arr\_i}^{1};\cdots;% \mathbf{T}_{arr\_i}^{K}\end{array}\right]\in\mathbf{\mathbb{R}}^{4K\times 3}% \end{array}start_ARRAY start_ROW start_CELL bold_H start_POSTSUBSCRIPT italic_a italic_r italic_r _ italic_i end_POSTSUBSCRIPT = [ start_ARRAY start_ROW start_CELL bold_H start_POSTSUBSCRIPT italic_a italic_r italic_r _ italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ; ⋯ ; bold_H start_POSTSUBSCRIPT italic_a italic_r italic_r _ italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW end_ARRAY ] ∈ blackboard_R start_POSTSUPERSCRIPT 4 italic_K × 8 end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL bold_T start_POSTSUBSCRIPT italic_a italic_r italic_r _ italic_i end_POSTSUBSCRIPT = [ start_ARRAY start_ROW start_CELL bold_T start_POSTSUBSCRIPT italic_a italic_r italic_r _ italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ; ⋯ ; bold_T start_POSTSUBSCRIPT italic_a italic_r italic_r _ italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW end_ARRAY ] ∈ blackboard_R start_POSTSUPERSCRIPT 4 italic_K × 3 end_POSTSUPERSCRIPT end_CELL end_ROW end_ARRAY

for i=2,,N𝑖2𝑁i=2,...,Nitalic_i = 2 , … , italic_N. Apparently, it holds that rank(𝐅)=rank(𝐅¯)𝑟𝑎𝑛𝑘𝐅𝑟𝑎𝑛𝑘¯𝐅rank(\mathbf{F})=rank(\overline{\mathbf{F}})italic_r italic_a italic_n italic_k ( bold_F ) = italic_r italic_a italic_n italic_k ( over¯ start_ARG bold_F end_ARG ). Also, due to the structure of 𝐇arr_isubscript𝐇𝑎𝑟𝑟_𝑖\mathbf{H}_{arr\_i}bold_H start_POSTSUBSCRIPT italic_a italic_r italic_r _ italic_i end_POSTSUBSCRIPT, their columns are independent of each other. For each microphone array, denote 𝐅arr_i=[𝐇arr_i𝐓arr_i]subscript𝐅𝑎𝑟𝑟_𝑖delimited-[]subscript𝐇𝑎𝑟𝑟_𝑖subscript𝐓𝑎𝑟𝑟_𝑖\mathbf{F}_{arr\_i}=\left[\begin{array}[]{cc}\mathbf{H}_{arr\_i}&\mathbf{T}_{% arr\_i}\end{array}\right]bold_F start_POSTSUBSCRIPT italic_a italic_r italic_r _ italic_i end_POSTSUBSCRIPT = [ start_ARRAY start_ROW start_CELL bold_H start_POSTSUBSCRIPT italic_a italic_r italic_r _ italic_i end_POSTSUBSCRIPT end_CELL start_CELL bold_T start_POSTSUBSCRIPT italic_a italic_r italic_r _ italic_i end_POSTSUBSCRIPT end_CELL end_ROW end_ARRAY ]. We then perform the following elementary transformation on the matrix 𝐅arr_isubscript𝐅𝑎𝑟𝑟_𝑖\mathbf{F}_{arr\_i}bold_F start_POSTSUBSCRIPT italic_a italic_r italic_r _ italic_i end_POSTSUBSCRIPT:

(i) adding the first column block [𝐡i1;𝐔i1;;𝐡iK;𝐔iK]superscriptsubscript𝐡𝑖1superscriptsubscript𝐔𝑖1superscriptsubscript𝐡𝑖𝐾superscriptsubscript𝐔𝑖𝐾\left[\mathbf{h}_{i}^{1};\mathbf{U}_{i}^{1};\cdots;\mathbf{h}_{i}^{K};\mathbf{% U}_{i}^{K}\right][ bold_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ; bold_U start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ; ⋯ ; bold_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ; bold_U start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ] of 𝐇arr_isubscript𝐇𝑎𝑟𝑟_𝑖\mathbf{H}_{arr\_i}bold_H start_POSTSUBSCRIPT italic_a italic_r italic_r _ italic_i end_POSTSUBSCRIPT to 𝐓arr_isubscript𝐓𝑎𝑟𝑟_𝑖\mathbf{T}_{arr\_i}bold_T start_POSTSUBSCRIPT italic_a italic_r italic_r _ italic_i end_POSTSUBSCRIPT;

(ii) exchanging row blocks to collect all 𝐡iksuperscriptsubscript𝐡𝑖𝑘\mathbf{h}_{i}^{k}bold_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT and 𝐔iksuperscriptsubscript𝐔𝑖𝑘\mathbf{U}_{i}^{k}bold_U start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT together, respectively, thereby obtaining

𝐅¯arr_i=[𝐌h_i𝟎𝟏K×1φ𝐤𝐌U_i𝐌V_i𝟎𝟎𝐭𝐤𝟎]4K×11subscript¯𝐅𝑎𝑟𝑟_𝑖delimited-[]subscript𝐌_𝑖0subscript1𝐾1subscript𝜑𝐤subscript𝐌𝑈_𝑖subscript𝐌𝑉_𝑖00subscript𝐭𝐤0superscript4𝐾11\overline{\mathbf{F}}_{arr\_i}={\left[\begin{array}[]{cccc}{\scriptstyle% \mathbf{M}_{h\_i}}&{\scriptstyle\mathbf{0}}&{\scriptstyle\mathbf{1}_{K\times 1% }}&{\scriptstyle\varphi_{\mathbf{k}}}\\ {\scriptstyle\mathbf{M}_{U\_i}}&{\scriptstyle\mathbf{M}_{V\_i}}&{\scriptstyle% \mathbf{0}}&{\scriptstyle\mathbf{0}}\end{array}\right.}{\left.\begin{array}[]{% c}{\scriptstyle-\mathbf{t}_{\mathbf{k}}}\\ {\scriptstyle\mathbf{0}}\end{array}\right]}\in\mathbf{\mathbb{R}}^{4K\times 11}over¯ start_ARG bold_F end_ARG start_POSTSUBSCRIPT italic_a italic_r italic_r _ italic_i end_POSTSUBSCRIPT = [ start_ARRAY start_ROW start_CELL bold_M start_POSTSUBSCRIPT italic_h _ italic_i end_POSTSUBSCRIPT end_CELL start_CELL bold_0 end_CELL start_CELL bold_1 start_POSTSUBSCRIPT italic_K × 1 end_POSTSUBSCRIPT end_CELL start_CELL italic_φ start_POSTSUBSCRIPT bold_k end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL bold_M start_POSTSUBSCRIPT italic_U _ italic_i end_POSTSUBSCRIPT end_CELL start_CELL bold_M start_POSTSUBSCRIPT italic_V _ italic_i end_POSTSUBSCRIPT end_CELL start_CELL bold_0 end_CELL start_CELL bold_0 end_CELL end_ROW end_ARRAY start_ARRAY start_ROW start_CELL - bold_t start_POSTSUBSCRIPT bold_k end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL bold_0 end_CELL end_ROW end_ARRAY ] ∈ blackboard_R start_POSTSUPERSCRIPT 4 italic_K × 11 end_POSTSUPERSCRIPT (53)

where

𝐌h_i=[𝐡i1;𝐡i2;;𝐡iK], 𝐌U_i=[𝐔i1;𝐔i2;;𝐔iK],𝐌V_i=[𝐕i1;𝐕i2;;𝐕iK]φ𝐤=[Δ1;Δ2;;ΔK],𝐭𝐤=[(𝐬1cd11)T;(𝐬2cd12)T;;(𝐬Kcd1K)T].formulae-sequencesubscript𝐌_𝑖superscriptsubscript𝐡𝑖1superscriptsubscript𝐡𝑖2superscriptsubscript𝐡𝑖𝐾 subscript𝐌𝑈_𝑖superscriptsubscript𝐔𝑖1superscriptsubscript𝐔𝑖2superscriptsubscript𝐔𝑖𝐾subscript𝐌𝑉_𝑖superscriptsubscript𝐕𝑖1superscriptsubscript𝐕𝑖2superscriptsubscript𝐕𝑖𝐾subscript𝜑𝐤delimited-[]subscriptΔ1subscriptΔ2subscriptΔ𝐾subscript𝐭𝐤delimited-[]superscriptsuperscript𝐬1𝑐superscriptsubscript𝑑11Tsuperscriptsuperscript𝐬2𝑐superscriptsubscript𝑑12Tsuperscriptsuperscript𝐬𝐾𝑐superscriptsubscript𝑑1𝐾T\begin{array}[]{c}\mathbf{M}_{h\_i}=[\mathbf{h}_{i}^{1};\mathbf{h}_{i}^{2};% \ldots;\mathbf{h}_{i}^{K}],\text{ }\mathbf{M}_{U\_i}=\left[\mathbf{U}_{i}^{1};% \mathbf{U}_{i}^{2};\ldots;\mathbf{U}_{i}^{K}\right],\\ \mathbf{M}_{V\_i}=\left[\mathbf{V}_{i}^{1};\mathbf{V}_{i}^{2};\ldots;\mathbf{V% }_{i}^{K}\right]\text{, }{{\varphi}_{\mathbf{k}}=\left[\begin{array}[]{c}% \Delta_{1};\Delta_{2};\ldots;\Delta_{K}\end{array}\right],}\\ \mathbf{t}_{\mathbf{k}}=\left[\begin{array}[]{c}\left(\frac{{\scriptstyle% \mathbf{s}^{1}}}{{\scriptstyle cd_{1}^{1}}}\right)^{\mathrm{T}};\left(\frac{{% \scriptstyle\mathbf{s}^{2}}}{{\scriptstyle cd_{1}^{2}}}\right)^{\mathrm{T}};% \ldots;\left(\frac{{\scriptstyle\mathbf{s}^{K}}}{{\scriptstyle cd_{1}^{K}}}% \right)^{\mathrm{T}}\end{array}\right].\end{array}start_ARRAY start_ROW start_CELL bold_M start_POSTSUBSCRIPT italic_h _ italic_i end_POSTSUBSCRIPT = [ bold_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ; bold_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ; … ; bold_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ] , bold_M start_POSTSUBSCRIPT italic_U _ italic_i end_POSTSUBSCRIPT = [ bold_U start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ; bold_U start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ; … ; bold_U start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ] , end_CELL end_ROW start_ROW start_CELL bold_M start_POSTSUBSCRIPT italic_V _ italic_i end_POSTSUBSCRIPT = [ bold_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ; bold_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ; … ; bold_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ] , italic_φ start_POSTSUBSCRIPT bold_k end_POSTSUBSCRIPT = [ start_ARRAY start_ROW start_CELL roman_Δ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ; roman_Δ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ; … ; roman_Δ start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT end_CELL end_ROW end_ARRAY ] , end_CELL end_ROW start_ROW start_CELL bold_t start_POSTSUBSCRIPT bold_k end_POSTSUBSCRIPT = [ start_ARRAY start_ROW start_CELL ( divide start_ARG bold_s start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT end_ARG start_ARG italic_c italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT end_ARG ) start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT ; ( divide start_ARG bold_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_c italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT ; … ; ( divide start_ARG bold_s start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT end_ARG start_ARG italic_c italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT end_ARG ) start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT end_CELL end_ROW end_ARRAY ] . end_CELL end_ROW end_ARRAY

We further perform the following elementary operations on 𝐅¯arr_isubscript¯𝐅𝑎𝑟𝑟_𝑖\overline{\mathbf{F}}_{arr\_i}over¯ start_ARG bold_F end_ARG start_POSTSUBSCRIPT italic_a italic_r italic_r _ italic_i end_POSTSUBSCRIPT, i=2,3,,N𝑖23𝑁i=2,3,\cdots,Nitalic_i = 2 , 3 , ⋯ , italic_N:

(i) dividing the fourth column block by Δ1subscriptΔ1{\Delta}_{1}roman_Δ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT;

(ii) for k=2,3,,K𝑘23𝐾k=2,3,\cdots,Kitalic_k = 2 , 3 , ⋯ , italic_K, deducing the k-th𝑘-𝑡k\raisebox{0.0pt}{-}thitalic_k - italic_t italic_h row by the first row;

(iii) transforming the elements in the first row (except the third one) to zero by the third column block (the first element therein equals 1 while the other elements equal zero after the elementary operations listed above);

(iv) for k=3,4,,K𝑘34𝐾k=3,4,\cdots,Kitalic_k = 3 , 4 , ⋯ , italic_K, deducing the k-th𝑘-𝑡k\raisebox{0.0pt}{-}thitalic_k - italic_t italic_h row by the second row multiplied by ΔkΔ1Δ2Δ1subscriptΔ𝑘subscriptΔ1subscriptΔ2subscriptΔ1\frac{{\Delta}_{k}-{\Delta}_{1}}{{\Delta}_{2}-{\Delta}_{1}}divide start_ARG roman_Δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - roman_Δ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG roman_Δ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - roman_Δ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG;

(v) transforming the elements in the second row (except the fourth one) to zero by the fourth column block (the second element therein equals 1 while the other elements equal zero after the elementary operations listed above);

(vi) moving column blocks 3 and 4 to columns blocks 1 and 2, respectively.

After the above operations, we obtain

𝐅¯arr_i=[𝐋¯i𝐓¯]superscriptsubscript¯𝐅𝑎𝑟𝑟_𝑖delimited-[]subscript¯𝐋𝑖¯𝐓\begin{array}[]{l}\overline{\mathbf{F}}_{arr\_i}^{\prime}=\left[\begin{array}[% ]{cc}\mathbf{\bar{L}}_{i}&\bar{\mathbf{T}}\end{array}\right]\end{array}start_ARRAY start_ROW start_CELL over¯ start_ARG bold_F end_ARG start_POSTSUBSCRIPT italic_a italic_r italic_r _ italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = [ start_ARRAY start_ROW start_CELL over¯ start_ARG bold_L end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_CELL start_CELL over¯ start_ARG bold_T end_ARG end_CELL end_ROW end_ARRAY ] end_CELL end_ROW end_ARRAY (54)

where

𝐓¯=[𝟎Ψ𝟎]=[𝟎2×3Θ1,3((𝐬k)Tcd1k)Θ3,1(Δk)Θ2,1(Δk)Θ1,2((𝐬k)Tcd1k)Θ1,4((𝐬k)Tcd1k)Θ4,1(Δk)Θ2,1(Δk)Θ1,2((𝐬k)Tcd1k)Θ1,K((𝐬k)Tcd1k)ΘK,1(Δk)Θ2,1(Δk)Θ1,2((𝐬k)Tcd1k)𝟎3K×3]¯𝐓delimited-[]0Ψ0delimited-[]subscript023subscriptΘ13superscriptsuperscript𝐬𝑘T𝑐superscriptsubscript𝑑1𝑘subscriptΘ31subscriptΔ𝑘subscriptΘ21subscriptΔ𝑘subscriptΘ12superscriptsuperscript𝐬𝑘T𝑐superscriptsubscript𝑑1𝑘subscriptΘ14superscriptsuperscript𝐬𝑘T𝑐superscriptsubscript𝑑1𝑘subscriptΘ41subscriptΔ𝑘subscriptΘ21subscriptΔ𝑘subscriptΘ12superscriptsuperscript𝐬𝑘T𝑐superscriptsubscript𝑑1𝑘subscriptΘ1𝐾superscriptsuperscript𝐬𝑘T𝑐superscriptsubscript𝑑1𝑘subscriptΘ𝐾1subscriptΔ𝑘subscriptΘ21subscriptΔ𝑘subscriptΘ12superscriptsuperscript𝐬𝑘T𝑐superscriptsubscript𝑑1𝑘subscript03𝐾3{\bar{\mathbf{T}}=\left[\begin{array}[]{c}\mathbf{0}\\ \Psi\\ \mathbf{0}\end{array}\right]=\left[\begin{array}[]{c}\mathbf{0_{\mathrm{2% \times 3}}}\\ {\scriptstyle\Theta_{1,3}\left(\dfrac{{\scriptstyle\left({\scriptstyle\mathbf{% s}^{k}}\right)^{\mathrm{T}}}}{{\scriptstyle cd_{1}^{k}}}\right)-\dfrac{{% \scriptstyle\Theta_{3,1}(\Delta_{k})}}{{\scriptstyle\Theta_{2,1}(\Delta_{k})}}% \Theta_{1,2}\left(\dfrac{{\scriptstyle\left({\scriptstyle\mathbf{s}^{k}}\right% )^{\mathrm{T}}}}{{\scriptstyle cd_{1}^{k}}}\right)}\\ {\scriptstyle{\scriptstyle\Theta_{1,4}\left(\dfrac{{\scriptstyle\left({% \scriptstyle\mathbf{s}^{k}}\right)^{\mathrm{T}}}}{{\scriptstyle cd_{1}^{k}}}% \right)-\dfrac{{\scriptstyle\Theta_{4,1}(\Delta_{k})}}{{\scriptstyle\Theta_{2,% 1}(\Delta_{k})}}\Theta_{1,2}\left(\dfrac{{\scriptstyle\left({\scriptstyle% \mathbf{s}^{k}}\right)^{\mathrm{T}}}}{{\scriptstyle cd_{1}^{k}}}\right)}}\\ \vdots\\ {\scriptstyle{\scriptstyle\Theta_{1,K}\left(\dfrac{{\scriptstyle\left({% \scriptstyle\mathbf{s}^{k}}\right)^{\mathrm{T}}}}{{\scriptstyle cd_{1}^{k}}}% \right)-\dfrac{{\scriptstyle\Theta_{K,1}(\Delta_{k})}}{{\scriptstyle\Theta_{2,% 1}(\Delta_{k})}}}\Theta_{1,2}\left(\dfrac{{\scriptstyle\left({\scriptstyle% \mathbf{s}^{k}}\right)^{\mathrm{T}}}}{{\scriptstyle cd_{1}^{k}}}\right)}\\ \mathbf{0_{\mathrm{\mathit{3K}\times 3}}}\end{array}\right]}over¯ start_ARG bold_T end_ARG = [ start_ARRAY start_ROW start_CELL bold_0 end_CELL end_ROW start_ROW start_CELL roman_Ψ end_CELL end_ROW start_ROW start_CELL bold_0 end_CELL end_ROW end_ARRAY ] = [ start_ARRAY start_ROW start_CELL bold_0 start_POSTSUBSCRIPT 2 × 3 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL roman_Θ start_POSTSUBSCRIPT 1 , 3 end_POSTSUBSCRIPT ( divide start_ARG ( bold_s start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT end_ARG start_ARG italic_c italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_ARG ) - divide start_ARG roman_Θ start_POSTSUBSCRIPT 3 , 1 end_POSTSUBSCRIPT ( roman_Δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) end_ARG start_ARG roman_Θ start_POSTSUBSCRIPT 2 , 1 end_POSTSUBSCRIPT ( roman_Δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) end_ARG roman_Θ start_POSTSUBSCRIPT 1 , 2 end_POSTSUBSCRIPT ( divide start_ARG ( bold_s start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT end_ARG start_ARG italic_c italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_ARG ) end_CELL end_ROW start_ROW start_CELL roman_Θ start_POSTSUBSCRIPT 1 , 4 end_POSTSUBSCRIPT ( divide start_ARG ( bold_s start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT end_ARG start_ARG italic_c italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_ARG ) - divide start_ARG roman_Θ start_POSTSUBSCRIPT 4 , 1 end_POSTSUBSCRIPT ( roman_Δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) end_ARG start_ARG roman_Θ start_POSTSUBSCRIPT 2 , 1 end_POSTSUBSCRIPT ( roman_Δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) end_ARG roman_Θ start_POSTSUBSCRIPT 1 , 2 end_POSTSUBSCRIPT ( divide start_ARG ( bold_s start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT end_ARG start_ARG italic_c italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_ARG ) end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL roman_Θ start_POSTSUBSCRIPT 1 , italic_K end_POSTSUBSCRIPT ( divide start_ARG ( bold_s start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT end_ARG start_ARG italic_c italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_ARG ) - divide start_ARG roman_Θ start_POSTSUBSCRIPT italic_K , 1 end_POSTSUBSCRIPT ( roman_Δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) end_ARG start_ARG roman_Θ start_POSTSUBSCRIPT 2 , 1 end_POSTSUBSCRIPT ( roman_Δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) end_ARG roman_Θ start_POSTSUBSCRIPT 1 , 2 end_POSTSUBSCRIPT ( divide start_ARG ( bold_s start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT end_ARG start_ARG italic_c italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_ARG ) end_CELL end_ROW start_ROW start_CELL bold_0 start_POSTSUBSCRIPT italic_3 italic_K × 3 end_POSTSUBSCRIPT end_CELL end_ROW end_ARRAY ] (55)

and

𝐋¯i=subscript¯𝐋𝑖absent\displaystyle\mathbf{\bar{L}}_{i}=over¯ start_ARG bold_L end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = diag(𝐈2,Φi)𝑑𝑖𝑎𝑔subscript𝐈2subscriptΦ𝑖\displaystyle diag(\mathbf{I}_{2},\Phi_{i})italic_d italic_i italic_a italic_g ( bold_I start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , roman_Φ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) (56)
=\displaystyle== [10𝟎𝟎01𝟎𝟎00Θ3,1(𝐡arr_ik)Θ3,1(Δk)Θ2,1(Δk)Θ2,1(𝐡arr_ik)𝟎00Θ4,1(𝐡arr_ik)Θ4,1(Δk)Θ2,1(Δk)Θ2,1(𝐡arr_ik)𝟎00ΘK,1(𝐡arr_ik)ΘK,1(Δk)Θ2,1(Δk)Θ2,1(𝐡arr_ik)𝟎𝟎𝟎𝐔arr_i1𝐕arr_i1𝟎𝟎𝐔arr_i2𝐕arr_i2𝟎𝟎𝐔arr_iK𝐕arr_iK]delimited-[]1000010000subscriptΘ31superscriptsubscript𝐡𝑎𝑟𝑟_𝑖𝑘subscriptΘ31subscriptΔ𝑘subscriptΘ21subscriptΔ𝑘subscriptΘ21superscriptsubscript𝐡𝑎𝑟𝑟_𝑖𝑘000subscriptΘ41superscriptsubscript𝐡𝑎𝑟𝑟_𝑖𝑘subscriptΘ41subscriptΔ𝑘subscriptΘ21subscriptΔ𝑘subscriptΘ21superscriptsubscript𝐡𝑎𝑟𝑟_𝑖𝑘000subscriptΘ𝐾1superscriptsubscript𝐡𝑎𝑟𝑟_𝑖𝑘subscriptΘ𝐾1subscriptΔ𝑘subscriptΘ21subscriptΔ𝑘subscriptΘ21superscriptsubscript𝐡𝑎𝑟𝑟_𝑖𝑘000superscriptsubscript𝐔𝑎𝑟𝑟_𝑖1superscriptsubscript𝐕𝑎𝑟𝑟_𝑖100superscriptsubscript𝐔𝑎𝑟𝑟_𝑖2superscriptsubscript𝐕𝑎𝑟𝑟_𝑖200superscriptsubscript𝐔𝑎𝑟𝑟_𝑖𝐾superscriptsubscript𝐕𝑎𝑟𝑟_𝑖𝐾\displaystyle\left[\begin{array}[]{cccc}1&0&\mathbf{0}&\mathbf{0}\\ 0&1&\mathbf{0}&\mathbf{0}\\ 0&0&{\scriptstyle\Theta_{3,1}(\mathbf{h}_{arr\_i}^{k})-\dfrac{{\scriptstyle% \Theta_{3,1}(\Delta_{k})}}{{\scriptstyle\Theta_{2,1}(\Delta_{k})}}\Theta_{2,1}% (\mathbf{h}_{arr\_i}^{k})}&\mathbf{0}\\ 0&0&{\scriptstyle\Theta_{4,1}(\mathbf{h}_{arr\_i}^{k})-\dfrac{{\scriptstyle% \Theta_{4,1}(\Delta_{k})}}{{\scriptstyle\Theta_{2,1}(\Delta_{k})}}\Theta_{2,1}% (\mathbf{h}_{arr\_i}^{k})}&\mathbf{0}\\ \vdots&\vdots&\vdots&\vdots\\ 0&0&{\scriptstyle\Theta_{K,1}(\mathbf{h}_{arr\_i}^{k})-\dfrac{{\scriptstyle% \Theta_{K,1}(\Delta_{k})}}{{\scriptstyle\Theta_{2,1}(\Delta_{k})}}\Theta_{2,1}% (\mathbf{h}_{arr\_i}^{k})}&\mathbf{0}\\ \mathbf{0}&\mathbf{0}&\mathbf{U}_{arr\_i}^{1}&\mathbf{V}_{arr\_i}^{1}\\ \mathbf{0}&\mathbf{0}&\mathbf{U}_{arr\_i}^{2}&\mathbf{V}_{arr\_i}^{2}\\ \vdots&\vdots&\vdots&\vdots\\ \mathbf{0}&\mathbf{0}&\mathbf{U}_{arr\_i}^{K}&\mathbf{V}_{arr\_i}^{K}\end{% array}\right][ start_ARRAY start_ROW start_CELL 1 end_CELL start_CELL 0 end_CELL start_CELL bold_0 end_CELL start_CELL bold_0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 1 end_CELL start_CELL bold_0 end_CELL start_CELL bold_0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL roman_Θ start_POSTSUBSCRIPT 3 , 1 end_POSTSUBSCRIPT ( bold_h start_POSTSUBSCRIPT italic_a italic_r italic_r _ italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) - divide start_ARG roman_Θ start_POSTSUBSCRIPT 3 , 1 end_POSTSUBSCRIPT ( roman_Δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) end_ARG start_ARG roman_Θ start_POSTSUBSCRIPT 2 , 1 end_POSTSUBSCRIPT ( roman_Δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) end_ARG roman_Θ start_POSTSUBSCRIPT 2 , 1 end_POSTSUBSCRIPT ( bold_h start_POSTSUBSCRIPT italic_a italic_r italic_r _ italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) end_CELL start_CELL bold_0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL roman_Θ start_POSTSUBSCRIPT 4 , 1 end_POSTSUBSCRIPT ( bold_h start_POSTSUBSCRIPT italic_a italic_r italic_r _ italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) - divide start_ARG roman_Θ start_POSTSUBSCRIPT 4 , 1 end_POSTSUBSCRIPT ( roman_Δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) end_ARG start_ARG roman_Θ start_POSTSUBSCRIPT 2 , 1 end_POSTSUBSCRIPT ( roman_Δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) end_ARG roman_Θ start_POSTSUBSCRIPT 2 , 1 end_POSTSUBSCRIPT ( bold_h start_POSTSUBSCRIPT italic_a italic_r italic_r _ italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) end_CELL start_CELL bold_0 end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL start_CELL ⋮ end_CELL start_CELL ⋮ end_CELL start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL roman_Θ start_POSTSUBSCRIPT italic_K , 1 end_POSTSUBSCRIPT ( bold_h start_POSTSUBSCRIPT italic_a italic_r italic_r _ italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) - divide start_ARG roman_Θ start_POSTSUBSCRIPT italic_K , 1 end_POSTSUBSCRIPT ( roman_Δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) end_ARG start_ARG roman_Θ start_POSTSUBSCRIPT 2 , 1 end_POSTSUBSCRIPT ( roman_Δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) end_ARG roman_Θ start_POSTSUBSCRIPT 2 , 1 end_POSTSUBSCRIPT ( bold_h start_POSTSUBSCRIPT italic_a italic_r italic_r _ italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) end_CELL start_CELL bold_0 end_CELL end_ROW start_ROW start_CELL bold_0 end_CELL start_CELL bold_0 end_CELL start_CELL bold_U start_POSTSUBSCRIPT italic_a italic_r italic_r _ italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT end_CELL start_CELL bold_V start_POSTSUBSCRIPT italic_a italic_r italic_r _ italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL bold_0 end_CELL start_CELL bold_0 end_CELL start_CELL bold_U start_POSTSUBSCRIPT italic_a italic_r italic_r _ italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_CELL start_CELL bold_V start_POSTSUBSCRIPT italic_a italic_r italic_r _ italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL start_CELL ⋮ end_CELL start_CELL ⋮ end_CELL start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL bold_0 end_CELL start_CELL bold_0 end_CELL start_CELL bold_U start_POSTSUBSCRIPT italic_a italic_r italic_r _ italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT end_CELL start_CELL bold_V start_POSTSUBSCRIPT italic_a italic_r italic_r _ italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT end_CELL end_ROW end_ARRAY ]

with 𝐡𝐡\mathbf{h}bold_h, 𝐔𝐔\mathbf{U}bold_U, and 𝐕𝐕\mathbf{V}bold_V being defined in (48), Θm,n(𝒇(k))subscriptΘ𝑚𝑛𝒇𝑘\Theta_{m,n}(\boldsymbol{f}(k))roman_Θ start_POSTSUBSCRIPT italic_m , italic_n end_POSTSUBSCRIPT ( bold_italic_f ( italic_k ) ) represents 𝒇(m)𝒇(n).𝒇𝑚𝒇𝑛\boldsymbol{f}(m)-\boldsymbol{f}(n).bold_italic_f ( italic_m ) - bold_italic_f ( italic_n ) . With the above elementary row and column transformations, we have

𝐅¯𝐅¯=[𝐋¯2𝐋¯3𝐋¯N𝐋¯𝐓¯𝐓¯𝐓¯]𝐓¯.\overline{\mathbf{F}}\sim\overline{\mathbf{F}}^{\prime}=\underset{\mathbf{% \overline{L}^{\prime}}}{\underbrace{\left[\begin{array}[]{cccc}\mathbf{\bar{L}% }_{2}\\ &\mathbf{\bar{L}}_{3}\\ &&\ddots\\ &&&\mathbf{\bar{L}}_{N}\end{array}\right.}}\underset{\mathbf{\overline{T}^{% \prime}}}{\underbrace{\left.\begin{array}[]{c}\mathbf{\bar{T}}\\ \mathbf{\bar{T}}\\ \vdots\\ \mathbf{\bar{T}}\end{array}\right]}}.over¯ start_ARG bold_F end_ARG ∼ over¯ start_ARG bold_F end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = start_UNDERACCENT over¯ start_ARG bold_L end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_UNDERACCENT start_ARG under⏟ start_ARG [ start_ARRAY start_ROW start_CELL over¯ start_ARG bold_L end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL over¯ start_ARG bold_L end_ARG start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL end_CELL start_CELL ⋱ end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL over¯ start_ARG bold_L end_ARG start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT end_CELL end_ROW end_ARRAY end_ARG end_ARG start_UNDERACCENT over¯ start_ARG bold_T end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_UNDERACCENT start_ARG under⏟ start_ARG start_ARRAY start_ROW start_CELL over¯ start_ARG bold_T end_ARG end_CELL end_ROW start_ROW start_CELL over¯ start_ARG bold_T end_ARG end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL over¯ start_ARG bold_T end_ARG end_CELL end_ROW end_ARRAY ] end_ARG end_ARG . (57)

It holds that rank(𝐅)=rank(𝐅¯)=rank(𝐅¯)𝑟𝑎𝑛𝑘𝐅𝑟𝑎𝑛𝑘¯𝐅𝑟𝑎𝑛𝑘superscript¯𝐅rank(\mathbf{F})=rank(\overline{\mathbf{F}})=rank(\overline{\mathbf{F}}^{% \prime})italic_r italic_a italic_n italic_k ( bold_F ) = italic_r italic_a italic_n italic_k ( over¯ start_ARG bold_F end_ARG ) = italic_r italic_a italic_n italic_k ( over¯ start_ARG bold_F end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ). From the structure of 𝐅¯superscript¯𝐅\overline{\mathbf{F}}^{\prime}over¯ start_ARG bold_F end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, we can see that the block columns containing 𝐋¯isubscript¯𝐋𝑖\mathbf{\bar{L}}_{i}over¯ start_ARG bold_L end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, i=2,,N𝑖2𝑁i=2,...,Nitalic_i = 2 , … , italic_N, are independent of each other. A necessary condition for 𝐅¯superscript¯𝐅\overline{\mathbf{F}}^{\prime}over¯ start_ARG bold_F end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT to be of full column rank is that 𝐋¯isubscript¯𝐋𝑖\mathbf{\bar{L}}_{i}over¯ start_ARG bold_L end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and 𝐓¯¯𝐓\mathbf{\bar{T}}over¯ start_ARG bold_T end_ARG are of full column rank, respectively, i=2,,N𝑖2𝑁i=2,...,Nitalic_i = 2 , … , italic_N. This completes the proof.  

Proof of Theorem 3. Here we take j=2𝑗2j=2italic_j = 2 as an example. For 𝐅¯superscript¯𝐅\overline{\mathbf{F}}^{\prime}over¯ start_ARG bold_F end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, we could perform elementary row block changes: for i=3,,N𝑖3𝑁i=3,\ldots,Nitalic_i = 3 , … , italic_N, deduce 𝐋¯isubscript¯𝐋𝑖\mathbf{\bar{L}}_{i}over¯ start_ARG bold_L end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT row block by the first-row block and obtain:

[𝐋¯2𝐓¯𝐋¯2𝐋¯3𝟎𝐋¯2𝐋¯N𝟎].delimited-[]subscript¯𝐋2missing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpression¯𝐓subscript¯𝐋2subscript¯𝐋3missing-subexpressionmissing-subexpressionmissing-subexpression0missing-subexpressionmissing-subexpressionmissing-subexpressionsubscript¯𝐋2missing-subexpressionmissing-subexpressionmissing-subexpressionsubscript¯𝐋𝑁0\left[\begin{array}[]{cccccc}\mathbf{\bar{L}}_{2}&&&&&\mathbf{\bar{T}}\\ -\mathbf{\bar{L}}_{2}&\mathbf{\bar{L}}_{3}&&&&\mathbf{0}\\ \vdots&&\ddots&&&\mathbf{\vdots}\\ -\mathbf{\bar{L}}_{2}&&&&\mathbf{\bar{L}}_{N}&\mathbf{0}\end{array}\right].[ start_ARRAY start_ROW start_CELL over¯ start_ARG bold_L end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL over¯ start_ARG bold_T end_ARG end_CELL end_ROW start_ROW start_CELL - over¯ start_ARG bold_L end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_CELL start_CELL over¯ start_ARG bold_L end_ARG start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL bold_0 end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL start_CELL end_CELL start_CELL ⋱ end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL - over¯ start_ARG bold_L end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL over¯ start_ARG bold_L end_ARG start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT end_CELL start_CELL bold_0 end_CELL end_ROW end_ARRAY ] . (58)

Denote the submatrix of this matrix as:

𝐌2_T=[𝐋¯2𝐓¯𝐋¯2𝟎].subscript𝐌2_𝑇delimited-[]subscript¯𝐋2¯𝐓subscript¯𝐋20\mathbf{M}_{2\_T}=\left[\begin{array}[]{cc}\mathbf{\bar{L}}_{2}&\mathbf{\bar{T% }}\\ \mathbf{\vdots}&\mathbf{\vdots}\\ -\mathbf{\bar{L}}_{2}&\mathbf{0}\end{array}\right].bold_M start_POSTSUBSCRIPT 2 _ italic_T end_POSTSUBSCRIPT = [ start_ARRAY start_ROW start_CELL over¯ start_ARG bold_L end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_CELL start_CELL over¯ start_ARG bold_T end_ARG end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL - over¯ start_ARG bold_L end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_CELL start_CELL bold_0 end_CELL end_ROW end_ARRAY ] . (59)

From the structure in (58), we can see clearly that if:

(i) 𝐌2_Tsubscript𝐌2_𝑇\mathbf{M}_{2\_T}bold_M start_POSTSUBSCRIPT 2 _ italic_T end_POSTSUBSCRIPT is of full column rank, and

(ii) diag(𝐋¯3,,𝐋¯N)𝑑𝑖𝑎𝑔subscript¯𝐋3subscript¯𝐋𝑁diag(\mathbf{\bar{L}}_{3},\ldots,\mathbf{\bar{L}}_{N})italic_d italic_i italic_a italic_g ( over¯ start_ARG bold_L end_ARG start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , … , over¯ start_ARG bold_L end_ARG start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) is of full column rank,
then 𝐅¯superscript¯𝐅\overline{\mathbf{F}}^{\prime}over¯ start_ARG bold_F end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT will be of full column rank. Due to the fact that rank(𝐅)=rank(𝐅¯)=rank(𝐅¯)𝑟𝑎𝑛𝑘𝐅𝑟𝑎𝑛𝑘¯𝐅𝑟𝑎𝑛𝑘superscript¯𝐅rank(\mathbf{F})=rank(\overline{\mathbf{F}})=rank(\overline{\mathbf{F}}^{% \prime})italic_r italic_a italic_n italic_k ( bold_F ) = italic_r italic_a italic_n italic_k ( over¯ start_ARG bold_F end_ARG ) = italic_r italic_a italic_n italic_k ( over¯ start_ARG bold_F end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ), the Jacobian matrix 𝐉𝐉\mathbf{J}bold_J is of full column rank. Similarly, the same conditions hold when j𝑗jitalic_j equals to 3,,N3𝑁3,\ldots,N3 , … , italic_N. So the Jacobian matrix 𝐉𝐉\mathbf{J}bold_J is of full column rank if any matrix consisting of the (j1)-th𝑗1-𝑡(j-1)\raisebox{0.0pt}{-}th( italic_j - 1 ) - italic_t italic_h column block and the last column block in 𝐅¯superscript¯𝐅\overline{\mathbf{F}}^{\prime}over¯ start_ARG bold_F end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is of full column rank, 2jN2𝑗𝑁2\leq j\leq N2 ≤ italic_j ≤ italic_N, and 𝐋¯isubscript¯𝐋𝑖\mathbf{\bar{L}}_{i}over¯ start_ARG bold_L end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are of full column rank, i=2,,N𝑖2𝑁i=2,\ldots,Nitalic_i = 2 , … , italic_N and ij𝑖𝑗i\neq jitalic_i ≠ italic_j. This completes the proof.   

Proof of Theorem 4. (i) 𝐓¯¯𝐓\bar{\mathbf{T}}over¯ start_ARG bold_T end_ARG in (55) is of full column rank only if a 3 × 3 matrix formed by at least one of the three-permutation of its rows is full rank. For (𝐬k)T1×3,1kKformulae-sequencesuperscriptsuperscript𝐬𝑘Tsuperscript131𝑘𝐾\left(\mathbf{s}^{k}\right)^{\mathrm{T}}\in\mathbb{R}^{1\times 3},1\leq k\leq K( bold_s start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 1 × 3 end_POSTSUPERSCRIPT , 1 ≤ italic_k ≤ italic_K, the necessary condition for 𝐓¯¯𝐓\bar{\mathbf{T}}over¯ start_ARG bold_T end_ARG to be of full column rank is K5𝐾5K\geq 5italic_K ≥ 5. If K<5𝐾5K<5italic_K < 5, 𝐓¯¯𝐓\bar{\mathbf{T}}over¯ start_ARG bold_T end_ARG can not be of the full column rank.

(ii) Based on (45), when 𝐬k=λk1𝐬k1superscript𝐬𝑘subscript𝜆𝑘1superscript𝐬𝑘1\mathbf{\mathbf{s}}^{k}={\lambda}_{k-1}\mathbf{s}^{k-1}bold_s start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT = italic_λ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT bold_s start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT, we could derive 𝐬kd1k=𝐬k1d1k1.superscript𝐬𝑘superscriptsubscript𝑑1𝑘superscript𝐬𝑘1superscriptsubscript𝑑1𝑘1\frac{\mathbf{s}^{k}}{d_{1}^{k}}=\frac{\mathbf{s}^{k-1}}{d_{1}^{k-1}}.divide start_ARG bold_s start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_ARG start_ARG italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_ARG = divide start_ARG bold_s start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT end_ARG start_ARG italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT end_ARG . From the expression of 𝐓¯¯𝐓\bar{\mathbf{T}}over¯ start_ARG bold_T end_ARG, we can see that 𝐓¯¯𝐓\bar{\mathbf{T}}over¯ start_ARG bold_T end_ARG cannot be of full rank if 𝐬ksuperscript𝐬𝑘\mathbf{s}^{k}bold_s start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT is proportional to each other, k=1,,K𝑘1𝐾k=1,\cdots,Kitalic_k = 1 , ⋯ , italic_K. In this case, the sound source is collinear with the origin of the reference microphone array frame {𝐱arr_1}subscript𝐱𝑎𝑟𝑟_1\left\{\mathrm{\mathbf{x}}_{arr\_1}\right\}{ bold_x start_POSTSUBSCRIPT italic_a italic_r italic_r _ 1 end_POSTSUBSCRIPT } at all time steps.

(iii) If the sound source lies on any Euclidean plane of x+αy=0𝑥𝛼𝑦0x+\alpha y=0italic_x + italic_α italic_y = 0, x+βz=0𝑥𝛽𝑧0x+\beta z=0italic_x + italic_β italic_z = 0, and y+γz=0𝑦𝛾𝑧0y+\gamma z=0italic_y + italic_γ italic_z = 0 within the three-dimensional xyz𝑥𝑦𝑧x-y-zitalic_x - italic_y - italic_z Cartesian coordinate frame {𝐱arr_1}subscript𝐱𝑎𝑟𝑟_1\left\{\mathrm{\mathbf{x}}_{arr\_1}\right\}{ bold_x start_POSTSUBSCRIPT italic_a italic_r italic_r _ 1 end_POSTSUBSCRIPT } at all moments, where α𝛼\alphaitalic_α, β𝛽\betaitalic_β, and γ𝛾\gammaitalic_γ are arbitrary scalars, the sound source position 𝐬k,superscript𝐬𝑘\mathbf{s}^{k},bold_s start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , 1kK1𝑘𝐾1\leq k\leq K1 ≤ italic_k ≤ italic_K, could be expressed as [αsyk;syk;szk]𝛼superscriptsubscript𝑠𝑦𝑘superscriptsubscript𝑠𝑦𝑘superscriptsubscript𝑠𝑧𝑘\left[-\alpha s_{y}^{k};s_{y}^{k};s_{z}^{k}\right][ - italic_α italic_s start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ; italic_s start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ; italic_s start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ], [βszk;syk;szk]𝛽superscriptsubscript𝑠𝑧𝑘superscriptsubscript𝑠𝑦𝑘superscriptsubscript𝑠𝑧𝑘\left[-\beta s_{z}^{k};s_{y}^{k};s_{z}^{k}\right][ - italic_β italic_s start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ; italic_s start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ; italic_s start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ], and [sxk;γszk;szk]superscriptsubscript𝑠𝑥𝑘𝛾superscriptsubscript𝑠𝑧𝑘superscriptsubscript𝑠𝑧𝑘\left[s_{x}^{k};-\gamma s_{z}^{k};s_{z}^{k}\right][ italic_s start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ; - italic_γ italic_s start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ; italic_s start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ], respectively. 𝐓¯¯𝐓\bar{\mathbf{T}}over¯ start_ARG bold_T end_ARG will not be of full column rank. Specifically, if α=0𝛼0\alpha=0italic_α = 0 oder β=0𝛽0\beta=0italic_β = 0 oder γ=0𝛾0\gamma=0italic_γ = 0, the sound source position 𝐬ksuperscript𝐬𝑘\mathbf{s}^{k}bold_s start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT will have sxk=0superscriptsubscript𝑠𝑥𝑘0s_{x}^{k}=0italic_s start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT = 0, syk=0superscriptsubscript𝑠𝑦𝑘0s_{y}^{k}=0italic_s start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT = 0, and szk=0superscriptsubscript𝑠𝑧𝑘0s_{z}^{k}=0italic_s start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT = 0, respectively, i.e., YOZ, XOZ, and XOY planes in global frame. This completes the proof.   

Proof of Theorem 5. (i) If the sound source, at all of K(K5)𝐾𝐾5K\,(K\geq 5)italic_K ( italic_K ≥ 5 ) time steps, is collinear w.r.t. the origin of the microphone array frame {𝐱arr_i}subscript𝐱𝑎𝑟𝑟_𝑖\left\{\mathrm{\mathbf{x}}_{arr\_i}\right\}{ bold_x start_POSTSUBSCRIPT italic_a italic_r italic_r _ italic_i end_POSTSUBSCRIPT }, i.e., (𝐬k𝐱arr_ip)=ϵk1(𝐬k1𝐱arr_ip)superscript𝐬𝑘superscriptsubscript𝐱𝑎𝑟𝑟_𝑖𝑝subscriptitalic-ϵ𝑘1superscript𝐬𝑘1superscriptsubscript𝐱𝑎𝑟𝑟_𝑖𝑝(\mathbf{\mathbf{s}}^{k}-\mathbf{x}_{arr\_i}^{p})={\epsilon}_{k-1}(\mathbf{% \mathbf{s}}^{k-1}-\mathbf{x}_{arr\_i}^{p})( bold_s start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT - bold_x start_POSTSUBSCRIPT italic_a italic_r italic_r _ italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ) = italic_ϵ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( bold_s start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT - bold_x start_POSTSUBSCRIPT italic_a italic_r italic_r _ italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ) always holds true, then for i2,𝑖2i\geq 2,italic_i ≥ 2 , k=2,3,,K𝑘23𝐾k=2,3,\ldots,Kitalic_k = 2 , 3 , … , italic_K, we can get the following expression:

{[Δxik;Δyik;Δzik]=ϵk1[Δxik1;Δyik1;Δzik1]𝐡ik=𝐡ik1, 𝐔ik=1ϵk1𝐔ik1, 𝐕ik=𝐕ik1casesdelimited-[]Δsuperscriptsubscript𝑥𝑖𝑘Δsuperscriptsubscript𝑦𝑖𝑘Δsuperscriptsubscript𝑧𝑖𝑘missing-subexpressionmissing-subexpressionsubscriptitalic-ϵ𝑘1delimited-[]Δsubscriptsuperscript𝑥𝑘1𝑖Δsubscriptsuperscript𝑦𝑘1𝑖Δsubscriptsuperscript𝑧𝑘1𝑖missing-subexpressionmissing-subexpressionotherwiseformulae-sequencesuperscriptsubscript𝐡𝑖𝑘superscriptsubscript𝐡𝑖𝑘1formulae-sequence superscriptsubscript𝐔𝑖𝑘1subscriptitalic-ϵ𝑘1superscriptsubscript𝐔𝑖𝑘1 superscriptsubscript𝐕𝑖𝑘superscriptsubscript𝐕𝑖𝑘1otherwise\begin{cases}{\scriptstyle\left[\begin{array}[]{ccc}\Delta x_{i}^{k};\Delta y_% {i}^{k};\Delta z_{i}^{k}\end{array}\right]={\epsilon}_{k-1}\left[\begin{array}% []{ccc}\Delta x^{k-1}_{i};\Delta y^{k-1}_{i};\Delta z^{k-1}_{i}\end{array}% \right]}\\ \mathbf{h}_{i}^{k}=\mathbf{h}_{i}^{k-1},\text{ }\mathbf{U}_{i}^{k}=\frac{1}{{% \epsilon}_{k-1}}\mathbf{U}_{i}^{k-1},\text{ }\mathbf{V}_{i}^{k}=\mathbf{V}_{i}% ^{k-1}\end{cases}{ start_ROW start_CELL [ start_ARRAY start_ROW start_CELL roman_Δ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ; roman_Δ italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ; roman_Δ italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW end_ARRAY ] = italic_ϵ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT [ start_ARRAY start_ROW start_CELL roman_Δ italic_x start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; roman_Δ italic_y start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; roman_Δ italic_z start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW end_ARRAY ] end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL bold_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT = bold_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT , bold_U start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_ϵ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_ARG bold_U start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT , bold_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT = bold_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT end_CELL start_CELL end_CELL end_ROW

where 𝐡,𝐔𝐡𝐔\mathbf{h},\mathbf{U}bold_h , bold_U, and 𝐕𝐕\mathbf{V}bold_V are defined in (48). For an arbitrary single time step, we have rank(𝐔ik)=rank(𝐑iT𝐀)𝑟𝑎𝑛𝑘superscriptsubscript𝐔𝑖𝑘𝑟𝑎𝑛𝑘superscriptsubscript𝐑𝑖T𝐀rank(\mathbf{U}_{i}^{k})=rank(\mathbf{R}_{i}^{\mathrm{T}}\mathbf{A})italic_r italic_a italic_n italic_k ( bold_U start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) = italic_r italic_a italic_n italic_k ( bold_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT bold_A ) as shown in (49). It can also be seen that det(𝐀)=0𝑑𝑒𝑡𝐀0det(\mathbf{A})=0italic_d italic_e italic_t ( bold_A ) = 0 and the second-order sub-determinant of 𝐀𝐀\mathbf{A}bold_A is not equal to 0, we know that rank(𝐀)=2𝑟𝑎𝑛𝑘𝐀2rank(\mathbf{A})=2italic_r italic_a italic_n italic_k ( bold_A ) = 2. 𝐑iTsuperscriptsubscript𝐑𝑖T\mathbf{R}_{i}^{\mathrm{T}}bold_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT is a rotation matrix, rank(𝐑iT)=3𝑟𝑎𝑛𝑘superscriptsubscript𝐑𝑖T3rank(\mathbf{R}_{i}^{\mathrm{T}})=3italic_r italic_a italic_n italic_k ( bold_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT ) = 3, thus rank(𝐔ik)=2𝑟𝑎𝑛𝑘superscriptsubscript𝐔𝑖𝑘2rank(\mathbf{U}_{i}^{k})=2italic_r italic_a italic_n italic_k ( bold_U start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) = 2. Therefore, 𝐋¯isubscript¯𝐋𝑖\mathbf{\bar{L}}_{i}over¯ start_ARG bold_L end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT will not be of full column rank.

(ii) When θarr_iy=±π2superscriptsubscript𝜃𝑎𝑟𝑟_𝑖𝑦plus-or-minus𝜋2\theta_{arr\_i}^{y}=\pm\frac{\pi}{2}italic_θ start_POSTSUBSCRIPT italic_a italic_r italic_r _ italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT = ± divide start_ARG italic_π end_ARG start_ARG 2 end_ARG, for the corresponding microphone array at any different time steps, 𝐕iksuperscriptsubscript𝐕𝑖𝑘\mathbf{V}_{i}^{k}bold_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT defined in (50) has the same structure, i.e.,

𝐕ik(θarr_iy=π2)=[0Δxikcz+ΔyikszΔyiksxzΔxikcxzΔziksxΔyikcxz+ΔxiksxzΔzikcx0Δyiksxz+ΔxikcxzΔyikcxzΔxiksxz]𝐕ik(θarr_iy=π2)=[0ΔxikczΔyikszΔyiksx+z+Δxikcx+zΔziksxΔyikcx+zΔxiksx+zΔzikcx0Δyiksx+z+Δxikcx+zΔyikcx+zΔxiksx+z],superscriptsubscript𝐕𝑖𝑘(θarr_iy=π2)delimited-[]0Δsuperscriptsubscript𝑥𝑖𝑘subscript𝑐𝑧Δsuperscriptsubscript𝑦𝑖𝑘subscript𝑠𝑧Δsuperscriptsubscript𝑦𝑖𝑘subscript𝑠𝑥𝑧Δsuperscriptsubscript𝑥𝑖𝑘subscript𝑐𝑥𝑧Δsuperscriptsubscript𝑧𝑖𝑘subscript𝑠𝑥Δsuperscriptsubscript𝑦𝑖𝑘subscript𝑐𝑥𝑧Δsuperscriptsubscript𝑥𝑖𝑘subscript𝑠𝑥𝑧Δsuperscriptsubscript𝑧𝑖𝑘subscript𝑐𝑥0Δsuperscriptsubscript𝑦𝑖𝑘subscript𝑠𝑥𝑧Δsuperscriptsubscript𝑥𝑖𝑘subscript𝑐𝑥𝑧Δsuperscriptsubscript𝑦𝑖𝑘subscript𝑐𝑥𝑧Δsuperscriptsubscript𝑥𝑖𝑘subscript𝑠𝑥𝑧superscriptsubscript𝐕𝑖𝑘(θarr_iy=π2)delimited-[]0Δsuperscriptsubscript𝑥𝑖𝑘subscript𝑐𝑧Δsuperscriptsubscript𝑦𝑖𝑘subscript𝑠𝑧Δsuperscriptsubscript𝑦𝑖𝑘subscript𝑠𝑥𝑧Δsuperscriptsubscript𝑥𝑖𝑘subscript𝑐𝑥𝑧Δsuperscriptsubscript𝑧𝑖𝑘subscript𝑠𝑥Δsuperscriptsubscript𝑦𝑖𝑘subscript𝑐𝑥𝑧Δsuperscriptsubscript𝑥𝑖𝑘subscript𝑠𝑥𝑧Δsuperscriptsubscript𝑧𝑖𝑘subscript𝑐𝑥0Δsuperscriptsubscript𝑦𝑖𝑘subscript𝑠𝑥𝑧Δsuperscriptsubscript𝑥𝑖𝑘subscript𝑐𝑥𝑧Δsuperscriptsubscript𝑦𝑖𝑘subscript𝑐𝑥𝑧Δsuperscriptsubscript𝑥𝑖𝑘subscript𝑠𝑥𝑧\begin{array}[]{c}\mathbf{V}_{i}^{k}\text{(${\scriptstyle\theta_{arr\_i}^{y}=% \frac{\pi}{2}}$)}=\left[\begin{array}[]{cc}{\scriptstyle 0}&{\scriptstyle% \Delta x_{i}^{k}c_{z}+\Delta y_{i}^{k}s_{z}}\\ {\scriptstyle\Delta y_{i}^{k}s_{x-z}-\Delta x_{i}^{k}c_{x-z}}&{\scriptstyle% \Delta z_{i}^{k}s_{x}}\\ {\scriptstyle\Delta y_{i}^{k}c_{x-z}+\Delta x_{i}^{k}s_{x-z}}&{\scriptstyle% \Delta z_{i}^{k}c_{x}}\end{array}\right.\left.\begin{array}[]{c}{\scriptstyle 0% }\\ {\scriptstyle-\Delta y_{i}^{k}s_{x-z}+\Delta x_{i}^{k}c_{x-z}}\\ {\scriptstyle-\Delta y_{i}^{k}c_{x-z}-\Delta x_{i}^{k}s_{x-z}}\end{array}% \right]\\ \mathbf{V}_{i}^{k}\text{(${\scriptstyle\theta_{arr\_i}^{y}=-\frac{\pi}{2}}$)}=% \left[\begin{array}[]{cc}{\scriptstyle 0}&{\scriptstyle-\Delta x_{i}^{k}c_{z}-% \Delta y_{i}^{k}s_{z}}\\ {\scriptstyle\Delta y_{i}^{k}s_{x+z}+\Delta x_{i}^{k}c_{x+z}}&{\scriptstyle-% \Delta z_{i}^{k}s_{x}}\\ {\scriptstyle\Delta y_{i}^{k}c_{x+z}-\Delta x_{i}^{k}s_{x+z}}&{\scriptstyle-% \Delta z_{i}^{k}c_{x}}\end{array}\right.\left.\begin{array}[]{c}{\scriptstyle 0% }\\ {\scriptstyle\Delta y_{i}^{k}s_{x+z}+\Delta x_{i}^{k}c_{x+z}}\\ {\scriptstyle\Delta y_{i}^{k}c_{x+z}-\Delta x_{i}^{k}s_{x+z}}\end{array}\right% ],\end{array}start_ARRAY start_ROW start_CELL bold_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_a italic_r italic_r _ italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT = divide start_ARG italic_π end_ARG start_ARG 2 end_ARG ) = [ start_ARRAY start_ROW start_CELL 0 end_CELL start_CELL roman_Δ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_c start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT + roman_Δ italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL roman_Δ italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_x - italic_z end_POSTSUBSCRIPT - roman_Δ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_c start_POSTSUBSCRIPT italic_x - italic_z end_POSTSUBSCRIPT end_CELL start_CELL roman_Δ italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL roman_Δ italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_c start_POSTSUBSCRIPT italic_x - italic_z end_POSTSUBSCRIPT + roman_Δ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_x - italic_z end_POSTSUBSCRIPT end_CELL start_CELL roman_Δ italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_c start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_CELL end_ROW end_ARRAY start_ARRAY start_ROW start_CELL 0 end_CELL end_ROW start_ROW start_CELL - roman_Δ italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_x - italic_z end_POSTSUBSCRIPT + roman_Δ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_c start_POSTSUBSCRIPT italic_x - italic_z end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL - roman_Δ italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_c start_POSTSUBSCRIPT italic_x - italic_z end_POSTSUBSCRIPT - roman_Δ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_x - italic_z end_POSTSUBSCRIPT end_CELL end_ROW end_ARRAY ] end_CELL end_ROW start_ROW start_CELL bold_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_a italic_r italic_r _ italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT = - divide start_ARG italic_π end_ARG start_ARG 2 end_ARG ) = [ start_ARRAY start_ROW start_CELL 0 end_CELL start_CELL - roman_Δ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_c start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT - roman_Δ italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL roman_Δ italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_x + italic_z end_POSTSUBSCRIPT + roman_Δ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_c start_POSTSUBSCRIPT italic_x + italic_z end_POSTSUBSCRIPT end_CELL start_CELL - roman_Δ italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL roman_Δ italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_c start_POSTSUBSCRIPT italic_x + italic_z end_POSTSUBSCRIPT - roman_Δ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_x + italic_z end_POSTSUBSCRIPT end_CELL start_CELL - roman_Δ italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_c start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_CELL end_ROW end_ARRAY start_ARRAY start_ROW start_CELL 0 end_CELL end_ROW start_ROW start_CELL roman_Δ italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_x + italic_z end_POSTSUBSCRIPT + roman_Δ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_c start_POSTSUBSCRIPT italic_x + italic_z end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL roman_Δ italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_c start_POSTSUBSCRIPT italic_x + italic_z end_POSTSUBSCRIPT - roman_Δ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_x + italic_z end_POSTSUBSCRIPT end_CELL end_ROW end_ARRAY ] , end_CELL end_ROW end_ARRAY

where s,c𝑠𝑐s,citalic_s , italic_c represent sin,cos𝑠𝑖𝑛𝑐𝑜𝑠sin,cositalic_s italic_i italic_n , italic_c italic_o italic_s, respectively and rank(𝐕ik)2𝑟𝑎𝑛𝑘superscriptsubscript𝐕𝑖𝑘2rank(\mathbf{V}_{i}^{k})\equiv 2italic_r italic_a italic_n italic_k ( bold_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) ≡ 2. Therefore, the matrix of 𝐋¯isubscript¯𝐋𝑖\mathbf{\bar{L}}_{i}over¯ start_ARG bold_L end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT in (57) will not be of full column rank. This completes the proof.   

Appendix B

Evaluation Metrics. The errors of microphone arrays positions, orientations, time offsets, clock differences and sound source positions can be expressed as follows:

E(𝐱arr_ip)=𝐱^arr_ip𝐱02, E(𝐱arr_iθ)=arccos(𝐑^i𝐯𝐗0𝐯𝐯22),formulae-sequence𝐸superscriptsubscript𝐱𝑎𝑟𝑟_𝑖𝑝subscriptnormsuperscriptsubscript^𝐱𝑎𝑟𝑟_𝑖𝑝subscript𝐱02 𝐸superscriptsubscript𝐱𝑎𝑟𝑟_𝑖𝜃subscript^𝐑𝑖𝐯subscript𝐗0𝐯superscriptsubscriptnorm𝐯22E(\mathbf{x}_{arr\_i}^{p})=\left\|\mathbf{\hat{x}}_{arr\_i}^{p}-\mathbf{x}_{0}% \right\|_{2},\text{ }E(\mathbf{x}_{arr\_i}^{\theta})=\arccos\left(\frac{\small% {\mathbf{\hat{R}}_{i}\mathbf{v}\cdotp\mathbf{X}_{0}\mathbf{v}}}{\left\|\mathbf% {v}\right\|_{2}^{2}}\right),italic_E ( bold_x start_POSTSUBSCRIPT italic_a italic_r italic_r _ italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ) = ∥ over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_a italic_r italic_r _ italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT - bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_E ( bold_x start_POSTSUBSCRIPT italic_a italic_r italic_r _ italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT ) = roman_arccos ( divide start_ARG over^ start_ARG bold_R end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_v ⋅ bold_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT bold_v end_ARG start_ARG ∥ bold_v ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) ,
E(xarr_iτ)=x^arr_iτx0, E(xarr_iδ)=x^arr_iδx0, E(𝐬k)=𝐬^k𝐱02,formulae-sequence𝐸superscriptsubscript𝑥𝑎𝑟𝑟_𝑖𝜏superscriptsubscript^𝑥𝑎𝑟𝑟_𝑖𝜏subscript𝑥0formulae-sequence 𝐸superscriptsubscript𝑥𝑎𝑟𝑟_𝑖𝛿superscriptsubscript^𝑥𝑎𝑟𝑟_𝑖𝛿subscript𝑥0 𝐸subscript𝐬𝑘subscriptnormsubscript^𝐬𝑘subscript𝐱02E(x_{arr\_i}^{\tau})=\hat{x}_{arr\_i}^{\tau}-x_{0},\text{ }E(x_{arr\_i}^{% \delta})=\hat{x}_{arr\_i}^{\delta}-x_{0},\text{ }E(\mathbf{s}_{k})=\left\|\hat% {\mathbf{s}}_{k}-\mathbf{x}_{0}\right\|_{2},italic_E ( italic_x start_POSTSUBSCRIPT italic_a italic_r italic_r _ italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT ) = over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_a italic_r italic_r _ italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT - italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_E ( italic_x start_POSTSUBSCRIPT italic_a italic_r italic_r _ italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_δ end_POSTSUPERSCRIPT ) = over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_a italic_r italic_r _ italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_δ end_POSTSUPERSCRIPT - italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_E ( bold_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) = ∥ over^ start_ARG bold_s end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ,

where ^^\hat{\cdot}over^ start_ARG ⋅ end_ARG represents the estimate of the unknown scalars/vectors/matrix parameters, x0/𝐱0/𝐗0subscript𝑥0subscript𝐱0subscript𝐗0x_{0}/\mathbf{x}_{0}/\mathbf{X}_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT / bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT / bold_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT represents the true value of the corresponding parameter and 𝐯=[1;1;1]𝐯111\mathbf{\mathbf{v}}=\left[1;1;1\right]bold_v = [ 1 ; 1 ; 1 ].

In the experiments in Sections V and VI, we utilized the root mean square error (RMSE) to evaluate the accuracy of the calibration algorithm for parameters estimations. The RMSE of the parameter 𝐱𝐱\mathbf{x}bold_x was calculated as RMSE(𝐱)=1Mi=1MEi2(𝐱)𝑅𝑀𝑆𝐸𝐱1𝑀superscriptsubscript𝑖1𝑀superscriptsubscript𝐸𝑖2𝐱RMSE(\mathbf{x})=\sqrt{\frac{1}{M}\sum_{i=1}^{M}E_{i}^{2}(\mathbf{x})}italic_R italic_M italic_S italic_E ( bold_x ) = square-root start_ARG divide start_ARG 1 end_ARG start_ARG italic_M end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( bold_x ) end_ARG, where M𝑀Mitalic_M is equal to the total number of the corresponding parameter 𝐱𝐱\mathbf{x}bold_x.

[Uncaptioned image] Jiang Wang received the B.Eng. in Electrical Engineering and Automation from the Shenyang Agricultural University, Shenyang, China, in 2020. Since September 2021, he has been working towards the M.S. degree in Electronic Science and Technology, Southern University of Science and Technology, Shenzhen, China. His major research interests include sensor calibration, robot audition, SLAM, sensor fusion.
[Uncaptioned image] Yuanzheng He received the B.Eng. in Electronic and Information Engineering from the Henan University of Technology, Zhengzhou, China, in 2021. Since September 2021, he has been working towards the M.S. degree in Electronic Science and Technology, Southern University of Science and Technology, Shenzhen, China. His major research interests include robot audition, robot perception, and multi-sensor fusion.
[Uncaptioned image] Daobilige Su received his B. Eng. in Mechatronic Engineering from Zhejiang University, China in 2010, M. Eng. in Automation and Robotics from Warsaw University of Technology, Poland and M. Eng. in Automation from University of Genova, Italy through European Master on Advanced Robotics (EMARO) program in 2012, and Ph. D. in robotics at Centre for Autonomous System (CAS), University of Technology Sydney (UTS), Australia in 2017. He was a post-doctoral research associate at Australian Centre for Filed Robotics (ACFR), The University of Sydney from 2017 to 2020. He is currently an Associate Professor at College of Engineering, China Agricultural University, China. His current research areas include field robotics, SLAM, robot audition, computer vision, and machine learning.
[Uncaptioned image] Katsutoshi Itoyama received the M.S. and Ph.D. degrees in informatics from Kyoto University, Kyoto, Japan, in 2008 and 2011, respectively. He had been an Assistant Professor with the Graduate School of Informatics, Kyoto University, until 2018 and is currently a Associate Professor with the Tokyo Institute of Technology, Tokyo, Japan. His research interests include sound source separation, music listening interfaces, and music information retrieval.
[Uncaptioned image] Kazuhiro Nakadai received a B.E. in electrical engineering in 1993, an M.E. in information engineering in 1995, and a Ph.D. in electrical engineering in 2003 from the University of Tokyo. He worked with Nippon Telegraph and Telephone for four years as a system engineer from 1995 to 1999, with the Kitano Symbiotic Systems Project, ERATO, JST as a researcher from 1999 to 2003, and with Honda Research Institute Japan, Co., Ltd. as a principal scientist from 2003 to 2022. Currently he is a professor at the Department of Systems and Control Engineering, School of Engineering, Tokyo Institute of Technology. He has had a concurrent position at Tokyo Institute of Technology, as a visiting associate professor from 2006 to 2010, a visiting professor from 2011 to 2017, and a specially-appointed professor from 2017 to 2022. He also had a concurrent position as a guest professor at Waseda University from 2011 to 2018. His research interests include AI, robotics, signal processing, computational auditory scene analysis, multi-modal integration, and robot audition. He has been an executive board member for JSAI from 2015 to 2016, and for RSJ from 2017 to 2018. He is a Fellow of the IEEE and also a member of JSAI, RSJ, IPSJ, ASJ, HIS, ISCA, ACM.
[Uncaptioned image] Junfeng Wu received the B.Eng. degree from the Department of Automatic Control, Zhejiang University, Hangzhou, China, and the Ph.D. degree in electrical and computer engineering from the Hong Kong University of Science and Technology, Hong Kong, in 2009, and 2013, respectively. From 2014 to 2017, he was a Postdoctoral Researcher with the ACCESS (Autonomic Complex Communication nEtworks, Signals and Systems) Linnaeus Center, School of Electrical Engineering, KTH Royal Institute of Technology, Stockholm, Sweden. From 2017 to 2021, he was with the College of Control Science and Engineering, Zhejiang University, Hangzhou, China. He is currently an Associate Professor with the School of Data Science, The Chinese University of Hong Kong, Shenzhen, China. His research interests include networked control systems, state estimation, and wireless sensor networks, multi-agent systems, robot perception and localization. He currently serves as an Associate Editor for IEEE Transactions on Control of Network Systems.
[Uncaptioned image] Shoudong Huang received the bachelor’s and master’s degrees in mathematics and the Ph.D. in automatic control from Northeastern University, Shenyang, China, in 1987, 1990, and 1998, respectively. He is currently a Professor with the Centre for Autonomous Systems, Faculty of Engineering and Information Technology, University of Technology, Sydney, NSW, Australia. His research interests include nonlinear control systems and mobile robots simultaneous localization and mapping (SLAM), exploration, and navigation.
[Uncaptioned image] Youfu Li received the PhD degree in robotics from the Department of Engineering Science, University of Oxford in 1993. From 1993 to 1995 he was a research staff in the Department of Computer Science at the University of Wales, Aberystwyth, UK. He joined City University of Hong Kong in 1995 and is currently professor in the Department of Mechanical Engineering. His research interests include robot sensing, robot vision, and visual tracking. In these areas, he has published over 400 papers including over 180 SCI listed journal papers. Dr Li has received many awards in robot sensing and vision including IEEE Sensors Journal Best Paper Award by IEEE Sensors Council, Second Prize of Natural Science Research Award by the Ministry of Education, 1st Prize of Natural Science Research Award of Hubei Province, 1st Prize of Natural Science Research Award of Zhejiang Province, China. He was on Top 2% of the world’s most highly cited scientists by Stanford University, 2020, 2021 and Career Long. He has served as an Associate Editor for IEEE Transactions on Automation Science and Engineering (T-ASE), Associate Editor and Guest Editor for IEEE Robotics and Automation Magazine (RAM), and Editor for CEB, IEEE International Conference on Robotics and Automation (ICRA). He is a Fellow of the IEEE.
[Uncaptioned image] He Kong received the Bachelor’s degree in Electrical Engineering from China University of Mining and Technology, Xuzhou, China, Master’s degree in Control Science and Engineering from Harbin Institute of Technology, Harbin, China, and the Ph.D. degree in Electrical Engineering from the University of Newcastle, Australia, respectively. He was a Research Fellow at the Australian Centre for Field Robotics, the University of Sydney, Australia, during 2016–2021. In early 2022, he joined the Southern University of Science and Technology, Shenzhen, China, where he is currently an Associate Professor. His research interests include active multi-modal perception, robot audition, state estimation, and control applications. He is currently serving on the editorial board of IEEE Robotics and Automation Letters, IEEE Robotics and Automation Magazine, IEEE Sensors Letters, International Journal of Adaptive Control and Signal Processing. He has also served as an Associate Editor for several international conferences in robotics and automation, including the IEEE ICRA, IEEE/RSJ IROS, IEEE CASE, etc.