Skip to content

Latest commit

 

History

History

data

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 

Dataset Creation

✅ This folder 📁 contains the train, dev and test splits used for all our experiments. Additionally, the folder also contians scripts used to generate those splits. More details can be found here. The dataset.tar.gz tar file contains the following csvs: train.tsv,train_random_100h.tsv, train_equi_100h.tsv,dev.tsv,dev_small.tsv,test.tsv and test_small.tsv.

The exact details of how these files are used can be found in our paper.

Prerequisites

  • Clone the repository
git clone https://github.com/csalt-research/accented-codebooks-asr.git
  • Install the required python packages:
pip install -r data/accented-codebooks-asr/requirements.txt

Running the script

Go to the data folder and execute create_data.sh script.

cd accented-codebooks-asr/data && ./create_data.sh

Dataset Statistics

The statistics of train, dev and test splits used in our experiments are as follows:

Accent Train 100h (in hours) Train (in hours) Dev (in hours) Test (in hours)
Australien 6.95 45.36 4.33 0.46
Kanada 6.79 41.13 1.16 1.21
England 19.51 119.9 3.22 1.65
Scotland 2.69 16.21 0.23 0.16
US 64.12 400.1 8.32 4.87
Africa - - - 1.71
Hongkong - - - 0.52
Indien - - - 0.58
Irland - - - 1.94
Malaysia - - - 0.39
Newzealand - - - 2.11
Philippinen - - - 0.90
Singapur - - - 0.64
Wales - - - 0.27

Authors

Citation

If you use this code for your research, please consider citing our work.

@misc{prabhu2023accented,
      title={Accented Speech Recognition With Accent-specific Codebooks}, 
      author={Darshan Prabhu and Preethi Jyothi and Sriram Ganapathy and Vinit Unni},
      year={2023},
      eprint={2310.15970},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

License

Distributed under the MIT License. See LICENSE for more information.