data

Dataset Creation

✅ This folder 📁 contains the train, dev and test splits used for all our experiments. Additionally, the folder also contians scripts used to generate those splits. More details can be found here. The dataset.tar.gz tar file contains the following csvs: train.tsv,train_random_100h.tsv, train_equi_100h.tsv,dev.tsv,dev_small.tsv,test.tsv and test_small.tsv.

The exact details of how these files are used can be found in our paper.

Prerequisites

Clone the repository

git clone https://github.com/csalt-research/accented-codebooks-asr.git

Install the required python packages:

pip install -r data/accented-codebooks-asr/requirements.txt

Running the script

Go to the data folder and execute create_data.sh script.

cd accented-codebooks-asr/data && ./create_data.sh

Dataset Statistics

The statistics of train, dev and test splits used in our experiments are as follows:

Accent	Train 100h (in hours)	Train (in hours)	Dev (in hours)	Test (in hours)
Australien	6.95	45.36	4.33	0.46
Kanada	6.79	41.13	1.16	1.21
England	19.51	119.9	3.22	1.65
Scotland	2.69	16.21	0.23	0.16
US	64.12	400.1	8.32	4.87
Africa	-	-	-	1.71
Hongkong	-	-	-	0.52
Indien	-	-	-	0.58
Irland	-	-	-	1.94
Malaysia	-	-	-	0.39
Newzealand	-	-	-	2.11
Philippinen	-	-	-	0.90
Singapur	-	-	-	0.64
Wales	-	-	-	0.27

Authors

Darshan Prabhu - M.Tech, CSE, IIT Bombay - Darshan Prabhu
Preethi Jyothi - Associate Professor, CSE, IIT Bombay - Preethi Jyothi
Sriram Ganapathy - Associate Professor, EE, IISc Bangalore - Sriram Ganapathy
Vinit Unni - Ph.D, CSE, IIT Bombay - Vinit Unni

Citation

If you use this code for your research, please consider citing our work.

@misc{prabhu2023accented,
      title={Accented Speech Recognition With Accent-specific Codebooks}, 
      author={Darshan Prabhu and Preethi Jyothi and Sriram Ganapathy and Vinit Unni},
      year={2023},
      eprint={2310.15970},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

License

Distributed under the MIT License. See LICENSE for more information.

Name		Name	Last commit message	Last commit date
parent directory ..
scripts		scripts
create_dataset.sh		create_dataset.sh
dataset.tar.gz		dataset.tar.gz
helper.tar.gz		helper.tar.gz
readme.md		readme.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

readme.md

Dataset Creation

Prerequisites

Running the script

Dataset Statistics

Authors

Citation

License

Dateien

data

Directory actions

More options

Directory actions

More options

Latest commit

History

data

Folders and files

parent directory

readme.md

Dataset Creation

Prerequisites

Running the script

Dataset Statistics

Authors

Citation

License