SPRING-INX Data
We are happy to release 2000 hours of manually transcribed speech data in 10 Indian languages for ASR applications.
Download Data
ASR Recipes
Manually Transcribed Multilingual Indian Speech Corpus
Releasing speech data in 10 different Indian Languages to encourage the members from academia and industry to build speech applications for Indian languages. The total data amounts to 2000 hours. You can find more details in the arXiv paper.
Languages | Data_R1 | Data_R2 |
---|---|---|
Assamese | Download | Download |
Bengali | Download | Download |
Gujarati | Download | Download |
Hindi | Download | Download |
Kannada | Download | - |
Malayalam | Download | Download |
Marathi | Download | Download |
Odia | Download | - |
Punjabi | Download | Download |
Tamil | Download | Download |
Statistics of the Data
Data_R1(hrs) | Data_R2(hrs) | ||||||||
---|---|---|---|---|---|---|---|---|---|
Languages | train | dev | eval | Total_R1 | train | dev | eval | Total_R2 | Total |
Assamese | 50.56 | 5.12 | 4.97 | 60.65 | 35.35 | 5.09 | 4.99 | 45.43 | 106.08 |
Bengali | 374.73 | 40.02 | 5.02 | 419.77 | 272.98 | 30.06 | 5.04 | 308.08 | 727.85 |
Gujarati | 175.49 | 19.61 | 4.95 | 200.05 | 58.27 | 7.19 | 5.09 | 70.55 | 270.6 |
Hindi | 316.41 | 29.68 | 5.09 | 351.18 | 137.11 | 16.23 | 5.08 | 158.42 | 509.6 |
Kannada | 82.49 | 9.66 | 4.79 | 96.94 | 0 | 96.94 | |||
Malayalam | 214.73 | 24.72 | 5.07 | 244.52 | 68.8 | 8.16 | 5.1 | 82.06 | 326.58 |
Marathi | 130.36 | 14.37 | 5.16 | 149.89 | 208.96 | 24.69 | 5.15 | 238.8 | 388.69 |
Odia | 82.49 | 9.26 | 4.73 | 96.48 | 0 | 96.48 | |||
Punjabi | 138.96 | 15.09 | 5.08 | 159.13 | 216.36 | 24.88 | 5.03 | 246.27 | 405.4 |
Tamil | 200.66 | 19.97 | 5.1 | 225.73 | 128.11 | 15.12 | 5.17 | 148.4 | 374.13 |
Total | 1766.88 | 187.5 | 49.96 | 2004.34 | 1125.94 | 131.42 | 40.65 | 1298.01 | 3302.35 |
Source of DATA
This data was collected on payment basis using the following vendors -- Mediscribe India, Desicrew, and Crescendo. Preliminary checking of quality of transcriptions was done by our partners at KL University as well as by SPRING Lab members. The data consists mostly of mock conversations as well as monolgues on different topics.
NLTM Pilot and NPTEL Curated Data
Indian Language ASR Challenge Data 490 Hours Links
English 190 hrs | Hindi 188 hrs | Tamil 112 hrs |
---|---|---|
Audio | Audio | Audio |
Eval Set Audio | Eval Set Audio | |
Transcripts | Transcripts | Transcripts |
Dictionary | Dictionary | Dictionary |
Indian Language Data Balance (Excluding the Challenge Data) 410 Hours Links
English 110 hrs | Hindi 262 hrs | Tamil 38 hrs |
---|---|---|
Audio | Audio | Audio |
Transcripts | Transcripts | Transcripts |
NPTEL Data 1200 Hours Links
Computer Science 250 hrs | Electrical 250 hrs | Humanities 250 hrs | Mechanical 250 hrs | BioChem 200 hrs |
---|---|---|---|---|
Audio | Audio | Audio | Audio | Audio |
Transcripts | Transcripts | Transcripts | Transcripts | Transcripts |
Funding
This data collection effort was funded by Ministry of Electronics & Information Technology.
License
This is released under CCBY 4.0 license