SpeeD-IL: Bridging the Speech Technology Gap in India
Creating Speech Datasets and Models for underrepresented Indian Languages to foster inclusive speech technology.
Funded by: Mission Bhashini, Govt of India
The Data Divide in Indian Speech Technology
Disparities in speech technology support highlight the urgent need for inclusive initiatives.
Dominance of a Few Languages
Most speech technology products support well-resourced languages like English and Hindi, leaving many Indian languages behind due to limited commercial support.
Lack of Speech Datasets
The primary obstacle is the scarcity of sufficient speech datasets, especially for non-scheduled Indo-Aryan and Dravidian languages, and scheduled languages from Tibeto-Burman and Austro-Asiatic language families.
Equitable Access Needed
Initiatives are needed to focus on collecting and curating speech data for underrepresented languages to ensure equitable access to speech technology.
Addressing the Gap: SpeeD-IL Project
Large-Scale Datasets
Creating transcribed speech datasets for Indian languages.
Model Development
Developing robust speech models for these languages.
Bridging Linguistic Diversity
Focusing on under-resourced languages from Tibeto-Burman and Austro-Asiatic families.
Key Objectives of the SpeeD-IL Project
1
Data Collection
Gathering 100 hours of speech data from at least 10 underrepresented languages within each major language family.
2
Phone Set Development
Creating comprehensive phone sets for each language under study to facilitate accurate phonetic representation.
3
ASR System Development
Building baseline Automatic Speech Recognition (ASR) systems for each language to enable practical applications.
4
Language Models
Constructing language models to improve the accuracy and fluency of speech recognition systems.
5
Pre-trained Models
To build pre-trained models based on the data collected in the project for each language family under study to enable further fine-tuning.
Sub-Projects: SpeeD-TB & SpeeD-IA
SpeeD-TB
Focuses on Tibeto-Burman languages, addressing a critical gap and supporting regional diversity.
SpeeD-IA
Dedicated to Indo-Aryan languages, aiming to provide resources for marginalised communities.
Fostering Collaboration and Open Access
Public Availability
Making datasets and pre-trained models publicly available through appropriate platforms.
Licensing
Using CC BY-SA-NC 4.0 license for the dataset to encourage sharing and adaptation for non-commercial purposes.
Model Licensing
Employing AGPL v3 license for the models to ensure open-source access and community-driven improvements.
Expected Outcomes and Impact
4
Language Families
Focusing on the four major language families of India to ensure broad coverage.
10+
Languages per Family
Including at least 10 underrepresented languages in each family for data collection.
4000+
Hours of Speech Data
Aiming to collect approximately 1000 hours of transcribed speech per language.
The SpeeD-IL project is poised to make a significant impact on the landscape of speech technology in India, driving inclusivity, fostering innovation, and empowering communities through accessible language technologies.
© 2022-25 UnReaL-TecE LLP. All rights reserved.