MorphGen: Morphological Generators for Indian Languages
Generating inflectional wordforms and morphological features for 10+ Indian languages, facilitating synthetic dataset creation and system development.
Supported by: Oxford Languages
Core Objectives
1
Inflectional Generation
Generate all inflectional wordforms from a given lemma or wordform.
2
Morphological Features
Produce complete morphological features for each generated form.
3
Rule Application
Optionally output the specific rule applied during form generation.
Rule-Based System Architecture
In-House Rule Language
Built on regular expressions, the custom rule-expression language simplifies rule editing.
Language Support
Currently supports more than 10 Indian languages, with ongoing expansion.
Applications of Morphological Generators
Synthetic Dataset Generation
Create large-scale synthetic datasets for training NLP models.
Generative System Building
Develop generative systems adaptable to various hardware configurations.
Supported Indian Languages
Hindi
A widely spoken language in North India.
Bengali
Dominant in West Bengal and Bangladesh.
Assamese
Spoken in the Assam region.
Marathi
Predominant in Maharashtra.
Odia
Spoken in Odisha.
Nepali
Official language of Nepal and spoken in Sikkim.
Sanskrit
An ancient and classical language.
Punjabi
Spoken in the Punjab region.
MorphGen Language
The MorphGen language is a custom language designed on top of regular expressions for describing morphological rules in Indian languages. It provides a concise and intuitive way to specify complex patterns and transformations, leveraging the power of regular expressions while adding higher-level abstractions tailored for morphological analysis and suited for non-technical experts.
Synthetic Data Generation Process
1
Define Lemmas
2
Apply Rules
3
Generate Forms
4
Output Dataset
Advantages of Our Approach
1
Accuracy
High precision in morphological generation.
2
Flexibility
Easy to adapt to new languages and rules.
3
Efficiency
Fast generation of morphological forms.
4
Transparency
Rule-based System for synthetic data generation
© 2022-25 UnReaL-TecE LLP. All rights reserved.