Supervised
Grapheme-to-Phoneme Conversion of
Orthographic Schwas
in Hindi and Punjabi

Aryaman Arora, Luke Gessler, Nathan Schneider

Contributions

  • Motivation: Hindi and Punjabi text-to-speech
  • State-of-the-art model using machine learning for schwa deletion, the hardest task in Hindi G2P conversion
  • The first computational model in the literature for Punjabi schwa deletion
  • Several scripts for parsing entries from the Digital Dictionaries of South Asia datasets
  • Public release of all our code

Abugidas

  • Somewhere between an alphabet and a syllabary
  • Orthographic unit: a consonant and a vowel diacritic, or a vowel by itself
  • Employed in South Asia, Southeast Asia, Ethiopia, and Canada

Brahmi

Devanagari

  • Hindi uses the Devanagari script
  • G2P conversion is trivial, simple substitution of Devanagari with phonemes

⟨p⟩
+
⟨ɑː⟩
= पा
⟨pɑː⟩

⟨p⟩
+
⟨eː⟩
= पे
⟨peː⟩
  • One major exception (i.e. why this paper exists): schwa deletion
If a consonant's vowel marker is optional, what is a consonant with no vowel diacritic?

⟨p⟩
+ =
⟨pə⟩
  • Orthographically, the inherent schwa ⟨ə⟩ is applied to a plain consonant
  • Phonologically, due to historical changes, the inherent schwa is sometimes not pronounced
    • No straightforward rules have been found by linguists

जंगली
forested

जं ली
Orthographic ⟨d͡ʒəŋ gə liː⟩
Phonemic /d͡ʒəŋ g liː/

Past Approaches

  • Linguistic accounts have explained it in terms of prosody (hierarchical phonological structures) or phonotactics (linear phonological sequences)
  • Computational G2P systems have followed suit

Prosodic Structure

Tyson and Nagar 2009

Schwa Deletion in 2020

  • Typical approach to G2P conversion: neural seq2seq
  • We approach schwa deletion as a binary classification problem: either a schwa is deleted or it is not
  • Machine learning!
    • Johny and Jansche (2018) proposed a novel machine learning approach to Bengali schwa realization

Methodology

Training Data and Features

  • Scraped orthographic-phonemic pairs from dictionaries (two Hindi, one Punjabi)
  • Force-aligned each pair to find missing schwas in the phonemic form
  • Featurization
    • Encode for each phoneme in a k-wide window around the schwa
    • Phonological features (e.g. vowel length, place of articulation)
  • Tuned these parameters using grid search
Dictionary # Schwa Instances
McGregor (The Oxford Hindi-English dictionary) 36,183
Bahri (Learners' Hindi-English dictionary) 14,082
Google (Johny and Jansche, 2019) 1,098
Singh (The Panjabi Dictionary) 34,576

Datasets we scraped and released.

An example entry from the Hindi training dataset.

Models

  • Logistic regression (Sklearn)
  • Multilayer perceptron neural network (Sklearn)
  • Gradient boosted decision trees (XGBoost)

Results

  • XGBoost model with the same hyperparameters achieves 95.00% accuracy for Punjabi
  • First schwa deletion model for Punjabi

Discussion

Error Analysis

Analysis

  • XGBoost generates best-fit decision trees that are human-readable
  • The system can learn phonotactics because we provide neighbourhood phonemes as feature
    • Learning prosodic rules is more difficult because our features do not include e.g. syllable boundaries or weights
    • Apparently phonotactics is more than enough
  • We can use the trees to create phonotactic rules for schwa deletion!

Example

Conclusion

  • We presented state-of-the-art schwa deletion models for Hindi and Punjabi (with code!) and accompanying datasets
  • Future research avenues
    • "Weakened schwas" were marked in the McGregor dataset but what that means needs to be investigated (dialectal variation? phonemic or phonetic distinction?)
    • Other Indo-Aryan languages still need schwa deletion models for G2P conversion but datasets are not easily available

Thank you!