Things to do eventually.

This is a list of cool and interesting questions in computational linguistics and/or just linguistics that I don’t think have been addressed (enough) in the past research. I’ll probably work on some of these at some point.

Things I need to learn

  • Literally anything substantial about statistics and calculus for ML
  • PyTorch
  • Deep learning
  • Probabilistic machine learning. Van de Meent et al. (2018), Pyro.
  • BERT, transformers, and attention. Vaswani et al. (2017)
  • At least the basics about some grammar frameworks: LFG, dependency grammars, Minimalism, HPSG.
  • Optimality Theory

South Asian language documentation

  • How can we assess where we are so far and what languages need to be studied? There's so much research work that it's hard to make sense of it all.
    Bhāṣācitra is a start.
  • Find speakers of and study as many undocumented languages of the region as possible. Unsure if I will have time for this after university.
    Kholosi is a start. Follow up with Rajaswa Patil on documenting Gujjari.
  • How can computational methods help? Check out Wav2Vec for dealing with speech, and more generally learn about low-resource NLP.

South Asian historical linguistics

  • The SARVA project is dead. Revive it and make it better (make a database that is usable, not a text file, and look towards computational identification of cognates, alignment of phonemes, strata detection, and so on).
    ✅ Created Jambu for the CDIAL. See how to incorporate Suresh Kolichala's DEDR. Write a paper at some point (computational venue and linguistic venue?) and go get more etymologised word lists from documentary linguists.
  • We need a massive diachronic database for all the literary languages of South Asia. How can we figure out anything with high confidence if we haven't seen all the data?
  • What is the linguistic history of Sanskrit? What kinds of language contact (particularly with now lost substrata languages) influenced its development? Can we computationally model this sort of language change?
    See Witzel (1999) on a comprehensive overview of substrata in Vedic Sanskrit. It seems Hock (1975) proposed the idea. Modern research seems stagnant.
  • Do the language isolates of South Asia actually provide substrate loans into modern IA/Dravidian/Munda languages or have they been too isolated to do so?
    Not sure of the literature much here, but I think the signs point to "yes" for Burushaski, "maybe" for Nihali, and more murky for the others.
  • Can substrate vocabulary be automatically identified with phonotactic LMs? Is the time depth too great (presumably, substrate loans are assimilated more readily than adstrate loans)?
    We know that NIA languages have a lot of unidentified borrowings (e.g. Masica's work on Hindi agricultural vocab) so it would be nice for computational tools, like phonotactic language models, to be able to help here.
    ✅ Working on a paper with Chundra Cathcart. Need to actually learn probabilistic ML stuff. This is only a start at tackling the issue.
  • How much of the substrate vocabulary of IA is actually Munda or Dravidian, and how much is contributed by a language X ("Nishadic" or it may be multiple unrelated languages) to all three?
  • We need to do a systematic study of toponyms, extending on Southworth (2005)'s work on Maharastra, towards the goal of finding clear substrate languages. This is how e.g. Gaulish has been reconstructed.
  • Can we model sound change computationally when taking into account the ramifications of continuous language contact? Broadly, can we reconcile the issues with the tree model for South Asian language history using computational approaches?
    The work here is being done by Chundra Cathcart and Taraka Rama, mainly by the former. Cathcart mentioned in emails about a probabilistic ML model architecture for this problem but it does not seem to converge well. Can we take a deep learning approach? Check out Meloni et al. (2019) for Romance reconstruction.


  • Do the perplexities calculated by a phonotactic LM (which I guess is a generative phonology model; it doesn't learn restrictions like the MaxEnt model) correlate with OT weighted violations?
    Nelson and Mayer (2020) is one of the papers that came up with this model.
  • Can we model multilingual schwa deletion across Indo-Aryan with a phonotactic LM? What kinds of linguistic generalisations can we even extract from that kind of model?
  • Can other architectures model phonotactics better than unidirectional RNNs? Especially curious about bidirectional models. IDK how attention works, should probably read that, but may be useful here too.


  • What the heck are compound verbs doing in South Asian languages?
    Read Benjamin Slade's papers and scour his bibliography.

Case and adpositions

  • SNACS for Hindi-Urdu and eventually other South Asian languages. Can we learn cool linguistic stuff from it? Train a classifier on Hindi data (use several architectures), try to do transfer learning for other languages.
    Arora, Venkateswaran, and Schneider (2021) for guidelines. Classifier paper for EMNLP 2021.
  • What do language models learn about case? Like, do they learn the [+definite] context for the Hindi accusative? Would be a neat linguistic investigation.
  • Is models learning this kind of stuff useful for upstream tasks?