Aryaman Arora

B.S. in Computer Science and Linguistics
Georgetown University, Class of 2024

Who am I?

I'm Aryaman [आर्यमन /ˈäːɾjəmən/], a freshman at Georgetown! I'm doing computational linguistics research as a member of NERT, Dr. Nathan Schneider's research group at Georgetown. I'm excited about technologies for South Asian languages and linguistic work on Indo-Aryan languages.

I'm an admin on the English Wiktionary where I manage South Asian language documentation. I also do programming competitions and hackathons on occasion.

I like to bicycle (though not very well), read (in English and Hindi, but the latter not very well), and eat unreasonable amounts of food relative to my size. I'd also say I like to go places, but we're in the middle of a pandemic.

CV My Blog

News

  • Feb 15 2021: Presenting "SNACS Annotation of Case Markers and Adpositions in Hindi" at SCiL 2021.
  • Dec 14 2020: Attending COLING.
  • Nov 16 2020: Attending EMNLP and presenting at SIGTYP.
  • Aug 27 2020: My research is in the news.
  • Aug 3 2020: Starting undergrad at Georgetown.
  • Jul 21 2020: Presented at ACL (my first conference!) and wrote about it.
  • Apr 22 2020: Officially registered our nonprofit, Washingtutors, an online tutoring service for DC Public School students.

Research

My research work has included adposition and case supersenses applied to various languages, grapheme-to-phoneme conversion of Indo-Aryan languages, and other work on Hindi from a computational perspective.

I've worked with unique coauthors on publications.


2021
  • Adposition and Case Supersenses v1.0: Guidelines for Hindi–Urdu Aryaman Arora, Nitin Venkateswaran, Nathan Schneider

    These are the guidelines for the application of SNACS (Semantic Network of Adposition and Case Supersenses; Schneider et al. 2018) to Modern Standard Hindi of Delhi. SNACS is an inventory of 50 supersenses (semantic labels) for labelling the use of adpositions and case markers with respect to both lexical-semantic function and relation to the underlying context. The English guidelines (Schneider et al., 2020) were used as a model for this document. Besides the case system, Hindi has an extremely rich adpositional system built on the oblique genitive, with productive incorporation of loanwords even in present-day Hinglish. This document is aligned with version 2.5 of the English guidelines.

    @misc{arora-etal-2021-guidelines,
        title = "Adposition and Case Supersenses v1.0: Guidelines for {H}indi–{U}rdu",
        author = "Arora, Aryaman  and
            Venkateswaran, Nitin  and
            Schneider, Nathan",
        year={2021},
        eprint={2103.01399},
        archivePrefix={arXiv},
        primaryClass={cs.CL},
        url={https://arxiv.org/abs/2103.01399}
    }
                                        
  • SNACS Annotation of Case Markers and Adpositions in Hindi Aryaman Arora, Nitin Venkateswaran, Nathan Schneider SCiL

    We present in-progress annotation of semantic relations expressed through adpositions and case markers in a Hindi corpus. We used the multilingual SNACS annotation scheme, which has been applied to a variety of typologically diverse languages. Annotation problems in Hindi are examined and used to suggest changes to SNACS. We look towards finalizing the corpus and using it for future work in typology and semantic role-dependent tasks.

    @inproceedings{arora-etal-2021-snacs,
        title = "{SNACS} Annotation of Case Markers and Adpositions in {H}indi",
        author = "Arora, Aryaman  and
            Venkateswaran, Nitin  and
            Schneider, Nathan",
        booktitle = "Proceedings of the Society for Computation in Linguistics",
        volume = "4",
        year = "2021",
        address = "Online",
        publisher = "Society for Computation in Linguistics",
        url = "https://scholarworks.umass.edu/scil/vol4/iss1/57/",
        pages = "454--458",
    }
                                        
2020
  • PASTRIE: A Corpus of Prepositions Annotated with Supsersense Tags in Reddit International English Michael Kranzlein, Emma Manning, Siyao Peng, Shira Wein, Aryaman Arora, Nathan Schneider LAW

    We present the Prepositions Annotated with Supsersense Tags in Reddit International English (“PASTRIE”) corpus, a new dataset containing manually annotated preposition supersenses of English data from presumed speakers of four L1s: English, French, German, and Spanish. The annotations are comprehensive, covering all preposition types and tokens in the sample. Along with the corpus, we provide analysis of distributional patterns across the included L1s and a discussion of the influence of L1s on L2 preposition choice.

    @inproceedings{kranzlein-etal-2020-pastrie,
    title = "{PASTRIE}: A Corpus of Prepositions Annotated with Supersense Tags in {R}eddit International {E}nglish",
    author = "Kranzlein, Michael  and
        Manning, Emma  and
        Peng, Siyao  and
        Wein, Shira  and
        Arora, Aryaman  and
        Schneider, Nathan",
    booktitle = "Proceedings of the 14th Linguistic Annotation Workshop",
    month = dec,
    year = "2020",
    address = "Barcelona, Spain",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.law-1.10",
    pages = "105--116"
    }
  • SNACS Annotation of Case Markers and Adpositions in Hindi Aryaman Arora, Nathan Schneider SIGTYP

    The use of specific case markers and adpositions for particular semantic roles is idiosyncratic to every language. This poses problems in many natural language processing tasks such as machine translation and semantic role labelling. Models for these tasks rely on human-annotated corpora as training data. There is a lack of corpora in South Asian languages for such tasks. Even Hindi, despite being a resource-rich language, is limited in available labelled data. This extended abstract presents the in-progress annotation of case markers and adpositions in a Hindi corpus, employing the cross-lingual scheme proposed by Schneider et al. (2017), Semantic Network of Adposition and Case Supersenses (SNACS). The SNACS guidelines we developed also apply to Urdu. We hope to finalize this corpus and develop NLP tools making use of the dataset, as well as promote NLP for typologically similar South Asian languages.

  • Supervised Grapheme-to-Phoneme Conversion of Orthographic Schwas in Hindi and Punjabi Aryaman Arora, Luke Gessler, Nathan Schneider ACL

    Hindi grapheme-to-phoneme (G2P) conversion is mostly trivial, with one exception: whether a schwa represented in the orthography is pronounced or unpronounced (deleted). Previous work has attempted to predict schwa deletion in a rule-based fashion using prosodic or phonetic analysis. We present the first statistical schwa deletion classifier for Hindi, which relies solely on the orthography as the input and outperforms previous approaches. We trained our model on a newly-compiled pronunciation lexicon extracted from various online dictionaries. Our best Hindi model achieves state of the art performance, and also achieves good performance on a closely related language, Punjabi, without modification.

    @inproceedings{arora-etal-2020-supervised,
        title = "Supervised Grapheme-to-Phoneme Conversion of Orthographic Schwas in {H}indi and {P}unjabi",
        author = "Arora, Aryaman  and
            Gessler, Luke  and
            Schneider, Nathan",
        booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics",
        month = jul,
        year = "2020",
        address = "Online",
        publisher = "Association for Computational Linguistics",
        url = "https://www.aclweb.org/anthology/2020.acl-main.696",
        doi = "10.18653/v1/2020.acl-main.696",
        pages = "7791--7795",
    }
  • Quasi-Passive Lower and Upper Extremity Robotic Exoskeleton for Strengthening Human Locomotion Aryaman Arora, John R. McIntyre Sustainable Innovation

    Most of the robotic exoskeletons available today are either lower extremity or upper extremity devices targeting individual orthotic (elbow, knee, and ankle) joints. However, there are a few which target both lower and upper extremities. This chapter aims to propose a design for a wearable quasi-passive lower and upper extremity robotic exoskeleton (QLUE-REX) system, targeting disabled users and aged seniors. This exoskeleton system aims to improve mobility, assist walking, improve and enhance muscle strength, and help people with leg/arm disabilities. QLUE-REX combines elbow, knee, and ankle joints with options to synchronize individual joints’ movements to achieve the following: (1) assist in lifting loads of 30–40 kilograms, (2) assist in walking, (3) easy and flexible to wear without any discomfort, and (4) be able to learn and adapt along with storing time-stamped sensor data on its exoskeleton storage media for predicting/correcting users’ movements and share data with health professionals. The research’s main objective is to conceptualize a design for QLUE-REX system. QLUE-REX will be a feasible modular-type wearable system that incorporates orthotic elbow, knee, and ankle joints effectively in either synchronous or asynchronous modes depending on the users’ needs. It will utilize human-walking analysis, data sensing and estimation technology, and measurement of the electromyography signals of user’s muscles, exploiting biomechanical principles of human-machine interface.

In preparation

Current
  • Fieldwork on and linguistic documentation of the marginalised Kholosi language of Iran (website).
  • Wiktionary Data Preparer (WDP): a tool for field linguists for uploading fieldwork lexicons to Wiktionary (collaborating with Luke Gessler and Dr. Hilaria Cruz).
  • Bhāṣācitra: a map of linguistic resources for South Asia and Iran.
  • Bengali G2P: working on computational approaches to schwa deletion in Bengali (collaborating with Arundhati Sengupta).

The background image is from here. It's from one of the edicts of the Mauryan emperor Aśoka of the 3rd century BCE written in Early Middle Indo-Aryan (specifically Aśokan Prakrit).