Automatic generation of lexical recognition tests using natural language processing
Lexical recognition tests (LRTs) are widely used to assess the vocabulary size of language learners based on word recognition. In such tests, the learners need to differentiate between words and crafted nonwords that look much like real words. A review of the literature showed that: (i) LRTs are generally human-crafted, which is a very time consuming process and (ii) compared to English and other European languages, Arabic is under-resourced and has received little attention in modern language learning research. In this context, Natural Language Processing (NLP) has been successfully used for a number of tasks related to language learning research. In this thesis, we shed light on the utilization of NLP techniques for the automatic generation of LRTs for the Arabic language in particular.</br> The main contribution of this thesis is the exploration of the automatic generation of quality lexical recognition tests under the following two aspects: (a) nonwords generation, and (b) test adaption to Arabic. Regarding (a), we find that character n-gram language models can be used to distinguish low from high-quality nonwords. More precisely, high-order models incorporating position-specific information work best for the automatic generation of nonwords for English LRTs. Furthermore, we investigate the validity of the automatically generated LRTs.</br> We conduct a user study and find that our automatically generated test yields scores that are highly correlated with a well-established lexical recognition test which was manually created. Regarding (b), we pave the road for test adaption to Arabic. We address some of the NLP challenges inherited from the Modern Standard Arabic (in particular, the Arabic script).</br> These challenges can be further split into (i) resource creation, (ii) role of diacritical marks (diacritics are the second class of symbols in Arabic script) in designing Arabic LRTs, (iii) obtaining reliable frequency counts in Arabic, and (iv) the role of diacritics in adapting the difficulty of Arabic LRTs. Regarding (i), instead of acquiring costly corpora, we consider automatic diacritization as an alternative step towards the creation of Arabic annotated (diacritized) resources.</br> Thus, we conduct a comparative study of available tools for the automatic diacritization of Arabic text (vowels restoration). We find that Farasa is outperforming all other tools. As a result, we utilize Farasa to create diacritized Arabic resources. Regarding (ii), we noticed that existing tests are neglecting diacritics, a very important feature of the Arabic language that ambiguates the Arabic words and causes many challenges for automatic processing. We enhanced the Arabic LRTs by adding a new parameter. We are the first who added the lexical diacritics parameter to Arabic LRTs. We find that diacritics have the potential to better control the difficulty of the tests. Regarding (iii), we find that diacritics have a significant influence on obtaining reliable frequency counts in Arabic.</br> We also showed that a quite good approximation can be obtained by applying automatic diacritization to non-diacritized corpora. Thus, the automatic diacritization is effective for obtaining reliable frequency counts for Arabic words. Regarding (iv), we conduct a user study to compare diacritized (using the most frequent diacritized form of a word) and non-diacritized lexical recognition tests and find that they are largely comparable.</br> Then, we conduct a large-scale user study and compare the test under three conditions: No Diacritics, Frequent Diacritics, and Infrequent-Diacritics. We find that diacritics can be used to construct more appropriate Arabic LRTs by using the less frequent diacritized form of a word.</br> Furthermore, we present lugha, a Maven-based tool that covers various Arabic text preprocessing and normalization steps. Lugha can be easily integrated into Java-based NLP pipelines. We also enrich the set of Arabic annotated resources and create some diacritized corpora for MSA.