Tajik National Corpus (TNC)

This website presents the Tajik National Corpus (TNC) of written texts. The total size is 31 million words. At the present time 96% of the total size of the TNC is annotated. Each annotated word form has grammatical tags and translation into English and Russian.

Funding

The Tajik National Corpus was funded by:

Structure

Currently the corpus consists of modern texts in the Tajik language, published in the 20th and 21st centuries. The corpus comprises the following genres: fiction, poetry, drama, journalism, scientific and educational texts, (auto)biographies, religious literature, political and law texts, newspapers. The percentage of each genre is the following:

The complete list of the texts included in the Tajik National Corpus is available via the ‘Select subcorpus’ button.

Corpus characteristics

The corpus is based on the automatic annotation by a morphological analyzer. The analyzer was designed by T. Arkhangelskiy and successfully tested on a number of currently available online linguistic corpora developed since the middle of 2000. Among corpora of Iranian languages one should mention the Ossetic National Corpus (http://corpus.ossetic-studies.org/) and the Corpus of the Digor dialect of the Ossetic language (http://corpus-digor.ossetic-studies.org/).

The automatic analysis includes lemmatisation and morphological tagging. Lemmatisation process implies attributing dictionary form to each word form. In the Tajik National Corpus, each word form is also translated into Russian and English. To translate to Russian we used the following dictionaries: Tajik-Russian Dictionary by M.B. Rahimi, L.V. Uspenskaya, Moscow, 1954 and Tajik-Russian Dictionary (second edition) by D. Sajmiddinov, S.D. Holmatova, S. Karimov (eds), Dushanbe, 2006. Lemmatisation implies manual processing of the dictionary. Morphological tagging includes grammatical and syntactical information (tags, see Tagset) such as parts of speech, number, time, aspect, modality, person and so on. The rules of morphological tagging were also processed manually.

At the moment the percentage of automatically annotated word forms is 96%.

The universal search system, developed by T. Arkhangelskiy, is adapted for the Tajik National Corpus. The search system was updated by T. Arkhangelskiy in 2021. The search by lexeme, word form, translation and grammatical tags (see Tagset) is available. For an advanced search one can combine several parameters for a search query. The search system also allows searching of several elements with certain distance between them. One can also choose any subcorpus for searching (for example, the texts of a certain genre or period, texts of specific authors). Settings for output design can be configured. For more information, use the help button in the Search page.

Whole texts are not available for copyright reasons. The maximum context length is 7 sentences.

Special characters

Keyboard combinations of the Standard Russian keyboard can be used to insert special Tajik letters, provided the "standard" input method is selected in the settings:

A virtual Tajik keyboard was added in 2021. To activate it, click the keyboard button in the Search page and go to Word or Lemma search.

Transliteration

In 2021, we developed a system of Tajik Cyrillic to Latin transliteration. To activate it go to Search page, click Options in the upper left corner, Transliteration and choose latin.

If you want to use Latin instead of Cyrillic in the Search fields Word or Lemma, click Options, Character input method and choose inputmethod_latin.

Below is the list of correspondences between Latin and Tajik Cyrillic letters:

Tagset

There is the full list of the tags (the Grammar cell in the search) used in the corpus below.

ATTENTION! The tags are case-sensitive.

Corpus creators

The Tajik National Corpus is a product of Tajik and Russian collaboration. Researchers from the Russian-Tajik (Slavic) University, Department of Theoretical and Applied Linguistics (D.M. Iskandarova, Kh.D. Shambezoda, M.B. Davlatmirova, O.L. Kozlova) and Department of Informatics and Information Systems (Z.D. Usmanov, M.A. Umarov) and also researchers from Tajikistan State University of Law, Business and Politics in Khujand (G. Dovudov, A. Kosimov) have carried out collection, digitisation and processing of texts in Tajik. The final editing of the texts before the publication in the corpus was completed by A.P. Vydrin.

The Tajik dictionary has been processed by A.P. Vydrin, A.D. Egorova and I.V. Egorov. The Tajik morphological analyser has been configured by A.P. Vydrin. The list of Grammatical tags (the Grammar cell in the search) and glosses (the Gloss cell in the search) was developed by A.P. Vydrin.

In 2021, A.P. Vydrin, T. Arkhangelskiy and A.V. Panasyuk developed the automatic analysis of almost all Tajik analytical verbal forms.

A.P. Vydrin is in charge of maintaining the percentage of automatically annotated words.

Since 2020, A.V. Panasyuk provides technical support for the Corpus and updates the newspaper corpora.

Acknowledgments

We express our gratitude to A.A. Melikov who has provided us with the collection of modern Tajik literature (around 200 books) collected from publishing companies of Dushanbe, Samarkand and Tashkent; to B. Olimov who shared his private collection of 130 books; to graduate students of the Department of Theoretical and Applied Linguistics of the Russian-Tajik (Slavic) University (especially to Manizha Sokhibova and Khammod Muboraksho). We also thank T. Arkhangelskiy who has consulted us on the automatic analyser and has posted the Tajik Corpus online.

Contacts

Please, send any comments and suggestions to Arseniy Vydrin, senjacom@gmail.com.

Future development of the Corpus

We plan to develop a Corpus of Classic Persian-Tajik literature of the 9th – 19th centuries. The current Tajik National Corpus will be enlarged by including new texts. The percentage of the automatic annotation will be increased and the quality of the annotation will be improved.

The Corpus developers will appreciate any donations of texts published in Tajik. We accept files in any text format (doc, docx, rtf, txt, odt). Please, send your texts to lingvistik.rtsu@gmail.com and senjacom@gmail.com. We guarantee copyright compliance and the use of the texts for the Tajik National Corpus development only.