Tajik National Corpus (TNC)

This website presents the Tajik National Corpus (TNC) of written texts. The total size is 12 million words. At the present time 91.4% of the total size of the TNC is annotated. Each annotated word form has grammatical tags and translation into Russian.

Funding

The Tajik National Corpus was funded by:

Structure

Currently the corpus consists of modern texts in the Tajik language, published in the 20th and 21st centuries. The corpus comprises the following genres: fiction, poetry, drama, journalism, scientific and educational texts, (auto)biographies, religious literature, political and law texts, newspapers. The percentage of each genre is the following:

The complete list of the texts included in the Tajik National Corpus is available via the ‘Select subcorpus’ button.

Corpus characteristics

The corpus is based on the automatic annotation by a morphological analyzer. The analyzer was designed by T. Arkhangelskiy and successfully tested on a number of currently available online linguistic corpora developed since the middle of 2000. Among corpora of Iranian languages one should mention the Ossetic National Corpus (http://corpus.ossetic-studies.org/) and the Corpus of the Digor dialect of the Ossetic language (http://corpus-digor.ossetic-studies.org/).

The automatic analysis includes lemmatisation and morphological tagging. Lemmatisation process implies attributing dictionary form to each word form. In the Tajik National Corpus, each word form is also translated into Russian (according to Tajik-Russian Dictionary by M.B. Rahimi, L.V. Uspenskaya. Moscow, 1954). Lemmatisation implies manual processing of the dictionary. Morphological tagging includes grammatical and syntactical information (tags, see Tagset) such as parts of speech, number, time, aspect, modality, person and so on. The rules of morphological tagging were also processed manually.

At the moment the percentage of automatically annotated word forms is 91.4%.

The universal search system, developed by T. Arkhangelskiy, is adapted for the Tajik National Corpus. The search by lexeme, word form, translation and grammatical tags (see Tagset) is available. For an advanced search one can combine several parameters for a search query. The search system also allows searching of several elements with certain distance between them. One can also choose any subcorpus for searching (for example, the texts of a certain genre or period, texts of specific authors). Settings for output design can be configured. For more information, use the help button in the Search page.

Whole texts are not available for copyright reasons. The maximum context length is 7 sentences.

Special characters

Keyboard combinations of the Standard Russian keyboard can be used to insert special Tajik letters, provided the "standard" input method is selected in the settings:

Tagset

There is the full list of the tags (the Grammar cell in the search) used in the corpus below.

ATTENTION! The tags are case-sensitive.

Corpus creators

The Tajik National Corpus is a product of Tajik and Russian collaboration. Researchers from the Russian-Tajik (Slavic) University, Department of Theoretical and Applied Linguistics (D.M. Iskandarova, Kh.D. Shambezoda, M.B. Davlatmirova, O.L. Kozlova) and Department of Informatics and Information Systems (Z.D. Usmanov, M.A. Umarov) and also researchers from Tajikistan State University of Law, Business and Politics in Khujand (G. Dovudov, A. Kosimov) have carried out collection, digitisation and processing of texts in Tajik.

The Tajik dictionary has been processed by A.P. Vydrin and I.V. Egorov. The Tajik morphological analyser has been configured by A.P. Vydrin.

Acknowledgments

We express our gratitude to A.A. Melikov who has provided us with the collection of modern Tajik literature (around 200 books) collected from publishing companies of Dushanbe, Samarkand and Tashkent; to B. Olimov who shared his private collection of 130 books; to graduate students of the Department of Theoretical and Applied Linguistics of the Russian-Tajik (Slavic) University (especially to Manizha Sokhibova and Khammod Muboraksho). We also thank T. Arkhangelskiy who has consulted us on the automatic analyser and has posted the Tajik Corpus online.

Contacts

Arseniy Vydrin provides technical support for the Corpus. Please, send any comments and suggestions to senjacom@gmail.com.

Future development of the Corpus

We plan to develop a Corpus of Classic Persian-Tajik literature of the 9th – 19th centuries. The current Tajik National Corpus will be enlarged by including new texts. The percentage of the automatic annotation will be increased and the quality of the annotation will be improved.

The Corpus developers will appreciate any donations of texts published in Tajik. We accept files in any text format (doc, docx, rtf, txt, odt). Please, send your texts to lingvistik.rtsu@gmail.com and senjacom@gmail.com. We guarantee copyright compliance and the use of the texts for the Tajik National Corpus development only.