Tajik National Corpus (TNC)
This website presents the Tajik National Corpus (TNC) of written texts. The total size is 58.4 million words. At the present time 96% of the total size of the TNC is annotated. Each annotated word form has grammatical tags and translation into English and Russian.
Funding
The Tajik National Corpus was funded by:
- The state budget of the Republic of Tajikistan (2019-2021): project leader – Prof. D.M. Iskandarova; research advisor – Prof. V.A. Plungian, full member of the Russian Academy of Sciences;
- The development programme of the Russian-Tajik (Slavic) University (2019): project leader – Prof. D.M. Iskandarova;
- Russian Foundation for Basic Research, Project 19-012-00637 (2019-2021): project leader – A.P. Vydrin.
Structure
Currently the corpus consists of modern texts in the Tajik language, published in the 20th and 21st centuries. The corpus comprises the following genres: fiction, poetry, drama, journalism, scientific and educational texts, (auto)biographies, religious literature, political and law texts, newspapers. The percentage of each genre is the following:
- fiction — 13.5%
- poetry — 3%
- scientific and educational texts — 6%
- (auto)biographies — 2%
- journalism — 0.65%
- religious texts — 1.8%
- juridical literature — 0.7%
- politics — 0.14%
- fairytales — 0.1%
- drama — 0.03%
- newspapers — 72%
The complete list of the texts included in the Tajik National Corpus is available via the ‘Select subcorpus’ button.
Corpus characteristics
The corpus is based on the automatic annotation by a morphological analyzer. The analyzer was designed by T. Arkhangelskiy and successfully tested on a number of currently available online linguistic corpora developed since the middle of 2000. Among corpora of Iranian languages one should mention the Ossetic National Corpus (http://corpus.ossetic-studies.org/) and the Corpus of the Digor dialect of the Ossetic language (http://corpus-digor.ossetic-studies.org/).
The automatic analysis includes lemmatisation and morphological tagging. Lemmatisation process implies attributing dictionary form to each word form. In the Tajik National Corpus, each word form is also translated into Russian and English. To translate to Russian we used the following dictionaries: Tajik-Russian Dictionary by M.B. Rahimi, L.V. Uspenskaya, Moscow, 1954 and Tajik-Russian Dictionary (second edition) by D. Sajmiddinov, S.D. Holmatova, S. Karimov (eds), Dushanbe, 2006. Lemmatisation implies manual processing of the dictionary. Morphological tagging includes grammatical and syntactical information (tags, see Tagset) such as parts of speech, number, time, aspect, modality, person and so on. The rules of morphological tagging were also processed manually.
At the moment the percentage of automatically annotated word forms is 96%.
The universal search system, developed by T. Arkhangelskiy, is adapted for the Tajik National Corpus. The search system was updated by T. Arkhangelskiy in 2021. The search by lexeme, word form, translation and grammatical tags (see Tagset) is available. For an advanced search one can combine several parameters for a search query. The search system also allows searching of several elements with certain distance between them. One can also choose any subcorpus for searching (for example, the texts of a certain genre or period, texts of specific authors). Settings for output design can be configured. For more information, use the help button in the Search page.
Whole texts are not available for copyright reasons. The maximum context length is 7 sentences.
Special characters
Keyboard combinations of the Standard Russian keyboard can be used to insert special Tajik letters, provided the "standard" input method is selected in the settings:
- и1 = ӣ
- х1 = ҳ
- к1 = қ
- ч1 = ҷ
- у1 = ӯ
- г1 = ғ
- * = any letter or letters
- | - any from divided by vertical pipe (for instance, query "prox|dist" in Grammar field will match all proximal or distal pronouns)
A virtual Tajik keyboard was added in 2021. To activate it, click the keyboard button in the Search page and go to Word or Lemma search.
Transliteration
In 2021, we developed a system of Tajik Cyrillic to Latin transliteration. To activate it go to Search page, click Options in the upper left corner, Transliteration and choose latin.
If you want to use Latin instead of Cyrillic in the Search fields Word or Lemma, click Options, Character input method and choose inputmethod_latin.
Below is the list of correspondences between Latin and Tajik Cyrillic letters:
- g1 = ғ
- s1 = ш
- z1 = ж
- h1 = х
- ch = ч
- a1 = я
- o1 = ё
- y1 = ю
- i1 = ī
- u1 = ū
- c1 = щ
- y2 = ы
- a = а
- b = б
- v = в
- g = г
- d = д
- z = з
- i = и
- ī = ӣ
- y = й
- k = к
- q = қ
- l = л
- m = м
- n = н
- o = о
- p = п
- r = р
- s = с
- t = т
- u = у
- ū = ӯ
- f = ф
- h = ҳ
- j = ҷ
- ' = ъ
- ė = э
- c = ц
- ` = ь
Tagset
There is the full list of the tags (the Grammar cell in the search) used in the corpus below.
ATTENTION! The tags are case-sensitive.
- abs – abstract noun suffix in -ӣ / -вӣ / -гӣ
- ADJ – Adjective (part of speech)
- adj1 – adjectivizer in -ӣ / -вӣ / -гӣ
- adj2 – adjectivizer in -нок
- adj3 – adjectivizer in -она / -гона
- adj4 – adjectivizer in -онӣ
- adj5 – adjectivizer in -ангӣ
- adj6 – adjectivizer in -ин / -гин
- ADV – Adverb
- ag – agent noun
- ag1 – agent noun in -чӣ
- ag2 – agent noun in -гар
- ag3 – agent noun in -бон
- ag4 – agent noun in -ор
- ag5 – agent noun in -гор
- ag6 – agent noun in -вар
- ag7 – agent noun in -кор
- and – conjunction у / ю / ва
- anim – animate
- bi – verbal prefix би- in Imperative and Subjunctive
- bodypart – body part
- caus – morphologically causative verb
- cnject – Conjective
- cnject,prs – Present Conjective
- cnject,pst – Past Conjective
- cnject.prs.pass – Present Conjective in Passive (дида мешудагистам)
- cnject.prs.prog – Present Progressive Conjective (хонда истодагист).
- cnject.pst.pass – Past Conjective in Passive (дида шудагистам)
- cnject2 – Conjective formed with a short copular
- cnject2,prs – Present Conjective formed with a short copular
- cnject2,pst – Past Conjective formed with a short copular
- cnject2.prs.pass – Present Conjective formed with a short copular in Passive (дида мешудагиям)
- cnject2.prs.prog – Present Progressive Conjective formed with a short copular
- cnject2.pst.pass – Past Conjective formed with a short copular in Passive (дида шудагиям)
- color – color
- compar – comparative in -тар
- CONJ – Conjunction
- conv.prs.prog – Present Progressive Converb (хонда истода)
- cop – copular
- cop.encl – short copular
- cop.v – full copular
- DEM – demonstrative Pronoun
- dimin – diminutive suffix
- dimin1 – diminutive suffix in -ҷон
- dimin2 – diminutive suffix in -ак / -аккак
- dimin3 – diminutive suffix in -ча / -чек / -ичек
- dimin4 – diminutive suffix in -ина
- dist – distal pronoun
- ezf – ezafe
- f – feminine proper noun
- fract – fractional numeral
- fut – Future
- fut.pass – Future in Passive (сохта хоҳад шуд)
- hab – habitual (prefix ме- in past tenses and Perfect)
- hab,prf – Evidential Durative (мегуфтаанд)
- hab,prf – Evidential Durative (Perfect Durative)
- hab,pst – Imperfect (мехобид)
- hab,pst – Imperfect (Past Durative)
- hab.part.pst – imperfective past participle in -та / -да
- hab.prf.pass – Evidential Durative in Passive (дида мешудааст)
- have – suffix -манд denoting possession of an object or quality
- hon – honorific (verbal flexion 2pl)
- house – compounds with -хона ‘house’
- hum – human
- imp – Imperative
- impf.pass – Imperfect in Passive (дида мешуд)
- indef – indefinite marker -е
- indir – indirect mood
- inf – infinitive
- int – Intention (verbal forms with a future participle and short copular, e.g. рафтаниам)
- INTJ – Interjection
- kinship – kinship term
- m – masculine proper noun
- mod – modal verb
- N – Noun
- neg – negation in на-
- neg2 – negation in ма-
- nonhuman – nonhuman
- NUM – Numeral
- obj.def – definite object marker –ро
- ord – ordinal numeral
- part – participle
- part.fut – future participle in -анӣ
- part.mod – modal participle in -агӣ
- part.mod.prs – modal present participle in ме-...-агӣ
- part.mod.prs.pass – Passive modal present participle in ме-...-агӣ (кашида мешудагӣ)
- part.mod.pst – modal past participle in -агӣ
- part.mod.pst.pass – Passive modal past participle in -агӣ (хонда шудагӣ)
- part.prs – present participle in -анда
- part.prs.prog – present progressive participle (хонда истодагӣ)
- part.pst – past participle in -та / -да
- pass – all finite passive forms
- pass.part – all passive participles
- pass.part.pst – passive past participle in -ташуда / -дашуда
- pers – personal pronoun
- pl – plural
- pl.anim – animate plural in -он / -гон / -вон / -ён
- pl.ar – Arabic plural in -от / -ҷот / -вот
- pl.ar.m – Arabic plural in -ин
- place – suffix denoting place
- place1 – place suffix in -(и)стон
- place2 – place suffix in -зор
- place3 – place suffix in -сор
- place4 – place suffix in -гоҳ
- place5 – place suffix in -дон
- pluprf – Pluperfect (хонда будам)
- pluprf.evid – Evidential Pluperfect (хонда будаааст)
- pluprf.evid.pass – Evidential Pluperfect in Passive (фиристода шуда будааст)
- pluprf.pass – Pluperfect in Passive (гирифта шуда буд)
- poss – possessive pronoun
- poss.1 – first person possessive pronoun
- poss.2 – second person possessive pronoun
- poss.3 – third person possessive pronoun
- poss.pl – possessive pronoun plural
- poss.sg – possessive pronoun singular
- POST – Postposition
- PREP – Preposition
- prf – Perfect
- prf.pass – Perfect in Passive (дида шудааст)
- prog – all finite progressive forms
- PRON – Pronoun
- prop – proper noun
- prox – proximal pronoun
- prs – Present
- prs.pass – Present in Passive (дида мешавам)
- prs.prog – Present Progressive (хонда истодаам)
- prs.prog.pass – Present Progressive in Passive (дида шуда истодаам)
- PRTCL – Particle
- pst – Past
- pst.pass – Past in Passive (дида шуд)
- pst.prog – Past Progressive (хонда истода будам)
- pst.prog.pass – Past Progressive in Passive (хонда шуда истода буд)
- rel – relative marker -е
- sbjv – Subjunctive
- sbjv.hab – Durative Perfect Subjunctive (мехонда бошам)
- sbjv.hab.pass – Durative Perfect Subjunctive in Passive (дида мешуда бошам)
- sbjv.pass – Subjunctive in Passive (дида шавам)
- sbjv.pst – Past (Perfect) Subjunctive (дида бошам)
- sbjv.pst.pass – Past (Perfect) Subjunctive in Passive (гирифта шуда бошад)
- sg – singular
- similar – adjective suffix in -гун
- similar2 – adjective suffix in -монанд
- suf.adj – any derivational suffix forming adjectives
- suf.n – any derivational suffix forming nouns
- super – superlative in -тарин
- V – Verb
- v.adv – present participle in -он
- 1 – first person
- 2 – second person
- 3 – third person
Corpus creators
The Tajik National Corpus is a product of Tajik and Russian collaboration. Researchers from the Russian-Tajik (Slavic) University, Department of Theoretical and Applied Linguistics (D.M. Iskandarova, Kh.D. Shambezoda, M.B. Davlatmirova, O.L. Kozlova) and Department of Informatics and Information Systems (Z.D. Usmanov, M.A. Umarov) and also researchers from Tajikistan State University of Law, Business and Politics in Khujand (G. Dovudov, A. Kosimov) have carried out collection, digitisation and processing of texts in Tajik. The final editing of the texts before the publication in the corpus was completed by A.P. Vydrin.
The Tajik dictionary has been processed by A.P. Vydrin, A.D. Egorova and I.V. Egorov. The Tajik morphological analyser has been configured by A.P. Vydrin. The list of Grammatical tags (the Grammar cell in the search) and glosses (the Gloss cell in the search) was developed by A.P. Vydrin.
In 2021, A.P. Vydrin, T. Arkhangelskiy and A.V. Panasyuk developed the automatic analysis of almost all Tajik analytical verbal forms.
A.P. Vydrin is in charge of maintaining the percentage of automatically annotated words.
Since 2020, A.V. Panasyuk provides technical support for the Corpus and updates the newspaper corpora.
Acknowledgments
We express our gratitude to A.A. Melikov who has provided us with the collection of modern Tajik literature (around 200 books) collected from publishing companies of Dushanbe, Samarkand and Tashkent; to B. Olimov who shared his private collection of 130 books; to graduate students of the Department of Theoretical and Applied Linguistics of the Russian-Tajik (Slavic) University (especially to Manizha Sokhibova and Khammod Muboraksho). We also thank T. Arkhangelskiy who has consulted us on the automatic analyser and has posted the Tajik Corpus online.
Contacts
Please, send any comments and suggestions to Arseniy Vydrin, senjacom@gmail.com.
Future development of the Corpus
We plan to develop a Corpus of Classic Persian-Tajik literature of the 9th – 19th centuries. The current Tajik National Corpus will be enlarged by including new texts. The percentage of the automatic annotation will be increased and the quality of the annotation will be improved.
The Corpus developers will appreciate any donations of texts published in Tajik. We accept files in any text format (doc, docx, rtf, txt, odt). Please, send your texts to lingvistik.rtsu@gmail.com and senjacom@gmail.com. We guarantee copyright compliance and the use of the texts for the Tajik National Corpus development only.