Tajik National Corpus (TNC)

This website presents the Tajik National Corpus (TNC) of written texts. The total size is 58.4 million words. At the present time 96% of the total size of the TNC is annotated. Each annotated word form has grammatical tags and translation into English and Russian.

Funding

The Tajik National Corpus was funded by:

The state budget of the Republic of Tajikistan (2019-2021): project leader – Prof. D.M. Iskandarova; research advisor – Prof. V.A. Plungian, full member of the Russian Academy of Sciences;
The development programme of the Russian-Tajik (Slavic) University (2019): project leader – Prof. D.M. Iskandarova;
Russian Foundation for Basic Research, Project 19-012-00637 (2019-2021): project leader – A.P. Vydrin.

Structure

Currently the corpus consists of modern texts in the Tajik language, published in the 20th and 21st centuries. The corpus comprises the following genres: fiction, poetry, drama, journalism, scientific and educational texts, (auto)biographies, religious literature, political and law texts, newspapers. The percentage of each genre is the following:

fiction — 13.5%
poetry — 3%
scientific and educational texts — 6%
(auto)biographies — 2%
journalism — 0.65%
religious texts — 1.8%
juridical literature — 0.7%
politics — 0.14%
fairytales — 0.1%
drama — 0.03%
newspapers — 72%

The complete list of the texts included in the Tajik National Corpus is available via the ‘Select subcorpus’ button.

Corpus characteristics

The corpus is based on the automatic annotation by a morphological analyzer. The analyzer was designed by T. Arkhangelskiy and successfully tested on a number of currently available online linguistic corpora developed since the middle of 2000. Among corpora of Iranian languages one should mention the Ossetic National Corpus (http://corpus.ossetic-studies.org/) and the Corpus of the Digor dialect of the Ossetic language (http://corpus-digor.ossetic-studies.org/).

The automatic analysis includes lemmatisation and morphological tagging. Lemmatisation process implies attributing dictionary form to each word form. In the Tajik National Corpus, each word form is also translated into Russian and English. To translate to Russian we used the following dictionaries: Tajik-Russian Dictionary by M.B. Rahimi, L.V. Uspenskaya, Moscow, 1954 and Tajik-Russian Dictionary (second edition) by D. Sajmiddinov, S.D. Holmatova, S. Karimov (eds), Dushanbe, 2006. Lemmatisation implies manual processing of the dictionary. Morphological tagging includes grammatical and syntactical information (tags, see Tagset) such as parts of speech, number, time, aspect, modality, person and so on. The rules of morphological tagging were also processed manually.

At the moment the percentage of automatically annotated word forms is 96%.

The universal search system, developed by T. Arkhangelskiy, is adapted for the Tajik National Corpus. The search system was updated by T. Arkhangelskiy in 2021. The search by lexeme, word form, translation and grammatical tags (see Tagset) is available. For an advanced search one can combine several parameters for a search query. The search system also allows searching of several elements with certain distance between them. One can also choose any subcorpus for searching (for example, the texts of a certain genre or period, texts of specific authors). Settings for output design can be configured. For more information, use the help button in the Search page.

Whole texts are not available for copyright reasons. The maximum context length is 7 sentences.

Special characters

Keyboard combinations of the Standard Russian keyboard can be used to insert special Tajik letters, provided the "standard" input method is selected in the settings:

и1 = ӣ
х1 = ҳ
к1 = қ
ч1 = ҷ
у1 = ӯ
г1 = ғ
* = any letter or letters
| - any from divided by vertical pipe (for instance, query "prox|dist" in Grammar field will match all proximal or distal pronouns)

A virtual Tajik keyboard was added in 2021. To activate it, click the keyboard button in the Search page and go to Word or Lemma search.

Transliteration

In 2021, we developed a system of Tajik Cyrillic to Latin transliteration. To activate it go to Search page, click Options in the upper left corner, Transliteration and choose latin.

If you want to use Latin instead of Cyrillic in the Search fields Word or Lemma, click Options, Character input method and choose inputmethod_latin.

Below is the list of correspondences between Latin and Tajik Cyrillic letters:

g1 = ғ
s1 = ш
z1 = ж
h1 = х
ch = ч
a1 = я
o1 = ё
y1 = ю
i1 = ī
u1 = ū
c1 = щ
y2 = ы
a = а
b = б
v = в
g = г
d = д
z = з
i = и
ī = ӣ
y = й
k = к
q = қ
l = л
m = м
n = н
o = о
p = п
r = р
s = с
t = т
u = у
ū = ӯ
f = ф
h = ҳ
j = ҷ
' = ъ
ė = э
c = ц
` = ь

Tagset

There is the full list of the tags (the Grammar cell in the search) used in the corpus below.

ATTENTION! The tags are case-sensitive.

abs – abstract noun suffix in -ӣ / -вӣ / -гӣ
ADJ – Adjective (part of speech)
adj1 – adjectivizer in -ӣ / -вӣ / -гӣ
adj2 – adjectivizer in -нок
adj3 – adjectivizer in -она / -гона
adj4 – adjectivizer in -онӣ
adj5 – adjectivizer in -ангӣ
adj6 – adjectivizer in -ин / -гин
ADV – Adverb
ag – agent noun
ag1 – agent noun in -чӣ
ag2 – agent noun in -гар
ag3 – agent noun in -бон
ag4 – agent noun in -ор
ag5 – agent noun in -гор
ag6 – agent noun in -вар
ag7 – agent noun in -кор
and – conjunction у / ю / ва
anim – animate
bi – verbal prefix би- in Imperative and Subjunctive
bodypart – body part
caus – morphologically causative verb
cnject – Conjective
cnject,prs – Present Conjective
cnject,pst – Past Conjective
cnject.prs.pass – Present Conjective in Passive (дида мешудагистам)
cnject.prs.prog – Present Progressive Conjective (хонда истодагист).
cnject.pst.pass – Past Conjective in Passive (дида шудагистам)
cnject2 – Conjective formed with a short copular
cnject2,prs – Present Conjective formed with a short copular
cnject2,pst – Past Conjective formed with a short copular
cnject2.prs.pass – Present Conjective formed with a short copular in Passive (дида мешудагиям)
cnject2.prs.prog – Present Progressive Conjective formed with a short copular
cnject2.pst.pass – Past Conjective formed with a short copular in Passive (дида шудагиям)
color – color
compar – comparative in -тар
CONJ – Conjunction
conv.prs.prog – Present Progressive Converb (хонда истода)
cop – copular
cop.encl – short copular
cop.v – full copular
DEM – demonstrative Pronoun
dimin – diminutive suffix
dimin1 – diminutive suffix in -ҷон
dimin2 – diminutive suffix in -ак / -аккак
dimin3 – diminutive suffix in -ча / -чек / -ичек
dimin4 – diminutive suffix in -ина
dist – distal pronoun
ezf – ezafe
f – feminine proper noun
fract – fractional numeral
fut – Future
fut.pass – Future in Passive (сохта хоҳад шуд)
hab – habitual (prefix ме- in past tenses and Perfect)
hab,prf – Evidential Durative (мегуфтаанд)
hab,prf – Evidential Durative (Perfect Durative)
hab,pst – Imperfect (мехобид)
hab,pst – Imperfect (Past Durative)
hab.part.pst – imperfective past participle in -та / -да
hab.prf.pass – Evidential Durative in Passive (дида мешудааст)
have – suffix -манд denoting possession of an object or quality
hon – honorific (verbal flexion 2pl)
house – compounds with -хона ‘house’
hum – human
imp – Imperative
impf.pass – Imperfect in Passive (дида мешуд)
indef – indefinite marker -е
indir – indirect mood
inf – infinitive
int – Intention (verbal forms with a future participle and short copular, e.g. рафтаниам)
INTJ – Interjection
kinship – kinship term
m – masculine proper noun
mod – modal verb
N – Noun
neg – negation in на-
neg2 – negation in ма-
nonhuman – nonhuman
NUM – Numeral
obj.def – definite object marker –ро
ord – ordinal numeral
part – participle
part.fut – future participle in -анӣ
part.mod – modal participle in -агӣ
part.mod.prs – modal present participle in ме-...-агӣ
part.mod.prs.pass – Passive modal present participle in ме-...-агӣ (кашида мешудагӣ)
part.mod.pst – modal past participle in -агӣ
part.mod.pst.pass – Passive modal past participle in -агӣ (хонда шудагӣ)
part.prs – present participle in -анда
part.prs.prog – present progressive participle (хонда истодагӣ)
part.pst – past participle in -та / -да
pass – all finite passive forms
pass.part – all passive participles
pass.part.pst – passive past participle in -ташуда / -дашуда
pers – personal pronoun
pl – plural
pl.anim – animate plural in -он / -гон / -вон / -ён
pl.ar – Arabic plural in -от / -ҷот / -вот
pl.ar.m – Arabic plural in -ин
place – suffix denoting place
place1 – place suffix in -(и)стон
place2 – place suffix in -зор
place3 – place suffix in -сор
place4 – place suffix in -гоҳ
place5 – place suffix in -дон
pluprf – Pluperfect (хонда будам)
pluprf.evid – Evidential Pluperfect (хонда будаааст)
pluprf.evid.pass – Evidential Pluperfect in Passive (фиристода шуда будааст)
pluprf.pass – Pluperfect in Passive (гирифта шуда буд)
poss – possessive pronoun
poss.1 – first person possessive pronoun
poss.2 – second person possessive pronoun
poss.3 – third person possessive pronoun
poss.pl – possessive pronoun plural
poss.sg – possessive pronoun singular
POST – Postposition
PREP – Preposition
prf – Perfect
prf.pass – Perfect in Passive (дида шудааст)
prog – all finite progressive forms
PRON – Pronoun
prop – proper noun
prox – proximal pronoun
prs – Present
prs.pass – Present in Passive (дида мешавам)
prs.prog – Present Progressive (хонда истодаам)
prs.prog.pass – Present Progressive in Passive (дида шуда истодаам)
PRTCL – Particle
pst – Past
pst.pass – Past in Passive (дида шуд)
pst.prog – Past Progressive (хонда истода будам)
pst.prog.pass – Past Progressive in Passive (хонда шуда истода буд)
rel – relative marker -е
sbjv – Subjunctive
sbjv.hab – Durative Perfect Subjunctive (мехонда бошам)
sbjv.hab.pass – Durative Perfect Subjunctive in Passive (дида мешуда бошам)
sbjv.pass – Subjunctive in Passive (дида шавам)
sbjv.pst – Past (Perfect) Subjunctive (дида бошам)
sbjv.pst.pass – Past (Perfect) Subjunctive in Passive (гирифта шуда бошад)
sg – singular
similar – adjective suffix in -гун
similar2 – adjective suffix in -монанд
suf.adj – any derivational suffix forming adjectives
suf.n – any derivational suffix forming nouns
super – superlative in -тарин
V – Verb
v.adv – present participle in -он
1 – first person
2 – second person
3 – third person

Corpus creators

The Tajik National Corpus is a product of Tajik and Russian collaboration. Researchers from the Russian-Tajik (Slavic) University, Department of Theoretical and Applied Linguistics (D.M. Iskandarova, Kh.D. Shambezoda, M.B. Davlatmirova, O.L. Kozlova) and Department of Informatics and Information Systems (Z.D. Usmanov, M.A. Umarov) and also researchers from Tajikistan State University of Law, Business and Politics in Khujand (G. Dovudov, A. Kosimov) have carried out collection, digitisation and processing of texts in Tajik. The final editing of the texts before the publication in the corpus was completed by A.P. Vydrin.

The Tajik dictionary has been processed by A.P. Vydrin, A.D. Egorova and I.V. Egorov. The Tajik morphological analyser has been configured by A.P. Vydrin. The list of Grammatical tags (the Grammar cell in the search) and glosses (the Gloss cell in the search) was developed by A.P. Vydrin.

In 2021, A.P. Vydrin, T. Arkhangelskiy and A.V. Panasyuk developed the automatic analysis of almost all Tajik analytical verbal forms.

A.P. Vydrin is in charge of maintaining the percentage of automatically annotated words.

Since 2020, A.V. Panasyuk provides technical support for the Corpus and updates the newspaper corpora.

Acknowledgments

We express our gratitude to A.A. Melikov who has provided us with the collection of modern Tajik literature (around 200 books) collected from publishing companies of Dushanbe, Samarkand and Tashkent; to B. Olimov who shared his private collection of 130 books; to graduate students of the Department of Theoretical and Applied Linguistics of the Russian-Tajik (Slavic) University (especially to Manizha Sokhibova and Khammod Muboraksho). We also thank T. Arkhangelskiy who has consulted us on the automatic analyser and has posted the Tajik Corpus online.

Contacts

Please, send any comments and suggestions to Arseniy Vydrin, senjacom@gmail.com.

Future development of the Corpus

We plan to develop a Corpus of Classic Persian-Tajik literature of the 9th – 19th centuries. The current Tajik National Corpus will be enlarged by including new texts. The percentage of the automatic annotation will be increased and the quality of the annotation will be improved.

The Corpus developers will appreciate any donations of texts published in Tajik. We accept files in any text format (doc, docx, rtf, txt, odt). Please, send your texts to lingvistik.rtsu@gmail.com and senjacom@gmail.com. We guarantee copyright compliance and the use of the texts for the Tajik National Corpus development only.