The frequency dictionary for Russian

The second version of the frequency list

From this page you can access the frequency list for modern Russian. Up to now Chastotnyj slovarj russkogo jazyka (Zasorina, 1977) provided the most widely used frequency list for Russian. However, the corpus used in Zasorina is relatively small according to modern standards (about 1 million words). It is outdated: mostly it covers uses from 1920s to 1960s and includes a high proportion of ideological sources, like texts by Lenin and Khrushchev and Soviet newspapers, thus, word frequencies in it are severely biased, e.g. Soviet and comrade are in the first hundred of Russian words on a par with function words. Finally, the list of (Zasorina, 1977) is not available electronically.

The list accessible from this page includes about 32000 words with frequency greater than 1 ipm (one instance per million words). A shorter selection of 5000 most frequent words is also available. The lists use Windows-1251 encoding for Cyrillic and are compressed by WinZip (Linux or Mac users can use StuffIt for decompression).

The structure of the lists follows the template of the lemmatised BNC lists produced by Adam Kilgariff, namely:

word rank, frequency (in ipm), word, part of speech.

Lists of 1 ipm words

lemma.al.zip - lemmas sorted in the alphabetical order
lemma.num.zip - lemmas sorted by their frequency
words.num.zip - word forms sorted by their frequency

Lists of 5000 most frequent words

5000lemma.al.zip - lemmas sorted in the alphabetical order
5000lemma.num.zip - lemmas sorted by their frequency

Some data about uses of words in modern Russian

The average word length is 5.28 characters.
The average sentence length is 10.38 words.
1000 most frequent lemmas cover 64.0708% of word forms in texts.
2000 most frequent lemmas cover 71.9521% of word forms in texts.
3000 most frequent lemmas cover 76.6824% of word forms in texts.
5000 most frequent lemmas cover 82.0604% of word forms in texts.

The exact information on the mapping of frequency to coverage is available from here.

The list is compiled on the basis of a corpus of modern Russian. It contains a selection of modern fiction, political texts, newspapers, and popular science (about 50 million words, MW, fiction allocates for about half of the corpus). All texts were written originally in Russian between 1970 and 2002; the majority of them between 1980 and 1995, the newspapers corpus is from 1997-1999.

It is widely known that large texts present a problem for frequency lists, since a large text that contains many instances of a rare word can boost its frequency. If the corpus is based on fiction, large texts are quite frequent. As an example, the corpus contains a huge sequel to Tolkien's "The Lord of the Rings" written by a Russian author (Nick Perumov). In spite of the fact that the length of the sequel is about 250 kW, less than one percent of the whole corpus, the frequency of uses of the word hobbit in that book puts the word in the first thousand of most frequent Russian words, if no precautions against large texts are made. Out of this reason, the frequency list is calculated under the condition that no single text from the corpus contributes more than 10 kW and no author contributes more than 100 kW to the count. Thus, the subset of the whole corpus used for frequency count is about 16 MW.

Words are not uniformly distributed in texts. Some of them (like prepositions) occur in many texts with predictable rate, some (like pronouns or mental verbs) are significantly more frequent for certain writers or genres, while some are "contagious": if a word (e.g. a proper name, a title of nobility or a technical term) occurs once in a text, it tends to be repeated, thus boosting its frequency in a document. The variation can be measured in a variety of ways (Church, K. and Gale, W. (1995) Poisson Mixtures, Journal of Natural Language Engineering 1:2), though the easiest way is to use the coefficient of variation, which is defined as the standard deviation divided by the mean. The standard deviation is a measure of the absolute dispersion of data (it is larger for words with larger mean frequency), while the coefficient of variation allows to compare the dispersion of words with unequal mean frequency. The variance data for 5000 most frequent lemmas are available here. The structure:
lemma, mean frequency (ipm), number of texts in which the lemma occurs, standard deviation of frequency counted for all texts, coefficient of variation, variance.

Three frequency lists for word classes are also available:

The compilation of the corpora, development of respective tools and the frequency lists were available due to the Fellowship awarded to the author from the Alexander von Humboldt Foundation, Germany. Lemmas for word forms in the corpora were produced by means of the morphological analyzer of Dialing. Since many forms are ambiguous, some lemmas in the dictionary are not optimal. This requires some disambiguation for the future edition of the frequency list. Please, send your comments and suggestions to the author: Serge Sharoff, s.sharoffleeds.ac.uk.