Search:    
 

Corpus SYN2010

SYN2010 is a synchronic representative corpus of written Czech comprising 100 million tokens. It is a sequel to the corpora SYN2000 and SYN2005 and together with them forms a series of synchronic representative corpora that cover three successive periods. The basic characteristic features of the SYN2010 corpus are identical to those of SYN2005, especially the concept of representativeness based on the reception of written language, and the resulting composition of the corpus. All newspaper and magazine texts included into SYN2010 were published in 2005 - 2009, each year being equally represented - just as in SYN2005. Naturally, the proportion of particular newspaper and magazine titles has changed. However, the criteria that define a synchronic text in both fiction and professional literature remained unchanged; the SYN2010 corpus thus includes solely professional texts published after 1989. Some of the fiction texts may have been published earlier, but there is a general rule that the corpus consists mainly of newer texts, whereas the proportion of older texts is decreasing. Compared to the SYN2005 corpus, the lemmatization and morphological tagging of the SYN2010 corpus have been significantly improved; both of them correspond with the processing of the SYN2009PUB.

Structure of corpus SYN2010:

40
% 
fiction
27 %
technical literature
33 %
jurnalism

Structure of technical and other specialised literature according to thematic orientation
:
no. of words  (in mil.)

 
Structure of journalism according
to the year of issue:
Structure of journalism according
to the newspaper title:
Publicistika podle roku vydání publicistika podle titulů
no. of words  (in mil.) no. of words  (in mil.)