Corpus SYN2009PUB
The SYN2009PUB is a synchronic corpus of written journalism, a sequel to SYN2006PUB. It contains exclusively journalist texts from 1995 to 2007, the total size of the corpus is 700 million of words (tokens). All the SYN-series corpora are disjunctive as to the texts used, that is no text, which is part of one corpus, is included in the other two. Corpora SYN2000, SYN2005, SYN2006PUB and SYN2009PUB thus contain a total of 1 200 million text words (tokens).
The lemmatisation and morphological tagging of SYN2009PUB were improved in comparison with the older corpora. This concerns mainly lemmatisation of personal and possessive pronouns, non-determination of grammatical categories for abbreviations and foreign words, and also the tokenisation (detection of word form boundaries) - mainly in case of abbreviations and hyphenated word forms. The tagset itself was slightly simplified, the differences are in elimination of values that grouped together several categories.
In should be stressed that the SYN2009PUB corpus does not claim to be representative in any way. Although tens of independent regional newspapers and other titles have been included (in addition to the rather unified Deníky Bohemia and Deníky Moravia), their overall share is very low. It is clear from the charts below that the corpus composition is balanced neither according to the year of issue, nor according to the titles. The SYN2009PUB corpus will thus be appreciated mainly by users who need to work with large amounts of data.
| Structure of corpora according to years | Structure of corpora according to titles |
|
|
| no. of words (in mil.) | no. of words (in mil.) |


