Search:    
 

Corpus SYN

The SYN is a non-reference corpus consisting of texts from all reference synchronic written corpora of the SYN series, i.e. SYN2000, SYN2005, SYN2006PUB, SYN2009PUB and SYN2010. Since all the SYN-series corpora are disjoint as to the texts used, the size of the SYN corpus is equal to the sum of their sizes and it is thus currently 1 300 million words (tokens). The SYN corpus is not representative, as the vast majority of the texts belongs to the category of newspapers and magazines, which is due to the incorporation of the SYN2006PUB and SYN2009PUB corpora.

Before its publication, each of the SYN-series corpora was processed using the newest versions of the tools available at the time of its compilation: tokenisation (division of corpus into tokens), segmentation (sentence boundary detection), morphological analysis and disambiguation. At the same time, all the SYN-series corpora were designed as reference corpora, i.e. invariable entities that remain unchanged once published. As a consequence, the results of processing the text with older versions of all the tools are preserved in these corpora, which makes their markup gradually more obsolete. Moreover, it also makes their markup incompatible and further complicates any comparison of data based on them. The improvements in corpus processing made since 2000 are not at all insignificant: many newly recognized word forms including different approach to certain language phenomena, more reliable disambiguation with the rule-based component, completed and unified bibliographical information etc. However, these improvements could not be incorporated into the already published corpora without violating their reference status or introducing a revision control, which would be confusing for most users. That is why the SYN corpus was introduced as a "wrapping" of all the synchronic written corpora re-processed by state-of-the-art versions of the tools including tokenisation, segmentation, morphological analysis and disambiguation (which is on the same level as in SYN2010).

The possibility to search the revised texts of all the SYN-series corpora joined together is supplemented also by the possibility to create subcorpora which have the same composition as the original corpora. This is enabled due to the attribute opus.syn, e.g. subcorpus corresponding to SYN2005 can be created by applying the condition syn="2005" on the structural attribute opus. Of course, this condition may be further combined with other ones that specify required text type, publication date etc. More information can be found in the manual (Czech only). It is therefore possible to use corpus SYN also for work with older representative corpora re-processed by the latest corpus tools. Naturally, there may be found differences between the original corpora and the corresponding new subcorpora caused by different processing. These changes may include not only different lemmatization, but also different frequency of word forms or different number of positions, as these are the results of the tokenisation.

As a non-reference corpus, the SYN corpus may be modified in the future for various reasons, e.g. correction of errors, significant improvement of morphological analysis and/or disambiguation, or inclusion of future (so far only planned) synchronic written corpora. Such an update will therefore be irregular; however, it will not happen more often than once a year. The SYN corpus will thus still retain its character as a non-reference unification of all the SYN-series corpora consistently re-processed with state-of-the-art versions of available tools providing the CNC users with the following benefits: