AppsApps
Word at a Glance

What is a corpus?

A language corpus is an electronic collection of authentic texts (written or spoken) where various language phenomena can be easily searched for and displayed in their natural context.

The CNC corpora include written (printed) contemporary Czech (more than 5 billion tokens), internet Czech (more than 6 billion tokens), spontaneous spoken Czech, historical Czech, as well as the InterCorp parallel corpus which contains translations from/to 60+ languages.

Applications

  1. KonText

    The KonText application is a basic query interface for working with corpora. It allows evaluation of simple and complex queries, displaying their results as concordance lines, computing frequency distribution, calculating association measures for collocations and further work with language data. All functions are clearly described in the manual.

  2. SyD

    The SyD application is designed for versatile exploration of variants both from the synchronic (contemporary language) and diachronic perspective. Based on the CNC corpora, it summarizes how frequently each particular variant is used in present as well as in the past. Try it out! Simply enter two or more competing variants of a single linguistic phenomenon, e.g. téměř × skoro.

  3. Morfio

    The Morfio application is aimed at searching word formation relations between corpus units, e.g. lovit - úlovek. It enables finding all word pairs formed in the same way and evaluating morphological productivity of their formation. The application is based on large corpora of written language that cover a large variety of word formation possibilities of contemporary Czech.

  4. KWords

    The KWords application provides a fundamental basis for empirical interpretation of texts. It analyzes words in the given text and compares their frequency with the reference corpus. The result is the identification of keywords, i.e. units occurring significantly more often in the analyzed text than in the reference corpus representing a neutral language use.

  5. Treq

    The Treq is an easy-to-use application to look up translation equivalents in bidirectional Czech-foreign language dictionaries automatically extracted from parallel texts in the InterCorp corpus.

About the CNC

Logo ČNK

The Czech National Corpus (CNC) project was established at the Faculty of Arts, Charles University in 1994 with the aim of creating general-purpose national language corpora.

In 2012, the importance of the CNC resources and services was recognised by the Ministry of Education, Youth and Sports and CNC has since been funded as a Large research infrastructure within the framework of the LM programme, currently as project LM2023044 (2023-2026). This enables the CNC to provide comprehensive user services including continuous data mapping of Czech, application development and many-faceted user support.

Support and information resources

  1. Wiki

    The CNC web manual in the form of a wiki is a complex corpus linguistics knowledge base. It also contains useful information about the CNC tools and resources, and an on-line tutorial in seven lessons aimed at both beginners and advanced users (Czech only).

  2. Support

    The support centre is a virtual platform accessible to all registered users. It features an advisory centre (with Q&A) and application-related issue tracking for bug reports and feature requests.

  3. Biblio

    Biblio is a repository of CNC-based research papers, books and theses. The repository is publicly available to all visitors of this portal and, at the same time, it serves as a continuously updated corpus linguistics bibliography. Would you like to know more?

  4. Advisory Board

    The Advisory Board is a permanent body of the Czech National Corpus research infrastructure. It monitors the scientific quality of the project, provides feedback concerning short-term and long-term strategy decisions and evaluates the project results.

  5. Language data

    Is access via the query interface insufficient for your research objectives? CNC also provides linguistic data in packages derived from the published corpora while respecting the limitations that result from agreements with text providers, copyright law and other regulations.

  6. For schools

    We are introducing a new repository of corpus-based exercises for language teaching at primary and secondary schools. This regularly updated webpage offers both a variety of worksheets ready to be printed out and handed to the students and tips for the hands-on use of corpora in language learning environment (Czech only).

  7. CLARIN K-centre

    The CNC-based K-centre provides information, consulting and technical assistance in the area of corpus linguistics with specialization in empirical research of Czech. It is a part of the K-centres of CLARIN, an ESFRI infrastructure focusing on digital language resources and tools for Humanities and Social Sciences.