Search:    
 

Available Corpora

Written corpora (synchronic)

corpus name size
(# of words)
lemmatisation morphological
tags
short description
SYN2009PUB700 mil.YESYES corpus of newspapers and magazines from 1995 - 2007
SYN2006PUB 300 mil. YES YES corpus of newspapers and magazines from 1990 - 2004
SYN2005 100 mil. YES YES balanced corpus, the most of the texts are from 2000 - 2004
SYN2000 100 mil. YES YES balanced corpus, the most of the texts are from 1990 - 1999
FSC2000 100 mil. YES NO modified SYN2000, source of the Frequency Dictionary of Czech
KSK-DOPISY 800 000 NO NO transcriptions of handwritten correspondence from 1990 - 2004
ORWELL 80 000 YES YES Orwell's "1984", manually annotated

Spoken corpora (synchronic)

corpus name size
(# of words)
lemmatisation morphological
tags
short description
ORAL2008 1 mil NO NO sociolinguistically balanced corpus of informal spoken Czech
ORAL2006 1 mil. NO NO corpus of informal spoken Czech
PMK 675 000 NO NO Prague spoken corpus
BMK 490 000 NO NO Brno spoken corpus

Diachronic corpus

corpus name size
(# of words)
lemmatisation morphological
tags
short description
DIAKORP  1.6 mil. NO NO corpus of the diachronic section of the CNC

Parallel corpus

corpus name size
(# of words)
lemmatisation morphological
tags
short description
InterCorp 44 mil. YES
(partial)
YES
(partial)
parallel corpus being compiled as a part of the InterCorp project