Available Corpora
Written corpora (synchronic) |
||||
| corpus name | size (# of words) |
lemmatisation | morphological tags |
short description |
| SYN2009PUB | 700 mil. | YES | YES | corpus of newspapers and magazines from 1995 - 2007 |
| SYN2006PUB | 300 mil. | YES | YES | corpus of newspapers and magazines from 1990 - 2004 |
| SYN2005 | 100 mil. | YES | YES | balanced corpus, the most of the texts are from 2000 - 2004 |
| SYN2000 | 100 mil. | YES | YES | balanced corpus, the most of the texts are from 1990 - 1999 |
| FSC2000 | 100 mil. | YES | NO | modified SYN2000, source of the Frequency Dictionary of Czech |
| KSK-DOPISY | 800 000 | NO | NO | transcriptions of handwritten correspondence from 1990 - 2004 |
| ORWELL | 80 000 | YES | YES | Orwell's "1984", manually annotated |
Spoken corpora (synchronic) |
||||
| corpus name | size (# of words) |
lemmatisation | morphological tags |
short description |
| ORAL2008 | 1 mil | NO | NO | sociolinguistically balanced corpus of informal spoken Czech |
| ORAL2006 | 1 mil. | NO | NO | corpus of informal spoken Czech |
| PMK | 675 000 | NO | NO | Prague spoken corpus |
| BMK | 490 000 | NO | NO | Brno spoken corpus |
Diachronic corpus |
||||
| corpus name | size (# of words) |
lemmatisation | morphological tags |
short description |
| DIAKORP | 1.6 mil. | NO | NO | corpus of the diachronic section of the CNC |
Parallel corpus |
||||
| corpus name | size (# of words) |
lemmatisation | morphological tags |
short description |
| InterCorp | 44 mil. | YES (partial) |
YES (partial) |
parallel corpus being compiled as a part of the InterCorp project |
