The Latvian Prose Counter displays data on Latvian novels of the 19th and 20th centuries. The creation of the corpus was started with the first Latvian works of prose fiction, hence, early novels are better represented in the data set available in the Counter. All novels published as books by the year 1920 are available in the present data set.
The creation of a corpus entails several steps - scanning of the books, segmentation, optical recognition, error correction, morphological and syntactical tagging, and the creation of metadata. The oldest Latvian novels have required the greatest effort: we have used machine learning methods to improve the quality of optical recognition; we have also transformed the old fraktur script to modern writing in order to be able to use modern language processing tools. Many early Latvian novels have never been reprinted in modern orthography, and only a few people nowadays have read them.