At Gerulata Technologies, we believe NLP technologies are essential part of our toolkit. In 2020, we've created the first masked language model specifically pretrained for Slovak language. Now we are open-sourcing it.
Slovak language is what people in NLP would call a "low resource language", meaning that there are very little published datasets, that could be used for training ML models. While it is easy to get large, good quality datasets for English or Spanish - you can practically download them from the internet - for smaller languages such datasets are not available. Creating your own datasets means a lot of web crawling, text cleaning and pre-processing.
After you manage to put the training dataset together, you still have to train the model, which is computationally intensive task that typicall requires access to specialized hardware.
However, once you have the model trained, it can run on widely available hardware and be used to perform a variety of useful tasks.
Together with our research partners in KInIT, we are publishing a paper that details the process of training and evaluates the performance of our SlovakBERT model comparing it with leading cross-lingual models. You can download and read the research paper on Arxive: [TODO: Insert Link]
If you are interesting in testing and using the model yourself, it is available, together with basic usage examples, from Huggingface: https://huggingface.co/gerulata/slovakbert