Grammarly releases GEC dataset for the Ukrainian language

A Ukrainian company Grammarly, which develops tools for working with texts, has presented the first annotated GEC dataset for the Ukrainian language. The company’s press service has informed AIN.UA about it.

What is UA-GEC dataset

GEC dataset is an array of texts written by volunteers, worked through by linguists, correcting stylistic, spelling, and other errors. In general, the concept of Grammatical Error Correction (GEC) means correcting grammatical errors.

As Grammarly says, the dataset is designed for academic and practical language learning. As of its launch, it includes more than a thousand texts of various genres. Almost 500 volunteers from Ukraine and abroad were involved in its creation. The GEC dataset will be expanded.

The GEC corpus for the Ukrainian language can be downloaded here.

Project goal

  • The company explains that the implemented project will speed up the development of voice assistants and online systems for correcting grammar in the Ukrainian language
  • will help to use the high-quality Ukrainian language on the Internet;
  • will increase the number of open tools for NLP learning of Ukrainian.

“We see this project as extrinsic value for the development of Ukrainian computer linguistics and the Ukrainian language online, and that’s why we decided to make it a permanent project for our company,” commented Anastasiya Osidach, manager of Grammarly’s computer linguistics team and the GEC corpus project.

The GEC dataset for the Ukrainian language will become a permanent project, as Grammarly notes. You can write an essay, translate a text or share your own material on the project website.