Resources for Language Technologies

As a part of our mission, we provide open data and resources on speech technologies, specifically automatic speech recognition (ASR), text-to-speech synthesis (TTS) and machine translation (MT) in the languages we work with. You can find a detailed list here with short explanations and further references to get more information. You can also find some of these resources in Col·lectivaT’s page in Hugging Face.

Name	Language	Type	License	Download
TV3Parla v0.3	Catalan	acoustic model	GNU AGPL-3.0	link
TV3Parla+ParlamentParla v0.2	Catalan	acoustic model	GNU AGPL-3.0	link
TV3Parla Corpus v0.3	Catalan	audio corpus	CC-BY-NC 4.0	link
ParlamentParla Corpus v2.0	Catalan	audio corpus	CC-BY 4.0	link
ParlamentParla Corpus - clean v1.0	Catalan	audio corpus	CC-BY 4.0	link
ParlamentParla Corpus - other v1.0	Catalan	audio corpus	CC-BY 4.0	link
ParlamentParla Corpus - old v0.3	Catalan	audio corpus	CC-BY 4.0	link
Catotron - Ona	Catalan	TTS model	CC-BY 4.0	link
Catotron - Pau	Catalan	TTS model	CC-BY 4.0	link
UPC FestCat Ona - optimized	Catalan	TTS audio corpus	CC BY-SA 3.0 ES	link
UPC FestCat Pau - optimized	Catalan	TTS audio corpus	CC BY-SA 3.0 ES	link
OpenSubtitles LM v1.0	Catalan	language model	CC-BY 4.0	link
Tamazight monolingual and parallel texts	Tamazight	text data	CC-BY 2.0	link
Araina text corpus	Occitan Aranese	text data	CC-0 1.0	link
Şalom articles	Judeospanish	text data	CC-BY 4.0	link
Una Fraza al diya	Judeospanish	text data	CC-BY 4.0	link

Acoustic corpora

During various projects, we have gathered publicly available speech data and converted them into acoustic training corpora. These data sets are available for download with varying open licenses.

TV3Parla

This corpus includes 240 hours of Catalan speech from broadcast material. The details of segmentation, data processing and also model training are explained in Külebi, Öktem; 2018. The content is owned by Corporació Catalana de Mitjans Audiovisuals, SA (CCMA); we processed their material and hereby making it available under their terms of use.

The corpus can be reached here under a CC BY-NC 4.0 license.

This project was supported by the Softcatalà Association.

ParlamentParla

We have gathered this audio corpus from the recordings and the transcripts of the Catalan Parliament (Parlament de Catalunya) plenary sessions, which took place between 2007 and 2018. We aligned the transcriptions with their respective recordings and segmented them optimal for ASR development. The content belongs to the Catalan Parliament and the data is released conforming their terms of use.

The version 0.3 corpus includes per-intervention full text aligned with audio links.

As of version 1.0, the corpus can be reached in two parts; 90 hours of clean and 230 hours of other quality segments.

As of version 2.0, the corpus is extended and separated into 211 hours of clean and 400 hours of other quality segments. Furthermore, each speech segment is tagged with its speaker and each speaker with their gender.

Preparation of this corpus was partly supported by the Department of Culture of the Catalan autonomous government, and the v2.0 was supported by the Barcelona Supercomputing Center, within the framework of the project AINA of the Department of Digital Policies.

UPC FestCat TTS Corpora

FestCat corpus was developed by TALP Research Center, Polytechnic University of Barcelona in 2007 for building open source TTS systems for Catalan. We reprocessed this corpus by optimizing it to build our neural-network based TTS Catotron. Long segments were split or either discarded to have a maximum audio length of 12 seconds. The male voice corpus Pau contains 6 hours 54 minutes and female voice corpus Ona contains 6 hours 12 minutes. Both of them are released with Attribution-ShareAlike 3.0 Spain (CC BY-SA 3.0 ES) license.

Preparation of this corpus was supported by the Department of Culture of the Catalan autonomous government

ASR models

These are the ASR models that we trained with CMUSphinx speech recognition toolkit, using the aforementioned corpora. We continue our work on maintaining and bettering the models in our repository. You can find installation and configuration guides, including tutorials on basic use-cases in the wiki.

TV3Parla v0.3: sphinxtrain 5pre-alpha continuous model
TV3Parla+ParlamentParla v0.2: sphinxtrain 5pre-alpha continuous model

TTS models

Catotron is the first free, open speech synthesis system based on neural networks. Col·lectivaT has lead the development with funding from Department of Culture of the Catalan autonomous government with the participation of researchers from Natural Language Processing research group (TALN) of Pompeu Fabra University and Language and Speech Technologies and Applications Center of Polytechnic University of Catalonia (UPC-TALP).

Official page
Project blog with links to models (Ona, Pau, Waveglow, MelGAN) and samples
Source code for GPU and CPU
Jupyter notebooks for inference and speaker adaptation

For more information, you can refer to our paper published in Interspeech 2020.

The preparation of this page was supported by the Culture Department of the Catalan autonomous government.