Open Research

Overview

SMG has played a leading role internationally in promoting Open Research in linguistics and improved access to language data for researchers, language communities, and the general public. Our pioneering open databases have promoted the idea that such contributions are fundamental to science.

Data rescue


We are working on retro-standardizing older resources to ensure their long term preservation. This often involves normalizing and standardizing the resources to make them machine-readable.

Data archives

Data standard

  • Paralex standard: Paralex is a standard for morphological lexicons which document inflectional paradigms.

DeAR data for researchers

We formulated these principles in order to publish high quality, easily citable, scientifically impactful data, useful for the long term:

Decentralized

Data is decentralized with no single team or institution operating a central database. Standards serve as a format to share data and as a means for researchers to create interoperable data of high-quality. International collaboration must be incentivized for data coverage to scale up to the world's languages.

Automated verification

Data is tested automatically against the descriptions in the metadata in order to guarantee data quality. Moreover, data quality can be checked by writing custom tests (as is done in software development), which are run after each change of the data.

Revisable pipelines

Dataset authors must be able to continuously update data presentation, in particular websites, reflecting the evolving nature of data. This is achieved by generating those publications automatically and directly from the standardized dataset. Automated tools can be leveraged to produce user-friendly representations of the data (for example static websites, publication ready pdfs, etc.). These can be run again at any point, so that it is easy to re-generate those from the data edited by the researchers.

Both principes A and R fit particularly well with the use of versioning systems such as git, where validation, testing and publishing can be done through continuous development pipelines.

Data tools

  • Gitlab2Zenodo: Automatically send new versions of your data t o Zenodo
    pip install gitlab2zenodo
  • Paralex package: Generate metadata for paralex datasets; check that your data is valid.
  • Coming soon:
    • Paralex sites:Generate sites automatically from paralex datasets
    • Paradigms library: Manipulate morphological data easily, create publication ready tables.
TOP
close