Open Research
Overview
SMG has played a leading role internationally in promoting Open Research in linguistics and improved access to language data for researchers, language communities, and the general public. Our pioneering open databases have promoted the idea that such contributions are fundamental to science.
Data rescue
We are working on retro-standardizing older resources to ensure their long term preservation. This often involves normalizing and standardizing the resources to make them machine-readable.
- The Romance Verbal Inflection Dataset 2.0:
- Sacha Beniamine, Martin Maiden, and Erich Round. 2020. Opening the Romance Verbal Inflection Dataset 2.0: A CLDF lexicon. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 3027–3035, Marseille, France. European Language Resources Association.
- Original database: http://romverbmorph.clp.ox.ac.uk/
- Data record on zenodo: http://dx.doi.org/10.5281/zenodo.3552167
- The Surrey Morphological Complexity Database:
- Baerman, Matthew, Dunstan Brown, Roger Evans, Greville G. Corbett, Lynne Cahill & Sacha Beniamine. 2023. Surrey Morphological Complexity Database. University of Surrey. http://dx.doi.org/10.15126/SMG.23/1
- Data repository: https://gitlab.com/surrey_morphology_group/smc_data
- The DB website is automatically generated using the Mkdocs static site generator.
Data archives
- All of the SMG deposits on Zenodo: Zenodo SMG community
- All of the Paralex deposits on Zenodo: Zenodo Paralex community
Data standard
- Paralex standard: Paralex is a standard for morphological lexicons which document inflectional paradigms.
DeAR data for researchers
We formulated these principles in order to publish high quality, easily citable, scientifically impactful data, useful for the long term:
Decentralized
Data is decentralized with no single team or institution operating a central database. Standards serve as a format to share data and as a means for researchers to create interoperable data of high-quality. International collaboration must be incentivized for data coverage to scale up to the world's languages.
Automated verification
Data is tested automatically against the descriptions in the metadata in order to guarantee data quality. Moreover, data quality can be checked by writing custom tests (as is done in software development), which are run after each change of the data.
Revisable pipelines
Dataset authors must be able to continuously update data presentation, in particular websites, reflecting the evolving nature of data. This is achieved by generating those publications automatically and directly from the standardized dataset. Automated tools can be leveraged to produce user-friendly representations of the data (for example static websites, publication ready pdfs, etc.). These can be run again at any point, so that it is easy to re-generate those from the data edited by the researchers.
Both principes A and R fit particularly well with the use of versioning systems such as git, where validation, testing and publishing can be done through continuous development pipelines.
Data tools
- Gitlab2Zenodo: Automatically send new versions of your data t o Zenodo
pip install gitlab2zenodo - Paralex package: Generate metadata for paralex datasets; check that your data is valid.
- Coming soon:
- Paralex sites:Generate sites automatically from paralex datasets
- Paradigms library: Manipulate morphological data easily, create publication ready tables.