Open Research

20 years of Open Data Bases

Two decades of Open Data for language diversity at the Surrey Morphology Group

Professor Erich Round & Dr Sacha Beniamine

The Surrey Morphology Group (SMG) is a world-leading research group in the School of Literature and Languages [1]. Since the 1990s, a key ingredient of SMG’s success has been its commitment to Open Research language data.

Motivation

In the 1990s, databases in the field of linguistics were often created alongside research projects, but not made publicly available. This impeded attempts by linguists to verify claims that were based on the data. From the very start, SMG committed to producing high-value databases that would be freely accessible.

Methods

Through two decades of research projects, we have developed and published 20 Open Research public databases on linguistic complexity and diversity. Rather than targeting big data of middling quality, we targeted defined areas for which we strove to produce very high quality data. Scientifically, we grappled with linguistic phenomena connected by complex grammatical and cultural relationships. Building databases to foreground these complex relationships was a worthy challenge both conceptually and technically, and led to theoretical advances.

SMG currently maintains 20 web databases [2], created between 1999 (Surrey Person Syncretism Database) and 2021 (Surrey Lexical Splits Database, among others), including dictionaries with social media interactivity (Nuer Lexicon, Archi dictionary), lexicons of conjugated verbs and declined nouns and adjectives (Skolt Saami Pradigms, Chichimec Paradigms, Oto-Mangean Inflection Class, Saanich verb), cross-linguistic descriptions of linguistic phenomena (Defectiveness, Deponency, Syncretism, Suppletion, Lexical Splits, Periphrasis, Agreement, Prominent Internal Possessors, Owners into Actors), curated collections of cultural artefacts (The Mian & Kilivila Collection, Endangered Languages and Cultures of Siberia) and descriptions of evolutionary phenomena (Morphosyntactic change).

Beyond web sites, we have produced CD-ROMS, story books, grammar books and physical dictionnaries. In total, SMG databases document 325 languages around the world across 100 language families. Around 200 of these languages are either vulnerable, endangered, or extinct.

When making our databases public, we ensure that users have access not only to our analyses but also the underlying source information that supports our claims. This is a necessary condition for reproducible science, which has now become a high-priority issue. The language data we collect supports impactful resources and pedagogical material  for language communities. Some databases include specific interfaces aimed at native speakers (Scolt Saami Paradigms, Chichimec Paradigms). For the general public we have created word games (Nuerdle, Archidle) based on our databases, and we maintain a blog, Morph, which delves into engaging and accessible snippets of our research findings.

Challenges

Many challenges have proven solvable with the application of ingenuity. Early concerns that data would be used without citation were addressed by the systematic usage of DOIs and user-friendly cite buttons. Other challenges have been more durable. Databases demand a longevity that far surpasses the length of funded projects. The challenge of securing resources to ensure their technical continuity still awaits a stable solution.

Results

SMG is now renowned for its outstanding databases, which showcase our excellent research. Around the world, linguists from undergraduates to professors use these resources for learning, teaching and research. A priority looking ahead, as Open Data becomes more ingrained in scientific workflows, is to enhance the inter-operability of databases at SMG and beyond.

Conclusions

Early adoption of open data was one of the keys to SMG’s success, in great part thanks to prompt institutional support before Open Data became standard research practice. The Surrey Morphology Group has played a leading role internationally in promoting Open Research in linguistics and improved access to language data for researchers, language communities, and the general public. Our pioneering open databases have promoted the idea that such contributions are fundamental to science.

References

(1) https://www.smg.surrey.ac.uk/
(2) https://www.smg.surrey.ac.uk/databases/.

TOP
close