Whip: Human and machine-readable specifications for data

Hoey, Stijn, Van; Desmet, Peter

doi:10.22032/dbt.37799

Vortrag 2018 CC BY 4.0

Veröffentlicht

Whip: Human and machine-readable specifications for data

Different tools and technologies are available to clean and harmonize data. Independent of the tool used, the ability to assess the quality of a data set and identify potential errors is crucial for harmonization efforts. The necessity becomes even more apparent in the context of data publication, (re)use and aggregation. Documentation and guidelines about the data requirements provide guidance in this process and enable to communicate what to expect from the data, but are mostly intended for humans only. To facilitate the harmonization process, we propose the usage of a specification file, describing the constraints to which the data should comply. Its syntax is human- and machine-readable, so it can be used to communicate expected data quality/conformity and to validate data automatically. The scope of the set of specifications can be specific to a dataset, researcher or research community, which allows bottom-up and top-down adoption. As an example, we apply the specifications to verify data mapped to the biodiversity information standard Darwin Core. In this talk, we will present "whip", a proposed syntax and format to express data specifications. Whip allows to define column-based constraints for tabular (tidy) data with a number of rules. We will also demonstrate a software application (called "pywhip") to validate data sets using these specifications. We hope it will trigger a discussion on how to express data specifications and communicate data quality expectations.

Vorschau

Einordnung

Konferenz:: ICEI 2018 : 10th International Conference on Ecological Informatics- Translating Ecological Data into Knowledge and Decisions in a Rapidly Changing World
Erschienen in:: ICEI 2018 : 10th International Conference on Ecological Informatics- Translating Ecological Data into Knowledge and Decisions…
(2018)
Datum der Veröffentlichung:: 2018
DOI:: 10.22032/dbt.37799
Sprache:: Englisch
Ressourcentyp:: Text
Umfang:: 27 Seiten
Erscheinungsort:: Jena
Schlagwörter:: Data Harmonization, Data Quality, Documentation, Specifications
DDC-Sachgruppe der DNB:: 004 Informatik
DDC-Sachgruppe der DNB:: 570 Biowissenschaften, Biologie
DDC-Sachgruppe der DNB:: 580 Pflanzen (Botanik)
DDC-Sachgruppe der DNB:: 590 Tiere (Zoologie)
DDC-Sachgruppe der DNB:: 600 Technik
DDC-Sachgruppe der DNB:: 630 Landwirtschaft, Veterinärmedizin
Einrichtung:: Friedrich-Schiller-Universität Jena

auf die Merkliste

Zitieren

Zitierform:

10.22032/dbt.37799
Zitier-Link kopieren

Rechte

Nutzung und Vervielfältigung:

Export

BibTeX, Endnote, MODS, MARCXML, RIS, ISI, PICA, DC, CSV