Data package is valid! - Open Knowledge Blog

This blog is the second in a series done by the Frictionless Data Fellows, discussing how they created Frictionless Data Packages with their research data. Learn more about the Fellows Programme here http://fellows.frictionlessdata.io/.

By Ouso Daniel

The last few months have been exciting, to say the least. I dug deep into seeking to understand how to minimise friction in data workflows and promote openness and reproducibility. I have been able to Know of various FD software for improving data publishing workflows through the FD Field Guide. We’ve looked at a number of case studies where FD synergised well for reproducibility, an example is on the eLife study. We also looked at contributing and coding best practices. Moreover, I found Understanding JSON schemas (by json-schema.org) a great guide in understanding the data package schema, which is JSON-based. It all culminated in the creation of a data package, which I now want to share my experience.

To quality-check the integrity of your data package creation, you must validate it before downloading it for sharing, among many things. The best you can get from that process is “Data package is valid!”. What about before then?

Data package

Simply, I would say, it is data coupled to its associated attributes in a JSON format. To marry the data to its attributes you will require an FD tool. Here is the one I created.

Data Package Creator (DPC)

A DPC gives you a data package. The good news is that it takes care of both realms of users; programmers and GUI users. I will describe the latter case. It is a web app with three main components: One, the Metadata pane on the left. Two, the Resources (a data article) pane in the middle and the third is the Schema on the right pane (usually hidden, but can be exposed by clicking the three-dots-in-curly-brackets icon).

The Data

I used my project data in which I was evaluating the application of a molecular technique, high-resolution melting analysis, in the identification of wildlife species illegally targeted as bushmeat. I had two files containing tabular data: one with sample information on samples analysed and sequences deposited in GenBank and the other on species identification blind validation across three mitochondrial markers. My data package thus had two resources. This data was contained in my local repository, but I shipped it into GitHub in the CSV format for easy accessibility.

Creating the Data Package

You may follow along, in details, with this data package specifications. On the resources pane tab, from left to right, I entered a name for my resource and the path. I pasted the raw GitHub link to my data on the provided path field and clicked the load button to the right. Locally, you may click the load button that will pop your local file system. DPC automatically inferred the data structure and prompted me to load the inferred fields (columns). I counter checked that the data types for each field were correctly inferred, and added titles and descriptions. The data format for each field was left as default. From the gear-wheel (settings) in the resource tab, I gave each of the two resources titles, descriptions, format and encoding. The resource profile is also automatically inferred. All the field and resource metadata data that I inputted are optional, except we want to intentionally be reproducible and open. On the other hand, there is compulsory metadata information for the general data package, in the metadata pane. They are name and title. Be sure to get the name right, it must match the pattern ^([-a-z0-9._/])+$ for the data package to be valid, it is the most probable error you might encounter.

The data package provides for very rich metadata capturing, which is one of its strengths for data reusability. There are three metadata categories, which must not be confused; data package metadata, resource metadata and field (column) metadata, respectively nested. After inputting all the necessary details in the DPC you have to validate your data package before downloading it. The two click-buttons for these purposes are at the bottom of the metadata pane. Any error(s) will be captured and described at the very top of the resources pane. Alternatively, you will see the title of this post, upon which you can download your data package and rename it accordingly, retaining the .json extension.

Conclusion

I applied DPC first-hand in my research, so can you. We created a data package starting from and ending with the most widely used data organisation formats, CSV and JSON respectively (interoperability). We gave it adequate metadata to allow a stranger to comfortably make sense of the data (reusability) and provided licence information, CC-BY-SA-4.0 (accessibility). The data package is also uniquely identified and made available on a public repository in GitHub (findability). A FAIR data package. Moreover, the data package is very light (portable) making it easily sharable, and open and reproducible. The package is holistic, containing metadata, data and a schema (a blueprint for data structure and metadata). How do I use the data package? You may ask.

Way forward

Keep in memory the term goodtables, I will tell you how it is useful with the data package we just created. Until then you may keep in touch by reading periodic blogs regarding the Frictionless Data fellowship, where you will also find works by my colleagues Sele, Monica and Lily. Follow me, OKF on twitter for flash updates.