Chapter 6 Documentation And Publications

6.1 Citations, Bibliography

For any external literature, data source, please, store the citation information in a well-formatted, thematic .bib files.

We create programatically .bib citation information files for our data products, software releases, and published documents.

Marking up a reference to (Antal 2021b) with [@R-retroharmonize] will place the citation in your selected format (here APA) to the text, and to the references at the end of this documentation.

We would love to cite your work. Please, create a well-formatted .bib file with your publications about the best use of data, creation of data, scientific analysis of data, etc, and save it to this repo as name-surname.bib.

Make sure that each .bib entry contains at least the title, author and year fields.

6.2 Software Releases

We have so far three released software products.

  • The aim of retroharmonize is to provide tools for reproducible retrospective (ex-post) harmonization of datasets that contain variables measuring the same concepts but coded in different ways.
  • The regions package is an offspring of the eurostat package on rOpenGov. It started as a tool to validate and re-code regional Eurostat statistics, but it aims to be a general solution for all sub-national statistics. It will be developed parallel with other rOpenGov packages.
  • iotables processes all the symmetric input-output tables of the EU member states, and calculates direct, indirect and induced effects, multipliers for GVA, employment, taxation. These are important inputs into policy evaluation, business forecasting, or granting/development indicator design. iotables is used by about 800 experts around the world.

They are released on CRAN, and they follow the CRAN guidelines for unit-testing and human review. As CRAN relies on extensive automated testing, the formatting standard is very strict both for the software code and its documentation. The slightest deviation results in rejection.

6.3 Metadata

Metadata plays an important role in our databases and in documentation, too. We placed it into a separate chapter.

6.4 Data releases

We currently work with two platforms, and we must maintain compatibility with data repositories that our clients use. Both provide some validation, version control, and give a standard doi to our releases.

6.4.1 Dataverse

dataverse.org/ has a well-supported API, and this is the choice of our first academic partner, IViR.

The Dataverse Project is being developed at Harvard’s Institute for Quantitative Social Science (IQSS), along with many collaborators and contributors worldwide. The Dataverse Project was built on our experience with our earlier Virtual Data Center (VDC) project, which spanned 1997-2006 as a collaboration between the Harvard-MIT Data Center (now part of IQSS) and the Harvard University Library. Precursors to the VDC date to 1987, comprising such entities as pre-web software to automatically transfer cataloging information by FTP to other sites across campus automatically at designated times, and before that to a stand-alone software guide to local data.

6.4.2 Zenodo

zenodo is the choice of the European Union, and it is likely that in the future all EU-financed research data must be published here.

The OpenAIRE project, in the vanguard of the open access and open data movements in Europe was commissioned by the EC to support their nascent Open Data policy by providing a catch-all repository for EC funded research. CERN, an OpenAIRE partner and pioneer in open source, open access and open data, provided this capability and Zenodo was launched in May 2013.

In support of its research programme CERN has developed tools for Big Data management and extended Digital Library capabilities for Open Data. Through Zenodo these Big Science tools could be effectively shared with the long-tail of research.

6.5 Publications

Our work is often embedded in publications. We want to make the application of various formatting guidelines as painless as possible. This is one of the main motivations to use markdown and Rmarkdown to document our work. There are more and more conversion tools that automatically convert our longform documentation from markdown or Rmarkdown to the formatting standards of almost any scientific publisher.

6.6 Namespace

6.6.1 Naming variables, objects, features

In our program code and databases, we are following Open Science naming guidelines. While there are many versions of naming guidelines, we must pick something that works for all and adhere to it.

For ease of programmatic use, we should follow a lowercase, snakecase variable naming.

## Warning: package 'dplyr' was built under R version 4.0.4
original canonical
This Variable this_variable
X1 x_1
VarName var_name
My_Var my_var
that_variable that_variable

Variables should describe things starting with general to specific, when possible, or make group selections, such as starts_with() or ends_with() easy to use, because contains() can be ambiguous.

original canonical
spotify_artist_id spotify_artist_id
deezer_id_artist deezer_id_artist
id_on_bandcamp_track id_on_bandcamp_track
artist_bandcamp_id artist_bandcamp_id
gender_artist gender_artist
  • contains("artist") selects all artist features
  • ends_with("_id") selects all ID’s. Because id is short, it may appear in words like india. But _id is unique, particularly if it is the end of a variable name.
  • starts_with("spotify") selects all features related to Spotify, and starts_with("spotify") & contains("_rec_") selects all recording features.