indie-blogs-bg

All About Dataset Markup

All About Dataset Markup featured image
6 Oct 2020
Nirlep Patel
Structured Data

With the help of Google Dataset search engine researchers, scientists, data journalists etc. access online data sets. Datasets discovery is made easy by adding the dataset schema and other metadata standards used for structuring the data of datasets. The main aim of this markup is to make datasets from fields like Sciences, social sciences, machine learning, civic and government data easily discoverable. You can search for Datasets using the Dataset Search Tool.

What Type Of Data Qualifies As Datasets?

The following types of data are qualified as Datasets:

  •  CSV File or a table containing data.
  •  A systematic collection of Tables
  • A proprietary format file containing data.
  • A collection of files containing meaningful data.
  • Images which capture data
  •  Machine Learning Files.
  •  Any data which looks like a dataset for the user.
  •  A structured object containing data in another format that you might be wanting to install in a special tool for processing.

 

How Do I Add The Data Set Markup?

You can add the dataset markup in the following ways:

  • Add all the required markup properties using the JSON-LD format.
  • Follow the structured data guidelines.
  •  Validate the said code using the Rich Results Test.
  • Position a few pages that include your structure data and use the URL inspection tool to know how Google sees the page. Make sure that your page is Google accessible and not blocked by robots.txt file, no.index tag, or login requirements.

If the page looks alright, you can ask Google to recrawl your URLs.

  • Submit A Sitemap for informing Google about future changes related to your website.

PRO TIP: If you want to delete your dataset or not want it to be displayed over the search engines, make use of the robots meta tag for controlling your dataset indexing process. However, it may take some time for the desired results.

Google’s Approach To Dataset Discovery:

Google Understands the structured data of datasets by using either the Schema.org Dataset Markup or equivalent structures represented in W3C’s Data Catalogue Vocabulary (DCAT) Format. For improving the discovery of datasets, Google is also experimenting with support for structured data on W3CCSVW.

What Are The Guidelines To Follow?

In addition to the structured data guidelines, Google advises to follow the:

  1. Sitemap Practices:

A. Make use of sitemap files for helping Google to find your URL. Using the SameAs markup and Sitemap files helps Google document the process following which the dataset descriptions publish on your site.

B .A dataset repository usually has two types of pages: The Landing Page and the page listing multiple datasets.

In such cases, adding the Dataset Structure to Landing pages is recommended.

If structured data is added to multiple pages of the dataset, then Use the SameAs property to like it with the landing page.

 

  1. Source and Provenance Practices:

If a dataset is a copy or best on another dataset, then follow the below-listed practices:

  • When a dataset is either republished or some materials related to it are published elsewhere, use the SameAs property for indicating the most landing URL’s of the original dataset.
  •  For datasets that change significantly, use the isBasedOn property. Use the same entity when a dataset is derived or aggregated from several originals.
  • Make use of the identifier property for attaching any relevant Digital Object identifiers or Compact Identifiers. Repeat the identifier property for datasets having more than one identifier. JSON-LD is represented using the JSON list syntax.

 

  1. Textual Property Guidelines:

All textual properties should contain no more than 5ooo characters as Google Data Search makes use of only the first 5000 characters of any textual property. All Names and titles must either be of few words or short sentences.

 

WHAT TO DO IF MY STRUCTURED DATASETS EXPERIENCE ERRORS AND WARNINGS?

You might experience warnings or errors in Google’s Structured Data Testing Tool or other validation systems. These validation systems suggest that every organization should have contact information properties like ContactType; important values include customer service, emergency, journalist, newsroom and public engagement. You can ignore errors for csvw: Table that is not the expected value for the mainEntity property.

WHAT ARE THE VARIOUS PROPERTIES OF THE DATASET MARKUP?

The various properties required for structuring datasets data are:

A.Dataset: the property entails a detailed description of a particular topic. Example: Scientific or Civic datasets.

Entities such as an identifier, license and sameAs contain provenance and license information.

  1. Description: A summary which is between 50-5000 characters. The summary is a short description of the said dataset. It may include the Markdown syntax and all the embedded images must use the absolute path URL’s. Further, always denote two new lines with \n when using the JSON-LD format.
  2. Name: The property holds the said name of the dataset. Always use unique names for distinct datasets.
  3. Alternate Name: The property holds the alternate name that is used for referring to datasets.
  4. Creator: The property contains the creator or author of the dataset. For unique identifications of individuals, use ORCI sameAs property of the Person Type.
  5. Citation: The property holds creative works or texts which are recommended by the dataset provider. Always provide the citation of the dataset itself with other properties like name, identifier, creator and publisher. With the help of this property, Academic publications such as data descriptor, data paper, and articles related to the said dataset, can be identified.

Guidelines to be followed:

  • Do not use this property for providing citation information for the dataset itself.
  •  Always provide the article identifiers when populating the citation property with the citation snippet.
  1. HasPart or IsPartOf: When the dataset is a collection of smaller datasets make use of the hasPart property and when the dataset is a part of larger datasets use the IsPartOf property for denoting the relationship. If a dataset is used as a value, then include all the standalone dataset properties.
  2. Identifier: The property contains identifiers such as DOI or Compact identifier. In case the dataset has more than one identifier, repeat the identifier properties.
  3. License: The license under which the dataset is distributed.

Always provide a URL unambiguously stating the specific version of the license used.

  1. Measurement Technique: the property contains the said technique, methodology or technology used in a dataset which can correspond to the variable (s) described in the variableMeasured.
  2. SameAs: The property holds the URL of a reference webpage unambiguously indicating the identity of the dataset.
  3. SpatialCoverage: The property holds information relating to the spatial aspect of the data. This property should be included only when the dataset has a spatial dimension.

Spatial coverage includes specifying the shape, location and points of coverage.

12.TemporalCoverage: The said time interval of the dataset specified in ISO 8601 format. Describe depending upon the dataset time interval. Example:

Single date: “temporalCoverage”: “2008”

Time period: “temporal coverage”: “1950-01-01”/ “2013-12-18”

Open-ended time period: “2013-12-19/….

  1. VariableMeasured: the variable measured by the said dataset. Example: Temperature or Pressure.
  2. Version: The said version number for the dataset.
  3. URL: The location of the page.

DataCatalog

  1. DataCatalog: the property contains more than one dataset.
  2. IncludedInDataCatalog: The catalogue in which the dataset belongs.

Data catalogues are usually published in repositories which contain many other datasets. Similar datasets are included in more than one such repository.

DataDownload:

  • DataDownload: this property contains a dataset in a downloadable form.
  • Distribution: The property contains the location of the downloadable dataset and the file format available for download.
  • Distribution.contentURL: The property contains the link to download.
  •   Distribution.encodingFormat: The property holds the file format of the distribution.

WHAT ARE TABULAR DATASETS?

A tabular dataset is a dataset containing information organised in a grid of rows and columns. It is currently in beta form and is subjected to change. Use the Dataset Markup for structuring the data of tabular datasets. Currently, there is also a variation of CSVW provided on the HTML page parallel to user-oriented tabular content.

PRO TIP: Please refer to the previous posts of the Series to know in detail about monitoring search results and troubleshooting problems.

For analysing your Google Search Traffic, use the performance report.

WHAT DO I DO WHEN SPECIFIC DATASET IS NOT SHOWING UP IN DATASET SEARCH RESULTS?

What has caused the issue?

If you do not use the stated structured data markup on the page describing the dataset or if your website has not been crawled yet.

How can this issue be fixed?

Copy the page link and paste it in The Rich Results Test. If the message states that the page is not eligible for rich results or if the markup is not eligible for the rich result, then either the dataset markup is incorrect or there is no markup used. This can be fixed by following the how-to add data structure guideline.

If the page does not have a markup it means that it has not yet been crawled. You can check the crawl status with Google Search Console.

If the company logo is missing or it does not appear correctly by results

This problem usually occurs when your page misses the schema.org markup used to organise logos or your business is not established with Google.

How to fix this issue?

  • Get your business established with Google.
  •  Add the Logo Structured Data Markup to your page

In the next post, I will throw light on how to structure Subscriptions and Paywalled Content onto your site.