FAIR-Checker: supporting digital resource findability and reuse with Knowledge Graphs and Semantic Web standards

FAIR-checker metadata analysis workflow

By embedding RDF triples into web pages through JSON-LD, RDFa or HTML microdata, web data providers can semantically advertise search engines with metadata describing the content of web resources. This is particularly attractive to technically comply with the FAIR principles. The general idea of FAIR-Checker is to promote the use of embedded metadata in web pages to ease the findability and reuse of digital scientific resources. Figure 1 drafts the main steps for gathering, enriching, and analyzing Semantic Web annotations while benefiting from public Knowledge Graphs.

Fig. 1

Gathering, enriching and analyzing semantic web annotations in line with FAIR principles

Given a web page URL, the very first step consists in extracting semantic annotations, based on JSON-LD, RDFa, or HTML microdata standards (➊). This constitutes a minimal starting Knowledge Graph (KG) which is queried with SPARQL for FAIR assessment (➋). Then, for each metadata entity (e.g. a person, a dataset, a software, etc.), public KGs are queried to retrieve relevant associated RDF triples (➌). Since attribution or citation is a clear incentive to promote sharing and reuse in open sciences, we principally target scientific literature KGs such as OpenAire or OpenCitation. We also include general knowledge through WikiData. Following the Linked Data principles, our objective is, i) for data publisher, to limit their efforts in annotating individual web pages, and ii) for Knowledge Graph developers/maintainers to make them contribute to the implementation of FAIR principles. Then, ontology checks (➍) assess that used ontology classes or properties are part of community agreed standards. By leveraging community-specific registries such as BioPortalFootnote 12 or OLSFootnote 13, or general registries such as LOVFootnote 14, we can evaluate if common ontology terms are reused. Finally, since more and more web resources are annotated with Schema.org [15], we leverage the Bioschemas [16] community-agreed profiles to assess the completeness of semantic annotations (➎). These Bioschemas profiles are automatically transformed into SHACL constraints. These constraints are used to indicate the missing triples, considered as mandatory or recommended by the community to describe a certain type of resource. This finally provides users with guidelines for improving the quality of metadata.

Evaluating FAIR metrics with SPARQL query templates

In this section, we show how SPARQL queries can instrument and operationalize numerous FAIR principles for metadata.

Many of these principles rely on the availability of web-accessible, machine-readable metadata, grounded on community-agreed and shared vocabularies. Being able to automatically parse embedded RDF triples already ensures that metadata is accessible through an open protocol (Accessibility principle A1.1) and a structured data format, allowing knowledge representation (Findability principle F2, Interoperability principle I1). Please refer to Table 1 for FAIR principle brief description.

In our workflow (Fig. 1), SPARQL queries are also used in step 4 to check whether RDF entities match ontology properties present in available registries. This helps in assessing FAIR principles I2 and R1.3. We identified in Table 2 a list of common ontology properties that should be used when publishing FAIR resources.

Table 2 Summary of the selected ontology properties relevant to assess three specific FAIR principles in FAIR-Checker

Specific properties have been proposed to identify resources or concepts such as DC-TermsFootnote 15 or Schema.orgFootnote 16identifier properties. These properties should be found when assessing the Findability principles (F1). In addition, FAIR-Checker evaluates if common identification schemes, registered through the Identifiers.org [17] resolution service can be found in embedded RDF triples.

One of the reuse criteria (R1.1) lies in making data available with a clearly established access license. A number of ontologies and controlled vocabularies allow to describe licenses in a machine-readable way. For this, we have identified the license properties defined in Schema.org, DC-Terms, DOAPFootnote 17, DBpediaFootnote 18 ontologies.

Another principle of reuse is based on the provision of detailed provenance information (Reuse principle R1.2). This information is needed to identify data sources such as authors, funding organizations, but also potential data transformation steps. For this, we selected three commonly used ontologies: PROVFootnote 19 [18], PAVFootnote 20 [19] and DC-Terms. More precisely, they allow to expose time information (e.g. prov:startedAtTime, pav:retrievedOn), multiple granularity of versioning information (e.g. pav:hasCurrentVersion, pav:previousVersion), or multiple roles of authorship (e.g. dct:contributor, pav:curatedBy).

For each metrics associated with the FAIR principles described above, we propose to automatically generate, based on a query template, SPARQL ASK queries as shown on Fig. 2.

Fig. 2

SPARQL ASK query template

From a list of target properties, we generate a SPARQL VALUES clause (line 2). When evaluating these queries on the retrieved RDF triples associated to a Web URL, a positive answer is returned when at least one of the predefined target properties can be found.

Public Knowledge Graphs supporting FAIR assessment

Many public Knowledge Graphs already aggregate, and make accessible, a large number of metadata associated to digital resources such as databases, scientific literature or software. With FAIR-Checker, we propose to exploit these semantic data sources during the FAIRification process. From the assessed URL, FAIR-Checker generates a SPARQL DESCRIBE query to retrieve the RDF triples already accessible. Wikidata [20] is queried for general knowledge, through the wikidata:P356 property, used to cross reference digital object identifiers (DOIs). The properties openaire:resPersistentID and datacite:hasIdentifier are also exploited in SPARQL Describe queries to query respectively the SPARQL endpoints of OpenAIRE and OpenCitation which target scientific literature metadata. As an example, when inspecting Schema.org metadata associated to a scholarly article shared through the Datacite repository, we can add additional metadata to it by retrieving a set of keywords and subjects from the OpenAIRE SPARQL endpoint.

Handling missing semantic annotations with a generator of profile-based SHACL shapes

Schema.org is a lightweight, general purpose, controlled vocabulary, initially supported by major web search engines, and aimed at semantically annotating web pages. However, since the way major search engines use this metadata is unknown, it is difficult for web data providers to choose which semantic properties to expose, potentially leading to a large diversity of quality in semantic annotations and possibly a lot of missing information.

Community-driven metadata profiles have been proposed to tackle this issue. Targeting the Life Sciences community, BioschemasFootnote 21 [21] is an active community effort supported by the Elixir European Bioinformatics research infrastructure aimed at extending Schema.org and promoting its usage to increase the discoverability of Life Science resources. The Bioschemas community led to more than 37 Schema.org usage recommendations also known as profiles. Bioschemas profiles specify which RDF triples should be used to describe specific type of entities. They specify which ontology classes or properties should be used (mostly from Schema.org), enabling the specification of different cardinalities (one or many), as well as different marginalities (minimum, recommended, or optional) for properties. For instance, the Bioschemas community agreed to state that a web page describing a gene should at least (referring to the minimum marginality) provide both schema:identifier and schema:name properties, but it is recommended (referring to the recommended marginality) to also provide a description (schema:description) and a reference web page for the gene (schema:url).

Up to now, there is no clear consensus on how to represent such profiles with machine-readable formats. With FAIR-Checker, we propose to rely on SHACL [22] [23] to automatically represent and evaluate the compatibility of semantic annotations against community-agreed profiles. The marginality of semantic properties is represented with SHACL property shapes. Minimal properties are encoded with a sh:Violation severity, whereas recommended properties are encoded with a sh:Warning severity.

Fig. 3

Figure 3 shows a generic SHACL shape template. By providing a list of minimal and recommended properties for each Bioschemas profile to a text template engine, FAIR-Checker is able to instantiate a profile-specific SHACL shape. Lines 6 to 12 show the iteration over the minimal properties, leading to the generation of multiple property shape patterns specified between lines 7 and 11. The specific property to be evaluated is injected in line 8 through the min_prop variable. This pattern is repeated between lines 14 and 20 to address profile-specific recommended properties. The produced shapes are matched against all instances of the target_class variables thanks to the iteration specified between lines 2 and 4.

User recommendations

The reason for a negative result at a given FAIR principle may be difficult to understand for a novice user, and therefore difficult for him to solve. This was a motivation to include a set of easy to understand and accessible recommendations in FAIR-Checker , in particular for the “Check” functionality. These recommendations aim to explain how the user can improve its metadata to later validate a failed evaluation. It also provides useful links to training resources such as the FAIR-CookBookFootnote 22. For example, FAIR-CookBook Recipe 1 (Findability section) on unique and persistent identifiers gives the user the necessary background information to assign persistent resource identifiers to its resource and solve a failure on F1B (Persistent IDs) principle. In the same section, FAIR CookBook recipe 8 on Search Engine Optimisation provides examples of structured metadata in JSON-LD, that can help users to encode their metadata in a structured format and solve a failure on F2A (Structured metadata) principle. Also, recipes 3 and 4 in the Interoperability section of the FAIR-CookBook constitute a useful introduction to terminologies and ontologies and a guide for selecting the most appropriate ones. This can be useful to increase compliance with F2B (Shared vocabularies for metadata) and I2B (Machine-readable vocabularies) principles. Finally, concerning the failure due to the lack of license information (R1.1 principle), FAIR-Checker recommendation suggests using one of the following properties: schema:license, dct:license, doap:license, dbpedia-owl:license or cc:license. We are currently collecting user feedback to produce relevant additional recommendations.

View original article

JOURNAL OF BIOMEDICAL SEMANTICS

Like

Share Bookmark

0 0 0 0 0 0 0

More from this channel

FAIR-Checker: supporting digital resource findability and reuse with Knowledge Graphs and Semantic Web standards

Comments (0)