Automatic transparency evaluation for open knowledge extraction systems

Semantics and linked datasets formalise and classify knowledge in a machine-readable way [1, 2]. This simplifies knowledge extraction, retrieval, and analysis [3, 4]. Open Knowledge Extraction (OKE) is the automatic extraction of structured knowledge from unstructured/semi-structured text and transforming it into linked data [5]. The use of OKE systems as the fundamental component of advanced knowledge services is experiencing rapid growth [6]. However, similar to many modern Artificial Intelligence (AI) based systems, most OKE systems include non-transparent processes.

Transparency is defined as the understandability and interpretability of the processes and outcomes of AI systems for humans [7]. Transparency of AI is needed due to the extensive use of black-box algorithms in modern AI systems [8,9,10,11,12,13,14]. Enhancing transparency facilitates scrutability, trust, effectiveness, and efficiency [15]. AI transparency is one of the AI governance main components, which is necessary for accountability [8,9,10, 15]. Transparency is the single most cited principle in the 84 policy documents reviewed by Jobin et al. [16]. The General Data Protection Regulation (GDPR) also requires transparency by affirming “The right to explanation”, mandating accountability mechanisms and restricting automated decision-making [17].

Automatic transparency evaluation is an important step to enhance the transparency of OKE systems. Automation helps with scalability and saves both time and energy, adding to sustainability. Transparency evaluation allows analysis and indicates effective ways to enhance the transparency of a system under evaluation. Transparency is a multidimensional problem which looks at different aspects of the process, input/s, and output/s of a system, such as their quality, security, and ethics. To the best of our knowledge, to this date, there is no automatic way to evaluate all the transparency dimensions of OKE systems. Accordingly, this paper’s focus is on the automatic transparency evaluation for OKE systems. Our research question is “To what extent can the transparency of OKE systems be evaluated automatically using the state-of-the-art tools and metrics?”. The Cyrus transparency evaluation framework describes a comprehensive set of transparency dimensions, includes a transparency testing methodology, and identifies relevant assessment tools for OKE systems.

The contributions of this paper are as follows: i) the transparency problem for OKE systems is formalised and ii) Cyrus, a new transparency evaluation framework for OKE systems is proposed, iii) state-of-the-art FAIRness assessment [18,19,20] and linked data quality assessment [21] tools that are capable of evaluating some transparency dimensions are identified and iv) Cyrus and the assessment tools are applied to evaluate the transparency of three state-of-the-art open-source OKE systems by assessing three linked datasets produced from the same corpus [22].

Open Knowledge Extraction (OKE) systems

OKE is the automatic extraction of structured knowledge from unstructured or semi-structured text and then representing and publishing the knowledge as linked data [5]. OKE usually consist of three main tasks, which are entity and relation extraction, text annotation based on the vocabularies and ontologies, and conversion to RDFFootnote 1 (Resource Description Framework). In this paper, the transparency of three state-of-the-art open-source OKE systems is evaluated. All these systems create Knowledge Graphs (KG) from the same corpus, i.e., Covid-19 Open Research Dataset (CORD-19) [22]. CORD-19 is a corpus of scientific papers on Covid-19 and related historical coronavirus research. An overview of each of these OKE systems is provided in the following paragraphs.

In 2019, Booth et al. [23] created CORD-19-on-FHIRFootnote 2, a linked data version of CORD-19 dataset in FHIRFootnote 3 RDF format. It was produced by data mining the CORD-19 dataset and adding semantic annotations, using the NLP2FHIR pipeline [24] and the FHIR to RDF converterFootnote 4 to create the final linked datasets. The purpose of CORD-19-on-FHIR is to facilitate linkage with other biomedical datasets and enable answering the research question. Currently the entity types of Conditions, Medications and Procedures are extracted using Natural Language Processing (NLP) methods from the titles and abstracts of the CORD-19 dataset. PubtatorFootnote 5 [25] is also used to extract Species, Gene, Disease, Chemical, CellLine, Mutation and Strain.

CORD19-NEKG is another KG construction pipeline for the CORD-19 dataset. It was created by Michel et al. [26]. CORD19-NEKG is an RDF dataset describing named entities in the CORD-19 dataset, which have been extracted using: i) the DBPedia Spotlight [27] named entity extraction tool, which uses DBPedia entities to annotate text automatically; ii) Entity-fishingFootnote 6, which uses Wikidata entities to annotate text automatically; and iii) the NCBO BioPortal Annotator [28], which annotates text automatically with user-selected ontologies and vocabularies.

COVID-KG [29] is another KG based on the CORD-19 dataset. This KG has been built by transforming CORD-19 dataset papers (JSON files and their metadata CSV files) into RDF in two steps: a. Enriching the JSON files using annotations from DBpedia Spotlight, BioPortal Annotator, Crossref API, ORCID API and b. Mapping JSON to RDF using the YARRRML Parser.

Existing solutions for the transparency of AI models

AI systems have three important components, i.e., 1. Input data or resources, 2. Input transformation process including algorithms and models used, and 3. outputs. For AI to be transparent, each of these components should be transparent. Explainable AI (XAI) aims to turn a non-transparent machine learning model into a mathematically interpretable one. Several studies have suggested using XAI methods to enhance the transparency, however, these methods are often shown to be less accurate than non-transparent algorithms [30,31,32]. Also, XAI often does not consider whether the explanations are understandable for humans [33,34,35].

Some researchers suggest auditing or risk assessment [8,9,10, 36] to increase transparency, which assesses the inputs and outputs of the model assuming the model itself as a black box. However, auditing is the least powerful method among the available methods for understanding black box models’ behaviours [37], since it does not help make the model decision process clear. Logging of algorithm executions can also be helpful by enabling responsible entities to carry out retrospective analysis [38]. Openness of the algorithm’s source code, inputs, and outputs is another way to provide transparency. However, this exposes the system to strategic gaming and does not work for algorithms that change over time and for those with random elements [9].

However, metadata-driven approaches that create a framework to disclose key pieces of information about a model would be more effective in communicating algorithmic performance to the public [15]. Most of the current transparency solutions are metadata-driven [15, 39,40,41]. Model Cards [42] divide the information about the model into nine groups, i.e., model details (basic information such as model developer/s, model date, version, and type), intended use, factors (demographic or phenotypic groups, environmental conditions, and so on), metrics (e.g., model performance measures or decision thresholds), evaluation data (datasets, motivation, preprocessing), training data, quantitative analyses, ethical considerations, and caveats and recommendations. There are no requirements to reveal sensitive information and organisations only need to disclose basic information about the model.

Inspired by nutrition labels, Yang et al. [43] have suggested a nutrition label for ranking AI systems, as a way to make them transparent. Ranking Facts consist of visual widgets that illustrate details of the ranking methodology or of the output to the users in six groups. These information include the Recipe (describing the ranking algorithm, attributes that matter, and to what extent), the Ingredients (list of the most effective attributes to the outcome), the detailed Recipe and Ingredients widgets (statistics of the attributes in the Recipe and in the Ingredients), the Stability of the algorithm output, the detailed Stability (the slope of the line that is fit to the stability score distribution, at the top-10 and over-all), the Fairness widget (whether the ranked output complies with statistical parity), and the Diversity widget shows diversity with respect to sensitive features.

Another similar approach is FactSheets [44]. In this work, a questionnaire has been created to be filled and published by the stakeholders of AI services. This questionnaire includes 11 sections, i.e., the previous FactSheets filled for the service, a description of the testing done by the service provider, the test results, testing by third parties, Safety, Explainability, Fairness, Concept Drift, Security, Training Data, and Trained Models. Each of the reviewed methods provide a set of information that should be available for their targeted models to be transparent. Table 1 shows differences and commonalities between these approaches.

Table 1 Differences and commonalities between AI transparency methodsExisting solutions for data transparency

Similar to solutions for the transparency of AI models, most of the existing solutions for data transparency are metadata-driven. Some of the most significant approaches are overviewed here.

In Datasheets for datasets [45], information about the datasets has been classified in four groups, i.e., composition, collection, preprocessing/cleaning/labelling, and maintenance. Data composition section includes information such as missing information, errors, sources of noise, or redundancies in the dataset. The collection process section contains information such as data validation/verification, mechanisms to collect the data, and validation of the collection mechanisms. The preprocessing/cleaning/labelling section includes information about the raw data and its transformation, e.g., discretisation and tokenisation. Lastly, the maintenance section refers to the information such as the data erratum, applicable limits on the retention of the data, and maintenance of the older versions of the data.

Data Cards method [46] has quite a dynamic format to be applicable to different kinds of data. Information in Data Cards is roughly divided into nine sections, i.e., publishers, licence and access, dataset snapshot-data type, nature of content, known correlations, simple statistics of data features, training, validation, and testing parts, motivation and use-dataset purposes, key domain applications, primary motivations, extended use-safe and unsafe use cases, dataset maintenance, versions, and status, data collection methods, data labelling, and finally fairness indicators.

Data Nutrition Labels [47] consist of seven modules, i.e., metadata, provenance, variables, statistics, pair plots, probabilistic model, and finally ground truth correlations. Metadata module includes information such as filename, format, URL, domain, keywords, dataset size, the number of missing cells, and license. The provenance module contains source and authors’ contact information along with the version history. The variables module provides a textual description of each variable/column in the dataset. The statistics module includes simple statistics for the dataset variables, such as min/max, median, and mean. The pair plots module encompasses histograms and heat maps of distributions and linear correlations between two chosen variables. The probabilistic model module contains histograms and other statistical plots for the synthetic data distribution hypotheses. Lastly, the ground truth correlations module refers to heat maps for linear correlations between a chosen variable in the dataset and variables from the ground truth datasets. One of the interesting contributions in Data Nutrition Labels is visual badges that show information about the dataset. Similar to the AI transparency method, each of the data transparency methods provide a framework of the information that they find necessary for data transparency. Table 2 shows differences and commonalities between these approaches. Inspired by the reviewed model and data transparency methods, we propose a comprehensive transparency evaluation and enhancement framework for OKE systems.

Table 2 Differences and commonalities between data transparency methodsExisting solutions for the evaluation of AI systems’ transparency

To the best of our knowledge, there are no automatic methods that cover the evaluation of all the transparency dimensions for AI systems. However, there are some checklists to measure fairness, accountability, and transparency of AI systems, regardless of the techniques that are used in building systems. Shin [7] uses a 27 measurements checklist on a 7-point scale for seven criteria, i.e., fairness, accountability, transparency, explainability, usefulness, convenience, and satisfaction, to evaluate user perceptions of algorithmic decisions. However, the checklist itself is not publicly available. In another work, Shin et al. [48] have proposed a survey with transparency among its variables. However, it cannot be independently used and needs other approaches to measure these criteria. Jalali et al. [49] evaluated the transparency of reports for 29 COVID-19 models using 27 Boolean criteria. These criteria have been adopted from three transparency checklists [50,51,52] which include reproducibility and transparency indicators for scientific papers and reports. Jalali et al.’s transparency assessment checklist was used in [53] for the transparency evaluation.

Automatic transparency evaluation

Quality and transparency are entangled concepts [12, 15]. In 2012, Zaveri et al. [54] proposed a comprehensive linked data quality evaluation framework, consisting of six quality categories and 23 quality dimensions, for each dimension a number of metrics has been identified in the literature. Based on the Data Quality VocabularyFootnote 7, a category “Represents a group of quality dimensions in which a common type of information is used as quality indicator.” and a dimension “Represents criteria relevant for assessing quality. Each quality dimension must have one or more metric to measure it”. A number of quality evaluation metrics have been implemented in open source linked data quality evaluation tools, such as the two following tools: RDFUnit [55] and Luzzu[21].

In [55], inspired by test-driven software development, a methodology has been proposed for linked data quality assessment based on SPARQL query templates, which are then instantiated into concrete quality test queries. Through this approach, domain specific semantics can be encoded in the data quality test cases, which allows the discovery of data quality problems beyond conventional methods. An open access tool, named RDFUnitFootnote 8 has been built based on this method.

Debattista et al. [21] propose Luzzu, a conceptual methodology for assessing linked datasets and a framework for linked data quality assessment. Luzzu allows defining new quality metrics, creating RDF quality metadata and quality problem reports, provides scalable dataset processors for data dumps, SPARQL endpoints, and big data infrastructures, and a customisable ranking algorithm for user-defined weights. Luzzu scales linearly against the number of triples in a dataset. Luzzu is open-source and has 29 quality evaluation metrics already implemented.

In addition to the above, our prior work has shown that FAIR principles [56] can be used to evaluate some transparency dimensions [57]. FAIR principles are well-accepted data governance principles, which have originally been proposed to enhance usability of scholarly digital resources for humans and machines [58, 59]. FAIR principles include four criteria for findability, two for accessibility, three for interoperability, and one (including three sub-criteria) for reusability. Since their emergence in 2016, several automatic tools [18,19,20] have been suggested to check if digital objects (resources, datasets) are aligned with the FAIR principles.

Comments (0)

No login
gif