Dynamic Retrieval Augmented Generation of Ontologies using Artificial Intelligence (DRAGON-AI)

Ontologies are structured representations of knowledge, consisting of a collection of terms organized using logical relationships and textual information. In the life sciences, ontologies such as the Gene Ontology (GO) [1], Mondo [2], Uberon [3], and FoodOn [4] are used for a variety of purposes such as curation of gene function and expression, classification of diseases, or annotation of food datasets. Ontologies are core components of major data generation projects such as The Encyclopedia of DNA Elements (ENCODE) [5] and the Human Cell Atlas [6]. The construction and maintenance of ontologies is a knowledge- and resource-intensive task, carried out by dedicated teams of ontology editors, working alongside the curators who use these ontologies to curate literature and annotate data. Due to the pace of scientific change, the rapid generation of diverse data, the discovery of new concepts, and the diverse needs of a broad range of stakeholders, most ontologies are perpetual works in progress. Many ontologies have thousands, or tens of thousands of terms, and are continuously growing. There is a strong need for tools that help ontology editors fulfill requests for new terms and other changes.

Currently, most ontology editing workflows involve manual entry of multiple pieces of information (also called axioms) for each term or class in the ontology. This information includes the unique identifier, a human-readable label, a textual definition, as well as relationships that connect terms to other terms, either in the same ontology or a different ontology [7]. For example, the Cell Ontology (CL) [8] term with the ID CL:1001502 has the label “mitral cell”, a subClassOf (is-a) relationship to the term “interneuron” (CL:0000099), a “has soma location” relationship [9] to the Uberon term “olfactory bulb mitral cell layer” (UBERON:0004186), as well as a textual definition: The large glutaminergic nerve cells whose dendrites synapse with axons of the olfactory receptor neurons in the glomerular layer of the olfactory bulb, and whose axons pass centrally in the olfactory tract to the olfactory cortex. Most of this information is entered manually, using either a dedicated ontology development environment such as Protégé [10] or using spreadsheets that are subsequently translated into an ontology using tools like ROBOT [11]. In some cases, the assignment of an is-a relationship can be automated using OWL reasoning [12], but this relies on the ontology developer specifying logical definitions (a particular kind of axiom) for a subset of terms in advance. This strategy is used widely in multiple different biological ontologies (bio-ontologies), in particular, those involving many compositional terms, resulting in around half of the terms having subclass relationships automatically assigned in this way [13,14,15,16].

Except for the use of OWL reasoning to infer is-a relationships, the work of creating ontology terms is largely manual. The field of Ontology Learning (OL) aims to use a variety of statistical and Natural Language Processing (NLP) techniques to automatically construct ontologies, but the end results still require significant manual post-processing and manual curation by experts [17], and currently no biological ontologies make use of OL. Newer Machine Learning (ML) techniques such as link prediction leverage the graph structure of ontologies to predict new links, but state-of-the-art ontology link prediction algorithms such as rdf2vec [18] and owl2vec* [19] have low accuracy, and these also have yet to be adopted in standard ontology editing workflows.

A new approach that shows promise for helping to automate ontology term curation is instruction-tuned LLMs [20] such as the gpt-4 model that underpins ChatGPT [21]. LLMs are highly generalizable tools that can perform a wide range of generative tasks, including extracting structured knowledge from text and generating new text [22, 23]. One area that has seen widespread adoption of LLMs is software engineering, where it is now common to use tools such as GitHub Copilot [24] that are integrated within software development environments and perform code autocompletion. We have previously noted analogies between software engineering and ontology engineering and have successfully transferred tools and workflows from the former to the latter [25]. We are therefore drawn to the question of whether the success of generative AI in software could be applied to ontologies.

Here we describe and evaluate DRAGON-AI, an LLM-backed method for assisting in the task of ontology term completion. Given a portion of an ontology term (for example, the label/name, or the definition), the goal is to generate other requisite parts (for example, a textual description, or relationships to other terms). Our method accomplishes this using combinations of latent knowledge encoded in LLMs, knowledge encoded in one or more ontologies, or semi-structured knowledge sources such as GitHub issues, using a Retrieval Augmented Generation (RAG) approach. RAG is a common technique used to enhance the reliability of LLMs by combining them with an existing knowledge base or document store [26]. RAG is typically implemented by indexing documents or records as vectors created from textual embeddings—the most similar documents are retrieved in response to a query, and injected into the LLM prompt. We demonstrate the use of DRAGON-AI to generate both logical relationships and textual definitions over ten different ontologies drawn from the Open Biological and Biomedical Ontologies (OBO) Foundry [27]. To evaluate the automated textual definitions, we recruited ontology editors from the OBO community to rank these definitions according to three criteria.

We demonstrate that DRAGON-AI is able to achieve meaningful performance on both logical relationship and text generation tasks.

Implementation

DRAGON-AI is a method that allows for AI-based auto-completion of ontology objects. The input for the method is a partially completed ontology term (for example, just the term label, such as “hydroxyprolinuria”), and the output is a JSON or YAML object that has all desired fields populated, including the text definition, logical definition, and relationships.

The procedure is shown in Fig. 1. As an initial step, each ontology term and any additional contextual information is translated into a vector embedding, which is used as an index for retrieving relevant terms. Additional contextual information can include the contents of a GitHub issue tracker, which might contain text or semi-structured information of relevance to the request. The main ontology completion step works by first constructing a prompt using relevant contextual information. The prompt is passed as an input to an LLM, and the results are parsed to retrieve the completed term object.

Fig. 1figure 1

The DRAGON-AI ontology term completion process. (1) As an initial preprocessing step, knowledge resources (such as ontologies and GitHub issues) are indexed in a vector database. (2) A user provides a partial ontology term object (here, a term with only the label of the desired term “hydroxyprolinuria” is provided). (3) The vector database is queried for similar terms (e.g. cystathioninuria, hydroxyproline) or other relevant pieces of information (e.g. a GitHub issue). (4) A prompt is generated from a template, incorporating the most similar items in the vector database. (5) The prompt is provided as textual input to an LLM, which returns a completed JSON object. Either local or remote LLMs can be used. (6) The parsed object is returned to the user. Note that this figure uses YAML syntax to represent JSON objects, for the sake of compactness

Indexing ontologies and ontology embeddings

As an initial step, DRAGON-AI will create a vector embedding [28] for each term. Each term is represented as a structured object which is serialized using JSON, following a schema with the following properties:

id: a translated identifier for the term, as described below

label: a string with a human readable label or name for the term

definition: an optional string with a human-readable textual definition

relationships: a list of relationship objects

original_id: the original untranslated identifier

logical_definitions: an optional list of relationship objects

A relationship object has the following properties:

predicate: a translated identifier for the relationship type. For bio-ontologies, this is typically taken from the Relation Ontology [29], or is the subClassOf predicate, for is-a relations

target: a translated identifier for the term that the relationship points to, either in the same ontology, or a different ontology

Ontology terms are typically referred to using non-semantic numeric identifiers (for example, CL:1001502). These can confound LLMs, which have a tendency to hallucinate identifiers [30]. In our initial experiments, we found LLMs tend to perform best if presented with information in the same way that information is presented to humans, presumably as the majority of their training data is in this form. Therefore, we chose to transform all identifiers from a non-semantic numeric form (e.g. CL:1001502) to a symbol represented by the ontology term label in camel case format (e.g. MitralCell). An example is shown in Table 1.

Table 1 Example JSON structure used in DRAGON-AI

We create a vector embedding for each term by first translating the object to text, and then embedding the text. The text is created by concatenating the label, definition, and relationships as key-value pairs. For this study we used the OpenAI text-embedding-ada-002 text embedding model, accessed via the OpenAI API.

We store objects and their embeddings using the ChromaDB database [31]. This allows for efficient queries to retrieve the top k matching objects for an input object, using the Hierarchical Navigable Small World graph search algorithm [32].

Indexing unstructured and semi-structured knowledge

Additional contextual knowledge can be included in DRAGON-AI to inform the term completion process – for example, publications from PubMed, articles from Wikipedia, or documentation intended for human ontology editors. One of the most important sources of knowledge for ontology terms is the content of GitHub issue trackers, where new term requests and other term change requests are proposed and discussed. Information in these trackers may be free text, or partially structured.

We used the GitHub API to load GitHub issues and store the resulting JSON objects, which are indexed without any specialized pre-processing. The text-serialized form of the GitHub JSON object is used as input for the embeddings. We store these JSON objects separately from the main ontology term objects.

Prompt generation using Retrieval Augmented Generation

At the core of the DRAGON-AI approach is the generation of a prompt that is passed as input to an LLM. The prompt includes the partial term, and an instruction directing the model to complete the term, filling missing information, and return as a JSON object.

In order to guide the LLM to create a term that is similar in style to existing terms, and to guide the LLM to pick existing terms in relationships, we provide additional context within the prompt. This additional context includes existing relevant terms, provided in the same JSON format as the intended response. When prompting LLMs, it is common to include a small set of examples to help guide the model to provide the best responses (few-shot learning). One approach here is to use a static or fixed set of examples, but the drawback of this is that the pre-selected examples may not be applicable to the specific request from the user. Ideally, examples would be selected based on relevancy.

We use RAG as the general approach to retrieve the most relevant information. As a first step, the partial term object provided by the user is used as a query to the ontology terms loaded into the ontology vector index. An embedding is created from the text fields of the object (using the same embedding model as was used to index the ontology), and this is used to query the top k results (k is 10 by default). These form the in-context examples for the prompt. The intent is to retrieve terms that are similar to the intended term to inform the prediction of the completed term; for example, if the query term is “hydroxyprolinuria”, then similar terms in the ontology such as “cystathioninuria” will be informative.

Each retrieved example forms an input–output training pair which is concatenated directly into the prompt by serializing the JSON object, for example:

 input:

output:

] }

To diversify search results, we implement Maximal Marginal Relevance (MMR) [33] in order to re-rank results. This helps with inclusion of terms that inform multiple cross-cutting aspects of the requested term, including terms from other relevant ontologies. For example, if the input is “hydroxyprolinuria” then the highest-ranking terms may be other phenotypes involving circulating molecules, but by diversifying search results we also include relevant chemical entities from ChEBI like “hydroxyproline”.

Optionally, additional information other than the source ontology can be included in the prompt. This potentially includes GitHub issues (accessed via the GitHub API), documentation written by and for ontology developers, and PubMed articles. For this study we only made use of GitHub issues. For these sources we also use a RAG method to select only the most semantically similar documents.

Different LLMs have different limits on the combined size of prompt and response. In order to stay within these limits, we reduce the number of in-context examples to the maximum number that still fits within the limit, or the number provided by the user, whichever is greater.

Prompt passing and result parsing

DRAGON-AI allows for a number of different ways to extract structured information as a response. These include using OpenAI function calls, or using a recursive-descent approach via the SPIRES algorithm [34]. For this study we evaluated a pure RAG-based in-context approach, as shown in Fig. 1.

This prompt is presented to the LLM, which responds with a serialized JSON object analogous to the in-context examples. This response is parsed using a standard JSON parser, with additional preprocessing to remove extraneous preamble text, and the results are merged with the input object to form the predicted object.

Relationship predictions are further post-processed to remove relationships to non-existent terms in the ontology or imported ontologies. Some of these correspond to meaningful relationships to terms that have yet to be added. In the future, the system may be extended to include a step that fills in missing terms, but the current behavior is to be conservative when predicting relationships.

Evaluation

We used 10 different ontologies in our evaluation: the Cell Ontology (CL) [8], UBERON, the Gene Ontology (GO), the Human Phenotype Ontology (HP) [35], the Mammalian Phenotype Ontology (MP) [36], The Mondo disease ontology (MONDO), the Environment Ontology (ENVO) [37], the Food Ontology (FOODON), the Ontology of Biomedical Investigations (OBI) [38], and the Ontology of Biological Attributes (OBA) [39]. These were selected based on being widely used and impactful and covering a broad range of domains, from basic science through to clinical practice, with representation outside biology (the Environment Ontology and FoodOn). This selection also represents a broad range of ontology development styles, from highly compositional ontologies making extensive use of templated design patterns (OBA) to more individually structured. All selected ontologies make use of Description Logic (DL) axiomatizations, allowing for the use of reasoning to auto-classify the ontology, providing a baseline for comparison. Table 2 shows a summary of which tasks were performed and evaluated on which ontologies. Table 3 has details of the models that were used in the study.

Table 2 Ontologies and ontology versions used for evaluationTable 3 Models evaluated, plus their versions/checkpoints

We subdivided each ontology into a core ontology plus a testing set of 50 terms. Where possible, we selected test terms from the set of terms that were added to the ontology after November 2022, to minimize the possibility of test data leakage. This was not possible for ENVO, which has a less frequent release schedule, with the most recent release at the time of analysis being from February 2023, so this ontology included terms added in 2021 and 2022. Uberon also had fewer new terms in 2023, so the test set for this ontology was 40 terms.

We chose three tasks: prediction of (1) relationships, (2) definitions, and (3) logical definitions. For each task, the test set consists of ontology term objects with the field to be predicted masked (other fields such as the ontology term identifier were also masked, as these are another source of training data leakage). For example, to predict relationships, the text objects have only labels and text definitions present. We excluded OBI, HP, and MP from the logical definition analysis as these ontologies have more complex, nested logical definitions that don’t conform to the simple style supported in DRAGON-AI. We only evaluated textual definitions for nine of the ten ontologies based on evaluator expertise.

We tested three models (gpt-4, gpt-3.5-turbo, and nous-hermes-13b-ggml) against all ontologies for the three tasks. The first two models are proprietary closed models accessed via an API; the latter model is open, and was executed locally on an M1 MacOS system.

Relationship prediction evaluation

One of the main challenges in ontology learning is evaluation, since the construction of ontologies involves some subjective decisions, and many different valid representations are possible [40]. An additional challenge is that ontologies allow for specification of things at different levels of specificity. For the relationship prediction task, we chose to treat the existing relationships in the ontology as the gold standard, recognizing this may penalize alternative but valid representations.

If a predicted relationship matches a relationship that exists in the ontology, this counts as a true positive. If a predicted relationship is more general than a relationship in the ontology, then we do not count this as a full true positive, but instead treat it as an intermediate between true positive and false negative. We use Information Content (IC) based scores, in the same fashion as Critical Assessment of Function Annotation (CAFA) evaluations [41]. The IC of an ontology term is calculated as the negative log of the probability of observing that term as a subsumer of a random term in the ontology, IC(t) = -log(P(t)). We calculate the IC of the broader predicted term (ICp) and the narrower expected term (ICe), and assign the true positive to be the ratio ICp/ICe, and the false negative as 1-ICp/ICe.

A relationship (s, p, o) is counted as more general if the target node is traversable from the subject node over a combination of is-a (subClassOf) relationship and p relationship.

As a baseline, we also include OWL reasoning results using the Elk reasoner [42]. This is only applicable to subsumption (SubClassOf) relationships. For each subsumption relationship in the ontology, we remove the relationship and use the reasoner to determine if the relationship is recapitulated. We use the OWLTools [43] tag-entailed-axioms command to do this. As all ontologies use OWL Reasoning as part of their release process, the precision of reasoning, when measured against the released ontology, is 1.0 by definition. However, recall and F1 [44] can be informative to determine breadth of coverage of reasoning.

Definition prediction evaluation

For the definition prediction task, we could not employ the same strategy as evaluation, as it is very rare for a predicted definition to be an exact match for the one that was manually authored in the ontology – however, these cannot be counted as false positives as they may still be good definitions. We therefore used two methods for evaluating definitions: (1) measuring the semantic distance between predicted definition and curated definition using BERTScore [45]; (2) manual assessment of predicted and curated definitions. For scoring definitions, we used the bert-score package from PyPI, and used default parameters (English language, roberta-large as model).

For the manual evaluation, we enlisted ontology editors and curators to score predicted and curated definitions.

We first aggregated all generated definitions using all models along with the definitions that had previously been manually curated for the test set terms. We assigned each evaluator a task of evaluating a set of definitions by scoring using three different criteria. See supplementary methods for the templates used. The three scoring criteria were:

Biological accuracy: is the textual definition biologically accurate?

Internal consistency: is the structure and content of the definition consistent with other definitions in the ontology, and with the style guide for that ontology?

Overall score: overall utility of the definition.

For each of these metrics, an ordinal scale of 1–5 was used, with 1 being the worst, 3 being acceptable, and 5 being the best. Assigning a consistency score was optional. Evaluators could also choose to use the same score for accuracy and overall score. Additionally, the evaluator could opt to provide a confidence score for their ranking, also on a score ranging from 1 (low confidence) to 5 (high confidence). We provided a notes column to allow for additional qualitative analysis of the results.

At least two evaluators were assigned to each ontology. Evaluators received individualized spreadsheets and were blinded from the source of the definition. They worked independently, and did not see the results of other evaluators until their task was completed. Evaluators were also asked to provide a retrospective qualitative evaluation of the process, which we include in the discussion section.

To measure inter-annotator agreement we calculated the Intraclass Correlation Coefficient (ICC) measure. A one-way analysis of variance (ANOVA) model was fitted to the data, treating the evaluator as a random effect. From the ANOVA table, we extracted the mean squares between evaluators (MSB) and the mean squares within evaluators (MSE). The ICC was then calculated using the formula:

$$\text=(\text-\text) / (\text+(\text-1)\times \text)$$

We calculated the ICC for three metrics: accuracy, consistency, and score. As a baseline, values above 0.5 are considered to indicate moderate consistency, with 0.75 and over indicating good consistency.

Aggregating ICCs

The overall ICC values for accuracy, consistency, and score were computed by filtering the dataset based on a minimum confidence threshold and then applying the ICC calculation method to each metric. This provided a robust measure of inter-rater reliability for each of the evaluated metrics.

Execution

Our workflow is reproducible through our GitHub repository [46], also archived on Zenodo [47]. A Makefile is used to orchestrate extraction of ontologies, splitting test sets, loading into a vector database, and performing predictions. A collection of Jupyter Notebooks is used to evaluate and analyze the results.

Comments (0)

No login
gif