GutBrainIE @ CLEF 2026 Guidelines - Gut Brain Information Extraction

Tasks

GutBrainIE CLEF 2026 is the TASK #6 of the BioASQ CLEF Lab 2026, proposing a Natural Language Processing (NLP) challenge on biomedical texts within the context of the EU-supported project HEREDITARY.

Specifically, it is focused on extracting structured information from biomedical abstracts related to the gut microbiota and its connections with Alzheimer's, Parkinson's, Multiple Sclerosis, Amyotrophic Lateral Sclerosis, and mental health. The task aims to foster the development of Information Extraction (IE) systems that can support experts in understanding the gut-brain interplay.

The GutBrainIE task is divided into two main subtasks. In the first task, participants are asked to identify and classify specific text spans into predefined categories, while in the second one they have to determine if a particular relationship defined between two categories holds or not.

The training data for all subtasks are available upon registration for the challenge and the test data will be available around two weeks before the run submission deadline. The training set is divided into four parts:

Gold Collection: An highly curated dataset manually annotated primarily by a team of 7 expert annotators from the University of Padua, Italy, with the involvement of 6 additional external expert contributors;
Silver Collection: A weakly curated dataset manually annotated by a team of about 55 students of Linguistics and Terminology trained and supervised by the experts;
Silver Collection 2025: A weakly curated dataset manually annotated for the 2025 edition by a team of about 40 students of Linguistics and Terminology trained and supervised by the experts, with automatically generated concept-level annotations;
Bronze Collection: A distantly supervised dataset comprising automatically generated annotations. Note that no manual revision has been performed on this set.

The test set is a held-out selection of documents from the gold collection, consisting exclusively of titles and abstracts for each document, ensuring representativeness and full coverage of all entity and relation types.

Please see the Datasets and Important Dates sections for more information.

Subtask 6.1.1 - Named Entity Recognition (NER)

Participants are provided with PubMed abstracts discussing the gut-brain interplay and are asked to classify specific text spans (entity mentions) into one of the 13 predefined categories, such as bacteria, chemical, microbiota.

Entity mentions are expressed as tuples (entityCategory ; entityLocation (title/abstract) ; startOffset ; endOffset).

Subtask 6.1.2 - Named Entity Recognition and Disambiguation (NERD)

Participants are provided with PubMed abstracts discussing the gut-brain interplay and are asked to identify and classify specific text spans (entity mentions) into one of the 13 predefined categories, such as bacteria, chemical, or microbiota. Each identified entity must also be linked to a concept identifier from one of the defined biomedical reference resources.

Entity mentions are represented as tuples (entityCategory ; entityLocation (title/abstract) ; startOffset ; endOffset ; conceptURI).

Subtask 6.2.1 - Mention-Level Relation Extraction (M-RE)

Participants are provided with PubMed abstracts discussing the gut-brain interplay and are asked to identify relations between specific entity mentions within a document. Each relation must include the two involved entity mentions and the relation predicate connecting them.

Relations are expressed as triples (subjectMention ; relationPredicate ; objectMention).

Subtask 6.2.2 - Concept-Level Relation Extraction (C-RE)

This subtask extends the mention-level setting to the concept level. Participants must identify and classify relations between linked concepts rather than between their textual mentions. Concept-level relations aim to capture knowledge connections abstracted from surface forms and lexical variations.

Relations are expressed as tuples (subjectConceptURI ; subjectCategory ; relationPredicate ; objectConceptURI; objectCategory).

For each task, the test set consists of a collection of documents, each including only the PubMed ID, title, and abstract. Given a test document, participants are required to extract tuples that include all the fields for the respective task. For example, in the NER task, tuples must contain all and only the following fields: (entityCategory ; entityLocation (title/abstract) ; startOffset ; endOffset). For more details, please refer to the submission format examples provided in the Submitting section.

Participating

To participate in GutBrainIE CLEF 2026, groups need to register to the BioASQ Laboratory at the following link:

Data are available! Access to the data will be granted within 24 hours from registration.

Important Dates

Training data release: Available Now! (access granted upon registration)
Registration closes: April 23, 2026
Test data release: April 28, 2026
Runs submission deadline: May 7, 2026
Evaluation results out: May 19, 2026
Participant and position paper submission deadline: May 28, 2026
Notification of acceptance for participant and position papers: June 30, 2026
Camera-ready participant papers submission: July 6, 2026
GutBrainIE CLEF Workshop: September 21-24, 2026 during the CLEF Conference in Jena, Germany

Results

Leaderboards

Available after the systems submission deadline...

Datasets

Data Description

The data for this task is a set of titles and abstracts of biomedical articles retrieved from PubMed, focusing on the gut-brain interplay and its implications in neurological and mental health.

The datasets are organized into:

Gold-Standard Annotations: High-quality annotations, expert-curated.
Silver-Standard Annotations: Mid-quality annotations, created by trained students under expert supervision. The Silver-Standard collection is divided into:
- Silver: Mid-quality annotations created by trained students under expert supervision. The students are organized into two clusters:
  - StudentA, including those with more consistent annotation performance,
  - StudentB, including those with less consistent annotation performance.
- Silver 2025: Same as Silver but for the 2025 edition. Concept-level annotations were automatically generated.
Bronze-Standard Annotations: Automatically generated annotations, using fine-tuned GLiNER for Named Entity Recognition (NER) and fine-tuned ATLOP for Relation Extraction (RE).

To foster the development of effective IE systems for the subtasks proposed within GutBrainIE, the provided datasets include:

Entity Mentions: Text spans classified into predefined categories and linked to concept URIs from biomedical reference resources.
Relations: Associations between entities, specifying that a particular relationship holds between two entities.

Collections

Training and Development Data: Available upon registration. After registering, participants will receive an email with a download link (starting from February 2–9, 2026).
Test Data: The test set will be available from April 28, 2026. This data will be used for final system evaluation.

Dataset Format

Annotations are provided in JSON format, for ease of use with NLP systems. Each entry in the dataset corresponds to a PubMed article, identified by its PubMed ID (PMID), and includes the following fields:

Metadata: Article-related information, including title, author, journal, year, and abstract. It also includes the identifier of the annotator: expert annotators are labeled as expert_1 to expert_7, student annotators are grouped into two clusters, identified as student_A and student_B, and automatically generated annotations are labeled as distant. Participants may use this information to assign different weights to annotations based on who performed them.
Entities: An array of objects where each object represents an annotated entity mention in the article, with the following attributes:

◦ start and end indices: Character offsets marking the span of the entity mention.
◦ location: Indicates if the entity mention is located in the title or in the abstract.
◦ text_span: The actual text span of the entity mention.
◦ label: The label assigned to the entity mention (e.g., bacteria, microbiome).
◦ uri: The concept URI to which the entity mention is linked, taken from one of the biomedical reference resources.

Relations: An array of objects where each object represents an annotated relationship between two entity mentions in the article, with the following attributes:

◦ subject_start and subject_end indices: Character offsets marking the span of the subject entity mention.
◦ subject_location: Indicates if the subject entity mention is located in the title or in the abstract.
◦ subject_text_span: The actual text span of the subject entity mention.
◦ subject_uri: The concept URI to which the subject entity mention is linked, taken from one of the biomedical reference resources.
◦ subject_label: The label assigned to the subject entity mention (e.g., bacteria, microbiome).
◦ predicate: The label assigned to the relationship.
◦ object_start and object_end indices: Character offsets marking the span of the object entity mention.
◦ object_location: Indicates if the object entity mention is located in the title or in the abstract.
◦ object_text_span: The actual text span of the object entity mention.
◦ object_label: The label assigned to the object entity mention (e.g., bacteria, microbiome).
◦ object_uri: The concept URI to which the object entity mention is linked, taken from one of the biomedical reference resources.

Mention-level Relations: Relations extracted from the Relations array, formatted as mention-based tuples of subject_text_span, subject_label, predicate, object_text_span, and object_label.
Concept-level Relations: Relations extracted from the Relations array, formatted as concept-based tuples of subject_uri, subject_label, predicate, object_uri, and object_label.

Alternative Formats

For those more familiar with CSV or tabular formats, the dataset is also provided in these formats. In this case, each of the fields mentioned above is stored in a separate file:

Metadata file
Entities file
Relations file
Mention-level relations file
Concept-level relations file

The CSV files use the pipe symbol (|) as a separator, while tabular files use the tab character (\t) for separation.

Entity and Relation Labels

The set of entities considered for annotations includes:

Entity Label	URI	Definition
Anatomical Location	NCIT_C13717	Named locations of or within the body.
Animal	NCIT_C14182	A non-human living organism that has membranous cell walls, requires oxygen and organic foods, and is capable of voluntary movement, as distinguished from a plant or mineral.
Biomedical Technique	NCIT_C15188	Research concerned with the application of biological and physiological principles to clinical medicine.
Bacteria	NCBITaxon_2	One of the three domains of life (the others being Eukarya and ARCHAEA), also called Eubacteria. They are unicellular prokaryotic microorganisms which generally possess rigid cell walls, multiply by cell division, and exhibit three principal forms: round or coccal, rodlike or bacillary, and spiral or spirochetal.
Chemical	CHEBI_59999	A chemical substance is a portion of matter of constant composition, composed of molecular entities of the same type or of different types. This category also includes metabolites, which in biochemistry are the intermediate or end product of metabolism, and neurotransmitters, which are endogenous compounds used to transmit information across the synapses.
Dietary Supplement	MESH_68019587	Products in capsule, tablet or liquid form that provide dietary ingredients, and that are intended to be taken by mouth to increase the intake of nutrients. Dietary supplements can include macronutrients, such as proteins, carbohydrates, and fats; and/or micronutrients, such as vitamins; minerals; and phytochemicals.
Disease, Disorder, or Finding (DDF)	NCIT_C7057	A condition that is relevant to human neoplasms and non-neoplastic disorders. This includes observations, test results, history and other concepts relevant to the characterization of human pathologic conditions.
Drug	CHEBI_23888	Any substance which when absorbed into a living organism may modify one or more of its functions. The term is generally accepted for a substance taken for a therapeutic purpose, but is also commonly used for abused substances.
Food	NCIT_C1949	A substance consumed by humans and animals for nutritional purpose
Gene	SNOMEDCT_67261001	A functional unit of heredity which occupies a specific position on a particular chromosome and serves as the template for a product that contributes to a phenotype or a biological function.
Human	NCBITaxon_9606	Members of the species Homo sapiens.
Microbiome	OHMI_0000003	This term refers to the entire habitat, including the microorganisms (bacteria, archaea, lower and higher eukaryotes, and viruses), their genomes (i.e., genes), and the surrounding environmental conditions.
Statistical Technique	NCIT_C19044	A method of calculating, analyzing, or representing statistical data.

while the defined set of relations includes:

Head Entity	Tail Entity	Predicate
Anatomical Location	Human / Animal	Located in
Bacteria	Bacteria / Chemical / Drug	Interact
Bacteria	DDF	Influence
Bacteria	Gene	Change expression
Bacteria	Human / Animal	Located in
Bacteria	Microbiome	Part of
Chemical	Anatomical Location / Human / Animal	Located in
Chemical	Chemical	Interact / Part of
Chemical	Microbiome	Impact / Produced by
Chemical / Dietary Supplement / Drug / Food	Bacteria / Microbiome	Impact
Chemical / Dietary Supplement / Food	DDF	Influence
Chemical / Dietary Supplement / Drug / Food	Gene	Change expression
Chemical / Dietary Supplement / Drug / Food	Human / Animal	Administered
DDF	Anatomical Location	Strike
DDF	Bacteria / Microbiome	Change abundance
DDF	Chemical	Interact
DDF	DDF	Affect / Is a
DDF	Human / Animal	Target
Drug	Chemical / Drug	Interact
Drug	DDF	Change effect
Human / Animal / Microbiome	Biomedical Technique	Used by
Microbiome	Anatomical Location / Human / Animal	Located in
Microbiome	Gene	Change expression
Microbiome	DDF	Is linked to
Microbiome	Microbiome	Compared to

For more details on entity and relation labels, refer to the Annotation Guidelines available for download HERE.

Datasets Statistics

The table below provides an overview of the key statistics for each dataset collection:

Collection	Num of Documents	Total Entities	Avg Entities per Doc	Total Relations	Avg Rels per Doc
Train Gold	639	20'530	32.13	8'556	13.39
Train Silver	811	26'134	32.22	10'907	13.45
Train Silver 2025	499	15'275	30.61	10'616	21.27
Train Bronze	2972	89'987	30.28	29'692	9.99
Development Set	80	2'521	31.51	1'261	15.76

Train Gold Collection Statistics

The tables below provide detailed statistics for entities and relations in the Train Gold collection.

The Ratio column represents the proportion of each entity or relation label relative to the total number of entities or relations in the collection.

Entity Statistics

Entity Label	Count	Ratio	Avg Per Doc
DDF	6'909	0.337	10.81
Chemical	3'119	0.152	4.88
Microbiome	1'920	0.094	3.00
Human	1'786	0.087	2.79
Bacteria	1'535	0.075	2.40
Animal	1'094	0.053	1.71
Biomedical Technique	1'093	0.053	1.71
Anatomical Location	993	0.048	1.55
Dietary Supplement	769	0.037	1.20
Drug	379	0.018	0.59
Statistical Technique	361	0.018	0.56
Gene	301	0.015	0.47
Food	271	0.013	0.42

Relation Statistics

Relation Label	Count	Ratio	Avg Per Doc
Influence	1'481	0.173	2.32
Located In	1'282	0.150	2.01
Target	1'177	0.138	1.84
Is Linked To	888	0.104	1.39
Affect	831	0.097	1.30
Interact	524	0.061	0.82
Impact	512	0.060	0.80
Used By	478	0.056	0.75
Change Abundance	297	0.035	0.46
Is A	280	0.033	0.44
Administered	240	0.028	0.38
Part Of	206	0.024	0.32
Strike	132	0.015	0.21
Change Effect	105	0.012	0.16
Change Expression	87	0.010	0.14
Produced By	29	0.003	0.05
Compared To	7	0.001	0.01

Train Silver Collection Statistics

The tables below provide detailed statistics for entities and relations in the Train Silver collection.

Entity Statistics

Entity Label	Count	Ratio	Avg Per Doc
DDF	8'504	0.325	10.49
Chemical	4'310	0.165	5.31
Microbiome	2'243	0.086	2.77
Bacteria	2'031	0.078	2.50
Human	2'000	0.077	2.47
Anatomical Location	1'661	0.064	2.05
Biomedical Technique	1'657	0.063	2.04
Animal	1'018	0.039	1.26
Dietary Supplement	831	0.032	1.02
Drug	629	0.024	0.78
Statistical Technique	444	0.017	0.55
Food	414	0.016	0.51
Gene	392	0.015	0.48

Relation Statistics

Relation Label	Count	Ratio	Avg Per Doc
Influence	2'099	0.192	2.59
Located In	1'645	0.151	2.03
Target	1'254	0.115	1.55
Is Linked To	903	0.083	1.11
Affect	899	0.082	1.11
Is A	740	0.068	0.91
Interact	702	0.064	0.87
Impact	521	0.048	0.64
Part Of	441	0.040	0.54
Used By	428	0.039	0.53
Strike	343	0.031	0.42
Administered	318	0.029	0.39
Change Effect	258	0.024	0.32
Change Abundance	237	0.022	0.29
Change Expression	63	0.006	0.08
Produced By	52	0.005	0.06
Compared To	4	0.000	0.00

Train Silver 2025 Collection Statistics

The tables below provide detailed statistics for entities and relations in the Train Silver 2025 collection.

Entity Statistics

Entity Label	Count	Ratio	Avg Per Doc
DDF	5'584	0.366	11.19
Chemical	1'871	0.122	3.75
Microbiome	1'599	0.105	3.20
Human	1'199	0.078	2.40
Bacteria	1'129	0.074	2.26
Anatomical Location	856	0.056	1.72
Dietary Supplement	660	0.043	1.32
Biomedical Technique	655	0.043	1.31
Drug	500	0.033	1.00
Animal	483	0.032	0.97
Gene	319	0.021	0.64
Statistical Technique	258	0.017	0.52
Food	162	0.011	0.32

Relation Statistics

Relation Label	Count	Ratio	Avg Per Doc
Influence	2'536	0.239	5.08
Is Linked To	1'710	0.161	3.43
Target	1'540	0.145	3.09
Located In	1'049	0.099	2.10
Impact	835	0.079	1.67
Change Abundance	535	0.050	1.07
Change Effect	469	0.044	0.94
Used By	391	0.037	0.78
Affect	383	0.036	0.77
Part Of	355	0.033	0.71
Interact	288	0.027	0.58
Produced By	247	0.023	0.49
Strike	124	0.012	0.25
Change Expression	92	0.009	0.18
Administered	34	0.003	0.07
Is A	18	0.002	0.04
Compared To	10	0.001	0.02

Train Bronze Collection Statistics

The tables below provide detailed statistics for entities and relations in the Train Bronze collection.

Entity Statistics

Entity Label	Count	Ratio	Avg Per Doc
DDF	34'560	0.384	11.63
Chemical	13'054	0.145	4.39
Microbiome	8'625	0.096	2.90
Anatomical Location	6'409	0.071	2.16
Human	5'874	0.065	1.98
Bacteria	4'959	0.055	1.67
Biomedical Technique	3'851	0.043	1.30
Animal	3'090	0.034	1.04
Dietary Supplement	3'017	0.034	1.02
Drug	2'358	0.026	0.79
Food	1'742	0.019	0.59
Gene	1'304	0.014	0.44
Statistical Technique	1'144	0.013	0.38

Relation Statistics

Relation Label	Count	Ratio	Avg Per Doc
Influence	6'652	0.224	2.24
Target	4'163	0.140	1.40
Is Linked To	3'905	0.132	1.31
Located In	3'830	0.129	1.29
Affect	3'078	0.104	1.04
Is A	1'713	0.058	0.58
Impact	1'494	0.050	0.50
Used By	919	0.031	0.31
Strike	850	0.029	0.29
Interact	779	0.026	0.26
Part Of	601	0.020	0.20
Administered	554	0.019	0.19
Change Abundance	498	0.017	0.17
Produced By	317	0.011	0.11
Change Effect	293	0.010	0.10
Change Expression	44	0.001	0.01
Compared To	2	0.000	0.00

Dev Collection Statistics

The tables below provide detailed statistics for entities and relations in the Dev collection.

Entity Statistics

Entity Label	Count	Ratio	Avg Per Doc
DDF	793	0.315	9.91
Chemical	366	0.145	4.58
Microbiome	231	0.092	2.89
Human	192	0.076	2.40
Bacteria	183	0.073	2.29
Anatomical Location	169	0.067	2.11
Animal	152	0.060	1.90
Biomedical Technique	139	0.055	1.74
Drug	75	0.030	0.94
Dietary Supplement	64	0.025	0.80
Gene	63	0.025	0.79
Food	59	0.023	0.74
Statistical Technique	35	0.014	0.44

Relation Statistics

Relation Label	Count	Ratio	Avg Per Doc
Located In	209	0.166	2.61
Influence	177	0.140	2.21
Target	172	0.136	2.15
Affect	142	0.113	1.77
Is Linked To	134	0.106	1.68
Administered	76	0.060	0.95
Used By	74	0.059	0.93
Impact	64	0.051	0.80
Is A	49	0.039	0.61
Interact	43	0.034	0.54
Change Effect	34	0.027	0.42
Part Of	31	0.025	0.39
Strike	23	0.018	0.29
Change Abundance	16	0.013	0.20
Change Expression	7	0.006	0.09
Produced By	6	0.005	0.07
Compared To	4	0.003	0.05

Submitting

Participating teams should satisfy the following guidelines:

The runs should be submitted in the format described below;
Each group can submit a maximum of 25 runs for each subtask.

Submissions are handled via the BioASQ submission system linked below:

Available after test set release

Submission Guidelines

Participants are invited to submit results for any or all of the four subtasks (NER, NERD, M-RE, C-RE) independently.

In the following, a "run" refers to the predictions made by a single system on the test set.

All runs must be submitted in a single zipped file named "teamID_GutBrainIE_2026.zip". Within this zip archive, each run should be placed in a separate folder named "teamID_taskID_runID_systemDesc" without spaces or any special character, where:

teamID is the identifier of your team choosen when you registered to CLEF 2026;
taskID is the identifier of your task choosen, i.e., one of these tokens: T611 for task 6.1.1, T612 for task 6.1.2, T621 for task 6.2.1, and T622 for task 6.2.2 ;
runID is the identifier of the run choosen by the participants, containing only letters and numbers (a-z, A-Z, 0-9)
systemDesc is an optional (short) string that further describes your submission.

The content of each folder consists of two files:

teamID_taskID_runID_systemDesc.json: A JSON file with predictions on the test set.
teamID_taskID_runID_systemDesc.meta: A metadata briefly describing the approach, including:

◦ Team ID.
◦ Task ID.
◦ Run ID.
◦ Type of training applied.
◦ Pre-processing methods.
◦ Training data used.
◦ Relevant details of the run.
◦ A link to a GitHub repository enabling easy reproducibility of the run.

Please ensure that the zip archive you submit strictly adheres to this filenames structure. The team name must remain consistent across all submissions, and the system name shold reflect the approach used by the submitted run. The run number serves as a progressive identifier for multiple submissions of the same system. An example of a valid submission wil be available soon...

Submissions not adhering to these guidelines might be rejected. Please ensure accuracy and completeness of all submitted files to streamline evaluation.

An example of a valid submission can be downloaded from HERE. Please note that the predicted entities and relations included in this example are dummy data, provided solely to illustrate the correct structure and formatting of a valid submission folder.

To ensure your submission follows the required structure, you can use the validation script available HERE.

Example of Submission Format

Submissions should follow the same JSON format used in the provided datasets and must include only the field associated with the specific subtask for which the submission is made:

Subtask 6.1.1 (NER): Include the entities field WITHOUT the uri subfield.
Subtask 6.1.2 (NERD): Include the entities field WITH the uri subfield.
Subtask 6.2.1 (Mention-level RE): Include the mention_level_relations field.
Subtask 6.2.2 (Concept-level RE): Include the concept_level_relations field.

Each entry must correspond to the PubMed ID of the article from the test set being considered.

Find below samples of valid entries for each submission type:

Subtask 6.1.1 (NER) Submission Sample

					
{
	"34870091": {
		"entities": [
			{
				"start_idx": 75,
				"end_idx": 82,
				"location": "title",
				"text_span": "patients",
				"label": "human"
			},
			{
				"start_idx": 250,
				"end_idx": 270,
				"location": "abstract",
				"text_span": "intestinal microbiome",
				"label": "microbiome"
			}
		]
	}
}

Subtask 6.1.2 (NERD) Submission Sample

					
{
	"34870091": {
		"entities": [
			{
				"start_idx": 75,
				"end_idx": 82,
				"location": "title",
				"text_span": "patients",
				"label": "human",
				"uri": "http://id.nlm.nih.gov/mesh/D010361"
			},
			{
				"start_idx": 250,
				"end_idx": 270,
				"location": "abstract",
				"text_span": "intestinal microbiome",
				"label": "microbiome",
				"uri": "http://purl.obolibrary.org/obo/NCIT_C93019"
			}
		]
	}
}

Subtask 6.2.1 (Mention-Level RE) Submission Sample

				
{
	"34870091": {
		"mention_level_relations": [
			{
				"subject_text_span": "intestinal microbiome",
				"subject_label": "microbiome",
				"predicate": "located in",
				"object_text_span": "patients",
				"object_label": "human"
			}
		]
	}
}

Subtask 6.2.2 (Concept-Level RE) Submission Sample

				
{
	"34870091": {
		"concept_level_relations": [
			{
				"subject_uri": "http://purl.obolibrary.org/obo/NCIT_C93019",
				"subject_label": "microbiome",
				"predicate": "located in",
				"object_uri": "http://id.nlm.nih.gov/mesh/D010361",
				"object_label": "human"
			}
		]
	}
}

Validation Script

A validation script that can be used to verify the correctness of the submission format can be downloaded from HERE.

Participant Papers

Participants are expected to write a report describing their participation in GutBrainIE CLEF 2026, their proposed solution, what features have been used for prediction, the analysis of the experimental results and insights derived from them.

Participant reports will be peer-reviewed and accepted reports will be published in the CLEF 2026 Working Notes at CEUR-WS, indexed in DBLP and Scopus.

Participants are strongly encouraged to use LaTeX, for which a template is available for download HERE. The template follows the CEURART single-column style and not only defines the layout of the report but also provides a suggested structure and outlines the expected content for each section.

It is also possible to use ODT (LibreOffice) or DOCX (Microsoft Word) formats. In these cases, participants should download the official CEURART templates directly from the CEUR website HERE and ensure that the content and structure match those outlined in the provided LaTeX template.

Participant papers are expected to be 10-20 pages in length, excluding references.

The schedule for submission, revision, and camera ready of the participant report is detailed above in the Important Dates section.

Participant papers can be submitted via Easychair HERE.

Baselines & Evaluation

Official Baselines

The official baselines for the GutBrainIE CLEF 2026 challenge can be found and reproduced by accesing this GitHub repository.

The repository provides a complete implementation of the baseline pipeline, allowing participants to reproduce the results from scratch, test the pre-trained models, or evaluate the provided predictions (i.e., runs) of the baselines for all subtasks (NER, NERD, and RE). These runs can also be used by participants as references of valid submission formats.

The repository also contains the official evaluation script, which will be used to assess submitted runs after the system submission deadline. It is crucial that participants ensure their submitted runs are compatible with this script, as submissions that cannot be evaluated with it will be considered invalid. The evaluation script can also be used to obtain preliminary performance results on the development set.

The detailed results of the baselines on the development set can be found in the Results section.

Evaluation Metrics

Submitted runs will be evaluated using standard Information Extraction metrics to assess both per-label and overall system performance. The employed metrics are the same for all the subtasks (a detailed explanation of these can be found HERE):

In the equations below:

\(TP\) (True Positives) refers to the number of correctly predicted entities or relations.
\(FP\) (False Positives) refers to the number of wrongly predicted entities or relations.
\(FN\) (False Negatives) refers to the number of entities or relations in the ground truth that were not predicted.
\(\mathcal{L}\) is the set of labels referring to:

◦ For subtasks 6.1.1 and 6.1.2: entity labels.
◦ For subtasks 6.2.1 and 6.2.2: triples of (subject label, predicate, object label).

Macro-average Precision: \( P_{\text{macro}_{\text{avg}}} = \frac{\sum_{l \in \mathcal{L}} \, \frac{TP_l}{TP_l \,+\, FP_l}}{|\mathcal{L}|} \)
Macro-average Recall: \( R_{\text{macro}_{\text{avg}}} = \frac{\sum_{l \in \mathcal{L}} \, \frac{TP_l}{TP_l \,+\, FN_l}}{|\mathcal{L}|} \)
Macro-average F1-score: \( F1_{\text{macro}_{\text{avg}}} = 2 \, \times \, \frac{P_{\text{macro}_{\text{avg}}} \; \times \; R_{\text{macro}_{\text{avg}}}}{P_{\text{macro}_{\text{avg}}} \; + \; R_{\text{macro}_{\text{avg}}}} \)
Micro-average Precision: \( P_{\text{micro}_{\text{avg}}} = \frac{\sum_{l \in \mathcal{L}} \; TP_l}{\sum_{l \in \mathcal{L}} \; TP_l \; + \; FP_l} \)
Micro-average Recall: \( R_{\text{micro}_{\text{avg}}} = \frac{\sum_{l \in \mathcal{L}} \; TP_l}{\sum_{l \in \mathcal{L}} \; TP_l \; + \; FN_l} \)
Micro-average F1-score: \( F1_{\text{micro}_{\text{avg}}} = 2 \, \times \, \frac{P_{\text{micro}_{\text{avg}}} \; \times \; R_{\text{micro}_{\text{avg}}}}{P_{\text{micro}_{\text{avg}}} \; + \; R_{\text{micro}_{\text{avg}}}} \)

For each subtask, the reference metric for the final leaderboard will be the micro-average F1-score, as it better accounts for class imbalances. However, system rankings will also be provided for each of the metrics detailed above.

Baseline Results

The scores reported in the table below refer to the development set and represent the actual evaluation of the inference capabilities of the proposed baseline models across the four subtasks.

Sub-task	Macro-P	Macro-R	Macro-F1	Micro-P	Micro-R	Micro-F1
NER	0.7114	0.7480	0.7267	0.7782	0.8221	0.7996
NERD	0.3820	0.4045	0.3916	0.4281	0.4522	0.4398
Mention-level RE	0.3660	0.2862	0.3003	0.4462	0.3453	0.3893
Concept-level RE	0.1009	0.1021	0.0966	0.1409	0.1292	0.1348