GutBrainIE @ CLEF 2025 Guidelines - Gut Brain Information Extraction

Tasks

GutBrainIE CLEF 2025 is the TASK #6 of the BioASQ CLEF Lab 2025, proposing a Natural Language Processing (NLP) challenge on biomedical texts within the context of the EU-supported project HEREDITARY.

Specifically, it is focused on extracting structured information from biomedical abstracts related to the gut microbiota and its connections with Parkinson's disease and mental health, aiming to foster the development of Information Extraction (IE) systems that can support experts in understanding the gut-brain interplay.

The GutBrainIE task is divided into two main subtasks. In the first task, participants are asked to identify and classify specific text spans into predefined categories, while in the second one they have to determine if a particular relationship defined between two categories holds or not.

The training data for all the subtasks are available upon registration for the challenge and the test data will be available around two weeks before the run submission deadline. The training set is divided into four parts:

Gold Collection: An highly curated dataset manually annotated by a team of 7 expert annotators from the University of Padua, Italy;
Platinum Collection: A subset of the gold annotations further validated by biomedical experts from the Radboud University Medical Center, Netherlands;
Silver Collection: A weakly curated dataset manually annotated by a team of about 40 students of Linguistics and Terminology trained and supervised by the experts;
Bronze Collection: A distantly supervised dataset comprising automatically generated annotations. Note that no manual revision has been performed on this set.

The test set is a held-out selection of documents from the gold and platinum collections, consisting exclusively of titles and abstracts for each document, ensuring representativeness and full coverage of all entity and relation types.

Please see the Datasets and Important Dates sections for more information.

Subtask 6.1 - Named Entity Recognition

Participants are provided with PubMed abstracts discussing the gut-brain interplay and are asked to classify specific text spans (entity mentions) into one of the 13 predefined categories, such as bacteria, chemical, microbiota.

Entity mentions are expressed as tuples (entityLabel ; entityLocation (title/abstract) ; startOffset ; endOffset).

Subtask 6.2.1 - Binary Relation Extraction

Participants are provided with PubMed abstracts discussing the gut-brain interplay and are asked to identify which entities are in relation within a document. No relation type needs to be predicted within this subtask.

Relations are expressed as pairs (entityLabel1 ; entityLabel2).

Subtask 6.2.2 - Ternary Tag-Based Relation Extraction

Participants are asked to identify which entities are in relation within a document and predict the type of relation between them.

Relations are expressed as triples (entityLabel1 ; relationLabel ; entityLabel2).

Subtask 6.2.3 - Ternary Mention-Based Relation Extraction

Participants are required to identify the actual entities involved in a relation and predict the type of relation.

Relations are expressed as triples (entityMention1 ; relationLabel ; entityMention2).

For each task, the test set consists of a collection of documents, each including only the PubMed ID, title, and abstract. Given a test document, participants are required to extract tuples that include all the fields for the respective task. For example, in the NER task, tuples must contain all and only the following fields: (entityLabel ; entityLocation (title/abstract) ; startOffset ; endOffset). For more details, please refer to the submission format examples provided in the Submitting section.

Important Dates

Registration closes: April 25, 2025
Test data release: April 28, 2025
Runs submission deadline: May 10, 2025
Evaluation results out: May 19, 2025
Participant and position paper submission deadline: May 30, 2025
Notification of acceptance for participant and position papers: June 27, 2025
Camera-ready participant papers submission: July 7, 2025
GutBrainIE CLEF Workshop: September 9-12, 2025 during the CLEF Conference in Madrid, Spain

Results

Leaderboards

Tables below report the leaderboards for each subtask, showing for each team their best-performing run (according to its micro-F1 score). For each subtask, we highlighted the cells corresponding to the top score for each evaluation metric.

Note: because we picked only one “best” run per team based on micro-F1, the highlighted scores in other metrics (e.g., precision or recall) may not reflect the absolute maximum, since runs with greater precision or recall but lower micro-F1 are omitted here. Complete leaderboards for every evaluation metric will be provided in the final task overview paper.

Participants will receive detailed evaluations of all their submitted runs privately, while aggregated results will be released publicly after the Camera-Ready paper submission deadline.

Participant Teams Overview

Team ID	Affiliation	Country
ata2425ds	University of Padua	Italy
ataupd2425-gainer	University of Padua	Italy
ataupd2425-pam	University of Padua	Italy
BIU-ONLP	Bar Ilan University	Israel
DS@GT-bioasq-task6	NA	United States
DS@GT-BioNER	NA	Canada
Graphwise-1	Graphwise	Bulgaria
greenday	Stony Brook University	United States
Gut-Instincts	Aalborg University	Denmark
GutUZH	University of Zurich	Switzerland
ICUE	University of Edinburgh	United Kingdom
lasigeBioTM	LASIGE, Faculdade de Ciências, Universidade de Lisboa	Portugal
LYX-DMIIP-FDU	Fudan University	China
NLPatVCU	Virginia Commonwealth University	United States
ONTUG	University of Technology, Graz; ontotext, Bulgaria	Austria
Schemalink	Dept. of Computer Science, University of Milan	Italy
ToGS	University of Technology, Graz	Austria

Subtask 6.1 (NER)

Team ID	Run ID	Macro-P	Macro-R	Macro-F1	Micro-P	Micro-R	Micro-F1
GutUZH	2	0.7950	0.7736	0.7613	0.8384	0.8432	0.8408
Gut-Instincts	5eedev	0.7619	0.7813	0.7591	0.8286	0.8480	0.8382
NLPatVCU	ensemble1	0.8139	0.7161	0.7169	0.8255	0.8488	0.8370
ICUE	ensemble5	0.8216	0.7451	0.7546	0.8369	0.8294	0.8331
LYX-DMIIP-FDU	run1	0.7605	0.7910	0.7347	0.8020	0.8513	0.8259
ata2425ds	trf	0.7199	0.7546	0.7217	0.7914	0.8432	0.8164
greenday	1	0.7368	0.7682	0.7471	0.7956	0.8278	0.8114
Graphwise-1	13	0.7691	0.7398	0.7185	0.8066	0.7955	0.8010
BASELINE	Organizers	0.6883	0.7690	0.7047	0.7639	0.8238	0.7927
ataupd2425-gainer	ma	0.5808	0.5322	0.5281	0.8333	0.7397	0.7837
DS@GT-bioasq-task6	1	0.6342	0.7849	0.6872	0.7337	0.8197	0.7743
DS@GT-BioNER	run2	0.6731	0.6497	0.6469	0.7783	0.7437	0.7606
ataupd2425-pam	3	0.6400	0.7435	0.6763	0.6809	0.7745	0.7247
Schemalink	1	0.4813	0.5038	0.4650	0.5547	0.5659	0.5602
BIU-ONLP	3	0.4393	0.3585	0.3711	0.4916	0.4721	0.4816
lasigeBioTM	R1	0.2206	0.1034	0.0863	0.3471	0.1964	0.2509

Subtask 6.2.1 (Binary Tag-Based RE)

Team ID	Run ID	Macro-P	Macro-R	Macro-F1	Micro-P	Micro-R	Micro-F1
Gut-Instincts	6219eedev3re	0.5166	0.6315	0.5386	0.6304	0.7532	0.6864
ONTUG	union	0.4185	0.4074	0.4057	0.7121	0.6104	0.6573
Graphwise-1	104	0.4043	0.3748	0.3832	0.7418	0.5844	0.6538
ataupd2425-pam	A7	0.4807	0.6091	0.4993	0.5671	0.7316	0.6389
BIU-ONLP	4	0.4632	0.3379	0.3713	0.7453	0.5195	0.6122
BASELINE	Organizers	0.4650	0.3564	0.3864	0.7584	0.4892	0.5947
LYX-DMIIP-FDU	run1	0.3637	0.4269	0.3688	0.6168	0.5714	0.5933
NLPatVCU	C18	0.3975	0.8419	0.5082	0.4381	0.8571	0.5798
Schemalink	1	0.3758	0.6573	0.4421	0.4531	0.7533	0.5659
ataupd2425-gainer	bp	0.3171	0.3254	0.2969	0.6150	0.4979	0.5504
ICUE	run17	0.3559	0.8790	0.4751	0.3894	0.9221	0.5476
ToGS	hermes8bragreorder	0.2211	0.1304	0.1451	0.5701	0.2641	0.3610

Subtask 6.2.2 (Ternary Tag-Based RE)

Team ID	Run ID	Macro-P	Macro-R	Macro-F1	Micro-P	Micro-R	Micro-F1
Gut-Instincts	6229eedev3re	0.4663	0.6445	0.5184	0.6280	0.7572	0.686567
ataupd2425-pam	B7	0.4409	0.5704	0.4694	0.5853	0.7202	0.6458
ONTUG	union	0.4254	0.4025	0.4058	0.7059	0.5926	0.6443
Graphwise-1	105	0.4120	0.3709	0.3840	0.7326	0.5638	0.6372
ICUE	run22	0.4011	0.7123	0.4880	0.4974	0.7860	0.6093
BIU-ONLP	4	0.4725	0.3288	0.3630	0.7362	0.4939	0.5911
BASELINE	Organizers	0.4729	0.3421	0.3745	0.7533	0.4650	0.5751
NLPatVCU	C19	0.3810	0.8005	0.4868	0.4362	0.8436	0.5750
LYX-DMIIP-FDU	run1	0.3625	0.4171	0.3549	0.5973	0.5432	0.5690
Schemalink	1	0.3756	0.6592	0.4437	0.4523	0.7613	0.5675
ataupd2425-gainer	td	0.3167	0.2315	0.2528	0.7405	0.3992	0.5187
ToGS	hermes8bragreorder	0.2261	0.1267	0.1414	0.5556	0.2469	0.3419
lasigeBioTM	R1	0.0797	0.0622	0.0646	0.3929	0.0453	0.0812

Subtask 6.2.3 (Ternary Mention-Based RE)

Team ID	Run ID	Macro-P	Macro-R	Macro-F1	Micro-P	Micro-R	Micro-F1
Gut-Instincts	6239eedev3re	0.3310	0.4303	0.3497	0.4215	0.5147	0.4635
Graphwise-1	107	0.3323	0.2369	0.2603	0.4686	0.3097	0.3729
ICUE	run23	0.2509	0.4239	0.2825	0.2858	0.5054	0.3651
LYX-DMIIP-FDU	run1	0.2106	0.2418	0.1990	0.3682	0.3257	0.3457
ONTUG	union	0.2589	0.2293	0.2266	0.3529	0.3231	0.3373
BASELINE	Organizers	0.3514	0.1829	0.2123	0.4986	0.2453	0.3288
Schemalink	1	0.2265	0.4088	0.2546	0.1948	0.4665	0.2749
ataupd2425-pam	C7	0.1940	0.2764	0.1982	0.2278	0.3432	0.2738
ataupd2425-gainer	tms	0.2203	0.1384	0.1538	0.4272	0.1810	0.2542
NLPatVCU	C11	0.1522	0.5041	0.2163	0.1423	0.6005	0.2300
BIU-ONLP	4	0.1171	0.0854	0.0879	0.2339	0.1461	0.1799
ToGS	hermes3bloraragreorder	0.0249	0.0180	0.0203	0.1702	0.0536	0.0815
lasigeBioTM	R1	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000

Datasets

Data Description

The data for this task is a set of titles and abstracts of biomedical articles retrieved from PubMed, focusing on the gut-brain interplay and its implications in neurological and mental health.

The datasets are organized into:

Platinum-Standard Annotations: Highest-quality annotations, expert-curated and reviewed by external biomedical specialists.
Gold-Standard Annotations: High-quality annotations, expert-curated.
Silver-Standard Annotations: Mid-quality annotations, created by trained students under expert supervision. The students are organized into two clusters:
- StudentA, including those with more consistent annotation performance,
- StudentB, including those with less consistent annotation performance.
Bronze-Standard Annotations: Automatically generated annotations, using fine-tuned GLiNER for Named Entity Recognition (NER) and fine-tuned ATLOP for Relation Extraction (RE).

To foster the development of effective IE systems for the subtasks proposed within GutBrainIE, the provided datasets include:

Entity Mentions: Text spans classified into predefined categories.
Relations: Associations between entities, specifying that a particular relationship holds between two entities.

Collections

Training and Development Data: Available upon registration. After registering, participants will receive an email with a download link.
Test Data: The test set will be available from April 28, 2025. This data will be used for final system evaluation.

Dataset Format

Annotations are provided in JSON format, for ease of use with NLP systems. Each entry in the dataset corresponds to a PubMed article, identified by its PubMed ID (PMID), and includes the following fields:

Metadata: Article-related information, including title, author, journal, year, and abstract. It also includes the identifier of the annotator: expert annotators are labeled as expert_1 to expert_7, student annotators are grouped into two clusters, identified as student_A and student_B, and automatically generated annotations are labeled as distant. Participants may use this information to assign different weights to annotations based on who performed them.
Entities: An array of objects where each object represents an annotated entity mention in the article, with the following attributes:

◦ start and end indices: Character offsets marking the span of the entity mention.
◦ location: Indicates if the entity mention is located in the title or in the abstract.
◦ text_span: The actual text span of the entity mention.
◦ label: The label assigned to the entity mention (e.g., bacteria, microbiome).

Relations: An array of objects where each object represents an annotated relationship between two entity mentions in the article, with the following attributes:

◦ subject_start and subject_end indices: Character offsets marking the span of the subject entity mention.
◦ subject_location: Indicates if the subject entity mention is located in the title or in the abstract.
◦ subject_text_span: The actual text span of the subject entity mention.
◦ subject_label: The label assigned to the subject entity mention (e.g., bacteria, microbiome).
◦ predicate: The label assigned to the relationship.
◦ object_start and object_end indices: Character offsets marking the span of the object entity mention.
◦ object_location: Indicates if the object entity mention is located in the title or in the abstract.
◦ object_text_span: The actual text span of the object entity mention.
◦ object_label: The label assigned to the object entity mention (e.g., bacteria, microbiome).

Binary Tag-based Relations: Relations extracted from the Relations array, formatted as tag-based pairs of subject_label and object_label.
Ternary Tag-based Relations: Relations extracted from the Relations array, formatted as tag-based triples of subject_label, predicate, and object_label.
Ternary Mention-based Relations: Relations extracted from the Relations array, formatted as mention-based tuples of subject_text_span, subject_label, predicate, object_text_span, and object_label.

Alternative Formats

For those more familiar with CSV or tabular formats, the dataset is also provided in these formats. In this case, each of the fields mentioned above is stored in a separate file:

Metadata file
Entities file
Relations file
Binary Tag-based file
Ternary Tag-based file
Ternary Mention-based file

The CSV files use the pipe symbol (|) as a separator, while tabular files use the tab character (\t) for separation.

Entity and Relation Labels

The set of entities considered for annotations includes:

Entity Label	URI	Definition
Anatomical Location	NCIT_C13717	Named locations of or within the body.
Animal	NCIT_C14182	A non-human living organism that has membranous cell walls, requires oxygen and organic foods, and is capable of voluntary movement, as distinguished from a plant or mineral.
Biomedical Technique	NCIT_C15188	Research concerned with the application of biological and physiological principles to clinical medicine.
Bacteria	NCBITaxon_2	One of the three domains of life (the others being Eukarya and ARCHAEA), also called Eubacteria. They are unicellular prokaryotic microorganisms which generally possess rigid cell walls, multiply by cell division, and exhibit three principal forms: round or coccal, rodlike or bacillary, and spiral or spirochetal.
Chemical	CHEBI_59999	A chemical substance is a portion of matter of constant composition, composed of molecular entities of the same type or of different types. This category also includes metabolites, which in biochemistry are the intermediate or end product of metabolism, and neurotransmitters, which are endogenous compounds used to transmit information across the synapses.
Dietary Supplement	MESH_68019587	Products in capsule, tablet or liquid form that provide dietary ingredients, and that are intended to be taken by mouth to increase the intake of nutrients. Dietary supplements can include macronutrients, such as proteins, carbohydrates, and fats; and/or micronutrients, such as vitamins; minerals; and phytochemicals.
Disease, Disorder, or Finding (DDF)	NCIT_C7057	A condition that is relevant to human neoplasms and non-neoplastic disorders. This includes observations, test results, history and other concepts relevant to the characterization of human pathologic conditions.
Drug	CHEBI_23888	Any substance which when absorbed into a living organism may modify one or more of its functions. The term is generally accepted for a substance taken for a therapeutic purpose, but is also commonly used for abused substances.
Food	NCIT_C1949	A substance consumed by humans and animals for nutritional purpose
Gene	SNOMEDCT_67261001	A functional unit of heredity which occupies a specific position on a particular chromosome and serves as the template for a product that contributes to a phenotype or a biological function.
Human	NCBITaxon_9606	Members of the species Homo sapiens.
Microbiome	OHMI_0000003	This term refers to the entire habitat, including the microorganisms (bacteria, archaea, lower and higher eukaryotes, and viruses), their genomes (i.e., genes), and the surrounding environmental conditions.
Statistical Technique	NCIT_C19044	A method of calculating, analyzing, or representing statistical data.

while the defined set of relations includes:

Head Entity	Tail Entity	Predicate
Anatomical Location	Human / Animal	Located in
Bacteria	Bacteria / Chemical / Drug	Interact
Bacteria	DDF	Influence
Bacteria	Gene	Change expression
Bacteria	Human / Animal	Located in
Bacteria	Microbiome	Part of
Chemical	Anatomical Location / Human / Animal	Located in
Chemical	Chemical	Interact / Part of
Chemical	Microbiome	Impact / Produced by
Chemical / Dietary Supplement / Drug / Food	Bacteria / Microbiome	Impact
Chemical / Dietary Supplement / Food	DDF	Influence
Chemical / Dietary Supplement / Drug / Food	Gene	Change expression
Chemical / Dietary Supplement / Drug / Food	Human / Animal	Administered
DDF	Anatomical Location	Strike
DDF	Bacteria / Microbiome	Change abundance
DDF	Chemical	Interact
DDF	DDF	Affect / Is a
DDF	Human / Animal	Target
Drug	Chemical / Drug	Interact
Drug	DDF	Change effect
Human / Animal / Microbiome	Biomedical Technique	Used by
Microbiome	Anatomical Location / Human / Animal	Located in
Microbiome	Gene	Change expression
Microbiome	DDF	Is linked to
Microbiome	Microbiome	Compared to

For more details on entity and relation labels, refer to the Annotation Guidelines available for download HERE.

Datasets Statistics

The table below provides an overview of the key statistics for each dataset collection:

Collection	Num of Documents	Total Entities	Avg Entities per Doc	Total Relations	Avg Rels per Doc
Train Platinum	111	3'638	32.77	1'455	13.11
Train Gold	208	5'192	24.96	1'994	9.59
Train Silver	499	15'275	30.61	10'616	21.27
Train Bronze	749	21'357	28.51	8'165	11.90
Development Set	40	1'117	27.93	623	15.58

Train Platinum Collection Statistics

The tables below provide detailed statistics for entities and relations in the Train Platinum collection.

The Ratio column represents the proportion of each entity or relation label relative to the total number of entities or relations in the collection.

Entity Statistics

Entity Label	Count	Ratio	Avg Per Doc
DDF	1'232	0.339	11.10
Chemical	421	0.116	3.79
Bacteria	416	0.114	3.75
Human	401	0.110	3.61
Microbiome	379	0.104	3.41
Dietary Supplement	210	0.058	1.89
Biomedical Technique	150	0.041	1.35
Anatomical Location	149	0.041	1.34
Animal	101	0.028	0.91
Gene	64	0.018	0.58
Drug	50	0.014	0.45
Statistical Technique	45	0.012	0.41
Food	20	0.005	0.18

Relation Statistics

Relation Label	Count	Ratio	Avg Per Doc
Influence	249	0.171	2.24
Target	216	0.148	1.95
Is Linked To	214	0.147	1.93
Located In	177	0.122	1.59
Affect	131	0.090	1.18
Impact	105	0.072	0.95
Used By	102	0.070	0.92
Change Abundance	69	0.047	0.62
Part Of	55	0.038	0.50
Change Expression	45	0.031	0.41
Interact	31	0.021	0.28
Strike	19	0.013	0.17
Administered	10	0.007	0.09
Change Effect	10	0.007	0.09
Is A	9	0.006	0.08
Produced By	7	0.005	0.06
Compared To	6	0.004	0.05

Train Gold Collection Statistics

The tables below provide detailed statistics for entities and relations in the Train Gold collection.

Entity Statistics

Entity Label	Count	Ratio	Avg Per Doc
DDF	2'027	0.390	9.75
Chemical	621	0.120	2.99
Microbiome	574	0.111	2.76
Human	474	0.091	2.28
Bacteria	281	0.054	1.35
Anatomical Location	262	0.050	1.26
Animal	226	0.044	1.09
Dietary Supplement	185	0.036	0.89
Biomedical Technique	183	0.035	0.88
Statistical Technique	119	0.023	0.57
Drug	118	0.023	0.57
Gene	64	0.012	0.31
Food	58	0.011	0.28

Relation Statistics

Relation Label	Count	Ratio	Avg Per Doc
Target	335	0.168	1.61
Is Linked To	317	0.159	1.52
Influence	303	0.152	1.46
Located In	299	0.150	1.44
Affect	276	0.138	1.33
Used By	96	0.048	0.46
Impact	92	0.046	0.44
Change Abundance	70	0.035	0.34
Interact	65	0.033	0.31
Is A	54	0.027	0.26
Administered	28	0.014	0.13
Part Of	17	0.009	0.08
Change Effect	17	0.009	0.08
Strike	14	0.007	0.07
Produced By	6	0.003	0.03
Change Expression	4	0.002	0.02
Compared To	1	0.001	0.00

Train Silver Collection Statistics

The tables below provide detailed statistics for entities and relations in the Train Silver collection.

Entity Statistics

Entity Label	Count	Ratio	Avg Per Doc
DDF	5'584	0.366	11.19
Chemical	1'871	0.122	3.75
Microbiome	1'599	0.105	3.20
Human	1'199	0.078	2.40
Bacteria	1'129	0.074	2.26
Anatomical Location	856	0.056	1.72
Dietary Supplement	660	0.043	1.32
Biomedical Technique	655	0.043	1.31
Drug	500	0.033	1.00
Animal	483	0.032	0.97
Gene	319	0.021	0.64
Statistical Technique	258	0.017	0.52
Food	162	0.011	0.32

Relation Statistics

Relation Label	Count	Ratio	Avg Per Doc
Influence	2'536	0.239	5.08
Is Linked To	1'710	0.161	3.43
Target	1'540	0.145	3.09
Located In	1'049	0.099	2.10
Impact	835	0.079	1.67
Change Abundance	535	0.050	1.07
Change Effect	469	0.044	0.94
Used By	391	0.037	0.78
Affect	383	0.036	0.77
Part Of	355	0.033	0.71
Interact	288	0.027	0.58
Produced By	247	0.023	0.49
Strike	124	0.012	0.25
Change Expression	92	0.009	0.18
Administered	34	0.003	0.07
Is A	18	0.002	0.04
Compared To	10	0.001	0.02

Train Bronze Collection Statistics

The tables below provide detailed statistics for entities and relations in the Train Bronze collection.

Entity Statistics

Entity Label	Count	Ratio	Avg Per Doc
DDF	8'421	0.394	11.24
Chemical	2'719	0.127	3.63
Microbiome	2'299	0.108	3.07
Human	1'674	0.078	2.23
Bacteria	1'336	0.063	1.78
Anatomical Location	1'142	0.053	1.52
Animal	857	0.040	1.14
Dietary Supplement	787	0.037	1.05
Biomedical Technique	770	0.036	1.03
Drug	581	0.027	0.78
Statistical Technique	336	0.016	0.45
Gene	241	0.011	0.32
Food	194	0.009	0.26

Relation Statistics

Relation Label	Count	Ratio	Avg Per Doc
Influence	1'572	0.193	2.10
Target	1'357	0.166	1.81
Is Linked To	1'337	0.164	1.79
Affect	1'189	0.146	1.59
Located In	621	0.076	0.83
Is A	581	0.071	0.78
Impact	344	0.042	0.46
Used By	341	0.042	0.46
Change Effect	258	0.032	0.34
Change Abundance	229	0.028	0.31
Administered	132	0.016	0.18
Strike	109	0.013	0.15
Interact	61	0.007	0.08
Part Of	18	0.002	0.02
Produced By	12	0.001	0.02
Change Expression	3	0.000	0.00
Compared To	1	0.000	0.00

Dev Collection Statistics

The tables below provide detailed statistics for entities and relations in the Dev collection.

Entity Statistics

Entity Label	Count	Ratio	Avg Per Doc
DDF	379	0.339	9.47
Chemical	131	0.117	3.27
Microbiome	127	0.114	3.17
Human	86	0.077	2.15
Anatomical Location	76	0.068	1.90
Animal	73	0.065	1.82
Drug	60	0.054	1.50
Bacteria	54	0.048	1.35
Gene	39	0.035	0.97
Biomedical Technique	36	0.032	0.90
Dietary Supplement	27	0.024	0.68
Food	26	0.023	0.65
Statistical Technique	3	0.003	0.07

Relation Statistics

Relation Label	Count	Ratio	Avg Per Doc
Located In	99	0.159	2.48
Affect	89	0.143	2.23
Target	85	0.136	2.12
Is Linked To	74	0.119	1.85
Influence	68	0.109	1.70
Impact	54	0.087	1.35
Change Effect	32	0.051	0.80
Is A	31	0.050	0.78
Used By	17	0.027	0.42
Administered	16	0.026	0.40
Part Of	13	0.021	0.33
Strike	11	0.018	0.28
Change Abundance	11	0.018	0.28
Interact	9	0.014	0.23
Produced By	6	0.010	0.15
Change Expression	4	0.006	0.10
Compared To	4	0.006	0.10

Submitting

Participating teams should satisfy the following guidelines:

The runs should be submitted in the format described below;
Each group can submit a maximum of 25 runs for each subtask.

Submissions are handled via the BioASQ submission system linked below:

Submit your runs

Submission Guidelines

Participants are invited to submit results for any or all of the four subtasks (NER, Binary RE, Tag-based Ternary RE, Mention-based Ternary RE) independently.

In the following, a "run" refers to the predictions made by a single system on the test set.

All runs must be submitted in a single zipped file named "teamID_GutBrainIE_2025.zip". Within this zip archive, each run should be placed in a separate folder named "teamID_taskID_runID_systemDesc" without spaces or any special character, where:

teamID is the identifier of your team choosen when you registered to CLEF 2025;
taskID is the identifier of your task choosen, i.e., one of these tokens: T61 for task 6.1, T621 for task 6.2.1, T622 for task 6.2.2, and T623 for task 6.2.3 ;
runID is the identifier of the run choosen by the participants, containing only letters and numbers (a-z, A-Z, 0-9)
systemDesc is an optional (short) string that further describes your submission.

The content of each folder consists of two files:

teamID_taskID_runID_systemDesc.json: A JSON file with predictions on the test set.
teamID_taskID_runID_systemDesc.meta: A metadata briefly describing the approach, including:

◦ Team ID.
◦ Task ID.
◦ Run ID.
◦ Type of training applied.
◦ Pre-processing methods.
◦ Training data used.
◦ Relevant details of the run.
◦ A link to a GitHub repository enabling easy reproducibility of the run.

Please ensure that the zip archive you submit strictly adheres to this filenames structure. The team name must remain consistent across all submissions, and the system name shold reflect the approach used by the submitted run. The run number serves as a progressive identifier for multiple submissions of the same system. An example of a valid submission wil be available soon...

Submissions not adhering to these guidelines might be rejected. Please ensure accuracy and completeness of all submitted files to streamline evaluation.

An example of a valid submission can be downloaded from HERE. Please note that the predicted entities and relations included in this example are dummy data, provided solely to illustrate the correct structure and formatting of a valid submission folder.

To ensure your submission follows the required structure, you can use the validation script available HERE.

Example of Submission Format

Submissions should follow the same JSON format used in the provided datasets and must include only the field associated with the specific subtask for which the submission is made:

Subtask 6.1 (NER): Include the entities field.
Subtask 6.2.1 (Binary Tag-Based RE): Include the binary_tag_based_relations field.
Subtask 6.2.2 (Ternary Tag-Based RE): Include the ternary_tag_based_relations field.
Subtask 6.2.3 (Ternary Mention-Based RE): Include the ternary_mention_based_relations field.

Each entry must correspond to the PubMed ID of the article from the test set being considered.

Find below samples of valid entries for each submission type:

Subtask 6.1 (NER) Submission Sample

					
{
	"34870091": {
		"entities": [
			{
				"start_idx": 75,
				"end_idx": 82,
				"location": "title",
				"text_span": "patients",
				"label": "human"
			},
			{
				"start_idx": 250,
				"end_idx": 270,
				"location": "abstract",
				"text_span": "intestinal microbiome",
				"label": "microbiome"
			}
		]
	}
}

Subtask 6.2.1 (Binary Tag-Based RE) Submission Sample

				
{
	"34870091": {
		"binary_tag_based_relations": [
			{
				"subject_label": "microbiome",
				"object_label": "human"
			}
		]
	}
}

Subtask 6.2.2 (Ternary Tag-Based RE) Submission Sample

				
{
	"34870091": {
		"ternary_tag_based_relations": [
			{
				"subject_label": "microbiome",
				"predicate": "located in",
				"object_label": "human"
			}
		]
	}
}

Subtask 6.2.3 (Ternary Mention-Based RE) Submission Sample

				
{
	"34870091": {
		"ternary_mention_based_relations": [
			{
				"subject_text_span": "intestinal microbiome",
				"subject_label": "microbiome",
				"predicate": "located in",
				"object_text_span": "patients",
				"object_label": "human"
			}
		]
	}
}

Validation Script

A validation script that can be used to verify the correctness of the submission format can be downloaded from HERE.

Participant Papers

Participants are expected to write a report describing their participation in GutBrainIE CLEF 2025, their proposed solution, what features have been used for prediction, the analysis of the experimental results and insights derived from them.

Participant reports will be peer-reviewed and accepted reports will be published in the CLEF 2025 Working Notes at CEUR-WS, indexed in DBLP and Scopus.

Participants are strongly encouraged to use LaTeX, for which a template is available for download HERE. The template follows the CEURART single-column style and not only defines the layout of the report but also provides a suggested structure and outlines the expected content for each section.

It is also possible to use ODT (LibreOffice) or DOCX (Microsoft Word) formats. In these cases, participants should download the official CEURART templates directly from the CEUR website HERE and ensure that the content and structure match those outlined in the provided LaTeX template.

Participant papers are expected to be 10-20 pages in length, excluding references.

The schedule for submission, revision, and camera ready of the participant report is detailed above in the Important Dates section.

Participant papers can be submitted via Easychair HERE.

Baselines & Evaluation

The official baselines for the GutBrainIE CLEF 2025 challenge can be found and reproduced by accesing this GitHub repository.

The repository provides a complete implementation of the baseline pipeline, allowing participants to reproduce the results from scratch, test the pre-trained models, or evaluate the provided predictions (i.e., runs) of the baselines for both NER and RE. These runs can also be used by participants as references of valid submission formats.

The repository also contains the official evaluation script, which will be used to assess submitted runs after the system submission deadline. It is crucial that participants ensure their submitted runs are compatible with this script, as submissions that cannot be evaluated with it will be considered invalid. The evaluation script can also be used to obtain preliminary performance results on the development set.

The detailed results of the baselines on the development set can be found in the Results section.

Evaluation Metrics

Submitted runs will be evaluated using standard Information Extraction metrics to assess both per-label and overall system performance. The employed metrics are the same for all the subtasks (a detailed explanation of these can be found HERE):

In the equations below:

\(TP\) (True Positives) refers to the number of correctly predicted entities or relations.
\(FP\) (False Positives) refers to the number of wrongly predicted entities or relations.
\(FN\) (False Negatives) refers to the number of entities or relations in the ground truth that were not predicted.
\(\mathcal{L}\) is the set of labels referring to:

◦ For subtask 6.1: entity labels.
◦ For subtask 6.2.1: pairs of (subject label, object label).
◦ For subtasks 6.2.2 and 6.2.3: triples of (subject label, predicate, object label).

Macro-average Precision: \( P_{\text{macro}_{\text{avg}}} = \frac{\sum_{l \in \mathcal{L}} \, \frac{TP_l}{TP_l \,+\, FP_l}}{|\mathcal{L}|} \)
Macro-average Recall: \( R_{\text{macro}_{\text{avg}}} = \frac{\sum_{l \in \mathcal{L}} \, \frac{TP_l}{TP_l \,+\, FN_l}}{|\mathcal{L}|} \)
Macro-average F1-score: \( F1_{\text{macro}_{\text{avg}}} = 2 \, \times \, \frac{P_{\text{macro}_{\text{avg}}} \; \times \; R_{\text{macro}_{\text{avg}}}}{P_{\text{macro}_{\text{avg}}} \; + \; R_{\text{macro}_{\text{avg}}}} \)
Micro-average Precision: \( P_{\text{micro}_{\text{avg}}} = \frac{\sum_{l \in \mathcal{L}} \; TP_l}{\sum_{l \in \mathcal{L}} \; TP_l \; + \; FP_l} \)
Micro-average Recall: \( R_{\text{micro}_{\text{avg}}} = \frac{\sum_{l \in \mathcal{L}} \; TP_l}{\sum_{l \in \mathcal{L}} \; TP_l \; + \; FN_l} \)
Micro-average F1-score: \( F1_{\text{micro}_{\text{avg}}} = 2 \, \times \, \frac{P_{\text{micro}_{\text{avg}}} \; \times \; R_{\text{micro}_{\text{avg}}}}{P_{\text{micro}_{\text{avg}}} \; + \; R_{\text{micro}_{\text{avg}}}} \)

For each subtask, the reference metric for the final leaderboard will be the micro-average F1-score, as it better accounts for class imbalances. However, system rankings will also be provided for each of the metrics detailed above.

Baseline Results

The scores reported on the development set represent the actual evaluation of the inference capabilities of the proposed baseline models. Scores on the training collections are provided to give a sense of how well the models fit the data they were trained on. Therefore, they should not be interpreted as evaluation results but rather as indicators of performance on the seen data.

Subtask 6.1 (NER)

Dataset	Macro-P	Macro-R	Macro-F1	Micro-P	Micro-R	Micro-F1
Dev	0.6627	0.7473	0.6917	0.7561	0.8272	0.7901
Train Platinum	0.7784	0.8967	0.8228	0.8516	0.8994	0.8749
Train Gold	0.8047	0.9226	0.8567	0.8653	0.9368	0.8997
Train Silver	0.7891	0.8347	0.8095	0.8402	0.8729	0.8562
Train Bronze	0.7618	0.8586	0.8023	0.8478	0.8954	0.871

Subtask 6.2.1 (Binary Tag-Based RE)

Dataset	Macro-P	Macro-R	Macro-F1	Micro-P	Micro-R	Micro-F1
Dev	0.5181	0.4330	0.4404	0.6585	0.4909	0.5625
Train Platinum	0.8542	0.9298	0.8672	0.858	0.943	0.8985
Train Gold	0.8195	0.9559	0.8667	0.7806	0.968	0.8643
Train Silver	0.8509	0.8462	0.8414	0.8619	0.9027	0.8818
Train Bronze	0.2453	0.2454	0.2257	0.6778	0.619	0.6471

Subtask 6.2.2 (Ternary Tag-Based RE)

Dataset	Macro-P	Macro-R	Macro-F1	Micro-P	Micro-R	Micro-F1
Dev	0.4875	0.4231	0.4270	0.6585	0.4696	0.5482
Train Platinum	0.8533	0.9299	0.8677	0.859	0.9435	0.8993
Train Gold	0.8228	0.9566	0.8696	0.7834	0.967	0.8656
Train Silver	0.8547	0.8459	0.8431	0.8627	0.9022	0.882
Train Bronze	0.1126	0.1182	0.1033	0.6618	0.5483	0.5998

Subtask 6.2.3 (Ternary Mention-Based RE)

Dataset	Macro-P	Macro-R	Macro-F1	Micro-P	Micro-R	Micro-F1
Dev	0.2746	0.1906	0.2111	0.4574	0.2875	0.3531
Train Platinum	0.7741	0.8352	0.7856	0.78	0.8681	0.8217
Train Gold	0.7586	0.8574	0.7883	0.744	0.8927	0.8116
Train Silver	0.7997	0.7778	0.7828	0.7939	0.8071	0.8005
Train Bronze	0.08	0.0867	0.0736	0.5084	0.413	0.4557

FAQs

Find answers to common questions about the GutBrainIE challenge, dataset, and submissions.

Coming soon...

If you need any additional information, please get in contact with us writing to:

Nicola Ferro: nicola.ferro @ unipd.it
Gianmaria Silvello: gianmaria.silvello @ unipd.it

Participation Guidelines