For example, "data known to influence health" encompasses environmental data, which is already findable and accessible within existing geospatial infrastructures, with datasets typically described in compliance with INSPIRE and mapped to GeoDCAT-AP. It is important to recognise that the EHDS Regulation will apply to a wide array of datasets that are already catalogued in European data portals. For instance, data.europa.eu currently hosts around 27,000 records labelled as health-related datasets, which may need to be aligned with the EHDS-specific requirements. It is important to retain that HealthDCAT-AP is specifically tailored to address the diverse nature of data covered by the European Health Data Space. As such it includes various dataset types as well as various data access rights within one unique Data Space. |
EHDS Regulation Recital 56) |
The categories of electronic health data that can be processed for secondary use should be broad and flexible enough to accommodate the evolving needs of health data users, while remaining limited to data related to health or known to influence health. They can also include relevant data from the health system, for example electronic health records, claims data, dispensation data, data from disease registries or genomic data, as well as data with an impact on health, for example data on consumption of different substances, socio-economic status or behaviour, and data on environmental factors such as pollution, radiation or the use of certain chemical substances. The categories of electronic health data for secondary use include some categories of data that were initially collected for other purposes such as research, statistics, patient safety, regulatory activities or policy making, for example, policy- making registries or registries concerning the side effects of medicinal products or medical devices. European databases that facilitate use or reuse of data are available in some areas, such as cancer (the European Cancer Information System) or rare diseases (for example, the European Platform on Rare Disease Registration and European reference networks (ERN) registries). The categories of electronic health data that can be processed for secondary use should also include automatically generated data from medical devices and person-generated data, such as data from wellness applications. Data on clinical trials and clinical investigations should also be included in the categories of electronic health data for secondary use when the clinical trial or clinical investigation has ended, without affecting any voluntary data sharing by the sponsors of ongoing trials and investigations. Electronic health data for secondary use should be made available preferably in a structured electronic format that facilitates their processing by computer systems. Examples of structured electronic formats include records in a relational database, XML documents or CSV files and free text, audios, videos and images provided as computer-readable files. |
The High-Value Datasets Implementing Regulation (HVD IR) serves as a key example of how new regulatory frameworks can impose additional requirements on existing DCAT-AP records. The HVD IR introduces a set of rules specifically applicable to certain classes of datasets classified as high-value, which are organised into six thematic categories: Geospatial, Earth Observation and Environment, Meteorological, Statistics, Companies and Company Ownership, and Mobility. Publishers whose datasets fall within the scope of the HVD IR are obligated to enhance their dataset descriptions to meet these new standards. As the DCAT-AP HVD specification states, "Any improvement of the metadata will immediately flow throughout the European network of (Open) Data Portals and thus increase the level of metadata quality." This reflects the broader impact of the HVD IR, where enhancements in metadata quality are propagated across the entire network of data portals, thereby elevating the overall standard of metadata. The improvements in metadata quality mandated by the HVD IR for existing DCAT-AP records are analogous to the enhancements required by HealthDCAT-AP for open health-related data. Both frameworks underscore the importance of high-quality, interoperable metadata and require, at least, one dataset Distribution. |
EHDS Regulation Article 60) |
5. Health data holders of non-personal electronic health data shall provide access to data through trusted open databases to ensure unrestricted access for all users and data storage and preservation. Trusted open public databases shall have in place robust, transparent and sustainable governance and a transparent model of user access. |
EHDS Regulation Art. 77) Dataset description and dataset catalogue |
4. ... the Commission shall, by means of implementing acts, set out the minimum elements health data holders are to provide for datasets and the characteristics of those elements. Those implementing acts shall be adopted in accordance with the examination procedure referred to in Article 98(2). |
Regarding the scope of HealthDCAT-AP, we recommend the following approach: for EHDS Art.55, HealthDCAT-AP should align with the "minimum elements" established by DCAT-AP HVD. For non-public data (i.e., protected and sensitive health data), for high-impact datasets (as mentioned in EHDS Article 58), HealthDCAT-AP will introduce additional "minimum elements". For all protected and sensitive datasets that are subject to data application procedures managed by Health Data Access Bodies, data users will benefit from enhanced data descriptions. These enhancements might include mandatory variable descriptions (i.e., data dictionaries) or/and the availability of open synthetic datasets. This approach would enables health dataset catalogues to harvest metadata records from other sources that have a HVD reporting endpoint, with the capability to filter datasets tagged as health-related*. Moreover, the healthData@EU infrastructure will not only benefit from ongoing improvements introduced by the core DCAT-AP vocabulary but will also gain from the metadata quality enhancements mandated by the HVD IR. |
EHDS Regulation Recital 58) |
In order to increase the effectiveness of the secondary use of personal electronic health data, and to fully benefit from the possibilities offered by this Regulation, the availability in the EHDS of electronic health data described in Chapter IV should be such that the data are as accessible, high-quality, ready and suitable for the purpose of creating scientific, innovative and societal value and quality as possible. Work on the implementation of the EHDS and further dataset improvements should be conducted in a manner that prioritises the datasets that are the most suitable for creating such value and quality. |
EHDS Regulation Art. 80) Minimum specifications for datasets of high impact |
The Commission may, by means of implementing acts, determine the minimum specifications for datasets of high impact for secondary use, taking into account existing Union infrastructures, standards, guidelines and recommendations. Those implementing acts shall be adopted in accordance with the examination procedure referred to in Article 98(2). |
![]() |
Figure 1: HealthDCAT-AP vs Art.51 defining datasets in scope of the EHDS |
Expected Scenario |
When, for instance, a statistical institute complies with the HVD IR, it will also meet the requirements of HealthDCAT-AP, as both application profiles share the same minimum metadata elements for the publication of open data (e.g., statistical data). This dual compliance ensures that Health Data Access Bodies (HDABs) can seamlessly integrate and populate health catalogues with DCAT-AP HVD records that fall under the scope of Article 51. |
* Attention Point: Health-related datasets |
Challenge: The EHDS ELI URI must be used in the "Applicable Legislation" property, which is mandatory in DCAT-AP HVD. Since the "Theme" property is not mandatory in either DCAT-AP or DCAT-AP HVD, it is important to ensure that the "Applicable Legislation" field is accurately and comprehensively filled for High-Value Datasets (HVDs) to facilitate the identification and harvesting of datasets under the scope of EHDS Article 33. Solution: To assist data holders in complying with the EHDS Regulation and other legal frameworks, the European Commission could implement an AI-based metadata enhancement service (e.g., API service). This service, similar to the automated 'Eurovoc' tagging in data.europa.eu, would help identify applicable legal frameworks, such as the EHDS, and suggest enhancements to dataset descriptions. Such a service would ensure that metadata aligns with the relevant DCAT-Application Profiles and supports legal compliance across sectors. A notable precedent is the automatic "Eurovoc" tagging introduced by data.europa.eu, which offers a valuable use case for SEMIC experts to explore further. This tagging system provides a novel approach for implementing properties that support faceted search, a concept that should be considered in the future development of DCAT Application Profiles to improve the efficiency of dataset catalogues. |
Annif is an open-source toolkit for automated subject indexing. It integrates several machine learning and AI based algorithms for text classification. This implementation helps to tag EU Vocabulary properties for datasets. |
Comment: HealthDCAT-AP extends DCAT-AP by introducing two additional properties, hasCodingSystem and hasCodeValues, which enable the tagging of datasets using any controlled vocabulary, such as Eurovoc. |
