When a health dataset is categorised as non-public data (i.e., protected or sensitive data), one requirement raised during the Technical Working Group sessions was the need to provide descriptions for at least one sample distribution of the dataset. This retained approach involves making a subset of the dataset available using mock-up data in replacement of the sensitive data. This can be an anonymised or synthetic subset. The Population Health Information Research Infrastructure | phiri.eu . This will allow data users to experiment with the data by coding, preparing scripts and analysis, or training ML/AI models. Additionally, the sample distribution must be accompanied by a data dictionary, providing definitions and descriptions of all data elements. |
|
The Population Health Information Research Infrastructure | phiri.eu PHIRI Tacks 7.5 'upgrading options' (Project Horizon 2020 - grant agreement No 101018317) |
Populating the digital twin with mock-up data Mock-up data is a type of synthetic data that is designed to mimic the characteristics of real-world data without containing any sensitive information. By using techniques such as data sampling, transformation, generation, and augmentation, it is possible to create mock-up data that closely resembles the original dataset while protecting any sensitive information that may be present. Depending of the goals of the digital twin model and legal constraints to process sensitive data, one or another techniques can be considered: Data Sampling: One approach to creating mock-up data is to randomly sample data from the original dataset, using techniques such as random sampling or stratified sampling . This can help to preserve the statistical properties of the original dataset while eliminating any sensitive information (e.g.: the 'avatar' method). Data Transformation: Another approach to creating mock-up data is to transform the original data in some way, such as by scaling or re-scaling the values, adding noise, or applying mathematical functions. This can help to preserve the structure of the original data while obscuring any sensitive information. Data Generation: In some cases, it may be necessary to generate entirely new data in order to create mock-up data. This can be done using techniques such as Data profiling which can capture statistical properties of a dataset and aim to create high-quality, realistic mock-up data. Data profiling refers to the analysis of information for use in a data warehouse in order to clarify the structure, content, relationships, and derivation rules of the data . Data profiles are obtained from an attribute–value system such as a flat file or a spreadsheet (rectangular data) . For instance, in case of a relational database, a SQL query extracts information and exports it as a csv file. A data profile can be produced from the csv file and serves as a reference document to generate mock-up data. More advanced techniques like generative adversarial networks (GANs) can generate new data that very closely resembles it. Data Augmentation: Data augmentation involves adding new data points to the existing dataset, either by duplicating existing data points or by creating new data points through some other means. This can help to increase the size of the dataset and improve the accuracy of the digital twin model. As already mentioned, all these techniques for creating mock-up data requires careful consideration of the legal framework in force in silos for the sharing and processing of sensitive data. There are various legal constraints (e.g.: GDPR) from one silo to another and data profiling as it guarantees privacy in an unequivocal way is the most appropriate technique to implement the population's digital twin. However, any other technique may be used. It may have the advantage of including patterns in the mock data that enrich the analysis. It will then be important to specify in the digital twin the quality of the mock-up data. Simplifying data access negotiations As sensitive personal health data is never directly accessed, the Population Health Digital Twin offers the possibility to overpass the pitfall of long and complex administrative data access negotiations with the data permit authorities as no individual personal data is directly accessed. |
In analysing the required capabilities of the future EU central health dataset catalogues, two key requirements have been identified: 1- Variable listing in Data Permit/Access Applications: Data users submitting a data permit/access application form must have the option to list the specific variables they wish to access for reuse. They are not intended to access all variables according to the Data minimisation principle (ref: EHDS Regulation - Article 44 'Data minimisation and purpose limitation'). |
EHDS Regulation Article 68) Data permit |
1. For the purposes of granting access to electronic health data, the health data access bodies shall assess whether all the following criteria are fulfilled: |
2- Analysis script submission with Data Request Applications: When requesting a data request, users must be able to attach an analysis script which is meant to produce the anonymised statistical data. |
EHDS Regulation Article 69) Health data request |
1. The health data applicant may submit a health data request for the purposes referred to in Article 53 with the aim of obtaining a response only in an anonymised statistical format. A health data access body shall not provide a response to a health data request in any other format and the health data user shall have no access to the electronic health data used to provide that response. |
Article 33 of the Data Act mandates that the healthData@EU infrastructure must unequivocally comply with this regulation to function effectively as a Data Space. This includes the requirement to 'describe in a publicly available and consistent manner' the data structures. Publishing data structures, formats, and vocabularies consistently within the European Health Data Space (EHDS) is crucial for handling sensitive health data. Such standardisation facilitates interoperability and ensures uniform data interpretation across different systems. For a clinician or researcher accessing health data across different Member States, harmonised data dictionaries ensure that common terms like 'sex,' 'weight,' or 'blood pressure' are consistently defined, improving their ability to work with this data seamlessly, no matter where it was generated |
Data Act Article 33 |
Essential requirements regarding interoperability of data, of data sharing mechanisms and services, as well as of common European data spaces |
Iate: Data Structure definition |
A set of structural metadata associated with a data set, which includes information about how concepts are associated with the measures, dimensions, and attributes of a hypercube, along with information about the representation of data and related descriptive metadata. |
In HealthDCAT-AP, this can be achieved by publishing at least one sample Distribution, accompanied by a data dictionary. The proposed solution made to the Technical Working Group (TWG) to publish and harmonise the production of data dictionaries involves utilising the 'Tabular Data on the Web' (CSVW) vocabulary. CSVW allows the annotation of tabular data (e.g., CSV files) with metadata that describes the data structure, types, formats, and relationships. CSVW provides namespace vocabulary terms and definitions specifically for tabular data, making it simple to implement. This metadata effectively serves as a data dictionary, providing comprehensive information about the dataset. See example: |
![]() |
Figure 5: Draft version of the HealthDCAT-AP (RDF examples) |
W3C Tabular Data on the Web: |
W3C CSV on the Web: |
By effectively RDF-izing variable descriptions using the CSVW vocabulary, the data dictionary can be presented as RDF in a sample Distribution. This ensures alignment with the RDF framework of DCAT facilitating better data interoperability and usability which is expected for supporting the data access application service of the EU health dataset catalogue. To summarise, HealthDCAT-AP requires data holders to provide a sample distribution of the dataset (e.g., mock-up data, anonymised data, synthetic data, etc.) in any computer-readable format (e.g., CSV, JSON). If applicable, a data dictionary should also be published. The data dictionary must be published using CSVW, resulting in an RDF format for the sample distribution. A more complex use case involves merging both requirements by simultaneously producing the dataset sample as tabular data along with its data dictionary using CSVW: |
Dataset example: Health Phenotype Data |
|
Dimensions of the dataset:
|
To fully understand a dataset, both the dimensions and their semantics are essential (e.g.: data types and taxonomies). The CSVW vocabulary addresses these requirements which are not covered by DCAT-AP. Alternative solutions like DataCube are also available. |
| ||||||
| ||||||
Real-world example: The dataset "Legati: Misiones diplomáticas hispanomusulmanas (1492-1708)" on the EU Data Portal serves as an illustrative case to demonstrate the advantages of integrating the CSVW vocabulary into HealthDCAT-AP.
The dataset description includes three distributions in various formats, providing a useful example for discussion:
|
The first distribution, in XLSX format, contains the actual data. This data can be automatically displayed on the EU Data Portal using a CSV reader and the dataset's download link: |
![]() |
Preview of the xlsx distribution provided for the dataset 'Legati: misiones diplomáticas hispanomusulmanas (1492-1708)' LEGATI.v2.xlsx |
A similar tool could be developed to display an adms:sample distribution, in CSVW format, in any health dataset catalogues.
The second distribution, in TXT format, is a README file that contains no data and should not be listed as a dataset distribution. However, it is notable that the file includes structural metadata that could be provided through a CSVW distribution: |
|
README txt file provided as a distribution for the dataset 'Legati: misiones diplomáticas hispanomusulmanas (1492-1708)' |
The third distribution, a PDF containing data entry guidelines, represents the dataset's data model. Although this PDF is not a true data distribution, it could be formalized as an entity-relationship diagram or table structure, which CSVW could standardise as a more structured solution.
In summary, we aim to standardise sample distributions for health datasets and incorporate them into the HealthDCAT Application Profile to ensure interoperability in publishing sample datasets and data dictionaries. This principle of harmonised data specifications for health data samples and data dictionaries can be effectively realised using the CSVW vocabulary. |
Many discussions during the design of the HealthDCAT-AP focused on the 'dataset' definition. For health data sources that comprise a large data warehouse, this proves to be a particularly challenging exercise. How to decide which datasets are suitable for publishing? How to break down large health registries into logical or 'virtual' datasets that are relevant for the secondary use? The European Health Data Space (EHDS) Recital paragraph Alinea (56) provides a definition of a dataset that guides this process: |
EHDS Regulation Recital 56) |
... Electronic health data for secondary use (x) should (x) be made available preferably in a structured electronic format that facilitates their processing by computer systems. Examples of structured electronic formats include records in a relational database, XML documents or CSV files and free text, audios, videos and images provided as computer-readable files. |
A dataset is, in essence, a structured collection of data ready for analysis, while a data source is the provider of that data (e.g.: A data warehouse or health registry). The Oxford Dictionary defines dataset as "a collection of data, typically in tabular form, especially one that can be accessed and manipulated by computer software". In summary, the data source supplies the raw data, which is then processed and organised into datasets for various analytical purposes. During the exercise of mapping the national metadata model of the French Health Data Hub to HealthDCAT-AP (Task 9.4), we extensively discussed the associated challenges. In order to generate HealthDCAT-AP metadata records, the agreed-upon approach was to define coherent data collections from the data warehouses. This strategy ensures that datasets from the data warehouse are consistently prepared and structured, making them ready for analysis. In the health data landscape, it is a common situation to find data providers such as registries, biobanks curating data and storing data in a data warehouse. These organisations will face the challenge to define coherent data collections for the EHDS: The challenge of defining datasets from an IT technical perspective: Many health data providers curate their data within the context of data warehousing. Data marts, as subsets of these data warehouses, are specifically designed to address the unique needs of individual departments within an organisation. |
Here's a more detailed explanation: |
State of the art of data management and business intelligence: A Data Warehouse is a centralised repository that stores integrated data from multiple sources. It is designed to support business analysis and decision-making processes. The data warehouse contains historical data and is optimised for query and analysis. It typically covers a wide range of subject areas. The Data Mart is a smaller, more focused subset of a data warehouse designed to meet the specific needs of a particular department, team, or business function, such as a research project. It contains data relevant to a particular subject area or line of business. How Data Warehouse and data mart work together 1. Data Collection:
2. Data Storage:
3. Data Access via Data Marts:
4. Usage:
By making data accessible in data marts, organisations can ensure that relevant and actionable insights are more readily available to specific departments, improving the efficiency and effectiveness of their business processes. |
![]() |
Figure 6: Data mart vs data warehouse |
Within the framework of the HealthData@EU infrastructure, a recommended "user-centric" approach for data warehouse holders is to create "virtual" data marts. Each 'virtual' data mart would define a coherent dataset. A HealthDCAT-AP record can then be generated and maintained for each data mart. Additionally, a data dictionary containing definitions and descriptions of all data elements for the 'virtual' data mart is produced. This approach enhances data discovery and aligns with a more service-oriented strategy. It supports the EHDS user journey by allowing users to request access for data in a Secure Processing Environment (SPE), which can be considered a "virtual" secured data mart. In Finland, this approach has already been adopted. The catalogue lists a collection of data resources, with each resource offering a set of accessible datasets: |
![]() |
Alternatively, the feasibility of creating a single healthDCAT metadata record for an entire data warehouse can be questioned. A data warehouse is not a dataset; it lacks the granularity necessary to effectively describe the diverse datasets it contains. A single metadata record would likely be too broad and general, making it difficult for users to discover the specific datasets they need. Instead, approaching datasets as products - each with its own healthDCAT metadata record - would offer a more granular and user-friendly solution. This would allow data users to better understand and access specific datasets tailored to their needs, ensuring they can effectively use the data. Treating datasets as individual products also enables more precise metadata, enhancing data discoverability and usability. |
Few resources and studies are covering this topic: |
Other resource by the same Author: |
Below is a figure that shows the hierarchical structure of the Norwegian Cause Of Death Registry:
In this context, it's not always intuitive to determine what qualifies as a dataset and what does not. In future processes, it will be important to establish a common understanding and develop a method to embed these breakdown structures within dataset descriptions. |
![]() |
Figure 8: Example of a data warehouse's structure: the Norwegian Cause of Death Registry |
Given the federated nature of the healthDCAT@EU infrastructure, the concept of 'virtual' datasets, as discussed above, can be considered across the EHDS. Through HealthDCAT-AP, common harmonised data models and datasets could be promoted and accurately described. Figure 9 presents a graphical representation of the EHDS components supporting secondary data use, divided into two main categories: on the left, the provision of relevant health data for health data holders and Health Data Access Bodies, and on the right, the consumption of relevant data for health data users. The concept of this figure comes from the JRC Science for Policy report INSPIRE – A public sector contribution to the European Green Deal data space (Fig. 11, p. 36). Additional components have been added, illustrating the data flow to PROCESSING SERVICES (Secure Processing Environments), common DATA ACCESS APPLICATION forms and clarifying that the EHDS does not required raw data to be harmonised before publication. While the EHDS does not required and generalised data harmonisation, this does not preclude organisations, such as research infrastructures, from coordinating and agreeing on common data models and virtual datasets for the EHDS. This approach supports federated analysis and learning by making the data ready for processing. (Note: Interoperability requirements for Processing Services essential for federated analysis and learning fall outside the scope of this document.) |
How can such a virtual, federated 'dataset' be promoted? |
When a dataset conforms to a specific data model or standard, the dct:conformsTo property in HealthDCAT-AP should ideally reference a URI. For example, a dataset using the OMOP data model could use the URI https://www.wikidata.org/wiki/Q47219554 for OMOP. Additionally, a common HARMONISED DATA VALIDATOR could be implemented as a new component of the HealthDATA@EU infrastructure to ensure the quality and consistency of datasets. |
![]() |
Figure 9: Overview of the components of the HealthDCAT@EU infrastructure. |
Real-world example: |
DARWIN EU (Data Analysis and Real World Interrogation Network) is a European initiative established by the European Medicines Agency (EMA) to harness real-world healthcare data for regulatory decision-making. Its primary objective is to provide timely and reliable evidence on the use, safety, and effectiveness of medicines across the European Union (EU). The network comprises a coordination centre and a growing consortium of data partners, including hospitals, primary care providers, health insurance organisations, patient registries, and biobanks. These partners contribute anonymised healthcare data, which is standardized into a common data model to facilitate efficient analysis. |
EHDS Regulation Article 52) Intellectual property rights and trade secrets |
2. Health data holders shall inform the health data access body of any electronic health data containing content or information protected by intellectual property rights, trade secrets or covered by the regulatory data protection right laid down in Article 10(1) of Directive 2001/83/EC or Article 14(11) of Regulation (EC) 726/2004. |
To comply with the requirements of the Data Governance Act, HealthDCAT-AP has made the 'Conditions for re-use (Rights)' a mandatory property using dct:rights, a dct:RightsStatement. This property refers to 'a statement that specifies rights associated with the Dataset Distribution,' allowing data holders to provide the necessary information to meet the Intellectual Property Rights assertion requirements under the EHDS Regulation. This requirement also applies to sample Distributions. However, due to IPR constraints, if any electronic health data contains content protected by Intellectual Property Rights, it may prevent the provision of a dataset subset. In such cases, it may still be possible to produce a redacted data dictionary, with health data holders indicating which parts of the datasets are affected by IPRs and justifying why specific protection is required. A sample distribution providing a redacted data dictionary and conditions for re-use would be beneficial to ensure dataset discoverability. Therefore, even when IPRs apply, a data dictionary should remain mandatory for HealthDCAT-AP to enhance data discoverability. |
Real-world example: |
Publishing redacted data - where sensitive or confidential information is edited or censored - is a common practice before sharing or publishing. A notable example comes from biodiversity data management, especially within platforms like the Global Biodiversity Information Facility (GBIF) and similar environmental data repositories. In these cases, datasets containing sensitive information, such as precise locations of endangered species, are redacted to address conservation and ethical concerns. Only non-sensitive metadata and summary statistics, like species counts and general habitat types, are shared, while specific geographic coordinates and other details that could risk exposing protected locations are excluded. Publishing redacted data is also common in marine data management, as seen with some of the European Marine Observation and Data Network (EMODnet) data products. For example, datasets containing sensitive information about submarine power and telecommunication cables are often redacted to protect infrastructure security and integrity. While general information, such as the presence of cables and broad geographical areas, may be shared, the exact coordinates and detailed pathways are withheld. This raw, precise information remains under the copyright of the companies that own or operate the cables, ensuring both data security and compliance with proprietary rights. |





