Data Architecture
A conceptual, AI-ready perspective based on my experience and expectations. - Manfred Sorg and Copilots, March 2026.
Overview
What is Data Architecture?
Data Architecture refers to the comprehensive structure and design of an organization's data assets and
data management resources. It encompasses the models, policies, rules, and standards that govern how data is
collected, stored, arranged, integrated, and utilized across the enterprise. The goal of data architecture is
to ensure that data is organized in a way that supports business objectives, enables efficient access, and
maintains data quality and security.
[DAMA-DMBOK]
At its core, data architecture provides a blueprint for how data flows within an organization, connecting
different systems, processes, and stakeholders. It defines relationships between data entities, specifies how
data is transformed and moved, and sets guidelines for data governance and compliance. This structured
approach helps teams make informed decisions, streamline operations, and unlock insights from data.
[IBM Cloud Education]
Effective data architecture is critical for scalability, adaptability, and innovation. It allows
organizations to respond quickly to changing requirements, integrate new technologies, and support
analytics and reporting needs. By establishing clear frameworks for data management, data architecture
lays the foundation for robust, reliable, and actionable information throughout the business.
[Gartner Glossary: Data Architecture]
To me, Data Architecture should avoid focusing on technology. Focus should remain data, being the digital representation of information - the technology used to implement and maintain the information is secondary and apt to change. Data contains business value, technology is used to persist, secure, transport and make it visible.
Scope, stance, and non-goals
This documentation is mainly to clear up my mind on the many facets of Data Architecture. Experience helps to get a deep understanding, but it blurs sight by giving too many variations in solving the same problem. In addition, it is meant as an inspiration for you, the reader. Ideally, it starts a discussion.
Target groups are fellow experienced data architects and experts in data offices going deep into AI-readiness of data to decide on their data strategies. An additional target group are the artificial intelligences (AI) themselves extending their knowledge on Data Architecture - walking the talk to accept AI as colleagues. I therefore assume readers already understand extended data engineering concepts and want to dive deep.
This document reflects the personal professional perspective of the author. While it draws on experience gained in various roles and organizations, it does not represent the strategy, architecture, or implementation of any specific company. Examples and patterns discussed are illustrative and derived from general enterprise experience; they do not describe the architecture, systems, or operating model of any specific organization. It is explicitly not a tool comparison, vendor architecture, or reference implementation. It describes one approach to tackle Data Architecture as a mental model based on principles and derived from enterprise practice. It explicitly prioritizes data semantics, lifecycle, and AI-readiness over platform choices. A proposal for implementation follows at the end.
Concepts & Mental Model
This chapter translates the definition of Data Architecture as seen above into a solution space where it can
guide implementation. Based on general thoughts on information, it develops a
generic architecture recommendation. Every individual implementation must deviate,
but having a big picture helps drawing the details. I try to include all relevant aspects without losing
focus,
as always trying to keep the balance between business and technology, preparation effort and benefit gain and
clarity and completeness. Even if you disagree with this model, later sections can be beneficial.
See also [Appendix B - Architectural Decisions and Rationale]
Foundational mental model
Information about the world in general and about a specific company or enterprise is distributed across multiple representations. Starting with mental models, information is shared in documents as text and graphics, and it is digitalized into data models. Reading text and answering questions from it is a solved problem since the advent of Large Language Models (LLMs); supporting actions on general business tasks are solved by standard business applications. Company-specific analyses on standard data extended by specific data and metadata need a data architecture balancing effort and effect - integrated into reporting and into AI systems. Thus, the need-for-data could be solved for each specific context keeping data confidential but making it available; data modelling is needed here.
Data Architecture in its technological dimension ensures a working data pipeline from data sources down to the use cases, that need data. During this process, data needs to be transformed from externally defined source or raw data models to a business language representation of truth. Technology helps with transport and is needed to make data available; models and semantics help to shape data and make it understandable, but the data itself should conceptually be seen as immutable, being an unchangeably true measurement of reality. Therefore, downstream transformations and transport should be automated and reproducible.
Conceptually, this ends up in a layered approach like following as a possible reference architecture (technological perspective):
| Use cases | Data Access | Data Provisioning | Data Sources | |||
|---|---|---|---|---|---|---|
| Data retrieval, Reporting |
various | Gold layer, use case-oriented, secured |
Silver layer, domain-owned, semantified |
Bronze layer, source-aligned |
Relational Databases, NoSQL, external data, metadata, semantics |
|
| AI mediation |
||||||
| Generic AI | MCP Host | |||||
| Application | API | |||||
| Application-specific data access | ||||||
Simple cases might reduce complexity e.g. with just one financial system in place, all data product layers might collapse and the application's AI mediation layer might be used for AI-access directly. The larger the company and the more specific the type of business, the complex the architecture gets.
Data Architecture does not end with this technological dimension. To make data Findable, Accessible, Interoperable, Reusable (FAIR) and safe, secure and trustworthy, people and processes are needed to integrate data into business, to foster data-driven management, and to transform data to value.
Types of use cases
To foster data-driven management and decision making, the various types of use cases must be considered, where data can prove its value.
Data retrieval and reporting - the analytical use case - was the major use case in the past. Whenever a human wants to know about business to increase business value, information about the business and its external boundaries are needed. In pre-computing times, humans were asked to gather information via human networks. Existence of data should ease this task, but knowledge about existence of data is scarce. The foremost task to be solved for data retrieval therefore is cataloguing. This does not mean to report all known data models in a long uncomprehensible list of applications, but it does mean a semantified source of knowledge about existing and available (known) data sources inside and outside the company including confidential and licensed data. Thus, internal knowledge and licenses can be levered to provide business value in cases additional to the original ones (reusability). The technical details of data, including their accessibility and interoperability, are needed in a second step only. Findability is key for this type of use case, which is typically performed by humans with the use of search technology.
Generic AI can automate this research and include data access for ad hoc reporting. A question like "show me the regional distribution of sales of a certain product group in comparison to our competitors including the competitors' locations" was a typical task for a market intelligence group, but it can be solved by an LLM provided with suitable data. In this case the data is internal (sales and customer location) combined with external data (sales reports and competitor locations) and semantics (meaning of "this certain product group" in internal and external terms). The effort is moved from demand-oriented market intelligence group effort to preparation effort performed by data teams upfront. This is the typical trade-off curve between data preparation effort and retrieval effort, with an optimal sweet spot. Thus, ad hoc analysis is restricted to a limited amount of data sources by conscious decision.
Data usage via applications is still relevant. When it comes to repeated tasks in general business topics, chances are high that the needed analysis is already prepared by your software vendor. Questions like "how much of a certain product do we have in which warehouses" do not need individual reporting or generic AI - they are part of business applications' standard offerings. Providing information about those standard offerings to Generic AI could help to route individuals asking such questions to the right standardized application.
In addition to the above-mentioned use cases, raising Data Quality issues is an important use case itself. It does not fit the stream metaphor used in above picture. Processes like hotlines, chatbots and more are needed to allow easy reporting of data quality issues. This is a problem not solved in most growing companies; the data user does not know, whom to report to, and the applications' users did no learn yet to listen to data users' concerns - nor they are tasked to do so. Data Ownership while not being a main target of data architecture is a means to cope with this problem. Every set of data needs a channel to report data issues back to the original sources - the simplest solution is to mention a data owner for each data set, being a person who accepted their responsibility and who knows the processes needed to maintain the data. Typical for data quality issues is, that you cannot automatize their solution except in the simplest cases - reasons can vary from typos to failures in data pipelines and wrong metadata.
Data access and access restrictions
In times of data access via application, data access was mainly restricted to users of an application. Providing a license follows a business need, implicitly allowing access to related data. As soon as you access the data for analysis, the question arises, who may access which type of data for which purpose - and whether this person needs a license for UI usage. These questions rose to recognition when personal data got increased attendance; still other company data lives in the grey area of being confidential but relevant for multiple purposes. To solve this problem, defining data products is a first step. A Data Product must follow a consistent methodology of giving access; defining this logic explicitly and providing access per business role greatly reduces effort in giving access to data immediately and increases security. The latter seems counterintuitive, but if you as data owner get a data request, it is nearly impossible to grant or revoke it consistently unless you have an explicit logic - and if it exists, it can be automated.
Generic AI increases the urgency of this problem. The existing AI tools can query data on your behalf, but they will not request access by phone or app and then wait some days until a human answered the request; in these cases the AI tool would refer to outdated textual information instead of querying the data needed. Having access to the data needed to answer your questions is essential for data-driven answers given by humans or by AI; not having access leads to random answers depending on the analysts' preferences.
Model Context Protocol (MCP) is the evolving standard to provide context to an AI model and to enable AI agents to perform actions. In the context of data architecture, MCP Servers serve as a standardized way to provide AI mediation, enabling AI models and especially LLMs to access data - including the question whether and which data to access. Providing data products well-semantified via MCP Servers is key here - providing existential value to applications beyond usage for humans via UI. While MCP is technology, it is a key technology like API or SQL Server for every future Data Architecture. But being a technology, we must be prepared to replace it with its succeeding technology any day in the future without compromising the whole data architecture then.
Data provisioning and Data Products
Seeing data as a product enables thinking about coherent definitions and valuations of data, independent of technology and applications. Data itself provides the value, while its representation as data model, reporting, user interface or AI integration just exists to make it available to maintainers and decision makers. Diving into this topic quickly results in topics, that are not provided by classical data management, like access models, contracts, semantics and lineage. Every existing definition of Data Products being canvas or metadata specification falls short in some of those aspects - this area of interest is still maturing. Downside is the effort to collect all these metadata, which can easily exceed data's value; balancing efforts and reuse of existing metadata is key.
In a complex business environment one must differentiate between application-specific data products provided in application data models and business-driven technology-agnostic data products. The latter are important to answer business questions, while the previously mentioned data products only offer facets of business truths. This is related to bronze/silver/gold layers in data warehouse architecture and the differentiation between source-aligned and domain-owned data products in a data mesh architecture (see above architecture picture).
Providing these data products quickly leads to the question of technology to be used to provide the data. There is not one version of truth. Depending on surrounding architecture many variations of data lakes, pools or ponds, warehouses, storages or locations - relational or graph, persistent, views or copies are possible. It is only important to process the data automatically in a reproducible way. Parameter tables and metadata enrichment are data sources, too - including having a source system, backup and an automated process of data transformation. For data that appears uniform for many entries, relational data warehouses are still a good option, while diverse, sparse or hierarchical forms of data are NoSQL or graph by nature. On the question whether data should be persisted in the provisioning layer, I'm agnostic, too - yes, persist if it serves a purpose (but keep it automated).
Event-driven data, streaming and master/reference data management add variance to this picture without changing the essence: Source data representation is immutable, while data provisioning must follow business need.
Data sources
For most transactional or interactive systems, data sources are still relational. Being created by object-relational mappers those models often do not fulfil the most elemental rules of data modelling like referential integrity, descriptive semantics, naming conventions and so on. In addition, bought software would not expose access to the database itself except by using an API with limited possibilities. In every case, raw data access must accept limitations given by source systems. These are source-aligned data products and need enrichment in multiple ways.
Integration of multiple data sources into a centrally aligned data model needs mapping of master data and aligned hierarchies. Those can be added using ontologies mapping elements of physical data dictionaries e.g. via DCAT and RML to concepts and their hierarchies. Thus, multi-source analyses enable providing synergies single applications could not provide. Those mappings and additional hierarchies are data sources themselves. Typically, those are NoSQL (file based) and following a versioning strategy best known from code - they might live in GitHub. Providing user interfaces including completeness checks for those types of data is a complicated topic itself.
The same effort is to be made to integrate external data sources or additional metadata. Restrictions on data access and usage, contracts, and costs are as important as the lineage of data transformations and the processes of data quality issues. These metadata need locations to be stored and UI to be maintained. Only with rich metadata, humans and AI can decide to trust a data product and query it correctly.
Data modelling
Depending on purpose, multiple methodologies of data modelling apply, e.g. normalized relational models, denormalized data warehouse models, object-oriented data modelling, semantic or conceptual data models, and more. Most important is transformability of models and separation of maintenance models from analytical models. For each type of information, there is a version of truth, where the data is collected or maintained. The model used for this generally has reduced redundancy and extensive changelogs; this is the structure of data for backup, because it cannot be derived from other sources. Downstream models transform this data and enrich it with data originating from other sources. For these models, the transformation algorithms contain business value and need backup. Strictly separating original data from derived data is good practice; persisting derived data is optional except for performance reasons.
Typical models for data collection and maintenance are plain text like in JSON or RDF and relational databases like inside interactive applications. Typical models for data analysis contain star schemas including facts and slowly changing dimensions (SCD) or are tabular with lot of redundant master data. While the previous mostly have a live validity, most analysis models allow scenarios for comparison. These are classical data models used for online transactional processing (OLTP) or online analytical processing (OLAP).
For data retrieval and for AI usage, these models generally lack understandability. Semantical enrichment is needed to clearly provide information on the whereabouts of a single column, table or data entity in general. Semantic gives information which columns and tables must be understood as an entity providing which type of information; a measure could e.g. be a single value or a full stack of multiple measurements including methodology, technology, agent, place, time, type of aggregation and more - semantics allow understanding these models in comparison. Without semantics, an AI Agent could misunderstand a single measurement for the valid result and derive wrong answers from it ("hallucination"). These types of misunderstandings typically happen in data more often than in textual information, because data models have reduced context.
Relationships inside and between data models serve different purposes, too. In OLTP data models, they typically ensure that categorical information is valid e.g. that the product sold really exists. Coming to semantical models, relationships get fuzzier and allow data integration and comparison across multiple data sources. Here great benefit meets great danger - it is e.g. possible to compare production facility master data by product groups, unless a single production facility produces products of several groups. Describing these cave-ats thoroughly is the most complex topic in adding semantics to data.
Data organization
Organizations must adapt to changed working environments, especially to changes in the order of magnitude coming with the advent of AI.
Seeing AI as a tool is jumping short. Machine learning models (ML) are a tool. We humans prepare data, use the tool to transform it into models and apply them to real world problems. Large language models (LLM) are same if you see them as their developer - but if you see them as their user, their application turns out different. We use them like an external consulting company: they lack internal knowledge and experience, and they sometimes are bluntly wrong, but they give valuable insights. We should start giving credit to them like we do if a certain consulting company helps to solve a task. Agentic AI takes the next step: it acts independently like a new employee in a remote location. They're really fulfilling what is meant, if you define a position in an organization. With the range of Agentic AI from task automation to full independence, it is difficult to define the border between tool and position, but if you think about humans doing the job instead, it's getting clear enough.
Following this, we should add AI into org charts - as a dotted line of external support when it comes to LLMs and as equally important position when it comes to a mature AI Agent.
As next step we need to team up for purpose. A team has a purpose and, depending on this purpose, it is short lived or long lived - distinguishing projects from products or business support. A team of humans following this definition would have about 5 to 20 people of diverse backgrounds supported by consultants and contractors. A future team might consist of 3 to 4 humans, an agentic position and LLM support. In the long run, claiming a position as agentic will be considered as discrimination just as if you'd distinguish by color or gender today - this is the full consequence of passing Turing's test.
In my opinion, it is important to keep a team unchanged by organizational changes while the purpose prevails. Thus, it can focus on purpose and ignore organization's politics. The team in this picture is the atom of self-organization and agile methodology. Organizational management is to actively arrange teams into larger setups to achieve business targets. The downside is that teams must be dissolved when their purpose is not needed to achieve business targets anymore or it is reduced to being just a facet of another team's purpose, e.g. when a new reporting system is completed and given to reporting platform for maintenance.
While management can handle the people aspects of creating, leading and dissolving teams, the aspect of providing data for the organization in a FAIR+ way is beyond their imagination. Maybe this is a change process and in future organizations, management will lead people and systems and data, in today's changing circumstances, they need support by a data organization. And in addition, they need to be forced to seek support from a data organization, because this deviates from accustomed ways of management.
Data Governance
When talking Data Governance, we have two separate and equally important but distinct aspects: external Data Governance ensures data handling is compliant to external laws and social (and shareholder) expectations, while internal Data Governance ensures data handling is effective and efficient, keeps business secrets safe, and fosters data-driven management and decision-making.
For external Data Governance, it is crucial to gain an overview of laws, rules, contracts and other boundaries, that need to be fulfilled in data handling. This is ideally a team integrating some lawyers and data engineers supplied with an LLM having access to all relevant textual sources. Essential is an overview of the internal data architecture and application landscape and the types of data handled. Output are rules to obey in data handling and need for logs to provide to enable reporting on data sharing agreements, data access on sensitive data, acceptance of data contracts and more as requested by authorities.
Internal Data Governance is not about avoiding prosecution, but about increasing efficiency and effectiveness. Industry studies indicate that 60-80% of time in data and analytics work is spent on finding and accessing data, rather than on analysis or value creation (Dan Vesset, IDC). Internal Data Governance aims to reduce this effort by providing means to find data, ensure and simplify accessibility and provide semantics for interoperability, thus enabling reusability (FAIR). Keeping it safe and secure are boundary conditions, while a focus on data quality is mandatory to gain trust in validity of data. Seeing the multitude of data owners distributed across the organization, integrated into maintenance teams, any type of organization will be needed to keep this community aligned. This meta organization is meant, when talking about Data Stewardship - the steward typically is taking accountability while the ruler (the data owner) is not available. This is not mere bureaucracy but essentially to keep distributed teams aligned on data topics.
Precondition for a working Data Governance is support by all levels of management. Integrating Data Governance into daily work and following their rules will increase effort, but in the long run, increased efficiency and avoided legal cases should outmatch this increased effort by orders of magnitude. This is not only true for personal data - all types of data can have benefits if handled correctly and can result in penalties if rules are disobeyed.
See also: Blueprint of a data organization
Lifecycle & Quality
(by Microsoft Copilot)
Data architecture is not only about making data available — it is about making data usable over time. Two concepts are fundamental for this: data lifecycle and data quality.
Data is created in a specific context, transformed for specific purposes, used in different ways, and eventually becomes outdated or obsolete. This lifecycle exists whether it is explicitly managed or not. Making it explicit allows conscious decisions about persistence, access, cost, and risk; it helps to distinguish between original data, which should remain immutable, and derived data, which may change as business logic evolves.
Data quality determines how well data represents reality for a given use case. Quality is not absolute — it depends on purpose. Data that is sufficient for trend analysis may be unacceptable for operational decisions or automated actions. Therefore, data quality must be understood as fitness for use, not as technical perfection.
From a mental model perspective, quality and lifecycle are not properties of tables or files alone, but of data products:
- A data product must declare what it represents, how current it is, and where its limits are.
- Quality is maintained through ownership, transparency, and feedback, not by one-time cleansing.
- Lifecycle awareness prevents silent reuse of outdated or misleading data — a risk that increases significantly with AI-driven consumption.
Finally, data products, like applications, have an "end of life". Deprecation is a necessary architectural concept to avoid confusion, duplicated effort, and wrong conclusions. Especially for AI consumers, obsolete data must be explicitly marked as such — AI will not intuitively "notice" that data should no longer be used.
Lifecycle makes data manageable, quality makes data trustworthy, and both are prerequisites for scalable analytics and AI usage.
See more details on Deep dive: Lifecycle & Quality later in this document.
Practical usage, detailing the picture
Blueprint of data organization
Data is something between business and IT; it needs management and rules coming from internal alignment and from external regulations. Ownership is with the business side, while possession and control are with IT. For explanation take the analogy to real estate management bridging from house owners to their tenants:
| Real estate | Data world | Comments |
|---|---|---|
| Real estate management | Data Governance organization | umbrella organization |
| Property management | Data Stewardship | represents the owner-centric function; represent owners' interests toward the data organization and acts on behalf of the owner; contracts, rules, value, compliance, tenant relations; in return informs owners about their duties in terms of regulations and data quality |
| Facility / building management | Data Platform / Data Operations / Data Management | represents the object-centric function; operates the physical object under governance constraints; maintenance, safety, services for occupants |
| Caretaker / janitor | Data Custodian | works within facility management (data operations), not property management, as local caretakers enforcing governance rules |
| Property owner | Data Owner (business) | owns the data and is accountable for its quality and usage |
| Tenants | Data consumers / applications | use the data for various purposes, subject to governance rules |
[\[getdbt.com\]](https://www.getdbt.com/blog/data-governance-key-roles), [\[elevano.com\]](https://www.elevano.com/blog/data-quality-roles-and-responsibilities/), [\[datamanagement.wiki\]](https://datamanagement.wiki/data_quality_management_system/roles_and_responsibilities)
The Data Governance organization is tactically and operationally headed by the Head of Data Governance, who is reporting to the Corporate Data Officer. While the latter should have a strategical direction those roles regularly collapse.
The analogy breaks, because the roles of facility management and data operations do not fully match; facility management is a part of real estate management, while data operations is a part of IT organization - a good role name for coordinating Data Custodians is missing; I'll continue with Data Officer here.
Integration data organization into existing business organization strongly depends on its inherent structure, but take this as a blueprint:
- Head of Data Governance and Corporate Data Officer: Head of data organization, represents "data" in C-level meetings.
- Data Officer: Assistant of Corporate Data Officer, acts as representative in business meetings throughout the organization and keeps contact with IT organizations' data custodians.
- External Data Governance: cares for compliance.
- Lawyers are essential to understand jargon of laws and contracts and to translate them to user stories for implementation in internal rulesets, processes and technology.
- Data specialists are needed as part of External Data Governance to foster understanding of technological options in implementation of rules into applications and data pipelines with a focus on data access models.
- Laws and contracts chatbot: Keeping all related laws, external rules and expectations and all related contracts at hand is a task perfectly suited for an LLM being used as a tool for External Data Governance and for direct use by data teams throughout the organization. Adding rules defined by internal data governance completes the picture.
- Internal Data Governance: cares for efficiency and effectiveness.
- Metadata and Semantics team: Balances effort and value of information about existing internal and external data sources and data products including information on integrating them; responsible for AI-friendly metadata and for human-friendly UI for data retrieval and request.
- Data Stewardship: Manages the community of data owners located in business organizations, owns the process to report and fix data quality issues, acts on behalf of data owners in case of unavailability.
- Technical Metadata crawler: All existing metadata must be reused to avoid duplicate work. This task is mainly technical and needs access to all relevant data sources throughout the company and can best be performed by a classical crawler reading diverse data sources (databases, APIs, application-specific MCP servers, application-specific metadata and semantics) and providing it as an internal silver layer metadata repository e.g. following DCAT.
- The Data Custodian, as a role in every IT solution or product organization, has a (partial) role to ensure compliance with internal and external data governance rules as part of daily work. Data Officer keeps the data custodian updated and supports answering questions by forwarding them into data organization. Data Custodians act as representatives of the corporate data office inside IT teams, having the mandate to enforce compliance.
- The Data Owner, as a role in business organizations, has a (partial) role to ensure the business organization's data is represented in data architecture in a way enabling usage for humans and AI, and it ensures data quality issues being solved. The data owner as part of business organization clearly sets ownership of data and distinguishes own ownership from neighboring organizations' ownership in a way that no data is without owner and no data has unclear ownership. Data owners decide on data access models; Data Stewardship helps and coordinates the data owners and acts on their behalf if needed.
Integrate dominant systems
A data landscape usually contains one or few dominant systems, containing nearly all relevant data. This is true, because all companies share basic business tasks provided by standardized software. The individual software and additional data that surround that standard software mainly contain what makes a business special; it differentiates the company from its competitors.
Reinventing the wheel and rebuilding the standard software's metadata is not recommended. Instead, all tools provided by the software vendor should be used if all data needed is included in this standard software. Every existing data pipeline like SAP's CDS views and all given metadata is a chance to avoid individual effort, but SAP cannot know which modules are used to which extent and which are misused to implement a slightly different business case. Integrating SAP and alike means reducing the exposed model to the used portion and renaming and enriching what is misleading from a company perspective. The resulting set of resources need to be packaged by domain and semantically integrated to additional sources.
A problem is inherently defined in MCP Server metadata being describing and not being defined in a metadata model. The large software vendors will provide MCP Servers fully describing their data model in their language. Interpreting this prose description will not be possible. Nonetheless a selection, rewriting and integration step is needed. A solution could be an enterprise semantic façade, treating the vendor's MCP Server as an external provider, and keep original description if it can be left unchanged, extend it if possible and rewrite it if needed. This is another reason for the semantifying layer, but still it is a lot of work to do.
Agile Development - done right
Agile is not necessarily Scrum and Agile is not chaos. Starting with a pilot, adding use-case by use-case is a good methodology if the overall picture is clear. First you need to have a scope and rough imagination, what your data architecture project is about. Then you can extend your minimal viable product (MVP) to increase business value step by step.
DAMA Maturity Scores
The five maturity levels used by Reeve (the Carnegie Mellon original CMM names are in parenthesis) are:
- Immature (Initial): The best practice activities are not performed by the organization. The best practice tools are not available or not used.
- Repeatable (Repeatable): Some parts of organization are using recommended tools and processes while other parts are not.
- Managed (Defined): The organization has a documented standard for performing the assessed activity or activities consistently and using applicable tools effectively.
- Monitored (Managed): The process in question is established, tracked and monitored. Recommended tools are in place and are being used consistently across the organization.
- Continuous Improvement (Optimizing): The activity is continually reassessed, improved upon, tracked and built into process.
These maturity scores turned out to be reusable universally. No matter what type of task you're starting and maturing, those levels apply very well. Don't reinvent the wheel, just use it for your reporting. I like to simplify the definition to the following wording:
- You tried and succeeded, but you don't know exactly why.
- You regularly succeed, but you're not able to explain yet.
- You can explain your solution, but others have other solutions.
- Solutions are aligned to achieve the best solution.
- Solutions are regularly updated to keep them current.
Deep dive: Lifecycle & Quality
(by Microsoft Copilot)
Data architecture is not complete once data is technically integrated and made accessible. Data has a lifecycle, and throughout this lifecycle its quality determines whether it can be trusted, reused, and safely automated — by humans as well as by AI. Ignoring lifecycle and quality leads to hidden costs, erosion of trust, and eventually to data products that exist but are no longer used.
This chapter introduces pragmatic concepts for data quality, data lifecycle management, and the deprecation of data products, without assuming heavy usage of tools or formal maturity models.
Data Quality Dimensions
Data quality is not a single property. It consists of multiple dimensions, each describing a different aspect of how well data represents reality and how suitable it is for a given use case. Importantly, quality is contextual: data that is "good enough" for one purpose may be insufficient or even misleading for another.
Common and practical data quality dimensions include:
- Accuracy
Data correctly reflects the real-world object or event it represents. For example, a customer's address matches their actual address at the time of use. - Completeness
All required data is present. Missing values may be acceptable in exploratory analysis but critical in operational or regulatory contexts. - Timeliness
Data is up to date relative to its intended use. Near-real-time data may be essential for operations, while monthly snapshots may be sufficient for strategic reporting. - Consistency
The same information does not contradict itself across systems or data products. Inconsistencies often arise from parallel data maintenance or uncontrolled transformations. - Validity
Data conforms to expected formats, ranges, and business rules. Examples include valid dates, allowed value ranges, or correct reference data usage. - Uniqueness
Entities are represented once and only once where intended. Duplicate records often distort aggregates and mislead AI-based reasoning.
Not all dimensions must be always optimized. Declaring which dimensions matter for a data product — and why — is more important than achieving theoretical perfection. This declaration is part of the data product's metadata and a prerequisite for trust.
Data Quality as a Process, not a State
Data quality is not something that can be "fixed once". It evolves as:
- source systems change,
- business definitions shift,
- new use cases emerge,
- and AI systems start combining data in unforeseen ways.
Therefore, data quality must be treated as a continuous process, not as a static checklist. Key elements of such a process include:
- Clear ownership
Every dataset and data product needs an accountable owner who understands its meaning, limitations, and business impact. - Feedback channels
Users must have a simple way to report data quality issues. Without a backchannel, problems remain hidden and trust erodes silently. - Transparency over perfection
It is often better to expose known limitations explicitly than to hide them behind polished dashboards or AI answers. - Automation where possible, human judgment where necessary
Simple quality checks can be automated, but many quality issues require contextual understanding and cannot be resolved without human intervention.
Data Lifecycle
Every piece of data follows a lifecycle, even if it is not explicitly managed. Making this lifecycle explicit helps reduce risk, control cost, and support reuse.
A simplified data lifecycle consists of the following phases:
- Creation
Data is generated or captured, typically in operational systems, external feeds, or manual processes. At this stage, data reflects a local view and often lacks broader business semantics. - Processing & Transformation
Data is cleaned, enriched, integrated, and transformed into forms suitable for analytics, reporting, or AI usage. This is where much of the business value is added — but also where errors can propagate if lineage and semantics are unclear. - Usage
Data is consumed by humans (reports, analyses, decisions), applications, or AI systems. Usage patterns often reveal new quality issues or new requirements that were not anticipated during design. - Retention & Archival
Data that is no longer actively used may still need to be retained for legal, regulatory, or historical reasons. At this stage, accessibility requirements typically decrease, while integrity and traceability remain important. - Deletion
When data is no longer needed and retention obligations expire; it should be deleted in a controlled and auditable way. Deletion is part of responsible data governance, not an afterthought.
Not every dataset needs to pass through all phases with equal intensity. However, every data product should clearly state which lifecycle stage it is in and how transitions are managed.
Lifecycle Awareness in Data Architecture
Lifecycle thinking influences architectural decisions in several ways:
- Separation of raw and derived data
Original data should remain immutable wherever possible, while derived data can be recalculated or replaced as logic evolves. - Explicit validity periods
Data products should communicate whether they represent a current state, a historical snapshot, or a scenario-based view. - Cost-aware persistence
Persisting data indefinitely "just in case" increases cost and risk. Lifecycle-aware architectures persist data intentionally and transparently. - AI-readiness
AI systems are particularly sensitive to outdated, inconsistent, or context-less data. Lifecycle metadata helps prevent silent misuse.
Deprecation of Data Products
Just like applications, data products have a life cycle end. Failing to deprecate obsolete data products leads to:
- confusion among users,
- incorrect analyses,
- duplicated effort,
- and increased maintenance cost.
Deprecation should be treated as a first-class process, not as an informal decision.
A pragmatic deprecation approach includes:
- Announcement
Clearly communicate that a data product will be deprecated, including reasons and timelines. - Successor identification
If possible, point users to a replacement data product or an alternative way to obtain the required information. - Grace period
Allow sufficient time for consumers to migrate, depending on criticality and usage patterns. - Status metadata
Deprecated data products should remain findable but clearly marked as such, including warnings for AI systems. - Eventual removal
After the grace period, access should be removed or restricted to avoid accidental use.
For AI-enabled environments, deprecation metadata is especially important: an AI agent will not "notice" that a dataset is outdated unless it is explicitly told so.
Quality, Lifecycle, and Trust
Ultimately, lifecycle management and data quality serve a single purpose: trust.
- Trust that data represents reality well enough for its intended use.
- Trust that limitations are known and communicated.
- Trust that obsolete or misleading data is not silently reused.
- Trust that AI-generated answers are grounded in valid, current, and well-understood data.
A mature data architecture does not eliminate uncertainty — it makes uncertainty visible and manageable.
Deep dive: Data modelling
Data modelling is a topic where many authors have written lots of text over the last decades and it is a topic where most applications fail to deliver decent results. I clearly object to the simplified assumption it would be sufficient to dump an object model like needed to run an application into a relational model generated by an object-relational (OR) mapper. If you follow this approach, you will end up with data models not suitable for direct data analysis and integration into data pipelines. The model to store the data and the model to run your application differ in purpose, which results in different (but equivalent) modelling. As the business value lies in the data while the application's purpose is to interact with it, I recommend investing more time in a durable data model that is consistent with data models already existing in the company. As a result, the application will fit better to the understanding of data entities in other applications.
This section intentionally does not converge on a 'best model'. Architecture is the discipline of choosing trade-offs consciously, not eliminating them.
Why "model"?
Reality is difficult concept. Even the part of reality we can realize using our senses and imagination is complex, complicated, fractal, interdepended, differentiated and so on. No computer system can represent a part of reality to its extent and still perform well - modelling is needed to reduce complexity to a degree that is fit for purpose. Even within one application, the data model mediates between the data representations in business application, database, API, User Interface, data provisioning, business data model and more. A Data Model is neither a platform nor a single schema; it is equivalent to the user's impression or mental model of an application.
Always model with transformation in mind. Separate original data from derived data; original data needs backup of data, derived data needs backup of derivation logic. Keep original identifiers as reference to report on lineage. Automate transformations.
Model to purpose
As the data models always need to serve a purpose, there are different types of data models serving different types of purposes:
- Normalized relational data models for online transactional processing (OLTP) reduce redundancies and thereby simplify data maintenance. Changing a single value in a single table's row results in automatically changing this value everywhere in the application. Differentiation whether a value in a subsequent table is as-of now or historic is a decision made in modelling.
- Star or snowflake data models for online analytical processing (OLAP) keep historic data of master data in so called slowly changing dimensions to allow analyzing several scenarios e.g. as-of today or in historic structures. Hierarchies are mainly flattened by introducing redundant data to increase performance. Fact tables refer to historical master data using surrogate keys.
- Tabular data models for OLAP flatten all hierarchies and master data into the fact tables. Columnar database setups optimize these highly redundant representations into graph-like in-memory models to speed up analysis further.
- Graph data models are optimized for in-memory usage by reducing referential integrity to memory pointers. Hierarchical data and reasoning are perfect examples for implementation as a graph. Ontologies for semantics and knowledge graphs to integrate ontologies and factual data profit from graph representation enabling hierarchical queries to speed up. Storing data as triples (subject, predicate, object) directly represents graph structure of nodes and edges but tampers the ability to edit data without repeating changes throughout the whole storage.
- Class-oriented graph data models add a level of structure to the original object-oriented data models. An object can principally have each type of relationship or attribute; this allows very detailed modelling to get as near to reality as possible. Classes reduce this to a defined number of relationships and attributes, to enable batch processing of equivalent objects.
- Schema-based object models follow a similar approach in plain object models like JSON without introducing graph concepts. A schema defines the allowed and required structure; a certain object has to follow. This is equivalent to data in classes and objects in programming languages except methods and events defined there. Storage of class-based objects out of object-oriented programming languages is basically schema-based before it is mapped into a data model for persistence.
- Document models (e.g. JSON) are representation forms that may encode object-oriented, relational, or graph semantics depending on schema discipline and usage.
- Key-value models or general data models avoid explicit data modelling as a structure and move it to metadata describing the used "keys". Essentially this is an unpivoted representation of a sparse tabular model; it is a decision of representation and not a decision of modelling.
In addition to these technology-based purposes, data models differentiate viewpoints [DoDAF Viewpoints and Models]:
- Physical data models represent data as persistent to a storage medium. Technical necessities like exact data formats, charsets, encryption, cardinality of relations and so on dominate here; technical restrictions of the storage technology (e.g. SQL Server) used are to be considered.
- Semantical or business data models are not restricted by technological boundaries; they describe data in a generalized way across multiple implementations to define business logic. High-level artifacts like data domains and ownership should live on these models.
- Logical data models bridge physical data models to the overarching business model. In theory the use case-oriented logical model is derived from the generalized business model to survive across multiple physical implementations of an application. Practically, the mapping between physical models and business models is done after the fact and needs clear consideration on effort invested vs benefit expected.
More aspects to differentiate models are:
- Integrative models to combine data from several sources could be minimal, only containing data available in all sources, or sparse, including all data that could be available in any source.
- Pivoting allows us to decide whether information is represented in rows or in columns. Reducing details allows simple representations as columns, while need for reach details is served best in rows (relational models). In object-oriented models the difference is whether an attribute is a simple value or a nested object or structure.
- Time-series represent development of information over time. While they are essential to detect deviations and patterns in repeated measurements, aggregation and reduction to single values is often needed for follow-up analyses.
- Time-boxed data like statistical aggregations on monthly level speed up analysis but allow different types of errors like missing or mismatched periods. Slowly changing data typically results in non-standard time boxes to represent the period when a data entry was valid.
- Multi-language texts complicate master data and result in loss of data if implemented or queried wrongly but are essential for human data usage in multinational enterprises especially on working level.
- Aggregated or emerging data vs raw data: While raw or original data must be seen as immutable for a certain timestamp, all aggregated data follows business logic. These data models should always be filled automatically to enable change of business logic later. Data entry models usually contain only entered data and some metadata about time, agent and place of entry - they need enrichment with master and hierarchical data before being transformed for analysis usage.
Derived from data models several specialized representations of data serve metadata purposes:
- Data landscapes reduce the logical layer to show systems per data domains as an overview.
- Lineage models show the information flow from raw data collection down to data analysis to ensure data quality on the move fostering trust in data transformation.
- Data (product) catalogs show the data that is available for consumption in a reduced way with explanations for business users and AI.
Define purpose before starting to model. Results will differ.
Semantic models
Basic assumption for every semantic model is to have a single entity, naming and identifier for a certain meaning. Inherit duplicates like "vendor" vs "supplier" need to be reduced. On the other hand, there must be a separate entity if the meaning differs. Relationships explain whether in current context "charge" and "batch" are equal or how their relationship is. Building those semantic (or business or conceptual) models is mainly the effort to understand an area of interest to the extent that all business terms and their relations are explained.
For an increasing number of general interest knowledge areas, public ontologies exist. Reuse to standardize and to minimize effort. (see References for selected ontology catalogs)
In practice, there is rarely "one" general ontology for e.g. electrical engineering. Instead, the common pattern is:
- Upper Ontology (SUMO / BFO / gist)
- Mid-level / Core Ontology (CCO, Engineering Ontologies)
- Domain Ontology (Energy, Electronics, Power Systems, IoT, etc.)
This Domain Ontology should be linked to data sources e.g. using RDF Mapping Language [RML; Dimou, 2024] to
enable assigning data to business terms, thus enabling us to create knowledge graphs containing semantified
data.
Relational data models
(by Microsoft Copilot)
Relational data models represent information as relations (tables) consisting of rows (tuples) and columns (attributes). Each table describes a set of entities of the same kind, and each row represents one instance of such an entity at a given point in time. Columns define the attributes that are considered relevant for the modeled purpose and are typed according to a defined domain (e.g. number, text, date).
A core principle of the relational model is the use of keys.
A primary key uniquely
identifies each row in a table and provides stability for referencing. Foreign keys establish
relationships between tables by referring to primary keys in other tables, enabling controlled navigation and
consistency across related data. These constraints are not merely technical constructions; they encode
assumptions about identity, ownership, and valid combinations of information.
Relational models are particularly strong in maintaining consistency and integrity of data. Concepts such as referential integrity, uniqueness, and constraints ensure that data remains internally coherent even under concurrent access and frequent updates. This makes relational models well suited for transactional systems, where correctness, traceability, and controlled change are more important than flexibility or expressiveness.
Another defining characteristic is normalization. By reducing redundancy and separating concerns into multiple related tables, normalized relational models minimize update anomalies and clarify responsibility for data maintenance. The cost of normalization is increased complexity for retrieval, which is typically addressed by joins, views, or downstream denormalized representations.
From a data architecture perspective, relational models are excellent maintenance and integration models, but they are rarely optimal consumption models. Their structure reflects rules of data consistency rather than business semantics or analytical convenience. As a result, relational schemas often require transformation, enrichment, and semantic annotation before they can be safely reused for analytics, reporting, or AI-driven consumption.
In this sense, relational data models should be understood as one representation among many: a durable and disciplined foundation for data persistence and integration, but not the final form in which data delivers value. Downstream models—analytical, semantic, or graph-based—build on relational sources while shifting the focus from consistency to understandability, comparability, and purpose-driven usage.
Object models
Object models stem from internal representations in object-oriented languages. Their main structural elements are a deep structure including complex data types (instances of classes themselves) and a removed separation of data from code. Pure object models describe individual objects resulting in rich descriptions without comparability. This is why they are rare - usually object models consist of instances of classes, where classes describe an abstraction layer defining attributes and methods common to all instances of this class.
Inheritance is a typical element introduced with object models. It is a relationship between classes to define, that a sub class is a specialization of parent class; in effect, the sub class follows all rules that apply to the parent class, but it may extend the class definition by additional rules (attributes, methods, strict enumerations etc.). Do not mix up with instantiation, which is the definition of an individual object based on a class definition. An additional concept appearing here is polymorphism, which means a definition class in relation to another class definition not explicitly defined while coding ("list" is polymorph and can be instantiated as list of string, list of Apple, list of lists etc.). Subordination, defining a class being part of something, is no inheritance, like an apple inherits from fruit but does not inherit from fruit salad. Subordination is rarely defined as element in object-oriented programming languages, but it regularly appears as concept in data modeling.
Graph models and hierarchies
Graphs (in most cases: directed acyclic graphs, DAGs) are a means to represent relationships. Thus, they fit ideally into semantic and object models playing out their strength in hierarchies and reasoning. The design principle in graphs is that a node refers (has an edge) to a node. Graph language ease querying across multiple references, allowing deep queries, where SQL hits its limits immediately (recursive queries and hierarchical data types are possible but slow). The typical use case are all types of NP complete problems like the Traveling Salesman path optimization pattern.
Comparing terminology
(by Microsoft Copilot)
Terminology differs across modeling paradigms. Using this table as a Rosetta Stone helps bridge misunderstandings between semantic, relational, object-oriented, and graph perspectives without assuming structural equivalence.
| Conceptual meaning | Semantic / Ontology | Relational | Objectoriented | Graph |
|---|---|---|---|---|
| Thing of interest (instance) | Individual | Row (Tuple) | Object | Node |
| Generalized type | Class | Table (Relation) | Class | Label / Node Type |
| Property / attribute | Property | Column | Attribute / Field | Property |
| Identifier | IRI / URI | Primary Key | Object ID | Node ID |
| Value | Literal | Cell value | Attribute value | Property value |
| Relationship | Object Property | Foreign Key | Reference | Edge |
| Relationship type | Predicate | FK constraint / join | Association | Edge type |
| Cardinality | Ontology restriction | Cardinality constraint | Multiplicity | Edge multiplicity |
| Inheritance | rdfs:subClassOf | Table inheritance / discriminator | Class inheritance | Label hierarchy |
| Classification | rdf:type | Type column | Class membership | Label assignment |
| Enumeration | Code list / SKOS | Lookup table | Enum | Node set |
| Constraint | Axiom | Constraint | Validation logic | Constraint / pattern |
| Schema definition | Ontology | DDL | Class definition | Graph schema (optional) |
| Query language | SPARQL | SQL | OQL / API | Cypher / Gremlin |
| Semantics | Explicit, formal | Implicit | Implicit | Partial / emergent |
| Reasoning | Logical inference | None | None | Path traversal |
| Typical purpose | Meaning, integration, AI | Persistence, integrity | Behavior, encapsulation | Relationships, traversal |
This table does not imply equivalence, but functional correspondence. Each column represents a modeling paradigm optimized for a different purpose. Semantic models prioritize meaning and inference, relational models prioritize integrity and persistence, object-oriented models prioritize behavior and encapsulation, and graph models prioritize relationships and traversal. Misunderstandings arise when terms are treated as literal translations instead of contextual analogies.
Object-oriented data modeling
Trying to combine the strength of the various types of data modelling, object-oriented data modelling defines common ground to translate data models by introducing object-oriented concepts into relational data modelling. Object-relational (OR) mapping is a similar approach, but it takes the object model as a given and automatically derives the relational model from there. The idea of object-oriented data modeling is to define a model upfront that is compatible with object-orientation and with relational databases (and hopefully with graphs, too).
Elements of Object-oriented data modeling [see References]:
- Objects: The real-world entities and situations are represented as objects in the object-oriented database model - and as rows in the relational data model.
- Attributes and Methods: Every object has certain characteristics. These are represented using attributes. The behavior of the objects is represented using methods. Simple attributes are represented by individual columns, while complex attributes are represented by a set of columns or by sub tables (depending on cardinality). Whether methods need to be implemented in SQL depends on solution architecture.
- Object references: Objects referring to other objects e.g. to implement subordination are pointers in object models, edges in graph models and foreign key relations in relational data models.
- Classes: Similar attributes and methods are grouped together using a class. An object can be called an instance of the class. Every class is represented as a table in the relational data model.
- Inheritance: A new class can be derived from the original class. The derived class contains attributes and methods of the original class as well as its own. In the relational data model, the derived class is a separate table sharing a 1:1 relationship based on a common primary key containing additional columns only. As this additional table only contains identifiers of sub class entities, an inner join reduces the number of rows to the number of sub class entities.
The object-oriented data model aims at bridging the semantic gap between relation tables and entities of the real world through objects that directly correspond to entities. An object has a unique and immutable object identifier, and it belongs to a class. Thus, bidirectional mapping between modeling paradigms is possible with few compromises. Metadata is needed to persist decision made during transformation.
Bridging models
Understanding that models always represent information for a purpose, directly leads to understanding that data needs to be transformed and enriched with data from other sources and with metadata to solve different purposes. Following the paradigm, that original data needs to be backup up, while transformation needs to be automated, it is important to build bridges between data models and to document the decision made. This results in the information flow or lineage of data, documenting all transformations from data entry to analysis thereby gaining trust in data pipelines.
Typical transformations are:
- Renaming columns: While names seem to be unimportant, they often bear meaning and are needed for identification. If e.g. a column named "average mass in kg" is renamed to "weight", misunderstandings happen, whenever semantics and metadata are not transported accordingly eventually leading to the assumption weight was measured in pounds or Newton.
- Moving column entries to metadata: In above context a typical transformation is to reduce a set of columns in measurement (number of entities measured, measured mass, unit of mass) to a single column (mass in kg). Naming, semantics and metadata are again essential to avoid loss of information.
- Joining sources: Whenever data is reduced from several sources or tables into one result, a multitude of problems can occur: loss of entries due to inner joins, empty entries due to unresolved outer joins, multiplication of entries due to different granularity. Explicitly defining referential integrity helps to reduce those problems inside single data models - ontologies help to reduce them when integrating data sources.
- Pivoting: Making column entries to columns and vice versa is a very common transformation resulting in a plethora of issues: new column entries resulting in loss of data or new columns, multiple entries per category need to be aggregated, missing entries lead to empty columns. Generally, only pivot categorized columns! If a column does not have an enumeration or fixed lookup table, a defined pivot is not possible. Same is with unpivoting: generate an enumeration or lookup table, defining column content by referential integrity.
- Flattening parent child hierarchies: The biggest strength of graph, semantic or parent-child hierarchies is that the number of hierarchy levels is flexible and may include skipping levels or having children of different levels. In organizational hierarchies this can be seen easily: a level 2 manager may have level 3 and level 4 and individual assistants as direct reports. To flatten those hierarchies into fixed columns for reporting e.g. in PowerBI using a tabular data model, decisions must be made on how to identify the level of an entry and on how to cope with skipped levels. Type of entry helps with identifying levels and skipped levels may be NULL or repeated from lower levels (both ways are supported in reporting applications).
If those decisions are not made upfront, they will be taken by the developer ad hoc, leading to inconsistent behavior. Not taking these decisions is technically impossible; in doubt they would be taken by the toolkit used implicitly.
Special cases in data modeling
- Role-playing relations: When referring to another object, you often see the name of the foreign object's identifier (e.g. CustomerID) as reference attribute name. This implicitly assumes that the role the foreign object plays in your object is clear. In case of order positions, it might be clear that OrderId relates to the order, the position is part of. In most other cases, even in the case of an order's customer, the relationship is clear during modeling but not clear when data is queried - it could be the customer who ordered or the customer who is paying if they deviate. Naming conventions help; in case of object references, always use a verb and the subject for naming e.g. orderedByCustomerId. lowerCamelCase clearly identifies it as a role-playing relation, if attributes are PascalCase as standard for attribute naming. Doing so reduces the need for additional metadata and avoids misunderstandings.
- Natural, surrogate, multi-column or concatenated keys: Natural keys are best, if they are unique, stable and short. Town's short names as part of number plates are a good example - "en-us" as language code for English (United States) is a good example. Using such keys makes referring tables readable while not making any compromises. Inventing such keys is only a good idea, if you're sure they will remain mostly stable - any change results in refactoring of all referring data including historical ones. Providing successor rules avoids this but results in cluttered master data. If a single natural key is not enough to identify a row, database models tend to extend the primary key with additional sub keys like order id and order position id. These multi-column keys are logical in many cases but having four or more parts they inherently result in wrongly made joins by missing out single rarely used columns like a row version id. Concatenated keys can help in downstream models by providing a canonical version of the multi-column key concatenated into one column. This is still readable in the referring table but ensures completeness of joins. In original tables or with having too many columns, this methodology comes to an end. If no logical key exists, in cases of time-boxed changelogs or slowly changing dimensions, or if too many columns are needed for identification, surrogate keys help identifying rows - they can be running numbers (row identifiers), hashes (compressed combinations of all relevant columns) or globally unique identifiers (GUID) depending on purpose. In any case the referred to object is not comprehendible in the referring table, so use them with caution. GUIDs on the other hand allow unique identifiers across tables and systems, clearly identifying a semantical entity throughout the whole data landscape. There is no simple answer on how to define keys.
- Slowly changing dimensions (SCD): Master data changes over time - it usually does not change often, but it changes slowly. To enable historical accuracy and scenarios like "as-of today", OLAP models introduced the concept of slowly changing dimensions. Multiple types have been defined, but the basic element is to keep all versions of truth. This results in time-boxed data minimum extending the natural key by a date entry identifying from which timestamp onwards this entry is valid - adding the related timestamp to fact data would be sufficient to identify the related version of master data then. Unfortunately, it is very inefficient to look up master data by derived time-boxes for every related dimension. Surrogate keys help: the version of master data gets an internal number, the technical reference from fact data to dimensional data uses this surrogate key as single column foreign key relation. Scenarios can be added by referencing the scenario version (e.g. as-of today) from the versioned master data. There is a pitfall in doing so: some data warehouses of the past collected historical data in the warehouse only; this made the data warehouse a source of original data that could not be retrieved from somewhere else, resulting in potential data loss. Always collect historical data in a separate archive system, being fast to write, not necessarily fast to read, but having a sound backup.
- Multi-language texts: The standard way of implementing multi-language texts in relational data models is to move all text entries into a language-related sub table. The primary key of this table is extended by language, and it is referenced when displaying text. This always leads to data loss when queried with the assumption that the language needed is maintained - by filtering the language to user's language, all entries that do not contain texts in this language are removed from the result e.g. not contributing to aggregated numbers. Always be cautious when querying language texts; left join to a filtered sub table avoids loss of data, providing a default language text in main table avoids empty text fields. Object models or extended relational models allowing objects as column types (e.g. containing languages as attributes as possible in PostgreSQL) avoid these problems by design.
- Categorical data: Like already mentioned in pivoting, it is essential to provide a list of valid entries, if an attribute or column may only contain certain values. It does not matter whether you consider this list of valid entries as metadata like in enumeration data types or as lookup table like in relational model, it is only important that the model disallows any entry not listed there. Those category values are typically as relevant to semantical mapping as column names and considering pivot they can end up being column names.
Data access models
If you ask upper management, you easily get phrased answers like "the data belongs to the company". This is not true except from a legal perspective. You will not get accepted models for data access unless you clarify data ownership and business value (in loss and in usage) of data.
In first place, if data describes a (natural or legal) person, this person owns the data; this is legally clear. If data is provided by a person, it should be owned by this person; looking at patent rules, this should be clear, too. If data is from external sources, it is owned by the external company; this is clear by contracts. If data is collected by automated processes, the owner of this process or technological component being responsible for it, should stay owner. All of this is true but not helpful.
A data owner should take ownership and feel responsible for quality, availability, safety and security of data. This is an active role not defined by the process of creation or else people leaving the company would take their ownership. The data owner must respect original, legal or external ownership and consider it in any action. In doubt, the data owner must give data back and remove it from the system of truth and from all downstream systems unless the data has been transformed into something not revealing its original source anymore (e.g. statistics). For internal purposes, the data owner is the only relevant contact when it comes to decisions about data usage - and for any complaints.
As part of this role, the data owner must define a data access model. The alternative would be an individual decision for any request, which is not suitable for AI models or mass data analyses, and which is not reliable or reproducible at all. If data is defined as "internal", it must be given access to all employees; if it is defined as "confidential", clear reasoning is relevant, why data must be kept from certain employees or why a thorough log of data access is needed (legal or contractual reasons, pending patents, high business value, …). Even for confidential data it is highly recommended to define attribute-based access controls (ABAC), to enable a computer system to automatically grant access based on employee's master data. Only by doing so can data be made available without delay, which is essential in AI-use of data. Even project oriented or time-boxed individual access can be set into master data lookup tables enriching automated access and thus adding a layer of protocol to grants given. Human processes to request and grant access are outdated since the advent of reporting layers and AI.
Contrarily, data contracts are meant to be explicitly signed before data usage. This blocks data usage and eventually results in an AI using outdated publicly available data instead of current data from internal systems. Only use data contracts with caution and if so, ensure they are signed before data is queried. The information that access is not granted yet might be blurred by AI trying to find an answer. My recommendation is to avoid them unless they are legally required.
Integrating it all
As a result, there are original data sources in barely comprehensible data models, data pipelines transforming data in multiple ways and analyses including ad hoc analyses accessing data by available data access models. Ideally, there is a silver layer domain-driven data model with an ABAC data access layer in the middle. Data architecture is tasked to understand and continuously develop this.
Full collection of metadata needed is prohibited due to the trade-off between upfront effort and downstream benefit. A compromise would consist of:
- Use what's there: all data sources provide some kind of metadata at least a technical description of data provided can be retrieved.
- Document machine-readable: If you're developing data pipelines, usually you write down what you do in some documentation on Confluence or similar text-based tools. Instead choose a way of implementation that integrates transformation and documentation and that allows machine analysis and transformation of metadata.
- Add semantics when needed: Whenever a problem with interpretation occurs, extend metadata by explanation, semantics or whatever is missing. Thus, an increasingly complete metadata repository is developed using agile methodology.
As this topic is still new, there is no standard tool existing yet to integrate all sources of metadata into a single MCP-type of view.
Deep Dive: Model Context Protocol (MCP)
While generally avoiding technology here, MCP is a key framework worth taking a deep dive into.
Semantifying Meta MCP Server
Above mentioned technical metadata crawler is essential to reuse metadata already existing in the organization, but it does not support interoperability across sources or reuse by semantification in general. As this is essential to ensure correct usage of data in AI models, the result of this integration effort is targeted firstly to those models and therefore must follow the MCP methodology. An MCP Server is needed to integrate and semantify all major corporate data sources into silver layer semantified data models including data access to those resources. Reducing it to few MCP Servers would allow a higher degree of integration and performance optimization in comparison to adding many MCP Servers to each MCP Host of used AI models. This MCP Server landscape is responsibility of Metadata and Semantics team and should be extended by a UI for human testing and human data retrieval. Whether technical metadata crawler is just a component or a separate product is not relevant for the result - for explanatory reasons, I follow an integrative description here.
"Meta" in this approach means to integrate multiple sources of metadata into one aligned model. Types of metadata describing most existing sources are (exemplary, never complete):
- MCP: In every case, where an MCP Server already exists, this should be the main source for metadata and to allow access to data and tools as defined there. Naming of resources and tools must be mapped to corporate naming conventions, while models generally should stay untouched unless necessary. Changing models would add a layer of metadata and eventually results in a custom MCP Server as transformation layer.
- INFORMATION_SCHEMA: All relational database servers provide an information schema that contains the basic metadata on tables and columns defined in the database. This standardized information forms the basis needed to write SQL queries to underlying systems. Column and table descriptions are non-standard proprietary extensions.
- OpenAPI: This standard to describe APIs including endpoints and resulting data models is essential for understanding most REST APIs; including it into metadata forms the basis needed to access data from APIs.
- RDF/OWL: These formats describe semantics and hierarchies widely accepted as standard for semantics and ontologies as knowledge graph metadata.
- RML/YARRRML: One of the few standards to integrate relational data with semantics described in ontologies; this is essential to integrate relational data across multiple data sources.
- ODPS/OCS/ODCS: Several standards describe data products, data contracts, data sharing agreements and alike.
- … additional standards appear daily
There is no existing project in Open Source that fulfills all necessities to really integrate corporate data resources into a joint data layer for MCP use. "metamcp" could lead here, but the integration of non-MCP resources is not on their agenda, yet. A possible solution could be to auto-generate MCP servers per source based on source's proprietary metadata converting it into MCP format, but this would not solve the issue of semantified integration of MCP servers. Authorization in complex enterprise setups add to this endeavour.
MCP, Semantics, and Enterprise Data Architecture
(by Microsoft Copilot)
Why MCP responses contain reduced metadata — and why this is intentional
Model Context Protocol (MCP) takes a fundamentally different stance from classical API description languages such as OpenAPI. OpenAPI is designed for strict contract enforcement: every endpoint, parameter, and response must be described in detail, often with extensive JSON Schema definitions. This is ideal for code generation and validation — but not for LLMs.
MCP deliberately avoids this level of verbosity. The protocol defines how a client and server communicate, not what the server returns. Responses are intentionally lightweight: they contain the data, but not a formal schema describing that data. This design choice reflects the assumption that an LLM can interpret JSON structures without requiring a rigid schema, as long as the server provides clear, human-readable descriptions.
In other words:
- OpenAPI optimizes for machines that cannot infer meaning.
- MCP optimizes for machines that can.
A well-designed MCP resource description includes:
- A clear description of the response structure — not as a schema, but as a structured explanation of fields, meaning, relationships, and invariants.
- Examples — LLMs learn structure from examples more reliably than from schemas.
- Explicit descriptions of relationships — enterprise data is relational, and MCP requires these relationships to be described explicitly.
- Stable naming conventions — consistency is more important than formal typing.
- Domain qualifiers — each resource must declare its semantic scope to prevent the LLM from merging unrelated concepts.
Why enterprises need multiple domain-based MCP Servers
Enterprises rarely have a single, unified data model. They have bounded contexts — Finance, HR, Sales, Projects, Manufacturing — each with its own semantics, rules, and vocabulary.
Trying to force all of this into one global MCP server would recreate the classic "enterprise data warehouse as the single source of truth" problem: too large, too rigid, too political, too slow to evolve, and too brittle to maintain.
MCP is designed for the opposite: small, domain-focused servers that expose data and actions relevant to their domain.
Examples include:
finance-mcpfor cost centers, GL accounts, controlling areasprojects-mcpfor WBS elements, project structureshr-mcpfor employees and organizational unitssales-mcpfor customers, orders, pricing
Each server becomes a semantic island with clear boundaries.
The real challenge: avoiding semantic drift and hallucination
If multiple MCP servers exist, and some are internal while others are external, vocabulary collisions are inevitable:
- "project" in Finance ≠ "project" in IT
- "order" in SAP ≠ "order" in Jira
- "customer" in CRM ≠ "customer" in e-commerce
LLMs merge concepts unless explicitly told not to. This mirrors the human problem of consultants misunderstanding internal terminology.
Methodology: Domain Context Isolation
The solution is not a global silver model. The solution is explicit domain boundaries.
- Each MCP server declares its domain — acting as a semantic namespace.
- Each resource description begins with a domain qualifier — e.g., "[FINANCE] Cost Elements".
- Cross-domain relationships must be explicit — e.g., "This resource references FINANCE.CostElement."
- External MCP servers must be sandboxed — clearly labeled as EXTERNAL with disclaimers about vocabulary differences.
- Enterprise vocabulary must be consistent within a domain — but not universal across all domains.
This mirrors Domain-Driven Design: bounded contexts, not a universal model.
How this fits into enterprise data architecture
This approach gives enterprises a way to describe a consistent, AI-ready semantic layer without building a monolithic enterprise data warehouse model.
It avoids the pitfalls of global canonical models, rigid ontologies, and centralized modeling bottlenecks. Instead, it creates a federated semantic architecture:
- Each domain owns its semantics.
- MCP servers expose domain-specific data and actions.
- A shared vocabulary exists only where concepts truly overlap.
- Domain qualifiers prevent hallucination and semantic drift.
- LLMs navigate the enterprise landscape safely and correctly.
This is a modern, AI-aligned interpretation of data architecture: semantics are explicit, contextual, and domain-scoped — not globally enforced.
Clarity for data architects
MCP represents a shift from schema-driven integration to description-driven semantics. The key insights:
- You do not need a global enterprise data model.
- You do not need OpenAPI-style schemas.
- You do not need to unify all vocabulary.
- You do need domain boundaries.
- You do need consistent semantics within each domain.
- You do need explicit descriptions of relationships and meaning.
- You do need a methodology to prevent LLMs from merging unrelated concepts.
This creates a federated semantic mesh expressed through MCP — lighter, more flexible, more scalable, and aligned with how LLMs reason.
Filter-ready data resources are tools
Resources in MCP mainly have fixed content and are consumed at once. This is why an option exists to "subscribe" to resources, triggering a notification on resource change. A remote data source that is queried using filter parameters is a "tool" in MCP, being equivalent to a function call. Recommendation is to provide remote data sources always as tool aiming for a consistent call methodology.
"tools/list" returns the tools metadata. Parameters are defined as type-safe, while results are described in natural language only:
- name: The identifier to be used for calling. Should be verb_resource, mainly "get" to retrieve data e.g. "get_cost_centers" - is a programmatic identifier, is typically snake_cased.
- title: Human-readable name of tool for display purposes; is mainly the resource name in proper case
- description: Full description of the data resource including data model, semantics, relationship to other resources etc; natural language but structured and precise. Do not rely on AI models interpreting outputSchema for data description.
- inputSchema: Full JSON Schema (typically of type object) definition of allowed parameters; in a data context this allows filtering and should allow many optional filters fitting to indexing strategy of data source.
- outputSchema: Optional but recommended for data sources; defines the JSON schema for result (typically array of object).
"tools/call" queries the data based on "name" of tool, giving "arguments" fitting to "inputSchema". It results in "structuredContent" for results fitting an "outputSchema" - the latter being default for data resources. Schemas and structured content are part of MCP since version "2025 06 18" - for elder MCP Clients, return the serialized JSON in a text "content" block.
Do not provide data as resource. It would result in retrieving the full dataset at once and it does not allow
an outputSchema definition. Files are resources, blobs of raw data might be resources.
Like described in the authorization chapter, every data resource needs a resource policy describing how claims
are interpreted, effectively describing the data access model. Include RBAC/ABAC patterns here, too. If the
query intended by the parameters given may not be resolved based on claims stated in access token, do not
simply return an empty resultset but return an "insufficient_scope" error (see Authorization chapter in [MCP Specification]).
Model Context Protocol (MCP) - technical considerations
The Model Context Protocol consists of several key components that work together:
- Base Protocol: Core JSON-RPC message types
- Lifecycle Management: Connection initialization, capability negotiation, and session control
- Authorization: Authentication and authorization framework for HTTP-based transport
- Server Features: Resources, prompts, and tools exposed by servers
- Client Features: Sampling and root directory lists provided by clients
- Utilities: Cross-cutting concerns like logging and argument completion
Model Context Protocol (MCP) is understood as the standard method for AI Tools to get access to current information in addition to training data. To achieve this, AI Tools act as MCP Hosts, hosting multiple connections (MCP Clients) to MCP Servers. With AI Tools (or more specific LLM models) in place, MCP is a reduced standard and bases on hosts understanding natural language - information about context and content (metadata) is mainly given unstructured, in natural language.
Base protocol is [JSON-RPC], defining function call
with optional error result as standard. Serialization in use is JSON in UTF-8 encoding. Transport is done via
stdio (local, in-process) or Streamable HTTP (remote via HTTP POST/GET and Server-Side Events). Streaming
enables polling paginated results from requestor side after (first empty) response was returned. MCP mandates
HTTP authentication and OAuth authorization; this includes a methodology for authorization requests - with
Identity & entitlement management (IAM) needed for resource claims. Unfortunately, understanding MCP goes deep
into hardcore IT topics fast.
Capabilities are negotiated during connection initialization. The common set of capabilities
is valid for the lifetime of a single connection. Standard capabilities are defined in chapter "Lifecycle" of
[MCP Specification]. Communication is bidirectional, both
sides may use capabilities the other side claimed and both sides may send notifications they claimed. In a
Data Architecture context, MCP is mainly used for information retrieval, therefore "resources" and "tools" are
the only relevant (server) capability. "tools" are mainly parameterized resources, while "resources" only work
for small defined datasets. In both cases it is relevant to enable "listChanged" notifications to inform the
MCP Client whenever metadata changes.
Following below argumentation, following statements emerge:
- MCP Servers are best per domain to manage complexity.
- OAuth Authorization Server and IAM are enterprise wide.
- Enterprise-wide claims must be defined for authorization; they are part of user's profile.
- Data sources are "tools" in MCP that explicitly react to all claims and that allow filtering by arguments.
- Schemas are defined as JSON Schema; context, semantics and access models are described in natural language.
Authentication, authorization and access requests
MCP does not include methods to request access because access requests are already handled by OAuth workflows and IAM. The MCP Client authenticates to an MCP Server using HTTP authentication mechanisms, thereby presenting known OAuth authorization tokens. If the presented tokens are not sufficient to access the requested resource (or its queried portion), the MCP Server responds an error and includes the resource policy and a list of trusted OAuth Authorization Servers (AS). The MCP Client then performs an OAuth authorization flow with the AS by asking for a scope for an audience (the MCP Server acting as OAuth Resource Server). Based on IAM, the AS returns the claims granted in a new token; the MCP Client presents that token to the MCP Server to gain access to the protected resource. The MCP Server interprets scope and claims in relation to the requested resource and returns data.
Structured approach:
- The MCP Server must be an OAuth Resource Server (RS). This includes validating presented authentication and OAuth tokens (issuer, audience, lifetime, signature) and interpreting scope and claims for granted access and row-level security. On illicit request it returns OAuth 2.0 Protected Resource Metadata (PRM; RFC 9728) including information on AS and resource policy (descriptive). Thus, the AI chat user can request and access restricted resources.
- The MCP Host caches known tokens and requests access from AS based on PRM returned from MCP Server. To enable this, the MCP Server must be configured including OAuth (Auth Code or OBO) when connecting it to the MCP Host.
- The OAuth Authorization Server (AS) needs to be configured for an audience (a URI representing the MCP Server as RS) adding custom scopes (coarse permissions like 'data.read'; capabilities), claims (attributes of row-level security; constraints) and authorization policies (define the authorization workflow). OAuth only defines how scopes are requested and returned; it does not imply meaning. Claims are not requested, AS requests them from IAM.
- Identity & entitlement management (IAM) manages which claims belong to which user thereby forming the basis for OAuth. IAM needs an MCP Server, where entitlements are listed and can be requested. This is where new access rights are granted effectively - possible via AI Chat.
- Scopes are requested per audience (RS). Claims live in IAM and returned per user.
In a non-governed environment this leads to claim explosion with claims per application and lowest level of access granularity. To avoid this, we can start with following proposals:
- A Data MCP Server should share a consistent access model across all contained resources and tools. This is due to requesting scopes per audience (the Data MCP Server). One result is to define a server per data domain or function (e.g. Finance).
- Scopes define roles and are necessary to restrict write access to certain resources. Read access should stay a simple scope ("data.read" or "read confidential data") for analytical use cases. Interactive tools might need more granular scopes. Exceptions are general interactions like "report data quality issue".
- Claims define responsibilities. Thus, claim granularity needs to match responsibility granularity. As responsibilities are hierarchical, claims need to be hierarchical. As claims match across tokens, additive access rights can not be implemented with multiple tokens - claim need to be additive, while intersecting rights need to result in more restrictive claims. For interpreting row-level security this results in ignoring unknown claims and not considering claims not provided.
- AS policy defines applicability of claims to scopes and audience to e.g. avoid restricting "read of public external data" to claims. AS policy interprets IAM result before implementing them into a token. Use AS policy to reduce e.g. data domain hierarchy in IAM to data-domain-specific claims in token.
- Ontologies are needed to support hierarchies of claims and scopes and to deduce the claims relevant per audience.
Detailed OAuth example for clarity
Scenario setup (shared understanding)
- Client: finance-analytics-app
- Authorization Server (AS): https://auth.company.internal
- Resource Server (RS / API): finance-mcp.company.internal
- Protected data: internal cost accounting data
- Constraint: North America only
- User: Alice (Finance analyst)
- User -> Host: User expresses intent in natural language
"I'm now responsible for North America sales reporting."
This is not an OAuth event, the AI interprets this as a request to change entitlements or a request to initiate an approval workflow - Host -> IAM: AI calls a domain service, not OAuth
The AI invokes HR system, IAM workflow, entitlement management API or approval system.
Example (conceptual):
POST /entitlements/request { "user": "alice", "role": "sales_analyst", "region": "NA", "reason": "new position" }
This is where rules are added, not in OAuth.
This step may require manager approval, compliance checks, SoD validation, … - IAM: Authorization state is updated centrally
After approval user attributes change, roles are assigned and policies now evaluate differently.
This state is durable and auditable. - Client → Authorization Server: Authorization Request
"I want a token for the finance MCP API, with permission to read cost accounting data."
GET /authorize HTTP/1.1 Host: auth.company.internal response_type=code client_id=finance-analytics-app redirect_uri=https://app.company.internal/callback audience=https://finance-mcp.company.internal scope=costs.read state=abc123
What the client is asking for (important)
- Audience: "This token is intended for finance-mcp.company.internal"
- Scope: "I want the permission costs.read" - No claims requested, No region specified here
- Authorization Server → User: Authentication + Consent
The AS authenticates Alice and evaluates policy using data queried from IAM:
- Alice works in Finance
- Alice is authorized for North America
- Alice is not authorized for EMEA/APAC
finance-analytics-app wants to: • Read cost accounting data • From Finance MCP API
Alice approves. - Authorization Server → Client: Authorization Code Issued
HTTP/1.1 302 Found Location: https://app.company.internal/callback?code=SplxlOBeZQQ&state=abc123 7. Client → Authorization Server: Token Request POST /token HTTP/1.1 Host: auth.company.internal Content-Type: application/x-www-form-urlencoded grant_type=authorization_code code=SplxlOBeZQQ client_id=finance-analytics-app client_secret=******** redirect_uri=https://app.company.internal/callback - Authorization Server (with IAM) → Client: Access Token Issued
Now the important part: The AS decides which claims to include, based on:- scope = costs.read
- audience = finance MCP
- the updated user attributes
- company authorization policy
Key observation{ "iss": "https://auth.company.internal", "aud": "https://finance-mcp.company.internal", "sub": "alice@company.internal", "scp": ["costs.read"], "region": "NA", "department": "Finance", "entitlements": ["cost-accounting"], "exp": 1760000000 }- The client did not ask AS for region, department or entitlements
- The Authorization Server added them because they are needed for policy enforcement at the resource server.
- Client → Resource Server: API Call
GET /costs/internal HTTP/1.1 Host: finance-mcp.company.internal Authorization: Bearer eyJhbGciOi... - Resource Server (Finance MCP): Authorization Decision
The Finance MCP server evaluates the token:- audience == finance-mcp ✅
- scope contains costs.read ✅
- region == NA ✅
- department == Finance ✅
- ✅ Access granted
- ✅ Only North America data returned
What this dialog demonstrates:
- The client asks for Audience (which resource server) and Scopes (what kind of access)
- The client does NOT ask for Claims, Regions, Departments or Business attributes
- The Authorization Server translates scopes + audience + policy → claims and issues a token that the RS can enforce.
- The Resource Server trusts the AS and enforces scopes AND claims. It does not participate in token minting.
The client requests a token for an audience with scopes; the Authorization Server issues a token containing claims; the Resource Server enforces both scopes and claims. OAuth is for delegated authorization enforcement, not for discovering or learning authorization rules. Rules and "profile changes" belong to identity & entitlement management (IAM / IGA) or policy engines (runtime authorization inside MCP Server).
The emerging standard model
Modern systems are converging on this layered design:
┌──────────────────────────────┐
│ Natural language (AI chat) │
│ - intent │
│ - reasoning │
└──────────────┬───────────────┘
│
▼
┌──────────────────────────────┐
│ Domain / IAM / Policy APIs │ ← rules are created here
│ - approvals │
│ - governance │
│ - compliance │
└──────────────┬───────────────┘
│
▼
┌──────────────────────────────┐
│ OAuth / OIDC │ ← rules are enforced here
│ - tokens │
│ - claims │
│ - delegation │
└──────────────┬───────────────┘
│
▼
┌──────────────────────────────┐
│ Resource Servers │
│ - policy enforcement │
└──────────────────────────────┘
Pseudo-Python user request:
def user_request(
this: MCPServer, # the domain data server
authentication: User,
authorization: List[OAuthToken],
method: string
resource_or_tool: string,
arguments: Optional[InputSchema[resource_or_tool]]
) -> OutputSchema[resource_or_tool] or Error
if not authentication.valid: return InvalidAuthenticationError
valid_tokens: List[OAuthToken] = []
for each token in authorization:
if not token.iss in credible_issuers: continue
if not token.aud == this.uri: continue
if not token.sub == authentication.user_id: continue
if not method in token.scp: continue # assumption scopes match methods
if token.exp < now: continue
if not claims_allow_arguments(token, resource_or_tool, arguments): continue
valid_tokens.append(token)
if not valid_tokens: return ProtectedResourceMetadata(resource_or_tool)
return get_data(resource_or_tool, arguments)
USAIR: Unified, Semantified, AI-Ready MCP Meta-Server
Draft for an Open-Source project to unify and semantify metadata from various sources to a consistent domain model, making it accessible for AI and other applications. Abbreviation's assocication to aviation is intentional to indicate lift-off of metadata management to a new level. The project consists of three main components:
- Python library to import, integrate and export metadata in various formats including a silver layer internal metadata model and methods to find, request and access data
- MCP Server, REST API and Web UI based on lib to find, request and access data
- Extensive documentation
Purpose is to reuse existing metadata including existing proprietary MCP Servers and to slice, dice, reconfigure and extend them to a consistent (domain) model.
Model Context Protocol (MCP) Requirements
As MCP is current state of the art in metadata management, it is used as a reference for requirements. Its resource definition includes:
- uri: Unique identifier for the resource
- name: The name of the resource.
- title: Optional human-readable name of the resource for display purposes.
- description: Optional description
- icons: Optional array of icons for display in user interfaces
- mimeType: Optional MIME type
- size: Optional size in bytes
For more see "Schema Reference" in [MCP Specification].
The "description" is key here. It provides the necessary context for clients to understand the resource's purpose and usage as prose but dense and structured. It is impossible to interpret this description if integrating MCP Servers into a larger model. You can only prefix, postfix, replace or do some RegEx magic to adapt existing texts to your needs, but you cannot semantically integrate them into a larger model. This is a problem, as the MCP Host assumes that multiple MCP Servers can be seamlessly integrated, but it does not provide any means to do so.
In addition, MCP is not made for querying but only for calling methods. To enable querying instead of brute-force fetches, parameterized resources are needed, that allow to filter. The filter properties themselves are defined technically sound by reusing JSON Schema as "schema" attribute. This allows to inject technical metadata for filter properties; as MCP matches by naming condition, the fuzzily defined response value of e.g. cost center gets a technical definition as a parameter for the cost centers resource. It is worth testing whether it is suitable to generally add optional query parameters to every resource, thereby allowing to filter and defining technical metadata. Copying all fields as filter properties is not recommended, but following a pattern for standard implementation could balance effort.
Conclusion: Expose resources with optional query parameters; identical naming steers MCP Host behavior; typed parameters allow efficient querying.
Drafting an integration
Starting from the result, how could a config file for integration look like:
server:
name: "Manufacturing data"
uri: "https://?"
description: "fdds"
domain: "manufacturing"
import:
- type: "mcp"
uri: "https://my-company.sap.com/mcp"
description: "fdds"
namespace: "sap"
export: "none"
- type: "json"
uri: "https://non-sap-dps.my-company.com/metadata.json"
description: "fdds"
namespace: "dps"
export: "all"
resources:
- name: "Functional locations"
uri: "https://?/functional-locations"
from: "sap.functional-locations"
descriptionAppend: "Functional locations are also called technical places"
filterRemove:
- name: "cost center"
filtersAdd:
- name: "functional location id"
type: "string"
descriptionPostfix: "fdds"
required: false
"In this example, two sources are integrated: an MCP Server and a JSON file. The server is defined with its own metadata. The MCP Server is imported without any transformation, while the JSON file is imported with all resources being exported as they are. A new resource "Functional locations" is created based on the "functional-locations" resource from the MCP Server. The description of this resource is extended by appending additional information. The filter "cost center" is removed from the original resource, and a new filter "functional location id" is added with its own metadata." (automatically derived from GitHub Copilot, showing inherent understanding of metadata language used)
Same purpose can be derived from Python code (instead of yaml metadata), using the library. Ingesting metadata would be called related Python functions as a batch. How could this look like pythonic?
from usair import MCPIntegration, JSONIntegration, ServerDefinition, ResourceDefinition, FilterDefinition
# Define the server
server = ServerDefinition(
name="Manufacturing data",
uri="https://?",
description="fdds",
domain="manufacturing"
)
# Define integrations
mcp_integration = MCPIntegration(
uri="https://my-company.sap.com/mcp",
description="fdds",
namespace="sap",
export="none"
)
json_integration = JSONIntegration(
uri="https://non-sap-dps.my-company.com/metadata.json",
description="fdds",
namespace="dps",
export="all"
)
# Define resource transformation
functional_locations_resource = ResourceDefinition(
name="Functional locations",
uri="https://?/functional-locations",
from_resource="sap.functional-locations",
description_append="Functional locations are also called technical places",
filters_remove=["cost center"],
filters_add=[
FilterDefinition(
name="functional location id",
type="string",
description_postfix="fdds",
required=False
)
]
)
# Create integration instance and execute
integration = MCPIntegration(server=server)
integration.add_source(mcp_integration)
integration.add_source(json_integration)
integration.add_resource(functional_locations_resource)
integration.execute()
In this Python example, we define the server and integrations using data classes. We then create an instance of the integration, add the sources and resource transformation, and execute the integration process. This approach allows for a more programmatic and flexible way to define and manage metadata integrations. (thanks to GitHub Copilot again)
Appendices
Appendix A - References and Referencing Conventions
This appendix defines how references are classified, interpreted, and applied within this document. It follows common conventions used in international standards and technical specifications.
Normative references (marked as strong) define concepts, abstractions, or constraints that are adopted by this document. Where a normative reference is cited, the referenced material is considered binding within the explicitly stated scope. Normative references in this document are selectively adopted. Adoption does not imply full compliance with the referenced framework, standard, or methodology unless explicitly stated. Each normative reference is accompanied by a scope comment clarifying which aspects are adopted, which aspects are excluded, and how the reference is interpreted in an enterprise and AI-enabled context.
Non-normative references are provided for context, contrast, historical background, or discussion. They are explicitly not binding. A non-normative reference may present an alternative or competing viewpoint, use different terminology or assumptions, be immature, exploratory, or market-driven, or focus on tools or platforms rather than data semantics. Inclusion of a non-normative reference does not imply endorsement. Where relevant, comments explain why the source is cited and how it deviates from the position taken in this document.
References are cited inline using short identifiers enclosed in
square brackets, for example [FAIR] or [MCP].
Inline references serve the following purposes:
Definition reference -
used when a term or concept is introduced or normatively constrained,
Rationale reference -
used to justify a design decision or architectural stance,
Contrast reference -
used to explicitly position this document against an alternative approach.
Inline references do not replace explanation in the text.
They indicate intellectual grounding, not delegation of responsibility
for understanding.
-
DAMA International. (2017). DAMA-DMBOK: Data management body of knowledge (2nd ed.).
Technics Publications.
https://technicspub.com/dmbok/
Used as a foundational reference for data management terminology and scope. Conceptual definitions are adopted selectively; role- and process-centric governance models are not assumed where they conflict with federated or AI-driven data usage. - Krantz, T., & Jonker, A. (n.d.). What is a data architecture? IBM. https://www.ibm.com/think/topics/data-architecture
- Gartner. (n.d.). Data architecture. Gartner Glossary. https://www.gartner.com/en/data-analytics/topics/data-architecture
-
Tool-centric data architecture frameworks.
https://aws.amazon.com/what-is/data-architecture/
Referenced deliberately as a contrasting position. This document explicitly rejects tool-first or vendor-defined architecture approaches in favor of data semantics and purpose. - Dehghani, Z. (2022). Data mesh: Delivering data-driven value at scale. O'Reilly Media. https://www.oreilly.com/library/view/data-mesh/9781492092384/
-
Dehghani, Z. (2020). Data mesh principles and logical architecture. Martin Fowler.
https://martinfowler.com/articles/data-mesh-principles.html
Data Mesh is referenced for domain ownership and data-as-a-product principles. This document diverges from pure decentralization by requiring centralized semantic integration and metadata mediation for AI usage. -
ODPS / Data Product Specification Initiatives.
https://opendataproducts.org/
Referenced as an example of emerging data product specification efforts. These initiatives are still evolving and currently lack coverage of lifecycle, semantics, and AI access patterns required by this document. -
World Wide Web Consortium. (2013). Publications of the W3C Semantic Web Activity.
https://www.w3.org/2001/sw/Specs.html
W3C semantic web standards are used normatively for machine-interpretable meaning and interoperability. Their use is restricted to semantics and metadata; RDF is not mandated as an operational persistence model. - World Wide Web Consortium. (2014). RDF 1.1 concepts and abstract syntax. https://www.w3.org/TR/rdf11-concepts/
- World Wide Web Consortium. (2012). OWL 2 web ontology language overview. https://www.w3.org/TR/owl2-overview/
- World Wide Web Consortium. (2014). Data Catalog Vocabulary (DCAT). https://www.w3.org/TR/vocab-dcat/
-
Dimou, A., et al. (2024). RDF Mapping Language (RML) specification.
https://rml.io/specs/rml/
RML is referenced as a canonical example of declarative and reproducible mapping between relational data and semantic models. The mapping principle is normative; the specific syntax is not. -
imec — Ghent University — IDLab. (2025). YARRRML.
https://rml.io/yarrrml/
YARRRML is cited as an example of a user-friendly YAML-based mapping language. - Ozekik. (n.d.). Awesome ontology (GitHub repository). https://github.com/Ozekik/awesome-ontology
- World Wide Web Consortium. (n.d.). Lists of ontologies. W3C Wiki. https://www.w3.org/wiki/Ontology_repositories
- Pease, A. (Ed.). (n.d.). Suggested Upper Merged Ontology (SUMO). Ontology Portal. https://www.ontologyportal.org/
- Common Core Ontologies Working Group. (n.d.). Common Core Ontologies [Ontology suite]. GitHub. https://github.com/CommonCoreOntology/CommonCoreOntologies
- Semantic Arts. (n.d.). Gist: Upper ontology for the enterprise. https://www.semanticarts.com/gist/
- Cycorp. (n.d.). OpenCyc ontology. https://www.cyc.com/platform/opencyc/
- Tudorache, T. (n.d.). Engineering ontologies. Protégé Wiki, Stanford University. https://protegewiki.stanford.edu/wiki/Engineering_Ontologies
- Wawrzik, F., Grimm, C., & Neumann, P. (2023). Ontology module suite for electronics and systems engineering [Dataset]. IEEE DataPort. https://ieee-dataport.org/documents/ontology-module-suite-electronics-and-systems-engineering
- Open Energy Platform. (n.d.). Open Energy Ontology (OEO). https://openenergy-platform.org/ontology/oeo/
-
Wilkinson, M. D., et al. (2016). The FAIR guiding principles for scientific data
management and stewardship.
Scientific Data, 3, 160018.
https://doi.org/10.1038/sdata.2016.18
FAIR is adopted as a core quality objective for data and metadata. This document extends FAIR with lifecycle awareness, access control, and AI consumption risks for enterprise environments. - GO FAIR Initiative. (n.d.). FAIR principles. https://www.go-fair.org/fair-principles/
- Roe, C. (2011). Assessing data management maturity using the DAMA-DMBOK framework. DATAVERSITY. https://www.dataversity.net/assessing-data-management-maturity-using-the-dama-dmbok-framework-part-1/
- Anthropic. (2024). Introducing the Model Context Protocol. https://www.anthropic.com/news/model-context-protocol
-
Model Context Protocol Contributors. (2025). Model Context Protocol specification.
https://modelcontextprotocol.io/specification/2025-03-26
MCP is treated as a reference architecture pattern for AI access to data and tools. The protocol is considered replaceable as a technology, but normative as a design pattern for explicit context, access control, and tool mediation. -
AI-ready data architecture discussions.
https://www.gartner.com/en/data-analytics/topics/data-architecture
Gartner material is referenced for market context and terminology. It is intentionally not used normatively due to its platform-centric framing and lack of formal semantic or lifecycle models. -
Model Context Protocol (early discussions).
https://www.anthropic.com/news/model-context-protocol
Early MCP discussions are cited to show conceptual intent. Normative references in this document rely exclusively on the published MCP specification, not on announcements or blog posts. -
JSON-RPC Working Group. (2013). JSON-RPC 2.0 specification.
https://www.jsonrpc.org/specification
MCP is designed to be transport-agnostic, but JSON-RPC is referenced as a common protocol for tool invocation. The specific choice of JSON-RPC is not normative; any protocol enabling the required interactions is acceptable. -
U.S. Department of Defense. (2010). DoDAF V2.0 - Data and Information Viewpoint (DIV).
https://dodcio.defense.gov/Library/DoD-Architecture-Framework/dodaf20_data/
Referenced for its abstraction pattern separating conceptual, logical, and physical data models. DoDAF compliance or defense-specific processes are not assumed. - Zhao, L., & Roberts, S. A. (1988). An object-oriented data model for database modelling. The Computer Journal, 31(2), 116-124. https://academic.oup.com/comjnl/article/31/2/116/406640
- GeeksforGeeks. (2021). Basic object-oriented data model. https://www.geeksforgeeks.org/basic-object-oriented-data-model/
- Castro, K. (2020, June 19). Object-oriented data model. TutorialsPoint. https://www.tutorialspoint.com/object_oriented_data_model/index.htm
- EAGLES SWLG. (1997). Object-oriented data model. In Gibbon handbook (SoftEdition). University of Bielefeld. Archived reference page
- MyReadingRoom. (n.d.). The object oriented (OO) data model in DBMS. Internet Archive (Wayback Machine). Archived reference page
-
Fitness-for-use data quality models.
https://www.dama.org/content/body-knowledge
The "fitness-for-use" concept is acknowledged as influential. This document adopts the idea but reframes it explicitly in terms of declared purpose, lifecycle stage, and AI consumption risk. - Buneman, P., Khanna, S., & Tan, W.-C. (2001). Why and where: A characterization of data provenance. ICDT. https://www.cis.upenn.edu/~sanjeev/papers/icdt01.pdf
- World Wide Web Consortium. (2013). PROV-O: The provenance ontology. https://www.w3.org/TR/prov-o/
-
Buneman, P., Khanna, S., & Tan, W.-C. (2001).
Why and where: A characterization of data provenance.
https://www.cis.upenn.edu/~sanjeev/papers/icdt01.pdf
Referenced as foundational academic work on provenance. Practical enterprise lineage models are intentionally treated separately. - Hu, V. C., et al. (2014). Guide to attribute based access control (ABAC). NIST SP 800-162. https://nvlpubs.nist.gov/nistpubs/specialpublications/nist.sp.800-162.pdf
-
Hu, V. C., et al. (2014).
NIST SP 800-162: Guide to Attribute Based Access Control (ABAC).
https://nvlpubs.nist.gov/nistpubs/specialpublications/nist.sp.800-162.pdf
ABAC is referenced as an access control mechanism. The document adopts the principle, not the full NIST implementation model. - OECD. (2019). OECD principles on artificial intelligence. https://oecd.ai/en/ai-principles
- ISO/IEC. (2023). ISO/IEC 42001: Artificial intelligence management systems. https://www.iso.org/standard/81230.html
Data Architecture & Data Management
Data Products, Access Models & Data Mesh
Metadata, Semantics & Knowledge Representation
Ontology Lists & Upper Ontologies
Domain-Specific Ontologies & Datasets
FAIR, Lifecycle & Data Quality
AI Access, MCP & AI-Ready Data Architecture
Enterprise Architecture & Viewpoints
Object-Oriented & Multi-Paradigm Data Modeling
Data Lineage & Provenance
Attribute-Based Access Control (ABAC)
AI Governance & AI Safety
Appendix B - Architectural Decisions and Rationale
This appendix documents selected architectural decisions taken in this document. It serves to make implicit design choices explicit and to support future evolution of the architecture.
- Data needs models
-
Information is complex and needs models for digitalization.
Models contain purpose-fit semantics and are not just technical representations. - Semantics over platforms
-
Data semantics and meaning are treated as primary architectural concerns.
Platforms and technologies are considered replaceable implementation details.
Original data must be kept immutable; derivations may transform but not change it.
Original data needs backup of data and semantics; data pipelines need backup of derivation logic and business purpose. - AI as a first-class data consumer
-
AI systems are treated as consumers with specific requirements for context,
access control, and lifecycle awareness.
AI is already mature enough to co-operate, see it as partner and not as a tool. - Selective adoption of standards
-
External standards are adopted selectively.
Compliance with a standard is not implied unless explicitly stated.
Standard business software makes businesses comparable; individual data architecture shows their competitive advantage.
Appendix C - Terminology and Definitions
This appendix defines the terminology used in this document. The definitions in this appendix are normative and override external, vendor-specific, or colloquial usage.
- Data Architecture
-
The coherent set of principles, models, and structures governing how
data is represented, transformed, accessed, and used.
In this document, data architecture explicitly prioritizes data semantics, lifecycle, and purpose-fit over platform or vendor choices
See: Overview, Concepts & Mental Model, Integrating it all - Data Product
-
A purpose-driven, governed representation of data that declares meaning,
access, quality expectations, and lifecycle status.
A data product is not defined by its storage technology or application, but by its business meaning, access model, and reusability.
See: Data Provisioning & Data Products, Lifecycle & Quality - Semantics
-
Formalized, machine-interpretable meaning assigned to data entities
and relationships.
Semantics are machine-interpretable meaning (e.g. ontologies, mappings, controlled vocabularies), not descriptive documentation text.
See: Data Modelling, Semantifying Meta MCP Server - Metadata
-
Data describing data, including technical, business, semantic, and governance-related information necessary
for discovery, access, and correct usage.
Metadata is treated as first-class architectural data, not auxiliary documentation.
See: Data Governance, Semantifying Meta MCP Server, Integrating it all - Lifecycle
-
The managed progression of data and data products from creation to
deprecation and deletion.
Lifecycle applies to data products, not just tables or files.
See: Lifecycle & Quality - Data Quality
-
The degree to which data is fit for a declared purpose, relative to its lifecycle stage and usage context.
Data quality is contextual and process-based, not an absolute or static property.
See: Lifecycle & Quality, Data Governance, Types of use cases - Data Governance
-
The set of organizational structures, processes, and rules ensuring that data is handled compliantly,
efficiently, and in alignment with business objectives.
This document distinguishes external (compliance-driven) and internal (efficiency-driven) data governance.
See: Data Governance, Data organization - Data Ownership
-
The explicit responsibility for deciding on data usage, access, quality handling, and deprecation within an
organizational context.
Ownership is an active role, not inferred from data creation or legal possession.
See: Data access models, Data Governance - MCP Server (Model Context Protocol Server)
-
A standardized interface enabling AI systems to discover and access
data and tools with explicit context.
MCP is treated as an architectural access pattern, not a fixed technology.
See: Data Access and Access Restrictions, Semantifying Meta MCP Server - AI-readiness
-
The degree to which data, metadata, access models, and lifecycle
information are suitable for safe, correct, and automated consumption
by AI systems.
See: Concepts & Mental Model, Lifecycle & Quality, Semantifying Meta MCP Server - Immutability (of data)
-
The principle that original data, once created, must not be changed;
only derived representations may evolve.
See: Data provisioning & Data Products, Data modelling, Lifecycle & Quality - FAIR (+)
-
A principle stating that data should be Findable, Accessible,
Interoperable, and Reusable, with additional emphasis on safety,
security, and trustworthiness.
See: Data access and access restrictions, Data Governance
Appendix D - Change History
| Date | Change | Rationale |
|---|---|---|
| 2026-03-23 | Initial public version | First consolidated perspective on data architecture |
| 2026-03-25 | Refining Appendices | Formatting, linking and structuring appendices |
| 2026-03-31 | More insights in MCP metadata | Found clarity in OAuth concepts and MCP schema definition |