Our Perspective On Data Achitecture

By Joep van Genuchten

January 3, 2021 - 9 minutes read - 1760 words

We would argue that perhaps the most important reason existence of the capability of data architecture, is to insure that the data and information that we use to base our decisions on is first and foremost there and then actually represents reality.

Whether you are publishing scientific- or open data or just trying to get more value out of your organizations confidential data, we believe the FAIR principles are the most concrete proposal for solving the most common data issues. Data quality issues in modern organizations are most often not the consequence of faulty data entry (much of which is automated anyway) but much more often the result of insufficient data modeling and a weak data architecture in general. So what basic functionality should a data architecture support in order to facilitate FAIR data? In this article we will explain what we believe the four basic parts of a FAIR -supporting data architecture are and how data lineage, both vertical and horizontal interact with this architecture to get the most out of any data.

Four functions in a data architecture

We see four main functions to any data architecture. In practice, some will be a lot more complex than others and one could argue that some should be split up in-to smaller functions, but for the purpose of this paper, we will use these four. You can find an overview of these four functions in the diagram below and how they contribute to a FAIR data architecture.

Data Architecture

The user, meaning the humans that interact with the data (architecture) whether they are seekers of information or developers of information systems, will mostly interact with the system based on the natural language they know and use. This is why the primary interaction between people and the data architecture is via the Business Vocabulary and the data catalog: functions designed to facilitate the ‘translation’ between the natural language that people use, and the formal languages that the information systems use.

Data catalog: Findable

Once an organization has multiple information applications, it is easy to lose sight of which data exists and where it exists. Data Catalogs can solve this issue, by publishing which data products are available, what they are about, who is responsible for them and where they might be accessed.

FAIR reference models for data catalogs

DCAT

Business glossary: Findable, Reusable

The business glossary describes the terminology that people use in an organization. It is oriented at the natural language use. Because it describes the terminology that people actually use to communicate about their work, it gives us a first insight in an organizations information needs. Defining and publishing terminology is an essential part of any data architecture as it facilitates reuse of terminology and makes it easier to understand what we are talking about. Terminology defined in the business vocabulary may be used to annotate data resources contained in the data catalog, it can be used for text mining and it may also be helpful in on-boarding new team members as they can get a quick overview of the most important concepts that people in the organization work with.

SKOS

Conceptual model: Interoperable, Reusable

The conceptual model focuses on the formal definitions that we base our data models on. Such a model should be inspired by the concepts we find in the business glossary, but it formalizes the concepts such that the domain may be described by predicate logic and set theory. What this means is that for a given ‘thing’ and a given ‘concept’ we should be able to determine (with ‘true’ or ‘false’) if the ‘thing’ is in the extent of the ‘concept’. Effectively this typically results in multiple data definitions for a given business concept. For instance: we can define ‘customer’ as ‘someone who has purchased something with us’ (this can be evaluated to true or false), but for a sales department someone who hasn’t purchased anything yet, but has shown interest in doing so might still be considered a ‘customer’. For them, a customer might be ‘someone who has shown interest in or purchased one of our products’ (this can also be evaluated as true or false). This is not wrong, in fact, it is very desirable that any one can define classifications to suit their task and responsibilities, and this typically happens in any organization consisting of more than 1 person. In natural language these subtle differences in meaning (semantics) are no problem, a single term like ‘customer’ suffices. However, in the data world these are radically different things. In data, poor differentiation in concepts can be the difference between inter-operable and un-inter-operable data. A conceptual model helps making these differences in meaning explicit and gives insight into how to make these data sets inter-operable and reusable.

RDFS
OWL

Information systems: Findable, Available, Interoperable, Reusable

Information systems are any application, registry, data base, BI-report or data science algorithm that produce data or information (which they typically all do). These systems need to guarantee the availability of the data. Both by producing it and publishing it. These form the core of any data architecture, as this is where the data lives.

In terms of governing standards, we need to make a distinction here. The data architecture exclusively dictates standards for the information that goes in to and comes out of the information system. Any other design parameters, like technological ones, are not governed by the data architecture. In the example of technological design parameters the solution architecture should be leading.

Finally, the information systems are where most development happens and in the early phases of innovation. Teams should be free to innovate and explore different conceptual models without being restricted by any governing standards (which almost by definition will not meet the information needs of the innovation). So deciding ‘when’ an information system should comply to the data architecture is a question that each organization needs to make for itself.

most schema specific technology choices will be made through the solution architecture. examples include:
- JSON-schema
- XSD
- OpenAPI
- Apache Avro
for conceptual models, the following standards are used:
- DX-PROF
- SHACL
- RDFS
- OWL

Vertical lineage: what the data means

Vertical lineage helps us understand what data means by relating our data instances to the conceptual models and business vocabulary.

Vertical Lineage

By modeling the data coming out of any information system according to a common conceptual model, data interoperability is insured. However, any given use case will never require all the objects that are defined in the conceptual model you never use an entire language to write an article, only those concepts you want to say something about. So we make a selection out of the conceptual model and sometimes we wish to constrain certain relationships. This is typically where the closed world assumption enters the stage. We call this a Profile. A profile is technology agnostic description of what data for a specific use case looks like. A profile should be able to be transformed to any technology specific schema, like a database schema for storage, a json schema for a rest API or an Apache Avro schema for publication of a message queue.

We realize that what we say here is by no means new: model driven engineering has been talking about this for decades. Standards like UML and its companion XMI have been strong efforts to realize the promises of model driven engineering, however whereas UML is not formal enough to enforce a single interpretation (as it is first and foremost a graphical convention), major tooling vendors could never agree on an interoperable XMI syntax so that models produced in one modelling tool could usually not be used in another. This meant that the interoperability problem in the information world was moved from the model-level to the meta-model-level.

What has changed over the last couple of years, is that semantic models required fully capture the information requirement to pull this of, have (been (like OWL or SHACL) or are (like DX-PROF)being standardized. Where as UML leaves it to the interpretation of the modeler what exactly a class is and when the subclass/superclass relation should be used, OWL is very clear that a class represents a mathematical set, and the subclass relation is only to be used when one set is strictly a sub-set of the other. At the same time, while there are multiple syntaxes for RDF, these are all fully interchangeable and applications like the OWL-API show this time and time again.

Getting vertical lineage right, in a ‘FAIR’ way, was until recently not possible. However, with the availability of open standards that describe this domain, we believe this is now changing.

Horizontal lineage or provenance: where data comes from

The final topic we will discuss in this paper is horizontal lineage. This refers to keeping track of how data originates, how it has been modified, and by whom, by the time it gets to the user. This is important to instill trust in the data, as well as being able to make more accurate judgments as to whether the data is actual suited for its intended use. Perhaps most important: data quality problems almost never originate within a single application (and if they do they can typically be fixed a any other bug), rather data quality problems are most often the result of the complex interactions between/among applications and people (be it the users or the developers). Having the lineage of data available can help resolve these issues.

Horizontal Lineage

Horizontal lineage helps keeping record of the different processes and people that have somehow touched the data. This is not something that is contained to a single application. Horizontal lineage requires many (ideally all) applications to publish which data they ingest, what they do with the data and what data gets produced/published. As can be seen in the diagram: the semantic model of horizontal lineage (here depicted as a natural language approximation of the PROV-O W3C recommendation) is not complicated. However, implementing it consistently across an organization can be a complex task.

PROV-O

Conclusion

Data architecture is not a well understood field, even if we as a species have managed information for millennia. We believe that the FAIR principles, applied to the four basic functions of a metadata architecture, with (not just documented but) implemented horizontal and vertical lineage are our best guess for the next step forward in the field of data architecture and information management.

Data Architecture

Four functions in a data architecture

Data catalog: Findable

FAIR reference models for data catalogs

Business glossary: Findable, Reusable

FAIR Reference standards related to the business glossary

Conceptual model: Interoperable, Reusable

FAIR Reference standards related to the conceptual model

Information systems: Findable, Available, Interoperable, Reusable

FAIR Reference standards related to information systems

Vertical lineage: what the data means

Horizontal lineage or provenance: where data comes from

FAIR Reference standards related to horizontal lineage

Conclusion