How do we know what we know in data modelling
By Joep van Genuchten
- 15 minutes read - 3063 wordsThere are many frameworks for Data Architecture, like DM-BOK, FAIR and many others. The prerequisite to making any of them work for your organization is to understand your organizations data. In order to do this, we use semantic models see here. But how can we be sure that the terms we define in them actually represent real, meaningful things?
Lets start by setting an important context for the field of data architecture: There is no definitive way to be certain about correctness, the ‘true-ness’, of the semantic models and ontologies that we make. Further more, it is in the nature of our job and responsibilities to both:
- Be constantly confronted by that fact through the mismatch between the data and the reality that we encounter, and
- Continuously question our own work and check it against the every changing reality in which our organizations exists.
Whereas semantic models give us a way of thinking about what there is and what we might want to know about them, the only time we can actually influence the degree to which these models reflect reality is at the moment we make them. So how these models come to be, the epistemological process of ontology engineering, is crucial.
Many philosophers have said many different things about “epistemology” over the last millennia, but at some point they all reach a point where you cannot go any further, cannot “know” any further and you “just have to take a leap of faith” and make some assumptions which you then hope to be true. So what can you do?
In my opinion, the best way to “learn” or “know” something in data architecture, is to be told by someone else, ideally someone who knows about the subject at hand. The more people, ideally independently from each other, say the same thing, the more certain you can be that there is an element of truth to it.
In this article we will discuss methods that give us some structure to have faith in the semantic models that we make: A way for us to believe that they are probably true.
The consequences of getting it wrong
The nasty thing about semantic models is that it is incredibly hard to verify them against reality. After all: what is ‘real’?
At the same time, a concept, or an object definition might seem absolutely intuitive in one moment, but a month or a year later, it turns out to be completely ill-conceived or based on a misunderstanding. Furthermore, formal terms, once defined, tend to get a life of their own. They seem to provide a solid framework to work from. So in a complex domain where no one person can ever hold all the knowledge, a well defined term is a very tempting conceptual anchor. And, especially in an office: nothing is more concrete than a table full of data about ‘something’. The more information there is about ‘something’ the more likely we are to believe that that ‘thing’ actually represents something in reality. All too often have obscure data definitions from a vendor become business terminology used by real people that all have different, typically inconsistent ideas about what it represents.
To illustrate: consider a square circle, which is a shape with a center point, a constant radius, exactly four corners and each side has the exact same length. I can even define a data object (in the form of a table in this case):
id | CenterPoint | Radius | SideLength | SurfaceArea |
---|---|---|---|---|
dfvsrg3r | 1,5 | 3 | 5.3 | 28.274 |
sdgwy45y | 3,2 | 8 | 14.1 | 201.06 |
we1345 | 7,2 | 1 | 1.77 | 3.1415 |
If the premise of this example wasn’t set out to be ‘an example of nonsense’, how easy would it have been to believe that this is about something real?
A keen observer will even note that both pi Radius^2 and SideLength^2 result in the SurfaceArea, so good luck figuring out with data quality checks that this is nonsense. As a matter of fact, Data quality measurements (as any other algorithm) are by definition subject to ontological commitment. This means that you cannot use an algorithm that relies on the terms defined in the subject semantic model to falsify itself. The only way (that I know of) to do some verification of internal consistency, is to use OWL and use a ‘reasoner’ to check for internal consistency. In this case, if the ontology has already defined a circle as a shape that has no corners, than a specialization of circle that has 4 of them, should trigger an error. However, in every day life nonsensical definitions are often much harder to spot. So this is not as powerful as you might hope (even though making use of reasoners will definitely improve the quality of any semantic model).
Obvious or not, these kinds of nonsensical definitions become part of the processes in which people work, people use this data make decisions: they become part of the business vocabulary. Very often, we take something to exist just because it has a definition (because someone else has defined it). Be mindful: The method proposed here is by no means immune to nonsense definitions, it just reduces the likelihood because it is aimed at involving as many minds as possible in the hopes that at least one of those minds notices the mistake before implementation.
Basic process
The goal of the process is to be as sure as we can that our semantic model has a meaningful relationship to reality. This touches roughly three levels. The model needs to be:
- Meaningful to us,
- Meaningful to our organization and
- Meaningful to our industry or domain.
Note
Something being meaningful also means that for every element in the semantic model there needs to be at least one person who understands it and who can explain it to others. I have seen all too often canonical data models or other semantic models that contain parts that no one understands anymore.
This is extremely detrimental for two reasons. First of all its a waste of modelling effort, both for those who create the models, as well as those using them, wasting time and energy interpreting models that no one else understands. Second, this type of waste really harms the good will of the organization towards the data architecture as a whole.
Typically an organization loses understanding of (parts of) its semantic models when no one feels ownership over the definitions. Should you find yourself in such a situation, consider not modelling at all. If this is what you end up doing, make sure you communicate this clearly to your stakeholders and explain that there is a lack of ownership.
In order to do this, we try to involve as many people, preferably Subject Matter Experts (SME’s), as we can, while ensuring that we and our colleagues both understand and are able to explain the models we make.
The why of this process will be discussed in more detail below. For now, it suffices to understand that the basic strategy underlying the flowchart is to ensure at every turn, the knowledge of as many people as possible is involved in the creation of the semantic models. Basically: the more often you encounter “Yes”, better off you are.
The importance of Subject Matter Experts
As a Data Architect, you are, relatively speaking, a generalist. Often, you are asked to make, or assist in making, an information model about a subject you have very little if any knowledge of. In this context it is crucial to have access to SME’s.
So what makes a SME? You need someone who:
- Understands the subject
- Understands how the subject matters to the organization
- Understands the organizational processes around the subject
- Who does what, in which order and why?
- Who is responsible?
This knowledge may be present in one person or you may need to talk to several people. These people will help you navigate the subject domain. You can also use them to verify your work.
Reference models
Since our goal is to get as many SME’s involved in the creation of our semantic models, our preferred method is to base our model on standardized reference models. This may seem non-obvious, but a standardized reference model has typically been designed by many SME’s from many different organizations. So a standardized reference model meets our most important requirements. All we have to do is use them.
There is however a certain hierarchy in models that you should prefer over others.
- Models standardized by the W3C, ISO, IEC or another standardization body should always be preferred.
- When multiple models of different standardizations exist, I would suggest selecting the one that you and your colleagues understand most naturally.
- Open source reference models
- Proprietary reference models.
- Depending on the licencing and the risk apatite of the organization, it might be preferable to design an organization specific semantic model at this point.
Subject Matter Experts in standardization
So all we have to do is use them, but that is easier said than done. In order to grok them in fullnes it helps to explore these reference models with an SME. But be ware: SME’s are typically not trained in ontology or data modelling. When SME’s talk about something as if it is one thing, from an information perspective they are often talking about more than one thing. So they might not always immediately understand why the authors of a reference model have decided to model some thing in the way that they have. It is then your job to make the translation from the reference model to the way the SME sees the world and help them understand why the authors of the reference model have made these decisions.
Very often these modelling decisions can be explained by basic modelling patterns that are often (also) part of upper ontolgies which we’ll discuss in the next section.
Using upper ontologies as part of your epistemological process
Upper ontologies, like the Basic Formal Ontology (BFO) or Basic Semantics provide a basic high level structure for classifying things. Many of these are derived from basic (variations of) categories that Aristotle named:
- Physical things
- Activities
- States
- Events
- Spatial Regions
- Temporal Regions
- Information Objects (a more recent addition to the traditional categories)
Note
Aristotle (and many Upper Ontologies) relied on classifications like Secondary Substance, Quantities, Qualifications and some others. These are typically covered by meta-models (models that describe models) like UML and OWL that are typically used to describe reference models. UML and OWL are in themselves reference models so they fit within the framework of using reference models wherever possible, as layed out here. For more information (see here).
It is advisable to make an Upper Ontology part of your organizations' semantic models as it helps you to meaningfully harmonize or relate reference models from different domains.
Weak points in reference models
Using a standardized reference model should always be preferred over designing your own models. However there are some things to beware of. Applied semantic modelling as a field is still very much in flux. This means that some of the best domain specific reference models, like IEC-CIM, were first created when we were still trying to figure out best practices. For instance, in the 90-ies and 2000-s it was common to not understand the difference between, say, an event or status or physical object and the information object that has information about them. So in IEC-CIM the Outage class has been modelled as a specialization of Document because there was certain metadata that the community wanted to attach to outages. So while the intent was to model a situation where no power is delivered, the model treats it as a specialization of ‘a stack of paper’. Similar patterns can be found in many UML/Object Oriented reference models.
Although typically these constructs are not part of the core model (as in, that part of the domain that the modellers are most interested in) it does pose the question if you should correct the model and deviate from the standard, or conform to the standard but have semantic errors in your model: it doesn’t accurately represent reality anymore. There is no one right answer to this question. It will depend on the situation your organization is and also how mature the data architecture of the organization is. In general: if the data architecture is not very mature yet, I would recommend adhering to the standard over altering it.
Building a semantic model based on Subject Matter Experts Knowledge
Sometimes there is no (standardized) reference model available (although you’d be surprised how often a more thorough internet search typically does turn up something, so are you really sure?). In this case, you will have to build one from scratch. The goal is here to be a facilitator for SME’s to make their knowledge explicit so you can capture it in both a human- and a machine readable form.
Involve Subject Matter experts
When building semantic models from scratch, it is once again crucial to involve subject matter experts. They know their subjects best.
Using upper ontologies as part of your epistemological process
If you thought you saw this title already, you are correct. The use of an upper ontology is perhaps even more relevant when making your own model. It helps to identify those things that intuitively feel like one thing, but from a formal perspective decompose into multiple parts. It also helps to prevent the model from effectively creating another silo by, on a high level, defining relationships to other kinds of objects. For example, both BFO and Basic semantics already tell you that any physical object has a relationship to a spatial region. This relationship helps you to integrate your domain ontology with common geo-spatial ontologies. You just sort of get that out of the box.
Modelling process
When making semantic models from scratch there is a certain order of operations.
- Start by identifying concepts and form a controlled vocabulary.
- The point here is to make people’s intuition about a subject explicit.
- People will have different definitions and or gut feelings of terminology: zoom in on this. Encourage people to speak their mind and disagree.
- Subject Matter Experts, especially engineers, have a tendency to think up rules (of thumb) about things they work with. These are interesting as they make a first step towards formalization.
- Rules are interesting but very often the exceptions to those rules shed most light on knowledge within a domain. You often have to chase down these exceptions.
- Organizations tend to push SME’s towards simplifying their knowledge, reduce it to easy or simple language. For semantic modelling you don’t want SME’s to hold back in that way: encourage them to really dive in, to let them guide you through their expertise.
- Once you have a map of the concepts, take your Upper Ontology and sort the concepts under general classes and relationships (now you are working towards creating a conceptual model)
- Certain concepts might not align neatly with your upper ontology. This is often the case because things that can be thought of as ‘one thing’ in natural language tend to be multiple things when capturing them in a formal language.
- When taking a concept, it can help to ask your SME’s where and when (space and time) an instance of that concept is created, and where and when it seizes to exist. This often reveals subtle differences in interpretation and perspective.
- Perspective is important: you will find that there are multiple legitimate perspective on single concepts within an organization. It is our job to identify them, understand why people look at these things differently (often related to roles and responsibilities) and then define them (and how they are meaningfully related) in our ontology separately.
- Relate your ontological definitions to other (domain specific) reference ontologies. This gives more (standardized) context to the things you define and helps other people understand them. It also ensures you are not just creating another silo. Instead, you integrate your model by design with other domains.
- The reason for making semantic models is often triggered by implementation of new applications or algorithms. So often you’ll be asked to then use the vocabulary and conceptual model to define one or more profiles and schemas.
The ultimate leap of faith: Democracy and the wisdom of the (small) crowds
When all else fails…
What else is there to say? Sometimes you don’t have access to SME’s there are no reference architectures and no reference models. The first thing you should do is stop and wonder if you should embark on this assignment in the first place: If the stakeholders cannot even provide you with an SME, is the project even on the right track? This is typically a moment to ask critical questions and push back.
But sometimes your organization just needs something… In these situations you will have to make do with your own intuition and that of those around you. Try to avoid making a model by yourself and always try to involve others in the discussion to make sure that you are at least not bound to your own perspective. At the very least, define the objects that you do end up defining in terms of subclasses of objects in the upper ontology that your organization has adopted.
In general, for any one topic there is a lot to know, more than can typically be deduced by a few individuals. So if this is the situation you find yourself in, feel free to define the model, but be open to having it changed, sometimes fundamentally, as new insights emerge. Self defined models are typically not very stable. But, ultimately, a poor model (be it a vocabulary, a conceptual model or a profile) is still preferable over no model at all. As it gives us a reference for comparison and a structure to discuss and improve upon.
In Conclusion
When it comes to ‘knowing’ in Data Architecture, the more minds that have worked on a model, the better. So when it comes to modelling: do not do it yourself unless you have no alternatives. So in order of preference your semantic models should be based on:
- A standardized referenced model harmonized with an upper ontology.
- A standardized referenced model that is not harmonized with an upper ontology.
- Ontology based on expert knowledge harmonized with an upper ontology.
- Self developed ontology harmonized with upper ontology.