Two approaches to modelling meaning
By Joep van Genuchten
- 9 minutes read - 1810 wordsIn this piece I want to introduce two approaches to model meaning that are relevant to data architecture. They evolve around the Triangle of Reference that was introduced in a ‘The Meaning of Meaning’ by Ogden and Richards in 1923.
In a nutshell the, the triangle relates the thoughts or concepts we have (top of the triangle) to the things in the worlds that we have thoughts about (the Referents) and the symbols we use to symbolize our thoughts (like words, pictograms, data etcetera). A important point the triangle is trying to convey is that there is no direct relationship between symbols and referents: that relationship can only be made via our thoughts, our interpretation.
Note
The triangle of reference is, in my opinion, not a very formal model. For instance in the case of metadata (data about data), the data, which is usually classified as Symbol becomes the Referent, the metadata is then the symbol we use to refer to it. Having said that, the semantic tringle does give a pretty intuitive model to work from.
The two models focus in different aspects of the triangle. One aims to capture more natural language. It tries to capture the different words (symbols) we use to express the same concept or make explicit when we use the same term to actually mean different things. The other model aims to capture formality. It tries to classify the things in reality that we think of as similar, and the concepts we form around that similarity.
This is part really only scratches the surface of subject of natural and formal languages. It only looks at it through the lens of the current best practices around data management and data architecture. One piece that has formed a lot of my understanding about this subject is a thesis bi Jessica Olsen: Would You Believe That? The prerogative of Assent and the Utility of Disagreement
Natural Languages and Controlled Vocabularies
In my experience, models of natural language focus on the relationship between the words and language we use (the Symbols) and the concepts we have in our head. In models of natural language, the ‘Referent’ (the things in the real world that we are talking about) is not a focal point. The reason for this, is that determining when a thing is really just one thing (see part on patterns of miscommunication) is a discipline in and of itself with lots of ways to make mistakes. Focussing on the concepts and the symbols we use to represent them leaves a lot of room for interpretation and intuition about the world. This makes models of natural language particularly suitable for situations around innovation where we all have a hunch, but cannot quite put our fingers on it.
Controlled vocabularies, as models of natural languages place a lot of emphasis on uniquely identifying the concept that we have agreement over. The symbols we use to symbolize our concepts are in service of the concept. So an important capability of controlled vocabularies is to be able to identify both synonyms as well as homonyms.
This means that controlled vocabularies are a great way to facilitate communication between people. It challenges people to pay attention to what they mean to communicate something and what terminology they use to express this. Similarly, the exercise of creating a controlled vocabulary can really help people in the same team or organization understand eachother better.
The limitation of natural language models is that the concepts they identify and define are not very formal. People tend to assume that these concepts then make good data definitions. This is unfortunately not so. As a matter of fact, data definitions based on these models tend to leave too much room for interpretation, which cause confusion among those responsible for collecting data. if data definitions is what you need: look at formal languages.
Formal Languages and Ontologies
Whereas natural languages focus on concepts and symbols, formal languages emphasize the relationship between the referents and our concepts. This means that these languages are good at challenging the concepts we hold. To illustrate: if two people look at an aquarium and discuss how many organisms they can see, one might answer 20 and the other 13. This implies either a mismatch in sensory capability (lets assume not) or a mismatch in the assumptions we hold about classifying ‘organisms’. Maybe one observer only counted fish and the other counted fish and plants. Or both counted fish and plants, but one also counted patches of algae. Then you can have a discussion about how to count algae: is a patch really 1 organism, or is it many that we cannot see and therefore we shouldn’t count algae?
You probably intuitively have a sense that this perspective is more related to what we typically think of when we think of ‘data’. Indeed, especially around (semi) structured data, formal languages are a much better model for understanding how they behave. Formal languages lend themselves better for modelling human-machine communication.
If we start at the concept, or thought, formal languages pay a lot of attention to what things in the world are represented by that thought, this is called the extent of a concept. So when we think of chair, we go around the world pointing at things and wondering “is this a chair?”. Formal languages then go on to say: lets treat the totality of things that we have classified as chair as a mathematical set. So form a formal language perspective the concept of ‘chair’ represents the set of things that are chairs. If two observers would hypothetically compare the set that they have come up with (assuming that could both see ‘all’ things) and they have the exact same set, we conclude that we therefore must hold the same belief of the concept ‘chair’. If however, this is not the case, we must believe different things about what a chair is. In formal languages the set is the basis for understanding. This can be a very powerful tool for creating understanding or agreement. We can then go on and say, if we encounter a new thing, what are the rules to determine if it is a chair. We can then make definitions like: ‘A chair is something you can sit on that is made by humans’. We can then wonder if a toilet seat classifies as chair and adjust our definition according to our intuition or information requirement.
Note I want to draw attention to the term ‘information requirement’ here. When designing ontologies, we often assume we model ‘natural kinds’. Or rather: we assume that once we have captured the natural kinds in our models, we have a good model. My experience is that, while the concept and the ideas behind it are very useful, this does not typically align intuitively with what people and organizations want to know about the world around them. Thus: most data/information models in existence are not modelled after natural kinds, but rather after a set of things that is somehow important to the processes we have or feel responsibility over. Keeping this in mind will help us interpret these models more correctly.
An important thing to note here is that while I have used the word ‘chair’ here (which is a word that is also part of our natural language), The primary driver of the concept is really the set. The word ‘chair’ is just a label that I assigned to it for the purpose of this story. In data management, you will often see that these sets are often named with a (semi) meaningless term or identifier. We then assign a human readable label (Symbol) to them so the information is more naturally consumable to human beings. After all, our language doesn’t have enough terms in it to describe all possible sets around things we think of (chairs: with or without arm support, with or without wheels, with or without back support etc). Unfortunately, formal language terms often get interpreted as natural language terms, which leads to miscommunication. If you have ever wondered why things containing data (tables, columns, objects) etcetera tend to have such seemingly weird and un-intuitive names (full of abbreviations, pre, and post fixes etcetera), it is very much related to this: Developers try to hint at the relevant concept in their naming convention and then explicitly distinguish by adding pre/post fixes that say something (to them at least) about the specific subset of data that is captured in that table or data set.
However, there are also some shortcomings. Sometimes we have a concept, we hold a belief, but we also know that we are not sure for every instance if it is part of the set. Sometimes that is because its an immature concept (we know there is something, but we know we don’t fully understand it yet): we wonder if ‘stools’ are part of the set of ‘chairs’. Eventually this is something we can settle and then the definition will be more more clear, more formal. But sometimes that is because the concept does not lend it self for ‘counting’: it is weird to think of concepts like ‘love’, ‘beauty’, ‘friendship’ or ' evil' as things we can count. Sure, we can point at all the things that we find beautiful, put that into a mathematical set and say: this is beauty to me. But that doesn’t capture its meaning, its essence. It says something about the things we find beautiful, not about beauty itself. It certainly isn’t something we can define a rule around so that we can decide for the next thing if it is beautiful or not.
Note
Things like recommendation algorithms sure give this a good try. But their classification is a probabilistic one, not a deterministic one. They make a best effort guess, but we know that there is an uncertainty associated with these classifications.
The strength of formal languages is that they put some very rigorous checks on how we look at reality. They allow us, to a degree, to check for inconsistencies in our own mental projections of the world. However, there are certain things that are important to us as human beings that do not let themselves be formalized. When you find yourself struggling with defining these kinds of concepts: look at natural languages.
In conclusion
On the spectrum of natural versus formal languages, most people have a preference for one side of the of the spectrum or the other. In order to capture truth in data we collect about our world, we have to account for our limited perception of it and the many ways different people look at it. In order to do this, we need to communicate both through formal languages to capture things we know and understand pretty well, as well as informal languages to explore new ideas.