What is information
By Joep van Genuchten
- 7 minutes read - 1349 wordsBefore we talk about data management, and how people and machines interact with data, it is important to take a moment to explore why we have data in the first place. And for that we need to take a step back and explore the nature of information.
This will allow us to better understand the strengths and limitations of data and also put its expressiveness in context.
I’d like to make an important scope not here. This piece is aimed at putting the nature of information in the light of data management in a broader context. There is a lot more to say about information than what I discuss here, but that is beyond the scope of this piece.
Note
Sometimes, you find that while you are writing up what you know, you realize that some other people have tried to convey the same knowledge in a different, perhaps much better way. This story by the podcast ‘Hidden Brain’ is about a lot that this piece is also about. It puts emphasis on a different part of information than I do here, so it forms a good complementary perspective.
Forms information might take
In order to avoid the very deep rabbit hole that is studying and understanding the nature of information, and yet hint at how deep it is, let’s have a look at different forms that information may take. If we look at the Information Artifact Ontology it has in access of 200 classes defining different sorts of things that contain information. Examples include information contained in:
- data sets
- documents
- narrative structures
- images
- identifiers
- symbols
We can easily think of some more analogue forms, information contained in:
- speech
- gestures (nodding, waving)
- Morse
- Flag Semaphore
A keen observer will realize that all of these have a human origin. This leads us to wonder: are there non-human-made forms that contain information? If we see a dark cloud in the sky, does that contain information? If hear a loud noise quickly approaching us, is that information? If we feel the sting of a bug, is that information? Let’s refer to these non-human-made forms of information as signals.
How we process signals and data
Some would probably argue that signals are not yet information, that we first need to interpret them before they turn into information: if we see a dark cloud in the sky, we interpret that to mean we are about to get rained on and we should look for shelter. If we hear a loud noise quickly approach it, we interpret that a bus heading right for us and jump out of the way. If we feel the sting of a bug we interpret that we are standing on an ants nest and need remove ourselves so we are no longer perceived as a threat by the ants so they will no longer attach us.
I am sympathetic to this distinction. I have also noticed that people using data or other forms of human-made information assume that interpretation of those data is not necessary. We think that because other people have interpreted the signals for us, that we don’t need to do any further interpretation. Yet we do interpret. Whether we see a dark cloud in the sky (signal) or look at a precipitation radar (data): we try to guess if the rain will affect us. When we look at a dataset, we look at its form, the (totality) of the labels that are used to describe it, and we guess if what is in the data will affect us. So while the provenance (how it came to be) of signals is different from the provenance of human-made information forms, what we do with it to translate it to action (within ourselves or our organization) is the same. This begs the question: if we need to go through the same mental processes (interpretation) to translate signals and ‘human-made information forms’ into insight into our own situation, is the distinction that meaningful for determining how we should manage them?
Ironically enough: yes, but for the opposite reasons we initially think. There is an important difference between signals and ‘human-made information forms’ and how we process them: Signals do not have intent behind them, human-made information forms do: the intent and worldview of the author of the data (or the programmer of te application in case of automatically generated data). This means that with human-made information forms, we should not only wonder: what does this say about the world, or my world? But also: how did the author affect the shape of this information and how does that affect the way it represents the world?
‘Human-made information forms’ are extremely useful in helping us make sense of the world around us. They allow for cooperation. We are a social species after all. There is a reason why there are so many books, pictures, art pieces, datasets etcetera. In the arts, we teach ways in which to interpret someone elses creations and reflect on its meaning and the lense that the author lets us see the world through. When it comes to the formality of structured data, somewhere we started to assume that lense is perfectly clear. The assumption that we all understand the relationship between ‘human-made information forms’ and the reality they are about is akin to assuming nobody is colorblind (and colorblindness is in my experience a much rarer condition). I see this as the major reason the ‘data driven revolution’ hasn’t fulfilled its promise yet.
Defining data and information
If you google “difference between information and data”, the most given answer is something along the lines of
- ‘Information is data that was processed so a human can read, understand, and use it’ and
- ‘Data is a raw and unorganized fact’
While these links are among the top results, when you follow them, they to go to places like “guru99.com” and “computerhope.com”. That is to say: there are a lot of people that are trying to (find an) answer (to) this question but perhaps not very many (what I would consider) reputable sources on the matter.
Of the many attempts at defining this distinction, most align decently with the two definitions that I have copied above. I presume that they therefore represent the popular understanding of these two terms. In conversations with people about data, I find that indeed many of them hold similar definitions.
So, data is ‘raw’ ‘unprocessed’, ‘unorganized’, whereas information is ‘based on data’ ‘processed’. Given what I have written above, these definitions seem somewhat naive. After all: data is collected by someone (or something made by someone) and therefore is already processed. If we look at structured data (like tables, data graphs, or hierarchical structures in an API) these are:
- Examples of data as most people understand the concept.
- highly processed through interpretation of signals and what they should represent by the designer of the data.
- Highly organized (if only because otherwise computers cannot do anything with them).
Clearly, the popular definition is somewhat inconsistent with the intuition we have about what data is (and the things we would identify as such). There is nothing ‘raw’ about data, there has always been at least one human that has decided what it would look like.
I like the definitions given by Wikipedia better:
- Information
- “Information can be thought of as the resolution of uncertainty; it is that which answers the question of “What an entity is” and thus defines both its essence and the nature of its characteristics. The concept of information has different meanings in different contexts. Thus the concept becomes related to notions of constraint, communication, control, data, form, education, knowledge, meaning, understanding, mental stimuli, pattern, perception, representation, and entropy.”
- Data
- “Data are units of information, often numeric, that are collected through observation”
This basically says something along the lines of: if Information is like a physical quantity, data is the unit with which we express it. Both of these definitions refer (by mentioning stimuli, perception, representation and observation) to the concept of signal that I defined above. These will be the definitions we use.