During my morning walk or commute to and from the airport, I’ve got into the habit of listening to audio books. My current book is The Story of Human Language.
I thoroughly recommend it. It has some great stories of how languages change over time. For example, do you know how the French ‘ne…pas’ negation construct came about? It looks so strange when you first learn French as a foreign language. Turns out, the old way of saying ‘not’ was to use only ‘ne’ in front of the verb. At some point, people felt the need to add emphasis and it became fashionable to use constructs like ‘he doesn’t walk a step’, ‘she doesn’t drink a drop’, and ‘they didn’t speak a word’. Over time, ‘step’ (‘pas’ en français) became the standard word for emphasis regardless of whichever verb it qualified. Nowadays, ‘ne’ is more-or-less redundant, and ‘pas’ has become the ‘not’ word in spoken French. Strange but true.
And typical, because natural languages are constantly evolving. Which poses the question whether a language reference should be prescriptive or descriptive [i]. Or maybe a bit of both.
What’s that got to do with data modelling?
I’m ‘pas’ always thinking about language. Lately, I’ve been spending a lot of time thinking about the future of data management for Oil and Gas subsurface data. Funnily enough, scratch the surface and you’re faced with the same prescriptive / descriptive question because data management is also evolving.
New big data techniques and tools are offering up new implementation approaches. Instead of RDBMS, what about file stores, document stores, key-value stores? And instead of schema on write, why not do schema on read?
“But what about the rules?” I hear you say. Surely we need to be prescriptive here. Subsurface data is valuable data, often costing millions of dollars to acquire, and we don’t want to risk losing or breaking it.
Not lost in translation
Luckily, we have industry standards which keep us safe during data exchange. Good or bad, data-exchange standards will always survive because they are fundamental to our business. And the modern standards are pretty good.
I’m currently looking into Kylo as a framework for ingest [ii] of subsurface data. Clearly, ingest of industry-standard exchange formats is the place to start. And at that point we can apply quality control to our data.
But then what?
Transforming the data to a more flexible, more integratable [iii] format has to be the second step. And that requires modelling.
Which is not as easy as it sounds. Subsurface data is big – too big – with many versions of the ‘truth’, each based on different uncertainties or assumptions. And, critically, we need to be meticulous about our measurement data (units of measure, geodetic references, and quality indicators, please) otherwise our data becomes meaningless.
If we create only logical models, we‘re in danger of coming up with complex models that, because of our data volumes, will be expensive or slow when implemented. But approaching modelling in a purely performance-based manner (as many subsurface software vendors have done in the past) creates rigid systems built for known, linear workflows – prescriptive systems that slam the door on discovering new ways of integrating and accessing data.
Loosening the reigns a little
I don’t believe we need to choose one way or the other and, from what I see around me, I don’t believe we are. As Gartner stated years ago, the future is a Logical Data Warehouse made up of a selection of different technologies, integrated together. And if we accept that, I think we need a hybrid data model. Something that bridges the gap between logical and physical, and takes advantage of big data technologies.
More questions than answers
I came across this post this morning in my LinkedIn feed. It mentions a new modelling technique called Concept & Object Modelling Notation (COMN) that takes the best from E-R modelling and object modelling. It supports modelling from conceptual through to physical, for both SQL and NoSQL stores. I haven’t researched it enough to state that COMN is the answer but I believe, wholeheartedly, that the author is asking the right question.
So, stay tuned. There are more musings on the future of subsurface data management in the pipeline.
Mais pas en français.
[i] Prescriptive grammar applies a set of explicit rules. Descriptive grammar analyses how people speak and deduces the rules.
[ii] Yes, ‘ingest’ is what we used to call ‘data loading’ - isn’t the advancement of language charming?
[iii] I invented a new word there - did you see that? This is exactly how it happens.