Why should we care about data variety — Semi Structured Data
In the earlier article, I talked about structured data and why is it important to understand not only at the high level but, diving deep on the structure to make sure the data model will scale. Let’s look at semi-structured data.
There are several different ways to represent. Most commonly used representations are with JSON and XML. For sake of simplicity we will take JSON representation.
In a general implementation perspective, as the name represents, unless one uses business transformations, semi structured data will not fit underlying data model. So, what is the benefit of using this kind of data? Well, source systems that produce data can scale quickly as there is no need to re-format the emitted data from applications. Rather, it will be the downstream consumers who will need to format the data as per their needs.
This may not be a good example but, let’s take application click metrics. Depending on the click type, the structure will change for data. However, there will be certain key elements like
browser, client_id, timestamp, click_type etc…
For example if there is a button_click event, the structure could be
In the above example, there are key elements browser, client_id, timestamp, click_type and metadata. However, depending on click_type, structure of metadata will change. But, from the application perspective, if the data is being stored in no-sql/ document db, one entity can store different click_types. What this ensures is rapid enhancements and scaling of application features.
But wait, what happens to downstream consumers/ Data Lake/ Data warehouse? Well, data will be pulled from source systems as is into staging area. Once the data is in staging area, transformations are applied and depending on the design of data warehouse, again data can be pushed to one or more tables.
So, how should one design their data model? it again depends on the business requirements. The idea behind having semi structured data is to make new enhancements faster. So, does it really make sense for downstream consumers to build multiple tables? (may be or may be not). However, one approach is to create a key value pair table. In our above example session_id will be the key. Based on the key elements can be transposed and stored in a table.