Importance of data infrastructure architecture — Part 1
I have been in the data world my entire career and worked on different tools and technologies. Over the years, I have gained good amount of experience in understanding how organic the growth of an organization is and how sustainable the current data infrastructure of the org is relative to scaling and adaptability.
Of course when it comes to picking tool of choice, a data engineer will be limited to what the company policy is. However, it’s in the ability of data engineer to understand what the growth anticipation is and if the existing tools can adapt. Data world is fast changing and the latest and greatest available tools today may become so obsolete tomorrow. So, how can a data engineer ensure infrastructure can scale and adapt?
One needs to have a solid blue print which is the architecture. Let’s take an analogy of home construction. The engineer provides a blue print which will specify the loads a wall can bear, how much wind can the architecture resist etc… When the construction starts, the contractor exactly knows what the home is capable of and what it is not and completes the construction in a safer manner. After an year or so, home owner wants some upgrades. Now, its not necessary to bring the whole house down. Rather, upgrade specific areas of interest.
The same applies to any data infrastructure as well. Design the blue print properly. Inorder to that, one needs to understand what are the problem spaces this infrastructure is going to support and what is the anticipated trajectory (no one know what happens an year from now). Any data infrastructure will have the following patterns
- Data Ingress Point
- Data Staging
- Data Transformation
- Data Storage
- Data Access (Be it direct access or visualization)
However, its not that easy to understand until one knows or kind of gauges the 4 Vs of data (volume, velocity, variety and veracity) that can be anticipated. Once these are understood, a blue print is defined, tools of choice will be usecase dependent.
For instance data ingress point will potentially be different for data coming from different systems. For all inhouse data that can be standardized. However, if there is external data, there will be less control. For example for external datasets ingress can be file share could be via ftp or s3 or some cloud specific technologies. However, in the architecture, we define as external ingress and internal ingress. Having an SOP with external data providers will further standardize the ingress. All the 4 Vs will impact the ingress. A single ingress might endup publishing data to multiple areas in staging layer.
Next comes data staging. Where should the data be staged. Most of the data now a days will be in the distributed file systems and no downstream users should have access to stage data. Only the core engineers and the transformation processes will have access.
Data transformation as the name suggests, will help in building curated datasets or metric computations. Once these are processed, data goes into data storage. However, this time, these datasets will mean something to the business.
Data Access will have different types. Some users may want to use some visualization tool and some might want to create their own queries to get the data they want.
On a high level this is what it is
In the next part, I will discuss with real world example and how I would implement architecture