Introduction data architecture
data architecture: A few years ago Forbes stated that there are 2.5 quintillion bytes of data created every day. In that same article they also said that 90% of the world’s data was created in the last few years.
In 2018 IDC predicted that the global datasphere, which is the amount of data that is created, captured or replicated, is going to increase from 33 Zettabytes in 2018 to 175 Zettabytes in 2025.
In 2020 there was approximately 44 Zettabytes of data in the world. For the uninitiated, a zettabyte is a billion terabytes or a trillion gigabytes!! There are numerous reasons why so much data exists. The issue we’re addressing for this article is, where should enterprises store their share of this data so that they can capitalise on this rapid increase and availability of it.
In research completed by IDG as part of a data & analytics survey, enterprises said
“On average, 54% of the data their organization views and analyzes is generated internally, while 25% is generated externally and 21% comes from a combination of the two”
What types of data are organisations using?
The type of data collected by large organisations tends to be
- Transactional data
- Machine-generated or sensor data
- Government & public domain data
- Security monitoring data
with the top 3 sources of data being
- Sales & financial transactions (56%)
- Leads & sales contacts from customer databases (51%)
- Email and productivity apps (39%)
45% of the organisations surveyed said that unstructured data is one of their biggest challenges. 31% said it is a problem they have under control and only 17% said that it is a primary focus for them.
In 2016 the average enterprise had 347.56TB and this was expected to increase to 461.25TB. IDC have also predicted that 49% of the world’s stored data will be kept in public cloud environments. Within this same time frame, 30% of the global datasphere will be real-time. So with increasing data storage, access and processing requirements, organisations need to not only consider where they will store but also how they will store their data.
Data Storage (data architecture)
Historically data has been processed and stored in large monolithic data warehouses. But organisations now need to consider making big improvements to their data architecture, in order to build the capability needed for addressing a new set of demands. The pre-processed and formal structure of data in traditional data warehouses can’t meet those demands alone. The rising and increasing use of unstructured data from a large number of different sources has given rise to data lakes. While the term has steadily made its way into discussions amongst business leaders, it’s clear that not everyone really understands what a data lake is. More specifically what the differences are between it and a data warehouse.
The following table shows 4 of the key differences between data warehouses and data lakes:
Data Warehouse | Data Lake | |
Data Structure | Processed & structured | Raw & unstructured |
Purpose of data | Predefined | Undefined |
Data Access | Complex | Easily accessible |
Ability to update | Difficult & costly | Quick & easy with minimal/limited cost |
What do we need to do moving forward?
To meet the demands of both legacy requirements and emerging needs around data analytics, organisations need a modern data architecture. In a paper from earlier this year, Mckinsey put forward the blueprint for a reference data architecture. This blueprint has been tried and tested in many global organisations across a number of industries. It therefore can be used by organisations to help de-risk the costs and effort required to build a modern data architecture. A key element to this approach is for organisations to have data engineering at their core in order to drive the necessary level of change. As mentioned in a previous article, this means having a chief data officer, as part of the executive team, to guide and steer the organisation using the best and most appropriate data strategy.
As well as having a blueprint reference architecture, Mckinsey have also outlined 6 steps to help organisations build a data architecture to meet their current and future needs. The steps are :
- Migrate from on-premise to cloud-based data platforms
- Transition from batch to real time data processing
- Have a modular multi-platform solution that addresses specific needs, rather than a pre-integrated one size fits all commercial solution
- Decouple data access
- Create a domain based architecture and not an enterprise warehouse
- Use flexible data models rather than rigid ones
So at a high level, the debate isn’t about whether to use a data lake or a data warehouse. There are a number of different technologies and platforms that organisations can use to build and implement a data architecture.