
Derek Strauss says DW2.0 is architecture for the next generation of data warehousing.
Data Warehousing has been around for almost two decades and has become an essential part of the information technology infrastructure. A data warehouse contains integrated granular historical data. First generation warehouses have addressed many information requirements. However, several challenges are facing these warehouses today. For first generation data warehouses, deriving value meant integrating numeric based, transaction data. Today, deriving maximum value from corporate data means taking all corporate data and deriving value from it. This includes textual, unstructured data as well as numeric, transaction data.
In the 1990s, the cost of data warehousing was almost a non issue. In today's world, the cost of data warehousing is a burning issue, and with volumes of data increasing exponentially, better data management solutions are vital to cost containment. Metadata, master data and data quality were neglected in early warehouses. Today, data quality, the business meaning of the data, business rules related to the data, the cohesive management of enterprise master data (customer data) and reference data are all critical to the data warehouse. In today's world, data warehouses are recognised as being the foundation for competitive use of information - and not simply from a strategic decision support perspective, but also as a key component in operating the business. The rate of change in the business environment has consistently increased year-on-year. Today, it is recognised that the data warehouse needs to be malleable so that it can keep up with changing business requirements.
DW2.0 – addressing the challenges
DW2.0 recognises the data lifecycle. Once data enters the data warehouse it starts to age and in time the probability of access diminishes. This has profound implications on the technology that is appropriate to the management of the data. Another notable phenomenon is that as data ages, the volume of data increases (in most cases, dramatically). DW2.0 highlights the special design considerations for handling large volumes of data with a decreasing probability of access.
The data warehouse is most effective when containing both structured and unstructured data. Classical first generation data warehouses consisted entirely of transaction-oriented structured data. A modern data warehouse should contain both structured and unstructured data. Unstructured data is textual data that appears in medical records, contracts, emails, spreadsheets, and many other documents. There is a wealth of information in unstructured data. But unlocking that information has been a real challenge. DW2.0 describes in detail what is required to create a data warehouse containing both structured and unstructured data.
For a variety of reasons metadata was not considered a significant part of the early data warehouses. DW2.0 recognises the importance and role of metadata. The issue is not the need for metadata. There is metadata everywhere - in local repositories such as DBMS directories, ETL tools, data modeling tools and so on. What is needed is a cohesive enterprise view of metadata. All of the local metadata stores need to be coordinated such that they can work together harmoniously. In addition, there is a need for the support of both technical and business metadata in the DW2.0 environment to enable such things as data lineage (understanding the origins of the data), voice of data (focusing on data quality) and change impact analysis.
Data warehouses are ultimately built on a foundation of technology. The data warehouse is driven by a set of business requirements against the backdrop of a corporate data model. Over time the business requirements of the enterprise change. But the technical environment and its data structures traditionally have not been easy to change. DW2.0 provides two solutions to this dilemma. One solution is software based, using technologies that provide a malleable foundation for the data warehouse. The other solution is the design practice of separating static data and temporal data at the point of data base definition. Either of these approaches has the very beneficial effect of allowing the technical foundation of the data warehouse to change in step with business changes.
Derek Strauss is founder, CEO and a Principal Consultant of Gavroshe. He has over 30 years of IT industry experience, including 24 years working in the data resource management and business intelligence/data warehousing fields. He is a co-author of the book DW2.0 - The Architecture for the Next Generation of Data Warehousing, Morgan Kaufmann, 2008 (W.H. Inmon, D. Strauss, G. Neushloss)