According to The Open Group Architecture Framework (TOGAF), data architecture describes the structure of an organisation’s logical and physical data assets and data management resources.
Thus, data architecture defines how data is acquired, stored, processed, distributed and consumed. It lays the groundwork for an environment where data scientists, data engineers and data analysts operate, making decisions around data architecture inherently strategic.
This is because successful data architectures need to deal with a company’s present challenges as well as accommodate future developments. For example, which data is stored now and in a couple of years, and what is the best way to ensure persistence for which data? To what extent should consistency be enforced across data sets & projects? What about data governance: who is allowed access to what data?
Answering these questions is all about balancing trade-offs, such as:
ETL vs. ELT:
- Transform and then Load decreases the number of times data must be moved to reduce cost, increase data freshness and optimise enterprise agility
- Load and then Transform keeps raw data (e.g. JSON files) close, so the extraction process can be adapted easily if needed – requires more space, but offers more agility
Autonomy in Projects vs. Common Vocabulary Across Projects:
- Data modelling should happen autonomously within projects as their requirements are known, giving users maximum flexibility to develop tailor-made solutions, but risking data sets that are no longer consistent across the organisation
- The same entities should have consistent naming schemes across data sets & projects, making it easier to navigate the data architecture, but providing less flexibility since any development needs to be coordinated with other projects
Easy Data Access vs. Data Governance:
- Data scientists need easy access to data sets for the purpose of exploratory data analysis, testing & experimentation, but the data produced by an ETL process may have undergone too much processing to be useful
- Each Transformation step makes it easier for the data to comply with data governance demands by tightly controlling which information is present in the data at which stage and to whom, making e.g. sure that there is no personally identifiable information (PII) in the data
Using a microservices architecture is one way of addressing the trade-offs mentioned above and allows for more customised configuration possibilities. This architecture type structures a process as a collection of services that are maintainable, loosely coupled and independently deployable.
When processing data, microservices can be configured more flexibly to first store the data after import from the respective data source and then transform it in another microservice for the specific use case such as a machine learning model in production.
Thus, microservices perform fine-grained, lightweight & clearly defined tasks that can, e.g. provide the raw data for data scientists but with PII removed to comply with data governance standards.