By Michael Schroeder ERNI Switzerland
The second article from the series on data-driven projects, explores common challenges that arise during their execution. To illustrate these concepts, we will focus on one of ERNI’s latest project called GeoML. This second article focuses on the second part of the GeoML project: Idea2Proof.
Introduction
Data quality is a crucial factor in data-driven projects, and its significance is closely tied to the nature of the research question at hand. Achieving good data quality is a relative concept that must always be evaluated in relation to the requirements of the data-driven model. Data and the model should be seen as tandem, with their characteristics often described using the four Vs:
Volume
This refers to the quantity of data available. In our project, high-resolution satellite images covering the entirety of Switzerland amount to over 10 TB of data, which incurs significant storage costs. To address this, we have made the decision to reduce the image resolution. Processing these individual images and aligning them to the desired resolution and position necessitates the use of appropriate big data tools.
Variety
Different data sources and types present challenges in data-driven projects. In our case, we integrate various data sources, including satellite images (in raster format), road accident data (spatial and temporal points), road networks (spatial lines), and traffic load data (sparse spatial and temporal data). Conceptual competence is required to combine these different data types effectively and assess their relevance to the original question.
Velocity & Veracity
Also play crucial roles in ML-based projects. Ensuring the timely processing and adaptation of data as well as assessing the accuracy and reliability of the data are essential aspects that impact the overall success and performance of the ML algorithm.
Data metric definition
Once the research question is clear and the data has been cleansed, the next step is to effectively utilise the data. Many projects encounter difficulties at this stage, either by choosing an algorithm that does not align with the data or by selecting one that cannot address the research question adequately. Therefore, we need to define a risk metric applicable to the project based on the available data.
Not all accidents are equal; the severity of an accident differs. To account for this, we have normalised the number of accidents based on factors such as road infrastructure and traffic volume. Additionally, accidents resulting in severe injuries or fatalities carry greater weight. By considering these factors, we have defined a risk index for each road segment, which we subsequently categorized into five levels, serving as our target variable.
Algorithm development
After defining our target variable, the next step is to choose a suitable algorithm. There are no strict guidelines for this decision, as it varies case by case. Depending on the problem’s complexity, involving an experienced data scientist is crucial. The choice of algorithm class (e.g., regression, classification, clustering) should align with the defined research question, while model selection and parametrisation require expertise and experience. It is important to ensure that the model has an appropriate size (e.g., number of parametres) to distinguish artifacts from genuine statistical patterns. Furthermore, mitigating systematic biases in data distributions between training and testing data during model operation poses an additional challenge.
Conclusion
Data quality is paramount in data-driven projects, and its assessment must be tailored to the specific question at hand. Overcoming challenges related to data quality, volume, variety, and algorithm selection is vital for successful project outcomes. By addressing these challenges in our case study, we strive to develop an effective algorithm that can leverage diverse data sources and provide meaningful insights for decision-making purposes.