Wednesday 2nd December 2020, 16:45–17:00 (Australia/Sydney), Zoom Breakout Room 1
Continuous Integration (CI) is a common software development practice which enables developers to merge regular code changes to a central repository which can be validated with automated tests. Continuous Delivery (CD) extends continuous integration by automating the release process, allowing software deployment on demand. Continuous deployment goes further by removing any manual intervention required as part of the release process.
Although CI/CD is generally associated with software development, it is an approach that can be leveraged to better manage and QA regularly updated datasets.
Additional challenges are confronted when building a CI/CD framework for data processing compared to developing software. Automation flows need to be triggered by new data inputs or changes to data, not only by changes made to the code base. Decisions around what constitutes a data trigger can vary significantly depending on the use case.
Many cloud-based services for CI/CD may not be appropriate for a data pipeline as confidential data or PII may not be able to be transferred to some of these platforms. Data validation services need to be monitored carefully as they can charge per processing unit so steps to minimize costs for re-processing are required.
This presentation discusses these challenges using example use cases in a survey research environment.
Alistair works as Manager, Data Science at the Social Research Centre