Alistair works as Manager, Data Science at the Social Research Centre
Continuous Integration and Delivery pipelines for data processing
Continuous Integration (CI) is a common software development practice which enables developers to merge regular code changes to a central repository which can be validated with automated tests. Continuous Delivery (CD) extends continuous integration by automating the release process, allowing software deployment on demand. Continuous deployment goes further by removing any manual intervention required as part of the release process.
Although CI/CD is generally associated with software development, it is an approach that can be leveraged to better manage and QA regularly updated datasets.
Additional challenges are confronted when building a CI/CD framework for data processing compared to developing software. Automation flows need to be triggered by new data inputs or changes to data, not only by changes made to the code base. Decisions around what constitutes a data trigger can vary significantly depending on the use case.
Many cloud-based services for CI/CD may not be appropriate for a data pipeline as confidential data or PII may not be able to be transferred to some of these platforms. Data validation services need to be monitored carefully as they can charge per processing unit so steps to minimize costs for re-processing are required.
This presentation discusses these challenges using example use cases in a survey research environment.
Challenges of adopting open source software for survey research in practice
Research organisations embracing open source software can rarely escape comparisons to proprietary offerings and the need to integrate proprietary software into their workflows, whether due to internally driven requirements or external demands. Whilst there are many benefits to the adoption of open source tools in a survey research environment, these benefits come with their own unique challenges.
Often proprietary products will advertise integration with open source tools, which are pseudo-integrations that offer little utility. Likewise support for proprietary tools in open source frameworks can be haphazard and can meaningfully lag changes to proprietary software.
We began using R at the Social Research Centre in 2013, and over time it has become our primary data processing tool and supports much of our data analysis work. Alongside the move to R we have adopted many other open source tools with varying levels of success.
We discuss some of the unique practical challenges faced when adopting open source tools, including:
- integrating open source software with existing proprietary tools
- usability challenges
- user support
- tools with hybrid open/closed source offerings
Deep learning using Julia
Julia is an open source, high-performance programming language for numerical and scientific computing. Its launch billed the language as having “the speed of C with the usability of Python, the dynamism of Ruby, the mathematical prowess of MatLab, and the statistical chops of R”. Julia aims to solve the “two-language” problem and as it was designed from the beginning for high performance and with technical and numerical computing usage in mind it is perfectly suited to a range of data science applications.
Julia is very much in its infancy compared with R and Python, launching in 2012 with a stable 1.0 release in 2018. Julia provides support for modern machine learning frameworks such as TensorFlow and MXNet, making it easy to adapt to existing workflows, and supports a number of statistical and data science applications built in R and Python. While the Julia community is growing, many frameworks are still in the process of being ported natively into Julia.
This presentation provides a brief introduction to the Julia programming language and explores some deep learning implementations including some examples using Flux.