SRG Seminar: “Evaluating Data Linkage: Creation and use of synthetic data for comprehensive linkage evaluation” by Tom Dalton and “Container orchestration” by Uchechukwu Awada

The abstract of Tom’s talk:

“Data linkage approaches are often evaluated with small or few data sets. If a linkage approach is to be used widely, quantifying its performance with varying data sets would be beneficial. In addition, given a data set needs to be linked, the true links are by definition unknown. The success of a linkage approach is thus difficult to comprehensively evaluate.

This talk focuses on the use of many synthetic data sets for the evaluation of linkage quality achieved by automatic linkage algorithms in the domain of population reconstruction. It presents an evaluation approach which considers linkage quality when characteristics of the population are varied. We envisage a sequence of experiments where a set of populations are generated to consider how linkage quality varies across different populations: with the same characteristics, with differing characteristics, and with differing types and levels of corruption. The performance of an approach at scale is also considered.

The approach to generate synthetic populations with varying characteristics on demand will also be addressed. The use of synthetic populations has the advantage that all the true links are known, thus allowing evaluation as if with real-world ‘gold-standard’ linked data sets.

Given the large number of data sets evaluated against we also give consideration as to how to present these findings. The ability to assess variations in linkage quality across many data sets will assist in the development of new linkage approaches and identifying areas where existing linkage approaches may be more widely applied.”

The abstract of Awada’s talk:

“Over the years, there has been rapid development in the area of software development. A recent innovation in software or application deployment and execution is the use of Containers. Containers provide a lightweight, isolated and well-defined execution environment. Application container like Docker, wrap up a piece of software in a complete file-system that contain everything it needs to run: code, runtime, system tools, system libraries, etc. To support and simplify large-scale deployment, cloud computing providers (i.e., AWS, Google, Microsoft, etc) have recently introduced Container Service Platforms (CSPs), which support automated and flexible orchestration of containerised applications on container-instances (virtual machines).

Existing CSP frameworks do not offer any form of intelligent resource scheduling: applications are usually scheduled individually, rather than taking a holistic view of all registered applications and available resources in the cloud. This can result in increased execution times for applications, and resource wastage through under utilised container-instances; but also a reduction in the number of applications that can be deployed, given the available resources. In addition, current CSP frameworks do not currently support: the deployment and scaling of containers across multiple regions at the same time; merging containers into a multi-container unit in order to achieve higher cluster utilisation and reduced execution times.

Our research aims to extend the existing system by adding a cloud-based Container Management Service (CMS) framework that offers increased deployment density, scalability and resource efficiency. CMS provides additional functionalities for orchestrating containerised applications by joint optimisation of sets of containerised applications and resource pool in multiple (geographical distributed) cloud regions. We evaluate CMS on a cloud-based CSPs i.e., Amazon EC2 Container Management Service (ECS) and conducted extensive experiments using sets of CPU and Memory intensive containerised applications against the custom deployment strategy of Amazon ECS. The results show that CMS achieves up to 25% higher cluster utilisation and up to 70% reduction in execution times.”