The demand for stream processing of unbounded data feeds from scientific instruments, simulations, and sensors increases as more applications critically depend on timely insights from the incoming data. The implementation of streaming applications often requires the integration of a diverse set of frameworks and runtime systems. For example, simulations are often implemented using HPC techniques, such as MPI, while analytics components often rely on Spark. Further, these application components need to be coupled via a brokering infrastructure (such as Kafka) ensuring a balance between data production and consumptions.
The aim of this master thesis is to qualitatively and quantitatively evaluate different streaming frameworks, in particular the message broker Kafka and the processing frameworks Spark-Streaming and Flink on HPC and cloud infrastructures. In this thesis, we develop and extend a set of tools for running data streaming frameworks and applications on heterogeneous infrastructure. Using a representative set of example applications (e.g. machine learning algorithms like K-Means, Regressions, Outlier Detections etc.), we characterize the performance and resource needs.
A particular challenge is the identification of a good resource configuration. Typically, a complex set of factors need to be considers, e.g. the incoming data rate and the complexity of the processing workload. To address this challenge, we will develop a performance model that aids the understanding of resource requirements for streaming applications and enables recommendations of resource configurations.
Prof. Dr. D. Kranzlmüller
Dauer der Arbeit:
Anzahl Bearbeiter: 1