The PROCESS demonstrators will pave the way towards exascale data services that will accelerate innovation and maximise the benefits of these emerging data solutions. The main tangible outputs of PROCESS are five very large data service prototypes, implemented using a mature, modular, generalizable open source solution for user friendly exascale data. The services will be thoroughly validated in real-world settings, both in scientific research and in industry pilot deployments.
To achieve these ambitious objectives, the project consortium brings together the key players in the new data-driven ecosystem: top-level HPC and big data centres, communities – such as Square Kilometre Array (SKA) project – with unique data challenges that the current solutions are unable to meet and experienced e-Infrastructure solution providers with an extensive track record of rapid application development.
In addition to providing the service prototypes that can cope with very large data, PROCESS addresses the work programme goals by using the tools and services with heterogeneous use cases, including medical informatics, airline revenue management and open data for global disaster risk reduction. This diversity of user communities ensures that in addition to supporting communities that push the envelope, the solutions will also ease the learning curve for broadest possible range of user communities. In addition, the chosen open source strategy maximises the potential for uptake and reuse, together with mature software engineering practices that minimise the efforts needed to set up and maintain services based on the PROCESS software releases.
Project website: http://www.process-project.eu/
This project has received funding from the European Union’s Horizon 2020 research and innovation programme under Grant Agreement 777533.
In SecConNet we research novel container network architectures, which utilize programmable infrastructures and virtualisation technologies across multiple administrative domains whilst maintaining security and quality requirements of requesting parties for both private sector and scientific use-cases. For this, we exploit semantically annotated infrastructure information together with the information on the business and application logic and apply policy engines and encryption to enforce the intents of the data owners in the infrastructure and thus increasing trust.
Containers are lightweight alternatives to full-fledged virtual machines. Containers provide scientific, industrial and business applications with versatile computing environments suitable to handle Big Data applications. A container can operate as a secure, isolated and individual entity that on behalf of its owner manages and processes the data it is given.
Containers can exploit policy engines and encryption to protect algorithms and data. However, for multi-organisation (chain) applications groups of containers need access to the same data and/or need to exchange data among them. Technologies to connect containers together are developed with primary attention to their performance, but the greatest challenge is the creation of secure and reliable multi-site, multi-domain container networks.
The project will deliver multiple models of container infrastructures as archetypes for Big Data applications. SecConNet will show that containers can efficiently map to available clouds and data centers, and can be interconnected to deliver these different operational models; these in turn can support a plethora of Big Data applications in domains such as life sciences, health and industrial applications.
Computers are going through a radical redesign process, leading to novel architectures with large numbers of small cores. Examples of such “many-cores” are Graphics Processing Units and the Intel Xeon Phi, which are used by about 65% of the top 50 fastest supercomputers. Many-cores can give spectacular performance results, but their programming model is totally different from traditional CPUs. It currently takes an unacceptable amount of time for application programmers to obtain sufficient performance on these devices. The key problem is the lack of methodology to easily develop efficient many-core kernels.
We will therefore develop a programming methodology and compiler ecosystem that guide application developers to effectively write efficient scientific programs for many-cores, starting with a methodology and compiler that we have developed recently. We will apply this methodology to two highly diverse applications for which performance currently is key: Bioinformatics and Natural Language Processing (NLP).
We will extend our compiler ecosystem to address the applications’ requirements in three directions: kernel fusion, distributed execution, and generation of human-readable target code. The project should provide applications and eScientists with a sound methodology and the relevant understanding to enable practical use of these game-changing many-cores, boosting the performance of current and future programs.
Image by: Robert Howie
Next to research papers, software is rapidly becoming one of the two prime outputs of scientific advancement in practically every field of research. While research papers are passive, research software is active: reusable, reproducible and transferable. The demand on the valuable software products, services and know-how of the eScience center is evidence of this development.
Academic software producers like the NLeSC, but in the broader sense any academic group that produces software, need to make their performance indicators observable: the impact of the developed software should be measured and reported. This is essential to get recognition and credit for the academic contributions in software, as well as to securing continued financial support for the future. Measured impact is also key intelligence for strategic decision-making on the maintenance of eScience software (as >50% of the cost of software is in maintenance).
Software Analytics research was pioneered by CWI SWAT and its partners in the last decade; it is the application of data-analytics to source code versions, installation, reuse, issue tracking, online discussion, etc. to turn this data into actionable insights. We have the goal to transfer this software analytics tooling and knowledge to eScience center: to set up infrastructure for monitoring and assessing the impact of their and their partner’s software. This infrastructure works directly with eStep to support the upcoming scientific evaluation of the center. Moreover, measuring software impact will be an incentive for partners to join eStep (in addition to increased visibility).
Image: Map of scientific collaboration between researchers by Olivier H. Beauchesne
Phenology studies the timing of recurring plant and animal biological phases, their causes, and their interrelations. This seasonal timing varies from year to year and from place to place because it is strongly influenced by weather and climatic variability.
Understanding phenological variability is critical to quantify the impact of climate change on the global biogeochemical cycles (e.g. changes in the carbon and water cycles) as well as to manage natural resources (e.g. timing of animal migration), food production (e.g. timing of agricultural activities), public health (e.g. timing of hay fever), and even for tourism (e.g. timing of excursions).
A major obstacle in phenological modeling is the computational intensity and the extreme data size when working at continental scale and with high spatial resolution grids of explanatory variables (e.g. weather and remotely sensed data). We believe that moving our phenological modelling workflows to a modern big-data platform such as Spark will allows us to more easily experiment with novel analytical methods to generate phenological metrics at high spatial resolution (1 km) and to identify phenoregions (i.e. regions with similar phenology) by clustering time series of phenological metrics.
Image by: Chris Devers
Data produced by imaging systems is ever growing in size and complexity. Extracting, presenting, and communicating information from “big” (large, complex, heterogeneous) imaging data is a fundamental problem, which is relevant in many areas, ranging from medical care to high-tech information systems. The main research question in this project is how to develop IT support for diagnostic and decision-making processes based on large and complex imaging data. The approach is based on developing novel graphics, visualization, and interaction methods for the exploration of imaging data. A key element is the use of storytelling as a means of visual data communication.
Visual storytelling is an innovative approach for visual presentation and communication that is especially important in situations where the data analyst is not the same person as the decision-maker, and information needs to be exchanged in an intuitive and easy-to-remember way. Diagnostics in radiology will be the primary use case to test the developed approaches. Because of recent developments in information technology and big data, the focus of radiology is shifting and disruptive technologies are required to allow radiology to position itself as a future-proof specialty in Healthcare.
Image: NASA Goddard Space Flight Center – Flowers Fields as Seen by NASA Satellite
FPGAs excel in performing simple operations on high-speed streaming data, at high (energy) efficiency. However, so far, their difficult programming model and poor floating-point support prevented a wide
adoption for typical HPC applications. This is changing, due to recent and near-future FPGA technology developments: support for the high-level OpenCL programming language, hard floating-point units, and
tight integration with (Xeon) CPU cores. Combined, these are game changers: they dramatically reduce development times and allow using FPGAs for applications that were previously deemed too complex.
Another technology advance, 3D XPoint memory, allows new ways to deal with large amounts of data. Together, these developments will have disruptive impact on tomorrow’s data centers, and blur borders
between embedded computing and HPC.
With support from Intel, we will explore these disruptive technologies in critical parts of radioastronomical processing pipelines, so that they can be applied in future and upgraded telescopes. These
should lead to shorter development times, more performance, higher energy efficiency, lower costs, lower risks, and eventually more astronomical science.
Image: Peter Gerdes – Telescope Dwingeloo (CC License)
We are witnessing an increased significance of point clouds for societal and scientific applications, such as in smart cities, 3D urban modeling, flood modeling, dike monitoring, forest mapping, and digital object preservation in history and art. Modern Big Data acquisition technologies, such as laser scanning from airborne, mobile, or static platforms, dense image matching from photos, or multi-beam echo-sounding, have the potential to generate point clouds with billions (or even trillions) of elevation/depth points. One example is the height map of the Netherlands (the AHN2 dataset), which consists of no less than 640.000.000.000 height values.
Simply too big
The main problem with these point clouds is that they are simply too big (several terabytes) to be handled efficiently by common ICT infrastructures. At this moment researchers are unable to use this point cloud Big Data to its full potential because of a lack of tools for data management, dissemination, processing, and visualization.
“At this moment researchers are unable to use point cloud Big Data to its full potential.”
Within this project several novel and innovative eScience techniques will be developed. Our work will also result in proposals for new standards to the Open Geospatial Consortium (OGC) and/or the International Organisation for Standardisation/Technical Committee 211:
- Database SQL (Structure Query Language) extension for point clouds
- Web Point Cloud Services (WPCS) supporting progressive transfer of point clouds
Enjoy our AHN2-webviewer at http://ahn2.pointclouds.nl/
A scalable solution
The goal is a scalable (more data and users without architectural change) and generic solution: keep all current standard object-relational database management system (DBMS) and integrate with existing spatial vector and raster data functionality. Core support for point cloud data types in the DBMS is needed, besides the existing vector and raster data types. Furthermore, a new and specific web-services protocol for point cloud data is investigated, supporting progressive transfer based on multi-resolution. Based on user requirements analysis a point cloud benchmark is specified. Oracle, PostgreSQL, MonetDB and file based solutions are analyzed and compared. After identifying weaknesses in existing DBMSs, R&D activities will be conducted in order to realize improved solutions, in close cooperation with the various DBMS developers. The non-academic partners in this project (Rijkswaterstaat, Fugro and Oracle) will deliver their services and expertise and provide access to data and software (development).
At the frontiers of contemporary science, many if not all of the quantitative research and engineering challenges with high socioeconomic impact – such as climate, energy, materials, health and disease, urbanization, economy, psychology, or sociology – are essentially multiscale system problems. Progress in most of these societal grand challenges is determined by our ability to design and implement multiscale models and simulations of the particular systems under study.
Generic methods and efficient algorithms
This project will develop generic methods and efficient algorithms for sensitivity analysis and uncertainty quantification for multiscale modelling & simulation, to implement these algorithms as high quality modules of the publically available Multi Scale Modelling and Simulation Framework, and to test, validate and apply the methods on a sufficiently large portfolio of multiscale applications.
Supporting the whole range of computing infrastructure
The framework must support the whole range of computing infrastructure, from the desktop, via cluster and clouds, to high-end HPC machines. The research and development will be executed in close collaboration with the recently started FET-HPC ComPat project. With some exceptions, sensitivity analysis and uncertainty quantification for multiscale modelling and simulation is currently almost lacking but very much needed.
“Sensitivity analysis and uncertainty quantification for multiscale modelling and simulation is currently almost lacking but very much needed.”
This project will have a significant impact by filling this gap and studying in detail the behaviour of error propagation in coupled single scale models, taking into account the different kinds of scale bridging that can be identified, and then applying that knowledge for sensitivity analysis and uncertainty quantification.
Available to the scientific community
The impact will be amplified by making these new developments available to the scientific community as high quality and computationally very efficient modules in the Multiscale Modelling and Simulation Framework.
Image: Argonne National Laboratory – Multiscale Blood Flow Simulations.
In modern radio telescopes, System Health Management (SHM) systems are crucial for (early) detection of errors and for remedying them. Due to the increasing scale and complexity of the systems involved, the effectiveness and efficiency of current day SHM approaches are limited. Therefore, intelligent automated SHM approaches would significantly improve the quality and availability of the observational systems.
“Intelligent automated approaches to improve the quality and availability of observational systems.”
Crucial for scientific results
This is not only beneficial for maintenance, operations, and cost. It also is crucial for the scientific results, as accurate knowledge of the state of the telescope is essential for calibrating the system. Data analytics and more specifically Machine Learning (ML) have shown to be able to “learn” from data.
The purpose of this project is to investigate the applicability of novel approaches such as ML for SHM in radio astronomy. Although this project focuses on application of this technology in radio astronomy, similar problems arise in scientific instruments across many disciplines, such as high-energy physics, ecology, life sciences and urban planning.
A generic methodology
Similar problems also occur in large-scale simulations, for example in water management, computational chemistry and climate science. In this alliance, a generic methodology will be developed which is also applicable in these fields.
Image: Afshin Darian – The eight radio telescopes of the Smithsonian Submillimeter Array, located at the Mauna Kea Observatory in Hawaii –