Principles and Practices – Real-time Linked Dataspaces

Dataspaces

A dataspace is an emerging approach to data management that recognises that in large-scale integration scenarios, involving thousands of data sources, it is difficult and expensive to obtain an upfront unifying schema across all sources (Franklin, Halevy and Maier, 2005). Within dataspaces, datasets co-exist but are not necessarily fully integrated or homogeneous in their schematics and semantics. Instead, data is integrated on an “as-needed” basis with the labour-intensive aspects of data integration postponed until they are required. Dataspaces reduce the initial effort required to setup data integration by relying on automatic matching and mapping generation techniques. This results in a loosely integrated set of data sources. When tighter semantic integration is required, it can be achieved in an incremental “pay-as-you-go” fashion with more detailed mappings among the required data sources.

Real-time Linked Dataspaces

Within the dataspace paradigm, there has been limited work on addressing the requirements of real-time processing of events and streams, and research into relevant support services. The Real-time Linked Dataspace (RLD) has been created as a data platform for intelligent systems within smart environments. The RLD combines the pay-as-you-go paradigm of dataspaces with linked data, knowledge graphs, and real-time stream and event processing capabilities to support large-scale distributed heterogeneous collection of streams, events and data sources (Curry et al., 2019). This work builds on past efforts to use dataspaces in Building Data Management (Curry et al., 2013), Energy Data Management (Curry, Hasan and O’Riáin, 2012), and System of Systems (Curry, 2012). The goal is to support a principled approach to incremental real-time data management based on a set of support services with tiered levels of support, to provide a unified entity-centric query framework over real-time and historical data streams in a smart environment.

This section details the foundations of the Real-time Linked Dataspace approach and describes how its architectural components meet the key requirements identified for real-time information processing (as identified by Stonebraker et al.) and data platform for smart environments.

Definition and Principles

A Real-time Linked Dataspace is a specialised dataspace that manages and processes the large-scale distributed heterogeneous collection of streams, events and data sources (Curry et al., 2019). It manages the sources without presuming a pre-existing semantic integration among them, uses linked data and knowledge graphs to coordinate the dataspace, and operates under a 5 star model for “pay-as-you-go” data management. The RLD adapts the dataspace principles as set out by Halevy et al. (Halevy, Franklin and Maier, 2006) to describe the specific requirements within a real-time dataspace setting:

A Real-time Linked Dataspace must deal with many different formats of streams and events.
A Real-time Linked Dataspace does not subsume the stream and event processing engines; they still provide individual access via their native interfaces.
Queries in the real-time linked dataspace are provided on a best-effort and approximate basis.
The Real-time Linked Dataspace must provide pathways to improve the integration among the data sources, including streams and events, in a pay-as-you-go fashion.

In order to enable these principles to support real-time data processing, we propose a set of specialised dataspace support services to enable the requirements of loose administrative proximity and semantic integration for event and stream systems. Loose coupling of event processing systems on the semantic dimension reflects a low cost to define and maintain rules concerning the use of terms, and a low cost to building and agreeing on the event semantic model. This requirement forms the foundation of the techniques and models used to process events and streams within the real-time linked dataspace.

Architecture

The RLD contains all the relevant information within a data ecosystem, including things, sensors, and data sources and has the responsibility for managing the relationships among these participants.

A close up of a map

Description automatically generated — Real-time linked dataspace architecture

The architecture of the RLD with the following main concepts:

Support Platform: Responsible for providing the functionalities and services essential for managing the dataspace. Support services are grouped into data services and stream and event services.
Things / Sensors: Produce real-time data streams that need to be processed and managed. Things in a smart environment range from connected devices, energy, and water sensors, to connected cars and manufacturing equipment.
Data Sources: Available in a wide variety of formats and accessible through different systems interfaces. Example data sources include building management systems, energy and water management systems, passenger information systems, financial data, weather, and (linked) open datasets.
Managed Entities: Actively managed entities within the data ecosystem, including their relationship to participating things, data sources, and other entities in the RLD.
Intelligent Applications, Analytics, & Users: Interact with the RLD and leverage its data and services to provide data analytics, decision support tools, user interfaces, and data visualisations. Applications/Users can query the RLD in an entity-centric manner, while users can be enlisted in the curation of the data and entities via the Human Task service.

Excerpt from: Curry E. (2020) Fundamentals of Real-time Linked Dataspaces. In: Real-time Linked Dataspaces. Springer, Cham.

References

Curry, E. (2012) ‘System of Systems Information Interoperability using a Linked Dataspace’, in IEEE 7th International Conference on System of Systems Engineering (SOSE 2012). Genoa, Italy: IEEE, pp. 101–106. Available at: http://www.edwardcurry.org/publications/Curry_LinkedDataspaceForSOS_SOSE.pdf.

Curry, E. et al. (2013) ‘Linking building data in the cloud: Integrating cross-domain building data using linked data’, Advanced Engineering Informatics, 27(2), pp. 206–219. Available at: http://www.edwardcurry.org/publications/Curry_AEI_2013.pdf.

Curry, E. et al. (2019) ‘A Real-time Linked Dataspace for the Internet of Things: Enabling “Pay-As-You-Go” Data Management in Smart Environments’, Future Generation Computer Systems, 90, pp. 405–422. doi: 10.1016/j.future.2018.07.019.

Curry, E., Hasan, S. and O’Riáin, S. (2012) ‘Enterprise Energy Management using a Linked Dataspace for Energy Intelligence’, in The Second IFIP Conference on Sustainable Internet and ICT for Sustainability (SustainIT 2012). Pisa, Italy: IEEE. Available at: http://www.edwardcurry.org/publications/curry_SustainIT_2012.pdf.

Franklin, M., Halevy, A. and Maier, D. (2005) ‘From databases to dataspaces: a new abstraction for information management’, ACM SIGMOD Record. ACM Press, 34(4), pp. 27–33. doi: 10.1145/1107499.1107502.

Halevy, A., Franklin, M. and Maier, D. (2006) ‘Principles of dataspace systems’, in 25th ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems – PODS ’06. New York, New York, USA: ACM Press, pp. 1–9. doi: 10.1145/1142351.1142352.