1. Introduction

The Open Geospatial Consortium (OGC®) is releasing this Call for Participation ("CFP") to solicit proposals for the OGC Testbed-16 (also called "Initiative" or just "Testbed"). The goal of the initiative is to evaluate the maturity of the Earth Observation Cloud Architecture that has been developed over the last two years as part of various OGC Innovation Program (IP) initiatives in a real world environment.

logo

1.1. Background

The OGC Testbed is an annual research and development program that explores geospatial technology from various angles. It takes the OGC Baseline into account, though at the same time allows to explore selected aspects with a fresh pair of eyes. The Testbeds integrate requirements and ideas from a group of sponsors, which allows leveraging symbiotic effects and makes the overall initiative more attractive to both participants and sponsoring organizations.

1.2. OGC Innovation Program Initiative

This Initiative is being conducted under the OGC Innovation Program. The OGC Innovation Program provides a collaborative agile process for solving geospatial challenges. Organizations (sponsors and technology implementers) come together to solve problems, produce prototypes, develop demonstrations, provide best practices, and advance the future of standards. Since 1999 more than 100 initiatives have taken place.

1.3. Benefits of Participation

This initiative provides an outstanding opportunity to engage with the latest research on geospatial system design, concept development, and rapid prototyping. The initiative provides a business opportunity for stakeholders to mutually define, refine, and evolve service interfaces and protocols in the context of hands-on experience and feedback. The outcomes are expected to shape the future of geospatial software development and data publication. The Sponsors are supporting this vision with cost-sharing funds to partially offset the costs associated with development, engineering, and demonstration of these outcomes. This offers selected Participants a unique opportunity to recoup a portion of their initiative expenses.

1.4. Master Schedule

The following table details the major Initiative milestones and events. Dates are subject to change.

Table 1. Master schedule
Milestone Date  Event

M01

23 December 2019

Release of Call for Participation (CFP)

M02

21 January 2020

Questions for CFP Bidders Q&A Webinar Due

M03

28 January 2020

CFP Bidders Q&A Webinar. Register at https://attendee.gotowebinar.com/register/5009861363812363533. The Webinar starts at 10am EST.

M04

09 February 2020

CFP Proposal Submission Deadline (11:59pm U.S. Pacific Time)

M05

31 March

All CFP Participation Agreements Signed

M06

6-8 April

Kickoff Workshop at US Geological Survey (USGS) National Center, 12201 Sunrise Valley Drive, Reston, Virginia 20192 (https://www2.usgs.gov/visitors/).

M07

31 May

Initial Engineering Reports (IERs)

M08

June

Interim Workshop at TC Meeting Montreal, Canada. Participation not mandatory but appreciated.

M09

30 September

TIE-Tested Component Implementations completed; Preliminary DERs complete & clean, ready for internal reviews

M10

31 October

Ad hoc TIE demonstrations (as requested during the month) & Demo Assets posted to Portal; Near-Final DERs posted to Pending & WG review requested

M11

November (specific date TBD)

Final DERs (incorporating WG feedback) posted to Pending to support WG & TC vote

M12

December (specific date TBD)

Final demonstration at TC meeting

M13

15 December

Participant Final Summary Reports due

2. Technical Architecture

This section provides the technical architecture and identifies all requirements and corresponding work items. It references the OGC standards baseline, i.e. the complete set of member approved Abstract Specifications, Standards including Profiles and Extensions, and Community Standards where necessary. Further information on the OGC standards baseline can be found online.

Note

Please note that some documents referenced below may not have been released to the public yet. These reports require a login to the OGC portal. If you don’t have a login, please contact OGC at techdesk@opengeospatial.org.

Testbed Threads

The Testbed is organized in a number of threads. Each thread combines a number of tasks that are further defined in the following chapters. The threads integrate both an architectural and a thematic view, which allows to keep related work items closely together and to remove dependencies across threads.

threads
Figure 1. Testbed Threads

The threads include the following tasks

2.1. Aviation

The goals of the Aviation task in Testbed-16 are: to evaluate emerging architectural and technological solutions for the FAA SWIM services, and to advance further the usage of Linked data in information integration in the SWIM context.

The API modernization work shall evaluate solutions for data distribution that complement those currently used by FAA SWIM. Particular emphasis in this context shall be on OpenAPI-based Web APIs. OGC is busy developing Web APIs for the various geospatial resource types such as features, coverages, maps, tiles, and processes among others. There are many documented benefits for using Web APIs in the context of geospatial data retrieval and processing, including faster time to market for products, more flexibility in deployment models, and straight forward upgrade paths as standards evolve.

aviationAirplane

The linked data aspect shall explore the use and benefit of semantic web technologies in the context of FAA SWIM services. Linked data shall help querying and accessing all required data for any given task and help with the heterogeneous semantics introduced by the various ontologies and taxonomies used within the aviation community. These include for example the SWIM Controlled Vocabulary (SWIM CV), the Web Service Description Ontological Model (WSDOM), or semantics.aero.

Background

The System-Wide Information Management (SWIM) program, established and maintained by the FAA, supports the sharing of Air Trafic Management (ATM) information by providing communications infrastructure and architectural solutions for identifying, developing, provisioning, and operating a network of highly-distributed, interoperable, and reusable services.

As part of the SWIM architecture, data providers create services to access their data. For the FAA, these services are published in the NAS Services Registry/Repository (NSRR). NSSR is a catalog of all SWIM services and provides documentation on various aspects of each service, including its provider, functionality, quality characteristics, interface, and implementation.

One of the SWIM challenges is the handling of semantics across all participants. A diverse set of ontologies, controlled vocabularies, and taxonomies has been developed over the last decade. A good overview is provided in OGC 18-035.

The SWIM Controlled Vocabulary (SWIM CV) provides to SWIM organizations, support contractors, vendors, and business partners a uniform understanding of terms employed in the SWIM environment. The CV contains a comprehensive list of terms with clear and unambiguous definitions. Each term is globally uniquely identified by a dereferenceable URI so that it can be related semantically to other terms, vocabularies, or resources. SWIM CV is now part of semantics.aero.

semantics.aero is an open repository for use by the international aviation community to publish artifacts developed using Semantic Web technologies. Besides the SWIM CV, artifacts include taxonomies for classifying services by product type, availability, flight phase, ICAO region, etc. and are available in both human-readable (HTML) and machine-readable (RDF) versions.

The Web Service Description Ontological Model (WSDOM) is an ontology intended to be a basis for model-driven implementation of SOA-related artifacts. The ontology has been developed in Web Ontology Language (OWL) version 1.1; it consists of several files and is currently available in a single downloadable zip file. WSDOM can be considered an "RDF" realization of the Service Description Conceptual Model (SDCM). WSDOM standardizes information and metadata pertinent to describing SWIM services and facilitates interchange of service data between service providers and the FAA. The intent behind the ontology is to make service definitions clear, unambiguous, and discoverable by both humans and computer systems. WSDOM consists of ontology classes covering the key notions of service profile, service interface, service implementation, stakeholder, and document. WSDOM is patterned after the OWL-S semantic web services description ontology.

WSDOM was developed before SDCM and was written in OWL. For many people in the industry, OWL ontology was too technical to be understood by a large audience. To address this gap and make the service description more "readable", the Service Description Conceptual Model was created. That said, SDCM 2.0 is more recent than WSDOM 1.0, and therefore better reflects the current NSRR data structure. WSDOM 2.0 is currently being developed and not been aligned yet with the latest version of SDCM.

2.1.1. Problem Statement and Research Questions

FAA has invested in setting up SWIM Feeds that are accessible on a feed by feed basis. Each feed was designed as stand-alone. However, the value of data increases when it’s combined with other data. In addition, real-world situations are often not related one to one with a SWIM feed (or a single data source for that matter). Therefore, Testbed-16 shall investigate data integration options based on semantic web technologies and analyze the current status of achievable interoperability. The latter is the basis for analytics, querying and visualization of data coming from distributed sources.

Research questions:

  1. Should the existing SWIM architecture be modernized with resource-oriented Web APIs?

  2. What role can OGC Web APIs play for a modernized SWIM services?

  3. How can OGC APIs be used to address the heterogeneous semantic SWIM landscape?

  4. What impact have linked data principles and requirements on OGC Web APIs?

  5. How to deal with the various ontologies and taxonomies used in SWIM?

  6. How to best enhance the various ontologies and how to build a scalable geospatial definition server?

  7. How to best combine data from various SWIM data feeds to make it available for multi-source and linked data based analytics?

2.1.2. Aim

This Testbed-16 task aims at a better understanding on the value of modern Web APIs in the context of SWIM service integration and the potential of semantic web technologies to solve complex queries.

2.1.3. Previous Work

The topic of semantic-enablement of models used in the domain of aviation has been explored previously in OGC Testbeds (OGC 16-039, OGC 17-036). In past demonstrations, analyses recommended the use of run-time registries and complex use cases for service discovery and data taxonomy/ontology. However, much of the information exchanged within the System-Wide Information Management (SWIM) network is made up of various data models using XML schema encoding (such as AIXM), which addresses only the structure and syntax of information exchanged between systems, but not the semantic aspect of the model. Testbed-12 and -13 have made progress toward the semantic enablement of the controlled vocabularies using the Simple Knowledge Organization System (SKOS) encoding but these vocabularies were still referred from the XML document based on structure and syntax. This hybrid approach does not allow the usage of off-the-shelf solutions for Linked Data such as linking heterogeneous domain entities, deductive reasoning and unified access to information. Systems are currently built around specific data models and unable to communicate and link to each other causing duplication of information and making difficult to search and discover information that are relevant for users.

Testbed-14 formulated an approach to semantic-enable the different data models, taxonomy and service description that can incorporate semantic metadata. These metadata includes descriptive metadata, geospatial-temporal characteristics, quality information, and fitness-of-use information about services and data. This additional metadata enables the integration of information and services, improves search and discovery, and increases the level of possible automation (e.g. by reasoning or access and processing by agents).

Aviation activities have been part of several initiatives in the past. The following list provides an overview. Links to the relevant Engineering Reports are provided further below.

  1. Testbed-15

    1. OGC Testbed-15: Semantic Web Link Builder and Triple Generator

  2. Testbed-14:

    1. SWIM Information Registry

    2. Semantically Enabled Aviation Data Models

  3. Testbed-13:

    1. Aviation Abstract Quality Model - Data Quality Specification

    2. Quality Assessment Service

    3. Geospatial Taxonomies

  4. Testbed-12:

    1. Aviation Semantics

    2. Catalog Services for Aviation

The following Engineering Reports are relevant in the context of this task.

  • OGC 18-022r1, OGC Testbed-14: SWIM Information Registry Engineering Report

  • OGC 18-035, OGC Testbed 14: Semantically Enabled Aviation Data Models Engineering Report

  • OGC 17-036, OGC Testbed-13: Geospatial Taxonomies Engineering Report

  • OGC 17-032r2, OGC Testbed-13: Aviation Abstract Quality Model Engineering Report

  • OGC 16-018, OGC Testbed-12: Aviation Architecture Engineering Report

  • OGC 16-024r2, OGC Testbed-12: Catalog Services for Aviation Engineering Report

  • OGC 16-028r1, OGC Testbed-12: FIXM GML Engineering Report

  • OGC 16-039r2, OGC Testbed-12 Aviation Semantics Engineering Report

  • OGC 16-061, OGC Testbed-12 Aviation SBVR Engineering Report

Reports that addressed linked data and semantic web technologies include:

  • OGC 19-021, OGC Testbed-15: Semantic Web Link Builder and Triple Generator (draft version available on OGC portal or upon request)

  • OGC 18-094r1, OGC Testbed-14: Characterization of RDF Application Profiles for Simple Linked Data Application and Complex Analytic Applications Engineering Report

  • OGC 18-032r2, OGC Testbed-14: Application Schema-based Ontology Development Engineering Report

  • OGC 17-018, OGC Testbed-13: Data Quality Specification Engineering Report

  • OGC 16-046r1, OGC Testbed-12 Semantic Enablement Engineering Report

2.1.4. Scenario & Requirements

Figure 2 illustrates the current SWIM situation. A number of services are available that use different taxonomies, vocabularies, and ontologies. SWIM services are described in the service registry. All services are built on the Service Oriented Architecture (SOA).

aviationCurrent
Figure 2. Aviation scenario, current situation

The current situation illustrated above shall be explored along two axes. First, the value and role of Web APIs and here more specifically OGC APIs for SWIM. Second, the handling of semantics and integration of data from various services, as illustrated in Figure 3.

aviationScenario
Figure 3. Example of deploying API and Semantic Technology in today’s Global SWIM

Figure 3 depicts a scenario where diversified SWIM initiatives use API and semantic mediation for integrating service meta-information collected by their respective registries.

The Testbed-16 aviation scenario will address the integration of SWIM data from various sources to answer complex queries, such as examplariely:

  • Which flights from IAD to any airport in Europe have not been subject of GDP Advisories in the last 2 hours?

  • What is the closest airport in Florida to land a flight from IAD, given a Temporary Flight Restriction due to a hurricane?

2.1.5. Work Items & Deliverables

The following figure illustrates the work items and deliverables of this task.

aviationWorkItems
Figure 4. Aviation task architecture and deliverables

The following list identifies all deliverables that are part of this task. Detailed requirements are stated above. All participants are required to participate in all technical discussions. Thread assignment and funding status are defined in section Deliverables Summary & Funding Status.

Engineering Reports

  • D001 Aviation Engineering Report - Engineering Report capturing all results and experiences from this task. The report shall provide answers to all research questions and document implementations.

Components

  • D100 OGC API- Service endpoint for a SWIM service. The endpoint shall support sufficient semantics that allow a triple builder to link data from various API endpoints.

  • D101 OGC API- Similar to D100

  • D103 Triple Builder- The triple builder shall use all OGC APIs to generate links between data and services. It shall support answers to complex queries such as exemplarily provided above. The triple builder shall store all triples in a triple store providing at least a GEOSPARQL interface. Additional entry paths between the triple store and the client D106 can be defined during the project.

  • D105 Semantic Web Client- Client application to interact with the triple store to answer complex queries as provided above. Ideally, the client provides a graphical user interface that illustrates results on a map.

  • D106 SWIM OGC API Client- Client application to interact with the OGC APIs that front-end SWIM services. Ideally, the client provides a graphical user interface that illustrates results on a map.

2.2. Machine Learning

The Machine Learning task focuses on understanding the potential of existing or new OGC standards for supporting Machine Learning (ML) applications in the context of wildland fire safety and response. In this context, the integration of ML models into standards-based data infrastructures, the handling of ML training data, and the integrated visualization of ML data with other source data shall be explored. Emphasis is on the integration of data from the Canadian Geospatial Data Infrastructure (CGDI), the handling of externally provided training data, and the provisioning of results to end-users without specialized software.

wildFire
Figure 5. Photo by Matt Howard on Unsplash

Wildland fires are those that occur in forests, shrublands and grasslands. While representing a natural component of forest ecosystems, wildland fires can present risks to human lives and infrastructure. Being able to properly plan for and respond to wildland fire events is thus a critical component of forestry management and emergency response.

Appropriate responses to wildland fire events benefit from planning activities undertaken before events occur. ML presents a new opportunity to advance wildland fire planning using diverse sets of geospatial information (e.g. satellite imagery, Light Detection and Ranging (LiDAR) data, land cover information, building footprints). As much of the required geospatial information is made available using OGC standards, a requirement exists to understand how well these standards can support ML in the context of wildland fire planning.

Testbed-16 will explore how to leverage ML, cloud deployment and execution, and geospatial information, provided through OGC standards, to improve planning approaches for wildland fire events. Findings of the work will also inform future improvement and/or development activities for OGC standards, leading to improved potential for the use of OGC compliant data within ML applications.

Advanced planning for wildland fire events can greatly improve the ability of first responders to address a situation. However, it is very difficult to account for the many variables (e.g. wind, dryness, fuel loads) and their combinations, that will be present at the exact time of an event. As such, there is an opportunity to evaluate how ML approaches, combined with geospatial information delivered using OGC standards, can improve response planning throughout the duration and aftermath of wildland fire occurrences.

Thus, in addition to planning related work, Testbed-16 shall explore how to leverage ML technologies for dynamic wildland fire response. It will also provide insight into how OGC standards can support wildland fire response activities in a dynamic context. Any identified limitations of existing OGC standards will be used to plan improvements to these frameworks. An opportunity also exists to explore how OGC standards may be able to support the upcoming Canadian WildFireSat mission.

The Canadian Wildland Fire Information System (CWFIS) provides further information about wildland fires in Canada. It creates daily fire weather and fire behavior maps year-round and hot spot maps throughout the forest fire season, generally between May and September.

Important
Though this task uses a wildland fire scenario, the emphasis is not on the quality of the modelled results, but on the integration of externally provided source and training data, the deployment of the ML model on remote clouds through a standardized interface, and the visualization of model output!

2.2.1. Problem Statements, Requirements, and Research Questions

Testbed-16 shall address the following three challenges.

  1. Discovery and reusability of training data sets

  2. Integration of ML models and training data into standards-based data infrastructures

  3. Cost-effective visualization and data exploration technologies based on Map Markup Language (MapML)

mlChallenge
Figure 6. ML and EO integration challenges; various training data sets that need to be discovered, loaded, and interpreted (left); integration of (live) sources from Web APIs, event streams, or Web Services (right); and visualization (top)
Training Data Sets

We currently have unprecedented Earth Observation (EO) capabilities at hand. To combine these with the major advances in artificial intelligence in general and ML in particular, we need to close the gap between ML on one side and Earth observation data on the other. In this context, two aspects need to be addressed. First, the extremely limited discoverability and availability of training and test datasets, and second, interoperability challenges to allow ML systems to work with available data sources and live data feeds coming from a variety of systems and APIs.

In this context, training datasets are pairs of examples of labelled data (independent variable) and the corresponding EO data (dependent variables). Together, these two are used to train an ML model that is then used to make predictions of the target variable based on previously unseen EO data. Test data is a set of observations used to evaluate the performance of the model using some performance metric. In addition to the training and test data, a third set of observations, called a validation or hold-out set, is sometimes required. The validation set is used to tune variables called hyper parameters, which control how the model learns. In the following paragraphs, the set of training data, test data, and validation data are together referred to simply as Training Data Sets (TDS).

To address the general lack of training data discoverability, accessibility, and reusability, Testbed-16 shall develop solutions that describe how training data sets shall be generated, structured, described, made available, and curated.

Integration of (Live) Data

The second aspect addresses the integration of data in ML model runs. This includes data available at Web APIs or Web services, and event streams.

Geospatial information required for wildland fire planning and response is commonly obtained from central data repositories. In Canada, large and well-known geospatial repositories, such as the Earth Observation Data Management System (EODMS), the Federal Geospatial Platform / Open Maps and Canada’s National Forest Information System (NFIS) provide vast quantities and types of reputable geospatial data through OGC standards. However, these systems have generally not been designed to support advanced ML applications, especially within an emergency planning/response context. This component of the work aims to determine how well these systems can support ML applications in the context of OGC standards. It will also provide initial insight into the readiness of fundamental components of the Canadian Geospatial Data Infrastructure (CGDI) for supporting new technologies such as ML. It will give actionable recommendations as to how CGDI geospatial information repositories can be improved to better support ML applications. Potential improvements to OGC standards in the context of geospatial data repositories and extreme events will also be identified.

Deployment and Execution of ML Models

All machine learning models shall be deployed and executed on cloud platforms offering a specialized WPS interface known as Application Deployment and Execution Service (ADES). ADES has been developed in Testbed-13/14 and allows any type of application that is packaged as a Docker container and made available on a Docker Hub to be deployed and executed in a cloud environment. The WPS will be made available. To use that WPS, model providers need to package their model together with all necessary auxiliary data (training data, configuration files, etc.) in a Docker Container, create an Application Package description of that container, and submit it to the transactional ADES. Support for deployment and operation as well as all necessary cloud resources will be provided by Testbed-16 Sponsors. Further information on the ADES is provided in section Previous Work, Earth Observation Cloud Processing.

Visualization of ML Results

With planning and response activities for wildland fire events, it is critical that stakeholders (e.g. planners, first responders, residents, policy makers) are able to visualize related geospatial information quickly and accurately. Currently, such visualization requires users to have access to specialized software and skills. These barriers shall be reduced using tools supporting MapML. When implemented, MapML allows for viewing and interaction with geospatial information within web browsers. With widespread availability of web browsers on multiple devices, intuitive user interfaces, and no cost constraints, making geospatial information available through MapML has the potential to revolutionize how we interact with geographic information.

This Testbed-16 task shall determine the utility of MapML for providing a geospatial information visualization and interaction interface in the context of wildland fire planning and response. Findings will allow the sponsor to determine if MapML would provide a benefit to wildland fire stakeholders, including what improvements may be required. It will also aim to increase the visibility of MapML as a practical tool for geospatial visualization and interaction. Potential improvements to OGC standards to further leverage MapML capabilities will also be identified.

To be more precise, all geospatial information results from the wildland fire planning and response components shall be published, so that they incorporate MapML capability. The delivery of MapML has been explored within Testbed-13 and Testbed-14. The Testbed-13 MapML Engineering Report (OGC 17-019) recommends to use Web Map Service (WMS) or Web Map Tile Service (WMTS) as services that can be used to deliver MapML documents with small modifications. Other options arise from using OGC Web APIs as developed in Testbed-15. The sponsor requires MapML be implemented in a Web browser to publish results. Several open source web browser engines exist that can be leveraged for this work (e.g. WebKit, Gecko, and Blink). JavaScript implementations for MapML are not to be used for final result delivery.

Further on, Testbed-16 shall investigate the ability of MapML to operate on mobile devices in mobile environments.

A comparison of MapML to other visualization and interaction tools shall complement this work. Here, Testbed-16 shall compare and contrast the ability of MapML to act as an operational stakeholder geospatial information visualization and interaction tool with current approaches used within a wildland fire context. The sponsor is particularly interested in exploring the ability of MapML to federate between authoritative sources (e.g. municipal/territorial/Indigenous/provincial/state/federal governments, international organizations, etc.).

If applicable, Testbed-16 shall provide actionable recommendations for MapML improvement to better support extreme event advanced planning and active response.

Overview of Task Activities

The following diagram provides an overview of the main work items for this Testbed-16 task; with the training data at the bottom, existing platforms and corresponding APIs to the left, and Machine Learning models and visualization efforts to the right.

mlTaskOverview
Figure 7. Major components and research aspects of the Machine Learning task

The platforms to the left in the ML scenario are all operational. EODMS supports multiple services including WCS, CSW, WFS and WPS. The FAQ: Can I access EODMS using an API provides further information. FGP / Open Maps supports WFS, WMS, and WMTS. It also supports ESRI REST. NFIS supports various interfaces; documentation for all platforms is accessible online.

Research Questions

The following overarching research questions shall further help to guide the work in this task:

  • Does ML require "data interoperability"? Or can ML enable "data interoperability"? How do existing and emerging OGC standards contribute to a data architecture flow towards "data interoperability"?

  • Where do trained datasets go and how can they be re-used?

  • How can we ensure the authenticity of trained datasets?

  • Is it necessary to have analysis ready data (ARD) for ML? Can ML help ARD development?

  • What is the value of datacubes for ML?

  • How do we address interoperability of distributed datacubes maintained by different organizations?

  • What is the potential of MapML in the context of ML? Where does it need to be enhanced?

  • How to discover and run an existing ML model?

2.2.3. Scenario

The Machine Learning task scenario addresses two phases of wildland fire management, i.e. Wildland Fire Planning and Wildland Fire Response. For both phases, various steps of training and analysis data integration, processing and visualization shall be executed as outlined below. The scenario serves the purpose of guiding through the various steps in the two phases of wildland fire planning and response. It helps to ground all work in a real-world scenario. Focus shall remain on the requirements listed above.

Important
Though this task uses a wildland fire scenario, the emphasis is not on the quality of the modeled results, but on the integration of externally provided source and training data and the visualization of model output!

Annotated training data sets will be provided. Additional datasets can be provided during the Testbed.

  • RADARSAT-1 open Synthetic Aperture Radar (SAR) imagery through EODMS (see this link for more information).

  • NRCan National Air Photo Library through EODMS (see this link for more information).

  • Sample LiDAR datasets covering the Charles H. Herty Pines Nature Preserve in Statesboro, Georgia. Data contains 3D point files, present in ASCII format. Can be easily converted to other formats as needed.

  • Additional LiDAR datasets to be provided by NRCan partners in New Brunswick.

The scenario includes the following major steps:

Wildland fire planning:

  1. Investigate the application of different ML frameworks (e.g. Mapbox RoboSat, Azavea’s Raster Vision, NRCan’s GeoDeepLearning) to multiple types of remotely sensed information (i.e. synthetic aperture radar and optical satellite imagery, LiDAR), provided though OGC standards, to identify fuel availability within targeted forest regions.

  2. Explore interoperability challenges of training data. Develop solutions that allow the wildland fire training data, test data, and validation data be structured, described, generated, discovered, accessed, and curated within data infrastructures.

  3. Explore the interoperability and reusability of trained ML models to determine potential for applications using different types of geospatial information. Interoperability, reusability and discoverability are essential elements for cost-efficient ML. The structure and content of the trained ML models have to provide information about its purpose. Questions such as: “What is it trained to do?” or “What data was it trained on?” or “Where is it applicable?” need to be answered sufficiently. Interoperability of training data should be addressed equivalently.

  4. Deep Learning (DL) architectures can use LiDAR data to classify field objects (e.g. buildings, low vegetation, etc.). These architectures mainly use the TIFF and ASCII image formats. Other DL architectures use 3D data stored in a rasterized or voxelized form. However, 3D voxels or rasterized forms may have many approximations that make classification and segmentation vulnerable to errors. Therefore, Testbed-16 shall apply advanced DL architectures directly to the raw point cloud to classify points and segments of individual items (e.g. trees, etc.). Participants shall use the PointNET architecture for this or propose different approaches. If different DL architectures are proposed, the sponsor will consider them as an alternative to PointNET. Sponsor approval will be required before a different architecture can be used.

  5. Leverage outcomes from the previous steps to predict wildland fire behavior within a given area through ML. Incorporate training of ML using historical fire information and the Canadian Forest Fire Danger Rating system (fire weather index, fire behaviour prediction) leveraging weather, elevation models, fuels.

  6. Using ML to discover and map suitably sized and shaped water bodies for water bombers and helicopters

  7. Investigate the use of ML to develop smoke forecasts based on weather conditions, elevation models, vegetation/fuel and active fires (size) based on distributed data sources and datacubes using OGC standards.

Wildland fire response:

For the wildland fire response phase, the following aspects should be considered:

  1. Explore ML methods for identifying active wildland fire locations through analysis of fire information data feeds (e.g. the Canadian Wildland Fire Information System, the United States Geological Survey LANDFIRE system) and aggregation methods. Explore the potential of MapML as an input to the ML process and the usefulness of a structured Web of geospatial data in this context.

  2. Implement ML to identify potential risks to buildings and other infrastructure given identified fire locations. Potential for estimating damage costs.

  3. Investigate how existing standards related to water resources (e.g. WaterML, Common Hydrology Features (CHyF), in conjunction with ML, can be used to locate potential water sources for wildland fire event response.

  4. Develop evacuation and first responder routes based on ML predictions of active fire behaviour and real-time conditions (e.g. weather, environmental conditions).

  5. Based on smoke forecasts and suitable water bodies, determine if suitable water bodies are accessible to water bombers and helicopters.

  6. Explore the communication of evacuation and first responder routes, as well as other wildland fire information, through Publication/Subscription (Pub/Sub) messaging.

  7. Examine how ML can be used to identify watersheds/water sources that will be more susceptible to degradation (e.g. flooding, erosion, poor water quality) after a fire has occurred.

  8. Identify how OGC standards and ML may be able to support the goals of the upcoming Canadian WildFireSat mission.

2.2.4. Work Items & Deliverables

The following figure illustrates the work items of this task and identifies deliverables.

mlDeliverables
Figure 8. Machine Learning work items (client, machine learning tools, training data) and deliverables (green with numbered identifiers)

The following list identifies all deliverables that are part of this task. Detailed requirements are stated above. All participants are required to participate in all technical discussions and to provide contributions to the Engineering Reports. Thread assignment and funding status are defined in section Summary of Testbed Deliverables. Some work items are identical to facilitate Technology Integration Experiments (TIEs).

Engineering Reports

  • D015 Machine Learning Engineering Report - Engineering Report capturing all results and experiences from this task. It shall respond to all requirements listed above. The Engineering Report shall contain a plain language executive summary to clearly outline the motivations, goals, and critical outcomes of this task, taking into account the mandates of the OGC and the sponsor(s).

  • D016 Machine Learning Training Data Engineering Report - Engineering Report describing the training data metadata model, structure, file format, media type, and its integration into Spatial Data Infrastructures and SDI-based Machine Learning tools, which includes discovery, access, and authenticity evaluation.

Components

  • D130 MapML Client 1 - MapML client, to be provided either as server proxy in combination with Web browser frontend, or as Web App supporting MapML. JavaScript implementations are not to be used.

  • D131 MapML Client 2 - Similar to D130.

  • D132 Machine Learning Environment 1 - An ML framework such as e.g. Mapbox RoboSat, Azavea’s Raster Vision, NRCan’s GeoDeepLearning with support for OGC Web Services or OGC Web APIs to retrieve externally provided data as described above; and to provide results at OGC Web APIs or OGC Web service interfaces in a form that allows MapML clients to explore results. The model shall be deployed and executed on a cloud platform that provides a ADES interface for easy deployment and execution. The Preference will be given to contractors that will make the configured ML framework available to the sponsor at the end of the project. Ideally, this happens in the form of scripts that build a docker instance or process, which initializes the ML runs.

  • D133 Machine Learning Environment 2 - Similar to D132

  • D134 Deep Learning Environment - A DL framework DL architecture based on PointNET or sponsor-approved equivalent. The model shall be deployed and executed on a cloud platform that provides a ADES interface for easy deployment and execution. Preference will be given to contractors that will make the configured Deep Learning Environment available to the sponsor at the end of the project. The environment shall be capable of:

    • Direct application to raw LiDAR point clouds.

    • Classifying points and segments for individual field objects (e.g. trees, etc.)

  • D135 Training Data Set 1 - Training data set including training data, test data, and validation data compliant with the model and definitions defined in D016. The data shall be made available at Web API endpoints.

  • D136 Training Data Set 2 - Similar to D135

2.3. Data Access and Processing API (DAPA) for Geospatial Data

The provider-centric view has defined data retrieval mechanisms for geospatial data in the past. As a result, many data access mechanisms are built around powerful web services with rich query languages or file-based download services. Both may result in a sub-optimal experience for the end user, who would often prefer an approach similar to a local function call on a dataset in memory. Testbed-16 shall explore this end-user centric perspective and develop proposals for future standardization work in the context of end-user centric data retrieval and processing APIs. For the end-user, a function call to calculate the minimum temperature value for a target area shall look similar independently of the data location, be it a local file, an in-memory structure (e.g. an xarray), or a remote data set stored on a cloud.

dapaTeaser

Several data encoding formats are currently in use in OGC, such as NetCDF, GeoTiff, HDF, GML/Observation and Measurements or variations thereof, or, increasingly often, JSON encoded data. The different formats exist for various reasons, such as efficiency enhancements for specific domains or use cases, interoperability efforts, or just historical reasons. JSON, for example, is the first format that comes to mind when thinking about sharing data in the internet context; simple to understand and code against, but it may probably be a bad idea to use it with medium sized matrices, since the size can be several times that of a comparable uncompressed GeoTIFF. Testbed-16 shall develop recommendations for data encoding formats that fit the use cases described below. Data encoding formats are used to exchange data. Data storage formats are out of scope for this Testbed. Data can be stored in any format on the provider side. The long term vision of the overall effort is to provide an API that shall be supported by all data providers. In this context, additional aspects such as cloud-friendly or cloud optimized formats (see for example cloud optimized geotiff, COG,or Zarr) need to be considered, but these are out of scope for this Testbed. A study on this topic is currently executed by NASA and ESA with results expected for early 2020. This study is looking at several EO data formats, both legacy (e.g., netCDF) and cloud-optimized (CoG, Zarr) to see what they offer with respect to supporting analysis in the cloud, as well as aspects of suitability as a storage format. Results from that study should be taken into account.

Though Testbed-16 definitely references the OGC baseline, there is freedom to explore data formats with a fresh pair of eyes. Recommended data formats do not need to be extremely generic and applicable to all possible situations, but shall be easy to handle for end-users and support the requirements defined by the environmental data retrieval use cases.

2.3.1. Problem Statement & Research Questions

This task shall address the following research questions:

  1. How does a resource model looks like that binds specific functions to specific data?

  2. How to realize an end-user optimized data access and processing API (DAPA) that looks like local function calls?

  3. What data encoding formats work best in which situation of data retrieval for the uses cases defined below?

The first question addresses the need to bind specific data, for example weather data, to a specific set of functions, e.g. access and analytical functions. Not all data is equally suited for all types of data processing. If the data does not support bands, all band algebra that is traditionally applied to multi-band satellite imagery is void. On the other side, nominally or ordinally scaled data may require different classification and interpolation techniques than rational data and does not allow the same set of mathematical operations. In short, the type of data needs to fit the set of operations. To achieve this need, each endpoint that provides access to data and analytics needs to express the specific combination in the list of data resources advertised by the endpoint. As Web API endpoints work on resources, the underlying resource data model needs to express valid combinations of data and operations.

The second question addresses the need to allow processing that is data storage agnostic. An end-user likes to make the same call for local data as for remote data. Figuratively speaking, an operation to calculate the minimum value within some data structure looks similar for MyLocalInMemoryData.MIN(), MyLocalFile.MIN(), and http://your.data/subset.MIN(). The DAPA shall make full use of the capabilities of the OpenAPI specification, which allows e.g. for CoverageJSON to provide a fully implemented schema of the returned data (less so for large formats, such as NetCDF, or binary formats). That helps first to automate code generation, and second to validate if the returned data follows a specific model.

The third research question addresses the need to better understand how to encode data for transport and exchange. Given that there is no universal answer to this question, the goal is to discuss encodings in the context of different scenarios and user groups.

2.3.2. Aim

The aim of this task is to simplify access to environmental and earth observation data by providing a Data Access and Processing API (DAPA). The API shall be evaluated by scientists and tested with Jupyter Notebook implementations, which can serve as examples for further research work.

2.3.3. Background & Previous Work

The Data Access and Processing API development takes into account several developments and existing best practices for both APIs and data encodings.

On the API side, there are for example openEO, GeoTrellis, or GeoAPI. The European Commission funded research project openEO is currently developing an open API to connect R, python, javascript and other clients to big Earth observation cloud back-ends in a simple and unified way. GeoTrellis is a geographic data processing engine for high performance applications. It is implemented as a Scala library and framework that uses Apache Spark to work with raster data and supports many Map Algebra operations as well as vector to raster or raster to vector operations. The OGC GeoAPI Implementation Standard defines the GeoAPI library. The Java language API includes a set of types and methods that can be used for the manipulation of geographic information that is represented according to ISO and OGC standards.

These APIs are complemented by a set of emerging OGC API standards to handle geospatial data and processes. The OGC API family of (mostly emerging) standards is organized by resource type. So far, OGC API - Features has been released as a standard that specifies the fundamental API building blocks for interacting with features. The spatial data community uses the term 'feature' for things in the real world that are of interest. OGC API standards define modular API building blocks to spatially enable Web APIs in a consistent way. The OpenAPI specification is used to define the API building blocks.

On the data encoding side, there are several existing standards that are frequently being used for Earth observation, environmental, ecological, or climate data. These include NetCDF, GeoTiff, HDF, GML/Observation and Measurements or variations thereof, or increasingly often JSON encoded data. Testbed-16 shall explore existing solutions as well as emerging specifications and provide recommendations with focus on the end-user, i.e. data or earth scientist.

The OGC-ESIP Coverage and Processing API Sprint at ESIP winter meeting in January 2020 performs an analysis on coverages beyond the current WCS capabilities. This effort takes into account various elements that need to be developed for an API approach based on the abstract specifications for Coverages and Processing as well as OPeNDAP, GeoXarray/ZARR, R-spatial and other modern software development environments. The Geospatial Coverages Data Cube Community Practice document describes community practices for Geospatial Coverage Data Cubes as implemented by multiple communities and running as operational systems.

2.3.4. Scenario & Requirements

Testbed-16 shall address three different use cases. All use-cases shall be implemented and executed in Jupyter Notebooks that interact with OGC Web APIs as illustrated in figure X below.

dapaFlow
Figure 9. DAPA architecture

The use cases describe different data retrieval requests from end-user’s point of view. The end-user wants to execute a Jupyter Notebook, which executes a function call on Web API. The Web API shall then interact with the actual data platform, though Testbed-16 abstracts away the last step and concentrates on the Jupyter Notebook and Web API instead.

Use-Case 1: Data Retrieval

The user wants to access geospatial data for a specific area in a simple function call. The function call shall identify the data and allow to define the discrete sampling geometry. Valid geometries shall include point locations (x, y, and optional z), bounding-box, and polygon. All geometries shall be provided either in-line or by reference, as exemplarily shown below:

The latter shall allow to request data for a specific sampling geometry by providing a call to an OGC API - Features endpoint. The various encoding options of sampling geometries provided by OGC API - Features instances shall be discussed.

Users shall be enabled to access original data. The word “original” is a bit tricky in this context, as data often undergoes some process on its way from original reading to final product. As an example, imagine a digital temperature sensor. The actual reading performed in the sensor is some form of electricity, but the value provided at the sensor interface is 21°Cel. Thus, some form of calibration curve has been applied to the original reading, which might not be available at all. In this case, the value 21°Cel can be considered as “original”. The same principles apply to satellite data. The original raw data readings are often not accessible. Instead, the data underwent some correction process before it is made available. Higher product levels may include orthorectification or re-gridding processes. In any case, data providers shall provide a description of the performed processing together with the actual data. In addition, data should be available as raw as possible.

End-users want to retrieve all data that exists within the provided target geometry. In the case of a polygon geometry, the end-user shall receive all data that is located in that polypon. In the case of a point geometry, the end-user shall retrieve the value exactly at that point.

In addition, end-users shall have the option to define the (interpolation) method for value generation. If no option is selected, the Web API shall indicate how a given value was produced. Testbed-16 shall develop a set of frequently used production options, including for example “original value”, “interpolation method”, “re-gridding”, or any combination thereof.

This use case differentiates the following different data requests.

  1. Synopsis/Time-Averaged Map: The end-user wants to retrieve data for a single point in time or as an average value over a time period. The figure below is an example of visualized time-averaged data for a number of sampling locations.

dapaTAM
  1. Area-Averaged Time Series: The end-user wants to retrieve a single value that averages all data in the target geometry for each time step. The figure below is an example of visualized area-averaged data for a number of time steps.

dapaATS
  1. Time Series: The end-user wants to retrieve the full time series for each data point. The figure below is an example of visualized full time series data set that includes a number of time steps.

dapaTimeSeries

Testbed-16 shall explore these use-cases in combination with additional processing steps. For example, the end-user requests synoptic, map, or time series data, that is interpolated to a grid.

Testbed-16 does not address the data storage side. The stored data can be point cloud, gridded data set, datacube, or anything else.

Use-Case 2: Data Processing

Testbed-16 shall explore simple data processing functions. These include to calculate the minimum, maximum, and average values for any given data retrieval subset as accessible in the Data Retrieval use cases.

Use-Case 3: API Evaluation

The third use-case is orthogonal to the first two. It does not add any additional requirements on the API itself, but evaluates the API from an end-user point of view. This third use case will be implemented in the form of a full day workshop with several data scientists or earth scientists being invited to evaluate the API regarding

  • Learning curve to use the API

  • Richness of accessible functionality

  • Amount of code needed to execute some common analyses

It is currently planned to organize the workshop in conjunction with another major event, such as e.g. ESIP summer meeting, July 2020 in Burlington, Vermont, or the OGC Technical Committee meeting in Montreal, Canada, June 2020. API developers are not required to attend the workshop physically. Remote participation will be provided. It is emphasized that workshop in June or July requires early design and implementation work to be finished.

The workshop shall allow the API developers and endpoint providers to further refine the EDR API and increase ease of use based on the feedback provided by the scientists. It is therefore expected that early versions of the API and corresponding implementations are available in time for the mid-term evaluation workshop.

2.3.5. Work Items & Deliverables

The following figure illustrates the work items and deliverables of this task.

dapaDeliverables
Figure 10. Deliverables of the Environmental Data Retrieval and Processing task

The following list identifies all deliverables that are part of this task. Detailed requirements are stated above. All participants are required to participate in all technical discussions. Thread assignment and funding status are defined in section Deliverables Summary & Funding Status.

Engineering Reports

  • D005 Data Access and Processing Engineering Report - Engineering Report capturing all results and experiences from this task, including descriptions of all implementations and feedback from scientists.

  • D026 Data Access and Processing API Engineering Report - Engineering Report describing the DAPA. The report shall be complemented by an OpenAPI example available on Swagger Hub.

Components

  • D107 Jupyter Notebook - Jupyter Notebook that interacts with the API endpoints D165-167. The concrete data retrieval and processing scenarios shall be defined at the kick-off meeting. At minimum, the use cases Data Retrieval and DataProcessing shall be supported. The Jupyter Notebook can use any supported programming language.

  • D108 Jupyter Notebook - similar to D107.

  • D109 Jupyter Notebook - similar to D107.

  • D165 API Endpoint- API endpoint implementation that provides the frontend to some data store. Any data store can be used, but operational data stores that are publicly available are preferred. The API endpoint shall implement the API defined in D026.

  • D166 API Endpoint- similar to D165.

  • D167 API Endpoint- similar to D165.

  • D110 Expert- Data scientists or earth scientist helping to evaluate the API as described in the Evaluation Use-case. Experts are expected to provide their recommendations in written format that can be integrated into D005.

  • D111 Expert- similar to D110.

  • D112 Expert- similar to D110.

2.4. Earth Observation Application Packages with Jupyter Notebooks

Testbeds 13, 14, and 15 developed an architecture that allows deploying and executing arbitrary applications next to the physical location of the data to be processed. The architecture builds on specialized WPS interfaces that allow to submit so called Application Packages to Exploitation Platforms. An Exploitation Platform is a cloud based virtual environment that provides users with access to Earth observation data and tools for their manipulation. An Application Package contains all information, data, and software required to execute the application packaged inside a Docker container on a remote platform. The architecture is described and currently under evaluation in the current OGC Earth Observation Applications Pilot. Testbed-16 shall now complement this approach with applications based on Project Jupyter. The goal of Project Jupyter is to improve the workflows of researchers, educators, scientists, and other practitioners of scientific computing.

jupyter

Testbed-16 shall explore how programming code developed by a scientist can be shared with other scientists in an efficient and secure way based on Jupyter Notebooks and related technology. The actual processing shall take place on exploitation platforms, which are data platforms that provide additional processing capacities. One of the key challenges in this context is retrieval of the data that shall be analyzed and processed. This data is stored in a variety of storage formats, such as individual files on clouds, datacubes, or databases, and needs to be made available to the Jupyter kernel. To facilitate data access, the data storages are ideally frontended with a standardized data access and processing API. The Testbed-16 task Data Access and Processing API (DAPA) for Geospatial Data has the goal to develop such an API and is therefore closely related. Whilst sharing Jupyter Notebooks is the focus of this task, both tasks need to merge towards the end of the Testbed to allow for shared TIEs, Technology Interoperability Experiments.

Testbeds 13-15 have explored various mechanisms to link individual applications into a chain of applications. Here, the output of one application serves as input for the next one. Though Testbed-13 favored BPMN as the preferred approach, Testbed-14 identified CWL as a simpler and more appropriate approach. Other initiatives favor process graphs expressed in JSON with JSON Schema for graph validation. In any case and independently of terminology, there are several aspects that have not been addressed in full detail yet and need further research. This includes in particular the error handling in federated cloud environments combined with appropriate roll-back and clean up mechanisms in case that some part of a chain fails.

2.4.1. Problem Statement and Research Questions

Jupyter Notebooks shall be shared with other scientists. However, the notebook documents are JSON documents that contain text, source code, rich media output, and metadata. Each segment of the document is stored in a cell. Sharing these JSON files is one option for exchange, but not ideal, given the envisioned automated deployment and execution on exploitation platforms.

Simple copying of notebooks between jupyter installations allows some interactive usage. On Exploitation Platforms, the goal is to go further and support orchestration and execution of notebooks in batch mode as part of workflows (see Testbed-14 results) and notebook discovery supported by catalogue technology (see Testbed-15 results). Also, it includes code inspection from authorized users and on the fly code changes to exploit the full potential of Jupyter notebooks.

Testbed-16 shall develop recommendations on how to handle Jupyter Notebooks similar to Application Packages; or expressed differently: how to include Jupyter Notebooks into Application Packages, so that they can be executed securely on exploitation platforms with support for code change and workflow orchestration and execution.

There are several elements to be considered in this context:

  • It is possible to create, list and load GitHub Gists from notebook documents. Gists are a way to share code because it allows to share single files, parts of files, or full applications.

  • With jupyterhub, it is possible to spawn, manage, and proxy multiple instances of a single-user Jupyter notebook server. In other words, it’s a platform for hosting notebooks on a server with multiple users. That allows providing notebooks to other scientists.

  • Binder and tmpnb allow to get temporary environments to reproduce notebook execution, but are now superseded by jupyterhub.

  • [https://nbconvert.readthedocs.io/en/latest/Nbconvert] or papermill allow to execute a JPNB from command line.

  • Tools such as nbviewer allow to render notebooks as static web pages.

  • Jupyter_dashboards allow to display notebooks as interactive dashboards, though this functionality is now superseded by Voilà.

  • A big challenge with sharing notebooks is the security model. How to offer the interactivity of a notebook, making use of e.g. Jupyter widgets, without allowing arbitrary code execution by the end-user?

  • Jupyter Notebooks shall be shared with non-technical persons. In particular, the browser based read–eval–print loop (REPL) notebook shall be presentable as a web application that hides all the programming code fields.

  • Voilà supports interactive widgets, including roundtrips to the kernel. It does not permit arbitrary code execution by consumers of dashboards, is built on Jupyter standard protocols and file formats, and includes a template system to produce rich application layouts.

  • How to chain a set of specific processes in a processing graph? Testbeds 13-15 experimented with CWL and BPMN, but other approaches exist such as openEO process graphs. Testbed-16 shall provide recommendations on the preferred solution and compare the advantages and disadvantages of the various solutions.

2.4.2. Aim

The aim of this task it to extend the Earth Observation Applications architecture developed in Testbeds 13-15 and further evaluated in the OGC Earth Observation Applications Pilot with support for shared and remotely executed Jupyter Notebooks. The notebooks shall make use of the Data Access and Processing API (DAPA) developed in the Data Access and Processing API (DAPA) for Geospatial Data task and tested in joint Technical Interoperability Experiments.

2.4.3. Previous Work

OGC Testbed activities in Testbed-13, Testbed-14, and the ongoing Testbed-15 have developed an architecture that allows the ad-hoc deployment and execution of applications close to the physical location of the source data. The goal is to minimize data transfer between data repositories and application processes. The following Engineering Reports describe the work accomplished in the Testbed-13/14:

  • OGC Testbed-14: Application Package Engineering Report (18-049r1)

  • OGC Testbed-14: ADES & EMS Results and Best Practices Engineering Report (18-050r1)

  • OGC Testbed-14: Authorisation, Authentication, & Billing Engineering Report (18-057)

  • OGC Testbed-14: Next Generation Web APIs - WFS 3.0 Engineering Report (18-045)

  • OGC Testbed-13: EP Application Package Engineering Report (17-023)

  • OGC Testbed-13: Application Deployment and Execution Service Engineering Report (17-024)

  • OGC Testbed-13: Cloud Engineering Report (17-035)

Testbed-13 reports are referenced to provide the background of this work and design decisions in context, but they are mostly superseded by Testbed-14 reports.

Testbed-15 has explored the discovery aspect of processes, applications, and data. The results will be made publicly available in OGC Testbed-15: Catalogue and Discovery Engineering Report (OGC 19-020r1).

A good summary of the current architecture is provided in the OGC Earth Observation Applications Pilot call for participation. The goal of the pilot is to evaluate the maturity of the Earth Observation Applications-to-the-Data specifications that has been developed over the last two years as part of various OGC Innovation Program (IP) initiatives in a real world environment. ‘Real world’ includes integration of the architecture in an environment requiring authenticated user identity, access controls and billing for resources consumed.

At the same time, significant progress towards more Web-oriented interfaces has been made in OGC with the emerging OGC APIs -Core, -Features, -Coverages, and -Processes. All of these APIs are using OpenAPI. These changes have not been fully explored in the current architecture, which provides additional ground for experimentation.

2.4.4. Scenario & Requirements

Testbed-16 envisions multiple scenarios that build on each other. All are fictitious and can be modified during the Testbed as long as the basic characteristics, i.e. data access, processing, and chaining, remain conserved.

The first scenario explores the handling of Jupyter Notebooks and the interaction with data and processing capacities through the Data Access and Processing API, DAPA. As said before, the scenario is fictitious and can be replaced by any other scenario as long as the individual steps, i.e. discovery, exploration, data requests and processing for both raster and vector data, and result representation remain conserved.

  1. Notebook cell searches a catalog via OpenSearch with bounding box, time range and keywords .Notebook receives a list of data collections with WMS or OGC API endpoints for sample browse and displays them to the user

  2. Notebook makes GetMap requests to WMS endpoints retrieves maps from OGC API endpoints to see sample pictures of the data collection over the bounding box and shows results on a map in Notebook

  3. User selects one data collection for deeper exploration

  4. Notebook queries catalog via OpenSearch for selected collection’s data items* given bounding box and time range

  5. Notebook receives an initial page of resource items with WCS/WFS or other data providing endpoints such as a Data Access and Processing API (DAPA) endpoint

  6. Notebook selects one specific data item and requests data

  7. Notebook receives data for desired space and time, and computes a time-averaged map and an area-averaged time series and displays them

  8. Notebook continues to select, query, and process other data. This step needs to be repeated to illustrate how different types of data (vector and raster at different endpoints and serialized in different formats including data from DAPA) can be used with Jupyter notebooks

  9. Notebook queries shapes from one endpoint to be used at another endpoints for area of interest selection

  10. Notebook integrates various data sets and represents results.

The second scenario extends the first one by making the Jupyter Notebook or web application available to other users on an Earth Observation exploitation platform. The notebook shall be available in both batch mode (the full notebook is executed and only final results are provided) and interactive mode, where the user interacts with the notebook. The scenario includes the following components.

Table 2. Scenario components
Component Definition

EO Exploitation Platform (EP)

A cloud based virtual environment that provides users with access to EO data and tools for their manipulation

Thematic Exploitation Platform (TEP)

An EO Exploitation Platform focused on a specific theme

ENV-TEP

Fictitious TEP focused on environmental topics

User Application hosting service

A platform service that allows users to upload their application (that implements a data processing algorithm) on the platform. The upload works in the form of an ApplicationPackage that is then automatically integrated into the platform. It is assumed that this service is based on the results of OGC Testbed-14, Application Deployment and Execution Services and Execution Management Service respectively

Application execution service

A Platform service that allows users to execute applications available on the platform. It is assumed that this service is based on the results of OGC Testbed-14

Data Hosting service

A Platform service that allows users to store data on the platform

Data Access service

A Platform service that allows users to access the data available on the platform

ENV-TEP provides the following platform level services:

  • User Application hosting

  • Application execution

  • Data hosting

  • Data access

Alice is a user of ENV-Platform and wants to code an algorithm that calculates an environmental quality index (EQIX) for major cities around the world. This quality index is based on a variety of data, e.g. average wind speed, annual rainfall, solar radiation data, population density, and air quality. The data is provided partly on the ENV-TEP, partly by other platforms and accessible via the Data Access and Processing API (DAPA).

Alice creates a Jupyter Notebook based application that is deployed on the platform using the User Application hosting service. It is executed by the Application execution service, with data being accessed through DAPA.

The algorithm is made available to others in two forms. First, as a Jupyter Notebook with source code available for step-wise execution and manipulation. Second, as a web application that only requires an area of interest as input and provides the quality index as output.

This scenario shall explore and develop recommendations for notebooks that can be used both in interactive step-wise execution mode and batch mode. The interactive step-wise execution mode displays intermediate results to the user, who can make decisions that influence the next steps (e.g. by selecting one out of many offered data sets). In batch mode, some cells that are used for visualization or user interaction need to be handled differently, as the user expects a final result exclusively.

The third scenario then continues with an application chaining element being added. In order to do so, Bob is introduced, who creates an application that takes EQIX results from several cities in an area and creates a liveable area index (LAIX). For that purpose, Bob runs the EQIX application for several cities, then combines results for selected cities that fall within a target area, adds additional data such as e.g. annual mean temperature for that area, and eventually produces a liveable area index. Applications to be chained in this scenario should be a combination of Jupyter notebook based applications and Application Packages that link Docker Containers with arbitrary applications. Bob should have read access to the EQIX application. He should not be allowed to delete the EQIX application.

2.4.5. Work Items & Deliverables

The following figure illustrates the work items and deliverables of this task.

jupyterDeliverables
Figure 11. Earth Observation Application Processing with Jupyter Notebooks task architecture and deliverables

The following list identifies all deliverables that are part of this task. Detailed requirements are stated above. All participants are required to participate in all technical discussions. Thread assignment and funding status are defined in section Deliverables Summary & Funding Status.

Engineering Reports

  • D027 EOApps with Jupyter Engineering Report - Engineering Report capturing all results and experiences from this task. The report shall provide answers to all research questions and document implementations.

Components

  • D168 Jupyter NB - Jupyter notebook (ideally and web application) that makes a complex application available for chaining at ADES/EMS.

  • D169 Jupyter NB - similar to D168

  • D170 ADES/EMS w. Jupyter kernel - ADES/EMS implementation with Jupyter support that allows registration of D168/169 and supports chaining. The platform shall interact with Data Access and Processing APIs as provided by the Data Access and Processing API (DAPA) for Geospatial Data task.

  • D171 ADES/EMS w. Jupyter kernel - similar to D170

2.5. GeoPackage

GeoPackage is an OGC standard that has grown substantially in popularity amongst the Geospatial community. The GeoPackage Encoding Standard was developed for the purpose of providing an open standard based platform for transferring geospatial information which is independent, portable, self-describing and compact.

Goal of this testbed is to advance the discoverability of the contents of a GeoPackage through utilising the concept of metadata profiles; and improved efficiency of using large scale vector datasets in GeoPackage.

gpTeaser

2.5.1. Problem Statement and Research Questions

GeoPackage Encoding Standard has proven to be an effective "container" mechanism for bundling and sharing geospatial data for a variety of operational use cases. GeoPackage is an open standard based platform-independent format for transferring geospatial information.

The work in this testbed is focusing on improvements in GeoPackage through two use cases:

(1) Metadata Profiles

Discoverability of what is in a GeoPackage is important to enable a developer to quickly assess the type of data contained in a GeoPackage and to determine how it should be processed effectively. There is currently no agreement on the meaning and significance of metadata in GeoPackage or how that metadata should be used to serve any particular purpose. Manually opening a GeoPackage provides no way of recognising if the file has any particular type of metadata in it, without having to access the row entries in GeoPackage tables. The OGC document ‘Proposed OGC GeoPackage Enhancements’ proposes the concept of Metadata Profiles for GeoPackage and proposes this in two parts:

  • Creating a new extension that defines a new extension “scope” (i.e., the gpkg_extensions.scope column) of Metadata.

  • Creating an extension for each metadata profile that describes the meaning and significance of a particular type of metadata.

Metadata profiles will improve the functionality in GeoPackage and will be implemented for Vector, Raster and Imagery datasets.

(2) Large Vector Datasets

Previous work has suggested that GeoPackage performance is poor when handling large vector datasets. This work is to examine and demonstrate the improved efficiency and effectiveness of storing large vector datasets in a GeoPackage. This may be through proposed extension(s) to GeoPackage which may include recommending open source tooling and software libraries for optimising SQLite , which is the underlying database that GeoPackage utilises. This approach may consider indexing, and Geohash could be considered to improve indexing functionality in GeoPackage.

“Geohash is a public domain geocode system invented in 2008 by Gustavo Niemeyer, which encodes a geographic location into a short string of letters and digits. It is a hierarchical spatial data structure which subdivides space into buckets of grid shape, which is one of the many applications of what is known as a Z-order curve, and generally space-filling curves.” (Wikipedia)

Diagram below shows GeoPackage interactions between client and server (GeoPackage Builder).

gpOverview
Figure 12. Overview of the GeoPackage task

2.5.2. Aim

Client and server implementations should demonstrate improved discoverability of content in GeoPackage through the implementation of Metadata Profiles and improved efficiency when distributing large vector datasets.

2.5.3. Previous Work

  • OGC 12-128r15, OGC® GeoPackage Encoding Standard Version 1.2.1

  • OGC 18-000, OGC® GeoPackage Related Tables Extension

  • OGC 19-047, OGC® Proposed OGC GeoPackage Enhancements

2.5.4. Scenario & Requirements

The following requirements shall be met:

  • Develop a client and server implementation, which demonstrates the discoverability of the contents of a GeoPackage through Metadata Profiles. Metadata profiles should be developed for different types of data that are stored in a GeoPackage, such as Raster, Vector, Image and elevation data. Metadata Profiles should include the use, size, data type and provenance of the GeoPackage.

  • Investigate potential causes of poor GeoPackage efficiency when used with Large Vector Datasets to establish whether the problem is with the format, or with the software tools being used. Develop a client and server implementation, which demonstrates the improved efficiency and effectiveness of storing large vector datasets in a GeoPackage. Open source tooling and libraries may be considered as an approach to achieve this. This work should document the approach and make recommendations for a GeoPackage extension or profile for large vector datasets. Portrayal of Vector Data should reference work in Open Portrayal Framework in Testbed- 15.

The implementation should focus on client and server implementations. All results and lessons learned shall be captured in an Engineering Report.

2.5.5. Work Items & Deliverables

The following figure illustrates the work items and deliverables of this task.

gpDeliverables
Figure 13. Deliverables of the GeoPackage task

The following list identifies all deliverables that are part of this task. Detailed requirements are stated above. All participants are required to participate in all technical discussions. Thread assignment and funding status are defined in section Deliverables Summary & Funding Status.

Engineering Reports

  • D010 GeoPackage Engineering Report - Engineering Report capturing all results and experiences from this task. It should also capture and make recommendations for OGC standardisation and extensions.

Components

  • D119 GeoPackage Client - Client implementation that supports the GeoPackage scenario documented above. The client can be implemented as a desktop application, mobile application, or browser based. Implementation is to be based upon Open source where possible and a build script for working implementation in Docker or VM environment should be provided.

  • D118 GeoPackage Server- Server implementation building GeoPackage with support for all requirements as defined above. Implementation is to be based upon Open source where possible and a build script for working implementation in Docker or VM environment should be provided.

2.6. Data Centric Security

What is Data Centric Security?

Data Centric Security is an approach that emphasizes the security of the data itself rather than the security of networks, servers, or applications, the approach achieves this by the means of:

  • Encrypting the data at all times unless it is being manipulated by an authorized entity.

  • Having a robust Attribute Based Access Control (ABAC) mechanism in place to provide attribution of all entities that wish to access the data

  • A Policy Engine that determines whether or not access to a data element by an entity is permissible based upon policies stored within the Policy Engine.

  • This results in:

    • The integrity of the data being verifiable.

    • Entities that produce/edit the data are verifiable and authorized.

    • Security measures required by/applicable to the data element(s) are attached to the data and remains with the data element(s) whilst it is either being stored or in transit.

    • Security measures related to confidentiality are greatly reduced within the systems within which the data is transferred or the services that the data passes through. Measures related to the delivery of the Integrity and Availability requirements of the systems being used are unchanged.

dcsTeaser

Why is Data Centric Security important?

The Data Centric Security approach delivers a number of benefits, such as:

  • Data may reside on a service that the author of the data does not control (i.e. Cloud Storage).

  • Data may transit through networks and be proxied by services that the author of the data does not control.

  • Only an authorized entity may access the data element(s), thus providing the data owner with confidence that the data is appropriately protected.

  • The use of Cryptographic techniques to bind the Metadata to the data provides a mechanism to verify that the data is authentic, has not been tampered with and is provided by the author they believe is providing the data.

2.6.1. Problem Statement and Research Questions

A fundamental requirement for Data Centric Security is that the data is always in an encrypted form until an authorized entity makes use of the data. An entity may be either a human or a system entity within the processing system.

As the data could pass through systems that don’t belong to the data consumer or producer, the data must remain encrypted throughout the geospatial environment. The geospatial environment includes all infrastructure that touches the geospatial data (services, networks, storage, clients, etc.). When looking at utilizing OGC standards such as OGC API - Features in a data centric security scenario, standards need to include ways to classify the security requirements around data access. This classification can exist as additional metadata fields. The requirement stems from the need to limit different entities to a different subset(s) of the data. Additional requirements include the need for representation of the source of the information as well as an assurance that the information has not been tampered with.

The Data Centric Security work in Testbed-15 illustrated that it is possible to support Data Centric Security within the OGC API family of standards. Testbed-15 explored three scenarios, where a security proxy intercepted requests from a client. Data was then filtered, encrypted, and signed in three different ways. All three scenarios are described in detail in the Testbed-15 Data Centric Security Engineering Report.

  1. In the first scenario, the security proxy forwards the request and modifies the response from a vanilla OGC API Features service to a STANAG 4774 and 4778 output format.

  2. In the second scenario, the security proxy performs temporal decisions and spatial filtering additionally. To do so, the security proxy contains a geospatial policy of classified and unclassified areas.

  3. In the third scenario, the security proxy forwards the request and response from an OGC API Features service that understands the STANAG 4774 and 4778 output format (and has access to data in these formats). In this scenario, the OGC API Features service returns a feature collection with STANAG 4774/4778 encoded feature objects.

In the Testbed-15 Engineering Report, the following topics were noted for future work:

  • Testbed-15 used the STANAG 4774 and 4778 format, which used XML. Other encoding formats exist and some applications, particularly in a commercial business, may not be as keen to support XML. The report proposed that the DCS solution should not be constrained to the use of XML and that an alternative, JSON-based container format should be investigated.

  • A key management scenario that stores the keys in a key management service and requires the client to fetch the key via a key identifier stored in the metadata should also be implemented.

2.6.2. Aim

Further develop a Data Centric Security implementation in the OGC API family of standards, including a Data Centric Security JSON implementation.

Demonstrate a JSON Data Centric Security implementation using STANAG 4774 and 4778. NATO STANAG 4774 defines the Confidentiality Metadata Label Syntax for Data Centric Security. NATO STANAG 4778 is the Metadata Binding Mechanism for Joint Coalition Information Sharing.

An additional implementation should demonstrate a key management scenario, where keys are stored in a key store..

2.6.3. Previous Work

  • OGC 19-016, OGC Testbed-15 Data Centric Engineering Report

  • STANAG 4774 – Metadata Confidentiality Label Syntax

  • STANAG 4778 – Metadata Binding Mechanism

2.6.4. Scenario & Requirements

The overall goal is to continue the work of Testbed-15 and implement in this Testbed several of the recommendations identified in the Engineering Report.

Develop a JSON client and server implementation based on OGC API - Features and the Data Centric Security work in OGC Testbed-15. This work will reference STANAG 4774 – Metadata Confidentiality Label Syntax and STANAG 4778 – Metadata Binding Mechanism and implement this in a JSON language encoding. The Engineering Report should address questions such as: What do we learn from this implementation and what are the implications to Data Centric Security?

Develop a client and a server implementation in a key management scenario that stores encryption keys in a key management service. The client fetches the key via a key identifier, which is stored in the metadata. Any Key Management approach proposed should provide for the federation of multiple systems and the ability to transfer encryption key material, if so required. These implementation(s) should evaluate IdAM and ABAC Policy Engine Protocols in a client and server Data Centric Security implementation(s). Axiomatics Abbreviation Language for Authorization (ALFA) should be used as the Policy Engine Protocols in any implementation(s) as well.

Address the content negotiation issue where container content formats need to be specified in addition to the container formats themselves. The STANAG 4778 output format is a container format that contains encrypted portions of sensitive data and associated metadata. This causes an extra challenge for the OGC API family of standards. Given the nested structure, the API standards need a way to specify both the container encoding and the format of the data in the container. Once standards such as OGC API - Features support the documentation of containers and data and get agreement by the implementing community, then interoperability is possible. However this may not be the only factor in interoperability. STANAG 4778 may not be an appropriate output format, especially when there may be a variety of different DCS formats in the future. One of the issues that different DCS formats may expose in the future is how to express a feature collection where items could be of different DCS formats. This would be caused by different authors contributing to the feature collection.

The OGC API – Records implementation allows geospatial data to be discovered. In a Data Centric Security implementation this is challenging. Work in this testbed is to explore how Geospatial data can be discovered in a Data Centric Security implementation.

In summary, OGC API family implementations shall be developed, which use a Data Centric Security approach based upon features and metadata:

  • JSON Implementation of STANAG 4774 and 4778 should focus on client and server

  • Key Management Scenario which stores encrypted keys in a store for the key management service

  • Resolve the content negotiation issue with nested formats

  • Explore how Geospatial data can be discovered in a Data Centric Security implementation.

  • Consider the implication of Data Centric Security in feature and metadata implementation as an increased data burden on the network.

  • Deliver implementation OGC API code

  • Implementation is to be based upon Open source where possible

  • Build script for working implementation on Docker or Virtual Machine (VM) environment

  • Capture in an Engineering Report the implementation and make recommendations for OGC standardization

Desktop/client/server and cell-phone implementation scenarios

The various features shall be explored in two different scenarios. The first scenario assumes reliable and high speed internet access. The second scenario addresses cell-phone based data centric security.

Desktop/client/server scenario

The desktop/client/server does not put any specific constraints on the implementation setup. The DCS server can be provided as a single or distributed physical instance serving data and providing key management support. Connectivity between servers and clients can always be assumed.

Cell-phone scenario

A Sargent in the U.S. National Guard has been deployed on a disaster recovery mission. He carries with him a smart phone which contains sensitive data. When meeting with first responders, how does he share critical information with them without compromising sensitive information? How does internet connectivity affect that scenario?

Hypothesis: Use of the Data Centric Security techniques developed in Testbed-16 could address this problem. All sensitive data is encapsulated in a Data Centric Security package. Security policies are defined using GeoXACML. A Policy Enforcement Point (PEP) applet only allows access that data allowed under the currently active security policy. Authorized users can set the active security policy.

End State: The Sargent selects the security policy appropriate for the intended audience. He can now access data on his smart phone without worries about exposing sensitive information.

Task: Validate or invalidate this hypothesis. Demonstrate what is possible.

2.6.5. Work Items & Deliverables

The following figure illustrates the work items and deliverables of this task.

dcsDeliverables
Figure 14. Data Centric Security task work items and deliverables

The following list identifies all deliverables that are part of this task. Detailed requirements are stated above. All participants are required to participate in all technical discussions. Thread assignment and funding status are defined in section Deliverables Summary & Funding Status.

Engineering Reports

  • D011 Data Centric Security Engineering Report - Engineering Report capturing all results and experiences from this task, including the JSON DCS specification. It should also make recommendations for OGC standardization and extensions.

Components

  • D121 Data Centric Security Client - Client implementation that supports the Data Centric Security desktop/client/server scenario documented above. Ideally, implementation is based upon open source where possible and build script for working implementation in Docker or VM environment should be delivered.

  • D120 Data Centric Security Server- Server implementation with support for data centric security desktop/client/server scenario. The server shall support key management. Implementation is to be based upon Open source where possible and build script for working implementation in Docker or VM environment should be delivered.

  • D143 PEP on Cellphone - PEP implementation on cell-phone that supports cellphone scenario documented above.

  • D144 PEP on Cellphone - Similar to D143

  • D145 Key Management Server - Implementation of the key management server as required in both scenarios. Ideally, implementation is based upon open source where possible and build script for working implementation in Docker or VM environment should be delivered.

  • D146 Key Management Server - Similar to D145

  • D147 DCS App - Implementation of cellphone application to explore cellphone scenario as described above.

  • D148 DCS App - Similar to D147

2.7. Discrete Global Grid System (DGGS)

A Discrete Global Grid System (DGGS) represents a spherical partitioning of the Earth’s surface into a grid of cells (Wikipedia). The OGC maintains an Abstract Specification (OGC 15-104r5) that captures the foundational concepts for DGGS. This Testbed task aims to begin the process to move towards an OGC Implementation Standard for DGGS through the creation of a open-source based DGGS reference implementation. Testbed-16 represents the initial effort of what is considered a multi-initiatives process.

dggsTeaser

Discrete Global Grid Systems (DGGS) offer a new way for geospatial information to be stored, visualized, and analyzed. Based on a partitioning of the Earth’s surface into a spherical grid, DGGS allows geospatial information to be represented in a way that more intuitively reflects relationships between data and the Earth’s surface. With DGGS, providers and consumers of geospatial information can eliminate many of the uncertainties and distortions inherently present with traditional coordinate systems. To fully realize the benefits of DGGS, standard-compliant implementations are required to allow cell-id management across DGGS with varying structure and alignment.

2.7.1. Problem Statement and Research Questions

DGGS presents an opportunity for the geospatial community to implement a representation of Earth that is vastly different from traditional coordinate system approaches. DGGS has the potential to enable storage, analysis and visualization of geospatial information in a way that more accurately reflects the relationship between data and the Earth. While the OGC abstract specification captures fundamental DGGS concepts, there is a need to more concretely demonstrate DGGS to drive its adoption. Testbed-16 shall contribute to this advancement through development of a DGGS reference implementation.

Key questions for this work include the following:

  • What DGGS structure would be best for developing a reference implementation? e.g. Uber’s Hexagonal Hierarchical Spatial Index, Open Equal Area Global Grid (OpenEAGGR)

  • What is a simple application that could be used to demonstrate the value of the reference implementation?

  • What should be considered for future work oriented towards operational implementation of DGGS?

It is expected that results from this task will form the basis for future initiatives to fully enable DGGS through an OGC Implementation Standard.

2.7.2. Aim

This task aims at getting server-side DGGS implementation work started that supports a DGGS API. The API shall support two core functions, i.e. geographic location to cell-ID and cell-ID to geographic location, and optionally cell-ID to cell-ID conversion to support multiple DGGSs. The API shall be in-line with OGC OpenAPI family of standards.

2.7.3. Previous Work

2.7.4. Scenario & Requirements

The goal is to develop a reference DGGS implementation with a demonstrable use-case to highlight the potential and value of DGGS. The following requirements shall be met:

  • The server-side implementation must be fully open source.

  • The server-side implementation shall include a library that encapsulates the actual DGGS functionality. That library should be usable for existing tools and services and shall support the geographic location to cell-ID(s) and reverse conversion.

  • All server-side development must be completed using open source tools, with all outputs made available in a public, freely accessible format.

  • All aspects of the implementation (e.g. underlying code) must be made available through an open license. An example is the Government of Canada’s Open Government License. Other licenses will be considered by the sponsor if they contain similar characteristics.

  • The client side demonstration application shall support features that highlight DGGS aspects and demonstrate the advantages a DGGS provides to the non-geospatial expert consumer type of user. The client shall be available as a browser-based solution. Open source is preferred for the client. The client shall visualize the DGGS at various zoom levels and interact with DGGS-enabled data services, i.e. OGC API endpoints that understand cell-IDs as spatial filters. Though appreciated, a globe-like visualization is not required. Simple visualizations (i.e. 2 dimensional) that demonstrate the capabilities of DGGS will be welcome.

The following figure illustrates possible implementation scenarios.