Contents

Dell Streaming Data Platform 1.5 Software Developer's Guide PDF

1 of 197
1 of 197

Summary of Content for Dell Streaming Data Platform 1.5 Software Developer's Guide PDF

Streaming Data Platform 1.5 Developer's Guide

Version 1.5

December 2022 Rev. 1.0

Notes, cautions, and warnings

NOTE: A NOTE indicates important information that helps you make better use of your product.

CAUTION: A CAUTION indicates either potential damage to hardware or loss of data and tells you how to avoid

the problem.

WARNING: A WARNING indicates a potential for property damage, personal injury, or death.

2022 Dell Inc. or its subsidiaries. All rights reserved. Dell Technologies, Dell, and other trademarks are trademarks of Dell Inc. or its subsidiaries. Other trademarks may be trademarks of their respective owners.

Figures..........................................................................................................................................8

Tables........................................................................................................................................... 9

Chapter 1: Introduction................................................................................................................10 Product summary...............................................................................................................................................................10 Product components..........................................................................................................................................................11

About Pravega............................................................................................................................................................... 11 About analytic engines and Pravega connectors................................................................................................. 12 Management plane.......................................................................................................................................................13

Deployment options...........................................................................................................................................................13 Component deployment matrix...................................................................................................................................... 14 Architecture and supporting infrastructure ............................................................................................................... 15 Product highlights.............................................................................................................................................................. 17 More features..................................................................................................................................................................... 19 Basic terminology...............................................................................................................................................................21 Interfaces............................................................................................................................................................................. 21

User Interface (UI)......................................................................................................................................................22 Grafana dashboards....................................................................................................................................................23 Apache Flink Web UI...................................................................................................................................................23 Apache Spark Web UI.................................................................................................................................................24 JupyterHub overview................................................................................................................................................. 25 APIs.................................................................................................................................................................................26

What you get with SDP................................................................................................................................................... 26 Use case examples............................................................................................................................................................26 Documentation resources................................................................................................................................................27

Chapter 2: Using the Streaming Data Platform (SDP)................................................................. 29 About installation and administration........................................................................................................................... 29 User access roles.............................................................................................................................................................. 29 About project isolation..................................................................................................................................................... 30 Log on as a new user........................................................................................................................................................30 Change your password or other profile attributes ...................................................................................................32 Overview of the project-member user interface.......................................................................................................32 Analytics projects and Pravega scopes views............................................................................................................33

View projects and the project dashboard..............................................................................................................33 Pravega scopes view.................................................................................................................................................. 34 Pravega streams view................................................................................................................................................ 34

Create analytics projects and Flink clusters............................................................................................................... 35 About analytics projects............................................................................................................................................ 35 About Flink clusters.................................................................................................................................................... 36 Components of a Flink setup....................................................................................................................................36 About Flink jobs............................................................................................................................................................36 Create Flink clusters................................................................................................................................................... 37 Change Flink cluster attributes................................................................................................................................ 39

Contents

Contents 3

Delete a Flink cluster.................................................................................................................................................. 39 GPU-accelerated workload support for Flink....................................................................................................... 39

About stream processing analytics engines and Pravega connectors................................................................. 43 About the Apache Flink integration......................................................................................................................44 About the Apache SparkTM integration..................................................................................................................44

Chapter 3: Working with Pravega Streams.................................................................................. 45 About Pravega streams................................................................................................................................................... 45 About stateful stream processing................................................................................................................................. 46 About event-driven applications....................................................................................................................................46 About exactly once semantics....................................................................................................................................... 46

Pravega exactly once semantics..............................................................................................................................47 Apache Flink exactly once semantics..................................................................................................................... 47

Understanding checkpoints and savepoints............................................................................................................... 48 About Pravega scopes..................................................................................................................................................... 48

Accessing Pravega scopes and streams................................................................................................................48 About cross project scope sharing..........................................................................................................................48

About Pravega connectors and streams..................................................................................................................... 50 Create data streams..........................................................................................................................................................51

Create a stream from an existing Pravega scope................................................................................................ 51 Stream access rules.................................................................................................................................................... 51

Stream configuration attributes.................................................................................................................................... 52 List streams (scope view)...............................................................................................................................................53 View streams......................................................................................................................................................................54 Edit streams....................................................................................................................................................................... 55 Delete streams...................................................................................................................................................................56 Start and stop stream ingestion.................................................................................................................................... 56 Monitor stream ingestion................................................................................................................................................ 57 About Grafana dashboards to monitor Pravega........................................................................................................ 57

Chapter 4: Working with Apache Flink Applications..................................................................... 58 About hosting artifacts in Maven repositories...........................................................................................................58

About artifacts for Flink projects............................................................................................................................ 58 About Maven repositories......................................................................................................................................... 59 Upload Flink application artifacts............................................................................................................................. 61 Deploy new Flink applications...................................................................................................................................62 View and delete artifacts.......................................................................................................................................... 65

About Apache Flink applications....................................................................................................................................65 Application lifecycle and scheduling............................................................................................................................. 65

Apache Flink life cycle................................................................................................................................................65 View status and delete applications..............................................................................................................................67

View and edit Flink applications............................................................................................................................... 67 View application properties....................................................................................................................................... 67 View application savepoints...................................................................................................................................... 67 View application events............................................................................................................................................. 68 View application deployment logs........................................................................................................................... 69 View deployed applications in the Apache Flink UI............................................................................................. 69

Chapter 5: Working with Apache Spark Applications................................................................... 70

4 Contents

About Apache Spark applications..................................................................................................................................70 Apache Spark terminology............................................................................................................................................... 71 Apache Spark life cycle.....................................................................................................................................................71 About Spark clusters........................................................................................................................................................ 72 Create a new Spark application..................................................................................................................................... 72 Create a new artifact....................................................................................................................................................... 74 View status of Spark applications................................................................................................................................. 76 View the properties of Spark applications...................................................................................................................76 View the history of a Spark application........................................................................................................................77 View the events for Spark applications........................................................................................................................77 View the logs for Spark applications.............................................................................................................................77 Deploying Spark applications.......................................................................................................................................... 77

Uploading common artifacts to your analytics project.......................................................................................78 Deploying Python applications using the SDP UI.................................................................................................79 Deploying PySpark applications using Kubectl..................................................................................................... 79 Deploying Java or Scala applications using the SDP UI......................................................................................81

GPU-accelerated workload support for Spark...........................................................................................................82 Troubleshooting Spark applications..............................................................................................................................86

Chapter 6: Working with JupyterHub...........................................................................................87 JupyterHub overview....................................................................................................................................................... 88 JupyterHub installation.................................................................................................................................................... 89 Starting a Jupyter notebook server..............................................................................................................................89 Stopping a Jupyter Notebook Server...........................................................................................................................92 Persistent file storage...................................................................................................................................................... 92 Pravega Python client demo notebook........................................................................................................................92 Spark installation demo notebook................................................................................................................................. 93

Chapter 7: Working with Analytic Metrics................................................................................... 94 About analytic metrics for projects...............................................................................................................................94 Enable analytic metrics for a project............................................................................................................................95 View analytic metrics ...................................................................................................................................................... 95 Add custom metrics to an application..........................................................................................................................96 Analytic metrics predefined dashboard examples..................................................................................................... 97

Chapter 8: Working with Video.................................................................................................. 100 Overview of video features...........................................................................................................................................100 Video streaming terminology........................................................................................................................................ 100 Camera recorder pipelines..............................................................................................................................................101

How video is stored in Pravega..............................................................................................................................102 Configure SDP to record from cameras...............................................................................................................102 Create a camera recorder pipeline........................................................................................................................ 102 Control camera recorder pipelines.........................................................................................................................106 Monitor camera recorder pipelines........................................................................................................................ 107 Automate recording from many cameras............................................................................................................ 109 RTSP camera simulator.............................................................................................................................................110 Troubleshooting camera recorder pipelines......................................................................................................... 110

GStreamer pipelines.........................................................................................................................................................112 Start the Interactive Shell with GStreamer......................................................................................................... 113

Contents 5

Use the Interactive Shell with GStreamer tools..................................................................................................114 Run a continuous GStreamer pipeline................................................................................................................... 116 GStreamer plugins for Pravega elements............................................................................................................. 118 Create custom GStreamer applications................................................................................................................127 Connect to a remote SDP server...........................................................................................................................127

Object Detection with NVIDIA DeepStream..............................................................................................................128 Build and run a DeepStream development pod.................................................................................................. 130 Build and run object detection in a DeepStream development pod............................................................... 131

SDP Video Architecture.................................................................................................................................................. 131 Video compression and encoding........................................................................................................................... 131 MP4 media container format................................................................................................................................... 131 Stream truncation and retention............................................................................................................................ 131 Seeking in a video stream........................................................................................................................................ 132 Changing video streaming parameters................................................................................................................. 132 Identifying Pravega video streams........................................................................................................................ 132 Timestamps................................................................................................................................................................. 132 Storing and retrieving video in Pravega............................................................................................................... 133 Pravega video server................................................................................................................................................ 135 APIs to retrieve video data from Pravega........................................................................................................... 135

Chapter 9: Working with Pravega Schema Registry....................................................................136 About Pravega schema registry................................................................................................................................... 136 Create schema group...................................................................................................................................................... 141 List and delete schemas.................................................................................................................................................142 Add schema to streams..................................................................................................................................................143 Edit schema group and add codecs............................................................................................................................ 144

Chapter 10: Working with Pravega Search (PSearch)................................................................. 147 Overview............................................................................................................................................................................ 147

Two query types ....................................................................................................................................................... 148 Additional Pravega Search functionality.............................................................................................................. 148 Limitations................................................................................................................................................................... 149

Using Pravega Search.................................................................................................................................................... 149 Create Pravega Search cluster.............................................................................................................................. 150 Make a searchable stream........................................................................................................................................151 Continuous queries.................................................................................................................................................... 154 Search queries............................................................................................................................................................ 160

Health and maintenance ............................................................................................................................................... 163 Monitor Pravega Search cluster metrics............................................................................................................. 163 Delete index documents ..........................................................................................................................................163 Scale the Pravega Search cluster..........................................................................................................................163

Pravega Search REST API.............................................................................................................................................168 REST head................................................................................................................................................................... 168 Authentication............................................................................................................................................................ 169 Resources.................................................................................................................................................................... 170 Methods documentation........................................................................................................................................... 171

Chapter 11: Troubleshooting Application Deployments............................................................... 185 Apache Flink application failures.................................................................................................................................. 185

6 Contents

Review the full deployment log.............................................................................................................................. 187 Application logging.......................................................................................................................................................... 189

Review application logs............................................................................................................................................ 189 Application issues related to TLS................................................................................................................................. 191 Apache Spark Troubleshooting.................................................................................................................................... 192

Chapter 12: SDP Code Hub Applications and Sample Code......................................................... 193 Develop an example video processing application...................................................................................................193

Chapter 13: References..............................................................................................................195 Keycloak and Kubernetes user roles........................................................................................................................... 195 Additional resources........................................................................................................................................................196 Where to go for support................................................................................................................................................ 196 Provide feedback about this document..................................................................................................................... 197

Contents 7

1 SDP Core architecture............................................................................................................................................ 16

2 SDP Edge architecture............................................................................................................................................17

3 Initial administrator UI after login.........................................................................................................................22

4 Apache Flink Web UI............................................................................................................................................... 24

5 Apache Spark Web UI............................................................................................................................................. 24

6 Jupyter Notebook....................................................................................................................................................25

7 Project Dashboard................................................................................................................................................... 33

8 Flink connector reader app example code: Ingestion and reader apps shared between different

projects...................................................................................................................................................................... 49

9 Example in SDP: Ingestion and reader apps shared between different projects..................................... 49

10 Upload artifact..........................................................................................................................................................62

11 Create new Spark app............................................................................................................................................ 72

12 Spark app form......................................................................................................................................................... 73

13 Create Maven artifact............................................................................................................................................ 75

14 Maven artifacts........................................................................................................................................................ 75

15 Create application file artifact.............................................................................................................................. 75

16 Application file artifacts..........................................................................................................................................76

17 Jupyter Notebook....................................................................................................................................................88

18 JupyterHub installation...........................................................................................................................................89

19 Ingress link.................................................................................................................................................................90

20 JupyterHub server options....................................................................................................................................90

21 Jupyter Notebook interface...................................................................................................................................91

22 Python 3 notebook...................................................................................................................................................91

23 Stopping a Jupyter Notebook Server................................................................................................................. 92

24 Sample Spark notebook......................................................................................................................................... 93

25 Spark Jobs.................................................................................................................................................................98

26 Location of Search Cluster tab in the UI..........................................................................................................150

27 Sample application for video processing with Pravega................................................................................ 194

Figures

8 Figures

1 Component deployment matrix for SDP Core and SDP Edge....................................................................... 14

2 SDP Micro sizes:.......................................................................................................................................................15

3 Interfaces in SDP......................................................................................................................................................21

4 SDP documentation set..........................................................................................................................................27

5 Application state management............................................................................................................................. 68

6 Pravega Search procedures summary.............................................................................................................. 149

7 Pravega Search cluster services........................................................................................................................ 163

Tables

Tables 9

Introduction This Developer's Guide is intended for application developers and data analysts with experience developing streaming analytics applications and provides guidance on how to use the Dell Technologies Streaming Data Platform and Pravega streaming storage.

Topics:

Product summary Product components Deployment options Component deployment matrix Architecture and supporting infrastructure Product highlights More features Basic terminology Interfaces What you get with SDP Use case examples Documentation resources

Product summary Dell Technologies Streaming Data Platform (SDP) is an autoscaling software platform for ingesting, storing, and processing continuously streaming unbounded data. The platform can process both real-time and collected historical data in the same application.

SDP ingests and stores streaming data, such as Internet of Things (IoT) devices, web logs, industrial automation, financial data, live video, social media feeds, and applications. It also ingests and stores event-based streams. It can process multiple data streams from multiple sources while ensuring low latencies and high availability.

The platform manages stream ingestion and storage and hosts the analytic applications that process the streams. It dynamically distributes processing related to data throughput and analytical jobs over the available infrastructure. It also dynamically autoscales storage resources to handle requirements in real time as the streaming workload changes.

SDP supports the concept of projects and project isolation or multi-tenancy. Multiple teams of developers and analysts all use the same platform, but each team has its own working environment. The applications and streams that belong to a team are protected from write access by other users outside of the team. Cross-team stream data sharing is supported in read-only mode.

SDP integrates the following capabilities into one software platform:

Stream ingestionThe platform is an autoscaling ingesting engine. It ingests all types of streaming data, including unbounded byte streams and event-based data in real time.

Stream storageElastic tiered storage provides instant access to real-time data, access to historical data, and near-infinite storage.

Stream processingReal-time stream processing is possible with an embedded analytics engine. Your stream processing applications can perform functions, such as: Process real-time and historical data. Process a combination of real-time and historical data in the same stream. Create and store new streams. Send notifications to enterprise alerting tools. Send output to third-party visualization tools.

Platform managementIntegrated management provides data security, configuration, access control, resource management, easy upgrade process, stream metrics collection, and health and monitoring features.

1

10 Introduction

Run-time managementA web-based User Interface (UI) allows authorized users to configure stream properties, view stream metrics, run applications, view job status, and monitor system health.

Application developmentThe product distribution includes APIs. The web UI supports application deployment and artifact storage.

Product components SDP is a software-only platform consisting of integrated components, supporting APIs, and Kubernetes Custom Resource Definitions (CRDs). This product runs in a Kubernetes environment.

Pravega Pravega is the stream store in SDP. It handles ingestion and storage for continuously streaming unbounded byte streams. Pravega is an Open Source Software project, which is sponsored by Dell Technologies.

Unified Analytics SDP includes the following embedded analytic engines for processing your ingested stream data.

Apache Flink Apache Flink is an embedded stream processing engine in SDP. Dell Technologies distributes Docker images from the Apache Flink Open Source Software project.

SDP ships with images for Apache Flink. It also supports custom Flink images.

Apache Spark Apache Spark is a unified analytics engine for large-scale data processing.

SDP ships with images for Apache Spark.

Pravega Search Pravega Search provides query features on the data in Pravega streams. It supports filtering and tagging incoming data as it is ingested as well as searching stored data.

GStreamer GStreamer is a pipeline-based multimedia framework that links together video processing elements. The GStreamer Plugin for Pravega is open-source software and is used to capture video, perform video compression, read and write Pravega stream data, and enables NVIDIA DeepStream interference for object detection.

JupyterHub JupyterHub provides an interactive web-based environment for data scientists and data engineers to write and immediately run Python code.

For supported analytic engine image versions for this SDP release, see Component deployment matrix on page 14.

Management platform

The management platform is Dell Technologies proprietary software. It integrates the other components and adds security, performance, configuration, and monitoring features.

User interface The management plane provides a comprehensive web-based user interface for administrators and application developers.

Metrics stacks SDP deploys InfluxDB databases and Grafana instances for metrics visualization. Separate stacks are deployed for Pravega and for each analytic project.

Pravega schema registry

The schema registry provides a serving and management layer for storing and retrieving schemas for Pravega streams.

APIs Various APIs are available in the SDP distributions. APIs for Spark, Flink, Pravega, Pravega Search, and Pravega Schema Registry are bundled in this SDP release.

About Pravega

The Open Source Pravega project was created specifically to support streaming applications that handle large amounts of continuously arriving data.

In Pravega, the stream is a core primitive. Pravega ingests unbounded streaming data in real time and coordinates permanent storage.

Pravega user applications are known as Writers and Readers. Pravega Writers are applications using the Pravega API to ingest collected streaming data from several data sources into SDP. The platform ingests and stores the streams. Pravega Readers read data from the Pravega store.

Introduction 11

Pravega streams are based on an append-only log data structure. By using append-only logs, Pravega rapidly ingests data into durable storage. Pravega handles all types of streams, including:

Unbounded or bounded streams of data Streams of discrete events or a continuous stream of bytes Sensor data, server logs, video streams, or any other type of information

Pravega seamlessly coordinates a two-tiered storage system for each stream. Bookkeeper (called Tier 1) stores the recently ingested tail of a stream temporarily. Long-term storage (sometimes called Tier 2) occurs in a configured alternate location. You can configure streams with specific data retention periods.

An application, such as a Java program reading from an IoT sensor, writes data to the tail of the stream. Apache Flink applications can read from any point in the stream. Multiple applications can read and write the same stream in parallel. Some of the important design features in Pravega are: Elasticity, scalability, and support for large volumes of streaming data Preserved ordering and exactly-once semantics Data retention based on time or size Durability Transaction support

Applications can access data in real time or past time in a uniform fashion. The same paradigm (the same API call) accesses both real-time and historical data in Pravega. Applications can also wait for data that is associated with any arbitrary time in the future.

Specialized software connectors provide access to Pravega. For example, a Flink connector provides Pravega data to Flink jobs. Because Pravega is an Open Source project, it can potentially connect to any analytics engine with community-contributed connectors.

Pravega is unique in its ability to handle unbounded streaming bytes. It is a high-throughput, autoscaling real-time store that preserves key-based ordering of continuously streaming data and guarantees exactly-once semantics. It infinitely tiers ingested data into long-term storage.

For more information about Pravega, see http://www.pravega.io.

About analytic engines and Pravega connectors

SDP includes analytic engines and connectors that enable access to Pravega streams.

Analytic engines run applications that analyze, consolidate, or otherwise process the ingested data.

Apache Flink Apache Flink is a high throughput, stateful analytics engine with precise control of time and state. It is an emerging market leader for processing stream data. Apache Flink provides a framework and distributed processing engine for stateful computations over unbounded and bounded data streams. It performs computations at in-memory speed and at any scale. Preserving the order of data during processing is guaranteed.

The Flink engine accommodates many types of stream processing models, including:

Continuous data pipelines for real-time analysis of unbounded streams Batch processing Publisher or subscriber pipelines

The SDP distribution includes Apache Flink APIs that can process continuous streaming data, sets of historical batch data, or combinations of both.

For more information about Apache Flink, see https://flink.apache.org/.

Apache SparkTM Apache Spark provides a dataflow engine in which the user can express the required flow using transformation and actions. Data is handled through a Resilient Distributed Dataset (RDD) that is an immutable, partitioned dataset that transformations and operations work on. Applications dataflow graphs are broken down into stages. Each stage creates a RDD. The RDD is not a materialized view on disk but rather an in-memory representation of the data that are held within the Spark cluster that later stages can process.

The SDP distribution includes Apache Spark APIs that can process streaming data, sets of historical batch data, or combinations of both. Spark has two processing modes: batch processing and streaming microbatch.

12 Introduction

For more information about Apache Spark, see https://spark.apache.org.

Other analytic engines and Pravega connectors

You may develop custom Pravega connectors to enable client applications to read from and write real- time data to Pravega streams. For more information, see the SDP Code Hub.

Pravega Flink and Spark connectors are developed to enable the data ingestion and consumption for Pravega with these analytics engines.

Management plane

The SDP management plane coordinates the interoperating functions of the other components.

The management plane deploys and manages components in the Kubernetes environment. It coordinates security, authentication, and authorization. It manages Pravega streams and the analytic applications in a single platform.

The web-based UI provides a common interface for all users. Developers can upload and update application images. All project members can manage streams and processing jobs. Administrators can manage resources and user access.

Some of the features of the management plane are:

Integrated data security, including TLS encryption, multilevel authentication, and role-based access control (RBAC) Project-based isolation for team members and their respective streams and applications Possibility of sharing stream data in a read-only mode across streams Flink cluster and application management Spark application management Pravega streams management Stream data schema management and evolution history with a Schema Registry DevOps oriented platform for modern software development and delivery Integrated Kubernetes container environment Application monitoring and direct access to the Apache Flink or Apache Spark web-based UIs Direct access to predefined Grafana dashboards for Pravega Direct access to project-specific predefined Grafana dashboards showing operational metrics for Flink, Spark, and Pravega

Search clusters

Deployment options Streaming Data Platform supports options for deployment at the network edge and in the data center core.

SDP Core SDP Core provides all the advantages of on-premises data collection, processing, and storage. SDP Core is intended for data center deployment with full-size servers. It handles larger data ingestion needs and also accepts data that is collected by SDP Edge and streamed up to the Core.

High availability (HA) is built into all deployments. Deployments start with a minimum of three nodes and can expand up to 12 nodes, with integrated scaling of added resources.

Recommended servers have substantially more resources than SDP Edge. Long-term storage is Dell Technologies PowerScale clusters or Dell Technologies ECS appliances. Multitenant use cases are typical. Other intended use cases build models across larger datasets and ingest large amounts of data.

SDP Edge Deploying at the Edge, near gateways or sensors, has the advantage of local ingestion, transformation, and alerting. Data is processed, filtered, or enriched before transmission upstream to the Core.

SDP Edge is a small footprint deployment, requiring fewer minimum resources (CPU cores, RAM, and storage) than SDP Core. SDP Edge supports configurations of one or three nodes.

Single-node deployment is a low-cost option for development, proof of concept, or for production use cases where high availability (HA) is not required. Single-node deployment operates without any external long-term storage, using only node disks for storage.

Three node deployments provide HA at the Edge for local data ingestion that cannot tolerate downtime. This deployment can use PowerScale for long-term storage.

Introduction 13

SDP Edge Container Native Storage (CNS)

NOTE: CNS is not a general supported feature in SDP and is only RPQ-based.

SDP supports Longhorn as the CNS solution.

CNS is a software defined data storage solution that runs in containers on Kubernetes environments. Everything that is needed within an enterprise storage environment is isolated in the container without dependencies.

In a CNS environment, underlying hardware, like HDDs and NVMe SSDs, are virtualized in a pool, which provides persistent storage for applications running inside the containers.

Benefits of CNS:

Portability: When running in containers, storage units can be transported between data center environments.

Isolation: Because containers do not have any dependencies, they run applications with everything the workload already needs. This includes storage volumes.

Container native storage enables stateful workloads to run within containers by providing persistent volumes.

Combined with Kubernetes primitives such as StatefulSets, it delivers the reliability and stability to run mission-critical workloads in production environments.

SDP Edge Starter Pack

SDP Edge Starter Pack is a new low-cost deployment option that is based on the SDP-Edge deployment option. The option is limited to a maximum of one MB per second of throughput. SDP does not limit ingestion but alerts the user when cumulative ingestion is more than one MB per second.

The minimum starter pack license option available is one core. The one core SDP Edge Starter Pack license uses as many cores that are assigned to K8s, but the throughput is limited to one MB per second.

This deployment option is designed for:

Users getting started with SDP. Users who want to deploy large numbers of inexpensive HA deployment (hence allowing as many

cores), but limited in capabilities (hence the limit on throughput) to reduce usable functionality (for example, without any real time streaming video analysis).

If these systems are required to be expanded later, you can upgrade the license to the full version of SDP-Edge.

SDP Micro SDP Micro is a lightweight version of SDP Edge. It is intended for low-volume data ingestion and movement, and it offers limited analytics capabilities. Here is a summary of SDP Micro deployment and functionality:

SDP Micro is supported as a single-node deployment only, with a few number of cores. Long-term storage (LTS) is on local storage only. Kubespray is the only supported Kubernetes environment. For data handling and analytics, SDP Micro supports only Pravega and Flink. SDP Micro can perform

basic data transformations with Flink.

SDP Micro can act as a low-cost introduction into the SDP environment. It has a smaller download package than SDP Edge.

Upgrade from SDP Micro to SDP Edge is not supported.

Component deployment matrix This matrix shows the differences between SDP Edge and SDP Core deployments.

Table 1. Component deployment matrix for SDP Core and SDP Edge

Component SDP Core 1.5 SDP Edge 1.5 SDP Micro 1.5

Apache Flink Ships with versions 1.14.6 and 1.15.2

Ships with versions 1.14.6 and 1.15.2

Apache Spark Ships with versions 2.4.8 and 3.3.0

Ships with versions 2.4.8 and 3.3.0

14 Introduction

Table 1. Component deployment matrix for SDP Core and SDP Edge (continued)

Component SDP Core 1.5 SDP Edge 1.5 SDP Micro 1.5

Kubernetes Platform OpenShift 4.10.9 KubeSpray 2.18.1 Kubespray

Number of nodes supported

312 13 One node bare metal or VM

Number of processor cores supported

SDP Edge: unlimited

SDP Edge Starter Pack: 1

SDP Edge Container Native Storage (CNS): 3

See SDP Micro sizes below

Maximum throughput Unlimited SDP Edge: unlimited

SDP Edge Starter Pack: 1 MB per second of throughput

Unlimited

Container Runtime Crio version 1.19.0 Docker version 19.03

Operating System RHEL 8.6, CORE OS 4.10.3 Ubuntu 20.04 Ubuntu 20.04, RHEL 8.3 or higher

Long-term storage option:

Dell Technologies PowerScale

Gen5 or later hardware

OneFS 8.2.x or 9.x software with NFSv4.0 enabled

Gen5 or later hardware

OneFS 8.2.x or 9.x software with NFSv4.0 enabled

See SDP Micro sizes below

Long-term storage option:

Dell Technologies ECS

ECS object storage appliance with ECS 3.7 and later.

Not supported See SDP Micro sizes below

Embedded Service Enabler (ESE)

ESE 2.4.10.0

Table 2. SDP Micro sizes:

Product Size Memory Long-term Storage(local disk or vSAN)

Total virtual cores

vCPUs reserved for analytics

SDP Micro Small 48 GB 1x 1 TB 6 1

Medium 64 GB 2x 1 TB 12 2

Large 128 GB 2x 1 TB 24 4

X-Large 192 GB 2x 1 TB 36 12

Architecture and supporting infrastructure The supporting infrastructure for SDP supplies storage and compute resources and the network infrastructure.

SDP is a software-only solution. The customer obtains the components for the supporting infrastructure independently.

For each SDP version, the reference architecture includes specific products that are tested and verified. The reference architecture is an end-to-end enterprise solution for stream processing use cases. Your Dell Technologies sales representative can provide appropriate reference architecture solutions for your expected use cases.

A general description of the supporting infrastructure components follows.

Reference Hardware

SDP runs on bare metal servers. SDP Edge runs on Ubuntu. SDP Core runs on Red Hat Enterprise Linux Core operating system.

Network A network is required for communication between the nodes in the SDP cluster and for the external clients to access cluster applications.

Introduction 15

Local storage Local storage is required for various system functions. The Dell Technologies support team helps size storage needs based on intended use cases. Storage capacity that is specified for persistent volume is not enforced. It may be possible to write beyond the specified size of persistent volume. If there is adequate capacity in the backend, it is possible to write more than the volume size. And it is up to the underlying storage backend to enforce the capacity.

Long-term storage

Long-term storage for stream data is required and is configured during installation. Long-term storage is any of the following: SDP Core production solutions require an elastic scale-out storage solution. You may use either of the

following for long-term storage: A file system on a Dell Technologies PowerScale cluster A bucket on the Dell Technologies ECS appliance

SDP Edge production solutions use a file system on a Dell Technologies PowerScale cluster. For testing, development, or use cases where only temporary storage is needed, long-term storage

may be defined as a file system on a local mount point.

Kubernetes environments supported by SDP

SDP runs in a Kubernetes container environment. The container environment isolates projects, efficiently manages resources, and provides authentication and RBAC services. The supported Kubernetes environments are as follows: SDP Edge runs in Kubespray. SDP Core runs in Red Hat OpenShift.

The following figures show the supporting infrastructure in context with SDP.

Figure 1. SDP Core architecture

16 Introduction

Figure 2. SDP Edge architecture

Product highlights SDP includes the following major innovations and unique capabilities.

Enterprise-ready deployment

SDP is a cost effective, enterprise-ready product. This software platform, running on a recommended reference architecture, is a total solution for processing and storing streaming data. With SDP, an enterprise can avoid the complexities of researching, testing, and creating an appropriate infrastructure for processing and storing streaming data. The reference architecture consists of both hardware and software. The resulting infrastructure is scalable, secure, manageable, and verified. Dell Technologies defines the infrastructure and provides guidance in setting it up. In this way, SDP dramatically reduces time to value for an enterprise.

SDP provides integrated support for a robust and secure total solution, including fault tolerance, easy scalability, and replication for data availability.

With SDP, Dell Technologies provides the following deployment support:

Recommendations for the underlying hardware infrastructure Sizing guidance for compute and store to handle your intended use cases End-to-end guidance for setting up the reference infrastructure, including switching and network

configuration (trunks, VLANs, management and data IP routes, and load balancers) Comprehensive image distribution, consisting of customized images for the operating system,

supporting software, SDP software, and API distributions for developers Integrated installation and configuration for underlying software components (Docker, Helm,

Kubernetes) to ensure alignment with SDP requirements

The result is an ecosystem ready to ingest and store streams, and ready for your developers to code and upload applications that process those streams.

Unbounded byte stream ingestion, storage, and analytics

Pravega was designed from the outset to handle unbounded byte stream data.

In Pravega, the unbounded byte stream is a primitive structure. Pravega stores each stream (any type of incoming data) as a single persistent stream, from ingestion to long-term storage, like this:

Recent tailThe real-time tail of a stream exists on Tier 1 storage. Long-termThe entire stream is stored on long-term storage (also called Tier 2 storage in Pravega).

Applications use the same API call to access real-time data (the recent tail on Tier 1 storage) and all historical data on long-term storage.

In Apache Flink or Spark applications, the basic building blocks are streams and transformations. Conceptually, a stream is a potentially never-ending flow of data records. A transformation is an operation

Introduction 17

that takes one or more streams as input and produces one or more output streams. In both applications, non-streaming data is treated internally as a stream.

By integrating these products, SDP creates a solution that is optimized for processing unbounded streaming bytes. The solution is similarly optimized for bounded streams and more traditional static data.

High throughput stream ingestion

Pravega enables the ingestion capacity of a stream to grow and shrink according to workload. During ingestion Pravega splits a stream into partitions to handle a heavy traffic period, and then merges partitions when traffic is less. Splitting and merging occurs automatically and continuously as needed. Throughout, Pravega preserves order of data.

Stream filtering on ingestion

PSearch continuous queries process data as it is ingested, providing a way to filter out unwanted data before it is stored, or to enrich the data with tagging before it is stored.

Stream search PSearch queries can search an entire stored stream of structured or unstructured data.

Exactly-once semantics

Pravega is designed with exactly-once semantics as a goal. Exactly-once semantics means that, in a given stream processing application, no event is skipped or duplicated during the computations.

Key-based guaranteed order

Pravega guarantees key-based ordering. Information in a stream is keyed in a general way (for example, by sensor or other application-provided key). SDP guarantees that values for the same key are stored and processed in order. The platform, however, is free to scale the storage and processing across keys without concern for ordering.

The ordering guarantee supports use cases that require order for accurate results, such as in financial transactions.

Massive data volume

Pravega accommodates massive data ingestion. In the reference architecture, Dell Technologies hardware solutions support the data processing and data storage components of the platform. All the processing and storage reference hardware are easily scaled out by adding additional nodes.

Batch and publish/ subscribe models supported

Pravega, Apache Spark, and Apache Flink support the more traditional batch and publish/subscribe pipeline models. Processing for these models includes all the advantages and guarantees that are described for the continuous stream models.

Pravega ingests and stores any type of stream, including:

Unbounded byte streams, such as data streamed from IoT devices Bounded streams, such as movies and videos Unbounded append-type log files Event-based input, streaming or batched

In Apache Flink and Apache Spark, all input is a stream. Both process table-based input and batch input as a type of stream.

ACID-compliant transaction support

The Pravega Writer API supports Pravega transactions. The Writer can collect events, persist them, and decide later whether to commit them as a unit to a stream. When the transaction is committed, all data that was written to the transaction is atomically appended to the stream.

The Writer might be an Apache Flink or other application. As an example, an application might continuously process data and produce results, using a Pravega transaction to durably accumulate the results. At the end of a time window, the application might commit the transaction into the stream, making the results of the processing available for downstream processing. If an error occurs, the application cancels the transaction and the accumulated processing results disappear.

Developers can combine transactions and other features of Pravega to create a chain of Flink jobs. The Pravega-based sink for one job is the source for a downstream Flink job. In this way, an entire pipeline of Flink jobs can have end-to-end exactly once, guaranteed ordering of data processing.

In addition, applications can coordinate transactions across multiple streams. A Flink job can use two or more sinks to provide source input to downstream Flink jobs.

Pravega achieves ACID compliance as follows:

Atomicity and Consistency are achieved in the basic implementation. A transaction is a set of events that is collectively either added into a stream (committed) or discarded (aborted) as a batch.

Isolation is achieved because the transactional events are never visible to any readers until the transaction is committed into a stream.

18 Introduction

Durability is achieved when an event is written into the transaction and acknowledged back to the writer. Transactions are implemented in the same way as stream segments. Data that is written to a transaction is as durable as data written directly to a stream.

Security Access to SDP and the data it processes is strictly controlled and integrated throughout all components.

Authentication is provided through both Keycloak and LDAP. Kubernetes and Keycloak role-based access control (RBAC) protect resources throughout the

platform. TLS controls external access. Within the platform, the concept of a project defines and isolates resources for a specific analytic

purpose. Project membership controls access to those resources.

For information about these and other security features, see the Dell Technologies Streaming Data Platform Security Configuration Guide .

More features Here are additional important capabilities in SDP.

Fault tolerance The platform is fault tolerant in the following ways:

All components use persistent volumes to store data. Kubernetes abstractions organize containers in a fault-tolerant way. Failed pods restart automatically,

and deleted pods are created automatically. Certain key components, such as Keycloak, are deployed in "HA" mode by default. In the Keycloak

case, three Keycloak pods are deployed, clustered together, to provide near-uninterrupted access even if a pod goes down.

Data retention and data purge

Pravega includes the following ways to purge data, per stream:

A manual trigger in an API call specifies a point in a stream beyond which data is purged. An automatic purge may be based on size of stream. An automatic purge may be based on time.

Historical data processing

Historical stream processing supports:

Reading the whole contents of a stream. Read a custom slice of a stream by defining start and/or end read boundaries (StreamCuts).

Apache Flink job management

Authorized users can monitor, start, stop, and restart Apache Flink jobs from the SDP UI. The Apache Flink savepoint feature permits a restarted job to continue processing a stream from where it left off, guaranteeing exactly once semantics.

Apache Spark job management

Authorized users can monitor, start, stop, and restart Apache Spark jobs from the SDP UI.

Monitoring and reporting

From the SDP UI, administrators can monitor the state of all projects and streams. Other users (project members) can monitor their specific projects.

Dashboard views on SDP UI show recent Pravega ingestion metrics, read and write metrics on streams, and long-term storage metrics.

Heat maps of Pravega streams show segments as they are split and merged, to help with resource allocation decisions.

Stream metrics show throughput, reads and writes per stream, and transactional metrics such as commits and cancels.

Latencies at the segment store host level are available, aggregated over all segment stores.

The following additional UIs are linked from the SDP UI.

Project members can browse directly to the Flink Web UI that shows information about their jobs. The Apache Flink Web UI monitors Flink jobs as they are running.

Project members can browse directly to the Spark Web UI that shows information about their jobs. The Apache Spark Web UI monitors Spark jobs as they are running.

Introduction 19

In SDP, there are predefined Grafana dashboards for Pravega. Administrators can view Pravega JVM statistics, and examine stream throughputs and latency metrics. To view Pravega Grafana UI, go to SDP UI > Pravega Metrics.

Administrators can browse Grafana dashboards in the Monitoring Grafana UI to see the Kubernetes cluster metrics for SDP components. To view the UI, go to Monitoring Metrics link from the SDP UI.

Project members can browse directly to the OpenSearch UI from a Pravega Search cluster page.

Logging Kubernetes logging is implemented in all SDP components.

Remote support ESE SupportAssist is supported for SDP.

Event reporting Services in SDP collect events and display them in the SDP UI. The UI offers search and filtering on the events, including a way to mark them as acknowledged. In addition, some critical events are forwarded to the Secure Remote Services or SCG Gateway.

There are Issues along with the Events.

An Issue can be in multiple states. The state of the Issue would be determined by the type of the last Event corresponding to the Issue.

Some Issues can be autocleared after a predetermined time interval (no manual acknowledgment required).

Important Events and Issues are reported to the notifiers set up during installation (could be streamingdata-snmp-notifier, streamingdata-supportassist-ese, nautilus- notifier).

SDP Code Hub The SDP Code Hub is a centralized portal to help application developers getting started with SDP applications. Developers can browse and download example applications and code templates, download Pravega connectors, and view demos. Applications and templates from Dell Technologies teams include Pravega samples, Flink samples, Spark samples, and API templates. See the Code Hub here.

Schema Registry Schema Registry provides a serving and management layer for storing and retrieving schemas for application metadata. A shared repository of schemas allows applications to flexibly interact with each other and store schemas for Pravega streams.

GStreamer support

GStreamer is a pipeline-based multimedia framework that links together video processing elements. The GStreamer Plugin for Pravega is open-source software and is used to capture video, perform video compression, read and write Pravega stream data, and enables NVIDIA DeepStream interference for object detection.

GPU support for the GStreamer framework

A GPU (Graphics Processing Unit) is a specialized processor with dedicated memory that conventionally performs floating point operations that are required for rendering graphics. SDP takes advantage of GPUs for image and video processing, stream processing, and machine learning. The GPU-accelerated workload adds support for both Apache Flink and Apache Spark applications, which enables data scientists to use machine learning on analytic workloads. During installation, the NVIDIA GPU operator updates the SDP node base operating system and the Kubernetes environment with the appropriate drivers and configurations for GPU access. The GPU Operator deploys Node Feature Discovery (NDF) to identify nodes that contain GPUs and installs the GPU driver on GPU-enabled nodes.

MQTT support MQ Telemetry Transport (MQTT) is a lightweight publish or subscribe messaging transport protocol that is used for connecting remote IoT devices and optimize network bandwidth. MQTT makes it easy to encrypt messages using Transport Layer Security (TLS) and authenticate clients using protocols, such as OAuth. Pravega MQTT introduced in SDP (1.3 and later versions) implements an MQTT broker that can be used by MQTT clients to publish events to Pravega streams. MQTT subscribers are not supported in SDP 1.3. MQTT is intended for ingesting events into SDP with high-throughput using Quality of Service (QoS) 0 (at-most-once) semantics and supports only TLS connections.

TLS 1.3 support In SDP (1.3), you can specify which Transport Layer Security (TLS) protocol versions to enable when you install SDP to use SSL or TLS for communication over public endpoints.

JupyterHub JupyterHub provides an interactive web-based environment for data scientists and data engineers to write and immediately run Python code.

20 Introduction

Basic terminology The following terms are basic to understanding the workflows supported by SDP.

Pravega scope The Pravega concept for a collection of stream names. RBAC for Pravega operates at the scope level.

Pravega stream A durable, elastic, append-only, unbounded sequence of bytes that has good performance and strong consistency. A stream is uniquely identified by the combination of its name and scope. Stream names are unique within their scope.

Pravega event A collection of bytes within a stream. An event has identifying properties, including a routing key, so it can be referenced in applications.

Pravega writer A software application that writes data to a Pravega stream.

Pravega reader A software application that reads data from a Pravega stream. Reader groups support distributed processing.

Flink application An analytic application that uses the Apache Flink API to process one or more streams. Flink applications may also be Pravega Readers and Writers, using the Pravega APIs for reading from and writing to streams.

Flink job Represents an executing Flink application. A job consists of many executing tasks.

Flink task A Flink task is the basic unit of execution. Each task is executed by one thread.

Spark application An analytic application that uses the Apache Spark API to process one or more streams.

Spark job Represents an executing Spark application. A job consists of many executing tasks.

Spark task A Spark task is the basic unit of execution. Each task is executed by one thread.

RDD Resilient Distributed Dataset. The basic abstraction in Spark that represents an immutable, partitioned collection of elements that can be operated on in parallel.

Project An SDP concept. A project defines and isolates resources for a specific analytic purpose, enabling multiple teams of people to work within SDP in separate project environments.

Project member An SDP user with permission to access the resources in a specific project.

Kubernetes environment

The underlying container environment in which all SDP services run. The Kubernetes environment is abstracted from end-user view. Administrators can access the Kubernetes layer for authentication and authorization settings, to research performance, and to troubleshoot application execution.

Schema registry A registry service that manages schemas & codecs. It also stores schema evolution history. Each stream is mapped to a schema group. A schema group consists of schemas & codecs that are associated with applications.

Pravega Search cluster

Resources that process Pravega Search indexing, searches, and continuous queries.

Interfaces SDP includes the following interfaces for developers, administrators, and data analysts.

Table 3. Interfaces in SDP

Interface Purpose

SDP User Interface Configure and manage streams and analytic jobs. Upload analytic applications.

Pravega Grafana custom dashboards Drill into metrics for Pravega.

Apache Flink Web User Interface Drill into Flink job status.

Apache Spark Web User Interface Drill into Spark job status.

Keycloak User Interface Configure security features.

Pravega and Apache Flink APIs Application development.

Introduction 21

Table 3. Interfaces in SDP (continued)

Interface Purpose

Project-specific Grafana custom dashboards

Drill into metrics for Flink, Spark, and Pravega Search clusters.

Project-specific OpenSearch Web User Interface

Submit Pravega Search queries.

Monitoring Grafana dashboards Administrators can browse Grafana dashboards in the Monitoring Grafana UI to see the Kubernetes cluster metrics for SDP components.

JupyterHub JupyterHub provides an interactive web-based environment for data scientists and data engineers to write and immediately run Python code.

In addition, users may download the Kubernetes CLI (kubectl) for research and troubleshooting for the SDP cluster and its resources. This includes support for the SDP custom resources, such as projects.

User Interface (UI)

The Dell Technologies Streaming Data Platform provides the same user Interface for all personas interacting with the platform.

The views and actions available to a user depend on that user's RBAC role. For example: Logins with admin role see data for all existing streams and projects. In addition, the UI contains buttons that let them

create projects, add users to projects, and other management tasks. Those options are not visible to other users. Logins with specific project roles can see their projects and the streams, applications, and other resources that are

associated with their projects.

Here is a view of the initial UI window that administrators see when they first log in. Administrators see all metrics for all the streams in the platform.

Figure 3. Initial administrator UI after login

Project members (non-admin users) do not see the dashboard. They only see the Analytics and the Pravega tabs for the streams in their projects.

22 Introduction

Grafana dashboards

SDP includes the collection, storage, and visualization of detailed metrics.

SDP deploys one or more instances of metrics stacks. One instance is for gathering and visualizing Pravega metrics. There is a separate metrics stack (Prometheus + Grafana) for kubernetes cluster monitoring. Additional project-specific metrics stacks are optionally deployed.

A metrics stack consists of an InfluxDB database and Grafana.

InfluxDB is an open-source database for storing time series data. Grafana is an open-source metrics visualization tool. Grafana deployments in SDP include predefined dashboards that

visualize the collected metrics in InfluxDB.

Developers can create their own custom Grafana dashboards as well, accessing any of the data stored in InfluxDB.

Pravega metrics

In SDP, InfluxDB stores metrics that are reported by Pravega. The Dashboard page on the SDP UI shows some of these metrics. More details are available on Pravega Grafana dashboards. Links to Pravega Metrics and to Monitoring Metrics grafana dashboards are available to SDP administrators from any page of SDP UI. Administrators can use these dashboards to drill into problems or identify developing memory problems, stream-related inefficiencies, or problems with storage interactions.

The SDP UI Dashboards page contains a link to the Pravega Grafana instance. The Dashboards page and the Pravega Grafana instance are available only to administrators.

The navigation bar contains the link to Pravega Grafana Instance. It is available to admin users from any SDP UI page.

Project metrics

Metrics is a Project Feature. It can be added to (or removed from) the Project at any time (either on Project creation or at a later point).

It the Metrics feature gets removed from the project, all the previous data stored in its InfluxDB databases would be lost.

If the project is created through the UI, the metrics feature is selected by default and a user has to remove that feature if it is not required.

The Metrics feature is available to both project members and administrators.

Application specific analytics

For projects that have metrics enabled, developers can add new metrics collections into their applications, and push the metrics to the project-specific InfluxDB instance. Any metric in InfluxDB is available for use on customized Grafana dashboards.

Apache Flink Web UI

The Apache Flink Web UI shows details about the status of Flink jobs and tasks. This UI helps developers and administrators to verify Flink application health and troubleshoot running applications.

To view the Apache Flink Web UI from the SDP UI:

Analytics > Project > Flink Clusters > Overview > Flink cluster name.

Introduction 23

Figure 4. Apache Flink Web UI

Apache Spark Web UI

The Apache Spark Web UI shows details about the status of Spark jobs and tasks. This UI helps developers and administrators to verify Spark application health and troubleshoot running applications.

The SDP UI contains direct links to the Apache Spark Web UI. From the Analytics Project page, go to a project, click Spark Apps (on Overview tab). The name is a link to the Spark Web UI which opens in a new browser tab. It displays the Overview screen for the Spark application you selected. From here, you can drill into status for all jobs and tasks.

Figure 5. Apache Spark Web UI

24 Introduction

JupyterHub overview

JupyterHub provides an interactive web-based environment for data scientists and data engineers to write and immediately run Python code.

Jupyter Notebook

Figure 6. Jupyter Notebook

Jupyter Notebook is a single-user web-based interactive development environment for code, and data. It supports a wide range of workflows in data science, scientific computing, and machine learning.

JupyterHub

JupyterHub is a multiuser version of Jupyter Notebook that is designed for companies, classrooms, and research labs. JupyterHub spawns, manages, and proxies multiple instances of the single-user Jupyter Notebook server.

Jupyter Notebook Single User Pod

The Jupyter Notebook Single User Pod runs a Jupyter Notebook server for a single user. All Python codes that are run on notebooks of a user are run in this pod.

Introduction 25

JupyterHub on SDP

JupyterHub can be enabled for an Analytics Project in SDP with a few clicks. It uses the single-sign-on functionality of SDP.

Each Analytics Project has its own deployment of JupyterHub. Each user can have up to one Jupyter pod per Analytics Project.

APIs

The following developer resources are included in an SDP distribution.

SDP includes these application programming interfaces (APIs):

Pravega APIs, required to create the following Pravega applications: Writer applications, which write stream data into the Pravega store. Reader applications, which read stream data from the Pravega store.

Apache Flink APIs, used to create applications that process stream data. Apache Spark APIs, used to create applications that process stream data. PSearch APIs, used to register continuous queries or process searches against the stream. Schema Registry APIs, used to retrieve and perform schema registry operations.

Stream processing applications typically use these APIs to read data from Pravega, process or analyze the data, and perhaps even create new streams that require writing into Pravega.

What you get with SDP The SDP distribution includes the following software, integrated into a single platform.

Dell Technologies Streaming Data Platform management plane software Keycloak software and an integrated security model Pravega data store and API Schema registry for managing schemas and codecs Pravega Search (PSearch) framework, query processors, and APIs Apache Flink framework, processing engine, and APIs Apache Spark framework, processing engine, and APIs InfluxDB for storing metrics Grafana UI for presenting metrics SDP installer, scripts, and other tools JupyterHub is an interactive web-based development environment.

Use case examples Following are some examples of streaming data use cases that Dell Technologies Streaming Data Platform is especially designed to process.

Industrial IoT Detect anomalies and generate alerts. Collect operational data, analyze the data, and present results to real-time dashboards and trend

analysis reporting. Monitor infrastructure sensors for abnormal readings that can indicate faults, such as vibrations or

high temperatures, and recommend proactive maintenance. Collect real-time conditions for later analysis. For example, determine optimal wind turbine placement

by collecting weather data from multiple test sites and analyzing comparisons.

Streaming Video Store and analyze streaming video from drones in real time. Conduct security surveillance. Serve on-demand video.

Automotive Process data from automotive sensors to support predictive maintenance.

26 Introduction

Detect and report on hazardous driving conditions that are based on location and weather. Provide logistics and routing services.

Financial Monitor for suspicious sequences of transactions and issue alerts. Monitor transactions for legal compliance in real-time data pipelines. Ingest transaction logs from market exchanges and analyze for real-time market trends.

Healthcare Ingest and save data from health monitors and sensors. Feed dashboards and trigger alerts for patient anomalies.

High-speed events

Collect and analyze IoT sensor messages. Collect and analyze Web events. Collect and analyze logfile event messages.

Batch applications

Batch applications that collect and analyze data are supported.

Documentation resources Use these resources for additional information.

Table 4. SDP documentation set

Subject Reference

Dell Technologies Streaming Data Platform documentation

Dell Technologies Streaming Data Platform Documentation InfoHub: https://www.dell.com/support/article/us/en/19/sln319974/ dell-emc-streaming-data-platform-infohub

Dell Technologies Streaming Data Platform support site: https://www.dell.com/support/home/us/en/04/ product-support/product/streaming-data-platform/overview

Dell Technologies Streaming Data Platform Developer's Guide Dell Technologies Streaming Data Platform Installation and

Administration Guide Dell Technologies Streaming Data Platform Security Configuration

Guide Dell Technologies Streaming Data Platform Release Notes

NOTE: You must log in to a Dell support account to access release notes.

SDP Code Hub Community-supported public GitHub portal for developers and integrators. The SDP Code Hub includes Pravega connectors, demos, sample applications, API templates, and Pravega and Flink examples from the open-source SDP developer community: https:// streamingdataplatform.github.io/code-hub/

Pravega concepts, architecture, use cases, and Pravega API documentation

Pravega open-source project documentation: http://www.pravega.io

Apache Flink concepts, tutorials, guidelines, and Apache Flink API documentation

Apache Flink open-source project documentation: https://flink.apache.org/ https://github.com/pravega/flink-connectors/tree/master/

documentation/src/docs https://github.com/pravega/pravega-samples/tree/master/flink-

connector-examples https://github.com/StreamingDataPlatform/workshop-samples/

Apache Spark concepts, tutorials, guidelines, and Apache Spark API documentation

Apache Spark open-source project documentation: https://spark.apache.org https://github.com/StreamingDataPlatform/workshop-samples/tree/

master/spark-examples

Introduction 27

Table 4. SDP documentation set (continued)

Subject Reference

https://github.com/pravega/spark-connectors/tree/master/ documentation/src/docs

https://github.com/pravega/pravega-samples/tree/master/spark- connector-examples

28 Introduction

Using the Streaming Data Platform (SDP)

Topics:

About installation and administration User access roles About project isolation Log on as a new user Change your password or other profile attributes Overview of the project-member user interface Analytics projects and Pravega scopes views Create analytics projects and Flink clusters About stream processing analytics engines and Pravega connectors

About installation and administration Your platform administrator installs and administers all components of the Streaming Data Platform.

Distributed data processing frameworks like Apache Flink or Apache Spark need to be set up to interact with several components, such as resource managers, file systems, and services for distributed coordination. Your platform administrator might install these processing engine on more than one node for additional processing power. Each installation of Flink or Spark is licensed and tracked, and Dell Technologies can help you determine how many licenses to include in your configuration.

For information about installation and administering Flink and Spark clusters and Pravega streams, see the Dell Technologies Streaming Data Platform Installation and Administration Guide

User access roles The SDP provides one user interface (UI) for all users interacting with stream analytics. Authorization roles control the views and actions available to each user login.

For example:

Platform administrator roles have authorization to create analytics projects, add users to projects, and manage storage resources. Those options are not visible to logins with a user role.

Project member roles are assigned to projects by platform administrators as follows: Project members see only the streams and applications that are associated with their projects. Users assigned to a project member role are typically developers or data analysts. Project members can view project-related streams, create and update Flink or Spark clusters, upload applications and

artifacts. Project members must be added to the platform and to analytics projects by administrators. For more information, see the

section on adding new members in the Streaming Data Platform Installation and Administration Guide.

The SDP supports the following types of activities for administrators, developers, and data analysts.

Role Persona Activities

Administrator Security Administrator Configure logins View security-related logs

Platform Administrator Create projects and add users to projects

Monitor system throughput, storage, and resource allocation

2

Using the Streaming Data Platform (SDP) 29

Role Persona Activities

Project-member Developer Add streams and applications to projects

Define streams and configure stream properties

Create and update Flink clusters Upload production-ready application

images, applications, and associated artifacts

Start, stop, and troubleshoot application jobs

Data Analyst Work within a project Start and stop applications Adding members to existing scope

and streams Monitor application job status Research stream metrics

Platform administrators have access to all projects by default, and projects are created and managed by administrators. Administrators ensure that a projects members include the set of users who need access to the project. Administrators monitor resources associated with stream storage and application processing. They may also monitor stream ingestion and scaling.

A project-member is usually a developer or data analyst.

Developers typically upload their application artifacts, choose or create the required streams associated with the project, and run and monitor applications. They may monitor streams as well.

Data analysts may run and monitor applications. They may also monitor or analyze metrics for the streams in the project.

About project isolation Groups of users with differing functions work within a Streaming Data Platform project. A project defines and isolates resources for a specific analytics purpose. Each project has a dedicated Kubernetes namespace, which enhances project isolation.

A projects resources are protected using RBAC by the following:

Project resources are protected at the Keycloak level. For example, access to the Pravega scopes and its streams and Maven artifacts.

Project resources are protected at the Kubernetes level. For example, any namespace resources in the Kubernetes cluster for this project, such as FlinkCluster, FlinkApplication, and Pods.

For additional information about RBAC and security, see the Dell Technologies Streaming Data Platform Security Configuration Guide and Dell Technologies Streaming Data Platform Installation and Administration Guide .

Log on as a new user Youll need a user account configured by your platform administrator to log on to the SDP. Ask your administrator for the URL. A user can change the password or other profile attributes for the username that they are currently logged in under.

Steps

1. Type the URL of the SDP UI in a web browser.

NOTE: Obtain the URL from your Platform Administrator.

30 Using the Streaming Data Platform (SDP)

The Streaming Data Platform login window displays.

2. Using one of the following credential sets, provide a Username and Password.

If your platform administrator provided credentials, use those. For environments where an existing Identity Provider is integrated, use your enterprise credentials.

3. Click Log In.

If your user name and password are valid, you are authenticated to SDP. Depending on your authorization, you see one of the following windows:

If you see this window Explanation

The Dashboard section of the SDP UI. The dashboard appears if the username has authorizations associated with it. It shows streams and metrics that you are authorized to view.

A welcome message asking you to see your Administrator.

The welcome message appears if there are no authorizations associated with the username. You are authenticated but not authorized to see any data. Ask an Administrator to provide authorizations to your username.

4. If you see the welcome message directing you to an Administrator, ask for one of the following authorizations:

Purpose Authorization Request

View and administer all projects and streams. Platform Administrator

View or develop applications or streams within a project. Project member

View or develop applications or streams within a project that does not yet exist.

A Platform Administrator must create the new project and add you as a member.

View or develop streams in a scope that is independent of any project and does not yet exist.

A Platform Administrator must create the new scope and add you as a member.

Using the Streaming Data Platform (SDP) 31

Change your password or other profile attributes A user can change the password or other profile attributes for the username that they are currently logged in under.

Steps

1. NOTE: When federation is enabled, password changes must be made in the federated identity provider. The federated

credentials are used for authentication for all operations in the SDP and kubectl. In order to use basic authentication

with Gradle to publish artifacts to a project's Maven repository, you must set a local password in Keycloak for the

local/shadow user that was created when the federated user logged into the SDP.

Log in to the SDP using the username whose password or other profile attributes you want to change.

2. In the top banner, click the User icon.

3. Verify that the username at the top of the menu is the username whose profile you want to change.

4. Choose Edit Account.

5. To change the password, complete the password-related fields.

6. Edit other fields in the profile if needed.

7. Click Save.

8. To verify a password change, log out and log back in using the new password. To logout, click the User icon and choose Logout.

Overview of the project-member user interface This section summarizes the top-level screens and menus in the SDP UI available to users in the project-member role.

The top banner contains icons for navigating to the portions of the UI that a project-member has access to.

Icon Description

Analytics Displays all the projects that include the login account as a project-member. You can: View summary information about all the analytics projects assigned to you. Drill into a project page. NOTE: Only platform administrators can create new projects.

For platform administrator information, see Dell Technologies Streaming Data Platform Installation and Administration Guide

Pravega Displays all the scopes that your login account has permission to access. You can: View summary information about the scopes assigned to you. Drill into a scope page, and from there, into streams. Create new streams. Create new scopes, if you are a platform administrator.

User The User icon displays a drop-down menu with these options: The user name that you used to log into the current session. (This item is not clickable.) Product SupportOpens a SDP documentation InfoHub. From there, you can access Dell

Technologies product documentation and support. Logout.

32 Using the Streaming Data Platform (SDP)

Analytics projects and Pravega scopes views The Analytics Projects and Pravega Scopes views are available to users in the project-member role.

View projects and the project dashboard

Log into the SDP and click the Analytics icon to navigate to the Analytics Projects view. This lists the projects you have access to.

Navigate using the links at the top of the project dashboard to view Apache Flink clusters, applications, and artifacts associated with the project. Later sections of this guide provide information about how to create Flink clusters, create applications, and upload artifacts.

Users with the project-member credentials can view the following:

Project-members (non-administrative users) can only see the projects assigned to them. Project-members cannot create analytics projects. Only users with platform administrator credentials have the rights to

create analytics projects. Project-members have permissions to create and update Flink clusters, Flink and Spark applications, artifacts, Pravega

Search (PSearch) clusters, and Schema Registry components, such as schema groups, schemas, cqs, and codecs assigned to their projects.

Click a project name to view the Dashboard for that project.

Figure 7. Project Dashboard

The page header includes information about the project:

Created Date State Description If your system is configured for Dell EMC ECS storage, the project's system-generated bucket name and access credential

displays

Click the tabs along the top of the dashboard to drill down into the project and view the following:

Link name Description

Flink Clusters View the configured Flink clusters.

Flink Apps View the applications defined in the project.

Using the Streaming Data Platform (SDP) 33

Link name Description

Project-members can use this tab to create new applications, remove applications, edit application configurations. and update artifacts.

Spark Apps View the applications defined in the project. Project-members can use this tab to create new

applications, remove applications, edit application configurations. and update artifacts.

Artifacts View the application artifacts uploaded to the project's Maven repository.

Project-members can upload, update, and remove artifacts.

Members Add members.

Pravega Search Search the configured clusters, searchable streams, and continuous queries.

Video View, Add, Edit, Delete GStreamer pipelines, Camera Recorder Pipelines, and Camera Secrets, as well as play video streams.

Additionally, the project dashboard displays the following:

Number of task slots used/number of task slots available Number of streams associated with the project Number of jobs running, finished, cancelled, and failed Kubernetes event messages related to the project in the message section

Pravega scopes view

A Pravega scope is a collection of projects and associated streams. RBAC authentication for Pravega operates at the scope level.

To navigate to the Pravega Scopes view, log on to the SDP and click the Pravega icon in the top banner.

Project-members (non-administrative users) have access to the following:

Project-members can view project-related scopes. Project-members cannot create new Pravega scopes or view any scopes that are not assigned to their projects. Project-members cannot create scopes outside of a project or manage scope membership in the SDP UI.

For more information about managing Pravega scopes, see the Dell Technologies Streaming Data Platform Installation and Administration Guide .

Pravega streams view

A stream is uniquely identified by the combination of its name and scope. The Pravega streams view lists the streams available within a specific project scope. Project-members typically create the data streams associated with a project. They may also monitor those streams. To view a project's Pravega streams, log on to the SDP and click the Pravega icon in the top banner. Click the name of a Project Scope. From here, you can view streams and add/view schema registry groups and codecs. For

more information, see the "About Pravega schema registry" section in this guide. NOTE: You can also access the details for a single stream from the Pravega > Pravega Streams view.

View the following details about the stream: Name of the stream Scaling policy Retention policy Bytes written Bytes read Transactions open, created, or aborted

34 Using the Streaming Data Platform (SDP)

Stream segment count Segment heat chart

For more information about streams, see the Working with Pravega streams section.

The views and actions available to a user in the Dell Technologies Streaming Data Platform depend on a user's RBAC role:

Users in the platform administrator role can see data for all existing streams and projects. In addition, the UI displays icons that allow admins to create projects, add users to projects, and create Pravega scopes. Those options are not visible to project-members.

Users in the project-member role, for example developers and data analysts, can see the projects and streams, applications, and other resources that are associated with their projects.

Create analytics projects and Flink clusters Only platform administrators can create analytics projects in the Streaming Data Platform. Both platform administrators and project-members can create Flink clusters.

Unlike Apache Flink, because Apache Spark defines its cluster in the application resource, there is no need to create a Spark cluster before deploying the application. A Spark cluster consists of a Spark operator, driver, and executor pods.

About analytics projects

A SDP analytics project is a Kubernetes custom resource of kind project. A project resource is a Kubernetes namespace enhanced with resources and services. An analytics project consists of Flink clusters, applications, artifacts, Pravega Search (PSearch) clusters, and project members.

Only platform administrators can create projects and add members to projects. Project members can see information about the projects that they belong to.

Platform administrators can view all projects and create new projects in the UI. If more control of project definition is required, administrators can create projects manually using kubectl commands. For more information, see the Dell Technologies Streaming Data Platform Installation and Administration Guide .

All analytic capabilities are contained within projects. If enabled upon project creation by platform administrators, analytic metrics are enabled for a project.

Projects provide support for multiple teams working on the same platform, while isolating each team's resources from the others. Project members can collaborate in a secure way, as resources for each project are managed separately.

When an analytics project is created through the Dell Technologies Streaming Data Platform UI, the platform automatically creates the following:

Project features are components that can be added to a project, such as the Artifact Repository, Metrics Stack, Jupyter Hub, Pravega Ingestion Gateway. If a feature is selected, SDP will provision the components within the project namespace to provide that service.

Resources and services Function

Maven repository Hosts job artifacts

Zookeeper cluster Allows fault tolerant Flink clusters

Project storage A persistent volume claim (PVC) to provide shared clusters between Flink clusters

Kubernetes namespace Houses the project resources and services

Keycloak credentials Configured to allow authorized access to the project Pravega scope and analytics jobs to communicate with Pravega

Pravega scope A Pravega scope is created with the same name as the project to contain all the project streams. Scopes created this way automatically appear in the list of scopes. Keycloak credentials allow access to the Pravega scope

Security Security features allow applications belonging to the project to access the Pravega streams within the project scope.

Using the Streaming Data Platform (SDP) 35

Resources and services Function

For more information about security features, see the Dell Technologies Streaming Data Platform Security Configuration Guide and the Dell Technologies Streaming Data Platform Installation and Administration Guide .

About Flink clusters

Flink clusters provide the compute capabilities for analytics applications. Using the SDP UI, one or more fault-tolerant Flink clusters can be deployed to a project.

Both platform administrators and project-members can deploy Flink clusters.

Flink clusters can be deployed into a project as follows:

Administrators and project members can deploy Flink clusters using the SDP UI. If customized configurations of compute resources are required, you can deploy clusters manually using Kubernetes

kubectl commands.

To use kubectl commands, project-members must be federated users. For more information, see the Dell Technologies Streaming Data Platform Installation and Administration Guide .

Components of a Flink setup

Distributed data processing frameworks like Apache Flink need to be set up to interact with several components, such as resource managers, file systems, and services for distributed coordination. Platform administrators manage these components using the SDP UI.

Apache Flink consists of components that work together to execute streaming applications. For example, the job manager and task managers have the following functions:

Job Manager

The job manager is the master process that controls the execution of a single application. An application consists of a logical data flow graph and a JAR file that bundles all the required classes, libraries, and other resources.

The job manager requests the necessary resources (Task Manager slots) to execute the tasks from the Resource Manager. Once it receives enough task manager slots, it distributes the tasks to the task managers that execute them. During execution, the job manager is responsible for all actions that require central coordination, such as the coordination of checkpoints.

Task Managers

Task managers are the worker processes of Flink. Typically, there are multiple Task Managers running in a Flink setup. Each task manager provides a certain number of slots. The number of slots limits the number of tasks a task manager can execute. When instructed by the Resource Manager, a task manager offers one or more of its slots to the job manager. The job manager can then assign tasks to the slots to execute them.

About Flink jobs

Analytics applications define the parameters and required environment for a deployment.

For Apache Flink jobs to be deployed, they must be scheduled on Flink clusters as standard Flink jobs. Each start and stop of an analytics application results in a new Flink job being deployed to a cluster.

36 Using the Streaming Data Platform (SDP)

Create Flink clusters

After a platform administrator creates an analytics project, project-members can create Flink clusters using the SDP UI.

Steps

1. Determine if the cluster requires a custom configuration. The Artifact Repository uses default values for most configuration attributes when creating Flink clusters. If a custom configuration is required, a project-member can create a Flink cluster using the Kubernetes command line.

Custom configurations are required in the following situations:

Custom Flink images for processing

Developers may want to process their applications using a custom Flink image. SDP ships with images for 1.12.7 and 1.13.6. Any other image would be considered custom. Dell Technologies recommends that you use the latest Apache Flink version available when creating new SDP Flink applications.

The supported versions are prebuilt Apache Flink images that are included with the SDP.

Custom labels for scheduling

Developers may want to use Flink cluster labels to assign jobs to specific Flink clusters for processing. The SDP uses the Flink image version number to assign the cluster to a job. For more flexibility, developers can incorporate labels.

NOTE: Registering and using custom Flink images is supported by creating a RuntimeImage resource. To view the

default cluster Flink images installed, run the following Kubernetes command: kubectl get runtimeimages

2. To create a Flink cluster using the Dell Technologies Streaming Data Platform UI, log on and navigate to Analytics -> project-name > Flink Clusters.

3. Click Create Flink Cluster.

4. Complete the cluster configuration screen. The following attributes are configurable:

Section Attribute Description

General Name Labels are optional. The label name must conform to Kubernetes naming conventions.

Label Enter a custom label key name and value.

Flink Image Choose the Flink image for this cluster. The SDP presents options taken from the registered ClusterFlinkImage resources.

Using the Streaming Data Platform (SDP) 37

Section Attribute Description

Task Manager Number of Replicas You can configure multiple task managers. Enter the number of Apache Flink task managers in the cluster in the Replicas field.

NOTE: A standalone Apache Flink cluster consists of at least one Job Manager (the master process) and one or more task managers (worker processes) that run on one or more machines.

Number of Task Slots Enter the number of task slots per task manager (at least one). Each task manager is a JVM process that can execute one or more subtasks in separate threads. The number of task slots controls how many tasks a task manager accepts.

NOTE: The task slots represent the equal dividing-up of the resources a task manager provides. For example, if a task manager was configured with 2 CPU cores, and 2048 MB of memory, then each task slot would get roughly 1 CPU and 1024 MB of heap. For more information, see the Apache Flink documentation: https://ci.apache.org/projects/ flink/flink-docs-stable/concepts/ runtime.html#task-slots-and- resources.

Memory Specify the memory allocation. The default value is 1024 MB.

CPU Enter the number of cores. The default value is 0.5 cores. Dell Technologies recommends 1 CPU per task slot.

GPUs Enter the number of NVIDIA GPUs installed. GPU-accelerated support for Flink clusters enables customers to use video processing, stream processing, and machine learning on analytic workloads.

Local Storage Volume Size Specify the Storage Type as either Ephemeral or Volume Based. Ephemeral temporary storage is

provided automatically to your instance by the host node. This is the preferred storage option for task managers. Every time the task manager starts, it starts with clean temporary storage.

Volume based storage is provided by persistent storage PVCs. This storage is not overwritten/wiped when the task manager restarts,

38 Using the Streaming Data Platform (SDP)

Section Attribute Description

and is always available, regardless of the state of the running instance.

Configuration Add Configuration Add user-defined key/value pair parameter as a string/string.

For more information about Apache Flink task managers and task slots, see the Apache Flink documentation: https:// ci.apache.org/projects/flink/flink-docs-stable/concepts/runtime.html

5. Click Save.

In the Flink Cluster view that appears, the cluster State is initially Deploying. After a few seconds, the State changes to Ready. The cluster is now ready for project-members to create applications and upload artifacts.

Change Flink cluster attributes

Project-members can edit Flink clusters to change the number of Apache Flink task manager replicas.

Steps

1. Log on to the SDP.

2. Navigate to Analytics > project-name > Flink Clusters.

3. The Edit Flink Cluster page appears. Change the number of task manager replicas and then click Save.

Delete a Flink cluster

You can delete a Flink cluster and define a new cluster within a project without having to remove any applications.

Prerequisites

Before you delete a Flink cluster, check for the impact on applications that are currently running on the cluster. If you delete a Flink cluster that has applications deployed and running on it, then the SDP first safely stops all Flink applications (the Flink cluster goes into a "draining" state while this is happening). Once all Flink applications have been stopped, the cluster is deleted. In the meantime, any Flink application that was ejected from the Flink cluster immediately goes into a Scheduling state, while searching for a new cluster to deploy onto. Once one is found, it deploys and continues executing from where it left off.

About this task

Deleting a cluster does not delete any underlying Flink applications associated with that cluster.

Steps

1. Log on to the SDP.

2. Navigate to Analytics -> project-name.

3. Click the Flink Clusters tab.

4. Identify the cluster to delete and then click Delete under the Action column. A popup menu displays two buttons: Delete and Cancel.

5. Confirm that you want to delete the cluster by clicking Delete. Click Cancel to keep the cluster.

GPU-accelerated workload support for Flink

GPU-accelerated workload support for Apache Flink introduced in SDP (1.3 and later versions) enables data scientists to use machine learning on analytic workloads.

Specialized GPU (graphics processing unit) processors with dedicated memory are used to process intensive deep learning AI workloads. SDP takes advantage of GPUs for video processing, stream processing, and machine learning. The GPU-accelerated workload adds support for Apache Flink and Apache Spark applications.

During SDP installation, the NVIDIA GPU operator updates the SDP node base operating system and the Kubernetes environment with the appropriate drivers and configurations for GPU access. The GPU Operator deploys the Node Discovery

Using the Streaming Data Platform (SDP) 39

Feature (NDF) to identify nodes that contain GPUs and installs the GPU driver on GPU-enabled nodes. For more information, see the Dell Technologies Streaming Data Platform Installation and Administration Guide.

CUDA libraries for GPU workloads

CUDA libraries are the API between applications and the GPU driver.

SDP (1.3 and later versions) ships with developer packages containing a Dockerfile and required binaries required for GPU workload support. The developer package contains an example CUDA Dockfile. All workloads within SDP run within Docker containers, which must contain the CUDA toolkit libraries for GPU support.

NOTE: CUDA toolkit libraries must be compatible with the CUDA driver version. The version of the CUDA libraries

depends on your environment, and the required libraries depend on the workload (for example, JCUDA libraries). For more

information, see https://docs/nvidia.com/deploy/cuda-compatibility/

Example: Flink CUDA

ARG CUDA_VERSION=11.2.0

FROM nvidia/cuda:${CUDA_VERSION}-runtime-ubuntu20.04

ARG FLINK_VERSION=1.13.2-spring1 ARG SCALA_VERSION=2.12

#Install Java RUN set -eux; \ apt-get update; \ apt-get install -y --no-install-recommends \ # utilities for keeping Debian and OpenJDK CA certificates in sync ca-certificates p11-kit \ ; \ rm -rf /var/lib/apt/lists/*. . . . . .# Install Flink RUN set -ex; \ tar -xf flink.tgz --strip-components=1; \ rm flink.tgz; \ \ chown -R flink:flink .;. . .. . # Configure container COPY docker-entrypoint.sh / ENTRYPOINT ["/docker-entrypoint.sh"] EXPOSE 6123 8081 CMD ["help"]

Example runtimes in the SDP UI

40 Using the Streaming Data Platform (SDP)

System > Runtimes > Create Runtime

Kubernetes pod scheduling

Kubernetes pods use resources to specify GPU requirements. If you are using CUDA libraries for GPU workloads, pods must specify both the Request and Limits for GPU workloads as shown:

apiVersion: v1 kind: Pod metadata: name: cuda-vector-add spec: restartPolicy: OnFailure containers: - name: cuda-vector-add image: "k8s.gcr.io/cuda-vector-add:v0.1" resources: request: nvidia.com/gpu: 1 # requesting 1 GPU limits: nvidia.com/gpu: 1 # requesting 1 GPU

Using the Streaming Data Platform (SDP) 41

Specify GPUs when creating Flink cluster in the SDP UI

In SDP (1.3 and later versions), you can specify the number of GPUs under the Task Manager section of the Create Flink Cluster page as shown:

Example GPUs in Task Manager specification

spec: taskManager: cpu: "0.5" gpu: 1 <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< memory: 1525M numberOfTaskSlots: 5 replicas: 2 resources: limits: ephemeral-storage: 10Gi requests: ephemeral-storage: 10Gi

If there are not enough resources to create Task Managers, the SDP UI will display a Provision-Pending state.

Monitoring GPUs

Monitoring Metrics > Dashboards

42 Using the Streaming Data Platform (SDP)

About stream processing analytics engines and Pravega connectors Specialized connectors for Pravega enable applications to perform read and write operations over Pravega streams. Streaming Data Platform includes two built-in analytics engines for Pravegathe Apache Flink and Apache SparkTM connectorswhich are integrated into the SDP UI.

Because Pravega is open-source software, you may develop your own connectors and contribute them to the Pravega community on the SDP Code GitHub site. All connectors are reviewed and approved by Dell Technologies.

For more information about developing Pravega connectors, see http://www.pravega.io and the SDP code hub: https:// streamingdataplatform.github.io/code-hub/.

Using the Streaming Data Platform (SDP) 43

About the Apache Flink integration

Apache Flink job management for Pravega is an integrated component in the SDP UI. Flink allows users to build an end-to-end stream processing pipeline that can read and write Pravega streams using the Flink stream processing framework. By combining the features of Apache Flink and Pravega, it is possible to build a pipeline of multiple Flink applications that can be chained together to give end-to-end exactly-once guarantees across the chain of applications.

Apache Flink is a distributed stream processor with intuitive and expressive APIs to implement stateful stream processing applications. The Flink integration into SDP:

Ensures parallel reads and writes Exposes Pravega streams as Table API records Allows checkpointing Guarantees exactly-once processing with Pravega

About the Apache SparkTM integration

Apache SparkTM job management for Pravega is an integrated component in the Streaming Data Platform.

The Apache Spark integration allows users to manage applications and view Spark analytic metrics from within the SDP UI.

The integration includes support for Java, Scala, and Python (pySpark) and Spark 2.4.8 and 3.3.0.

To learn about developing stream processing applications using Apache Spark, see the Apache Spark open-source documentation at https://spark.apache.org.

44 Using the Streaming Data Platform (SDP)

Working with Pravega Streams A Pravega stream is an unbounded stream of bytes or stream of events. The Pravega engine ingests the unbounded streaming data and coordinates permanent storage. Storage is configured by the platform administrator using the SDP UI. This section describes how to create, modify, and monitor Pravega streams

Topics:

About Pravega streams About stateful stream processing About event-driven applications About exactly once semantics Understanding checkpoints and savepoints About Pravega scopes About Pravega connectors and streams Create data streams Stream configuration attributes List streams (scope view) View streams Edit streams Delete streams Start and stop stream ingestion Monitor stream ingestion About Grafana dashboards to monitor Pravega

About Pravega streams Pravega writer RESTful APIs write streaming data to the Pravega stream store. Pravega is an open-source project sponsored by Dell Technologies. Customized software connectors, such as the integrated Apache Flink and Apache Spark connectors, allow streaming applications, such as Apache Flink and Apache Spark, access to Pravega.

Pravega handles all types of streams, including:

unbounded streams bounded streams event-based messages batched logs streaming video any other type of stream

Pravega streams are based on an append-only log data structure. By using append-only logs, Pravega rapidly ingests data into durable storage.

Pravega handles unbounded streaming bytes and seamlessly coordinates a two-tiered storage system for each stream:

Pravega manages temporary storage for the recently ingested tail of a stream. Pravega infinitely tiers ingested data into long-term storage, such as a Dell Technologies PowerScale cluster or ECS object

storage.

Before running applications that write or read streams to Pravega, the stream must be defined and configured in the SDP as described in the Create data streams section.

For more information about Pravega streams, see http://www.pravega.io. For more information about long-term storage, see the Dell Technologies Streaming Data Platform Installation and

Administration Guide .

3

Working with Pravega Streams 45

About stateful stream processing Stateful stream processing is an application design pattern for processing unbounded streams of events and is applicable to many different use cases. Any application that processes a stream of events and does not perform record-at-a-time transformations must be stateful with the ability to store and access intermediate data.

When an application receives an event, it can perform arbitrary computations that involve reading data from or writing data to the state. In principle, state can be stored and accessed in many different places including program variables, local files, or embedded or external databases.

Apache Flink or Spark stores the application state locally in memory or in an embedded database. Flink and Spark, as distributed systems, periodically write application state checkpoints to remote storage. The local state is protected against failures and data loss.

If a failure occurs, Flink and Spark can recover a stateful streaming application by restoring its state from a previous checkpoint and resetting the read position on the event log. The application replays and fast forwards the input events from the event log until it reaches the tail of the stream. This technique is used to recover from failures but can also be leveraged to update an application, fix issues, and repair previously emitted results, or migrate an application to a different cluster.

About event-driven applications Event-driven applications are stateful streaming applications that ingest event streams and process the events with application- specific logic. Depending on the logic, an event-driven application can trigger actions, such as sending an alert or an email, or write events to an outgoing event stream to be consumed by another event-driven application.

Typical use cases for event-driven applications include:

Anomaly detection Pattern detection or complex event processing Real-time recommendations

Event-driven applications communicate using event logs instead of RESTful calls and hold application data as local state instead of writing it to and reading it from an external datastore, such as a relational database or key-value store.

Exactly-once state consistency and the ability to scale an application are fundamental requirements for event-driven applications.

About exactly once semantics Stream processing applications are prone to failures due to the distributed nature of deployment they must support. Tasks are expected to fail. The processing framework must provide guarantees that it maintains the consistency of the internal state of the stream processor despite crashes.

The SDP delivers end-to-end exactly once guarantees. Exactly once semantics mean that each incoming event affects the final result exactly once. If a machine or software fails, there is no duplicate data and no data that goes unprocessed. Exactly once semantics consumes less storage than traditional systems because duplication is avoided. There is one version of the data--one source of truth--that multiple analytic applications can access.

These exactly once guarantees are meaningful since you should always assume the possibility of network or machine failures that might result in data loss. Exactly once semantics means that the streaming application must observe a consistent state after recovery compared to the state before the crash.

Many traditional streaming data systems duplicate data for usage in different pipelines. With Pravega, data is ingested and protected in one place, with one single source of truth for all processing data pipelines.

A stream processing application can provide end-to-end exactly once guarantees only when the external data source (to which the source and sink operators are interacting) supports the following:

An event is not delivered more than once (duplicate events) to any source operator. For example, if there are multiple source operators (parallelism > 1) that are reading from the external data source, then each source operator is expected to see unique sets of events. The events are not duplicated across the source operators.

Source operator could consume data from the external data source from a specific offset. The sink operator could request the external data source to perform a commit or to rollback operations.

46 Working with Pravega Streams

Pravega exactly once semantics

Streaming applications process large amounts of continuously arriving data. In Pravega, the stream is a storage primitive for continuous and unbounded data. Pravega ingests unbounded streaming data in real time and coordinates permanent storage.

Once the writes are acknowledged to the client, Pravega model guarantees data durability, message ordering, and exactly once support. Duplicate event writes are not allowed. With message ordering, events within the same routing key are delivered to readers in the order they were written.

Pravega streams are based on an append-only log data structure. By using append-only logs, Pravega rapidly ingests data into durable storage. Pravega handles all types of streams, including:

Unbounded streams in real time (continuous flow of data where the end position is unknown) Bounded streams (start and end boundaries are defined and well-known) event-based messages Batched logs Any other type of stream

Pravega seamlessly coordinates a two-tiered storage system for each stream.

Temporary storage for the recently ingested tail of a stream Long-term storage on Dell Technologies PowerScale or ECS

Users can configure data retention periods.

Applications, such as a Java program reading from an IoT sensor, write data to the tail of the stream. Analytics applications, such as Flink, Spark, or Hadoop jobs, can read from any point in the stream. Many applications can read and write the same stream in parallel. Pravega design highlights include elasticity, scalability, and support for large volumes of streamed data and applications.

The same API call accesses both real-time and historical data stored in Pravega. Applications can access data in near-real time, past and historical time, or at any arbitrary time in the future in a uniform fashion.

Pravega streams are durable, ordered, consistent, and transactional. Pravega is unique in its ability to handle unbounded streaming bytes. As a high-throughput store that preserves ordering of continuously streaming data, Pravega infinitely tiers ingested data into Dell Technologies PowerScale long-term storage.

Apache Flink exactly once semantics

Specialized connectors enable applications to perform read and write operations over Pravega streams.

The Pravega Flink Connector is a data integration component that can read and write Pravega streams with the Apache Flink stream processing framework. The Apache Flink connector offers seamless integration with the Flink infrastructure, thereby ensuring parallel reads/writes, exposing Pravega streams as Table API records, allowing checkpointing, and guaranteeing exactly once processing with Pravega.

Apache Flink is a distributed stream processor with intuitive and expressive APIs to implement stateful stream processing applications. By combining the features of Apache Flink and Pravega, you can build a pipeline of multiple Flink applications. This pipeline can be chained together to give end-to-end exactly once guarantees across the chain of applications.

Flink provides a framework and distributed processing engine for stateful computations over unbounded and bounded data streams. It performs computations at in-memory speed and at any scale. The order of data during processing is guaranteed. Apache Flink adds additional enterprise-ready functionality to the open source Apache Flink project, including:

Integrated for container orchestration Continuous integration/continuous delivery (CI/CD) job pipelines Logging and metrics State storage Streaming ledger, which adds serializing ACID transaction capability to Flink

A checkpoint feature allows stopping a running application and restarting at the saved state. Flink supports exactly once guarantee with the use of distributed snapshots. With the distributed snapshot and state checkpointing method of achieving exactly once, all the states for each operator in the streaming application are periodically checkpointed. If a failure occurs anywhere in the system, all the states for every operator are rolled back to the most recently globally consistent checkpoint. During the rollback, all processing is paused. Sources are also reset to the correct offset corresponding to the most recent checkpoint. The whole streaming application is rewound to its most recent consistent state, and processing can then restart from that state.

Working with Pravega Streams 47

Flink draws a consistent snapshot of all its operator states periodically (its checkpoint configuration intervals). Flink then stores the state information in a reliable distributed state backend storage. By doing so, it allows the Flink application to recover from any failures by restoring the application state to the latest checkpoint.

Flink includes APIs that can process continuous near-real-time data, sets of historic data, or combinations of both.

For additional information about Flink streaming analytics applications, see the section in this guide: Working with Apache Flink Applications. Also see Pravega Flink Connector.

Understanding checkpoints and savepoints A savepoint is a consistent image of the execution state of a streaming job, created using the Apache Flink checkpointing mechanism. You can use savepoints to stop-and-resume, fork, or update Flink jobs.

Savepoints consist of two parts: a directory with binary files on storage and a metadata file.

The files on storage represent the net data of the execution state image of the job. The metadata file of a savepoint contains pointers to all files on storage that are part of the savepoint in the form of

absolute paths.

Conceptually, Flink savepoints are different from checkpoints in a similar way that backups are different from recovery logs in traditional database systems.

The primary purpose of checkpoints is to provide a recovery mechanism if there is an unexpected job failure. Flink manages the checkpoint life cycle. A checkpoint is created, owned, and released by Flink without user interaction.

In contrast, savepoints are created, owned, and deleted by the user.

Each time the application is stopped, its state is recorded in a Flink savepoint. Restarting the job deploys a new Flink job to the cluster, restoring the state from the savepoint.

NOTE: Since all clusters within a SDP analytics project have access to the savepoints, the applications can be restarted

from any cluster.

About Pravega scopes A Pravega (project) scope is a collection of streams. Pravega scopes are automatically created when someone in the platform administrator role creates an analytics project using

the SDP When a Pravega scope is created in SDP, a new Kubernetes service-instance of the same name is also created. The

service-instance provides the means for access control to the scope and its streams through Keycloak security.

Accessing Pravega scopes and streams

Use these steps to list the Pravega scopes and streams for analytics projects that you have been granted access to as a project member in SDP.

Steps

1. Log on to the SDP. 2. Click the Pravega icon in the banner to view the project scopes and streams to which you have been granted access.

SDP administrators:

Refer to the Dell Technologies Streaming Data Platform Installation and Administration Guide for information about configuring project scopes and adding project-members to a Pravega scope.

About cross project scope sharing

A Pravega scope can be shared in read-only mode with one or many other projects.

The Streaming Data Platform cross project scope sharing feature permits applications running in one project to have READ access to scopes from other projects. For example, a team working on ingesting data, such as stock prices, might be different from the analysts who analyze that same streaming data. Also, several distinct data analyst teams might want to consume that

48 Working with Pravega Streams

data but do not necessarily want to share their Flink or Spark applications or Maven artifacts. The cross project scope sharing feature allows these objects to remain hidden from other project members.

Platform administrators manage cross project scope sharing in the SDP UI, granting and removing project scope READ access to project members. When access is shared, applications in one project have READ access to streams in the shared scopes.

For more information about how administrators can configure cross project scope sharing, see the "Manage cross project scope sharing" section of the Dell Technologies Streaming Data Platform Installation and Administration Guide.

Configuring applications to read a specific stream or set of streams from a shared scope

Reader applications that run in the context of a Pravega scope are run as a project-wide service account and can consume Pravega stream data in that scope. Applications must be configured to read a specific stream, or set of streams, from that shared scope.

First, a platform administrator grants shared access at the scope level. Then authorizations are created for the project member's applications to read the specified project scope and streams that are associated with that scope.

For example, the following Flink reader application is coded with:

The fully qualified stream name for the source (marketdata-0\stockprices-0)

The current project scope (compute-0) for the reader group storage/reader synchronization state

Figure 8. Flink connector reader app example code: Ingestion and reader apps shared between different projects

In the SDP UI, the compute-reader-0 single reader application reads data from the Pravega streamName marketdata-0\stockprices-0 stream is shown.

Figure 9. Example in SDP: Ingestion and reader apps shared between different projects

Troubleshooting

If your application reader is failing with PERMISSION DENIED errors, have a platform administrator perform the following:

1. Check that the cross project binding is present and is not in a failed state. a. Check the SDP UI to ensure that cross project access is fully granted.

Working with Pravega Streams 49

b. Log in to SDP and go to the Pravega tab. c. Click the scope that should be shared. d. Click Manage cross project access to verify that the target project to be shared appears selected in the multiselect

drop-down menu. e. Log in to the console as an admin, and use the kubectl CLI to check that all service bindings in the system show

Ready by running

kubectl get servicebindings -n nautilus-pravega 2. Check the Keycloak admin console by going to the Evaluate tool under the Authorization tab of the Pravega-controller

client.

About Pravega connectors and streams Pravega is a storage system that exposes Stream as the main primitive for storing and serving continuous and unbounded data. Pravega writer RESTful APIs write the streaming data to the Pravega store.

Developers work with specialized Pravega connectors, for example Apache Flink and Apache Spark, and other components, such as Keycloak, to write applications that are compatible with SDP. You may also develop custom Pravega connectors for client applications to read and write data to Pravega streams.

Pravega connectors are used to build end-to-end stream processing pipelines that use Pravega as the stream storage and message bus, and analytic engines for computation over the streams. For more information, see Pravega concepts: https:// pravega.io/docs/nightly/pravega-concepts/.

Supported Pravega connectors

Currently, Pravega offers the following connectors:

The Flink Connector enables building end-to-end stream processing pipelines with Pravega in Apache Flink. This also allows reading and writing data to external data sources and sinks via Flink Connector.

The Spark Connector to read and write Pravega Streams with Apache Spark, a high-performance analytics engine for batch and streaming data. The connector can be used to build end-to-end stream processing pipelines that use Pravega as the stream storage and message bus, and Apache Spark for computation over the streams.

The Hadoop Connector implements both the input and the output format interfaces for Apache Hadoop. It leverages the Pravega batch client to read existing events in parallel; and uses write API to write events to Pravega streams.

The Presto Connector is a distributed SQL query engine for big data. Presto uses connectors to query storage from different storage sources. This connector allows Presto to query storage from Pravega streams.

The Boomi Connector is a Pravega connector for the Boomi Atomsphere. The Nifi Connector is a Pravega connector to read and write Pravega streams with Apache NiFi.

Pravega Keycloak client for Maven Central Repository

The Pravega Keycloak Cient implements the client_credentials oAuth2 flow and communicates with Keycloak to obtain JWT tokens.

Creating Pravega streams in SDP

Before applications can read or write streams to Pravega, streams must be defined and configured in the SDP. These functions are most easily accomplished by creating a stream in the UI, but they can also be created manually.

A stream's full name consists of scope-name/stream-name. Therefore, a stream name must be unique within its scope and conform to Kubernetes naming conventions.

Users with the platform administrator role or project-members can create streams within existing scopes in the UI or manually using kubectl commands.

Users can select existing streams or define new ones for the project. Streams defined this way automatically appear in the list of streams under the project scope in the Pravega section of the UI.

Streams defined within a project-specific scope can only be referenced by applications in that project.

See the Dell Technologies Streaming Data Platform Installation and Administration Guide for information on creating scopes, resource management, and administrative tasks to ensure that adequate storage and processing resources are available to handle stream volumes and analytics jobs.

50 Working with Pravega Streams

Create data streams Both platform administrators and project-members can create data streams. All stream names must conform to Kubernetes naming conventions.

A stream name must be unique within the scope. All user-provided identifiers in the SDP must conform to the Kubernetes naming conventions as follows:

Names must consist of lowercase alphanumeric characters or a dash (-) Names must start and end with an alphanumeric character

Create a stream from an existing Pravega scope

You can create a data stream from an existing scope. Platform administrators must first create the scope before project- members can create streams.

Steps

1. Log in to the SDP.

2. Click the Pravega icon in the banner.

A list of existing scopes appears.

3. Click a scope.

A list of existing streams in that scope appears.

4. Click Create Stream.

5. Complete the configuration screen as described in Stream configuration attributes.

Red boxes indicate errors.

6. When there are no red boxes, click Save. The new stream appears in the Pravega Streams table. The new entry includes the following available actions: Edit Use this action to change the stream configuration. Delete Use this action to remove the stream from the scope.

Stream access rules

Streams are protected resources. Access to streams is protected by the Pravega scope and project membership.

Stream access is protected as shown in the following table:

Access Description

Scope access If the scope was created in the Pravega section of the SDP UI, it has a list of project-members associated with

Working with Pravega Streams 51

Access Description

it. This is a project-independent scope. and only those project-members and platform administrator users listed can access the streams in the scope.

Project membership If the scope was created under a project and an application in the Analytics section of the UI, it has project-members associated with it.

Only project-members and platform administrator users can add and view streams in the scope.

Stream configuration attributes The following tables describe stream configuration attributes, including segment scaling attributes and retention policy.

General

Property Description

Name Identifies the stream. The name must be unique within the scope and conform to Kubernetes naming conventions. The stream's identity is:

scopename/streamname

Scope The scope field is preset based on the scope you selected on the previous screen and cannot be changed.

Segment Scaling

A stream is divided into segments for processing efficiency. Segment scaling controls the number of segments used to process a stream.

There are two scaling types: Dynamic With Dynamic scaling, the system determines when to split and merge segments for optimal performance.

Choose dynamic scaling if you expect the incoming data flow to vary significantly over time. This option lets the system automatically create additional segments when data flow increases and to decrease the number of segments when the data flow slows down.

Static In Static scaling, the number of segments is always the configured value. Choose static scaling if you expect a uniform incoming data flow.

You can edit any of the segment scaling attributes at any time. It takes some minutes for changes to affect segment processing. Scaling is based on recent averages over various time spans, with cool down periods built in.

Scaling type Scaling attributes Description

Dynamic Trigger Choose one of the following as the trigger for scaling action: Incoming Data Rate Looks at incoming bytes to determine when segments

need splitting or merging. Incoming Event Rate Looks at incoming events to determine when

segments need splitting or merging.

Minimum number of segments

The minimum number of segments to maintain for the stream.

Segment Target Rate

Sets a target processing rate for each segment in the stream. When the incoming rate for a segment consistently exceeds the specified

target, the segment is considered hot, and it is split into multiple segments. When the incoming rate for a segment is consistently lower than the specified

target, the segment is considered cold, and it is merged with its neighbor.

52 Working with Pravega Streams

Scaling type Scaling attributes Description

Specify the rate as an integer. The unit of measure is determined by the trigger choice. MB/sec when Trigger is Incoming Data Rate. Typical segment target rates are

between 20 and 100 MB/sec. You can refine your target rate after performance testing.

Events/sec when Trigger is Incoming Event Rate. Settings would depend on the size of your events, calculated with the MB/sec guidelines above in mind.

To figure out an optimal segment target rate (either MB/sec or events/sec), consider the needs of the Pravega writer and reader applications. For writers, you can start with a setting and watch latency metrics to make

adjustments. For readers, consider how fast an individual reader thread can process the

events in a single stream. If individual readers are slow and you need many of them to work concurrently, you want enough segments so that each reader can own a segment. In this case, you need to lower the segment target rate, basing it on the reader rate, and not on the capability of Pravega. Be aware that the actual rate in a segment might exceed the target rate by 50% in the worst case.

Scaling Factor Specifies how many colder segments to create when splitting a hot segment.

Scaling factor should be 2 in nearly all cases. The only exception would be if the event rate can increase 4 times or more in 10 minutes. In that case, a scaling factor of 4 might work better. A value higher than 2 should only be entered after performance testing shows problems.

Static Number of segments

Sets the number of segments for the stream. The number of segments used for processing the stream does not change over time, unless you edit this attribute. The value can be increased and decreased at any time.

We recommend starting with 1 segment and increasing only when the segment write rate is too high.

Retention Policy

The toggle button at the beginning of the Retention Policy section turns retention policy On or Off. It is Off by default.

Off (Default) The system retains stream data indefinitely. On The system discards data from the stream automatically, based on either time or size.

Retention Type Attribute Description

Retention Time Days The number of days to retain data. Stream data older than Days is discarded.

Retention Size MBytes The number of MBytes to retain. The remainder at the older end of the stream is discarded.

List streams (scope view) List the Pravega streams for the project scopes you have access to.

Steps

1. Log on to the SDP and click the Pravega icon in the banner. A list of scopes to which you have access appears.

2. Click the name of a scope. A list of existing streams in that scope displays.

Working with Pravega Streams 53

Video index streams are named with an "-index" suffix. NOTE: The Retention Policy must be "infinite" for video and video index streams.

View streams View Pravega stream writers and readers for the project scopes you have access to.

Steps

1. Log on to the SDP and click the Pravega icon in the banner. A list of scopes to which you have access appears.

2. Click the name of a scope. A view of existing writers/readers, transactions, and segment scaling displays.

With data flowing through, streaming data with no transactions looks similar to the following in the SDP UI (this is a truncated view).

54 Working with Pravega Streams

Edit streams Edit data streams using the SDP UI.

Steps

1. Log on to the SDP and click the Pravega icon in the banner. A list of scopes to which you have access appears.

2. Click the name of a scope. A list of existing streams in that scope appears.

3. Click the stream name.

4. On the Create Stream view, edit the configuration screen as detailed in the Stream configuration attributes section.

Working with Pravega Streams 55

5. After editing the settings, wait until there are no red boxes and then click Save. The new stream displays and you have the following available actions: EditChange the stream configuration. DeleteRemove the stream from the scope.

6. Click the stream name to view the Pravega stream.

Delete streams Delete a Pravega stream using the SDP UI.

Steps

1. Log in to the SDP and click the Pravega icon in the banner. A list of scopes to which you have access appears.

2. Click the name of a scope. A list of existing streams in that scope appears.

3. Click the stream name, and then click Delete under the Actions column.

Start and stop stream ingestion Native Pravega, Apache Flink, or Apache Spark applications control stream ingestion.

The SDP creates and deletes scope and stream entities and monitors various aspects of streams. The system does not control stream ingestion.

56 Working with Pravega Streams

Monitor stream ingestion Monitor performance of stream ingestion and storage statistics using the Pravega stream view in the SDP UI.

Steps

1. Log in to the SDP.

2. Go to Pravega > scope-name > stream-name.

View the following:

Ingestion rates General stream parameter settings Segment heat charts showing segments that are hotter or colder than the current trigger rate for segment splits. For

streams with a fixed scaled policy, the colors on the heat chart can indicate ingestion rates. The redder the segment is, the higher the ingestion rate.

About Grafana dashboards to monitor Pravega Only users in the platform administrator role can access Grafana dashboards.

Stream metrics displayed on the Pravega dashboard include throughput, readers and writers per stream (byte and event write rates as well as byte read rates), and transactional metrics like commits and cancels. Reader and writer counts, which would be the application-specific metrics, are not available.

For more information about Grafana dashboards, Pravega, and administrator access, see the Dell Technologies Streaming Data Platform Installation and Administration Guide .

Working with Pravega Streams 57

Working with Apache Flink Applications This section describes using Apache Flink applications with the SDP.

Topics:

About hosting artifacts in Maven repositories About Apache Flink applications Application lifecycle and scheduling View status and delete applications

About hosting artifacts in Maven repositories SDP Custom Path as a source option. This allows users to reference an application file already found on the Flink image. Each analytics project has a local Maven repository that is deployed within its namespace for hosting application artifacts that are defined in the project.

Apache Flink and Spark look at the SDP to resolve artifacts when launching an application. The SDP also supports external Enterprise Repositories, allowing the SDP to pull artifacts from Maven repositories external to SDP.

Project-members can upload/download applications and artifacts to the Maven repository for the project(s) they have been provided access to using the SDP UI or through Maven/Gradle using HTTP basic authentication.

Platform administrators can upload/download applications and artifacts across all projects and Maven repositories.

The Maven repository API is protected using HTTP basic authentication. Both Maven and Gradle can be configured to supply basic authentication credentials when accessing the API. For more information, see the Dell Technologies Streaming Data Platform Installation and Administration Guide

Applications must be built with the following runtime dependency, which is the library that allows authentication to Keycloak.

... https://oss.jfrog.org/artifactory/jfrog-dependencies ... ... 0.6.0-6.1a4bae7-SNAPSHOT ...

Coordinates: io.pravega pravega-keycloak-client ${pravega.keycloak.version}

About artifacts for Flink projects

Artifacts can be of two types: Maven or File. Before applications can be deployed to a cluster, the JAR files must be uploaded.

The SDP platform administrator is allowed to upload/download artifacts for any project. A project-member of a project can upload/download artifacts as well, but only for the projects they are a members of.

You can upload artifacts by directly publishing using the Maven CLI mvn tool or Gradle using basic authentication, or by uploading the files using the SDP UI.

There are two types of repositories:

Local repositories can be internal repositories set up on a file or HTTP server within your company. These repositories are used to share private artifacts between development teams and for releases.

4

58 Working with Apache Flink Applications

Remote repositories refer to any other type of repository accessed by various protocols, such as http://. These repositories might be a remote repository set up by a third party to provide their artifacts for downloading. For example, repo.maven.apache.org is the location of the Maven central repository. Within SDP we refer to these as External Maven repositories. They are external to the SDP cluster.

External Maven repositories are configured as part of the project definition.

SDP attempts to resolve artifacts in the following order with the first match being used:

1. Project Artifact Repository 2. Project External Maven Repositories in the order specified in the project configuration.

Local and remote repositories should be structured similarly, so that scripts can run on either side and can be synchronized for offline use. The layout of the repositories should be transparent to the user.

NOTE: A user in the platform administrator role configures the volume size during creation of an analytics project. The

Maven volume size is the anticipated space requirement for storing application artifacts that are associated with the

project. The SDP provisions this space in the configured vSAN.

For more information, see the Dell Technologies Streaming Data Platform Installation and Administration Guide

About Maven repositories

Each Analytics Project has a Maven repository that is deployed inside its namespace. This repository is where Apache Flink and Java-based Spark applications look when resolving artifacts while launching an application. The Maven repository API is protected using basic access authentication (Basic Auth). You can configure both mvn and Gradle to supply Basic Auth credentials when accessing the API.

Project members can upload and download applications and artifacts to the Maven repository for their projects. Platform administrators can upload and download applications and artifacts across all projects and Maven repositories. The standard Maven-publish Gradle plugin must be added to the project in addition to a credentials section.

Accessing and configuring Maven repositories

About this task

There is direct external access to the Maven repository from outside the cluster, with appropriate credentials.

Use the following steps to configure a mvn project with a pom.xml file.

Steps

1. Edit your ~/.m2/settings.xml file to contain a section, similar to the following sample. Credentials are for a project member.

nautilus user1

2. In the pom.xml file of the project, add a section similar to the following.

nautilus nautilus https://repo.example.lab.com/maven2

Working with Apache Flink Applications 59

NOTE: In the field, the hostname must include the prefix repo, as shown in the External access and network

exposure table in the Dell Technologies Streaming Data Platform Security Configuration Guide .

Results

Now you can publish a project using the mvn deploy command.

Configuring Maven for a .jar file with no pom.xml file

If you must publish a .jar file without a pom.xml file, use the following procedure.

About this task

NOTE: This procedure applies to .jar files which you are not compiling and for which you do not have a pom.xml file.

Steps

1. Use the :deploy-file functionality of the mvn deploy plug-in with -D arguments to supply the request parameters to mvn.

2. Output should be similar to the following:

mvn deploy:deploy-file \ -DgroupId=com.my.org \ -DartifactId=helloworld \ -Dversion=1.0 \ -Dpackaging=jar \ -Dfile=helloworld-params-1.0-SNAPSHOT.jar \ -DrepositoryId=nautilus \ -Durl=https://repo.example.lab.com/maven2

In the field, the hostname must include the prefix repo, as shown in the table in External access and network exposure.

Results

Now you can publish a .jar file.

Configuring a Gradle project to publish to a Maven repository

About this task

To configure a Gradle project to publish to a Maven repository using Basic Auth, add a publishing section to your gradle.build file, and include a credentials section.

Steps

1. Open your gradle.build file.

2. Add a publishing similar to the following sample. The credentials are for a project member.

gradle.build

apply plugin: "maven-publish" publishing { repositories {

60 Working with Apache Flink Applications

maven { url = "https://repo.example.lab.com/maven2" credentials { username "user1" password "mypassword" } authentication { basic(BasicAuthentication) } } }

publications { maven(MavenPublication) { groupId = 'com.dell.com' artifactId = 'anomaly-detection' version = '1.0'

from components.java } } }

NOTE: In the in publishing.repositories.maven.url, the hostname must include the prefix repo, as

shown in the External access and network exposure table in the Dell Technologies Streaming Data Platform Security

Configuration Guide .

Results

Now you can publish a project using the gradle command, similar to the following:

./gradlew :anomaly-detection:publishMavenPulicationToMavenRepo

Upload Flink application artifacts

Perform the following steps to upload Flink application artifacts using the SDP UI.

Steps

1. Log on to the SDP.

2. Navigate to Analytics > project-name and click the Artifacts tab.

3. Select either Maven or File to view those artifacts.

4. Click Upload Artifact to upload a Maven artifact or Upload File to upload a file.

The upload button displayed depends on the choice of Maven or Files button.

Working with Apache Flink Applications 61

Figure 10. Upload artifact

5. Click Upload Artifact to upload the artifact.

Deploy new Flink applications

Perform the following steps to deploy a Flink application using the SDP UI.

Steps

1. Log in to the SDP.

2. Go to Analytics > Analytics Projects > Flink > Apps > project-name > Create New App.

62 Working with Apache Flink Applications

In the form that appears, specify the name for the application, the artifact source and main class, and the following configuration information:

Field Description

General Provide the name for the application.

Language Either Java, Scala or Python. SDP provides experimental support for Python Flink applications.

Type Either Streaming or Batch. Provides a hint to SDP on how it should manage the application.

Source Source selections: MavenUse the Artifact drop-down to find the

artifact version. If you do not see the version, click the Upload link. Upload a new version of your code. An Upload Artifact form is displayed with choices for the Maven Group, Maven Artifact, Version, and a browse button to the JAR file. To change the version of the artifact, select a different version in the Artifacts drop- down menu.

FileSelect the Main Application File from the drop- down menu. The selection can be a URL, image path, or other location. Click the Upload link to upload an Artifact File, using a path in the repository. Browse to choose any type of file to upload other than JAR.

Custom PathThe Main Application File can be a URL, image path, or other location. Main ClassThis

Working with Apache Flink Applications 63

Field Description

choice is optional. This class is the program entry point. This class is only needed if it is not specified in the JAR file manifest.

Configuration ParallelismEnter the number of tasks to be run in parallel.

NOTE: This number is the Default parallelism if it has not been explicitly set in the application.

Flink VersionChoose the version of the Flink cluster where the application is to run.

Add PropertyAssociate key-value pairs. Add StreamEnter a key for the values to be extracted

from a specified Pravega scope and stream. You can specify an existing stream or create a new stream.

State Strategy Either Savepoint or None. Stateful applications should use Savepoint which tells

SDP to create a savepoint whenever the application is stopped.

Stateless applications should use None which tells SDP to simply cancel the current Flink Job without saving state when the application is stopped.

Restore Strategy Either Latest, Latest Savepoint or None. Informs SDP what state it should use when starting an

application. Latest uses the latest savepoint regardless of the origin (either savepoint or checkpoint). This is the default.

Latest Savepoint tells SDP to use the latest Savepoint that has an origin or Savepoint.

Stateless applications should use None to start the application from no previous state.

Allow Non-Restore State Either True or False All operators within a Flink application have a unique ID

which is used when saving and restoring state. If an application is modified by adding a new operation the Flink Job graph will have a new ID which is not in the saved state.

By default Flink will reject starting an application if an operator is found in the job that has no state in the savepoint. This can be overidden by setting "Allow Non- Restore State" to true.

For details, see https:// nightlies.apache.org/flink/flink-docs-release-1.15/docs/ops/ state/savepoints/#allowing-non-restored-state

3. If you do not see the version of the artifact in the Artifact drop-down, click the Upload link to upload a new version of your application. This choice takes you to the Upload Artifact view. When the upload is complete, you can change the version of the artifact by selecting a different version in the Artifacts drop-down menu.

4. Specify a Main Class if you did not specify one in the JAR manifest file.

5. Under the Configuration section, under Parallelism, enter the number of tasks that run in parallel.

6. Click Add Property to provide additional properties, such as key pairs. For example, the Key might be City and the Value might be the name of a city.

7. Click Add Stream to add a new stream to the analytics application. Enter your stream name, segment scaling, and retention policy choices in the Create Stream form. For details about completing this form, see the stream configuration attributes section of this guide.

8. Click Save to save your stream settings. This choice creates a new stream within your and takes you back to the Create New App form.

9. Ensure that all the sections of the form have been completed as required and then click Create.

64 Working with Apache Flink Applications

10. You are returned to the Analytics Project screen, where you can view the status of the deployed application properties, savepoints, events, and logs views. Note: While the application is running, you cannot start or stop it, however you have the option under the Actions column to Edit or Delete it.

View and delete artifacts

Perform the following procedure to view and delete project artifacts.

Steps

1. Log on to the SDP.

2. Navigate to Analytics > project-name and click the Artifacts tab. The Artifacts view is displayed.

3. Click the Maven or the Files button to display artifacts or files.

4. Under the Actions column, click Delete to delete the artifact or file from the project.

5. Click Download to download a copy.

About Apache Flink applications Stream processors, such as Apache Flink, support event-time processing to produce correct and deterministic results with the ability to process large amounts of data in real time.

These analytics engines are embedded in the SDP, providing a seamless way for development teams to deploy analytics applications.

The SDP provides a managed environment for running Apache Flink Big Data analytics applications, providing:

Consistent deployment A UI to stop and start deployments An easy way to upgrade clusters and applications

NOTE: This guide explicitly does not go into how to write streaming applications. To learn about developing stream

processing applications using Apache Flink or Apache Spark, see the following:

Software Link

SDP Code Hub code samples, demos, and connectors For a description, see the section in this guide: SDP Code Hub streaming data examples and demos

Apache Flink open-source documentation https://flink.apache.org

Ververica Platform documentation https://www.ververica.com/

Pravega concepts documentation and Pravega workshop examples

http://pravega.io/docs/latest/pravega-concepts/

https://streamingdataplatform.github.io/code-hub/

Application lifecycle and scheduling A streaming data application flows through several states during its lifecycle.

Apache Flink life cycle

Once a streaming data application is created using the SDP UI, the Apache Flink stream processing framework manages it.

This framework determines the state of the application and what action to take. For example, the system can either deploy the application or stop a running application.

Working with Apache Flink Applications 65

Lifecycle Description

Scheduling Scheduling is the process of locating a running Flink cluster on which to deploy the application as a Flink job. If multiple potential target clusters are found, then one of the clusters is chosen at random. The application remains in a scheduling state until a target cluster is identified.

The scheduler uses two pieces of information to match an application to a cluster:

1. Flink Version: For a cluster to be considered a deployment target for an application, the Flink versions must match. The flinkVersion specified in the FlinkApplication descriptor and the flinkVersion in the image that the FlinkCluster is based on must be the same.

2. Labels: The Flink version is enough to match an application with a cluster, however labels can also be used to provide more precise cluster targeting. Standard Kubernetes resource labels can be specified on the Flink resource. An optional clusterSelector label can be specified within an application descriptor to target a cluster.

Starting Starting is an important state indicating that the application has been deployed on the Flink cluster but not yet fully running. Flink is still scheduling tasks.

Started Once a target cluster has been identified, the application is deployed to the target cluster. The process requires two steps and can take several minutes as follows:

1. The Maven artifact is first resolved against the Maven repository to retrieve the JAR file.

2. The application is then deployed to the target Flink cluster.

NOTE: If an application is stopped, the Flink job is started from the recoveryStrategy.

Savepointing If an application that has been started is modified, the framework responds by first stopping the corresponding Flink job. What happens after a running Flink job is stopped depends upon the stateStrategy.

A Flink job that has state is canceled with a savepoint.

When an application is stateless, there is no state to save.

The None state can be used in situations where the application is stateless.

Stopped Once savepointing is completed, the application is returned to a stopped state. The state field in the application specification determines what happens next.

If the state is stopped, then the application remains in a stopped state until the specification is updated.

If the state is started, the application returns to the scheduling state and the deployment process restarts.

Error If the application is placed in an error state, see more about the error using the SDP web interface or application resource. For more information about debugging applications in an error state, see the section in this guide Troubleshooting deployments.

66 Working with Apache Flink Applications

Lifecycle Description

Cancelling User has set goal of Application to Cancelled indicating they wish to stop the job ignoring the savepoint strategy. Job is currently being stopped on the Flink Cluster.

Cancelled Application has been successfully cancelled.

View status and delete applications After you have deployed Flink clusters and applications, you can view their properties or delete them.

View and edit Flink applications

Log in to the SDP and go to Analytics > project-name. Click the Apps tab. You can also select Flink Apps from the Dashboard.

View and edit the following:

Application name State of the application, for example, Scheduling or Stopped Name of the Flink cluster the application is running on Cluster selectors, such as custom labels Flink version The time and date the application was created Available actions that you can perform on the application include Start, Stop, Edit, and Delete.

View application properties

Log on to the SDP and navigate to Analytics > project-name. Click the Properties tab.

Navigate the blue links at the top to view events, logs, the Flink UI, and the following:

Category Description

Source The Main class from the JAR manifest

Configuration Flink version Cluster Selectors, such as custom labels Job ID Parallelism Parameters

View application savepoints

Log on to the SDP and navigate to Analytics > project-name. Click the Savepoints tab.

Navigate the blue links at the top to view application properties, events, logs, view running applications in the Flink UI, and the following:

Name of savepoint Path to savepoint When the savepoint was created State of the savepoint Available actions, for example, delete

Working with Apache Flink Applications 67

Table 5. Application state management

Category Description

Checkpoint Status As an stateful application progresses through the stream it will continually take periodic checkpoints. These checkpoints contain the state at a point in time and can be used by Flink to recover the application to a last known successful state in the event of failure.

NOTE: In the case of Flink applicaitons processing Pravega Streams state management is crucial for correctly processing the stream. During checkpointing the current position in the stream is saved to the state and reset if the application is recovered.

Flink will continue to process the stream even if checkpoints fail, however if too many checkpoints fail since the last successful checkpoint, it could result in a large amount of the streaming having to be reprocessed if a failure happens.

In the Application list SDP shows a Checkpoint column which indicates the current status of checkpoints with the following values:

OK: All five of the last checkpoints succeeded Warning: One of the last five checkpoints failed Error: Five or more of the last consecutive checkpoints

failed

Checkpoint Tracking If an application is using Retained Checkpoints, SDP tracks the latest retained checkpoints as Application Savepoints. These Savepoints have an origin of checkpoint and are constantly updated as the application progresses.

This allows users to recover from situations where the application could not be savepointed (perhaps due to some environment or application issue) and had to be cancelled.

Savepoints Savepoints now have an Origin field indicating what created the savepoint, either Savepoint for a manually requested savepoint or checkpoint for an automatically created savepoint from a retained checkpoint.

Metrics If a project has the Metrics feature added, SDP will create a new dashboard for tracking the Flink Applications. These metrics (available from the Project Feature Metrics link) provide further insight into checkpointing activity with information such as checkpoint time and checkpoint size.

Flink UI Detailed information about the checkpoint state can be found in the Flink UI.

For more information about savepoints, see Understanding checkpoints and savepoints

View application events

Log on to the SDP and navigate to Analytics > project-name. Click the Events link.

View all application event messages, the type and reason for the event. For example, Launch or Stop.

You can also view when the event was first and last seen, and the count of events.

68 Working with Apache Flink Applications

View application deployment logs

Log on to the SDP and navigate to Analytics > project-name > app-name. Click the Logs link to view the deployment logs.

All logging by task managers is contained in the logs of the target Flink cluster.

For information about troubleshooting logs, see the section Application logging.

Users in the platform administrator role can access all of the system logs. For more information, see the section "Monitor application health" in the Dell Technologies Streaming Data Platform Installation and Administration Guide .

View deployed applications in the Apache Flink UI

The SDP UI contains direct links to the Apache Flink UI.

Apache Flink provides its own monitoring interface, which is available from within the SDP UI. The Apache Flink UI shows details about the status of jobs and tasks within them. This interface is useful for verifying application health and troubleshooting.

To access the Apache Flink task managers view from the SDP UI, log on to the SDP and navigate to Analytics > Analytics Projects > project_name > flink_cluster_name.

View the Flink UI

You can view a deployed application in the SDP and Apache Flink UI.

Steps

1. Log on to the SDP.

2. Navigate to Analytics > Analytics Projects > project-name > app-name.

3. Click the Flink tab to open the job view in the Flink UI.

You can review: Status of running task managers Utilized and available task slots Status of running and completed jobs

Working with Apache Flink Applications 69

Working with Apache Spark Applications This chapter describes using Apache SparkTM applications with the Streaming Data Platform (SDP).

Topics:

About Apache Spark applications Apache Spark terminology Apache Spark life cycle About Spark clusters Create a new Spark application Create a new artifact View status of Spark applications View the properties of Spark applications View the history of a Spark application View the events for Spark applications View the logs for Spark applications Deploying Spark applications GPU-accelerated workload support for Spark Troubleshooting Spark applications

About Apache Spark applications The SDP 1.2 release introduces support for Apache SparkTM. The Apache Spark integration allows users to manage applications through Kubernetes resources or within the SDP UI.

Apache Spark integration with SDP

Highlights include:

Support for Java, Scala, and Python (PySpark) Spark analytic metrics within the SDP UI Seamless integration with Apache Spark checkpoints Parallel Readers and Writers supporting high-throughput and low-latency processing Exactly-once processing guarantees for both Readers and Writers for micro-batch streaming

Developer community support

If you need help developing Spark applications with Pravega, contact the developer community on the Pravega Slack channel. To learn more about developing stream processing applications using Apache Spark:

See the Apache Spark open-source documentation at https://spark.apache.org. See SDP PySpark workshop examples at https://github.com/StreamingDataPlatform/workshop-samples/tree/master/

spark-examples. For information about developing Spark applications locally and testing with Pravega in a sandbox environment before

deploying to SDP, see: https://github.com/pravega/pravega-samples/blob/spark-connector-examples/spark-connector- examples/README.md.

5

70 Working with Apache Spark Applications

Apache Spark terminology The following terms are useful in understanding Apache Spark applications.

Spark application An analytic application that uses the Apache Spark API to process one or more streams.

Spark job A running Spark application. A Spark job consists of many running tasks.

Spark task A Spark task is the basic unit of performance. One thread performs each task.

Batch processing The input data is partitioned, and the data queried on executors before writing the result. The job ends when the query has finished processing the input data.

Streaming micro- batch

A Spark micro-batch reader allows Spark streaming applications to read Pravega streams and creates a micro-batch of events that are received during a particular time interval.

SDP supports micro-batch streaming that can use either exactly-once or at-least-once semantics.

Pravega stream cuts (offsets) are used to reliably recover from failures and provide exactly-once semantics.

A query processes the micro-batch, and the results are sent to an output sink. Spark starts processing the next micro-batch when the previous micro-batch ends.

Streaming continuous

SDP 1.2 and Pravega do not support continuous streaming with Apache Spark applications.

Block manager Responsible for tracking and managing blocks of data between executors nodes. Data is transferred directly between executors.

Spark SQL A layer that is built on the standard Spark RDD and dataframes. Allows applications to specify SQL to perform queries. Known as Structured Streaming when used with Apache Spark streaming.

Apache Spark life cycle The Spark application life cycle consists of goal and transient states.

Once a streaming data application is created, the Spark stream processing framework determines the state of the application and what action to take. For example, the system can either deploy the application or stop a running application. The following shows the life cycle of a SparkApplication.

The SparkApplication state field informs SDP of the intended state of the application. SDP attempts to get the application into the wanted state based on its current state. Transient states are stages that the application goes through when attempting to reach a goal state.

Life cycle Description

Submitting Submitting is the first transient state. This state is repeated if the application fails.

Running If the state is running, SDP deploys the application.

Failed

Stopping Stopping is a transient state with the goal state of stopped.

Stopped If the state is stopped, the application is not deployed. SDP does nothing to the application.

Finished Finished is the goal state.

The goal state of finished is intended for batch applications. A user can resubmit an application in the finished state by first changing the state to stopped and then back to finished.

Working with Apache Spark Applications 71

About Spark clusters The Spark cluster is the application.

With Spark, there is no concept of a session cluster as there is with Flink. The Spark cluster is split into the driver and the executors. The driver runs the main application and manages the scheduling of tasks. The executors are workers that perform tasks. SparkContext is the cluster manager and is created as the first action in any Spark application. SparkContext communicates with Kubernetes to create executor pods. Application actions are split into tasks and scheduled on executors.

Create a new Spark application Perform the following steps to create a new Spark application using the SDP UI.

Steps

1. Log in to the SDP.

2. Go to Analytics > Analytics Projects > project-name > Spark > Create New Spark App. Click the Create New Spark App button.

Figure 11. Create new Spark app

The Spark app form displays.

72 Working with Apache Spark Applications

Figure 12. Spark app form

In the form that is displayed, specify the following configuration information:

Field Description

General Provide the type, name for the application, and its state.

Source Source selections: MavenUse the Artifact drop-down to find the

artifact version. If you do not see the version, click the Upload link. Upload a new version of your code. An Upload Artifact form is displayed with choices for the Maven Group, Maven Artifact, Version, and a browse button to the JAR file. To change the version of the artifact, select a different version in the Artifacts drop- down menu.

FileSelect the Main Application File from the drop- down menu. The selection can be a URL, image path, or other location. Click the Upload link to upload an Artifact File, using a path in the repository. Browse to choose any type of file to upload other than JAR.

Custom PathThe Main Application File can be a URL, image path, or other location. Main ClassThis choice is optional. This class is the program entry point. This class is only needed if it is not specified in the JAR file manifest.

Configuration RuntimeSelect the Spark version from the drop-down menu.

ExecutorsChoose the number of executors.

Working with Apache Spark Applications 73

Field Description

Executor MemoryThe amount of memory is in GB. Executor CoresThis selection is the number of

executor cores. Executor GPUsEnter the number of NVIDIA GPUs

installed. GPU-accelerated support for Spark applications enables customers to use video processing, stream processing, and machine learning on analytic workloads.

Add ArgumentThis list of string arguments is passed to the application.

Add ConfigurationAdd user-defined key/value pair parameter as a string/string.

Local Storage Storage TypeSelect Ephemeral or Volume Based. Volume SizeThis size is in GB.

Driver MemoryThis entry is the amount of memory in GB. CoresThis entry is the number of cores. Storage TypeSelect Ephemeral or Volume Based. Volume SizeThis entry is the size in GB.

Working Directory File ArtifactsSelect the Spark version from the drop- down menu.

Additional Custom PathThis entry is the custom path JAR dependencies to be added to the classpath.

Additional Jars For additional jars to be added to the classpath: Maven JarsSelect the Spark version from the drop-

down menu. File JarsChoose the number of executors. Additional Custom PathCustom path JAR

dependencies to be added to the classpath.

3. Ensure that all the sections of the form have been completed as required and then click Create.

4. You are returned to the Analytics Project screen where you can view the status of the Spark application. Note: While the application is running, you cannot start or stop it; however, you have the option under the Actions column to Edit or Delete it.

Create a new artifact

About this task

Artifacts are added to an analytics project on the Create New Spark App form when you define an application in the SDP UI. You can upload Maven JAR files, individual application files, or custom paths to the SDP UI.

To manage Maven dependencies artifacts, use one of the following two methods as outlined in Steps #1 and #2.

Steps

1. Method A: Git clone the repository for spark-connectors and pravega-keycloak:

Run the following gradlew command to build the artifacts:

./gradlew install

and then upload the JAR file using the SDP UI as shown in Step #3.

2. Method B: Download the pre-built Spark connector for Pravega artifacts from Maven Central and/or JFrog: pravega-connectors-spark-3.0 pravega-connectors-spark-2.4 JFrog (snapshots from jfrog.org)

74 Working with Apache Spark Applications

and then upload the JAR file using the SDP UI as shown in Step #3.

3. Upload artifacts using the Create New Spark App form under the Source section. Choose either the Maven (for JAR files) or File (for Python or Scala application files) as follows. MavenUse the Artifact drop-down to find the artifact version. If you do not see the version, click the Upload link.

Upload a new version of your code. An Upload Artifact form is displayed with choices for the Maven Group, Maven Artifact, Version, and a browse button to the JAR file. To change the version of the artifact, select a different version in the Artifacts drop-down menu. Click the Maven button.

Figure 13. Create Maven artifact

A screen similar to the following displays after uploading Maven artifacts:

Figure 14. Maven artifacts FileSelect the Main Application File from the drop-down menu. The selection can be a URL, image path, or other

location. Click the Upload link to upload an Artifact File, using a path in the repository. Browse to choose an application file to upload.

Figure 15. Create application file artifact The selected file is uploaded when the artifact is created. A screen similar to the following displays after uploading application file artifacts:

Working with Apache Spark Applications 75

Figure 16. Application file artifacts

4. Change any configurations as required, such as the Spark version, dependencies, the application name of the deployment yaml file, and so on.

5. Run the following kubectl commands on the cluster to apply the changes and deploy the application:

kubectl apply -f generate_data_to_pravega.yaml -n namespace kubectl apply -f pravega_to_console_python.yaml -n namespace

NOTE: You must be a federated user with LDAP as the identify provider to run Kubernetes commands on the

cluster. For more information, see the Configure federated user accounts section: Dell Technologies Streaming Data

Platform Installation and Administration Guide

View status of Spark applications Select the Spark Apps tab to view the status of Spark applications.

Log in to the SDP and go to Analytics > project-name. For a Spark application, select Spark Apps. You can also select Spark Apps from the Dashboard. View the following:

Application name Type State of the application, for example, Submitting, Running, or Stopped Goal state Runtime Executors Available actions that you can perform on the application. For example, Start, Stop, Edit, and Delete.

View the properties of Spark applications Select the Spark Apps tab to view the status of Spark applications.

Log in to the SDP and go to Analytics > project-name. For a Spark application, under Analytics Projects select a Spark project.

NOTE: To reach the Spark UI, select the up arrow Spark link.

To view the Spark application properties:

1. Select Spark Apps. 2. Select an app. 3. Select the Properties tab.

76 Working with Apache Spark Applications

View the history of a Spark application Select the Spark Apps tab to view the status of Spark applications.

Log in to the SDP and go to Analytics > project-name. For a Spark application, under Analytics Projects select a Spark project.

NOTE: To reach the Spark UI, select the up arrow Spark link.

To view the Spark application history:

1. Select Spark Apps. 2. Select an app. 3. Select the History tab.

View the events for Spark applications Select the Spark Apps tab to view the status of Spark applications.

Log in to the SDP and go to Analytics > project-name. For a Spark application, under Analytics Projects select a Spark project.

NOTE: To reach the Spark UI, select the up arrow Spark link.

To view the Spark application events:

1. Select Spark Apps. 2. Select an app. 3. Select the Events tab.

View the logs for Spark applications Select the Spark Apps tab to view the status of Spark applications.

Log in to the SDP and go to Analytics > project-name. For a Spark application, under Analytics Projects select a Spark project.

NOTE: To reach the Spark UI, select the up arrow Spark link.

To view the Spark application logs:

1. Select Spark Apps. 2. Select an app. 3. Select the Logs tab.

Deploying Spark applications To run on SDP, Apache Spark applications must be configured using the options described in the table below. All Spark applications must have the environment variables set appropriately for your environment. However, it is up to your application to retrieve the environment variables and provide them as options to the Spark session.

Apache Spark option Environment variable Purpose

controller PRAVEGA_CONTROLLER_URI This is the Pravega controller URI for your SDP cluster. For example,

tcp://nautilus-pravega- controller.nautilus- pravega.svc.cluster.local:909 0

scope PRAVEGA_SCOPE This is the Pravega scope, which is equal to the Analytics Project name.

Working with Apache Spark Applications 77

Apache Spark option Environment variable Purpose

allow_create_scope N/A This must be set to false since applications in SDP cannot create Pravega scopes.

checkpointLocation CHECKPOINT_DIR Stateful operations in Spark must periodically write checkpoints which can be used to recover from failures. The checkpoint directory identified by this environment variable should be used for this purpose. It will be highly available until the Spark application is deleted. This should be used even for Spark applications which do not use Pravega.

Writing Apache Spark applications for deployment on SDP

Use the following example applications as a template when writing Apache Spark applications:

Python: stream_generated_data_to_pravega.py

Scala: StreamPravegaToConsole.scala

Uploading common artifacts to your analytics project

This task must be completed before running Apache Spark applications with SDP.

Steps

1. Download the Pravega Connectors for Spark 3.0 JAR from Maven Central.

2. Download the Pravega Connectors for Spark 2.4 JAR from Maven Central.

3. Download the Pravega Keycloak Client JAR from Maven Central.

4. Create an Analytics Projects in SDP if youi do not already have one. An Analytics Project can have any number of Flink and Spark applications.

NOTE: In the following steps, be careful to upload the correct file corresponding to the Maven artifact.

5. Upload the Pravega Connectors for Spark 3.0 JAR to SDP. a. Click Analytics > Artifacts > Maven > Upload Artifact. b. Enter the following:

Maven Group: io.pravega Maven Artifact: pravega-connectors-spark-3-0-2-12 Version: 0.9.0 File: pravega-connectors-spark-3.0_2.12-0.9.0.jar

c. Click Upload.

6. Upload the Pravega Connectors for Spark 2.4 JAR. a. Click Analytics > Artifacts > Maven > Upload Artifact. b. Enter the following:

Maven Group: io.pravega Maven Artifact: pravega-connectors-spark-2-4-2-11 Version: 0.9.0 File: pravega-connectors-spark-2.4_2.11-0.9.0.jar

c. Click Upload.

7. Upload the Pravega Keycloak Client JAR. a. Click Analytics -> Artifacts -> Maven -> Upload Artifact. b. Enter the following:

Maven Group: io.pravega Maven Artifact: pravega-keycloak-client Version: 0.9.0

78 Working with Apache Spark Applications

File: pravega-keycloak-client-0.9.0.jar c. Click Upload.

8. You should now see the artifacts in the SDP UI.

Deploying Python applications using the SDP UI

Use the steps in this section to deploy Apache Spark applications in the SDP UI.

Steps

1. Create your Python Spark application file. For example, stream_generated_data_to_pravega.py will continuously write a timestamp to a Pravega stream. Save this file as on your local computer.

2. Upload your Python Spark application.

a. Click Analytics > Artifacts > Files > Upload File. b. Enter the following:

Path: (leave blank) File: stream_generated_data_to_pravega.py Click Upload.

3. Create the Spark application.

a. Click Spark > Create New Spark App. b. Enter the following:

General Type: Python Name: stream-generated-data-to-pravega

Source File Main Application File: stream_generated_data_to_pravega.py

Configuration Runtime: Spark 3.0.1 Executors: 1

Additional Jars Maven Jars:

io.pravega:pravega-connectors-spark-3-0-2-12:0.0.0 io.pravega:pravega-keycloak-client:0.9.0

c. Click Create.

4. You may view the application logs by clicking the Logs link.

NOTE: You can also run an application directly from a URL. Use the Source type of Custom Path, then

enter a URL such as https://github.com/pravega/pravega-samples/blob/spark-connector-examples/spark-connector-

examples/src/main/python/stream_generated_data_to_pravega.py

CAUTION: You should only run an application from a URL if you trust the owner of the URL.

Deploying PySpark applications using Kubectl

Much of the Spark application deployment process can be scripted with Kubectl.

Steps

1. Create your Python Spark application file. For example, stream_generated_data_to_pravega.py will continuously write a timestamp to a Pravega stream. Save this file as on your local computer.

2. Use the SDP UI to create and upload your Python Spark application file using the instructions in the previous section for details.

Working with Apache Spark Applications 79

3. Create a YAML file that defines how the Spark application should be deployed. The following is an example YAML file that can be used to deploy stream_generated_data_to_pravega.py. Save this file as stream_generated_data_to_pravega.yaml on your local computer.

apiVersion: spark.nautilus.dellemc.com/v1beta1 kind: SparkApplication metadata: name: stream-generated-data-to-pravega spec: # Language type of application, Python, Scala, Java (optional: Defaults to Java if not specified) type: Python # Can be Running, Finished or Stopped (required) state: Running # Main Java/Scala class to execute (required for type:Scala or Java) # mainClass: org.apache.spark.examples.SparkPi # Signifies what the `mainApplicationFile` value represents. # Valid Values: "Maven", "File", "Text". Default: "Text" mainApplicationFileType: File # Main application file that will be passed to Spark Submit # (interpreted according to `mainApplicationFileType`) mainApplicationFile: "stream_generated_data_to_pravega.py" # Extra Maven dependencies to resolve from Maven Central (optional: Java/Scala ONLY) jars: - "{maven: io.pravega:pravega-connectors-spark-3-0-2-12:0.9.0}" - "{maven: io.pravega:pravega-keycloak-client:0.9.0}" parameters: - name: timeout value: "100000" # Single value arguments passed to application (optional) arguments: [] # - "--recreate" # Directory in project storage which will be handed to Application as CHECKPOINT_DIR environment # variable for Application checkpoint files (optional, SDP will use Analytic Project storage) # checkpointPath: /spark-pi-checkpoint-d5722b45-8773-41db-93a4-bab2324d74d0 # Number of seconds SDP will wait after signalling shutdown to an application before forcabily # killing the Driver POD (optional, default: 180 seconds) gracefulShutdownTimeout: 180 # When to redeploy the application # (optional, default: OnFailure, failureRetries: 3, failureRetryInterval: 10, submissionRetries: 0) retryPolicy: # Valid values: Never, Always, OnFailure type: OnFailure # Number of retries before Application is marked as failed failureRetries: 5 # Number of seconds between each retry during application failures failureRetryInterval: 60 # Defines shape of cluster to execute application # Reference to a RuntimeImage containing the Spark Runtime to use for the cluster runtime: spark-3.0.1 # Driver Resource settings driver: cores: 1 memory: 512M # Kubernetes resources that should be applied to the driver POD (optional) resources: requests: cpu: 1 memory: 1024Mi limits: cpu: 1 memory: 1024Mi # Executor Resource Settings executor: replicas: 1 cores: 1 memory: 512M # Kubernetes resources that should be applied to each executor POD

80 Working with Apache Spark Applications

resources: requests: cpu: 1 memory: 1024Mi limits: cpu: 1 memory: 1024Mi # Extra key value/pairs to apply to configuration (optional) configuration: spark.reducer.maxSizeInFlight: 48m spark.driver.extraJavaOptions: "-Divy.cache.dir=/tmp -Divy.home=/tmp" # Custom Log4J Logging Levels (optional) logging: org.myapp: DEBUG io.pravega: DEBUG io.pravega.connectors: DEBUG ```

4. Apply the YAML file with kubectl.

NAMESPACE=sparktest1 kubectl apply -n $NAMESPACE -f stream_generated_data_to_pravega.yaml

5. Monitor the Spark application.

kubectl get -n $NAMESPACE SparkApplication/stream-generated-data-to-pravega

Related information

Deploying Python applications using the SDP UI on page 79

Deploying Java or Scala applications using the SDP UI

Use the steps in this section to deploy Java or Scala applications in the SDP UI.

Steps

1. Prepare your Java/Scala build environment.

2. Compile your Java or Scala source code to produce a JAR file. The example Scala application StreamPravegaToConsole.scala will read strings from a Pravega stream and log

them. This will build the file pravega-samples/spark-connector-examples/build/libs/pravega-spark- connector-examples-0.9.0.jar.

git clone https://github.com/pravega/pravega-samples cd pravega-samples git checkout spark-connector-examples -- cd spark-connector-examples ../gradlew build

3. Upload your Spark application JAR.

a. Click Analytics > Artifacts > Maven > Upload Artifact. b. Enter the following. It is a best practice for the group, artifact, and version to match what is in the file, but this is not

required.

Maven Group:: io.pravega Maven Artifact:: pravega-spark-connector-examples Version: 0.11.0 File: pravega-spark-connector-examples- 0.11.0.jar

c. Click Upload.

4. Create the Spark application.

a. Click Spark > Create New Spark App.

Working with Apache Spark Applications 81

b. Enter the following: General

Type: Scala Name: pravega-to-console-scalaa

Source Maven Artifact: io.pravega:pravega-spark-connector-examples:0.9.0 Main Class: io.pravega.example.spark.StreamPravegaToConsole

Configuration Runtime: Spark 3.0.1 Executors: 1

Additional Jars Maven Jars:

io.pravega:pravega-connectors-spark-3-0-2-12:0.9.0 io.pravega:pravega-keycloak-client:0.9.0

c. Click Create.

5. You may view the application logs by clicking the Logs link.

When running the above example, you will see output similar to the following every three seconds.

------------------------------------------- Batch: 29 ------------------------------------------- +---------------------+--------+-----------------+----------+------+ |event |scope |stream |segment_id|offset| +---------------------+--------+-----------------+----------+------+ |2021-03-12 23:37:44.1|examples|streamprocessing1|0 |62582 | |2021-03-12 23:37:46.1|examples|streamprocessing1|0 |62611 | |2021-03-12 23:37:45.1|examples|streamprocessing1|0 |62640 | +---------------------+--------+-----------------+----------+------+

GPU-accelerated workload support for Spark GPU-accelerated workload support for Spark applications introduced in SDP (1.3 and later versions) enables data scientists to use machine learning on analytic workloads.

Specialized GPU (graphics processing unit) processors with dedicated memory are used to process intensive deep learning AI workloads. SDP takes advantage of GPUs for video processing, stream processing, and machine learning. The GPU-accelerated workload adds support for Apache Flink and Apache Spark applications.

During SDP installation, the NVIDIA GPU operator updates the SDP node base operating system and the Kubernetes environment with the appropriate drivers and configurations for GPU access. The GPU Operator deploys the Node Discovery Feature (NDF) to identify nodes that contain GPUs and installs the GPU driver on GPU-enabled nodes. For more information, see the Dell Technologies Streaming Data Platform Installation and Administration Guide.

CUDA libraries for GPU workloads

CUDA libraries are the API between the application and the GPU driver.

SDP (1.3 and later versions) ships with developer packages containing a Dockerfile and required binaries required for GPU workload support. The developer package contains an example CUDA Dockfile. All workloads within SDP run within Docker containers, which must contain the CUDA toolkit libraries for GPU support.

NOTE: CUDA toolkit libraries must be compatible with the CUDA driver version. The version of the CUDA libraries

depends on your environment, and the required libraries depend on the workload (for example, JCUDA libraries). For more

information, see https://docs/nvidia.com/deploy/cuda-compatibility/

82 Working with Apache Spark Applications

Example: Spark CUDA

ARG CUDA_VERSION=11.2.0

FROM nvidia/cuda:${CUDA_VERSION}-runtime-ubuntu20.04

#Install Java RUN set -eux; \ apt-get update; \ apt-get install -y --no-install-recommends \ # utilities for keeping Debian and OpenJDK CA certificates in sync ca-certificates p11-kit \ ; \ rm -rf /var/lib/apt/lists/*. . . . . . ARG SPARK_VERSION=3.1.2 ARG HADOOP_MAJOR_VERSION=3.2 ARG HADOOP_VERSION=3.2.0 ARG SCALA_VERSION=2.12

RUN set -x && apt-get update && \ apt install -y wget curl zip

RUN mkdir /opt/spark && wget -q https://archive.apache.org/dist/spark/spark-$ {SPARK_VERSION}/spark-${SPARK_VERSION}-bin-hadoop${HADOOP_MAJOR_VERSION}.tgz && \ tar -xzvf spark-${SPARK_VERSION}-bin-hadoop${HADOOP_MAJOR_VERSION}.tgz && mv spark-${SPARK_VERSION}-bin-hadoop${HADOOP_MAJOR_VERSION}/* /opt/spark && rm -rf spark-$ {SPARK_VERSION}-bin-hadoop${HADOOP_MAJOR_VERSION}* . . .. . .ENTRYPOINT [ "/opt/spark/kubernetes/dockerfiles/spark/entrypoint.sh" ]

ARG spark_uid=185 USER ${spark_uid}

Using custom images for GPU workloads

To use custom images distributed in the SDP developer package:

1. Unzip the developer package distributed with SDP. 2. Update the Dockerfile with any required libraries. 3. Build the Dockerfile by running the command $ docker build -f dockerfile.cuda -tag .

4. Push the Dockerfile to the SDP Docker Registry by running the command $ docker push .

5. Create the RuntimeImage resource with the type spark.

6. List runtime images by running the command $ kubectl get runtimeimage.

Example RuntimeImage

apiVersion: nautilus.dellemc.com/v1beta1 kind: RuntimeImage metadata: name: spark-3.1.2 spec: description: Spark 3.1.2 with Python 3.7 displayName: Spark 3.1.2 docker:

Working with Apache Spark Applications 83

image: spark:3.1.2-3.2.0-1.2-14-57c8cc5 sdpRegistry: true environment: JAVA_HOME: /usr/local/openjdk-11 sparkVersion: 3.1.2 type: spark version: 3.1.2

Example runtimes in the SDP UI

System > Runtimes > Create Runtime

Kubernetes pod scheduling

Kubernetes pods use resources to specify GPU requirements. If you are using CUDA libraries for GPU workloads, pods must specify both the Request and Limits for GPU workloads as shown:

apiVersion: v1 kind: Pod metadata: name: cuda-vector-add spec:

84 Working with Apache Spark Applications

restartPolicy: OnFailure containers: - name: cuda-vector-add image: "k8s.gcr.io/cuda-vector-add:v0.1" resources: request: nvidia.com/gpu: 1 # requesting 1 GPU limits: nvidia.com/gpu: 1 # requesting 1 GPU

SparkApplication Task Managers

If there are not enough resources to create Task Managers, the SDP UI will display a provision-pending state as shown.

Monitoring GPUs

Monitoring Metrics > Dashboards

Working with Apache Spark Applications 85

Troubleshooting Spark applications This section includes common error messages when troubleshooting Spark applications.

Authentication failed

Symptom:

io.pravega.shaded.io.grpc.StatusRuntimeException: UNAUTHENTICATED: Authentication failed

Causes: Keycloak JAR is not included. If this error occurs when attempting to create a scope, make sure the option allow_create_scope is set to false.

86 Working with Apache Spark Applications

Working with JupyterHub This chapter describes using JupyterHub with the Streaming Data Platform (SDP).

Topics:

JupyterHub overview JupyterHub installation Starting a Jupyter notebook server Stopping a Jupyter Notebook Server Persistent file storage Pravega Python client demo notebook Spark installation demo notebook

6

Working with JupyterHub 87

JupyterHub overview JupyterHub provides an interactive web-based environment for data scientists and data engineers to write and immediately run Python code.

Jupyter Notebook

Figure 17. Jupyter Notebook

Jupyter Notebook is a single-user web-based interactive development environment for code, and data. It supports a wide range of workflows in data science, scientific computing, and machine learning.

JupyterHub

JupyterHub is a multiuser version of Jupyter Notebook that is designed for companies, classrooms, and research labs. JupyterHub spawns, manages, and proxies multiple instances of the single-user Jupyter Notebook server.

Jupyter Notebook Single User Pod

The Jupyter Notebook Single User Pod runs a Jupyter Notebook server for a single user. All Python codes that are run on notebooks of a user are run in this pod.

88 Working with JupyterHub

JupyterHub on SDP

JupyterHub can be enabled for an Analytics Project in SDP with a few clicks. It uses the single-sign-on functionality of SDP.

Each Analytics Project has its own deployment of JupyterHub. Each user can have up to one Jupyter pod per Analytics Project.

JupyterHub installation

Steps

1. Open the SDP UI, locate your Analytics Project, and then click Edit.

2. In the list of Features, select JupyterHub.

Figure 18. JupyterHub installation

3. Click Save.

Starting a Jupyter notebook server

Steps

1. Open your Analytics Project, and then click the Features tab.

2. Click the Ingress link of the JupyterHub feature.

Working with JupyterHub 89

Figure 19. Ingress link

3. If you have not recently used JupyterHub, you are presented a choice of server options. If you have recently used JupyterHub in this Analytics Project with this user account, and an existing Jupyter pod exists, you are directly connected to it. Otherwise, you are presented a choice of server options.

Figure 20. JupyterHub server options

The choice of server options determines the following:

a. Docker image including Python version and Python libraries b. CPU and memory resources

The Jupyter Notebook interface opens, which provides a file list on the left, and notebooks and terminals on the right.

90 Working with JupyterHub

Figure 21. Jupyter Notebook interface

4. Click Python 3 to open a new Python 3 notebook.

Figure 22. Python 3 notebook

NOTE: When the first cell is run in each notebook (tab), Jupyter Notebook launches a Python process within the

Jupyter pod to run the Python code of the user. This process is known as Kernel.

NOTE: You may close your browser and reopen it later. It resumes exactly where you left off, including any running

kernels and unsaved notebooks. However, if you are disconnected for more than 24 hours, the Jupyter pod is

terminated. A Jupyter pod which has been disconnected for more than 24 hours is terminated. Any files that are saved

in the project-data folder remains but all other files in the pod are deleted.

Working with JupyterHub 91

NOTE: If JupyterHub is removed from a project, any running Jupyter pods remain running. In this case, JupyterHub do

not automatically terminate pods that are created by a prior installation of JupyterHub.

Stopping a Jupyter Notebook Server

Steps

1. To open the Hub Control Panel, click File > Hub Control Panel.

Figure 23. Stopping a Jupyter Notebook Server

2. Click Stop My Server to terminate your Jupyter pod and take you to the Server Options screen.

Persistent file storage

Notebook (.ipynb) and other files should be stored in shared storage that is provided for the Analytics Project. This guarantees persistence until the Analytics Project is deleted. Files in this directory will be shared with other users in the same Analytics Project. Within the pod, the path /home/jovyan/project-data is mounted to the Analytics Project shared NFS Persistent Volume Claim named data-project. JupyterLab is configured to display this directory in the file browser. This directory can easily be used by your Python code since it is a standard file system.

NOTE: Persistent file storage is not available when using S3 object storage.

Pravega Python client demo notebook

The following Python code can be copied to a Jupyter notebook. It demonstrates how to use the Pravega Python client to write events to a Pravega stream and then read the events.

import os import pravega_client import uuid

os.environ["pravega_client_auth_method"] = "Bearer" os.environ["pravega_client_auth_keycloak"] = "/var/run/secrets/nautilus.dellemc.com/ serviceaccount/keycloak.json" os.environ["pravega_client_tls_cert_path"] = "/etc/ssl/certs/ca-certificates.crt"

pravega_controller_uri = os.environ["PRAVEGA_CONTROLLER_URI"][6:] pravega_scope = os.environ["PRAVEGA_SCOPE"] pravega_stream = "demo-" + str(uuid.uuid4()) initial_segments = 1

stream_manager = pravega_client.StreamManager(pravega_controller_uri, True, True) stream_manager.create_stream(pravega_scope, pravega_stream, initial_segments)

writer = stream_manager.create_writer(pravega_scope, pravega_stream) for i in range(0, 10): writer.write_event("sample event %d" % i, routing_key="%d" % i)

92 Working with JupyterHub

reader_group_name = str(uuid.uuid4()) reader_group = stream_manager.create_reader_group(reader_group_name, pravega_scope, pravega_stream) reader = reader_group.create_reader("0")

# Acquire a segment slice to read. slice = await reader.get_segment_slice_async() for event in slice: print(event.data())

# After calling release segment, data in this segment slice will not be read again by # readers in the same reader group. reader.release_segment(slice)

# Remember to mark the finished reader as offline. reader.reader_offline()

Spark installation demo notebook

Spark is one of the important use cases of using Jupyter notebooks. To demonstrate how to install and use Spark, a sample notebook has been added in /home/jovyan/samples. This notebook runs the necessary download and installation commands for Spark.

Figure 24. Sample Spark notebook

Working with JupyterHub 93

Working with Analytic Metrics

Topics:

About analytic metrics for projects Enable analytic metrics for a project View analytic metrics Add custom metrics to an application Analytic metrics predefined dashboard examples

About analytic metrics for projects SDP includes capabilities for automatic collection and visualization of detailed metrics for project applications. SDP generates separate dashboards for each Flink cluster, Spark application, and Pravega Search cluster defined within a project.

Description Analytic metrics is an optional feature within a project. At project creation, the administrator chooses whether to enable or disable project-level analytic metrics. The feature is enabled by default.

Project metrics are valuable for:

historical view of long running applications performance analysis insight into application conditions troubleshooting

The Metrics feature is enabled or disabled per project, at project creation time. When Metrics is enabled, SDP creates a metrics stack for the project. The stack contains: Project-specific InfluxDB database Project-specific Grafana deployment and access point

Separate Flink, Spark, and PSearch dashboards

SDP visualizes metrics in dashboards based on predefined templates. Separate dashboards are used for each Flink cluster, each Spark application, and the Pravega Search cluster created in the project. The default dashboards include the following:

Cluster type

Dashboard contents

Flink clusters

Shows metrics for Flink jobs, tasks, and task managers. When you delete the Flink cluster, the system deletes the cluster's analytic dashboard and metrics.

Spark applications

Shows processing progress and other metrics. The Spark integration uses a Telegraf server to collect metrics. If you delete the Spark application, the system also deletes the application's analytic dashboard and metrics.

Pravega Search cluster

Shows metrics for Pravega Search components, such as indexworkers and shardworkers. When you delete the Pravega Search cluster, the system deletes the cluster's dashboard and metrics.

Customizations Two types of customizations are supported: Advanced Grafana users may create their own customized dashboards. Application developers may include customized metrics in their applications. Applications can emit

custom metrics to InfluxDB, and those metrics can be visualized by customized Grafana templates.

Metrics retention period

The retention period for metrics data is two weeks.

Dashboard access

Access to project dashboards and metrics is controlled by project membership.

7

94 Working with Analytic Metrics

Enable analytic metrics for a project The analytic metrics feature is enabled separately for each project, at project creation time.

About this task

SDP platform administrators create analytic projects. They can choose to enable or disable Metrics.

NOTE: It is not possible to enable metrics after project creation.

Steps

1. Log in to the SDP UI as an admin.

2. Create a project as described in the Dell Technologies Streaming Data Platform Installation and Administration Guide .

3. In the Metrics field, ensure Enabled is selected.

Enabled is the default selection.

Results

If project creation is successful, the metrics stack is deployed. Metrics are collected for each cluster in the project.

View analytic metrics Project members can access a project's Grafana UI from the Features tab in the SDP UI.

Prerequisites

At project creation time, the Metrics field must be set to Enabled.

About this task

To access the metrics dashboards for a Flink project, follow these steps:

Steps

1. Go to Analytics > project-name.

2. Click the Features tab.

3. Click the metrics-link in the Endpoint column. The link opens the Grafana UI.

Working with Analytic Metrics 95

Add custom metrics to an application Application developers may deploy custom metrics in applications by specifying InfluxDB and Grafana dashboard resources within their Helm chart.

Prerequisites

The Metrics option must be enabled at project creation time by the platform administrator. Users who are members of the project have access.

About this task

In the following Flink longevity metrics example, the developer would need to know what the secret is, and how to use it, in order to connect to the InfluxDB Database. The developer would add the secret and bind connection details to Environment variables within the application, which it then uses to set up the metrics connection.

The order that components are created are resolved by Kubernetes, since the POD cannot start until the secret, and thus the InfluxDB, has been created.

The application would need to do the following.:

Steps

1. Include an InfluxDBDatabase resource in the Helm chart.

apiVersion: metrics.dellemc.com/v1alpha1 kind: InfluxDBDatabase metadata: name: flink-longevity-metrics spec: influxDBRef: name: project-metrics

2. Include a GrafanaDashboard resource in the Helm chart.

apiVersion: metrics.dellemc.com/v1alpha1 kind: "GrafanaDashboard" metadata: name: "flink-cluster" spec: dashboard: | { .... "panels": [ { "aliasColors": {}, "bars": false, "dashLength": 10, "dashes": false, "datasource": "flink-longevity-metrics", .... }, }

3. Configure the application parameters to pick up the details from the secret.

apiVersion: flink.nautilus.dellemc.com/v1beta1 kind: "FlinkApplication" metadata: name: flink-longevity-duplicate-check-0" spec: ... ... ... parameters: - name: readerParallelism value: "1" - name: checkpointInterval

96 Working with Analytic Metrics

value: "10" - name: metrics type: secret value: "flink-longevity-metrics"

Analytic metrics predefined dashboard examples Dashboards provide preconfigured visualizations of an InfluxDB database. The following example shows predefined dashboard templates from analytic engine providers that are included with Streaming Data Platform.

Flink cluster metrics dashboard

The default Flink dashboard shows metrics for Flink jobs, tasks, and task managers. You can select the job of interest from the dropdown in the upper left. The following image shows only a few of the graphs.

Spark application metrics dashboard

The default Spark metrics dashboard lists jobs, stages, storage, environment, executors, and SQL.

Working with Analytic Metrics 97

Figure 25. Spark Jobs

Pravega Search cluster metrics dashboard

The Pravega Search metrics dashboard includes graphs for each of the service types (controller, resthead, indexworker, queryworker, shardworker). It also includes graphs for CPU and memory usage. The following image shows only a few of the graphs.

98 Working with Analytic Metrics

Working with Analytic Metrics 99

Working with Video Streaming Data Platform can record video from supported network-connected cameras to Pravega and perform real-time GPU-accelerated object detection inference on the recorded video. This chapter provides an overview of the SDP streaming video features.

Topics:

Overview of video features Video streaming terminology Camera recorder pipelines GStreamer pipelines Object Detection with NVIDIA DeepStream SDP Video Architecture

Overview of video features Object detection is a type of deep learning inference that can identify the coordinates (left, top, width, height) and class (for example, car, bike, person) of objects in an image. Object detection is a core requirement for many advanced video intelligence tasks, such as counting the number of people, measuring automobile traffic flow rates, and reading street signs.

SDP provides a web UI, which allows users to specify the connectivity parameters for network-connected cameras, such as IP addresses and login credentials. You can also view historical and near-live video.

Retention policies can be applied to video streams, enabling automatic deletion of older videos based on bytes used or age.

To provide object detection, SDP developers can use the NVIDIA DeepStream SDK. The SDK provides GPU-acceleration of common video processing tasks, such as video decoding and object detection.

Both camera recording and object detection processes use the GStreamer multimedia framework, which allows developers to build flexible and high-performance video processing pipelines.

The core video components of SDP are 100% open source and available under the Apache 2 license. For more information, see: https://github.com/pravega/gstreamer-pravega.

Video streaming terminology The following terms are referenced in this chapter.

Term Description

deep learning Deep learning is a subset of machine learning that uses artificial neural networks in many layers. It has been used extensively in computer vision. GPUs are used extensively to accelerate deep learning tasks.

Frames per Second (FPS) Frames per second

H.264 A video compression standard, also referred to as Advanced Video Codec (AVC) or MPEG-4.

HTTP Live Streaming, HLS An HTTP-based protocol for playing near-live video. Camera to screen latency is generally 10 seconds or higher.

inference Inference means to use a previously trained model to make predictions about new data. It is usually less compute- intensive than training, but it usually must be done in

8

100 Working with Video

Term Description

real-time. GPUs can be used to accelerate many inference workloads, including inference of deep learning models.

I-frame In video compression, Iframes are the least compressible, but do not require other video frames to decode.

Instantaneous Decoder Refresh (IDR ) IDR frames are always I-frames. The distinction is that no frames after an IDR frame may depend on frames prior to the IDR frame, but this is not necessarily true for I-frames.

key frame Same as I-frame

object detection Object detection is a type of deep learning inference that can identify the coordinates (left, top, width, height) and class (e.g. car, bike, person) of objects in an image.

Presentation Timestamp (PTS ) Presentation Timestamp for an image from a camera is the precise time when the image was captured by the camera sensor.

Real-Time Protocol (RTP) Real-Time Protocol is a low level media transport protocol used by IP (network-enabled) cameras. It can use either UDP or TCP transports.

Real-Time Streaming Protocol (RTSP) Real-Time Streaming Protocol is the network protocol used by IP (network-enabled) cameras.

International Atomic Time (TAI) International Atomic Time is a high-precision atomic coordinate time standard and is the basis for Coordinated Universal Time (UTC).

training Training a machine learning models means to determine the model parameters using historical data. It is often a very compute-intensive task, requiring GPUs for acceptable performance.

Video Management System (VMS) A computer system for recording video from security cameras.

Camera recorder pipelines SDP can record video from common network video cameras. For each camera, SDP establishes a connection and records compressed video to a Pravega stream.

Supported cameras

SDP connects to video cameras using Real-Time Streaming Protocol (RTSP). RTSP cameras often support both TCP and UDP transport modes of Real-Time Protocol (RTP). SDP only supports the

TCP transport mode because it is more reliable and can easily travel through firewalls. Cameras must be configured to provide video compression using H.264. Although GStreamer and the GStreamer Plugins for Pravega allow pipelines with arbitrary data types, including audio, audio

is not supported in SDP. If a camera provides both video and audio, the audio will not be recorded.

Working with Video 101

How video is stored in Pravega

A GStreamer playback pipeline can play a video frame as soon as it is read from the data stream.

The index stream is used only for seeking and building an HTTP Live Streaming playlist. Only key frames are indexed. Both streams are Pravega byte streams and are written to long-term storage (LTS) without additional framing.

Configure SDP to record from cameras

This section describes how to determine and configure video camera connection parameters.

Determine your camera connection parameters

RTSP cameras can be accessed using a URI such as rtsp://user:password@mycamera.example.com:554/camera- path .

The protocol must be rtsp.

The host can be a DNS name, such as mycamera.example.com or an IP address.

The standard port for RTSP is 554. In some network configurations, port 80 can be used. The correct value for the path component (camera-path above) depends on the camera manufacturer and model.

Contact your camera manufacturer for the correct path for your camera. If this path controls the compression algorithm, choose H.264.

Create a camera recorder pipeline

This section describes how to create a camera recorder pipeline in the SDP UI.

Steps

1. In the SDP UI, create an Analytics Project.

2. Add Video Server to the analytics project.

analytics-project > Edit > Video Server > Save

102 Working with Video

3. Create Camera Secrets.

analytics-project > Video > Camera Secrets > Create Camera Secret

A Camera Secret specifies the username and password that is used to authenticate to RTSP cameras. If a group of cameras shares a single set of credentials, you can create a single Camera Secret for this group of cameras. It is recommended that the Camera Secret user have view-only permissions to this camera.

Working with Video 103

Specify the following parameters:

Name Type a name to identify this Camera Secret.

Username Type a username used to authenticate to the camera using RTSP.

Password Type a password used to authenticate to the camera using RTSP.

4. Create Camera Recorder Pipelines.

Video > Camera Recorder Pipelines > Create Camera Pipeline

104 Working with Video

Specify the following parameters:

Name Choose a name to identify the Camera Recorder Pipeline.

Host Address The DNS name or IP address of your camera.

Port The RTSP port used by your camera. The standard port is 554.

URL Path The path component of the RTSP URI. The correct value depends on the camera manufacturer and model. Contact your camera manufacturer for the correct path for your camera.

Secret Choose the Camera Secret used to authenticate to the camera using RTSP.

Existing Streams Only one camera at a time can record to a single stream. If you previously created a Camera Recorder Pipeline to record to a stream, and later deleted the Camera Recorder Pipeline, you can configure a new Camera Recorder Pipeline to append to that existing stream. Otherwise, choose New Stream.

Retention Policy Type The retention policy can be used to automatically delete the oldest video content, where: None: Do not automatically delete video content. Days: Delete video content older than x many days. The

age of the video content is based on the timestamp that is stored in the video. You may specify fractional days such as 0.5.

Working with Video 105

Bytes: Delete the oldest video content until the number of bytes used is less than this many MiB (1,048,576 bytes).

Days and Bytes: This combines Days and Bytes. Video content older than the specified number of days will be deleted. However, video content is deleted earlier than the specified number of days if the number of bytes used exceeds the specified MiB.

By default, the retention policy is evaluated every 15 minutes. Additionally, files on long-term storage (LTS) are created and deleted in chunks of 128 MiB. When sizing disk capacity, you should plan on an extra 15 minutes plus 256 MiB per camera.

The retention policy may be changed at any time. The retention policy is applied by the Camera Recorder Pipeline only when it is running.

WARNING: Do not start two GStreamer Camera Pipelines that write to the same Pravega stream. This results

in one or both pipelines continuously restarting.

WARNING: Do not use the retention policy settings under the Pravega tab of the SDP UI. Doing so may cause

video playback errors. The Retention Policy in the Pravega tab should be disabled.

5. Add Environment Property: Environment properties can be added to customize the behavior of the camera recorder process as follows: TIMESTAMP_SOURCE: A value of rtcp-sender-report is the most accurate, since the camera effectively

timestamps each frame. However, for cameras that are unable to send RTSP Sender Reports, or have unreliable clocks, local-clock can be used in which the time offset is calculated when the first frame is received. This will result in timestamps being incorrect by up to a few seconds. The default is local-clock.

6. Click Save.

7. Click Start to begin recording from this camera. The state changes to Running once SDP connects to the camera and begins recording. If this does not occur, see the Troubleshooting camera recorder pipelines section in this chapter.

Control camera recorder pipelines

All configuration settings for a camera recorder pipeline can be changed in the SDP UI under Analytics Projects.

Starting

A Camera Recorder Pipeline can be started at any time by clicking Start which starts a new recording process.

If the Pravega stream exists, newly recorded video is appended to the Pravega stream. If the Pravega stream does not exist, it is created by the recording process.

Restarting a Camera Recorder Pipeline causes a gap in the time line of the Pravega stream. When using the SDP UI to play a video stream with a gap, a solid blue video appears for five seconds when the gaps occurs.

106 Working with Video

Stopping

You can stop the Camera Recorder Pipeline at any time by clicking Stop.

Stopping a Camera Recorder Pipeline terminates the recording process, however the Pravega stream is not deleted. While stopped, any video retention policy does not take effect, as it is applied by the recording process.

Editing

All configuration settings for a Camera Recorder Pipeline can be changed at any time by clicking Edit.

If the recording process is running when edited, it is stopped for a few seconds and then restarted automatically. Video is not recorded when stopped.

Deleting

A Camera Recorder Pipeline can be deleted at any time by clicking Delete. This has the same behavior as stopping the pipeline.

The configuration for the Camera Recorder Pipeline is deleted. The Pravega stream is not deleted.

Monitor camera recorder pipelines

You can play historical and live video directly in the SDP UI as described in this section.

Playing video in the SDP UI with Pravega video server

Play video from a camera recorder pipeline within the SDP UI from either the Camera Recorder Pipelines tab or the Video Streams tab.

Working with Video 107

When the player is first opened, it begins playing near the end of the video, approximately 5 to 15 seconds from live. You can use the slider control to position playback to any point within the current playback window. The hh:mm:ss counter

near the slider counts the elapsed time of recorded video since the begin date/time at the top of the window. This date/ time may be different from the elapsed time since midnight if gaps occur in the recording.

If you do not see video, try adjusting the start date. Use the calendar control to select a different playback window. The playback window is adjusted automatically so that a

maximum of 24 hours is available in the slider control. Video streams show all streams with a tag of "video."

Enabling Pravega Video Server

Pravega Video Server is a component of SDP that allows all major web browsers to play historical and live video. It is an HTTP web service that supports HTTP Live Streaming (HLS). It can be enabled at any time by adding the feature to an Analytics Project under Edit Project.

108 Working with Video

Automate recording from many cameras

You can use the Kubernetes API and kubectl to automate the creation of Camera Recorder Pipelines.

Steps

1. Create a file named camera-recorder-pipeline.yaml as shown below.

apiVersion: gstreamer.dellemc.com/v1alpha1 kind: CameraRecorderPipeline metadata: name: camera001-recorder spec: camera: address: "10.1.2.3" # Replace with RTSP IP or FQDN. path: "/media" # Replace with your RTSP URI path. port: "554" secret: "camera-group-1-secret" # Replace with the name of a Camera Secret. pravega: retentionPolicy: retentionBytes: 0 retentionDays: 180.0 retentionPolicyType: days # Can be days, bytes, daysAndBytes, or none. stream: "camera001" state: Running

2. Create the Kubernetes resource.

kubectl apply -n ${NAMESPACE} -f camera-recorder-pipeline.yaml

Results

You will see the new Camera Recorder Pipeline in the SDP UI.

A variety of methods can be used to generate and apply multiple YAML files. Helm is commonly used for automation in Kubernetes. You may want to refer to the sample Helm chart rtsp-camera-to-pravega.

Working with Video 109

RTSP camera simulator

The RTSP Camera Simulator is an application which simulates an RTSP camera.

RTSP clients such as the Camera Recorder Pipeline can connect to it as if it were a real camera. It generates standard color bars with timestamps as shown below.

The RTSP Camera Simulator is expected to be used only in a test environment.

It can be started on any Linux host using the steps below.

git clone --recursive https://github.com/pravega/gstreamer-pravega cd gstreamer-pravega export CAMERA_PORT=8554 export CAMERA_USER=user export CAMERA_PASSWORD=mypassword scripts/rtsp-camera-simulator-docker.sh

You can then use an RTSP player, such as VLC, to play the URL rtsp://user:mypassword@127.0.0.1:8554/cam/ realmonitor?width=640&height=480.

Alternatively, it can be started on Kubernetes, including SDP, using the Helm chart rtsp-camera-simulator.

Troubleshooting camera recorder pipelines

This section provides debugging suggestions to assist in troubleshooting issues.

Pipeline state shows error or starting

This condition has many possible causes. Click on the name of the Camera Recorder Pipeline, then Logs. Search the log for ERROR or WARN messages.

Common messages are described as follows.

110 Working with Video

Debug Logging

Camera Recorder Pipelines can be configured to log highly detailed debug messages to assist in troubleshooting. Logging can be controlled by setting environment variables. Be aware that excessive logging may cause poor performance.

Logs are available in the SDP UI. View large logs using kubectl commands.

GST_DEBUG: The GStreamer framework and plugins will log according to this environment variable. The default is:

WARNING,rtspsrc:INFO,rtpbin:INFO,rtpsession:INFO,rtpjitterbuffer:INFO,h264parse:WARN,q tmux:FIXME,fragmp4pay:INFO,timestampcvt:DEBUG,pravegasink:DEBUG

Available logging levels are none, ERROR, WARNING, FIXME, INFO, DEBUG, LOG, TRACE, MEMDUMP.

Examples INFO - Logs all informational messages.

DEBUG - Logs all debug messages.

WARN,pravegasrc:LOG,pravegasink:LOG - The elements pravegasrc and pravegasink will log all log messages. All other elements will log warning messages only.

Learn more about debug logging with GStreamer. LOG_LEVEL: Controls the logging from Python applications such as rtsp-camera-to-pravega.py.

The default is 20 which logs at the info level. Set to 10 to enable debug logging.

PRAVEGA_VIDEO_LOG: Controls logging by the Pravega Rust Client and its dependencies. The default is info. Available logging levels are error, warn, info, debug, and trace.

The logging level can be set for specific modules with a syntax like pravega_video=debug,info

gst_rtsp_connection_connect_with_response_usec: failed to connect: Could not connect to...: Socket I/O timed out

Example message

0:00:22.749267925 1 0x28df860 ERROR default gstrtspconnection.c:1052:gst_rtsp_connection_connect_with_response_usec: failed to connect: Could not connect to 108.185.180.25: Socket I/O timed out 0:00:22.749300278 1 0x28df860 ERROR rtspsrc gstrtspsrc.c:5163:gst_rtsp_conninfo_connect: Could not connect to server. (Generic error) 0:00:22.749309854 1 0x28df860 WARN rtspsrc gstrtspsrc.c:8025:gst_rtspsrc_retrieve_sdp: error: Failed to connect. (Generic error) 0:00:22.749364555 1 0x28df860 WARN rtspsrc gstrtspsrc.c:8111:gst_rtspsrc_open: can't get sdp 0:00:22.749405114 1 0x28df860 WARN rtspsrc gstrtspsrc.c:8772:gst_rtspsrc_play: failed to open stream 0:00:22.749412988 1 0x28df860 WARN rtspsrc gstrtspsrc.c:6143:gst_rtspsrc_loop: we are not connected ERROR:root:gst-resource-error-quark: Could not open resource for reading and writing. (7): ../gst/rtsp/gstrtspsrc.c(8025): gst_rtspsrc_retrieve_sdp (): /GstPipeline:pipeline0/ GstRTSPSrc:rtspsrc: Failed to connect. (Generic error) INFO:root:Stopping pipeline

This error indicates that a TCP connection could not be established to the RTSP camera. Check the camera host address and port.

rtspsrc...Unauthorized (401)

Example message

0:00:03.761097358 1 0x2f6b860 WARN rtspsrc gstrtspsrc.c:6651:gst_rtspsrc_send: error: Unauthorized (401)

Working with Video 111

This error indicates the recorder process cannot authenticate to the RTSP camera. Check the camera host address, port, URL path, user name, and password. In the settings for the camera itself, ensure that the specified user has permissions to view the live video stream. You may want to test that you have the correct RTSP parameters by using an application such as VLC Media Player.

thread 'unnamed panicked at should be valid path'

Example message

thread ' ' panicked at 'should be valid path: Os { code: 2, kind: NotFound, message: "No such file or directory" }', pravega-client-rust-fa7c139c5174e088/0833868/ config/src/lib.rs:128:49 stack backtrace: 0: rust_begin_unwind at /rustc/a178d0322ce20e33eac124758e837cbd80a6f633/library/std/src/ panicking.rs:515:5 1: core::panicking::panic_fmt at /rustc/a178d0322ce20e33eac124758e837cbd80a6f633/library/core/src/ panicking.rs:92:14 2: core::result::unwrap_failed at /rustc/a178d0322ce20e33eac124758e837cbd80a6f633/library/core/src/ result.rs:1355:5 3: core::result::Result ::expect at /rustc/a178d0322ce20e33eac124758e837cbd80a6f633/library/core/src/ result.rs:997:23 4: pravega_client_config::ClientConfigBuilder::extract_trustcerts at pravega-client-rust-fa7c139c5174e088/0833868/config/src/lib.rs:128:20

This error indicates that the file referenced by the environment variable pravega_client_tls_cert_path was not found. This environment variable is automatically set to /etc/ssl/certs/ca-certificates.crt, but it could have been overwritten by the user. Ensure that it is not set in Camera Recorder Pipeline.

thread tokio-runtime-worker panicked at 'Cannot drop a runtime in a context where blocking is not allowed'

Example message

thread 'tokio-runtime-worker' panicked at 'Cannot drop a runtime in a context where blocking is not allowed. This happens when a runtime is dropped from within an asynchronous context.', /usr/local/cargo/registry/src/github.com-1ecc6299db9ec823/ tokio-1.5.0/src/runtime/blocking/shutdown.rs:51:21 note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

This message can safely be ignored as it does not impact a running application. It is caused by Issue 267 that occurs during the application shutdown process. If you do not expect your application to have shut down, look for messages prior to this one to identify the root cause.

Video player displays blue video

Stopping and starting a Camera Recorder Pipeline causes a gap in the time line of the Pravega stream. Camera and network faults may also produce gaps or discontinuities in the recording process.

When using the SDP UI to play a video stream with a gap, a solid blue video will appear for five seconds when the gaps occur.

GStreamer pipelines GStreamer is a pipeline-based multimedia framework that links together a wide variety of media processing elements.

GStreamer is free and open source software subject to the terms of the GNU Lesser General Public License (LGPL), although some plugins have proprietary licenses. GStreamer Plugin for Pravega is 100% open source and available under Apache 2 License.

112 Working with Video

You can start simple pipelines using command-line tools such as gst-launch-1.0. For example, the following command will capture video from a camera, compress it, and write to a file.

gst-launch-1.0 v4l2src ! x264enc ! filesink location=myvideo.h264

More sophisticated applications can use GStreamer as a library.

GStreamer applications can be written in Python, C, C++, Rust, Java, and more. Developers can build video players, media servers, DVRs, and video management servers (VMS).

GStreamer is extendable with plugins. There are nearly 2000 plugins available. Commonly-used plugins are:

RTSP source (rtspsrc): Capture from a network camera. H.264 video compression and decompression: CPU and GPU-accelerated elements are available. Pravega Sink: Write to a Pravega stream. Pravega Source: Read from a Pravega stream. NVIDIA DeepStream Inference: GPU-accelerated AI deep learning inference for video analytics such as object detection.

Start the Interactive Shell with GStreamer

The Interactive Shell with GStreamer is a Kubernetes pod that provides a Bash prompt. Users can run a variety of GStreamer applications for inspecting, copying, exporting, and importing Pravega video streams.

Steps

1. Create a local file named gstreamer-interactive.yaml using the following template, substituting your environment- specific settings in the sections noted with "CHANGE REQUIRED."

apiVersion: apps/v1 kind: StatefulSet metadata: name: gstreamer-interactive spec: serviceName: gstreamer-interactive replicas: 1 selector: matchLabels: app: gstreamer-interactive template: metadata: labels: app: gstreamer-interactive spec: # CHANGE REQUIRED: # Change serviceAccount to the format NAMESPACE-pravega, where NAMESPACE is the Kubernetes namespace # where this pod will be deployed. This is the same as the SDP Analytics Project name. serviceAccount: examples-pravega containers: - name: gstreamer image: "{runtime: gstreamer-1.18.4}" imagePullPolicy: Always resources: requests: cpu: "100m" memory: "1Gi" command: ["bash", "-c", "--", "sleep infinity"] env: - name: pravega_client_tls_cert_path value: "/etc/ssl/certs/ca-certificates.crt" # If using NFS, this will mount your project volume in the pod for convenient file import and export. # Otherwise, remove the volumeMounts and volumes sections. volumeMounts: - name: data-project mountPath: /mnt/data-project volumes: - name: data-project

Working with Video 113

persistentVolumeClaim: claimName: data-project

2. Create the pod.

kubectl apply -n ${NAMESPACE} -f gstreamer-interactive.yaml

3. Open a Bash shell in the pod.

kubectl exec -n ${NAMESPACE} -it statefulset/gstreamer-interactive bash root@gstreamer-interactive-0:/#

Use the Interactive Shell with GStreamer tools

The Interactive Shell is a Kubernetes pod that provides gst-launch-1.0 and other GStreamer tools used for testing, troubleshooting, and iterative development.

gst-launch-1.0

The gst-launch-1.0 tool accepts a textual description of a simple GStreamer pipeline and runs it.

gst-inspect-1.0

The gst-inspect-1.0 tool provides usage information for GStreamer elements. It has three modes of operation:

1. Without arguments, it lists all available element types you can use to instantiate new elements. 2. With a file name as an argument, it treats the file as a GStreamer plugin, tries to open it, and lists all the elements described

inside. 3. With a GStreamer element name as an argument, it lists all information regarding that element.

The following lists all available elements. Only Pravega elements are shown below.

pravega: fragmp4pay: Fragmented MP4 Payloader pravega: pravegasink: Pravega Sink pravega: pravegasrc: Pravega Source pravega: pravegatc: Pravega Transaction Coordinator pravega: timestampcvt: Convert timestamps ... Total count: 242 plugins (1 blacklist entry not shown), 1427 features

To learn more about GStreamer tools, see: GStreamer Basic Tutorial 10.

Launch the Interactive Shell with GStreamer

1. Create gstreamer-interactive.yaml to define a StatefulSet

2. Run kubectl apply -f gstreamer-interactive.yaml 3. Run kubectl exec -it statefulset/gstreamer-interactive bash

Export a Pravega stream to a fragmented MP4 file

This example uses the gst-launch-1.0 tool to run a GStreamer pipeline. This pipeline reads video content from a Pravega stream, starting and stopping at specific timestamps, and exports the video in Fragmented MP4 format to a file in the shared project directory.

Run the following command in the Interactive Shell with GStreamer.

gst-launch-1.0 -v \ pravegasrc \ stream=examples/my-stream \ start-mode=timestamp \

114 Working with Video

start-utc=2021-08-13T21:00:00.000Z \ end-mode=timestamp \ end-utc=2021-08-13T21:01:00.000Z \ ! filesink \ location=/mnt/data-project/export.mp4

If end-mode is unbounded, this will run continuously until the Pravega stream is sealed or deleted. Otherwise, the application will terminate when the specified time range has been exported.

Run gst-inspect-1.0 pravegasrc to see the list of available properties for the pravegasrc element.

Export a Pravega stream to a GStreamer Data Protocol (GDP) file

The GStreamer Data Protocol (GDP) file format preserves buffer timestamps and other metadata. When a Pravega stream is exported to a GDP file and later imported to a new Pravega stream, the two Pravega streams will be identical.

gst-launch-1.0 -v \ pravegasrc \ stream=examples/my-stream \ start-mode=timestamp \ start-utc=2021-08-13T21:00:00.000Z \ end-mode=timestamp \ end-utc=2021-08-13T21:01:00.000Z \ ! "video/quicktime" \ ! gdppay \ ! filesink \ location=/mnt/data-project/export.gdp

Import a GStreamer Data Protocol (GDP) file to a Pravega stream

When a Pravega stream is exported to a GDP file, and later imported to a new Pravega stream, the two Pravega streams will be identical.

gst-launch-1.0 -v \ filesrc \ location=/mnt/data-project/export.gdp \ ! gdpdepay \ ! pravegasink \ stream=examples/my-stream \ sync=false \ timestamp-mode=tai

Continuously capture from a network camera and write to a fragmented MP4 file

gst-launch-1.0 \ rtspsrc location=rtsp://10.1.2.3/cam \ ! rtph264depay \ ! h264parse \ ! mp4mux fragment-duration=1 \ ! filesink location=myvideo.mp4

Transcode a media file

gst-launch-1.0 \ uridecodebin uri=https://example.com/my-video.webm \ ! x264enc \ ! mp4mux \ ! filesink location=output.mp4

Working with Video 115

Run a continuous GStreamer pipeline

Steps in this section demonstrate how you can create GStreamerPipeline custom resources to run simple GStreamer pipelines.

In the following examples, the pipeline will continuously read video from the Pravega stream named my-stream and write to the Pravega stream named copy-of-my-stream. The property start-mode=latest indicates that it will start reading from the latest position in the stream when the pipeline starts.

Create and run a continuous GStreamer pipeline using the SDP UI

You can use the SDP UI to create GStreamerPipeline custom resources to run GStreamer pipelines.

Steps

1. Log in to SDP and open the Analytics Project. Click Video > GStreamer Pipelines > Create GStreamer Pipeline.

2. Enter the following: a. Name: Type a name to identify the GStreamer Pipeline. b. Image: Enter the name of your custom Docker image. To use the built-in GStreamer image, enter {runtime:

gstreamer-1.18.4}, which will be resolved to the local Docker registry used by SDP.

c. Replicas: Replicas should always be 1, unless your custom application handles high availability. d. Pull Policy: Enter Always. e. Liveness Probe Enabled: Enter False, unless your customer application supports a liveness problem, which performs an

to /ishealthy on port 8080.

f. Add Environment Property: Provide application parameters through the following environment variables: ENTRYPOINT: When using the GStreamer image, the command specified in this environment variable will be

executed. For example:

gst-launch-1.0 -v pravegasrc stream=examples/my-stream start-mode=latest ! pravegasink stream=examples/copy-of-my-stream sync=false retention-type=days retention-days=1.0

g. GST_DEBUG: GStreamer applications will log according to this environment variable. For example: i. INFO: Logs all informational messages. ii. DEBUG: Logs all debug messages. iii. WARN,pravegasrc:LOG,pravegasink:LOG: The elements pravegasrc and pravegasink will log all log

messages. All other elements will log warning messages only. h. Resource Requests: Enter the number of CPU cores and memory that should be reserved for the application. i. Resource Limits: Enter the maximum number of CPU cores and memory that your application is allowed to use. If the

application attempts to use more than the memory limit, it will be terminated. Read the following Kubernetes resources page to learn more about resource requests and limits: https://kubernetes.io/ docs/concepts/configuration/manage-resources-containers/

3. Click Save.

4. Click Start to start the GStreamer Pipeline.

116 Working with Video

Create and run GStreamer pipelines using the command line

You can create simple GStreamer pipelines using kubectl commands on the command line.

Steps

1. Create a local file named copy-my-stream.yaml with contents similar to the following:

apiVersion: gstreamer.dellemc.com/v1alpha1 kind: GStreamerPipeline metadata: name: copy-my-stream spec: image: "{runtime: gstreamer-1.18.4}" pullPolicy: Always env: - name: GST_DEBUG value: WARN,pravegasrc:LOG,pravegasink:LOG - name: ENTRYPOINT value: > gst-launch-1.0 -v pravegasrc stream=examples/my-stream start-mode=latest ! pravegasink stream=examples/copy-of-my-stream sync=false retention-type=days retention-days=1.0 resources: requests: cpu: "0.1" memory: 256Mi limits: cpu: "0.2" memory: 512Mi state: Running

2. Create the GStreamer Pipeline.

kubectl apply -n ${NAMESPACE} -f copy-of-my-stream.yaml

3. View the log.

kubectl log -n ${NAMESPACE} statefulset/copy-of-my-stream

Working with Video 117

GStreamer plugins for Pravega elements

GStreamer plugins are written in the Rust language and use the Pravega Rust Client.

The GStreamer plugins for Pravega are open source, Apache licensed, and available for download at: https://github.com/ pravega/gstreamer-pravega.

Pravega Sink (pravegasink)

Writes video and/or audio to a Pravega byte stream. Pravega Sink receives a series of byte buffers from an upstream element and writes the bytes to a Pravega byte stream.

Each buffer is framed with the buffer size and the absolute timestamp (nanoseconds since 1970-01-01 00:00:00 International Atomic Time). This can be used for storing a wide variety of multimedia content including H.264 video, AC3 audio, and multimedia containers such as MPEG transport streams and MP4.

Writes of buffers 8 MiB or less are atomic. Writes a time index to an auxiliary Pravega stream to enable efficient seeking. Writes an index stream associated with each data stream. The index stream consists of 20-byte records containing the

absolute timestamp and the byte offset. A new index record is written for each key frame. Pravega data and index streams can be truncated which means that all bytes earlier than a specified offset can be deleted. Pravega Sink can be stopped (gracefully or ungracefully) and restarted, even when writing to the same stream. Since

Pravega provides atomic appends, it is guaranteed that significant corruption will not occur. Arbitrary GStreamer buffers can be stored and transported using Pravega by utilizing the gdppay and gdpdepay elements.

The following gst-inspect-1.0 shows the plugin details for GStreamer for Pravega.

Plugin Details: Name pravega Description GStreamer Plugin for Pravega Filename /usr/lib/x86_64-linux-gnu/gstreamer-1.0/libgstpravega.so Version 0.7.0-RELEASE License unknown Source module gst-plugin-pravega Source release date 2021-08-15 Binary package gst-plugin-pravega Origin URL https://github.com/pravega/gstreamer-pravega

fragmp4pay: Fragmented MP4 Payloader pravegasink: Pravega Sink pravegasrc: Pravega Source pravegatc: Pravega Transaction Coordinator timestampcvt: Convert timestamps

5 features: +-- 5 elements

The following shows the Pravega Sink details for the GStreamer plugin for Pravega.

Factory Details: Rank none (0) Long-name Pravega Sink Klass Sink/Pravega Description Write to a Pravega stream

Plugin Details: Name pravega Description GStreamer Plugin for Pravega Filename /usr/lib/x86_64-linux-gnu/gstreamer-1.0/libgstpravega.so Version 0.7.0-RELEASE License unknown Source module gst-plugin-pravega Source release date 2021-08-15 Binary package gst-plugin-pravega Origin URL https://github.com/pravega/gstreamer-pravega

GObject +----GInitiallyUnowned +----GstObject

118 Working with Video

+----GstElement +----GstBaseSink +----PravegaSink

Pad Templates: SINK template: 'sink' Availability: Always Capabilities: ANY

Clocking Interaction: element requires a clock element is supposed to provide a clock but returned NULL Element has no URI handling capabilities.

Pads: SINK: 'sink' Pad Template: 'sink'

Element Properties: allow-create-scope : If true, the Pravega scope will be created if needed. flags: writable Boolean. Default: true Write only async : Go asynchronously to PAUSED flags: readable, writable Boolean. Default: true blocksize : Size in bytes to pull per buffer (0 = default) flags: readable, writable Unsigned Integer. Range: 0 - 4294967295 Default: 4096 buffer-size : Size of buffer in number of bytes flags: writable Unsigned Integer. Range: 0 - 4294967295 Default: 131072 Write only controller : Pravega controller. If not specified, this will use the value of the environment variable PRAVEGA_CONTROLLER_URI. If that is empty, it will use the default of tcp:// 127.0.0.1:9090. flags: writable String. Default: null Write only enable-last-sample : Enable the last-sample property flags: readable, writable Boolean. Default: true index-max-sec : Force index record if one has not been created in this many seconds, even at delta frames. flags: writable Double. Range: 0 - inf Default: 10 Write only index-min-sec : The minimum number of seconds between index records flags: writable Double. Range: 0 - inf Default: 0.5 Write only keycloak-file : The filename containing the Keycloak credentials JSON. If not specified, this will use the value of the environment variable KEYCLOAK_SERVICE_ACCOUNT_FILE. If that is empty, authentication will be disabled. flags: writable String. Default: null Write only last-sample : The last sample received in the sink flags: readable Boxed pointer of type "GstSample" max-bitrate : The maximum bits per second to render (0 = disabled) flags: readable, writable Unsigned Integer64. Range: 0 - 18446744073709551615 Default: 0 max-lateness : Maximum number of nanoseconds that a buffer can be late before it is dropped (-1 unlimited) flags: readable, writable Integer64. Range: -1 - 9223372036854775807 Default: -1 name : The name of the object flags: readable, writable, 0x2000

Working with Video 119

String. Default: "pravegasink0" parent : The parent of the object flags: readable, writable, 0x2000 Object of type "GstObject" processing-deadline : Maximum processing time for a buffer in nanoseconds flags: readable, writable Unsigned Integer64. Range: 0 - 18446744073709551615 Default: 20000000 qos : Generate Quality-of-Service events upstream flags: readable, writable Boolean. Default: false render-delay : Additional render delay of the sink in nanoseconds flags: readable, writable Unsigned Integer64. Range: 0 - 18446744073709551615 Default: 0 retention-bytes : The number of bytes that the video stream will be retained. flags: writable Unsigned Integer64. Range: 0 - 18446744073709551615 Default: 0 Write only retention-days : The number of days that the video stream will be retained. flags: writable Double. Range: 0 - inf Default: 0 Write only retention-maintenance-interval-seconds: The oldest data will be deleted from the stream with this interval, according to the retention policy. flags: writable Unsigned Integer64. Range: 0 - 18446744073709551615 Default: 900 Write only retention-type : If 'days', data older than 'retention-days' will be deleted from the stream. If 'bytes', the oldest data will be deleted so that the data size does not exceed 'retention-bytes'. If daysAndBytes, the oldest data will be deleted if it is older than retention-days or the data size exceeds retention-bytes. flags: writable Enum "GstRetentionType" Default: 0, "none" (0): none - If 'none', no data will be deleted from the stream. (1): days - If 'days', data older than 'retention-days' will be deleted from the stream. (2): bytes - If 'bytes', the oldest data will be deleted so that the data size does not exceed 'retention-bytes'. (3): daysAndBytes - If 'daysAndBytes', the oldest data will be deleted if it is older than 'retention-days' or the data size exceeds 'retention-bytes'. Write only seal : Seal Pravega stream when stopped flags: writable Boolean. Default: false Write only stats : Sink Statistics flags: readable Boxed pointer of type "GstStructure" average-rate: 0 dropped: 0 rendered: 0 stream : scope/stream flags: writable String. Default: null Write only sync : Sync on the clock flags: readable, writable Boolean. Default: true throttle-time : The time to keep between rendered buffers (0 = disabled) flags: readable, writable Unsigned Integer64. Range: 0 - 18446744073709551615 Default: 0 timestamp-mode : Timestamp mode used by the input flags: writable Enum "GstTimestampMode" Default: 2, "tai"

120 Working with Video

(0): realtime-clock - (DEPRECATED) Pipeline uses the realtime clock which provides nanoseconds since the Unix epoch 1970-01-01 00:00:00 UTC, not including leap seconds. This mode is deprecated. Instead, use the timestampcvt element with input-timestamp-mode=relative. (1): ntp - (DEPRECATED) Input buffer timestamps are nanoseconds since the NTP epoch 1900-01-01 00:00:00 UTC, not including leap seconds. Use this for buffers from rtspsrc (ntp-sync=true ntp-time-source=running-time). This mode is deprecated. Instead, use the timestampcvt element with input-timestamp-mode=ntp. (2): tai - Input buffer timestamps are nanoseconds since 1970-01-01 00:00:00 TAI International Atomic Time, including leap seconds. Write only ts-offset : Timestamp offset in nanoseconds flags: readable, writable Integer64. Range: -9223372036854775808 - 9223372036854775807 Default: 0

Pravega Source (pravegasrc)

Reads video written by Pravega Sink. It reads a series of byte buffers from a Pravega byte stream and delivers it to downstream components.

Guaranteed to read the byte buffers in the same order in which they were written by Pravega Sink. Buffer timestamps (PTS) are also maintained.

Seekable by absolute time. The index is used to efficiently identify the offset at which to begin reading. Pravega Source will respond to seekable queries by providing the first and last timestamps in the time index.

It is common to have one process write to a Pravega Sink while one or more other processes across a network read from the same Pravega stream using the Pravega Source.

Tail reads are able to achieve around 20 ms of end-to-end latency (less than 1 frame).

Factory Details: Rank none (0) Long-name Pravega Source Klass Source/Pravega Description Read from a Pravega stream

Plugin Details: Name pravega Description GStreamer Plugin for Pravega Filename /usr/lib/x86_64-linux-gnu/gstreamer-1.0/libgstpravega.so Version 0.7.0-RELEASE License unknown Source module gst-plugin-pravega Source release date 2021-08-15 Binary package gst-plugin-pravega Origin URL https://github.com/pravega/gstreamer-pravega

GObject +----GInitiallyUnowned +----GstObject +----GstElement +----GstBaseSrc +----GstPushSrc +----PravegaSrc

Working with Video 121

Pad Templates: SRC template: 'src' Availability: Always Capabilities: ANY

Element has no clocking capabilities. Element has no URI handling capabilities.

Pads: SRC: 'src' Pad Template: 'src'

Element Properties: allow-create-scope : If true, the Pravega scope will be created if needed. flags: writable Boolean. Default: true Write only blocksize : Size in bytes to read per buffer (-1 = default) flags: readable, writable Unsigned Integer. Range: 0 - 4294967295 Default: 4096 buffer-size : Size of buffer in number of bytes flags: writable Unsigned Integer. Range: 0 - 4294967295 Default: 131072 Write only controller : Pravega controller. If not specified, this will use the value of the environment variable PRAVEGA_CONTROLLER_URI. If that is empty, it will use the default of tcp:// 127.0.0.1:9090. flags: writable String. Default: null Write only do-timestamp : Apply current stream time to buffers flags: readable, writable Boolean. Default: false end-mode : The position to end reading the stream at flags: writable Enum "GstEndMode" Default: 0, "unbounded" (0): unbounded - Do not stop until the stream has been sealed. (1): latest - Determine the last byte in the data stream when the pipeline starts. Stop immediately after that byte has been emitted. (2): latest-indexed - Search the index for the last record when the pipeline starts. Stop immediately before the located position. (3): timestamp - Search the index for the record on or immediately after the specified end-timestamp or end-utc. Stop immediately before the located position. Write only end-timestamp : If end-mode=timestamp, this is the timestamp at which to stop, in nanoseconds since 1970-01-01 00:00 TAI (International Atomic Time). flags: writable Unsigned Integer64. Range: 0 - 18446744073709551615 Default: 18446744073709551615 Write only end-utc : If end-mode=utc, this is the timestamp at which to stop, in RFC 3339 format. For example: 2021-12-28T23:41:45.691Z flags: writable String. Default: null Write only keycloak-file : The filename containing the Keycloak credentials JSON. If not specified, this will use the value of the environment variable KEYCLOAK_SERVICE_ACCOUNT_FILE. If that is empty, authentication will be disabled. flags: writable

122 Working with Video

String. Default: null Write only name : The name of the object flags: readable, writable, 0x2000 String. Default: "pravegasrc0" num-buffers : Number of buffers to output before sending EOS (-1 = unlimited) flags: readable, writable Integer. Range: -1 - 2147483647 Default: -1 parent : The parent of the object flags: readable, writable, 0x2000 Object of type "GstObject" start-mode : The position to start reading the stream at flags: writable Enum "GstStartMode" Default: 1, "earliest" (0): no-seek - This element will not initiate a seek when starting. It will begin reading from the first available buffer in the stream. It will not use the index and it will not set the segment times. This should generally not be used when playing with sync=true. This option is only useful if you wish to read buffers that may exist prior to an index record. (1): earliest - Start at the earliest available random-access point. (2): latest - Start at the most recent random-access point. (3): timestamp - Start at the random-access point on or immediately before the specified start-timestamp or start-utc. Write only start-timestamp : If start-mode=timestamp, this is the timestamp at which to start, in nanoseconds since 1970-01-01 00:00 TAI (International Atomic Time). flags: writable Unsigned Integer64. Range: 0 - 18446744073709551615 Default: 0 Write only start-utc : If start-mode=utc, this is the timestamp at which to start, in RFC 3339 format. For example: 2021-12-28T23:41:45.691Z flags: writable String. Default: null Write only stream : scope/stream flags: writable String. Default: null Write only typefind : Run typefind before negotiating (deprecated, non-functional) flags: readable, writable, deprecated Boolean. Default: false

Pravega Transaction Coordinator (pravegatc)

The GStreamer element pravegatc can be used in a pipeline with a pravegasrc element to provide failure recovery. A pipeline that includes these elements can be restarted after a failure and the pipeline will resume from where it left off.

The pravegatc element periodically writes the timestamp of the current buffer to a Pravega table. When the pravegatc element starts, if it finds a PTS in this Pravega table, it sets the start-timestamp property of the pravegasrc element.

NOTE: In the SDP 1.3 implementation, some buffers may be processed more than once or never at all.

The following template describes the elements of the Pravega Transaction Coordinator GStreamer plugin.

Factory Details: Rank none (0) Long-name Pravega Transaction Coordinator Klass Generic

Working with Video 123

Description This element can be used in a pipeline with a pravegasrc element to provide failure recovery. A pipeline that includes these elements can be restarted after a failure and the pipeline will resume from where it left off. The current implementation is best-effort which means that some buffers may be processed more than once or never at all. The pravegatc element periodically writes the PTS of the current buffer to a Pravega table. When the pravegatc element starts, if it finds a PTS in this Pravega table, it sets the start-timestamp property of the pravegasrc element.

Plugin Details: Name pravega Description GStreamer Plugin for Pravega Filename /usr/lib/x86_64-linux-gnu/gstreamer-1.0/libgstpravega.so Version 0.7.0-RELEASE License unknown Source module gst-plugin-pravega Source release date 2021-08-15 Binary package gst-plugin-pravega Origin URL https://github.com/pravega/gstreamer-pravega

GObject +----GInitiallyUnowned +----GstObject +----GstElement +----PravegaTC

Pad Templates: SRC template: 'src' Availability: Always Capabilities: ANY

SINK template: 'sink' Availability: Always Capabilities: ANY

Element has no clocking capabilities. Element has no URI handling capabilities.

Pads: SINK: 'sink' Pad Template: 'sink' SRC: 'src' Pad Template: 'src'

Element Properties: controller : Pravega controller. If not specified, this will use the value of the environment variable PRAVEGA_CONTROLLER_URI. If that is empty, it will use the default of tcp:// 127.0.0.1:9090. flags: writable String. Default: null Write only keycloak-file : The filename containing the Keycloak credentials JSON. If not specified, this will use the value of the environment variable KEYCLOAK_SERVICE_ACCOUNT_FILE. If that is empty, authentication will be disabled. flags: writable String. Default: null Write only name : The name of the object flags: readable, writable, 0x2000 String. Default: "pravegatc0" parent : The parent of the object flags: readable, writable, 0x2000

124 Working with Video

Object of type "GstObject" table : The scope and table name that will be used for storing the persistent state. The format must be 'scope/table'. flags: writable String. Default: null Write only

Timestamp Convert (timestampcvt)

This element converts PTS timestamps for buffers. Use this for pipelines that will eventually write to pravegasink. This element drops any buffers without PTS. Additionally, any PTS values that decrease will have their PTS corrected.

Factory Details: Rank none (0) Long-name Convert timestamps Klass Generic Description This element converts PTS timestamps for buffers.Use this for pipelines that will eventually write to pravegasink (timestamp-mode=tai). This element drops any buffers without PTS. Additionally, any PTS values that decrease will have their PTS corrected.

Plugin Details: Name pravega Description GStreamer Plugin for Pravega Filename /usr/lib/x86_64-linux-gnu/gstreamer-1.0/libgstpravega.so Version 0.7.0-RELEASE License unknown Source module gst-plugin-pravega Source release date 2021-08-15 Binary package gst-plugin-pravega Origin URL https://github.com/pravega/gstreamer-pravega

GObject +----GInitiallyUnowned +----GstObject +----GstElement +----TimestampCvt

Pad Templates: SRC template: 'src' Availability: Always Capabilities: ANY

SINK template: 'sink' Availability: Always Capabilities: ANY

Element has no clocking capabilities. Element has no URI handling capabilities.

Pads: SINK: 'sink' Pad Template: 'sink' SRC: 'src' Pad Template: 'src'

Element Properties: input-timestamp-mode: Timestamp mode used by the input flags: writable Enum "GstInputTimestampMode" Default: 0, "ntp" (0): ntp - Input buffer timestamps are nanoseconds since the NTP epoch 1900-01-01 00:00:00 UTC, not including leap seconds. Use this for buffers from rtspsrc (ntp-sync=true

Working with Video 125

ntp-time-source=running-time) with an RTSP camera that sends RTCP Sender Reports. (1): relative - Input buffer timestamps do not have a known epoch but relative times are accurate. The offset to TAI time will be calculated as the difference between the system clock and the PTS of the first buffer. This element will apply this offset to produce the PTS for each output buffer. (2): tai - Input buffer timestamps are nanoseconds since 1970-01-01 00:00:00 TAI International Atomic Time, including leap seconds. Use this for buffers from pravegasrc. Write only name : The name of the object flags: readable, writable, 0x2000 String. Default: "timestampcvt0" parent : The parent of the object flags: readable, writable, 0x2000 Object of type "GstObject"

Fragmented MP4 Payloader (fragmp4pay)

This element accepts fragmented MP4 input from mp4mux and emits buffers suitable for writing to pravegasink. Each output buffer will contain exactly one moof and one mdat atom in their entirety. Additionally, output buffers containing

key frames will be prefixed with the ftype and moov atoms, allowing playback to start from any key frame.

Factory Details: Rank none (0) Long-name Fragmented MP4 Payloader Klass Generic Description This element accepts fragmented MP4 input from mp4mux and emits buffers suitable for writing to pravegasink. Each output buffer will contain exactly one moof and one mdat atom in their entirety. Additionally, output buffers containing key frames will be prefixed with the ftype and moov atoms, allowing playback to start from any key frame.

Plugin Details: Name pravega Description GStreamer Plugin for Pravega Filename /usr/lib/x86_64-linux-gnu/gstreamer-1.0/libgstpravega.so Version 0.7.0-RELEASE License unknown Source module gst-plugin-pravega Source release date 2021-08-15 Binary package gst-plugin-pravega Origin URL https://github.com/pravega/gstreamer-pravega

GObject +----GInitiallyUnowned +----GstObject +----GstElement +----FragMp4Pay

Pad Templates: SRC template: 'src' Availability: Always Capabilities: ANY

126 Working with Video

SINK template: 'sink' Availability: Always Capabilities: ANY

Element has no clocking capabilities. Element has no URI handling capabilities.

Pads: SINK: 'sink' Pad Template: 'sink' SRC: 'src' Pad Template: 'src'

Element Properties: name : The name of the object flags: readable, writable, 0x2000 String. Default: "fragmp4pay0" parent : The parent of the object flags: readable, writable, 0x2000 Object of type "GstObject"

Create custom GStreamer applications

By creating a custom Docker image, you can deploy more sophisticated applications as described in this section.

About this task

The Python pravega-to-pravega.py application https://github.com/pravega/gstreamer-pravega/blob/master/ python_apps/pravega-to-pravega.py is similar to the copy-my-stream pipeline shown in the previous section, however this example includes improved failure recovery.

Steps

1. Create your custom Python application.

Refer to the following examples:

pravega-to-pravega.py rtsp-camera-to-pravega.py pravega-to-object-detection-to-pravega.py

2. Create a Dockerfile similar to the following:

FROM docker-repo.example.com/nautilus/gstreamer:1.3 COPY my_app.py / CMD ["/my_app.py"]

3. Run the command docker build --tag docker-repo.example.com/my-app:0.0.1 to build your image.

4. Run docker push docker-repo.example.com/my-app:0.0.1 to push your image to a Docker repository.

5. Create the GStreamer pipeline as shown above, but specify the image docker-repo.example.com/my-app:0.0.1.

Connect to a remote SDP server

When used in a GStreamer Pipeline in SDP, the GStreamer Plugin for Pravega obtains connection details for Pravega in the same SDP cluster through environment variables. It is also possible for GStreamer applications running outside of SDP to connect to Pravega from SDP.

Steps

1. Obtain the Keycloak credentials file In the SDP Pravega cluster.

NAMESPACE=examples export PRAVEGA_CONTROLLER_URI=tls://$(kubectl get -n nautilus-pravega ingress/pravega-

Working with Video 127

controller -o jsonpath="{.spec.tls[0].hosts[0]}"):443 export KEYCLOAK_SERVICE_ACCOUNT_FILE=${HOME}/keycloak.json kubectl get secret ${NAMESPACE}-pravega -n ${NAMESPACE} -o jsonpath="{.data.keycloak\.json}" | base64 -d > ${KEYCLOAK_SERVICE_ACCOUNT_FILE} echo ${PRAVEGA_CONTROLLER_URI} echo ${KEYCLOAK_SERVICE_ACCOUNT_FILE} cat ${KEYCLOAK_SERVICE_ACCOUNT_FILE}

2. Run the remaining commands on the GStreamer application host. Create the file ${HOME}/keycloak.json containing the output from the command cat $

{KEYCLOAK_SERVICE_ACCOUNT_FILE} that you ran in the previous step.

3. Configure the GStreamer application.

export PRAVEGA_CONTROLLER_URI=... export KEYCLOAK_SERVICE_ACCOUNT_FILE=${HOME}/keycloak.json

4. (Optional) To create a GStreamer pipeline that connects to more than one Pravega cluster, specify the properties controller and keycloak-file for the pravegasrc, pravegasink, and pravegatc elements.

5. (Optional) If you want to run the GStreamer application in a Kubernetes cluster, store the Keycloak credentials file as a Kubernetes Secret and mount it as a file, which requires you to directly create a StatefulSet.

Object Detection with NVIDIA DeepStream Object detection algorithms can detect objects, such as vehicles and people, in video images using NVIDA's DeepStream SDK toolkit.

SDP developers will create your own Kubernetes containers for your applications and models. These can run in SDP as continuous GStreamer pipelines.

Real-time video object detection architecture

What is NVIDIA DeepStream SDK?

NVIDIA's DeepStream SDK delivers a complete streaming analytics toolkit for AI-based multi-sensor processing, video, and image understanding. It is built around GStreamer and provides GStreamer plugins, such as GPU accelerated H.264 video encoding/decoding and deep learning inference.

128 Working with Video

NVIDIA DeepStream requires an NVIDIA GPU, such as an embedded edge device running Jetson platform, or a larger edge or datacenter GPU like T4.

Pravega protocol adapter for NVIDIA DeepStream

The Pravega protocol adapter provides an implementation of NVIDIA DeepStream message broker as follows:

Existing DeepStream applications only need to modify their configurations to use Pravega. Used to write metadata, such as inferred bounding boxes, to a Pravega event stream. Not intended to write video or audio data (you can use Pravega Sink for that) Written in the Rust language and uses the Pravega Rust client. Available for download here: https://github.com/pravega/gstreamer-pravega/tree/dev2/deepstream/

pravega_protocol_adapter

NVIDIA transfer learning toolkit

Training deep learning models and using the NVIDIA Transfer Learning Toolkit is out of scope for this guide. We describe it in this section so that users understand the complete process.

NVIDIA Transfer Learning Toolkit consists of a set of pre-trained deep learning models and tools to help data scientists train deep learning models. Pre-trained models include TrafficCamNet (detect and track cars), PeopleNet (people counting, heatmap generation, social distancing), License Plate Detection, Face Detection, and more. These models can be deployed out of the box for applications in smart city, retail, public safety, healthcare, and others. All models are trained on thousands of proprietary images and can achieve very high accuracy on NVIDIA test data.

Data scientists can also use the NVIDIA Transfer Learning Toolkit to create their own deep learning models (referred to as retraining or transfer learning), export them to the appropriate format, and then import into their DeepStream application for real-time inference. This workflow is shown below.

Working with Video 129

Build and run a DeepStream development pod

You can use the Docker container in this section for interactive development environments, including Visual Studio Code, to build and run a DeepStream development pod.

Steps

1. Create the following file.

image: repository: devops-repo.isus.emc.com:8116/nautilus/deepstream-dev-pod:faheyc ssh: authorized_keys: | ssh-rsa AAAABC...JmQfr ssh-rsa AAAABD...uwxyz

2. The field authorized_keys should contain one or more lines from ~/.ssh/id_rsa.pub. If you do not have the file ~/.ssh/id_rsa.pub, create one by running the command ssh-keygen -t rsa -b 4096.

cd git clone --recursive https://github.com/pravega/gstreamer-pravega cd gstreamer-pravega export NAMESPACE=examples k8s/deepstream-dev-pod.sh ~/deepstream-dev-pod.yaml

Results

The exact command to SSH to the pod will be printed by the previous command.

130 Working with Video

Build and run object detection in a DeepStream development pod

Object detection is a type of deep learning inference that can identify the coordinates and class of objects in an image. This example shows you how to run object detection in the DeepStream development pod that you created in the previous section using the DeepStream SDK.

Prerequisites

Run the steps in the previous section to build the development pod.

Steps

1. SSH to the DeepStream development pod.

2. Move the gstreamer-pravega directory to the shared and persistent project directory.

mv -v gstreamer-pravega data-project/ cd data-project/gstreamer-pravega deepstream/scripts/build-and-test.sh

Creating an object detection model is a complex process. Refer to the following NVIDIA DeepStream documentation for deployment details: NVIDA TAO Toolkit and DeepStream documentation.

SDP Video Architecture This section details the video storage and analytics architecture of the Streaming Data Platform .

Video compression and encoding

To understand the architecture of video with SDP, it is useful to know how video is compressed.

A video is simply a sequence of images, often 30 images per second. An image in a video is often referred to as a frame. An image compression format such as JPEG can compress a single image by removing redundant information. However, in a sequence of images, there is also redundant information between successive images.

Nearly all video compression algorithms achieve high compression rates by identifying and removing the redundant information between successive images and within each image. Therefore a single frame of a compressed video is usually a delta frame, which means that it only contains the differences from the previous frame. A delta frame by itself cannot be used to reconstruct a complete frame. However, a typical video stream will occasionally (perhaps once per second) contain key frames which contains all the information to construct a complete frame.

The task of a video decoder is to begin decoding at a key frame and then apply successive delta frames to reconstruct each frame. A video decoder must necessarily maintain a state consisting of one or sometimes more frame buffers.

For more information, see A Guide to MPEG Fundamentals and Protocol Analysis.

MP4 media container format

SDP will write video in the common fragmented MP4 container format. This is an efficient and flexible format and allows storage of H.264 video and most audio formats.

The built-in MP4 mux in GStreamer (mp4mux) can output fragmented MP4. However, the output is not well-suited for Pravega, because it only writes important headers at the start of a pipeline, making truncation and seeking challenging.

SDP provides a new GStreamer element called fragmp4pay that will duplicate the required headers at every indexed position. This makes truncation and seeking trivial, and the resulting MP4 will remain playable with standard MP4 players.

Stream truncation and retention

Stream truncation can occur during writing and/or reading of the stream.

Truncating a stream deletes all data in the stream prior to a specified byte offset. Subsequent attempts to read the deleted data will result in an error. Reads of non-truncated data will continue to succeed, using the same offsets used prior to the truncation.

Working with Video 131

Video streams written by GStreamer consist of a data stream and an index stream. These must be truncated carefully to ensure consistency.

Truncation will be periodically performed by the Pravega Sink GStreamer element as it writes video to the Pravega stream. Video streams can have a retention policy by age, size, or both. The generic Pravega retention policy mechanism will not be used for video streams written by GStreamer. To conform with the HLS spec, the start of each fragment, and therefore each index position, must contain all video headers. This constraint is satisfied by careful indexing so that it does not impact truncation.

Seeking in a video stream

A common requirement for all video solutions is to allow seeking to a particular position in a video streams. For instance, a video player will often provide a seek control allowing the user to navigate to any time in the video.

Seeking will be by time in nanoseconds. In the case of Pravega, it is appropriate to seek by UTC or TAI, specified as the number of nanoseconds since the UTC epoch 1970-01-01 00:00:00 (either excluding or including leap seconds).

When seeking, the Pravega Source GStreamer element will locate a nearby timestamp in the index, obtain the offset, and then position the data reader at that offset. Because the element fragmp4pay was used to write the stream, it is guaranteed that playback can begin at this position.

Changing video streaming parameters

During the lifetime of a video stream, the video parameters will likely need to be changed.

A Pravega video stream will typically be long lasting. A stream duration of several years would be reasonable. During this lifetime, it is possible that the video parameters (resolution, frame rate, codec, bit rate, and so on) will need to be changed. This is accommodated by requiring all random-access points to start with the necessary headers.

New encoding sessions will start with the discontinuity bit set to TRUE.

Identifying Pravega video streams

When the Pravega Sink GStreamer element writes a video stream, it will contain the metadata tag video.

A Pravega stream can have any number of metadata tags associated with it. The metadata tag video is used by the SDP UI to select Pravega streams that contain video.

Timestamps

The GStreamer Plugin for Pravega stores timestamps as the number of nanoseconds since 1970-01-01 00:00 TAI (International Atomic Time), including leap seconds. This convention allows video samples, audio samples, and other events to be unambiguously represented, even during a leap second.

GStreamer represents a time duration as the number of nanoseconds (1e-9 seconds), encoded with an unsigned 64-bit integer. This provides a range of 584 years which is well within the expected lifetime of any GStreamer application. Unfortunately, there are multiple standards for the epoch (time zero) and whether leap seconds are counted.

By far, the most common representation of time in computers is the POSIX clock, which counts the number of seconds since 1970-01-01 00:00:00 UTC, except leap seconds. Thus, most computer clocks will go backward for 1 second when leap seconds occur.

It is quite possible that a backward moving timestamp will cause problems with a system that demands frame-level precision. Leap seconds are scheduled 6 months in advance by an international organization when the Earth's rotation slows down (or speeds up) relative to the historical average.

Although leap seconds will continue to be a challenge for all other components (for example, cameras, temperature sensors, Linux hosts), by using TAI in Pravega, we can at least unambiguously convert time stored in Pravega to UTC.

It is convenient to think of leap seconds much like Daylight Savings Time (DST). Most computer systems avoid the 1 hour jumps in time during DST by storing time as UTC and converting to the user's local time only when displaying the time. When this same concept is used to handle leap seconds, we get TAI.

As of 1 January 2017, when another leap second was added, TAI is exactly 37 seconds ahead of UTC.

132 Working with Video

As a consequence of using TAI in GStreamer Plugin for Pravega, it will need to know the leap second schedule. As of the current version, it can assume a fixed 37 second offset but if a new leap second is scheduled, then it will need to be updated with the leap second schedule. As of 2021-08-25, a leap second has not been scheduled, and it is possible that leap seconds will not be scheduled for years or even decades.

Storing and retrieving video in Pravega

This section describes how the Pravega Sink plugin for GStreamer writes video to a Pravega byte stream.

Data stream frame format

The Pravega Sink plugin for GStreamer writes video to a Pravega byte stream using the encoding below, which is defined in the event_serde.rs resource file. The entire frame is appended to the Pravega byte stream atomically.

0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | | type_code (32-bit BE signed int, set to 0) | | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | event_length (32-bit BE unsigned int) | | number of bytes from reserved to the end of the payload | | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |D|R|I| | reserved (set to 0) |I|A|N| | |S|N|D| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | | | | timestamp (64-bit BE unsigned int) | + nanoseconds since 1970-01-01 00:00 TAI + | including leap seconds | | | | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | | payload (variable length) | | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

One tick mark represents one bit position.

Element Description

Type code The type code must be 0 which corresponds to pravega_wire_protocol::wire_commands::EventCommand.TYP E_CODE. This makes this byte stream compatible with a Pravega event stream reader.

Event length This is the number of bytes from reserved to the end of the payload. Encoded as a 32-bit big-endian unsigned int.

Reserved All reserved bits must be 0. These may be utilized in the future for other purposes.

DIS - discontinuity indicator True (1) if this event is or may be discontinuous from the previous event. This should usually be true for the first event written by a new process. It has the same meaning as in an MPEG transport stream.

RAN - random access indicator True (1) when the stream may be decoded without errors from this point. This is also known as IDR (Instantaneous Decoder Refresh). Usually, MPEG I-frames will have a true value for this field and all other events will have a false value.

Working with Video 133

Element Description

IND - include in index If true (1), this event should be included in the index. Typically, this will equal random_access but it is possible that one may want to index more often for Low-Latency HLS or less often to reduce the size of the index.

Timestamp The timestamp counts the number of nanoseconds since the epoch 1970-01-01 00:00 TAI (International Atomic Time). This definition is used to avoid problems with the time going backwards during positive leap seconds. If the timestamp is unknown or if there is ambiguity when converting from a UTC time source in the vicinity of a positive leap second, timestamp can be recorded as 0. As of 2021-08-25, TAI is exactly 37 seconds ahead of UTC. This offset will change when additional leap seconds are scheduled. This 64-bit counter will wrap in the year 2554. This timestamp reflects the sampling instant of the first octet in the payload, as in RFC 3550. For video frames, the timestamp will reflect when the image was captured by the camera. If DTS can differ from PTS, this timestamp should be the PTS. This allows different streams to be correlated precisely.

Payload Can be 0 or more fragmented MP4 atoms, or any other payload. Writes of the entire frame (type code through payload) must be atomic, which means it must be 8 MiB or smaller.

Data stream payload

We recommend that you store MP4 fragments in the payload of the data stream events. This provides the following features:

Multiplexing of any number of video and audio channels in the same byte stream An additional time source mechanism to deal with clock drift Allows truncation and concatenation at any point

Index stream frame format

When the Pravega Sink GStreamer element writes a video stream, it will also periodically (usually once per second) write records to an index stream. The index stream is a Pravega byte stream. The index stream has the same name as the video stream, but with "-index" appended to it. The index provides a mapping from the timestamp to the byte offset. It it used for seeking, truncation, failure recovery, and efficiently generating an HTTP Live Streaming (HLS) playlist.

The index must be reliable in the sense that if it has a {timestamp, offset} pair, then it must be able to read from this offset in the video stream. When possible, SDP will attempt to gracefully handle violations of these constraints.

The index and related data stream must satisfy the following constraints.

1. If the first record in the index has timestamp T1 and offset O1 (T1, O1), and the last record in the index has timestamp TN and offset TN (TN, ON), then the data stream can be read from offset O1 inclusive to ON exclusive. The bytes prior to O1 may have been truncated. All bytes between O1 and ON have been written to the Pravega server and, if written in a transaction, the transaction has been committed. However, it is possible that reads in this range may block for a short time due to processing in the Pravega server. Reads in this range will not block due to any delays in the writer.

2. All events in the data stream between O1 and ON will have a timestamp equal to or greater than T1 and strictly less than TN. 3. If there are no discontinuities, the samples in the stream were sampled beginning at time T1 and for a duration of TN - T1. 4. If index records 2 through N have DIS of 0, then it is guaranteed that the bytes between O1 and ON were written

continuously.

The index uses the encoding below, which is defined in the index.rs resource file.

The entire frame is appended to the Pravega byte stream atomically.

0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

134 Working with Video

| |D|R|R| | reserved (set to 0) |I|A|E| | |S|N|S| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | | | | timestamp (64-bit BE unsigned int) | + nanoseconds since 1970-01-01 00:00 TAI + | including leap seconds | | | | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | | | | offset (64-bit BE unsigned int) | + byte offset into Pravega stream + | | | | | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

One tick mark represents one bit position where:

reserved, RES: All reserved bits must be 0. These may be utilized in the future for other purposes. DIS - discontinuity indicator RAN - random access indicator

See the event_serde.rs resource file for definitions of common fields.

Pravega video server

The Pravega Video Server is a component of SDP that allows all major web browsers to play historical and live video.

The Pravega Video Server is an HTTP web service that supports live video streaming. By default it is not enabled but it can be enabled at any time by editing an Analytics Project

The UI can be accessed with a URL such as https://video.your-project.sdp-demo.org/.

APIs to retrieve video data from Pravega

A browser retrieves video data from Pravega using the APIs described in this section.

Get HLS play list

Request: GET /scopes/my_scope/streams/my_stream/m3u8? begin=2021-04-19T00:00:00Z&end=2021-04-20T00:00:00Z

Response: m3u8 text file

To avoid very large responses, requests should include begin and end timestamps with a time span of 24 hours or less. Requests without a begin timestamp will start at the first index record. Requests without an end timestamp will end at the last index record.

The play list will be generated on-demand based on data in the video index.

Get media (video data)

Request: GET /scopes/my_scope/streams/my_stream/media?begin=0&end=12345 Response: 1 or more MP4 fragments

Requests must include a byte range. Allowed byte ranges are provided in the HLS play list.

Working with Video 135

Working with Pravega Schema Registry Pravega schema registry stores and manages schemas for the unstructured data (raw bytes) stored in Pravega streams. It allows application developers to define standard schemas for events, share them across an organization, and evolve them in a safe way.

Topics:

About Pravega schema registry Create schema group List and delete schemas Add schema to streams Edit schema group and add codecs

About Pravega schema registry Pravega schema registry service is not limited to data stored in Pravega. It can serve as a general-purpose management solution for storing and evolving schemas in a wide variety of streaming and non-streaming use cases. Along with providing a storage layer for schemas, the schema registry service stores and manages encoding information in the form of codec information. Codecs can correspond to different compression or encryption that is used while encoding the serialized data at rest. The service generates unique identifiers for schemas and codec information pairs that users may use to tag their data.

Pravega schema registry stores schemas in a logical concept, known as a schema group. Application developers create the group in the Streaming Data Platform (SDP) with a set of compatibility policies, which store a versioned history of all schemas. Policies allow the evolution of schemas according to the configured compatibility settings.

The registry service also provides a RESTful interface to store and manage schemas under schema groups. Users can evolve their schemas within the context of the schema group based on schema compatibility policies that are configured at a group level.

The registry service supports serialization formats in Avro, Protobuf, and JSON schemas. Users can also store and manage schemas from any serialization system. The service allows users to specify wanted compatibility policies for evolution of their schemas, but these policies are employed only for the natively supported serialization systems.

The SDP installer automatically installs the schema registry Helm chart and service accounts. The security configurations are the same as are used in Pravega.

Schema registry in SDP

This guide provides an overview of the registry service and describes Pravega schema registry in the context of its usage within the SDP UI. Management of the registry service is not supported from within the SDP UI, therefore Pravega schema management is not documented in this guide. For detailed information about schema management, see: https://github.com/ pravega/schema-registry/wiki/PDP-1:-Schema-Registry .

Application developers can use the registry service in multiple ways to fit their requirements. Developers can create and upload schemas that are associated with Pravega streams using the SDP UI into a cluster registry. Developers can configure applications to read shared schemas between multiple versions.

Schema management in the SDP is performed in the context of a Pravega stream and has the following broad phases in the SDP.

1. Create schema group. a. Add schema group to an existing Pravega stream. b. Add schemas and codecs to the schema group.

2. List schemas. a. List schemas and codecs that are used for a Pravega stream.

3. Edit schema group.

9

136 Working with Pravega Schema Registry

a. Update compatibility policy. b. Add schemas. c. Add codecs.

4. Delete schema group.

Schema registry overview

Schema registry can be used as a general-purpose service for centrally storing, serving, and managing schemas for a host of services. Each service instance is stateless, and all stream schemas and metadata are durably stored in Pravega Tables. Each service instance can process all queries.

NOTE: Only JAVA applications can currently use schema registry as of SDP 1.2.

As shown in the following figure, the schema registry is decoupled from Pravega. Pravega applications that are using both the registry service and schema storage link schema registry and Pravega. This linkage allows Pravega to be optimized as a streaming storage system for writing and reading raw bytes. In addition, the schema registry REST APIs ensure that its scope is not limited to Pravega use cases.

Schema groups

Conceptually, schema registry stores a versioned history of schemas that are scoped to a logical group. A schema group is a logical collection where schemas for a Pravega stream are stored. A group is the top-level resource that is supported by the registry and is essentially a name under which all related schemas are grouped.

Application developers can create a schema group in the SDP UI identified by a unique group name. Group names can include an optional qualifier called namespace. Each namespace and group combination uniquely identifies a resource in schema registry.

A schema group is analogous to a Pravega stream. Similarly a namespace is analogous to a Pravega scope. Each stream, for which users intend to manage schemas, creates a corresponding group with the same name (groupname =

stream name and namespace = scope name). All schemas defining the structures of data in the Pravega stream are stored and evolved under this group.

The schema registry service manages the sharing of schemas between components so that schemas that are used in Pravega streams can evolve over time. The schema serialization and compatibility framework supports schema evolution, allowing applications to use different schemas during writes and reads as follows:

Serialization binary formats (for how the data is transmitted and stored) include Avro, Protobuf, JSON schemas, and custom formats.

Serialization and deserialization are handled behind the scenes in the SDP.

Working with Pravega Schema Registry 137

Schema registry SDK libraries provide the serializers. An application instantiates the serializer. All activities of communicating with schema registry are hidden from users on both the READ and WRITE side. As events are deserialized, the structure is returned to the streaming application.

For detailed information about how schema registry works behind the scenes, see: https://github.com/pravega/schema- registry/wiki/PDP-1:-Schema-Registry

Configuration compatibility settings include AllowAny and DenyAll.

The schema registry service includes:

Centralized registryAllows reusable schemas between streaming data applications. Version managementStores a versioned history and allows schema versions to evolve with the data over time.

A version number identifies each schema. Schemas are backwards, forwards, and bidirectionally compatible. As schemas are evolved, they are captured in a monotonically increasing chronological version history.

Schema validationEnsures that downstream applications do not break due to incompatible data formats or encoding. The service divides schemas into logical validation groups and validates schemas against previous schemas that share the same type in the history of the group.

Message encodingThe schema registry service generates unique identifiers for message encoding. Users can register different codecTypes under a group, and for each unique codecType and schema pair, the service

generates unique encoding identifiers.

Support for multiple types

Schema registry allows for writing events of multiple types into the same Pravega stream. An example use case is event sourcing, where applications might try to record their events describing different actions in a specific order. Pravega streams provide strong consistency and ordering guarantees for the events that are written to Pravega.

The schema registry service does not impose a restriction on schemas within a group to be of a single type. Instead, Pravega applications can register schemas under the "type" and evolve schemas within this type.

The service divides schemas into logical validation groups and validates schema against previous schemas in the history of the group that share the same type. For example, if a stream had events of type UserAdded, UserUpdated, UserRemoved, you register schemas under three

different types within the same group. Each type evolves independently. The service assigns unique identifiers for each schema version in the group. The service also assigns type-specific version numbers for the order in which they were added to the group.

Schema terminology

Schema Schema describes the structure of user data.

Schema binary Schema binary is the binary representation of the schema. For Avro, Protobuf, and JSON, the service parses this binary representation and validates the schema

to be well-formed and compatible with the compatibility policy. The schema binary can be up to 8 MB in size.

Codecs Schema registry includes support for registering codecs for a group. Codecs describe how data is encoded after serializing it. Typical usage of encoding would be to

describe the compression type for the data. CodecType defines different types of encodings that can be used for encoding data while writing it to

the stream. The schema registry serializer library has codec libraries for performing gzip and Snappy

compression.

Schema event types

Schema registry allows multiple types of events to be managed under a single group. Any compatibility option allows you to use multiple data or event types within the same stream.

Each event or object type is a string that identifies the object that the schema represents. A group allows multiple types of schemas to be registered, evolves schemas of each type, and assigns

them versions independently.

Schema compatibility

Schemas are considered compatible if data written using one schema can be read using another schema.

The registry defines rules for how the structure of data should be evolved while ensuring business continuity.

138 Working with Pravega Schema Registry

Compatibility policies determine how schemas are versioned within a group as they evolve. Compatibility types can be backward, forward, or bi-directional.

Backward compatibility allows you to use a schema policy that is older. Forward compatibility allows you to use a newer schema policy.

The service includes the options AllowAny and DenyAll. The AllowAny option allows any changes to schemas without performing any compatibility checks. The DenyAll option prevents new schema registration.

The compatibility policy choice is only applicable for formats that the service implicitly understands. So for formats other than Avro, JSON, and Protobuf, the service only allows the policy choice of AllowAny or DenyAll.

Schema encoding identifier

A schema encoding identifier is a unique identifier for a combination of schema version and codec type.

Users can request the registry service to generate a unique encoding ID (within the group) that identifies a schema version and codecType pair. The service generates a unique 4-byte identifier for the codecType and schema version.

The service indexes the encoding ID to the corresponding encoding information and also creates a reverse index of encoding information to the encoding ID.

Schema evolution Schema evolution defines changes to the structure of the data. Schema evolution allows using different schemas during writes and reads.

Schema object type

The type of object that the schema represents. For example, if a schema represents an object that is called "User," the type should be set to "User."

The service within the type context User evolves all schemas for User.

Schema group A logical collection where schemas for a Pravega stream are registered and stored. A schema group is a resource where all schemas for all event types for a stream are organized and

evolved. Unlike Pravega streams, which are created under a namespace that is called a scope, an SDP Schema

Registry group has a flat structure. A group ID identifies each group. For Pravega streams, this group ID is derived from the fully qualified scoped name of the Pravega stream. The group name is the same as the Pravega stream, and the namespace is the same as the Pravega scope or project-name.

Each Pravega stream can map to a group in a registry service, however, a group can be used outside the context of Pravega streams as well.

Schema group namespace

The namespace is analogous to the Pravega scope. Namespace is optional and can be provided at the time of group creation.

Schema group properties

Each group is set up with a configuration that includes the serialization format for the schemas and the wanted compatibility policy. Only schemas that satisfy the serialization format and compatibility policy criteria for a group can be registered to it.

Schema identifier When the service registers a schema, it is assigned a unique 4-byte ID to identify the schema within the group. If there is only a single type of object within the group, then the schema ID is identical to the

version number. If there are multiple object types, the schema ID represents the same schema that is represented

by "type + version number." The schema ID is unique for a group and is generated from a monotonically increasing counter in the

group. If a group has "n" schemas, adding a new schema has the ID "n+1." If the same schema is added to multiple groups, it can have different IDs assigned in each group.

Schema information

Schemas that are registered with the service are encapsulated in an object called SchemaInfo that declares the serialization format, type, and schema binary for the schema. The serialization format describes the serialization system where this schema is to be used.

Schema metadata

Metadata that is associated with a named schema.

Schema metadata properties map

User-defined additional metadata properties that are assigned to a group as a String/String map. These values are opaque to the service, which stores and makes them available. Each key and value string can be up to 200 bytes in length.

Working with Pravega Schema Registry 139

Example group properties:

SerializationFormat: "Avro", Compatibility:"backward", allowMultipleTypes:"false", properties:<"environment":"dev", "used in":"pravega">

Schema version Each update to a schema is uniquely identified by a monotonically increasing version number. As schemas are evolved for a message type, they are captured as a chronological history of schemas.

Schema serialization format

Serialization format can be one of Avro, Protobuf, JSON, or a custom format. The schema registry service has parsers for Avro, Profobuf, and JSON schemas and validates the

supplied schema binary to be well formed for these known formats. When you specify a custom format, you must specify a custom type name for the format. The

service assigns the schema binary a unique ID, but it does not provide any schema validation or compatible evolution support for custom formats.

When the serialization format property is set to "Any," the group allows for multiple formats of schemas to be stored and managed in the same group.

Schema version info or version number

If a schema is registered successfully, the service generates a schema ID that identifies the schema within the group. It also assigns the schema a version number, which is a monotonically increasing number representing the version for the schema type.

As a new schema is registered, it is compared with previous schemas of the same type. The new schema is assigned a version number in the context of the type. For example, the first schema for object User is assigned version 0, and the nth schema for object User is assigned version n.

The schema ID, object type, and version number are encapsulated under a VersionInfo object for each registered schema.

Schema Registry REST APIs

Pravega schema registry exposes a REST API interface that allows developers to define standard schemas for their events and evolve them in a way that is backwards compatible and future proof. Schema registry stores a versioned history of all schemas and allows the evolution of schemas according to the configured compatibility settings. You can use the APIs to perform all the activities for the schema and groups if you have the prerequisite admin permissions.

There are two chief artifacts that Pravega applications interact with while using schema registry: the REST service endpoint and the registry-aware serializers. Any non-Pravega stream actions on schema registry can be performed using schema registrys REST APIs directly, either programmatically or through the schema registry admin CLI.

Application development and programming are beyond the scope of this guide. For more information about streaming data application development, see the following repository:

Pravega schema registry REST APIs: https://github.com/pravega/schema-registry/wiki/REST-documentation REST API Usage Examples: https://github.com/pravega/schema-registry/wiki/REST-API-Usage-Samples

Error codes

The following error codes may be returned to your running application:

500: Internal service error

Typically a transient error, such as intermittent network errors. Typical causes are connectivity issues with Pravega. Ensure that Pravega is running and that the schema registry service is running and is configured correctly.

If the issue persists, look for the registry service logs to find WARN or ERROR level logs. Run kubectl logs -n nautilus-pravega 400: Bad argument

Typically due to an ill-formed schema file.

401: Unauthorized

402: Forbidden

140 Working with Pravega Schema Registry

Create schema group Project-members who have been granted access to a Pravega scope can create schema groups.

Steps

1. Log in to the SDP.

2. Go to Pravega > scope-name > stream-name and then click the Create Schema link. The Create Schema Group screen displays.

Section Description

General Group Name

Name of an existing Pravega scope or project-name.

Configuration Serialization Format

The serialization format can be one of Avro, Protobuf, JSON, Any, or a custom format. Schema registry service validates the supplied schema binary for these known formats.

Serialization supports schema evolution, which allows applications to use different schemas during writes and reads. When choosing a custom format, specify a custom type

name for the format. The service assigns the schema binary a unique ID, but it does not provide any schema validation or compatible evolution support for custom formats.

When the serialization format property is set to "Any," the group allows for multiple formats of schemas to be stored and managed in the same group.

Compatibility

Working with Pravega Schema Registry 141

Section Description

Compatibility policies for schema evolution include: Backward, Forward, and bi-directional, such as BackwardTransitive, ForwardTransitive, BackwardTill, ForwardTill, Full, FullTransitive, BackwardTillAndForwardTill, AllowAny, DenyAll.

The compatibility policy choice is only applicable for formats that the service implicitly understands. For formats other than schema group types of Avro, JSON, and Protobuf, the registry service only allows the policy choice of AllowAny or DenyAll.

The AllowAny option allows any changes to schemas without performing any compatibility checks.

The DenyAll option prevents new schema registration.

Allow Multiple Serialization Formats

Enable the Allow Multiple Serialization Formats option to support multiple serialization formats and allow for custom schema formats. This allows multiple schemas to co-exist within a group, for example, event sourcing support.

Properties Click Add Property to add user-defined metadata Key/ Value pairs that are assigned to a group as a string/string map. Each key and value string can be up to 200 bytes in

length. Example group properties:

SerializationFormat: "Avro", Compatibility:"backward", allowMultipleTypes:"false", properties:<"environment":"dev", "used in":"pravega">

3. Click Save to save the schema group configuration settings. Click Cancel to return to the streams list.

List and delete schemas Project members with Pravega scope access can list and delete existing schemas for that stream.

Steps

1. Log in to the SDP.

2. Go to Pravega > scope-name > stream-name and then click Edit Schema.

The Edit Schema Group screen displays with the names of schemas, serialization format, properties, creation date, and version.

3. You can also delete schemas from this screen by clicking the Delete link for the schema group under the Actions column. Follow the UI prompts to delete the schema.

142 Working with Pravega Schema Registry

Add schema to streams Add schema to existing Pravega streams.

Steps

1. Log in to the SDP.

2. Go to Pravega > scope-name and then click the name of an existing Pravega stream to which you want to add a schema.

3. On the Streams view, click the Create Schema link to add schema to a stream.

4. The Create Schema Group screen displays with the following fields. Complete the configuration and then click Save.

Section Description

General Group Name

Name of an existing Pravega scope or project-name.

Configuration Serialization Format

The serialization format can be one of Avro, Protobuf, JSON, Any, or a custom format. Schema registry service validates the supplied schema binary for these known formats.

Serialization supports schema evolution, which allows applications to use different schemas during writes and reads. When choosing a custom format, specify a custom type

name for the format. The service assigns the schema binary a unique ID, but it does not provide any schema validation or compatible evolution support for custom formats.

When the serialization format property is set to "Any," the group allows for multiple formats of schemas to be stored and managed in the same group.

Compatibility

Compatibility policies for schema evolution include: Backward, Forward, and bi-directional, such as BackwardTransitive, ForwardTransitive, BackwardTill, ForwardTill, Full, FullTransitive, BackwardTillAndForwardTill, AllowAny, DenyAll.

The compatibility policy choice is only applicable for formats that the service implicitly understands. For formats other than schema group types of Avro, JSON, and Protobuf, the registry service only allows the policy choice of AllowAny or DenyAll.

The AllowAny option allows any changes to schemas without performing any compatibility checks.

The DenyAll option prevents new schema registration.

Working with Pravega Schema Registry 143

Section Description

Allow Multiple Serialization Formats

Enable the Allow Multiple Serialization Formats option to support multiple serialization formats and allow for custom schema formats. This allows multiple schemas to co-exist within a group, for example, event sourcing support.

Properties Click Add Property to add user-defined metadata Key/ Value pairs that are assigned to a group as a string/string map. Each key and value string can be up to 200 bytes in

length. Example group properties:

SerializationFormat: "Avro", Compatibility:"backward", allowMultipleTypes:"false", properties:<"environment":"dev", "used in":"pravega">

Edit schema group and add codecs Project-members with Pravega scope access can edit schema groups and add codecs.

Steps

1. Log in to the SDP.

2. Go to Pravega > scope-name > stream-name and then click the Schemas tab. The Edit Schema Group screen displays.

3. Click the Add Schema button.

4. In the Add Schema panel, enter an optional schema name, serialization format, schema file, and additional metadata properties.

144 Working with Pravega Schema Registry

Configuration field Description

Add Schema Name Name of an existing Pravega scope or project-name

Serialization Format

The serialization format can be one of Avro, Protobuf, JSON, Any, or a custom format. Schema registry service validates the supplied schema binary for these known formats.

Serialization supports schema evolution, which allows applications to use different schemas during writes and reads. When choosing a custom format, specify a custom type

name for the format. The service assigns the schema binary a unique ID, but it does not provide any schema validation or compatible evolution support for custom formats.

When the serialization format property is set to "Any," the group allows for multiple formats of schemas to be stored and managed in the same group.

Schema File Browse to an existing Schema File on your local

machine and then click Save. Simple example Avro schema file:

{ "type": "record", "name": "Type1", "fields": [ {"name": "a", "type": "string"}, {"name": "b", "type": "int"} ] }

Working with Pravega Schema Registry 145

Configuration field Description

You can view the schema by clicking the name link in the SDP UI after uploading the schema file.

Add Property

Add user-defined metadata Key/Value pairs that are assigned to a group as a string/string map. Each key and value string can be up to 200 bytes in

length. Example group properties:

SerializationFormat: "Avro", Compatibility:"backward", allowMultipleTypes:"false", properties:<"environment":"dev", "used in":"pravega">

5. Click Save when you are finished.

6. Select the Codecs tab on the Edit Schema Group screen, and then click the Add Codec button to register codecs for a schema group. The Add Codec panel displays.

7. Browse on your local machine to upload a codec library. SDP supports gzip or Snappy compression. Click Add Property to add user-defined metadata key/value pairs that are assigned to a group as a string/string map.

8. Click Save when you are finished.

146 Working with Pravega Schema Registry

Working with Pravega Search (PSearch)

Topics:

Overview Using Pravega Search Health and maintenance Pravega Search REST API

Overview Pravega Search (PSearch) provides search functionality against Pravega streams.

Pravega Search enables robust querying against the data that are stored in Pravega streams. Pravega Search supports two types of queries: Search queriesqueries against all data that are stored in the stream and indexed. Continuous queriesreal-time queries against data as it is ingested into the stream.

Pravega Search can query highly structured data and full text data.

Pravega Search understands the structure of the data in a stream that is based on mapping information that you provide when you make a stream searchable. Using that structure definition, Pravega Search can process continuous queries and index all stored data in a stream. Indexing generates comprehensive Index documents, a type of metadata that helps to efficiently locate data occurrences in response to search queries.

Both existing and new streams can be configured as searchable. On existing streams with a large volume of data, the initial indexing may take some time.

Relationship to Elasticsearch/OpenSearch

NOTE: Elasticsearch, Logstash, and Kibana are trademarks of Elasticsearch BV, registered in the U.S. and in other

countries. OpenSearch is a fully open-source search and analytics suite. OpenSearch includes OpenSearch (derived from

Elasticsearch 7.10.2) and OpenSearch Dashboards (derived from Kibana 7.10.2).

Similar to OpenSearch, Pravega Search builds a distributed search engine that is based on the Apache Lucene index. Pravega Search also provides a REST API similar to the OpenSearch API, and works with OpenSearch Dashboards similar to the way that OpenSearch works with them.

For querying, Pravega Search supports the following OpenSearch query languages:

DQL as defined by OpenSearch SQL as defined by Elasticsearch

See https://opensearch.org/ for information about OpenSearch.

Integration into SDP

Pravega Search is available only as part of SDP. It is not available with Open Source Pravega. SDP provides all Pravega Search functionality in addition to comprehensive deployment, management, security, and access to the PSearch REST API.

Pravega Search is implemented in the context of an SDP project. It runs in its own cluster that is associated with a project. A project has only one PSearch cluster.

Access to a PSearch cluster is based on project membership. Only project members and SDP Administrators can make streams searchable, submit queries, and view query results. Applications using the Pravega Search REST API must obtain project credentials.

10

Working with Pravega Search (PSearch) 147

With Pravega Search that is deployed in its own separated clusters, the stream indexing and query processing do not affect the efficiency of other stream processing functions. Each PSearch cluster has its own CPU and memory resources. The resources used for stream ingestion, stream storage, and analytic operations are not affected by the volume or timing of indexing and querying.

Pravega Search maintains index metadata in an infinitely expanding series of index documents, which are stored in system-level Pravega streams. SDP manages the index documents and all the indexing worker threads that perform the indexing and query processing.

Autoscaling for Pravega Search resources is built into SDP.

What is included with Pravega Search

Pravega Search licensing includes the following:

Pravega Search functionality is enabled in the SDP UI. Pravega Search is deployed separately in each project, in project-specific PSearch clusters. The Pravega Search REST API is available. Each PSearch cluster has a unique REST head for access. OpenSearch Dashboards are deployed in each project-specific PSearch cluster. A Pravega Search metrics stack (InfluxDB and Grafana dashboards) is deployed in each project-specific PSearch cluster.

Two query types

Users can submit search queries on the stored data that has finished indexing. They can register continuous queries on real-time data.

SDP supports two types of Pravega Search queries. Each query type solves a different type of use case.

Search queries Search queries search indexed data. In an optimized and correctly sized Streaming Data Platform, indexed data is available soon after ingestion. In addition to searching the stream, these queries can perform aggregations.

Search queries are submitted using REST API calls or using the embedded Kibana UI.

An example use case for a search query is to collect aggregated data over a time range, such as the average machine temperatures over the last ten minutes. A custom application or Kibana could monitor the averages and catch a developing abnormal condition.

Continuous queries

A continuous query is applied in real time against stream data as the data is ingested. Continuous queries can act as a filter, eliminating unwanted data, or perform data enrichment. Continuous queries can write matching, filtered, or enriched (tagged) data to output streams.

Project members can register continuous queries against a stream in the SDP UI or by submitting REST API calls. A query is registered once and runs continuously. The Pravega Search cluster scales up and down automatically based on real-time load.

Use cases for continuous queries include : Alerting, based on system logs, device data, or transactions Monitoring and enriching (tagging) original data Analytics and aggregrations

An example use case for a continuous query is to search for an anomoly condition on machinery, such as exceeding a temperature threshold. A query that defines the threshold is applied against the incoming stream of data. An output stream collects the matches, and a custom application can analyze the output stream.

Additional Pravega Search functionality

Here are some additional Pravega Search features.

Seamless installation and upgrades

Pravega Search is a fully integrated component of SDP. The Pravega Search software is included by default in the SDP distribution and is installed with the initial SDP deployment. Licensing determines whether Pravega Search features are available for use.

SDP upgrade images include the latest supported Pravega Search and Pravega upgrades. Installers do not need to check component version compatibility when performing SDP upgrades.

148 Working with Pravega Search (PSearch)

Alerts and health Pravega Search produces operational metrics that are collected, aggregated, and visualized in the project-level analytic metrics provided by SDP. See About analytic metrics for projects on page 94 for usage information.

Pravega Search generates alerts based on predefined rules that are forwarded to KAHM, as described in the Dell Technologies Streaming Data Platform Installation and Administration Guide.

Each Pravega Search pod maintains a service log. The logs are stored in the Kubernetes persistent volume store.

Autoscaling Pravega Search automatically scales its components. Scaling is infinite and automatic within the resources available in the Pravega Search cluster. The number of Pravega Search service pods is autoscaled up or down depending on current processing needs.

REST API The Pravega Search REST API is included in the SDPSDP distribution. For more information, see Pravega Search REST API on page 168.

Limitations

The following limitations apply to Pravega Search in SDP 1.2. Automatic scaling in the Pravega Search cluster does not include the case in which the Pravega Search cluster runs out of

resources. In that case, you must manually add more nodes to the Pravega Search cluster. The number of continuous queries that you can register on a stream is limited. The limit is 10 continuous queries per stream. In a continuous query result stream, order is not guaranteed by default. The order may not be preserved due to parallel

processing. However, you can enable exactly-once mode over the searchable stream using the REST API.

Using Pravega Search The following table summarizes Pravega Search usage steps.

Table 6. Pravega Search procedures summary

Procedure Description Method

1 Create a Pravega Search cluster for the project.

There is one Pravega Search cluster per project. UI or REST API

2 Configure searchable streams.

You can configure new or existing streams as searchable. As part of this configuration, you provide the structure of the stream data, in JSON format.

UI or REST API

3 Register continuous queries. Continuous queries operate on incoming data in real time. An output stream stores the results of continuous queries.

UI or REST API

4 Submit search queries. Search queries operate on stored stream data and on incoming data as it finishes indexing.

Submit queries and view, analyze, save, and share results in Kibana.

Kibana or REST API

5 Make a stream not searchable any longer.

The index documents for that stream are removed. UI or REST API

Getting started

Pravega Search is implemented within the context of an SDP project. You must be a project member or SDP administrator to use Pravega Search.

To get started with Pravega Search in the SDP UI, click Analytics > > Pravega Search. NOTE: If the Pravega Search tab is not visible on a project page, it means that Pravega Search is not licensed for your

SDP implementation.

Working with Pravega Search (PSearch) 149

Figure 26. Location of Search Cluster tab in the UI

Create Pravega Search cluster

A Pravega Search cluster must be created in a project before you can index or search streams in the project.

Prerequisites

The project must exist and appear on the Analytics page in the UI. You must be a member of the project or an Administrator.

About this task

A Pravega Search cluster is a collection of processing nodes that are dedicated to Pravega Search indexing and searching functions. A project has only one Pravega Search cluster, which manages multiple searchable streams in the project and all querying on those streams.

The Pravega Search cluster is separate from the analytic clusters (for example, the Flink clusters) in the project. Adding and deleting a Pravega Search cluster does not affect analytic processing or stream data storage.

Steps

1. Go to Analytics > project-name.

2. Click Search Cluster.

NOTE: If the Search Cluster tab is missing, it indicates that Pravega Search is not licensed for this SDP deployment.

3. Click Create Cluster.

NOTE: If the Create Cluster button is missing, it indicates that a Pravega Search cluster is already deployed for the

project.

4. On the Cluster Size page, select the size of the cluster you want to create.

General guidelines are: SmallFor 1 to 10 searchable streams. MediumFor 10 to 50 searchable streams. LargeFor 50 to 100 searchable streams.

Customized sizes are possible. Contact Dell Customer Support.

As you click each size, the UI shows estimated resource usage, based on estimated number of services and pods that would be deployed.

5. Ensure the size that you want is highlighted and click Save. The Search Cluster tab contains a Processing message in the State column. Deployment may take some time.

6. When the State column shows Ready, you may continue to Make a searchable stream on page 151.

150 Working with Pravega Search (PSearch)

Results

Adding a Pravega Search cluster also creates a system-level Pravega scope and many system-level streams in the scope. These streams store the index documents that Pravega Search creates as it indexes data.

NOTE: These system-level scopes and streams are hidden from project members in the SDP UI. They are visible to

Administrators in the UI (on the Pravega page) and visible to all users on the Kubernetes command line. You should be

aware of their existence and purpose. Do not remove them.

Resize or delete a Pravega Search cluster

You can resize the cluster at any time without affecting the contents of the cluster. You may delete the cluster only if there are no searchable streams.

Prerequisites

The delete action is not implemented if there are any searchable streams in the project. The action returns a message indicating that you must first remove the searchable configuration from all the streams in the project before you can delete the Pravega Search cluster.

Steps

1. Go to Analytic Projects > project-name > Search Cluster > Cluster.

2. To resize the cluster:

a. On the cluster row, click Edit. b. On the Cluster Size page, select the new size. You may upsize or downsize the cluster. c. Click Save.

3. To delete the cluster:

a. On the cluster row, click Delete. b. If you receive a message indicating that there are searchable streams in the cluster, see Make searchable stream not

searchable on page 154.

Make a searchable stream

You can make an existing stream searchable or create a new stream and make it searchable.

About this task

A stream must be searchable before you can search it. When you make a stream searchable, the index is configured and the indexing process is initiated.

If there is existing data in the stream, Pravega Search indexes all the existing data. This may take some time, depending on how much data is in the stream. When this initial index processing is finished, all data in the stream is available for search queries.

New data ingested into the stream is indexed after all existing data are finished indexing.

A searchable stream has one index definition mapped to it. All the data in the stream is indexed according to its associated index definition.

Steps

1. Go to Analytic Projects > project-name > Search Cluster > Searchable Streams.

2. Click .

The Create Searchable Stream page appears.

3. In the Stream dropdown, select a stream name from the list of project streams, or select Create New Stream. If you select Create New Stream, the UI places you on the project's stream creation page. Create the new stream. When you are finished, the UI returns you to the Create Searchable Stream page, with the new stream name selected in the dropdown list.

4. Select Basic or Advanced configuration to configure shards.

A shard is a processing unit that works on indexing data. Pravega Search autoscales shards and other processing.

Working with Pravega Search (PSearch) 151

Option Description

Basic Starts index processing with default basic settings. As processing load increases, Pravega Search automatically scales the number of shards.

Advanced Starts index processing with a configured number of shards.

Advanced mode is suggested for users who have experience with Elasticsearch and Lucene, or who have consulted Dell Technologies support. Otherwise, choose Basic.

5. In the Identifier field, type a unique identifier for the searchable stream.

This value identifies the searchable stream when a user performs queries. It is similar to index name in ElasticSearch. Each searchable stream must have a unique identifier.

6. In the Index Mapping text box, provide a JSON object named Properties that describes the structure of the data in the stream.

You may copy the JSON from elsewhere and paste it into the text box.

See PSearch index mapping format on page 152 for the JSON object requirements and syntax.

7. Click Save.

NOTE: If the Save button is shaded, it means that the Index Mapping text box does not contain valid JSON. Correct

the JSON to make the Save button active.

Results

When a stream is successfully made searchable, the following occurs:

The searchable stream appears on the Searchable Streams tab on the Project page. To view the indexing configuration of the stream, click the stream name.

Indexing of stream data starts. The indexing occurs in the background and is not visible in the UI.

PSearch index mapping format

An index mapping is required when you make a stream searchable. The index mapping defines the structure of the stream's incoming data.

Background information

Applications outside of SDP capture data as java objects, JSON, or other formats. Pravega Writer applications serialize the object data into byte streams for efficient injection, and then Pravega deserializes the data into its original object format for storage. The index mapping that is required when a stream is made searchable defines the structure of the incoming data for PSearch. The required format for the index mapping is JSON.

Based on the index mapping that you provide, Pravega Search indexes the existing data in the stream and continues to index all newly arriving data. Pravega Search also uses the index mapping to process any continuous queries that are registered for the stream.

Index mapping format

The index mapping is a JSON object named properties:

{ "doc": {"properties": { "field-name-1": { "type": "data-type" }, "field-name-2": { "type": "data-type" }, "field-name-3": { "type": "data-type" } } } }

152 Working with Pravega Search (PSearch)

Each data-type must match the definition of the data in the Pravega Writer to prevent errors. For valid values for type, see https://www.elastic.co/guide/en/elasticsearch/reference/7.7/mapping-types.html#mapping-types.

Index mapping example

Consider a constant stream of CPU and memory usage measurements collected from many machines. A Pravega Writer application writes the measurements as events, using the machine name as event-id. Here is the index mapping:

Example index mapping

{ "doc": { "properties": { "machine_id": { "type": "keyword" }, "cpu_usage": { "type": "long" }, "mem_usage": { "type": "long" } } } }

Here is a sample event as written into the stream:

Example event contents

{ machine_id: machine-ubuntu-1, cpu_usage: 90, mem_usage: 80 }

Using the mapping, Pravega Search indexes the incoming stream data and can subsequently provide results to search queries about cpu and memory usage. It could also capture a continuous query intended to tag CPU and memory usage readings above some threshold values.

View searchable stream configuration

You can view the searchable stream configuration properties.

About this task

You cannot change the configuration of the searchable stream. Rather, you must make the stream not searchable, and then make the stream searchable again with a new configuration.

Steps

1. Go to Analytic Projects > > Search Cluster > Searchable Streams.

A list of the searchable streams in the project appears.

2. To view the searchable stream configuration, including the index mapping definition, click the stream name.

A popup window shows the current configuration.

You may copy the index mapping and other information from the popup window.

Working with Pravega Search (PSearch) 153

Make searchable stream not searchable

You can make a searchable stream not searchable.

About this task

NOTE: This action does not affect the stream data. It affects only the ability to index and search the stream.

Making a stream not searchable does the following: Deletes the index documents related to the stream. You can recreate index documents by making the stream searchable

again. Removes the index mapping definition. Removes the stream name from registered continuous query configurations, if any

Steps

1. Go to Analytics > > Search Cluster > Searchable Streams.

2. Optionally save the index mapping definition for future use with these steps:

a. Click the stream name. b. In the popup window that appears, copy the properties object and save it elsewhere.

c. Close the popup window.

3. Click Detach. The stream is detached from the Pravega Search cluster.

Continuous queries

A continuous query processes realtime data and provides results in realtime. The results are stored in an output stream.

You can register a continuous query in the SDP UI or by using REST API calls.

A continuous query may be registered for one or multiple searchable streams in the project. With exactly-once and strict ordering semantic guaranteed, every incoming event in a stream is processed by each continuous query in strict order. Each event is processed by all the queries that are registered against the stream.

Each continuous query has a result stream. Queries may share result streams. The result stream indicates whether the event matches the query or not. Two main usage strategies for result streams are:

1. FilterThe result stream contains only those events that matched one of the continuous queries that are registered against the stream.

2. Tag-The result stream contains events that matched at least one continuous query. The output event can have multiple tags on it.

Pravega Search and the schema registry

Pravega Search is integrated with the Pravega schema registry.

Schema registry provides interfaces for developers to define standard schemas for their events, share them across the organization and evolve them in a safe way. The schema registry stores a versioned history of all schemas associated with streams and allows the evolution of schemas according to the configured compatibility settings.

Pravega Search uses the schema registry to advantage in the following ways:

Pravega Search can support various serialization formats. The Pravega Search event reader can deserialize source stream events using the same deserializer that the event writer used when writing to the source stream.

Pravega Search can support various data schemas. Pravega Search can get a deserialized object encapsulated with its schema information, so that the deserialized object can be indexed, queried, and written to output streams with its original schema, regardless of what that schema is.

In the SDP UI, developers define schema groups to store variations of schemas. A Pravega stream is optionally associated with a schema group. For Pravega Search, the continuous query output streams can also optionally be associated with schema groups. You can explicitly assign the output stream to a schema group or allow Pravega Search to make assumptions and use default settings. For details, see Continuous query output streams on page 156 and Example: Schema groups with continuous query output streams on page 159.

154 Working with Pravega Search (PSearch)

Evolving schemas

Changing the schema of the input stream is transparent to Pravega Search. The Pravega readers automatically adapts to read the new events with the new schema.

Evolving schemas also have no impact on the output stream. A schema group can contain multiple schemas. When a new schema is found, it is append to the existing schema group for the output stream.

Register a continuous query

You can register a continuous query in the SDP UI. The query can operate against one or more streams in the project.

About this task

The continuous query page in the UI lets you define new streams and make existing streams searchable if you have not previously completed those steps.

A continuous query writes to a specified output stream. The continuous query page also lets you define a new stream for the output stream if you do not have one. The output stream does not have to be searchable. Multiple continuous queries can share the same output stream.

Steps

1. Go to Analytics > > Search Cluster > Continuous Queries.

2. Click Create Continuous Query.

3. In Name, type a unique name for this query.

4. In Query, type or paste the query statement.

See Continuous query construction on page 156 for information and examples.

5. In Searchable Streams, choose a stream from the drop-down list. You may also choose Create new stream. After choosing one stream, you many optionally choose additional streams. The list of streams that you select appears

horizontally in the field. When you choose Create new stream, the UI places you on the Searchable Streams page. You can make an existing

stream searchable or create a new stream and make it searchable. The UI returns you to this continuous query page when you are done.

6. In Output Stream, choose an existing stream from the list or create a new stream using Create new stream.

If you choose Create new stream, the UI lets you define the new stream and then returns you to this page to continue defining the continuous query.

The output stream may be a searchable stream but it does not have to be. It can be a normal stream.

A continuous query writes results to only one output stream.

Multiple continuous queries can share the same output stream, with limitations. See Continuous query output streams on page 156

7. In Output Stream Format, choose one of the following: TagThe output stream contains the events that matched the query. Events have a _tag field whose value is the

names of all continuous queries that matched that event.. This output format is typically used when multiple queries are associated with the same output stream and all results are written to the same output.

FilterThe output stream contains the events that matched the query, with no additional fields. Typical use cases for the filter option are: When there is only one query associated with one output stream. When there are multiple filter queries on one searchable stream, with the results for each redirected to different

output streams.

8. Click Save.

If Save is not enabled, it means that there is an error on the input page.

Check that all fields are completed. Check that the query format is valid.

Results

When Save is successful, the continuous query is active. If the input streams are ingesting data, the query is applied against that data.

Working with Pravega Search (PSearch) 155

Continuous query construction

Continuous queries are query statements that run against incoming data in real time.

The supported syntax for continuous queries in SDP v1.2 is ElasticSearch DSL version 7.7.0, as described here:https:// www.elastic.co/guide/en/elasticsearch/reference/7.7/query-dsl.html.

Example

The following query is in DSL format. It searches for events whose CPU memory exceeds 95%.

{ "query": { "range": { "cpu": { "gt": 95 } } } }

Continuous query output streams

The results of a continuous query are captured in the output stream that you specify when you register the query.

The output stream format depends on the tag or filter option chosen.

Tag Pravega Search writes stream events that match a query to the output stream. The events are enriched with an additional field _tag, which is an array of tags. Tag values are the names of all the continuous queries that match the event.

Events without any tags are not written to the output stream.

Filter Pravega Search writes the stream events that match the continuous query to the output stream.

Schema versus json output

Source stream has a schema and output is Filter

If the query's source stream has a schema associated with it, and the continuous query uses the Filter output option, Pravega Search writes the output as serialized and associated with the same schema as the source file. This means that the source stream and output stream must be compatible as follows: If you associated a schema group to the output stream, then the properties in schema group for the

source stream must match the properties in the schema group for the output stream. The schema group properties are serialization format and a compatibility setting.

If the output stream does not have an associated schema group, then Pravega Search assigns the schema group ANY to the output stream and a compatibility of AllowAny.

Source stream does not have a schema or output option is Tag

If the query's source stream does not have a schema associated with it, or if the output stream uses the Tag option, Pravega Search writes the output events in json string format.

Limitations on sharing output files

Multiple continuous queries can write to the same output file, with the following limitations:

Multiple continuous queries with different output formats (FILTER and TAG) cannot share the same output stream. The CQ that breaks the rule is rejected at creation or update time.

Multiple continuous queries that are registered against different source streams, with some source streams associated with schemas and some not, cannot share the same output stream. The continuous queries that breaks the rule is rejected at creation or update time.

156 Working with Pravega Search (PSearch)

Edit a continuous query

You can edit a continuous query configuration.

About this task

You can change the following fields:

Query statement The list of streams that the query operates on The output stream

Steps

1. Go to Analytics > > Search Cluster > Queries.

2. In the table of existing queries, locate the query that you want to change and click Edit.

3. Make the changes, and click Save.

Results

The changes are effective on new ingested data shortly after the change is saved. The system requires only a small amount of time to react to the change.

Deregister a continuous query

Deregister a continuous query to stop the query.

Steps

1. Go to Analytics > > Search Cluster > Queries.

2. In the table of existing queries, locate the query that you want to change and click Delete.

3. Confirm the deletion.

Results

The continuous query is removed from the system and cannot be recalled. The output stream is still available.

Example: Continuous query using the REST API

This example uses the REST API to create a continuous query that filters data as it is ingested.

Prerequisites

This example assumes the following: A project is previously created. A stream named demoStream is previously created in the project.

A Pravega Search cluster is previously created in the project. The API call to authenticate to the cluster is previously submitted.

Steps

1. Declare a stream as searchable.

The following request declares demoStream as searchable. It creates an associated index named demoIndex. The incoming data is structured as one integer and one text field.

POST http://localhost:9098/_searchable_streams

{ "streamName": "demoStream", "indexConfig": { "name": "demoIndex",

Working with Pravega Search (PSearch) 157

"mappings": { "properties": { "key1": { "type": "integer" }, "key2": { "type": "text" } } } } }

2. Register a continuous query instance.

The query name is demoCQ. It examines incoming data for two conditions. It writes events that satisfy both conditions to a stream named outputStream. It filters out data that does not satisfy the conditions. All incoming data is written to the original stream.

The query sets up a boolean AND condition. The first data value must be greater than 100 and the second data value must contain the string "python".

NOTE: To set up a boolean OR condition, use the "should" keyword.

POST http://localhost:9098/_cqueries

{ "name": "demoCQ", "query": { "bool": { "must": [ { "range": { "key1": { "gt": 100 } } }, { "term": { "key2": "python" } } ] } }, "searchableStream": ["demoStream"], "outputStream": { "name": "outputStream", "format": "filter" } }

3. Write test events to the demoCQ stream.

One way to write test events into a stream is in Kibana.

a. In the SDP UI, go to Analytics > > Psearch cluster . b. Click Kibana. c. Using the API in Kibana, write event 1 to the demoCQ stream.

{ "key1": 110, "key2": "Tony loves python" }

d. Write event 2.

{ "key1": 90, "key2": "Tony loves python" }

e. Write event 3.

{ "key1": 110,

158 Working with Pravega Search (PSearch)

"key2": "Tony loves java" }

4. Look in the demoCQ stream.

In step 1, the demoCQ stream is declared searchable. Therefore, you can examine that stream in Kibana.

You should find all three events there.

5. Verify the contents of the output stream.

In step 1, the output stream was not declared searchable. Therefore, you can not examine that stream in Kibana. You can use a raw stream reader, or, log onto the SDP UI and make outputStream searchable.

Only the first event is in outputStream. The others were filtered out.

Event 1 matches both conditions. Event 2 does not match the GT 100 condition for key1. Event 3 does not match the contains python condition for key2.

Example: Schema groups with continuous query output streams

This example shows how to use the schema registry with query output files.

About this task

Steps

1. Create a project named schema-registry-test.

2. Create a stream in the project scope named schemaRegistryTestStream.

3. Create a schema group for the stream.

Choose the following settings: Serialization Format: Json Compatibility: AllowAny

4. Make schemaRegistryTestStream a searchable stream.

Name the index the same as the stream name. The names are not required to match, but it is convenient if they do.

5. Create another stream that is a continuous query output stream.

Name this stream schemaRegistryTestOutputStream.

6. Create a schema group for the output file, with the same settings that you used for the source stream.

NOTE: This step is optional. If you do not declare a schema group for the output file, the schema of the source stream

is used.

7. Create a continuous query that uses the input and output streams.

The query in this example matches all events:

{ "query": { "match_all":{} } }

8. Ingest data with schema.

Here is an excerpt from the Pravega writer.

SerializerConfig serializerConfig = SerializerConfig.builder() .registryClient(schemaRegistryClient) .groupId(schemaRegistryTestStream) .registerSchema(true) .build(); serializer = SerializerFactory.jsonSerializer(serializerConfig, JSONSchema.of(SchemaData.class)); EventStreamWriter writer = clientFactory.createEventWriter( schemaRegistryTestStream, serializer,

Working with Pravega Search (PSearch) 159

EventWriterConfig.builder().build()); writer.writeTo(data);

9. Use Kibana to search data in the source stream and in the continuous query output stream.

The data is the same in both streams because the query was to match all.

Search queries

A search query searches all stored data in the stream that has finished indexing.

Each Pravega Search cluster deployment includes a Kibana instance. You can access the Pravega Search cluster's Kibana instance from the project page in the SDP UI. Then, use Kibana features to query the project's searchable streams, view the query results, analyze the results, and save the results.

Submit historical query and view results

This procedure describes how to go to the Kibana UI from the SDP UI and submit historical queries.

Prerequisites

Searchable streams must be defined in the cluster. There must be some data in the searchable streams. If a stream was made searchable, wait until the indexing is finished running on existing data. Otherwise, the query results are

incomplete. In a stream with a large amount of data already stored, indexing could take some time.

NOTE: You can view indexing progress using the stream watermark API in Kibana.

Familiarity with Kibana is helpful, or access the Kibana documentation at https://www.elastic.co/guide/.

Steps

1. In the SDP UI, go to Analytics > > Search cluster.

2. Click the Kibana tab.

The Kibana UI opens in a new window.

3. Create an index pattern that represents the streams to include in your historical query.

a. In the left panel, click Management > Kibana > Index Patterns. b. Click Create index pattern. c. Start typing in the index pattern field.

Kibana lists all the index mapping names in the search cluster.

The index names are the names that you used for the index mapping definitions when you created each searchable stream.

d. Define the streams to search as follows:

To search only one stream, select that stream's index mapping name. Use * as a wildcard in an index naming pattern. Use - to exclude a pattern. List multiple patterns or names that are separated with a comma and no spaces.

For example: test*,-test3

e. If Kibana detects an index with a timestamp, it asks you to choose a field to filter your data by time. A time field is required to use the time filter.

4. Submit queries.

a. In the left pane, click Discover. b. Select the index name pattern.

The index name pattern defines the streams to include in the search.

c. Optionally set a time filter.

The time filter defines the time range of data to include in the search.

d. Create the query.

160 Working with Pravega Search (PSearch)

See the Kibana documentation for supported query languages and syntax.

e. Examine the results, filter them, add and remove fields, save the search, and share the search.

See the Kibana documentation for instructions.

5. Create visualizations and dashboards.

a. In the left pane, click Dashboard. b. See the Kibana documentation for instructions to create, save, and share dashboards.

6. To return to the SDP UI, click the SDP tab in your browser.

The SDP and Kibana UIs are simultaneously available in separate tabs in your browser.

Example: Search query using Kibana

This example uses Kibana for search queries.

Prerequisites

This example assumes the following: A project is previously created. A stream that is named taxi-demo is previously created in the project and has been ingesting data over time.

The data in the stream looks like this:

{ "tolls_amount":"0", "pickup_datetime":"2015-01-07 14:58:07", "trip_distance":"1.58", "improvement_surcharge":"0.3", "mta_tax":"0.5", "rate_code_id":"1", "dropoff_datetime":"2015-01-07 15:10:17", "vendor_id":"2", "tip_amount":"1", "dropoff_location":[ -73.97346496582031, 40.78977584838867 ], "passenger_count":2, "pickup_location":[ -73.9819564819336, 40.76926040649414 ], "store_and_fwd_flag":"N", "extra":"0", "total_amount":"11.3", "fare_amount":"9.5", "payment_type":"1", "event_time":1607330906767 }

Steps

1. Create the Pravega Search cluster if it does not yet exist in the project.

a. In the SDP UI, go to Analytics > . b. Click Search Cluster. c. If a cluster name appears on the Cluster subtab, the search cluster exists and you can go to the next step.

Otherwise, click Create Search Cluster.

d. Choose a size, and click Save. e. Wait until the Cluster state is Running.

2. Declare the stream searchable.

a. Click the Searchable Streams subtab. b. Choose taxi-demo from the dropdown list of streams.

c. Click Create Searchable Stream in the upper right. d. Name the index so that you recognize it in Kibana or accept the default name that shows in the field.

Working with Pravega Search (PSearch) 161

e. Provide the index mapping. The following index mapping matches the data in the example above.

{ "properties":{ "surcharge":{ "type":"float" }, "dropoff_datetime":{ "type":"date", "format":"yyyy-MM-dd HH:mm:ss" }, "trip_type":{ "type":"keyword" }, "mta_tax":{ "type":"float" }, "rate_code_id":{ "type":"keyword" }, "passenger_count":{ "type":"integer" }, "pickup_datetime":{ "type":"date", "format":"yyyy-MM-dd HH:mm:ss" }, "tolls_amount":{ "type":"float" }, "tip_amount":{ "type":"float" }, "payment_type":{ "type":"keyword" }, "extra":{ "type":"float" }, "vendor_id":{ "type":"keyword" }, "store_and_fwd_flag":{ "type":"keyword" }, "improvement_surcharge":{ "type":"float" }, "fare_amount":{ "type":"float" }, "ehail_fee":{ "type":"float" }, "cab_color":{ "type":"keyword" }, "dropoff_location":{ "type":"geo_point" }, "vendor_name":{ "type":"text" }, "total_amount":{ "type":"float" }, "trip_distance":{ "type":"float" }, "pickup_location":{ "type":"geo_point" }, "event_time":{

162 Working with Pravega Search (PSearch)

"format":"yyyy-MM-dd HH:mm:ss || epoch_millis", "type":"date" } } }

f. Typically choose Basic configuration. Knowledgeable users can click Advanced and specify the initial shard number. g. Click Save.

The stream appears in the list of Searchable Streams.

3. Search Metrics to open OpenSearch Dashboards.

4. On the OpenSearch Dashboards, in the index pattern field, select the index name that you specified when making the stream searchable.

Although it is possible to specify, name patterns or create a list of index names, this example uses only one index name.

5. Create charts and dashboards.

See the OpenSearch Dashboards for details.

Health and maintenance You can monitor the Pravega Search cluster to ensure health and resource availability.

Monitor Pravega Search cluster metrics

If project metrics are enabled, the Pravega Search cluster has a metrics stack.

See About analytic metrics for projects on page 94 for information about collecting and viewing metrics.

See Edit the autoscaling policy on page 164 for a description of how the Pravega Search automatic scaling uses the collected metrics to trigger scaling actions.

Delete index documents

As indexes grow, you may want to delete index documents to reclaim space.

Use the Pravega Search REST API to delete unwanted index documents.

Scale the Pravega Search cluster

Most Pravega Search scaling occurs automatically based on configured values.

Services that consume resources

A Pravega Search cluster has three main functions: indexing incoming stream data, processing continuous queries, and responding to search queries. The servies in a Pravega Search cluster that perform those functions and consume resources are summarized here.

Table 7. Pravega Search cluster services

Service Description

Controller Manages the cluster and the indexes.

index worker Reads events from a Pravega source stream, processes continues queries, and sends events to shard workers.

resthead Receives query requests and forwards to either query workers (if they exist) or shard workers.

query worker An optional component. Manages query requests and responses for heavy query workloads. When queryworkers do not exist, queries go directly from resthead to shard workers.

Working with Pravega Search (PSearch) 163

Table 7. Pravega Search cluster services (continued)

Service Description

shard worker Accepts processing from other components. Shard workers consume most of the CPU and memory resources in the cluster. From index workersShard workers index the events and save the Lucene indices into Pravega

bytestreams. Each shard worker has zero or more shards, each of which is a Lucene directory. QueryingShard workers process search queries forwarded to them from query workers or the

resthead.

Indications that scaling is needed

Workload affects resource consumption in the following ways:

Data ingestion rateAs the stream ingestion rate increases, the number of stream segments in a stream increases, leading to more Pravega readers in the index workers. The increased number of readers require more CPU and memory, which creates the need for more index workers. Finally, the heavier workload on index workers requires more shard workers.

Number of search requestsAs the search workload increases, the load on the resthead, queryworkers and shardworkers increases.

For projects that have Metrics enabled, you can monitor metrics about the components, the workload, and the resources. Monitoring stream segment numbers, CPU and memory usage of index workers, shard numbers, and indexes can point out trends and help predict scaling needs. When CPU and memory resources are approaching limits, scaling is required.

Types of scaling

Type What is scaled How to scale

Horizontal pod scaling Changes the number of pod replicas automaticScaling is automatic based on current metrics and configured values.

manualScaling is controlled by manually setting explicit values that override calculations.

Vertical pod scaling Changes the CPU & memory resources available to pod containers

ManualAdd resources to the configuration.

Cluster scaling Scales the Kubernetes cluster at the node level.

See "Expansion and Scaling" in the Dell Technologies Streaming Data Platform Installation and Administration Guide .

Edit the autoscaling policy

Typically, users do not need to edit the autoscaling policy.

About this task

Autoscaling is performed by the psearch-scaler cronjob. This job runs every 5 minutes. It automatically increases or decreases the number of service replicas based on current metrics values and scaling policies. You may adjust the scaling policies to customize your automatic scaling results.

Changing the autoscaling policy is useful in some specific scenarios such as:

CPU and memory usage is high but the number of pod replicas has reached the upper bounds. If the cluster has sufficient resources, you can increase the maxScaleFactor. This change allows autoscaling to create more replicas.

A busy node is running a shardworker that is handling more than one large shard. The busy node is running low on available CPU and memory. Without intervention, the shardworker may be stopped with out of memory errors. If other nodes have extra resources, an adjustment to the autoscaling policy can help. In this case, change the shardworker configuration, increasing maxScaleFactor and decreasing shard_num. These changes result in more shardworkers and better distribution of the large shards. The workload is distributed among all the nodes.

Performance for indexing or continuous queries needs improvement, but the CPU and memory usages have not reached the scaling thresholds. In this case, decrease the cpu_m and memory_M threshholds in the Indexworker configuration. This change causes more aggressive scaling up.

164 Working with Pravega Search (PSearch)

Steps

1. Edit the psearch-scaling-policy configmap.

kubectl edit cm psearch-scaling-policy -n $project

2. Change settings for the various services according to need.

Common edits are:

Enlarge maxScaleFactor to increase the number of replicas that autoscaling can create.

Reduce the values in any of the targetAverageValue settings to increase the number optimal replicas, to make it easier to satisfy the scaling up condition.

Change minScaleFactor and maxScaleFactor to 1 to disable autoscaling.

Name Description

minScaleFactor A factor used to calculate the minReplicas for autoscaling.

maxScaleFactor A factor used to calculate the maxReplicas for autoscaling. Guidelines for the various services are: An efficient number of controllers is related to the total workload across the cluster. The

controller manages the cluster. An efficient number of indexworkers is related to the number of stream segments being

processed per indexworker. An efficient number of shardworkers is related to the average number of shards per

shardworker and sometimes the size of shards. Shardworkers are cpu and memory intensive. Too many shards running on one shardworker lowers performance.

shard_num The shardworker service includes an additional field named shard_num. For an index with very large shards, one or two shards can consume a few gigabytes of memory. In that case, you can decrease the value of shard_num to make the shardworker scale up more easily. For small indices, 10 shards per shardworker is reasonable.

metrics[...] Metrics fields determine the current optimal number of replicas. Current values for the listed metrics and their target values are used to calculate an optimal number of replicas. When multiple metrics are listed, autoscaling uses the largest calculated result as the optimal number of replicas. By reducing the values in any of the targetAverageValue settings, you make it easier to satisfy the scaling up condition.

The shardworker has an additional metric shard_num. In an index with very large shards, one or two shards can consume a few gigabytes of memory. In this case, decreasing the targetAverageValue for shard_num triggers scaling up more easily. For small indexes, a targetAverageValue of 10 shards per shardworker is reasonable.

minReplicas The current minimum number of replicas for autoscaling purposes. This value is calculated using the minScaleFactor times the initial number of replicas.

maxReplicas The current maximum number of replicas for autoscaling purposes. This value is calculated using the maxScaleFactor times the initial number of replicas.

Here is a view of the configuration:

Working with Pravega Search (PSearch) 165

Disable auto-scaling

You can disable horizontal autoscaling.

Steps

1. Edit the psearch-scaling-policy configmap.

kubectl edit cm psearch-scaling-policy -n $project

2. Change minScaleFactor and maxScaleFactor to 1 for every service.

Change to manual horizontal scaling

Manual scaling uses your explicit value for number of replicas, and scales up to the specified number.

About this task

Typical scenarios when manual scaling is useful are: If the workload has a large amount of incoming data but few query requests, you might improve performace by:

decreasing the replicas for the restthead and queryworker services, and also increasing the replicas for indexworkers and shardworkers

166 Working with Pravega Search (PSearch)

If you want to create a smaller or larger cluster.

Steps

1. Edit the PravegaSearch resource.

kubectl edit pravegasearch psearch -n $project

2. Edit the values in the replica labels. Set the specific number of replicas to create for each service.

3. Under spec, add source: manual.

Example

Here is a view of the PravegaSearch resource:

Vertical pod scaling

Vertical pod scaling adds cpu and memory to existing pods.

About this task

Use vertical pod scaling in these scenarios:

A service is running out of memory. Large shards are running in shard workers.

For performance reasons, some index data are cached in the shard worker. The shard worker might run out of memory if very large shards are running on it. In that case, you can increase the required memory and java heap size.

NOTE: This procedure does not affect existing statefulsets or pods. When you upgrade to a new Pravega Search version,

statefulsets and pods are recreated, using the new configurations.

Steps

1. Edit the PravegaSearch custom resource.

a. Run:

kubectl edit pravegasearch psearch -n $project

b. Change some or all the following values for indexworker or shardworker, as needed:

Working with Pravega Search (PSearch) 167

Under resources, change cpu and memory requests.

Under javaopts, change java heap size (the first value).

2. For existing pods, edit the configmaps of the components you want to change.

For example, edit the psearch-shardworker-configmap:

kubectl edit cm psearch-shardworker-configmap -n $project

You can enlarge the max heap value (-Xmx) or add other jvm options if needed.

3. Edit the statefulset of the component by increasing the resource request values (cpu or memory). The pod restarts after you save the settings.

kubectl edit sts psearch-shardworker -n $project

You can increase cpu or memory as follows:

Pravega Search REST API The Pravega Search REST API is available for use in applications that are deployed in an SDP project.

REST head

The root URL of the Pravega Search REST API is project-specific.

The Pravega Search API is not opened for external connectivity. An application that uses the Pravega Search REST API must be deployed within SDP, in the same project that contains the streams that are being searched.

Each project has a Pravega Search API root URL, as follows:

http://psearch-resthead-headless.{namespace}.svc.cluster.local:9098/

Where:

{namespace} is the project name.

For example:

http://psearch-resthead-headless.myproject.svc.cluster.local:9098/

168 Working with Pravega Search (PSearch)

Authentication

Introduction

To expose the Pravega Search REST API, the REST authentication is required in the Pravega Search RestHead. All the components that access the Pravega Search REST API must add authentication in the request header. The REST ingress is created for the RestHead service. Users do not have to port-forward the RestHead.

When the RestHead starts, the interceptor for Pravega Search REST authentication is registered. All REST requests are sent to the interceptor first. Two types of authentication are supported.

Basic authentication

The username and password is used for authentication. Every Pravega Search cluster has a username and password that is generated dynamically and randomly. If the username and password from the request header equals the username and password from the environment variables that are created by the Psearch-Operator, access is granted.

Bearer authentication

The RPT token from Keycloak is used for authentication. The RPT token is generated by Keycloak. Users have to get the RPT token from Keycloak and add it to the request header. The interceptor parses the request header to get the RPT token and checks the access role associated with the token. The role must be either admin or the appropriate project member role. Otherwise, the 401 error response is returned.

How to get the ingress hosts for Pravega Search REST access

Use the following command:

kubectl get ingress psearch-resthead-headless -n <projectName>

How to get the credentials for Basic authentication

To default username/password is admin/admin. User can also create user or update password on UI (OpenSearch Dashboard).

To get the value for basic_credentials:

echo -n "username:password" | base64

How to get the credentials for Bearer authentication

To get the RPT_Token:

sh ./pravega-search/tools/keycloakToken/token.sh \ -p (projectName) -c (clientSecret) -i (keycloak ingress)

To get the clientSecret, use the Keycloak UI or the secret file from the Kubernetes pod.

To get the Keycloak ingress:

kubectl get ingress -A | grep keycloak

Example: How to get access to the Pravega Search REST API with authentication

The following example shows how to get access to the Pravega Search REST API with Basic authentication:

curl --location --request GET 'https://(ingress_hosts)/_cluster/state' \ --header 'Authorization: Basic (basic_credentials)' --insecure

Working with Pravega Search (PSearch) 169

The following example shows how to get access to the Pravega Search REST API with Bearer authentication:

curl --location --request GET 'https://(ingress_hosts)/_cluster/state' \ --header 'Authorization:Bearer (RPT_Token) ' --insecure

Component Impact

The following components require access to the Pravega Search REST API:

Nautilus UI for Pravega Search - Bearer OpenSearch Dashboards - Bearer and Basic

For Basic authentication, Kibana gets username and password from the rest-auth-secret secret in the project namespace.

When users open the OpenSearch Dashboards, the request with the RPT token is redirected to do the authentication first. It uses Bearer authentication. After that, when users access the Pravega Search REST API in the OpenSearch Dashboards, the username and password that was created by the PSearch-Operator is added to all request headers. This is configured in the OpenSearch Dashboards initialization script.

Resources

The API provides access to Pravega Search clusters, searchable streams, and continuous queries.

Pravega Search cluster

Pravega Search indexing and querying runs in a Pravega Search cluster. The Pravega Search cluster is in a project namespace. Each project can have only one Pravega Search cluster, which handles indexing and querying operations for all the streams in the project.

The Pravega Search REST API supports the following actions for Pravega Search clusters: Create a Pravega Search cluster for a project. Change a Pravega Search cluster configuration.

PUT requires the entire resource. PATCH requires only the field that you want to change.

Delete a Pravega Search cluster. List the Pravega Search cluster in a project namespace. List all Pravega Search clusters in all namespaces. Get information about a Pravega Search cluster.

Searchable stream

A searchable stream is any Pravega stream that is marked as searchable by Pravega Search. When a stream is made searchable, it has an index mapping that is associated with it that defines the stream data structure. Pravega Search uses the index mapping to index existing data of the stream and all new data as the data is ingested. You can register continuous queries and submit historical queries against a searchable stream.

The Pravega Search REST API supports the following actions for searchable streams: Declare a stream searchable. Get information about a searchable stream. List searchable streams. Remove searchable property from a stream.

Continuous query A continuous query operates on one or more streams in real time as the data is ingested. It writes results to an output stream.

The Pravega Search REST API supports the following actions for continuous queries: Create a continuous query, registering it to operate on one or more streams. Get information about a continuous query. Update a continuous query definition. Delete a continuous query. Get a list of continuous queries.

Search query Search queries operate on a stream after it is indexed. The search query (historical query) resource lets a user perform a query in OpenSearch Dashboards.

170 Working with Pravega Search (PSearch)

Common object descriptions

The following table describes the objects that appear in many of the Pravega Search REST API requests.

Object Description

{namespace} A Kubernetes namespace specific to a project. A project namespace is the same as the project name.

{psearch-name} The name that was assigned to the Pravega Search cluster when the cluster was created. Because there is only one cluster in a project, the name is fixed as psearch.

{streamname} The name of a stream.

{queryname} The name of a saved continuous query.

Methods documentation

The Pravega Search REST API documentation is available for download as a zip file.

Download the Pravega Search REST API documentation . Extract all files, and open the index.html file.

The above documentation does not include the CREATE psearch cluster API. See the next section for that method.

Cluster resource operations

The following table summarizes cluster resource operations.

Operation Method and URI

Create a cluster POST {kubenetes_rest_root}/ apis/search.nautilus.dellemc.com/v1alpha1/namespaces/ {namespace}/pravegasearches/

Body

See https://eos2git.cec.lab.emc.com/NAUT/psearch-operator/tree/master/ charts/psearch for a body template. Many configurations are supported when you are creating the cluster with the REST API. Note that on SDP UI, only a few of the configurations are offered.

Delete a cluster DELETE {kubenetes_rest_root}/ apis/search.nautilus.dellemc.com/v1alpha1/namespaces/ {namespace}/pravegasearches/{psearch-name}

List all clusters GET {kubenetes_rest_root}/ apis/search.nautilus.dellemc.com/v1alpha1/namespaces/ {namespace}/pravegasearches

Retrieve information about a specific cluster GET {kubenetes_rest_root}/ apis/search.nautilus.dellemc.com/v1alpha1/namespaces/ {namespace}/pravegasearches/{psearch-name}

Edit a cluster using PATCH PATCH {kubenetes_rest_root}/apis/ search.nautilus.dellemc.com/v1alpha1/namespaces/ {namespace}/pravegasearches/{psearch-name}

Working with Pravega Search (PSearch) 171

Operation Method and URI

Header

Header: content-type:application/merge-patch+json

Body

{ "spec": { }}

Edit a cluster using PUT PUT {kubenetes_rest_root}/apis/ search.nautilus.dellemc.com/v1alpha1/namespaces/ {namespace}/pravegasearches/{psearch-name}

Body

Use the response payload from GET, editing fields as needed.

Create a cluster

Create a PSearch cluster in a project namespace. A project namespace may have only one PSearch cluster.

Request

POST {kubenetes_rest_root}/apis/search.nautilus.dellemc.com/v1alpha1/namespaces/ {namespace}/pravegasearches/

For a template, see https://eos2git.cec.lab.emc.com/NAUT/psearch-operator/tree/master/charts/psearch.

The REST API is more flexible than the SDP UI for cluster configurations. The limited selections available in the SDP UI accommodate the most typical customer requirements.

Object descriptions

See the template for object descriptions.

Example request

The following example request creates a small cluster.

POST {kubenetes_rest_root}/apis/search.nautilus.dellemc.com/v1alpha1/namespaces/ {namespace}/pravegasearches/

{ "apiVersion": "search.nautilus.dellemc.com/v1alpha1", "kind": "PravegaSearch", "metadata": { "name": "psearch"

"namespace":"project-1" }, "spec": { "source": "manual", "imageRef": "psearch-image", "indexWorker": { "jvmOpts": [ "-Xmx2g", "-XX:+PrintGC", "-XX:+PrintGCDetails", "-Xloggc:log/gc.log" ],

172 Working with Pravega Search (PSearch)

"replicas": 4, "resources": { "requests": { "cpu": "500m", "memory": "1Gi" } }, "storage": { "VolumeClaimTemplate": { "accessModes": [ "ReadWriteOnce" ], "dataSource": null, "resources": { "requests": { "storage": "10Gi" } }, "storageClassName": "standard" } } }, "kibana": { "replicas": 1 }, "options": { "psearch.metricsEnableStatistics": "true", "psearch.metricsEnableStatsDReporter": "false", "psearch.metricsInfluxDBUri": "http://psearch-influxdb:8086", "psearch.metricsInfluxDBUserName": "admin" }, "pSearchController": { "jvmOpts": [ "-Xmx2g", "-XX:+PrintGC", "-XX:+PrintGCDetails", "-Xloggc:log/gc.log" ], "replicas": 3, "resources": { "requests": { "cpu": "100m", "memory": "500Mi" } }, "storage": { "VolumeClaimTemplate": { "accessModes": [ "ReadWriteOnce" ], "dataSource": null, "resources": { "requests": { "storage": "10Gi" } }, "storageClassName": "standard" } } }, "queryWorker": { "jvmOpts": [ "-Xmx2g", "-XX:+PrintGC", "-XX:+PrintGCDetails", "-Xloggc:log/gc.log" ], "replicas": 2, "resources": { "requests": { "cpu": "100m", "memory": "500Mi" }

Working with Pravega Search (PSearch) 173

}, "storage": { "VolumeClaimTemplate": { "accessModes": [ "ReadWriteOnce" ], "dataSource": null, "resources": { "requests": { "storage": "10Gi" } }, "storageClassName": "standard" } } }, "resthead": { "jvmOpts": [ "-Xmx2g", "-XX:+PrintGC", "-XX:+PrintGCDetails", "-Xloggc:log/gc.log" ], "replicas": 2, "resources": { "requests": { "cpu": "100m", "memory": "500Mi" } }, "storage": { "VolumeClaimTemplate": { "accessModes": [ "ReadWriteOnce" ], "dataSource": null, "resources": { "requests": { "storage": "10Gi" } }, "storageClassName": "standard" } } }, "shardWorker": { "jvmOpts": [ "-Xmx3g", "-XX:MaxDirectMemorySize=3g", "-XX:+PrintGC", "-XX:+PrintGCDetails", "-Xloggc:log/gc.log" ], "replicas": 2, "resources": { "requests": { "cpu": "1", "memory": "3Gi" } }, "storage": { "VolumeClaimTemplate": { "accessModes": [ "ReadWriteOnce" ], "dataSource": null, "resources": { "requests": { "storage": "10Gi" } }, "storageClassName": "standard" }

174 Working with Pravega Search (PSearch)

} } }

}

Sample Responses:

Example responses

Description Response body

Created

HTTP Status: 201 { "apiVersion": "search.nautilus.dellemc.com/v1alpha1", "kind": "PravegaSearch", "metadata": { "creationTimestamp": "2020-03-04T03:52:59Z", "generation": 1, "name": "psearch", "namespace": "ccc", "resourceVersion": "4654248", "selfLink": "/apis/search.nautilus.dellemc.com/v1alpha1/namespaces/ccc/ pravegasearches/psearch", "uid": "08100ca4-541c-41c8-a03e-0e6bfd414d3d" }, "spec": { "externalAccess": { "enabled": false, "type": "ClusterIP" }, "imageRef": "cluster-psearch-image", "indexWorker": { "jvmOpts": [ "-Xmx2g" ], "replicas": 2, "resources": { "requests": { "memory": "2Gi" } }, "storage": { "VolumeClaimTemplate": { "accessModes": [ "ReadWriteOnce" ], "dataSource": null, "resources": { "requests": { "storage": "10Gi" } }, "storageClassName": "standard" } } }, "kibana": { "replicas": 1 }, "options": { "psearch.leaderElectionType": "fast", "psearch.luceneDirectoryType": "pravega_bytestream", "psearch.metricsEnableStatistics": "false", "psearch.metricsInfluxDBName": "psearch", "psearch.metricsInfluxDBPassword": "password", "psearch.metricsInfluxDBUri": "http://pravega-search- influxdb.default.svc.cluster.local:8086", "psearch.metricsInfluxDBUserName": "admin",

Working with Pravega Search (PSearch) 175

Description Response body

"psearch.perfCountEnabled": "true" }, "pSearchController": { "jvmOpts": [ "-Xmx2g" ], "replicas": 2, "resources": { "requests": { "memory": "2Gi" } }, "storage": { "VolumeClaimTemplate": { "accessModes": [ "ReadWriteOnce" ], "dataSource": null, "resources": { "requests": { "storage": "10Gi" } }, "storageClassName": "standard" } } }, "pravegaControllerAddress": "tcp://nautilus-pravega- controller.nautilus-pravega.svc.cluster.local:9090", "queryWorker": { "replicas": 2, "resources": {}, "storage": { "VolumeClaimTemplate": { "accessModes": [ "ReadWriteOnce" ], "dataSource": null, "resources": { "requests": { "storage": "10Gi" } }, "storageClassName": "standard" } } }, "resthead": { "replicas": 2, "resources": {}, "storage": { "VolumeClaimTemplate": { "accessModes": [ "ReadWriteOnce" ], "dataSource": null, "resources": { "requests": { "storage": "10Gi" } }, "storageClassName": "standard" } } }, "securityEnabled": true, "shardWorker": { "jvmOpts": [ "-Xmx3g", "-XX:MaxDirectMemorySize=3g"

176 Working with Pravega Search (PSearch)

Description Response body

], "replicas": 2, "resources": { "requests": { "memory": "3330Mi" } }, "storage": { "VolumeClaimTemplate": { "accessModes": [ "ReadWriteOnce" ], "dataSource": null, "resources": { "requests": { "storage": "10Gi" } }, "storageClassName": "standard" } } } } }

Already Exists

HTTP Status: 409 { "kind": "Status", "apiVersion": "v1", "metadata": {}, "status": "Failure", "message": "pravegasearches.search.nautilus.dellemc.com \"psearch\" already exists", "reason": "AlreadyExists", "details": { "name": "psearch", "group": "search.nautilus.dellemc.com", "kind": "pravegasearches" }, "code": 409 }

Delete a cluster

Delete an existing cluster from a project namespace.

Request

DELETE {kubenetes_rest_root}/apis/search.nautilus.dellemc.com/v1alpha1/namespaces/{namespace}/ pravegasearches/{psearch-name}

Example request

The following example request deletes the PSearch cluster named psearch.

DELETE {kubenetes_rest_root}/apis/search.nautilus.dellemc.com/v1alpha1/namespaces/project-1/ pravegasearches/psearch

Working with Pravega Search (PSearch) 177

Example responses

Description Response body

Deleted

HTTP Status: 200 { "kind": "Status", "apiVersion": "v1", "metadata": {}, "status": "Success", "details": { "name": "psearch", "group": "search.nautilus.dellemc.com", "kind": "pravegasearches", "uid": "08100ca4-541c-41c8-a03e-0e6bfd414d3d" } }

Not Found

HTTP Status: 404 { "kind": "Status", "apiVersion": "v1", "metadata": {}, "status": "Failure", "message": "pravegasearches.search.nautilus.dellemc.com \"psearch\" not found", "reason": "NotFound", "details": { "name": "psearch", "group": "search.nautilus.dellemc.com", "kind": "pravegasearches" }, "code": 404 }

List clusters

List PSearch clusters and their status.

Request

List PSearch clusters in a namespace. There is always only one PSearch cluster per namespace.

GET {kubenetes_rest_root}/apis/search.nautilus.dellemc.com/v1alpha1/namespaces/{namespace}/ pravegasearches

List all PSearch clusters in all namespaces.

GET {kubenetes_rest_root}/apis/search.nautilus.dellemc.com/v1alpha1/pravegasearches

Example for one project namespace

The following request lists the PSearch cluster and its status for the bbb project.

GET {kubenetes_rest_root}/apis/search.nautilus.dellemc.com/v1alpha1/namespaces/bbb/ pravegasearches

The response format is:

{ "apiVersion": "search.nautilus.dellemc.com/v1alpha1", "items": [ {

178 Working with Pravega Search (PSearch)

"apiVersion": "search.nautilus.dellemc.com/v1alpha1", "kind": "PravegaSearch", "metadata": { "creationTimestamp": "2020-03-03T05:28:58Z", "generation": 3, "name": "psearch", "namespace": "bbb", "resourceVersion": "4338841", "selfLink": "/apis/search.nautilus.dellemc.com/v1alpha1/namespaces/bbb/pravegasearches/ psearch", "uid": "7748428b-6c45-486d-8649-94f61f9c4545" }, "spec": {

...

}, "status": { "ingress": { "url": "" }, "ready": true, "reason": "" } } ], "kind": "PravegaSearchList", "metadata": { "continue": "", "resourceVersion": "4687091", "selfLink": "/apis/search.nautilus.dellemc.com/v1alpha1/namespaces/bbb/pravegasearches/" } }

Example for all projects

The following request lists all PSearch clusters and their status in all namespaces.

GET {kubenetes_rest_root}/apis/search.nautilus.dellemc.com/v1alpha1/pravegasearches

The response format is:

{ "apiVersion": "search.nautilus.dellemc.com/v1alpha1", "items": [ { "apiVersion": "search.nautilus.dellemc.com/v1alpha1", "kind": "PravegaSearch", "metadata": { "creationTimestamp": "2020-03-03T05:28:58Z", "generation": 3, "name": "psearch", "namespace": "bbb", "resourceVersion": "4338841", "selfLink": "/apis/search.nautilus.dellemc.com/v1alpha1/namespaces/bbb/pravegasearches/ psearch", "uid": "7748428b-6c45-486d-8649-94f61f9c4545" }, "spec": {

...

}, "status": { "ingress": { "url": "" }, "ready": true, "reason": "" }

Working with Pravega Search (PSearch) 179

} ], "kind": "PravegaSearchList", "metadata": { "continue": "", "resourceVersion": "4687681", "selfLink": "/apis/search.nautilus.dellemc.com/v1alpha1/pravegasearches/" } }

Response codes

Get a cluster

Retrieve information about a specific cluster and its status.

Request

GET {kubenetes_rest_root}/apis/search.nautilus.dellemc.com/v1alpha1/namespaces/{namespace}/ pravegasearches/{psearch-name}

Example request

The following request gets information about the cluster named psearch in the bbb project.

GET {kubenetes_rest_root}/apis/search.nautilus.dellemc.com/v1alpha1/namespaces/bbb/ pravegasearches/psearch

Example responses

The status information appears at the end of the payload.

Description Response body

Get resource

HTTP Status: 200 { "apiVersion": "search.nautilus.dellemc.com/v1alpha1", "kind": "PravegaSearch", "metadata": { "creationTimestamp": "2020-03-03T05:28:58Z", "generation": 3, "name": "psearch", "namespace": "bbb", "resourceVersion": "4338841", "selfLink": "/apis/search.nautilus.dellemc.com/v1alpha1/ namespaces/bbb/pravegasearches/psearch", "uid": "7748428b-6c45-486d-8649-94f61f9c4545" }, "spec": { "externalAccess": { "enabled": false, "type": "ClusterIP" }, "imageRef": "cluster-psearch-image", "indexWorker": { "jvmOpts": [ "-Xmx2g" ], "replicas": 2, "resources": { "requests": { "memory": "2Gi" }

180 Working with Pravega Search (PSearch)

Description Response body

}, "storage": { "VolumeClaimTemplate": { "accessModes": [ "ReadWriteOnce" ], "dataSource": null, "resources": { "requests": { "storage": "10Gi" } }, "storageClassName": "standard" } } }, "kibana": { "replicas": 1 }, "options": { "psearch.leaderElectionType": "fast", "psearch.luceneDirectoryType": "pravega_bytestream", "psearch.metricsEnableStatistics": "false", "psearch.metricsInfluxDBName": "psearch", "psearch.metricsInfluxDBPassword": "password", "psearch.metricsInfluxDBUri": "http://pravega-search- influxdb.default.svc.cluster.local:8086", "psearch.metricsInfluxDBUserName": "admin", "psearch.perfCountEnabled": "true" }, "pSearchController": { "jvmOpts": [ "-Xmx2g" ], "replicas": 2, "resources": { "requests": { "memory": "2Gi" } }, "storage": { "VolumeClaimTemplate": { "accessModes": [ "ReadWriteOnce" ], "dataSource": null, "resources": { "requests": { "storage": "10Gi" } }, "storageClassName": "standard" } } }, "pravegaControllerAddress": "tcp://nautilus-pravega- controller.nautilus-pravega.svc.cluster.local:9090", "queryWorker": { "replicas": 2, "resources": {}, "storage": { "VolumeClaimTemplate": { "accessModes": [ "ReadWriteOnce" ], "dataSource": null, "resources": { "requests": { "storage": "10Gi" }

Working with Pravega Search (PSearch) 181

Description Response body

}, "storageClassName": "standard" } } }, "resthead": { "replicas": 2, "resources": {}, "storage": { "VolumeClaimTemplate": { "accessModes": [ "ReadWriteOnce" ], "dataSource": null, "resources": { "requests": { "storage": "10Gi" } }, "storageClassName": "standard" } } }, "securityEnabled": true, "shardWorker": { "jvmOpts": [ "-Xmx3g", "-XX:MaxDirectMemorySize=3g" ], "replicas": 2, "resources": { "requests": { "memory": "3330Mi" } }, "storage": { "VolumeClaimTemplate": { "accessModes": [ "ReadWriteOnce" ], "dataSource": null, "resources": { "requests": { "storage": "10Gi" } }, "storageClassName": "standard" } } } }, "status": { "ingress": { "url": "" }, "ready": true, "reason": "" } }

Not Found

HTTP Status: 404 { "kind": "Status", "apiVersion": "v1", "metadata": {}, "status": "Failure", "message": "pravegasearches.search.nautilus.dellemc.com \"pseawrch\" not found", "reason": "NotFound",

182 Working with Pravega Search (PSearch)

Description Response body

"details": { "name": "pseawrch", "group": "search.nautilus.dellemc.com", "kind": "pravegasearches" }, "code": 404 }

Edit a cluster using PATCH

Change a PSearch cluster configuration by providing only the changed information in the request body.

Request

PATCH {kubenetes_rest_root}/apis/search.nautilus.dellemc.com/v1alpha1/namespaces/{namespace}/ pravegasearches/{psearch-name}

Header

Header: content-type:application/merge-patch+json

Body

Supply only the object that you are changing.

Example request

The following example request changes the number of indexworkers.

PATCH {kubenetes_rest_root}/apis/search.nautilus.dellemc.com/v1alpha1/namespaces/{namespace}/ pravegasearches/{psearch-name}

Header

Header: content-type:application/merge-patch+json

Body

{ "spec": { "indexworker": { "replicas" : 2 } }}

Example responses

Description Response body

HTTP Status: xxx HTTP responses

Working with Pravega Search (PSearch) 183

Edit a cluster using PUT

Change a PSearch cluster configuration by providing the entire spec object in the request body.

Request

PUT {kubenetes_rest_root}/apis/search.nautilus.dellemc.com/v1alpha1/namespaces/{namespace}/ pravegasearches/{psearch-name}

Body

Use the response payload from GET, editing fields as needed.

Example request

The following example request changes the configuration of the cluster.

PUT {kubenetes_rest_root}/apis/search.nautilus.dellemc.com/v1alpha1/namespaces/{namespace}/ pravegasearches/{psearch-name} { "apiVersion": "search.nautilus.dellemc.com/v1alpha1", "kind": "PravegaSearch", "metadata": { "creationTimestamp": "2020-03-03T05:28:58Z", "generation": 3, "name": "psearch", "namespace": "bbb", "resourceVersion": "4338841", "selfLink": "/apis/search.nautilus.dellemc.com/v1alpha1/namespaces/bbb/pravegasearches/ psearch", "uid": "7748428b-6c45-486d-8649-94f61f9c4545" }, "spec": { ...}

Example responses

Description Response body

Created or other message

HTTP Status: xxx

HTTP responses.

184 Working with Pravega Search (PSearch)

Troubleshooting Application Deployments You might encounter errors during deployment and execution of Apache Flink and Spark applications. Logging is an essential tool for debugging and understanding the behavior of your applications.

Users in the platform administrator role can access all system logs using native Kubernetes commands.

Topics:

Apache Flink application failures Application logging Application issues related to TLS Apache Spark Troubleshooting

Apache Flink application failures If a deployment fails, an Apache Flink application reaches an error state.

Application failed to resolve Maven coordinates

If the mavenCoordinate is incorrect and cannot be found in the local analytics project Maven repository, you get errors. To resolve errors, review the full or partial log files.

Application failed to deploy

If one of the following exists, a Flink application might fail to deploy:

The mainClass cannot be found in the JAR file. The Flink driver application fails with an application exception error.

To identify the cause of a failure, run the kubectl describe command to retrieve a partial log. This command displays the last 50 lines of the log. This information is enough to identify the cause of the failure.

$> kubectl describe FlinkApplication -n myproject helloworld

...

... Status: Deployment: Flink Cluster: Image: devops-repo.isus.emc.com:8116/nautilus/flink:0.0.1-017.d7d7da1-1.6.2 Name: mycluster URL: mycluster-jobmanager.myproject:8081 Job Id: Job State: Savepoint: file:/mnt/flink/savepoints/savepoint-3d2dfb-b37f0c8ad3a5 Deployment Number: 1 Error: Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.conf.Configuration at java.net.URLClassLoader.findClass(URLClassLoader.java:382) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) ... 5 more

2019-02-04 16:02:16,759 INFO org.apache.flink.core.fs.FileSystem

11

Troubleshooting Application Deployments 185

- Hadoop is not in the classpath/dependencies. The extended set of supported File Systems via Hadoop is not available.

2019-02-04 16:02:17,024 INFO org.apache.flink.runtime.security.modules.HadoopModuleFactory - Cannot create Hadoop Security Module because Hadoop cannot be found in the Classpath.

2019-02-04 16:02:17,047 INFO org.apache.flink.runtime.security.SecurityUtils - Cannot install HadoopSecurityContext because Hadoop cannot be found in the Classpath.

2019-02-04 16:02:17,106 INFO org.apache.flink.client.cli.CliFrontend - Running 'run' command.

2019-02-04 16:02:17,111 INFO org.apache.flink.client.cli.CliFrontend - Building program from JAR file

2019-02-04 16:02:17,158 ERROR org.apache.flink.client.cli.CliFrontend - Error while running the command. org.apache.flink.client.program.ProgramInvocationException:

The program's entry point class 'com.dellemc.flink.error.StreamingJob' was not found in the jar file.

org.apache.flink.client.program.PackagedProgram.loadMainClass(PackagedProgram.java:617)

at org.apache.flink.client.program.PackagedProgram. (PackagedProgram.java:199) at org.apache.flink.client.cli.CliFrontend.buildProgram(CliFrontend.java:865) at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:207) at org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1053) at org.apache.flink.client.cli.CliFrontend.lambda$main$11(CliFrontend.java:1129) at org.apache.flink.runtime.security.NoOpSecurityContext.runSecured(NoOpSecurityContext.java :30) at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1129)

Caused by: java.lang.ClassNotFoundException: com.dellemc.flink.error.StreamingJob

at java.net.URLClassLoader.findClass(URLClassLoader.java:382) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:348) at org.apache.flink.client.program.PackagedProgram.loadMainClass(PackagedProgram.java:614)

... 7 more

------------------------------------------------------------ The program finished with the following exception:

org.apache.flink.client.program.ProgramInvocationException: The program's entry point class 'com.dellemc.flink.error.StreamingJob' was not found in the jar file.

at org.apache.flink.client.program.PackagedProgram.loadMainClass(PackagedProgram.java:617) at org.apache.flink.client.program.PackagedProgram. (PackagedProgram.java:199) at org.apache.flink.client.cli.CliFrontend.buildProgram(CliFrontend.java:865) at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:207) at org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1053) at org.apache.flink.client.cli.CliFrontend.lambda$main$11(CliFrontend.java:1129) at org.apache.flink.runtime.security.NoOpSecurityContext.runSecured(NoOpSecurityContext.java :30) at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1129)

Caused by: java.lang.ClassNotFoundException: com.dellemc.flink.error.StreamingJob

at java.net.URLClassLoader.findClass(URLClassLoader.java:382) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at java.lang.Class.forName0(Native Method)

186 Troubleshooting Application Deployments

at java.lang.Class.forName(Class.java:348) at org.apache.flink.client.program.PackagedProgram.loadMainClass(PackagedProgram.java:614)

... 7 more

Observable Generation: 4 Reason: Flink Driver Failed State: error

Review the full deployment log

If you need to review the full deployment log, first identify the deployment pod and then use the kubectl logs command to retrieve the log output.

Steps

1. Identify the deployment pod using the kubectl get pods command, which returns the deployment pod name in the format -app- - .

> kubectl get pods -n taxidemo

NAME READY STATUS RESTARTS AGE kubectl get pods -n taxidemo

NAME READY STATUS RESTARTS AGE iot-gateway 1/1 Running 0 27m repo-5689d8c9b-ccs2p 1/1 Running 0 34m taxidemo-jobmanager-0 1/1 Running 0 16m taxidemo-rawdata-app-v1-1-mllnb 0/1 Completed 0 16m <<<<<<<<<<< taxidemo-rawdata-app-v2-1-gdfjl 0/1 Completed 0 2m <<<<<<<<<<< taxidemo-stats-app-v1-1-ldtks 0/1 Completed 0 16m <<<<<<<<<<< taxidemo-taskmanager-b596fcc55-j6jtd 1/1 Running 2 16m zookeeper-0 1/1 Running 0 34m zookeeper-1 1/1 Running 0 34m zookeeper-2 1/1 Running 0 33m

2. Use the kubectl logs command to obtain the full log output. For example:

> kubectl logs taxidemo-rawdata-app-v2-1-gdfjl -n taxidemo Executing Flink Run ------------------- 2019-02-07 16:20:25,285 INFO org.apache.flink.client.cli.CliFrontend - -------------------------------------------------------------------------------- 2019-02-07 16:20:25,286 INFO org.apache.flink.client.cli.CliFrontend - Starting Command Line Client (Version: 1.6.2, Rev:3456ad0, Date:17.10.2018 @ 18:46:46 GMT)

2019-02-07 16:20:25,286 INFO org.apache.flink.client.cli.CliFrontend - OS current user: root

2019-02-07 16:20:25,286 INFO org.apache.flink.client.cli.CliFrontend - Current Hadoop/Kerberos user:

2019-02-07 16:20:25,287 INFO org.apache.flink.client.cli.CliFrontend - JVM: OpenJDK 64-Bit Server VM - Oracle Corporation - 1.8/25.181-b13

2019-02-07 16:20:25,287 INFO org.apache.flink.client.cli.CliFrontend - Maximum heap size: 878 MiBytes

2019-02-07 16:20:25,287 INFO org.apache.flink.client.cli.CliFrontend - JAVA_HOME: /docker-java-home/jre

2019-02-07 16:20:25,287 INFO org.apache.flink.client.cli.CliFrontend - No Hadoop Dependency available

2019-02-07 16:20:25,287 INFO org.apache.flink.client.cli.CliFrontend - JVM Options:

Troubleshooting Application Deployments 187

2019-02-07 16:20:25,287 INFO org.apache.flink.client.cli.CliFrontend - -Dlog.file=/opt/flink/log/flink--client-taxidemo-rawdata-app-v2-1-gdfjl.log

2019-02-07 16:20:25,287 INFO org.apache.flink.client.cli.CliFrontend - -Dlog4j.configuration=file:/etc/flink/log4j-cli.properties

2019-02-07 16:20:25,287 INFO org.apache.flink.client.cli.CliFrontend - -Dlogback.configurationFile=file:/etc/flink/logback.xml

2019-02-07 16:20:25,287 INFO org.apache.flink.client.cli.CliFrontend - Program Arguments:

2019-02-07 16:20:25,287 INFO org.apache.flink.client.cli.CliFrontend - run 2019-02-07 16:20:25,287 INFO org.apache.flink.client.cli.CliFrontend - -d

2019-02-07 16:20:25,287 INFO org.apache.flink.client.cli.CliFrontend - -s

2019-02-07 16:20:25,287 INFO org.apache.flink.client.cli.CliFrontend - file:/mnt/flink/savepoints/savepoint-18b407-f5f5a79d567f

2019-02-07 16:20:25,287 INFO org.apache.flink.client.cli.CliFrontend - -c 2019-02-07 16:20:25,287 INFO org.apache.flink.client.cli.CliFrontend - com.dellemc.BadDriver ...

2019-02-07 16:20:25,320 INFO org.apache.flink.core.fs.FileSystem - Hadoop is not in the classpath/dependencies. The extended set of supported File Systems via Hadoop is not available. 2019-02-07 16:20:25,470 INFO org.apache.flink.runtime.security.modules.HadoopModuleFactory - Cannot create Hadoop Security Module because Hadoop cannot be found in the Classpath.

2019-02-07 16:20:25,513 INFO org.apache.flink.runtime.security.SecurityUtils - Cannot install HadoopSecurityContext because Hadoop cannot be found in the Classpath.

2019-02-07 16:20:25,568 INFO org.apache.flink.client.cli.CliFrontend - Running 'run' command.

2019-02-07 16:20:25,572 INFO org.apache.flink.client.cli.CliFrontend - Building program from JAR file

2019-02-07 16:20:25,670 ERROR org.apache.flink.client.cli.CliFrontend - Error while running the command. org.apache.flink.client.program.ProgramInvocationException: The program's entry point class 'com.dellemc.BadDriver' was not found in the jar file. at org.apache.flink.client.program.PackagedProgram.loadMainClass(PackagedProgram.java:617 ) at org.apache.flink.client.program.PackagedProgram. (PackagedProgram.java:199) at org.apache.flink.client.cli.CliFrontend.buildProgram(CliFrontend.java:865) at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:207) at org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1053) at org.apache.flink.client.cli.CliFrontend.lambda$main$11(CliFrontend.java:1129) at org.apache.flink.runtime.security.NoOpSecurityContext.runSecured(NoOpSecurityContext.j ava:30) at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1129)

Caused by: java.lang.ClassNotFoundException: com.dellemc.BadDriver

at java.net.URLClassLoader.findClass(URLClassLoader.java:382) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at java.lang.Class.forName0(Native Method)

188 Troubleshooting Application Deployments

at java.lang.Class.forName(Class.java:348) at org.apache.flink.client.program.PackagedProgram.loadMainClass(PackagedProgram.java:614 ) ... 7 more

------------------------------------------------------------ The program finished with the following exception:

org.apache.flink.client.program.ProgramInvocationException: The program's entry point class 'com.dellemc.BadDriver' was not found in the jar file.

at org.apache.flink.client.program.PackagedProgram.loadMainClass(PackagedProgram.java:617 ) at org.apache.flink.client.program.PackagedProgram. (PackagedProgram.java:199) at org.apache.flink.client.cli.CliFrontend.buildProgram(CliFrontend.java:865) at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:207) at org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1053) at org.apache.flink.client.cli.CliFrontend.lambda$main$11(CliFrontend.java:1129) at org.apache.flink.runtime.security.NoOpSecurityContext.runSecured(NoOpSecurityContext.j ava:30) at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1129)

Caused by: java.lang.ClassNotFoundException: com.dellemc.BadDriver at java.net.URLClassLoader.findClass(URLClassLoader.java:382) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:348) at org.apache.flink.client.program.PackagedProgram.loadMainClass(PackagedProgram.java:614 ) ... 7 more

...

... Error Running Flink Driver Application

Application logging Flink applications consist of Flink tasks and task managers. Any logging an operator might perform is contained in the TaskManager logs of the target cluster.

Flink operator tasks can run on any of the TaskManagers within the Flink cluster. You first must identify the TaskManagers of the target cluster before you can review logs.

Review application logs

Identify the TaskManagers of the target cluster and view logs.

Steps

1. First identify the TaskManagers of the target cluster. These task managers are named -taskmanager- . For example:

> kubectl get pods -n taxidemo NAME READY STATUS RESTARTS AGE iot-gateway 1/1 Running 0 42m repo-5689d8c9b-ccs2p 1/1 Running 0 50m taxidemo-jobmanager-0 1/1 Running 0 32m

Troubleshooting Application Deployments 189

taxidemo-rawdata-app-v1-1-mllnb 0/1 Completed 0 32m taxidemo-rawdata-app-v2-1-gdfjl 0/1 Completed 0 17m taxidemo-stats-app-v1-1-ldtks 0/1 Completed 0 32m taxidemo-taskmanager-b596fcc55-j6jtd 1/1 Running 2 32m <<<<<<<<< zookeeper-0 1/1 Running 0 50m zookeeper-1 1/1 Running 0 49m zookeeper-2 1/1 Running 0 49m

2. Run the kubectl logs command to view the logs. For example:

> kubectl logs taxidemo-taskmanager-b596fcc55-j6jtd -n taxidemo Starting Task Manager config file: jobmanager.rpc.address: taxidemo-taskmanager-b596fcc55-j6jtd jobmanager.rpc.port: 6123 jobmanager.heap.size: 1024m taskmanager.heap.size: 1024m taskmanager.numberOfTaskSlots: 2 parallelism.default: 1 rest.port: 8081 blob.server.port: 6124 query.server.port: 6125 Starting taskexecutor as a console application on host taxidemo-taskmanager-b596fcc55- j6jtd.

2019-02-07 16:06:06,165 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - -------------------------------------------------------------------------------- 2019-02-07 16:06:06,166 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - Starting TaskManager (Version: 1.6.2, Rev:3456ad0, Date:17.10.2018 @ 18:46:46 GMT)

2019-02-07 16:06:06,166 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - OS current user: flink

2019-02-07 16:06:06,166 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - Current Hadoop/Kerberos user:

2019-02-07 16:06:06,166 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - JVM: OpenJDK 64-Bit Server VM - Oracle Corporation - 1.8/25.181-b13

2019-02-07 16:06:06,167 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - Maximum heap size: 922 MiBytes

2019-02-07 16:06:06,167 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - JAVA_HOME: /docker-java-home/jre

2019-02-07 16:06:06,167 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - No Hadoop Dependency available

2019-02-07 16:06:06,167 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - JVM Options:

2019-02-07 16:06:06,167 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - -XX:+UseG1GC

2019-02-07 16:06:06,168 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - -Xms922M

2019-02-07 16:06:06,168 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - -Xmx922M

2019-02-07 16:06:06,168 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - -XX:MaxDirectMemorySize=8388607T

2019-02-07 16:06:06,168 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - -Dlog4j.configuration=file:/etc/flink/log4j-console.properties

2019-02-07 16:06:06,168 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - -Dlogback.configurationFile=file:/etc/flink/logback-console.xml

190 Troubleshooting Application Deployments

2019-02-07 16:06:06,168 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - Program Arguments:

3. You can also follow new log entries using the kubectl logs command with the -f parameter. For example:

> kubectl logs taxidemo-taskmanager-b596fcc55-j6jtd -n taxidemo -f ... ... ... 2019-02-07 16:41:38,678 INFO io.pravega.client.netty.impl.RawClient - Closing connection with exception: null

2019-02-07 16:41:39,695 INFO io.pravega.client.netty.impl.ClientConnectionInboundHandler - Connection established ChannelHandlerContext(ClientConnectionInboundHandler#0, [id: 0x742c134f])

2019-02-07 16:41:39,698 INFO io.pravega.client.netty.impl.RawClient - Closing connection with exception: null

2019-02-07 16:41:44,425 INFO io.pravega.client.stream.impl.ControllerResolverFactory - Attempting to refresh the controller server endpoints

2019-02-07 16:41:44,425 INFO io.pravega.client.stream.impl.ControllerResolverFactory - Attempting to refresh the controller server endpoints

2019-02-07 16:41:44,427 INFO io.pravega.client.stream.impl.ControllerResolverFactory - Updating client with controllers: [[addrs=[nautilus-pravega-controller.nautilus-pravega.svc.cluster.local/ 10.100.200.14:9090], attrs={}]]

2019-02-07 16:41:44,427 INFO io.pravega.client.stream.impl.ControllerResolverFactory - Updating client with controllers: [[addrs=[nautilus-pravega-controller.nautilus-pravega.svc.cluster.local/ 10.100.200.14:9090], attrs={}]]

2019-02-07 16:41:44,427 INFO io.pravega.client.stream.impl.ControllerResolverFactory - Rescheduling ControllerNameResolver task for after 120000 ms

2019-02-07 16:41:44,427 INFO io.pravega.client.stream.impl.ControllerResolverFactory - Rescheduling ControllerNameResolver task for after 120000 ms

2019-02-07 16:41:49,757 INFO io.pravega.client.netty.impl.ClientConnectionInboundHandler - Connection established ChannelHandlerContext(ClientConnectionInboundHandler#0, [id: 0xf2faca14])

2019-02-07 16:41:49,759 INFO io.pravega.client.netty.impl.RawClient

Application issues related to TLS The SDP supports native Transport Layer Security (TLS) for external connections to the platform. This section describes TLS-related connection information in Pravega and Flink applications.

The TLS feature is optional and is enabled or disabled with a true/false setting in the configuration values file during installation. For more information, see the TLS configuration details section in the Dell Technologies Streaming Data Platform Installation and Administration Guide

When TLS is enabled in the SDP, Pravega applications must use a TLS endpoint to access the Pravega datastore.

The URI used in the Pravega client application, in the ClientConfig class, must start with:

tls://:443

NOTE: If the URI starts with tcp://, the application fails with a javax.net.ssl.SSLHandshakeException error.

Troubleshooting Application Deployments 191

To obtain the Pravega ingress endpoint, run the following command:

kubectl get ing pravega-controller -n nautilus-pravega

The HOST column in the output shows the Pravega endpoint.

Apache Spark Troubleshooting Authentication failed is a common error message.

Symptom io.pravega.shaded.io.grpc.StatusRuntimeException: UNAUTHENTICATED: Authentication failed

Causes Keycloak JAR is not included. Attempt to create a scope. Make sure that the option allow_create_scope is set to false.

192 Troubleshooting Application Deployments

SDP Code Hub Applications and Sample Code You can develop streaming data applications using sample code templates available on the SDP Code Hub. Dell Technologies monitors this hub, which is a community-supported public GitHub portal for developers and integrators. It includes Pravega connectors, demos, sample applications, API templates, along with Pravega, Spark, and Flink examples from the open-source SDP developer community.

The SDP Code Hub includes three main categories:

Code samples Example projects that demonstrate a single concept or feature, for example, exactly-once processing using the Pravega

client. Connectors

Libraries or components that can be used to integrate Pravega with other technologies. Demos

Small but complete applications that tell a story and demonstrate a user case, for example, iDRAC telemetry and video facial recognition.

Visit the SDP Code Hub: https://streamingdataplatform.github.io/code-hub/ to browse articles about open-source SDP projects and download examples. The SDP Platform Developer repository includes additional workshop code samples: https:// github.com/StreamingDataPlatform/workshop-samples

The SDP Code Hub portal does not contain any product downloads, advertising, commercial, or proprietary software.

Topics:

Develop an example video processing application

Develop an example video processing application The SDP Code Hub includes a video processing code example that demonstrates how to perform data ingestion, analytics, and visualization, with methods to store, process, and read video using Pravega and Apache Flink. The SDP Code Hub also includes an application that performs object detection using TensorFlow and YOLO.

The video processing example project on the SDP Code Hub shows how to:

Start with local-only development to avoid complications and iterate quickly. Flink and Pravega run on your local machine for this example.

Install a local stand-alone copy of Pravega in Docker. Create a Pravega scope using the Pravega RESTful APIs. It requires you to:

Set up application parameters using the IntelliJ IDE from JetBrains (available for free download). Connect a USB camera to a VMware virtual machine. Run the example application. Understand the example Java source code.

To get the most out of this project, you need a basic understanding of the high-level architecture of the SDP including: Pravega for processing and storage, Apache Flink or Spark for analysis, Kubernetes, authentication and metrics, and back-end storage.

Download the installation files from https://github.com/pravega/video-samples. Use the examples, along with the following required components, available for download from the sites listed below:

Java 8 (JDK 1.8)The Java programming language and Java development life cycle, including writing, compiling, and creating JAR files and running the JDK or JRE.

Kubernetes--Kubernetes (K8s) for automating deployment, scaling, and management of containerized applications. A Kubernetes tutorial contains scripts and code to deploy all components on Kubernetes. See https://kubernetes.io/.

DockerInstall Docker and Docker Compose to simplify the deployment of the components: https://docs.docker.com/ v17.12/install/

12

SDP Code Hub Applications and Sample Code 193

PravegaFor additional SDP developer code examples, see the SDP Code Hub: https://streamingdataplatform.github.io/ code-hub/ For information about Pravega, see http://pravega.io.

Apache FlinkSee https://flink.apache.org for more information. Streaming Data Generator--A streaming data generator has been created for this example that can generate synthetic

sensor and video data and send it to the Pravega Gateway. These Python scripts are in the streaming_data_generator directory.

Pravega Gateway--This GRPC server provides a gateway to a Pravega server. It provides limited Pravega functionality to any environment that supports GRPC, including Python. For more information, see Pravega Gateway.

Maven 3.6 Gradle 4.0 A webcam installed on your local workstation.

Overview of a video processing application

The SDP is not required for local application development and testing. You should start development on your local machine because it is easier to iterate quickly and troubleshoot applications when Flink or Spark applications and Pravega are installed locally. When you install and develop locally, your log files are also local.

Once you have an application developed, you can debug it locally using Docker and then deploy and test it on the SDP.

It usually takes a few minutes to deploy an application to the Platform, including uploading a JAR file, starting your clusters, and running your applications.

In this example, you send figures from a locally installed web camera to an application called Camera Recorder that reads frames from the Camera Recorder application and writes the data to a Pravega stream called the Video Stream.

Figure 27. Sample application for video processing with Pravega

The Video Data Generator Job generates random images that go into the same video stream. These random images do not have metadata associated with them, so you can distinguish those images from the ones coming from the camera recorder, which are tagged with metadata. There is a timestamp associated with each image in the video.

A Flink job called Multi-Video Grid Job processes the images. In a real-world situation, images from different servers would have different timestamps. As a result, this application aligns images in windows of time (10 milliseconds) to create a single image grid output of four images. This scenario could be used if you wanted to record four camera outputs at a time from independent video streams. It could also be used for real-time machine learning based on four different camera inputs. It aligns the images by time and then displays multi-video streaming output.

The Multi-Video Grid Job performs several steps:

Resizes images (scales multiple images down in parallel). Aligns images by their time window. Because the images have timestamps that are slightly off, the job aligns them to be

exactly within 10 milliseconds of each other. Joins the images as a single image in a grid and presents them as a single input and writes them to a new stream called the

Grid Stream.

The Video Player application then reads the streams from the Grid Stream and displays them to your screen.

For more information about building and running the video samples, browse and download the samples from: https:// github.com/pravega/video-samples

194 SDP Code Hub Applications and Sample Code

References This section includes Keycloak and Kubernetes user role definitions and additional support resources.

Topics:

Keycloak and Kubernetes user roles Additional resources Where to go for support Provide feedback about this document

Keycloak and Kubernetes user roles This section describes the roles used within the Streaming Data Platform.

Streaming Data Platform user roles

Role name in Keycloak Description

admin Users with complete wild-card access to all resources across all services. These users have wildcard access to all projects and all Pravega scopes. One user with this role is created by the installation. You may create additional usernames with the admin role.

-member Users or clients in Keycloak with access to specifically protected resources for the named project. Access includes but is not limited to: Maven artifacts, Pravega scopes in the project's namespace, and Flink clusters defined in the project. These role names are created whenever a new project is created.

Kubernetes user roles

Role name in Kubernetes Description

cluster-admin Users with complete access to all K8s resources within the cluster. If you follow recommended procedures for creating users and assigning roles, the admin users in the previous section become cluster-admin users in the cluster, through impersonation.

-member Users with curated access to most resources in the project namespace. Project members are aggregated to the common edit role.

NOTE: In order to be able to use kubectl commands, admin and non-admin users must be UAA users.

For more information, see the Dell Technologies Streaming Data Platform Installation and Administration Guide

13

References 195

Additional resources This section provides additional resources for application developers using the SDP.

Topic Reference

SDP (SDP) Code Hub SDP Code Hub: https://streamingdataplatform.github.io/ code-hub/

SDP Documentation InfoHub Documentation resources from Dell Technologies: https://www.dell.com/ support/article/us/en/19/sln319974/dell-emc-streaming- data-platform-infohub

Apache Flink concepts, tutorials, guidelines, and Flink API documentation.

Apache Flink Open Source Project documentation: https://flink.apache.org/

Ververica Flink documentation: https:// www.ververica.com/

Apache SparkTM concepts, tutorials, guidelines, and Spark API documentation.

Apache Spark Open Source Project documentation: https://spark.apache.org

Pravega Spark Connector https://github.com/pravega/ spark-connectors

Spark 2.4.5 Programming Guide https://spark.apache.org/ docs/latest/rdd-programming-guide.html

Spark 2.4.5 Streaming Documentation https://spark.apache.org/docs/latest/ streaming-programming-guide.html

Spark Tuning https://spark.apache.org/docs/latest/ tuning.html

Pravega concepts Pravega Open source project documentation: http:// www.pravega.io.

Pravega concepts: https://pravega.io/docs/nightly/ pravega-concepts/

Data flow from Sensors to the Edge and the Cloud using Pravega

Blog post: https://pravega.io/blog/2021/03/23/data-flow- from-sensors-to-the-edge-and-the-cloud-using-pravega/

Introducing Pravega 0.9.0: New features, improved performance and more

Blog post: https://pravega.io/ blog/2021/03/10/introducing-pravega-0-9-0-new-features- improved-performance-and-more/

When speed meets parallelism -- Pravega performance under parallel streaming workloads

Blog post: https://pravega.io/blog/2021/03/10/when- speed-meets-parallelism-pravega-performance-under-parallel- streaming-workloads/

When speeding makes sense -- Fast, consistent, durable and scalable streaming data with Pravega

Blog post: https://pravega.io/blog/2020/10/01/when-speeding- makes-sense-fast-consistent-durable-and-scalable-streaming- data-with-pravega/

Pravega Client API 101 Blog post: https://pravega.io/blog/2020/09/22/pravega- client-api-101/

Where to go for support Dell Technologies Secure Remote Services (SRS) or Secure Connect Gateway (SCG) and call home features are available for the SDP. These features require an SRS or SCG Gateway server configured on-site to monitor the platform. The SDP installation process configures the connection to the SRS or SCG Gateway. Detected problems are forwarded to Dell Technologies as actionable alerts, and support teams can remotely connect to the platform to help with troubleshooting.

196 References

Dell Technologies support

Support tab on the Dell homepage: https://www.dell.com/support/incidents-online. Once you identify your product, the "How to Contact Us" gives you the option of email, chat, or telephone support.

SDP support landing page

Product information, drivers and downloads, and knowledge base articles: SDP product support landing page.

Telephone support United States: 1-800-SVC-4EMC (1-800-782-4362) Canada: 1-800-543-4782 Worldwide: 1-508-497-7901 Local phone numbers for a s

Manualsnet FAQs

If you want to find out how the Streaming Data Platform Dell works, you can view and download the Dell Streaming Data Platform 1.5 Software Developer's Guide on the Manualsnet website.

Yes, we have the Developer's Guide for Dell Streaming Data Platform as well as other Dell manuals. All you need to do is to use our search bar and find the user manual that you are looking for.

The Developer's Guide should include all the details that are needed to use a Dell Streaming Data Platform. Full manuals and user guide PDFs can be downloaded from Manualsnet.com.

The best way to navigate the Dell Streaming Data Platform 1.5 Software Developer's Guide is by checking the Table of Contents at the top of the page where available. This allows you to navigate a manual by jumping to the section you are looking for.

This Dell Streaming Data Platform 1.5 Software Developer's Guide consists of sections like Table of Contents, to name a few. For easier navigation, use the Table of Contents in the upper left corner.

You can download Dell Streaming Data Platform 1.5 Software Developer's Guide free of charge simply by clicking the “download” button in the upper right corner of any manuals page. This feature allows you to download any manual in a couple of seconds and is generally in PDF format. You can also save a manual for later by adding it to your saved documents in the user profile.

To be able to print Dell Streaming Data Platform 1.5 Software Developer's Guide, simply download the document to your computer. Once downloaded, open the PDF file and print the Dell Streaming Data Platform 1.5 Software Developer's Guide as you would any other document. This can usually be achieved by clicking on “File” and then “Print” from the menu bar.